gh-152140: Rework and optimize _Extra class in zipfile#152141
gh-152140: Rework and optimize _Extra class in zipfile#152141danny0838 wants to merge 7 commits into
_Extra class in zipfile#152141Conversation
eendebakpt
left a comment
There was a problem hiding this comment.
The PR indeed creates cleaner code. But this is not purely refactoring, there are several new things:
- Early exit for empty extra data: does that occur? If so, how often
- _strip_extra_fields now returns a bytarray instead of bytes
- The parsing of the data is implemented different now (e.g. we need an additional
keep remaining trailing bytes)
Can you make benchmarks to show the new code is indeed faster (or at least no regression)?
It depends. Other ZIP tools may add Linux permission or extended date, while Python
Yes, this doesn't affect the current single use case. And it would be better for future potential caller to further process the returned value.
This is NOT a behavior change, as responsed in another comment.
I'll try to. Is there a need to include it in the test suite? Or just provide a standalone code sinipper for testing? |
|
Here's the benchmark: import timeit
import struct
import random
random.seed(42)
class _Extra(bytes):
FIELD_STRUCT = struct.Struct('<HH')
def __new__(cls, val, id=None):
return super().__new__(cls, val)
def __init__(self, val, id=None):
self.id = id
@classmethod
def read_one(cls, raw):
try:
xid, xlen = cls.FIELD_STRUCT.unpack(raw[:4])
except struct.error:
xid = None
xlen = 0
return cls(raw[:4+xlen], xid), raw[4+xlen:]
@classmethod
def split(cls, data):
# use memoryview for zero-copy slices
rest = memoryview(data)
while rest:
extra, rest = _Extra.read_one(rest)
yield extra
@classmethod
def strip(cls, data, xids):
"""Remove Extra fields with specified IDs."""
return b''.join(
ex
for ex in cls.split(data)
if ex.id not in xids
)
def strip_extra_fields(data, field_ids):
"""Remove Extra fields with specified IDs and return a bytearray.
data should be bytes or bytearray.
"""
result = bytearray()
# early return for empty extra data
if not data:
return result
# use memoryview for zero-copy slices
mv = memoryview(data)
mv_len = len(mv)
pos = 0
while pos + 4 <= mv_len:
xid, xlen = struct.unpack_from('<HH', mv, pos)
if pos + 4 + xlen > mv_len:
break
if xid not in field_ids:
result.extend(mv[pos:pos + 4 + xlen])
pos += 4 + xlen
# keep remaining trailing bytes (e.g. truncated or malformed data)
if pos < mv_len:
result.extend(mv[pos:])
return result
def strip_extra_fields_no_early_return(data, field_ids):
"""Remove Extra fields with specified IDs and return a bytearray.
data should be bytes or bytearray.
"""
result = bytearray()
# use memoryview for zero-copy slices
mv = memoryview(data)
mv_len = len(mv)
pos = 0
while pos + 4 <= mv_len:
xid, xlen = struct.unpack_from('<HH', mv, pos)
if pos + 4 + xlen > mv_len:
break
if xid not in field_ids:
result.extend(mv[pos:pos + 4 + xlen])
pos += 4 + xlen
# keep remaining trailing bytes (e.g. truncated or malformed data)
if pos < mv_len:
result.extend(mv[pos:])
return result
def strip_extra_fields_no_mv(data, field_ids):
"""Remove Extra fields with specified IDs and return a bytearray.
data should be bytes or bytearray.
"""
result = bytearray()
# early return for empty extra data
if not data:
return result
mv = data
mv_len = len(mv)
pos = 0
while pos + 4 <= mv_len:
xid, xlen = struct.unpack_from('<HH', mv, pos)
if pos + 4 + xlen > mv_len:
break
if xid not in field_ids:
result.extend(mv[pos:pos + 4 + xlen])
pos += 4 + xlen
# keep remaining trailing bytes (e.g. truncated or malformed data)
if pos < mv_len:
result.extend(mv[pos:])
return result
def strip_extra_fields_no_mv_no_early_return(data, field_ids):
"""Remove Extra fields with specified IDs and return a bytearray.
data should be bytes or bytearray.
"""
result = bytearray()
mv = data
mv_len = len(mv)
pos = 0
while pos + 4 <= mv_len:
xid, xlen = struct.unpack_from('<HH', mv, pos)
if pos + 4 + xlen > mv_len:
break
if xid not in field_ids:
result.extend(mv[pos:pos + 4 + xlen])
pos += 4 + xlen
# keep remaining trailing bytes (e.g. truncated or malformed data)
if pos < mv_len:
result.extend(mv[pos:])
return result
def strip_extra_fields_try(data, field_ids):
"""Remove Extra fields with specified IDs and return a bytearray.
data should be bytes or bytearray.
"""
result = bytearray()
# early return for empty extra data
if not data:
return result
# use memoryview for zero-copy slices
data_len = len(data)
pos = 0
while pos < data_len:
try:
xid, xlen = struct.unpack_from('<HH', data, pos)
except struct.error:
xid, xlen = None, 0
if xid not in field_ids:
result.extend(data[pos:pos + 4 + xlen])
pos += 4 + xlen
return result
class _Extra2:
@classmethod
def iter(cls, data):
"""Iter through all subfields."""
# early return for empty extra data
if not data:
return
pos, data_len = 0, len(data)
while pos < data_len:
try:
xid, xlen = struct.unpack_from('<HH', data, pos)
except struct.error:
xid, xlen = None, 0
yield data[pos:pos + 4 + xlen], xid, xlen
pos += 4 + xlen
@classmethod
def strip(cls, data, xids):
"""Remove Extra fields with specified IDs and return a bytearray."""
result = bytearray()
for xfield, xid, _ in cls.iter(data):
if xid not in xids:
result.extend(xfield)
return result
class _Extra3:
@classmethod
def iter(cls, data):
"""Iter through all subfields."""
# early return for empty extra data
if not data:
return
pos, data_len = 0, len(data)
while pos < data_len:
try:
xid, xlen = struct.unpack_from('<HH', data, pos)
except struct.error:
xid, xlen = None, 0
yield data[pos:pos + 4 + xlen], xid, xlen
pos += 4 + xlen
@classmethod
def strip(cls, data, xids):
"""Remove Extra fields with specified IDs and return a bytearray."""
result = b''
for xfield, xid, _ in cls.iter(data):
if xid not in xids:
result += xfield
return result
class _Extra4:
@classmethod
def iter(cls, data):
"""Iter through all subfields."""
# early return for empty extra data
if not data:
return
pos, data_len = 0, len(data)
while pos < data_len:
try:
xid, xlen = struct.unpack_from('<HH', data, pos)
except struct.error:
xid, xlen = None, 0
yield data[pos:pos + 4 + xlen], xid, xlen
pos += 4 + xlen
@classmethod
def strip(cls, data, xids):
"""Remove Extra fields with specified IDs and return a bytearray."""
return b''.join(
xfield
for xfield, xid, _ in cls.iter(data)
if xid not in xids
)
class _Extra5:
@classmethod
def iter(cls, data):
"""Iter through all subfields."""
# early return for empty extra data
if not data:
return
pos, data_len = 0, len(data)
while pos < data_len:
try:
xid, xlen = struct.unpack_from('<HH', data, pos)
except struct.error:
xid, xlen = None, 0
yield data[pos:pos + 4 + xlen], xid
pos += 4 + xlen
@classmethod
def strip(cls, data, xids):
"""Remove Extra fields with specified IDs and return a bytearray."""
result = b''
for xfield, xid in cls.iter(data):
if xid not in xids:
result += xfield
return result
class _Extra6:
FIELD_STRUCT = struct.Struct('<HH')
@classmethod
def iter(cls, data):
"""Iter through all subfields."""
# early return for empty extra data
if not data:
return
pos, data_len = 0, len(data)
while pos < data_len:
try:
xid, xlen = cls.FIELD_STRUCT.unpack_from(data, pos)
except struct.error:
xid, xlen = None, 0
yield data[pos:pos + 4 + xlen], xid, xlen
pos += 4 + xlen
@classmethod
def strip(cls, data, xids):
"""Remove Extra fields with specified IDs."""
result = b''
for xfield, xid, _ in cls.iter(data):
if xid not in xids:
result += xfield
return result
s = struct.Struct("<HH")
valid_extra_data = (
s.pack(0x0001, 24) + b"X" * 24 + # ZIP64 Extended Information
s.pack(0x5455, 13) + b"\x07" + struct.pack("<LLL", 1767225600, 1767225600, 1767225600) + # Extended Timestamp
s.pack(0x000a, 32) + b"\x00"*4 + struct.pack("<QQQ", 132537600000000000, 132537600000000000, 132537600000000000) + # NTFS Info
s.pack(0x0001, 16) + b"Y" * 16 + # ZIP64
s.pack(0x7875, 11) + b"\x01\x04\xe8\x03\x00\x00\x04\xe8\x03\x00\x00" # Unix UID/GID
)
empty_extra_data = b""
dataset = [valid_extra_data if i % 2 == 0 else empty_extra_data for i in range(200)]
random.shuffle(dataset)
def run_case0():
for data in dataset:
_Extra.strip(data, (1,))
def run_case1():
for data in dataset:
strip_extra_fields(data, (1,))
def run_case2():
for data in dataset:
strip_extra_fields_no_early_return(data, (1,))
def run_case3():
for data in dataset:
strip_extra_fields_no_mv(data, (1,))
def run_case4():
for data in dataset:
strip_extra_fields_no_mv_no_early_return(data, (1,))
def run_case5():
for data in dataset:
strip_extra_fields_try(data, (1,))
def run_case6():
for data in dataset:
_Extra2.strip(data, (1,))
def run_case7():
for data in dataset:
_Extra3.strip(data, (1,))
def run_case8():
for data in dataset:
_Extra4.strip(data, (1,))
def run_case9():
for data in dataset:
_Extra5.strip(data, (1,))
def run_case10():
for data in dataset:
_Extra6.strip(data, (1,))
NUMBER = 10000
print(f"=== Benchmark Results ({NUMBER * len(dataset):,} total invocations) ===")
print(f"Dataset Profile: 50% Valid Extra Data, 50% Empty Data")
i = 0
while True:
try:
time = timeit.timeit(stmt=f'run_case{i}()', globals=globals(), number=NUMBER)
except NameError:
break
if i == 0:
time0 = time
print(f'case{i}: {time:.4f} seconds ({time0 / time:.2f}x)')
i += 1
import tracemalloc
def profile_memory(func):
tracemalloc.start()
for _ in range(50):
func()
current, peak = tracemalloc.get_traced_memory()
tracemalloc.stop()
return peak
print(f"\n=== Memory Peak Results ===")
i = 0
while True:
try:
peak = profile_memory(globals()[f'run_case{i}'])
except KeyError:
break
if i == 0:
peak0 = peak
print(
f'Peak Memory for case{i}: {peak:,} bytes '
f'(saved {((peak0 - peak) / peak0) * 100:.1f}%)'
)
i += 1And the results: The benchmark shows that the new implementation improves performance, the early return for empty data is helpful (assume that 50% is empty), removing |
_Extra class with ZipFile._strip_extra_fields()_Extra class with _strip_extra_fields() in zipfile
_Extra class with _strip_extra_fields() in zipfile_Extra class in zipfile
Introduce a helper `.update()` that simplifies the common operation of fields updating of extra data bytes, by automatically stripping existing subfields with identical IDs and inserts multiple fields in one step.
Introduce `.iter()` in place of `.strip()` to yield each subfield info as a tuple. This has better semantic and is more performant than subclassing `bytes`. The method name change also allows feature detection for the behavior changes. Remove `.read_one()` since it has never been used and is now obsolete. Make `_Extra` a normal class rather than a subclass of `bytes`, which is imperformant by having to copy the passed bytes when instantiated.
| @@ -0,0 +1,5 @@ | |||
| Rework :class:`!_Extra` in the :mod:`zipfile` module:: | |||
There was a problem hiding this comment.
_Extra is an internal class, so not relevant for users. These details can be removed from the news entry.
For just the refactoring we can skip the news entry. For the performance improvement, I would like to see a benchmark of functionality from the public api (e.g. not a benchmark of _Extra.strip, but of opening or unzipping a file)
The
_Extraclass was over-engineered. Its only active usage was itsstrip()method, called byZipFile._write_end_record()to provide the stripping logic for ZIP64 fields. The context-dependent nature of extra fields also made it difficult to be reused by_decodeExtra()or other methods efficiently. Additionally, itssplit()method called_Extradirectly rather than utilizingcls, which was a suboptimal pattern that hindered extensibility.Remove the
_Extraclass entirely and reimplement it as a private static method_strip_extra_fields()that processes a bytearray insideZipFile, positioned directly beneath its caller. This eliminates dead and suboptimal code, achieves clean encapsulation and code locality, and improves performance by avoiding temporary class allocations.Additionally, bind this method as a class property in the tests to simplify the code.
zipfilemodule #152140