Skip to content

gh-152140: Rework and optimize _Extra class in zipfile#152141

Open
danny0838 wants to merge 7 commits into
python:mainfrom
danny0838:gh-152140
Open

gh-152140: Rework and optimize _Extra class in zipfile#152141
danny0838 wants to merge 7 commits into
python:mainfrom
danny0838:gh-152140

Conversation

@danny0838

@danny0838 danny0838 commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

The _Extra class was over-engineered. Its only active usage was its strip() method, called by ZipFile._write_end_record() to provide the stripping logic for ZIP64 fields. The context-dependent nature of extra fields also made it difficult to be reused by _decodeExtra() or other methods efficiently. Additionally, its split() method called _Extra directly rather than utilizing cls, which was a suboptimal pattern that hindered extensibility.

Remove the _Extra class entirely and reimplement it as a private static method _strip_extra_fields() that processes a bytearray inside ZipFile, positioned directly beneath its caller. This eliminates dead and suboptimal code, achieves clean encapsulation and code locality, and improves performance by avoiding temporary class allocations.

Additionally, bind this method as a class property in the tests to simplify the code.

@eendebakpt eendebakpt left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR indeed creates cleaner code. But this is not purely refactoring, there are several new things:

  • Early exit for empty extra data: does that occur? If so, how often
  • _strip_extra_fields now returns a bytarray instead of bytes
  • The parsing of the data is implemented different now (e.g. we need an additional keep remaining trailing bytes)

Can you make benchmarks to show the new code is indeed faster (or at least no regression)?

Comment thread Lib/zipfile/__init__.py Outdated
Comment thread Lib/zipfile/__init__.py Outdated
@danny0838

danny0838 commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

The PR indeed creates cleaner code. But this is not purely refactoring, there are several new things:

  • Early exit for empty extra data: does that occur? If so, how often

It depends. Other ZIP tools may add Linux permission or extended date, while Python zipfile doesn't create extra for non-zip64 entries and ignores most extra fields, and it should be very often for the latter case.

  • _strip_extra_fields now returns a bytarray instead of bytes

Yes, this doesn't affect the current single use case. And it would be better for future potential caller to further process the returned value.

  • The parsing of the data is implemented different now (e.g. we need an additional keep remaining trailing bytes)

This is NOT a behavior change, as responsed in another comment.

Can you make benchmarks to show the new code is indeed faster (or at least no regression)?

I'll try to. Is there a need to include it in the test suite? Or just provide a standalone code sinipper for testing?

@danny0838

danny0838 commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

Here's the benchmark:

import timeit
import struct
import random

random.seed(42)

class _Extra(bytes):
    FIELD_STRUCT = struct.Struct('<HH')

    def __new__(cls, val, id=None):
        return super().__new__(cls, val)

    def __init__(self, val, id=None):
        self.id = id

    @classmethod
    def read_one(cls, raw):
        try:
            xid, xlen = cls.FIELD_STRUCT.unpack(raw[:4])
        except struct.error:
            xid = None
            xlen = 0
        return cls(raw[:4+xlen], xid), raw[4+xlen:]

    @classmethod
    def split(cls, data):
        # use memoryview for zero-copy slices
        rest = memoryview(data)
        while rest:
            extra, rest = _Extra.read_one(rest)
            yield extra

    @classmethod
    def strip(cls, data, xids):
        """Remove Extra fields with specified IDs."""
        return b''.join(
            ex
            for ex in cls.split(data)
            if ex.id not in xids
        )

def strip_extra_fields(data, field_ids):
    """Remove Extra fields with specified IDs and return a bytearray.

    data should be bytes or bytearray.
    """
    result = bytearray()

    # early return for empty extra data
    if not data:
        return result

    # use memoryview for zero-copy slices
    mv = memoryview(data)
    mv_len = len(mv)
    pos = 0
    while pos + 4 <= mv_len:
        xid, xlen = struct.unpack_from('<HH', mv, pos)
        if pos + 4 + xlen > mv_len:
            break
        if xid not in field_ids:
            result.extend(mv[pos:pos + 4 + xlen])
        pos += 4 + xlen

    # keep remaining trailing bytes (e.g. truncated or malformed data)
    if pos < mv_len:
        result.extend(mv[pos:])

    return result

def strip_extra_fields_no_early_return(data, field_ids):
    """Remove Extra fields with specified IDs and return a bytearray.

    data should be bytes or bytearray.
    """
    result = bytearray()

    # use memoryview for zero-copy slices
    mv = memoryview(data)
    mv_len = len(mv)
    pos = 0
    while pos + 4 <= mv_len:
        xid, xlen = struct.unpack_from('<HH', mv, pos)
        if pos + 4 + xlen > mv_len:
            break
        if xid not in field_ids:
            result.extend(mv[pos:pos + 4 + xlen])
        pos += 4 + xlen

    # keep remaining trailing bytes (e.g. truncated or malformed data)
    if pos < mv_len:
        result.extend(mv[pos:])

    return result

def strip_extra_fields_no_mv(data, field_ids):
    """Remove Extra fields with specified IDs and return a bytearray.

    data should be bytes or bytearray.
    """
    result = bytearray()

    # early return for empty extra data
    if not data:
        return result

    mv = data
    mv_len = len(mv)
    pos = 0
    while pos + 4 <= mv_len:
        xid, xlen = struct.unpack_from('<HH', mv, pos)
        if pos + 4 + xlen > mv_len:
            break
        if xid not in field_ids:
            result.extend(mv[pos:pos + 4 + xlen])
        pos += 4 + xlen

    # keep remaining trailing bytes (e.g. truncated or malformed data)
    if pos < mv_len:
        result.extend(mv[pos:])

    return result

def strip_extra_fields_no_mv_no_early_return(data, field_ids):
    """Remove Extra fields with specified IDs and return a bytearray.

    data should be bytes or bytearray.
    """
    result = bytearray()

    mv = data
    mv_len = len(mv)
    pos = 0
    while pos + 4 <= mv_len:
        xid, xlen = struct.unpack_from('<HH', mv, pos)
        if pos + 4 + xlen > mv_len:
            break
        if xid not in field_ids:
            result.extend(mv[pos:pos + 4 + xlen])
        pos += 4 + xlen

    # keep remaining trailing bytes (e.g. truncated or malformed data)
    if pos < mv_len:
        result.extend(mv[pos:])

    return result

def strip_extra_fields_try(data, field_ids):
    """Remove Extra fields with specified IDs and return a bytearray.

    data should be bytes or bytearray.
    """
    result = bytearray()

    # early return for empty extra data
    if not data:
        return result

    # use memoryview for zero-copy slices
    data_len = len(data)
    pos = 0
    while pos < data_len:
        try:
            xid, xlen = struct.unpack_from('<HH', data, pos)
        except struct.error:
            xid, xlen = None, 0
        if xid not in field_ids:
            result.extend(data[pos:pos + 4 + xlen])
        pos += 4 + xlen

    return result

class _Extra2:
    @classmethod
    def iter(cls, data):
        """Iter through all subfields."""
        # early return for empty extra data
        if not data:
            return

        pos, data_len = 0, len(data)
        while pos < data_len:
            try:
                xid, xlen = struct.unpack_from('<HH', data, pos)
            except struct.error:
                xid, xlen = None, 0
            yield data[pos:pos + 4 + xlen], xid, xlen
            pos += 4 + xlen

    @classmethod
    def strip(cls, data, xids):
        """Remove Extra fields with specified IDs and return a bytearray."""
        result = bytearray()
        for xfield, xid, _ in cls.iter(data):
            if xid not in xids:
                result.extend(xfield)
        return result

class _Extra3:
    @classmethod
    def iter(cls, data):
        """Iter through all subfields."""
        # early return for empty extra data
        if not data:
            return

        pos, data_len = 0, len(data)
        while pos < data_len:
            try:
                xid, xlen = struct.unpack_from('<HH', data, pos)
            except struct.error:
                xid, xlen = None, 0
            yield data[pos:pos + 4 + xlen], xid, xlen
            pos += 4 + xlen

    @classmethod
    def strip(cls, data, xids):
        """Remove Extra fields with specified IDs and return a bytearray."""
        result = b''
        for xfield, xid, _ in cls.iter(data):
            if xid not in xids:
                result += xfield
        return result

class _Extra4:
    @classmethod
    def iter(cls, data):
        """Iter through all subfields."""
        # early return for empty extra data
        if not data:
            return

        pos, data_len = 0, len(data)
        while pos < data_len:
            try:
                xid, xlen = struct.unpack_from('<HH', data, pos)
            except struct.error:
                xid, xlen = None, 0
            yield data[pos:pos + 4 + xlen], xid, xlen
            pos += 4 + xlen

    @classmethod
    def strip(cls, data, xids):
        """Remove Extra fields with specified IDs and return a bytearray."""
        return b''.join(
            xfield 
            for xfield, xid, _ in cls.iter(data) 
            if xid not in xids
        )

class _Extra5:
    @classmethod
    def iter(cls, data):
        """Iter through all subfields."""
        # early return for empty extra data
        if not data:
            return

        pos, data_len = 0, len(data)
        while pos < data_len:
            try:
                xid, xlen = struct.unpack_from('<HH', data, pos)
            except struct.error:
                xid, xlen = None, 0
            yield data[pos:pos + 4 + xlen], xid
            pos += 4 + xlen

    @classmethod
    def strip(cls, data, xids):
        """Remove Extra fields with specified IDs and return a bytearray."""
        result = b''
        for xfield, xid in cls.iter(data):
            if xid not in xids:
                result += xfield
        return result

class _Extra6:
    FIELD_STRUCT = struct.Struct('<HH')
    @classmethod
    def iter(cls, data):
        """Iter through all subfields."""
        # early return for empty extra data
        if not data:
            return

        pos, data_len = 0, len(data)
        while pos < data_len:
            try:
                xid, xlen = cls.FIELD_STRUCT.unpack_from(data, pos)
            except struct.error:
                xid, xlen = None, 0
            yield data[pos:pos + 4 + xlen], xid, xlen
            pos += 4 + xlen

    @classmethod
    def strip(cls, data, xids):
        """Remove Extra fields with specified IDs."""
        result = b''
        for xfield, xid, _ in cls.iter(data):
            if xid not in xids:
                result += xfield
        return result

s = struct.Struct("<HH")
valid_extra_data = (
    s.pack(0x0001, 24) + b"X" * 24 +  # ZIP64 Extended Information
    s.pack(0x5455, 13) + b"\x07" + struct.pack("<LLL", 1767225600, 1767225600, 1767225600) + # Extended Timestamp
    s.pack(0x000a, 32) + b"\x00"*4 + struct.pack("<QQQ", 132537600000000000, 132537600000000000, 132537600000000000) + # NTFS Info
    s.pack(0x0001, 16) + b"Y" * 16 +  # ZIP64
    s.pack(0x7875, 11) + b"\x01\x04\xe8\x03\x00\x00\x04\xe8\x03\x00\x00" # Unix UID/GID
)
empty_extra_data = b""

dataset = [valid_extra_data if i % 2 == 0 else empty_extra_data for i in range(200)]
random.shuffle(dataset)

def run_case0():
    for data in dataset:
        _Extra.strip(data, (1,))

def run_case1():
    for data in dataset:
        strip_extra_fields(data, (1,))

def run_case2():
    for data in dataset:
        strip_extra_fields_no_early_return(data, (1,))

def run_case3():
    for data in dataset:
        strip_extra_fields_no_mv(data, (1,))

def run_case4():
    for data in dataset:
        strip_extra_fields_no_mv_no_early_return(data, (1,))

def run_case5():
    for data in dataset:
        strip_extra_fields_try(data, (1,))

def run_case6():
    for data in dataset:
        _Extra2.strip(data, (1,))

def run_case7():
    for data in dataset:
        _Extra3.strip(data, (1,))

def run_case8():
    for data in dataset:
        _Extra4.strip(data, (1,))

def run_case9():
    for data in dataset:
        _Extra5.strip(data, (1,))

def run_case10():
    for data in dataset:
        _Extra6.strip(data, (1,))

NUMBER = 10000

print(f"=== Benchmark Results ({NUMBER * len(dataset):,} total invocations) ===")
print(f"Dataset Profile: 50% Valid Extra Data, 50% Empty Data")
i = 0
while True:
    try:
        time = timeit.timeit(stmt=f'run_case{i}()', globals=globals(), number=NUMBER)
    except NameError:
        break

    if i == 0:
        time0 = time
    print(f'case{i}: {time:.4f} seconds ({time0 / time:.2f}x)')

    i += 1

import tracemalloc

def profile_memory(func):
    tracemalloc.start()
    for _ in range(50):
        func()
    current, peak = tracemalloc.get_traced_memory()
    tracemalloc.stop()
    return peak

print(f"\n=== Memory Peak Results ===")
i = 0
while True:
    try:
        peak = profile_memory(globals()[f'run_case{i}'])
    except KeyError:
        break

    if i == 0:
        peak0 = peak
    print(
        f'Peak Memory for case{i}: {peak:,} bytes '
        f'(saved {((peak0 - peak) / peak0) * 100:.1f}%)'
    )

    i += 1

And the results:

=== Benchmark Results (2,000,000 total invocations) ===
Dataset Profile: 50% Valid Extra Data, 50% Empty Data
case0: 13.3667 seconds (1.00x)
case1: 3.8258 seconds (3.49x)
case2: 3.8045 seconds (3.51x)
case3: 3.0988 seconds (4.31x)
case4: 3.0951 seconds (4.32x)
case5: 2.8227 seconds (4.74x)
case6: 4.3608 seconds (3.07x)
case7: 3.9348 seconds (3.40x)
case8: 5.4211 seconds (2.47x)
case9: 3.8251 seconds (3.49x)
case10: 4.1201 seconds (3.24x)

=== Memory Peak Results ===
Peak Memory for case0: 2,408 bytes (saved 0.0%)
Peak Memory for case1: 677 bytes (saved 71.9%)
Peak Memory for case2: 677 bytes (saved 71.9%)
Peak Memory for case3: 245 bytes (saved 89.8%)
Peak Memory for case4: 245 bytes (saved 89.8%)
Peak Memory for case5: 245 bytes (saved 89.8%)
Peak Memory for case6: 539 bytes (saved 77.6%)
Peak Memory for case7: 523 bytes (saved 78.3%)
Peak Memory for case8: 959 bytes (saved 60.2%)
Peak Memory for case9: 523 bytes (saved 78.3%)
Peak Memory for case10: 515 bytes (saved 78.6%)

The benchmark shows that the new implementation improves performance, the early return for empty data is helpful (assume that 50% is empty), removing memoryview improves performance, and try..except works better. Accordingly, I removed the memoryview usage.

@danny0838

danny0838 commented Jun 28, 2026

Copy link
Copy Markdown
Contributor Author

Due to bug gh-152486, PR gh-152487 will be another caller of _Extra.strip from ZipInfo if it's to be implemented. Rebased to the latest main branch and reworked as zipfile._strip_extra_fields() module function. Other details are as previously discussed.

@danny0838 danny0838 changed the title gh-152140: Replace _Extra class with ZipFile._strip_extra_fields() gh-152140: Replace _Extra class with _strip_extra_fields() in zipfile Jun 28, 2026
@danny0838 danny0838 changed the title gh-152140: Replace _Extra class with _strip_extra_fields() in zipfile gh-152140: Rework and optimize _Extra class in zipfile Jun 29, 2026
Introduce a helper `.update()` that simplifies the common operation of
fields updating of extra data bytes, by automatically stripping
existing subfields with identical IDs and inserts multiple fields in
one step.
Introduce `.iter()` in place of `.strip()` to yield each subfield info
as a tuple.  This has better semantic and is more performant than
subclassing `bytes`.  The method name change also allows feature
detection for the behavior changes.

Remove `.read_one()` since it has never been used and is now obsolete.

Make `_Extra` a normal class rather than a subclass of `bytes`, which
is imperformant by having to copy the passed bytes when instantiated.
@@ -0,0 +1,5 @@
Rework :class:`!_Extra` in the :mod:`zipfile` module::

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_Extra is an internal class, so not relevant for users. These details can be removed from the news entry.

For just the refactoring we can skip the news entry. For the performance improvement, I would like to see a benchmark of functionality from the public api (e.g. not a benchmark of _Extra.strip, but of opening or unzipping a file)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants