gh-152140: Rework and optimize `_Extra` class in `zipfile` by danny0838 · Pull Request #152141 · python/cpython

danny0838 · 2026-06-25T00:35:16Z

The _Extra class was over-engineered. Its only active usage was its strip() method, called by ZipFile._write_end_record() to provide the stripping logic for ZIP64 fields. The context-dependent nature of extra fields also made it difficult to be reused by _decodeExtra() or other methods efficiently. Additionally, its split() method called _Extra directly rather than utilizing cls, which was a suboptimal pattern that hindered extensibility.

Remove the _Extra class entirely and reimplement it as a private static method _strip_extra_fields() that processes a bytearray inside ZipFile, positioned directly beneath its caller. This eliminates dead and suboptimal code, achieves clean encapsulation and code locality, and improves performance by avoiding temporary class allocations.

Additionally, bind this method as a class property in the tests to simplify the code.

Issue: Tidy and optimize zipfile module #152140

eendebakpt

The PR indeed creates cleaner code. But this is not purely refactoring, there are several new things:

Early exit for empty extra data: does that occur? If so, how often
_strip_extra_fields now returns a bytarray instead of bytes
The parsing of the data is implemented different now (e.g. we need an additional keep remaining trailing bytes)

Can you make benchmarks to show the new code is indeed faster (or at least no regression)?

danny0838 · 2026-06-25T09:06:57Z

The PR indeed creates cleaner code. But this is not purely refactoring, there are several new things:

Early exit for empty extra data: does that occur? If so, how often

It depends. Other ZIP tools may add Linux permission or extended date, while Python zipfile doesn't create extra for non-zip64 entries and ignores most extra fields, and it should be very often for the latter case.

_strip_extra_fields now returns a bytarray instead of bytes

Yes, this doesn't affect the current single use case. And it would be better for future potential caller to further process the returned value.

The parsing of the data is implemented different now (e.g. we need an additional keep remaining trailing bytes)

This is NOT a behavior change, as responsed in another comment.

Can you make benchmarks to show the new code is indeed faster (or at least no regression)?

I'll try to. Is there a need to include it in the test suite? Or just provide a standalone code sinipper for testing?

danny0838 · 2026-06-25T11:10:31Z

Here's the benchmark:

import timeit
import struct
import random

random.seed(42)

class _Extra(bytes):
    FIELD_STRUCT = struct.Struct('<HH')

    def __new__(cls, val, id=None):
        return super().__new__(cls, val)

    def __init__(self, val, id=None):
        self.id = id

    @classmethod
    def read_one(cls, raw):
        try:
            xid, xlen = cls.FIELD_STRUCT.unpack(raw[:4])
        except struct.error:
            xid = None
            xlen = 0
        return cls(raw[:4+xlen], xid), raw[4+xlen:]

    @classmethod
    def split(cls, data):
        # use memoryview for zero-copy slices
        rest = memoryview(data)
        while rest:
            extra, rest = _Extra.read_one(rest)
            yield extra

    @classmethod
    def strip(cls, data, xids):
        """Remove Extra fields with specified IDs."""
        return b''.join(
            ex
            for ex in cls.split(data)
            if ex.id not in xids
        )

def strip_extra_fields(data, field_ids):
    """Remove Extra fields with specified IDs and return a bytearray.

    data should be bytes or bytearray.
    """
    result = bytearray()

    # early return for empty extra data
    if not data:
        return result

    # use memoryview for zero-copy slices
    mv = memoryview(data)
    mv_len = len(mv)
    pos = 0
    while pos + 4 <= mv_len:
        xid, xlen = struct.unpack_from('<HH', mv, pos)
        if pos + 4 + xlen > mv_len:
            break
        if xid not in field_ids:
            result.extend(mv[pos:pos + 4 + xlen])
        pos += 4 + xlen

    # keep remaining trailing bytes (e.g. truncated or malformed data)
    if pos < mv_len:
        result.extend(mv[pos:])

    return result

def strip_extra_fields_no_early_return(data, field_ids):
    """Remove Extra fields with specified IDs and return a bytearray.

    data should be bytes or bytearray.
    """
    result = bytearray()

    # use memoryview for zero-copy slices
    mv = memoryview(data)
    mv_len = len(mv)
    pos = 0
    while pos + 4 <= mv_len:
        xid, xlen = struct.unpack_from('<HH', mv, pos)
        if pos + 4 + xlen > mv_len:
            break
        if xid not in field_ids:
            result.extend(mv[pos:pos + 4 + xlen])
        pos += 4 + xlen

    # keep remaining trailing bytes (e.g. truncated or malformed data)
    if pos < mv_len:
        result.extend(mv[pos:])

    return result

def strip_extra_fields_no_mv(data, field_ids):
    """Remove Extra fields with specified IDs and return a bytearray.

    data should be bytes or bytearray.
    """
    result = bytearray()

    # early return for empty extra data
    if not data:
        return result

    mv = data
    mv_len = len(mv)
    pos = 0
    while pos + 4 <= mv_len:
        xid, xlen = struct.unpack_from('<HH', mv, pos)
        if pos + 4 + xlen > mv_len:
            break
        if xid not in field_ids:
            result.extend(mv[pos:pos + 4 + xlen])
        pos += 4 + xlen

    # keep remaining trailing bytes (e.g. truncated or malformed data)
    if pos < mv_len:
        result.extend(mv[pos:])

    return result

def strip_extra_fields_no_mv_no_early_return(data, field_ids):
    """Remove Extra fields with specified IDs and return a bytearray.

    data should be bytes or bytearray.
    """
    result = bytearray()

    mv = data
    mv_len = len(mv)
    pos = 0
    while pos + 4 <= mv_len:
        xid, xlen = struct.unpack_from('<HH', mv, pos)
        if pos + 4 + xlen > mv_len:
            break
        if xid not in field_ids:
            result.extend(mv[pos:pos + 4 + xlen])
        pos += 4 + xlen

    # keep remaining trailing bytes (e.g. truncated or malformed data)
    if pos < mv_len:
        result.extend(mv[pos:])

    return result

def strip_extra_fields_try(data, field_ids):
    """Remove Extra fields with specified IDs and return a bytearray.

    data should be bytes or bytearray.
    """
    result = bytearray()

    # early return for empty extra data
    if not data:
        return result

    # use memoryview for zero-copy slices
    data_len = len(data)
    pos = 0
    while pos < data_len:
        try:
            xid, xlen = struct.unpack_from('<HH', data, pos)
        except struct.error:
            xid, xlen = None, 0
        if xid not in field_ids:
            result.extend(data[pos:pos + 4 + xlen])
        pos += 4 + xlen

    return result

class _Extra2:
    @classmethod
    def iter(cls, data):
        """Iter through all subfields."""
        # early return for empty extra data
        if not data:
            return

        pos, data_len = 0, len(data)
        while pos < data_len:
            try:
                xid, xlen = struct.unpack_from('<HH', data, pos)
            except struct.error:
                xid, xlen = None, 0
            yield data[pos:pos + 4 + xlen], xid, xlen
            pos += 4 + xlen

    @classmethod
    def strip(cls, data, xids):
        """Remove Extra fields with specified IDs and return a bytearray."""
        result = bytearray()
        for xfield, xid, _ in cls.iter(data):
            if xid not in xids:
                result.extend(xfield)
        return result

class _Extra3:
    @classmethod
    def iter(cls, data):
        """Iter through all subfields."""
        # early return for empty extra data
        if not data:
            return

        pos, data_len = 0, len(data)
        while pos < data_len:
            try:
                xid, xlen = struct.unpack_from('<HH', data, pos)
            except struct.error:
                xid, xlen = None, 0
            yield data[pos:pos + 4 + xlen], xid, xlen
            pos += 4 + xlen

    @classmethod
    def strip(cls, data, xids):
        """Remove Extra fields with specified IDs and return a bytearray."""
        result = b''
        for xfield, xid, _ in cls.iter(data):
            if xid not in xids:
                result += xfield
        return result

class _Extra4:
    @classmethod
    def iter(cls, data):
        """Iter through all subfields."""
        # early return for empty extra data
        if not data:
            return

        pos, data_len = 0, len(data)
        while pos < data_len:
            try:
                xid, xlen = struct.unpack_from('<HH', data, pos)
            except struct.error:
                xid, xlen = None, 0
            yield data[pos:pos + 4 + xlen], xid, xlen
            pos += 4 + xlen

    @classmethod
    def strip(cls, data, xids):
        """Remove Extra fields with specified IDs and return a bytearray."""
        return b''.join(
            xfield 
            for xfield, xid, _ in cls.iter(data) 
            if xid not in xids
        )

class _Extra5:
    @classmethod
    def iter(cls, data):
        """Iter through all subfields."""
        # early return for empty extra data
        if not data:
            return

        pos, data_len = 0, len(data)
        while pos < data_len:
            try:
                xid, xlen = struct.unpack_from('<HH', data, pos)
            except struct.error:
                xid, xlen = None, 0
            yield data[pos:pos + 4 + xlen], xid
            pos += 4 + xlen

    @classmethod
    def strip(cls, data, xids):
        """Remove Extra fields with specified IDs and return a bytearray."""
        result = b''
        for xfield, xid in cls.iter(data):
            if xid not in xids:
                result += xfield
        return result

class _Extra6:
    FIELD_STRUCT = struct.Struct('<HH')
    @classmethod
    def iter(cls, data):
        """Iter through all subfields."""
        # early return for empty extra data
        if not data:
            return

        pos, data_len = 0, len(data)
        while pos < data_len:
            try:
                xid, xlen = cls.FIELD_STRUCT.unpack_from(data, pos)
            except struct.error:
                xid, xlen = None, 0
            yield data[pos:pos + 4 + xlen], xid, xlen
            pos += 4 + xlen

    @classmethod
    def strip(cls, data, xids):
        """Remove Extra fields with specified IDs."""
        result = b''
        for xfield, xid, _ in cls.iter(data):
            if xid not in xids:
                result += xfield
        return result

s = struct.Struct("<HH")
valid_extra_data = (
    s.pack(0x0001, 24) + b"X" * 24 +  # ZIP64 Extended Information
    s.pack(0x5455, 13) + b"\x07" + struct.pack("<LLL", 1767225600, 1767225600, 1767225600) + # Extended Timestamp
    s.pack(0x000a, 32) + b"\x00"*4 + struct.pack("<QQQ", 132537600000000000, 132537600000000000, 132537600000000000) + # NTFS Info
    s.pack(0x0001, 16) + b"Y" * 16 +  # ZIP64
    s.pack(0x7875, 11) + b"\x01\x04\xe8\x03\x00\x00\x04\xe8\x03\x00\x00" # Unix UID/GID
)
empty_extra_data = b""

dataset = [valid_extra_data if i % 2 == 0 else empty_extra_data for i in range(200)]
random.shuffle(dataset)

def run_case0():
    for data in dataset:
        _Extra.strip(data, (1,))

def run_case1():
    for data in dataset:
        strip_extra_fields(data, (1,))

def run_case2():
    for data in dataset:
        strip_extra_fields_no_early_return(data, (1,))

def run_case3():
    for data in dataset:
        strip_extra_fields_no_mv(data, (1,))

def run_case4():
    for data in dataset:
        strip_extra_fields_no_mv_no_early_return(data, (1,))

def run_case5():
    for data in dataset:
        strip_extra_fields_try(data, (1,))

def run_case6():
    for data in dataset:
        _Extra2.strip(data, (1,))

def run_case7():
    for data in dataset:
        _Extra3.strip(data, (1,))

def run_case8():
    for data in dataset:
        _Extra4.strip(data, (1,))

def run_case9():
    for data in dataset:
        _Extra5.strip(data, (1,))

def run_case10():
    for data in dataset:
        _Extra6.strip(data, (1,))

NUMBER = 10000

print(f"=== Benchmark Results ({NUMBER * len(dataset):,} total invocations) ===")
print(f"Dataset Profile: 50% Valid Extra Data, 50% Empty Data")
i = 0
while True:
    try:
        time = timeit.timeit(stmt=f'run_case{i}()', globals=globals(), number=NUMBER)
    except NameError:
        break

    if i == 0:
        time0 = time
    print(f'case{i}: {time:.4f} seconds ({time0 / time:.2f}x)')

    i += 1

import tracemalloc

def profile_memory(func):
    tracemalloc.start()
    for _ in range(50):
        func()
    current, peak = tracemalloc.get_traced_memory()
    tracemalloc.stop()
    return peak

print(f"\n=== Memory Peak Results ===")
i = 0
while True:
    try:
        peak = profile_memory(globals()[f'run_case{i}'])
    except KeyError:
        break

    if i == 0:
        peak0 = peak
    print(
        f'Peak Memory for case{i}: {peak:,} bytes '
        f'(saved {((peak0 - peak) / peak0) * 100:.1f}%)'
    )

    i += 1

And the results:

=== Benchmark Results (2,000,000 total invocations) ===
Dataset Profile: 50% Valid Extra Data, 50% Empty Data
case0: 13.3667 seconds (1.00x)
case1: 3.8258 seconds (3.49x)
case2: 3.8045 seconds (3.51x)
case3: 3.0988 seconds (4.31x)
case4: 3.0951 seconds (4.32x)
case5: 2.8227 seconds (4.74x)
case6: 4.3608 seconds (3.07x)
case7: 3.9348 seconds (3.40x)
case8: 5.4211 seconds (2.47x)
case9: 3.8251 seconds (3.49x)
case10: 4.1201 seconds (3.24x)

=== Memory Peak Results ===
Peak Memory for case0: 2,408 bytes (saved 0.0%)
Peak Memory for case1: 677 bytes (saved 71.9%)
Peak Memory for case2: 677 bytes (saved 71.9%)
Peak Memory for case3: 245 bytes (saved 89.8%)
Peak Memory for case4: 245 bytes (saved 89.8%)
Peak Memory for case5: 245 bytes (saved 89.8%)
Peak Memory for case6: 539 bytes (saved 77.6%)
Peak Memory for case7: 523 bytes (saved 78.3%)
Peak Memory for case8: 959 bytes (saved 60.2%)
Peak Memory for case9: 523 bytes (saved 78.3%)
Peak Memory for case10: 515 bytes (saved 78.6%)

The benchmark shows that the new implementation improves performance, the early return for empty data is helpful (assume that 50% is empty), removing memoryview improves performance, and try..except works better. Accordingly, I removed the memoryview usage.

danny0838 · 2026-06-28T13:11:48Z

Due to bug gh-152486, PR gh-152487 will be another caller of _Extra.strip from ZipInfo if it's to be implemented. Rebased to the latest main branch and reworked as zipfile._strip_extra_fields() module function. Other details are as previously discussed.

Introduce a helper `.update()` that simplifies the common operation of fields updating of extra data bytes, by automatically stripping existing subfields with identical IDs and inserts multiple fields in one step.

Introduce `.iter()` in place of `.strip()` to yield each subfield info as a tuple. This has better semantic and is more performant than subclassing `bytes`. The method name change also allows feature detection for the behavior changes. Remove `.read_one()` since it has never been used and is now obsolete. Make `_Extra` a normal class rather than a subclass of `bytes`, which is imperformant by having to copy the passed bytes when instantiated.

eendebakpt · 2026-06-30T19:01:28Z

@@ -0,0 +1,5 @@
+Rework :class:`!_Extra` in the :mod:`zipfile` module::


_Extra is an internal class, so not relevant for users. These details can be removed from the news entry.

For just the refactoring we can skip the news entry. For the performance improvement, I would like to see a benchmark of functionality from the public api (e.g. not a benchmark of _Extra.strip, but of opening or unzipping a file)

bedevere-app Bot mentioned this pull request Jun 25, 2026

Tidy and optimize zipfile module #152140

Open

bedevere-app Bot added the awaiting review label Jun 25, 2026

eendebakpt reviewed Jun 25, 2026

View reviewed changes

Comment thread Lib/zipfile/__init__.py Outdated

eendebakpt reviewed Jun 25, 2026

View reviewed changes

Comment thread Lib/zipfile/__init__.py Outdated

danny0838 force-pushed the gh-152140 branch from d50f67b to 2ff724f Compare June 28, 2026 13:06

danny0838 changed the title ~~gh-152140: Replace _Extra class with ZipFile._strip_extra_fields()~~ gh-152140: Replace _Extra class with _strip_extra_fields() in zipfile Jun 28, 2026

danny0838 force-pushed the gh-152140 branch from 040e50f to b7861ce Compare June 29, 2026 16:17

danny0838 changed the title ~~gh-152140: Replace _Extra class with _strip_extra_fields() in zipfile~~ gh-152140: Rework and optimize _Extra class in zipfile Jun 29, 2026

danny0838 added 7 commits June 30, 2026 18:17

Fix _Extra.split() in zipfile

ab99c2a

Fix incorrect comment in _decodeExtra

0e5796f

Optimize _Extra and add tests

80a2413

Introduce validate parameter for _Extra.split()

b1ef21a

Introduce _Extra.update()

3a2147e

Introduce a helper `.update()` that simplifies the common operation of fields updating of extra data bytes, by automatically stripping existing subfields with identical IDs and inserts multiple fields in one step.

Add news

1016775

danny0838 force-pushed the gh-152140 branch from b7861ce to 1016775 Compare June 30, 2026 17:22

eendebakpt reviewed Jun 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

gh-152140: Rework and optimize `_Extra` class in `zipfile`#152141

gh-152140: Rework and optimize `_Extra` class in `zipfile`#152141
danny0838 wants to merge 7 commits into
python:mainfrom
danny0838:gh-152140

danny0838 commented Jun 25, 2026 •

edited

Loading

Uh oh!

eendebakpt left a comment

Uh oh!

Uh oh!

Uh oh!

danny0838 commented Jun 25, 2026 •

edited

Loading

Uh oh!

danny0838 commented Jun 25, 2026 •

edited

Loading

Uh oh!

danny0838 commented Jun 28, 2026 •

edited

Loading

Uh oh!

eendebakpt Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -0,0 +1,5 @@
		Rework :class:`!_Extra` in the :mod:`zipfile` module::

Uh oh!

Uh oh!

Conversation

danny0838 commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eendebakpt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

danny0838 commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danny0838 commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danny0838 commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eendebakpt Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

danny0838 commented Jun 25, 2026 •

edited

Loading

danny0838 commented Jun 25, 2026 •

edited

Loading

danny0838 commented Jun 25, 2026 •

edited

Loading

danny0838 commented Jun 28, 2026 •

edited

Loading