Skip to content

zoneinfo: pure-Python POSIX TZ unquoted abbreviation regex accepts whitespace or non-ASCII letters (C rejects) #152248

Description

@tonghuaroot

Bug report

The pure-Python zoneinfo._parse_tz_str accepts a POSIX TZ string whose unquoted std/dst abbreviation contains characters that are not ASCII letters (for example an embedded space or a non-ASCII letter), but the C implementation rejects the same string. The two implementations disagree on what is a valid TZ string.

Verified on main (3.16.0a0). Both implementations are reached via ZoneInfo.from_file() (the same path TZStrTest.test_invalid_tzstr uses), with a TZif v3 file that has no in-band transitions so the footer drives the parse:

'AB C3'                  pure=ACCEPT  C=REJECT (ValueError: Invalid STD offset in b'AB C3')
' A B 3'                 pure=ACCEPT  C=REJECT (ValueError: Invalid STD format in b' A B 3')
'AAA4BB B,J60/2,J300/2'  pure=ACCEPT  C=REJECT (ValueError: Invalid DST offset in b'AAA4BB B,J60/2,J300/2')
'ABÀC3' (À = U+00C0)     pure=ACCEPT  C=REJECT (ValueError: Invalid STD offset in b'AB\xc3\x80C3')

For 'AB C3' the pure parser captures std='AB C', leaving 3 as the offset. For 'AAA4BB B,...' it captures dst='BB B' with an embedded space. For the non-ASCII case 'ABÀC3' the pure parser captures std='ABÀC'. The C implementation stops the abbreviation at the first byte that is not an ASCII letter and then fails on the leftover character before the digit.

The non-ASCII case is reachable through the public from_file path, not just the private parser. _zoneinfo.py:266 does _parse_tz_str(tz_str.decode()), a default UTF-8 decode, so a footer b'AB\xc3\x80C3' decodes to 'ABÀC3' and reaches the pure parser. The unpatched pure parser accepts it while the C implementation rejects it. (Only an invalid UTF-8 footer, for example a lone b'\xc0', raises UnicodeDecodeError during the decode; a valid multibyte non-ASCII letter decodes cleanly and is parsed.)

Root cause

Lib/zoneinfo/_zoneinfo.py:643 and :647 define the abbreviation alternative as a negated character class:

(?P<std>[^<0-9:.+-]+|<[a-zA-Z0-9+-]+>)   # line 643
...
(?P<dst>[^0-9:.+-]+|<[a-zA-Z0-9+-]+>)    # line 647

The unquoted alternative [^<0-9:.+-]+ only excludes <, digits, :, ., +, -. Everything else, including spaces, tabs, and non-ASCII letters, is admitted. (re.ASCII on line 652 does not help here: it constrains only the \w/\d/\s shorthands, not an explicit negated literal class, confirmed empirically.)

The C parse_abbr (Modules/_zoneinfo.c:1767-1781) walks the unquoted form with Py_ISALPHA:

else {
    str_start = ptr;
    // From the POSIX standard:
    //
    //   In the unquoted form, all characters in these fields shall be
    //   alphabetic characters from the portable character set in the
    //   current locale.
    while (Py_ISALPHA(*ptr)) {
        ptr++;
    }
    str_end = ptr;
    if (str_end == str_start) {
        return -1;
    }
}

Py_ISALPHA is ASCII-only (_Py_ctype_table[Py_CHARMASK(c)] & PY_CTF_ALPHA; only ASCII a-z/A-Z carry PY_CTF_ALPHA). So for 'AB C3' the C loop stops at the space, the abbreviation is AB, and parse_tz_delta then chokes on the space before 3. For ' A B 3' the leading space yields an empty abbreviation, raising Invalid STD format. For 'ABÀC3' the loop stops at the first byte of the multibyte À, the abbreviation is AB, and the leftover bytes before 3 cause Invalid STD offset.

Which implementation should change, and why this is a parity fix

This aligns the pure implementation with the C implementation; it is not a behavior-policy change. The reachable surface is the TZif v2+ footer parsed by from_file / _load_file (Lib/zoneinfo/_zoneinfo.py:266), which is governed by RFC 8536. RFC 8536 section 3.3.1 specifies the footer uses the POSIX TZ string grammar, so the unquoted-abbreviation restriction applies. POSIX (The Open Group Base Specifications Issue 8, XBD 8.3) states for the unquoted form:

In the unquoted form, all characters in these fields shall be alphabetic characters from the portable character set in the current locale.

The C source already cites this same text at Modules/_zoneinfo.c:1769-1773. The quoted <...> form is the documented escape hatch for any other character, and this patch leaves it untouched. glibc historically tolerates extra characters in unquoted abbreviations, but the governing grammar (POSIX via RFC 8536) restricts the unquoted form to portable-charset alpha, the C accelerator already enforces that, and the quoted form remains available for anything else. So this brings pure into line with both the spec and the C implementation, rather than away from them.

Fix

Tighten the unquoted alternative to ASCII letters, matching the C Py_ISALPHA loop, and keep the <...> quoted branch untouched:

(?P<std>[a-zA-Z]+|<[a-zA-Z0-9+-]+>)
(?P<dst>[a-zA-Z]+|<[a-zA-Z0-9+-]+>)

[a-zA-Z]+ is 1-or-more, deliberately matching the C loop which only rejects an empty unquoted run (str_end == str_start). 1-2 character abbreviations such as A3 and AB3 are accepted by both implementations today and stay accepted, so accept-length parity is preserved. A 3-char minimum ([a-zA-Z]{3,}) is intentionally not imposed: it would itself be a new divergence from C. (The comment at Lib/zoneinfo/_zoneinfo.py:632 that says std/dst "must be 3 or more characters long" reflects POSIX display guidance and is not enforced by either implementation.)

Checked against the system IANA tz database (/usr/share/zoneinfo): across 94 distinct POSIX TZ footers (599 non-symlink TZif files carrying a v2+ footer, 0 of them non-ASCII), the actual _parse_tz_str function produces an identical parse outcome on every footer before and after the change (0 differences), so no real-world zone changes behavior. Zones such as <+0330>-3:30, <-04>4<-03>,..., and IST-2IDT,... keep working through the <...> branch and the plain-alpha branch.

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    stdlibStandard Library Python modules in the Lib/ directorytype-bugAn unexpected behavior, bug, or error
    No fields configured for issues without a type.

    Projects

    Status
    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions