Bug report
The pure-Python zoneinfo._parse_tz_str accepts a POSIX TZ string whose unquoted std/dst abbreviation contains characters that are not ASCII letters (for example an embedded space or a non-ASCII letter), but the C implementation rejects the same string. The two implementations disagree on what is a valid TZ string.
Verified on main (3.16.0a0). Both implementations are reached via ZoneInfo.from_file() (the same path TZStrTest.test_invalid_tzstr uses), with a TZif v3 file that has no in-band transitions so the footer drives the parse:
'AB C3' pure=ACCEPT C=REJECT (ValueError: Invalid STD offset in b'AB C3')
' A B 3' pure=ACCEPT C=REJECT (ValueError: Invalid STD format in b' A B 3')
'AAA4BB B,J60/2,J300/2' pure=ACCEPT C=REJECT (ValueError: Invalid DST offset in b'AAA4BB B,J60/2,J300/2')
'ABÀC3' (À = U+00C0) pure=ACCEPT C=REJECT (ValueError: Invalid STD offset in b'AB\xc3\x80C3')
For 'AB C3' the pure parser captures std='AB C', leaving 3 as the offset. For 'AAA4BB B,...' it captures dst='BB B' with an embedded space. For the non-ASCII case 'ABÀC3' the pure parser captures std='ABÀC'. The C implementation stops the abbreviation at the first byte that is not an ASCII letter and then fails on the leftover character before the digit.
The non-ASCII case is reachable through the public from_file path, not just the private parser. _zoneinfo.py:266 does _parse_tz_str(tz_str.decode()), a default UTF-8 decode, so a footer b'AB\xc3\x80C3' decodes to 'ABÀC3' and reaches the pure parser. The unpatched pure parser accepts it while the C implementation rejects it. (Only an invalid UTF-8 footer, for example a lone b'\xc0', raises UnicodeDecodeError during the decode; a valid multibyte non-ASCII letter decodes cleanly and is parsed.)
Root cause
Lib/zoneinfo/_zoneinfo.py:643 and :647 define the abbreviation alternative as a negated character class:
(?P<std>[^<0-9:.+-]+|<[a-zA-Z0-9+-]+>) # line 643
...
(?P<dst>[^0-9:.+-]+|<[a-zA-Z0-9+-]+>) # line 647
The unquoted alternative [^<0-9:.+-]+ only excludes <, digits, :, ., +, -. Everything else, including spaces, tabs, and non-ASCII letters, is admitted. (re.ASCII on line 652 does not help here: it constrains only the \w/\d/\s shorthands, not an explicit negated literal class, confirmed empirically.)
The C parse_abbr (Modules/_zoneinfo.c:1767-1781) walks the unquoted form with Py_ISALPHA:
else {
str_start = ptr;
// From the POSIX standard:
//
// In the unquoted form, all characters in these fields shall be
// alphabetic characters from the portable character set in the
// current locale.
while (Py_ISALPHA(*ptr)) {
ptr++;
}
str_end = ptr;
if (str_end == str_start) {
return -1;
}
}
Py_ISALPHA is ASCII-only (_Py_ctype_table[Py_CHARMASK(c)] & PY_CTF_ALPHA; only ASCII a-z/A-Z carry PY_CTF_ALPHA). So for 'AB C3' the C loop stops at the space, the abbreviation is AB, and parse_tz_delta then chokes on the space before 3. For ' A B 3' the leading space yields an empty abbreviation, raising Invalid STD format. For 'ABÀC3' the loop stops at the first byte of the multibyte À, the abbreviation is AB, and the leftover bytes before 3 cause Invalid STD offset.
Which implementation should change, and why this is a parity fix
This aligns the pure implementation with the C implementation; it is not a behavior-policy change. The reachable surface is the TZif v2+ footer parsed by from_file / _load_file (Lib/zoneinfo/_zoneinfo.py:266), which is governed by RFC 8536. RFC 8536 section 3.3.1 specifies the footer uses the POSIX TZ string grammar, so the unquoted-abbreviation restriction applies. POSIX (The Open Group Base Specifications Issue 8, XBD 8.3) states for the unquoted form:
In the unquoted form, all characters in these fields shall be alphabetic characters from the portable character set in the current locale.
The C source already cites this same text at Modules/_zoneinfo.c:1769-1773. The quoted <...> form is the documented escape hatch for any other character, and this patch leaves it untouched. glibc historically tolerates extra characters in unquoted abbreviations, but the governing grammar (POSIX via RFC 8536) restricts the unquoted form to portable-charset alpha, the C accelerator already enforces that, and the quoted form remains available for anything else. So this brings pure into line with both the spec and the C implementation, rather than away from them.
Fix
Tighten the unquoted alternative to ASCII letters, matching the C Py_ISALPHA loop, and keep the <...> quoted branch untouched:
(?P<std>[a-zA-Z]+|<[a-zA-Z0-9+-]+>)
(?P<dst>[a-zA-Z]+|<[a-zA-Z0-9+-]+>)
[a-zA-Z]+ is 1-or-more, deliberately matching the C loop which only rejects an empty unquoted run (str_end == str_start). 1-2 character abbreviations such as A3 and AB3 are accepted by both implementations today and stay accepted, so accept-length parity is preserved. A 3-char minimum ([a-zA-Z]{3,}) is intentionally not imposed: it would itself be a new divergence from C. (The comment at Lib/zoneinfo/_zoneinfo.py:632 that says std/dst "must be 3 or more characters long" reflects POSIX display guidance and is not enforced by either implementation.)
Checked against the system IANA tz database (/usr/share/zoneinfo): across 94 distinct POSIX TZ footers (599 non-symlink TZif files carrying a v2+ footer, 0 of them non-ASCII), the actual _parse_tz_str function produces an identical parse outcome on every footer before and after the change (0 differences), so no real-world zone changes behavior. Zones such as <+0330>-3:30, <-04>4<-03>,..., and IST-2IDT,... keep working through the <...> branch and the plain-alpha branch.
Linked PRs
Bug report
The pure-Python
zoneinfo._parse_tz_straccepts a POSIX TZ string whose unquoted std/dst abbreviation contains characters that are not ASCII letters (for example an embedded space or a non-ASCII letter), but the C implementation rejects the same string. The two implementations disagree on what is a valid TZ string.Verified on
main(3.16.0a0). Both implementations are reached viaZoneInfo.from_file()(the same pathTZStrTest.test_invalid_tzstruses), with a TZif v3 file that has no in-band transitions so the footer drives the parse:For
'AB C3'the pure parser capturesstd='AB C', leaving3as the offset. For'AAA4BB B,...'it capturesdst='BB B'with an embedded space. For the non-ASCII case'ABÀC3'the pure parser capturesstd='ABÀC'. The C implementation stops the abbreviation at the first byte that is not an ASCII letter and then fails on the leftover character before the digit.The non-ASCII case is reachable through the public
from_filepath, not just the private parser._zoneinfo.py:266does_parse_tz_str(tz_str.decode()), a default UTF-8 decode, so a footerb'AB\xc3\x80C3'decodes to'ABÀC3'and reaches the pure parser. The unpatched pure parser accepts it while the C implementation rejects it. (Only an invalid UTF-8 footer, for example a loneb'\xc0', raisesUnicodeDecodeErrorduring the decode; a valid multibyte non-ASCII letter decodes cleanly and is parsed.)Root cause
Lib/zoneinfo/_zoneinfo.py:643and:647define the abbreviation alternative as a negated character class:The unquoted alternative
[^<0-9:.+-]+only excludes<, digits,:,.,+,-. Everything else, including spaces, tabs, and non-ASCII letters, is admitted. (re.ASCIIon line 652 does not help here: it constrains only the\w/\d/\sshorthands, not an explicit negated literal class, confirmed empirically.)The C
parse_abbr(Modules/_zoneinfo.c:1767-1781) walks the unquoted form withPy_ISALPHA:Py_ISALPHAis ASCII-only (_Py_ctype_table[Py_CHARMASK(c)] & PY_CTF_ALPHA; only ASCIIa-z/A-ZcarryPY_CTF_ALPHA). So for'AB C3'the C loop stops at the space, the abbreviation isAB, andparse_tz_deltathen chokes on the space before3. For' A B 3'the leading space yields an empty abbreviation, raisingInvalid STD format. For'ABÀC3'the loop stops at the first byte of the multibyteÀ, the abbreviation isAB, and the leftover bytes before3causeInvalid STD offset.Which implementation should change, and why this is a parity fix
This aligns the pure implementation with the C implementation; it is not a behavior-policy change. The reachable surface is the TZif v2+ footer parsed by
from_file/_load_file(Lib/zoneinfo/_zoneinfo.py:266), which is governed by RFC 8536. RFC 8536 section 3.3.1 specifies the footer uses the POSIX TZ string grammar, so the unquoted-abbreviation restriction applies. POSIX (The Open Group Base Specifications Issue 8, XBD 8.3) states for the unquoted form:The C source already cites this same text at
Modules/_zoneinfo.c:1769-1773. The quoted<...>form is the documented escape hatch for any other character, and this patch leaves it untouched. glibc historically tolerates extra characters in unquoted abbreviations, but the governing grammar (POSIX via RFC 8536) restricts the unquoted form to portable-charset alpha, the C accelerator already enforces that, and the quoted form remains available for anything else. So this brings pure into line with both the spec and the C implementation, rather than away from them.Fix
Tighten the unquoted alternative to ASCII letters, matching the C
Py_ISALPHAloop, and keep the<...>quoted branch untouched:[a-zA-Z]+is 1-or-more, deliberately matching the C loop which only rejects an empty unquoted run (str_end == str_start). 1-2 character abbreviations such asA3andAB3are accepted by both implementations today and stay accepted, so accept-length parity is preserved. A 3-char minimum ([a-zA-Z]{3,}) is intentionally not imposed: it would itself be a new divergence from C. (The comment atLib/zoneinfo/_zoneinfo.py:632that says std/dst "must be 3 or more characters long" reflects POSIX display guidance and is not enforced by either implementation.)Checked against the system IANA tz database (
/usr/share/zoneinfo): across 94 distinct POSIX TZ footers (599 non-symlink TZif files carrying a v2+ footer, 0 of them non-ASCII), the actual_parse_tz_strfunction produces an identical parse outcome on every footer before and after the change (0 differences), so no real-world zone changes behavior. Zones such as<+0330>-3:30,<-04>4<-03>,..., andIST-2IDT,...keep working through the<...>branch and the plain-alpha branch.Linked PRs