zoneinfo: pure-Python POSIX TZ unquoted abbreviation regex accepts whitespace or non-ASCII letters (C rejects)

# Bug report

The pure-Python `zoneinfo._parse_tz_str` accepts a POSIX TZ string whose unquoted std/dst abbreviation contains characters that are not ASCII letters (for example an embedded space or a non-ASCII letter), but the C implementation rejects the same string. The two implementations disagree on what is a valid TZ string.

Verified on `main` (3.16.0a0). Both implementations are reached via `ZoneInfo.from_file()` (the same path `TZStrTest.test_invalid_tzstr` uses), with a TZif v3 file that has no in-band transitions so the footer drives the parse:

```
'AB C3'                  pure=ACCEPT  C=REJECT (ValueError: Invalid STD offset in b'AB C3')
' A B 3'                 pure=ACCEPT  C=REJECT (ValueError: Invalid STD format in b' A B 3')
'AAA4BB B,J60/2,J300/2'  pure=ACCEPT  C=REJECT (ValueError: Invalid DST offset in b'AAA4BB B,J60/2,J300/2')
'ABÀC3' (À = U+00C0)     pure=ACCEPT  C=REJECT (ValueError: Invalid STD offset in b'AB\xc3\x80C3')
```

For `'AB C3'` the pure parser captures `std='AB C'`, leaving `3` as the offset. For `'AAA4BB B,...'` it captures `dst='BB B'` with an embedded space. For the non-ASCII case `'ABÀC3'` the pure parser captures `std='ABÀC'`. The C implementation stops the abbreviation at the first byte that is not an ASCII letter and then fails on the leftover character before the digit.

The non-ASCII case is reachable through the public `from_file` path, not just the private parser. `_zoneinfo.py:266` does `_parse_tz_str(tz_str.decode())`, a default UTF-8 decode, so a footer `b'AB\xc3\x80C3'` decodes to `'ABÀC3'` and reaches the pure parser. The unpatched pure parser accepts it while the C implementation rejects it. (Only an *invalid* UTF-8 footer, for example a lone `b'\xc0'`, raises `UnicodeDecodeError` during the decode; a valid multibyte non-ASCII letter decodes cleanly and is parsed.)

## Root cause

`Lib/zoneinfo/_zoneinfo.py:643` and `:647` define the abbreviation alternative as a negated character class:

```python
(?P<std>[^<0-9:.+-]+|<[a-zA-Z0-9+-]+>)   # line 643
...
(?P<dst>[^0-9:.+-]+|<[a-zA-Z0-9+-]+>)    # line 647
```

The unquoted alternative `[^<0-9:.+-]+` only excludes `<`, digits, `:`, `.`, `+`, `-`. Everything else, including spaces, tabs, and non-ASCII letters, is admitted. (`re.ASCII` on line 652 does not help here: it constrains only the `\w`/`\d`/`\s` shorthands, not an explicit negated literal class, confirmed empirically.)

The C `parse_abbr` (`Modules/_zoneinfo.c:1767-1781`) walks the unquoted form with `Py_ISALPHA`:

```c
else {
    str_start = ptr;
    // From the POSIX standard:
    //
    //   In the unquoted form, all characters in these fields shall be
    //   alphabetic characters from the portable character set in the
    //   current locale.
    while (Py_ISALPHA(*ptr)) {
        ptr++;
    }
    str_end = ptr;
    if (str_end == str_start) {
        return -1;
    }
}
```

`Py_ISALPHA` is ASCII-only (`_Py_ctype_table[Py_CHARMASK(c)] & PY_CTF_ALPHA`; only ASCII `a-z`/`A-Z` carry `PY_CTF_ALPHA`). So for `'AB C3'` the C loop stops at the space, the abbreviation is `AB`, and `parse_tz_delta` then chokes on the space before `3`. For `' A B 3'` the leading space yields an empty abbreviation, raising `Invalid STD format`. For `'ABÀC3'` the loop stops at the first byte of the multibyte `À`, the abbreviation is `AB`, and the leftover bytes before `3` cause `Invalid STD offset`.

## Which implementation should change, and why this is a parity fix

This aligns the pure implementation with the C implementation; it is not a behavior-policy change. The reachable surface is the TZif v2+ footer parsed by `from_file` / `_load_file` (`Lib/zoneinfo/_zoneinfo.py:266`), which is governed by RFC 8536. RFC 8536 section 3.3.1 specifies the footer uses the POSIX TZ string grammar, so the unquoted-abbreviation restriction applies. POSIX (The Open Group Base Specifications Issue 8, XBD 8.3) states for the unquoted form:

> In the unquoted form, all characters in these fields shall be alphabetic characters from the portable character set in the current locale.

The C source already cites this same text at `Modules/_zoneinfo.c:1769-1773`. The quoted `<...>` form is the documented escape hatch for any other character, and this patch leaves it untouched. glibc historically tolerates extra characters in unquoted abbreviations, but the governing grammar (POSIX via RFC 8536) restricts the unquoted form to portable-charset alpha, the C accelerator already enforces that, and the quoted form remains available for anything else. So this brings pure into line with both the spec and the C implementation, rather than away from them.

## Fix

Tighten the unquoted alternative to ASCII letters, matching the C `Py_ISALPHA` loop, and keep the `<...>` quoted branch untouched:

```python
(?P<std>[a-zA-Z]+|<[a-zA-Z0-9+-]+>)
(?P<dst>[a-zA-Z]+|<[a-zA-Z0-9+-]+>)
```

`[a-zA-Z]+` is 1-or-more, deliberately matching the C loop which only rejects an empty unquoted run (`str_end == str_start`). 1-2 character abbreviations such as `A3` and `AB3` are accepted by both implementations today and stay accepted, so accept-length parity is preserved. A 3-char minimum (`[a-zA-Z]{3,}`) is intentionally not imposed: it would itself be a new divergence from C. (The comment at `Lib/zoneinfo/_zoneinfo.py:632` that says std/dst "must be 3 or more characters long" reflects POSIX display guidance and is not enforced by either implementation.)

Checked against the system IANA tz database (`/usr/share/zoneinfo`): across 94 distinct POSIX TZ footers (599 non-symlink TZif files carrying a v2+ footer, 0 of them non-ASCII), the actual `_parse_tz_str` function produces an identical parse outcome on every footer before and after the change (0 differences), so no real-world zone changes behavior. Zones such as `<+0330>-3:30`, `<-04>4<-03>,...`, and `IST-2IDT,...` keep working through the `<...>` branch and the plain-alpha branch.



### Linked PRs
* gh-152249

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

zoneinfo: pure-Python POSIX TZ unquoted abbreviation regex accepts whitespace or non-ASCII letters (C rejects) #152248

Bug report

Root cause

Which implementation should change, and why this is a parity fix

Fix

Linked PRs

Metadata

Assignees

Labels

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Uh oh!

zoneinfo: pure-Python POSIX TZ unquoted abbreviation regex accepts whitespace or non-ASCII letters (C rejects) #152248

Description

Bug report

Root cause

Which implementation should change, and why this is a parity fix

Fix

Linked PRs

Metadata

Metadata

Assignees

Labels

Fields

Projects

Milestone

Relationships

Development

Issue actions