gh-152248: Reject a POSIX TZ abbreviation with non-ASCII-letter characters in pure-Python zoneinfo#152249
gh-152248: Reject a POSIX TZ abbreviation with non-ASCII-letter characters in pure-Python zoneinfo#152249tonghuaroot wants to merge 5 commits into
Conversation
… characters in pure-Python zoneinfo
| tzstr = "ABÀC3" | ||
| footer = tzstr.encode("utf-8") | ||
|
|
||
| def from_footer(): |
There was a problem hiding this comment.
We can give zone_from_tzstr a new parameter for encoding rather than duplicating.
There was a problem hiding this comment.
Done, added an encoding parameter to zone_from_tzstr and reused it. I kept this a separate method only because the C and pure errors differ (bytes repr vs decoded text), so each is matched against its own message.
| parser_re = re.compile( | ||
| r""" | ||
| (?P<std>[^<0-9:.+-]+|<[a-zA-Z0-9+-]+>) | ||
| (?P<std>[a-zA-Z]+|<[a-zA-Z0-9+-]+>) |
There was a problem hiding this comment.
And I see another divergence, C accepts an empty <>. :'-(
There was a problem hiding this comment.
Good catch. The direction is the reverse of this PR though: here C is the lenient side. Its parse_abbr quoted branch has no empty check, while its own unquoted branch rejects an empty run (if (str_end == str_start) return -1;), so the pure parser is correct. Want me to fold a small C fix in here, or open a separate issue?
There was a problem hiding this comment.
Please add it here, it's in the scope of POSIX TZ strings. This is actually spelled out by recent versions of the standard:
the quoting characters do not contribute to the three byte minimum length and {TZNAME_MAX} maximum length.
There was a problem hiding this comment.
Done. The C parser now rejects an empty <>, mirroring its unquoted branch.
The pure-Python
zoneinfoparser accepts a POSIX TZ string whose unquoted std/dst abbreviation contains characters other than ASCII letters (for example an embedded space or a non-ASCII letter), while the C implementation rejects it. The unquoted alternative in the parser regex is a negated class ([^<0-9:.+-]+) that admits anything except a few delimiters, whereas the Cparse_abbrwalks the unquoted form withPy_ISALPHA(ASCII letters only), as POSIX (via RFC 8536) requires for the unquoted form.This tightens the unquoted alternative to
[a-zA-Z]+, matching the C accelerator and POSIX, and leaves the quoted<...>form untouched. Every well-formed TZ string and all bundled IANA zones still parse unchanged; only the previously-accepted strings now raiseValueError.The non-ASCII case is reachable through the public
from_filepath, which UTF-8-decodes the footer, so it is covered by a dedicated regression test in addition to the whitespace cases added to the sharedinvalid_tzstrslist.