gh-152905: Decode LC_TIME items in nl_langinfo() from glibc wide data#152911
gh-152905: Decode LC_TIME items in nl_langinfo() from glibc wide data#152911serhiy-storchaka wants to merge 3 commits into
Conversation
…e data On glibc, locale.nl_langinfo() now decodes the LC_TIME text items from the wide (_NL_W*) locale data, independently of the LC_CTYPE encoding. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Documentation build overview
|
The encoding-independence guaranteed by the wide (_NL_W*) decode is glibc-specific, so gate test_nl_langinfo_encoding_independent on glibc (which also covers the previously skipped musl case). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
On my Fedora 44, this change is not only technical, it does actually change nl_langinfo() output on multiple locales. Some examples.
I wrote this script to dump all nl_langinfo() values of all locales on Linux: import locale
import subprocess
def get_all_locales():
cmd = ['locale', '-a']
proc = subprocess.run(cmd, stdout=subprocess.PIPE, text=True)
stdout = proc.stdout
return stdout.splitlines()
langinfo_constants = [
"DAY_1",
"DAY_2",
"DAY_3",
"DAY_4",
"DAY_5",
"DAY_6",
"DAY_7",
"ABDAY_1",
"ABDAY_2",
"ABDAY_3",
"ABDAY_4",
"ABDAY_5",
"ABDAY_6",
"ABDAY_7",
"MON_1",
"MON_2",
"MON_3",
"MON_4",
"MON_5",
"MON_6",
"MON_7",
"MON_8",
"MON_9",
"MON_10",
"MON_11",
"MON_12",
"ABMON_1",
"ABMON_2",
"ABMON_3",
"ABMON_4",
"ABMON_5",
"ABMON_6",
"ABMON_7",
"ABMON_8",
"ABMON_9",
"ABMON_10",
"ABMON_11",
"ABMON_12",
"RADIXCHAR",
"THOUSEP",
"CRNCYSTR",
"D_T_FMT",
"D_FMT",
"T_FMT",
"AM_STR",
"PM_STR",
"CODESET",
"T_FMT_AMPM",
"ERA",
"ERA_D_FMT",
"ERA_D_T_FMT",
"ERA_T_FMT",
"ALT_DIGITS",
"YESEXPR",
"NOEXPR",
"_DATE_FMT",
]
langinfo_constants = [
name
for name in langinfo_constants
if hasattr(locale, name)
]
langinfo_constants.sort()
all_locales = get_all_locales()
all_locales.sort()
for loc in all_locales:
title = f"Locale {loc}"
print(title)
print("=" * len(title))
print()
locale.setlocale(locale.LC_ALL, loc)
for name in langinfo_constants:
key = getattr(locale, name)
value = locale.nl_langinfo(key)
print(f'{name}: {value!r}')
print()
print(f"Total: nl_langinfo() values: {len(langinfo_constants)} per locale") |
| values.append([nl_langinfo(item) for item in items]) | ||
| if len(values) < 2: | ||
| continue | ||
| with self.subTest(locales=avail): |
There was a problem hiding this comment.
I don't see the purpose of avail. Here, it's always equal to locs (except that it's a list instead of a tuple). I suggest removing avail.
| for other in values[1:]: | ||
| self.assertEqual(values[0], other) |
There was a problem hiding this comment.
This loop would be needed if variants values would have more than 2 locales. But currently, it's available 2 locales, so this loops seems complicated just to do:
self.assertEqual(values[0], values[1])
| self.assertEqual(values[0], other) | ||
| tested = True | ||
| if not tested: | ||
| self.skipTest('no suitable locale pairs') |
There was a problem hiding this comment.
Currently, when the test fails, it generates a long output which can be hard to debug (I modified the code to inject a bug on purpose):
======================================================================
FAIL: test_nl_langinfo_encoding_independent (test.test__locale._LocaleTests.test_nl_langinfo_encoding_independent) (locales=['el_GR.UTF-8', 'el_GR.ISO8859-7'])
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/vstinner/python/main/Lib/test/test__locale.py", line 320, in test_nl_langinfo_encoding_independent
self.assertEqual(values[0], values[1])
~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Lists differ: ['Ιαν[365 chars]μμ', '%a %d %b %Y %T %Z', '%d/%m/%Y', '%T', '%I:%M:%S %p', ''] != ['Ιαν[365 chars]μμ', '%a %d %b %Y %T %Z', '%d/%m/%Y', '%T', '%I:%M:%S %p', 'x']
First differing element 44:
''
'x'
['Ιανουαρίου',
'Φεβρουαρίου',
'Μαρτίου',
'Απριλίου',
'Μαΐου',
'Ιουνίου',
'Ιουλίου',
'Αυγούστου',
'Σεπτεμβρίου',
'Οκτωβρίου',
'Νοεμβρίου',
'Δεκεμβρίου',
'Ιαν',
'Φεβ',
'Μαρ',
'Απρ',
'Μαΐ',
'Ιουν',
'Ιουλ',
'Αυγ',
'Σεπ',
'Οκτ',
'Νοε',
'Δεκ',
'Κυριακή',
'Δευτέρα',
'Τρίτη',
'Τετάρτη',
'Πέμπτη',
'Παρασκευή',
'Σάββατο',
'Κυρ',
'Δευ',
'Τρι',
'Τετ',
'Πεμ',
'Παρ',
'Σαβ',
'πμ',
'μμ',
'%a %d %b %Y %T %Z',
'%d/%m/%Y',
'%T',
'%I:%M:%S %p',
- '']
+ 'x']
? +
An alternative is to compare a single value rather than comparing two arrays:
@unittest.skipUnless(nl_langinfo, "nl_langinfo is not available")
@unittest.skipUnless(libc_ver()[0] == 'glibc',
"wide nl_langinfo variants are glibc-specific")
def test_nl_langinfo_encoding_independent(self):
# gh-152905: The LC_TIME text items are decoded independently of the
# LC_CTYPE encoding (on glibc via the wide nl_langinfo variants), so
# the same locale in different encodings yields identical strings.
self.addCleanup(setlocale, LC_TIME, setlocale(LC_TIME))
names = [f'MON_{i}' for i in range(1, 13)]
names += [f'ABMON_{i}' for i in range(1, 13)]
names += [f'DAY_{i}' for i in range(1, 8)]
names += [f'ABDAY_{i}' for i in range(1, 8)]
names += ['AM_STR', 'PM_STR',
'D_T_FMT', 'D_FMT', 'T_FMT']
if hasattr(locale, 'T_FMT_AMPM'):
names.append('T_FMT_AMPM')
if hasattr(locale, 'ALT_DIGITS'):
names.append('ALT_DIGITS')
items = [(name, getattr(locale, name)) for name in names]
# The same language in a Unicode and a legacy encoding.
variants = [
('ja_JP.UTF-8', 'ja_JP.EUC-JP'),
('fr_FR.UTF-8', 'fr_FR.ISO8859-1'),
('el_GR.UTF-8', 'el_GR.ISO8859-7'),
]
tested = False
for locs in variants:
values = []
for loc in locs:
try:
setlocale(LC_TIME, loc)
except Error:
continue
values.append({name: nl_langinfo(item) for name, item in items})
if len(values) < 2:
continue
tested = True
for name, item in items:
with self.subTest(locales=locs, name=name):
self.assertEqual(values[0][name], values[1][name])
if not tested:
self.skipTest('no suitable locale pairs')| if hasattr(locale, 'T_FMT_AMPM'): | ||
| items.append(locale.T_FMT_AMPM) | ||
| if hasattr(locale, 'ALT_DIGITS'): | ||
| items.append(locale.ALT_DIGITS) |
There was a problem hiding this comment.
You should also test _DATE_FMT, no?
Why not testing ERA_D_FMT, ERA_D_T_FMT and ERA_T_FMT?
| def test_nl_langinfo_encoding_independent(self): | ||
| # gh-152905: The LC_TIME text items are decoded independently of the | ||
| # LC_CTYPE encoding (on glibc via the wide nl_langinfo variants), so | ||
| # the same locale in different encodings yields identical strings. |
There was a problem hiding this comment.
Please mention that ERA has no wide character variant and so is not test.
| #endif | ||
| #ifdef ERA | ||
| if (item == ERA && *result) { | ||
| pyresult = decode_strings(result, SIZE_MAX); |
There was a problem hiding this comment.
You might remove max_count of decode_strings() since it's no longer needed.
|
|
||
| .. versionchanged:: next | ||
| On glibc, the ``LC_TIME`` items are now decoded | ||
| independently of the ``LC_CTYPE`` encoding. |
I'm fine with the change anyway. But since the nl_langinfo() output changes on some locales, I would prefer to not backport this change. |
I know, this is a point. Our On other hand, I found these discrepancies when tried to use |
* Rewrite test_nl_langinfo_encoding_independent to compare each item individually (clearer failures), listing only the legacy locales and deriving the UTF-8 variant; broaden coverage to 20 locales across 17 legacy encodings. Also test ERA_D_FMT, ERA_D_T_FMT, ERA_T_FMT and _DATE_FMT; note that ERA has no wide variant and is not tested. * Drop the now-unused max_count parameter of decode_strings(). * Mention in the docs that ERA is not affected. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
On glibc, decode the
LC_TIMEitems from the wide (_NL_W*) locale data, so the result no longer depends on theLC_CTYPEencoding.The wide constant is always
_NL_W+ the narrow name, so it is filled intolanginfo_constants[]by token pasting — one scan yields both the item and its wide form.ERAhas no wide counterpart and keeps the narrow path.🤖 Generated with Claude Code