Skip to content

gh-152905: Decode LC_TIME items in nl_langinfo() from glibc wide data#152911

Open
serhiy-storchaka wants to merge 3 commits into
python:mainfrom
serhiy-storchaka:gh-152905-nl-langinfo-wide
Open

gh-152905: Decode LC_TIME items in nl_langinfo() from glibc wide data#152911
serhiy-storchaka wants to merge 3 commits into
python:mainfrom
serhiy-storchaka:gh-152905-nl-langinfo-wide

Conversation

@serhiy-storchaka

@serhiy-storchaka serhiy-storchaka commented Jul 2, 2026

Copy link
Copy Markdown
Member

On glibc, decode the LC_TIME items from the wide (_NL_W*) locale data, so the result no longer depends on the LC_CTYPE encoding.

The wide constant is always _NL_W + the narrow name, so it is filled into langinfo_constants[] by token pasting — one scan yields both the item and its wide form. ERA has no wide counterpart and keeps the narrow path.

🤖 Generated with Claude Code

…e data

On glibc, locale.nl_langinfo() now decodes the LC_TIME text items from the
wide (_NL_W*) locale data, independently of the LC_CTYPE encoding.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@read-the-docs-community

read-the-docs-community Bot commented Jul 2, 2026

Copy link
Copy Markdown

Documentation build overview

📚 cpython-previews | 🛠️ Build #33431572 | 📁 Comparing 85d729d against main (31864bd)

  🔍 Preview build  

3 files changed
± library/concurrent.futures.html
± library/locale.html
± whatsnew/changelog.html

@serhiy-storchaka serhiy-storchaka requested a review from vstinner July 2, 2026 18:18
The encoding-independence guaranteed by the wide (_NL_W*) decode is
glibc-specific, so gate test_nl_langinfo_encoding_independent on glibc
(which also covers the previously skipped musl case).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@vstinner

vstinner commented Jul 3, 2026

Copy link
Copy Markdown
Member

On my Fedora 44, this change is not only technical, it does actually change nl_langinfo() output on multiple locales.

Some examples.

  • Locale ast_ES.iso885915: MON_10: "d'ochobre" => MON_10: 'd’ochobre' (different quote: U+0027 => U+2019)
  • Locale br_FR.iso885915@euro: D_T_FMT: "D'ar %A %d a viz %B %Y %T" => 'Dʼar %A %d a viz %B %Y %T' (different quote)
  • Locale es_ES.iso885915@euro: AM_STR: 'a.\xa0m.' => 'a.\u202fm.'
  • Locale oc_FR.iso88591: MON_4: "d'abril" => 'd’abril' (different quote)
  • Locale ro_RO.iso88592: DAY_3: 'marţi' => DAY_3: 'marți' (U+0163 => U+021b)
  • Locale yi_US: ABDAY_2: "מאָנ'" => "מאָנ'" (U+05d0 U+05b8 => U+fb2f)

I wrote this script to dump all nl_langinfo() values of all locales on Linux:

import locale
import subprocess

def get_all_locales():
    cmd = ['locale', '-a']
    proc = subprocess.run(cmd, stdout=subprocess.PIPE, text=True)
    stdout = proc.stdout
    return stdout.splitlines()

langinfo_constants = [
    "DAY_1",
    "DAY_2",
    "DAY_3",
    "DAY_4",
    "DAY_5",
    "DAY_6",
    "DAY_7",
    "ABDAY_1",
    "ABDAY_2",
    "ABDAY_3",
    "ABDAY_4",
    "ABDAY_5",
    "ABDAY_6",
    "ABDAY_7",
    "MON_1",
    "MON_2",
    "MON_3",
    "MON_4",
    "MON_5",
    "MON_6",
    "MON_7",
    "MON_8",
    "MON_9",
    "MON_10",
    "MON_11",
    "MON_12",
    "ABMON_1",
    "ABMON_2",
    "ABMON_3",
    "ABMON_4",
    "ABMON_5",
    "ABMON_6",
    "ABMON_7",
    "ABMON_8",
    "ABMON_9",
    "ABMON_10",
    "ABMON_11",
    "ABMON_12",
    "RADIXCHAR",
    "THOUSEP",
    "CRNCYSTR",
    "D_T_FMT",
    "D_FMT",
    "T_FMT",
    "AM_STR",
    "PM_STR",
    "CODESET",
    "T_FMT_AMPM",
    "ERA",
    "ERA_D_FMT",
    "ERA_D_T_FMT",
    "ERA_T_FMT",
    "ALT_DIGITS",
    "YESEXPR",
    "NOEXPR",
    "_DATE_FMT",
]
langinfo_constants = [
    name
    for name in langinfo_constants
    if hasattr(locale, name)
]
langinfo_constants.sort()

all_locales = get_all_locales()
all_locales.sort()

for loc in all_locales:
    title = f"Locale {loc}"
    print(title)
    print("=" * len(title))
    print()

    locale.setlocale(locale.LC_ALL, loc)

    for name in langinfo_constants:
        key = getattr(locale, name)
        value = locale.nl_langinfo(key)
        print(f'{name}: {value!r}')
    print()

print(f"Total: nl_langinfo() values: {len(langinfo_constants)} per locale")

Comment thread Lib/test/test__locale.py Outdated
values.append([nl_langinfo(item) for item in items])
if len(values) < 2:
continue
with self.subTest(locales=avail):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the purpose of avail. Here, it's always equal to locs (except that it's a list instead of a tuple). I suggest removing avail.

Comment thread Lib/test/test__locale.py Outdated
Comment on lines +319 to +320
for other in values[1:]:
self.assertEqual(values[0], other)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop would be needed if variants values would have more than 2 locales. But currently, it's available 2 locales, so this loops seems complicated just to do:

self.assertEqual(values[0], values[1])

Comment thread Lib/test/test__locale.py
self.assertEqual(values[0], other)
tested = True
if not tested:
self.skipTest('no suitable locale pairs')

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, when the test fails, it generates a long output which can be hard to debug (I modified the code to inject a bug on purpose):

======================================================================
FAIL: test_nl_langinfo_encoding_independent (test.test__locale._LocaleTests.test_nl_langinfo_encoding_independent) (locales=['el_GR.UTF-8', 'el_GR.ISO8859-7'])
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/vstinner/python/main/Lib/test/test__locale.py", line 320, in test_nl_langinfo_encoding_independent
    self.assertEqual(values[0], values[1])
    ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Lists differ: ['Ιαν[365 chars]μμ', '%a %d %b %Y %T %Z', '%d/%m/%Y', '%T', '%I:%M:%S %p', ''] != ['Ιαν[365 chars]μμ', '%a %d %b %Y %T %Z', '%d/%m/%Y', '%T', '%I:%M:%S %p', 'x']

First differing element 44:
''
'x'

  ['Ιανουαρίου',
   'Φεβρουαρίου',
   'Μαρτίου',
   'Απριλίου',
   'Μαΐου',
   'Ιουνίου',
   'Ιουλίου',
   'Αυγούστου',
   'Σεπτεμβρίου',
   'Οκτωβρίου',
   'Νοεμβρίου',
   'Δεκεμβρίου',
   'Ιαν',
   'Φεβ',
   'Μαρ',
   'Απρ',
   'Μαΐ',
   'Ιουν',
   'Ιουλ',
   'Αυγ',
   'Σεπ',
   'Οκτ',
   'Νοε',
   'Δεκ',
   'Κυριακή',
   'Δευτέρα',
   'Τρίτη',
   'Τετάρτη',
   'Πέμπτη',
   'Παρασκευή',
   'Σάββατο',
   'Κυρ',
   'Δευ',
   'Τρι',
   'Τετ',
   'Πεμ',
   'Παρ',
   'Σαβ',
   'πμ',
   'μμ',
   '%a %d %b %Y %T %Z',
   '%d/%m/%Y',
   '%T',
   '%I:%M:%S %p',
-  '']
+  'x']
?   +

An alternative is to compare a single value rather than comparing two arrays:

    @unittest.skipUnless(nl_langinfo, "nl_langinfo is not available")
    @unittest.skipUnless(libc_ver()[0] == 'glibc',
                         "wide nl_langinfo variants are glibc-specific")
    def test_nl_langinfo_encoding_independent(self):
        # gh-152905: The LC_TIME text items are decoded independently of the
        # LC_CTYPE encoding (on glibc via the wide nl_langinfo variants), so
        # the same locale in different encodings yields identical strings.
        self.addCleanup(setlocale, LC_TIME, setlocale(LC_TIME))

        names = [f'MON_{i}' for i in range(1, 13)]
        names += [f'ABMON_{i}' for i in range(1, 13)]
        names += [f'DAY_{i}' for i in range(1, 8)]
        names += [f'ABDAY_{i}' for i in range(1, 8)]
        names += ['AM_STR', 'PM_STR',
                  'D_T_FMT', 'D_FMT', 'T_FMT']
        if hasattr(locale, 'T_FMT_AMPM'):
            names.append('T_FMT_AMPM')
        if hasattr(locale, 'ALT_DIGITS'):
            names.append('ALT_DIGITS')
        items = [(name, getattr(locale, name)) for name in names]

        # The same language in a Unicode and a legacy encoding.
        variants = [
            ('ja_JP.UTF-8', 'ja_JP.EUC-JP'),
            ('fr_FR.UTF-8', 'fr_FR.ISO8859-1'),
            ('el_GR.UTF-8', 'el_GR.ISO8859-7'),
        ]
        tested = False
        for locs in variants:
            values = []
            for loc in locs:
                try:
                    setlocale(LC_TIME, loc)
                except Error:
                    continue
                values.append({name: nl_langinfo(item) for name, item in items})
            if len(values) < 2:
                continue
            tested = True

            for name, item in items:
                with self.subTest(locales=locs, name=name):
                    self.assertEqual(values[0][name], values[1][name])
        if not tested:
            self.skipTest('no suitable locale pairs')

Comment thread Lib/test/test__locale.py Outdated
if hasattr(locale, 'T_FMT_AMPM'):
items.append(locale.T_FMT_AMPM)
if hasattr(locale, 'ALT_DIGITS'):
items.append(locale.ALT_DIGITS)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should also test _DATE_FMT, no?

Why not testing ERA_D_FMT, ERA_D_T_FMT and ERA_T_FMT?

Comment thread Lib/test/test__locale.py
def test_nl_langinfo_encoding_independent(self):
# gh-152905: The LC_TIME text items are decoded independently of the
# LC_CTYPE encoding (on glibc via the wide nl_langinfo variants), so
# the same locale in different encodings yields identical strings.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please mention that ERA has no wide character variant and so is not test.

Comment thread Modules/_localemodule.c Outdated
#endif
#ifdef ERA
if (item == ERA && *result) {
pyresult = decode_strings(result, SIZE_MAX);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might remove max_count of decode_strings() since it's no longer needed.

Comment thread Doc/library/locale.rst

.. versionchanged:: next
On glibc, the ``LC_TIME`` items are now decoded
independently of the ``LC_CTYPE`` encoding.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except of ERA, no?

@vstinner

vstinner commented Jul 3, 2026

Copy link
Copy Markdown
Member

On my Fedora 44, this change is not only technical, it does actually change nl_langinfo() output on multiple locales.

I'm fine with the change anyway. But since the nl_langinfo() output changes on some locales, I would prefer to not backport this change.

@serhiy-storchaka

Copy link
Copy Markdown
Member Author

On my Fedora 44, this change is not only technical, it does actually change nl_langinfo() output on multiple locales.

I know, this is a point. Our strftime wraps wcsftime instead of strftime if possible. On gcc it produces results consistent with wide nl_langinfo. If we re-implemented it in Python (there are such plans), we need a wide nl_langinfo.

On other hand, I found these discrepancies when tried to use nl_langinfo in strptime. strptime should be permissive in any case, accept the output of wcsftime and strftime, it should normalize apostrophes, etc, and current code already do this (or many of this, I need to check my non-merged patches).

* Rewrite test_nl_langinfo_encoding_independent to compare each item
  individually (clearer failures), listing only the legacy locales and
  deriving the UTF-8 variant; broaden coverage to 20 locales across 17
  legacy encodings.  Also test ERA_D_FMT, ERA_D_T_FMT, ERA_T_FMT and
  _DATE_FMT; note that ERA has no wide variant and is not tested.
* Drop the now-unused max_count parameter of decode_strings().
* Mention in the docs that ERA is not affected.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants