fix: preserve non-ASCII (CJK) path segments in auto-generated project name#624
Open
mvanhorn wants to merge 1 commit into
Open
fix: preserve non-ASCII (CJK) path segments in auto-generated project name#624mvanhorn wants to merge 1 commit into
mvanhorn wants to merge 1 commit into
Conversation
… name Percent-encode non-ASCII bytes in cbm_project_name_from_path so distinct CJK paths keep recoverable, collision-free slugs; accept '%' in cbm_validate_project_name. Bound the slug length with a hash suffix so deep multibyte paths stay within the OS filename-component limit. Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Fixes #571:
cbm_project_name_from_path()derived an auto project slug by mappingevery byte outside
[A-Za-z0-9._-]to-and then collapsing dashes. For a pathsuch as
/Users/yunxin/Desktop/开发/后端, every UTF-8 byte of the CJK segments isunsafe, so each segment collapsed to a single
-that was then trimmed away,yielding
Users-yunxin-Desktop. Two different non-ASCII directories under the sameASCII parent collapsed to the same name, so the identifying information was
silently lost and distinct repos collided. Users with non-Latin directory names
saw indexing create a DB under a truncated, unrecognizable name.
This change percent-encodes non-ASCII bytes instead of discarding them, so
non-ASCII paths keep a distinct, recoverable name. It also keeps the slug
generator in agreement with
cbm_validate_project_name()-- the same invariantthat motivated the
#349space fix (a name that indexes butresolve_storelater rejects reports project-not-found).
Changes:
src/pipeline/fqn.c--cbm_project_name_from_path()now percent-encodes anybyte
>= 0x80as uppercase%HHinto a new, larger buffer, while ASCIIseparators / unsafe bytes still map to
-. ASCII slugs are byte-identical tobefore (e.g.
/home/u/my project->home-u-my-project). The existingdash/dot collapse and leading/trailing trim are preserved. Because the slug is
later used as
<cache>/<name>.db, long slugs are bounded to 200 bytes with adeterministic FNV-1a hash suffix so distinct long paths still produce distinct
names and stay within the OS filename-component limit. Includes overflow guards
and correct memory management.
src/foundation/str_util.c--cbm_validate_project_name()additionallyaccepts
%. It remains filesystem-safe; the existing..,/,\, andleading-
.rejections are unchanged.tests/test_fqn.c-- new coverage for CJK percent-encoding, distinctness, thelong-path bound, the
%-accepting validator, and an unchanged-ASCIIregression.
Example:
/Users/yunxin/Desktop/开发/后端now yieldsUsers-yunxin-Desktop-%E5%BC%80%E5%8F%91-%E5%90%8E%E7%AB%AF(distinct, valid)instead of
Users-yunxin-Desktop.Checklist
git commit -s) -- required, CI rejectsunsigned commits (DCO, see CONTRIBUTING.md)
make -f Makefile.cbm test) -- 5689 passed, 0 failedmake -f Makefile.cbm lint-ci) -- clang-format clean ontouched files
Fixes #571