Skip to content

feat: concurrent add pipeline with crash-safe mutation recovery#104

Closed
gwokhou wants to merge 17 commits into
VectifyAI:mainfrom
gwokhou:feat/concurrent-ingestion-recovery
Closed

feat: concurrent add pipeline with crash-safe mutation recovery#104
gwokhou wants to merge 17 commits into
VectifyAI:mainfrom
gwokhou:feat/concurrent-ingestion-recovery

Conversation

@gwokhou

@gwokhou gwokhou commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Summary

Parallelize openkb add <dir> and make every KB mutation crash-recoverable through a journaled snapshot/rollback layer.

Why

Directory add was strictly serial — ingest time scaled linearly with file count and was dominated by conversion. Worse, a crash between convert and the registry write orphaned raw/source artifacts in the live KB with no way to recover or clean them up.

Changes

  • Concurrent prepare, serialized commit. File preparation (hash, duplicate prefilter, staging, conversion) runs on a ThreadPoolExecutor (file_processing_jobs, default 2); live-KB mutation — publish, PageIndex indexing, LLM compile, registry write, log append — stays under the mutation lock. Commits run in scan order for stable log.md / CLI output.
  • Crash-safe mutation journal (openkb/mutation.py). Each commit snapshots the KB paths it will touch, journals the intent, and rolls back on failure. recover_pending_journals rolls back any active journal left by an interrupted run and discards committed/rolled_back ones. SQLite sidecars (pageindex.db-wal/-shm/-journal) are snapshotted too, so long-doc failures don't leave sidecars newer than the db.
  • Recovery wired into the lock. Draining runs on every exclusive-lock acquisition, so remove/recompile/lint/chat also restore the KB before mutating — not just add. Prevents a crashed add's journal from later clobbering an intervening remove.
  • Staged conversion. convert_document writes into an isolated staging dir published atomically at commit, so a crash can no longer orphan live-KB artifacts.
  • Perf & diagnostics. Prefilter known hashes before preparing; DEBUG-level per-stage timing logs.

Tests

test_mutation.py + test_add_command.py cover snapshot rollback, commit protection, partial-failure cleanup, lock-driven recovery, and the jobs>1 pipeline end-to-end (real prepare + commit, only LLM compile mocked). Full suite: 820 passed.

🤖 Generated with Claude Code

@gwokhou gwokhou marked this pull request as ready for review June 18, 2026 16:15
gwokhou and others added 7 commits June 20, 2026 00:54
Add openkb/mutation.py, the transactional layer for KB mutations, plus
tests. MutationSnapshot snapshots the target paths, journals the intent,
and restores on failure across active/committed/rolled_back states
(children restored before parents; self-cleans its backup dir on partial
failure). snapshot_paths writes the active journal as the recovery signal;
recover_pending_journals rolls back active journals and discards terminal
ones. publish_staged_tree copies staged raw/source artifacts into place.

Module and tests only — wired into the add path in the next commit.
Parallelize directory `add`: prepare files concurrently (hash, prefilter,
staging, convert) while live-KB mutation stays serialized under the
mutation lock.

- Split add into prepare (file-local, into an isolated staging dir) and
  commit (under the lock): snapshot -> publish -> index -> compile ->
  registry write -> mark_committed, with snapshot rollback on failure.
- convert_document gains assume_locked/staging_dir/doc_name_override so
  parallel prepares never write the live KB unlocked.
- Reserve doc_names in scan order via HashRegistry.memory so same-stem
  files behave like serial adds.
- New file_processing_jobs config (default 2).
Document file_processing_jobs: add it to the Settings yaml example and
note (README Advanced options + config.yaml.example) that only file
preparation is parallelized while live-KB mutation stays serialized, so
raising it helps mainly when conversion is the bottleneck.
Add DEBUG-level timing logs across the add pipeline (lock_wait, prefilter,
prepare, index, compile, commit) via _log_add_timing, gated behind
isEnabledFor(DEBUG) so there is no cost when disabled. Surfaces where time
goes during concurrent directory add.
Hash directory inputs up front and skip files whose hash is already in the
registry before spawning prepare workers, so known duplicates no longer go
through conversion and staging. Hashing runs across the jobs workers;
files that fail to hash surface as per-file "failed" outcomes instead of
aborting the batch.
recover_pending_journals was wired only into _kb_mutation_lock (the add
path), so remove/recompile/lint/chat — which take kb_ingest_lock directly
via _with_kb_lock — never drained. An `openkb add` that crashed mid-commit
left an ACTIVE journal that an intervening `openkb remove` ignored and a
later `openkb add` then rolled back, clobbering the remove's hashes.json
edits and resurrecting the removed document.

Move draining into kb_lock's first exclusive acquisition so every mutation
entry point restores the KB to a known state before mutating. Delay-import
mutation from locks to break the locks<->mutation cycle, and drop the
now-redundant drain (plus its double-scan/double-log per file) from
_kb_mutation_lock.

Co-Authored-By: Claude <noreply@anthropic.com>
- test_exclusive_lock_drains_active_journal_before_yielding: regression
  guard for the lock-level drain — an ACTIVE journal left by a crashed add
  is rolled back the moment any exclusive lock is taken, not only on the
  add path.
- test_add_directory_jobs_gt1_runs_real_pipeline: end-to-end exercise of the
  jobs>1 ThreadPoolExecutor branch (real prepare + real commit, only LLM
  compile mocked). Every prior jobs>1 test mocked both halves, so futures
  ordering, staging publish, registry writes, and cleanup were never run.

Co-Authored-By: Claude <noreply@anthropic.com>
@gwokhou gwokhou force-pushed the feat/concurrent-ingestion-recovery branch 2 times, most recently from 114abea to a48d054 Compare June 19, 2026 17:20
@gwokhou gwokhou changed the title feat: add ingestion recovery and concurrent add pipeline feat: concurrent add pipeline with crash-safe mutation recovery Jun 19, 2026
gwokhou and others added 2 commits June 20, 2026 11:21
Cleanup pass over the concurrent-ingestion + mutation-journal feature.
Behavior preserved (full suite: 820 passed).

- mutation: stream _copy_file_atomic (mkstemp + copyfileobj + fsync +
  os.replace) instead of buffering whole files; route snapshot_paths
  file backups through it so every file copy in the module is atomic and
  streaming — large raw PDFs no longer spike peak memory.
- converter: drop the dead ConvertResult.staging_dir field (only
  _PreparedAdd.staging_dir is ever read); split convert_document into a
  lock-acquiring wrapper + _convert_document_locked so the parallel
  prepare path calls the locked form directly, removing the
  assume_locked flag and its recursive self-call.
- cli: hoist a module-level logger (drop four local rebinding sites);
  unify the jobs==1 and jobs>1 commit loops into one ThreadPoolExecutor
  path; document the defensive long-doc snapshot paths.

Co-Authored-By: Claude <noreply@anthropic.com>

@KylinMountain KylinMountain left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the design holds up well — staging isolation, serialized commit, the journal-based recovery, and the commit-time re-checks are all correct, and test coverage is solid. Nothing here blocks merge on day-to-day correctness. Three things are worth fixing first, though — #1 is the one that bites in normal use:

1. Every per-file commit deep-copies the whole KB for the rollback backupopenkb/cli.py:475, openkb/mutation.py:165

_snapshot_add_paths lists the entire wiki/concepts and wiki/entities trees plus the .openkb/files PDF blob store, and snapshot_paths copytrees them on each file's commit. A 50-file batch copies the whole KB ~50 times — O(files × corpus) I/O that scales with the KB and can fill the disk mid-batch; even adding a 2 KB .md copies the full blob store. Suggest snapshotting only the pages the doc actually touches (or hardlink/CoW), or taking the snapshot once per batch rather than once per file.

2. Publish copies bytes instead of renaming, and drops the dir-fsync + chmodopenkb/mutation.py:18

Staging is on the same filesystem as raw/ and wiki/sources/, so os.replace would be an O(1) atomic rename instead of a full streaming copy + fsync. _copy_file_atomic also omits the parent-dir fsync and fchmod that locks.atomic_write_bytes performs, so published files are less durable across a crash and inherit mkstemp's 0600 mode instead of the umask mode. Suggest os.replace with a copy fallback on EXDEV, reusing the atomic_write_bytes fsync/chmod path.

3. Interrupted batches leak staging dirsopenkb/cli.py:1067

Staging dirs are all mkdir'd up front in the futures comprehension, but _cleanup_staging only runs inside each per-file commit and there's no batch-level try/finally. A Ctrl+C, a failed file, or an mkdir error orphans the already-created add-* dirs, and recovery only scans journal/ so it never reclaims them. Suggest a batch-level cleanup, or folding staging into the recovery sweep.

Happy to send a patch for #1 if useful. The rest of what I found is rarer crash/concurrency edges and design follow-ups — fine as separate issues.

gwokhou added 2 commits June 24, 2026 21:40
…, batch cleanup

Addresses reviewer (KylinMountain) feedback on the concurrent-add /
crash-recovery PR. No day-to-day-correctness blocker; perf, durability,
and a resource leak.

VectifyAI#1 — per-file snapshot no longer copies the whole KB.
  snapshot_paths gains a hardlink_dirs opt-in: wiki/concepts, wiki/entities,
  and .openkb/files are backed up via os.link (O(1)) instead of copytree
  (EXDEV/EPERM/EACCES → copy fallback, incl. Windows OneDrive/ACL hardlink
  blocks). Per-file snapshot is now O(touched) instead of O(files × corpus),
  so the "2 KB .md copies the full blob store" case is gone. Hardlinks
  required the wiki writers (all 14 sites) to move from path.write_text
  (in-place, same inode) to atomic_write_text (temp+replace, new inode) so a
  backup keeps the old bytes while the live tree is mutated. .openkb/files is
  hardlink-safe because PageIndex only appends new {doc_id} blobs. Per-file
  failure isolation preserved (did not take the once-per-batch suggestion,
  which would roll back earlier files' shared-index changes on a later
  failure).

VectifyAI#2 — publish renames; copies are durable + umask-mode.
  publish_staged_tree → os.replace (O(1)) with EXDEV→copy fallback
  (_publish_staged_file); the rename path fsyncs the file DATA as well as the
  parent dir (parity with _copy_file_atomic, so a crash right after publish
  can't leave committed metadata pointing at un-durable bytes). _copy_file_atomic
  (snapshot/rollback/EXDEV-fallback) now fsyncs the parent dir and sets the
  umask mode instead of inheriting mkstemp's 0600, reusing locks'
  _fsync_directory/_target_mode.

VectifyAI#3 — interrupted batches no longer leak staging dirs.
  The add batch tracks its staging dirs in a try/finally that reaps any whose
  file never reached _commit_prepared_add's per-call cleanup. Batch-local
  rather than a recovery sweep, since the lock is released between per-file
  commits and a sweep could reap a mid-batch acquisition's pending dirs.

Tests: +10 covering publish rename/durability/EXDEV, hardlink snapshot +
rollback correctness, hardlink EACCES fallback (Windows), interrupted-batch
leak, a hardlink-invariant regression guard (mutation-verified), and
real-commit-path hardlink wiring. Full suite: 837 passed.
Two pre-existing resource/crash issues surfaced by the self-review of the
review-fix commit, addressed here:

recovery retry is now bounded (recover_pending_journals).
  An active journal whose rollback keeps failing (e.g. persistent ENOSPC) was
  retried on every lock acquisition forever — re-doing the failing rollback
  and never releasing the backup dir + journal. The journal now carries an
  additive "attempts" counter (default 0, backward-compatible with old
  journals); after MAX_ROLLBACK_ATTEMPTS (5) failed rollbacks recovery
  discards it with a loud "manual review needed" message, bounding on-disk
  retention instead of leaking indefinitely.

rollback of a hardlinked dir is now O(touched), not O(corpus).
  rollback() previously did rmtree(target) + copytree(backup) for every dir,
  recopying the whole tree — the .openkb/files blob store (potentially GB) on
  every failed long-doc add, and able to ENOSPC mid-rollback (feeding the
  unbounded-retry leak above). For dirs snapshotted via hardlinks, rollback
  now does an inode-diff restore (_restore_hardlinked_dir): untouched files
  already share the backup's inode and are left in place; only new/modified/
  deleted files are touched. Crash-rebuilt snapshots (no hardlink info on
  disk) keep the safe full-copy path, so the journal format is unchanged and
  recovery stays conservative.

Tests: +4 (bounded-recovery give-up; hardlinked rollback leaves untouched
files in place / removes new + restores modified / prunes new nested blob
dirs). Full suite: 841 passed.
@gwokhou

gwokhou commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

@KylinMountain Thanks your review! All three items are addressed in the latest commits, with tests added for each change (full suite: 841 passed). I also found two further issues during self-review, noted below.

Solved issues from your review 981e491

1. Per-file snapshot cost reduced to O(touched).

wiki/concepts, wiki/entities, and .openkb/files are now snapshotted with hardlinks (os.link) rather than copytree, with EXDEV/EPERM/EACCES falling back to a copy. This removes the O(files × corpus) copy, including the case where a small add copied the entire blob store.

Notes:

  • Hardlinks require every writer into these trees to write atomically (temp + replace, yielding a new inode). The wiki writers used in-place write_text, which would invalidate a hardlink backup on rewrite; all 14 have been switched to atomic_write_text. A regression test guards this — it fails if any writer reverts to in-place writes.
  • Per-file rollback granularity is preserved. A single batch-level snapshot was not adopted because it would discard shared-index changes from earlier files when a later file fails.

One assumption worth confirming: .openkb/files relies on PageIndex appending only new {doc_id} blobs per add (verified in its local backend — fresh UUID per add, dedup returns early, no in-place rewrite). If you would prefer not to rely on that, the alternative is to track removals for new blobs in that tree.

Please share your perspective on the patch you had in mind. If a simpler approach than hardlinks plus atomic writers exists, I would prefer to adopt it.

2. Publish uses rename, durable, and umask-mode.

publish_staged_tree now uses os.replace (O(1)) with an EXDEV→copy fallback. The renamed file's data and parent directory are fsync'd, and the mode is the umask default rather than mkstemp's 0600, reusing locks._fsync_directory/_target_mode as suggested.

3. Interrupted batches reclaim their staging directories.

The batch records its staging directories and reclaims any that do not reach commit via a try/finally. It is scoped to the batch rather than folded into recovery, because the lock is released between per-file commits, and a recovery sweep could reclaim another in-flight batch's directories. Remaining limitation: a hard kill or power loss mid-batch can still orphan staging directories, since the final step does not run and recovery scans only journal/.

Solved issues from self-review 5fb53b8

  • The publish rename path initially omitted the file-data fsync that the copy path performs, leaving committed metadata referencing undurable bytes after a crash. Fixed.
  • The hardlink fallback handled only EXDEV/EPERM; on Windows, OneDrive/ACL hardlink blocks surface as EACCES and would abort the snapshot. EACCES is now included in the fallback set.

Additionally, recovery retry is now bounded (a persistently-failing rollback journal previously retried indefinitely; it now gives up after five attempts with an explicit warning), and rollback for hardlinked directories is O(touched) via inode comparison, leaving untouched files in place.

Further Discussions

I am happy to hear the follow-up suggestions you mentioned (about rarer concurrency, crashes, and design).

@gwokhou gwokhou requested a review from KylinMountain June 24, 2026 14:59
recover_pending_journals' except block referenced `snapshot` unconditionally, but it is assigned inside the try — so a corrupt/empty/stray .json (or one missing kb_dir/backup_dir) made json.loads/_snapshot_from_journal raise before the assignment. The handler then threw UnboundLocalError, which propagates out of recovery and out of every exclusive kb_lock acquisition (draining runs on first acquisition), bricking add/remove/recompile/chat for the whole KB off a single bad journal.

Initialize snapshot=None per iteration; when it is still None in the except (parse/reconstruct failed), best-effort remove the unrecoverable journal, log loudly, and continue instead of crashing. Adds a parametrized regression test over malformed-journal payloads plus a lock-acquire assertion.
@gwokhou

gwokhou commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

One more solved issue from self-review bdef059

a malformed recovery journal no longer bricks the KB lock

While self-reviewing this PR I caught a latent crash in recover_pending_journals (openkb/mutation.py).

Bug

The except Exception handler referenced snapshot unconditionally, but snapshot is assigned inside the try (after json.loads / _snapshot_from_journal). A malformed journal — a corrupt/empty/stray .json in .openkb/journal/, or a valid JSON missing the kb_dir/backup_dir keys — makes parsing/construction raise before the assignment, so the handler itself threw UnboundLocalError:

except Exception as exc:
    snapshot.attempts += 1   # ← snapshot unbound → UnboundLocalError

Because journal draining runs on every first exclusive kb_lock acquisition (locks._drain_pending_journals), that propagated up through kb_lock: a single bad journal file would make every subsequent add / remove / recompile / chat fail with an opaque UnboundLocalError, bricking mutations for the entire KB.

Normal writes go through atomic_write_json, so the trigger for a malformed journal is uncommon (external corruption, manual edits, or a stray .json dropped in the dir) — but recovery must be robust to it regardless, and it wasn't.

Fix

Initialize snapshot = None per loop iteration, and in the except handle the snapshot is None case: best-effort unlink the unrecoverable journal, log loudly, and continue instead of crashing. The existing bounded-retry logic is left untouched for the case where snapshot is bound (a rollback/discard failure).

Test

Added a parametrized regression test (tests/test_mutation.py) over four malformed-journal payloads (empty, invalid JSON, missing keys, wrong shape), asserting that recovery returns cleanly and removes the bad journal — and, the key assertion, that the KB's mutation lock still acquires afterward. Full suite green (845 passed).

@KylinMountain

Copy link
Copy Markdown
Collaborator

Re-verified all three against the actual code (including reading the PageIndex local backend), not just the writeup — they hold up:

One thing I'd fold into this PR — cheap and on-theme:

lint.py:252 (fix_broken_links) still rewrites concept/entity pages in place (md.write_text(...)). It's the lone exception to the "every writer into a hardlinked tree is atomic" invariant this PR introduces. It's safe today only because fix_broken_links is reachable solely from remove / lint --fix, neither of which snapshots. But the moment hardlink_dirs is reused for remove/recompile (the natural next step for crash-safety), this in-place write mutates the shared-inode backup and rollback silently becomes a no-op — and no current test catches it, since it's off the add path.

Two one-liners, both in scope here:

  1. Switch that write_textatomic_write_text. It's the 15th sibling of the 14 writers you already converted, and it makes "all wiki writes are atomic" a clean, assertable global property instead of "14 atomic + one in-place exception."
  2. Make the invariant explicit in snapshot_paths (a cheap runtime assert or a docstring contract) so a future caller reusing hardlink_dirs can't step on it.

Actually wiring crash-safety into remove/recompile is genuinely separate work — good as a follow-up issue, not this PR.

For the rarer edges you asked about, I'll file these as separate issues:

  • SQLite hot-copypageindex.db + -wal/-shm are snapshotted as independent byte copies; if a long-doc index leaves an uncheckpointed WAL, a rollback can restore a mutually-inconsistent db/wal pair. A checkpoint (or closing the client) before the snapshot would close it.
  • convert_document cross-thread re-entrancy — lock reentrancy is thread-local, so calling the lock-acquiring convert_document from a prepare worker would deadlock. Latent (workers use _convert_document_locked today), guarded only by a docstring.
  • Advisory doc_name reservation — the lock is released before commit, so a concurrent external add of a same-stem file makes this batch's file fail with the "conflict after parallel preparation" error instead of re-deriving a suffixed name.

Nice work overall — the self-review catches (rename data-fsync, EACCES fallback, bounded retry) were the right calls.

@gwokhou

gwokhou commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

@KylinMountain Thanks again for your quick and patient review!

Issues fixed in 559ada7.

  • fix_broken_links now uses atomic_write_text, so lint/remove cleanup no longer rewrites concept/entity pages in place.
  • Added a regression test that runs fix_broken_links under a hardlinked wiki/concepts snapshot and verifies rollback restores the original wikilink content. This covers the off-add-path writer case you called out.
  • Tightened the snapshot_paths docstring to make the hardlinked-dir writer contract explicit: writers must be atomic temp+replace or append-only.

Re-ran the focused coverage:

UV_PYTHON=3.13 UV_CACHE_DIR=/tmp/uv-cache uv run --extra dev pytest tests/ test_mutation.py tests/test_lint.py -q

Result: 84 passed, 5 warnings.

(Warnings are existing Python 3.13 SWIG/native-binding deprecation warnings. They are emitted during dependency import and are unrelated to the lint atomic-write / snapshot rollback change.)

I’m leaving remove/recompile crash-safety and the rarer SQLite/re-entrancy/doc-name reservation edges as follow-up issues, as suggested.

Resolve conflicts between the concurrent-ingestion/recovery branch and
upstream's PageIndex Cloud import feature (VectifyAI#97) + config/compiler/skill
changes.

cli.py:
- imports: keep HEAD's staged-converter helpers (_convert_document_locked,
  _sanitize_stem, resolve_doc_name); add upstream's import_cloud_document.
- add(): keep HEAD's removal of @_with_kb_lock (the concurrent directory
  batch must release the mutation lock between per-file commits) and adopt
  upstream's from_pageindex_cloud param. Wrap the cloud-import call in
  _kb_mutation_lock so it matches every other add mutation path — the
  auto-merge otherwise left cloud import unlocked after the decorator was
  dropped, dropping both mutual exclusion and journal draining on acquire.
- _add_single_file_locked: remove dead local imports (body now delegates
  to _prepare_add_file / _commit_prepared_add).

test_add_command.py: both sides added disjoint tests at the same site;
keep all of them (HEAD recovery/concurrency tests + upstream cloud tests).

Full suite: 910 passed.
@gwokhou

gwokhou commented Jun 26, 2026

Copy link
Copy Markdown
Contributor Author

Some issues found after 66765ff, working on it (already done)

gwokhou added 3 commits June 26, 2026 12:14
import_from_pageindex_cloud mutated the live KB (summary/source writes,
compile_long_doc concept/entity pages, registry, PageIndex) with no
snapshot/rollback journal around the window — so a crash, SIGKILL, or
mid-import failure stranded orphaned artifacts the recovery layer could
not reach: the exact failure mode this branch's journal layer exists to
prevent. The merge-added _kb_mutation_lock gave mutual exclusion and
drained *old* journals but created no *new* one for the import itself.

Wrap the import → compile → registry window in snapshot_paths, reusing
the add path's commit discipline: mark_committed after the registry write
is the commit point; rollback_best_effort + discard on any failure.
doc_name is unknown until import_cloud_document runs, so summaries/sources
are snapshotted at directory granularity under the lock. Retire
_cleanup_failed_cloud_import — journal rollback supersedes it and also
covers the import_cloud_document-failure window the manual cleanup never
did. Make the post-commit append_log best-effort so a wiki/log.md write
failure can't turn a registered import into an uncaught false-failure,
matching the add path.

Tests (RED first):
- import_cloud_document failure rolls back partial writes, leaves no journal
- post-commit log failure still returns "added"
- compile-failure test updated to exercise journal rollback
- regression: `add --from-pageindex-cloud` drains a seeded active journal
  under the mutation lock (guards the lock wrap from silent removal)
- success path marks committed + discards, leaves no journal

Full suite: 914 passed.
…)/doc)

Follow-up to 5666951 addressing review feedback: the cloud import's
whole-tree snapshot was O(total sources size) time/disk and could
fail-before-work on a large/image-heavy KB — contradicting this branch's
O(1)-snapshot principle.

Split import_cloud_document into prepare_cloud_import (read-only cloud
fetch + collision-resistant name resolution, no write) so the crash-safe
CLI path knows doc_name BEFORE writing. It then snapshots only this doc's
paths via the add path's _snapshot_add_paths helper (reused verbatim)
instead of copying the whole wiki/summaries and wiki/sources trees, and
hardlinks concepts/entities/files as the add path does. Name resolution is
read-only, so running it pre-snapshot is safe.

Also (Low): document import_from_pageindex_cloud's lock contract — it must
be called under _kb_mutation_lock (caller-held, like _add_single_file_locked);
the CLI add --from-pageindex-cloud path acquires it.

import_cloud_document is kept (now a thin prepare + write) for backward
compat; test_indexer still exercises it directly.

Tests (TDD RED→GREEN): a spy test proving the snapshot covers
wiki/summaries/{doc}.md, not the whole wiki/summaries dir; existing cloud
tests updated to the prepare/write split. Full suite: 915 passed.
@gwokhou

gwokhou commented Jun 26, 2026

Copy link
Copy Markdown
Contributor Author

Some issues found after 66765ff, working on it

Heads up on the commits after the merge — fixing gaps I caught reviewing the merged result. Resolving the upstream conflict pulled in the PageIndex Cloud import (#97), which predates this branch's journal layer, so cloud import was mutating the live KB with no snapshot/rollback — regressing the PR's own crash-safety contract.

  • 5666951 wraps the import_cloud_document → compile → registry window in snapshot_paths (mark_committed after the registry write, rollback on any failure), makes post-commit append_log best-effort, and drops the now-redundant _cleanup_failed_cloud_import. Adds a regression test that add --from-pageindex-cloud drains a pending journal under the lock.

  • 1be4350 fixes the first pass's whole-tree snapshot: snapshotting the whole summaries/sources trees was O(corpus) and could fail-before-work on a large KB. It splits import_cloud_document into prepare_cloud_import (read-only fetch + name resolution) so doc_name is known before writing, then snapshots only this doc's paths via the existing _snapshot_add_paths.

  • eb4c1ab fixes a cross-system compatibility issue in that split: prepare_cloud_import no longer uses platform-native Path(cloud_name).stem for PageIndex Cloud display names. Cloud names are data, not local paths; Windows-style names such as C:\Users\me\Cloud Paper.pdf now resolve to the same Cloud-Paper doc_name on POSIX and Windows. Added a regression test so the POSIX-only wrong artifact (C-Users-me-Cloud-Paper.json) is not produced.

Verification:

Full suite verification for 5666951, 1be4350, eb4c1ab: 916 passed, 7 warnings in 4.88s. UV_CACHE_DIR=/tmp/uv-cache UV_PYTHON=3.13 uv run --extra dev pytest -q

import_cloud_document is kept (thin prepare + write); test_indexer.py still exercises it. The separate Python 3.14 dependency metadata issue is project-level rather than this PR's cloud-import behavior, so I'm leaving that for a follow-up issue instead of widening this PR.

@KylinMountain

Copy link
Copy Markdown
Collaborator

@gwokhou Honestly, this PR has gotten too big to review safely — +2600 across 16 files, and the upstream merge pulled cloud import in too. It's really two things bundled: crash-safe recovery (which is great, let's keep it) and concurrency.

And I'm not sure the concurrency earns its keep: only the prepare step runs in parallel — PageIndex + LLM compile, the actual slow parts, still run serially under the lock. So the speedup is small, but it's what forces all the tricky bits (name reservation, parallel conflict checks).

Can we split it? Land the serial recovery first — it's the valuable part and gets much simpler without concurrency. Do concurrency as a follow-up but maybe low priority? What do you think?

@gwokhou

gwokhou commented Jun 26, 2026

Copy link
Copy Markdown
Contributor Author

@gwokhou Honestly, this PR has gotten too big to review safely — +2600 across 16 files, and the upstream merge pulled cloud import in too. It's really two things bundled: crash-safe recovery (which is great, let's keep it) and concurrency.

And I'm not sure the concurrency earns its keep: only the prepare step runs in parallel — PageIndex + LLM compile, the actual slow parts, still run serially under the lock. So the speedup is small, but it's what forces all the tricky bits (name reservation, parallel conflict checks).

Can we split it? Land the serial recovery first — it's the valuable part and gets much simpler without concurrency. Do concurrency as a follow-up but maybe low priority? What do you think?

I agree. I'll close this PR first, and then split it into two PRs:

  1. Landing the serial crash-safe recovery/journal rollback work, keeping the cloud-import recovery fixes only where they preserve that same contract.
  2. Moving the concurrency / parallel prepare path into a follow-up PR and treating it as a lower priority unless we can demonstrate that the speedup justifies the added complexity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants