Skip to content

fix(store): preserve corrupt DB instead of deleting (#557)#620

Open
yasku wants to merge 2 commits into
DeusData:mainfrom
yasku:fix/557-preserve-corrupt-db
Open

fix(store): preserve corrupt DB instead of deleting (#557)#620
yasku wants to merge 2 commits into
DeusData:mainfrom
yasku:fix/557-preserve-corrupt-db

Conversation

@yasku

@yasku yasku commented Jun 25, 2026

Copy link
Copy Markdown

What does this PR do?

Fixes #557. The startup/query integrity check unlink()'d a project's .db (plus its -wal/-shm sidecars) on any anomaly, causing total data loss with no backup, no recovery path, and no visible warning. The trigger is frequently a false positive: a WAL leftover after a SIGKILL, a binary variant swap (standardui), or a stale-lock break can all make a recoverable DB look corrupt.

This replaces the destructive unlink with a rename to <name>.db.corrupt.<timestamp> and emits a prominent stderr message, matching the issue's suggested fix. The user can now inspect, recover, or attach the file to a bug report.

Two call sites are fixed:

  • src/mcp/mcp.c resolve_store() — the reported path, hit on the first query against a project.
  • src/pipeline/pipeline.c try_incremental_or_delete_db() — only preserves on real corruption; a healthy DB rebuilt due to a mode change is still deleted, so .corrupt files don't accumulate on normal reindexes.

If rename fails (permissions, cross-device), it falls back to unlink so no orphan files are left behind.

Checklist

  • Every commit is signed off (git commit -s) — DCO verified via scripts/check-dco.sh
  • Tests pass locally (make -f Makefile.cbm test) — 5684 passed
  • Lint passes for changed files (clang-format clean; cppcheck clean on the three modified files)
  • New behavior is covered by a test (reproduce-first): resolve_store_preserves_corrupt_db_issue557 seeds a corrupt projects row, issues a query, and asserts the original .db is gone while a .corrupt sidecar remains.

Notes on the lint checklist item (transparency)

A few constraints worth making explicit so the maintainer isn't surprised:

  • make lint-ci does not pass cleanly locally, but only on pre-existing code I did not touch. My local cppcheck is 2.21.0, which is newer/stricter than CI's and flags pre-existing findings in src/cypher/cypher.c and src/cli/cli.c (e.g. knownConditionTrueFalse, compareValueOutOfTypeRangeError). Running cppcheck scoped to just my three changed files exits clean (0). I deliberately did not touch those files to keep this PR focused (per CONTRIBUTING's "avoid unrelated reformatting").
  • clang-format version skew. CI uses clang-format-20; I only have clang-format-22 locally. Both modified source files (mcp.c, pipeline.c) are whole-file clean under v22, and since the surrounding pre-existing code shows no diff under v22 either, v20 and v22 appear to agree on this file's style. If CI's v20 disagrees on any of my lines, I'm happy to reformat.
  • clang-tidy / magic numbers in the test. The new test uses raw strncmp(..., 19) / strncmp(..., 10) literals, matching the existing style in tests/ (e.g. test_cli.c, test_httpd.c, test_pipeline.c). This is fine because test sources are not in LINT_SRCS, and CI's lint step runs cppcheck + clang-format only (clang-tidy is enforced locally, not in CI).
  • Labels. I don't have triage/write access to this repo, so I couldn't apply any labels (e.g. bug, stability/performance) to match cbm v0.8.1 silently deletes project DBs on "corrupt" detection — data loss with no recovery #557. Please add whatever is appropriate.

The startup/query integrity check unlink()'d a project's .db (plus its
-wal/-shm sidecars) on any anomaly, causing total data loss with no
backup, no recovery path, and no visible warning. The trigger is often a
false positive: a WAL leftover after a SIGKILL, a binary variant swap, or
a stale-lock break can all make a recoverable DB look corrupt.

Replace the destructive unlink with a rename to <name>.db.corrupt.<ts>
and emit a prominent stderr message so the user can inspect, recover, or
attach the file to a bug report. Two call sites are fixed:

- mcp.c resolve_store(): the reported path, hit on first query.
- pipeline.c try_incremental_or_delete_db(): only preserves on real
  corruption; a healthy DB rebuilt due to a mode change is still deleted
  so .corrupt files don't accumulate on normal reindexes.

Add a regression test (resolve_store_preserves_corrupt_db_issue557) that
seeds a corrupt projects row, issues a query, and asserts the original
.db is gone while a .corrupt sidecar remains.

Signed-off-by: yasku <aguss.cba@gmail.com>
Comment thread src/pipeline/pipeline.c Fixed
@yasku

yasku commented Jun 25, 2026

Copy link
Copy Markdown
Author

Working on fixing this ☺️

CodeQL flagged a time-of-check/time-of-use race in
try_incremental_or_delete_db(): the file at db_path was probed with
stat() and then operated on by name (rename/unlink), so the underlying
file could change between the two.

Drop the redundant stat() existence probe and derive existence from a
query-mode open (cbm_store_open_path_query, which never creates the
file): a NULL result means absent/unreadable, so we return without
touching the path. This removes the stat()->rename() dataflow CodeQL
reports and matches resolve_store() in mcp.c, which already uses this
pattern. Behavior is unchanged (absent -> skip, healthy -> reindex,
corrupt -> preserve as .corrupt).

Signed-off-by: yasku <aguss.cba@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cbm v0.8.1 silently deletes project DBs on "corrupt" detection — data loss with no recovery

2 participants