feat: add rubygems ecosystem to bq-dataset-ingest (CM-1296)#4307
feat: add rubygems ecosystem to bq-dataset-ingest (CM-1296)#4307themarolt wants to merge 18 commits into
Conversation
Signed-off-by: Uroš Marolt <uros@marolt.me>
Signed-off-by: Uroš Marolt <uros@marolt.me>
Signed-off-by: Uroš Marolt <uros@marolt.me>
Signed-off-by: Uroš Marolt <uros@marolt.me>
Signed-off-by: Uroš Marolt <uros@marolt.me>
Signed-off-by: Uroš Marolt <uros@marolt.me>
Signed-off-by: Uroš Marolt <uros@marolt.me>
There was a problem hiding this comment.
Pull request overview
This PR extends the osspckgs bq-dataset-ingest pipeline to treat RubyGems as a first-class ecosystem, while also changing how large versions / package_dependencies full loads are merged (no longer dropping/rebuilding indexes & constraints) and adding a resume-by-job-id path for partially merged package_dependencies ingests.
Changes:
- Add RUBYGEMS to ecosystem/system filters, deps.dev dependency SQL (full + incremental), versions lookup creation, CLI bootstrap trigger, and the monitor’s known ecosystem list.
- Add resumable
package_dependenciesingest support via--resume-job <id>(reuses a prior job’s GCS export and recorded progress to skip re-exporting from BigQuery). - Remove the drop → load → rebuild index/FK workflow for
versionsandpackage_dependencies, usingON CONFLICTmerges against live constraints instead; improve monitor rendering/ETA behavior.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| services/libs/data-access-layer/src/osspckgs/ingestJobs.ts | Adds DAL query to fetch resume-relevant ingest job fields (gcsPrefix + progress + merged rows). |
| services/apps/packages_worker/src/scripts/triggerBootstrap.ts | Adds RUBYGEMS to CLI and introduces --resume-job argument validation + propagation. |
| services/apps/packages_worker/src/scripts/monitorOsspckgs.ts | Buffered redraw to reduce flicker; improves status/ETA logic; adds rubygems to known ecosystems. |
| services/apps/packages_worker/src/deps-dev/workflows/ingestVersions.ts | Removes drop/rebuild flow; merges with ON CONFLICT DO NOTHING against live unique index. |
| services/apps/packages_worker/src/deps-dev/workflows/ingestDependencies.ts | Adds resume flow and removes drop/rebuild; increases merge timeout; sets step to merging. |
| services/apps/packages_worker/src/deps-dev/workflows/bootstrapOsspckgs.ts | Threads resumeJobId through bootstrap and skips watermark checks in resume mode. |
| services/apps/packages_worker/src/deps-dev/queries/systems.ts | Adds RUBYGEMS to default and valid system lists. |
| services/apps/packages_worker/src/deps-dev/queries/depsSql.ts | Adds RubyGems full+incremental dependency extraction branches. |
| services/apps/packages_worker/src/deps-dev/activities/manage.ts | Deletes index/constraint management activities (no longer used). |
| services/apps/packages_worker/src/deps-dev/activities/index.ts | Removes exports of deleted manage* activities; exports new getResumeExport. |
| services/apps/packages_worker/src/deps-dev/activities/getResumeExport.ts | Adds activity to fetch resume metadata for a prior ingest job. |
| services/apps/packages_worker/src/deps-dev/activities/createVersionsLookup.ts | Allows rubygems ecosystem in versions lookup creation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Fill-constraints variant: UNIQUE constraint stays in place, so ON CONFLICT is valid. Upserts | ||
| // version_constraint only for rows where it is currently NULL — safe to run against a table already | ||
| // populated by --deps-table-b (which sets version_constraint = NULL for all rows). DISTINCT ON | ||
| // resolves duplicate (root, dep) pairs from BQ (same root/dep, different to_version) before the upsert. |
| // Merge against the live UNIQUE index — ON CONFLICT DO NOTHING makes every chunk idempotent, so the | ||
| // table's indexes/keys/constraints are never dropped. Both full and incremental use this path. | ||
| const MERGE_SQL = ` |
| const { mergeStagingToTable } = proxyActivities<typeof depsDevActivities>({ | ||
| startToCloseTimeout: '1 hour', | ||
| retry: { maximumAttempts: 1 }, | ||
| }) |
| const totalChunks = Math.ceil(fileNames.length / filesPerChunk) | ||
| let priorRowsAffected = 0 | ||
| let priorStagingRows = 0 | ||
| const priorTableRowCounts: Record<string, number> = {} | ||
|
|
||
| for (let chunkIndex = 0; chunkIndex < totalChunks; chunkIndex++) { |
| // Resume mode reuses a prior job's export, so there is no fresh BQ export to validate. Skip the | ||
| // incremental watermark/partition checks below — the resumed partition may not match `today`. | ||
| const resume = opts.resumeJobId != null | ||
|
|
Signed-off-by: Uroš Marolt <uros@marolt.me>
Signed-off-by: Uroš Marolt <uros@marolt.me>
| // Returns an existing exported job for the given GCS prefix so callers can | ||
| // skip re-running the BQ export when retrying a failed workflow. | ||
| // Loads the fields needed to resume a partially-merged chunked job by id: the exact GCS export | ||
| // path (so the same parquet files — and therefore the same chunk boundaries — are re-listed) plus | ||
| // the file-level load progress and rows-merged-so-far. progressDone/progressTotal come from the | ||
| // table_row_counts JSONB written by updateLoadingProgress. Returns null if the job id is unknown. | ||
| export async function getIngestJobForResume( |
| // NuGet groups deps by TargetFramework — flatten all groups, dedup handled downstream | ||
| // by DISTINCT ON in MERGE_SQL_FULL and ON CONFLICT in MERGE_SQL. | ||
| // by ON CONFLICT in MERGE_SQL (and DISTINCT ON in the fill-constraints variant). |
Signed-off-by: Uroš Marolt <uros@marolt.me>
Signed-off-by: Uroš Marolt <uros@marolt.me>
Signed-off-by: Uroš Marolt <uros@marolt.me>
f60e8b1 to
b92d4d0
Compare
Signed-off-by: Uroš Marolt <uros@marolt.me>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 9065aa6. Configure here.
Signed-off-by: Uroš Marolt <uros@marolt.me>

Summary
Adds RubyGems as a first-class ecosystem to the
bq-dataset-ingest(osspckgs) pipeline, and folds in two operational improvements developed alongside it: resumablepackage_dependenciesingests and removal of the index drop/recreate step. RubyGems is manifest-sourced (deps.dev Scenario B) — packages/versions/repos/advisories come from the standard*Latestviews; dependencies come fromRubyGemsRequirementsLatest.Changes
RUBYGEMSto the systems filter, deps SQL (full + incremental viaRubyGemsRequirements, RuntimeDependencies only),createVersionsLookup,triggerBootstrapCLI, and the monitor's known-ecosystems list.getResumeExportactivity +getIngestJobForResumeDAL query let a partially-mergedpackage_dependenciesjob resume by id, reusing its GCS export and chunk boundaries (skips the multi-hour BQ export on retry).manage{PackageDeps,Versions}{Indexes,Constraints}activities and the drop→load→rebuild flow for all job kinds.fullnow merges against live indexes viaON CONFLICT DO NOTHING(idempotent). The day-long rebuild was only needed for the original NPM onboarding.mergingphase step during the merge loop.Type of change
JIRA ticket
https://linuxfoundation.atlassian.net/browse/CM-1296
Note
High Risk
Changes how multi-billion-row
package_dependencies/versionsfull loads interact with live indexes and FKs (performance and correctness depend on ON CONFLICT idempotency), plus new resume paths that must not skip unmerged parquet chunks.Overview
Adds RubyGems to the osspckgs BQ→PG pipeline: manifest-style deps from
RubyGemsRequirementsLatest/ incrementalRubyGemsRequirements(runtime deps only), wired through default ecosystems, versions lookup, and bootstrap CLI validation.Introduces
--resume-jobfor interruptedpackage_dependenciesruns: skips BQ, reuses the job’s GCS parquet and chunk boundaries, restores ecosystems and fill intent from job meta (meta:fill,meta:ecosystems), with hard validation so missing/changed exports fail loudly instead of marking done.Removes the full-load drop indexes/FKs → plain INSERT → rebuild path for
versionsandpackage_dependencies(deletedmanage*Indexes/manage*Constraintsactivities). Full and incremental now merge into live unique constraints viaON CONFLICT(DISTINCT ONretained for BQ duplicate edges); deps merge timeout bumped to 4h.bqExportToGcsstampsmeta:fillon reuse paths so export reuse doesn’t silently pick the wrong merge SQL. Monitor gets flicker-free redraw and file-based total ETA; status showsmergingstep during the merge loop.Reviewed by Cursor Bugbot for commit 3ade70a. Bugbot is set up for automated code reviews on this repo. Configure here.