Skip to content

feat: add rubygems ecosystem to bq-dataset-ingest (CM-1296)#4307

Open
themarolt wants to merge 18 commits into
mainfrom
feat/add-rubygems-ingest-CM-1296
Open

feat: add rubygems ecosystem to bq-dataset-ingest (CM-1296)#4307
themarolt wants to merge 18 commits into
mainfrom
feat/add-rubygems-ingest-CM-1296

Conversation

@themarolt

@themarolt themarolt commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds RubyGems as a first-class ecosystem to the bq-dataset-ingest (osspckgs) pipeline, and folds in two operational improvements developed alongside it: resumable package_dependencies ingests and removal of the index drop/recreate step. RubyGems is manifest-sourced (deps.dev Scenario B) — packages/versions/repos/advisories come from the standard *Latest views; dependencies come from RubyGemsRequirementsLatest.

Changes

  • RubyGems ecosystem — added RUBYGEMS to the systems filter, deps SQL (full + incremental via RubyGemsRequirements, RuntimeDependencies only), createVersionsLookup, triggerBootstrap CLI, and the monitor's known-ecosystems list.
  • Resumable deps ingest — new getResumeExport activity + getIngestJobForResume DAL query let a partially-merged package_dependencies job resume by id, reusing its GCS export and chunk boundaries (skips the multi-hour BQ export on retry).
  • Removed index drop/recreate — dropped the manage{PackageDeps,Versions}{Indexes,Constraints} activities and the drop→load→rebuild flow for all job kinds. full now merges against live indexes via ON CONFLICT DO NOTHING (idempotent). The day-long rebuild was only needed for the original NPM onboarding.
  • Monitor — buffered redraw to eliminate flicker; shows the merging phase step during the merge loop.

Type of change

  • Bug fix
  • New feature
  • Refactor / cleanup
  • Performance improvement
  • Chore / dependency update
  • Documentation

JIRA ticket

https://linuxfoundation.atlassian.net/browse/CM-1296


Note

High Risk
Changes how multi-billion-row package_dependencies/versions full loads interact with live indexes and FKs (performance and correctness depend on ON CONFLICT idempotency), plus new resume paths that must not skip unmerged parquet chunks.

Overview
Adds RubyGems to the osspckgs BQ→PG pipeline: manifest-style deps from RubyGemsRequirementsLatest / incremental RubyGemsRequirements (runtime deps only), wired through default ecosystems, versions lookup, and bootstrap CLI validation.

Introduces --resume-job for interrupted package_dependencies runs: skips BQ, reuses the job’s GCS parquet and chunk boundaries, restores ecosystems and fill intent from job meta (meta:fill, meta:ecosystems), with hard validation so missing/changed exports fail loudly instead of marking done.

Removes the full-load drop indexes/FKs → plain INSERT → rebuild path for versions and package_dependencies (deleted manage*Indexes / manage*Constraints activities). Full and incremental now merge into live unique constraints via ON CONFLICT (DISTINCT ON retained for BQ duplicate edges); deps merge timeout bumped to 4h.

bqExportToGcs stamps meta:fill on reuse paths so export reuse doesn’t silently pick the wrong merge SQL. Monitor gets flicker-free redraw and file-based total ETA; status shows merging step during the merge loop.

Reviewed by Cursor Bugbot for commit 3ade70a. Bugbot is set up for automated code reviews on this repo. Configure here.

Copilot AI review requested due to automatic review settings July 3, 2026 15:54
Comment thread services/apps/packages_worker/src/scripts/triggerBootstrap.ts Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the osspckgs bq-dataset-ingest pipeline to treat RubyGems as a first-class ecosystem, while also changing how large versions / package_dependencies full loads are merged (no longer dropping/rebuilding indexes & constraints) and adding a resume-by-job-id path for partially merged package_dependencies ingests.

Changes:

  • Add RUBYGEMS to ecosystem/system filters, deps.dev dependency SQL (full + incremental), versions lookup creation, CLI bootstrap trigger, and the monitor’s known ecosystem list.
  • Add resumable package_dependencies ingest support via --resume-job <id> (reuses a prior job’s GCS export and recorded progress to skip re-exporting from BigQuery).
  • Remove the drop → load → rebuild index/FK workflow for versions and package_dependencies, using ON CONFLICT merges against live constraints instead; improve monitor rendering/ETA behavior.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
services/libs/data-access-layer/src/osspckgs/ingestJobs.ts Adds DAL query to fetch resume-relevant ingest job fields (gcsPrefix + progress + merged rows).
services/apps/packages_worker/src/scripts/triggerBootstrap.ts Adds RUBYGEMS to CLI and introduces --resume-job argument validation + propagation.
services/apps/packages_worker/src/scripts/monitorOsspckgs.ts Buffered redraw to reduce flicker; improves status/ETA logic; adds rubygems to known ecosystems.
services/apps/packages_worker/src/deps-dev/workflows/ingestVersions.ts Removes drop/rebuild flow; merges with ON CONFLICT DO NOTHING against live unique index.
services/apps/packages_worker/src/deps-dev/workflows/ingestDependencies.ts Adds resume flow and removes drop/rebuild; increases merge timeout; sets step to merging.
services/apps/packages_worker/src/deps-dev/workflows/bootstrapOsspckgs.ts Threads resumeJobId through bootstrap and skips watermark checks in resume mode.
services/apps/packages_worker/src/deps-dev/queries/systems.ts Adds RUBYGEMS to default and valid system lists.
services/apps/packages_worker/src/deps-dev/queries/depsSql.ts Adds RubyGems full+incremental dependency extraction branches.
services/apps/packages_worker/src/deps-dev/activities/manage.ts Deletes index/constraint management activities (no longer used).
services/apps/packages_worker/src/deps-dev/activities/index.ts Removes exports of deleted manage* activities; exports new getResumeExport.
services/apps/packages_worker/src/deps-dev/activities/getResumeExport.ts Adds activity to fetch resume metadata for a prior ingest job.
services/apps/packages_worker/src/deps-dev/activities/createVersionsLookup.ts Allows rubygems ecosystem in versions lookup creation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +106 to +109
// Fill-constraints variant: UNIQUE constraint stays in place, so ON CONFLICT is valid. Upserts
// version_constraint only for rows where it is currently NULL — safe to run against a table already
// populated by --deps-table-b (which sets version_constraint = NULL for all rows). DISTINCT ON
// resolves duplicate (root, dep) pairs from BQ (same root/dep, different to_version) before the upsert.
Comment on lines +42 to 44
// Merge against the live UNIQUE index — ON CONFLICT DO NOTHING makes every chunk idempotent, so the
// table's indexes/keys/constraints are never dropped. Both full and incremental use this path.
const MERGE_SQL = `
Comment on lines 23 to 26
const { mergeStagingToTable } = proxyActivities<typeof depsDevActivities>({
startToCloseTimeout: '1 hour',
retry: { maximumAttempts: 1 },
})
Comment on lines +125 to +130
const totalChunks = Math.ceil(fileNames.length / filesPerChunk)
let priorRowsAffected = 0
let priorStagingRows = 0
const priorTableRowCounts: Record<string, number> = {}

for (let chunkIndex = 0; chunkIndex < totalChunks; chunkIndex++) {
Comment on lines +79 to +82
// Resume mode reuses a prior job's export, so there is no fresh BQ export to validate. Skip the
// incremental watermark/partition checks below — the resumed partition may not match `today`.
const resume = opts.resumeJobId != null

Signed-off-by: Uroš Marolt <uros@marolt.me>
Signed-off-by: Uroš Marolt <uros@marolt.me>
Copilot AI review requested due to automatic review settings July 3, 2026 17:22

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.

Comment on lines +89 to +95
// Returns an existing exported job for the given GCS prefix so callers can
// skip re-running the BQ export when retrying a failed workflow.
// Loads the fields needed to resume a partially-merged chunked job by id: the exact GCS export
// path (so the same parquet files — and therefore the same chunk boundaries — are re-listed) plus
// the file-level load progress and rows-merged-so-far. progressDone/progressTotal come from the
// table_row_counts JSONB written by updateLoadingProgress. Returns null if the job id is unknown.
export async function getIngestJobForResume(
Comment on lines +25 to +26
// NuGet groups deps by TargetFramework — flatten all groups, dedup handled downstream
// by DISTINCT ON in MERGE_SQL_FULL and ON CONFLICT in MERGE_SQL.
// by ON CONFLICT in MERGE_SQL (and DISTINCT ON in the fill-constraints variant).
Copilot AI review requested due to automatic review settings July 4, 2026 05:55

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 1 comment.

Comment thread services/apps/packages_worker/src/deps-dev/workflows/ingestDependencies.ts Outdated
themarolt added 2 commits July 4, 2026 20:31
Signed-off-by: Uroš Marolt <uros@marolt.me>
Signed-off-by: Uroš Marolt <uros@marolt.me>
@themarolt themarolt force-pushed the feat/add-rubygems-ingest-CM-1296 branch from f60e8b1 to b92d4d0 Compare July 4, 2026 18:31
Copilot AI review requested due to automatic review settings July 4, 2026 18:31

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 3 comments.

Comment thread services/apps/packages_worker/src/deps-dev/activities/bqExportToGcs.ts Outdated
Comment thread services/apps/packages_worker/src/deps-dev/activities/bqExportToGcs.ts Outdated
Comment thread services/apps/packages_worker/src/deps-dev/activities/bqExportToGcs.ts Outdated
Signed-off-by: Uroš Marolt <uros@marolt.me>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 9065aa6. Configure here.

Signed-off-by: Uroš Marolt <uros@marolt.me>
Copilot AI review requested due to automatic review settings July 4, 2026 19:02

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated no new comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants