feat(webapp): billing limits — pause, reject, recovery, and settings UI#3996
feat(webapp): billing limits — pause, reject, recovery, and settings UI#3996kathiekiwi wants to merge 8 commits into
Conversation
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughThis PR implements a billing limits feature that lets organizations configure a monthly compute spend cap. When the cap is reached, a billing platform webhook triggers a grace period during which billable environments are paused; new task triggers are rejected once the grace period ends. A recovery flow lets users increase or remove the limit and choose to resume queued runs or accept cancellation. A Redis-backed worker periodically reconciles environment pause state against billing platform data. The feature adds a new 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
5c1a4bf to
ac87dcd
Compare
@trigger.dev/build
trigger.dev
@trigger.dev/core
@trigger.dev/python
@trigger.dev/react-hooks
@trigger.dev/redis-worker
@trigger.dev/rsc
@trigger.dev/schema-to-json
@trigger.dev/sdk
commit: |
ac87dcd to
80b0a02
Compare
80b0a02 to
18bce2f
Compare
18bce2f to
c338591
Compare
c338591 to
31b6df9
Compare
31b6df9 to
3200c8a
Compare
3200c8a to
7763a7a
Compare
7763a7a to
6d9aa23
Compare
6d9aa23 to
cc28bfd
Compare
cc28bfd to
2a89262
Compare
2a89262 to
b478344
Compare
b478344 to
ff448c9
Compare
cb4df39 to
19f5d47
Compare
Add the EnvironmentPauseSource enum and migration, plus the billing-limit platform client wrappers and schemas.
Configure a spend limit, manage billing alerts, and surface org-wide banners.
Converge billable environments to paused via webhook and a reconciliation worker; block manual resume.
Reject triggers with a 422 once entitlement reports no access, and bust the entitlement cache on state changes.
Recovery UI and durable resolve: cancel queued runs before unpausing, with reconciliation as a safety net.
Optionally cancel in-progress runs on limit hit via a deduplicated bulk-cancel job.
5eb8d46 to
ecf6630
Compare
| const existing = await prismaClient.bulkActionGroup.findFirst({ | ||
| where: { | ||
| environmentId: environment.id, | ||
| type: BulkActionType.CANCEL, | ||
| AND: [ | ||
| { | ||
| params: { | ||
| path: ["source"], | ||
| equals: options.source, | ||
| }, | ||
| }, | ||
| { | ||
| params: { | ||
| path: ["dedupeKey"], | ||
| equals: options.dedupeKey, | ||
| }, | ||
| }, | ||
| ], | ||
| }, | ||
| select: { id: true, friendlyId: true }, | ||
| }); |
There was a problem hiding this comment.
🚩 Bulk cancel dedupe query scans JSONB params without an index
The BillingLimitBulkCancelService at apps/webapp/app/v3/services/billingLimit/BillingLimitBulkCancelService.server.ts:114-134 deduplicates cancel actions by querying bulkActionGroup.params JSONB path filters (path: ["source"] and path: ["dedupeKey"]). Without a GIN index on the params column of BulkActionGroup, this requires a sequential scan of all cancel bulk actions for the environment. For orgs with many historical bulk actions, this could be slow during billing limit events. The query is scoped to a single environmentId and type: CANCEL, which limits the scan somewhat.
Was this helpful? React with 👍 or 👎 to provide feedback.
ecf6630 to
1662563
Compare
… tests Add the usage-bar marker, documentation, and test coverage.
CI unit-test workers have no global Postgres/Redis on localhost (testcontainers use random ports). Two latent fragilities surface once new test files shift the shard layout: - Modules build a Redis-backed singleton at import (auto-increment counter via triggerTask.server) and throw during collection when REDIS_HOST is unset. - Shared background singletons (OrganizationDataStoresRegistry) poll the global database at startup and reject async, which vitest flags as unhandled. Set harmless REDIS_HOST/PORT defaults, swallow only the Prisma P1001 "can't reach database" unhandled rejection (other rejections stay fatal), and inject a runs-repository stub in the dedupe unit test so it does not reach the production clickhouse factory. Temporary infra workaround; owner: platform.
1662563 to
6608555
Compare
| if (resumeMode === "new_only") { | ||
| await BillingLimitBulkCancelService.cancelQueuedRuns(organizationId, { | ||
| dedupeKey: buildBillingLimitResolveDedupeKey(organizationId, resolvedAt), | ||
| }); | ||
| } | ||
|
|
||
| await convergeBillingLimitEnvironmentsForOrg(organizationId, "ok"); |
There was a problem hiding this comment.
🔴 Queued runs can start executing before cancellation completes when user chooses 'Cancel queued runs' during billing limit resolve
Environments are unpaused (convergeBillingLimitEnvironmentsForOrg at billingLimitConvergeResolve.server.ts:25) immediately after the cancel job is merely enqueued (cancelQueuedRuns at billingLimitConvergeResolve.server.ts:20), so queued runs can be dequeued and start executing before the bulk-cancel worker processes them.
Impact: When a user explicitly chooses "Cancel queued runs" during billing limit resolve, some of those queued runs may execute anyway, potentially incurring charges the user was trying to avoid.
Async bulk cancel is enqueued but environments are unpaused inline before it runs
The convergeBillingLimitResolve function handles the new_only resume mode (user chose to cancel queued runs):
-
BillingLimitBulkCancelService.cancelQueuedRunsatbillingLimitConvergeResolve.server.ts:20-22createsBulkActionGrouprecords and enqueuesprocessBulkActionjobs via the common worker (BillingLimitBulkCancelService.server.ts:170). Theawaitresolves when the job is enqueued, not when cancellation is complete. -
Immediately after,
convergeBillingLimitEnvironmentsForOrg(organizationId, "ok")atbillingLimitConvergeResolve.server.ts:25unpauses all billing-limit-paused environments, restoring their concurrency limits viaupdateEnvConcurrencyLimits(billingLimitConvergeEnvironments.server.ts:193). -
The run queue can now dequeue PENDING runs from those environments.
-
The bulk cancel worker job hasn't executed yet — it searches for runs with
QUEUED_STATUSES(BillingLimitBulkCancelService.server.ts:59), but runs that were already dequeued in step 3 have transitioned to DEQUEUED/EXECUTING and escape cancellation.
The invariant that should hold for resumeMode === "new_only" is: no queued runs from the billing-limit pause window should be dequeued. The current ordering violates this because concurrency is restored before the cancel job runs.
Prompt for agents
The problem is in convergeBillingLimitResolve (billingLimitConvergeResolve.server.ts:19-25). When resumeMode is 'new_only', the function enqueues bulk-cancel jobs for queued runs, then immediately unpauses environments. But the bulk-cancel is async (processed by the common worker later), so the run queue can dequeue runs before the cancel job executes.
The fix should ensure that queued runs are cancelled (or at least prevented from being dequeued) BEFORE environments are unpaused. Several approaches:
1. Process the cancel inline (synchronously) instead of via the async worker, then unpause environments. This is the most reliable but may be slow for large backlogs.
2. Use a two-phase approach: first cancel queued runs inline or wait for the bulk cancel to complete, then enqueue a separate job to unpause environments.
3. If inline cancel is too expensive, consider keeping environments paused until the bulk cancel job completes, and have the bulk cancel job trigger the unpause as its final step.
The key constraint is that environment concurrency must not be restored until the cancel has processed all queued runs, so the ordering must be: cancel first, then unpause.
Was this helpful? React with 👍 or 👎 to provide feedback.
Summary
Adds Billing Limits to the webapp.
Customers can set a monthly spend cap. When usage crosses the limit, billable environments enter a grace period. If the limit is not resolved before grace expires, new triggers are rejected until the organization increases or removes the limit.
The webapp consumes billing-limit state from the billing platform and enforces it across environments, queues, and trigger creation.
Depends on the matching cloud billing PR.
User-facing changes
Billing Limits settings
New
/settings/billing-limitspage replaces the standalone billing-alerts page.Configure:
Configure billing alerts and notification emails.
Resolve active billing limits by increasing or removing the limit.
Org-wide banners
Adds banners for:
Usage page
Shows the configured billing limit on the spend chart.
Enforcement
Billable environments are paused when an org enters grace.
New triggers are rejected once grace expires.
Billing-limit pauses cannot be manually resumed.
New environments created during grace/rejected inherit the correct paused state.
Recovery supports:
Infrastructure
BILLING_LIMITas an environment pause source.Test plan
queueandnew_only).Notes
isConfigured: falsemeans no billing limit has been configured yet.mode: "none"means the customer explicitly opted out.