Skip to content

BQAA (Java): preview-readiness fixes — redaction, table bootstrap, drop stats, reliability#1318

Draft
caohy1988 wants to merge 4 commits into
google:mainfrom
caohy1988:fix/bqaa-preview-readiness
Draft

BQAA (Java): preview-readiness fixes — redaction, table bootstrap, drop stats, reliability#1318
caohy1988 wants to merge 4 commits into
google:mainfrom
caohy1988:fix/bqaa-preview-readiness

Conversation

@caohy1988

Copy link
Copy Markdown

Summary

Addresses preview-readiness gaps in the Java BigQuery Agent Analytics (BQAA) plugin tracked in #1316, bringing it closer to parity with the Python plugin and hardening the write path.

Important

This change was not built or tested locally (the full Maven build was not run in the authoring environment). Please rely on CI and review. Unit tests for the new behaviors still need to be added — see "Tests to add" below. Findings were verified by direct source review against the issue-pinned SHA ce6af87, which is byte-identical to current main for this package.

Fixes #1316 (partially — remaining deferred items listed below).

P1 — preview blockers

  • Secret redaction (JsonFormatter): sensitive keys (client_secret, access_token, refresh_token, id_token, api_key, password, and any temp:* key) are now redacted to [REDACTED] during truncation, matching the Python plugin's _SENSITIVE_KEYS. Case-insensitive; redaction does not set the is_truncated flag (parity). This covers tool args/results, session state, usage metadata, and custom-tag paths that flow through smartTruncate.
  • Empty STATE_DELTA guard (onEventCallback): a STATE_DELTA row is only emitted when event.actions().stateDelta() is non-empty, matching Python (no row for empty deltas).
  • Table bootstrap made production-safe (ensureTableExists / ensureTableExistsOnce):
    • New tables are created with DAY time-partitioning on timestamp (parity with Python; enables partition pruning), in addition to the existing clustering.
    • tableEnsured is now set only after a successful setup, so a transient first-run failure (auth blip, missing dataset, quota) is retried on later events instead of permanently disabling table create/upgrade for the plugin instance.
    • Concurrent-create (already exists) is treated as success rather than a hard failure.
  • root_agent_name is populated (TraceManager.initTraceIfNeeded, called from getAttributes): attributes.root_agent_name now resolves to the real root agent name instead of the sentinel default. Null-safe.
  • No-current-agent guard (resolveAgentName): the agent column is resolved defensively; workflow-driven callbacks with no current agent fall back to "unknown" instead of NPE-dropping the row.

P2 — reliability hardening

  • Drop accounting (BatchProcessor + PluginState + getDropStats()): rows lost to a full queue, append errors, or serialization errors are now counted and exposed via plugin.getDropStats() (queue_full / append_error / serialization_error). Counters survive per-invocation processor churn by folding into plugin-level totals at close.
  • Storage Write regional routing: the StreamWriter is now built with .setLocation(config.location()) so append RPCs route to the dataset's region (previously only the control-plane client set location; non-US/EU appends risked stream-not-found).
  • JVM shutdown hook: public construction paths register a best-effort drain (state.close() bounded by shutdownTimeout) so a host that never calls close() still flushes queued rows at exit. The package-private test constructor does not register a hook (avoids hook leakage in tests).
  • Schema auto-upgrade always diffs (maybeUpgradeSchema): removed the version-label-only early return that Python deliberately avoids — a table stamped with the current label but missing columns is now reconciled. Incompatible non-STRUCT type drift is logged (was silently ignored, later surfacing as opaque append failures).

P3 — operational polish

  • Config validation: batchSize, queueMaxSize, and maxContentLength are validated as positive in build().
  • View DDL identifier validation: createAnalyticsViews refuses to interpolate a project/dataset/table/viewPrefix that isn't a safe identifier (guards the string-formatted DDL).
  • Recursion depth guard in recursiveSmartTruncate (prevents StackOverflowError on pathological nesting) + a log line on the string-fallback path.
  • Hot-path INFO logs lowered to FINE (isProcessed, pending-task wait/removal, stale-span clearing, table-exists).
  • EventData.java: added the missing Apache license header and fixed stale _log_event Javadoc.

Deferred (intentionally not in this PR)

These need behavior/parity decisions and tests I did not want to ship blind:

  • Preserve the BQAA internal span tree under ambient OpenTelemetry (BQAA Java preview readiness: parity and safety gaps vs Python plugin #1316 P1 item): still prefers ambient span IDs. Changing this alters correlation semantics and needs tests.
  • Full ADK attributes.adk envelope parity (BQAA Java preview readiness: parity and safety gaps vs Python plugin #1316 P2): route/render/rewind metadata, source-event node/branch/scope, long-running pair keys.
  • Long-running (non-HITL) tool resume → TOOL_COMPLETED with pause_kind/function_call_id pair keys (BQAA Java preview readiness: parity and safety gaps vs Python plugin #1316 P1).
  • Fully off-thread first-event table ensure: ensureTableExists is still synchronous on the first event's callback. Moving it off-thread safely needs append-gating so early rows aren't lost; deferred to avoid an untested concurrency change. (Note for whoever picks this up: snapshot any mutable InvocationContext/Session data before crossing threads.)
  • Default table/view naming decision (agent_analytics.events + createViews=false vs Python agent_events + create_views=True): a product/docs decision, not a code fix.

Plugin-owned OTel span export (#1316 comment) is related to the span-tree item and is deferred with it.

Tests to add

  • Redaction across nested maps, tool args/results, session state, custom tags; is_truncated unaffected by redaction.
  • Empty state_delta produces no STATE_DELTA row.
  • Table creation sets DAY partitioning + clustering; failed first-run setup is retried on later events.
  • getDropStats() increments on queue-full / append-error / serialization-error paths.
  • Schema upgrade reconciles a same-label table that is missing a column.
  • Config builder rejects non-positive batchSize/queueMaxSize/maxContentLength.
  • Shutdown hook drains queued rows when close() is not called explicitly.

🤖 Generated with Claude Code

…p stats, reliability)

Addresses google#1316. Highlights:
- P1: secret redaction in JsonFormatter; empty-STATE_DELTA guard; DAY
  time-partitioning + retry-after-failure table bootstrap; root_agent_name
  initialization; no-current-agent guard.
- P2: drop-row accounting via getDropStats(); StreamWriter regional routing;
  JVM shutdown-hook drain; schema auto-upgrade always diffs (+ drift warning).
- P3: config numeric validation; view-identifier validation; truncation depth
  guard; hot-path INFO->FINE; EventData license header + Javadoc fix.

Not built/tested locally; unit tests still to add. Deep-semantic parity items
(span-tree-under-OTel, full ADK envelope, long-running tool resume, off-thread
first-event ensure) intentionally deferred; see PR description.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@google-cla

google-cla Bot commented Jul 1, 2026

Copy link
Copy Markdown

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@caohy1988

Copy link
Copy Markdown
Author

Review against #1316 preview-readiness issue

I re-reviewed this PR against the tracked preview blockers in #1316. This is a strong partial fix, but I would keep it draft for now. A few issue-tracked blockers remain, and I found two regressions/holes in the new implementation that should be fixed before marking ready.

Findings

  1. P1: table bootstrap still stops retrying after schema-upgrade failure.
    ensureTableExists() marks the table ready after maybeUpgradeSchema() returns (BigQueryAgentAnalyticsPlugin.java:218-221), but maybeUpgradeSchema() catches and logs BigQueryException without surfacing failure (BigQueryUtils.java:269-287). Result: a failed upgrade can still set tableEnsured=true (BigQueryAgentAnalyticsPlugin.java:163-172), leaving the instance permanently “ensured” while appends may fail on missing columns. Please make schema upgrade return success/failure or rethrow, and only set tableEnsured after create/existing-table/schema-upgrade all succeed.

  2. P1: license-header cleanup is still incomplete.
    EventData.java is fixed, but PluginState.java still starts at package ... with no Apache header (PluginState.java:1). Since BQAA Java preview readiness: parity and safety gaps vs Python plugin #1316 tracks this as release/preview hygiene, this should be fixed before preview.

  3. P2: drop stats miss close/shutdown-time drops.
    PluginState.ensureInvocationCompleted() folds counters before processor.close() (PluginState.java:266-270), and global close() does the same (PluginState.java:329-331). But BatchProcessor.close() can still flush queued rows and increment drop counters (BatchProcessor.java:273-276, write failures counted at BatchProcessor.java:149-181). Drops that happen during final drain are then removed with the processor and never reach getDropStats(). Fold after final close/drain, or fold both before and after.

  4. P2: schema drift logging still ignores mode drift.
    The PR logs incompatible type drift only (BigQueryUtils.java:328-342). The BQAA Java preview readiness: parity and safety gaps vs Python plugin #1316 follow-up finding was type/mode drift; mode differences remain silent because schemaFieldsMatch() does not compare Field.Mode for top-level or nested fields (BigQueryUtils.java:299-345). Add mode checks and warnings for nested non-STRUCT drift too.

  5. P2: no-current-agent fallback is only partial.
    The NPE/drop is fixed, but the agent column now becomes "unknown" (BigQueryAgentAnalyticsPlugin.java:292, 357-366) even when Event.author() is available on source-event rows. The issue asked for Python-style fallback to Event.author; Java currently only stores that as attributes.source_event_author (BigQueryAgentAnalyticsPlugin.java:644-656).

Coverage against #1316

Good fixes landed: redaction, empty STATE_DELTA guard, timestamp partitioning on create, retry after most table setup failures, root agent initialization, defensive agent lookup, regional StreamWriter.setLocation, config validation, view identifier validation, EventData header/Javadoc, recursion depth guard, and quieter logs.

Still intentionally unresolved from the issue: HITL/user-message resume parity, added Python event types and views, ambient OTel/internal span-tree semantics, plugin-owned OTel span export, full attributes.adk envelope, off-thread first-event ensure, default table/view naming decision, per-invocation writer scaling, and salvage/DLQ/retry behavior for bad batches.

Readiness

Recommendation: keep this PR as draft until the P1s above are fixed and tests are added. Current GH status also has cla/google failing; only lightweight checks passed so far.

…, license, agent fallback

- F1: maybeUpgradeSchema now returns success; ensureTableExists only marks the
  table ready when create/existing/upgrade all succeed (failed upgrade retries).
- F2: add missing Apache license header to PluginState.java.
- F3: fold drop counters AFTER BatchProcessor.close() so final-drain drops count.
- F4: schema drift check now also compares Field.Mode (was type-only), incl.
  nested fields via recursion.
- F5: agent column falls back to Event.author (via EventData.fallbackAgentName)
  on source-event rows (STATE_DELTA/A2A/AGENT_RESPONSE) before the "unknown"
  sentinel, matching the Python plugin.

Still untested locally; unit tests to follow.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@caohy1988

Copy link
Copy Markdown
Author

Thanks — all five findings validated against source and fixed in 11d2fab. Point by point:

  1. Table bootstrap retry after upgrade failure — fixed. maybeUpgradeSchema() now returns boolean (true = up-to-date or upgraded successfully; false = update threw). ensureTableExists() sets tableReady from that result on the existing-table path, so a failed schema upgrade no longer flips tableEnsured=true and is retried on the next event. Create and no-upgrade paths set tableReady=true explicitly.

  2. License header — fixed. Added the Apache header to PluginState.java. Swept the rest of the package; it was the only remaining file starting at package ….

  3. Drop stats miss final-drain drops — fixed. foldDropStats(processor) now runs after processor.close() in both ensureInvocationCompleted() and close(), so rows dropped during the closing flush are counted before the processor is discarded. (Counters are AtomicLongs on the processor and are still readable post-close.)

  4. Mode drift — fixed. schemaFieldsMatch() now compares Field.Mode in addition to StandardType, with null normalized to NULLABLE. The warning fires on type or mode drift and covers nested non-STRUCT fields (same method recurses into STRUCTs). Message now reports type/mode on both sides.

  5. Agent fallback to Event.author — implemented. Added EventData.fallbackAgentName (Optional). resolveAgentName() now returns current agent → fallbackAgentName"unknown". The source-event rows in onEventCallback (STATE_DELTA, A2A_INTERACTION, AGENT_RESPONSE) now pass event.author() as the fallback, matching the Python behavior. HITL rows keep the live agent (an agent is always present mid-invocation there), so they were left as-is; let me know if you'd rather thread author there too.

Everything under "still intentionally unresolved" in your review remains deferred (span-tree/OTel semantics, ADK envelope, HITL long-running resume + pair keys, off-thread first-event ensure, default naming, per-invocation writer scaling, DLQ/salvage). Keeping the PR draft: still not built locally and unit tests for these behaviors are pending. The cla/google check is the CLA signature, separate from the code.

Deferred google#1316 P1 item: span_id/parent_span_id now always come from the BQAA
internal execution tree (getCurrentSpanAndParent) instead of preferring ambient
OpenTelemetry span IDs, so parent_span_id always references another logged row
rather than an unlogged framework span. Ambient OTel still governs trace_id via
getTraceId. The completed/error callbacks (LLM/tool) and getCompletedEventData
now always record the popped internal span, and the now-dead
getAmbientSpanAndParent() helper is removed.

Tests (JUnit4/vintage, matching existing style):
- JsonFormatterTest: sensitive-key redaction (top-level, case-insensitive +
  temp: prefix, nested; is_truncated unaffected) and recursion depth guard.
- BigQueryLoggerConfigTest (new): build() rejects non-positive
  batchSize/queueMaxSize/maxContentLength.
- TraceManagerTest: initTraceIfNeeded sets root agent name; null-agent keeps
  the sentinel.

Still not executed locally; CI verifies.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@caohy1988

Copy link
Copy Markdown
Author

Two follow-ups in 08fe970:

1. Internal span tree under ambient OTel (previously deferred #1316 P1). getResolvedTraceIds now always sources span_id/parent_span_id from the BQAA internal execution tree (getCurrentSpanAndParent) instead of preferring ambient OpenTelemetry span IDs, so parent_span_id always references another logged row rather than an unlogged framework span. Ambient OTel still governs trace_id (via getTraceId). The LLM/tool completed+error callbacks and getCompletedEventData now unconditionally record the popped internal span (dropped the !hasAmbient gating); the now-dead getAmbientSpanAndParent() helper is removed.

Note: this does not stop the plugin from creating real OTel spans — the duplicate-export concern is separate and still deferred. This change only fixes which span IDs land in the rows.

2. Unit tests (JUnit4/vintage, matching existing style):

  • JsonFormatterTest: sensitive-key redaction — top-level, case-insensitive, temp: prefix, nested; asserts redaction does not set is_truncated — plus the recursion depth guard.
  • BigQueryLoggerConfigTest (new): build() rejects non-positive batchSize/queueMaxSize/maxContentLength.
  • TraceManagerTest: initTraceIfNeeded sets the root agent name from the context, and a null agent keeps the sentinel.

Still not executed locally (no Maven build in my environment), so CI is the gate. Remaining deferred items from your review are unchanged: ADK envelope, HITL long-running resume + pair keys, plugin-owned OTel span export, off-thread first-event ensure, default naming, per-invocation writer scaling, DLQ/salvage.

…n STRUCT mode drift

- Tests no longer expect a STATE_DELTA row for events that never set a
  state delta; onEventCallback_populatesCorrectFields now sets a real
  stateDelta and asserts it lands in attributes, and a new test pins
  down that empty deltas emit no row.
- schemaFieldsMatch now checks mode drift on STRUCT columns themselves
  (e.g. NULLABLE vs REPEATED content_parts) before recursing into
  subfields, via a shared warnOnIncompatibleDrift helper; added a
  regression test capturing the warning.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@hemasekhar-p hemasekhar-p self-assigned this Jul 2, 2026
@hemasekhar-p

Copy link
Copy Markdown
Contributor

Hi @caohy1988, thank you for your contribution! We appreciate you taking the time to submit this pull request. I noticed that the pre-checks are currently failing and make sure your PR consist of single commit as per Contribution Policy. Could you please take a look and address those comments?

@hemasekhar-p hemasekhar-p added the waiting on reporter Waiting for reaction by reporter. Failing that, maintainers will eventually closed it as stale. label Jul 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

waiting on reporter Waiting for reaction by reporter. Failing that, maintainers will eventually closed it as stale.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BQAA Java preview readiness: parity and safety gaps vs Python plugin

2 participants