BQAA (Java): preview-readiness fixes — redaction, table bootstrap, drop stats, reliability by caohy1988 · Pull Request #1318 · google/adk-java

caohy1988 · 2026-07-01T19:04:30Z

Summary

Addresses preview-readiness gaps in the Java BigQuery Agent Analytics (BQAA) plugin tracked in #1316, bringing it closer to parity with the Python plugin and hardening the write path.

Important

This change was not built or tested locally (the full Maven build was not run in the authoring environment). Please rely on CI and review. Unit tests for the new behaviors still need to be added — see "Tests to add" below. Findings were verified by direct source review against the issue-pinned SHA ce6af87, which is byte-identical to current main for this package.

Fixes #1316 (partially — remaining deferred items listed below).

P1 — preview blockers

Secret redaction (JsonFormatter): sensitive keys (client_secret, access_token, refresh_token, id_token, api_key, password, and any temp:* key) are now redacted to [REDACTED] during truncation, matching the Python plugin's _SENSITIVE_KEYS. Case-insensitive; redaction does not set the is_truncated flag (parity). This covers tool args/results, session state, usage metadata, and custom-tag paths that flow through smartTruncate.
Empty STATE_DELTA guard (onEventCallback): a STATE_DELTA row is only emitted when event.actions().stateDelta() is non-empty, matching Python (no row for empty deltas).
Table bootstrap made production-safe (ensureTableExists / ensureTableExistsOnce):
- New tables are created with DAY time-partitioning on timestamp (parity with Python; enables partition pruning), in addition to the existing clustering.
- tableEnsured is now set only after a successful setup, so a transient first-run failure (auth blip, missing dataset, quota) is retried on later events instead of permanently disabling table create/upgrade for the plugin instance.
- Concurrent-create (already exists) is treated as success rather than a hard failure.
root_agent_name is populated (TraceManager.initTraceIfNeeded, called from getAttributes): attributes.root_agent_name now resolves to the real root agent name instead of the sentinel default. Null-safe.
No-current-agent guard (resolveAgentName): the agent column is resolved defensively; workflow-driven callbacks with no current agent fall back to "unknown" instead of NPE-dropping the row.

P2 — reliability hardening

Drop accounting (BatchProcessor + PluginState + getDropStats()): rows lost to a full queue, append errors, or serialization errors are now counted and exposed via plugin.getDropStats() (queue_full / append_error / serialization_error). Counters survive per-invocation processor churn by folding into plugin-level totals at close.
Storage Write regional routing: the StreamWriter is now built with .setLocation(config.location()) so append RPCs route to the dataset's region (previously only the control-plane client set location; non-US/EU appends risked stream-not-found).
JVM shutdown hook: public construction paths register a best-effort drain (state.close() bounded by shutdownTimeout) so a host that never calls close() still flushes queued rows at exit. The package-private test constructor does not register a hook (avoids hook leakage in tests).
Schema auto-upgrade always diffs (maybeUpgradeSchema): removed the version-label-only early return that Python deliberately avoids — a table stamped with the current label but missing columns is now reconciled. Incompatible non-STRUCT type drift is logged (was silently ignored, later surfacing as opaque append failures).

P3 — operational polish

Config validation: batchSize, queueMaxSize, and maxContentLength are validated as positive in build().
View DDL identifier validation: createAnalyticsViews refuses to interpolate a project/dataset/table/viewPrefix that isn't a safe identifier (guards the string-formatted DDL).
Recursion depth guard in recursiveSmartTruncate (prevents StackOverflowError on pathological nesting) + a log line on the string-fallback path.
Hot-path INFO logs lowered to FINE (isProcessed, pending-task wait/removal, stale-span clearing, table-exists).
EventData.java: added the missing Apache license header and fixed stale _log_event Javadoc.

Deferred (intentionally not in this PR)

These need behavior/parity decisions and tests I did not want to ship blind:

Preserve the BQAA internal span tree under ambient OpenTelemetry (BQAA Java preview readiness: parity and safety gaps vs Python plugin #1316 P1 item): still prefers ambient span IDs. Changing this alters correlation semantics and needs tests.
Full ADK attributes.adk envelope parity (BQAA Java preview readiness: parity and safety gaps vs Python plugin #1316 P2): route/render/rewind metadata, source-event node/branch/scope, long-running pair keys.
Long-running (non-HITL) tool resume → TOOL_COMPLETED with pause_kind/function_call_id pair keys (BQAA Java preview readiness: parity and safety gaps vs Python plugin #1316 P1).
Fully off-thread first-event table ensure: ensureTableExists is still synchronous on the first event's callback. Moving it off-thread safely needs append-gating so early rows aren't lost; deferred to avoid an untested concurrency change. (Note for whoever picks this up: snapshot any mutable InvocationContext/Session data before crossing threads.)
Default table/view naming decision (agent_analytics.events + createViews=false vs Python agent_events + create_views=True): a product/docs decision, not a code fix.

Plugin-owned OTel span export (#1316 comment) is related to the span-tree item and is deferred with it.

Tests to add

Redaction across nested maps, tool args/results, session state, custom tags; is_truncated unaffected by redaction.
Empty state_delta produces no STATE_DELTA row.
Table creation sets DAY partitioning + clustering; failed first-run setup is retried on later events.
getDropStats() increments on queue-full / append-error / serialization-error paths.
Schema upgrade reconciles a same-label table that is missing a column.
Config builder rejects non-positive batchSize/queueMaxSize/maxContentLength.
Shutdown hook drains queued rows when close() is not called explicitly.

🤖 Generated with Claude Code

…p stats, reliability) Addresses google#1316. Highlights: - P1: secret redaction in JsonFormatter; empty-STATE_DELTA guard; DAY time-partitioning + retry-after-failure table bootstrap; root_agent_name initialization; no-current-agent guard. - P2: drop-row accounting via getDropStats(); StreamWriter regional routing; JVM shutdown-hook drain; schema auto-upgrade always diffs (+ drift warning). - P3: config numeric validation; view-identifier validation; truncation depth guard; hot-path INFO->FINE; EventData license header + Javadoc fix. Not built/tested locally; unit tests still to add. Deep-semantic parity items (span-tree-under-OTel, full ADK envelope, long-running tool resume, off-thread first-event ensure) intentionally deferred; see PR description. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

google-cla · 2026-07-01T19:04:47Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

caohy1988 · 2026-07-01T19:13:26Z

Review against #1316 preview-readiness issue

I re-reviewed this PR against the tracked preview blockers in #1316. This is a strong partial fix, but I would keep it draft for now. A few issue-tracked blockers remain, and I found two regressions/holes in the new implementation that should be fixed before marking ready.

Findings

P1: table bootstrap still stops retrying after schema-upgrade failure.
ensureTableExists() marks the table ready after maybeUpgradeSchema() returns (BigQueryAgentAnalyticsPlugin.java:218-221), but maybeUpgradeSchema() catches and logs BigQueryException without surfacing failure (BigQueryUtils.java:269-287). Result: a failed upgrade can still set tableEnsured=true (BigQueryAgentAnalyticsPlugin.java:163-172), leaving the instance permanently “ensured” while appends may fail on missing columns. Please make schema upgrade return success/failure or rethrow, and only set tableEnsured after create/existing-table/schema-upgrade all succeed.
P1: license-header cleanup is still incomplete.
EventData.java is fixed, but PluginState.java still starts at package ... with no Apache header (PluginState.java:1). Since BQAA Java preview readiness: parity and safety gaps vs Python plugin #1316 tracks this as release/preview hygiene, this should be fixed before preview.
P2: drop stats miss close/shutdown-time drops.
PluginState.ensureInvocationCompleted() folds counters before processor.close() (PluginState.java:266-270), and global close() does the same (PluginState.java:329-331). But BatchProcessor.close() can still flush queued rows and increment drop counters (BatchProcessor.java:273-276, write failures counted at BatchProcessor.java:149-181). Drops that happen during final drain are then removed with the processor and never reach getDropStats(). Fold after final close/drain, or fold both before and after.
P2: schema drift logging still ignores mode drift.
The PR logs incompatible type drift only (BigQueryUtils.java:328-342). The BQAA Java preview readiness: parity and safety gaps vs Python plugin #1316 follow-up finding was type/mode drift; mode differences remain silent because schemaFieldsMatch() does not compare Field.Mode for top-level or nested fields (BigQueryUtils.java:299-345). Add mode checks and warnings for nested non-STRUCT drift too.
P2: no-current-agent fallback is only partial.
The NPE/drop is fixed, but the agent column now becomes "unknown" (BigQueryAgentAnalyticsPlugin.java:292, 357-366) even when Event.author() is available on source-event rows. The issue asked for Python-style fallback to Event.author; Java currently only stores that as attributes.source_event_author (BigQueryAgentAnalyticsPlugin.java:644-656).

Coverage against #1316

Good fixes landed: redaction, empty STATE_DELTA guard, timestamp partitioning on create, retry after most table setup failures, root agent initialization, defensive agent lookup, regional StreamWriter.setLocation, config validation, view identifier validation, EventData header/Javadoc, recursion depth guard, and quieter logs.

Still intentionally unresolved from the issue: HITL/user-message resume parity, added Python event types and views, ambient OTel/internal span-tree semantics, plugin-owned OTel span export, full attributes.adk envelope, off-thread first-event ensure, default table/view naming decision, per-invocation writer scaling, and salvage/DLQ/retry behavior for bad batches.

Readiness

Recommendation: keep this PR as draft until the P1s above are fixed and tests are added. Current GH status also has cla/google failing; only lightweight checks passed so far.

…, license, agent fallback - F1: maybeUpgradeSchema now returns success; ensureTableExists only marks the table ready when create/existing/upgrade all succeed (failed upgrade retries). - F2: add missing Apache license header to PluginState.java. - F3: fold drop counters AFTER BatchProcessor.close() so final-drain drops count. - F4: schema drift check now also compares Field.Mode (was type-only), incl. nested fields via recursion. - F5: agent column falls back to Event.author (via EventData.fallbackAgentName) on source-event rows (STATE_DELTA/A2A/AGENT_RESPONSE) before the "unknown" sentinel, matching the Python plugin. Still untested locally; unit tests to follow. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

caohy1988 · 2026-07-01T19:20:24Z

Thanks — all five findings validated against source and fixed in 11d2fab. Point by point:

Table bootstrap retry after upgrade failure — fixed. maybeUpgradeSchema() now returns boolean (true = up-to-date or upgraded successfully; false = update threw). ensureTableExists() sets tableReady from that result on the existing-table path, so a failed schema upgrade no longer flips tableEnsured=true and is retried on the next event. Create and no-upgrade paths set tableReady=true explicitly.
License header — fixed. Added the Apache header to PluginState.java. Swept the rest of the package; it was the only remaining file starting at package ….
Drop stats miss final-drain drops — fixed. foldDropStats(processor) now runs after processor.close() in both ensureInvocationCompleted() and close(), so rows dropped during the closing flush are counted before the processor is discarded. (Counters are AtomicLongs on the processor and are still readable post-close.)
Mode drift — fixed. schemaFieldsMatch() now compares Field.Mode in addition to StandardType, with null normalized to NULLABLE. The warning fires on type or mode drift and covers nested non-STRUCT fields (same method recurses into STRUCTs). Message now reports type/mode on both sides.
Agent fallback to Event.author — implemented. Added EventData.fallbackAgentName (Optional). resolveAgentName() now returns current agent → fallbackAgentName → "unknown". The source-event rows in onEventCallback (STATE_DELTA, A2A_INTERACTION, AGENT_RESPONSE) now pass event.author() as the fallback, matching the Python behavior. HITL rows keep the live agent (an agent is always present mid-invocation there), so they were left as-is; let me know if you'd rather thread author there too.

Everything under "still intentionally unresolved" in your review remains deferred (span-tree/OTel semantics, ADK envelope, HITL long-running resume + pair keys, off-thread first-event ensure, default naming, per-invocation writer scaling, DLQ/salvage). Keeping the PR draft: still not built locally and unit tests for these behaviors are pending. The cla/google check is the CLA signature, separate from the code.

Deferred google#1316 P1 item: span_id/parent_span_id now always come from the BQAA internal execution tree (getCurrentSpanAndParent) instead of preferring ambient OpenTelemetry span IDs, so parent_span_id always references another logged row rather than an unlogged framework span. Ambient OTel still governs trace_id via getTraceId. The completed/error callbacks (LLM/tool) and getCompletedEventData now always record the popped internal span, and the now-dead getAmbientSpanAndParent() helper is removed. Tests (JUnit4/vintage, matching existing style): - JsonFormatterTest: sensitive-key redaction (top-level, case-insensitive + temp: prefix, nested; is_truncated unaffected) and recursion depth guard. - BigQueryLoggerConfigTest (new): build() rejects non-positive batchSize/queueMaxSize/maxContentLength. - TraceManagerTest: initTraceIfNeeded sets root agent name; null-agent keeps the sentinel. Still not executed locally; CI verifies. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

caohy1988 · 2026-07-01T22:44:54Z

Two follow-ups in 08fe970:

1. Internal span tree under ambient OTel (previously deferred #1316 P1). getResolvedTraceIds now always sources span_id/parent_span_id from the BQAA internal execution tree (getCurrentSpanAndParent) instead of preferring ambient OpenTelemetry span IDs, so parent_span_id always references another logged row rather than an unlogged framework span. Ambient OTel still governs trace_id (via getTraceId). The LLM/tool completed+error callbacks and getCompletedEventData now unconditionally record the popped internal span (dropped the !hasAmbient gating); the now-dead getAmbientSpanAndParent() helper is removed.

Note: this does not stop the plugin from creating real OTel spans — the duplicate-export concern is separate and still deferred. This change only fixes which span IDs land in the rows.

2. Unit tests (JUnit4/vintage, matching existing style):

JsonFormatterTest: sensitive-key redaction — top-level, case-insensitive, temp: prefix, nested; asserts redaction does not set is_truncated — plus the recursion depth guard.
BigQueryLoggerConfigTest (new): build() rejects non-positive batchSize/queueMaxSize/maxContentLength.
TraceManagerTest: initTraceIfNeeded sets the root agent name from the context, and a null agent keeps the sentinel.

Still not executed locally (no Maven build in my environment), so CI is the gate. Remaining deferred items from your review are unchanged: ADK envelope, HITL long-running resume + pair keys, plugin-owned OTel span export, off-thread first-event ensure, default naming, per-invocation writer scaling, DLQ/salvage.

…n STRUCT mode drift - Tests no longer expect a STATE_DELTA row for events that never set a state delta; onEventCallback_populatesCorrectFields now sets a real stateDelta and asserts it lands in attributes, and a new test pins down that empty deltas emit no row. - schemaFieldsMatch now checks mode drift on STRUCT columns themselves (e.g. NULLABLE vs REPEATED content_parts) before recursing into subfields, via a shared warnOnIncompatibleDrift helper; added a regression test capturing the warning. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

hemasekhar-p · 2026-07-02T09:45:53Z

Hi @caohy1988, thank you for your contribution! We appreciate you taking the time to submit this pull request. I noticed that the pre-checks are currently failing and make sure your PR consist of single commit as per Contribution Policy. Could you please take a look and address those comments?

hemasekhar-p self-assigned this Jul 2, 2026

hemasekhar-p added the waiting on reporter Waiting for reaction by reporter. Failing that, maintainers will eventually closed it as stale. label Jul 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BQAA (Java): preview-readiness fixes — redaction, table bootstrap, drop stats, reliability#1318

BQAA (Java): preview-readiness fixes — redaction, table bootstrap, drop stats, reliability#1318
caohy1988 wants to merge 4 commits into
google:mainfrom
caohy1988:fix/bqaa-preview-readiness

caohy1988 commented Jul 1, 2026

Uh oh!

google-cla Bot commented Jul 1, 2026

Uh oh!

caohy1988 commented Jul 1, 2026

Uh oh!

caohy1988 commented Jul 1, 2026

Uh oh!

caohy1988 commented Jul 1, 2026

Uh oh!

hemasekhar-p commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

caohy1988 commented Jul 1, 2026

Summary

P1 — preview blockers

P2 — reliability hardening

P3 — operational polish

Deferred (intentionally not in this PR)

Tests to add

Uh oh!

google-cla Bot commented Jul 1, 2026

Uh oh!

caohy1988 commented Jul 1, 2026

Review against #1316 preview-readiness issue

Findings

Coverage against #1316

Readiness

Uh oh!

caohy1988 commented Jul 1, 2026

Uh oh!

caohy1988 commented Jul 1, 2026

Uh oh!

hemasekhar-p commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants