Skip to content

telemetry: don't record error on chat span after successful LLM completion#3281

Draft
jedp-docker wants to merge 1 commit into
docker:mainfrom
jedp-docker:fix/chat-span-false-errors
Draft

telemetry: don't record error on chat span after successful LLM completion#3281
jedp-docker wants to merge 1 commit into
docker:mainfrom
jedp-docker:fix/chat-span-false-errors

Conversation

@jedp-docker

Copy link
Copy Markdown
Contributor

Summary

  • handleStream returns the moment it sees a terminal finish_reason (a deliberate latency optimisation — don't block the next turn on trailing message_stop/EOF). The deferred stream.Close() then closes the HTTP/2 body while the reader goroutine is still blocked in Recv(), causing that read to return "http2: response body closed" (or "context canceled" if the per-attempt streamCancel fires first).
  • instrumentedStream was recording this transport-level teardown as a span error even when the completion had fully succeeded — every affected span also carries gen_ai.response.finish_reasons=["stop"] and full token usage, confirming the response was complete.
  • This inflates the docker-agent error rate on every LLM call. Observed consistently across all harbor_watch.investigation traces in Tempo.

Fix: add finishReasonSeen bool to instrumentedStream (set-once on the single-consumer Recv path). Once a terminal finish reason is delivered, skip RecordError for subsequent transport errors — they're benign teardown, not failures. endOnce is still called so the span closes.

Test plan

  • Verify existing genai package tests pass: go test ./pkg/telemetry/genai/...
  • In Tempo: after deploy, confirm chat claude-* spans with finish_reasons=stop no longer appear as STATUS_CODE_ERROR
  • Confirm spans that fail before a finish reason (genuine mid-stream errors) still record errors

🤖 Generated with Claude Code

@docker-agent docker-agent left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assessment: 🟢 APPROVE

The fix is correct and well-scoped. The finishReasonSeen flag is set on the single-consumer Recv path — no concurrency risk — and the teardown-suppression logic correctly gates only RecordError, leaving endOnce() unconditional so spans always close. All terminal finish reasons are covered (the flag trips on any non-empty FinishReason, not just "stop"). The explanatory comment accurately describes the transport-teardown scenario and the intended invariant.

No bugs found in the introduced code.

@aheritier aheritier added area/telemetry Telemetry, tracing, and metrics kind/fix PR fixes a bug (maps to fix:). Use on PRs only. labels Jun 27, 2026
@jedp-docker jedp-docker force-pushed the fix/chat-span-false-errors branch 4 times, most recently from dfd48a4 to 6960f55 Compare June 29, 2026 19:36
handleStream returns the moment it sees a terminal finish_reason rather
than draining the stream to io.EOF (a deliberate latency optimisation).
The deferred stream.Close() then closes the HTTP/2 body while the reader
goroutine is still blocked in Recv(), so that read returns
"http2: response body closed" (or "context canceled" if the per-attempt
streamCancel fires first). instrumentedStream was recording this as a
span error even though the LLM response had already succeeded — every
affected span also carries gen_ai.response.finish_reasons=["stop"] and
full token usage.

Fix: add finishReasonSeen (set-once bool on the single-consumer Recv
path). Once a terminal finish reason has been delivered the completion
has logically succeeded; subsequent transport-level errors are benign
teardown. Skip RecordError but still call endOnce so the span closes.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@jedp-docker jedp-docker force-pushed the fix/chat-span-false-errors branch from 6960f55 to ec2a5b4 Compare June 29, 2026 20:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/telemetry Telemetry, tracing, and metrics kind/fix PR fixes a bug (maps to fix:). Use on PRs only.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants