telemetry: don't record error on chat span after successful LLM completion#3281
Draft
jedp-docker wants to merge 1 commit into
Draft
telemetry: don't record error on chat span after successful LLM completion#3281jedp-docker wants to merge 1 commit into
jedp-docker wants to merge 1 commit into
Conversation
docker-agent
left a comment
Contributor
There was a problem hiding this comment.
Assessment: 🟢 APPROVE
The fix is correct and well-scoped. The finishReasonSeen flag is set on the single-consumer Recv path — no concurrency risk — and the teardown-suppression logic correctly gates only RecordError, leaving endOnce() unconditional so spans always close. All terminal finish reasons are covered (the flag trips on any non-empty FinishReason, not just "stop"). The explanatory comment accurately describes the transport-teardown scenario and the intended invariant.
No bugs found in the introduced code.
dfd48a4 to
6960f55
Compare
handleStream returns the moment it sees a terminal finish_reason rather than draining the stream to io.EOF (a deliberate latency optimisation). The deferred stream.Close() then closes the HTTP/2 body while the reader goroutine is still blocked in Recv(), so that read returns "http2: response body closed" (or "context canceled" if the per-attempt streamCancel fires first). instrumentedStream was recording this as a span error even though the LLM response had already succeeded — every affected span also carries gen_ai.response.finish_reasons=["stop"] and full token usage. Fix: add finishReasonSeen (set-once bool on the single-consumer Recv path). Once a terminal finish reason has been delivered the completion has logically succeeded; subsequent transport-level errors are benign teardown. Skip RecordError but still call endOnce so the span closes. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
6960f55 to
ec2a5b4
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
handleStreamreturns the moment it sees a terminalfinish_reason(a deliberate latency optimisation — don't block the next turn on trailingmessage_stop/EOF). The deferredstream.Close()then closes the HTTP/2 body while the reader goroutine is still blocked inRecv(), causing that read to return"http2: response body closed"(or"context canceled"if the per-attemptstreamCancelfires first).instrumentedStreamwas recording this transport-level teardown as a span error even when the completion had fully succeeded — every affected span also carriesgen_ai.response.finish_reasons=["stop"]and full token usage, confirming the response was complete.docker-agenterror rate on every LLM call. Observed consistently across allharbor_watch.investigationtraces in Tempo.Fix: add
finishReasonSeen booltoinstrumentedStream(set-once on the single-consumerRecvpath). Once a terminal finish reason is delivered, skipRecordErrorfor subsequent transport errors — they're benign teardown, not failures.endOnceis still called so the span closes.Test plan
genaipackage tests pass:go test ./pkg/telemetry/genai/...chat claude-*spans withfinish_reasons=stopno longer appear asSTATUS_CODE_ERROR🤖 Generated with Claude Code