fix(server): prevent exec relays from hanging on idle connections#1992
Conversation
Add HTTP/2 keepalive on supervisor multiplex connections so half-dead sessions cannot leave in-flight exec relays parked indefinitely. Configure SSH keepalive on exec relay clients so long silent commands are not timed out on stdout idle alone; wedged or orphaned relays fail after missed keepalives instead. After a command reports exit status, bound how long the gateway waits for the trailing channel close. Return UNAVAILABLE when a relay closes before reporting exit status rather than defaulting to exit code 1. Signed-off-by: Gal Zaidman <gzaidman@nvidia.com>
|
All contributors have signed the DCO ✍️ ✅ |
|
I have read the DCO document and I hereby sign the DCO. |
|
recheck |
|
/ok to test b4878be |
|
@Gal-Zaidman have you been able to verify this resolves the issue in your environment? |
Yes, currently ran a job with 80 concurrent agents each running an SWE bench task with long exec (that is how harbor works) - zero hangs. |
|
/ok to test b4878be |
|
Label |
Summary
Gateway
ExecSandboxcalls could hang indefinitely after the command finished, when a supervisor session reset mid-exec orphaned the relay channel — the exec loop blocked onchannel.wait()with no liveness backstop, so callers hung until their own deadline. This adds SSH and HTTP/2 keepalives and bounds the post-exit wait so a wedged/orphaned relay fails fast instead of hanging.Related Issue
Closes #1990
Changes
channel.wait()forever. Channel-silent execs (e.g. an agent that redirects stdout to a file) stay alive while the relay is healthy — liveness is probed via keepalive, not output-idle.UNAVAILABLEwhen a relay closes before reporting an exit status, instead of a misleading exit code 1.Timer) on supervisor multiplex connections, to reduce the session resets that orphan relays.architecture/gateway.md.Testing
mise run pre-commit—rust:format:check,cargo clippy -D warnings, and markdownlint are clean for this change (ran individually). Note: the localmise run pre-commitaborts on itspython:protostep due to a missinggrpc_toolsdev dependency in the venv, unrelated to this change; CI runs the full suite.Manually validated on a Kubernetes deployment: rebuilt and deployed the gateway image, confirmed the gateway is healthy, that long channel-silent execs are not killed by the keepalive, and that the previously-observed multi-sandbox hang no longer reproduces.
Checklist