diff --git a/.agents/skills/debug-openshell-cluster/SKILL.md b/.agents/skills/debug-openshell-cluster/SKILL.md index 68ecc7749..b859b0d54 100644 --- a/.agents/skills/debug-openshell-cluster/SKILL.md +++ b/.agents/skills/debug-openshell-cluster/SKILL.md @@ -268,6 +268,48 @@ kubectl -n openshell get configmap openshell-config -o jsonpath='{.data.gateway\ kubectl -n get sandbox -o jsonpath='{.spec.template.spec.serviceAccountName}{"\n"}' ``` +If `supervisor_topology = "sidecar"` is rendered, sandbox pods should have an +`openshell-network-init` init container running `--mode=network-init`, an +`agent` container running `openshell-sandbox --mode=process`, and an +`openshell-supervisor-network` container running `--mode=network`. The init +container owns nftables setup and should be the only sidecar topology container +with `NET_ADMIN`. It also needs `CHOWN`/`FOWNER` to hand shared emptyDir state +to `proxy_uid`. The long-running network sidecar runs as +`proxy_uid` with primary GID `0` so it can read the root-owned, +group-readable projected service-account token. In sidecar topology the +`openshell-sa-token` projected volume should render `defaultMode: 288` (`0440`); +if the proxy logs `failed to read K8s SA token`, verify this token mode and the +network sidecar security context. The process container should also publish the +workload entrypoint PID to `OPENSHELL_ENTRYPOINT_PID_FILE` +(`/run/openshell-sidecar/entrypoint.pid` by default), and the network sidecar +should read it for binary-scoped policy decisions; if allowed network rules are +all denied, inspect that file and the network sidecar logs. + +If `supervisor_topology = "proxy-pod"` is rendered, each sandbox should have a +separate supervisor Deployment with one supervisor pod, a headless supervisor +Service, a proxy CA Secret, and two per-sandbox NetworkPolicies. The agent pod +should have `openshell.ai/sandbox-role=agent`; the supervisor pod should have +`openshell.ai/sandbox-role=supervisor`; both should share the same +`openshell.ai/sandbox-id`. The supervisor Deployment must have a controlling +`Sandbox` ownerReference. The Deployment pod template must carry the +`openshell.io/sandbox-id` annotation so the TokenReview bootstrap path can mint +a sandbox JWT. For supervisor pods, the gateway validates the +`Pod -> ReplicaSet -> Deployment -> Sandbox` owner chain, so missing +`apps/replicasets get` RBAC can also break bootstrap. If the agent cannot reach +the gateway, check DNS to the headless Service, the agent egress NetworkPolicy +DNS exception for kube-dns/CoreDNS, and the supervisor ingress NetworkPolicy +allowing only that agent pod on ports `3128` and `18080`. +Inspect all three when sandbox registration or egress enforcement fails: + +```bash +kubectl -n openshell get configmap openshell-config -o jsonpath='{.data.gateway\.toml}' | grep supervisor_topology +kubectl -n get pod -o jsonpath='{range .spec.initContainers[*]}{.name}{" "}{.command}{"\n"}{end}' +kubectl -n get pod -o jsonpath='{range .spec.containers[*]}{.name}{" "}{.command}{"\n"}{end}' +kubectl -n logs -c openshell-network-init --tail=200 +kubectl -n logs -c openshell-supervisor-network --tail=200 +kubectl -n logs -c agent --tail=200 +``` + ### Step 6: Check VM-Backed Gateways Use the VM driver logs and host diagnostics available in the user's environment. Verify: diff --git a/.agents/skills/helm-dev-environment/SKILL.md b/.agents/skills/helm-dev-environment/SKILL.md index bffa4e2e8..a2a34f8c0 100644 --- a/.agents/skills/helm-dev-environment/SKILL.md +++ b/.agents/skills/helm-dev-environment/SKILL.md @@ -60,9 +60,28 @@ mise run helm:skaffold:dev mise run helm:skaffold:run ``` -Both commands build the `gateway` and `supervisor` images and deploy the OpenShell Helm -chart. The `pkiInitJob` hook (a pre-install Job that runs `openshell-gateway generate-certs`) -generates mTLS secrets on first install. Envoy Gateway opt-in; see the Optional Add-ons section below. +**Supervisor sidecar topology** (build once and leave running): +```bash +mise run helm:skaffold:run:sidecar +``` + +**Supervisor proxy-pod topology** (build once and leave running): +```bash +mise run helm:skaffold:run:proxy-pod +``` + +All Skaffold commands build the `gateway` and `supervisor` images and deploy the OpenShell Helm +chart. The sidecar profile renders an `openshell-network-init` init container for +nftables setup and a non-root `openshell-supervisor-network` runtime sidecar for +proxying. The proxy-pod profile renders network supervision in a separate +supervisor Deployment with one pod and relies on Kubernetes NetworkPolicy +enforcement so the agent pod can reach only its paired supervisor plus DNS. The +default local k3s/k3d cluster keeps k3s's embedded NetworkPolicy controller +enabled; if you replace the CNI, install a policy-enforcing CNI before using +proxy-pod. The +`pkiInitJob` hook (a pre-install Job that runs `openshell-gateway +generate-certs`) generates mTLS secrets on first install. Envoy Gateway opt-in; +see the Optional Add-ons section below. The gateway Service uses ClusterIP. Access is via Envoy Gateway (port `8080`) or `kubectl port-forward`. @@ -71,6 +90,31 @@ The gateway Service uses ClusterIP. Access is via Envoy Gateway (port `8080`) or create the Secret named `openshell-ha-pg` with a `uri` key, then run `mise run helm:skaffold:run` or `mise run helm:skaffold:dev`. +### Kubernetes e2e profiles + +Run the default Kubernetes e2e environment: + +```bash +mise run e2e:kubernetes +``` + +Run the sidecar topology e2e environment: + +```bash +mise run e2e:kubernetes:sidecar +``` + +Run the proxy-pod topology e2e environment: + +```bash +mise run e2e:kubernetes:proxy-pod +``` + +The proxy-pod e2e task applies `ci/values-proxy-pod.yaml` through +`OPENSHELL_E2E_KUBE_EXTRA_VALUES`. Use an existing cluster with NetworkPolicy +enforcement, or let the wrapper create the default local k3d/k3s cluster with +k3s's embedded NetworkPolicy controller enabled. + ### TLS behaviour `ci/values-skaffold.yaml` sets `server.disableTls: true`, so Skaffold-based deploys run @@ -126,6 +170,18 @@ openshell sandbox list --gateway-endpoint https://localhost:8090 mise run helm:skaffold:delete ``` +For a sidecar-profile deployment: + +```bash +mise run helm:skaffold:delete:sidecar +``` + +For a proxy-pod-profile deployment: + +```bash +mise run helm:skaffold:delete:proxy-pod +``` + ### Delete the cluster entirely ```bash @@ -250,6 +306,8 @@ for dependencies still declared in `Chart.yaml`. | `deploy/helm/openshell/ci/values-gateway.yaml` | Envoy Gateway GRPCRoute + Gateway overlay | | `deploy/helm/openshell/ci/values-high-availability.yaml` | HA test overlay (`replicaCount: 2` with external PostgreSQL Secret) | | `deploy/helm/openshell/ci/values-keycloak.yaml` | Keycloak OIDC overlay | +| `deploy/helm/openshell/ci/values-sidecar.yaml` | Supervisor sidecar topology overlay for Kubernetes e2e/dev | +| `deploy/helm/openshell/ci/values-proxy-pod.yaml` | Supervisor proxy-pod topology overlay for Kubernetes e2e/dev; requires NetworkPolicy enforcement | | `deploy/helm/openshell/ci/values-spire.yaml` | SPIFFE/SPIRE provider token grant overlay | | `deploy/helm/openshell/ci/values-spire-stack.yaml` | SPIRE hardened chart values for local dev | | `deploy/helm/openshell/ci/values-tls-disabled.yaml` | Lint-only: TLS + auth disabled (reverse-proxy edge termination) | diff --git a/Cargo.lock b/Cargo.lock index c86773bb7..ad9136c0a 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -3680,8 +3680,10 @@ dependencies = [ "kube-runtime", "miette", "openshell-core", + "openshell-policy", "prost", "prost-types", + "rcgen", "serde", "serde_json", "temp-env", @@ -3825,6 +3827,7 @@ dependencies = [ "clap", "futures", "miette", + "nix", "openshell-core", "openshell-ocsf", "openshell-policy", @@ -3983,12 +3986,14 @@ dependencies = [ "nix", "openshell-core", "openshell-ocsf", + "openshell-policy", "rand_core 0.6.4", "russh", "rustix 1.1.4", "seccompiler", "serde_json", "sha2 0.10.9", + "temp-env", "tempfile", "tokio", "tokio-stream", @@ -4804,6 +4809,7 @@ dependencies = [ "ring", "rustls-pki-types", "time", + "x509-parser", "yasna", ] @@ -7772,6 +7778,7 @@ dependencies = [ "lazy_static", "nom", "oid-registry", + "ring", "rusticata-macros", "thiserror 1.0.69", "time", diff --git a/Cargo.toml b/Cargo.toml index f450cd5c8..c469cf1bc 100644 --- a/Cargo.toml +++ b/Cargo.toml @@ -37,7 +37,7 @@ http-body-util = "0.1" tokio-rustls = { version = "0.26", default-features = false, features = ["logging", "tls12", "ring"] } rustls = { version = "0.23", default-features = false, features = ["std", "logging", "tls12", "ring"] } rustls-pemfile = "2" -rcgen = { version = "0.13", features = ["crypto", "pem"] } +rcgen = { version = "0.13", features = ["crypto", "pem", "x509-parser"] } webpki-roots = "1" # CLI diff --git a/architecture/build.md b/architecture/build.md index a3cb2e25f..633efa72d 100644 --- a/architecture/build.md +++ b/architecture/build.md @@ -91,10 +91,11 @@ Runtime layout: as a release artifact. Linux GNU VM driver binaries must not reference `GLIBC_*` symbols newer than `GLIBC_2.28`; release workflows verify this before publishing artifacts. -- **Supervisor**: `scratch` base, static musl binary at `/openshell-sandbox`. - Static linkage is required because the image is mounted/extracted into - sandbox environments (Docker extraction, Podman image volumes, Kubernetes - init-container copy-self) and cannot rely on a dynamic loader. +- **Supervisor**: Alpine base with `nftables`, static musl binary at + `/openshell-sandbox`. Static linkage keeps the binary usable when the image + is mounted/extracted into sandbox environments (Docker extraction, Podman + image volumes, Kubernetes init-container copy-self), while `nftables` supports + Kubernetes supervisor sidecar egress enforcement. Gateway image builds bake the corresponding supervisor image tag into the gateway binary so Docker sandboxes do not depend on `:latest` by default. diff --git a/architecture/compute-runtimes.md b/architecture/compute-runtimes.md index f122fda5d..ac239bfb8 100644 --- a/architecture/compute-runtimes.md +++ b/architecture/compute-runtimes.md @@ -81,7 +81,7 @@ The supervisor must be available inside each sandbox workload: |---|---| | Docker | Bind-mounted local supervisor binary, or a binary extracted from the configured supervisor image. | | Podman | Read-only OCI image volume containing the supervisor binary. | -| Kubernetes | Sandbox pod image or pod template configuration. | +| Kubernetes | Supervisor image side-loaded into the sandbox pod by image volume or init container. | | VM | Embedded in the guest rootfs bundle. | | Extension | Defined by the out-of-tree driver. | @@ -89,6 +89,20 @@ Driver-controlled environment variables must override sandbox image or template values for sandbox ID, sandbox name, gateway endpoint, relay socket path, TLS paths, and command metadata. +Kubernetes can run the supervisor in the default combined topology or in a +sidecar topology. Combined mode keeps network and process supervision in the +agent container. Sidecar mode runs network enforcement, the proxy, and gateway +loopback forwarding in a dedicated sidecar, while the agent container runs only +the process-supervision leaf and launches the user workload after the sidecar +signals readiness. In sidecar mode, an init container performs the privileged +pod-network nftables setup with `NET_ADMIN` and hands shared state ownership to +the configured proxy UID; the long-running network sidecar runs as that UID and +does not keep `NET_ADMIN`. The agent container runs as the resolved sandbox +UID/GID with no added Linux capabilities. Sidecar mode preserves gateway session +and SSH behavior, but treats the process leaf as network-only: Landlock +filesystem policy, process privilege dropping, and process/binary identity +checks are not applied there. + ## Images The gateway image and Helm chart are built from this repository. Sandbox images diff --git a/architecture/gateway.md b/architecture/gateway.md index d873b2a10..9b0e70977 100644 --- a/architecture/gateway.md +++ b/architecture/gateway.md @@ -64,9 +64,11 @@ Podman, and VM drivers deliver the initial token through supervisor-only runtime material; Kubernetes supervisors exchange a projected ServiceAccount token through `IssueSandboxToken`. The gateway validates that projected token with Kubernetes `TokenReview`, requires the configured sandbox service account, -checks the returned pod binding against the live pod UID, and verifies the pod's -controlling `Sandbox` ownerReference against the live Sandbox CR UID and -sandbox-id label before minting the gateway JWT. The bootstrap path accepts +checks the returned pod binding against the live pod UID, and verifies the +pod's ownership against the live Sandbox CR UID and sandbox-id label before +minting the gateway JWT. Agent pods must be directly controlled by the +`Sandbox` CR. Proxy-pod supervisor pods may be controlled through the Kubernetes +`Pod -> ReplicaSet -> Deployment -> Sandbox` chain. The bootstrap path accepts both `agents.x-k8s.io/v1beta1` ownerReferences from newer Agent Sandbox controllers and `agents.x-k8s.io/v1alpha1` ownerReferences from existing deployments. Supervisors renew gateway JWTs in memory before expiry only while diff --git a/crates/openshell-core/src/grpc_client.rs b/crates/openshell-core/src/grpc_client.rs index 96158a1d1..4f2477c25 100644 --- a/crates/openshell-core/src/grpc_client.rs +++ b/crates/openshell-core/src/grpc_client.rs @@ -167,9 +167,14 @@ async fn build_plain_channel(endpoint: &str) -> Result { .into_diagnostic() .wrap_err_with(|| format!("failed to read client key from {key_path}"))?; - let tls_config = ClientTlsConfig::new() + let mut tls_config = ClientTlsConfig::new() .ca_certificate(Certificate::from_pem(ca_pem)) .identity(Identity::from_pem(cert_pem, key_pem)); + if let Ok(server_name) = std::env::var(sandbox_env::GATEWAY_TLS_SERVER_NAME) + && !server_name.is_empty() + { + tls_config = tls_config.domain_name(server_name); + } ep = ep .tls_config(tls_config) diff --git a/crates/openshell-core/src/sandbox_env.rs b/crates/openshell-core/src/sandbox_env.rs index b457a4a8e..d1ac71580 100644 --- a/crates/openshell-core/src/sandbox_env.rs +++ b/crates/openshell-core/src/sandbox_env.rs @@ -29,6 +29,67 @@ pub const SANDBOX_COMMAND: &str = "OPENSHELL_SANDBOX_COMMAND"; /// Deployment-controlled telemetry toggle propagated to the sandbox supervisor. pub const TELEMETRY_ENABLED: &str = "OPENSHELL_TELEMETRY_ENABLED"; +/// Supervisor pod/runtime topology. Kubernetes sidecar mode sets this to +/// `"sidecar"`; the default combined supervisor path omits it. +pub const SUPERVISOR_TOPOLOGY: &str = "OPENSHELL_SUPERVISOR_TOPOLOGY"; + +/// Network enforcement backend selected by the compute driver. +pub const NETWORK_ENFORCEMENT_MODE: &str = "OPENSHELL_NETWORK_ENFORCEMENT_MODE"; + +/// Process enforcement mode selected by the compute driver. +/// +/// The default when unset is `"full"`, where the process supervisor enforces +/// filesystem/process policy before spawning workloads. Kubernetes sidecar +/// topology sets this to `"network-only"` so the process wrapper can run as +/// the sandbox UID without Linux capabilities while preserving SSH/session +/// behavior. +pub const PROCESS_ENFORCEMENT_MODE: &str = "OPENSHELL_PROCESS_ENFORCEMENT_MODE"; + +/// Whether network policy evaluation must bind requests to the peer binary. +/// +/// The default when unset is `"required"`. Kubernetes sidecar experiments may +/// set this to `"relaxed"` to enforce endpoint and L7 policy without per-binary +/// `/proc` identity binding. +pub const NETWORK_BINARY_IDENTITY: &str = "OPENSHELL_NETWORK_BINARY_IDENTITY"; + +/// File written by the network supervisor when sidecar networking is ready. +pub const SUPERVISOR_READY_FILE: &str = "OPENSHELL_SUPERVISOR_READY_FILE"; + +/// TCP address the process supervisor waits for before starting when the +/// network supervisor runs outside the agent process. +pub const SUPERVISOR_READY_ADDR: &str = "OPENSHELL_SUPERVISOR_READY_ADDR"; + +/// File written by the process supervisor with the workload entrypoint PID and +/// read by the network sidecar for process/binary-bound network policy checks. +pub const ENTRYPOINT_PID_FILE: &str = "OPENSHELL_ENTRYPOINT_PID_FILE"; + +/// Loopback address where the network sidecar forwards gateway gRPC traffic. +pub const GATEWAY_FORWARD_ADDR: &str = "OPENSHELL_GATEWAY_FORWARD_ADDR"; + +/// Optional TLS server name used when the process supervisor reaches the +/// gateway through a loopback TCP forward. +pub const GATEWAY_TLS_SERVER_NAME: &str = "OPENSHELL_GATEWAY_TLS_SERVER_NAME"; + +/// Explicit URL injected into sandbox child processes for proxy-mode egress. +/// +/// Kubernetes proxy-pod topology uses a headless Service DNS name, which +/// cannot be represented by the policy's `SocketAddr` proxy field. +pub const PROXY_URL: &str = "OPENSHELL_PROXY_URL"; + +/// Explicit listener address for the network supervisor's HTTP CONNECT proxy. +pub const PROXY_BIND_ADDR: &str = "OPENSHELL_PROXY_BIND_ADDR"; + +/// Directory where the network supervisor writes the proxy CA files consumed +/// by workload child processes. +pub const PROXY_TLS_DIR: &str = "OPENSHELL_PROXY_TLS_DIR"; + +/// Optional CA certificate PEM path used by the network supervisor instead of +/// generating an ephemeral CA. +pub const PROXY_CA_CERT_PATH: &str = "OPENSHELL_PROXY_CA_CERT_PATH"; + +/// Optional CA private key PEM path paired with [`PROXY_CA_CERT_PATH`]. +pub const PROXY_CA_KEY_PATH: &str = "OPENSHELL_PROXY_CA_KEY_PATH"; + /// Path to the CA certificate for mTLS communication with the gateway. pub const TLS_CA: &str = "OPENSHELL_TLS_CA"; @@ -71,3 +132,18 @@ pub const K8S_SA_TOKEN_FILE: &str = "OPENSHELL_K8S_SA_TOKEN_FILE"; /// exchanges without using SPIFFE for gateway authentication. pub const PROVIDER_SPIFFE_WORKLOAD_API_SOCKET: &str = "OPENSHELL_PROVIDER_SPIFFE_WORKLOAD_API_SOCKET"; + +/// Resolved sandbox UID used to override `run_as_user` when the policy +/// specifies a numeric value instead of the hardcoded "sandbox" user name. +/// +/// Set by compute drivers (Kubernetes, Docker, VM) from resolved config or +/// cluster autodetection. The supervisor reads this at startup and uses it +/// directly with `setuid()` / `chown()` without requiring an `/etc/passwd` +/// entry in the sandbox image. +pub const SANDBOX_UID: &str = "OPENSHELL_SANDBOX_UID"; + +/// Resolved sandbox GID paired with [`SANDBOX_UID`]. +/// +/// Used alongside UID for PVC init container `chown` operations and when the +/// supervisor drops privileges to a group other than the UID's primary group. +pub const SANDBOX_GID: &str = "OPENSHELL_SANDBOX_GID"; diff --git a/crates/openshell-driver-kubernetes/Cargo.toml b/crates/openshell-driver-kubernetes/Cargo.toml index 07fa91015..002635a71 100644 --- a/crates/openshell-driver-kubernetes/Cargo.toml +++ b/crates/openshell-driver-kubernetes/Cargo.toml @@ -16,6 +16,7 @@ path = "src/main.rs" [dependencies] openshell-core = { path = "../openshell-core", default-features = false } +openshell-policy = { path = "../openshell-policy" } tokio = { workspace = true } tonic = { workspace = true, features = ["transport"] } @@ -33,6 +34,7 @@ tracing = { workspace = true } tracing-subscriber = { workspace = true } thiserror = { workspace = true } miette = { workspace = true } +rcgen = { workspace = true } [dev-dependencies] temp-env = "0.3" diff --git a/crates/openshell-driver-kubernetes/README.md b/crates/openshell-driver-kubernetes/README.md index 831e4edf2..4a634d14f 100644 --- a/crates/openshell-driver-kubernetes/README.md +++ b/crates/openshell-driver-kubernetes/README.md @@ -53,9 +53,35 @@ pods do not need direct external ingress for SSH. ## Container Security Context -The driver grants the sandbox agent container the Linux capabilities the -supervisor needs for namespace setup and policy enforcement. It can also request -a Kubernetes AppArmor profile through `app_armor_profile`. +The default `combined` supervisor topology grants the sandbox agent container +the Linux capabilities the supervisor needs for namespace setup and process, +filesystem, and network policy enforcement. + +The `sidecar` supervisor topology moves pod-level network setup into a root init +container and runs the long-lived network sidecar as a non-root UID with no +added Linux capabilities. The agent container also runs as the resolved sandbox +UID/GID with `allowPrivilegeEscalation: false` and `capabilities.drop: ["ALL"]`. +In this mode OpenShell preserves gateway session and SSH behavior, but the +process supervisor runs in network-only mode and does not apply Landlock +filesystem policy, process privilege dropping, or process/binary identity +checks. Network endpoint and L7 policy remain enforced by the network sidecar. + +The `proxy-pod` supervisor topology runs network enforcement and gateway +forwarding in a separate supervisor Deployment with one pod. The agent pod runs +only the process-mode supervisor and reaches the supervisor through a +per-sandbox headless Service. The driver creates an owner-referenced supervisor +Deployment with one replica plus Service, proxy CA Secret, and NetworkPolicy +resources so agent egress is limited to its paired supervisor pod plus DNS. If +the supervisor pod is deleted, the Deployment recreates it. + +Sidecar mode uses the pod `fsGroup` to make the projected service-account token +and sandbox client TLS secret group-readable so the non-root process supervisor +can authenticate to the gateway. Treat the agent container as trusted with +respect to those in-pod gateway credentials until a narrower credential handoff +exists. + +The driver can request a Kubernetes AppArmor profile through +`app_armor_profile`. Supported values are `Unconfined`, `RuntimeDefault`, and `Localhost/`. An empty or unset value omits diff --git a/crates/openshell-driver-kubernetes/src/config.rs b/crates/openshell-driver-kubernetes/src/config.rs index 4c1153b08..47de78900 100644 --- a/crates/openshell-driver-kubernetes/src/config.rs +++ b/crates/openshell-driver-kubernetes/src/config.rs @@ -15,6 +15,9 @@ pub const DEFAULT_SANDBOX_SERVICE_ACCOUNT_NAME: &str = "default"; /// Default storage size for the workspace PVC. pub const DEFAULT_WORKSPACE_STORAGE_SIZE: &str = "2Gi"; +/// Default UID for the long-running Kubernetes network proxy. +pub const DEFAULT_PROXY_UID: u32 = 1337; + /// How the supervisor binary is delivered into sandbox pods. #[derive(Debug, Clone, Copy, PartialEq, Eq, Default, Serialize, Deserialize)] #[serde(rename_all = "kebab-case")] @@ -52,6 +55,46 @@ impl FromStr for SupervisorSideloadMethod { } } +/// How the supervisor is arranged inside Kubernetes sandbox pods. +#[derive(Debug, Clone, Copy, PartialEq, Eq, Default, Serialize, Deserialize)] +#[serde(rename_all = "kebab-case")] +pub enum SupervisorTopology { + /// Run networking and process supervision in the agent container. + #[default] + Combined, + /// Run network supervision in a privileged sidecar and process supervision + /// as a low-capability wrapper in the agent container. + Sidecar, + /// Run network supervision in a separate supervisor pod and process + /// supervision as a low-capability wrapper in the agent pod. + ProxyPod, +} + +impl std::fmt::Display for SupervisorTopology { + fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { + match self { + Self::Combined => f.write_str("combined"), + Self::Sidecar => f.write_str("sidecar"), + Self::ProxyPod => f.write_str("proxy-pod"), + } + } +} + +impl FromStr for SupervisorTopology { + type Err = String; + + fn from_str(s: &str) -> Result { + match s { + "combined" => Ok(Self::Combined), + "sidecar" => Ok(Self::Sidecar), + "proxy-pod" => Ok(Self::ProxyPod), + other => Err(format!( + "unknown supervisor topology '{other}'; expected 'combined', 'sidecar', or 'proxy-pod'" + )), + } + } +} + /// Kubernetes `AppArmor` profile requested for the sandbox agent container. #[derive(Debug, Clone, PartialEq, Eq)] pub enum AppArmorProfile { @@ -176,6 +219,15 @@ pub struct KubernetesComputeConfig { pub supervisor_image_pull_policy: String, /// How the supervisor binary is delivered into sandbox pods. pub supervisor_sideload_method: SupervisorSideloadMethod, + /// Supervisor pod topology. `combined` preserves the existing single + /// root supervisor container path; `sidecar` moves pod-level network + /// enforcement into a dedicated sidecar container. + pub supervisor_topology: SupervisorTopology, + /// UID used by the long-running network proxy in sidecar and proxy-pod + /// topologies. In sidecar topology, the network init container installs + /// nftables rules that exempt this UID, so it must not match the sandbox + /// workload UID. + pub proxy_uid: u32, pub grpc_endpoint: String, pub ssh_socket_path: String, pub client_tls_secret_name: String, @@ -211,6 +263,16 @@ pub struct KubernetesComputeConfig { deserialize_with = "deserialize_provider_spiffe_workload_api_socket_path" )] pub provider_spiffe_workload_api_socket_path: String, + /// UID used for the supervisor container `securityContext.runAsUser` and + /// PVC init container ownership operations. When empty, the driver + /// auto-detects from `OpenShift` SCC annotations on the target namespace; + /// if those are also absent, falls back to `1000`. + #[serde(default, skip_serializing_if = "Option::is_none")] + pub sandbox_uid: Option, + /// GID used alongside `sandbox_uid` for PVC init container operations. + /// When empty and `sandbox_uid` is set, defaults to the resolved UID. + #[serde(default, skip_serializing_if = "Option::is_none")] + pub sandbox_gid: Option, } /// Lower bound enforced by kubelet for projected SA tokens. @@ -221,6 +283,18 @@ pub const MIN_SA_TOKEN_TTL_SECS: i64 = 600; /// pod start). pub const MAX_SA_TOKEN_TTL_SECS: i64 = 86_400; +/// Default sandbox UID used when neither config nor `OpenShift` SCC annotations +/// provide a resolved value. +pub(crate) const DEFAULT_SANDBOX_UID: u32 = 1000; + +/// The annotation key for the `OpenShift` `ServiceAccount` UID range. +/// Format: `/` (e.g. `1000000000/10000`). +pub const ANNOTATION_SCC_UID_RANGE: &str = "openshift.io/sa.scc.uid-range"; + +/// The annotation key for the `OpenShift` `ServiceAccount` supplemental groups. +/// Format: `/` (e.g. `1000000000/10000`). +pub const ANNOTATION_SCC_SUPPLEMENTAL_GROUPS: &str = "openshift.io/sa.scc.supplemental-groups"; + impl Default for KubernetesComputeConfig { fn default() -> Self { Self { @@ -236,6 +310,8 @@ impl Default for KubernetesComputeConfig { supervisor_image: DEFAULT_SUPERVISOR_IMAGE.to_string(), supervisor_image_pull_policy: String::new(), supervisor_sideload_method: SupervisorSideloadMethod::default(), + supervisor_topology: SupervisorTopology::default(), + proxy_uid: DEFAULT_PROXY_UID, grpc_endpoint: String::new(), ssh_socket_path: "/run/openshell/ssh.sock".to_string(), client_tls_secret_name: String::new(), @@ -246,6 +322,8 @@ impl Default for KubernetesComputeConfig { default_runtime_class_name: String::new(), sa_token_ttl_secs: 3600, provider_spiffe_workload_api_socket_path: String::new(), + sandbox_uid: None, + sandbox_gid: None, } } } @@ -277,6 +355,73 @@ impl KubernetesComputeConfig { &self.provider_spiffe_workload_api_socket_path, ) } + + pub fn validate_proxy_uid(&self) -> Result<(), String> { + if self.proxy_uid < openshell_policy::MIN_SANDBOX_UID { + return Err(format!( + "proxy_uid must be at least {}", + openshell_policy::MIN_SANDBOX_UID + )); + } + Ok(()) + } + + /// Resolve the sandbox UID/GID pair. + /// + /// Resolution order: + /// 1. Configured `sandbox_uid` / `sandbox_gid` (explicit override) + /// 2. `OpenShift` SCC namespace annotations (`sa.scc.uid-range`, + /// `sa.scc.supplemental-groups`) — passed in as the optional + /// `namespace_annotations` map + /// 3. Fallback defaults: UID=`1000`, GID=UID + pub fn resolve_sandbox_uid( + &self, + namespace_annotations: Option<&std::collections::BTreeMap>, + ) -> u32 { + if let Some(uid) = self.sandbox_uid { + return uid; + } + // Try OpenShift SCC annotation. + if let Some(anns) = namespace_annotations + && let Some(range) = anns.get(ANNOTATION_SCC_UID_RANGE) + && let Some(uid) = Self::from_open_shift_uid_range(range) + { + return uid; + } + DEFAULT_SANDBOX_UID + } + + pub fn resolve_sandbox_gid( + &self, + resolved_uid: u32, + _namespace_annotations: Option<&std::collections::BTreeMap>, + ) -> u32 { + self.sandbox_gid + .or(self.sandbox_uid) + .unwrap_or(resolved_uid) + } + + /// Parse `OpenShift` SCC `sa.scc.uid-range` annotation. + /// + /// Format: `/` (e.g. `1000000000/10000`). + pub fn from_open_shift_uid_range(annotation: &str) -> Option { + let (start, _) = annotation.split_once('/')?; + start + .trim() + .parse::() + .ok() + .filter(|&uid| uid >= openshell_policy::MIN_SANDBOX_UID) + } + + /// Parse `OpenShift` SCC `sa.scc.supplemental-groups` annotation. + pub fn from_open_shift_supplemental_groups(annotation: &str) -> Option { + let (start, _) = annotation.split_once('/')?; + start + .trim() + .parse::() + .ok() + .filter(|&gid| gid >= openshell_policy::MIN_SANDBOX_UID) + } } fn validate_provider_spiffe_workload_api_socket_path_value( @@ -314,6 +459,7 @@ fn validate_provider_spiffe_workload_api_socket_path_value( #[cfg(test)] mod tests { use super::*; + use std::collections::BTreeMap as HashMap; #[test] fn default_workspace_storage_size_is_2gi() { @@ -324,6 +470,66 @@ mod tests { ); } + #[test] + fn default_supervisor_topology_is_combined() { + let cfg = KubernetesComputeConfig::default(); + assert_eq!(cfg.supervisor_topology, SupervisorTopology::Combined); + } + + #[test] + fn default_proxy_uid_is_dedicated_non_root_uid() { + let cfg = KubernetesComputeConfig::default(); + assert_eq!(cfg.proxy_uid, DEFAULT_PROXY_UID); + } + + #[test] + fn serde_override_supervisor_topology_sidecar() { + let json = serde_json::json!({ + "supervisor_topology": "sidecar" + }); + let cfg: KubernetesComputeConfig = serde_json::from_value(json).unwrap(); + assert_eq!(cfg.supervisor_topology, SupervisorTopology::Sidecar); + } + + #[test] + fn serde_override_supervisor_topology_proxy_pod() { + let json = serde_json::json!({ + "supervisor_topology": "proxy-pod" + }); + let cfg: KubernetesComputeConfig = serde_json::from_value(json).unwrap(); + assert_eq!(cfg.supervisor_topology, SupervisorTopology::ProxyPod); + assert_eq!(cfg.supervisor_topology.to_string(), "proxy-pod"); + } + + #[test] + fn serde_override_proxy_uid() { + let json = serde_json::json!({ + "proxy_uid": 2000 + }); + let cfg: KubernetesComputeConfig = serde_json::from_value(json).unwrap(); + assert_eq!(cfg.proxy_uid, 2000); + cfg.validate_proxy_uid().unwrap(); + } + + #[test] + fn validate_proxy_uid_rejects_privileged_uid() { + let cfg = KubernetesComputeConfig { + proxy_uid: 999, + ..KubernetesComputeConfig::default() + }; + let err = cfg.validate_proxy_uid().unwrap_err(); + assert!(err.contains("proxy_uid")); + } + + #[test] + fn serde_rejects_invalid_supervisor_topology() { + let json = serde_json::json!({ + "supervisor_topology": "daemonset" + }); + let err = serde_json::from_value::(json).unwrap_err(); + assert!(err.to_string().contains("unknown variant")); + } + #[test] fn default_service_account_name_is_default() { let cfg = KubernetesComputeConfig::default(); @@ -459,4 +665,128 @@ mod tests { let cfg: KubernetesComputeConfig = serde_json::from_value(json).unwrap(); assert_eq!(cfg.image_pull_secrets, ["regcred", "backup-regcred"]); } + + #[test] + fn default_sandbox_uid_and_gid_are_none() { + let cfg = KubernetesComputeConfig::default(); + assert_eq!(cfg.sandbox_uid, None); + assert_eq!(cfg.sandbox_gid, None); + } + + #[test] + fn serde_override_sandbox_uid() { + let json = serde_json::json!({ + "sandbox_uid": 1500 + }); + let cfg: KubernetesComputeConfig = serde_json::from_value(json).unwrap(); + assert_eq!(cfg.sandbox_uid, Some(1500)); + } + + #[test] + fn serde_override_sandbox_gid() { + let json = serde_json::json!({ + "sandbox_gid": 2000 + }); + let cfg: KubernetesComputeConfig = serde_json::from_value(json).unwrap(); + assert_eq!(cfg.sandbox_gid, Some(2000)); + } + + #[test] + fn parse_openshift_uid_range() { + assert_eq!( + KubernetesComputeConfig::from_open_shift_uid_range("1000000000/10000"), + Some(1_000_000_000) + ); + assert_eq!( + KubernetesComputeConfig::from_open_shift_uid_range("1000/50000"), + Some(1000) + ); + } + + #[test] + fn parse_openshift_uid_range_rejects_below_min() { + // 999 is below MIN_SANDBOX_UID (1000) — should be rejected. + assert_eq!( + KubernetesComputeConfig::from_open_shift_uid_range("999/50000"), + None + ); + } + + #[test] + fn parse_openshift_supplemental_groups() { + assert_eq!( + KubernetesComputeConfig::from_open_shift_supplemental_groups("1000/50000"), + Some(1000) + ); + } + + #[test] + fn resolve_sandbox_uid_prefers_config() { + let cfg = KubernetesComputeConfig { + sandbox_uid: Some(5000), + ..KubernetesComputeConfig::default() + }; + // Config value should win even when annotations are present. + let mut anns: HashMap = HashMap::new(); + anns.insert( + ANNOTATION_SCC_UID_RANGE.to_string(), + "1000000000/10000".to_string(), + ); + assert_eq!(cfg.resolve_sandbox_uid(Some(&anns)), 5000); + } + + #[test] + fn resolve_sandbox_uid_falls_back_to_openshift_annotation() { + let cfg = KubernetesComputeConfig::default(); + let mut anns: HashMap = HashMap::new(); + anns.insert( + ANNOTATION_SCC_UID_RANGE.to_string(), + "1000000000/10000".to_string(), + ); + assert_eq!(cfg.resolve_sandbox_uid(Some(&anns)), 1_000_000_000); + } + + #[test] + fn resolve_sandbox_uid_falls_back_to_default() { + let cfg = KubernetesComputeConfig::default(); + // No config, no annotations. + assert_eq!(cfg.resolve_sandbox_uid(None), DEFAULT_SANDBOX_UID); + // Empty annotations map. + let anns: HashMap = HashMap::new(); + assert_eq!(cfg.resolve_sandbox_uid(Some(&anns)), DEFAULT_SANDBOX_UID); + } + + #[test] + fn resolve_sandbox_gid_prefers_config() { + let cfg = KubernetesComputeConfig { + sandbox_uid: Some(5000), + sandbox_gid: Some(6000), + ..KubernetesComputeConfig::default() + }; + assert_eq!( + cfg.resolve_sandbox_gid(cfg.resolve_sandbox_uid(None), None), + 6000 + ); + } + + #[test] + fn resolve_sandbox_gid_falls_back_to_uid() { + let cfg = KubernetesComputeConfig { + sandbox_uid: Some(5000), + ..KubernetesComputeConfig::default() + }; + // sandbox_gid is None, should fall back to sandbox_uid. + assert_eq!( + cfg.resolve_sandbox_gid(cfg.resolve_sandbox_uid(None), None), + 5000 + ); + } + + #[test] + fn resolve_sandbox_gid_falls_back_to_resolved_uid() { + let cfg = KubernetesComputeConfig::default(); + // Both are None, should use the resolved UID. + let uid = cfg.resolve_sandbox_uid(None); + assert_eq!(cfg.resolve_sandbox_gid(uid, None), uid); + } } diff --git a/crates/openshell-driver-kubernetes/src/driver.rs b/crates/openshell-driver-kubernetes/src/driver.rs index 909568302..06f0f365c 100644 --- a/crates/openshell-driver-kubernetes/src/driver.rs +++ b/crates/openshell-driver-kubernetes/src/driver.rs @@ -3,12 +3,16 @@ //! Kubernetes compute driver. +use super::AppArmorProfile; use crate::config::{ - AppArmorProfile, DEFAULT_SANDBOX_SERVICE_ACCOUNT_NAME, DEFAULT_WORKSPACE_STORAGE_SIZE, - KubernetesComputeConfig, SupervisorSideloadMethod, + DEFAULT_PROXY_UID, DEFAULT_SANDBOX_SERVICE_ACCOUNT_NAME, DEFAULT_SANDBOX_UID, + DEFAULT_WORKSPACE_STORAGE_SIZE, KubernetesComputeConfig, SupervisorSideloadMethod, + SupervisorTopology, }; use futures::{Stream, StreamExt, TryStreamExt}; -use k8s_openapi::api::core::v1::{Event as KubeEventObj, Node}; +use k8s_openapi::api::apps::v1::Deployment; +use k8s_openapi::api::core::v1::{Event as KubeEventObj, Namespace, Node, Secret, Service}; +use k8s_openapi::api::networking::v1::NetworkPolicy; use kube::api::{Api, ApiResource, DeleteParams, ListParams, PostParams}; use kube::core::gvk::GroupVersionKind; use kube::core::{DynamicObject, ObjectMeta}; @@ -31,7 +35,9 @@ use openshell_core::proto::compute::v1::{ watch_sandboxes_event, }; use openshell_core::proto_struct::{struct_to_json_object, value_to_json}; +use rcgen::{CertificateParams, DnType, IsCa, KeyPair, KeyUsagePurpose}; use serde::Deserialize; +use serde::de::DeserializeOwned; use std::collections::BTreeMap; use std::pin::Pin; use std::sync::Arc; @@ -217,6 +223,9 @@ impl KubernetesComputeDriver { config .validate_provider_spiffe_workload_api_socket_path() .map_err(KubernetesDriverError::Precondition)?; + config + .validate_proxy_uid() + .map_err(KubernetesDriverError::Precondition)?; let base_config = match kube::Config::incluster() { Ok(c) => c, Err(_) => kube::Config::infer() @@ -330,6 +339,47 @@ impl KubernetesComputeDriver { )) } + /// Resolve sandbox UID/GID from config or `OpenShift` SCC namespace annotations. + /// + /// Returns `(uid, gid, ns_annotations_map)`: + /// - If `sandbox_uid` is set in config, returns that (with fallback GID) + /// - Otherwise fetches the target namespace and checks for + /// `openshift.io/sa.scc.uid-range` / `openshift.io/sa.scc.supplemental-groups` + /// annotations. + /// - If neither config nor `OpenShift` is found, returns `(1000, 1000, {})` as defaults. + async fn resolve_sandbox_identity(&self) -> (u32, u32, BTreeMap) { + // Explicit config takes priority — skip namespace lookup entirely. + if self.config.sandbox_uid.is_some() { + let uid = self.config.resolve_sandbox_uid(None); + let gid = self.config.resolve_sandbox_gid(uid, None); + return (uid, gid, BTreeMap::new()); + } + + // Try to read namespace annotations for OpenShift SCC. + // Namespace is namespaced so Api::all works (it's cluster-scoped but + // can list all namespaces) and we filter by name, or use Api::namespaced. + let ns_api: Api = Api::all(self.client.clone()); + if let Ok(Ok(ns)) = + tokio::time::timeout(KUBE_API_TIMEOUT, ns_api.get(self.config.namespace.as_str())).await + { + let anns = ns.metadata.annotations.unwrap_or_default(); + let uid = self.config.resolve_sandbox_uid(Some(&anns)); + // Collect supplemental groups annotation for sandbox init containers. + let gid = anns + .get(crate::config::ANNOTATION_SCC_SUPPLEMENTAL_GROUPS) + .map_or(uid, |sup_range| { + KubernetesComputeConfig::from_open_shift_supplemental_groups(sup_range) + .unwrap_or(uid) + }); + (uid, gid, anns) + } else { + // Namespace fetch failed or timed out; fall back to defaults. + let uid = DEFAULT_SANDBOX_UID; + let gid = uid; + (uid, gid, BTreeMap::new()) + } + } + async fn has_gpu_capacity(&self) -> Result { let nodes: Api = Api::all(self.client.clone()); let node_list = nodes.list(&ListParams::default()).await?; @@ -471,11 +521,21 @@ impl KubernetesComputeDriver { .supported_agent_sandbox_api(self.client.clone()) .await .map_err(KubernetesDriverError::Message)?; + + // Resolve sandbox UID/GID from config or OpenShift SCC namespace annotations. + let (resolved_user_id, resolved_group_id, ns_annotations) = + self.resolve_sandbox_identity().await; + let mut obj = DynamicObject::new(name, &agent_sandbox_api.resource); obj.metadata = ObjectMeta { name: Some(name.to_string()), namespace: Some(self.config.namespace.clone()), labels: Some(sandbox_labels(sandbox)), + annotations: if ns_annotations.is_empty() { + None + } else { + Some(ns_annotations) + }, ..Default::default() }; let params = SandboxPodParams { @@ -485,6 +545,9 @@ impl KubernetesComputeDriver { supervisor_image: &self.config.supervisor_image, supervisor_image_pull_policy: &self.config.supervisor_image_pull_policy, supervisor_sideload_method: self.config.supervisor_sideload_method, + supervisor_topology: self.config.supervisor_topology, + proxy_uid: self.config.proxy_uid, + namespace: &self.config.namespace, service_account_name: &self.config.service_account_name, sandbox_id: &sandbox.id, sandbox_name: &sandbox.name, @@ -501,21 +564,25 @@ impl KubernetesComputeDriver { provider_spiffe_workload_api_socket_path: &self .config .provider_spiffe_workload_api_socket_path, + sandbox_uid: resolved_user_id, + sandbox_gid: resolved_group_id, }; + validate_proxy_identity(¶ms)?; + obj.data = sandbox_to_k8s_spec(sandbox.spec.as_ref(), ¶ms); - match tokio::time::timeout( + let created = match tokio::time::timeout( KUBE_API_TIMEOUT, agent_sandbox_api.api.create(&PostParams::default(), &obj), ) .await { - Ok(Ok(_result)) => { + Ok(Ok(result)) => { info!( sandbox_id = %sandbox.id, sandbox_name = %name, "Sandbox created in Kubernetes successfully" ); - Ok(()) + result } Ok(Err(err)) => { warn!( @@ -524,7 +591,7 @@ impl KubernetesComputeDriver { error = %err, "Failed to create sandbox in Kubernetes" ); - Err(KubernetesDriverError::from_kube(err)) + return Err(KubernetesDriverError::from_kube(err)); } Err(_elapsed) => { warn!( @@ -533,12 +600,197 @@ impl KubernetesComputeDriver { timeout_secs = KUBE_API_TIMEOUT.as_secs(), "Timed out creating sandbox in Kubernetes" ); - Err(KubernetesDriverError::Message(format!( + return Err(KubernetesDriverError::Message(format!( "timed out after {}s waiting for Kubernetes API", KUBE_API_TIMEOUT.as_secs() - ))) + ))); } + }; + + if self.config.supervisor_topology == SupervisorTopology::ProxyPod + && let Err(err) = self + .create_proxy_pod_resources( + sandbox, + sandbox.spec.as_ref(), + ¶ms, + &created, + &agent_sandbox_api.resource.api_version, + ) + .await + { + warn!( + sandbox_id = %sandbox.id, + sandbox_name = %name, + error = %err, + "Failed to create proxy-pod resources; deleting Sandbox CR" + ); + self.cleanup_proxy_pod_resources(name).await; + let _ = tokio::time::timeout( + KUBE_API_TIMEOUT, + agent_sandbox_api.api.delete(name, &DeleteParams::default()), + ) + .await; + return Err(err); } + + Ok(()) + } + + async fn create_proxy_pod_resources( + &self, + sandbox: &Sandbox, + spec: Option<&SandboxSpec>, + params: &SandboxPodParams<'_>, + sandbox_cr: &DynamicObject, + sandbox_api_version: &str, + ) -> Result<(), KubernetesDriverError> { + let names = proxy_pod_resource_names(&sandbox.name); + let template_environment = spec + .and_then(|spec| spec.template.as_ref()) + .map(|template| template.environment.clone()) + .unwrap_or_default(); + let spec_environment = spec_pod_env(spec); + let deployment_owner_ref = + proxy_pod_owner_reference(sandbox_cr, sandbox_api_version, true)?; + let dependent_owner_ref = + proxy_pod_owner_reference(sandbox_cr, sandbox_api_version, false)?; + let (ca_cert_pem, ca_key_pem) = generate_proxy_pod_ca()?; + + let secret = proxy_pod_ca_secret( + &names, + params, + dependent_owner_ref.clone(), + &ca_cert_pem, + &ca_key_pem, + ); + let service = proxy_pod_supervisor_service(&names, params, dependent_owner_ref.clone()); + let agent_egress = + proxy_pod_agent_egress_network_policy(&names, params, dependent_owner_ref.clone()); + let supervisor_ingress = + proxy_pod_supervisor_ingress_network_policy(&names, params, dependent_owner_ref); + let supervisor_deployment = proxy_pod_supervisor_deployment( + &names, + &template_environment, + &spec_environment, + params, + deployment_owner_ref, + ); + + let secrets: Api = Api::namespaced(self.client.clone(), &self.config.namespace); + let services: Api = Api::namespaced(self.client.clone(), &self.config.namespace); + let policies: Api = + Api::namespaced(self.client.clone(), &self.config.namespace); + let deployments: Api = + Api::namespaced(self.client.clone(), &self.config.namespace); + + tokio::time::timeout( + KUBE_API_TIMEOUT, + secrets.create(&PostParams::default(), &secret), + ) + .await + .map_err(|_| { + KubernetesDriverError::Message(format!( + "timed out after {}s creating proxy-pod CA secret", + KUBE_API_TIMEOUT.as_secs() + )) + })? + .map_err(KubernetesDriverError::from_kube)?; + tokio::time::timeout( + KUBE_API_TIMEOUT, + services.create(&PostParams::default(), &service), + ) + .await + .map_err(|_| { + KubernetesDriverError::Message(format!( + "timed out after {}s creating proxy-pod service", + KUBE_API_TIMEOUT.as_secs() + )) + })? + .map_err(KubernetesDriverError::from_kube)?; + tokio::time::timeout( + KUBE_API_TIMEOUT, + policies.create(&PostParams::default(), &agent_egress), + ) + .await + .map_err(|_| { + KubernetesDriverError::Message(format!( + "timed out after {}s creating proxy-pod agent egress NetworkPolicy", + KUBE_API_TIMEOUT.as_secs() + )) + })? + .map_err(KubernetesDriverError::from_kube)?; + tokio::time::timeout( + KUBE_API_TIMEOUT, + policies.create(&PostParams::default(), &supervisor_ingress), + ) + .await + .map_err(|_| { + KubernetesDriverError::Message(format!( + "timed out after {}s creating proxy-pod supervisor ingress NetworkPolicy", + KUBE_API_TIMEOUT.as_secs() + )) + })? + .map_err(KubernetesDriverError::from_kube)?; + tokio::time::timeout( + KUBE_API_TIMEOUT, + deployments.create(&PostParams::default(), &supervisor_deployment), + ) + .await + .map_err(|_| { + KubernetesDriverError::Message(format!( + "timed out after {}s creating proxy-pod supervisor deployment", + KUBE_API_TIMEOUT.as_secs() + )) + })? + .map_err(KubernetesDriverError::from_kube)?; + + info!( + sandbox_id = %sandbox.id, + sandbox_name = %sandbox.name, + supervisor_deployment = %names.supervisor_deployment, + service = %names.service, + "Created proxy-pod supervisor resources" + ); + Ok(()) + } + + async fn cleanup_proxy_pod_resources(&self, sandbox_name: &str) { + let names = proxy_pod_resource_names(sandbox_name); + let secrets: Api = Api::namespaced(self.client.clone(), &self.config.namespace); + let services: Api = Api::namespaced(self.client.clone(), &self.config.namespace); + let policies: Api = + Api::namespaced(self.client.clone(), &self.config.namespace); + let deployments: Api = + Api::namespaced(self.client.clone(), &self.config.namespace); + + let _ = tokio::time::timeout( + KUBE_API_TIMEOUT, + deployments.delete(&names.supervisor_deployment, &DeleteParams::default()), + ) + .await; + let _ = tokio::time::timeout( + KUBE_API_TIMEOUT, + policies.delete( + &names.supervisor_ingress_network_policy, + &DeleteParams::default(), + ), + ) + .await; + let _ = tokio::time::timeout( + KUBE_API_TIMEOUT, + policies.delete(&names.agent_egress_network_policy, &DeleteParams::default()), + ) + .await; + let _ = tokio::time::timeout( + KUBE_API_TIMEOUT, + services.delete(&names.service, &DeleteParams::default()), + ) + .await; + let _ = tokio::time::timeout( + KUBE_API_TIMEOUT, + secrets.delete(&names.proxy_ca_secret, &DeleteParams::default()), + ) + .await; } pub async fn delete_sandbox(&self, name: &str) -> Result { @@ -551,6 +803,9 @@ impl KubernetesComputeDriver { let agent_sandbox_api = self .supported_agent_sandbox_api(self.client.clone()) .await?; + if self.config.supervisor_topology == SupervisorTopology::ProxyPod { + self.cleanup_proxy_pod_resources(name).await; + } match tokio::time::timeout( KUBE_API_TIMEOUT, agent_sandbox_api.api.delete(name, &DeleteParams::default()), @@ -932,6 +1187,43 @@ const SUPERVISOR_VOLUME_NAME: &str = "openshell-supervisor-bin"; /// Name of the init container that installs the supervisor binary. const SUPERVISOR_INIT_CONTAINER_NAME: &str = "openshell-supervisor-install"; +/// Name of the init container that prepares pod-level sidecar networking. +const SUPERVISOR_NETWORK_INIT_CONTAINER_NAME: &str = "openshell-network-init"; + +/// Container name for the network-only supervisor sidecar. +const SUPERVISOR_NETWORK_SIDECAR_NAME: &str = "openshell-supervisor-network"; + +/// Shared volume used by the network sidecar to signal readiness to the +/// process-only supervisor in the agent container. +const SIDECAR_STATE_VOLUME_NAME: &str = "openshell-sidecar-state"; +const SIDECAR_STATE_MOUNT_PATH: &str = "/run/openshell-sidecar"; +const SIDECAR_READY_FILE: &str = "/run/openshell-sidecar/supervisor.ready"; +const SIDECAR_ENTRYPOINT_PID_FILE: &str = "/run/openshell-sidecar/entrypoint.pid"; +const SIDECAR_SSH_SOCKET_FILE: &str = "/run/openshell-sidecar/ssh.sock"; + +/// Shared TLS work directory. The network sidecar writes the proxy CA bundle +/// here, while the agent container consumes it after the readiness file exists. +const SIDECAR_TLS_VOLUME_NAME: &str = "openshell-supervisor-tls"; +const SIDECAR_TLS_MOUNT_PATH: &str = "/etc/openshell-tls/proxy"; +const SIDECAR_CLIENT_TLS_MOUNT_PATH: &str = "/etc/openshell-tls/proxy/client"; + +/// Loopback listener owned by the network sidecar. The process-only supervisor +/// connects here for gateway gRPC, and the sidecar forwards bytes to the real +/// gateway endpoint using its own network privileges. +const SIDECAR_GATEWAY_FORWARD_ADDR: &str = "127.0.0.1:18080"; + +const LABEL_SANDBOX_ROLE: &str = "openshell.ai/sandbox-role"; +const SANDBOX_ROLE_AGENT: &str = "agent"; +const SANDBOX_ROLE_SUPERVISOR: &str = "supervisor"; +const PROXY_POD_PROXY_PORT: u16 = 3128; +const PROXY_POD_GATEWAY_FORWARD_PORT: u16 = 18080; +const PROXY_POD_GATEWAY_FORWARD_ADDR: &str = "0.0.0.0:18080"; +const PROXY_POD_NETWORK_ENFORCEMENT_MODE: &str = "proxy-pod"; +const PROXY_POD_CA_SECRET_MOUNT_PATH: &str = "/var/run/openshell-proxy-ca"; +const PROXY_POD_CA_CERT_FILE: &str = "openshell-ca.pem"; +const PROXY_POD_CA_KEY_FILE: &str = "openshell-ca-key.pem"; +const PROXY_POD_SSH_SOCKET_FILE: &str = "/tmp/openshell/ssh.sock"; + /// Build the emptyDir volume that holds the supervisor binary. /// /// The init container writes the binary here; the agent container reads it. @@ -1006,28 +1298,12 @@ fn supervisor_init_container( spec } -/// Apply supervisor side-load transforms to an already-built pod template JSON. -/// -/// Depending on the sideload method: -/// - **`ImageVolume`**: mounts the supervisor OCI image directly as a read-only -/// volume (no init container needed, requires K8s >= v1.33). -/// - **`InitContainer`**: injects an emptyDir volume and an init container that -/// copies the supervisor binary from the supervisor image into that volume. -/// -/// In both cases, the agent container gets a command override to run the -/// side-loaded binary and `runAsUser: 0` so it can create network namespaces, -/// set up the proxy, and configure Landlock/seccomp. -fn apply_supervisor_sideload( - pod_template: &mut serde_json::Value, +fn apply_supervisor_binary_source( + spec: &mut serde_json::Map, supervisor_image: &str, supervisor_image_pull_policy: &str, method: SupervisorSideloadMethod, ) { - let Some(spec) = pod_template.get_mut("spec").and_then(|v| v.as_object_mut()) else { - return; - }; - - // 1. Add the volume (image source or emptyDir depending on method) let volumes = spec .entry("volumes") .or_insert_with(|| serde_json::json!([])) @@ -1046,7 +1322,6 @@ fn apply_supervisor_sideload( } } - // 2. Add the init container only for the init-container method if method == SupervisorSideloadMethod::InitContainer { let init_containers = spec .entry("initContainers") @@ -1059,8 +1334,35 @@ fn apply_supervisor_sideload( )); } } +} + +/// Apply supervisor side-load transforms to an already-built pod template JSON. +/// +/// Depending on the sideload method: +/// - **`ImageVolume`**: mounts the supervisor OCI image directly as a read-only +/// volume (no init container needed, requires K8s >= v1.33). +/// - **`InitContainer`**: injects an emptyDir volume and an init container that +/// copies the supervisor binary from the supervisor image into that volume. +/// +/// In both cases, the agent container gets a command override to run the +/// side-loaded binary as root so it can create network namespaces, set up the +/// proxy, and configure Landlock/seccomp. +#[allow(clippy::similar_names)] +fn apply_supervisor_sideload( + pod_template: &mut serde_json::Value, + supervisor_image: &str, + supervisor_image_pull_policy: &str, + method: SupervisorSideloadMethod, + sandbox_uid: u32, + sandbox_gid: u32, +) { + let Some(spec) = pod_template.get_mut("spec").and_then(|v| v.as_object_mut()) else { + return; + }; + + apply_supervisor_binary_source(spec, supervisor_image, supervisor_image_pull_policy, method); - // 3. Find the agent container and add volume mount + command override + // Find the agent container and add volume mount + command override let Some(containers) = spec.get_mut("containers").and_then(|v| v.as_array_mut()) else { return; }; @@ -1101,152 +1403,968 @@ fn apply_supervisor_sideload( if let Some(volume_mounts) = volume_mounts { volume_mounts.push(supervisor_volume_mount()); } + + // Inject resolved sandbox UID/GID as environment variables so the + // supervisor can use them directly without /etc/passwd lookups. + let env = container + .entry("env") + .or_insert_with(|| serde_json::json!([])) + .as_array_mut(); + if let Some(env) = env { + env.push(serde_json::json!({ + "name": openshell_core::sandbox_env::SANDBOX_UID.to_string(), + "value": sandbox_uid.to_string(), + })); + env.push(serde_json::json!({ + "name": openshell_core::sandbox_env::SANDBOX_GID.to_string(), + "value": sandbox_gid.to_string(), + })); + } } } -/// Apply workspace persistence transforms to an already-built pod template. -/// -/// This injects: -/// 1. A volume mount on the agent container at `/sandbox`. -/// 2. An init container (same image) that seeds the PVC with the image's -/// original `/sandbox` contents on first use. -/// -/// The PVC volume itself is **not** added here — the Sandbox CRD controller -/// automatically creates a volume for each entry in `volumeClaimTemplates` -/// (following the `StatefulSet` convention). Adding one here would create a -/// duplicate volume name and fail pod validation. -/// -/// The init container mounts the PVC at a temporary path so it can still see -/// the image's `/sandbox` directory. It checks for a sentinel file and skips -/// the copy if the PVC was already initialised. -fn apply_workspace_persistence( - pod_template: &mut serde_json::Value, - image: &str, - image_pull_policy: &str, -) { - let Some(spec) = pod_template.get_mut("spec").and_then(|v| v.as_object_mut()) else { - return; - }; +fn sidecar_state_volume_mount() -> serde_json::Value { + serde_json::json!({ + "name": SIDECAR_STATE_VOLUME_NAME, + "mountPath": SIDECAR_STATE_MOUNT_PATH, + }) +} - // 1. Add workspace volume mount to the agent container - let containers = spec.get_mut("containers").and_then(|v| v.as_array_mut()); - if let Some(containers) = containers { - let mut target_index = None; - for (i, c) in containers.iter().enumerate() { - if c.get("name").and_then(|v| v.as_str()) == Some("agent") { - target_index = Some(i); - break; - } - } - let index = target_index.unwrap_or(0); +fn sidecar_tls_volume_mount() -> serde_json::Value { + serde_json::json!({ + "name": SIDECAR_TLS_VOLUME_NAME, + "mountPath": SIDECAR_TLS_MOUNT_PATH, + }) +} - if let Some(container) = containers.get_mut(index).and_then(|v| v.as_object_mut()) { - let volume_mounts = container - .entry("volumeMounts") - .or_insert_with(|| serde_json::json!([])) - .as_array_mut(); - if let Some(volume_mounts) = volume_mounts { - volume_mounts.push(serde_json::json!({ - "name": WORKSPACE_VOLUME_NAME, - "mountPath": WORKSPACE_MOUNT_PATH - })); - } - } +fn sidecar_process_gateway_endpoint(grpc_endpoint: &str) -> String { + if grpc_endpoint.is_empty() { + String::new() + } else if grpc_endpoint.starts_with("https://") { + format!("https://{SIDECAR_GATEWAY_FORWARD_ADDR}") + } else { + format!("http://{SIDECAR_GATEWAY_FORWARD_ADDR}") } +} - // 3. Add the init container that seeds the PVC from the image - let init_containers = spec - .entry("initContainers") - .or_insert_with(|| serde_json::json!([])) - .as_array_mut(); - if let Some(init_containers) = init_containers { - // The init container mounts the PVC at a temp path so it can still - // read the image's original /sandbox contents. It copies them into - // the PVC only when the sentinel file is absent. - // - // Prefer a tar stream over `cp -a`: some sandbox images contain - // self-referential symlinks under `/sandbox/.uv`, and GNU cp can - // fail while seeding the PVC even though preserving the symlink as-is - // is valid. `tar` copies the tree without dereferencing those links. - // - // The inner `[ -d ... ]` guard handles custom images that don't have - // a /sandbox directory — the copy is skipped but the sentinel is - // still written so subsequent starts are instant. - let copy_cmd = format!( - "if [ ! -f {WORKSPACE_INIT_MOUNT_PATH}/{WORKSPACE_SENTINEL} ]; then \ - if [ -d {WORKSPACE_MOUNT_PATH} ]; then \ - tar -C {WORKSPACE_MOUNT_PATH} -cf - . | tar -C {WORKSPACE_INIT_MOUNT_PATH} -xpf -; \ - fi && \ - touch {WORKSPACE_INIT_MOUNT_PATH}/{WORKSPACE_SENTINEL}; \ - fi" - ); +fn gateway_tls_server_name(grpc_endpoint: &str) -> Option { + let rest = grpc_endpoint.strip_prefix("https://")?; + let authority = rest.split('/').next().unwrap_or(rest); + if authority.is_empty() { + return None; + } + if let Some(bracketed) = authority.strip_prefix('[') { + return bracketed.split(']').next().map(str::to_string); + } + authority + .split(':') + .next() + .filter(|host| !host.is_empty()) + .map(str::to_string) +} - let mut init_spec = serde_json::json!({ - "name": WORKSPACE_INIT_CONTAINER_NAME, - "image": image, - "command": ["sh", "-c", copy_cmd], - "securityContext": { "runAsUser": 0 }, - "volumeMounts": [{ - "name": WORKSPACE_VOLUME_NAME, - "mountPath": WORKSPACE_INIT_MOUNT_PATH - }] - }); - if !image_pull_policy.is_empty() { - init_spec["imagePullPolicy"] = serde_json::json!(image_pull_policy); - } - init_containers.push(init_spec); +#[derive(Debug, Clone)] +struct ProxyPodResourceNames { + supervisor_deployment: String, + service: String, + proxy_ca_secret: String, + agent_egress_network_policy: String, + supervisor_ingress_network_policy: String, +} + +fn proxy_pod_resource_names(sandbox_name: &str) -> ProxyPodResourceNames { + ProxyPodResourceNames { + supervisor_deployment: dns_label_name("os-sup", sandbox_name), + service: dns_label_name("os-svc", sandbox_name), + proxy_ca_secret: dns_label_name("os-ca", sandbox_name), + agent_egress_network_policy: dns_label_name("os-eg", sandbox_name), + supervisor_ingress_network_policy: dns_label_name("os-ing", sandbox_name), } } -/// Build the default `volumeClaimTemplates` array for sandbox pods. -/// -/// Provides a single PVC named "workspace" that backs the `/sandbox` -/// directory. The init container seeds it from the image on first use. -fn default_workspace_volume_claim_templates(storage_size: &str) -> serde_json::Value { - let size = if storage_size.is_empty() { - DEFAULT_WORKSPACE_STORAGE_SIZE - } else { - storage_size - }; - serde_json::json!([{ - "metadata": { - "name": WORKSPACE_VOLUME_NAME - }, - "spec": { - "accessModes": ["ReadWriteOnce"], - "resources": { - "requests": { - "storage": size - } +fn dns_label_name(prefix: &str, name: &str) -> String { + let mut hash = 0xcbf2_9ce4_8422_2325_u64; + for byte in name.as_bytes() { + hash ^= u64::from(*byte); + hash = hash.wrapping_mul(0x0000_0100_0000_01b3); + } + let suffix_hash = hash & 0xffff_ffff; + let suffix = format!("{suffix_hash:08x}"); + let mut sanitized = name + .chars() + .map(|c| { + let c = c.to_ascii_lowercase(); + if c.is_ascii_alphanumeric() || c == '-' { + c + } else { + '-' } - } - }]) + }) + .collect::(); + sanitized = sanitized + .trim_matches('-') + .split('-') + .filter(|part| !part.is_empty()) + .collect::>() + .join("-"); + if sanitized.is_empty() { + sanitized = "sandbox".to_string(); + } + let max_base_len = 63usize.saturating_sub(prefix.len() + suffix.len() + 2); + if sanitized.len() > max_base_len { + sanitized.truncate(max_base_len); + sanitized = sanitized.trim_matches('-').to_string(); + } + format!("{prefix}-{sanitized}-{suffix}") } -/// Parameters shared by `sandbox_to_k8s_spec` and `sandbox_template_to_k8s`. -struct SandboxPodParams<'a> { - default_image: &'a str, - image_pull_policy: &'a str, - image_pull_secrets: &'a [String], - supervisor_image: &'a str, - supervisor_image_pull_policy: &'a str, - supervisor_sideload_method: SupervisorSideloadMethod, - service_account_name: &'a str, - sandbox_id: &'a str, - sandbox_name: &'a str, - grpc_endpoint: &'a str, - ssh_socket_path: &'a str, - client_tls_secret_name: &'a str, - host_gateway_ip: &'a str, - enable_user_namespaces: bool, - app_armor_profile: Option<&'a AppArmorProfile>, - workspace_default_storage_size: &'a str, - default_runtime_class_name: &'a str, +fn proxy_pod_service_dns(service_name: &str, namespace: &str) -> String { + format!("{service_name}.{namespace}.svc.cluster.local") +} + +fn proxy_pod_process_gateway_endpoint(service_dns: &str, grpc_endpoint: &str) -> String { + if grpc_endpoint.is_empty() { + String::new() + } else if grpc_endpoint.starts_with("https://") { + format!("https://{service_dns}:{PROXY_POD_GATEWAY_FORWARD_PORT}") + } else { + format!("http://{service_dns}:{PROXY_POD_GATEWAY_FORWARD_PORT}") + } +} + +fn proxy_pod_proxy_url(service_dns: &str) -> String { + format!("http://{service_dns}:{PROXY_POD_PROXY_PORT}") +} + +fn apply_host_gateway_aliases( + spec: &mut serde_json::Map, + host_gateway_ip: &str, +) { + if host_gateway_ip.is_empty() { + return; + } + spec.insert( + "hostAliases".to_string(), + serde_json::json!([{ + "ip": host_gateway_ip, + "hostnames": ["host.docker.internal", "host.openshell.internal"] + }]), + ); +} + +fn copy_log_level_env( + env: &mut Vec, + template_environment: &std::collections::HashMap, + spec_environment: &std::collections::HashMap, +) { + if let Some(value) = spec_environment + .get(openshell_core::sandbox_env::LOG_LEVEL) + .or_else(|| template_environment.get(openshell_core::sandbox_env::LOG_LEVEL)) + { + upsert_env(env, openshell_core::sandbox_env::LOG_LEVEL, value); + } +} + +fn supervisor_sidecar_env( + template_environment: &std::collections::HashMap, + spec_environment: &std::collections::HashMap, + params: &SandboxPodParams<'_>, +) -> Vec { + let mut env = Vec::new(); + apply_required_env( + &mut env, + params.sandbox_id, + params.sandbox_name, + params.grpc_endpoint, + "", + !params.client_tls_secret_name.is_empty(), + provider_spiffe_socket_path(params), + ); + if !params.client_tls_secret_name.is_empty() { + upsert_env( + &mut env, + openshell_core::sandbox_env::TLS_CA, + &format!("{SIDECAR_CLIENT_TLS_MOUNT_PATH}/ca.crt"), + ); + upsert_env( + &mut env, + openshell_core::sandbox_env::TLS_CERT, + &format!("{SIDECAR_CLIENT_TLS_MOUNT_PATH}/tls.crt"), + ); + upsert_env( + &mut env, + openshell_core::sandbox_env::TLS_KEY, + &format!("{SIDECAR_CLIENT_TLS_MOUNT_PATH}/tls.key"), + ); + } + copy_log_level_env(&mut env, template_environment, spec_environment); + upsert_env( + &mut env, + openshell_core::sandbox_env::SUPERVISOR_TOPOLOGY, + "sidecar", + ); + upsert_env( + &mut env, + openshell_core::sandbox_env::NETWORK_ENFORCEMENT_MODE, + "sidecar-nftables", + ); + upsert_env( + &mut env, + openshell_core::sandbox_env::NETWORK_BINARY_IDENTITY, + "relaxed", + ); + upsert_env( + &mut env, + openshell_core::sandbox_env::SUPERVISOR_READY_FILE, + SIDECAR_READY_FILE, + ); + upsert_env( + &mut env, + openshell_core::sandbox_env::ENTRYPOINT_PID_FILE, + SIDECAR_ENTRYPOINT_PID_FILE, + ); + upsert_env( + &mut env, + openshell_core::sandbox_env::GATEWAY_FORWARD_ADDR, + SIDECAR_GATEWAY_FORWARD_ADDR, + ); + upsert_env( + &mut env, + openshell_core::sandbox_env::PROXY_TLS_DIR, + SIDECAR_TLS_MOUNT_PATH, + ); + env +} + +fn supervisor_sidecar_container( + template_environment: &std::collections::HashMap, + spec_environment: &std::collections::HashMap, + params: &SandboxPodParams<'_>, +) -> serde_json::Value { + let mut container = serde_json::json!({ + "name": SUPERVISOR_NETWORK_SIDECAR_NAME, + "image": params.supervisor_image, + "command": [ + SUPERVISOR_IMAGE_BINARY_PATH, + "--mode=network", + ], + "env": supervisor_sidecar_env(template_environment, spec_environment, params), + "securityContext": { + "runAsUser": params.proxy_uid, + "runAsGroup": params.sandbox_gid, + "runAsNonRoot": true, + "allowPrivilegeEscalation": false, + "capabilities": { + "drop": ["ALL"] + } + }, + "volumeMounts": [ + sidecar_state_volume_mount(), + sidecar_tls_volume_mount(), + { + "name": "openshell-sa-token", + "mountPath": "/var/run/secrets/openshell", + "readOnly": true + } + ] + }); + if !params.supervisor_image_pull_policy.is_empty() { + container["imagePullPolicy"] = serde_json::json!(params.supervisor_image_pull_policy); + } + if params.provider_spiffe_enabled { + container["volumeMounts"] + .as_array_mut() + .expect("volumeMounts is an array") + .push(serde_json::json!({ + "name": SPIFFE_WORKLOAD_API_VOLUME_NAME, + "mountPath": spiffe_socket_mount_path(params.provider_spiffe_workload_api_socket_path), + "readOnly": true, + })); + } + if let Some(profile) = params.app_armor_profile { + container["securityContext"]["appArmorProfile"] = app_armor_profile_to_k8s(profile); + } + container +} + +fn supervisor_network_init_container(params: &SandboxPodParams<'_>) -> serde_json::Value { + let mut container = serde_json::json!({ + "name": SUPERVISOR_NETWORK_INIT_CONTAINER_NAME, + "image": params.supervisor_image, + "command": [ + SUPERVISOR_IMAGE_BINARY_PATH, + "--mode=network-init", + "--proxy-uid", + params.proxy_uid.to_string(), + "--proxy-gid", + params.sandbox_gid.to_string(), + "--sidecar-state-dir", + SIDECAR_STATE_MOUNT_PATH, + "--sidecar-tls-dir", + SIDECAR_TLS_MOUNT_PATH, + ], + "securityContext": { + "runAsUser": 0, + "allowPrivilegeEscalation": false, + "capabilities": { + "drop": ["ALL"], + "add": ["NET_ADMIN", "NET_RAW", "CHOWN", "FOWNER"] + } + }, + "volumeMounts": [ + sidecar_state_volume_mount(), + sidecar_tls_volume_mount(), + ] + }); + if !params.supervisor_image_pull_policy.is_empty() { + container["imagePullPolicy"] = serde_json::json!(params.supervisor_image_pull_policy); + } + if !params.client_tls_secret_name.is_empty() { + container["volumeMounts"] + .as_array_mut() + .expect("volumeMounts is an array") + .push(serde_json::json!({ + "name": "openshell-client-tls", + "mountPath": "/etc/openshell-tls/client", + "readOnly": true + })); + } + if let Some(profile) = params.app_armor_profile { + container["securityContext"]["appArmorProfile"] = app_armor_profile_to_k8s(profile); + } + container +} + +fn apply_supervisor_sidecar_topology( + pod_template: &mut serde_json::Value, + template_environment: &std::collections::HashMap, + spec_environment: &std::collections::HashMap, + params: &SandboxPodParams<'_>, +) { + let Some(spec) = pod_template.get_mut("spec").and_then(|v| v.as_object_mut()) else { + return; + }; + + let pod_security_context = spec + .entry("securityContext") + .or_insert_with(|| serde_json::json!({})); + if let Some(sc) = pod_security_context.as_object_mut() { + sc.insert("fsGroup".to_string(), serde_json::json!(params.sandbox_gid)); + } + + apply_supervisor_binary_source( + spec, + params.supervisor_image, + params.supervisor_image_pull_policy, + params.supervisor_sideload_method, + ); + + let volumes = spec + .entry("volumes") + .or_insert_with(|| serde_json::json!([])) + .as_array_mut(); + if let Some(volumes) = volumes { + volumes.push(serde_json::json!({ + "name": SIDECAR_STATE_VOLUME_NAME, + "emptyDir": {} + })); + volumes.push(serde_json::json!({ + "name": SIDECAR_TLS_VOLUME_NAME, + "emptyDir": {} + })); + } + + let init_containers = spec + .entry("initContainers") + .or_insert_with(|| serde_json::json!([])) + .as_array_mut(); + if let Some(init_containers) = init_containers { + init_containers.push(supervisor_network_init_container(params)); + } + + let Some(containers) = spec.get_mut("containers").and_then(|v| v.as_array_mut()) else { + return; + }; + + let target_index = containers + .iter() + .position(|c| c.get("name").and_then(|v| v.as_str()) == Some("agent")) + .unwrap_or(0); + + if let Some(container) = containers + .get_mut(target_index) + .and_then(|v| v.as_object_mut()) + { + container.insert( + "command".to_string(), + serde_json::json!([ + format!("{}/openshell-sandbox", SUPERVISOR_MOUNT_PATH), + "--mode=process" + ]), + ); + + let security_context = container + .entry("securityContext") + .or_insert_with(|| serde_json::json!({})); + if let Some(sc) = security_context.as_object_mut() { + sc.insert( + "runAsUser".to_string(), + serde_json::json!(params.sandbox_uid), + ); + sc.insert( + "runAsGroup".to_string(), + serde_json::json!(params.sandbox_gid), + ); + sc.insert("runAsNonRoot".to_string(), serde_json::json!(true)); + sc.insert( + "allowPrivilegeEscalation".to_string(), + serde_json::json!(false), + ); + sc.insert( + "capabilities".to_string(), + serde_json::json!({ + "drop": ["ALL"] + }), + ); + } + + let volume_mounts = container + .entry("volumeMounts") + .or_insert_with(|| serde_json::json!([])) + .as_array_mut(); + if let Some(volume_mounts) = volume_mounts { + volume_mounts.push(supervisor_volume_mount()); + volume_mounts.push(sidecar_state_volume_mount()); + volume_mounts.push(sidecar_tls_volume_mount()); + } + + let env = container + .entry("env") + .or_insert_with(|| serde_json::json!([])) + .as_array_mut(); + if let Some(env) = env { + let process_endpoint = sidecar_process_gateway_endpoint(params.grpc_endpoint); + upsert_env( + env, + openshell_core::sandbox_env::ENDPOINT, + &process_endpoint, + ); + if let Some(server_name) = gateway_tls_server_name(params.grpc_endpoint) { + upsert_env( + env, + openshell_core::sandbox_env::GATEWAY_TLS_SERVER_NAME, + &server_name, + ); + } + upsert_env( + env, + openshell_core::sandbox_env::SUPERVISOR_TOPOLOGY, + "sidecar", + ); + upsert_env( + env, + openshell_core::sandbox_env::NETWORK_ENFORCEMENT_MODE, + "sidecar-nftables", + ); + upsert_env( + env, + openshell_core::sandbox_env::PROCESS_ENFORCEMENT_MODE, + "network-only", + ); + upsert_env( + env, + openshell_core::sandbox_env::SSH_SOCKET_PATH, + SIDECAR_SSH_SOCKET_FILE, + ); + upsert_env( + env, + openshell_core::sandbox_env::SUPERVISOR_READY_FILE, + SIDECAR_READY_FILE, + ); + upsert_env( + env, + openshell_core::sandbox_env::ENTRYPOINT_PID_FILE, + SIDECAR_ENTRYPOINT_PID_FILE, + ); + upsert_env( + env, + openshell_core::sandbox_env::PROXY_TLS_DIR, + SIDECAR_TLS_MOUNT_PATH, + ); + upsert_env( + env, + openshell_core::sandbox_env::SANDBOX_UID, + ¶ms.sandbox_uid.to_string(), + ); + upsert_env( + env, + openshell_core::sandbox_env::SANDBOX_GID, + ¶ms.sandbox_gid.to_string(), + ); + } + } + + containers.push(supervisor_sidecar_container( + template_environment, + spec_environment, + params, + )); +} + +fn proxy_pod_ca_source_volume_mount() -> serde_json::Value { + serde_json::json!({ + "name": "openshell-proxy-pod-ca-source", + "mountPath": PROXY_POD_CA_SECRET_MOUNT_PATH, + "readOnly": true + }) +} + +fn proxy_pod_ca_tls_volume_mount() -> serde_json::Value { + serde_json::json!({ + "name": "openshell-proxy-pod-tls", + "mountPath": SIDECAR_TLS_MOUNT_PATH, + }) +} + +fn proxy_pod_ca_init_container( + image: &str, + image_pull_policy: &str, + sandbox_gid: u32, +) -> serde_json::Value { + let copy_cmd = format!( + "set -eu; \ + mkdir -p {SIDECAR_TLS_MOUNT_PATH}; \ + cp {PROXY_POD_CA_SECRET_MOUNT_PATH}/{PROXY_POD_CA_CERT_FILE} {SIDECAR_TLS_MOUNT_PATH}/{PROXY_POD_CA_CERT_FILE}; \ + bundle={SIDECAR_TLS_MOUNT_PATH}/ca-bundle.pem; \ + found=0; \ + for path in /etc/ssl/certs/ca-certificates.crt /etc/pki/tls/certs/ca-bundle.crt /etc/ssl/ca-bundle.pem /etc/ssl/cert.pem; do \ + if [ -f \"$path\" ]; then cat \"$path\" > \"$bundle\"; found=1; break; fi; \ + done; \ + if [ \"$found\" = 0 ]; then : > \"$bundle\"; fi; \ + printf '\\n' >> \"$bundle\"; \ + cat {PROXY_POD_CA_SECRET_MOUNT_PATH}/{PROXY_POD_CA_CERT_FILE} >> \"$bundle\"" + ); + let mut init_spec = serde_json::json!({ + "name": "openshell-proxy-ca-install", + "image": image, + "command": ["sh", "-c", copy_cmd], + "securityContext": { + "runAsUser": 0, + "runAsGroup": sandbox_gid, + "allowPrivilegeEscalation": false, + "capabilities": { + "drop": ["ALL"] + } + }, + "volumeMounts": [ + proxy_pod_ca_source_volume_mount(), + proxy_pod_ca_tls_volume_mount(), + ] + }); + if !image_pull_policy.is_empty() { + init_spec["imagePullPolicy"] = serde_json::json!(image_pull_policy); + } + init_spec +} + +fn apply_proxy_pod_affinity( + spec: &mut serde_json::Map, + sandbox_id: &str, +) { + if sandbox_id.is_empty() { + return; + } + + let affinity = spec + .entry("affinity".to_string()) + .or_insert_with(|| serde_json::json!({})); + if !affinity.is_object() { + *affinity = serde_json::json!({}); + } + let affinity = affinity + .as_object_mut() + .expect("affinity was converted to object"); + let pod_affinity = affinity + .entry("podAffinity".to_string()) + .or_insert_with(|| serde_json::json!({})); + if !pod_affinity.is_object() { + *pod_affinity = serde_json::json!({}); + } + let pod_affinity = pod_affinity + .as_object_mut() + .expect("podAffinity was converted to object"); + let required = pod_affinity + .entry("requiredDuringSchedulingIgnoredDuringExecution".to_string()) + .or_insert_with(|| serde_json::json!([])); + if !required.is_array() { + *required = serde_json::json!([]); + } + if let Some(required) = required.as_array_mut() { + required.push(serde_json::json!({ + "labelSelector": { + "matchLabels": proxy_pod_match_labels(sandbox_id, SANDBOX_ROLE_SUPERVISOR) + }, + "topologyKey": "kubernetes.io/hostname" + })); + } +} + +fn apply_supervisor_proxy_pod_topology( + pod_template: &mut serde_json::Value, + params: &SandboxPodParams<'_>, +) { + let Some(spec) = pod_template.get_mut("spec").and_then(|v| v.as_object_mut()) else { + return; + }; + + let pod_security_context = spec + .entry("securityContext") + .or_insert_with(|| serde_json::json!({})); + if let Some(sc) = pod_security_context.as_object_mut() { + sc.insert("fsGroup".to_string(), serde_json::json!(params.sandbox_gid)); + } + + apply_supervisor_binary_source( + spec, + params.supervisor_image, + params.supervisor_image_pull_policy, + params.supervisor_sideload_method, + ); + + apply_proxy_pod_affinity(spec, params.sandbox_id); + + let names = proxy_pod_resource_names(params.sandbox_name); + let service_dns = proxy_pod_service_dns(&names.service, params.namespace); + + let volumes = spec + .entry("volumes") + .or_insert_with(|| serde_json::json!([])) + .as_array_mut(); + if let Some(volumes) = volumes { + volumes.push(serde_json::json!({ + "name": "openshell-proxy-pod-ca-source", + "secret": { + "secretName": names.proxy_ca_secret, + "defaultMode": 0o444, + "items": [{ + "key": PROXY_POD_CA_CERT_FILE, + "path": PROXY_POD_CA_CERT_FILE, + }] + } + })); + volumes.push(serde_json::json!({ + "name": "openshell-proxy-pod-tls", + "emptyDir": {} + })); + } + + let image = spec + .get("containers") + .and_then(|v| v.as_array()) + .and_then(|containers| containers.first()) + .and_then(|container| container.get("image")) + .and_then(|value| value.as_str()) + .unwrap_or(params.default_image) + .to_string(); + let init_containers = spec + .entry("initContainers") + .or_insert_with(|| serde_json::json!([])) + .as_array_mut(); + if let Some(init_containers) = init_containers { + init_containers.push(proxy_pod_ca_init_container( + &image, + params.image_pull_policy, + params.sandbox_gid, + )); + } + + let Some(containers) = spec.get_mut("containers").and_then(|v| v.as_array_mut()) else { + return; + }; + let target_index = containers + .iter() + .position(|c| c.get("name").and_then(|v| v.as_str()) == Some("agent")) + .unwrap_or(0); + if let Some(container) = containers + .get_mut(target_index) + .and_then(|v| v.as_object_mut()) + { + container.insert( + "command".to_string(), + serde_json::json!([ + format!("{}/openshell-sandbox", SUPERVISOR_MOUNT_PATH), + "--mode=process" + ]), + ); + + let security_context = container + .entry("securityContext") + .or_insert_with(|| serde_json::json!({})); + if let Some(sc) = security_context.as_object_mut() { + sc.insert( + "runAsUser".to_string(), + serde_json::json!(params.sandbox_uid), + ); + sc.insert( + "runAsGroup".to_string(), + serde_json::json!(params.sandbox_gid), + ); + sc.insert("runAsNonRoot".to_string(), serde_json::json!(true)); + sc.insert( + "allowPrivilegeEscalation".to_string(), + serde_json::json!(false), + ); + sc.insert( + "capabilities".to_string(), + serde_json::json!({ + "drop": ["ALL"] + }), + ); + } + + let volume_mounts = container + .entry("volumeMounts") + .or_insert_with(|| serde_json::json!([])) + .as_array_mut(); + if let Some(volume_mounts) = volume_mounts { + volume_mounts.push(supervisor_volume_mount()); + volume_mounts.push(proxy_pod_ca_tls_volume_mount()); + } + + let env = container + .entry("env") + .or_insert_with(|| serde_json::json!([])) + .as_array_mut(); + if let Some(env) = env { + let process_endpoint = + proxy_pod_process_gateway_endpoint(&service_dns, params.grpc_endpoint); + upsert_env( + env, + openshell_core::sandbox_env::ENDPOINT, + &process_endpoint, + ); + if let Some(server_name) = gateway_tls_server_name(params.grpc_endpoint) { + upsert_env( + env, + openshell_core::sandbox_env::GATEWAY_TLS_SERVER_NAME, + &server_name, + ); + } + upsert_env( + env, + openshell_core::sandbox_env::SUPERVISOR_TOPOLOGY, + "proxy-pod", + ); + upsert_env( + env, + openshell_core::sandbox_env::NETWORK_ENFORCEMENT_MODE, + PROXY_POD_NETWORK_ENFORCEMENT_MODE, + ); + upsert_env( + env, + openshell_core::sandbox_env::PROCESS_ENFORCEMENT_MODE, + "network-only", + ); + upsert_env( + env, + openshell_core::sandbox_env::SSH_SOCKET_PATH, + PROXY_POD_SSH_SOCKET_FILE, + ); + upsert_env( + env, + openshell_core::sandbox_env::PROXY_URL, + &proxy_pod_proxy_url(&service_dns), + ); + upsert_env( + env, + openshell_core::sandbox_env::SUPERVISOR_READY_ADDR, + &format!("{service_dns}:{PROXY_POD_PROXY_PORT}"), + ); + upsert_env( + env, + openshell_core::sandbox_env::PROXY_TLS_DIR, + SIDECAR_TLS_MOUNT_PATH, + ); + upsert_env( + env, + openshell_core::sandbox_env::SANDBOX_UID, + ¶ms.sandbox_uid.to_string(), + ); + upsert_env( + env, + openshell_core::sandbox_env::SANDBOX_GID, + ¶ms.sandbox_gid.to_string(), + ); + } + } +} + +/// Apply workspace persistence transforms to an already-built pod template. +/// +/// This injects: +/// 1. A volume mount on the agent container at `/sandbox`. +/// 2. An init container (same image) that seeds the PVC with the image's +/// original `/sandbox` contents on first use. +/// +/// The PVC volume itself is **not** added here — the Sandbox CRD controller +/// automatically creates a volume for each entry in `volumeClaimTemplates` +/// (following the `StatefulSet` convention). Adding one here would create a +/// duplicate volume name and fail pod validation. +/// +/// The init container mounts the PVC at a temporary path so it can still see +/// the image's `/sandbox` directory. It checks for a sentinel file and skips +/// the copy if the PVC was already initialised. +#[allow(clippy::similar_names)] +fn apply_workspace_persistence( + pod_template: &mut serde_json::Value, + image: &str, + image_pull_policy: &str, + sandbox_uid: u32, + sandbox_gid: u32, +) { + let Some(spec) = pod_template.get_mut("spec").and_then(|v| v.as_object_mut()) else { + return; + }; + + // 1. Add workspace volume mount to the agent container + let containers = spec.get_mut("containers").and_then(|v| v.as_array_mut()); + if let Some(containers) = containers { + let mut target_index = None; + for (i, c) in containers.iter().enumerate() { + if c.get("name").and_then(|v| v.as_str()) == Some("agent") { + target_index = Some(i); + break; + } + } + let index = target_index.unwrap_or(0); + + if let Some(container) = containers.get_mut(index).and_then(|v| v.as_object_mut()) { + let volume_mounts = container + .entry("volumeMounts") + .or_insert_with(|| serde_json::json!([])) + .as_array_mut(); + if let Some(volume_mounts) = volume_mounts { + volume_mounts.push(serde_json::json!({ + "name": WORKSPACE_VOLUME_NAME, + "mountPath": WORKSPACE_MOUNT_PATH + })); + } + } + } + + // 3. Add the init container that seeds the PVC from the image + let init_containers = spec + .entry("initContainers") + .or_insert_with(|| serde_json::json!([])) + .as_array_mut(); + if let Some(init_containers) = init_containers { + // The init container mounts the PVC at a temp path so it can still + // read the image's original /sandbox contents. It copies them into + // the PVC only when the sentinel file is absent. + // + // Prefer a tar stream over `cp -a`: some sandbox images contain + // self-referential symlinks under `/sandbox/.uv`, and GNU cp can + // fail while seeding the PVC even though preserving the symlink as-is + // is valid. `tar` copies the tree without dereferencing those links. + // Archive only the contents, not the `/sandbox` directory entry + // itself, so extraction never tries to chmod the PVC mount root. + // Extract without restoring owner, mode, or timestamps so the + // non-root init container can seed kubelet-owned PVCs. + // + // The inner `[ -d ... ]` guard handles custom images that don't have + // a /sandbox directory — the copy is skipped but the sentinel is + // still written so subsequent starts are instant. + let copy_cmd = format!( + "if [ ! -f {WORKSPACE_INIT_MOUNT_PATH}/{WORKSPACE_SENTINEL} ]; then \ + if [ -d {WORKSPACE_MOUNT_PATH} ]; then \ + tmp=$(mktemp) && rm -f \"$tmp\" && \ + (cd {WORKSPACE_MOUNT_PATH} && find . -mindepth 1 -maxdepth 1 -exec tar -cf \"$tmp\" {{}} +) && \ + if [ -f \"$tmp\" ]; then \ + tar -C {WORKSPACE_INIT_MOUNT_PATH} --no-same-owner --no-same-permissions --touch -xf \"$tmp\" && \ + rm -f \"$tmp\"; \ + fi; \ + fi && \ + touch {WORKSPACE_INIT_MOUNT_PATH}/{WORKSPACE_SENTINEL}; \ + fi" + ); + + let mut init_spec = serde_json::json!({ + "name": WORKSPACE_INIT_CONTAINER_NAME, + "image": image, + "command": ["sh", "-c", copy_cmd], + "securityContext": { + "runAsUser": sandbox_uid, + "runAsGroup": sandbox_gid, + "fsGroup": sandbox_gid, + }, + "volumeMounts": [{ + "name": WORKSPACE_VOLUME_NAME, + "mountPath": WORKSPACE_INIT_MOUNT_PATH + }] + }); + if !image_pull_policy.is_empty() { + init_spec["imagePullPolicy"] = serde_json::json!(image_pull_policy); + } + init_containers.push(init_spec); + } +} + +/// Build the default `volumeClaimTemplates` array for sandbox pods. +/// +/// Provides a single PVC named "workspace" that backs the `/sandbox` +/// directory. The init container seeds it from the image on first use. +fn default_workspace_volume_claim_templates(storage_size: &str) -> serde_json::Value { + let size = if storage_size.is_empty() { + DEFAULT_WORKSPACE_STORAGE_SIZE + } else { + storage_size + }; + serde_json::json!([{ + "metadata": { + "name": WORKSPACE_VOLUME_NAME + }, + "spec": { + "accessModes": ["ReadWriteOnce"], + "resources": { + "requests": { + "storage": size + } + } + } + }]) +} + +/// Parameters shared by `sandbox_to_k8s_spec` and `sandbox_template_to_k8s`. +struct SandboxPodParams<'a> { + default_image: &'a str, + image_pull_policy: &'a str, + image_pull_secrets: &'a [String], + supervisor_image: &'a str, + supervisor_image_pull_policy: &'a str, + supervisor_sideload_method: SupervisorSideloadMethod, + supervisor_topology: SupervisorTopology, + proxy_uid: u32, + namespace: &'a str, + service_account_name: &'a str, + sandbox_id: &'a str, + sandbox_name: &'a str, + grpc_endpoint: &'a str, + ssh_socket_path: &'a str, + client_tls_secret_name: &'a str, + host_gateway_ip: &'a str, + enable_user_namespaces: bool, + app_armor_profile: Option<&'a AppArmorProfile>, + workspace_default_storage_size: &'a str, + default_runtime_class_name: &'a str, /// Lifetime (seconds) of the projected `ServiceAccount` token used /// for the bootstrap `IssueSandboxToken` exchange. sa_token_ttl_secs: i64, provider_spiffe_enabled: bool, provider_spiffe_workload_api_socket_path: &'a str, + /// Resolved sandbox UID for supervisor `runAsUser` and env var. + sandbox_uid: u32, + /// Resolved sandbox GID for PVC init container operations. + sandbox_gid: u32, } impl Default for SandboxPodParams<'_> { @@ -1258,6 +2376,9 @@ impl Default for SandboxPodParams<'_> { supervisor_image: "", supervisor_image_pull_policy: "", supervisor_sideload_method: SupervisorSideloadMethod::default(), + supervisor_topology: SupervisorTopology::default(), + proxy_uid: DEFAULT_PROXY_UID, + namespace: "default", service_account_name: DEFAULT_SANDBOX_SERVICE_ACCOUNT_NAME, sandbox_id: "", sandbox_name: "", @@ -1272,10 +2393,27 @@ impl Default for SandboxPodParams<'_> { sa_token_ttl_secs: 3600, provider_spiffe_enabled: false, provider_spiffe_workload_api_socket_path: "", + sandbox_uid: DEFAULT_SANDBOX_UID, + sandbox_gid: DEFAULT_SANDBOX_UID, } } } +fn validate_proxy_identity(params: &SandboxPodParams<'_>) -> Result<(), KubernetesDriverError> { + if matches!( + params.supervisor_topology, + SupervisorTopology::Sidecar | SupervisorTopology::ProxyPod + ) && params.proxy_uid == params.sandbox_uid + { + let topology = params.supervisor_topology.to_string(); + return Err(KubernetesDriverError::Precondition(format!( + "proxy_uid ({}) must not match sandbox_uid ({}) in {topology} topology", + params.proxy_uid, params.sandbox_uid + ))); + } + Ok(()) +} + fn spec_pod_env(spec: Option<&SandboxSpec>) -> std::collections::HashMap { let mut env = spec.map_or_else(Default::default, |s| s.environment.clone()); if let Some(s) = spec.filter(|s| !s.log_level.is_empty()) { @@ -1398,7 +2536,8 @@ fn sandbox_template_to_k8s_with_gpu_requirements( .iter() .map(|(key, value)| (key.clone(), serde_json::Value::String(value.clone()))) .collect::>(); - if params.provider_spiffe_enabled { + let proxy_pod_topology = params.supervisor_topology == SupervisorTopology::ProxyPod; + if params.provider_spiffe_enabled || proxy_pod_topology { pod_labels.insert( LABEL_MANAGED_BY.to_string(), serde_json::Value::String(LABEL_MANAGED_BY_VALUE.to_string()), @@ -1410,6 +2549,12 @@ fn sandbox_template_to_k8s_with_gpu_requirements( ); } } + if proxy_pod_topology { + pod_labels.insert( + LABEL_SANDBOX_ROLE.to_string(), + serde_json::Value::String(SANDBOX_ROLE_AGENT.to_string()), + ); + } if !pod_labels.is_empty() { metadata.insert("labels".to_string(), serde_json::Value::Object(pod_labels)); } @@ -1586,13 +2731,22 @@ fn sandbox_template_to_k8s_with_gpu_requirements( serde_json::Value::Array(vec![serde_json::Value::Object(container)]), ); - // Add TLS secret volume. Mode 0400 (owner-read) prevents the - // unprivileged sandbox user from reading the mTLS private key. + // Add TLS secret volume. Combined mode uses mode 0400 because the + // supervisor starts as root and drops privileges before running workload + // children. Sidecar mode keeps the process supervisor non-root, so it uses + // pod fsGroup + 0440 to preserve gateway session and SSH control behavior. let mut volumes: Vec = Vec::new(); if !params.client_tls_secret_name.is_empty() { + let client_tls_default_mode = match params.supervisor_topology { + SupervisorTopology::Combined => 0o400, + SupervisorTopology::Sidecar | SupervisorTopology::ProxyPod => 0o440, + }; volumes.push(serde_json::json!({ "name": "openshell-client-tls", - "secret": { "secretName": params.client_tls_secret_name, "defaultMode": 256 } + "secret": { + "secretName": params.client_tls_secret_name, + "defaultMode": client_tls_default_mode + } })); } if params.provider_spiffe_enabled { @@ -1607,7 +2761,12 @@ fn sandbox_template_to_k8s_with_gpu_requirements( // Projected ServiceAccountToken volume — kubelet writes a short-lived // audience-bound JWT into /var/run/secrets/openshell/token and rotates // it automatically. The supervisor exchanges this for a gateway-minted - // JWT via `IssueSandboxToken` once at startup. + // JWT via `IssueSandboxToken` once at startup. In sidecar topology both + // supervisor containers run with the sandbox GID and need group-read access. + let sa_token_default_mode = match params.supervisor_topology { + SupervisorTopology::Combined => 0o400, + SupervisorTopology::Sidecar | SupervisorTopology::ProxyPod => 0o440, + }; volumes.push(serde_json::json!({ "name": "openshell-sa-token", "projected": { @@ -1618,21 +2777,13 @@ fn sandbox_template_to_k8s_with_gpu_requirements( "path": "token" } }], - "defaultMode": 256 + "defaultMode": sa_token_default_mode } })); spec.insert("volumes".to_string(), serde_json::Value::Array(volumes)); // Add hostAliases so sandbox pods can reach the Docker host. - if !params.host_gateway_ip.is_empty() { - spec.insert( - "hostAliases".to_string(), - serde_json::json!([{ - "ip": params.host_gateway_ip, - "hostnames": ["host.docker.internal", "host.openshell.internal"] - }]), - ); - } + apply_host_gateway_aliases(&mut spec, params.host_gateway_ip); let mut template_value = serde_json::Map::new(); if !metadata.is_empty() { @@ -1642,18 +2793,41 @@ fn sandbox_template_to_k8s_with_gpu_requirements( let mut result = serde_json::Value::Object(template_value); - apply_supervisor_sideload( - &mut result, - params.supervisor_image, - params.supervisor_image_pull_policy, - params.supervisor_sideload_method, - ); + match params.supervisor_topology { + SupervisorTopology::Combined => { + apply_supervisor_sideload( + &mut result, + params.supervisor_image, + params.supervisor_image_pull_policy, + params.supervisor_sideload_method, + params.sandbox_uid, + params.sandbox_gid, + ); + } + SupervisorTopology::Sidecar => { + apply_supervisor_sidecar_topology( + &mut result, + &template.environment, + spec_environment, + params, + ); + } + SupervisorTopology::ProxyPod => { + apply_supervisor_proxy_pod_topology(&mut result, params); + } + } // Inject workspace persistence (init container + PVC volume mount) so // that /sandbox data survives pod rescheduling. Skipped when the user // provides custom volumeClaimTemplates to avoid conflicts. if inject_workspace { - apply_workspace_persistence(&mut result, image, params.image_pull_policy); + apply_workspace_persistence( + &mut result, + image, + params.image_pull_policy, + params.sandbox_uid, + params.sandbox_gid, + ); } result @@ -1736,13 +2910,516 @@ fn apply_resource_quantity_map( merge_string_map(section_value, values); } -fn image_pull_secret_refs(secrets: &[String]) -> Vec { - secrets - .iter() - .map(|secret| secret.trim()) - .filter(|secret| !secret.is_empty()) - .map(|secret| serde_json::json!({ "name": secret })) - .collect() +fn image_pull_secret_refs(secrets: &[String]) -> Vec { + secrets + .iter() + .map(|secret| secret.trim()) + .filter(|secret| !secret.is_empty()) + .map(|secret| serde_json::json!({ "name": secret })) + .collect() +} + +fn k8s_object(value: serde_json::Value) -> T +where + T: DeserializeOwned, +{ + serde_json::from_value(value).expect("driver rendered an invalid Kubernetes object") +} + +fn generate_proxy_pod_ca() -> Result<(String, String), KubernetesDriverError> { + let ca_key = KeyPair::generate().map_err(|err| { + KubernetesDriverError::Message(format!("failed to generate CA key: {err}")) + })?; + + let mut params = CertificateParams::default(); + params.is_ca = IsCa::Ca(rcgen::BasicConstraints::Unconstrained); + params + .distinguished_name + .push(DnType::CommonName, "OpenShell Proxy Pod Sandbox CA"); + params + .distinguished_name + .push(DnType::OrganizationName, "OpenShell"); + params.key_usages = vec![KeyUsagePurpose::KeyCertSign, KeyUsagePurpose::CrlSign]; + + let ca_cert = params.self_signed(&ca_key).map_err(|err| { + KubernetesDriverError::Message(format!("failed to generate CA certificate: {err}")) + })?; + Ok((ca_cert.pem(), ca_key.serialize_pem())) +} + +fn proxy_pod_owner_reference( + sandbox_cr: &DynamicObject, + api_version: &str, + controller: bool, +) -> Result { + let name = + sandbox_cr.metadata.name.as_deref().ok_or_else(|| { + KubernetesDriverError::Message("created Sandbox is missing name".into()) + })?; + let uid = + sandbox_cr.metadata.uid.as_deref().ok_or_else(|| { + KubernetesDriverError::Message("created Sandbox is missing uid".into()) + })?; + Ok(serde_json::json!({ + "apiVersion": sandbox_cr + .types + .as_ref() + .map_or(api_version, |types| types.api_version.as_str()), + "kind": SANDBOX_KIND, + "name": name, + "uid": uid, + "controller": controller, + "blockOwnerDeletion": false, + })) +} + +fn proxy_pod_labels(sandbox_id: &str, role: &str) -> serde_json::Value { + let mut labels = serde_json::Map::new(); + labels.insert( + LABEL_MANAGED_BY.to_string(), + serde_json::json!(LABEL_MANAGED_BY_VALUE), + ); + labels.insert(LABEL_SANDBOX_ID.to_string(), serde_json::json!(sandbox_id)); + labels.insert(LABEL_SANDBOX_ROLE.to_string(), serde_json::json!(role)); + serde_json::Value::Object(labels) +} + +fn proxy_pod_match_labels(sandbox_id: &str, role: &str) -> serde_json::Value { + let mut labels = serde_json::Map::new(); + labels.insert(LABEL_SANDBOX_ID.to_string(), serde_json::json!(sandbox_id)); + labels.insert(LABEL_SANDBOX_ROLE.to_string(), serde_json::json!(role)); + serde_json::Value::Object(labels) +} + +fn proxy_pod_object_meta( + name: &str, + namespace: &str, + sandbox_id: &str, + role: &str, + owner_ref: serde_json::Value, +) -> serde_json::Value { + serde_json::json!({ + "name": name, + "namespace": namespace, + "labels": proxy_pod_labels(sandbox_id, role), + "annotations": { + "openshell.io/sandbox-id": sandbox_id + }, + "ownerReferences": [owner_ref] + }) +} + +fn proxy_pod_supervisor_env( + template_environment: &std::collections::HashMap, + spec_environment: &std::collections::HashMap, + params: &SandboxPodParams<'_>, +) -> Vec { + let mut env = Vec::new(); + apply_required_env( + &mut env, + params.sandbox_id, + params.sandbox_name, + params.grpc_endpoint, + "", + false, + provider_spiffe_socket_path(params), + ); + if !params.client_tls_secret_name.is_empty() { + upsert_env( + &mut env, + openshell_core::sandbox_env::TLS_CA, + &format!("{SIDECAR_CLIENT_TLS_MOUNT_PATH}/ca.crt"), + ); + upsert_env( + &mut env, + openshell_core::sandbox_env::TLS_CERT, + &format!("{SIDECAR_CLIENT_TLS_MOUNT_PATH}/tls.crt"), + ); + upsert_env( + &mut env, + openshell_core::sandbox_env::TLS_KEY, + &format!("{SIDECAR_CLIENT_TLS_MOUNT_PATH}/tls.key"), + ); + } + copy_log_level_env(&mut env, template_environment, spec_environment); + upsert_env( + &mut env, + openshell_core::sandbox_env::SUPERVISOR_TOPOLOGY, + "proxy-pod", + ); + upsert_env( + &mut env, + openshell_core::sandbox_env::NETWORK_ENFORCEMENT_MODE, + PROXY_POD_NETWORK_ENFORCEMENT_MODE, + ); + upsert_env( + &mut env, + openshell_core::sandbox_env::NETWORK_BINARY_IDENTITY, + "relaxed", + ); + upsert_env( + &mut env, + openshell_core::sandbox_env::GATEWAY_FORWARD_ADDR, + PROXY_POD_GATEWAY_FORWARD_ADDR, + ); + upsert_env( + &mut env, + openshell_core::sandbox_env::PROXY_BIND_ADDR, + &format!("0.0.0.0:{PROXY_POD_PROXY_PORT}"), + ); + upsert_env( + &mut env, + openshell_core::sandbox_env::PROXY_TLS_DIR, + SIDECAR_TLS_MOUNT_PATH, + ); + upsert_env( + &mut env, + openshell_core::sandbox_env::PROXY_CA_CERT_PATH, + &format!("{PROXY_POD_CA_SECRET_MOUNT_PATH}/{PROXY_POD_CA_CERT_FILE}"), + ); + upsert_env( + &mut env, + openshell_core::sandbox_env::PROXY_CA_KEY_PATH, + &format!("{PROXY_POD_CA_SECRET_MOUNT_PATH}/{PROXY_POD_CA_KEY_FILE}"), + ); + upsert_env( + &mut env, + openshell_core::sandbox_env::SANDBOX_UID, + ¶ms.sandbox_uid.to_string(), + ); + upsert_env( + &mut env, + openshell_core::sandbox_env::SANDBOX_GID, + ¶ms.sandbox_gid.to_string(), + ); + env +} + +fn proxy_pod_ca_secret( + names: &ProxyPodResourceNames, + params: &SandboxPodParams<'_>, + owner_ref: serde_json::Value, + cert_pem: &str, + key_pem: &str, +) -> Secret { + let mut string_data = serde_json::Map::new(); + string_data.insert( + PROXY_POD_CA_CERT_FILE.to_string(), + serde_json::json!(cert_pem), + ); + string_data.insert( + PROXY_POD_CA_KEY_FILE.to_string(), + serde_json::json!(key_pem), + ); + k8s_object(serde_json::json!({ + "apiVersion": "v1", + "kind": "Secret", + "metadata": { + "name": names.proxy_ca_secret, + "namespace": params.namespace, + "labels": proxy_pod_labels(params.sandbox_id, SANDBOX_ROLE_SUPERVISOR), + "ownerReferences": [owner_ref], + }, + "type": "Opaque", + "stringData": serde_json::Value::Object(string_data) + })) +} + +fn proxy_pod_supervisor_service( + names: &ProxyPodResourceNames, + params: &SandboxPodParams<'_>, + owner_ref: serde_json::Value, +) -> Service { + k8s_object(serde_json::json!({ + "apiVersion": "v1", + "kind": "Service", + "metadata": { + "name": names.service, + "namespace": params.namespace, + "labels": proxy_pod_labels(params.sandbox_id, SANDBOX_ROLE_SUPERVISOR), + "ownerReferences": [owner_ref], + }, + "spec": { + "clusterIP": "None", + "publishNotReadyAddresses": true, + "selector": proxy_pod_match_labels(params.sandbox_id, SANDBOX_ROLE_SUPERVISOR), + "ports": [ + { + "name": "http-proxy", + "port": PROXY_POD_PROXY_PORT, + "targetPort": PROXY_POD_PROXY_PORT, + "protocol": "TCP" + }, + { + "name": "gateway-forward", + "port": PROXY_POD_GATEWAY_FORWARD_PORT, + "targetPort": PROXY_POD_GATEWAY_FORWARD_PORT, + "protocol": "TCP" + } + ] + } + })) +} + +fn proxy_pod_supervisor_deployment( + names: &ProxyPodResourceNames, + template_environment: &std::collections::HashMap, + spec_environment: &std::collections::HashMap, + params: &SandboxPodParams<'_>, + owner_ref: serde_json::Value, +) -> Deployment { + let mut container = serde_json::json!({ + "name": SUPERVISOR_NETWORK_SIDECAR_NAME, + "image": params.supervisor_image, + "command": [ + SUPERVISOR_IMAGE_BINARY_PATH, + "--mode=network", + ], + "env": proxy_pod_supervisor_env(template_environment, spec_environment, params), + "ports": [ + {"name": "http-proxy", "containerPort": PROXY_POD_PROXY_PORT, "protocol": "TCP"}, + {"name": "gateway-fwd", "containerPort": PROXY_POD_GATEWAY_FORWARD_PORT, "protocol": "TCP"} + ], + "readinessProbe": { + "tcpSocket": {"port": PROXY_POD_PROXY_PORT}, + "periodSeconds": 2, + "failureThreshold": 30 + }, + "securityContext": { + "runAsUser": params.proxy_uid, + "runAsGroup": params.sandbox_gid, + "runAsNonRoot": true, + "allowPrivilegeEscalation": false, + "capabilities": { + "drop": ["ALL"] + } + }, + "volumeMounts": [ + { + "name": "openshell-sa-token", + "mountPath": "/var/run/secrets/openshell", + "readOnly": true + }, + { + "name": "openshell-proxy-pod-ca-source", + "mountPath": PROXY_POD_CA_SECRET_MOUNT_PATH, + "readOnly": true + }, + proxy_pod_ca_tls_volume_mount(), + ] + }); + if !params.supervisor_image_pull_policy.is_empty() { + container["imagePullPolicy"] = serde_json::json!(params.supervisor_image_pull_policy); + } + if !params.client_tls_secret_name.is_empty() { + container["volumeMounts"] + .as_array_mut() + .expect("volumeMounts is an array") + .push(serde_json::json!({ + "name": "openshell-client-tls", + "mountPath": SIDECAR_CLIENT_TLS_MOUNT_PATH, + "readOnly": true + })); + } + if params.provider_spiffe_enabled { + container["volumeMounts"] + .as_array_mut() + .expect("volumeMounts is an array") + .push(serde_json::json!({ + "name": SPIFFE_WORKLOAD_API_VOLUME_NAME, + "mountPath": spiffe_socket_mount_path(params.provider_spiffe_workload_api_socket_path), + "readOnly": true, + })); + } + if let Some(profile) = params.app_armor_profile { + container["securityContext"]["appArmorProfile"] = app_armor_profile_to_k8s(profile); + } + + let mut spec = serde_json::json!({ + "serviceAccountName": params.service_account_name, + "automountServiceAccountToken": false, + "securityContext": { + "fsGroup": params.sandbox_gid + }, + "containers": [container], + "volumes": [ + { + "name": "openshell-sa-token", + "projected": { + "sources": [{ + "serviceAccountToken": { + "audience": "openshell-gateway", + "expirationSeconds": params.sa_token_ttl_secs, + "path": "token" + } + }], + "defaultMode": 0o440 + } + }, + { + "name": "openshell-proxy-pod-ca-source", + "secret": { + "secretName": names.proxy_ca_secret, + "defaultMode": 0o440 + } + }, + { + "name": "openshell-proxy-pod-tls", + "emptyDir": {} + } + ] + }); + if !params.default_runtime_class_name.is_empty() { + spec["runtimeClassName"] = serde_json::json!(params.default_runtime_class_name); + } + if let Some(spec_obj) = spec.as_object_mut() { + apply_host_gateway_aliases(spec_obj, params.host_gateway_ip); + } + let image_pull_secrets = image_pull_secret_refs(params.image_pull_secrets); + if !image_pull_secrets.is_empty() { + spec["imagePullSecrets"] = serde_json::Value::Array(image_pull_secrets); + } + if !params.client_tls_secret_name.is_empty() { + spec["volumes"] + .as_array_mut() + .expect("volumes is an array") + .push(serde_json::json!({ + "name": "openshell-client-tls", + "secret": { + "secretName": params.client_tls_secret_name, + "defaultMode": 0o440 + } + })); + } + if params.provider_spiffe_enabled { + spec["volumes"] + .as_array_mut() + .expect("volumes is an array") + .push(serde_json::json!({ + "name": SPIFFE_WORKLOAD_API_VOLUME_NAME, + "csi": { + "driver": "csi.spiffe.io", + "readOnly": true + } + })); + } + + k8s_object(serde_json::json!({ + "apiVersion": "apps/v1", + "kind": "Deployment", + "metadata": proxy_pod_object_meta( + &names.supervisor_deployment, + params.namespace, + params.sandbox_id, + SANDBOX_ROLE_SUPERVISOR, + owner_ref + ), + "spec": { + "replicas": 1, + "selector": { + "matchLabels": proxy_pod_match_labels(params.sandbox_id, SANDBOX_ROLE_SUPERVISOR) + }, + "template": { + "metadata": { + "labels": proxy_pod_labels(params.sandbox_id, SANDBOX_ROLE_SUPERVISOR), + "annotations": { + "openshell.io/sandbox-id": params.sandbox_id + } + }, + "spec": spec + } + } + })) +} + +fn proxy_pod_agent_egress_network_policy( + names: &ProxyPodResourceNames, + params: &SandboxPodParams<'_>, + owner_ref: serde_json::Value, +) -> NetworkPolicy { + k8s_object(serde_json::json!({ + "apiVersion": "networking.k8s.io/v1", + "kind": "NetworkPolicy", + "metadata": { + "name": names.agent_egress_network_policy, + "namespace": params.namespace, + "labels": proxy_pod_labels(params.sandbox_id, SANDBOX_ROLE_AGENT), + "ownerReferences": [owner_ref], + }, + "spec": { + "podSelector": { + "matchLabels": proxy_pod_match_labels(params.sandbox_id, SANDBOX_ROLE_AGENT) + }, + "policyTypes": ["Egress"], + "egress": [ + { + "to": [{ + "podSelector": { + "matchLabels": proxy_pod_match_labels(params.sandbox_id, SANDBOX_ROLE_SUPERVISOR) + } + }], + "ports": [ + {"protocol": "TCP", "port": PROXY_POD_PROXY_PORT}, + {"protocol": "TCP", "port": PROXY_POD_GATEWAY_FORWARD_PORT} + ] + }, + { + "to": [{ + "namespaceSelector": {"matchLabels": {"kubernetes.io/metadata.name": "kube-system"}}, + "podSelector": {"matchLabels": {"k8s-app": "kube-dns"}} + }], + "ports": [ + {"protocol": "UDP", "port": 53}, + {"protocol": "TCP", "port": 53} + ] + }, + { + "to": [{ + "namespaceSelector": {"matchLabels": {"kubernetes.io/metadata.name": "kube-system"}}, + "podSelector": {"matchLabels": {"k8s-app": "coredns"}} + }], + "ports": [ + {"protocol": "UDP", "port": 53}, + {"protocol": "TCP", "port": 53} + ] + } + ] + } + })) +} + +fn proxy_pod_supervisor_ingress_network_policy( + names: &ProxyPodResourceNames, + params: &SandboxPodParams<'_>, + owner_ref: serde_json::Value, +) -> NetworkPolicy { + k8s_object(serde_json::json!({ + "apiVersion": "networking.k8s.io/v1", + "kind": "NetworkPolicy", + "metadata": { + "name": names.supervisor_ingress_network_policy, + "namespace": params.namespace, + "labels": proxy_pod_labels(params.sandbox_id, SANDBOX_ROLE_SUPERVISOR), + "ownerReferences": [owner_ref], + }, + "spec": { + "podSelector": { + "matchLabels": proxy_pod_match_labels(params.sandbox_id, SANDBOX_ROLE_SUPERVISOR) + }, + "policyTypes": ["Ingress"], + "ingress": [{ + "from": [{ + "podSelector": { + "matchLabels": proxy_pod_match_labels(params.sandbox_id, SANDBOX_ROLE_AGENT) + } + }], + "ports": [ + {"protocol": "TCP", "port": PROXY_POD_PROXY_PORT}, + {"protocol": "TCP", "port": PROXY_POD_GATEWAY_FORWARD_PORT} + ] + }] + } + })) } fn app_armor_profile_to_k8s(profile: &AppArmorProfile) -> serde_json::Value { @@ -2134,6 +3811,15 @@ mod tests { assert!(!should_try_next_sandbox_api_version(&err)); } + fn rendered_env<'a>(container: &'a serde_json::Value, name: &str) -> Option<&'a str> { + container["env"] + .as_array()? + .iter() + .find(|item| item.get("name").and_then(|value| value.as_str()) == Some(name))? + .get("value")? + .as_str() + } + #[test] fn driver_config_rejects_invalid_shape() { let template = SandboxTemplate { @@ -2271,6 +3957,8 @@ mod tests { "custom-image:latest", "IfNotPresent", SupervisorSideloadMethod::InitContainer, + 1500, // sandbox_uid + 1500, // sandbox_gid ); let sc = &pod_template["spec"]["containers"][0]["securityContext"]; @@ -2300,6 +3988,8 @@ mod tests { "supervisor-image:latest", "IfNotPresent", SupervisorSideloadMethod::InitContainer, + 1000, // sandbox_uid + 1000, // sandbox_gid ); let sc = &pod_template["spec"]["containers"][0]["securityContext"]; @@ -2325,6 +4015,8 @@ mod tests { "supervisor-image:latest", "IfNotPresent", SupervisorSideloadMethod::InitContainer, + 1000, // sandbox_uid + 1000, // sandbox_gid ); // Volume should be an emptyDir @@ -2399,6 +4091,8 @@ mod tests { "supervisor-image:latest", "IfNotPresent", SupervisorSideloadMethod::ImageVolume, + 1000, // sandbox_uid + 1000, // sandbox_gid ); let volumes = pod_template["spec"]["volumes"] @@ -2453,6 +4147,8 @@ mod tests { "supervisor-image:latest", "", SupervisorSideloadMethod::ImageVolume, + 1000, // sandbox_uid + 1000, // sandbox_gid ); let volume = &pod_template["spec"]["volumes"][0]; @@ -2463,6 +4159,474 @@ mod tests { ); } + #[test] + fn sidecar_topology_renders_process_agent_and_network_sidecar() { + let params = SandboxPodParams { + supervisor_topology: SupervisorTopology::Sidecar, + supervisor_sideload_method: SupervisorSideloadMethod::InitContainer, + supervisor_image: "supervisor-image:latest", + supervisor_image_pull_policy: "IfNotPresent", + grpc_endpoint: "https://openshell-gateway.openshell.svc:8080", + client_tls_secret_name: "openshell-client-tls", + proxy_uid: 2200, + namespace: "default", + sandbox_uid: 1500, + sandbox_gid: 1500, + ..SandboxPodParams::default() + }; + let pod_template = sandbox_template_to_k8s( + &SandboxTemplate { + image: "agent-image:latest".to_string(), + ..SandboxTemplate::default() + }, + false, + &std::collections::HashMap::new(), + false, + ¶ms, + ); + + assert!( + pod_template["spec"]["shareProcessNamespace"].is_null(), + "sidecar mode no longer needs a shared process namespace when binary identity is relaxed" + ); + assert_eq!(pod_template["spec"]["securityContext"]["fsGroup"], 1500); + let containers = pod_template["spec"]["containers"].as_array().unwrap(); + assert_eq!(containers.len(), 2); + + let agent = containers + .iter() + .find(|container| container["name"] == "agent") + .unwrap(); + assert_eq!( + agent["command"], + serde_json::json!([ + format!("{SUPERVISOR_MOUNT_PATH}/openshell-sandbox"), + "--mode=process" + ]) + ); + assert_eq!(agent["securityContext"]["runAsUser"], 1500); + assert_eq!(agent["securityContext"]["runAsGroup"], 1500); + assert_eq!(agent["securityContext"]["runAsNonRoot"], true); + assert_eq!( + agent["securityContext"]["capabilities"], + serde_json::json!({ + "drop": ["ALL"] + }) + ); + assert_eq!( + rendered_env(agent, openshell_core::sandbox_env::ENDPOINT), + Some("https://127.0.0.1:18080") + ); + assert_eq!( + rendered_env(agent, openshell_core::sandbox_env::GATEWAY_TLS_SERVER_NAME), + Some("openshell-gateway.openshell.svc") + ); + assert_eq!( + rendered_env(agent, openshell_core::sandbox_env::PROCESS_ENFORCEMENT_MODE), + Some("network-only") + ); + assert_eq!( + rendered_env(agent, openshell_core::sandbox_env::SSH_SOCKET_PATH), + Some(SIDECAR_SSH_SOCKET_FILE) + ); + assert_eq!( + rendered_env(agent, openshell_core::sandbox_env::SUPERVISOR_READY_FILE), + Some(SIDECAR_READY_FILE) + ); + assert_eq!( + rendered_env(agent, openshell_core::sandbox_env::ENTRYPOINT_PID_FILE), + Some(SIDECAR_ENTRYPOINT_PID_FILE) + ); + assert_eq!( + rendered_env(agent, openshell_core::sandbox_env::PROXY_TLS_DIR), + Some(SIDECAR_TLS_MOUNT_PATH) + ); + assert_eq!( + rendered_env(agent, openshell_core::sandbox_env::SANDBOX_UID), + Some("1500") + ); + + let sidecar = containers + .iter() + .find(|container| container["name"] == SUPERVISOR_NETWORK_SIDECAR_NAME) + .unwrap(); + assert_eq!(sidecar["image"], "supervisor-image:latest"); + assert_eq!(sidecar["imagePullPolicy"], "IfNotPresent"); + assert_eq!( + sidecar["command"], + serde_json::json!([SUPERVISOR_IMAGE_BINARY_PATH, "--mode=network"]) + ); + assert_eq!(sidecar["securityContext"]["runAsUser"], 2200); + assert_eq!(sidecar["securityContext"]["runAsGroup"], 1500); + assert_eq!(sidecar["securityContext"]["runAsNonRoot"], true); + assert_eq!( + sidecar["securityContext"]["capabilities"], + serde_json::json!({ + "drop": ["ALL"] + }) + ); + assert_eq!( + rendered_env(sidecar, openshell_core::sandbox_env::ENDPOINT), + Some("https://openshell-gateway.openshell.svc:8080") + ); + assert_eq!( + rendered_env(sidecar, openshell_core::sandbox_env::GATEWAY_FORWARD_ADDR), + Some(SIDECAR_GATEWAY_FORWARD_ADDR) + ); + assert_eq!( + rendered_env( + sidecar, + openshell_core::sandbox_env::NETWORK_BINARY_IDENTITY + ), + Some("relaxed") + ); + assert_eq!( + rendered_env(sidecar, openshell_core::sandbox_env::ENTRYPOINT_PID_FILE), + Some(SIDECAR_ENTRYPOINT_PID_FILE) + ); + assert_eq!( + rendered_env(sidecar, openshell_core::sandbox_env::PROXY_TLS_DIR), + Some(SIDECAR_TLS_MOUNT_PATH) + ); + assert_eq!( + rendered_env(sidecar, openshell_core::sandbox_env::TLS_CA), + Some("/etc/openshell-tls/proxy/client/ca.crt") + ); + let sidecar_mounts = sidecar["volumeMounts"].as_array().unwrap(); + assert!( + !sidecar_mounts + .iter() + .any(|mount| mount["name"] == "openshell-client-tls"), + "runtime sidecar should use the init-copied TLS files, not the root-owned Secret mount" + ); + let volumes = pod_template["spec"]["volumes"].as_array().unwrap(); + let sa_token = volumes + .iter() + .find(|volume| volume["name"] == "openshell-sa-token") + .unwrap(); + assert_eq!(sa_token["projected"]["defaultMode"], 0o440); + let client_tls = volumes + .iter() + .find(|volume| volume["name"] == "openshell-client-tls") + .unwrap(); + assert_eq!(client_tls["secret"]["defaultMode"], 0o440); + + let init_containers = pod_template["spec"]["initContainers"].as_array().unwrap(); + let network_init = init_containers + .iter() + .find(|container| container["name"] == SUPERVISOR_NETWORK_INIT_CONTAINER_NAME) + .unwrap(); + assert_eq!(network_init["image"], "supervisor-image:latest"); + assert_eq!(network_init["imagePullPolicy"], "IfNotPresent"); + assert_eq!( + network_init["command"], + serde_json::json!([ + SUPERVISOR_IMAGE_BINARY_PATH, + "--mode=network-init", + "--proxy-uid", + "2200", + "--proxy-gid", + "1500", + "--sidecar-state-dir", + SIDECAR_STATE_MOUNT_PATH, + "--sidecar-tls-dir", + SIDECAR_TLS_MOUNT_PATH + ]) + ); + assert_eq!( + network_init["securityContext"]["capabilities"], + serde_json::json!({ + "drop": ["ALL"], + "add": ["NET_ADMIN", "NET_RAW", "CHOWN", "FOWNER"] + }) + ); + let network_init_mounts = network_init["volumeMounts"].as_array().unwrap(); + assert!(network_init_mounts.iter().any(|mount| { + mount["name"] == "openshell-client-tls" + && mount["mountPath"] == "/etc/openshell-tls/client" + })); + } + + #[test] + fn sidecar_topology_adds_shared_state_and_tls_volumes() { + let params = SandboxPodParams { + supervisor_topology: SupervisorTopology::Sidecar, + supervisor_sideload_method: SupervisorSideloadMethod::ImageVolume, + supervisor_image: "supervisor-image:latest", + grpc_endpoint: "http://openshell-gateway.openshell.svc:8080", + ..SandboxPodParams::default() + }; + let pod_template = sandbox_template_to_k8s( + &SandboxTemplate::default(), + false, + &std::collections::HashMap::new(), + false, + ¶ms, + ); + + let volumes = pod_template["spec"]["volumes"].as_array().unwrap(); + assert!( + volumes + .iter() + .any(|volume| volume["name"] == SIDECAR_STATE_VOLUME_NAME) + ); + assert!( + volumes + .iter() + .any(|volume| volume["name"] == SIDECAR_TLS_VOLUME_NAME) + ); + assert!(volumes.iter().any(|volume| { + volume["name"] == SUPERVISOR_VOLUME_NAME && volume["image"].is_object() + })); + + let containers = pod_template["spec"]["containers"].as_array().unwrap(); + for container_name in ["agent", SUPERVISOR_NETWORK_SIDECAR_NAME] { + let container = containers + .iter() + .find(|container| container["name"] == container_name) + .unwrap(); + let mounts = container["volumeMounts"].as_array().unwrap(); + assert!(mounts.iter().any(|mount| { + mount["name"] == SIDECAR_STATE_VOLUME_NAME + && mount["mountPath"] == SIDECAR_STATE_MOUNT_PATH + })); + assert!(mounts.iter().any(|mount| { + mount["name"] == SIDECAR_TLS_VOLUME_NAME + && mount["mountPath"] == SIDECAR_TLS_MOUNT_PATH + })); + } + } + + #[test] + fn sidecar_topology_rejects_proxy_uid_matching_sandbox_uid() { + let params = SandboxPodParams { + supervisor_topology: SupervisorTopology::Sidecar, + proxy_uid: 1500, + namespace: "default", + sandbox_uid: 1500, + ..SandboxPodParams::default() + }; + + let err = validate_proxy_identity(¶ms).unwrap_err(); + assert!(matches!(err, KubernetesDriverError::Precondition(_))); + assert!(err.to_string().contains("proxy_uid")); + } + + #[test] + fn proxy_pod_topology_renders_process_agent_with_proxy_service() { + let params = SandboxPodParams { + supervisor_topology: SupervisorTopology::ProxyPod, + supervisor_sideload_method: SupervisorSideloadMethod::InitContainer, + supervisor_image: "supervisor-image:latest", + namespace: "agents", + sandbox_id: "sandbox-123", + sandbox_name: "example-sandbox", + grpc_endpoint: "https://openshell-gateway.openshell.svc:8080", + proxy_uid: 2200, + sandbox_uid: 1500, + sandbox_gid: 1500, + host_gateway_ip: "172.17.0.1", + ..SandboxPodParams::default() + }; + let pod_template = sandbox_template_to_k8s( + &SandboxTemplate { + image: "agent-image:latest".to_string(), + ..SandboxTemplate::default() + }, + false, + &std::collections::HashMap::new(), + false, + ¶ms, + ); + + let names = proxy_pod_resource_names("example-sandbox"); + let service_dns = proxy_pod_service_dns(&names.service, "agents"); + let agent = &pod_template["spec"]["containers"][0]; + + assert_eq!( + pod_template["metadata"]["labels"][LABEL_SANDBOX_ROLE], + SANDBOX_ROLE_AGENT + ); + assert_eq!( + agent["command"], + serde_json::json!([ + format!("{SUPERVISOR_MOUNT_PATH}/openshell-sandbox"), + "--mode=process" + ]) + ); + assert_eq!( + rendered_env(agent, openshell_core::sandbox_env::ENDPOINT), + Some(format!("https://{service_dns}:18080").as_str()) + ); + assert_eq!( + rendered_env(agent, openshell_core::sandbox_env::GATEWAY_TLS_SERVER_NAME), + Some("openshell-gateway.openshell.svc") + ); + assert_eq!( + rendered_env(agent, openshell_core::sandbox_env::PROXY_URL), + Some(format!("http://{service_dns}:3128").as_str()) + ); + assert_eq!( + rendered_env(agent, openshell_core::sandbox_env::SUPERVISOR_READY_ADDR), + Some(format!("{service_dns}:3128").as_str()) + ); + assert_eq!( + rendered_env(agent, openshell_core::sandbox_env::NETWORK_ENFORCEMENT_MODE), + Some(PROXY_POD_NETWORK_ENFORCEMENT_MODE) + ); + assert_eq!( + rendered_env(agent, openshell_core::sandbox_env::PROCESS_ENFORCEMENT_MODE), + Some("network-only") + ); + assert_eq!( + rendered_env(agent, openshell_core::sandbox_env::SSH_SOCKET_PATH), + Some(PROXY_POD_SSH_SOCKET_FILE) + ); + + let containers = pod_template["spec"]["containers"].as_array().unwrap(); + assert_eq!(containers.len(), 1); + let volumes = pod_template["spec"]["volumes"].as_array().unwrap(); + assert!(volumes.iter().any(|volume| { + volume["name"] == "openshell-proxy-pod-ca-source" + && volume["secret"]["secretName"] == names.proxy_ca_secret + })); + assert!(volumes.iter().any(|volume| { + volume["name"] == "openshell-proxy-pod-tls" && volume["emptyDir"].is_object() + })); + + let affinity = &pod_template["spec"]["affinity"]["podAffinity"]["requiredDuringSchedulingIgnoredDuringExecution"] + [0]; + assert_eq!( + affinity["labelSelector"]["matchLabels"][LABEL_SANDBOX_ROLE], + SANDBOX_ROLE_SUPERVISOR + ); + assert_eq!(affinity["topologyKey"], "kubernetes.io/hostname"); + } + + #[test] + fn proxy_pod_companion_resources_bind_one_agent_to_one_supervisor() { + let params = SandboxPodParams { + supervisor_topology: SupervisorTopology::ProxyPod, + supervisor_image: "supervisor-image:latest", + namespace: "agents", + service_account_name: "openshell-sandbox", + sandbox_id: "sandbox-123", + sandbox_name: "example-sandbox", + grpc_endpoint: "http://openshell-gateway.openshell.svc:8080", + proxy_uid: 2200, + sandbox_uid: 1500, + sandbox_gid: 1500, + host_gateway_ip: "172.17.0.1", + ..SandboxPodParams::default() + }; + let names = proxy_pod_resource_names(params.sandbox_name); + let owner_ref = serde_json::json!({ + "apiVersion": "agents.x-k8s.io/v1beta1", + "kind": "Sandbox", + "name": params.sandbox_name, + "uid": "sandbox-cr-uid", + "controller": true, + "blockOwnerDeletion": false + }); + + let supervisor = serde_json::to_value(proxy_pod_supervisor_deployment( + &names, + &std::collections::HashMap::new(), + &std::collections::HashMap::new(), + ¶ms, + owner_ref.clone(), + )) + .unwrap(); + assert_eq!( + supervisor["metadata"]["ownerReferences"][0]["controller"], + true + ); + assert_eq!( + supervisor["metadata"]["annotations"]["openshell.io/sandbox-id"], + "sandbox-123" + ); + assert_eq!( + supervisor["metadata"]["labels"][LABEL_SANDBOX_ROLE], + SANDBOX_ROLE_SUPERVISOR + ); + assert_eq!(supervisor["kind"], "Deployment"); + assert_eq!(supervisor["spec"]["replicas"], 1); + assert_eq!( + supervisor["spec"]["selector"]["matchLabels"][LABEL_SANDBOX_ROLE], + SANDBOX_ROLE_SUPERVISOR + ); + assert_eq!( + supervisor["spec"]["template"]["metadata"]["labels"][LABEL_SANDBOX_ROLE], + SANDBOX_ROLE_SUPERVISOR + ); + assert_eq!( + supervisor["spec"]["template"]["spec"]["hostAliases"][0]["ip"], + params.host_gateway_ip + ); + let hostnames = supervisor["spec"]["template"]["spec"]["hostAliases"][0]["hostnames"] + .as_array() + .unwrap(); + assert!(hostnames.contains(&serde_json::json!("host.openshell.internal"))); + let container = &supervisor["spec"]["template"]["spec"]["containers"][0]; + assert_eq!( + rendered_env(container, openshell_core::sandbox_env::PROXY_BIND_ADDR), + Some("0.0.0.0:3128") + ); + assert_eq!( + rendered_env(container, openshell_core::sandbox_env::GATEWAY_FORWARD_ADDR), + Some(PROXY_POD_GATEWAY_FORWARD_ADDR) + ); + + let agent_egress = serde_json::to_value(proxy_pod_agent_egress_network_policy( + &names, + ¶ms, + owner_ref.clone(), + )) + .unwrap(); + assert_eq!( + agent_egress["spec"]["policyTypes"], + serde_json::json!(["Egress"]) + ); + assert_eq!( + agent_egress["spec"]["podSelector"]["matchLabels"][LABEL_SANDBOX_ROLE], + SANDBOX_ROLE_AGENT + ); + assert_eq!( + agent_egress["spec"]["egress"][0]["to"][0]["podSelector"]["matchLabels"] + [LABEL_SANDBOX_ROLE], + SANDBOX_ROLE_SUPERVISOR + ); + + let supervisor_ingress = serde_json::to_value(proxy_pod_supervisor_ingress_network_policy( + &names, ¶ms, owner_ref, + )) + .unwrap(); + assert_eq!( + supervisor_ingress["spec"]["policyTypes"], + serde_json::json!(["Ingress"]) + ); + assert_eq!( + supervisor_ingress["spec"]["ingress"][0]["from"][0]["podSelector"]["matchLabels"] + [LABEL_SANDBOX_ROLE], + SANDBOX_ROLE_AGENT + ); + } + + #[test] + fn proxy_pod_topology_rejects_proxy_uid_matching_sandbox_uid() { + let params = SandboxPodParams { + supervisor_topology: SupervisorTopology::ProxyPod, + proxy_uid: 1500, + namespace: "default", + sandbox_uid: 1500, + ..SandboxPodParams::default() + }; + + let err = validate_proxy_identity(¶ms).unwrap_err(); + assert!(matches!(err, KubernetesDriverError::Precondition(_))); + assert!(err.to_string().contains("proxy-pod")); + } + /// Regression test: TLS mount path must match env var paths. /// The volume is mounted at a specific path and the env vars must point to /// files within that same path, otherwise the sandbox will fail to start @@ -2945,6 +5109,8 @@ mod tests { &mut pod_template, "openshell/sandbox:latest", "IfNotPresent", + 1000, // sandbox_uid + 1000, // sandbox_gid ); // Init container @@ -2955,7 +5121,8 @@ mod tests { assert_eq!(init_containers[0]["name"], WORKSPACE_INIT_CONTAINER_NAME); assert_eq!(init_containers[0]["image"], "openshell/sandbox:latest"); assert_eq!(init_containers[0]["imagePullPolicy"], "IfNotPresent"); - assert_eq!(init_containers[0]["securityContext"]["runAsUser"], 0); + // init container runs as the resolved sandbox UID (not root) + assert_eq!(init_containers[0]["securityContext"]["runAsUser"], 1000); // Init container mounts PVC at temp path, not /sandbox let init_mounts = init_containers[0]["volumeMounts"] @@ -2998,7 +5165,13 @@ mod tests { } }); - apply_workspace_persistence(&mut pod_template, "my-custom-image:v2", "IfNotPresent"); + apply_workspace_persistence( + &mut pod_template, + "my-custom-image:v2", + "IfNotPresent", + 1000, + 1000, + ); let init_image = pod_template["spec"]["initContainers"][0]["image"] .as_str() @@ -3020,7 +5193,7 @@ mod tests { } }); - apply_workspace_persistence(&mut pod_template, "img:latest", "Always"); + apply_workspace_persistence(&mut pod_template, "img:latest", "Always", 1000, 1000); let cmd = pod_template["spec"]["initContainers"][0]["command"] .as_array() @@ -3034,6 +5207,16 @@ mod tests { script.contains("tar -C"), "init script must seed image contents with a tar stream" ); + assert!( + script.contains("find . -mindepth 1 -maxdepth 1"), + "init script must archive sandbox contents without the mount root entry" + ); + assert!( + script.contains("--no-same-owner") + && script.contains("--no-same-permissions") + && script.contains("--touch"), + "init script must avoid restoring metadata onto the PVC root" + ); } #[test] diff --git a/crates/openshell-driver-kubernetes/src/lib.rs b/crates/openshell-driver-kubernetes/src/lib.rs index 22b0a8703..8c326f6af 100644 --- a/crates/openshell-driver-kubernetes/src/lib.rs +++ b/crates/openshell-driver-kubernetes/src/lib.rs @@ -6,8 +6,9 @@ pub mod driver; pub mod grpc; pub use config::{ - AppArmorProfile, DEFAULT_SANDBOX_SERVICE_ACCOUNT_NAME, DEFAULT_WORKSPACE_STORAGE_SIZE, - KubernetesComputeConfig, SupervisorSideloadMethod, + AppArmorProfile, DEFAULT_PROXY_UID, DEFAULT_SANDBOX_SERVICE_ACCOUNT_NAME, + DEFAULT_WORKSPACE_STORAGE_SIZE, KubernetesComputeConfig, SupervisorSideloadMethod, + SupervisorTopology, }; pub use driver::{KubernetesComputeDriver, KubernetesDriverError}; pub use grpc::ComputeDriverService; diff --git a/crates/openshell-driver-kubernetes/src/main.rs b/crates/openshell-driver-kubernetes/src/main.rs index f7eeeba42..1d70d7657 100644 --- a/crates/openshell-driver-kubernetes/src/main.rs +++ b/crates/openshell-driver-kubernetes/src/main.rs @@ -10,8 +10,8 @@ use tracing_subscriber::EnvFilter; use openshell_core::VERSION; use openshell_core::proto::compute::v1::compute_driver_server::ComputeDriverServer; use openshell_driver_kubernetes::{ - AppArmorProfile, ComputeDriverService, DEFAULT_SANDBOX_SERVICE_ACCOUNT_NAME, - KubernetesComputeConfig, KubernetesComputeDriver, SupervisorSideloadMethod, + AppArmorProfile, ComputeDriverService, DEFAULT_PROXY_UID, DEFAULT_SANDBOX_SERVICE_ACCOUNT_NAME, + KubernetesComputeConfig, KubernetesComputeDriver, SupervisorSideloadMethod, SupervisorTopology, }; #[derive(Parser, Debug)] @@ -80,6 +80,16 @@ struct Args { )] supervisor_sideload_method: SupervisorSideloadMethod, + #[arg( + long, + env = "OPENSHELL_SUPERVISOR_TOPOLOGY", + default_value = "combined" + )] + supervisor_topology: SupervisorTopology, + + #[arg(long, env = "OPENSHELL_PROXY_UID", default_value_t = DEFAULT_PROXY_UID)] + proxy_uid: u32, + #[arg(long, env = "OPENSHELL_ENABLE_USER_NAMESPACES")] enable_user_namespaces: bool, @@ -117,6 +127,8 @@ async fn main() -> Result<()> { .unwrap_or_else(|| openshell_core::config::DEFAULT_SUPERVISOR_IMAGE.to_string()), supervisor_image_pull_policy: args.supervisor_image_pull_policy.unwrap_or_default(), supervisor_sideload_method: args.supervisor_sideload_method, + supervisor_topology: args.supervisor_topology, + proxy_uid: args.proxy_uid, grpc_endpoint: args.grpc_endpoint.unwrap_or_default(), ssh_socket_path: args.sandbox_ssh_socket_path, client_tls_secret_name: args.client_tls_secret_name.unwrap_or_default(), @@ -135,6 +147,8 @@ async fn main() -> Result<()> { provider_spiffe_workload_api_socket_path: args .provider_spiffe_workload_api_socket_path .unwrap_or_default(), + sandbox_uid: None, + sandbox_gid: None, }) .await .into_diagnostic()?; diff --git a/crates/openshell-driver-podman/README.md b/crates/openshell-driver-podman/README.md index c0c84132b..8ff778de2 100644 --- a/crates/openshell-driver-podman/README.md +++ b/crates/openshell-driver-podman/README.md @@ -128,8 +128,8 @@ sequenceDiagram C->>C: entrypoint: /opt/openshell/bin/openshell-sandbox ``` -The supervisor image from `deploy/docker/Dockerfile.supervisor` copies the static -`openshell-sandbox` binary to `/openshell-sandbox`. +The supervisor image from `deploy/docker/Dockerfile.supervisor` provides the +static `openshell-sandbox` binary at `/openshell-sandbox`. Mounting that image at `/opt/openshell/bin` makes the binary available as `/opt/openshell/bin/openshell-sandbox`. diff --git a/crates/openshell-driver-podman/src/container.rs b/crates/openshell-driver-podman/src/container.rs index 66d0d9d90..60e9f07a7 100644 --- a/crates/openshell-driver-podman/src/container.rs +++ b/crates/openshell-driver-podman/src/container.rs @@ -839,9 +839,8 @@ pub fn build_container_spec_with_token_and_gpu_devices( // Side-load the supervisor binary from a standalone OCI image. // Podman resolves image_volumes at the libpod layer, mounting the // image's filesystem at the destination path without starting a - // container from it. The supervisor image is FROM scratch with just - // the binary at /openshell-sandbox, so it appears at - // /opt/openshell/bin/openshell-sandbox. + // container from it. The supervisor image exposes the binary at + // /openshell-sandbox, so it appears at /opt/openshell/bin/openshell-sandbox. image_volumes, hostname: format!("sandbox-{}", sandbox.name), // Override the image's ENTRYPOINT so the supervisor binary runs diff --git a/crates/openshell-driver-vm/src/driver.rs b/crates/openshell-driver-vm/src/driver.rs index d5d9565e7..643d31834 100644 --- a/crates/openshell-driver-vm/src/driver.rs +++ b/crates/openshell-driver-vm/src/driver.rs @@ -207,7 +207,7 @@ enum GuestImagePayloadSource { LocalDocker { rootfs_archive: PathBuf }, } -#[derive(Debug, Clone)] +#[derive(Debug, Clone, serde::Serialize, serde::Deserialize)] pub struct VmDriverConfig { pub openshell_endpoint: String, pub state_dir: PathBuf, @@ -225,8 +225,19 @@ pub struct VmDriverConfig { pub gpu_enabled: bool, pub gpu_mem_mib: u32, pub gpu_vcpus: u8, + /// Resolved sandbox UID for rootfs `/etc/passwd` entry. + /// When empty, defaults to 10001 (the legacy hardcoded value). + #[serde(default, skip_serializing_if = "Option::is_none")] + pub sandbox_uid: Option, + /// Resolved sandbox GID for rootfs `/etc/passwd` and `/etc/group` entries. + /// When empty, defaults to the resolved UID. + #[serde(default, skip_serializing_if = "Option::is_none")] + pub sandbox_gid: Option, } +/// Default sandbox UID used by the VM driver when no config value is set. +pub const DEFAULT_SANDBOX_UID: u32 = 10001; + impl Default for VmDriverConfig { fn default() -> Self { Self { @@ -246,11 +257,23 @@ impl Default for VmDriverConfig { gpu_enabled: false, gpu_mem_mib: 8192, gpu_vcpus: 4, + sandbox_uid: None, + sandbox_gid: None, } } } impl VmDriverConfig { + /// Resolve the sandbox UID, falling back to `DEFAULT_SANDBOX_UID`. + pub fn resolve_sandbox_uid(&self) -> u32 { + self.sandbox_uid.unwrap_or(DEFAULT_SANDBOX_UID) + } + + /// Resolve the sandbox GID, falling back to the resolved UID. + pub fn resolve_sandbox_gid(&self, resolved_uid: u32) -> u32 { + self.sandbox_gid.unwrap_or(resolved_uid) + } + fn requires_tls_materials(&self) -> bool { self.openshell_endpoint.starts_with("https://") } @@ -2545,14 +2568,19 @@ impl VmDriver { let image_identity_owned = image_identity.to_string(); let exported_rootfs_for_build = exported_rootfs.clone(); let prepared_rootfs_for_build = prepared_rootfs.clone(); + let sandbox_user_id = self.config.resolve_sandbox_uid(); + let sandbox_group_id = self.config.resolve_sandbox_gid(sandbox_user_id); self.publish_vm_progress( sandbox_id, "PreparingRootfs", - format!("Preparing VM rootfs for local image \"{image_ref}\""), + format!( + "Preparing VM rootfs for local image \"{image_ref}\" (sandbox uid={sandbox_user_id})" + ), HashMap::from([ ("image_ref".to_string(), image_ref.to_string()), ("image_source".to_string(), "local_docker".to_string()), ("image_identity".to_string(), image_identity.to_string()), + ("sandbox_uid".to_string(), sandbox_user_id.to_string()), ]), ); let prepare_result = tokio::task::spawn_blocking(move || { @@ -2560,6 +2588,8 @@ impl VmDriver { prepare_sandbox_rootfs_from_image_root( &prepared_rootfs_for_build, &image_identity_owned, + sandbox_user_id, + sandbox_group_id, ) .map_err(|err| { format!("vm sandbox image '{image_ref_owned}' is not base-compatible: {err}") @@ -2678,20 +2708,27 @@ impl VmDriver { let image_ref_owned = image_ref.to_string(); let image_identity_owned = image_identity.to_string(); let prepared_rootfs_for_build = prepared_rootfs.clone(); + let sandbox_user_id = self.config.resolve_sandbox_uid(); + let sandbox_group_id = self.config.resolve_sandbox_gid(sandbox_user_id); self.publish_vm_progress( sandbox_id, "PreparingRootfs", - format!("Preparing VM rootfs for image \"{image_ref}\""), + format!( + "Preparing VM rootfs for image \"{image_ref}\" (sandbox uid={sandbox_user_id})" + ), HashMap::from([ ("image_ref".to_string(), image_ref.to_string()), ("image_source".to_string(), "registry".to_string()), ("image_identity".to_string(), image_identity.to_string()), + ("sandbox_uid".to_string(), sandbox_user_id.to_string()), ]), ); let prepare_result = tokio::task::spawn_blocking(move || { prepare_sandbox_rootfs_from_image_root( &prepared_rootfs_for_build, &image_identity_owned, + sandbox_user_id, + sandbox_group_id, ) .map_err(|err| { format!("vm sandbox image '{image_ref_owned}' is not base-compatible: {err}") diff --git a/crates/openshell-driver-vm/src/main.rs b/crates/openshell-driver-vm/src/main.rs index 57db7b64b..17718f952 100644 --- a/crates/openshell-driver-vm/src/main.rs +++ b/crates/openshell-driver-vm/src/main.rs @@ -214,6 +214,8 @@ async fn main() -> Result<()> { gpu_enabled: args.gpu, gpu_mem_mib: args.gpu_mem_mib, gpu_vcpus: args.gpu_vcpus, + sandbox_uid: None, + sandbox_gid: None, }) .await .map_err(|err| miette::miette!("{err}"))?; diff --git a/crates/openshell-driver-vm/src/rootfs.rs b/crates/openshell-driver-vm/src/rootfs.rs index d59e7b4b9..a2499d806 100644 --- a/crates/openshell-driver-vm/src/rootfs.rs +++ b/crates/openshell-driver-vm/src/rootfs.rs @@ -29,8 +29,10 @@ pub const fn sandbox_guest_init_path() -> &'static str { pub fn prepare_sandbox_rootfs_from_image_root( rootfs: &Path, image_identity: &str, + sandbox_user_id: u32, + sandbox_group_id: u32, ) -> Result<(), String> { - prepare_sandbox_rootfs(rootfs)?; + prepare_sandbox_rootfs(rootfs, sandbox_user_id, sandbox_group_id)?; validate_sandbox_rootfs(rootfs)?; fs::write( rootfs.join(ROOTFS_VARIANT_MARKER), @@ -348,7 +350,11 @@ fn append_symlink_to_archive( .map_err(|e| format!("append symlink {}: {e}", source_path.display())) } -fn prepare_sandbox_rootfs(rootfs: &Path) -> Result<(), String> { +fn prepare_sandbox_rootfs( + rootfs: &Path, + sandbox_user_id: u32, + sandbox_group_id: u32, +) -> Result<(), String> { for relative in ["opt/openshell/.initialized", "opt/openshell/.rootfs-type"] { remove_rootfs_path(rootfs, relative)?; } @@ -377,7 +383,7 @@ fn prepare_sandbox_rootfs(rootfs: &Path) -> Result<(), String> { fs::create_dir_all(&opt_dir).map_err(|e| format!("create {}: {e}", opt_dir.display()))?; fs::write(opt_dir.join(".rootfs-type"), "sandbox\n") .map_err(|e| format!("write sandbox rootfs marker: {e}"))?; - ensure_sandbox_guest_user(rootfs)?; + ensure_sandbox_guest_user(rootfs, sandbox_user_id, sandbox_group_id)?; create_sandbox_mountpoint(&rootfs.join("sandbox"))?; create_sandbox_mountpoint(&rootfs.join("image-cache"))?; create_sandbox_mountpoint(&rootfs.join("lower"))?; @@ -752,16 +758,17 @@ fn temporary_injection_path(image_path: &Path) -> PathBuf { )) } -fn ensure_sandbox_guest_user(rootfs: &Path) -> Result<(), String> { - const SANDBOX_UID: u32 = 10001; - const SANDBOX_GID: u32 = 10001; - +fn ensure_sandbox_guest_user( + rootfs: &Path, + sandbox_user_id: u32, + sandbox_group_id: u32, +) -> Result<(), String> { let etc_dir = rootfs.join("etc"); fs::create_dir_all(&etc_dir).map_err(|e| format!("create {}: {e}", etc_dir.display()))?; ensure_line_in_file( &etc_dir.join("group"), - &format!("sandbox:x:{SANDBOX_GID}:"), + &format!("sandbox:x:{sandbox_group_id}:"), |line| line.starts_with("sandbox:"), )?; ensure_line_in_file(&etc_dir.join("gshadow"), "sandbox:!::", |line| { @@ -769,7 +776,9 @@ fn ensure_sandbox_guest_user(rootfs: &Path) -> Result<(), String> { })?; ensure_line_in_file( &etc_dir.join("passwd"), - &format!("sandbox:x:{SANDBOX_UID}:{SANDBOX_GID}:OpenShell Sandbox:/sandbox:/bin/bash"), + &format!( + "sandbox:x:{sandbox_user_id}:{sandbox_group_id}:OpenShell Sandbox:/sandbox:/bin/bash" + ), |line| line.starts_with("sandbox:"), )?; ensure_line_in_file( @@ -936,7 +945,9 @@ mod tests { fs::write(rootfs.join("bin/sed"), b"sed").expect("write sed"); fs::write(rootfs.join("sbin/ip"), b"ip").expect("write ip"); - prepare_sandbox_rootfs(&rootfs).expect("prepare sandbox rootfs"); + // Use a non-standard UID so the test doesn't collide with the default. + let uid = 20001; + prepare_sandbox_rootfs(&rootfs, uid, uid).expect("prepare sandbox rootfs"); validate_sandbox_rootfs(&rootfs).expect("validate sandbox rootfs"); assert!(rootfs.join("srv/openshell-vm-sandbox-init.sh").is_file()); @@ -955,12 +966,14 @@ mod tests { assert!( fs::read_to_string(rootfs.join("etc/passwd")) .expect("read passwd") - .contains("sandbox:x:10001:10001:OpenShell Sandbox:/sandbox:/bin/bash") + .contains(&format!( + "sandbox:x:{uid}:{uid}:OpenShell Sandbox:/sandbox:/bin/bash" + )) ); assert!( fs::read_to_string(rootfs.join("etc/group")) .expect("read group") - .contains("sandbox:x:10001:") + .contains(&format!("sandbox:x:{uid}:")) ); assert_eq!( fs::read_to_string(rootfs.join("etc/hosts")).expect("read hosts"), @@ -980,7 +993,7 @@ mod tests { fs::create_dir_all(rootfs.join("sandbox")).expect("create sandbox workdir"); fs::write(rootfs.join("sandbox/app.py"), "print('hello')\n").expect("write app"); - prepare_sandbox_rootfs(&rootfs).expect("prepare sandbox rootfs"); + prepare_sandbox_rootfs(&rootfs, 10001, 10001).expect("prepare sandbox rootfs"); assert!(rootfs.join("sandbox").is_dir()); assert_eq!( diff --git a/crates/openshell-policy/src/lib.rs b/crates/openshell-policy/src/lib.rs index 9d5dc5b25..f1721146e 100644 --- a/crates/openshell-policy/src/lib.rs +++ b/crates/openshell-policy/src/lib.rs @@ -917,6 +917,41 @@ fn from_proto(policy: &SandboxPolicy) -> PolicyFile { } } +// --------------------------------------------------------------------------- +// Sandbox UID/GID constants +// --------------------------------------------------------------------------- + +/// Minimum accepted UID for sandbox process identity. +/// UIDs below this are reserved for system users and are rejected. +pub const MIN_SANDBOX_UID: u32 = 1000; + +/// Maximum accepted UID for sandbox process identity. +/// UIDs above this exceed typical OS limits and are rejected. +pub const MAX_SANDBOX_UID: u32 = 2_000_000_000; + +/// The literal string value accepted as a valid sandbox user/group name. +const SANDBOX_NAME: &str = "sandbox"; + +/// Validate whether a process identity field value is acceptable. +/// +/// Accepts either the literal `"sandbox"` or a numeric UID/GID parsed as +/// `u32` within the range `[MIN_SANDBOX_UID, MAX_SANDBOX_UID]`. +/// +/// Rejects: +/// - The empty string (callers should use `ensure_sandbox_process_identity` +/// to fill defaults before validation) +/// - UID 0 or values below `MIN_SANDBOX_UID` +/// - Values above `MAX_SANDBOX_UID` +/// - Non-numeric strings other than `"sandbox"` (e.g. `"root"`, `"nobody"`) +pub fn is_valid_sandbox_identity(value: &str) -> bool { + if value == SANDBOX_NAME { + return true; + } + value + .parse::() + .is_ok_and(|uid| (MIN_SANDBOX_UID..=MAX_SANDBOX_UID).contains(&uid)) +} + // --------------------------------------------------------------------------- // Public API // --------------------------------------------------------------------------- @@ -1090,7 +1125,10 @@ impl fmt::Display for PolicyViolation { fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { match self { Self::InvalidProcessIdentity { field, value } => { - write!(f, "{field} must be 'sandbox', got '{value}'") + write!( + f, + "{field} must be 'sandbox' or a numeric UID/GID in range [{MIN_SANDBOX_UID}, {MAX_SANDBOX_UID}], got '{value}'" + ) } Self::PathTraversal { path } => { write!(f, "path contains '..' traversal component: {path}") @@ -1168,17 +1206,18 @@ pub fn validate_sandbox_policy( ) -> std::result::Result<(), Vec> { let mut violations = Vec::new(); - // Check process identity — must be "sandbox". + // Check process identity — must be "sandbox" or a numeric UID/GID + // within the acceptable sandbox range. // `ensure_sandbox_process_identity` should be called before this to - // fill in defaults; anything other than "sandbox" is rejected. + // fill in defaults; any invalid value is rejected. if let Some(ref process) = policy.process { - if process.run_as_user != "sandbox" { + if !is_valid_sandbox_identity(&process.run_as_user) { violations.push(PolicyViolation::InvalidProcessIdentity { field: "run_as_user", value: process.run_as_user.clone(), }); } - if process.run_as_group != "sandbox" { + if !is_valid_sandbox_identity(&process.run_as_group) { violations.push(PolicyViolation::InvalidProcessIdentity { field: "run_as_group", value: process.run_as_group.clone(), @@ -2031,6 +2070,180 @@ network_policies: assert!(s.contains("sandbox")); } + // ---- is_valid_sandbox_identity tests ---- + + #[test] + fn valid_identity_accepts_sandbox() { + assert!(is_valid_sandbox_identity("sandbox")); + } + + #[test] + fn valid_identity_accepts_numeric_uid_in_range() { + assert!(is_valid_sandbox_identity("1000")); + assert!(is_valid_sandbox_identity("50000")); + assert!(is_valid_sandbox_identity("1000660000")); + } + + #[test] + fn valid_identity_accepts_boundary_uids() { + assert!(is_valid_sandbox_identity(&MIN_SANDBOX_UID.to_string())); + assert!(is_valid_sandbox_identity(&MAX_SANDBOX_UID.to_string())); + } + + #[test] + fn valid_identity_rejects_zero() { + assert!(!is_valid_sandbox_identity("0")); + } + + #[test] + fn valid_identity_rejects_system_uids_below_min() { + assert!(!is_valid_sandbox_identity("999")); + assert!(!is_valid_sandbox_identity("100")); + assert!(!is_valid_sandbox_identity("1")); + } + + #[test] + fn valid_identity_rejects_uid_above_max() { + assert!(!is_valid_sandbox_identity( + &MAX_SANDBOX_UID.saturating_add(1).to_string() + )); + } + + #[test] + fn valid_identity_rejects_non_numeric_names() { + assert!(!is_valid_sandbox_identity("root")); + assert!(!is_valid_sandbox_identity("nobody")); + assert!(!is_valid_sandbox_identity("user")); + } + + #[test] + fn valid_identity_rejects_empty_string() { + assert!(!is_valid_sandbox_identity("")); + } + + // ---- Policy validation with numeric UIDs ---- + + #[test] + fn validate_accepts_numeric_uid_in_range() { + let policy = SandboxPolicy { + version: 1, + process: Some(ProcessPolicy { + run_as_user: "1000".into(), + run_as_group: "5000".into(), + }), + filesystem: None, + landlock: None, + network_policies: HashMap::new(), + }; + assert!(validate_sandbox_policy(&policy).is_ok()); + } + + #[test] + fn validate_accepts_boundary_uids() { + let policy = SandboxPolicy { + version: 1, + process: Some(ProcessPolicy { + run_as_user: MIN_SANDBOX_UID.to_string(), + run_as_group: MAX_SANDBOX_UID.to_string(), + }), + filesystem: None, + landlock: None, + network_policies: HashMap::new(), + }; + assert!(validate_sandbox_policy(&policy).is_ok()); + } + + #[test] + fn validate_rejects_uid_out_of_range_low() { + let mut policy = restrictive_default_policy(); + policy.process = Some(ProcessPolicy { + run_as_user: "500".into(), + run_as_group: "sandbox".into(), + }); + let violations = validate_sandbox_policy(&policy).unwrap_err(); + assert!(violations.iter().any(|v| matches!( + v, + PolicyViolation::InvalidProcessIdentity { + field: "run_as_user", + .. + } + ))); + } + + #[test] + fn validate_rejects_uid_out_of_range_high() { + let mut policy = restrictive_default_policy(); + policy.process = Some(ProcessPolicy { + run_as_user: (MAX_SANDBOX_UID + 1).to_string(), + run_as_group: "sandbox".into(), + }); + let violations = validate_sandbox_policy(&policy).unwrap_err(); + assert!(violations.iter().any(|v| matches!( + v, + PolicyViolation::InvalidProcessIdentity { + field: "run_as_user", + .. + } + ))); + } + + #[test] + fn validate_rejects_root_string() { + let mut policy = restrictive_default_policy(); + policy.process = Some(ProcessPolicy { + run_as_user: "root".into(), + run_as_group: "sandbox".into(), + }); + let violations = validate_sandbox_policy(&policy).unwrap_err(); + assert!(violations.iter().any(|v| matches!( + v, + PolicyViolation::InvalidProcessIdentity { + field: "run_as_user", + .. + } + ))); + } + + #[test] + fn validate_rejects_nobody_string() { + let mut policy = restrictive_default_policy(); + policy.process = Some(ProcessPolicy { + run_as_user: "nobody".into(), + run_as_group: "nogroup".into(), + }); + let violations = validate_sandbox_policy(&policy).unwrap_err(); + assert_eq!(violations.len(), 2); + } + + #[test] + fn validate_accepts_mixed_sandbox_name_and_uid() { + // run_as_user as "sandbox" name, run_as_group as numeric UID + let policy = SandboxPolicy { + version: 1, + process: Some(ProcessPolicy { + run_as_user: "sandbox".into(), + run_as_group: "1000".into(), + }), + filesystem: None, + landlock: None, + network_policies: HashMap::new(), + }; + assert!(validate_sandbox_policy(&policy).is_ok()); + } + + #[test] + fn policy_violation_display_includes_range() { + let v = PolicyViolation::InvalidProcessIdentity { + field: "run_as_user", + value: "root".into(), + }; + let s = format!("{v}"); + assert!(s.contains("sandbox")); + assert!(s.contains(&MIN_SANDBOX_UID.to_string())); + assert!(s.contains(&MAX_SANDBOX_UID.to_string())); + assert!(s.contains("root")); + } + // ---- Multi-port and host wildcard tests ---- #[test] diff --git a/crates/openshell-sandbox/Cargo.toml b/crates/openshell-sandbox/Cargo.toml index 086dbe02c..a5d344910 100644 --- a/crates/openshell-sandbox/Cargo.toml +++ b/crates/openshell-sandbox/Cargo.toml @@ -33,6 +33,9 @@ clap = { workspace = true } # Error handling miette = { workspace = true } +# Unix ownership for Kubernetes sidecar init setup +nix = { workspace = true } + # TLS crypto provider install (main.rs) rustls = { workspace = true } diff --git a/crates/openshell-sandbox/src/lib.rs b/crates/openshell-sandbox/src/lib.rs index d5967d1f3..cb402f515 100644 --- a/crates/openshell-sandbox/src/lib.rs +++ b/crates/openshell-sandbox/src/lib.rs @@ -13,10 +13,10 @@ mod mechanistic_mapper; #[cfg_attr(not(target_os = "linux"), allow(dead_code))] mod metadata_server; -use miette::Result; +use miette::{IntoDiagnostic, Result}; use std::future::Future; use std::sync::Arc; -use std::sync::atomic::AtomicU32; +use std::sync::atomic::{AtomicU32, Ordering}; use std::time::Duration; use tracing::{debug, info, warn}; @@ -64,12 +64,23 @@ use openshell_core::denial::DenialEvent; use openshell_core::policy::{NetworkMode, NetworkPolicy, ProxyPolicy, SandboxPolicy}; use openshell_core::provider_credentials::ProviderCredentialState; use openshell_supervisor_network::opa::OpaEngine; +use openshell_supervisor_process::process::ProcessEnforcementMode; pub use openshell_supervisor_process::process::{ProcessHandle, ProcessStatus}; use openshell_supervisor_process::skills; +use tokio::io::copy_bidirectional; +use tokio::net::{TcpListener, TcpStream}; use tokio::sync::mpsc::UnboundedSender; #[cfg(target_os = "linux")] use tokio::time::timeout; +const SIDECAR_NETWORK_ENFORCEMENT_MODE: &str = "sidecar-nftables"; +const PROXY_POD_NETWORK_ENFORCEMENT_MODE: &str = "proxy-pod"; +const SIDECAR_TLS_DIR: &str = "/etc/openshell-tls/proxy"; +const SIDECAR_CA_CERT: &str = "openshell-ca.pem"; +const SIDECAR_CA_BUNDLE: &str = "ca-bundle.pem"; +const SIDECAR_PROCESS_PROXY_ADDR: &str = "127.0.0.1:3128"; +const SIDECAR_READY_TIMEOUT_SECS: u64 = 120; + /// Run a command in the sandbox. /// /// # Errors @@ -125,6 +136,24 @@ pub async fn run_sandbox( } } + let external_network_enforcement = external_network_enforcement_enabled(); + let sidecar_network_enforcement = sidecar_network_enforcement_enabled(); + let process_enforcement_mode = process_enforcement_mode(); + let sidecar_ready_file = supervisor_ready_file(); + let supervisor_ready_addr = supervisor_ready_addr(); + if process_enabled + && !network_enabled + && let Some(path) = sidecar_ready_file.as_deref() + { + wait_for_supervisor_ready(path).await?; + } + if process_enabled + && !network_enabled + && let Some(addr) = supervisor_ready_addr.as_deref() + { + wait_for_supervisor_ready_addr(addr).await?; + } + // Load policy and initialize OPA engine let openshell_endpoint_for_proxy = openshell_endpoint.clone(); let sandbox_name_for_agg = sandbox.clone(); @@ -218,6 +247,12 @@ pub async fn run_sandbox( // Shared PID: set after process spawn so the proxy can look up // the entrypoint process's /proc/net/tcp for identity binding. let entrypoint_pid = Arc::new(AtomicU32::new(0)); + if network_enabled + && !process_enabled + && let Some(path) = entrypoint_pid_file() + { + spawn_entrypoint_pid_file_watcher(path, entrypoint_pid.clone()); + } // Create the workload's network namespace. It is shared infrastructure: // the proxy binds to its host-side veth IP, the bypass monitor reads @@ -225,7 +260,7 @@ pub async fn run_sandbox( // it via setns(). The RAII handle lives in this frame for the duration // of the sandbox. #[cfg(target_os = "linux")] - let netns = if network_enabled { + let netns = if network_enabled && !external_network_enforcement { openshell_supervisor_process::netns::create_netns_for_proxy(&policy)? } else { None @@ -295,6 +330,34 @@ pub async fn run_sandbox( None }; + let _gateway_forward = if network_enabled && external_network_enforcement { + let endpoint = openshell_endpoint_for_proxy.as_deref().ok_or_else(|| { + miette::miette!("external network enforcement requires an OpenShell gateway endpoint") + })?; + Some(start_gateway_forward_from_env(endpoint).await?) + } else { + None + }; + + #[cfg(target_os = "linux")] + if network_enabled && external_network_enforcement { + if !matches!(policy.network.mode, NetworkMode::Proxy) { + return Err(miette::miette!( + "external network enforcement requires proxy network mode" + )); + } + if let Some(path) = sidecar_ready_file.as_deref() { + write_supervisor_ready(path)?; + } + } + + #[cfg(not(target_os = "linux"))] + if network_enabled && external_network_enforcement { + return Err(miette::miette!( + "external network enforcement is only supported on Linux" + )); + } + // Spawn the denial-aggregator flush task. The aggregator drains denial // events from the proxy + bypass monitor, batches them, and ships // summaries to the gateway via `SubmitPolicyAnalysis`. @@ -445,8 +508,17 @@ pub async fn run_sandbox( } } + let process_policy = process_policy_for_topology(&policy, sidecar_network_enforcement)?; + let exit_code = if process_enabled { - let ca_file_paths = networking.as_ref().and_then(|n| n.ca_file_paths.clone()); + let ca_file_paths = networking + .as_ref() + .and_then(|n| n.ca_file_paths.clone()) + .or_else(|| { + external_network_enforcement + .then(sidecar_ca_file_paths) + .flatten() + }); openshell_supervisor_process::run::run_process( program, @@ -457,7 +529,8 @@ pub async fn run_sandbox( sandbox_id.as_deref(), openshell_endpoint.as_deref(), ssh_socket_path, - &policy, + &process_policy, + process_enforcement_mode, entrypoint_pid, provider_credentials, provider_env, @@ -518,6 +591,240 @@ async fn wait_for_shutdown_signal() { } } +fn sidecar_network_enforcement_enabled() -> bool { + std::env::var(openshell_core::sandbox_env::NETWORK_ENFORCEMENT_MODE) + .is_ok_and(|value| value == SIDECAR_NETWORK_ENFORCEMENT_MODE) +} + +fn external_network_enforcement_enabled() -> bool { + std::env::var(openshell_core::sandbox_env::NETWORK_ENFORCEMENT_MODE).is_ok_and(|value| { + matches!( + value.as_str(), + SIDECAR_NETWORK_ENFORCEMENT_MODE | PROXY_POD_NETWORK_ENFORCEMENT_MODE + ) + }) +} + +fn process_enforcement_mode() -> ProcessEnforcementMode { + match std::env::var(openshell_core::sandbox_env::PROCESS_ENFORCEMENT_MODE) + .unwrap_or_else(|_| "full".to_string()) + .as_str() + { + "network-only" => ProcessEnforcementMode::NetworkOnly, + _ => ProcessEnforcementMode::Full, + } +} + +fn supervisor_ready_file() -> Option { + std::env::var(openshell_core::sandbox_env::SUPERVISOR_READY_FILE) + .ok() + .filter(|value| !value.is_empty()) +} + +fn supervisor_ready_addr() -> Option { + std::env::var(openshell_core::sandbox_env::SUPERVISOR_READY_ADDR) + .ok() + .filter(|value| !value.is_empty()) +} + +fn entrypoint_pid_file() -> Option { + std::env::var(openshell_core::sandbox_env::ENTRYPOINT_PID_FILE) + .ok() + .filter(|value| !value.is_empty()) +} + +fn spawn_entrypoint_pid_file_watcher(path: String, entrypoint_pid: Arc) { + tokio::spawn(async move { + let pid_path = std::path::PathBuf::from(&path); + loop { + match std::fs::read_to_string(&pid_path) { + Ok(contents) => match contents.trim().parse::() { + Ok(pid) if pid > 0 => { + entrypoint_pid.store(pid, Ordering::Release); + info!(path, pid, "Loaded sidecar workload entrypoint PID"); + return; + } + Ok(_) | Err(_) => { + debug!(path, contents = %contents.trim(), "Ignoring invalid entrypoint PID file contents"); + } + }, + Err(err) if err.kind() == std::io::ErrorKind::NotFound => {} + Err(err) => { + debug!(path, error = %err, "Failed to read entrypoint PID file"); + } + } + tokio::time::sleep(Duration::from_millis(100)).await; + } + }); +} + +async fn wait_for_supervisor_ready(path: &str) -> Result<()> { + let ready_path = std::path::Path::new(path); + let deadline = tokio::time::Instant::now() + Duration::from_secs(SIDECAR_READY_TIMEOUT_SECS); + loop { + if ready_path.exists() { + info!(path, "Network supervisor sidecar is ready"); + return Ok(()); + } + if tokio::time::Instant::now() >= deadline { + return Err(miette::miette!( + "timed out waiting for network supervisor sidecar readiness file {path}" + )); + } + tokio::time::sleep(Duration::from_millis(250)).await; + } +} + +async fn wait_for_supervisor_ready_addr(addr: &str) -> Result<()> { + let deadline = tokio::time::Instant::now() + Duration::from_secs(SIDECAR_READY_TIMEOUT_SECS); + loop { + match TcpStream::connect(addr).await { + Ok(_) => { + info!(addr, "Network supervisor TCP endpoint is ready"); + return Ok(()); + } + Err(err) if tokio::time::Instant::now() >= deadline => { + return Err(miette::miette!( + "timed out waiting for network supervisor TCP endpoint {addr}: {err}" + )); + } + Err(_) => { + tokio::time::sleep(Duration::from_millis(250)).await; + } + } + } +} + +#[cfg(target_os = "linux")] +fn write_supervisor_ready(path: &str) -> Result<()> { + let ready_path = std::path::Path::new(path); + if let Some(parent) = ready_path.parent() { + std::fs::create_dir_all(parent).into_diagnostic()?; + } + std::fs::write(ready_path, b"ready\n").into_diagnostic()?; + info!(path, "Network supervisor sidecar readiness file written"); + Ok(()) +} + +fn sidecar_ca_file_paths() -> Option<(std::path::PathBuf, std::path::PathBuf)> { + let tls_dir = std::env::var(openshell_core::sandbox_env::PROXY_TLS_DIR) + .unwrap_or_else(|_| SIDECAR_TLS_DIR.to_string()); + let cert = std::path::Path::new(&tls_dir).join(SIDECAR_CA_CERT); + let bundle = std::path::Path::new(&tls_dir).join(SIDECAR_CA_BUNDLE); + (cert.exists() && bundle.exists()).then_some((cert, bundle)) +} + +fn process_policy_for_topology( + policy: &SandboxPolicy, + sidecar_network_enforcement: bool, +) -> Result { + let mut process_policy = policy.clone(); + if sidecar_network_enforcement && matches!(process_policy.network.mode, NetworkMode::Proxy) { + let proxy = process_policy + .network + .proxy + .get_or_insert(ProxyPolicy { http_addr: None }); + if proxy.http_addr.is_none() { + proxy.http_addr = Some(SIDECAR_PROCESS_PROXY_ADDR.parse().into_diagnostic()?); + } + } + Ok(process_policy) +} + +struct GatewayForwardHandle { + task: tokio::task::JoinHandle<()>, +} + +impl Drop for GatewayForwardHandle { + fn drop(&mut self) { + self.task.abort(); + } +} + +async fn start_gateway_forward_from_env(endpoint: &str) -> Result { + let listen_addr = + std::env::var(openshell_core::sandbox_env::GATEWAY_FORWARD_ADDR).map_err(|_| { + miette::miette!( + "{} is required for sidecar gateway forwarding", + openshell_core::sandbox_env::GATEWAY_FORWARD_ADDR + ) + })?; + start_gateway_forward(&listen_addr, endpoint).await +} + +async fn start_gateway_forward(listen_addr: &str, endpoint: &str) -> Result { + let upstream = gateway_tcp_addr(endpoint)?; + let listener = TcpListener::bind(listen_addr).await.into_diagnostic()?; + info!( + listen_addr, + upstream, "Gateway loopback TCP forward started for sidecar topology" + ); + + let task = tokio::spawn(async move { + loop { + let (mut inbound, peer) = match listener.accept().await { + Ok(accepted) => accepted, + Err(e) => { + warn!(error = %e, "Gateway forward accept failed"); + continue; + } + }; + let upstream = upstream.clone(); + tokio::spawn(async move { + let mut outbound = match TcpStream::connect(&upstream).await { + Ok(stream) => stream, + Err(e) => { + warn!(peer = %peer, upstream, error = %e, "Gateway forward connect failed"); + return; + } + }; + if let Err(e) = copy_bidirectional(&mut inbound, &mut outbound).await { + debug!(peer = %peer, error = %e, "Gateway forward connection closed with error"); + } + }); + } + }); + + Ok(GatewayForwardHandle { task }) +} + +fn gateway_tcp_addr(endpoint: &str) -> Result { + let (scheme, rest) = endpoint + .split_once("://") + .ok_or_else(|| miette::miette!("gateway endpoint must include a URL scheme"))?; + let default_port = match scheme { + "http" => 80, + "https" => 443, + other => { + return Err(miette::miette!( + "unsupported gateway endpoint scheme '{other}' for sidecar forwarding" + )); + } + }; + let authority = rest.split('/').next().unwrap_or(rest); + if authority.is_empty() { + return Err(miette::miette!("gateway endpoint is missing a host")); + } + if authority.starts_with('[') { + let closing = authority + .find(']') + .ok_or_else(|| miette::miette!("invalid bracketed IPv6 gateway endpoint"))?; + let host = &authority[..=closing]; + let port = authority[closing + 1..] + .strip_prefix(':') + .and_then(|value| value.parse::().ok()) + .unwrap_or(default_port); + return Ok(format!("{host}:{port}")); + } + let (host, port) = match authority.rsplit_once(':') { + Some((host, port)) if !host.is_empty() => { + (host, port.parse::().unwrap_or(default_port)) + } + _ => (authority, default_port), + }; + Ok(format!("{host}:{port}")) +} + /// Flush aggregated denial summaries to the gateway via `SubmitPolicyAnalysis`. async fn flush_proposals_to_gateway( endpoint: &str, @@ -1927,8 +2234,24 @@ fn format_setting_value(es: &openshell_core::proto::EffectiveSetting) -> String )] mod tests { use super::*; + use openshell_core::policy::{ + FilesystemPolicy, LandlockPolicy, NetworkMode, NetworkPolicy, ProcessPolicy, ProxyPolicy, + }; use std::sync::atomic::{AtomicBool, Ordering}; + fn proxy_policy(http_addr: Option) -> SandboxPolicy { + SandboxPolicy { + version: 1, + filesystem: FilesystemPolicy::default(), + network: NetworkPolicy { + mode: NetworkMode::Proxy, + proxy: Some(ProxyPolicy { http_addr }), + }, + landlock: LandlockPolicy::default(), + process: ProcessPolicy::default(), + } + } + fn effective_bool(value: bool) -> openshell_core::proto::EffectiveSetting { openshell_core::proto::EffectiveSetting { value: Some(openshell_core::proto::SettingValue { @@ -1940,6 +2263,73 @@ mod tests { } } + #[test] + fn sidecar_process_policy_sets_loopback_proxy_addr() { + let policy = proxy_policy(None); + + let process_policy = process_policy_for_topology(&policy, true).unwrap(); + + let http_addr = process_policy + .network + .proxy + .and_then(|proxy| proxy.http_addr) + .expect("sidecar process policy should set proxy address"); + assert_eq!(http_addr.to_string(), SIDECAR_PROCESS_PROXY_ADDR); + assert!( + policy + .network + .proxy + .as_ref() + .expect("original policy should keep proxy config") + .http_addr + .is_none(), + "process policy normalization must not mutate the network policy" + ); + } + + #[test] + fn non_sidecar_process_policy_preserves_proxy_addr() { + let policy = proxy_policy(None); + + let process_policy = process_policy_for_topology(&policy, false).unwrap(); + + assert!( + process_policy + .network + .proxy + .and_then(|proxy| proxy.http_addr) + .is_none() + ); + } + + #[test] + fn gateway_tcp_addr_uses_explicit_port() { + assert_eq!( + gateway_tcp_addr("https://openshell-gateway.openshell.svc:8080").unwrap(), + "openshell-gateway.openshell.svc:8080" + ); + } + + #[test] + fn gateway_tcp_addr_uses_scheme_default_port() { + assert_eq!( + gateway_tcp_addr("https://openshell-gateway.openshell.svc").unwrap(), + "openshell-gateway.openshell.svc:443" + ); + assert_eq!( + gateway_tcp_addr("http://openshell-gateway.openshell.svc").unwrap(), + "openshell-gateway.openshell.svc:80" + ); + } + + #[test] + fn gateway_tcp_addr_preserves_ipv6_brackets() { + assert_eq!( + gateway_tcp_addr("https://[fd00::1]:8443").unwrap(), + "[fd00::1]:8443" + ); + } + #[test] fn apply_ocsf_json_setting_enables_from_initial_settings_snapshot() { let enabled = AtomicBool::new(false); diff --git a/crates/openshell-sandbox/src/main.rs b/crates/openshell-sandbox/src/main.rs index 91b145c2e..f4b9676eb 100644 --- a/crates/openshell-sandbox/src/main.rs +++ b/crates/openshell-sandbox/src/main.rs @@ -35,15 +35,26 @@ const DEBUG_RPC_SUBCOMMAND: &str = "debug-rpc"; /// Default `--mode` value: run both supervisor leaves in a single binary. const DEFAULT_MODE: &str = "network,process"; +const SIDECAR_STATE_DIR: &str = "/run/openshell-sidecar"; +const SIDECAR_TLS_DIR: &str = "/etc/openshell-tls/proxy"; +#[cfg(target_os = "linux")] +const CLIENT_TLS_DIR: &str = "/etc/openshell-tls/client"; +#[cfg(target_os = "linux")] +const SIDECAR_CLIENT_TLS_SUBDIR: &str = "client"; +#[cfg(target_os = "linux")] +const CLIENT_TLS_FILES: [&str; 3] = ["ca.crt", "tls.crt", "tls.key"]; /// Which supervisor leaves are enabled in this process. /// /// Parsed from a comma-separated `--mode` value, e.g. `network`, -/// `process`, or `network,process`. At least one must be set. +/// `process`, or `network,process`. `network-init` is a one-shot setup mode +/// used by the Kubernetes sidecar topology and cannot be combined with other +/// mode components. At least one must be set. #[derive(Clone, Copy, Debug)] struct Mode { network: bool, process: bool, + network_init: bool, } impl std::str::FromStr for Mode { @@ -53,20 +64,27 @@ impl std::str::FromStr for Mode { let mut mode = Self { network: false, process: false, + network_init: false, }; for part in s.split(',').map(str::trim).filter(|p| !p.is_empty()) { match part { "network" => mode.network = true, "process" => mode.process = true, + "network-init" => mode.network_init = true, other => { return Err(format!( - "unknown mode component '{other}' (expected 'network' and/or 'process')" + "unknown mode component '{other}' (expected 'network', 'process', or 'network-init')" )); } } } - if !mode.network && !mode.process { - return Err("--mode must enable at least one of: network, process".into()); + if mode.network_init && (mode.network || mode.process) { + return Err("--mode=network-init cannot be combined with other components".into()); + } + if !mode.network && !mode.process && !mode.network_init { + return Err( + "--mode must enable at least one of: network, process, network-init".into(), + ); } Ok(mode) } @@ -149,9 +167,29 @@ struct Args { /// "network" and/or "process". Defaults to both (single-binary /// topology). Use --mode=network for a network-only sidecar, or /// --mode=process for a process-only supervisor when network - /// enforcement runs in another pod. + /// enforcement runs in another pod. Use --mode=network-init only in + /// the Kubernetes init container that prepares sidecar nftables. #[arg(long, default_value = DEFAULT_MODE)] mode: Mode, + + /// UID that the long-running Kubernetes network proxy will run as. + /// In sidecar topology, `--mode=network-init` installs nftables rules + /// that exempt this UID. + #[arg(long, env = "OPENSHELL_PROXY_UID", default_value_t = 1337)] + proxy_uid: u32, + + /// GID assigned to shared sidecar state directories. Defaults to + /// `--proxy-uid` when omitted. + #[arg(long, env = "OPENSHELL_PROXY_GID")] + proxy_gid: Option, + + /// Shared state directory between the network init container and sidecar. + #[arg(long, env = "OPENSHELL_SIDECAR_STATE_DIR", default_value = SIDECAR_STATE_DIR)] + sidecar_state_dir: String, + + /// Shared TLS work directory between the network init container and sidecar. + #[arg(long, env = "OPENSHELL_PROXY_TLS_DIR", default_value = SIDECAR_TLS_DIR)] + sidecar_tls_dir: String, } /// Copy the running executable to `dest`, creating parent directories as @@ -194,6 +232,131 @@ fn copy_self(dest: &str) -> Result<()> { Ok(()) } +#[cfg(target_os = "linux")] +fn prepare_sidecar_directory(path: &Path, uid: u32, gid: u32, mode: u32) -> Result<()> { + use miette::Context as _; + use nix::unistd::{Gid, Uid, chown}; + use std::os::unix::fs::PermissionsExt; + + std::fs::create_dir_all(path) + .into_diagnostic() + .wrap_err_with(|| format!("failed to create sidecar directory {}", path.display()))?; + let mut perms = std::fs::metadata(path).into_diagnostic()?.permissions(); + perms.set_mode(mode); + std::fs::set_permissions(path, perms) + .into_diagnostic() + .wrap_err_with(|| format!("failed to chmod sidecar directory {}", path.display()))?; + chown(path, Some(Uid::from_raw(uid)), Some(Gid::from_raw(gid))) + .into_diagnostic() + .wrap_err_with(|| { + format!( + "failed to chown sidecar directory {} to {uid}:{gid}", + path.display() + ) + })?; + Ok(()) +} + +#[cfg(target_os = "linux")] +fn copy_sidecar_client_tls_if_present( + source_dir: &Path, + sidecar_tls_dir: &Path, + uid: u32, + gid: u32, +) -> Result<()> { + use miette::Context as _; + use nix::unistd::{Gid, Uid, chown}; + use std::os::unix::fs::PermissionsExt; + + if !source_dir.exists() { + return Ok(()); + } + + let dest_dir = sidecar_tls_dir.join(SIDECAR_CLIENT_TLS_SUBDIR); + prepare_sidecar_directory(&dest_dir, uid, gid, 0o750)?; + for file_name in CLIENT_TLS_FILES { + let source = source_dir.join(file_name); + if !source.exists() { + return Err(miette::miette!( + "client TLS source file is missing: {}", + source.display() + )); + } + let dest = dest_dir.join(file_name); + std::fs::copy(&source, &dest) + .into_diagnostic() + .wrap_err_with(|| { + format!( + "failed to copy client TLS file {} to {}", + source.display(), + dest.display() + ) + })?; + let mut perms = std::fs::metadata(&dest).into_diagnostic()?.permissions(); + perms.set_mode(0o400); + std::fs::set_permissions(&dest, perms) + .into_diagnostic() + .wrap_err_with(|| { + format!("failed to chmod copied client TLS file {}", dest.display()) + })?; + chown(&dest, Some(Uid::from_raw(uid)), Some(Gid::from_raw(gid))) + .into_diagnostic() + .wrap_err_with(|| { + format!( + "failed to chown copied client TLS file {} to {uid}:{gid}", + dest.display() + ) + })?; + } + + Ok(()) +} + +#[cfg(target_os = "linux")] +fn run_network_init( + proxy_uid: u32, + proxy_gid: u32, + sidecar_state_dir: &str, + sidecar_tls_dir: &str, +) -> Result<()> { + if proxy_uid < openshell_policy::MIN_SANDBOX_UID { + return Err(miette::miette!( + "--proxy-uid must be at least {}", + openshell_policy::MIN_SANDBOX_UID + )); + } + if proxy_gid < openshell_policy::MIN_SANDBOX_UID { + return Err(miette::miette!( + "--proxy-gid must be at least {}", + openshell_policy::MIN_SANDBOX_UID + )); + } + + let sidecar_state_dir = Path::new(sidecar_state_dir); + let sidecar_tls_dir = Path::new(sidecar_tls_dir); + prepare_sidecar_directory(sidecar_state_dir, proxy_uid, proxy_gid, 0o775)?; + prepare_sidecar_directory(sidecar_tls_dir, proxy_uid, proxy_gid, 0o755)?; + copy_sidecar_client_tls_if_present( + Path::new(CLIENT_TLS_DIR), + sidecar_tls_dir, + proxy_uid, + proxy_gid, + )?; + openshell_supervisor_process::netns::install_sidecar_bypass_rules(proxy_uid) +} + +#[cfg(not(target_os = "linux"))] +fn run_network_init( + _proxy_uid: u32, + _proxy_gid: u32, + _sidecar_state_dir: &str, + _sidecar_tls_dir: &str, +) -> Result<()> { + Err(miette::miette!( + "--mode=network-init is only supported on Linux" + )) +} + fn main() -> Result<()> { // Handle `copy-self ` before clap so it works without any of the // sandbox flags. Kubernetes init containers invoke this path to seed an @@ -222,6 +385,16 @@ fn main() -> Result<()> { let args = Args::parse(); + if args.mode.network_init { + let proxy_gid = args.proxy_gid.unwrap_or(args.proxy_uid); + return run_network_init( + args.proxy_uid, + proxy_gid, + &args.sidecar_state_dir, + &args.sidecar_tls_dir, + ); + } + // Try to open a rolling log file; fall back to stderr-only logging if it fails // (e.g., /var/log is not writable in custom workload images). // Rotates daily, keeps the 3 most recent files to bound disk usage. @@ -421,4 +594,24 @@ mod tests { let final_path = dest_dir.join("openshell-sandbox"); assert!(final_path.exists(), "binary should land inside dest dir"); } + + #[test] + fn mode_parses_network_init_standalone() { + let mode = "network-init".parse::().unwrap(); + assert!(mode.network_init); + assert!(!mode.network); + assert!(!mode.process); + } + + #[test] + fn mode_rejects_combined_network_init() { + let err = "network-init,network".parse::().unwrap_err(); + assert!(err.contains("cannot be combined")); + } + + #[test] + fn mode_rejects_empty_value() { + let err = "".parse::().unwrap_err(); + assert!(err.contains("at least one")); + } } diff --git a/crates/openshell-server/src/auth/k8s_sa.rs b/crates/openshell-server/src/auth/k8s_sa.rs index eed0e5f08..f27b90067 100644 --- a/crates/openshell-server/src/auth/k8s_sa.rs +++ b/crates/openshell-server/src/auth/k8s_sa.rs @@ -5,8 +5,9 @@ //! //! Path-scoped to `IssueSandboxToken`. Validates a projected SA token //! presented by a sandbox pod, reads the pod's `openshell.io/sandbox-id` -//! annotation, verifies the pod is controlled by the corresponding Sandbox CR, -//! and returns a [`Principal::Sandbox`] with +//! annotation, verifies the pod is controlled by the corresponding Sandbox CR +//! either directly or through a supervisor Deployment controller chain, and +//! returns a [`Principal::Sandbox`] with //! [`SandboxIdentitySource::K8sServiceAccount`]. The `IssueSandboxToken` handler //! then mints a gateway-signed JWT for that sandbox id; subsequent gRPC calls //! from the supervisor use the gateway-minted JWT validated by @@ -19,10 +20,11 @@ use super::authenticator::Authenticator; use super::principal::{Principal, SandboxIdentitySource, SandboxPrincipal}; use async_trait::async_trait; use k8s_openapi::api::{ + apps::v1::{Deployment, ReplicaSet}, authentication::v1::{TokenReview, TokenReviewSpec, TokenReviewStatus, UserInfo}, core::v1::Pod, }; -use k8s_openapi::apimachinery::pkg::apis::meta::v1::ObjectMeta; +use k8s_openapi::apimachinery::pkg::apis::meta::v1::{ObjectMeta, OwnerReference}; use kube::Error as KubeError; use kube::api::{Api, ApiResource, PostParams}; use kube::core::{DynamicObject, gvk::GroupVersionKind}; @@ -45,7 +47,10 @@ const SANDBOX_API_VERSION_V1BETA1: &str = "v1beta1"; const SANDBOX_API_VERSION_V1ALPHA1: &str = "v1alpha1"; const SANDBOX_API_VERSION_FULL_V1BETA1: &str = "agents.x-k8s.io/v1beta1"; const SANDBOX_API_VERSION_FULL_V1ALPHA1: &str = "agents.x-k8s.io/v1alpha1"; +const APPS_API_VERSION_FULL_V1: &str = "apps/v1"; const SANDBOX_KIND: &str = "Sandbox"; +const REPLICA_SET_KIND: &str = "ReplicaSet"; +const DEPLOYMENT_KIND: &str = "Deployment"; const SANDBOX_ID_LABEL: &str = "openshell.ai/sandbox-id"; const POD_NAME_EXTRA: &str = "authentication.kubernetes.io/pod-name"; const POD_UID_EXTRA: &str = "authentication.kubernetes.io/pod-uid"; @@ -148,11 +153,21 @@ struct SandboxOwnerReference { uid: String, } +#[derive(Debug, Clone, PartialEq, Eq)] +struct ControllerOwnerReference { + api_version: String, + kind: String, + name: String, + uid: String, +} + /// Resolver backed by the apiserver's `TokenReview` API and `kube::Client` /// for the per-pod annotation lookup. pub struct LiveK8sResolver { token_reviews_api: Api, pods_api: Api, + replica_sets_api: Api, + deployments_api: Api, sandboxes_api_v1beta1: Api, sandboxes_api_v1alpha1: Api, expected_audience: String, @@ -169,6 +184,8 @@ impl LiveK8sResolver { ) -> Self { let token_reviews_api: Api = Api::all(client.clone()); let pods_api: Api = Api::namespaced(client.clone(), namespace); + let replica_sets_api: Api = Api::namespaced(client.clone(), namespace); + let deployments_api: Api = Api::namespaced(client.clone(), namespace); let sandbox_gvk_v1beta1 = GroupVersionKind::gvk(SANDBOX_API_GROUP, SANDBOX_API_VERSION_V1BETA1, SANDBOX_KIND); let sandbox_resource_v1beta1 = ApiResource::from_gvk(&sandbox_gvk_v1beta1); @@ -185,6 +202,8 @@ impl LiveK8sResolver { Self { token_reviews_api, pods_api, + replica_sets_api, + deployments_api, sandboxes_api_v1beta1, sandboxes_api_v1alpha1, expected_audience, @@ -214,6 +233,129 @@ impl LiveK8sResolver { Ok(None) } + + async fn sandbox_owner_for_pod( + &self, + pod: &Pod, + pod_name: &str, + ) -> Result { + match direct_sandbox_owner_reference(pod) { + Ok(owner) => Ok(owner), + Err(err) => { + let Some(controller) = controller_owner_reference( + pod.metadata.owner_references.as_deref().unwrap_or_default(), + ) else { + return Err(err); + }; + if controller.api_version != APPS_API_VERSION_FULL_V1 + || controller.kind != REPLICA_SET_KIND + { + return Err(err); + } + self.sandbox_owner_for_replica_set_controller(&controller, pod_name) + .await + } + } + } + + async fn sandbox_owner_for_replica_set_controller( + &self, + replica_set_owner: &ControllerOwnerReference, + pod_name: &str, + ) -> Result { + let replica_set = self + .replica_sets_api + .get_opt(&replica_set_owner.name) + .await + .map_err(|e| { + warn!( + pod = %pod_name, + replica_set = %replica_set_owner.name, + error = %e, + "failed to fetch ReplicaSet for pod identity validation" + ); + Status::internal(format!("replicaset GET failed: {e}")) + })? + .ok_or_else(|| { + warn!( + pod = %pod_name, + replica_set = %replica_set_owner.name, + "pod controller ReplicaSet was not found" + ); + Status::permission_denied("pod controller ReplicaSet not found") + })?; + validate_object_uid( + replica_set.metadata.uid.as_deref().unwrap_or_default(), + &replica_set_owner.uid, + "pod controller ReplicaSet UID mismatch", + )?; + + let deployment_owner = controller_owner_reference( + replica_set + .metadata + .owner_references + .as_deref() + .unwrap_or_default(), + ) + .ok_or_else(|| { + warn!( + pod = %pod_name, + replica_set = %replica_set_owner.name, + "ReplicaSet has no controlling Deployment ownerReference" + ); + Status::permission_denied("ReplicaSet is not controlled by a Deployment") + })?; + if deployment_owner.api_version != APPS_API_VERSION_FULL_V1 + || deployment_owner.kind != DEPLOYMENT_KIND + { + warn!( + pod = %pod_name, + replica_set = %replica_set_owner.name, + owner_api_version = %deployment_owner.api_version, + owner_kind = %deployment_owner.kind, + "ReplicaSet controller is not an apps/v1 Deployment" + ); + return Err(Status::permission_denied( + "ReplicaSet is not controlled by a Deployment", + )); + } + + let deployment = self + .deployments_api + .get_opt(&deployment_owner.name) + .await + .map_err(|e| { + warn!( + pod = %pod_name, + deployment = %deployment_owner.name, + error = %e, + "failed to fetch Deployment for pod identity validation" + ); + Status::internal(format!("deployment GET failed: {e}")) + })? + .ok_or_else(|| { + warn!( + pod = %pod_name, + deployment = %deployment_owner.name, + "ReplicaSet controller Deployment was not found" + ); + Status::permission_denied("ReplicaSet controller Deployment not found") + })?; + validate_object_uid( + deployment.metadata.uid.as_deref().unwrap_or_default(), + &deployment_owner.uid, + "ReplicaSet controller Deployment UID mismatch", + )?; + + sandbox_owner_reference_from_owner_refs( + deployment + .metadata + .owner_references + .as_deref() + .unwrap_or_default(), + "Deployment", + ) + } } #[async_trait] @@ -293,7 +435,7 @@ impl K8sIdentityResolver for LiveK8sResolver { let sandbox_id = pod_sandbox_id(&pod)?; - let owner = sandbox_owner_reference(&pod)?; + let owner = self.sandbox_owner_for_pod(&pod, &identity.pod_name).await?; let sandbox_cr = self.get_sandbox_cr_for_owner(&owner).await.map_err(|e| { warn!( pod = %identity.pod_name, @@ -406,8 +548,18 @@ fn pod_sandbox_id(pod: &Pod) -> Result { } #[allow(clippy::result_large_err)] -fn sandbox_owner_reference(pod: &Pod) -> Result { - let owner_refs = pod.metadata.owner_references.as_deref().unwrap_or_default(); +fn direct_sandbox_owner_reference(pod: &Pod) -> Result { + sandbox_owner_reference_from_owner_refs( + pod.metadata.owner_references.as_deref().unwrap_or_default(), + "pod", + ) +} + +#[allow(clippy::result_large_err)] +fn sandbox_owner_reference_from_owner_refs( + owner_refs: &[OwnerReference], + object_kind: &str, +) -> Result { let mut sandbox_refs = owner_refs .iter() .filter(|owner| is_supported_sandbox_owner_reference(owner)); @@ -424,27 +576,28 @@ fn sandbox_owner_reference(pod: &Pod) -> Result { SANDBOX_API_VERSION_FULL_V1BETA1, SANDBOX_API_VERSION_FULL_V1ALPHA1, ], - "pod Sandbox ownerReference uses unsupported apiVersion" + object_kind = %object_kind, + "Sandbox ownerReference uses unsupported apiVersion" ); } - return Err(Status::permission_denied( - "pod is not controlled by an OpenShell Sandbox", - )); + return Err(Status::permission_denied(format!( + "{object_kind} is not controlled by an OpenShell Sandbox" + ))); }; if sandbox_refs.next().is_some() { - return Err(Status::permission_denied( - "pod has multiple OpenShell Sandbox owners", - )); + return Err(Status::permission_denied(format!( + "{object_kind} has multiple OpenShell Sandbox owners" + ))); } if owner.controller != Some(true) { - return Err(Status::permission_denied( - "pod Sandbox ownerReference is not controlling", - )); + return Err(Status::permission_denied(format!( + "{object_kind} Sandbox ownerReference is not controlling" + ))); } if owner.name.is_empty() || owner.uid.is_empty() { - return Err(Status::permission_denied( - "pod Sandbox ownerReference is incomplete", - )); + return Err(Status::permission_denied(format!( + "{object_kind} Sandbox ownerReference is incomplete" + ))); } Ok(SandboxOwnerReference { api_version: owner.api_version.clone(), @@ -453,9 +606,32 @@ fn sandbox_owner_reference(pod: &Pod) -> Result { }) } -fn is_supported_sandbox_owner_reference( - owner: &k8s_openapi::apimachinery::pkg::apis::meta::v1::OwnerReference, -) -> bool { +fn controller_owner_reference(owner_refs: &[OwnerReference]) -> Option { + let owner = owner_refs + .iter() + .find(|owner| owner.controller == Some(true))?; + Some(ControllerOwnerReference { + api_version: owner.api_version.clone(), + kind: owner.kind.clone(), + name: owner.name.clone(), + uid: owner.uid.clone(), + }) +} + +#[allow(clippy::result_large_err)] +fn validate_object_uid(actual_uid: &str, expected_uid: &str, message: &str) -> Result<(), Status> { + if actual_uid != expected_uid { + warn!( + expected_uid = %expected_uid, + actual_uid = %actual_uid, + %message + ); + return Err(Status::permission_denied(message.to_string())); + } + Ok(()) +} + +fn is_supported_sandbox_owner_reference(owner: &OwnerReference) -> bool { owner.kind == SANDBOX_KIND && matches!( owner.api_version.as_str(), @@ -629,6 +805,17 @@ mod tests { } } + fn app_controller_owner(kind: &str, name: &str, uid: &str) -> OwnerReference { + OwnerReference { + api_version: APPS_API_VERSION_FULL_V1.to_string(), + block_owner_deletion: None, + controller: Some(true), + kind: kind.to_string(), + name: name.to_string(), + uid: uid.to_string(), + } + } + fn pod_with_owner_refs(owner_references: Vec) -> Pod { Pod { metadata: ObjectMeta { @@ -780,7 +967,7 @@ mod tests { fn sandbox_owner_reference_extracts_controlling_sandbox_owner() { let pod = pod_with_owner_refs(vec![sandbox_owner("sandbox-a", "cr-uid-a")]); - let owner = sandbox_owner_reference(&pod).expect("expected Sandbox owner"); + let owner = direct_sandbox_owner_reference(&pod).expect("expected Sandbox owner"); assert_eq!( owner, @@ -800,7 +987,7 @@ mod tests { "cr-uid-a", )]); - let owner = sandbox_owner_reference(&pod).expect("expected v1alpha1 Sandbox owner"); + let owner = direct_sandbox_owner_reference(&pod).expect("expected v1alpha1 Sandbox owner"); assert_eq!( owner, @@ -816,7 +1003,7 @@ mod tests { fn sandbox_owner_reference_rejects_missing_owner() { let pod = pod_with_owner_refs(vec![]); - let err = sandbox_owner_reference(&pod).expect_err("missing owner must fail"); + let err = direct_sandbox_owner_reference(&pod).expect_err("missing owner must fail"); assert_eq!(err.code(), tonic::Code::PermissionDenied); } @@ -829,8 +1016,8 @@ mod tests { "cr-uid-a", )]); - let err = - sandbox_owner_reference(&pod).expect_err("unsupported apiVersion must fail closed"); + let err = direct_sandbox_owner_reference(&pod) + .expect_err("unsupported apiVersion must fail closed"); assert_eq!(err.code(), tonic::Code::PermissionDenied); } @@ -841,7 +1028,7 @@ mod tests { owner.controller = Some(false); let pod = pod_with_owner_refs(vec![owner]); - let err = sandbox_owner_reference(&pod).expect_err("non-controller owner must fail"); + let err = direct_sandbox_owner_reference(&pod).expect_err("non-controller owner must fail"); assert_eq!(err.code(), tonic::Code::PermissionDenied); } @@ -853,11 +1040,50 @@ mod tests { sandbox_owner("sandbox-b", "cr-uid-b"), ]); - let err = sandbox_owner_reference(&pod).expect_err("multiple owners must fail"); + let err = direct_sandbox_owner_reference(&pod).expect_err("multiple owners must fail"); assert_eq!(err.code(), tonic::Code::PermissionDenied); } + #[test] + fn controller_owner_reference_extracts_controlling_apps_owner() { + let pod = pod_with_owner_refs(vec![app_controller_owner( + REPLICA_SET_KIND, + "supervisor-rs", + "rs-uid", + )]); + + let owner = controller_owner_reference(pod.metadata.owner_references.as_deref().unwrap()) + .expect("expected controller owner"); + + assert_eq!( + owner, + ControllerOwnerReference { + api_version: APPS_API_VERSION_FULL_V1.to_string(), + kind: REPLICA_SET_KIND.to_string(), + name: "supervisor-rs".to_string(), + uid: "rs-uid".to_string(), + } + ); + } + + #[test] + fn sandbox_owner_reference_from_deployment_requires_controlling_sandbox_owner() { + let deployment_owner_refs = vec![sandbox_owner("sandbox-a", "cr-uid-a")]; + + let owner = sandbox_owner_reference_from_owner_refs(&deployment_owner_refs, "Deployment") + .expect("expected Deployment Sandbox owner"); + + assert_eq!( + owner, + SandboxOwnerReference { + api_version: SANDBOX_API_VERSION_FULL_V1BETA1.to_string(), + name: "sandbox-a".to_string(), + uid: "cr-uid-a".to_string(), + } + ); + } + #[test] fn validate_sandbox_owner_reference_requires_matching_cr_uid_and_label() { let owner = SandboxOwnerReference { diff --git a/crates/openshell-supervisor-network/data/sandbox-policy.rego b/crates/openshell-supervisor-network/data/sandbox-policy.rego index efcdf0732..d70c69b74 100644 --- a/crates/openshell-supervisor-network/data/sandbox-policy.rego +++ b/crates/openshell-supervisor-network/data/sandbox-policy.rego @@ -19,6 +19,10 @@ allow_network if { network_policy_for_request } +binary_identity_required if { + object.get(object.get(data, "runtime", {}), "require_binary_identity", true) +} + # --- Deny reasons (specific diagnostics for debugging policy denials) --- deny_reason := "missing input.network" if { @@ -131,6 +135,12 @@ endpoint_allowed(policy, network) if { endpoint.ports[_] == network.port } +# Binary matching can be relaxed by trusted runtime configuration. In that +# mode, network policies are endpoint/L7 scoped and ignore policy.binaries. +binary_allowed(_, _) if { + not binary_identity_required +} + # Binary matching: exact path. # SHA256 integrity is enforced in Rust via trust-on-first-use (TOFU) cache, # not in Rego. The proxy computes and caches binary hashes at runtime. @@ -161,6 +171,10 @@ binary_allowed(policy, exec) if { glob.match(b.path, ["/"], p) } +user_declared_binary_allowed(_, _) if { + not binary_identity_required +} + user_declared_binary_allowed(policy, exec) if { some b b := policy.binaries[_] diff --git a/crates/openshell-supervisor-network/src/identity.rs b/crates/openshell-supervisor-network/src/identity.rs index fce568f41..5e89c3503 100644 --- a/crates/openshell-supervisor-network/src/identity.rs +++ b/crates/openshell-supervisor-network/src/identity.rs @@ -100,23 +100,34 @@ impl BinaryIdentityCache { /// Returns `Ok(hash)` if it matches, `Err` if the hash changed (binary tampered). #[cfg_attr(not(target_os = "linux"), allow(dead_code))] pub fn verify_or_cache(&self, path: &Path) -> Result { - self.verify_or_cache_with_hasher(path, procfs::file_sha256) + self.verify_or_cache_with_paths(path, path, procfs::file_sha256) } - fn verify_or_cache_with_hasher(&self, path: &Path, mut hash_file: F) -> Result + #[cfg(target_os = "linux")] + pub fn verify_or_cache_process_exe(&self, display_path: &Path, pid: u32) -> Result { + let proc_exe = PathBuf::from(format!("/proc/{pid}/exe")); + self.verify_or_cache_with_paths(display_path, &proc_exe, procfs::file_sha256) + } + + fn verify_or_cache_with_paths( + &self, + cache_path: &Path, + access_path: &Path, + mut hash_file: F, + ) -> Result where F: FnMut(&Path) -> Result, { let start = std::time::Instant::now(); - let metadata = std::fs::metadata(path) - .map_err(|error| miette::miette!("Failed to stat {}: {error}", path.display()))?; + let metadata = std::fs::metadata(access_path) + .map_err(|error| miette::miette!("Failed to stat {}: {error}", cache_path.display()))?; let fingerprint = FileFingerprint::from_metadata(&metadata); let cached = self .hashes .lock() .map_err(|_| miette::miette!("Binary identity cache lock poisoned"))? - .get(path) + .get(cache_path) .cloned(); if let Some(cached_binary) = &cached @@ -125,7 +136,7 @@ impl BinaryIdentityCache { debug!( " verify_or_cache: {}ms CACHE HIT path={}", start.elapsed().as_millis(), - path.display() + cache_path.display() ); return Ok(cached_binary.hash.clone()); } @@ -133,29 +144,29 @@ impl BinaryIdentityCache { debug!( " verify_or_cache: CACHE MISS size={} path={}", metadata.len(), - path.display() + cache_path.display() ); - let current_hash = hash_file(path)?; + let current_hash = hash_file(access_path)?; let mut hashes = self .hashes .lock() .map_err(|_| miette::miette!("Binary identity cache lock poisoned"))?; - if let Some(existing) = hashes.get(path) + if let Some(existing) = hashes.get(cache_path) && existing.hash != current_hash { return Err(miette::miette!( "Binary integrity violation: {} hash changed (cached: {}, current: {})", - path.display(), + cache_path.display(), existing.hash, current_hash )); } hashes.insert( - path.to_path_buf(), + cache_path.to_path_buf(), CachedBinary { hash: current_hash.clone(), fingerprint, @@ -165,7 +176,7 @@ impl BinaryIdentityCache { debug!( " verify_or_cache TOTAL (cold): {}ms path={}", start.elapsed().as_millis(), - path.display() + cache_path.display() ); Ok(current_hash) @@ -212,13 +223,13 @@ mod tests { let mut hash_calls = 0; let hash1 = cache - .verify_or_cache_with_hasher(tmp.path(), |path| { + .verify_or_cache_with_paths(tmp.path(), tmp.path(), |path| { hash_calls += 1; procfs::file_sha256(path) }) .unwrap(); let hash2 = cache - .verify_or_cache_with_hasher(tmp.path(), |path| { + .verify_or_cache_with_paths(tmp.path(), tmp.path(), |path| { hash_calls += 1; procfs::file_sha256(path) }) @@ -238,7 +249,7 @@ mod tests { let mut hash_calls = 0; let hash1 = cache - .verify_or_cache_with_hasher(tmp.path(), |path| { + .verify_or_cache_with_paths(tmp.path(), tmp.path(), |path| { hash_calls += 1; procfs::file_sha256(path) }) @@ -254,7 +265,7 @@ mod tests { .unwrap(); let hash2 = cache - .verify_or_cache_with_hasher(tmp.path(), |path| { + .verify_or_cache_with_paths(tmp.path(), tmp.path(), |path| { hash_calls += 1; procfs::file_sha256(path) }) @@ -275,7 +286,7 @@ mod tests { let mut hash_calls = 0; cache - .verify_or_cache_with_hasher(&path, |path| { + .verify_or_cache_with_paths(&path, &path, |path| { hash_calls += 1; procfs::file_sha256(path) }) @@ -292,7 +303,7 @@ mod tests { .set_modified(original_mtime) .unwrap(); - let result = cache.verify_or_cache_with_hasher(&path, |path| { + let result = cache.verify_or_cache_with_paths(&path, &path, |path| { hash_calls += 1; procfs::file_sha256(path) }); @@ -301,6 +312,28 @@ mod tests { assert_eq!(hash_calls, 2); } + #[test] + fn display_path_can_differ_from_access_path() { + let mut tmp = tempfile::NamedTempFile::new().unwrap(); + tmp.write_all(b"binary content").unwrap(); + tmp.flush().unwrap(); + let display_path = Path::new("/usr/bin/python3"); + + let cache = BinaryIdentityCache::new(); + let hash = cache + .verify_or_cache_with_paths(display_path, tmp.path(), procfs::file_sha256) + .unwrap(); + + assert!(!hash.is_empty()); + assert!( + cache + .hashes + .lock() + .unwrap() + .contains_key(Path::new("/usr/bin/python3")) + ); + } + #[test] fn hash_mismatch_returns_error() { let dir = tempfile::tempdir().unwrap(); diff --git a/crates/openshell-supervisor-network/src/l7/tls.rs b/crates/openshell-supervisor-network/src/l7/tls.rs index 70e198f42..c211200a8 100644 --- a/crates/openshell-supervisor-network/src/l7/tls.rs +++ b/crates/openshell-supervisor-network/src/l7/tls.rs @@ -63,6 +63,28 @@ impl SandboxCa { }) } + /// Load an existing CA certificate and private key from PEM. + pub fn from_pem(ca_cert_pem: &str, ca_key_pem: &str) -> Result { + let ca_key = KeyPair::from_pem(ca_key_pem).into_diagnostic()?; + let ca_cert = CertificateParams::from_ca_cert_pem(ca_cert_pem) + .into_diagnostic()? + .self_signed(&ca_key) + .into_diagnostic()?; + + Ok(Self { + ca_cert, + ca_key, + ca_cert_pem: ca_cert_pem.to_string(), + }) + } + + /// Load an existing CA certificate and private key from files. + pub fn from_files(cert_path: &Path, key_path: &Path) -> Result { + let ca_cert_pem = std::fs::read_to_string(cert_path).into_diagnostic()?; + let ca_key_pem = std::fs::read_to_string(key_path).into_diagnostic()?; + Self::from_pem(&ca_cert_pem, &ca_key_pem) + } + /// Returns the CA certificate in PEM format. pub fn cert_pem(&self) -> &str { &self.ca_cert_pem @@ -519,4 +541,18 @@ mod tests { "bundle should contain at least one cert", ); } + + #[test] + fn sandbox_ca_loads_from_pem() { + let ca = SandboxCa::generate().unwrap(); + let key_pem = ca.ca_key.serialize_pem(); + let loaded = SandboxCa::from_pem(ca.cert_pem(), &key_pem).unwrap(); + + assert_eq!(loaded.cert_pem(), ca.cert_pem()); + assert!( + CertCache::new(loaded) + .get_or_generate("example.com") + .is_ok() + ); + } } diff --git a/crates/openshell-supervisor-network/src/opa.rs b/crates/openshell-supervisor-network/src/opa.rs index fbab5fedd..850c38320 100644 --- a/crates/openshell-supervisor-network/src/opa.rs +++ b/crates/openshell-supervisor-network/src/opa.rs @@ -18,6 +18,7 @@ use std::sync::{ Arc, Mutex, atomic::{AtomicU64, Ordering}, }; +use tracing::info; /// Baked-in rego rules for OPA policy evaluation. /// These rules define the network access decision logic and static config @@ -55,6 +56,49 @@ pub struct NetworkInput { pub cmdline_paths: Vec, } +pub(crate) fn network_binary_identity_required() -> bool { + std::env::var(openshell_core::sandbox_env::NETWORK_BINARY_IDENTITY).map_or(true, |value| { + !matches!( + value.as_str(), + "relaxed" | "disabled" | "endpoint-only" | "false" | "0" + ) + }) +} + +fn inject_runtime_policy_data(data: &mut serde_json::Value, require_binary_identity: bool) { + let Some(obj) = data.as_object_mut() else { + return; + }; + obj.insert( + "runtime".to_string(), + serde_json::json!({ + "require_binary_identity": require_binary_identity, + }), + ); +} + +fn emit_binary_identity_mode(require_binary_identity: bool, source: &str) { + info!( + require_binary_identity, + source, "Configured OPA runtime binary identity mode" + ); + openshell_ocsf::ocsf_emit!( + openshell_ocsf::ConfigStateChangeBuilder::new(openshell_ocsf::ctx::ctx()) + .severity(openshell_ocsf::SeverityId::Informational) + .status(openshell_ocsf::StatusId::Success) + .state(openshell_ocsf::StateId::Enabled, "configured") + .unmapped( + "require_binary_identity", + serde_json::json!(require_binary_identity) + ) + .unmapped("source", serde_json::json!(source)) + .message(format!( + "OPA runtime binary identity mode configured [source:{source} require_binary_identity:{require_binary_identity}]" + )) + .build() + ); +} + /// Sandbox configuration extracted from OPA data at startup. pub struct SandboxConfig { pub filesystem: FilesystemPolicy, @@ -146,7 +190,9 @@ impl OpaEngine { engine .add_policy_from_file(policy_path) .map_err(|e| miette::miette!("{e}"))?; - let data_json = preprocess_yaml_data(&yaml_str)?; + let require_binary_identity = network_binary_identity_required(); + emit_binary_identity_mode(require_binary_identity, "files"); + let data_json = preprocess_yaml_data(&yaml_str, require_binary_identity)?; engine .add_data_json(&data_json) .map_err(|e| miette::miette!("{e}"))?; @@ -160,11 +206,24 @@ impl OpaEngine { /// /// Preprocesses the YAML data to expand access presets and validate L7 config. pub fn from_strings(policy: &str, data_yaml: &str) -> Result { + Self::from_strings_with_binary_identity_required( + policy, + data_yaml, + network_binary_identity_required(), + ) + } + + pub(crate) fn from_strings_with_binary_identity_required( + policy: &str, + data_yaml: &str, + require_binary_identity: bool, + ) -> Result { let mut engine = regorus::Engine::new(); engine .add_policy("policy.rego".into(), policy.into()) .map_err(|e| miette::miette!("{e}"))?; - let data_json = preprocess_yaml_data(data_yaml)?; + emit_binary_identity_mode(require_binary_identity, "strings"); + let data_json = preprocess_yaml_data(data_yaml, require_binary_identity)?; engine .add_data_json(&data_json) .map_err(|e| miette::miette!("{e}"))?; @@ -193,11 +252,25 @@ impl OpaEngine { /// gap between user-specified symlink paths (e.g., `/usr/bin/python3`) and /// kernel-resolved canonical paths (e.g., `/usr/bin/python3.11`). pub fn from_proto_with_pid(proto: &ProtoSandboxPolicy, entrypoint_pid: u32) -> Result { + Self::from_proto_with_pid_and_binary_identity_required( + proto, + entrypoint_pid, + network_binary_identity_required(), + ) + } + + fn from_proto_with_pid_and_binary_identity_required( + proto: &ProtoSandboxPolicy, + entrypoint_pid: u32, + require_binary_identity: bool, + ) -> Result { + emit_binary_identity_mode(require_binary_identity, "proto"); let data_json_str = proto_to_opa_data_json(proto, entrypoint_pid); // Parse back to Value for preprocessing, then re-serialize let mut data: serde_json::Value = serde_json::from_str(&data_json_str) .map_err(|e| miette::miette!("internal: failed to parse proto JSON: {e}"))?; + inject_runtime_policy_data(&mut data, require_binary_identity); // Validate BEFORE expanding presets let (errors, warnings) = crate::l7::validate_l7_policies(&data); @@ -720,9 +793,10 @@ fn parse_process_policy(val: ®orus::Value) -> ProcessPolicy { } /// Preprocess YAML policy data: parse, normalize, validate, expand access presets, return JSON. -fn preprocess_yaml_data(yaml_str: &str) -> Result { +fn preprocess_yaml_data(yaml_str: &str, require_binary_identity: bool) -> Result { let mut data: serde_json::Value = serde_yml::from_str(yaml_str) .map_err(|e| miette::miette!("failed to parse YAML data: {e}"))?; + inject_runtime_policy_data(&mut data, require_binary_identity); // Normalize port → ports for all endpoints so Rego always sees "ports" array. normalize_endpoint_ports(&mut data); @@ -2264,6 +2338,88 @@ process: assert!(eval_l7(&engine, &input)); } + #[test] + fn l7_get_allowed_by_rules_when_binary_identity_relaxed() { + let engine = + OpaEngine::from_strings_with_binary_identity_required(TEST_POLICY, L7_TEST_DATA, false) + .expect("Failed to load relaxed L7 test data"); + let mut input = l7_input("api.example.com", 8080, "GET", "/repos/myorg/foo"); + input["exec"]["path"] = "".into(); + assert!(eval_l7(&engine, &input)); + } + + #[test] + fn relaxed_binary_identity_preserves_matched_policy_and_l7_for_proto() { + let mut network_policies = std::collections::HashMap::new(); + network_policies.insert( + "test_l7".to_string(), + NetworkPolicyRule { + name: "test_l7".to_string(), + endpoints: vec![NetworkEndpoint { + host: "host.k3d.internal".to_string(), + port: 56123, + protocol: "rest".to_string(), + enforcement: "enforce".to_string(), + rules: vec![L7Rule { + allow: Some(L7Allow { + method: "GET".to_string(), + path: "/allowed".to_string(), + command: String::new(), + query: std::collections::HashMap::new(), + operation_type: String::new(), + operation_name: String::new(), + fields: Vec::new(), + params: std::collections::HashMap::new(), + }), + }], + allowed_ips: vec!["192.168.0.0/16".to_string()], + ..Default::default() + }], + binaries: vec![NetworkBinary { + path: "/usr/bin/curl".to_string(), + ..Default::default() + }], + }, + ); + let proto = ProtoSandboxPolicy { + version: 1, + filesystem: Some(ProtoFs { + include_workdir: true, + read_only: vec![], + read_write: vec![], + }), + landlock: Some(openshell_core::proto::LandlockPolicy { + compatibility: "best_effort".to_string(), + }), + process: Some(ProtoProc { + run_as_user: "sandbox".to_string(), + run_as_group: "sandbox".to_string(), + }), + network_policies, + }; + let engine = OpaEngine::from_proto_with_pid_and_binary_identity_required(&proto, 0, false) + .expect("engine from relaxed proto"); + let network_input = NetworkInput { + host: "host.k3d.internal".into(), + port: 56123, + binary_path: PathBuf::new(), + binary_sha256: String::new(), + ancestors: vec![], + cmdline_paths: vec![], + }; + let action = engine.evaluate_network_action(&network_input).unwrap(); + assert_eq!( + action, + NetworkAction::Allow { + matched_policy: Some("test_l7".to_string()) + } + ); + + let mut input = l7_input("host.k3d.internal", 56123, "GET", "/allowed"); + input["exec"]["path"] = "".into(); + assert!(eval_l7(&engine, &input)); + } + #[test] fn l7_post_allowed_by_rules() { let engine = l7_engine(); @@ -4592,6 +4748,46 @@ process: ); } + #[test] + fn relaxed_binary_identity_allows_declared_endpoint_without_binary_match() { + let engine = OpaEngine::from_strings_with_binary_identity_required( + TEST_POLICY, + INFERENCE_TEST_DATA, + false, + ) + .expect("Failed to load relaxed binary identity test data"); + let input = NetworkInput { + host: "api.anthropic.com".into(), + port: 443, + binary_path: PathBuf::from("/tmp/unlisted-agent"), + binary_sha256: "unused".into(), + ancestors: vec![], + cmdline_paths: vec![], + }; + + let action = engine.evaluate_network_action(&input).unwrap(); + assert_eq!( + action, + NetworkAction::Allow { + matched_policy: Some("claude_code".to_string()) + }, + ); + assert!( + engine.query_exact_declared_endpoint_host(&input).unwrap(), + "relaxed identity should preserve exact declared endpoint handling" + ); + + let undeclared = NetworkInput { + host: "api.openai.com".into(), + ..input + }; + let action = engine.evaluate_network_action(&undeclared).unwrap(); + assert!( + matches!(action, NetworkAction::Deny { .. }), + "relaxed identity must not allow undeclared endpoints" + ); + } + #[test] fn unknown_endpoint_returns_deny() { let engine = inference_engine(); diff --git a/crates/openshell-supervisor-network/src/proxy.rs b/crates/openshell-supervisor-network/src/proxy.rs index 0d2c8c025..c38ecbd3a 100644 --- a/crates/openshell-supervisor-network/src/proxy.rs +++ b/crates/openshell-supervisor-network/src/proxy.rs @@ -42,6 +42,8 @@ const TUNNEL_PROTOCOL_PEEK_POLL: std::time::Duration = std::time::Duration::from const TUNNEL_PROTOCOL_PEEK_POLL: std::time::Duration = std::time::Duration::from_millis(1); const INFERENCE_LOCAL_HOST: &str = "inference.local"; const INFERENCE_LOCAL_PORT: u16 = 443; +#[cfg(target_os = "linux")] +const SIDECAR_SUPERVISOR_TOPOLOGY: &str = "sidecar"; /// Hostnames injected by compute drivers as `/etc/hosts` aliases for the host /// machine. Traffic to these names is eligible for the trusted-gateway SSRF @@ -1426,7 +1428,7 @@ fn resolve_owner_identity( })?; let bin_hash = identity_cache - .verify_or_cache(&bin_path) + .verify_or_cache_process_exe(&bin_path, owner_pid) .map_err(|e| IdentityError { reason: format!("binary integrity check failed: {e}"), binary: Some(bin_path.clone()), @@ -1434,11 +1436,15 @@ fn resolve_owner_identity( ancestors: vec![], })?; - let ancestors = crate::procfs::collect_ancestor_binaries(owner_pid, entrypoint_pid); + let ancestor_identities = collect_ancestor_identities(owner_pid, entrypoint_pid); + let ancestors: Vec = ancestor_identities + .iter() + .map(|(_, path)| path.clone()) + .collect(); - for ancestor in &ancestors { + for (ancestor_pid, ancestor) in &ancestor_identities { identity_cache - .verify_or_cache(ancestor) + .verify_or_cache_process_exe(ancestor, *ancestor_pid) .map_err(|e| IdentityError { reason: format!( "ancestor integrity check failed for {}: {e}", @@ -1463,6 +1469,31 @@ fn resolve_owner_identity( }) } +#[cfg(target_os = "linux")] +fn collect_ancestor_identities(pid: u32, stop_pid: u32) -> Vec<(u32, PathBuf)> { + const MAX_DEPTH: usize = 64; + let mut ancestors = Vec::new(); + let mut current = pid; + + for _ in 0..MAX_DEPTH { + let ppid = match crate::procfs::read_ppid(current) { + Some(p) if p > 0 && p != current => p, + _ => break, + }; + + if let Ok(path) = crate::procfs::binary_path(ppid.cast_signed()) { + ancestors.push((ppid, path)); + } + + if ppid == stop_pid || ppid == 1 { + break; + } + current = ppid; + } + + ancestors +} + /// Resolve the identity of the process owning a TCP peer connection. /// /// Walks `/proc//net/tcp` to find the socket inode, locates @@ -1573,8 +1604,17 @@ fn evaluate_opa_tcp( } }; - let pid = entrypoint_pid.load(Ordering::Acquire); - if pid == 0 { + if !crate::opa::network_binary_identity_required() { + let result = evaluate_endpoint_only_opa(engine, host, port); + debug!( + "evaluate_opa_tcp endpoint-only: host={host} port={port} action={:?}", + result.action + ); + return result; + } + + let entrypoint_pid = entrypoint_pid.load(Ordering::Acquire); + let Some(proc_net_anchor_pid) = proc_net_anchor_pid(entrypoint_pid) else { return deny( "entrypoint process not yet spawned".into(), None, @@ -1582,12 +1622,12 @@ fn evaluate_opa_tcp( vec![], vec![], ); - } + }; let total_start = std::time::Instant::now(); let peer_port = peer_addr.port(); - let identity = match resolve_process_identity(pid, peer_port, identity_cache) { + let identity = match resolve_process_identity(proc_net_anchor_pid, peer_port, identity_cache) { Ok(id) => id, Err(err) => { return deny( @@ -1641,6 +1681,52 @@ fn evaluate_opa_tcp( result } +#[cfg(target_os = "linux")] +fn proc_net_anchor_pid(entrypoint_pid: u32) -> Option { + if entrypoint_pid != 0 { + return Some(entrypoint_pid); + } + sidecar_topology_enabled().then(std::process::id) +} + +#[cfg(target_os = "linux")] +fn sidecar_topology_enabled() -> bool { + std::env::var(openshell_core::sandbox_env::SUPERVISOR_TOPOLOGY) + .is_ok_and(|value| value == SIDECAR_SUPERVISOR_TOPOLOGY) +} + +fn evaluate_endpoint_only_opa(engine: &OpaEngine, host: &str, port: u16) -> ConnectDecision { + let input = crate::opa::NetworkInput { + host: host.to_string(), + port, + binary_path: PathBuf::new(), + binary_sha256: String::new(), + ancestors: vec![], + cmdline_paths: vec![], + }; + + match engine.evaluate_network_action_with_generation(&input) { + Ok((action, generation)) => ConnectDecision { + action, + generation, + binary: None, + binary_pid: None, + ancestors: vec![], + cmdline_paths: vec![], + }, + Err(e) => ConnectDecision { + action: NetworkAction::Deny { + reason: format!("policy evaluation error: {e}"), + }, + generation: engine.current_generation(), + binary: None, + binary_pid: None, + ancestors: vec![], + cmdline_paths: vec![], + }, + } +} + /// Non-Linux stub: OPA identity binding requires /proc. #[cfg(not(target_os = "linux"))] fn evaluate_opa_tcp( @@ -1648,9 +1734,13 @@ fn evaluate_opa_tcp( engine: &OpaEngine, _identity_cache: &BinaryIdentityCache, _entrypoint_pid: &AtomicU32, - _host: &str, - _port: u16, + host: &str, + port: u16, ) -> ConnectDecision { + if !crate::opa::network_binary_identity_required() { + return evaluate_endpoint_only_opa(engine, host, port); + } + ConnectDecision { action: NetworkAction::Deny { reason: "identity binding unavailable on this platform".into(), @@ -2152,14 +2242,24 @@ fn query_l7_route_snapshot( }; match engine.query_endpoint_configs_with_generation(&input) { - Ok((vals, generation)) => Some(L7RouteSnapshot { - configs: vals + Ok((vals, generation)) => { + let configs: Vec<_> = vals .into_iter() .filter_map(|val| crate::l7::parse_l7_config(&val)) .map(|config| L7ConfigSnapshot { config }) - .collect(), - generation, - }), + .collect(); + debug!( + host, + port, + generation, + config_count = configs.len(), + "Forward proxy L7 route lookup complete" + ); + Some(L7RouteSnapshot { + configs, + generation, + }) + } Err(e) => { let event = NetworkActivityBuilder::new(openshell_ocsf::ctx::ctx()) .activity(ActivityId::Fail) @@ -3337,10 +3437,29 @@ async fn handle_forward_proxy( } }; let policy_str = matched_policy.as_deref().unwrap_or("-"); + debug!( + host = %host_lc, + port, + binary = %binary_str, + binary_pid = %pid_str, + matched_policy = %policy_str, + decision_generation = decision.generation, + current_generation = opa_engine.current_generation(), + action = ?decision.action, + "Forward proxy L4 policy decision" + ); let sandbox_entrypoint_pid = entrypoint_pid.load(Ordering::Acquire); let forward_generation_guard = match opa_engine.generation_guard(decision.generation) { Ok(guard) => guard, Err(e) => { + warn!( + host = %host_lc, + port, + decision_generation = decision.generation, + current_generation = opa_engine.current_generation(), + error = %e, + "Forward proxy rejected request because policy generation changed after L4 decision" + ); emit_l7_tunnel_close_after_policy_change(&host_lc, port, e); emit_activity_simple(activity_tx, true, "policy_stale"); respond( @@ -3401,6 +3520,15 @@ async fn handle_forward_proxy( && !route.configs.is_empty() { if route.generation != forward_generation_guard.captured_generation() { + warn!( + host = %host_lc, + port, + decision_generation = decision.generation, + guard_generation = forward_generation_guard.captured_generation(), + route_generation = route.generation, + current_generation = opa_engine.current_generation(), + "Forward proxy rejected request because L7 route lookup used a different policy generation" + ); emit_l7_tunnel_close_after_policy_change( &host_lc, port, @@ -3426,6 +3554,14 @@ async fn handle_forward_proxy( let tunnel_engine = match opa_engine.clone_engine_for_tunnel(route.generation) { Ok(engine) => engine, Err(e) => { + warn!( + host = %host_lc, + port, + route_generation = route.generation, + current_generation = opa_engine.current_generation(), + error = %e, + "Forward proxy rejected request because L7 tunnel engine could not be cloned" + ); emit_l7_tunnel_close_after_policy_change(&host_lc, port, e); emit_activity_simple(activity_tx, true, "policy_stale"); respond( @@ -4105,6 +4241,14 @@ async fn handle_forward_proxy( }; if let Err(e) = forward_generation_guard.ensure_current() { + warn!( + host = %host_lc, + port, + captured_generation = forward_generation_guard.captured_generation(), + current_generation = forward_generation_guard.current_generation(), + error = %e, + "Forward proxy rejected request because policy changed before upstream connect" + ); emit_l7_tunnel_close_after_policy_change(&host_lc, port, e); emit_activity_simple(activity_tx, true, "policy_stale"); respond( @@ -4243,6 +4387,14 @@ async fn handle_forward_proxy( }; if let Err(e) = forward_generation_guard.ensure_current() { + warn!( + host = %host_lc, + port, + captured_generation = forward_generation_guard.captured_generation(), + current_generation = forward_generation_guard.current_generation(), + error = %e, + "Forward proxy rejected request because policy changed before relay" + ); emit_l7_tunnel_close_after_policy_change(&host_lc, port, e); respond( client, @@ -4379,6 +4531,46 @@ mod tests { use tokio::io::{AsyncRead, AsyncReadExt, AsyncWriteExt}; use tokio::net::{TcpListener, TcpStream}; + #[test] + fn endpoint_only_opa_allows_declared_endpoint_without_process_identity() { + let policy = include_str!("../data/sandbox-policy.rego"); + let data = r#" +version: 1 +network_policies: + test_l7: + name: test_l7 + endpoints: + - host: host.k3d.internal + port: 56123 + protocol: rest + enforcement: enforce + rules: + - allow: + method: GET + path: /allowed + binaries: + - path: /usr/bin/curl +"#; + let engine = OpaEngine::from_strings_with_binary_identity_required(policy, data, false) + .expect("relaxed engine"); + + let decision = evaluate_endpoint_only_opa(&engine, "host.k3d.internal", 56123); + assert_eq!( + decision.action, + NetworkAction::Allow { + matched_policy: Some("test_l7".to_string()), + } + ); + assert!(decision.binary.is_none()); + assert!(decision.ancestors.is_empty()); + + let denied = evaluate_endpoint_only_opa(&engine, "api.example.com", 443); + assert!( + matches!(denied.action, NetworkAction::Deny { .. }), + "endpoint-only mode must still deny undeclared endpoints" + ); + } + fn websocket_l7_config( protocol: crate::l7::L7Protocol, websocket_credential_rewrite: bool, diff --git a/crates/openshell-supervisor-network/src/run.rs b/crates/openshell-supervisor-network/src/run.rs index 9553e0673..5a8123415 100644 --- a/crates/openshell-supervisor-network/src/run.rs +++ b/crates/openshell-supervisor-network/src/run.rs @@ -54,6 +54,38 @@ pub struct Networking { pub policy_local_ctx: Arc, } +fn sandbox_ca_for_proxy() -> Result { + let cert_path = std::env::var(openshell_core::sandbox_env::PROXY_CA_CERT_PATH).ok(); + let key_path = std::env::var(openshell_core::sandbox_env::PROXY_CA_KEY_PATH).ok(); + match (cert_path, key_path) { + (Some(cert_path), Some(key_path)) => SandboxCa::from_files( + std::path::Path::new(&cert_path), + std::path::Path::new(&key_path), + ), + (None, None) => SandboxCa::generate(), + _ => Err(miette::miette!( + "{} and {} must be set together", + openshell_core::sandbox_env::PROXY_CA_CERT_PATH, + openshell_core::sandbox_env::PROXY_CA_KEY_PATH + )), + } +} + +fn explicit_proxy_bind_addr() -> Result> { + let Some(value) = std::env::var(openshell_core::sandbox_env::PROXY_BIND_ADDR) + .ok() + .filter(|value| !value.trim().is_empty()) + else { + return Ok(None); + }; + value.parse::().map(Some).map_err(|err| { + miette::miette!( + "invalid {} value {value:?}: {err}", + openshell_core::sandbox_env::PROXY_BIND_ADDR + ) + }) +} + /// Set up the networking stack: ephemeral CA + TLS state, proxy server, /// and the SSH-side proxy URL / netns FD. /// @@ -196,12 +228,14 @@ pub async fn run_networking( // the proxy, so it's owned here. let identity_cache = opa_engine.map(|_| Arc::new(BinaryIdentityCache::new())); - // Generate ephemeral CA and TLS state for HTTPS L7 inspection. - // The CA cert is written to disk so sandbox processes can trust it. + // Generate or load a CA and TLS state for HTTPS L7 inspection. The CA cert + // is written to disk so sandbox processes can trust it. let (tls_state, ca_file_paths) = if matches!(policy.network.mode, NetworkMode::Proxy) { - match SandboxCa::generate() { + match sandbox_ca_for_proxy() { Ok(ca) => { - let tls_dir = std::path::Path::new("/etc/openshell-tls"); + let tls_dir = std::env::var(openshell_core::sandbox_env::PROXY_TLS_DIR) + .unwrap_or_else(|_| "/etc/openshell-tls".to_string()); + let tls_dir = std::path::Path::new(&tls_dir); let system_ca_bundle = read_system_ca_bundle(); match write_ca_files(&ca, tls_dir, &system_ca_bundle) { Ok(paths) => { @@ -217,7 +251,7 @@ pub async fn run_networking( .severity(SeverityId::Informational) .status(StatusId::Success) .state(StateId::Enabled, "enabled") - .message("TLS termination enabled: ephemeral CA generated") + .message("TLS termination enabled") .build() ); (Some(state), Some(paths)) @@ -244,7 +278,7 @@ pub async fn run_networking( .status(StatusId::Failure) .state(StateId::Disabled, "disabled") .message(format!( - "Failed to generate ephemeral CA, TLS termination disabled: {e}" + "Failed to initialize proxy CA, TLS termination disabled: {e}" )) .build() ); @@ -273,9 +307,11 @@ pub async fn run_networking( // originating inside the namespace can reach the proxy. Otherwise the // proxy falls back to the policy-declared http_addr (loopback in // tests, etc.). - let bind_addr = proxy_bind_ip.map(|ip| { - let port = proxy_policy.http_addr.map_or(3128, |addr| addr.port()); - SocketAddr::new(ip, port) + let bind_addr = explicit_proxy_bind_addr()?.or_else(|| { + proxy_bind_ip.map(|ip| { + let port = proxy_policy.http_addr.map_or(3128, |addr| addr.port()); + SocketAddr::new(ip, port) + }) }); // Build inference context for local routing of intercepted inference calls. diff --git a/crates/openshell-supervisor-process/Cargo.toml b/crates/openshell-supervisor-process/Cargo.toml index 1163cc954..dc6a396eb 100644 --- a/crates/openshell-supervisor-process/Cargo.toml +++ b/crates/openshell-supervisor-process/Cargo.toml @@ -13,6 +13,7 @@ rust-version.workspace = true [dependencies] openshell-core = { path = "../openshell-core" } openshell-ocsf = { path = "../openshell-ocsf" } +openshell-policy = { path = "../openshell-policy" } anyhow = { workspace = true } base64 = { workspace = true } @@ -41,6 +42,7 @@ seccompiler = "0.5" tempfile = "3" [dev-dependencies] +temp-env = "0.3" tempfile = "3" [lints] diff --git a/crates/openshell-supervisor-process/src/netns/mod.rs b/crates/openshell-supervisor-process/src/netns/mod.rs index cc7b1d84c..d3d9063a0 100644 --- a/crates/openshell-supervisor-process/src/netns/mod.rs +++ b/crates/openshell-supervisor-process/src/netns/mod.rs @@ -467,6 +467,123 @@ pub fn create_netns_for_proxy( } } +/// Install pod-network bypass enforcement for Kubernetes sidecar topology. +/// +/// This runs in the current network namespace, not in a per-workload netns. +/// The rules allow loopback and the proxy UID, then reject direct +/// TCP/UDP egress from other UIDs so traffic must use the sidecar's local +/// proxy. +/// +/// # Errors +/// +/// Returns an error when `nft` is unavailable or the ruleset cannot be loaded. +pub fn install_sidecar_bypass_rules(proxy_uid: u32) -> Result<()> { + match install_sidecar_nft_bypass_rules(proxy_uid) { + Ok(()) => Ok(()), + Err(nft_error) => { + warn!( + error = %nft_error, + "Failed to install nftables sidecar rules; trying iptables-legacy fallback" + ); + install_sidecar_iptables_legacy_bypass_rules(proxy_uid).map_err(|iptables_error| { + miette::miette!( + "sidecar nft ruleset load failed: {nft_error}; sidecar iptables-legacy fallback failed: {iptables_error}" + ) + }) + } + } +} + +fn install_sidecar_nft_bypass_rules(proxy_uid: u32) -> Result<()> { + let nft_cmd = find_nft().ok_or_else(|| { + miette::miette!( + "trusted nft helper not found; sidecar network enforcement requires nftables" + ) + })?; + let log_prefix = Some("openshell:sidecar-bypass:"); + let ruleset = nft_ruleset::generate_sidecar_bypass_ruleset(proxy_uid, log_prefix); + run_nft_current_namespace(&nft_cmd, &ruleset) +} + +const SIDECAR_IPTABLES_CHAIN: &str = "OPENSHELL_SIDECAR_BYPASS"; + +fn install_sidecar_iptables_legacy_bypass_rules(proxy_uid: u32) -> Result<()> { + let iptables_cmd = find_iptables_legacy().ok_or_else(|| { + miette::miette!( + "trusted iptables-legacy helper not found; sidecar network enforcement fallback unavailable" + ) + })?; + + cleanup_sidecar_iptables_legacy_rules(&iptables_cmd); + + let proxy_uid_arg = proxy_uid.to_string(); + let commands: Vec> = vec![ + vec!["-N", SIDECAR_IPTABLES_CHAIN], + vec!["-A", SIDECAR_IPTABLES_CHAIN, "-o", "lo", "-j", "ACCEPT"], + vec![ + "-A", + SIDECAR_IPTABLES_CHAIN, + "-m", + "conntrack", + "--ctstate", + "ESTABLISHED,RELATED", + "-j", + "ACCEPT", + ], + vec![ + "-A", + SIDECAR_IPTABLES_CHAIN, + "-m", + "owner", + "--uid-owner", + &proxy_uid_arg, + "-j", + "ACCEPT", + ], + vec![ + "-A", + SIDECAR_IPTABLES_CHAIN, + "-p", + "tcp", + "-j", + "REJECT", + "--reject-with", + "tcp-reset", + ], + vec![ + "-A", + SIDECAR_IPTABLES_CHAIN, + "-p", + "udp", + "-j", + "REJECT", + "--reject-with", + "icmp-port-unreachable", + ], + vec!["-A", "OUTPUT", "-j", SIDECAR_IPTABLES_CHAIN], + ]; + + for args in commands { + if let Err(e) = run_iptables_legacy_current_namespace(&iptables_cmd, &args) { + cleanup_sidecar_iptables_legacy_rules(&iptables_cmd); + return Err(e); + } + } + + Ok(()) +} + +fn cleanup_sidecar_iptables_legacy_rules(iptables_cmd: &str) { + while run_iptables_legacy_current_namespace( + iptables_cmd, + &["-D", "OUTPUT", "-j", SIDECAR_IPTABLES_CHAIN], + ) + .is_ok() + {} + let _ = run_iptables_legacy_current_namespace(iptables_cmd, &["-F", SIDECAR_IPTABLES_CHAIN]); + let _ = run_iptables_legacy_current_namespace(iptables_cmd, &["-X", SIDECAR_IPTABLES_CHAIN]); +} + /// Run an `ip` command on the host. fn run_ip(args: &[&str]) -> Result<()> { let ip_path = find_trusted_binary("ip", IP_SEARCH_PATHS)?; @@ -490,6 +607,62 @@ fn run_ip(args: &[&str]) -> Result<()> { Ok(()) } +fn run_iptables_legacy_current_namespace(iptables_cmd: &str, args: &[&str]) -> Result<()> { + debug!( + command = %format!("{iptables_cmd} {}", args.join(" ")), + "Running iptables-legacy sidecar command" + ); + + let output = Command::new(iptables_cmd) + .args(args) + .output() + .into_diagnostic()?; + + if !output.status.success() { + let stderr = String::from_utf8_lossy(&output.stderr); + return Err(miette::miette!( + "{iptables_cmd} {} failed: {}", + args.join(" "), + stderr.trim() + )); + } + + Ok(()) +} + +fn run_nft_current_namespace(nft_cmd: &str, ruleset: &str) -> Result<()> { + use std::io::Write; + let mut tmp = tempfile::Builder::new() + .prefix("openshell-sidecar-nft-") + .suffix(".conf") + .tempfile() + .into_diagnostic()?; + tmp.write_all(ruleset.as_bytes()).into_diagnostic()?; + let ruleset_path = tmp.path().to_string_lossy().to_string(); + + debug!( + command = %format!("{nft_cmd} -f {ruleset_path}"), + "Loading nftables sidecar ruleset" + ); + + let output = Command::new(nft_cmd) + .args(["-f", &ruleset_path]) + .output() + .into_diagnostic()?; + + drop(tmp); + + if !output.status.success() { + let stderr = String::from_utf8_lossy(&output.stderr); + return Err(miette::miette!( + "sidecar nft ruleset load failed: {}", + stderr.trim() + )); + } + + Ok(()) +} + /// Run an `ip` command inside a network namespace via `nsenter --net=`. /// /// We use `nsenter` instead of `ip netns exec` because `ip netns exec` @@ -605,6 +778,11 @@ fn enable_nf_log_all_netns() { /// Well-known paths where nft may be installed. const NFT_SEARCH_PATHS: &[&str] = &["/usr/sbin/nft", "/sbin/nft", "/usr/bin/nft"]; +const IPTABLES_LEGACY_SEARCH_PATHS: &[&str] = &[ + "/usr/sbin/iptables-legacy", + "/sbin/iptables-legacy", + "/usr/bin/iptables-legacy", +]; fn find_trusted_binary<'a>(name: &str, paths: &'a [&str]) -> Result<&'a str> { paths @@ -629,6 +807,12 @@ fn find_nft() -> Option { .map(String::from) } +fn find_iptables_legacy() -> Option { + find_trusted_binary("iptables-legacy", IPTABLES_LEGACY_SEARCH_PATHS) + .ok() + .map(String::from) +} + #[cfg(test)] mod tests { use super::*; @@ -668,6 +852,16 @@ mod tests { } } + #[test] + fn iptables_legacy_search_paths_are_absolute() { + for path in IPTABLES_LEGACY_SEARCH_PATHS { + assert!( + path.starts_with('/'), + "IPTABLES_LEGACY_SEARCH_PATHS entry must be absolute: {path}" + ); + } + } + #[test] #[ignore = "requires root privileges"] fn test_create_and_drop_namespace() { diff --git a/crates/openshell-supervisor-process/src/netns/nft_ruleset.rs b/crates/openshell-supervisor-process/src/netns/nft_ruleset.rs index ba7aeb936..d7ec5132e 100644 --- a/crates/openshell-supervisor-process/src/netns/nft_ruleset.rs +++ b/crates/openshell-supervisor-process/src/netns/nft_ruleset.rs @@ -53,6 +53,46 @@ pub fn generate_bypass_ruleset(host_ip: &str, proxy_port: u16, log_prefix: Optio ) } +/// Generate a pod-network ruleset for Kubernetes sidecar enforcement. +/// +/// The network sidecar and the process supervisor share a pod network +/// namespace. The sidecar runs as `proxy_uid` and owns external egress; +/// sandbox traffic must use loopback services hosted by that sidecar +/// (gateway forward and HTTP CONNECT proxy). +pub fn generate_sidecar_bypass_ruleset(proxy_uid: u32, log_prefix: Option<&str>) -> String { + let log_tcp = log_prefix + .map(|p| { + format!( + "\n tcp flags syn limit rate 5/second burst 10 packets log prefix \"{p}\" flags skuid" + ) + }) + .unwrap_or_default(); + let log_udp = log_prefix + .map(|p| { + format!( + "\n meta l4proto udp limit rate 5/second burst 10 packets log prefix \"{p}\" flags skuid" + ) + }) + .unwrap_or_default(); + + format!( + r#"table inet openshell_sidecar_bypass {{ + chain output {{ + type filter hook output priority 0; policy accept; + + oifname "lo" accept + ct state established,related accept + meta skuid {proxy_uid} accept{log_tcp} + meta nfproto ipv4 meta l4proto tcp reject with icmp type port-unreachable + meta nfproto ipv6 meta l4proto tcp reject with icmpv6 type port-unreachable{log_udp} + meta nfproto ipv4 meta l4proto udp reject with icmp type port-unreachable + meta nfproto ipv6 meta l4proto udp reject with icmpv6 type port-unreachable + }} +}} +"# + ) +} + #[cfg(test)] mod tests { use super::*; @@ -145,4 +185,27 @@ mod tests { "UDP log rule must come before UDP reject rule" ); } + + #[test] + fn sidecar_ruleset_allows_supervisor_uid_and_loopback() { + let ruleset = generate_sidecar_bypass_ruleset(1337, None); + assert!(ruleset.contains("table inet openshell_sidecar_bypass")); + assert!(ruleset.contains("oifname \"lo\" accept")); + assert!(ruleset.contains("meta skuid 1337 accept")); + } + + #[test] + fn sidecar_ruleset_rejects_tcp_and_udp_egress() { + let ruleset = generate_sidecar_bypass_ruleset(0, Some("openshell:sidecar:test:")); + assert!(ruleset.contains("meta nfproto ipv4 meta l4proto tcp reject")); + assert!(ruleset.contains("meta nfproto ipv6 meta l4proto tcp reject")); + assert!(ruleset.contains("meta nfproto ipv4 meta l4proto udp reject")); + assert!(ruleset.contains("meta nfproto ipv6 meta l4proto udp reject")); + assert_eq!( + ruleset + .matches("log prefix \"openshell:sidecar:test:\"") + .count(), + 2 + ); + } } diff --git a/crates/openshell-supervisor-process/src/process.rs b/crates/openshell-supervisor-process/src/process.rs index c1b6b4532..1bf5a055b 100644 --- a/crates/openshell-supervisor-process/src/process.rs +++ b/crates/openshell-supervisor-process/src/process.rs @@ -11,7 +11,7 @@ use crate::netns::NetworkNamespace; use crate::sandbox; use miette::{IntoDiagnostic, Result}; use nix::sys::signal::{self, Signal}; -use nix::unistd::{Group, Pid, User}; +use nix::unistd::{Gid, Group, Pid, Uid, User}; use openshell_core::policy::{NetworkMode, SandboxPolicy}; use std::collections::HashMap; use std::ffi::CString; @@ -28,11 +28,37 @@ use std::sync::OnceLock; use tokio::process::{Child, Command}; use tracing::debug; +/// Process/filesystem enforcement performed by the process supervisor. +#[derive(Debug, Clone, Copy, PartialEq, Eq)] +pub enum ProcessEnforcementMode { + /// Preserve the existing supervisor behavior: prepare filesystem policy, + /// drop privileges, and apply Landlock/seccomp to workload processes. + Full, + /// Preserve process launch and SSH/session behavior, but skip controls + /// that require root or extra Linux capabilities. Kubernetes sidecar mode + /// uses this when network policy is enforced by the network sidecar. + NetworkOnly, +} + +impl ProcessEnforcementMode { + #[must_use] + pub const fn enforces_process_controls(self) -> bool { + matches!(self, Self::Full) + } +} + const SUPERVISOR_ONLY_ENV_VARS: &[&str] = &[ openshell_core::sandbox_env::SANDBOX_TOKEN, openshell_core::sandbox_env::SANDBOX_TOKEN_FILE, openshell_core::sandbox_env::K8S_SA_TOKEN_FILE, + openshell_core::sandbox_env::TLS_CA, + openshell_core::sandbox_env::TLS_CERT, + openshell_core::sandbox_env::TLS_KEY, openshell_core::sandbox_env::PROVIDER_SPIFFE_WORKLOAD_API_SOCKET, + openshell_core::sandbox_env::PROXY_URL, + openshell_core::sandbox_env::PROXY_BIND_ADDR, + openshell_core::sandbox_env::PROXY_CA_CERT_PATH, + openshell_core::sandbox_env::PROXY_CA_KEY_PATH, ]; pub fn is_supervisor_only_env_var(key: &str) -> bool { @@ -54,6 +80,35 @@ fn inject_provider_env(cmd: &mut Command, provider_env: &HashMap } } +fn configured_proxy_url( + policy: &SandboxPolicy, + netns_proxy_enabled: bool, +) -> Result> { + if !matches!(policy.network.mode, NetworkMode::Proxy) { + return Ok(None); + } + + if let Ok(proxy_url) = std::env::var(openshell_core::sandbox_env::PROXY_URL) { + let trimmed = proxy_url.trim(); + if !trimmed.is_empty() { + return Ok(Some(trimmed.to_string())); + } + } + + let proxy = policy.network.proxy.as_ref().ok_or_else(|| { + miette::miette!("Network mode is set to proxy but no proxy configuration was provided") + })?; + + if netns_proxy_enabled { + let port = proxy.http_addr.map_or(3128, |addr| addr.port()); + return Ok(Some(format!("http://10.200.0.1:{port}"))); + } + + Ok(proxy + .http_addr + .map(|http_addr| format!("http://{http_addr}"))) +} + #[cfg(unix)] pub fn harden_child_process() -> Result<()> { use rustix::process::{Resource, Rlimit, setrlimit}; @@ -443,6 +498,7 @@ impl ProcessHandle { workdir: Option<&str>, interactive: bool, policy: &SandboxPolicy, + enforcement_mode: ProcessEnforcementMode, netns: Option<&NetworkNamespace>, ca_paths: Option<&(PathBuf, PathBuf)>, provider_env: &HashMap, @@ -453,6 +509,7 @@ impl ProcessHandle { workdir, interactive, policy, + enforcement_mode, netns.and_then(NetworkNamespace::ns_fd), ca_paths, provider_env, @@ -465,12 +522,14 @@ impl ProcessHandle { /// /// Returns an error if the process fails to start. #[cfg(not(target_os = "linux"))] + #[allow(clippy::too_many_arguments)] pub fn spawn( program: &str, args: &[String], workdir: Option<&str>, interactive: bool, policy: &SandboxPolicy, + enforcement_mode: ProcessEnforcementMode, ca_paths: Option<&(PathBuf, PathBuf)>, provider_env: &HashMap, ) -> Result { @@ -480,6 +539,7 @@ impl ProcessHandle { workdir, interactive, policy, + enforcement_mode, ca_paths, provider_env, ) @@ -493,6 +553,7 @@ impl ProcessHandle { workdir: Option<&str>, interactive: bool, policy: &SandboxPolicy, + enforcement_mode: ProcessEnforcementMode, netns_fd: Option, ca_paths: Option<&(PathBuf, PathBuf)>, provider_env: &HashMap, @@ -517,27 +578,11 @@ impl ProcessHandle { cmd.current_dir(dir); } - if matches!(policy.network.mode, NetworkMode::Proxy) { - let proxy = policy.network.proxy.as_ref().ok_or_else(|| { - miette::miette!( - "Network mode is set to proxy but no proxy configuration was provided" - ) - })?; - // When using network namespace, set proxy URL to the veth host IP - if netns_fd.is_some() { - // The proxy is on 10.200.0.1:3128 (or configured port) - let port = proxy.http_addr.map_or(3128, |addr| addr.port()); - let proxy_url = format!("http://10.200.0.1:{port}"); - // Both uppercase and lowercase variants: curl/wget use uppercase, - // gRPC C-core (libgrpc) checks lowercase http_proxy/https_proxy. - for (key, value) in child_env::proxy_env_vars(&proxy_url) { - cmd.env(key, value); - } - } else if let Some(http_addr) = proxy.http_addr { - let proxy_url = format!("http://{http_addr}"); - for (key, value) in child_env::proxy_env_vars(&proxy_url) { - cmd.env(key, value); - } + if let Some(proxy_url) = configured_proxy_url(policy, netns_fd.is_some())? { + // Both uppercase and lowercase variants: curl/wget use uppercase, + // gRPC C-core (libgrpc) checks lowercase http_proxy/https_proxy. + for (key, value) in child_env::proxy_env_vars(&proxy_url) { + cmd.env(key, value); } } @@ -552,18 +597,30 @@ impl ProcessHandle { // process where the tracing subscriber is functional. The child's // pre_exec context cannot reliably emit structured logs. #[cfg(target_os = "linux")] - sandbox::linux::log_sandbox_readiness(policy, workdir); + if enforcement_mode.enforces_process_controls() { + sandbox::linux::log_sandbox_readiness(policy, workdir); + } // Phase 1 (as root): Prepare Landlock ruleset by opening PathFds. // This MUST happen before drop_privileges() so that root-only paths // (e.g. mode 700 directories) can be opened. See issue #803. #[cfg(target_os = "linux")] - let prepared_sandbox = sandbox::linux::prepare(policy, workdir) - .map_err(|err| miette::miette!("Failed to prepare sandbox: {err}"))?; + let prepared_sandbox = if enforcement_mode.enforces_process_controls() { + Some( + sandbox::linux::prepare(policy, workdir) + .map_err(|err| miette::miette!("Failed to prepare sandbox: {err}"))?, + ) + } else { + None + }; #[cfg(target_os = "linux")] - let supervisor_identity_mount = supervisor_identity_mount_from_env().map_err(|err| { - miette::miette!("Failed to prepare supervisor identity isolation: {err}") - })?; + let supervisor_identity_mount = if enforcement_mode.enforces_process_controls() { + supervisor_identity_mount_from_env().map_err(|err| { + miette::miette!("Failed to prepare supervisor identity isolation: {err}") + })? + } else { + None + }; // Set up process group for signal handling (non-interactive mode only). // In interactive mode, we inherit the parent's process group to maintain @@ -575,7 +632,7 @@ impl ProcessHandle { // Wrap in Option so we can .take() it out of the FnMut closure. // pre_exec is only called once (after fork, before exec). #[cfg(target_os = "linux")] - let mut prepared_sandbox = Some(prepared_sandbox); + let mut prepared_sandbox = prepared_sandbox; #[allow(unsafe_code)] unsafe { cmd.pre_exec(move || { @@ -600,8 +657,10 @@ impl ProcessHandle { // Drop privileges. initgroups/setgid/setuid need access to // /etc/group and /etc/passwd which would be blocked if // Landlock were already enforced. - drop_privileges(&policy) - .map_err(|err| std::io::Error::other(err.to_string()))?; + if enforcement_mode.enforces_process_controls() { + drop_privileges(&policy) + .map_err(|err| std::io::Error::other(err.to_string()))?; + } harden_child_process().map_err(|err| std::io::Error::other(err.to_string()))?; @@ -629,12 +688,14 @@ impl ProcessHandle { } #[cfg(not(target_os = "linux"))] + #[allow(clippy::too_many_arguments)] fn spawn_impl( program: &str, args: &[String], workdir: Option<&str>, interactive: bool, policy: &SandboxPolicy, + enforcement_mode: ProcessEnforcementMode, ca_paths: Option<&(PathBuf, PathBuf)>, provider_env: &HashMap, ) -> Result { @@ -656,17 +717,9 @@ impl ProcessHandle { cmd.current_dir(dir); } - if matches!(policy.network.mode, NetworkMode::Proxy) { - let proxy = policy.network.proxy.as_ref().ok_or_else(|| { - miette::miette!( - "Network mode is set to proxy but no proxy configuration was provided" - ) - })?; - if let Some(http_addr) = proxy.http_addr { - let proxy_url = format!("http://{http_addr}"); - for (key, value) in child_env::proxy_env_vars(&proxy_url) { - cmd.env(key, value); - } + if let Some(proxy_url) = configured_proxy_url(policy, false)? { + for (key, value) in child_env::proxy_env_vars(&proxy_url) { + cmd.env(key, value); } } @@ -697,13 +750,17 @@ impl ProcessHandle { // Drop privileges before applying sandbox restrictions. // initgroups/setgid/setuid need access to /etc/group and /etc/passwd // which may be blocked by Landlock. - drop_privileges(&policy) - .map_err(|err| std::io::Error::other(err.to_string()))?; + if enforcement_mode.enforces_process_controls() { + drop_privileges(&policy) + .map_err(|err| std::io::Error::other(err.to_string()))?; + } harden_child_process().map_err(|err| std::io::Error::other(err.to_string()))?; - sandbox::apply(&policy, workdir.as_deref()) - .map_err(|err| std::io::Error::other(err.to_string()))?; + if enforcement_mode.enforces_process_controls() { + sandbox::apply(&policy, workdir.as_deref()) + .map_err(|err| std::io::Error::other(err.to_string()))?; + } Ok(()) }); @@ -788,17 +845,36 @@ impl Drop for ProcessHandle { } } -/// Validate that the `sandbox` user exists in this image. +/// Validate that the configured sandbox identity exists in this image. /// -/// All sandbox images must include a `sandbox` user for privilege dropping. -/// This check runs at supervisor startup (inside the container) where we can -/// inspect `/etc/passwd`. If the user is missing, the sandbox fails fast -/// with a clear error instead of silently running child processes as root. +/// When the identity is the literal `"sandbox"`, verifies the user exists +/// in `/etc/passwd` (all sandbox images ship with one). +/// +/// When the identity is a numeric UID, skips the passwd lookup entirely — +/// the kernel will use the resolved UID regardless of whether an entry +/// exists in `/etc/passwd`. Logs an OCSF event confirming numeric UID usage. +/// Non-numeric, non-"sandbox" values are rejected. #[cfg(unix)] pub fn validate_sandbox_user(policy: &SandboxPolicy) -> Result<()> { - let user_name = policy.process.run_as_user.as_deref().unwrap_or("sandbox"); + let identity = policy.process.run_as_user.as_deref().unwrap_or("sandbox"); + + // Numeric UID — no passwd entry required; kernel resolves directly. + if openshell_policy::is_valid_sandbox_identity(identity) && identity.parse::().is_ok() { + openshell_ocsf::ocsf_emit!( + openshell_ocsf::ConfigStateChangeBuilder::new(openshell_ocsf::ctx::ctx()) + .severity(openshell_ocsf::SeverityId::Informational) + .status(openshell_ocsf::StatusId::Success) + .state(openshell_ocsf::StateId::Enabled, "validated") + .message(format!( + "Accepted numeric UID {identity} (no passwd entry required)" + )) + .build() + ); + return Ok(()); + } - if user_name.is_empty() || user_name == "sandbox" { + // "sandbox" name — must exist in /etc/passwd. + if identity == "sandbox" { match User::from_name("sandbox") { Ok(Some(_)) => { openshell_ocsf::ocsf_emit!( @@ -820,11 +896,36 @@ pub fn validate_sandbox_user(policy: &SandboxPolicy) -> Result<()> { return Err(miette::miette!("failed to look up 'sandbox' user: {e}")); } } + } else if !identity.is_empty() { + // Non-numeric, non-sandbox string — attempt passwd lookup. + // This catches cases where someone accidentally put "root" or similar. + match User::from_name(identity) { + Ok(Some(_)) => { + tracing::warn!( + identity, + "non-sandbox user accepted via passwd entry; \ + consider using a numeric UID for UID-injected images" + ); + } + Ok(None) => { + return Err(miette::miette!( + "unrecognized sandbox identity '{identity}'; \ + expected 'sandbox' or a numeric UID in range [{MIN_SANDBOX_UID}, {MAX_SANDBOX_UID}]" + )); + } + Err(e) => { + return Err(miette::miette!( + "failed to look up identity '{identity}': {e}" + )); + } + } } Ok(()) } +pub use openshell_policy::{MAX_SANDBOX_UID, MIN_SANDBOX_UID}; + /// Prepare a `read_write` path for the sandboxed process. /// /// Returns `true` when the path was created by the supervisor and therefore @@ -863,9 +964,13 @@ fn prepare_read_write_path(path: &Path) -> Result { /// Creates `read_write` directories if they don't exist and sets ownership /// on newly-created paths to the configured sandbox user/group. This runs as /// the supervisor (root) before forking the child process. +/// +/// Accepts both name-based identities (resolved via `/etc/passwd`) and numeric +/// UIDs/GIDs (passed directly to `chown` without a passwd lookup). #[cfg(unix)] pub fn prepare_filesystem(policy: &SandboxPolicy) -> Result<()> { use nix::unistd::chown; + use nix::unistd::{Gid, Uid}; let user_name = match policy.process.run_as_user.as_deref() { Some(name) if !name.is_empty() => Some(name), @@ -881,27 +986,22 @@ pub fn prepare_filesystem(policy: &SandboxPolicy) -> Result<()> { return Ok(()); } - // Resolve user and group - let uid = if let Some(name) = user_name { - Some( - User::from_name(name) - .into_diagnostic()? - .ok_or_else(|| miette::miette!("Sandbox user not found: {name}"))? - .uid, - ) - } else { - None + // Resolve UID: numeric values are passed directly; names resolve via passwd. + let uid = match user_name { + Some(name) if name.parse::().is_ok() => { + Some(Uid::from_raw(name.parse().into_diagnostic()?)) + } + Some(name) => User::from_name(name).into_diagnostic()?.map(|u| u.uid), + _ => None, }; - let gid = if let Some(name) = group_name { - Some( - Group::from_name(name) - .into_diagnostic()? - .ok_or_else(|| miette::miette!("Sandbox group not found: {name}"))? - .gid, - ) - } else { - None + // Resolve GID: numeric values are passed directly; names resolve via group. + let gid = match group_name { + Some(name) if name.parse::().is_ok() => { + Some(Gid::from_raw(name.parse().into_diagnostic()?)) + } + Some(name) => Group::from_name(name).into_diagnostic()?.map(|g| g.gid), + _ => None, }; // Create missing read_write paths and only chown the ones we created. @@ -954,27 +1054,59 @@ pub fn drop_privileges(policy: &SandboxPolicy) -> Result<()> { return Ok(()); } - let user = if let Some(name) = user_name { - User::from_name(name) - .into_diagnostic()? - .ok_or_else(|| miette::miette!("Sandbox user not found: {name}"))? - } else { - User::from_uid(nix::unistd::geteuid()) - .into_diagnostic()? - .ok_or_else(|| miette::miette!("Failed to resolve current user"))? + // Resolve UID: numeric values are used directly; names resolve via passwd. + let target_uid = match user_name { + Some(name) if name.parse::().is_ok() => Uid::from_raw(name.parse().into_diagnostic()?), + Some(name) => { + User::from_name(name) + .into_diagnostic()? + .ok_or_else(|| miette::miette!("Sandbox user not found: {name}"))? + .uid + } + None => nix::unistd::geteuid(), }; - let group = if let Some(name) = group_name { - Group::from_name(name) + // Resolve group: if a numeric GID is configured use it directly. + // Otherwise try name resolution, then fall back to current user's primary group. + let target_gid = match group_name { + Some(name) if name.parse::().is_ok() => Gid::from_raw(name.parse().into_diagnostic()?), + Some(name) => { + Group::from_name(name) + .into_diagnostic()? + .ok_or_else(|| miette::miette!("Sandbox group not found: {name}"))? + .gid + } + None => match target_uid.as_raw() { + 0 => nix::unistd::getegid(), + _ => Group::from_gid( + User::from_uid(target_uid) + .into_diagnostic()? + .ok_or_else(|| miette::miette!("Failed to resolve user from UID {target_uid}"))? + .gid, + ) .into_diagnostic()? - .ok_or_else(|| miette::miette!("Sandbox group not found: {name}"))? + .map_or_else(nix::unistd::getegid, |g| g.gid), + }, + }; + + // Resolve the user record for initgroups (if name-based) or skip it (numeric UID). + let user = if user_name.is_some() { + Some( + User::from_uid(target_uid) + .into_diagnostic()? + .ok_or_else(|| { + miette::miette!("Failed to resolve user record for UID {target_uid}") + })?, + ) } else { - Group::from_gid(user.gid) - .into_diagnostic()? - .ok_or_else(|| miette::miette!("Failed to resolve user primary group"))? + None }; - if user_name.is_some() { + // Set supplementary groups only when we have a name-based identity. + // Numeric UIDs may not have a passwd entry, so initgroups would fail. + if let Some(ref user) = user + && target_uid != nix::unistd::geteuid() + { let user_cstr = CString::new(user.name.clone()).map_err(|_| miette::miette!("Invalid user name"))?; #[cfg(any( @@ -993,18 +1125,20 @@ pub fn drop_privileges(policy: &SandboxPolicy) -> Result<()> { target_os = "redox" )))] { - nix::unistd::initgroups(user_cstr.as_c_str(), group.gid).into_diagnostic()?; + nix::unistd::initgroups(user_cstr.as_c_str(), target_gid).into_diagnostic()?; } } - nix::unistd::setgid(group.gid).into_diagnostic()?; + if target_gid != nix::unistd::getegid() { + nix::unistd::setgid(target_gid).into_diagnostic()?; + } // Verify effective GID actually changed (defense-in-depth, CWE-250 / CERT POS37-C) let effective_gid = nix::unistd::getegid(); - if effective_gid != group.gid { + if effective_gid != target_gid { return Err(miette::miette!( "Privilege drop verification failed: expected effective GID {}, got {}", - group.gid, + target_gid, effective_gid )); } @@ -1012,15 +1146,17 @@ pub fn drop_privileges(policy: &SandboxPolicy) -> Result<()> { #[cfg(target_os = "linux")] drop_capability_bounding_set()?; - if user_name.is_some() { - nix::unistd::setuid(user.uid).into_diagnostic()?; + if let Some(_user) = user { + if target_uid != nix::unistd::geteuid() { + nix::unistd::setuid(target_uid).into_diagnostic()?; + } // Verify effective UID actually changed (defense-in-depth, CWE-250 / CERT POS37-C) let effective_uid = nix::unistd::geteuid(); - if effective_uid != user.uid { + if effective_uid != target_uid { return Err(miette::miette!( "Privilege drop verification failed: expected effective UID {}, got {}", - user.uid, + target_uid, effective_uid )); } @@ -1028,11 +1164,11 @@ pub fn drop_privileges(policy: &SandboxPolicy) -> Result<()> { // Verify root cannot be re-acquired (CERT POS37-C hardening). // If we dropped from root, setuid(0) must fail; success means privileges // were not fully relinquished. - if nix::unistd::setuid(nix::unistd::Uid::from_raw(0)).is_ok() && user.uid.as_raw() != 0 { + if nix::unistd::setuid(Uid::from_raw(0)).is_ok() && target_uid.as_raw() != 0 { return Err(miette::miette!( "Privilege drop verification failed: process can still re-acquire root (UID 0) \ after switching to UID {}", - user.uid + target_uid )); } } @@ -1601,7 +1737,7 @@ mod tests { let current_user = User::from_uid(nix::unistd::geteuid()) .unwrap() .expect("current user entry"); - let restricted_group = Group::from_gid(nix::unistd::Gid::from_raw(0)) + let restricted_group = Group::from_gid(Gid::from_raw(0)) .unwrap() .expect("gid 0 group entry"); if restricted_group.gid == nix::unistd::getegid() { @@ -1736,4 +1872,54 @@ mod tests { Some(PathBuf::from("/run/spire")) ); } + + // ---- Numeric UID tests (Phase 2) ---- + + #[test] + fn drop_privileges_accepts_numeric_uid() { + // When running as non-root, a numeric UID/GID that matches the + // current process should succeed without any passwd lookup. + if nix::unistd::geteuid().is_root() { + return; + } + + let uid_raw = nix::unistd::geteuid().as_raw(); + let gid_raw = nix::unistd::getegid().as_raw(); + + let policy = policy_with_process(ProcessPolicy { + run_as_user: Some(uid_raw.to_string()), + run_as_group: Some(gid_raw.to_string()), + }); + + assert!( + drop_privileges(&policy).is_ok(), + "should accept current process UID/GID as numeric strings" + ); + } + + #[test] + fn drop_privileges_numeric_uid_skips_initgroups() { + // When running as non-root with a numeric user but group matches, + // initgroups should not be called (guard: target_uid != geteuid()). + if nix::unistd::geteuid().is_root() { + return; + } + + let current_uid = nix::unistd::geteuid().as_raw(); + + // Use a different group name that exists (the current one). + let current_group = Group::from_gid(nix::unistd::getegid()) + .expect("should resolve current group") + .expect("current group should exist"); + + let policy = policy_with_process(ProcessPolicy { + run_as_user: Some(current_uid.to_string()), // numeric UID, no passwd entry needed + run_as_group: Some(current_group.name), // name-based group + }); + + assert!( + drop_privileges(&policy).is_ok(), + "should accept numeric UID with name-based group (initgroups guarded)" + ); + } } diff --git a/crates/openshell-supervisor-process/src/run.rs b/crates/openshell-supervisor-process/src/run.rs index 5a5c203a2..10bf2f157 100644 --- a/crates/openshell-supervisor-process/src/run.rs +++ b/crates/openshell-supervisor-process/src/run.rs @@ -33,7 +33,7 @@ use openshell_core::denial::DenialEvent; #[cfg(target_os = "linux")] use crate::managed_children; -use crate::process::ProcessHandle; +use crate::process::{ProcessEnforcementMode, ProcessHandle}; fn ocsf_ctx() -> &'static openshell_ocsf::SandboxContext { openshell_ocsf::ctx::ctx() @@ -57,6 +57,7 @@ pub async fn run_process( openshell_endpoint: Option<&str>, ssh_socket_path: Option, policy: &SandboxPolicy, + enforcement_mode: ProcessEnforcementMode, entrypoint_pid: Arc, provider_credentials: ProviderCredentialState, provider_env: std::collections::HashMap, @@ -71,13 +72,17 @@ pub async fn run_process( // must include a "sandbox" user for privilege dropping; failing fast here // beats silently running children as root. #[cfg(unix)] - crate::process::validate_sandbox_user(policy)?; + if enforcement_mode.enforces_process_controls() { + crate::process::validate_sandbox_user(policy)?; + } // Create read_write directories and chown newly-created ones to the // sandbox user/group. Runs as the supervisor (root) before the child // is forked so the workload sees writable paths it owns. #[cfg(unix)] - crate::process::prepare_filesystem(policy)?; + if enforcement_mode.enforces_process_controls() { + crate::process::prepare_filesystem(policy)?; + } // Eagerly fetch initial settings and install the agent skill if the // proposals flag is on at startup, rather than waiting for the policy @@ -198,31 +203,10 @@ pub async fn run_process( // their env so cooperative tools (curl, npm, Node) route through the // CONNECT proxy. Linux uses the netns host_ip; on other targets fall back // to the policy-declared http_addr directly. - let ssh_proxy_url = if matches!(policy.network.mode, NetworkMode::Proxy) { - #[cfg(target_os = "linux")] - { - netns.map(|ns| { - let port = policy - .network - .proxy - .as_ref() - .and_then(|p| p.http_addr) - .map_or(3128, |addr| addr.port()); - format!("http://{}:{port}", ns.host_ip()) - }) - } - #[cfg(not(target_os = "linux"))] - { - policy - .network - .proxy - .as_ref() - .and_then(|p| p.http_addr) - .map(|addr| format!("http://{addr}")) - } - } else { - None - }; + #[cfg(target_os = "linux")] + let ssh_proxy_url = ssh_proxy_url_for_policy(policy, netns.map(NetworkNamespace::host_ip)); + #[cfg(not(target_os = "linux"))] + let ssh_proxy_url = ssh_proxy_url_for_policy(policy, None); let ssh_socket_path: Option = ssh_socket_path.map(std::path::PathBuf::from); if let Some(listen_path) = ssh_socket_path.clone() { @@ -251,6 +235,7 @@ pub async fn run_process( ca_paths, provider_credentials_clone, user_env_clone, + enforcement_mode, ) .await { @@ -317,6 +302,7 @@ pub async fn run_process( workdir, interactive, policy, + enforcement_mode, netns, ca_file_paths.as_ref(), &provider_env, @@ -329,12 +315,16 @@ pub async fn run_process( workdir, interactive, policy, + enforcement_mode, ca_file_paths.as_ref(), &provider_env, )?; // Store the entrypoint PID so the proxy can resolve TCP peer identity entrypoint_pid.store(handle.pid(), Ordering::Release); + if let Some(path) = entrypoint_pid_file() { + write_entrypoint_pid_file(&path, handle.pid())?; + } ocsf_emit!( ProcessActivityBuilder::new(ocsf_ctx()) .activity(ActivityId::Open) @@ -387,6 +377,49 @@ pub async fn run_process( Ok(status.code()) } +fn entrypoint_pid_file() -> Option { + std::env::var(openshell_core::sandbox_env::ENTRYPOINT_PID_FILE) + .ok() + .filter(|value| !value.is_empty()) +} + +fn write_entrypoint_pid_file(path: &str, pid: u32) -> Result<()> { + let pid_path = std::path::Path::new(path); + if let Some(parent) = pid_path.parent() { + std::fs::create_dir_all(parent).into_diagnostic()?; + } + std::fs::write(pid_path, format!("{pid}\n")).into_diagnostic()?; + info!( + path, + pid, "Published workload entrypoint PID for network sidecar" + ); + Ok(()) +} + +fn ssh_proxy_url_for_policy( + policy: &SandboxPolicy, + netns_proxy_host: Option, +) -> Option { + if !matches!(policy.network.mode, NetworkMode::Proxy) { + return None; + } + + if let Ok(proxy_url) = std::env::var(openshell_core::sandbox_env::PROXY_URL) { + let trimmed = proxy_url.trim(); + if !trimmed.is_empty() { + return Some(trimmed.to_string()); + } + } + + let proxy = policy.network.proxy.as_ref()?; + if let Some(host) = netns_proxy_host { + let port = proxy.http_addr.map_or(3128, |addr| addr.port()); + return Some(format!("http://{host}:{port}")); + } + + proxy.http_addr.map(|addr| format!("http://{addr}")) +} + /// Eagerly fetch initial settings and install the agent-driven policy /// proposal skill if the flag is on at startup. /// @@ -443,3 +476,81 @@ async fn install_initial_agent_skill(sandbox_id: Option<&str>, openshell_endpoin ); } } + +#[cfg(test)] +mod tests { + use super::*; + use openshell_core::policy::{ + FilesystemPolicy, LandlockPolicy, NetworkMode, NetworkPolicy, ProcessPolicy, ProxyPolicy, + }; + + static PROXY_ENV_LOCK: std::sync::Mutex<()> = std::sync::Mutex::new(()); + + fn policy(mode: NetworkMode, http_addr: Option) -> SandboxPolicy { + SandboxPolicy { + version: 1, + filesystem: FilesystemPolicy::default(), + network: NetworkPolicy { + mode, + proxy: http_addr.map(|http_addr| ProxyPolicy { + http_addr: Some(http_addr), + }), + }, + landlock: LandlockPolicy::default(), + process: ProcessPolicy::default(), + } + } + + fn with_proxy_url(proxy_url: Option<&str>, test: F) -> T + where + F: FnOnce() -> T, + { + let _guard = PROXY_ENV_LOCK.lock().expect("proxy env lock poisoned"); + temp_env::with_var(openshell_core::sandbox_env::PROXY_URL, proxy_url, test) + } + + #[test] + fn ssh_proxy_url_uses_policy_addr_without_netns() { + with_proxy_url(None, || { + let policy = policy(NetworkMode::Proxy, Some(([127, 0, 0, 1], 3128).into())); + + assert_eq!( + ssh_proxy_url_for_policy(&policy, None).as_deref(), + Some("http://127.0.0.1:3128") + ); + }); + } + + #[test] + fn ssh_proxy_url_prefers_netns_host_with_policy_port() { + with_proxy_url(None, || { + let policy = policy(NetworkMode::Proxy, Some(([127, 0, 0, 1], 8080).into())); + + assert_eq!( + ssh_proxy_url_for_policy(&policy, Some([10, 200, 0, 1].into())).as_deref(), + Some("http://10.200.0.1:8080") + ); + }); + } + + #[test] + fn ssh_proxy_url_skips_non_proxy_mode() { + with_proxy_url(None, || { + let policy = policy(NetworkMode::Allow, Some(([127, 0, 0, 1], 3128).into())); + + assert_eq!(ssh_proxy_url_for_policy(&policy, None), None); + }); + } + + #[test] + fn ssh_proxy_url_prefers_env_override() { + with_proxy_url(Some("http://openshell-supervisor.default.svc:3128"), || { + let policy = policy(NetworkMode::Proxy, Some(([127, 0, 0, 1], 8080).into())); + + assert_eq!( + ssh_proxy_url_for_policy(&policy, Some([10, 200, 0, 1].into())).as_deref(), + Some("http://openshell-supervisor.default.svc:3128") + ); + }); + } +} diff --git a/crates/openshell-supervisor-process/src/ssh.rs b/crates/openshell-supervisor-process/src/ssh.rs index 955ec780c..c55a6d877 100644 --- a/crates/openshell-supervisor-process/src/ssh.rs +++ b/crates/openshell-supervisor-process/src/ssh.rs @@ -6,7 +6,7 @@ use crate::child_env; #[cfg(target_os = "linux")] use crate::managed_children; -use crate::process::{drop_privileges, is_supervisor_only_env_var}; +use crate::process::{ProcessEnforcementMode, drop_privileges, is_supervisor_only_env_var}; use crate::sandbox; use miette::{IntoDiagnostic, Result}; use nix::pty::{Winsize, openpty}; @@ -42,6 +42,7 @@ type SshServerInit = ( fn ssh_server_init( listen_path: &Path, ca_file_paths: &Option<(PathBuf, PathBuf)>, + enforcement_mode: ProcessEnforcementMode, ) -> Result { let mut rng = OsRng; let host_key = PrivateKey::random(&mut rng, Algorithm::Ed25519).into_diagnostic()?; @@ -55,13 +56,16 @@ fn ssh_server_init( let config = Arc::new(config); let ca_paths = ca_file_paths.as_ref().map(|p| Arc::new(p.clone())); - // Ensure the parent directory exists and is root-owned with 0700 - // permissions. The sandbox entrypoint runs as an unprivileged user; it - // must not be able to enter this directory and connect to the socket. + // In full enforcement mode the supervisor starts as root and can isolate + // the SSH socket in a root-only directory before spawning unprivileged + // children. In network-only sidecar mode the process supervisor itself + // runs as the sandbox UID, so the driver points the socket at a writable + // sidecar state volume and accepts that Unix permissions no longer isolate + // same-UID child processes from the socket. if let Some(parent) = listen_path.parent() { std::fs::create_dir_all(parent).into_diagnostic()?; #[cfg(unix)] - { + if enforcement_mode.enforces_process_controls() { use std::os::unix::fs::PermissionsExt; let perms = std::fs::Permissions::from_mode(0o700); std::fs::set_permissions(parent, perms).into_diagnostic()?; @@ -108,21 +112,23 @@ pub async fn run_ssh_server( ca_file_paths: Option<(PathBuf, PathBuf)>, provider_credentials: ProviderCredentialState, user_environment: HashMap, + enforcement_mode: ProcessEnforcementMode, ) -> Result<()> { - let (listener, config, ca_paths) = match ssh_server_init(&listen_path, &ca_file_paths) { - Ok(v) => { - // Signal that the SSH server has bound the socket and is ready to - // accept connections. The parent task awaits this before spawning - // the entrypoint process, ensuring exec requests won't race - // against server startup. - let _ = ready_tx.send(Ok(())); - v - } - Err(err) => { - let _ = ready_tx.send(Err(err)); - return Ok(()); - } - }; + let (listener, config, ca_paths) = + match ssh_server_init(&listen_path, &ca_file_paths, enforcement_mode) { + Ok(v) => { + // Signal that the SSH server has bound the socket and is ready to + // accept connections. The parent task awaits this before spawning + // the entrypoint process, ensuring exec requests won't race + // against server startup. + let _ = ready_tx.send(Ok(())); + v + } + Err(err) => { + let _ = ready_tx.send(Err(err)); + return Ok(()); + } + }; loop { let (stream, _peer) = listener.accept().await.into_diagnostic()?; @@ -145,6 +151,7 @@ pub async fn run_ssh_server( ca_paths, provider_credentials, user_environment, + enforcement_mode, ) .await { @@ -172,6 +179,7 @@ async fn handle_connection( ca_file_paths: Option>, provider_credentials: ProviderCredentialState, user_environment: HashMap, + enforcement_mode: ProcessEnforcementMode, ) -> Result<()> { // Access is gated by the Unix-socket filesystem permissions (root-only), // not by an application-level preface. The supervisor bridges the @@ -195,6 +203,7 @@ async fn handle_connection( ca_file_paths, provider_credentials, user_environment, + enforcement_mode, ); russh::server::run_stream(config, stream, handler) .await @@ -223,6 +232,7 @@ struct SshHandler { ca_file_paths: Option>, provider_credentials: ProviderCredentialState, user_environment: HashMap, + enforcement_mode: ProcessEnforcementMode, channels: HashMap, } @@ -236,6 +246,7 @@ impl SshHandler { ca_file_paths: Option>, provider_credentials: ProviderCredentialState, user_environment: HashMap, + enforcement_mode: ProcessEnforcementMode, ) -> Self { Self { policy, @@ -245,6 +256,7 @@ impl SshHandler { ca_file_paths, provider_credentials, user_environment, + enforcement_mode, channels: HashMap::new(), } } @@ -468,6 +480,7 @@ impl russh::server::Handler for SshHandler { self.ca_file_paths.clone(), &self.provider_credentials.child_env_with_gcp_resolved(), &self.user_environment, + self.enforcement_mode, )?; let state = self.channels.get_mut(&channel).ok_or_else(|| { anyhow::anyhow!("subsystem_request on unknown channel {channel:?}") @@ -564,6 +577,7 @@ impl SshHandler { self.ca_file_paths.clone(), &provider_env, &self.user_environment, + self.enforcement_mode, )?; state.pty_master = Some(pty_master); state.input_sender = Some(input_sender); @@ -582,6 +596,7 @@ impl SshHandler { self.ca_file_paths.clone(), &provider_env, &self.user_environment, + self.enforcement_mode, )?; state.input_sender = Some(input_sender); } @@ -661,12 +676,20 @@ impl Default for PtyRequest { /// Derive the session USER and HOME from the policy's `run_as_user`. /// -/// Falls back to `("sandbox", "/sandbox")` when the policy has no explicit user, -/// preserving backward compatibility with images that use the default layout. +/// For name-based identities, looks up the home directory via `/etc/passwd` +/// (or defaults to `/home/{user}`). +/// +/// For numeric UIDs, there is no passwd entry — falls back to +/// `("{uid}", "/sandbox")` so the agent session still has a meaningful +/// USER identifier. fn session_user_and_home(policy: &SandboxPolicy) -> (String, String) { match policy.process.run_as_user.as_deref() { Some(user) if !user.is_empty() => { - // Look up the user's home directory from /etc/passwd. + // Numeric UID — no passwd entry expected; use default HOME. + if user.parse::().is_ok() { + return (user.to_string(), "/sandbox".to_string()); + } + // Name-based identity — look up home from /etc/passwd. let home = nix::unistd::User::from_name(user) .ok() .flatten() @@ -740,6 +763,7 @@ fn spawn_pty_shell( ca_file_paths: Option>, provider_env: &HashMap, user_environment: &HashMap, + enforcement_mode: ProcessEnforcementMode, ) -> anyhow::Result<(std::fs::File, mpsc::Sender>)> { let winsize = Winsize { ws_row: to_u16(pty.row_height.max(1)), @@ -798,12 +822,20 @@ fn spawn_pty_shell( // Probe Landlock availability from the parent process where tracing works. #[cfg(target_os = "linux")] - sandbox::linux::log_sandbox_readiness(policy, workdir.as_deref()); + if enforcement_mode.enforces_process_controls() { + sandbox::linux::log_sandbox_readiness(policy, workdir.as_deref()); + } // Phase 1 (as root): Prepare Landlock ruleset before drop_privileges. #[cfg(target_os = "linux")] - let prepared_sandbox = sandbox::linux::prepare(policy, workdir.as_deref()) - .map_err(|err| anyhow::anyhow!("Failed to prepare sandbox: {err}"))?; + let prepared_sandbox = if enforcement_mode.enforces_process_controls() { + Some( + sandbox::linux::prepare(policy, workdir.as_deref()) + .map_err(|err| anyhow::anyhow!("Failed to prepare sandbox: {err}"))?, + ) + } else { + None + }; #[cfg(unix)] { @@ -813,6 +845,7 @@ fn spawn_pty_shell( workdir.clone(), slave_fd, netns_fd, + enforcement_mode, #[cfg(target_os = "linux")] prepared_sandbox, )?; @@ -905,6 +938,7 @@ fn spawn_pipe_exec( ca_file_paths: Option>, provider_env: &HashMap, user_environment: &HashMap, + enforcement_mode: ProcessEnforcementMode, ) -> anyhow::Result>> { let mut cmd = command.map_or_else( || { @@ -947,12 +981,20 @@ fn spawn_pipe_exec( // Probe Landlock availability from the parent process where tracing works. #[cfg(target_os = "linux")] - sandbox::linux::log_sandbox_readiness(policy, workdir.as_deref()); + if enforcement_mode.enforces_process_controls() { + sandbox::linux::log_sandbox_readiness(policy, workdir.as_deref()); + } // Phase 1 (as root): Prepare Landlock ruleset before drop_privileges. #[cfg(target_os = "linux")] - let prepared_sandbox = sandbox::linux::prepare(policy, workdir.as_deref()) - .map_err(|err| anyhow::anyhow!("Failed to prepare sandbox: {err}"))?; + let prepared_sandbox = if enforcement_mode.enforces_process_controls() { + Some( + sandbox::linux::prepare(policy, workdir.as_deref()) + .map_err(|err| anyhow::anyhow!("Failed to prepare sandbox: {err}"))?, + ) + } else { + None + }; #[cfg(unix)] { @@ -961,6 +1003,7 @@ fn spawn_pipe_exec( policy.clone(), workdir.clone(), netns_fd, + enforcement_mode, #[cfg(target_os = "linux")] prepared_sandbox, )?; @@ -1060,7 +1103,9 @@ fn spawn_pipe_exec( mod unsafe_pty { #[cfg(not(target_os = "linux"))] use super::sandbox; - use super::{Command, RawFd, SandboxPolicy, Winsize, drop_privileges, setsid}; + use super::{ + Command, ProcessEnforcementMode, RawFd, SandboxPolicy, Winsize, drop_privileges, setsid, + }; #[cfg(unix)] use std::os::unix::process::CommandExt; @@ -1099,17 +1144,21 @@ mod unsafe_pty { _workdir: Option, slave_fd: RawFd, netns_fd: Option, - #[cfg(target_os = "linux")] prepared: crate::sandbox::linux::PreparedSandbox, + enforcement_mode: ProcessEnforcementMode, + #[cfg(target_os = "linux")] prepared: Option, ) -> anyhow::Result<()> { // Wrap in Option so we can .take() it out of the FnMut closure. // pre_exec is only called once (after fork, before exec). #[cfg(target_os = "linux")] - let mut prepared = Some(prepared); + let mut prepared = prepared; #[cfg(target_os = "linux")] - let supervisor_identity_mount = crate::process::supervisor_identity_mount_from_env() - .map_err(|err| { + let supervisor_identity_mount = if enforcement_mode.enforces_process_controls() { + crate::process::supervisor_identity_mount_from_env().map_err(|err| { anyhow::anyhow!("failed to prepare supervisor identity isolation: {err}") - })?; + })? + } else { + None + }; unsafe { cmd.pre_exec(move || { setsid().map_err(|err| std::io::Error::other(err.to_string()))?; @@ -1118,6 +1167,7 @@ mod unsafe_pty { enter_netns_and_sandbox( netns_fd, &policy, + enforcement_mode, #[cfg(target_os = "linux")] supervisor_identity_mount, #[cfg(target_os = "linux")] @@ -1144,20 +1194,25 @@ mod unsafe_pty { policy: SandboxPolicy, _workdir: Option, netns_fd: Option, - #[cfg(target_os = "linux")] prepared: crate::sandbox::linux::PreparedSandbox, + enforcement_mode: ProcessEnforcementMode, + #[cfg(target_os = "linux")] prepared: Option, ) -> anyhow::Result<()> { #[cfg(target_os = "linux")] - let mut prepared = Some(prepared); + let mut prepared = prepared; #[cfg(target_os = "linux")] - let supervisor_identity_mount = crate::process::supervisor_identity_mount_from_env() - .map_err(|err| { + let supervisor_identity_mount = if enforcement_mode.enforces_process_controls() { + crate::process::supervisor_identity_mount_from_env().map_err(|err| { anyhow::anyhow!("failed to prepare supervisor identity isolation: {err}") - })?; + })? + } else { + None + }; unsafe { cmd.pre_exec(move || { enter_netns_and_sandbox( netns_fd, &policy, + enforcement_mode, #[cfg(target_os = "linux")] supervisor_identity_mount, #[cfg(target_os = "linux")] @@ -1171,6 +1226,7 @@ mod unsafe_pty { fn enter_netns_and_sandbox( netns_fd: Option, policy: &SandboxPolicy, + enforcement_mode: ProcessEnforcementMode, #[cfg(target_os = "linux")] supervisor_identity_mount: Option< &crate::process::SupervisorIdentityMountNamespace, >, @@ -1199,7 +1255,9 @@ mod unsafe_pty { // Drop privileges. initgroups/setgid/setuid need /etc/group and // /etc/passwd which would be blocked if Landlock were already enforced. - drop_privileges(policy).map_err(|err| std::io::Error::other(err.to_string()))?; + if enforcement_mode.enforces_process_controls() { + drop_privileges(policy).map_err(|err| std::io::Error::other(err.to_string()))?; + } crate::process::harden_child_process() .map_err(|err| std::io::Error::other(err.to_string()))?; @@ -1212,7 +1270,9 @@ mod unsafe_pty { } #[cfg(not(target_os = "linux"))] - sandbox::apply(policy, None).map_err(|err| std::io::Error::other(err.to_string()))?; + if enforcement_mode.enforces_process_controls() { + sandbox::apply(policy, None).map_err(|err| std::io::Error::other(err.to_string()))?; + } Ok(()) } @@ -1527,6 +1587,112 @@ mod tests { assert_eq!(rx_b.recv().unwrap(), b"still-alive"); } + // ----------------------------------------------------------------------- + // session_user_and_home tests (Phase 2: numeric UID support) + // ----------------------------------------------------------------------- + + #[test] + fn session_user_and_home_returns_numeric_uid_as_user() { + use openshell_core::policy::{ + FilesystemPolicy, LandlockPolicy, NetworkPolicy, ProcessPolicy, + }; + let policy = SandboxPolicy { + version: 1, + filesystem: FilesystemPolicy::default(), + network: NetworkPolicy::default(), + landlock: LandlockPolicy::default(), + process: ProcessPolicy { + run_as_user: Some("1000".into()), + run_as_group: None, + }, + }; + let (user, home) = session_user_and_home(&policy); + assert_eq!(user, "1000"); + // Numeric UID has no passwd entry — defaults to /sandbox. + assert_eq!(home, "/sandbox"); + } + + #[test] + fn session_user_and_home_returns_name_from_passwd() { + use openshell_core::policy::{ + FilesystemPolicy, LandlockPolicy, NetworkPolicy, ProcessPolicy, + }; + let policy = SandboxPolicy { + version: 1, + filesystem: FilesystemPolicy::default(), + network: NetworkPolicy::default(), + landlock: LandlockPolicy::default(), + process: ProcessPolicy { + run_as_user: Some("sandbox".into()), + run_as_group: None, + }, + }; + let (user, home) = session_user_and_home(&policy); + assert_eq!(user, "sandbox"); + // Name-based — should resolve via passwd (or /home/{user}). + assert!(!home.is_empty()); + } + + #[test] + fn session_user_and_home_defaults_to_sandbox_when_empty() { + use openshell_core::policy::{ + FilesystemPolicy, LandlockPolicy, NetworkPolicy, ProcessPolicy, + }; + let policy = SandboxPolicy { + version: 1, + filesystem: FilesystemPolicy::default(), + network: NetworkPolicy::default(), + landlock: LandlockPolicy::default(), + process: ProcessPolicy { + run_as_user: Some(String::new()), + run_as_group: None, + }, + }; + let (user, home) = session_user_and_home(&policy); + assert_eq!(user, "sandbox"); + assert_eq!(home, "/sandbox"); + } + + #[test] + fn session_user_and_home_defaults_to_sandbox_when_none() { + use openshell_core::policy::{ + FilesystemPolicy, LandlockPolicy, NetworkPolicy, ProcessPolicy, + }; + let policy = SandboxPolicy { + version: 1, + filesystem: FilesystemPolicy::default(), + network: NetworkPolicy::default(), + landlock: LandlockPolicy::default(), + process: ProcessPolicy { + run_as_user: None, + run_as_group: None, + }, + }; + let (user, home) = session_user_and_home(&policy); + assert_eq!(user, "sandbox"); + assert_eq!(home, "/sandbox"); + } + + #[test] + fn session_user_and_home_handles_large_numeric_uid() { + use openshell_core::policy::{ + FilesystemPolicy, LandlockPolicy, NetworkPolicy, ProcessPolicy, + }; + let policy = SandboxPolicy { + version: 1, + filesystem: FilesystemPolicy::default(), + network: NetworkPolicy::default(), + landlock: LandlockPolicy::default(), + process: ProcessPolicy { + run_as_user: Some("1000660000".into()), + run_as_group: None, + }, + }; + let (user, home) = session_user_and_home(&policy); + assert_eq!(user, "1000660000"); + assert_eq!(home, "/sandbox"); + } + /// `install_pre_exec_no_pty` runs drop_privileges and succeeds when the /// current user/group is already the configured one (no actual uid change). /// @@ -1567,21 +1733,24 @@ mod tests { policy, None, None, // no netns fd + ProcessEnforcementMode::Full, #[cfg(target_os = "linux")] - sandbox::linux::prepare( - &SandboxPolicy { - version: 0, - filesystem: FilesystemPolicy::default(), - network: NetworkPolicy::default(), - landlock: LandlockPolicy::default(), - process: ProcessPolicy { - run_as_user: None, - run_as_group: None, + Some( + sandbox::linux::prepare( + &SandboxPolicy { + version: 0, + filesystem: FilesystemPolicy::default(), + network: NetworkPolicy::default(), + landlock: LandlockPolicy::default(), + process: ProcessPolicy { + run_as_user: None, + run_as_group: None, + }, }, - }, - None, - ) - .expect("prepare should succeed in test environment"), + None, + ) + .expect("prepare should succeed in test environment"), + ), ) .expect("install pre_exec should succeed"); diff --git a/deploy/docker/Dockerfile.supervisor b/deploy/docker/Dockerfile.supervisor index c84cc70e9..c760bbc89 100644 --- a/deploy/docker/Dockerfile.supervisor +++ b/deploy/docker/Dockerfile.supervisor @@ -5,10 +5,10 @@ # Supervisor image build. # -# The final image is `scratch`: it only carries the static `openshell-sandbox` -# binary used by Docker extraction, Podman image volumes, and the Kubernetes -# init container copy-self path. A static musl binary lets the image stay -# `scratch` while still being executable as an init container. +# The final image carries the static `openshell-sandbox` binary used by Docker +# extraction, Podman image volumes, and the Kubernetes init container copy-self +# path. It also includes nftables so the Kubernetes supervisor sidecar can +# install pod-namespace egress enforcement rules. # # The Rust binary is built natively before this image build runs and staged at: # deploy/docker/.build/prebuilt-binaries//openshell-sandbox @@ -19,17 +19,16 @@ # target) and uploads it as an artifact, which is downloaded into the same # staging directory before the image build job runs. -FROM scratch AS supervisor +FROM alpine:3.22 AS supervisor ARG TARGETARCH -# --chmod=0550 drops world-execute and survives the actions/upload-artifact -# + download-artifact roundtrip (which strips exec perms). Ownership is left -# at root (0:0) deliberately: the Podman driver mounts this image as a -# read-only image volume into the sandbox container and drops DAC_OVERRIDE, -# so the container's UID 0 must own the binary to read+exec it. Mode 0550 -# (r-xr-x---) is the security win; the chown to a non-root UID was breaking -# Podman without buying anything since the container is always UID 0. -COPY --chmod=0550 deploy/docker/.build/prebuilt-binaries/${TARGETARCH}/openshell-sandbox /openshell-sandbox +RUN apk add --no-cache nftables iptables iptables-legacy + +# --chmod=0555 restores execute bits after the actions/upload-artifact + +# download-artifact roundtrip strips them. Ownership stays root (0:0) for +# Podman image-volume mounts, while world-execute lets the Kubernetes +# network sidecar run this binary as the dedicated non-root proxy UID. +COPY --chmod=0555 deploy/docker/.build/prebuilt-binaries/${TARGETARCH}/openshell-sandbox /openshell-sandbox ENTRYPOINT ["/openshell-sandbox"] diff --git a/deploy/helm/openshell/README.md b/deploy/helm/openshell/README.md index e6d539592..477a59839 100644 --- a/deploy/helm/openshell/README.md +++ b/deploy/helm/openshell/README.md @@ -236,7 +236,9 @@ add `ci/values-spire.yaml` to the OpenShell release values files. | supervisor.image.pullPolicy | string | `""` | Supervisor image pull policy. Defaults to the gateway image pull policy when empty. | | supervisor.image.repository | string | `"ghcr.io/nvidia/openshell/supervisor"` | Supervisor image repository. | | supervisor.image.tag | string | `""` | Supervisor image tag. Defaults to the chart appVersion when empty. | +| supervisor.proxyUid | int | `1337` | UID for the long-running network sidecar or proxy supervisor pod. In sidecar topology, the network init container installs nftables rules that exempt this UID. | | supervisor.sideloadMethod | string | `""` | How the supervisor binary is delivered into sandbox pods. Empty (default) = auto-detect from cluster version: K8s >= v1.35 -> "image-volume" (ImageVolume enabled by default; GA in v1.36) K8s < v1.35 -> "init-container" (copies via init container + emptyDir) On K8s v1.33-v1.34 with the ImageVolume feature gate manually enabled, set this to "image-volume" explicitly. | +| supervisor.topology | string | `"combined"` | Supervisor pod topology for Kubernetes sandboxes. "combined" runs the current single supervisor container in the agent pod. "sidecar" runs network enforcement in a dedicated sidecar and the process supervisor as a low-capability wrapper in the agent container. "proxy-pod" runs network enforcement in a separate supervisor Deployment and restricts the agent pod to that supervisor through NetworkPolicy. | | tolerations | list | `[]` | Tolerations for the gateway pod. | | workload.allowMultiReplicaStatefulSet | bool | `false` | Allow replicaCount > 1 while rendering a StatefulSet. Prefer workload.kind=deployment for external database-backed multi-replica gateways; this override exists for operators who explicitly require StatefulSet identity or storage semantics. | | workload.kind | string | `"statefulset"` | Gateway workload controller kind. Use `statefulset` for the default SQLite database, or `deployment` when server.externalDbSecret points at an external database. | diff --git a/deploy/helm/openshell/ci/values-proxy-pod.yaml b/deploy/helm/openshell/ci/values-proxy-pod.yaml new file mode 100644 index 000000000..b7cb533fd --- /dev/null +++ b/deploy/helm/openshell/ci/values-proxy-pod.yaml @@ -0,0 +1,18 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +# CI/dev overlay for exercising the Kubernetes proxy-pod topology. +# +# This topology relies on Kubernetes NetworkPolicy enforcement: the agent pod is +# isolated to its paired supervisor pod plus DNS. The local k3s/k3d workflow +# must therefore run with the k3s network policy controller enabled, or with a +# custom policy-enforcing CNI installed before deploying this profile. +# +# Merge after values.yaml and ci/values-skaffold.yaml: +# helm install ... -f values.yaml -f ci/values-skaffold.yaml -f ci/values-proxy-pod.yaml +# +# Or set: +# OPENSHELL_E2E_KUBE_EXTRA_VALUES=deploy/helm/openshell/ci/values-proxy-pod.yaml +# before running `mise run e2e:kubernetes`. +supervisor: + topology: proxy-pod diff --git a/deploy/helm/openshell/ci/values-sidecar.yaml b/deploy/helm/openshell/ci/values-sidecar.yaml new file mode 100644 index 000000000..dac9e810f --- /dev/null +++ b/deploy/helm/openshell/ci/values-sidecar.yaml @@ -0,0 +1,13 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +# CI/dev overlay for exercising the Kubernetes supervisor sidecar topology. +# +# Merge after values.yaml and ci/values-skaffold.yaml: +# helm install ... -f values.yaml -f ci/values-skaffold.yaml -f ci/values-sidecar.yaml +# +# Or set: +# OPENSHELL_E2E_KUBE_EXTRA_VALUES=deploy/helm/openshell/ci/values-sidecar.yaml +# before running `mise run e2e:kubernetes`. +supervisor: + topology: sidecar diff --git a/deploy/helm/openshell/skaffold.yaml b/deploy/helm/openshell/skaffold.yaml index d571c3bd9..cf99be69a 100644 --- a/deploy/helm/openshell/skaffold.yaml +++ b/deploy/helm/openshell/skaffold.yaml @@ -119,6 +119,13 @@ deploy: # To enable SPIFFE/SPIRE provider token grants (requires the # spire-crds and spire releases above): #- ci/values-spire.yaml + # To exercise the Kubernetes supervisor sidecar topology: + #- ci/values-sidecar.yaml + # To exercise proxy-pod topology, use the proxy-pod Skaffold profile + # against a cluster with NetworkPolicy enforcement enabled. Stock k3s + # includes its embedded network policy controller; if you replace the + # CNI, install a policy-enforcing CNI before deploying this profile. + #- ci/values-proxy-pod.yaml # To test multi-replica external PostgreSQL behavior: #- ci/values-high-availability.yaml setValueTemplates: @@ -126,3 +133,14 @@ deploy: image.tag: '{{.IMAGE_TAG_openshell_gateway}}' supervisor.image.repository: '{{.IMAGE_REPO_openshell_supervisor}}' supervisor.image.tag: '{{.IMAGE_TAG_openshell_supervisor}}' +profiles: + - name: sidecar + patches: + - op: add + path: /deploy/helm/releases/0/valuesFiles/- + value: ci/values-sidecar.yaml + - name: proxy-pod + patches: + - op: add + path: /deploy/helm/releases/0/valuesFiles/- + value: ci/values-proxy-pod.yaml diff --git a/deploy/helm/openshell/templates/gateway-config.yaml b/deploy/helm/openshell/templates/gateway-config.yaml index 7037be88f..75e7c9af5 100644 --- a/deploy/helm/openshell/templates/gateway-config.yaml +++ b/deploy/helm/openshell/templates/gateway-config.yaml @@ -113,6 +113,8 @@ data: grpc_endpoint = {{ include "openshell.grpcEndpoint" . | quote }} service_account_name = {{ include "openshell.sandboxServiceAccountName" . | quote }} supervisor_sideload_method = {{ include "openshell.supervisorSideloadMethod" . | quote }} + supervisor_topology = {{ .Values.supervisor.topology | default "combined" | quote }} + proxy_uid = {{ .Values.supervisor.proxyUid | default 1337 }} sa_token_ttl_secs = {{ .Values.server.sandboxJwt.k8sSaTokenTtlSecs | default 3600 }} {{- if .Values.server.providerTokenGrants.spiffe.enabled }} provider_spiffe_workload_api_socket_path = {{ .Values.server.providerTokenGrants.spiffe.workloadApiSocketPath | quote }} diff --git a/deploy/helm/openshell/templates/role.yaml b/deploy/helm/openshell/templates/role.yaml index 5ecc4428a..76d07e992 100644 --- a/deploy/helm/openshell/templates/role.yaml +++ b/deploy/helm/openshell/templates/role.yaml @@ -35,10 +35,54 @@ rules: # returned pod name and UID to the pod's `openshell.io/sandbox-id` # annotation. patch is intentionally NOT granted — the annotation is set # once at pod create and must remain immutable for the lifetime of the - # sandbox. + # sandbox. create/delete/list/watch are intentionally not granted; the Agent + # Sandbox controller creates agent pods, and proxy-pod supervisors are + # managed through per-sandbox Deployments. - apiGroups: - "" resources: - pods verbs: - get + # Proxy-pod topology creates one supervisor Deployment, one supervisor + # Service, and one CA Secret per sandbox. All are owner-referenced to the + # Sandbox CR for garbage collection. The gateway also reads the generated + # ReplicaSet during K8s ServiceAccount bootstrap to verify the supervisor + # pod's Pod -> ReplicaSet -> Deployment -> Sandbox owner chain. + - apiGroups: + - apps + resources: + - deployments + verbs: + - create + - delete + - get + - list + - watch + - apiGroups: + - apps + resources: + - replicasets + verbs: + - get + - apiGroups: + - "" + resources: + - services + - secrets + verbs: + - create + - delete + - get + - list + - watch + - apiGroups: + - networking.k8s.io + resources: + - networkpolicies + verbs: + - create + - delete + - get + - list + - watch diff --git a/deploy/helm/openshell/tests/gateway_config_test.yaml b/deploy/helm/openshell/tests/gateway_config_test.yaml index c2708a20f..283c85b85 100644 --- a/deploy/helm/openshell/tests/gateway_config_test.yaml +++ b/deploy/helm/openshell/tests/gateway_config_test.yaml @@ -83,6 +83,33 @@ tests: path: data["gateway.toml"] pattern: '(?ms)\[openshell\.drivers\.kubernetes\].*?service_account_name\s*=\s*"openshell-sandbox"' + - it: renders supervisor topology under [openshell.drivers.kubernetes] + template: templates/gateway-config.yaml + set: + supervisor.topology: sidecar + asserts: + - matchRegex: + path: data["gateway.toml"] + pattern: '(?ms)\[openshell\.drivers\.kubernetes\].*?supervisor_topology\s*=\s*"sidecar"' + + - it: renders proxy-pod supervisor topology under [openshell.drivers.kubernetes] + template: templates/gateway-config.yaml + set: + supervisor.topology: proxy-pod + asserts: + - matchRegex: + path: data["gateway.toml"] + pattern: '(?ms)\[openshell\.drivers\.kubernetes\].*?supervisor_topology\s*=\s*"proxy-pod"' + + - it: renders proxy uid under [openshell.drivers.kubernetes] + template: templates/gateway-config.yaml + set: + supervisor.proxyUid: 2200 + asserts: + - matchRegex: + path: data["gateway.toml"] + pattern: '(?ms)\[openshell\.drivers\.kubernetes\].*?proxy_uid\s*=\s*2200' + - it: renders sandbox image pull secrets under [openshell.drivers.kubernetes] template: templates/gateway-config.yaml set: diff --git a/deploy/helm/openshell/tests/sandbox_namespace_test.yaml b/deploy/helm/openshell/tests/sandbox_namespace_test.yaml index ee89fce53..d2a3f27dd 100644 --- a/deploy/helm/openshell/tests/sandbox_namespace_test.yaml +++ b/deploy/helm/openshell/tests/sandbox_namespace_test.yaml @@ -57,6 +57,49 @@ tests: path: metadata.namespace value: other-ns + - it: grants only pod get for sandbox token bootstrap + template: templates/role.yaml + asserts: + - contains: + path: rules + content: + apiGroups: + - "" + resources: + - pods + verbs: + - get + + - it: grants sandbox RBAC for proxy-pod supervisor Deployments + template: templates/role.yaml + asserts: + - contains: + path: rules + content: + apiGroups: + - apps + resources: + - deployments + verbs: + - create + - delete + - get + - list + - watch + + - it: grants ReplicaSet get for proxy-pod supervisor token bootstrap + template: templates/role.yaml + asserts: + - contains: + path: rules + content: + apiGroups: + - apps + resources: + - replicasets + verbs: + - get + - it: uses explicit sandboxNamespace for sandbox RoleBinding template: templates/rolebinding.yaml set: diff --git a/deploy/helm/openshell/values.yaml b/deploy/helm/openshell/values.yaml index d7ff8b257..2fd3d2ef7 100644 --- a/deploy/helm/openshell/values.yaml +++ b/deploy/helm/openshell/values.yaml @@ -44,6 +44,17 @@ supervisor: # On K8s v1.33-v1.34 with the ImageVolume feature gate manually enabled, # set this to "image-volume" explicitly. sideloadMethod: "" + # -- Supervisor pod topology for Kubernetes sandboxes. + # "combined" runs the current single supervisor container in the agent pod. + # "sidecar" runs network enforcement in a dedicated sidecar and the process + # supervisor as a low-capability wrapper in the agent container. + # "proxy-pod" runs network enforcement in a separate supervisor Deployment and + # restricts the agent pod to that supervisor through NetworkPolicy. + topology: "combined" + # -- UID for the long-running network sidecar or proxy supervisor pod. In sidecar + # topology, the network init container installs nftables rules that exempt + # this UID. + proxyUid: 1337 # -- Image pull secrets attached to gateway and helper pods. imagePullSecrets: [] diff --git a/docs/kubernetes/access-control.mdx b/docs/kubernetes/access-control.mdx index 8824b6de1..5409a4b11 100644 --- a/docs/kubernetes/access-control.mdx +++ b/docs/kubernetes/access-control.mdx @@ -5,7 +5,7 @@ title: "Access Control" sidebar-title: "Access Control" description: "Configure OIDC user authentication or reverse-proxy auth termination for a Kubernetes-deployed OpenShell gateway." keywords: "Generative AI, Cybersecurity, Kubernetes, Authentication, mTLS, OIDC, Keycloak, Entra ID, Okta, Gateway Auth" -position: 4 +position: 5 --- The OpenShell gateway supports two access-control models for human callers on Kubernetes: diff --git a/docs/kubernetes/ingress.mdx b/docs/kubernetes/ingress.mdx index a47637073..a572004fb 100644 --- a/docs/kubernetes/ingress.mdx +++ b/docs/kubernetes/ingress.mdx @@ -5,7 +5,7 @@ title: "Ingress" sidebar-title: "Ingress" description: "Expose the OpenShell gateway externally using the Kubernetes Gateway API and a GRPCRoute." keywords: "Generative AI, Cybersecurity, Kubernetes, Gateway API, Envoy Gateway, GRPCRoute, Ingress, External Access" -position: 3 +position: 4 --- By default, the OpenShell gateway is only reachable inside the cluster. To let CLI clients connect without a `kubectl port-forward`, expose the gateway through an ingress. diff --git a/docs/kubernetes/managing-certificates.mdx b/docs/kubernetes/managing-certificates.mdx index 179388151..885a5c952 100644 --- a/docs/kubernetes/managing-certificates.mdx +++ b/docs/kubernetes/managing-certificates.mdx @@ -5,7 +5,7 @@ title: "Managing Certificates" sidebar-title: "Managing Certificates" description: "Configure the OpenShell Helm chart to use cert-manager for mTLS certificate issuance and automatic renewal." keywords: "Generative AI, Cybersecurity, Kubernetes, cert-manager, PKI, TLS, mTLS, Certificates" -position: 2 +position: 3 --- The OpenShell gateway uses mTLS certificates for transport between the gateway and sandbox supervisors. These certificates are not Kubernetes user authentication; configure OIDC or a trusted access proxy for user access. The Helm chart supports two ways to provision and manage the certificate bundle: diff --git a/docs/kubernetes/openshift.mdx b/docs/kubernetes/openshift.mdx index b8313bdfe..caf799b51 100644 --- a/docs/kubernetes/openshift.mdx +++ b/docs/kubernetes/openshift.mdx @@ -5,7 +5,7 @@ title: "OpenShift" sidebar-title: "OpenShift" description: "Install the OpenShell Helm chart on OpenShift, including the SCC binding and chart overrides required by OpenShift's Security Context Constraints." keywords: "Generative AI, Cybersecurity, Kubernetes, OpenShift, SCC, Security Context Constraints, Helm, Gateway, Installation" -position: 5 +position: 6 --- diff --git a/docs/kubernetes/setup.mdx b/docs/kubernetes/setup.mdx index 5ab786519..9996f337d 100644 --- a/docs/kubernetes/setup.mdx +++ b/docs/kubernetes/setup.mdx @@ -160,6 +160,8 @@ The most commonly changed values are: | `server.enableLoopbackServiceHttp` | Enable local plaintext HTTP for loopback sandbox service URLs. Defaults to `true`. | | `pkiInitJob.serverDnsNames` / `certManager.serverDnsNames` | Additional gateway server DNS SANs. Wildcard SANs also enable sandbox service URLs under that domain. | | `supervisor.sideloadMethod` | How the supervisor binary is delivered into sandbox pods. Leave empty to auto-detect based on cluster version: clusters running Kubernetes 1.35 or later use `image-volume` (ImageVolume GA in 1.36); older clusters use `init-container`. Set explicitly to `image-volume` on Kubernetes 1.33 or 1.34 with the ImageVolume feature gate enabled, or to `init-container` to force the legacy path on any version. | +| `supervisor.topology` | Sandbox pod topology. Leave as `combined` for the original full-enforcement path, set `sidecar` when the agent container should run non-root without added Linux capabilities, or set `proxy-pod` to run network enforcement in a separate supervisor pod with NetworkPolicy isolation. Refer to [Topology](/kubernetes/topology). | +| `supervisor.proxyUid` | Non-root UID for the long-running network sidecar or proxy supervisor pod. The UID must not match the sandbox UID. | Use a values file for repeatable deployments: @@ -213,6 +215,10 @@ The namespaced Role covers sandbox lifecycle and identity: | `agents.x-k8s.io` | `sandboxes`, `sandboxes/status` | create, delete, get, list, patch, update, watch | | `""` | `events` | get, list, watch | | `""` | `pods` | get | +| `apps` | `deployments` | create, delete, get, list, watch | +| `apps` | `replicasets` | get | +| `""` | `services`, `secrets` | create, delete, get, list, watch | +| `networking.k8s.io` | `networkpolicies` | create, delete, get, list, watch | The ClusterRole grants node inspection and token validation: @@ -243,6 +249,7 @@ The gateway exposes `/healthz` for process liveness and `/readyz` for dependency ## Next Steps +- To choose between combined, sidecar, and proxy-pod sandbox topology, refer to [Topology](/kubernetes/topology). - To enable automatic certificate rotation with cert-manager, refer to [Managing Certificates](/kubernetes/managing-certificates). - To expose the gateway externally without port-forwarding, refer to [Ingress](/kubernetes/ingress). - To configure OIDC or reverse-proxy authentication, refer to [Access Control](/kubernetes/access-control). diff --git a/docs/kubernetes/topology.mdx b/docs/kubernetes/topology.mdx new file mode 100644 index 000000000..2f19ec6a8 --- /dev/null +++ b/docs/kubernetes/topology.mdx @@ -0,0 +1,202 @@ +--- +# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +title: "Kubernetes Sandbox Topology" +sidebar-title: "Topology" +description: "Choose between combined, sidecar, and proxy-pod topology for Kubernetes sandbox pods." +keywords: "Generative AI, Cybersecurity, Kubernetes, Sandboxing, Sidecar, Network Policy, RuntimeClass" +position: 2 +--- + +Kubernetes sandbox pods can run the OpenShell supervisor in `combined`, +`sidecar`, or `proxy-pod` topology. Choose the topology based on which controls +you need inside the pod, how much privilege your cluster allows on the agent +container, and whether the cluster enforces Kubernetes NetworkPolicies. + +## Choose a Topology + +The default `combined` topology preserves the full OpenShell enforcement model. +Use `sidecar` only when you accept network-focused enforcement in exchange for a +lower-privilege agent container. + +| Topology | Use when | Main tradeoff | +|---|---|---| +| `combined` | You need OpenShell network, filesystem, and process controls in the sandbox workload. | The agent container carries the Linux capabilities the supervisor needs. | +| `sidecar` | You need the agent container to run as non-root without added Linux capabilities, and network policy is the primary control. | Filesystem policy, process privilege dropping, and process/binary identity checks are not applied by the process supervisor. | +| `proxy-pod` | You need network enforcement to run outside the agent pod and your cluster enforces Kubernetes NetworkPolicies. | Requires a NetworkPolicy-enforcing CNI or controller; process/filesystem enforcement stays network-only. | + +## Privilege Model + +The long-running container permissions differ by topology: + +| Topology | Pod or container | UID/GID | Privilege escalation | Capabilities | Result | +|---|---|---|---|---|---| +| `combined` | Agent container, which also runs the supervisor | Not forced by topology | Not explicitly disabled by the driver | Adds `SYS_ADMIN`, `NET_ADMIN`, `SYS_PTRACE`, and `SYSLOG`; adds `SETUID`, `SETGID`, and `DAC_READ_SEARCH` when user namespaces are enabled | Full supervisor controls run in the agent container. | +| `sidecar` | Agent container, process-only supervisor | `sandbox_uid:sandbox_gid` | `false` | Drops `ALL` | Agent and workload run without added Linux capabilities. | +| `sidecar` | Network supervisor sidecar | `proxyUid:sandbox_gid` | `false` | Drops `ALL` | Long-running proxy sidecar is also non-root without added capabilities. | +| `proxy-pod` | Agent pod container, process-only supervisor | `sandbox_uid:sandbox_gid` | `false` | Drops `ALL` | Agent and workload run without added Linux capabilities in their own pod. | +| `proxy-pod` | Supervisor pod container, network proxy only | `proxyUid:sandbox_gid` | `false` | Drops `ALL` | Long-running proxy runs outside the agent pod without added capabilities. | + +Short-lived setup containers still have the permissions needed to prepare the +pod: + +| Topology | Setup container | UID/GID | Privilege escalation | Capabilities | Purpose | +|---|---|---|---|---|---| +| `combined` | Supervisor install init container | `0` | Not set | Not set | Copies the supervisor binary into the agent container volume. | +| `sidecar` | Network init container | `0` | `false` | Drops `ALL`; adds `NET_ADMIN`, `NET_RAW`, `CHOWN`, and `FOWNER` | Installs pod-local nftables rules and prepares shared sidecar state. | +| `proxy-pod` | Supervisor install init container | `0` | Not set | Not set | Copies the supervisor binary into the agent pod volume. | +| `proxy-pod` | Proxy CA install init container | `0:sandbox_gid` | `false` | Drops `ALL` | Copies proxy CA material into the agent pod TLS volume. | + +## Combined Topology + +Combined topology is the original Kubernetes mode and remains the default. The +agent container starts the OpenShell supervisor, and the supervisor launches the +workload after applying sandbox setup. + +Combined topology keeps these controls in one supervisor path: + +- Network endpoint and L7 policy enforcement. +- Filesystem policy enforcement. +- Process and binary identity checks. +- Privilege drop into the sandbox user. +- Gateway relay, SSH sessions, exec, and file sync. + +Because the supervisor performs network namespace setup and process/filesystem +controls from the agent container, Kubernetes grants that container elevated +Linux capabilities. Use this mode when you need the complete OpenShell sandbox +contract and your cluster policy permits those capabilities. + +## Sidecar Topology + +Sidecar topology splits the supervisor into a network sidecar and a +low-privilege process supervisor in the agent container. + +The pod contains these OpenShell-managed pieces: + +| Component | Runs as | Purpose | +|---|---|---| +| Network init container | Root with setup capabilities | Installs pod-level nftables rules and prepares shared sidecar state. | +| Network sidecar | `supervisor.proxyUid` | Runs the proxy, enforces network policy, writes proxy TLS material, and forwards gateway traffic on loopback. | +| Agent container | Resolved sandbox UID/GID | Runs the process supervisor and launches the user workload. | + +In this topology, the agent container runs with `runAsNonRoot: true`, +`allowPrivilegeEscalation: false`, and `capabilities.drop: ["ALL"]`. The +long-running network sidecar also drops all Linux capabilities. The root init +container keeps the setup capabilities needed to configure pod networking. + +Sidecar mode preserves gateway session behavior, including SSH connectivity, +because the process supervisor still owns the session relay. The network sidecar +handles outbound enforcement and forwards the process supervisor's gateway +traffic to the real gateway endpoint. + + +Sidecar mode runs the process supervisor in network-only mode. OpenShell still +enforces network endpoint and L7 policy through the sidecar, but the process +supervisor does not apply Landlock filesystem policy, process privilege +dropping, or process/binary identity checks. + + +## Proxy-Pod Topology + +Proxy-pod topology moves network enforcement and gateway forwarding into a +separate supervisor Deployment with one pod. The agent pod runs the process +supervisor and reaches the supervisor through a per-sandbox headless Service. + +OpenShell creates these per-sandbox resources: + +- Agent pod labeled `openshell.ai/sandbox-role=agent`. +- Supervisor Deployment with one pod labeled `openshell.ai/sandbox-role=supervisor`. +- Headless Service for the supervisor pod. +- Proxy CA Secret shared through mounts. +- NetworkPolicy that limits agent egress to the supervisor pod and DNS. +- NetworkPolicy that accepts supervisor ingress only from the paired agent pod. + +The supervisor Deployment has a controlling `Sandbox` ownerReference so +Kubernetes garbage collection removes it when the sandbox is deleted. The +Deployment recreates the supervisor pod if the pod is deleted independently. + + +Proxy-pod topology requires NetworkPolicy enforcement to work as OpenShell +expects. The target cluster must have a policy-enforcing CNI or equivalent +NetworkPolicy controller before deploying this topology. Without enforcement, +the agent pod is not forced through its paired supervisor proxy, so the +agent-to-supervisor isolation policy is only declarative. + + +## Credential Exposure + +Sidecar and proxy-pod topologies use pod `fsGroup` and group-readable projected +credentials so the non-root process supervisor can authenticate to the gateway. +This includes the projected ServiceAccount token used for sandbox token +bootstrap and the sandbox client TLS secret. + +Treat the agent container as trusted with respect to those in-pod gateway +credentials. Use `combined` topology when that credential exposure is not +acceptable for your deployment. + +## RuntimeClass Isolation + +RuntimeClass isolation can add a stronger container boundary, but support +depends on the topology and runtime: + +- `proxy-pod` has been tested with Kata Containers and gVisor and is functional + when the cluster enforces NetworkPolicies. +- `sidecar` is experimental with Kata Containers and is known to fail with + gVisor because sidecar mode depends on pod-local network rule setup. + +Runtime classes do not re-enable the OpenShell filesystem and process controls +that sidecar and proxy-pod modes relax. Use them as an additional workload +boundary, not as a replacement for the combined topology's full supervisor +controls. + +You can set a default runtime class in the Kubernetes driver configuration or +override it per sandbox with driver config: + +```shell +openshell sandbox create \ + --driver-config-json '{"kubernetes":{"pod":{"runtime_class_name":"kata-containers"}}}' \ + -- claude +``` + +## Enable Alternate Topologies + +Set `supervisor.topology=sidecar` in the Helm chart values: + +```yaml +supervisor: + topology: sidecar + proxyUid: 1337 +``` + +`proxyUid` must be a non-root UID and must not match the sandbox UID. +The network init container exempts this UID from proxy redirection so the +sidecar can reach the gateway. + +Set `supervisor.topology=proxy-pod` to use proxy-pod mode: + +```yaml +supervisor: + topology: proxy-pod + proxyUid: 1337 +``` + +The same `proxyUid` value is used as the non-root UID for the proxy +supervisor pod created by the Deployment and must not match the sandbox UID. + +For direct gateway TOML configuration, set the equivalent Kubernetes driver +fields: + +```toml +[openshell.drivers.kubernetes] +supervisor_topology = "sidecar" +proxy_uid = 1337 +``` + +Leave `supervisor.topology` unset, or set it to `combined`, to keep the original +single-container supervisor path. + +## Next Steps + +- To install OpenShell on Kubernetes, refer to [Setup](/kubernetes/setup). +- To configure gateway authentication, refer to [Access Control](/kubernetes/access-control). +- To review the driver fields, refer to [Gateway Configuration File](/reference/gateway-config). diff --git a/docs/reference/gateway-config.mdx b/docs/reference/gateway-config.mdx index 2aaa6e7b0..994d6907c 100644 --- a/docs/reference/gateway-config.mdx +++ b/docs/reference/gateway-config.mdx @@ -177,6 +177,18 @@ supervisor_image_pull_policy = "IfNotPresent" # Use the image volume on Kubernetes >= 1.35 (GA in 1.36); switch to "init-container" # on older clusters or where the ImageVolume feature gate is off. supervisor_sideload_method = "image-volume" +# "combined" runs the existing single supervisor container with full process, +# filesystem, and network enforcement in the agent container. "sidecar" moves +# pod-level network enforcement and gateway forwarding into a network sidecar. +# "proxy-pod" moves network enforcement and gateway forwarding into a separate +# supervisor Deployment and uses NetworkPolicy to force agent egress through it. In +# sidecar and proxy-pod modes, the agent container runs non-root with no added +# Linux capabilities and process/filesystem enforcement is network-only. +supervisor_topology = "combined" +# UID used by the long-running network sidecar or proxy supervisor pod. In +# sidecar topology, the network init container installs nftables rules that +# exempt this UID. +proxy_uid = 1337 grpc_endpoint = "https://openshell-gateway.agents.svc:8080" ssh_socket_path = "/run/openshell/ssh.sock" client_tls_secret_name = "openshell-client-tls" @@ -194,6 +206,12 @@ sa_token_ttl_secs = 3600 # shared roots such as /run, /var, /tmp, and /etc are rejected. # Supervisor-to-gateway auth still uses gateway JWTs. provider_spiffe_workload_api_socket_path = "/spiffe-workload-api/spire-agent.sock" +# Explicit sandbox UID/GID for the supervisor container securityContext and +# PVC init container. When unset, the driver auto-detects from OpenShift SCC +# namespace annotations (openshift.io/sa.scc.uid-range) if present, falling +# back to 1000 on non-OpenShift clusters. +# sandbox_uid = 1500 +# sandbox_gid = 1500 ``` ### Docker @@ -306,6 +324,9 @@ overlay_disk_mib = 4096 guest_tls_ca = "/var/lib/openshell/guest-tls/ca.pem" guest_tls_cert = "/var/lib/openshell/guest-tls/client.pem" guest_tls_key = "/var/lib/openshell/guest-tls/client-key.pem" +# Resolved sandbox UID/GID for the rootfs /etc/passwd entry. +# Defaults to 10001 when unset; matching GID is used if sandbox_gid is empty. +# sandbox_uid = 20001 ``` ### Extension Driver diff --git a/docs/reference/sandbox-compute-drivers.mdx b/docs/reference/sandbox-compute-drivers.mdx index 341d9e9f4..8dd43497f 100644 --- a/docs/reference/sandbox-compute-drivers.mdx +++ b/docs/reference/sandbox-compute-drivers.mdx @@ -304,10 +304,30 @@ For maintainer-level implementation details, refer to the [Kubernetes driver REA | `supervisor_image` | `supervisor.image.repository` / `supervisor.image.tag` | Set the supervisor image that provides the `openshell-sandbox` binary. | | `supervisor_image_pull_policy` | `supervisor.image.pullPolicy` | Set the Kubernetes image pull policy for the supervisor image. | | `supervisor_sideload_method` | `supervisor.sideloadMethod` | How the supervisor binary is delivered into sandbox pods. Leave empty to auto-detect from cluster version. Set to `image-volume` to mount the supervisor OCI image directly as a volume (requires Kubernetes 1.33+ with the ImageVolume feature gate; GA in 1.36), or `init-container` to copy it through an init container on older clusters. | +| `supervisor_topology` | `supervisor.topology` | Set `combined` for the default single supervisor path, `sidecar` to move pod-level network enforcement and gateway forwarding into a dedicated sidecar, or `proxy-pod` to run network enforcement in a separate supervisor Deployment with NetworkPolicy isolation. | +| `proxy_uid` | `supervisor.proxyUid` | UID used by the long-running network sidecar or proxy supervisor pod. In `sidecar` topology, the network init container exempts this UID from proxy redirection. | | `app_armor_profile` | `server.appArmorProfile` | Set the sandbox agent container's AppArmor profile. Helm defaults this to `Unconfined` so AppArmor-enabled nodes do not block supervisor network namespace setup. Set the Helm value to an empty string to omit the field, or use `RuntimeDefault` or `Localhost/` for operator-managed profiles. | | `workspace_default_storage_size` | `server.workspaceDefaultStorageSize` | Set the default workspace PVC size for new sandboxes. | | `sa_token_ttl_secs` | `server.sandboxJwt.k8sSaTokenTtlSecs` | Set the projected ServiceAccount token TTL used for the bootstrap token exchange. | +In `combined` topology, the agent container carries the Linux capabilities +needed by the supervisor for network namespace setup, Landlock filesystem +policy, process privilege changes, and network policy enforcement. In `sidecar` +and `proxy-pod` topology, the agent container runs as the resolved sandbox +UID/GID with no added Linux capabilities. Sidecar mode uses a root init +container for nftables setup and a long-running non-root sidecar. Proxy-pod mode +creates a separate non-root supervisor Deployment with one pod, a headless +Service, a proxy CA Secret, and per-sandbox NetworkPolicies. The Deployment +recreates the supervisor pod if it is deleted. Both modes keep gateway session +and SSH behavior, but the process supervisor runs in network-only mode: +filesystem policy, process privilege dropping, and +process/binary identity checks are not applied by the process container. + +Sidecar mode uses pod `fsGroup` so the non-root process supervisor can read the +projected ServiceAccount token and sandbox client TLS secret required for +gateway authentication. Treat the workload container as trusted with respect to +those in-pod gateway credentials. + The Kubernetes driver creates namespaced `agents.x-k8s.io` `Sandbox` resources from the Kubernetes SIG Apps [agent-sandbox](https://github.com/kubernetes-sigs/agent-sandbox) project. It detects the served Sandbox API at runtime, caches the selected API version for the gateway process, and uses `v1beta1` when available before falling back to `v1alpha1`, so supported Agent Sandbox installations work without version-specific operator configuration. The Agent Sandbox controller turns those resources into sandbox pods and related storage. If Agent Sandbox is upgraded in place, restart the OpenShell gateway after the controller and CRD rollout completes so the gateway can detect the served API versions again. @@ -315,3 +335,30 @@ If Agent Sandbox is upgraded in place, restart the OpenShell gateway after the c `Sandbox.spec.volumeClaimTemplates` is immutable after creation. To change storage configuration, delete the sandbox and create a new one with the updated spec. + +## Sandbox User Identity + +OpenShell accepts both the hardcoded username `"sandbox"` and numeric UIDs in `[1000, 2_000_000_000]` for the supervisor's process identity (the policy's `run_as_user` field). The driver resolves the UID at sandbox creation time and passes it to the supervisor via environment variables. + +### Kubernetes / OpenShift + +The Kubernetes driver auto-detects the sandbox UID from OpenShift SCC namespace annotations: + +- `openshift.io/sa.scc.uid-range` (format: `/`, e.g. `1000000000/10000`) provides the UID. +- `openshift.io/sa.scc.supplemental-groups` provides the GID when present; otherwise the resolved UID is used as the GID. +- On non-OpenShift clusters, or when annotations are absent, the driver falls back to `1000`. + +You can override autodetection with explicit `sandbox_uid` / `sandbox_gid` config in `[openshell.drivers.kubernetes]`. When set, the driver skips namespace annotation lookup entirely. + +The resolved UID/GID appear in: + +- Supervisor container environment variables (`OPENSHELL_SANDBOX_UID`, `OPENSHELL_SANDBOX_GID`) for direct kernel-level privilege dropping without `/etc/passwd` lookups. +- PVC init container `securityContext.runAsUser/runAsGroup/fsGroup` for workspace ownership operations. + +### VM Driver + +The VM driver injects the sandbox UID into the rootfs guest's `/etc/passwd`, `/etc/group`, and `/etc/gshadow` during rootfs preparation. Default UID is `10001`; configure `sandbox_uid` in `[openshell.drivers.vm]` to use a different value. + +### Custom Images + +Custom sandbox images no longer need a baked-in `"sandbox"` user. If your image requires a passwd entry for tools like `sudo` or `ssh`, add one manually (e.g. `RUN useradd -m -u 1500 deploy`). The supervisor resolves the numeric UID directly via `setuid()` without needing `/etc/passwd`. diff --git a/e2e/with-kube-gateway.sh b/e2e/with-kube-gateway.sh index 47b8730dc..8a84fc6ad 100755 --- a/e2e/with-kube-gateway.sh +++ b/e2e/with-kube-gateway.sh @@ -20,6 +20,12 @@ # files, relative to the repository root or absolute, to layer additional chart # configuration on top of ci/values-skaffold.yaml. # +# Proxy-pod topology: +# Use OPENSHELL_E2E_KUBE_EXTRA_VALUES=deploy/helm/openshell/ci/values-proxy-pod.yaml +# or `mise run e2e:kubernetes:proxy-pod`. The target cluster must enforce +# Kubernetes NetworkPolicies; the ephemeral k3d/k3s path keeps k3s's embedded +# network policy controller enabled. +# # Image source: # - Ephemeral k3d mode builds local `openshell/{gateway,supervisor}:${IMAGE_TAG}` # images by default, imports them into k3d, then installs the chart. This @@ -80,6 +86,7 @@ EXTERNAL_PG_FIXTURE_SERVICE="openshell-e2e-postgres" EXTERNAL_PG_FIXTURE_USER="openshell" EXTERNAL_PG_FIXTURE_PASSWORD="openshell-e2e-postgres" EXTERNAL_PG_FIXTURE_DATABASE="openshell" +PROXY_POD_E2E=0 # Isolate CLI/SDK gateway metadata from the developer's real config. export XDG_CONFIG_HOME="${WORKDIR}/config" @@ -393,6 +400,46 @@ require_cmd() { fi } +configure_fixture_container_engine() { + local selected_engine="" + + if [ -n "${CONTAINER_ENGINE:-}" ]; then + selected_engine="$(printf '%s' "${CONTAINER_ENGINE}" | tr '[:upper:]' '[:lower:]')" + case "${selected_engine}" in + docker|podman) + export CONTAINER_ENGINE="${selected_engine}" + return + ;; + *) + echo "ERROR: CONTAINER_ENGINE=${CONTAINER_ENGINE} is invalid; expected docker or podman" >&2 + exit 2 + ;; + esac + fi + + case "${KUBE_CONTEXT}" in + k3d-*) + selected_engine="docker" + ;; + kind-*) + case "$(printf '%s' "${KIND_EXPERIMENTAL_PROVIDER:-}" | tr '[:upper:]' '[:lower:]')" in + podman) + selected_engine="podman" + ;; + *) + selected_engine="docker" + ;; + esac + ;; + *) + return + ;; + esac + + export CONTAINER_ENGINE="${selected_engine}" + echo "Using ${CONTAINER_ENGINE} for Kubernetes e2e host-side fixture containers." +} + require_cmd helm require_cmd kubectl require_cmd curl @@ -423,6 +470,8 @@ else KUBE_CONTEXT="k3d-${CLUSTER_NAME}" fi +configure_fixture_container_engine + if [ -z "${OPENSHELL_E2E_KUBE_BUILD_IMAGES+x}" ]; then if [ "${CLUSTER_CREATED_BY_US}" = "1" ]; then OPENSHELL_E2E_KUBE_BUILD_IMAGES=1 @@ -569,6 +618,9 @@ if [ -n "${OPENSHELL_E2E_KUBE_EXTRA_VALUES:-}" ]; then IFS=':' read -r -a extra_values_files <<< "${OPENSHELL_E2E_KUBE_EXTRA_VALUES}" for values_file in "${extra_values_files[@]}"; do [ -n "${values_file}" ] || continue + if [[ "${values_file}" == *"values-proxy-pod.yaml" ]]; then + PROXY_POD_E2E=1 + fi if [[ "${values_file}" != /* ]]; then values_file="${ROOT}/${values_file}" fi @@ -576,6 +628,11 @@ if [ -n "${OPENSHELL_E2E_KUBE_EXTRA_VALUES:-}" ]; then done fi +if [ "${PROXY_POD_E2E}" = "1" ]; then + echo "Proxy-pod e2e profile enabled; target cluster must enforce Kubernetes NetworkPolicies." + echo "Ephemeral k3d/k3s mode uses k3s's embedded NetworkPolicy controller unless the cluster is customized externally." +fi + if [ "${OPENSHELL_E2E_KUBE_DB_SCENARIOS:-0}" = "1" ]; then # --- Multi-scenario mode: test all database backends --- DB_PASSED=0 diff --git a/examples/bring-your-own-container/Dockerfile b/examples/bring-your-own-container/Dockerfile index 61f283970..fc65bd695 100644 --- a/examples/bring-your-own-container/Dockerfile +++ b/examples/bring-your-own-container/Dockerfile @@ -14,15 +14,19 @@ RUN apt-get update && apt-get install -y --no-install-recommends \ curl iproute2 nftables \ && rm -rf /var/lib/apt/lists/* -# Create the sandbox user for non-root execution. -# Use a high UID range to avoid conflicts with host users when running without -# user namespace remapping (UID in container = UID on host). -RUN groupadd -g 1000660000 sandbox && \ - useradd -m -u 1000660000 -g sandbox sandbox +# The sandbox user is injected at runtime by the compute driver. +# Kubernetes: resolved from OpenShift SCC namespace annotations or explicit +# sandbox_uid config. VM: resolves to 10001 by default, configurable in +# gateway TOML. +# +# Images no longer need a baked-in "sandbox" user — numeric UIDs are accepted +# and the driver passes them directly to setuid()/chown() at sandbox start. +# If your image requires a passwd entry for tools like ssh or sudo, add one +# manually (e.g. RUN useradd -m -u 1500 deploy). -RUN install -d -o sandbox -g sandbox /sandbox +RUN install -d /sandbox WORKDIR /sandbox -COPY --chown=sandbox:sandbox app.py . +COPY app.py . EXPOSE 8080 diff --git a/tasks/helm.toml b/tasks/helm.toml index f25dadb09..a26c7ecae 100644 --- a/tasks/helm.toml +++ b/tasks/helm.toml @@ -55,16 +55,46 @@ description = "Run skaffold dev for deploy/helm/openshell (iterative deploy)" dir = "deploy/helm/openshell" run = "skaffold dev" +["helm:skaffold:dev:sidecar"] +description = "Run skaffold dev with the Kubernetes supervisor sidecar topology" +dir = "deploy/helm/openshell" +run = "skaffold dev -p sidecar" + +["helm:skaffold:dev:proxy-pod"] +description = "Run skaffold dev with proxy-pod topology; requires NetworkPolicy enforcement in the target cluster" +dir = "deploy/helm/openshell" +run = "skaffold dev -p proxy-pod" + ["helm:skaffold:run"] description = "Run skaffold run for deploy/helm/openshell (one-shot deploy)" dir = "deploy/helm/openshell" run = "skaffold run" +["helm:skaffold:run:sidecar"] +description = "Run skaffold run with the Kubernetes supervisor sidecar topology" +dir = "deploy/helm/openshell" +run = "skaffold run -p sidecar" + +["helm:skaffold:run:proxy-pod"] +description = "Run skaffold run with proxy-pod topology; requires NetworkPolicy enforcement in the target cluster" +dir = "deploy/helm/openshell" +run = "skaffold run -p proxy-pod" + ["helm:skaffold:delete"] description = "Run skaffold delete for deploy/helm/openshell" dir = "deploy/helm/openshell" run = "skaffold delete" +["helm:skaffold:delete:sidecar"] +description = "Run skaffold delete for the Kubernetes supervisor sidecar topology" +dir = "deploy/helm/openshell" +run = "skaffold delete -p sidecar" + +["helm:skaffold:delete:proxy-pod"] +description = "Run skaffold delete for the Kubernetes proxy-pod topology" +dir = "deploy/helm/openshell" +run = "skaffold delete -p proxy-pod" + ["helm:skaffold:diagnose"] description = "Run skaffold diagnose for deploy/helm/openshell" dir = "deploy/helm/openshell" diff --git a/tasks/scripts/helm-k3s-local.sh b/tasks/scripts/helm-k3s-local.sh index f9ac186f5..7cbc98429 100755 --- a/tasks/scripts/helm-k3s-local.sh +++ b/tasks/scripts/helm-k3s-local.sh @@ -69,6 +69,10 @@ Environment: macOS uses k3d from mise (Docker required). Linux can use this flow only when k3d is installed explicitly; otherwise use kind or an existing cluster context. Pair with: mise run helm:skaffold:dev + +The proxy-pod Skaffold profile relies on Kubernetes NetworkPolicy enforcement. +This helper leaves k3s's embedded network policy controller enabled; if you +replace the CNI, install a policy-enforcing CNI before using that profile. EOF } diff --git a/tasks/test.toml b/tasks/test.toml index a3c56bb2b..f8279e03e 100644 --- a/tasks/test.toml +++ b/tasks/test.toml @@ -114,6 +114,16 @@ run = [ "AGENT_SANDBOX_VERSION=v0.4.6 e2e/rust/e2e-kubernetes.sh", ] +["e2e:kubernetes:sidecar"] +description = "Run Kubernetes e2e with the supervisor sidecar topology overlay" +env = { OPENSHELL_E2E_KUBE_EXTRA_VALUES = "deploy/helm/openshell/ci/values-sidecar.yaml" } +run = "e2e/rust/e2e-kubernetes.sh" + +["e2e:kubernetes:proxy-pod"] +description = "Run Kubernetes e2e with the proxy-pod topology overlay; requires NetworkPolicy enforcement in the target cluster" +env = { OPENSHELL_E2E_KUBE_EXTRA_VALUES = "deploy/helm/openshell/ci/values-proxy-pod.yaml" } +run = "e2e/rust/e2e-kubernetes.sh" + ["e2e:kubernetes:db"] description = "Run Kubernetes e2e with all database backend scenarios (SQLite and external PostgreSQL with existingSecret)" env = { OPENSHELL_E2E_KUBE_DB_SCENARIOS = "1" }