fix(auth): handle PermissionError on workload certificates to avoid startup hang and crash#17568
fix(auth): handle PermissionError on workload certificates to avoid startup hang and crash#17568nbayati wants to merge 4 commits into
Conversation
…tartup hang and crash In sandboxed environments (such as when running the gcloud CLI within a snap package manager sandbox), AppArmor rules can block reading the workload credentials directory or certificate files, raising a PermissionError. Previously, `os.path.exists()` caught this PermissionError and returned `False`. The library interpreted this as the files not being ready yet (e.g. due to a startup race) and entered a 30-second retry loop before failing with a RefreshError. This change: 1. Updates `_is_certificate_file_ready` to use `os.stat` and propagate `PermissionError`. 2. Catches `PermissionError` immediately during certificate lookup, logging a warning and falling back to unbound tokens without retrying or crashing. 3. Adds corresponding unit tests.
There was a problem hiding this comment.
Code Review
This pull request improves error handling when checking and accessing agent identity certificates. Specifically, it updates _is_certificate_file_ready to propagate PermissionError while catching other OSError exceptions, and handles PermissionError in get_agent_identity_certificate_path by logging a warning and falling back to unbound tokens. Corresponding unit tests have been added to verify these scenarios. There are no review comments, so I have no additional feedback to provide.
There was a problem hiding this comment.
Thank you for this contribution!
System-level / Off-diff Finding:
Graceful OSError Handling in get_and_parse_agent_identity_certificate()
get_and_parse_agent_identity_certificate() (around line 208) does not handle exceptions when reading the certificate file. Because os.path.getsize only requires directory traversal / metadata search permissions, _is_certificate_file_ready can evaluate to True even for a file that the current process does not have read permissions to open (e.g., owned by root with 600). When open(cert_path, "rb") is called, it will crash with a raw PermissionError.
Wrap the file-read in a try...except OSError block to gracefully fail-closed with a controlled RefreshError:
try:
with open(cert_path, "rb") as cert_file:
cert_bytes = cert_file.read()
except OSError as e:
raise exceptions.RefreshError(
f"Failed to read agent identity certificate file at {cert_path}: {e}. "
"Token binding protection is failing. You can turn off this protection by setting "
f"{environment_vars.GOOGLE_API_PREVENT_AGENT_TOKEN_SHARING_FOR_GCP_SERVICES} to false "
"to fall back to unbound tokens."
) from e| ) | ||
| has_logged_cert_warning = True | ||
|
|
||
| except PermissionError as e: |
There was a problem hiding this comment.
When GOOGLE_API_PREVENT_AGENT_TOKEN_SHARING_FOR_GCP_SERVICES is true (default), token-binding protection is enforced. Silently returning None and falling back to unbound tokens upon encountering a PermissionError on cert_config_path violates this security posture and creates a silent downgrade vulnerability (e.g., if a low-privileged co-tenant restricts access to the config file, forcing a silent downgrade to request unbound tokens).
If the explicitly provided config exists but is unreadable or malformed, the client must fail-closed by logging a critical error and raising exceptions.RefreshError.
except PermissionError as e:
_LOGGER.error(
"Security Critical: Permission denied when accessing certificate config or file: %s. "
"Aborting token lookup to prevent silent security downgrade to unbound tokens.",
e,
)
raise exceptions.RefreshError(f"Failed to read certificate config or file: {e}") from eThere was a problem hiding this comment.
This seems to be a duplicate of the other comment: #17568 (comment)
| "Token binding protection cannot be enabled. Falling back to unbound tokens.", | ||
| e, | ||
| ) | ||
| return None |
There was a problem hiding this comment.
Returning None inside the new except PermissionError as e: block bypasses the retry loop and the subsequent warning logs, resulting in a completely silent fallback with zero diagnostics or visibility for administrators. If a silent downgrade fallback is indeed the intended design (which is discouraged due to security policy), a highly visible warning or error log must be output before returning. Otherwise, fail-closed by raising exceptions.RefreshError.
There was a problem hiding this comment.
We need to fall back to unbound tokens here instead of failing closed. Since token binding is enabled by default, failing closed on a PermissionError would completely break gcloud/CLI users running inside sandboxes (like Snap packages) on GCE VMs where the host certificates are blocked by AppArmor.
To address the visibility concern, we do log a warning via _LOGGER.warning before returning. We decided against using warnings.warn because it prints messy Python warnings to stderr by default, which is confusing for users who didn't explicitly set this up. Standard users who don't have workload certificates mounted on their VMs won't see any warnings anyway since they exit early when the well-known directory isn't found.
There was a problem hiding this comment.
I think we may want to catch here too. I think it's possible for get_agent_identity_certificate_path to work correctly but then have .read to raise a PermissionError since .stat and .read require different permissions
There was a problem hiding this comment.
Done! I have wrapped the file read block in a try...except OSError block to catch read-level permissions errors and fall back to unbound tokens.
| if not path: | ||
| return False | ||
| try: | ||
| return os.path.getsize(path) > 0 |
There was a problem hiding this comment.
nit: the PR description mentions using os.stat , but the code uses os.path.getsize . Functionally equivalent since getsize uses stat under the hood, but just pointing it out!
There was a problem hiding this comment.
The new commit uses os.stat directly. I'll double check the PR description after making all the changes requested by reviewers to ensure it's still valid. Thanks for pointing it out :)
In sandboxed environments (such as when running the gcloud CLI within a snap package manager sandbox), AppArmor rules can block reading the workload credentials directory or certificate files, raising a PermissionError.
Previously,
os.path.exists()caught this PermissionError and returnedFalse. The library interpreted this as the files not being ready yet (e.g. due to a startup race) and entered a 30-second retry loop before failing with a RefreshError.This change:
_is_certificate_file_readyto useos.statand propagatePermissionError.PermissionErrorimmediately during certificate lookup, logging a warning and falling back to unbound tokens without retrying or crashing.Fixes b/522957573