fix(auth): handle PermissionError on workload certificates to avoid startup hang and crash by nbayati · Pull Request #17568 · googleapis/google-cloud-python

nbayati · 2026-06-24T20:46:41Z

In sandboxed environments (such as when running the gcloud CLI within a snap package manager sandbox), AppArmor rules can block reading the workload credentials directory or certificate files, raising a PermissionError.

Previously, os.path.exists() caught this PermissionError and returned False. The library interpreted this as the files not being ready yet (e.g. due to a startup race) and entered a 30-second retry loop before failing with a RefreshError.

This change:

Updates _is_certificate_file_ready to use os.stat and propagate PermissionError.
Catches PermissionError immediately during certificate lookup, logging a warning and falling back to unbound tokens without retrying or crashing.
Adds corresponding unit tests.

Fixes b/522957573

…tartup hang and crash In sandboxed environments (such as when running the gcloud CLI within a snap package manager sandbox), AppArmor rules can block reading the workload credentials directory or certificate files, raising a PermissionError. Previously, `os.path.exists()` caught this PermissionError and returned `False`. The library interpreted this as the files not being ready yet (e.g. due to a startup race) and entered a 30-second retry loop before failing with a RefreshError. This change: 1. Updates `_is_certificate_file_ready` to use `os.stat` and propagate `PermissionError`. 2. Catches `PermissionError` immediately during certificate lookup, logging a warning and falling back to unbound tokens without retrying or crashing. 3. Adds corresponding unit tests.

gemini-code-assist

Code Review

This pull request improves error handling when checking and accessing agent identity certificates. Specifically, it updates _is_certificate_file_ready to propagate PermissionError while catching other OSError exceptions, and handles PermissionError in get_agent_identity_certificate_path by logging a warning and falling back to unbound tokens. Corresponding unit tests have been added to verify these scenarios. There are no review comments, so I have no additional feedback to provide.

lsirac

Thank you for this contribution!

System-level / Off-diff Finding:

Graceful OSError Handling in `get_and_parse_agent_identity_certificate()`

get_and_parse_agent_identity_certificate() (around line 208) does not handle exceptions when reading the certificate file. Because os.path.getsize only requires directory traversal / metadata search permissions, _is_certificate_file_ready can evaluate to True even for a file that the current process does not have read permissions to open (e.g., owned by root with 600). When open(cert_path, "rb") is called, it will crash with a raw PermissionError.

Wrap the file-read in a try...except OSError block to gracefully fail-closed with a controlled RefreshError:

    try:
        with open(cert_path, "rb") as cert_file:
            cert_bytes = cert_file.read()
    except OSError as e:
        raise exceptions.RefreshError(
            f"Failed to read agent identity certificate file at {cert_path}: {e}. "
            "Token binding protection is failing. You can turn off this protection by setting "
            f"{environment_vars.GOOGLE_API_PREVENT_AGENT_TOKEN_SHARING_FOR_GCP_SERVICES} to false "
            "to fall back to unbound tokens."
        ) from e

lsirac · 2026-06-24T23:57:58Z

                )
                has_logged_cert_warning = True

+        except PermissionError as e:


When GOOGLE_API_PREVENT_AGENT_TOKEN_SHARING_FOR_GCP_SERVICES is true (default), token-binding protection is enforced. Silently returning None and falling back to unbound tokens upon encountering a PermissionError on cert_config_path violates this security posture and creates a silent downgrade vulnerability (e.g., if a low-privileged co-tenant restricts access to the config file, forcing a silent downgrade to request unbound tokens).

If the explicitly provided config exists but is unreadable or malformed, the client must fail-closed by logging a critical error and raising exceptions.RefreshError.

except PermissionError as e: _LOGGER.error( "Security Critical: Permission denied when accessing certificate config or file: %s. " "Aborting token lookup to prevent silent security downgrade to unbound tokens.", e, ) raise exceptions.RefreshError(f"Failed to read certificate config or file: {e}") from e

This seems to be a duplicate of the other comment: #17568 (comment)

lsirac · 2026-06-24T23:57:58Z

+                "Token binding protection cannot be enabled. Falling back to unbound tokens.",
+                e,
+            )
+            return None


Returning None inside the new except PermissionError as e: block bypasses the retry loop and the subsequent warning logs, resulting in a completely silent fallback with zero diagnostics or visibility for administrators. If a silent downgrade fallback is indeed the intended design (which is discouraged due to security policy), a highly visible warning or error log must be output before returning. Otherwise, fail-closed by raising exceptions.RefreshError.

We need to fall back to unbound tokens here instead of failing closed. Since token binding is enabled by default, failing closed on a PermissionError would completely break gcloud/CLI users running inside sandboxes (like Snap packages) on GCE VMs where the host certificates are blocked by AppArmor.

To address the visibility concern, we do log a warning via _LOGGER.warning before returning. We decided against using warnings.warn because it prints messy Python warnings to stderr by default, which is confusing for users who didn't explicitly set this up. Standard users who don't have workload certificates mounted on their VMs won't see any warnings anyway since they exit early when the well-known directory isn't found.

macastelaz · 2026-06-25T01:55:56Z

I think we may want to catch here too. I think it's possible for get_agent_identity_certificate_path to work correctly but then have .read to raise a PermissionError since .stat and .read require different permissions

Done! I have wrapped the file read block in a try...except OSError block to catch read-level permissions errors and fall back to unbound tokens.

macastelaz · 2026-06-25T01:56:53Z

+    if not path:
+        return False
+    try:
+        return os.path.getsize(path) > 0


nit: the PR description mentions using os.stat , but the code uses os.path.getsize . Functionally equivalent since getsize uses stat under the hood, but just pointing it out!

The new commit uses os.stat directly. I'll double check the PR description after making all the changes requested by reviewers to ensure it's still valid. Thanks for pointing it out :)

…ects

…ad errors

nbayati requested review from a team as code owners June 24, 2026 20:46

gemini-code-assist Bot reviewed Jun 24, 2026

View reviewed changes

lsirac reviewed Jun 24, 2026

View reviewed changes

macastelaz reviewed Jun 25, 2026

View reviewed changes

nbayati added 3 commits June 25, 2026 16:49

fix(auth): restrict certificate readiness check to regular files

981a60a

test(auth): mock os.path.exists locally to avoid test runner side eff…

5a7ce2c

…ects

fix(auth): fallback to unbound tokens on certificate permission or re…

022d2ae

…ad errors

nbayati requested review from lsirac and macastelaz June 25, 2026 20:27

macastelaz approved these changes Jun 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(auth): handle PermissionError on workload certificates to avoid startup hang and crash#17568

fix(auth): handle PermissionError on workload certificates to avoid startup hang and crash#17568
nbayati wants to merge 4 commits into
googleapis:mainfrom
nbayati:fix/auth-mtls-permission-denied

nbayati commented Jun 24, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

lsirac left a comment •

edited

Loading

Uh oh!

Uh oh!

lsirac Jun 24, 2026

Uh oh!

nbayati Jun 25, 2026

Uh oh!

lsirac Jun 24, 2026

Uh oh!

nbayati Jun 25, 2026

Uh oh!

Uh oh!

macastelaz Jun 25, 2026

Uh oh!

nbayati Jun 25, 2026

Uh oh!

macastelaz Jun 25, 2026

Uh oh!

nbayati Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

nbayati commented Jun 24, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

lsirac left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

System-level / Off-diff Finding:

Graceful OSError Handling in get_and_parse_agent_identity_certificate()

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lsirac left a comment •

edited

Loading

Graceful OSError Handling in `get_and_parse_agent_identity_certificate()`