Skip to content

fix(auth): handle PermissionError on workload certificates to avoid startup hang and crash#17568

Open
nbayati wants to merge 4 commits into
googleapis:mainfrom
nbayati:fix/auth-mtls-permission-denied
Open

fix(auth): handle PermissionError on workload certificates to avoid startup hang and crash#17568
nbayati wants to merge 4 commits into
googleapis:mainfrom
nbayati:fix/auth-mtls-permission-denied

Conversation

@nbayati

@nbayati nbayati commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

In sandboxed environments (such as when running the gcloud CLI within a snap package manager sandbox), AppArmor rules can block reading the workload credentials directory or certificate files, raising a PermissionError.

Previously, os.path.exists() caught this PermissionError and returned False. The library interpreted this as the files not being ready yet (e.g. due to a startup race) and entered a 30-second retry loop before failing with a RefreshError.

This change:

  1. Updates _is_certificate_file_ready to use os.stat and propagate PermissionError.
  2. Catches PermissionError immediately during certificate lookup, logging a warning and falling back to unbound tokens without retrying or crashing.
  3. Adds corresponding unit tests.

Fixes b/522957573

…tartup hang and crash

In sandboxed environments (such as when running the gcloud CLI within a snap package manager sandbox), AppArmor rules can block reading the workload credentials directory or certificate files, raising a PermissionError.

Previously, `os.path.exists()` caught this PermissionError and returned `False`. The library interpreted this as the files not being ready yet (e.g. due to a startup race) and entered a 30-second retry loop before failing with a RefreshError.

This change:
1. Updates `_is_certificate_file_ready` to use `os.stat` and propagate `PermissionError`.
2. Catches `PermissionError` immediately during certificate lookup, logging a warning and falling back to unbound tokens without retrying or crashing.
3. Adds corresponding unit tests.
@nbayati nbayati requested review from a team as code owners June 24, 2026 20:46

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request improves error handling when checking and accessing agent identity certificates. Specifically, it updates _is_certificate_file_ready to propagate PermissionError while catching other OSError exceptions, and handles PermissionError in get_agent_identity_certificate_path by logging a warning and falling back to unbound tokens. Corresponding unit tests have been added to verify these scenarios. There are no review comments, so I have no additional feedback to provide.

@lsirac lsirac left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this contribution!

System-level / Off-diff Finding:

Graceful OSError Handling in get_and_parse_agent_identity_certificate()

get_and_parse_agent_identity_certificate() (around line 208) does not handle exceptions when reading the certificate file. Because os.path.getsize only requires directory traversal / metadata search permissions, _is_certificate_file_ready can evaluate to True even for a file that the current process does not have read permissions to open (e.g., owned by root with 600). When open(cert_path, "rb") is called, it will crash with a raw PermissionError.

Wrap the file-read in a try...except OSError block to gracefully fail-closed with a controlled RefreshError:

    try:
        with open(cert_path, "rb") as cert_file:
            cert_bytes = cert_file.read()
    except OSError as e:
        raise exceptions.RefreshError(
            f"Failed to read agent identity certificate file at {cert_path}: {e}. "
            "Token binding protection is failing. You can turn off this protection by setting "
            f"{environment_vars.GOOGLE_API_PREVENT_AGENT_TOKEN_SHARING_FOR_GCP_SERVICES} to false "
            "to fall back to unbound tokens."
        ) from e

Comment thread packages/google-auth/google/auth/_agent_identity_utils.py Outdated
)
has_logged_cert_warning = True

except PermissionError as e:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When GOOGLE_API_PREVENT_AGENT_TOKEN_SHARING_FOR_GCP_SERVICES is true (default), token-binding protection is enforced. Silently returning None and falling back to unbound tokens upon encountering a PermissionError on cert_config_path violates this security posture and creates a silent downgrade vulnerability (e.g., if a low-privileged co-tenant restricts access to the config file, forcing a silent downgrade to request unbound tokens).

If the explicitly provided config exists but is unreadable or malformed, the client must fail-closed by logging a critical error and raising exceptions.RefreshError.

        except PermissionError as e:
            _LOGGER.error(
                "Security Critical: Permission denied when accessing certificate config or file: %s. "
                "Aborting token lookup to prevent silent security downgrade to unbound tokens.",
                e,
            )
            raise exceptions.RefreshError(f"Failed to read certificate config or file: {e}") from e

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be a duplicate of the other comment: #17568 (comment)

"Token binding protection cannot be enabled. Falling back to unbound tokens.",
e,
)
return None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning None inside the new except PermissionError as e: block bypasses the retry loop and the subsequent warning logs, resulting in a completely silent fallback with zero diagnostics or visibility for administrators. If a silent downgrade fallback is indeed the intended design (which is discouraged due to security policy), a highly visible warning or error log must be output before returning. Otherwise, fail-closed by raising exceptions.RefreshError.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to fall back to unbound tokens here instead of failing closed. Since token binding is enabled by default, failing closed on a PermissionError would completely break gcloud/CLI users running inside sandboxes (like Snap packages) on GCE VMs where the host certificates are blocked by AppArmor.

To address the visibility concern, we do log a warning via _LOGGER.warning before returning. We decided against using warnings.warn because it prints messy Python warnings to stderr by default, which is confusing for users who didn't explicitly set this up. Standard users who don't have workload certificates mounted on their VMs won't see any warnings anyway since they exit early when the well-known directory isn't found.

Comment thread packages/google-auth/tests/test_agent_identity_utils.py Outdated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may want to catch here too. I think it's possible for get_agent_identity_certificate_path to work correctly but then have .read to raise a PermissionError since .stat and .read require different permissions

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! I have wrapped the file read block in a try...except OSError block to catch read-level permissions errors and fall back to unbound tokens.

if not path:
return False
try:
return os.path.getsize(path) > 0

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the PR description mentions using os.stat , but the code uses os.path.getsize . Functionally equivalent since getsize uses stat under the hood, but just pointing it out!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new commit uses os.stat directly. I'll double check the PR description after making all the changes requested by reviewers to ensure it's still valid. Thanks for pointing it out :)

@nbayati nbayati requested review from lsirac and macastelaz June 25, 2026 20:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants