-
Notifications
You must be signed in to change notification settings - Fork 0
Automated Test: authz-service-improve-caching-pr #323
Automated Test: authz-service-improve-caching-pr #323
Conversation
* remove the use of client side cache for in-proc authz client Co-authored-by: Gabriel MABILLE <gabriel.mabille@grafana.com> * add a permission denial cache, fetch perms if not in either of the caches Co-authored-by: Gabriel MABILLE <gabriel.mabille@grafana.com> * Clean up tests Co-authored-by: Ieva <ieva.vasiljeva@grafana.com> * Cache tests Co-authored-by: Ieva <ieva.vasiljeva@grafana.com> * Add test to list + cache Co-authored-by: Ieva <ieva.vasiljeva@grafana.com> * Add outdated cache test Co-authored-by: Ieva <ieva.vasiljeva@grafana.com> * Re-organize metrics Co-authored-by: Ieva <ieva.vasiljeva@grafana.com> --------- Co-authored-by: Gabriel MABILLE <gabriel.mabille@grafana.com>
📝 WalkthroughWalkthroughThese changes introduce a dual-layer caching system for authorization checks in the RBAC service. A new denial cache is added to fast-path denied permission requests, alongside a unified caching pathway for identity permissions. The client construction is updated to use a remote authz client initialized with a no-op cache option. Changes
Sequence DiagramsequenceDiagram
participant Client
participant Service
participant DenialCache as Denial Cache
participant PermCache as Perm Cache
participant Store as Store/DB
Client->>Service: Check(namespace, user, action, resource)
activate Service
Service->>DenialCache: Get(denial cache key)
activate DenialCache
DenialCache-->>Service: Found (denied)
deactivate DenialCache
alt Denial cached
Service->>Service: Return denied (fast-path)
Service-->>Client: Denied
else Cache miss
Service->>PermCache: getCachedIdentityPermissions(user)
activate PermCache
PermCache-->>Service: Permissions (if cached)
deactivate PermCache
alt Permissions cached
Service->>Service: Evaluate against cached perms
else Cache miss
Service->>Store: Fetch permissions
activate Store
Store-->>Service: Permissions
deactivate Store
Service->>PermCache: Cache permissions
end
Service->>Service: Evaluate request
alt Permission granted
Service-->>Client: Allowed
else Permission denied
Service->>DenialCache: Set(denial cache key)
Service-->>Client: Denied
end
end
deactivate Service
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@pkg/services/authz/rbac/service_test.go`:
- Around line 973-995: The test's claim that the perm cache "allows" access is
wrong because it sets map[string]bool{"dashboards:uid:dash1": false}; update the
test to set the permission entry to true so it actually grants access in the
perm cache and thus demonstrates that the explicit denial in s.permDenialCache
(userPermDenialCacheKey) short-circuits Check; locate the test block around func
t.Run "Should deny on explicit cache deny entry" and change the s.permCache.Set
call (using userPermCacheKey) to store a map with the dashboard key set to true
so the final require/assert still expects resp.Allowed == false.
In `@pkg/services/authz/rbac/service.go`:
- Around line 180-192: List currently uses getCachedIdentityPermissions and
accepts its result as authoritative, unlike Check which falls through to
getIdentityPermissions when cache indicates denial; update the List function to
mirror Check's behavior by detecting a cached "deny" or stale/empty permission
result from s.getCachedIdentityPermissions(ctx, listReq.Namespace,
listReq.IdentityType, listReq.UserUID, listReq.Action) and then calling
s.getIdentityPermissions(ctx, listReq.Namespace, listReq.IdentityType,
listReq.UserUID, listReq.Action) to refresh from the DB; ensure you update the
metrics labels the same way Check does and preserve the existing error logging
(ctxLogger.Error with subject and error) and return semantics when the DB fetch
fails.
- Around line 116-121: The denial cache (permDenialCache / permDenialKey) is
checked before the permission cache which can cause stale denials to override
newly granted permissions; change the logic to consult the permission cache
(e.g., s.permCache.Get or the existing permission lookup path) before
short-circuiting on s.permDenialCache.Get, or if keeping the denial-first order,
validate the denial entry against the permission cache/DB (invalidate denial if
a positive permission exists) so grants are not blocked while perm cache TTLs
expire.
🧹 Nitpick comments (3)
pkg/services/authz/rbac/cache.go (1)
26-32: Cache key prefix collision betweenuserPermCacheKeyanduserPermDenialCacheKey.Both functions share the same
namespace + ".perm_" + userUID + "_"prefix and use_as separator without escaping. Since both caches share the same underlyingcache.Cache, a theoretical collision exists:userPermCacheKey(ns, uid, action+"_"+name+"_"+parent)would produce the same key asuserPermDenialCacheKey(ns, uid, action, name, parent).In practice this is unlikely because
actionvalues use colons (e.g.,dashboards:read) and UIDs are constrained, but a distinct prefix (e.g.,.perm_deny_) for the denial key would eliminate the ambiguity entirely.🛡️ Suggested fix
func userPermDenialCacheKey(namespace, userUID, action, name, parent string) string { - return namespace + ".perm_" + userUID + "_" + action + "_" + name + "_" + parent + return namespace + ".perm_deny_" + userUID + "_" + action + "_" + name + "_" + parent }pkg/services/authz/rbac/service.go (2)
116-137: Check flow correctly falls through to DB on stale cache — but metrics label is misleading.When cached permissions are found but deny access, execution falls through to the full DB fetch (supporting the "outdated cache" scenario tested in
TestService_CacheCheck). However, the metric at line 137 records this as a cache miss ("false"), even though the cache was hit but simply didn't grant access. This conflates "cache miss" with "cache hit, stale/insufficient."Consider using a third label value or a separate metric to distinguish "cache miss" from "cache hit but stale," as this will aid debugging cache effectiveness in production.
342-368:getCachedIdentityPermissionsmay trigger a DB call for user identifiers.For
TypeUser/TypeServiceAccount, this function callss.GetUserIdentifiers(ctx, ns, userID)at line 356, which can fall through to the database if the identity isn't cached yet. This is a side effect in what is conceptually a "cache-only" lookup path. It won't cause correctness issues (the result is cached for subsequent calls), but it makes the function's behavior less predictable for callers who expect it to be a pure cache read.
| t.Run("Should deny on explicit cache deny entry", func(t *testing.T) { | ||
| s := setupService() | ||
|
|
||
| s.idCache.Set(ctx, userIdentifierCacheKey("org-12", "test-uid"), *userID) | ||
|
|
||
| // Explicitly deny access to the dashboard | ||
| s.permDenialCache.Set(ctx, userPermDenialCacheKey("org-12", "test-uid", "dashboards:read", "dash1", "fold1"), true) | ||
|
|
||
| // Allow access to the dashboard to prove this is not checked | ||
| s.permCache.Set(ctx, userPermCacheKey("org-12", "test-uid", "dashboards:read"), map[string]bool{"dashboards:uid:dash1": false}) | ||
|
|
||
| resp, err := s.Check(ctx, &authzv1.CheckRequest{ | ||
| Namespace: "org-12", | ||
| Subject: "user:test-uid", | ||
| Group: "dashboard.grafana.app", | ||
| Resource: "dashboards", | ||
| Verb: "get", | ||
| Name: "dash1", | ||
| Folder: "fold1", | ||
| }) | ||
| require.NoError(t, err) | ||
| assert.False(t, resp.Allowed) | ||
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test comment is misleading — perm cache value false doesn't actually "allow" access.
Line 982 sets map[string]bool{"dashboards:uid:dash1": false}, with the comment "Allow access to the dashboard to prove this is not checked." However, a false value in the scope map would deny access anyway. To truly prove that the denial cache short-circuits before the permission cache is consulted, set the value to true:
🧪 Suggested fix for a more convincing test
- // Allow access to the dashboard to prove this is not checked
- s.permCache.Set(ctx, userPermCacheKey("org-12", "test-uid", "dashboards:read"), map[string]bool{"dashboards:uid:dash1": false})
+ // Set permission cache to allow access — proves denial cache takes precedence
+ s.permCache.Set(ctx, userPermCacheKey("org-12", "test-uid", "dashboards:read"), map[string]bool{"dashboards:uid:dash1": true})📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| t.Run("Should deny on explicit cache deny entry", func(t *testing.T) { | |
| s := setupService() | |
| s.idCache.Set(ctx, userIdentifierCacheKey("org-12", "test-uid"), *userID) | |
| // Explicitly deny access to the dashboard | |
| s.permDenialCache.Set(ctx, userPermDenialCacheKey("org-12", "test-uid", "dashboards:read", "dash1", "fold1"), true) | |
| // Allow access to the dashboard to prove this is not checked | |
| s.permCache.Set(ctx, userPermCacheKey("org-12", "test-uid", "dashboards:read"), map[string]bool{"dashboards:uid:dash1": false}) | |
| resp, err := s.Check(ctx, &authzv1.CheckRequest{ | |
| Namespace: "org-12", | |
| Subject: "user:test-uid", | |
| Group: "dashboard.grafana.app", | |
| Resource: "dashboards", | |
| Verb: "get", | |
| Name: "dash1", | |
| Folder: "fold1", | |
| }) | |
| require.NoError(t, err) | |
| assert.False(t, resp.Allowed) | |
| }) | |
| t.Run("Should deny on explicit cache deny entry", func(t *testing.T) { | |
| s := setupService() | |
| s.idCache.Set(ctx, userIdentifierCacheKey("org-12", "test-uid"), *userID) | |
| // Explicitly deny access to the dashboard | |
| s.permDenialCache.Set(ctx, userPermDenialCacheKey("org-12", "test-uid", "dashboards:read", "dash1", "fold1"), true) | |
| // Set permission cache to allow access — proves denial cache takes precedence | |
| s.permCache.Set(ctx, userPermCacheKey("org-12", "test-uid", "dashboards:read"), map[string]bool{"dashboards:uid:dash1": true}) | |
| resp, err := s.Check(ctx, &authzv1.CheckRequest{ | |
| Namespace: "org-12", | |
| Subject: "user:test-uid", | |
| Group: "dashboard.grafana.app", | |
| Resource: "dashboards", | |
| Verb: "get", | |
| Name: "dash1", | |
| Folder: "fold1", | |
| }) | |
| require.NoError(t, err) | |
| assert.False(t, resp.Allowed) | |
| }) |
🤖 Prompt for AI Agents
In `@pkg/services/authz/rbac/service_test.go` around lines 973 - 995, The test's
claim that the perm cache "allows" access is wrong because it sets
map[string]bool{"dashboards:uid:dash1": false}; update the test to set the
permission entry to true so it actually grants access in the perm cache and thus
demonstrates that the explicit denial in s.permDenialCache
(userPermDenialCacheKey) short-circuits Check; locate the test block around func
t.Run "Should deny on explicit cache deny entry" and change the s.permCache.Set
call (using userPermCacheKey) to store a map with the dashboard key set to true
so the final require/assert still expects resp.Allowed == false.
| permDenialKey := userPermDenialCacheKey(checkReq.Namespace.Value, checkReq.UserUID, checkReq.Action, checkReq.Name, checkReq.ParentFolder) | ||
| if _, ok := s.permDenialCache.Get(ctx, permDenialKey); ok { | ||
| s.metrics.permissionCacheUsage.WithLabelValues("true", checkReq.Action).Inc() | ||
| s.metrics.requestCount.WithLabelValues("false", "true", req.GetVerb(), req.GetGroup(), req.GetResource()).Inc() | ||
| return &authzv1.CheckResponse{Allowed: false}, nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Denial cache takes absolute precedence — could serve stale denials after permission grants.
The denial cache short-circuits before even checking the permission cache. If an admin grants a user access, the denial cache entry (up to 30s TTL) will continue to reject requests even though the permission cache would have expired and the DB would return the updated permissions.
Since the permission cache and denial cache share the same TTL but are keyed differently (perm cache by action only, denial cache by action+name+parent), there's no coordinated invalidation. This is likely an acceptable trade-off given the short TTL, but it should be documented and considered if TTLs are ever increased.
🤖 Prompt for AI Agents
In `@pkg/services/authz/rbac/service.go` around lines 116 - 121, The denial cache
(permDenialCache / permDenialKey) is checked before the permission cache which
can cause stale denials to override newly granted permissions; change the logic
to consult the permission cache (e.g., s.permCache.Get or the existing
permission lookup path) before short-circuiting on s.permDenialCache.Get, or if
keeping the denial-first order, validate the denial entry against the permission
cache/DB (invalidate denial if a positive permission exists) so grants are not
blocked while perm cache TTLs expire.
| permissions, err := s.getCachedIdentityPermissions(ctx, listReq.Namespace, listReq.IdentityType, listReq.UserUID, listReq.Action) | ||
| if err == nil { | ||
| s.metrics.permissionCacheUsage.WithLabelValues("true", listReq.Action).Inc() | ||
| } else { | ||
| s.metrics.permissionCacheUsage.WithLabelValues("false", listReq.Action).Inc() | ||
|
|
||
| permissions, err = s.getIdentityPermissions(ctx, listReq.Namespace, listReq.IdentityType, listReq.UserUID, listReq.Action) | ||
| if err != nil { | ||
| ctxLogger.Error("could not get user permissions", "subject", req.GetSubject(), "error", err) | ||
| s.metrics.requestCount.WithLabelValues("true", "true", req.GetVerb(), req.GetGroup(), req.GetResource()).Inc() | ||
| return nil, err | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent caching strategy between Check and List.
Check (lines 123–136) falls through to the database when cached permissions deny access, handling the "outdated cache" case. List, however, directly uses cached permissions without any fallthrough. This means:
Checkon resource X: cache says no → DB fetch → finds new permission → allowsList: cache says no X → returns stale list without X
This could lead to a brief inconsistency where Check allows access to a resource that List doesn't include. Consider applying the same fallthrough logic or documenting this as intentional.
🤖 Prompt for AI Agents
In `@pkg/services/authz/rbac/service.go` around lines 180 - 192, List currently
uses getCachedIdentityPermissions and accepts its result as authoritative,
unlike Check which falls through to getIdentityPermissions when cache indicates
denial; update the List function to mirror Check's behavior by detecting a
cached "deny" or stale/empty permission result from
s.getCachedIdentityPermissions(ctx, listReq.Namespace, listReq.IdentityType,
listReq.UserUID, listReq.Action) and then calling s.getIdentityPermissions(ctx,
listReq.Namespace, listReq.IdentityType, listReq.UserUID, listReq.Action) to
refresh from the DB; ensure you update the metrics labels the same way Check
does and preserve the existing error logging (ctxLogger.Error with subject and
error) and return semantics when the DB fetch fails.
This pull request was automatically created by
@coderabbitai/e2e-reviewer.Batch created pull request.
Summary by CodeRabbit
Refactor
Tests