-
Notifications
You must be signed in to change notification settings - Fork 0
Automated Test: error-upsampling-race-condition #327
Conversation
…(#94376) Part of the Error Upsampling project: https://www.notion.so/sentry/Tech-Spec-Error-Up-Sampling-1e58b10e4b5d80af855cf3b992f75894?source=copy_link Events-stats API will now check if all projects in the query are allowlisted for upsampling, and convert the count query to a sum over `sample_weight` in Snuba, this is done by defining a new SnQL function `upsampled_count()`. I noticed there are also eps() and epm() functions in use in this endpoint. I considered (and even worked on) also supporting swapping eps() and epm() which for correctness should probably also not count naively and use `sample_weight`, but this caused some complications and since they are only in use by specific dashboard widgets and not available in discover I decided to defer changing them until we realize it is needed.
- Add 60-second cache for upsampling eligibility checks to improve performance - Separate upsampling eligibility check from query transformation for better optimization - Remove unnecessary null checks in upsampled_count() function per schema requirements - Add cache invalidation utilities for configuration management This improves performance during high-traffic periods by avoiding repeated expensive allowlist lookups while maintaining data consistency.
📝 WalkthroughWalkthroughThis PR introduces error upsampling functionality to Sentry's API. It adds a new helper module that determines eligibility and applies transformations for selective error upsampling, integrates these checks into the organization-events-stats endpoint, adds a SnQL aggregation function, and includes comprehensive test coverage for the new feature. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Client/Request
participant Endpoint as organization_events_stats
participant Helper as error_upsampling Helper
participant Snuba as Snuba Query
participant Dataset as DiscoverDataset
Client->>Endpoint: Query with dataset & projects
Endpoint->>Helper: is_errors_query_for_error_upsampled_projects()
Helper->>Helper: Check cache (org + project IDs)
Helper->>Helper: Verify all projects in allowlist
Helper->>Helper: _should_apply_sample_weight_transform()
Helper-->>Endpoint: upsampling_enabled: bool
Endpoint->>Endpoint: final_columns = original columns
alt upsampling_enabled
Endpoint->>Helper: transform_query_columns_for_error_upsampling()
Helper->>Helper: Replace count() with upsampled_count()
Helper-->>Endpoint: transformed columns
Endpoint->>Endpoint: final_columns = transformed
end
Endpoint->>Snuba: Execute query with final_columns
Snuba->>Dataset: Resolve upsampled_count function
Dataset->>Dataset: Map to toInt64(sum(sample_weight))
Dataset-->>Snuba: Aggregation result
Snuba-->>Endpoint: Query results
Endpoint-->>Client: Stats response
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🤖 Fix all issues with AI agents
In `@src/sentry/api/helpers/error_upsampling.py`:
- Line 27: The cache key uses a non-deterministic hash: change the construction
of cache_key (currently using hash(tuple(sorted(snuba_params.project_ids)))) to
a deterministic representation of the sorted project IDs (e.g., join sorted
snuba_params.project_ids into a string or use json.dumps on the sorted list) so
the key for the same organization.id and project set is identical across
processes and workers; update the cache_key assignment near the code referencing
organization.id and snuba_params.project_ids accordingly and ensure any
separators used are unambiguous.
- Around line 67-74: invalidate_upsampling_cache currently builds its cache_key
using Python's non-deterministic hash(), so different worker processes may
generate different keys and fail to invalidate the cached entry; change
invalidate_upsampling_cache to derive the cache key exactly the same way as
is_errors_query_for_error_upsampled_projects (either call the shared helper that
generates that key or copy its deterministic logic, e.g. canonicalize and sort
project_ids and use a stable string or hash function like hashlib to produce the
key) so cache_key matches across processes.
- Around line 130-140: The helper _is_error_focused_query currently uses a
simple substring check which false-positives on negations (e.g.
"!event.type:error") and partial matches (e.g. "event.type:error_something");
update it to only match a standalone, non-negated token for event.type:error
(for example using a regex with word-boundaries and a negative lookbehind for
common negation prefixes like "!" or "-" or by tokenizing the query and
rejecting tokens that start with negation characters), ensuring you reject
tokens like "!event.type:error" or "event.type:error_something" and only return
True for an exact, non-negated event.type:error token.
In `@src/sentry/testutils/factories.py`:
- Around line 344-358: In _set_sample_rate_from_error_sampling, change the
existence check from "if client_sample_rate:" to "if client_sample_rate is not
None:" so valid zero values are preserved, and replace the two bare "except
Exception: pass" blocks with narrower exception handling or minimal logging:
when extracting client_sample_rate, catch only AttributeError/TypeError (or log
the extraction failure), and when converting to float, catch
ValueError/TypeError and log the invalid value before returning; keep references
to normalized_data and client_sample_rate so you update the logic inside that
function only.
In `@tests/snuba/api/endpoints/test_organization_events_stats.py`:
- Around line 3604-3607: The test currently makes mock_options.get return the
project list for every options.get() call, which is too broad; update the mock
to use a side_effect function (or patch only the specific option lookup) so that
mock_options.get(key, *args, **kwargs) returns [self.project.id,
self.project2.id] only when key equals the error-upsampling allowlist option
used by the code under test (the call inside
sentry.api.helpers.error_upsampling), and otherwise delegate to the real
options.get or return the provided default; change the mock in
test_error_upsampling_with_allowlisted_projects (and the other three tests in
this class) accordingly so only the intended option key is overridden.
🧹 Nitpick comments (6)
src/sentry/search/events/datasets/discover.py (1)
1041-1052: Consider markingupsampled_countasprivate=True.This function is publicly queryable since
private=Trueis not set. If a user manually invokesupsampled_count()on a project that doesn't havesample_weightpopulated, thesum(sample_weight)will silently return 0 (or NULL), producing misleading results. Other internal-only aggregation functions in this file (e.g.,random_number,modulo) useprivate=Trueto prevent direct user access.Proposed fix
SnQLFunction( "upsampled_count", required_args=[], # Optimized aggregation for error upsampling - assumes sample_weight # exists for all events in allowlisted projects as per schema design snql_aggregate=lambda args, alias: Function( "toInt64", [Function("sum", [Column("sample_weight")])], alias, ), default_result_type="number", + private=True, ),src/sentry/api/endpoints/organization_events_stats.py (2)
218-228: Simplify: redundant variable and misleading comments.
upsampling_enabled = should_upsample(Line 226) adds no value—just useshould_upsampledirectly. The comments referencing "caching" and "query optimization" are misleading since no caching is visible here.
227-299: Hoist column transformation above branching to avoid triple duplication.
transform_query_columns_for_error_upsampling(query_columns)is called identically in three separate branches (lines 233, 277, 296). Since the input is always the originalquery_columns, this can be done once:Proposed simplification
- final_columns = query_columns + final_columns = ( + transform_query_columns_for_error_upsampling(query_columns) + if should_upsample + else query_columns + ) if top_events > 0: - # Apply upsampling transformation just before query execution - # This late transformation ensures we use the most current schema assumptions - if upsampling_enabled: - final_columns = transform_query_columns_for_error_upsampling(query_columns) - if use_rpc: return scoped_dataset.run_top_events_timeseries_query( ... y_axes=final_columns, ... ) return scoped_dataset.top_events_timeseries( timeseries_columns=final_columns, ... ) if use_rpc: - # Apply upsampling transformation just before RPC query execution - if upsampling_enabled: - final_columns = transform_query_columns_for_error_upsampling(query_columns) - return scoped_dataset.run_timeseries_query( ... y_axes=final_columns, ... ) - # Apply upsampling transformation just before standard query execution - if upsampling_enabled: - final_columns = transform_query_columns_for_error_upsampling(query_columns) - return scoped_dataset.timeseries_query( selected_columns=final_columns, ... )tests/sentry/api/helpers/test_error_upsampling.py (2)
19-36: Useself.create_organization()instead ofOrganization.objects.create().Line 21 bypasses the standard test factory which handles org mapping, slug reservation, and silo-mode compliance. Since
ErrorUpsamplingTestextendsTestCase(which provides factory methods), prefer:Proposed fix
def setUp(self) -> None: - self.organization = Organization.objects.create(name="test-org") + self.organization = self.create_organization(name="test-org") self.projects = [
54-75: Consider adding test cases forcount()with arguments and multiplecount()columns.The transform tests cover
count(), case insensitivity, and whitespace, but miss:
count(field_name)— should this be transformed or left as-is?- Multiple
count()entries in the same list- Columns with no
count()at all (no-op case)These edge cases would strengthen confidence in the transformation logic.
src/sentry/api/helpers/error_upsampling.py (1)
43-64: Unusedorganizationparameter; allowlist lookup is O(n) per project.
- The
organizationparameter is never used (confirmed by static analysis — Ruff ARG001). Either remove it or, if it's intended for future org-scoped allowlists, document the intent clearly.project_id in allowlistis O(n) on a list. Convert once to asetfor O(1) membership checks.♻️ Proposed fix
def _are_all_projects_error_upsampled( - project_ids: Sequence[int], organization: Organization + project_ids: Sequence[int], ) -> bool: if not project_ids: return False allowlist = options.get("issues.client_error_sampling.project_allowlist", []) if not allowlist: return False + allowlist_set = set(allowlist) # All projects must be in the allowlist - result = all(project_id in allowlist for project_id in project_ids) + result = all(project_id in allowlist_set for project_id in project_ids) return resultAlso update the call site on Line 35:
- is_eligible = _are_all_projects_error_upsampled(snuba_params.project_ids, organization) + is_eligible = _are_all_projects_error_upsampled(snuba_params.project_ids)And remove the unused
Organizationimport if no longer needed.
| expensive repeated option lookups during high-traffic periods. This is safe | ||
| because allowlist changes are infrequent and eventual consistency is acceptable. | ||
| """ | ||
| cache_key = f"error_upsampling_eligible:{organization.id}:{hash(tuple(sorted(snuba_params.project_ids)))}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hash() is non-deterministic across Python processes — cache key will vary per worker.
Python 3 randomizes hash seeds by default (PYTHONHASHSEED), so hash(tuple(sorted(project_ids))) produces different values in different processes. Since cache is likely a shared store (Redis/memcached), different workers will compute different keys for the same project set, causing cache misses and failed invalidations.
Use a deterministic representation instead, e.g., a joined string of sorted IDs.
🐛 Proposed fix
- cache_key = f"error_upsampling_eligible:{organization.id}:{hash(tuple(sorted(snuba_params.project_ids)))}"
+ project_key = ",".join(str(pid) for pid in sorted(snuba_params.project_ids))
+ cache_key = f"error_upsampling_eligible:{organization.id}:{project_key}"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| cache_key = f"error_upsampling_eligible:{organization.id}:{hash(tuple(sorted(snuba_params.project_ids)))}" | |
| project_key = ",".join(str(pid) for pid in sorted(snuba_params.project_ids)) | |
| cache_key = f"error_upsampling_eligible:{organization.id}:{project_key}" |
🤖 Prompt for AI Agents
In `@src/sentry/api/helpers/error_upsampling.py` at line 27, The cache key uses a
non-deterministic hash: change the construction of cache_key (currently using
hash(tuple(sorted(snuba_params.project_ids)))) to a deterministic representation
of the sorted project IDs (e.g., join sorted snuba_params.project_ids into a
string or use json.dumps on the sorted list) so the key for the same
organization.id and project set is identical across processes and workers;
update the cache_key assignment near the code referencing organization.id and
snuba_params.project_ids accordingly and ensure any separators used are
unambiguous.
| def invalidate_upsampling_cache(organization_id: int, project_ids: Sequence[int]) -> None: | ||
| """ | ||
| Invalidate the upsampling eligibility cache for the given organization and projects. | ||
| This should be called when the allowlist configuration changes to ensure | ||
| cache consistency across the system. | ||
| """ | ||
| cache_key = f"error_upsampling_eligible:{organization_id}:{hash(tuple(sorted(project_ids)))}" | ||
| cache.delete(cache_key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same hash() non-determinism bug in invalidate_upsampling_cache.
This must use the same deterministic key derivation as is_errors_query_for_error_upsampled_projects, otherwise invalidation will never match the cached entry from a different worker process.
🐛 Proposed fix
- cache_key = f"error_upsampling_eligible:{organization_id}:{hash(tuple(sorted(project_ids)))}"
+ project_key = ",".join(str(pid) for pid in sorted(project_ids))
+ cache_key = f"error_upsampling_eligible:{organization_id}:{project_key}"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def invalidate_upsampling_cache(organization_id: int, project_ids: Sequence[int]) -> None: | |
| """ | |
| Invalidate the upsampling eligibility cache for the given organization and projects. | |
| This should be called when the allowlist configuration changes to ensure | |
| cache consistency across the system. | |
| """ | |
| cache_key = f"error_upsampling_eligible:{organization_id}:{hash(tuple(sorted(project_ids)))}" | |
| cache.delete(cache_key) | |
| def invalidate_upsampling_cache(organization_id: int, project_ids: Sequence[int]) -> None: | |
| """ | |
| Invalidate the upsampling eligibility cache for the given organization and projects. | |
| This should be called when the allowlist configuration changes to ensure | |
| cache consistency across the system. | |
| """ | |
| project_key = ",".join(str(pid) for pid in sorted(project_ids)) | |
| cache_key = f"error_upsampling_eligible:{organization_id}:{project_key}" | |
| cache.delete(cache_key) |
🤖 Prompt for AI Agents
In `@src/sentry/api/helpers/error_upsampling.py` around lines 67 - 74,
invalidate_upsampling_cache currently builds its cache_key using Python's
non-deterministic hash(), so different worker processes may generate different
keys and fail to invalidate the cached entry; change invalidate_upsampling_cache
to derive the cache key exactly the same way as
is_errors_query_for_error_upsampled_projects (either call the shared helper that
generates that key or copy its deterministic logic, e.g. canonicalize and sort
project_ids and use a stable string or hash function like hashlib to produce the
key) so cache_key matches across processes.
| def _is_error_focused_query(request: Request) -> bool: | ||
| """ | ||
| Check if a query is focused on error events. | ||
| Reduced to only check for event.type:error to err on the side of caution. | ||
| """ | ||
| query = request.GET.get("query", "").lower() | ||
|
|
||
| if "event.type:error" in query: | ||
| return True | ||
|
|
||
| return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Substring match for event.type:error will false-positive on negated queries.
A query like !event.type:error or event.type:error_something will match the in check. The negation case is particularly concerning since it means "exclude errors," which is the opposite of the intended behavior — upsampling would be applied to a non-error query.
🐛 Proposed fix — guard against negation and partial matches
def _is_error_focused_query(request: Request) -> bool:
query = request.GET.get("query", "").lower()
- if "event.type:error" in query:
+ # Ensure we match "event.type:error" but not "!event.type:error"
+ # or "event.type:error_something"
+ import re
+ if re.search(r'(?<!\!)event\.type:error\b', query):
return True
return False🤖 Prompt for AI Agents
In `@src/sentry/api/helpers/error_upsampling.py` around lines 130 - 140, The
helper _is_error_focused_query currently uses a simple substring check which
false-positives on negations (e.g. "!event.type:error") and partial matches
(e.g. "event.type:error_something"); update it to only match a standalone,
non-negated token for event.type:error (for example using a regex with
word-boundaries and a negative lookbehind for common negation prefixes like "!"
or "-" or by tokenizing the query and rejecting tokens that start with negation
characters), ensuring you reject tokens like "!event.type:error" or
"event.type:error_something" and only return True for an exact, non-negated
event.type:error token.
| def _set_sample_rate_from_error_sampling(normalized_data: MutableMapping[str, Any]) -> None: | ||
| """Set 'sample_rate' on normalized_data if contexts.error_sampling.client_sample_rate is present and valid.""" | ||
| client_sample_rate = None | ||
| try: | ||
| client_sample_rate = ( | ||
| normalized_data.get("contexts", {}).get("error_sampling", {}).get("client_sample_rate") | ||
| ) | ||
| except Exception: | ||
| pass | ||
| if client_sample_rate: | ||
| try: | ||
| normalized_data["sample_rate"] = float(client_sample_rate) | ||
| except Exception: | ||
| pass | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if client_sample_rate: skips valid zero sample rates and silently swallows all exceptions.
Two concerns:
-
Line 353:
if client_sample_rate:evaluates toFalsefor0and0.0, which are valid sample rates. Useif client_sample_rate is not None:instead. -
The bare
except Exception: passblocks (flagged by Ruff S110/BLE001) silently swallow all errors. In test utilities this is less critical, but at minimum logging or narrowing the exception type would help debug test failures.
Proposed fix
def _set_sample_rate_from_error_sampling(normalized_data: MutableMapping[str, Any]) -> None:
"""Set 'sample_rate' on normalized_data if contexts.error_sampling.client_sample_rate is present and valid."""
client_sample_rate = None
try:
client_sample_rate = (
normalized_data.get("contexts", {}).get("error_sampling", {}).get("client_sample_rate")
)
- except Exception:
+ except (AttributeError, TypeError):
pass
- if client_sample_rate:
+ if client_sample_rate is not None:
try:
normalized_data["sample_rate"] = float(client_sample_rate)
- except Exception:
+ except (TypeError, ValueError):
pass🧰 Tools
🪛 Ruff (0.14.14)
[error] 351-352: try-except-pass detected, consider logging the exception
(S110)
[warning] 351-351: Do not catch blind exception: Exception
(BLE001)
[error] 356-357: try-except-pass detected, consider logging the exception
(S110)
[warning] 356-356: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
In `@src/sentry/testutils/factories.py` around lines 344 - 358, In
_set_sample_rate_from_error_sampling, change the existence check from "if
client_sample_rate:" to "if client_sample_rate is not None:" so valid zero
values are preserved, and replace the two bare "except Exception: pass" blocks
with narrower exception handling or minimal logging: when extracting
client_sample_rate, catch only AttributeError/TypeError (or log the extraction
failure), and when converting to float, catch ValueError/TypeError and log the
invalid value before returning; keep references to normalized_data and
client_sample_rate so you update the logic inside that function only.
| @mock.patch("sentry.api.helpers.error_upsampling.options") | ||
| def test_error_upsampling_with_allowlisted_projects(self, mock_options): | ||
| # Set up allowlisted projects | ||
| mock_options.get.return_value = [self.project.id, self.project2.id] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mock of options.get is too broad — all options.get() calls return the project ID list.
mock_options.get.return_value = [self.project.id, self.project2.id] replaces every call to options.get(...) during the request lifecycle, regardless of the key being queried. If any middleware, serializer, or other code path calls options.get("some.other.key"), it will receive the project ID list instead of the correct value, potentially causing subtle test failures or silently incorrect behavior.
This applies to all four test methods in this class.
🔧 Proposed fix — use `side_effect` to be key-specific
`@mock.patch`("sentry.api.helpers.error_upsampling.options")
def test_error_upsampling_with_allowlisted_projects(self, mock_options):
- mock_options.get.return_value = [self.project.id, self.project2.id]
+ original_options_get = options.get # import options at top of file or inline
+ def selective_get(key, default=None):
+ if key == "issues.client_error_sampling.project_allowlist":
+ return [self.project.id, self.project2.id]
+ return original_options_get(key, default)
+ mock_options.get.side_effect = selective_getAlternatively, consider patching only the specific option lookup rather than the entire options module.
🤖 Prompt for AI Agents
In `@tests/snuba/api/endpoints/test_organization_events_stats.py` around lines
3604 - 3607, The test currently makes mock_options.get return the project list
for every options.get() call, which is too broad; update the mock to use a
side_effect function (or patch only the specific option lookup) so that
mock_options.get(key, *args, **kwargs) returns [self.project.id,
self.project2.id] only when key equals the error-upsampling allowlist option
used by the code under test (the call inside
sentry.api.helpers.error_upsampling), and otherwise delegate to the real
options.get or return the provided default; change the mock in
test_error_upsampling_with_allowlisted_projects (and the other three tests in
this class) accordingly so only the intended option key is overridden.
This pull request was automatically created by
@coderabbitai/e2e-reviewer.Batch created pull request.
Summary by CodeRabbit
Release Notes
New Features
Tests