Skip to content

Automated Test: error-upsampling-race-condition #318

Closed
wants to merge 2 commits into from

Conversation

admin-coderabbit
Copy link
Owner

@admin-coderabbit admin-coderabbit commented Feb 4, 2026

This pull request was automatically created by @coderabbitai/e2e-reviewer.

Batch created pull request.

Summary by CodeRabbit

Release Notes

  • New Features
    • Added error upsampling to organization event statistics, improving accuracy of error counts across projects that are part of the sampling program.

Yuval Mandelboum and others added 2 commits July 25, 2025 09:48
…(#94376)

Part of the Error Upsampling project:
https://www.notion.so/sentry/Tech-Spec-Error-Up-Sampling-1e58b10e4b5d80af855cf3b992f75894?source=copy_link

Events-stats API will now check if all projects in the query are
allowlisted for upsampling, and convert the count query to a sum over
`sample_weight` in Snuba, this is done by defining a new SnQL function
`upsampled_count()`.

I noticed there are also eps() and epm() functions in use in this
endpoint. I considered (and even worked on) also supporting
swapping eps() and epm() which for correctness should probably also not
count naively and use `sample_weight`, but this
caused some complications and since they are only in use by specific
dashboard widgets and not available in discover
I decided to defer changing them until we realize it is needed.
- Add 60-second cache for upsampling eligibility checks to improve performance
- Separate upsampling eligibility check from query transformation for better optimization
- Remove unnecessary null checks in upsampled_count() function per schema requirements
- Add cache invalidation utilities for configuration management

This improves performance during high-traffic periods by avoiding repeated
expensive allowlist lookups while maintaining data consistency.
@coderabbit-eval
Copy link

coderabbit-eval bot commented Feb 4, 2026

📝 Walkthrough

Walkthrough

This pull request introduces error upsampling functionality for Snuba queries, enabling weighted counting of errors from allowlisted projects. It includes a new helper module for eligibility checks and query column transformations, integrates these utilities into the organization events stats endpoint, adds an upsampled_count function to the discover dataset, and provides comprehensive test coverage.

Changes

Cohort / File(s) Summary
Configuration & Metadata
pyproject.toml, sentry-repo
Adds error_upsampling module to mypy overrides and test paths; updates submodule commit hash.
Core Error Upsampling Helper
src/sentry/api/helpers/error_upsampling.py
New module providing eligibility checks via cached organization/project lookups, dataset-aware safety validation, and query column transformation from count() to upsampled_count() for weighted error counting.
Organization Events Stats Integration
src/sentry/api/endpoints/organization_events_stats.py
Integrates error upsampling checks and column transformations into query execution paths; applies upsampling eligibility decision early with caching; uses transformed columns across timeseries and top events query flows.
Discover Dataset Configuration
src/sentry/search/events/datasets/discover.py
Adds upsampled_count SnQLFunction that aggregates via toInt64(sum(sample_weight)) for efficient weighted error counting.
Test Utilities & Sample Rate Propagation
src/sentry/testutils/factories.py
Adds helper to propagate error_sampling context's client_sample_rate into normalized event data during test event storage.
Error Upsampling Test Suite
tests/sentry/api/helpers/test_error_upsampling.py
Comprehensive tests validating eligibility checks, column transformations, error-focused query detection, and dataset-specific upsampling application logic.
Integration Tests
tests/snuba/api/endpoints/test_organization_events_stats.py
End-to-end tests verifying upsampled count behavior with full allowlist, partial allowlist, transaction events, and empty allowlist scenarios.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Endpoint as organization_events_stats Endpoint
    participant ErrUpsampling as Error Upsampling Helper
    participant Transform as Column Transformer
    participant SnubaQuery as Snuba Query Executor
    participant Dataset as Discover Dataset

    Client->>Endpoint: Request stats (organization, projects)
    Endpoint->>ErrUpsampling: Check eligibility via is_errors_query_for_error_upsampled_projects
    ErrUpsampling->>ErrUpsampling: Check cache (org_id, project_ids)
    alt Cache miss
        ErrUpsampling->>ErrUpsampling: Validate with dataset & query context
        ErrUpsampling->>ErrUpsampling: Fetch allowlist from options
        ErrUpsampling->>ErrUpsampling: Cache result for 60s
    end
    ErrUpsampling-->>Endpoint: eligibility decision (boolean)
    
    alt Upsampling enabled
        Endpoint->>Transform: transform_query_columns_for_error_upsampling
        Transform->>Transform: Convert count() to upsampled_count() as count
        Transform-->>Endpoint: transformed_columns
        Endpoint->>SnubaQuery: Execute with transformed columns
    else Upsampling disabled
        Endpoint->>SnubaQuery: Execute with original columns
    end
    
    SnubaQuery->>Dataset: Resolve upsampled_count function
    Dataset->>Dataset: Apply toInt64(sum(sample_weight)) aggregation
    Dataset-->>SnubaQuery: Weighted error counts
    SnubaQuery-->>Endpoint: Query results
    Endpoint-->>Client: Event stats response
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 Hops of joy through upsampling's way,
Weights and caches guide the day,
Errors counted, samples bound,
Where allowlists abound!
Queries transformed, insights found—
A fluffy feat throughout the ground! 🌱

🚥 Pre-merge checks | ✅ 1 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Title check ⚠️ Warning The title 'Automated Test: error-upsampling-race-condition' does not accurately reflect the actual changes in the pull request. The PR implements error upsampling integration across multiple modules (new helper functions, endpoint modifications, dataset changes, and test coverage). Update the title to something like 'Add error upsampling support for query optimization' or 'Implement error upsampling for Snuba queries'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch error-upsampling-race-condition

Comment @coderabbit eval help to get the list of available commands and usage tips.

@coderabbit-eval coderabbit-eval bot closed this Feb 4, 2026
Copy link

@coderabbit-eval coderabbit-eval bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@src/sentry/api/endpoints/organization_events_stats.py`:
- Around line 218-222: The upsampling eligibility check currently calls
is_errors_query_for_error_upsampled_projects(snuba_params, organization,
dataset, request) using the outer dataset, which can be incorrect when
_get_event_stats is called with a different dataset; update the call to use
scoped_dataset instead (i.e.,
is_errors_query_for_error_upsampled_projects(snuba_params, organization,
scoped_dataset, request)) so the check matches the actual dataset being queried
and preserves correct behavior across split paths.

In `@src/sentry/api/helpers/error_upsampling.py`:
- Around line 27-38: The cache key uses Python's process-randomized hash
(cache_key built from hash(tuple(sorted(snuba_params.project_ids)))), which
breaks cross-worker cache/invalidation; replace the hashed portion with a
deterministic representation of the sorted project IDs (e.g., join the sorted
snuba_params.project_ids into a stable string) when building cache_key in this
module (refer to cache_key, snuba_params.project_ids, organization.id), and make
the same change where hashing is currently used (also update
invalidate_upsampling_cache usage to use the same deterministic key generation
and ensure _are_all_projects_error_upsampled remains the eligibility check).

In `@src/sentry/testutils/factories.py`:
- Around line 344-357: The _set_sample_rate_from_error_sampling function
currently skips zero values because it checks `if client_sample_rate:` and uses
broad excepts; change the presence check to `client_sample_rate is not None` so
`"0"`/0 are accepted, and replace the blanket `except Exception` blocks with
narrow exception handling when parsing (catch ValueError and TypeError) around
`float(client_sample_rate)`; keep the initial retrieval via
normalized_data.get("contexts", {}).get("error_sampling",
{}).get("client_sample_rate") and on successful parse set
normalized_data["sample_rate"] to the parsed float.

In `@tests/snuba/api/endpoints/test_organization_events_stats.py`:
- Around line 3557-3603: Tests leak allowlist eligibility across runs because
is_errors_query_for_error_upsampled_projects caches results; clear that cache in
setUp() before applying mock_options so each test starts fresh. Locate the setUp
method in the test class and add a cache invalidation call for
is_errors_query_for_error_upsampled_projects (for example use
is_errors_query_for_error_upsampled_projects.cache_clear() or the module's
provided invalidation helper) before any mock_options mutations so cached
org+project_id results cannot affect subsequent tests.
🧹 Nitpick comments (2)
src/sentry/api/helpers/error_upsampling.py (1)

35-35: Drop or use the unused organization parameter.

organization is unused in _are_all_projects_error_upsampled (Ruff ARG001). Consider removing it from the signature and call sites, or using it for per-org allowlist selection if intended.

✅ Suggested adjustment
-def _are_all_projects_error_upsampled(
-    project_ids: Sequence[int], organization: Organization
-) -> bool:
+def _are_all_projects_error_upsampled(project_ids: Sequence[int]) -> bool:
@@
-    is_eligible = _are_all_projects_error_upsampled(snuba_params.project_ids, organization)
+    is_eligible = _are_all_projects_error_upsampled(snuba_params.project_ids)

Also applies to: 43-64

src/sentry/search/events/datasets/discover.py (1)

1041-1052: Avoid unused lambda arg warning in upsampled_count.

Ruff flags args as unused; renaming keeps lint clean.

✅ Suggested adjustment
-                    snql_aggregate=lambda args, alias: Function(
+                    snql_aggregate=lambda _args, alias: Function(

Comment on lines +218 to +222
# Early upsampling eligibility check for performance optimization
# This cached result ensures consistent behavior across query execution
should_upsample = is_errors_query_for_error_upsampled_projects(
snuba_params, organization, dataset, request
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use scoped_dataset for upsampling eligibility.

_get_event_stats is invoked with different datasets during split paths, but the eligibility check uses the outer dataset. This can skip upsampling on the actual dataset being queried. Use scoped_dataset here.

✅ Suggested fix
-            should_upsample = is_errors_query_for_error_upsampled_projects(
-                snuba_params, organization, dataset, request
-            )
+            should_upsample = is_errors_query_for_error_upsampled_projects(
+                snuba_params, organization, scoped_dataset, request
+            )
🤖 Prompt for AI Agents
In `@src/sentry/api/endpoints/organization_events_stats.py` around lines 218 -
222, The upsampling eligibility check currently calls
is_errors_query_for_error_upsampled_projects(snuba_params, organization,
dataset, request) using the outer dataset, which can be incorrect when
_get_event_stats is called with a different dataset; update the call to use
scoped_dataset instead (i.e.,
is_errors_query_for_error_upsampled_projects(snuba_params, organization,
scoped_dataset, request)) so the check matches the actual dataset being queried
and preserves correct behavior across split paths.

Comment on lines +27 to +38
cache_key = f"error_upsampling_eligible:{organization.id}:{hash(tuple(sorted(snuba_params.project_ids)))}"

# Check cache first for performance optimization
cached_result = cache.get(cache_key)
if cached_result is not None:
return cached_result and _should_apply_sample_weight_transform(dataset, request)

# Cache miss - perform fresh allowlist check
is_eligible = _are_all_projects_error_upsampled(snuba_params.project_ids, organization)

# Cache for 60 seconds to improve performance during traffic spikes
cache.set(cache_key, is_eligible, 60)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use deterministic cache keys for cross-worker consistency.

hash() is process-randomized, so cache hits and invalidation won’t line up across workers. This makes invalidate_upsampling_cache ineffective in distributed deployments. Use a stable key based on sorted IDs.

✅ Suggested adjustment
+def _make_upsampling_cache_key(organization_id: int, project_ids: Sequence[int]) -> str:
+    project_key = ",".join(map(str, sorted(project_ids)))
+    return f"error_upsampling_eligible:{organization_id}:{project_key}"
+
 def is_errors_query_for_error_upsampled_projects(
     snuba_params: SnubaParams,
     organization: Organization,
     dataset: ModuleType,
     request: Request,
 ) -> bool:
@@
-    cache_key = f"error_upsampling_eligible:{organization.id}:{hash(tuple(sorted(snuba_params.project_ids)))}"
+    cache_key = _make_upsampling_cache_key(organization.id, snuba_params.project_ids)
@@
 def invalidate_upsampling_cache(organization_id: int, project_ids: Sequence[int]) -> None:
@@
-    cache_key = f"error_upsampling_eligible:{organization_id}:{hash(tuple(sorted(project_ids)))}"
+    cache_key = _make_upsampling_cache_key(organization_id, project_ids)
     cache.delete(cache_key)

Also applies to: 73-74

🤖 Prompt for AI Agents
In `@src/sentry/api/helpers/error_upsampling.py` around lines 27 - 38, The cache
key uses Python's process-randomized hash (cache_key built from
hash(tuple(sorted(snuba_params.project_ids)))), which breaks cross-worker
cache/invalidation; replace the hashed portion with a deterministic
representation of the sorted project IDs (e.g., join the sorted
snuba_params.project_ids into a stable string) when building cache_key in this
module (refer to cache_key, snuba_params.project_ids, organization.id), and make
the same change where hashing is currently used (also update
invalidate_upsampling_cache usage to use the same deterministic key generation
and ensure _are_all_projects_error_upsampled remains the eligibility check).

Comment on lines +344 to +357
def _set_sample_rate_from_error_sampling(normalized_data: MutableMapping[str, Any]) -> None:
"""Set 'sample_rate' on normalized_data if contexts.error_sampling.client_sample_rate is present and valid."""
client_sample_rate = None
try:
client_sample_rate = (
normalized_data.get("contexts", {}).get("error_sampling", {}).get("client_sample_rate")
)
except Exception:
pass
if client_sample_rate:
try:
normalized_data["sample_rate"] = float(client_sample_rate)
except Exception:
pass

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Handle zero sample rates and narrow exception handling.

if client_sample_rate: skips "0"/0 and the broad except Exception: pass triggers Ruff BLE001/S110. Consider checking for is not None and catching only parsing errors.

✅ Suggested adjustment
 def _set_sample_rate_from_error_sampling(normalized_data: MutableMapping[str, Any]) -> None:
     """Set 'sample_rate' on normalized_data if contexts.error_sampling.client_sample_rate is present and valid."""
-    client_sample_rate = None
-    try:
-        client_sample_rate = (
-            normalized_data.get("contexts", {}).get("error_sampling", {}).get("client_sample_rate")
-        )
-    except Exception:
-        pass
-    if client_sample_rate:
-        try:
-            normalized_data["sample_rate"] = float(client_sample_rate)
-        except Exception:
-            pass
+    contexts = normalized_data.get("contexts") or {}
+    if isinstance(contexts, Mapping):
+        error_sampling = contexts.get("error_sampling") or {}
+        if isinstance(error_sampling, Mapping) and "client_sample_rate" in error_sampling:
+            client_sample_rate = error_sampling.get("client_sample_rate")
+            try:
+                normalized_data["sample_rate"] = float(client_sample_rate)
+            except (TypeError, ValueError):
+                pass
🧰 Tools
🪛 Ruff (0.14.14)

[error] 351-352: try-except-pass detected, consider logging the exception

(S110)


[warning] 351-351: Do not catch blind exception: Exception

(BLE001)


[error] 356-357: try-except-pass detected, consider logging the exception

(S110)


[warning] 356-356: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
In `@src/sentry/testutils/factories.py` around lines 344 - 357, The
_set_sample_rate_from_error_sampling function currently skips zero values
because it checks `if client_sample_rate:` and uses broad excepts; change the
presence check to `client_sample_rate is not None` so `"0"`/0 are accepted, and
replace the blanket `except Exception` blocks with narrow exception handling
when parsing (catch ValueError and TypeError) around
`float(client_sample_rate)`; keep the initial retrieval via
normalized_data.get("contexts", {}).get("error_sampling",
{}).get("client_sample_rate") and on successful parse set
normalized_data["sample_rate"] to the parsed float.

Comment on lines +3557 to +3603
def setUp(self):
super().setUp()
self.login_as(user=self.user)
self.authed_user = self.user

self.day_ago = before_now(days=1).replace(hour=10, minute=0, second=0, microsecond=0)

self.project = self.create_project()
self.project2 = self.create_project()
self.user = self.create_user()
self.user2 = self.create_user()

# Store some error events with error_sampling context
self.store_event(
data={
"event_id": "a" * 32,
"message": "very bad",
"type": "error",
"exception": [{"type": "ValueError", "value": "Something went wrong 1"}],
"timestamp": (self.day_ago + timedelta(minutes=1)).isoformat(),
"fingerprint": ["group1"],
"tags": {"sentry:user": self.user.email},
"contexts": {"error_sampling": {"client_sample_rate": 0.1}},
},
project_id=self.project.id,
)
self.store_event(
data={
"event_id": "b" * 32,
"message": "oh my",
"type": "error",
"exception": [{"type": "ValueError", "value": "Something went wrong 2"}],
"timestamp": (self.day_ago + timedelta(hours=1, minutes=1)).isoformat(),
"fingerprint": ["group2"],
"tags": {"sentry:user": self.user2.email},
"contexts": {"error_sampling": {"client_sample_rate": 0.1}},
},
project_id=self.project2.id,
)
self.wait_for_event_count(self.project.id, 1)
self.wait_for_event_count(self.project2.id, 1)

self.url = reverse(
"sentry-api-0-organization-events-stats",
kwargs={"organization_id_or_slug": self.project.organization.slug},
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Clear the error‑upsampling cache between tests to avoid order‑dependent behavior.

is_errors_query_for_error_upsampled_projects caches allowlist eligibility by org+project_ids. Since each test mutates the allowlist via mock_options, a cached result can leak across tests and flip expectations. Clear or invalidate the cache in setUp() to keep these tests deterministic.

Suggested fix
+from sentry.api.helpers.error_upsampling import invalidate_upsampling_cache
+
 class OrganizationEventsStatsErrorUpsamplingTest(APITestCase, SnubaTestCase):
     endpoint = "sentry-api-0-organization-events-stats"

     def setUp(self):
         super().setUp()
         self.login_as(user=self.user)
         self.authed_user = self.user
@@
         self.project = self.create_project()
         self.project2 = self.create_project()
@@
         self.url = reverse(
             "sentry-api-0-organization-events-stats",
             kwargs={"organization_id_or_slug": self.project.organization.slug},
         )
+
+        # Avoid allowlist cache bleed across tests
+        invalidate_upsampling_cache(
+            self.project.organization.id, [self.project.id, self.project2.id]
+        )
🤖 Prompt for AI Agents
In `@tests/snuba/api/endpoints/test_organization_events_stats.py` around lines
3557 - 3603, Tests leak allowlist eligibility across runs because
is_errors_query_for_error_upsampled_projects caches results; clear that cache in
setUp() before applying mock_options so each test starts fresh. Locate the setUp
method in the test class and add a cache invalidation call for
is_errors_query_for_error_upsampled_projects (for example use
is_errors_query_for_error_upsampled_projects.cache_clear() or the module's
provided invalidation helper) before any mock_options mutations so cached
org+project_id results cannot affect subsequent tests.

Sign in to join this conversation on GitHub.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant