-
Notifications
You must be signed in to change notification settings - Fork 0
Automated Test: monitor-incident-refactor-after #320
Automated Test: monitor-incident-refactor-after #320
Conversation
…0528) Since we'll be doing more with issue occurrences split out the concept of incidents into it's own logic module, as well as incident_occurrence into it's own module Part of GH-80527
📝 WalkthroughWalkthroughNew modules for incident occurrence creation and publishing are introduced, centralizing threshold-based incident logic. The mark_failed module is refactored to delegate incident processing to the new modules instead of handling it inline, reducing code duplication. Changes
Sequence DiagramsequenceDiagram
participant Monitor as MonitorCheckIn<br/>(failed)
participant Incidents as try_incident_threshold
participant Occurrence as create_incident_occurrence
participant MonitorEnv as MonitorEnvironment
participant Kafka as Kafka Producer
Monitor->>Incidents: trigger with failed_checkin
Incidents->>MonitorEnv: fetch environment state
alt environment exists
Incidents->>Incidents: load prior check-ins<br/>up to threshold
alt threshold met & no OK in history
Incidents->>MonitorEnv: set status to ERROR
Incidents->>MonitorEnv: create/fetch MonitorIncident
Incidents->>Occurrence: iterate failed_checkins
loop for each check-in
Occurrence->>Occurrence: assemble metadata<br/>(title, subtitle, evidence)
Occurrence->>Occurrence: build event data<br/>(contexts, tags, trace)
Occurrence->>Kafka: publish IssueOccurrence
end
Incidents-->>Monitor: return true
else
Incidents-->>Monitor: return false
end
else
Incidents-->>Monitor: return false
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~35 minutes Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@src/sentry/monitors/logic/incident_occurrence.py`:
- Around line 159-171: In get_monitor_environment_context, you create a
normalized local variable config and normalize its "schedule_type" but the
returned dict uses monitor_environment.monitor.config (original) instead; update
the returned "config" value to use the local config variable so the normalized
schedule_type is preserved (refer to get_monitor_environment_context and the
local config variable).
In `@src/sentry/monitors/logic/incidents.py`:
- Around line 37-53: The current logic can create an incident before reaching
failure_issue_threshold because after slicing previous_checkins you don't verify
you actually have that many recent check-ins; update the logic that builds
previous_checkins (using MonitorCheckIn.objects.filter(...) and
reversing/slicing into previous_checkins) to check the length after slicing and
return False if len(previous_checkins) < failure_issue_threshold so you only
proceed to the CheckInStatus.OK scan when you have at least
failure_issue_threshold check-ins for the monitor_environment and failed_checkin
context.
| def get_monitor_environment_context(monitor_environment: MonitorEnvironment): | ||
| config = monitor_environment.monitor.config.copy() | ||
| if "schedule_type" in config: | ||
| config["schedule_type"] = monitor_environment.monitor.get_schedule_type_display() | ||
|
|
||
| return { | ||
| "id": str(monitor_environment.monitor.guid), | ||
| "slug": str(monitor_environment.monitor.slug), | ||
| "name": monitor_environment.monitor.name, | ||
| "config": monitor_environment.monitor.config, | ||
| "status": monitor_environment.get_status_display(), | ||
| "type": monitor_environment.monitor.get_type_display(), | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Return the normalized config from context.
Line 168 returns the original config, which drops the schedule_type normalization done a few lines above. Return the local config instead.
🛠️ Proposed fix
- "config": monitor_environment.monitor.config,
+ "config": config,🤖 Prompt for AI Agents
In `@src/sentry/monitors/logic/incident_occurrence.py` around lines 159 - 171, In
get_monitor_environment_context, you create a normalized local variable config
and normalize its "schedule_type" but the returned dict uses
monitor_environment.monitor.config (original) instead; update the returned
"config" value to use the local config variable so the normalized schedule_type
is preserved (refer to get_monitor_environment_context and the local config
variable).
| previous_checkins = cast( | ||
| list[SimpleCheckIn], | ||
| # Using .values for performance reasons | ||
| MonitorCheckIn.objects.filter( | ||
| monitor_environment=monitor_env, date_added__lte=failed_checkin.date_added | ||
| ) | ||
| .order_by("-date_added") | ||
| .values("id", "date_added", "status"), | ||
| ) | ||
|
|
||
| # reverse the list after slicing in order to start with oldest check-in | ||
| previous_checkins = list(reversed(previous_checkins[:failure_issue_threshold])) | ||
|
|
||
| # If we have any successful check-ins within the threshold of | ||
| # commits we have NOT reached an incident state | ||
| if any([checkin["status"] == CheckInStatus.OK for checkin in previous_checkins]): | ||
| return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard against incident creation before the failure threshold is reached.
After Line 48 slices previous_checkins, the code can still create an incident even when fewer than failure_issue_threshold failures exist. This can trigger incidents early when only 1..(threshold‑1) failures have occurred. Add a length check before proceeding.
🧯 Proposed fix
- previous_checkins = list(reversed(previous_checkins[:failure_issue_threshold]))
+ previous_checkins = list(reversed(previous_checkins[:failure_issue_threshold]))
+
+ if len(previous_checkins) < failure_issue_threshold:
+ return False🤖 Prompt for AI Agents
In `@src/sentry/monitors/logic/incidents.py` around lines 37 - 53, The current
logic can create an incident before reaching failure_issue_threshold because
after slicing previous_checkins you don't verify you actually have that many
recent check-ins; update the logic that builds previous_checkins (using
MonitorCheckIn.objects.filter(...) and reversing/slicing into previous_checkins)
to check the length after slicing and return False if len(previous_checkins) <
failure_issue_threshold so you only proceed to the CheckInStatus.OK scan when
you have at least failure_issue_threshold check-ins for the monitor_environment
and failed_checkin context.
This pull request was automatically created by
@coderabbitai/e2e-reviewer.Batch created pull request.
Summary by CodeRabbit
New Features
Refactor