Automated Test: monitor-incident-refactor-after #320

admin-coderabbit · 2026-02-04T18:50:35Z

This pull request was automatically created by @coderabbitai/e2e-reviewer.

Batch created pull request.

Summary by CodeRabbit

New Features
- Monitor incidents now automatically trigger and publish occurrence data when failure thresholds are exceeded
- Aggregated failure reason messages provide clearer visibility into monitor status (missed, timeout, or error)
- Incident details include metadata for tracking and investigation
Refactor
- Simplified and centralized monitor failure processing logic for improved maintainability

…0528) Since we'll be doing more with issue occurrences split out the concept of incidents into it's own logic module, as well as incident_occurrence into it's own module Part of GH-80527

coderabbit-eval · 2026-02-04T18:51:19Z

📝 Walkthrough

Walkthrough

New modules for incident occurrence creation and publishing are introduced, centralizing threshold-based incident logic. The mark_failed module is refactored to delegate incident processing to the new modules instead of handling it inline, reducing code duplication.

Changes

Cohort / File(s)	Summary
Incident Occurrence Infrastructure `src/sentry/monitors/logic/incident_occurrence.py`	New module for creating and publishing incident occurrences with metadata assembly, failure reason aggregation, and Kafka integration. Includes helper functions for context preparation and status mapping.
Incident Threshold Logic `src/sentry/monitors/logic/incidents.py`	New module implementing threshold-based incident creation and management, evaluating prior check-ins against failure thresholds to determine when to raise incidents and trigger occurrence publishing.
Refactored Mark Failed `src/sentry/monitors/logic/mark_failed.py`	Simplified module delegating incident threshold processing to centralized logic; removed 260 lines of inline threshold handling, incident creation, and status mapping previously managed here.
Type Definitions `src/sentry/monitors/types.py`	Added SimpleCheckIn TypedDict for representing stripped-down check-in objects with id, date_added, and status fields.

Sequence Diagram

sequenceDiagram
    participant Monitor as MonitorCheckIn<br/>(failed)
    participant Incidents as try_incident_threshold
    participant Occurrence as create_incident_occurrence
    participant MonitorEnv as MonitorEnvironment
    participant Kafka as Kafka Producer

    Monitor->>Incidents: trigger with failed_checkin
    Incidents->>MonitorEnv: fetch environment state
    alt environment exists
        Incidents->>Incidents: load prior check-ins<br/>up to threshold
        alt threshold met & no OK in history
            Incidents->>MonitorEnv: set status to ERROR
            Incidents->>MonitorEnv: create/fetch MonitorIncident
            Incidents->>Occurrence: iterate failed_checkins
            loop for each check-in
                Occurrence->>Occurrence: assemble metadata<br/>(title, subtitle, evidence)
                Occurrence->>Occurrence: build event data<br/>(contexts, tags, trace)
                Occurrence->>Kafka: publish IssueOccurrence
            end
            Incidents-->>Monitor: return true
        else
            Incidents-->>Monitor: return false
        end
    else
        Incidents-->>Monitor: return false
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~35 minutes

Poem

🐰 A new path for when monitors fall,
Incidents logged and occurrences call—
From thresholds checked to Kafka's stream,
Centralized logic fuels our dream!
Hopping toward stability ✨

🚥 Pre-merge checks | ✅ 1 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title 'Automated Test: monitor-incident-refactor-after' is vague and does not clearly describe the substantive changes made in the pull request. While it references a branch name, it provides no meaningful information about what the pull request actually implements or changes.	Replace the title with a clear, descriptive summary of the main changes, such as 'Refactor monitor incident handling and extract incident occurrence logic' to better convey the purpose of the refactoring.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch monitor-incident-refactor-after

_{Comment @coderabbit eval help to get the list of available commands and usage tips.}

coderabbit-eval

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@src/sentry/monitors/logic/incident_occurrence.py`:
- Around line 159-171: In get_monitor_environment_context, you create a
normalized local variable config and normalize its "schedule_type" but the
returned dict uses monitor_environment.monitor.config (original) instead; update
the returned "config" value to use the local config variable so the normalized
schedule_type is preserved (refer to get_monitor_environment_context and the
local config variable).

In `@src/sentry/monitors/logic/incidents.py`:
- Around line 37-53: The current logic can create an incident before reaching
failure_issue_threshold because after slicing previous_checkins you don't verify
you actually have that many recent check-ins; update the logic that builds
previous_checkins (using MonitorCheckIn.objects.filter(...) and
reversing/slicing into previous_checkins) to check the length after slicing and
return False if len(previous_checkins) < failure_issue_threshold so you only
proceed to the CheckInStatus.OK scan when you have at least
failure_issue_threshold check-ins for the monitor_environment and failed_checkin
context.

coderabbit-eval · 2026-02-04T19:04:09Z

src/sentry/monitors/logic/incident_occurrence.py

+def get_monitor_environment_context(monitor_environment: MonitorEnvironment):
+    config = monitor_environment.monitor.config.copy()
+    if "schedule_type" in config:
+        config["schedule_type"] = monitor_environment.monitor.get_schedule_type_display()
+
+    return {
+        "id": str(monitor_environment.monitor.guid),
+        "slug": str(monitor_environment.monitor.slug),
+        "name": monitor_environment.monitor.name,
+        "config": monitor_environment.monitor.config,
+        "status": monitor_environment.get_status_display(),
+        "type": monitor_environment.monitor.get_type_display(),
+    }


⚠️ Potential issue | 🟡 Minor

Return the normalized config from context.
Line 168 returns the original config, which drops the schedule_type normalization done a few lines above. Return the local config instead.

🛠️ Proposed fix

- "config": monitor_environment.monitor.config, + "config": config,

🤖 Prompt for AI Agents

In `@src/sentry/monitors/logic/incident_occurrence.py` around lines 159 - 171, In get_monitor_environment_context, you create a normalized local variable config and normalize its "schedule_type" but the returned dict uses monitor_environment.monitor.config (original) instead; update the returned "config" value to use the local config variable so the normalized schedule_type is preserved (refer to get_monitor_environment_context and the local config variable).

coderabbit-eval · 2026-02-04T19:04:10Z

src/sentry/monitors/logic/incidents.py

+            previous_checkins = cast(
+                list[SimpleCheckIn],
+                # Using .values for performance reasons
+                MonitorCheckIn.objects.filter(
+                    monitor_environment=monitor_env, date_added__lte=failed_checkin.date_added
+                )
+                .order_by("-date_added")
+                .values("id", "date_added", "status"),
+            )
+
+            # reverse the list after slicing in order to start with oldest check-in
+            previous_checkins = list(reversed(previous_checkins[:failure_issue_threshold]))
+
+            # If we have any successful check-ins within the threshold of
+            # commits we have NOT reached an incident state
+            if any([checkin["status"] == CheckInStatus.OK for checkin in previous_checkins]):
+                return False


⚠️ Potential issue | 🟠 Major

Guard against incident creation before the failure threshold is reached.
After Line 48 slices previous_checkins, the code can still create an incident even when fewer than failure_issue_threshold failures exist. This can trigger incidents early when only 1..(threshold‑1) failures have occurred. Add a length check before proceeding.

🧯 Proposed fix

- previous_checkins = list(reversed(previous_checkins[:failure_issue_threshold])) + previous_checkins = list(reversed(previous_checkins[:failure_issue_threshold])) + + if len(previous_checkins) < failure_issue_threshold: + return False

🤖 Prompt for AI Agents

In `@src/sentry/monitors/logic/incidents.py` around lines 37 - 53, The current logic can create an incident before reaching failure_issue_threshold because after slicing previous_checkins you don't verify you actually have that many recent check-ins; update the logic that builds previous_checkins (using MonitorCheckIn.objects.filter(...) and reversing/slicing into previous_checkins) to check the length after slicing and return False if len(previous_checkins) < failure_issue_threshold so you only proceed to the CheckInStatus.OK scan when you have at least failure_issue_threshold check-ins for the monitor_environment and failed_checkin context.

ref(crons): Reorganize incident creation / issue occurrence logic (#8…

c3b32a7

…0528) Since we'll be doing more with issue occurrences split out the concept of incidents into it's own logic module, as well as incident_occurrence into it's own module Part of GH-80527

coderabbit-eval bot closed this Feb 4, 2026

coderabbit-eval bot reviewed Feb 4, 2026

View reviewed changes

Automated Test: monitor-incident-refactor-after #320

Automated Test: monitor-incident-refactor-after #320

admin-coderabbit commented Feb 4, 2026 •

edited by coderabbit-eval bot

Loading

coderabbit-eval bot commented Feb 4, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

coderabbit-eval bot left a comment

coderabbit-eval bot Feb 4, 2026

coderabbit-eval bot Feb 4, 2026

Automated Test: monitor-incident-refactor-after #320

Automated Test: monitor-incident-refactor-after #320

Conversation

admin-coderabbit commented Feb 4, 2026 • edited by coderabbit-eval bot Loading

Summary by CodeRabbit

coderabbit-eval bot commented Feb 4, 2026 • edited Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

coderabbit-eval bot left a comment

Choose a reason for hiding this comment

coderabbit-eval bot Feb 4, 2026

Choose a reason for hiding this comment

coderabbit-eval bot Feb 4, 2026

Choose a reason for hiding this comment

admin-coderabbit commented Feb 4, 2026 •

edited by coderabbit-eval bot

Loading

coderabbit-eval bot commented Feb 4, 2026 •

edited

Loading