Skip to content

Conversation

admin-coderabbit
Copy link
Owner

@admin-coderabbit admin-coderabbit commented Feb 4, 2026

This pull request was automatically created by @coderabbitai/e2e-reviewer.

Batch created pull request.

Summary by CodeRabbit

  • New Features

    • Monitor incidents now automatically trigger and publish occurrence data when failure thresholds are exceeded
    • Aggregated failure reason messages provide clearer visibility into monitor status (missed, timeout, or error)
    • Incident details include metadata for tracking and investigation
  • Refactor

    • Simplified and centralized monitor failure processing logic for improved maintainability

…0528)

Since we'll be doing more with issue occurrences split out the concept
of incidents into it's own logic module, as well as incident_occurrence
into it's own module

Part of GH-80527
@coderabbit-eval
Copy link

coderabbit-eval bot commented Feb 4, 2026

📝 Walkthrough

Walkthrough

New modules for incident occurrence creation and publishing are introduced, centralizing threshold-based incident logic. The mark_failed module is refactored to delegate incident processing to the new modules instead of handling it inline, reducing code duplication.

Changes

Cohort / File(s) Summary
Incident Occurrence Infrastructure
src/sentry/monitors/logic/incident_occurrence.py
New module for creating and publishing incident occurrences with metadata assembly, failure reason aggregation, and Kafka integration. Includes helper functions for context preparation and status mapping.
Incident Threshold Logic
src/sentry/monitors/logic/incidents.py
New module implementing threshold-based incident creation and management, evaluating prior check-ins against failure thresholds to determine when to raise incidents and trigger occurrence publishing.
Refactored Mark Failed
src/sentry/monitors/logic/mark_failed.py
Simplified module delegating incident threshold processing to centralized logic; removed 260 lines of inline threshold handling, incident creation, and status mapping previously managed here.
Type Definitions
src/sentry/monitors/types.py
Added SimpleCheckIn TypedDict for representing stripped-down check-in objects with id, date_added, and status fields.

Sequence Diagram

sequenceDiagram
    participant Monitor as MonitorCheckIn<br/>(failed)
    participant Incidents as try_incident_threshold
    participant Occurrence as create_incident_occurrence
    participant MonitorEnv as MonitorEnvironment
    participant Kafka as Kafka Producer

    Monitor->>Incidents: trigger with failed_checkin
    Incidents->>MonitorEnv: fetch environment state
    alt environment exists
        Incidents->>Incidents: load prior check-ins<br/>up to threshold
        alt threshold met & no OK in history
            Incidents->>MonitorEnv: set status to ERROR
            Incidents->>MonitorEnv: create/fetch MonitorIncident
            Incidents->>Occurrence: iterate failed_checkins
            loop for each check-in
                Occurrence->>Occurrence: assemble metadata<br/>(title, subtitle, evidence)
                Occurrence->>Occurrence: build event data<br/>(contexts, tags, trace)
                Occurrence->>Kafka: publish IssueOccurrence
            end
            Incidents-->>Monitor: return true
        else
            Incidents-->>Monitor: return false
        end
    else
        Incidents-->>Monitor: return false
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~35 minutes

Poem

🐰 A new path for when monitors fall,
Incidents logged and occurrences call—
From thresholds checked to Kafka's stream,
Centralized logic fuels our dream!
Hopping toward stability

🚥 Pre-merge checks | ✅ 1 | ❌ 1
❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'Automated Test: monitor-incident-refactor-after' is vague and does not clearly describe the substantive changes made in the pull request. While it references a branch name, it provides no meaningful information about what the pull request actually implements or changes. Replace the title with a clear, descriptive summary of the main changes, such as 'Refactor monitor incident handling and extract incident occurrence logic' to better convey the purpose of the refactoring.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch monitor-incident-refactor-after

Comment @coderabbit eval help to get the list of available commands and usage tips.

@coderabbit-eval coderabbit-eval bot closed this Feb 4, 2026
Copy link

@coderabbit-eval coderabbit-eval bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@src/sentry/monitors/logic/incident_occurrence.py`:
- Around line 159-171: In get_monitor_environment_context, you create a
normalized local variable config and normalize its "schedule_type" but the
returned dict uses monitor_environment.monitor.config (original) instead; update
the returned "config" value to use the local config variable so the normalized
schedule_type is preserved (refer to get_monitor_environment_context and the
local config variable).

In `@src/sentry/monitors/logic/incidents.py`:
- Around line 37-53: The current logic can create an incident before reaching
failure_issue_threshold because after slicing previous_checkins you don't verify
you actually have that many recent check-ins; update the logic that builds
previous_checkins (using MonitorCheckIn.objects.filter(...) and
reversing/slicing into previous_checkins) to check the length after slicing and
return False if len(previous_checkins) < failure_issue_threshold so you only
proceed to the CheckInStatus.OK scan when you have at least
failure_issue_threshold check-ins for the monitor_environment and failed_checkin
context.

Comment on lines +159 to +171
def get_monitor_environment_context(monitor_environment: MonitorEnvironment):
config = monitor_environment.monitor.config.copy()
if "schedule_type" in config:
config["schedule_type"] = monitor_environment.monitor.get_schedule_type_display()

return {
"id": str(monitor_environment.monitor.guid),
"slug": str(monitor_environment.monitor.slug),
"name": monitor_environment.monitor.name,
"config": monitor_environment.monitor.config,
"status": monitor_environment.get_status_display(),
"type": monitor_environment.monitor.get_type_display(),
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Return the normalized config from context.
Line 168 returns the original config, which drops the schedule_type normalization done a few lines above. Return the local config instead.

🛠️ Proposed fix
-        "config": monitor_environment.monitor.config,
+        "config": config,
🤖 Prompt for AI Agents
In `@src/sentry/monitors/logic/incident_occurrence.py` around lines 159 - 171, In
get_monitor_environment_context, you create a normalized local variable config
and normalize its "schedule_type" but the returned dict uses
monitor_environment.monitor.config (original) instead; update the returned
"config" value to use the local config variable so the normalized schedule_type
is preserved (refer to get_monitor_environment_context and the local config
variable).

Comment on lines +37 to +53
previous_checkins = cast(
list[SimpleCheckIn],
# Using .values for performance reasons
MonitorCheckIn.objects.filter(
monitor_environment=monitor_env, date_added__lte=failed_checkin.date_added
)
.order_by("-date_added")
.values("id", "date_added", "status"),
)

# reverse the list after slicing in order to start with oldest check-in
previous_checkins = list(reversed(previous_checkins[:failure_issue_threshold]))

# If we have any successful check-ins within the threshold of
# commits we have NOT reached an incident state
if any([checkin["status"] == CheckInStatus.OK for checkin in previous_checkins]):
return False

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard against incident creation before the failure threshold is reached.
After Line 48 slices previous_checkins, the code can still create an incident even when fewer than failure_issue_threshold failures exist. This can trigger incidents early when only 1..(threshold‑1) failures have occurred. Add a length check before proceeding.

🧯 Proposed fix
-            previous_checkins = list(reversed(previous_checkins[:failure_issue_threshold]))
+            previous_checkins = list(reversed(previous_checkins[:failure_issue_threshold]))
+
+            if len(previous_checkins) < failure_issue_threshold:
+                return False
🤖 Prompt for AI Agents
In `@src/sentry/monitors/logic/incidents.py` around lines 37 - 53, The current
logic can create an incident before reaching failure_issue_threshold because
after slicing previous_checkins you don't verify you actually have that many
recent check-ins; update the logic that builds previous_checkins (using
MonitorCheckIn.objects.filter(...) and reversing/slicing into previous_checkins)
to check the length after slicing and return False if len(previous_checkins) <
failure_issue_threshold so you only proceed to the CheckInStatus.OK scan when
you have at least failure_issue_threshold check-ins for the monitor_environment
and failed_checkin context.

Sign in to join this conversation on GitHub.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant