Skip to content

Automated Test: span-flusher-multiprocess #333

Closed

Conversation

admin-coderabbit
Copy link
Owner

@admin-coderabbit admin-coderabbit commented Feb 4, 2026

This pull request was automatically created by @coderabbitai/e2e-reviewer.

Batch created pull request.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added --flusher-processes CLI flag for process-spans consumer to control the maximum number of parallel processing instances (default: 1).
    • Enhanced span flusher with shard-aware multiprocessing for improved throughput and better resource utilization.
    • Added shard-level metrics and improved memory pressure handling.
  • Documentation

    • Updated type-checking best practices guide.

@coderabbit-eval
Copy link

coderabbit-eval bot commented Feb 4, 2026

📝 Walkthrough

Walkthrough

This change introduces per-shard multiprocessing to the span flusher. A new --flusher-processes CLI flag allows configuring a maximum number of worker processes. The factory layer wires this parameter to SpanFlusher, which now distributes shards across multiple processes instead of running a single process, with dedicated buffers, health tracking, and backpressure management per process.

Changes

Cohort / File(s) Summary
Documentation
CLAUDE.md
Updated type-checking documentation to pair anti-pattern (hasattr) with correct approach (isinstance) for union type checks.
CLI Configuration
src/sentry/consumers/__init__.py
Added --flusher-processes CLI flag (default 1) to process-spans consumer options, exposing max process count configuration.
Factory Integration
src/sentry/spans/consumers/process/factory.py
Added optional flusher_processes parameter to ProcessSpansStrategyFactory, passed through to SpanFlusher constructor as max_processes.
Core Flusher Implementation
src/sentry/spans/consumers/process/flusher.py
Refactored SpanFlusher to manage multiple processes distributed across shards instead of single process. Added per-process state tracking (buffers, health, backpressure), shard-to-process mapping, multi-step process initialization, and per-shard metrics/tagging.
Consumer Tests
tests/sentry/spans/consumers/process/test_consumer.py
Changed test_basic to use transaction mode; added new test_flusher_processes_limit to verify process distribution across shards respects configured maximum.
Flusher Tests
tests/sentry/spans/consumers/process/test_flusher.py
Updated backpressure assertion to check any process-level backpressure state instead of single global flag.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as CLI/Config
    participant Factory as ProcessSpansStrategyFactory
    participant Flusher as SpanFlusher
    participant Proc1 as Worker Process 1
    participant Proc2 as Worker Process 2
    participant Buf as SpansBuffer (per-shard)

    CLI->>Factory: Initialize with flusher_processes=2
    Factory->>Flusher: Create with max_processes=2
    Flusher->>Flusher: Compute shard-to-process mapping
    Flusher->>Flusher: Distribute 4 shards across 2 processes
    Flusher->>Proc1: Create for shards [0,1]
    Flusher->>Proc2: Create for shards [2,3]
    Proc1->>Buf: Operate on dedicated buffer (shards 0,1)
    Proc2->>Buf: Operate on dedicated buffer (shards 2,3)
    Flusher->>Flusher: Track per-process health & backpressure
    Flusher->>Proc1: Emit shard-tagged metrics
    Flusher->>Proc2: Emit shard-tagged metrics
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 Multiple whiskers, multiple spins,
Shards now scatter where work begins,
Four little burrows, two rabbit crews,
Parallel processing—fresh morning dews! 🌿

🚥 Pre-merge checks | ✅ 1 | ❌ 1
❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'Automated Test: span-flusher-multiprocess' is generic and vague, failing to clearly describe the main changes in the changeset. Revise the title to clearly describe the primary change, such as 'Implement multiprocess span flusher with per-shard queue management' or 'Add flusher-processes CLI option and shard-aware process management'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch span-flusher-multiprocess

Comment @coderabbit eval help to get the list of available commands and usage tips.

@coderabbit-eval coderabbit-eval bot closed this Feb 4, 2026
Copy link

@coderabbit-eval coderabbit-eval bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@src/sentry/consumers/__init__.py`:
- Around line 430-438: The click option "--flusher-processes" (click.Option with
dest "flusher_processes") lacks validation allowing 0 or negative values; update
the option to enforce a minimum of 1 by adding a callback validator on that
click.Option (or clamp the value in the span flusher factory where
flusher_processes is consumed) so any input < 1 is rejected or coerced to 1, and
ensure error text clearly states it must be >= 1.

In `@src/sentry/spans/consumers/process/flusher.py`:
- Around line 185-199: The metric tag key is inconsistent: change the tags on
the metrics.timer call that currently uses tags={"shards": shard_tag} to use the
same key as the others (tags={"shard": shard_tag}) so
metrics.timer("spans.buffer.flusher.wait_produce", ...) aligns with the earlier
metrics.timer and metrics.timing calls that use shard_tag; update the call near
the produce loop that references wait_produce to use "shard" (symbols to locate:
metrics.timer, metrics.timing, produce, flushed_segments, shard_tag,
KafkaPayload).
- Around line 336-347: The loop over self.processes currently breaks out when
deadline expires, which skips calling process.terminate() for remaining entries;
change the logic so that when remaining_time <= 0 you do not break the for-loop
but instead skip the waiting while-loop and continue to the termination step for
each remaining process. Concretely, in the for process_index, process in
self.processes.items() loop remove the break and implement a branch (e.g., if
remaining_time <= 0: skip waiting) so the subsequent isinstance(process,
multiprocessing.Process): process.terminate() still runs for every process even
after the deadline has passed.
🧹 Nitpick comments (3)
tests/sentry/spans/consumers/process/test_consumer.py (1)

60-62: Sleep-based synchronization is fragile.

The time.sleep(0.1) to allow flusher threads to process drift changes could flake on slow CI. Consider polling with a bounded retry loop instead.

src/sentry/spans/consumers/process/flusher.py (2)

337-337: Unused loop variable process_index.

Per static analysis, process_index is unused in the loop body. Use _ or .values().

♻️ Proposed fix
-        for process_index, process in self.processes.items():
+        for process in self.processes.values():

127-132: Remove unused _create_process_for_shard method.

This method is defined but never called within the codebase. Remove it per the YAGNI principle.

Comment on lines +430 to +438
"click_options": [
*multiprocessing_options(default_max_batch_size=100),
click.Option(
["--flusher-processes", "flusher_processes"],
default=1,
type=int,
help="Maximum number of processes for the span flusher. Defaults to 1.",
),
],

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

No validation that --flusher-processes is ≥ 1.

A user could pass --flusher-processes 0 or a negative value, which would cause unexpected behavior downstream (e.g., min(0, num_shards) → 0 processes, leading to empty dicts and no flushing). Consider adding a callback or clamping in the factory.

🛡️ Proposed fix: add a callback validator
+def _validate_positive_int(ctx, param, value):
+    if value < 1:
+        raise click.BadParameter("must be >= 1")
+    return value
+
 "process-spans": {
     "topic": Topic.INGEST_SPANS,
     "dlq_topic": Topic.INGEST_SPANS_DLQ,
     "strategy_factory": "sentry.spans.consumers.process.factory.ProcessSpansStrategyFactory",
     "click_options": [
         *multiprocessing_options(default_max_batch_size=100),
         click.Option(
             ["--flusher-processes", "flusher_processes"],
             default=1,
             type=int,
+            callback=_validate_positive_int,
             help="Maximum number of processes for the span flusher. Defaults to 1.",
         ),
     ],
 },
🤖 Prompt for AI Agents
In `@src/sentry/consumers/__init__.py` around lines 430 - 438, The click option
"--flusher-processes" (click.Option with dest "flusher_processes") lacks
validation allowing 0 or negative values; update the option to enforce a minimum
of 1 by adding a callback validator on that click.Option (or clamp the value in
the span flusher factory where flusher_processes is consumed) so any input < 1
is rejected or coerced to 1, and ensure error text clearly states it must be >=
1.

Comment on lines +185 to +199
with metrics.timer("spans.buffer.flusher.produce", tags={"shard": shard_tag}):
for flushed_segment in flushed_segments.values():
if not flushed_segment.spans:
continue

spans = [span.payload for span in flushed_segment.spans]
kafka_payload = KafkaPayload(None, orjson.dumps({"spans": spans}), [])
metrics.timing("spans.buffer.segment_size_bytes", len(kafka_payload.value))
metrics.timing(
"spans.buffer.segment_size_bytes",
len(kafka_payload.value),
tags={"shard": shard_tag},
)
produce(kafka_payload)

with metrics.timer("spans.buffer.flusher.wait_produce"):
with metrics.timer("spans.buffer.flusher.wait_produce", tags={"shards": shard_tag}):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Inconsistent metric tag key: "shard" vs "shards".

Lines 185 and 195 use tags={"shard": shard_tag} (singular), while line 199 uses tags={"shards": shard_tag} (plural). This will create split metrics in your dashboards.

🔧 Proposed fix
-                with metrics.timer("spans.buffer.flusher.wait_produce", tags={"shards": shard_tag}):
+                with metrics.timer("spans.buffer.flusher.wait_produce", tags={"shard": shard_tag}):
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
with metrics.timer("spans.buffer.flusher.produce", tags={"shard": shard_tag}):
for flushed_segment in flushed_segments.values():
if not flushed_segment.spans:
continue
spans = [span.payload for span in flushed_segment.spans]
kafka_payload = KafkaPayload(None, orjson.dumps({"spans": spans}), [])
metrics.timing("spans.buffer.segment_size_bytes", len(kafka_payload.value))
metrics.timing(
"spans.buffer.segment_size_bytes",
len(kafka_payload.value),
tags={"shard": shard_tag},
)
produce(kafka_payload)
with metrics.timer("spans.buffer.flusher.wait_produce"):
with metrics.timer("spans.buffer.flusher.wait_produce", tags={"shards": shard_tag}):
with metrics.timer("spans.buffer.flusher.produce", tags={"shard": shard_tag}):
for flushed_segment in flushed_segments.values():
if not flushed_segment.spans:
continue
spans = [span.payload for span in flushed_segment.spans]
kafka_payload = KafkaPayload(None, orjson.dumps({"spans": spans}), [])
metrics.timing(
"spans.buffer.segment_size_bytes",
len(kafka_payload.value),
tags={"shard": shard_tag},
)
produce(kafka_payload)
with metrics.timer("spans.buffer.flusher.wait_produce", tags={"shard": shard_tag}):
🤖 Prompt for AI Agents
In `@src/sentry/spans/consumers/process/flusher.py` around lines 185 - 199, The
metric tag key is inconsistent: change the tags on the metrics.timer call that
currently uses tags={"shards": shard_tag} to use the same key as the others
(tags={"shard": shard_tag}) so
metrics.timer("spans.buffer.flusher.wait_produce", ...) aligns with the earlier
metrics.timer and metrics.timing calls that use shard_tag; update the call near
the produce loop that references wait_produce to use "shard" (symbols to locate:
metrics.timer, metrics.timing, produce, flushed_segments, shard_tag,
KafkaPayload).

Comment on lines +336 to +347
# Wait for all processes to finish
for process_index, process in self.processes.items():
if deadline is not None:
remaining_time = deadline - time.time()
if remaining_time <= 0:
break

while process.is_alive() and (deadline is None or deadline > time.time()):
time.sleep(0.1)

if isinstance(self.process, multiprocessing.Process):
self.process.terminate()
if isinstance(process, multiprocessing.Process):
process.terminate()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

break on deadline expiry skips terminate() for remaining processes, risking zombie processes.

If the deadline is exceeded while waiting for the first process, the break on line 341 exits the loop entirely, leaving remaining processes un-terminated. Since stopped.value = True is a cooperative signal, a stuck process could linger.

🐛 Proposed fix: terminate all processes regardless of deadline
     def join(self, timeout: float | None = None):
         # set stopped flag first so we can "flush" the background threads while
         # next_step is also shutting down. we can do two things at once!
         self.stopped.value = True
         deadline = time.time() + timeout if timeout else None

         self.next_step.join(timeout)

         # Wait for all processes to finish
-        for process_index, process in self.processes.items():
+        for process in self.processes.values():
             if deadline is not None:
                 remaining_time = deadline - time.time()
                 if remaining_time <= 0:
                     break

             while process.is_alive() and (deadline is None or deadline > time.time()):
                 time.sleep(0.1)

+        # Terminate all remaining processes regardless of deadline
+        for process in self.processes.values():
             if isinstance(process, multiprocessing.Process):
                 process.terminate()
🧰 Tools
🪛 Ruff (0.14.14)

[warning] 337-337: Loop control variable process_index not used within loop body

Rename unused process_index to _process_index

(B007)

🤖 Prompt for AI Agents
In `@src/sentry/spans/consumers/process/flusher.py` around lines 336 - 347, The
loop over self.processes currently breaks out when deadline expires, which skips
calling process.terminate() for remaining entries; change the logic so that when
remaining_time <= 0 you do not break the for-loop but instead skip the waiting
while-loop and continue to the termination step for each remaining process.
Concretely, in the for process_index, process in self.processes.items() loop
remove the break and implement a branch (e.g., if remaining_time <= 0: skip
waiting) so the subsequent isinstance(process, multiprocessing.Process):
process.terminate() still runs for every process even after the deadline has
passed.

Sign in to join this conversation on GitHub.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant