Boostrap persistent state for volume servers. #1

admin-coderabbit · 2026-01-08T07:41:10Z

What problem are we solving?

Add support for maintenance (read-only) mode on volume servers.

See https://github.com/seaweedfs/seaweedfs/issues/7977 for details.

How are we solving the problem?

This PR implements logic to load/save persistent state information for storages associated with volume servers, and reporting state changes back to masters via heartbeat messages.

Master and node full support will follow in subsequent PRs.

How is the PR tested?

No functional changes, as nothing is updating states yet.

Checks

I have added unit tests if possible.
I will add related wiki document changes and link to this PR after merging.

Summary by CodeRabbit

New Features
- Added volume server state reporting to the master, enabling servers to communicate status information such as maintenance mode during heartbeats.
- Introduced state persistence for volume servers.
Refactor
- Consolidated UUID file configuration into reusable constants for improved maintainability.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

This PR implements logic load/save persistent state information for storages associated with volume servers, and reporting state changes back to masters via heartbeat messages. More work ensues! See https://github.com/seaweedfs/seaweedfs/issues/7977 for details.

coderabbit-eval · 2026-01-08T07:41:27Z

📝 Walkthrough

Walkthrough

The changes introduce persistent state management for volume servers with new protobuf message definitions, state storage lifecycle methods in the storage layer, and integration of state updates into heartbeat communication to the master server.

Changes

Cohort / File(s)	Summary
Protobuf Definitions `weed/pb/master.proto`, `weed/pb/volume_server.proto`	Added `volume_server.proto` import; extended Heartbeat message with new `state` field (tag 23). New `VolumeServerState` message with `maintenance` boolean field introduced alongside TODO notes for future RPC and response enhancements.
State Management Infrastructure `weed/storage/store_state.go`	New file establishing `State` type with persistent storage via protobuf. Exports `StateFileName` and `StateFileMode` constants. Implements `Load()` and `Save()` methods for disk I/O operations.
Store Layer Integration `weed/storage/store.go`	Added `HEARTBEAT_CHAN_SIZE` constant; extended `Store` struct with `State` and `StateUpdateChan` fields. New `LoadState()` and `SaveState()` methods handle state lifecycle. State initialization now occurs during `NewStore()` with lifecycle propagation to heartbeat channels.
Heartbeat Communication `weed/server/volume_grpc_client_to_master.go`	Added state update channel handling in heartbeat loop. Detects `StateUpdateChan` events, constructs Heartbeat with State field, and sends to master with error handling.
Refactoring & Comments `weed/storage/disk_location.go`, `weed/server/master_grpc_server.go`	Extracted UUID file constants (`UUIDFileName`, `UUIDFileMod`); replaced string concatenation with `path.Join()`. Added TODO comment in master heartbeat processing.

Sequence Diagram(s)

sequenceDiagram
    participant VS as Volume Server<br/>(Store)
    participant CH as State Update<br/>Channel
    participant HB as Heartbeat Loop<br/>(gRPC Client)
    participant MS as Master Server

    Note over VS: LoadState() or SaveState()
    VS->>VS: State lifecycle operation
    VS->>CH: Push VolumeServerState<br/>to StateUpdateChan
    
    HB->>CH: Monitor StateUpdateChan
    activate HB
    CH-->>HB: State update received
    HB->>HB: Build Heartbeat message<br/>with State field
    HB->>MS: Send Heartbeat to Master
    MS-->>HB: Acknowledge
    deactivate HB
    
    Note over HB: Continue regular heartbeat<br/>interval processing

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A state that persists, so grand and so new,
Through heartbeats that flutter, what wonders they do!
From storage to master, the channels convey,
Volume server health status, hooray, hooray, hooray! 💫

🚥 Pre-merge checks | ✅ 2

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Boostrap persistent state for volume servers' directly describes the main objective and change in this PR - adding infrastructure to load/save persistent state for volume servers.
Description check	✅ Passed	The description follows the repository template with all required sections completed: problem statement, implementation approach, testing notes, and checklist items addressed appropriately.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

_{Comment @coderabbit eval help to get the list of available commands and usage tips.}

coderabbit-eval

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In @weed/server/volume_grpc_client_to_master.go:
- Around line 215-227: Store.Close() currently doesn't close StateUpdateChan nor
the delta channels (NewVolumesChan, DeletedVolumesChan, NewEcShardsChan,
DeletedEcShardsChan) and StopHeartbeat() is never invoked, risking goroutine
leaks and panics from sends after close; fix by closing those channels inside
Store.Close() (or call StopHeartbeat() first to stop the heartbeat loop), and
update producers (LoadState(), SaveState(), and any other senders) to do
non-blocking/select sends or check a store-closed flag before sending so they
won't panic when channels are closed; locate the heartbeat loop that reads
StateUpdateChan (and the StopHeartbeat implementation) and ensure shutdown
ordering: StopHeartbeat() -> drain/stop producers -> close channels in
Store.Close().

🧹 Nitpick comments (4)

weed/storage/disk_location.go (1)
51-51: Consider using filepath.Join instead of path.Join for filesystem paths.

Since this constructs a filesystem path for file I/O operations, filepath.Join would be more appropriate as it's OS-aware and handles platform-specific path separators correctly. The path package is intended for URL-like paths with forward slashes.
♻️ Proposed refactor
-	fileName := path.Join(dir, UUIDFileName)
+	fileName := filepath.Join(dir, UUIDFileName)
This would also allow removing the path import on line 6.
weed/storage/store_state.go (2)
61-64: Use consistent format verb for error.

Line 63 uses %s for the error value while line 66 uses %v. For consistency and to handle errors that may have structured information, use %v for both.
🔧 Suggested fix
 	binPb, err := proto.Marshal(st.Pb)
 	if err != nil {
-		return fmt.Errorf("failed to serialize store state %v: %s", st.Pb, err)
+		return fmt.Errorf("failed to serialize store state %v: %v", st.Pb, err)
 	}
8-8: Migrate from deprecated protobuf module.

The github.com/golang/protobuf/proto package is deprecated and scheduled for removal in 2025. Switch to google.golang.org/protobuf/proto for runtime imports. The protobuf project provides the open2opaque migration tool to automate most of the changes.
weed/storage/store.go (1)
161-180: Consider non-blocking sends to StateUpdateChan.

Both LoadState() and SaveState() use blocking sends to StateUpdateChan. While the 1024 buffer makes blocking unlikely, other channels in this file (e.g., NewEcShardsChan at lines 128-141) use non-blocking sends with select/default to avoid potential deadlocks during startup or high-load scenarios.

For consistency and safety, consider applying the same pattern here.
♻️ Suggested non-blocking send pattern
 func (s *Store) LoadState() error {
 	err := s.State.Load()
 	if s.State.Pb != nil && err == nil {
-		s.StateUpdateChan <- s.State.Pb
+		select {
+		case s.StateUpdateChan <- s.State.Pb:
+		default:
+			glog.V(2).Infof("StateUpdateChan full during LoadState, state will be reported in heartbeat")
+		}
 	}
 	return err
 }

 func (s *Store) SaveState() error {
 	if s.State.Pb == nil {
 		glog.Warningf("tried to save empty state for store %s", s.Id)
 		return nil
 	}

 	err := s.State.Save()
 	if s.State.Pb != nil && err == nil {
-		s.StateUpdateChan <- s.State.Pb
+		select {
+		case s.StateUpdateChan <- s.State.Pb:
+		default:
+			glog.V(2).Infof("StateUpdateChan full during SaveState, state will be reported in heartbeat")
+		}
 	}
 	return err
 }

📜 Review details

Configuration used: Environment YAML

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e67973d and 5148562.

⛔ Files ignored due to path filters (3)

weed/pb/master_pb/master.pb.go is excluded by !**/*.pb.go
weed/pb/volume_server_pb/volume_server.pb.go is excluded by !**/*.pb.go
weed/pb/volume_server_pb/volume_server_grpc.pb.go is excluded by !**/*.pb.go

📒 Files selected for processing (7)

weed/pb/master.proto
weed/pb/volume_server.proto
weed/server/master_grpc_server.go
weed/server/volume_grpc_client_to_master.go
weed/storage/disk_location.go
weed/storage/store.go
weed/storage/store_state.go

🔇 Additional comments (15)

weed/pb/volume_server.proto (3)

10-14: LGTM! Clean protobuf message definition.

The VolumeServerState message is well-defined with clear documentation. The single maintenance field appropriately represents the read-only mode state.

56-56: Appropriate placeholder for future work.

The TODO correctly references the tracking issue and indicates planned RPC additions for state control.

581-581: Appropriate placeholder for future integration.

The TODO correctly marks where the state will be included in status responses, aligning with the incremental rollout approach described in the PR objectives.

weed/storage/disk_location.go (2)

23-26: LGTM! Well-defined constants.

Exporting UUIDFileName and UUIDFileMod promotes reusability across the codebase and eliminates magic values.

71-71: LGTM! Using named constant improves readability.

Replacing the magic number with UUIDFileMod makes the code more maintainable and self-documenting.

weed/pb/master.proto (2)

7-8: LGTM! Necessary import added.

The import of volume_server.proto is correctly added to support the new VolumeServerState type in the Heartbeat message.

89-91: LGTM! State field properly integrated into Heartbeat.

The state field is correctly added with an appropriate tag number and clear documentation. The cross-package reference to volume_server_pb.VolumeServerState is properly formatted.

weed/server/master_grpc_server.go (1)

168-168: Appropriate placeholder for future state handling.

The TODO is well-positioned in the heartbeat processing flow and correctly references the tracking issue. This aligns with the PR's incremental approach.

weed/storage/store_state.go (3)

14-17: LGTM!

Constants are appropriately defined with sensible defaults for file naming and permissions.

19-32: LGTM!

The State struct and NewState constructor are well-structured. The constructor properly derives the file path and delegates to Load() for initialization.

34-54: LGTM!

The Load() method handles all cases correctly:

Missing file is treated as first-time initialization (no error)

Initializes Pb before attempting unmarshal

Properly cleans up Pb on error to avoid partial state

weed/storage/store.go (4)

19-19: LGTM!

Import added correctly for the new state-related protobuf types.

27-30: LGTM!

The HEARTBEAT_CHAN_SIZE constant provides consistent sizing for heartbeat-related channels and improves maintainability.

74-75: LGTM!

New fields integrate well with the existing Store struct. The StateUpdateChan enables propagation of state changes to heartbeat processing.

152-156: Verify fatal error handling aligns with operational expectations.

Using glog.Fatalf for state initialization failure means the volume server will not start if the state file exists but is corrupted. This is a reasonable fail-fast approach, but verify this matches the desired operational behavior. An alternative would be logging an error, initializing with empty state, and moving/renaming the corrupted file.

coderabbit-eval · 2026-01-08T07:45:11Z

weed/server/volume_grpc_client_to_master.go

+		case stateMessage := <-vs.store.StateUpdateChan:
+			stateBeat := &master_pb.Heartbeat{
+				Ip:         ip,
+				Port:       port,
+				DataCenter: dataCenter,
+				Rack:       rack,
+				State:      stateMessage,
+			}
+			glog.V(0).Infof("volume server %s:%d updates state to %v", vs.store.Ip, vs.store.Port, stateMessage)
+			if err = stream.Send(stateBeat); err != nil {
+				glog.V(0).Infof("Volume Server Failed to update state to master %s: %v", masterAddress, err)
+				return "", err
+			}


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Find the Store struct definition and look for StateUpdateChan echo "=== Searching for StateUpdateChan in Store struct ===" rg -n "StateUpdateChan" --type=go -B 2 -A 2 echo -e "\n=== Searching for Store struct definition ===" fd . --type f -name "*.go" -path "*/store*" | head -20

Repository: admin-coderabbit/seaweedfs

Length of output: 2118

🏁 Script executed:

#!/bin/bash # Search for HEARTBEAT_CHAN_SIZE definition echo "=== HEARTBEAT_CHAN_SIZE definition ===" rg -n "HEARTBEAT_CHAN_SIZE" --type=go -B 1 -A 1 # Search for channel close operations on StateUpdateChan or all channels echo -e "\n=== Channel close operations in store context ===" rg -n "close.*Chan|StateUpdateChan" weed/storage/store.go # Look for store shutdown/cleanup methods echo -e "\n=== Store shutdown/cleanup methods ===" rg -n "func.*store.*Close|func.*store.*Shutdown" --type=go -A 3

Repository: admin-coderabbit/seaweedfs

Length of output: 32645

🏁 Script executed:

#!/bin/bash # Search for Store struct close/cleanup methods more broadly echo "=== Store struct methods (weed/storage/store.go) ===" rg -n "func.*Store\)" weed/storage/store.go | head -20 # Search for any close() or shutdown pattern in the file echo -e "\n=== Close operations in store.go ===" rg -n "close|Close" weed/storage/store.go # Check if there's a destructor or cleanup anywhere echo -e "\n=== Looking for volume server shutdown patterns ===" rg -n "StateUpdateChan|store.*shutdown|store.*close" weed/server/ --type=go -A 2 -B 2

Repository: admin-coderabbit/seaweedfs

Length of output: 2155

🏁 Script executed:

#!/bin/bash # Read the Store.Close() method implementation echo "=== Store.Close() implementation ===" sed -n '553,570p' weed/storage/store.go # Look for any channel closing logic echo -e "\n=== Check for channel closing in entire store.go ===" rg -n "close(" weed/storage/store.go

Repository: admin-coderabbit/seaweedfs

Length of output: 800

🏁 Script executed:

#!/bin/bash # Look for the heartbeat goroutine that reads from these channels echo "=== Find the heartbeat goroutine that consumes StateUpdateChan ===" rg -n "func.*SendHeartbeat" weed/server/volume_grpc_client_to_master.go -A 50 | head -100 # Check the complete Close() implementation more carefully echo -e "\n=== Store.Close() method (exact lines) ===" cat -n weed/storage/store.go | sed -n '553,560p' # Look for where the heartbeat goroutine is started echo -e "\n=== Look for goroutine starting patterns ===" rg -n "go.*Heartbeat|SendHeartbeat|StateUpdateChan" weed/server/ --type=go

Repository: admin-coderabbit/seaweedfs

Length of output: 1865

🏁 Script executed:

#!/bin/bash # Get the complete heartbeat sending function context echo "=== Complete heartbeat sending logic (around line 215) ===" sed -n '100,250p' weed/server/volume_grpc_client_to_master.go | cat -n # Look for where the heartbeat goroutine is launched echo -e "\n=== Search for where SendHeartbeat function is called ===" rg -n "SendHeartbeat\(" weed/server/volume_server.go # Check if there's a shutdown/stop mechanism for the volume server echo -e "\n=== Look for volume server stopping patterns ===" rg -n "func.*Stop|func.*Shutdown|vs.store.Close" weed/server/ --type=go -B 2 -A 2

Repository: admin-coderabbit/seaweedfs

Length of output: 8500

🏁 Script executed:

#!/bin/bash # Look for stopChan handling in the select loop echo "=== Complete select loop in doHeartbeatWithRetry (lines 114-250) ===" sed -n '114,250p' weed/server/volume_grpc_client_to_master.go | cat -n # Check how StopHeartbeat is used echo -e "\n=== StopHeartbeat and stopChan handling ===" rg -n "stopChan\|StopHeartbeat" weed/server/ --type=go -B 2 -A 2

Repository: admin-coderabbit/seaweedfs

Length of output: 6132

🏁 Script executed:

#!/bin/bash # Get more complete lines from the select statement echo "=== Complete select statement (lines 100-300) ===" sed -n '100,300p' weed/server/volume_grpc_client_to_master.go | cat -n # Look for where vs.stopChan is defined and how it's used echo -e "\n=== VolumeServer struct definition to find stopChan ===" rg -n "type VolumeServer struct" weed/server/volume_server.go -A 50 # Search for all stopChan references echo -e "\n=== All stopChan references ===" rg -n "stopChan" weed/server/ --type=go

Repository: admin-coderabbit/seaweedfs

Length of output: 12019

🏁 Script executed:

#!/bin/bash # Get the complete select statement with stopChan case echo "=== Complete select statement with stopChan (lines 114-310) ===" sed -n '114,310p' weed/server/volume_grpc_client_to_master.go | tail -50 # Verify if there's a closeFunc or similar for channels echo -e "\n=== Check if Store has any close methods for channels ===" rg -n "func.*Store.*Close\|func.*Store.*Stop" weed/storage/store.go -A 5

Repository: admin-coderabbit/seaweedfs

Length of output: 2313

🏁 Script executed:

#!/bin/bash # Check if LoadState/SaveState are ever called after shutdown echo "=== Where LoadState and SaveState are called ===" rg -n "LoadState\|SaveState" weed/server/ --type=go -B 2 -A 2 # Check the complete shutdown flow in VolumeServer echo -e "\n=== VolumeServer.SetStopping() and Shutdown() methods ===" sed -n '151,165p' weed/server/volume_server.go | cat -n # Check if there's any synchronization with heartbeat goroutine echo -e "\n=== Heartbeat goroutine lifecycle management ===" rg -n "isHeartbeating\|StartHeartbeat" weed/server/ --type=go

Repository: admin-coderabbit/seaweedfs

Length of output: 722

🏁 Script executed:

#!/bin/bash # Find where LoadState and SaveState might be invoked echo "=== Search in entire codebase for LoadState/SaveState ===" rg -rn "LoadState\(\)|SaveState\(\)" --type=go | head -20 # Check if StopHeartbeat is called anywhere echo -e "\n=== Where StopHeartbeat is called ===" rg -n "StopHeartbeat\|stopChan" weed/server/volume_server.go # Verify the heartbeat loop fully echo -e "\n=== Check complete stopChan handler ===" sed -n '305,320p' weed/server/volume_grpc_client_to_master.go | cat -n

Repository: admin-coderabbit/seaweedfs

Length of output: 1044

StateUpdateChan is properly initialized and buffered, but channel lifecycle management is incomplete.

The channel is correctly defined with buffer size 1024 (HEARTBEAT_CHAN_SIZE) to prevent blocking state updates. However, verification reveals a critical gap:

Store.Close() does not close StateUpdateChan or related delta update channels (NewVolumesChan, DeletedVolumesChan, NewEcShardsChan, DeletedEcShardsChan)

StopHeartbeat() exists but is never invoked, leaving the graceful shutdown path (stopChan) unused

Risk of goroutine leak if the heartbeat loop is blocked on a channel receive when the store shuts down

Potential panic if LoadState()/SaveState() attempt to send after Store.Close()

Channels should be closed in Store.Close(), or shutdown should explicitly trigger StopHeartbeat() before closing the store to ensure clean goroutine termination.

🤖 Prompt for AI Agents

In @weed/server/volume_grpc_client_to_master.go around lines 215 - 227, Store.Close() currently doesn't close StateUpdateChan nor the delta channels (NewVolumesChan, DeletedVolumesChan, NewEcShardsChan, DeletedEcShardsChan) and StopHeartbeat() is never invoked, risking goroutine leaks and panics from sends after close; fix by closing those channels inside Store.Close() (or call StopHeartbeat() first to stop the heartbeat loop), and update producers (LoadState(), SaveState(), and any other senders) to do non-blocking/select sends or check a store-closed flag before sending so they won't panic when channels are closed; locate the heartbeat loop that reads StateUpdateChan (and the StopHeartbeat implementation) and ensure shutdown ordering: StopHeartbeat() -> drain/stop producers -> close channels in Store.Close().

coderabbit-eval bot reviewed Jan 8, 2026

View reviewed changes

Boostrap persistent state for volume servers. #1

Boostrap persistent state for volume servers. #1

admin-coderabbit commented Jan 8, 2026 •

edited by coderabbit-eval bot

Loading

coderabbit-eval bot commented Jan 8, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

coderabbit-eval bot left a comment

coderabbit-eval bot Jan 8, 2026

Boostrap persistent state for volume servers. #1

Are you sure you want to change the base?

Boostrap persistent state for volume servers. #1

Conversation

admin-coderabbit commented Jan 8, 2026 • edited by coderabbit-eval bot Loading

What problem are we solving?

How are we solving the problem?

How is the PR tested?

Checks

Summary by CodeRabbit

coderabbit-eval bot commented Jan 8, 2026 • edited Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

coderabbit-eval bot left a comment

Choose a reason for hiding this comment

coderabbit-eval bot Jan 8, 2026

Choose a reason for hiding this comment

admin-coderabbit commented Jan 8, 2026 •

edited by coderabbit-eval bot

Loading

coderabbit-eval bot commented Jan 8, 2026 •

edited

Loading