Skip to content

priority: prevent init timer restarting when a child transitions from CONNECTING to CONNECTING#8813

Merged
easwars merged 2 commits intogrpc:masterfrom
easwars:priority_init_timer
Jan 8, 2026
Merged

priority: prevent init timer restarting when a child transitions from CONNECTING to CONNECTING#8813
easwars merged 2 commits intogrpc:masterfrom
easwars:priority_init_timer

Conversation

@easwars
Copy link
Copy Markdown
Contributor

@easwars easwars commented Jan 7, 2026

Fixes #8516

This PR ensures that the init timer is not restarted when a child policy transitions from CONNECTING to CONNECTING. This ensures that we will only ever wait for DefaultPriorityInitTimeout which is set to a value of 10s for a given child to become Ready before we attempt a failover to the next highest priority child.

RELEASE NOTES:

  • xds/priority: Fixed a bug causing delayed failover to lower-priority clusters when a higher-priority cluster is stuck in the CONNECTING state.

@easwars easwars added this to the 1.79 Release milestone Jan 7, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Jan 7, 2026

Codecov Report

❌ Patch coverage is 85.71429% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 83.32%. Comparing base (900ffa9) to head (02b3bf6).
⚠️ Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
...nternal/xds/balancer/priority/balancer_priority.go 83.33% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8813      +/-   ##
==========================================
+ Coverage   83.25%   83.32%   +0.06%     
==========================================
  Files         417      417              
  Lines       32978    32980       +2     
==========================================
+ Hits        27457    27479      +22     
+ Misses       4106     4090      -16     
+ Partials     1415     1411       -4     
Files with missing lines Coverage Δ
internal/xds/balancer/priority/balancer_child.go 90.76% <100.00%> (-3.08%) ⬇️
...nternal/xds/balancer/priority/balancer_priority.go 68.33% <83.33%> (-9.64%) ⬇️

... and 19 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

select {
case <-initTimerStarted:
t.Fatalf("Init timer restarted when subchannel moved from Ready to Idle")
case <-sCtx.Done():
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If UpdateState is asynchronous, and happens after sCtx.Done, it can give false positives. This race can happen if the test environment is very slow

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UpdateState is not asynchronous. See here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pattern that we use very commonly in our tests to check for things that we expect not to happen. This is a tradeoff between reasonable test execution times and correctness/flakiness. I don't recall having any problems with this approach so far though. But if you have an idea for how we could do this better, I'd be happy to hear. Thanks.

return
}
child.state = s
curState := child.state
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: oldState?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

child.stopInitTimer()
case connectivity.Connecting:
if !child.reportedTF {
// Skip restarting the timer if the child was already in Connecting.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind documenting the whole if? Why are we not starting it in the !reportedTF case?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. It's a bit long now, but I had to read a bunch of stuff to get to the history behind the !reportedTF.

@easwars easwars merged commit 9b39ae2 into grpc:master Jan 8, 2026
14 checks passed
@easwars easwars deleted the priority_init_timer branch January 8, 2026 06:20
mbissa pushed a commit to mbissa/grpc-go that referenced this pull request Feb 16, 2026
… CONNECTING to CONNECTING (grpc#8813)

Fixes grpc#8516

This PR ensures that the init timer is not restarted when a child policy
transitions from CONNECTING to CONNECTING. This ensures that we will
only ever wait for `DefaultPriorityInitTimeout` which is set to a value
of `10s` for a given child to become `Ready` before we attempt a
failover to the next highest priority child.

RELEASE NOTES:
- xds/priority: Fixed a bug causing delayed failover to lower-priority
clusters when a higher-priority cluster is stuck in the CONNECTING
state.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

xds: Priority policy restarts timer on CONNECTING->CONNECTING transition

3 participants