Skip to content

Simplify the heuristic for computing the next instance start time #955

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
masih opened this issue Apr 23, 2025 · 2 comments
Open

Simplify the heuristic for computing the next instance start time #955

masih opened this issue Apr 23, 2025 · 2 comments

Comments

@masih
Copy link
Member

masih commented Apr 23, 2025

The current logic is overly complicated, and burried in gpbftRunner deep enough that unit testing the raw heuristic itself in terms of conformity to basic design goals is cumbersome if not impossible.

The heuristic itself uses a number of parameters from manifest, and it's not entirely clear if we really need so many knobs to tune the instance start.

  • Refactor the heuristic into a stateless function that takes clear set of parameters.
  • Add unit tests to assert what the design goals are/should be.
  • Then simplify the logic and distil to 1 but no more than 2 parameters.

go-f3/host.go

Lines 389 to 460 in f77d509

func (h *gpbftRunner) computeNextInstanceStart(cert *certs.FinalityCertificate) (_nextStart time.Time) {
ecDelay := time.Duration(h.manifest.EC.DelayMultiplier * float64(h.manifest.EC.Period))
head, err := h.ec.GetHead(h.runningCtx)
if err != nil {
// this should not happen
log.Errorf("ec.GetHead returned error: %+v", err)
return h.clock.Now().Add(ecDelay)
}
// the head of the cert becomes the new base
baseTipSet := cert.ECChain.Head()
// we are not trying to fetch the new base tipset from EC as it might not be available
// instead we compute the relative time from the EC.Head
baseTimestamp := computeTipsetTimestampAtEpoch(head, baseTipSet.Epoch, h.manifest.EC.Period)
// Try to align instances while catching up, if configured.
if h.manifest.CatchUpAlignment > 0 {
defer func() {
now := h.clock.Now()
// If we were supposed to start this instance more than one GPBFT round ago, assume
// we're behind and try to align our start times. This helps keep nodes
// in-sync when bootstrapping and catching up.
if _nextStart.Before(now.Add(-h.manifest.CatchUpAlignment)) {
delay := now.Sub(baseTimestamp)
if offset := delay % h.manifest.CatchUpAlignment; offset > 0 {
delay += h.manifest.CatchUpAlignment - offset
}
_nextStart = baseTimestamp.Add(delay)
}
}()
}
lookbackDelay := h.manifest.EC.Period * time.Duration(h.manifest.EC.HeadLookback)
if cert.ECChain.HasSuffix() {
// we decided on something new, the tipset that got finalized can at minimum be 30-60s old.
return baseTimestamp.Add(ecDelay).Add(lookbackDelay)
}
if cert.GPBFTInstance == h.manifest.InitialInstance {
// if we are at initial instance, there is no history to look at
return baseTimestamp.Add(ecDelay).Add(lookbackDelay)
}
backoffTable := h.manifest.EC.BaseDecisionBackoffTable
attempts := 0
backoffMultipler := 2.0 // baseline 1 + 1 to account for the one ECDelay after which we got the base decistion
for instance := cert.GPBFTInstance - 1; instance > h.manifest.InitialInstance; instance-- {
cert, err := h.certStore.Get(h.runningCtx, instance)
if err != nil {
log.Errorf("error while getting instance %d from certstore: %+v", instance, err)
break
}
if cert.ECChain.HasSuffix() {
break
}
attempts += 1
if attempts < len(backoffTable) {
backoffMultipler += backoffTable[attempts]
} else {
// if we are beyond backoffTable, reuse the last element
backoffMultipler += backoffTable[len(backoffTable)-1]
}
}
backoff := time.Duration(float64(ecDelay) * backoffMultipler)
log.Debugf("backing off for: %v", backoff)
return baseTimestamp.Add(backoff).Add(lookbackDelay)
}

@masih
Copy link
Member Author

masih commented Apr 23, 2025

Small bug in picking backoff value from table, the code here always ignores the first value

go-f3/host.go

Lines 447 to 448 in f77d509

attempts += 1
if attempts < len(backoffTable) {

Not much point fixing this now, but something to sort out as part of getting this issue done.

@BigLep
Copy link
Member

BigLep commented May 2, 2025

2025-05-02: This could result in better F3 UX for the ecosystem. What would help with prioritizing this is if we get clear ecosystem signal that reducing the lookback from 2.5 minute to 1 minute would make a big difference.

The challenge with this task is that it requires careful testing making it non-trivial. As a result, we'll defer this for the future when we have more engineering bandwidwth and energy for F3 improvements and ecosystem signal that further optimizations here will move the needle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

2 participants