Skip to content

Investigate why distance from head is is more than expected #960

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
BigLep opened this issue Apr 30, 2025 · 8 comments
Open

Investigate why distance from head is is more than expected #960

BigLep opened this issue Apr 30, 2025 · 8 comments
Assignees

Comments

@BigLep
Copy link
Member

BigLep commented Apr 30, 2025

This is a tracking issue for investigating why F3 participation post activation is less than what we observed in passive testing hours before.

We went from 5 epochs behind on average to ~8 epochs behind on average.

Before activation:

Image

https://grafana.f3.eng.filoz.org/d/edsu1k5s7gtfkb/f3-passive-testing?orgId=1&var-network=mainnet&var-instance=ida.f3.eng.filoz.org%3A80&from=1745798400000&to=1745884800000&viewPanel=56

After activation:

Image

https://grafana.f3.eng.filoz.org/d/edsu1k5s7gtfkb/f3-passive-testing?orgId=1&var-network=mainnet&var-instance=ida.f3.eng.filoz.org%3A80&from=1745928000000&to=1746014400000&viewPanel=56

(note: I'm not showing one contiguous graph since there is a bootstrap phase which dramatically scales up the y-axis).

@BigLep BigLep added this to F3 Apr 29, 2025
@BigLep BigLep converted this from a draft issue Apr 30, 2025
@BigLep
Copy link
Member Author

BigLep commented Apr 30, 2025

2025-04-30

  • Differences between passive ad activation manifest
    • finalization - there is an extra step in path of finalization
    • network name - runtime may be slight different

We're going to look into both of these.

There is a gap between instance 5 to ~30. Looks like coordinated drop in participation.

Curio thread

The concern was that the distance from head is now 10 vs. in passive testing it was 9 5% of the time.

We're focused on "our ship" first (manifest differences).

2025-04-29

  • Forest

    • Waiting to see if/what their problems are
  • Curio

    • Confirming they are good

Going the path of getting observer setup so can see who isn't participating

  • Top priority is to get observer running

Hypothesis 1 : instances upgraded to the retracted version that won't being activated by contract

  • If that were the case, they should join passive testing if we restarted it

@masih
Copy link
Member

masih commented Apr 30, 2025

Checked pubsub settings in lotus, in relation to network name change in the activation manifest. The only difference i see in terms of code path execution in lotus is how the list of allowed topics is compiled here.

@masih
Copy link
Member

masih commented May 1, 2025

New metrics to measure time spent on checkpointing is deployed on test nodes. It initially shows the process to be slow but it fluctuates quite a bit. Letting it collect data for a while.

Image

@BigLep
Copy link
Member Author

BigLep commented May 1, 2025

Per filecoin-project/f3-activation-contract#22 (comment) , lets also capture a snapshot the minerIds that are participating so we can take diffs in future of further changes.

@masih
Copy link
Member

masih commented May 1, 2025

Time spent checkpointing settled to a small value, and unlikely to be causing issues here. The initial delay in checkpointing was only observed during instance restart which is expected since the node was slightly behind on syncing the chain. After that checkpointing time reduced to a few milliseconds at 99th percentile.

@BigLep BigLep changed the title Investigate why mainnet participation is less than expected Investigate why distance from head is is less than expected May 1, 2025
@BigLep BigLep changed the title Investigate why distance from head is is less than expected Investigate why distance from head is is more than expected May 1, 2025
@BigLep
Copy link
Member Author

BigLep commented May 1, 2025

2025-05-01 standup update:
Pubsub and finalization don't appear to be the cause. They were investigated yesterday.

A key thread is drop in participation. That is the main thread we'll pull on. To do that we will...

Next steps

@masih
Copy link
Member

masih commented May 2, 2025

As of 1824Z yesterday the participation in F3 observed by our test node increased by about 6%, to the level we observed during passive testing. This is great to see.

Image

I am left to conclude that the root cause of unstable distance from head after activation was solely participation as we have not seen any evidence that suggests this could have been caused by the slight change in code path execution (which could not have been tested passively on mainnet).

Since the increase in participation yesterday, we are back to seeing -9 as the worst case distance from head, which is consistent with what we observed during extended passive testing on mainnet.

Image

We continue to monitor the participation in F3 for a bit longer before closing this issue.

@BigLep BigLep moved this from In progress to In review in F3 May 2, 2025
@BigLep
Copy link
Member Author

BigLep commented May 6, 2025

2025-05-06 standup: we agreed the actions in #960 (comment) should be done, but those are lower priority than other work items that have emerged. This issue is still part of https://github.com/filecoin-project/go-f3/milestone/7

@BigLep BigLep moved this from In review to Todo in F3 May 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

3 participants