Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DEPENDENCY_TASK] Fix CDN Crash Tests #3390

Merged
merged 18 commits into from
Jun 28, 2024
Merged

[DEPENDENCY_TASK] Fix CDN Crash Tests #3390

merged 18 commits into from
Jun 28, 2024

Conversation

bfish713
Copy link
Collaborator

@bfish713 bfish713 commented Jun 27, 2024

Closes #3368

This PR:

Solves the final tests failures for the dependency task refactor. The failure was caused because for some code paths we can calculate the VID Shares, store them in the vid share map but not emit a VidShareRecv event. This only happens when running with the combined network. There are 2 places we could calculate the VID shares from a DAProposal. The first is in the DA task if the primary is down. The second is if we get a request from another peer for a VID share we'll try to calculate it if we don't have it but do have a DA Proposal. This led to some nodes not voting when they should be able to because the dependency task was waiting for a VIDShareRecv event that wouldn't come.

Typically what I was seeing in CI was that 6/10 nodes would vote and 4 would not. I think the reasoning for this is that the VID calculations were triggered by the requests for VID. What happens was the first nodes to request VID get back a share but the later nodes to request it are already calculating the VID due to the earlier requesters and don't try to request until the calculation finishes and when it does the request task cancels itself.

The sequence that could cause the bug, from the perspective of one were:

  1. Primary Network (CDN) is down
  2. DA Proposal received
  3. DA vote sent
  4. QuorumProposalRecv and validated
  5. Vid Delayed Request spawned (request not yet sent because we are waiting out the delay)
  6. DAC received
  7. VID Reqesest Received from another node, VID calculation started
  8. VID Caclulation finishes
  9. VID delayed request task awakes after waiting the delay and is cancelled because we already have our VID share
  10. View Times out because we (and others) didn't vote and the next leader could not form QC.

In this example the VID calculation could also be triggered if primary network was known down by the node.

The fix for this is to just broadcast that we have the share if we end up cancelling the VID Task because we have the share. This means there is no way for the vote dependency to not get the event if we got the share or calculated it.

This PR does not:

Key places to review:

@bfish713 bfish713 changed the title Bf/debug dep [DEPENDENCY_TASK] Fix CDN Crash Tests Jun 28, 2024
@bfish713 bfish713 marked this pull request as ready for review June 28, 2024 13:33
@bfish713 bfish713 requested review from jparr721, lukaszrzasik, ss-es and shenkeyao and removed request for elliedavidson June 28, 2024 13:35
rob-maron
rob-maron previously approved these changes Jun 28, 2024
@@ -686,6 +686,7 @@ pub(crate) async fn fetch_proposal<TYPES: NodeType>(
quorum_membership: Arc<TYPES::Membership>,
consensus: Arc<RwLock<Consensus<TYPES>>>,
) -> Result<Leaf<TYPES>> {
tracing::info!("Fetchign proposal for view {:?}", view);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe we should consider logs like these being debug

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking was that this should be a relatively rare occurrence. We should only get this if we miss a proposal and are catching back up but I'm not 100%

jparr721
jparr721 previously approved these changes Jun 28, 2024
Copy link
Contributor

@jparr721 jparr721 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m getting that Friday feeling!!! So glad these are done…

@bfish713 bfish713 dismissed stale reviews from jparr721 and rob-maron via ac4bf83 June 28, 2024 14:15
jparr721
jparr721 previously approved these changes Jun 28, 2024
ss-es
ss-es previously approved these changes Jun 28, 2024
Copy link
Contributor

@ss-es ss-es left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, just some minor comments

crates/task-impls/src/consensus/helpers.rs Outdated Show resolved Hide resolved
crates/task-impls/src/request.rs Outdated Show resolved Hide resolved
crates/task-impls/src/request.rs Outdated Show resolved Hide resolved
@@ -253,6 +254,8 @@ struct DelayedRequester<TYPES: NodeType, I: NodeImplementation<TYPES>> {
pub network: Arc<I::Network>,
/// Shared state to check if the data go populated
state: Arc<RwLock<Consensus<TYPES>>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: not really related to the PR, but could this be called consensus rather than state for consistency with the other tasks?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it should be I want to follow up on this pr with one that handles the VID stuff a little smarter

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! reapproved

@bfish713 bfish713 dismissed stale reviews from ss-es and jparr721 via 092b185 June 28, 2024 14:30
Copy link
Member

@shenkeyao shenkeyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@bfish713 bfish713 merged commit a8ee65d into main Jun 28, 2024
36 checks passed
@bfish713 bfish713 deleted the bf/debug-dep branch June 28, 2024 14:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[DEPENDENCY_REFACTOR] - Fix Combined Network Tests
5 participants