[5 / 5] Introduce approval-voting-parallel #4849
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is the start of implementing the approach described here: #1617 (comment) & #1617 (comment) & #1617 (comment).
Description of changes
The end goal is to have an architecture where we have single subsystem(
approval-voting-parallel
) and multiple worker types that would full-fill the work that currently is fulfilled by theapproval-distribution
andapproval-voting
subsystems. The main loop of the new subsystem would do just the distribution of work to the workers.The new subsystem will have:
worker_index = msg.validator % WORKER_COUNT
, this guarantees that all assignments and approvals from the same validator reach the same worker.On the hot path of processing messages synchronisation and waiting is needed between approval-distribution and approval-voting workers.
Guidelines for reading
For now this pull request is not production ready(neither functionally nor cosmetically) but it's purpose is to detail the approach taken with code and to flesh out any missing pieces during PoC creation, so as a thumb rule I would avoid nitpicking on things and focusing why this would/wouldn't work.
The full implementation is broken in 5 PRs and all of them are self-contained and improve things incrementally even without the parallelisation being implemented/enabled, the reason this approach was taken instead of a big-bang PR, is to make things easier to review and reduced the risk of breaking this critical subsystems.
After reading the full description of this PR, the changes should be read in the following order:
Context
and be able to run multiple instances of the subsystem on different threads. No functional changesapproval-voting-parallel
subsystem that runs on different workers the logic currently inapproval-distribution
andapproval-voting
Results
Running subsystem-benchmarks with 1000 validators 100 fully ocuppied cores and triggering all assignments and approvals for all tranches
Approval does not lags behind.
Master
With this PoC
Gathering enough assignments
Enough assignments are gathered in less than 500ms, so that gives un a guarantee that un-necessary work does not get triggered, on master on the same benchmark because the subsystems fall behind on work, that number goes above 32 seconds on master.
Cpu usage:
Master
With this PoC
Next steps
@ordian @eskimor @sandreim @AndreiEres, let me know what you think.