Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[5 / 5] Introduce approval-voting-parallel #4849

Draft
wants to merge 1 commit into
base: alexaggh/approval-voting-parallel-2-5
Choose a base branch
from

Conversation

alexggh
Copy link
Contributor

@alexggh alexggh commented Jun 20, 2024

This is the start of implementing the approach described here: #1617 (comment) & #1617 (comment) & #1617 (comment).

Description of changes

The end goal is to have an architecture where we have single subsystem(approval-voting-parallel) and multiple worker types that would full-fill the work that currently is fulfilled by the approval-distribution and approval-voting subsystems. The main loop of the new subsystem would do just the distribution of work to the workers.

The new subsystem will have:

  • N approval-distribution workers: This would do the work that is currently being done by the approval-distribution subsystem and in addition to that will also perform the crypto-checks that an assignment is valid and that a vote is correctly signed. Work is assigned via the following formula: worker_index = msg.validator % WORKER_COUNT, this guarantees that all assignments and approvals from the same validator reach the same worker.
  • 1 approval-voting worker: This would receive an already valid message and do everything the approval-voting currently does, except the crypto-checking that has been moved already to the approval-distribution worker.

On the hot path of processing messages synchronisation and waiting is needed between approval-distribution and approval-voting workers.

Screenshot 2024-06-07 at 11 28 08

Guidelines for reading

For now this pull request is not production ready(neither functionally nor cosmetically) but it's purpose is to detail the approach taken with code and to flesh out any missing pieces during PoC creation, so as a thumb rule I would avoid nitpicking on things and focusing why this would/wouldn't work.

The full implementation is broken in 5 PRs and all of them are self-contained and improve things incrementally even without the parallelisation being implemented/enabled, the reason this approach was taken instead of a big-bang PR, is to make things easier to review and reduced the risk of breaking this critical subsystems.

After reading the full description of this PR, the changes should be read in the following order:

  1. [1 / 5] Optimize logic for gossiping assignments #4848, some other micro-optimizations for networks with a high number of validators. This change gives us a speed up by itself without any other changes.
  2. [2 / 5] Make approval-distribution logic runnable on a separate thread #4845 , this contains only interface changes to decouple the subsystem from the Context and be able to run multiple instances of the subsystem on different threads. No functional changes
  3. [3 / 5] Move crypto checks in the approval-distribution #4928, moving of the crypto checks from approval-voting in approval-distribution, so that the approval-distribution has no reason to wait after approval-voting anymore. This change gives us a speed up by itself without any other changes.
  4. [4 / 5] Make approval-voting runnable on a worker thread #4846, interface changes to make approval-voting runnable on a separate thread. No functional changes
  5. This PR, where we instantiate an approval-voting-parallel subsystem that runs on different workers the logic currently in approval-distribution and approval-voting

Results

Running subsystem-benchmarks with 1000 validators 100 fully ocuppied cores and triggering all assignments and approvals for all tranches

Approval does not lags behind.

Master

Chain selection approved  after 72500 ms hash=0x0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a

With this PoC

Chain selection approved  after 3500 ms hash=0x0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a

Gathering enough assignments

Enough assignments are gathered in less than 500ms, so that gives un a guarantee that un-necessary work does not get triggered, on master on the same benchmark because the subsystems fall behind on work, that number goes above 32 seconds on master.

Screenshot 2024-06-20 at 15 48 22

Cpu usage:

Master

CPU usage, seconds                     total   per block
approval-distribution                96.9436      9.6944
approval-voting                     117.4676     11.7468
test-environment                     44.0092      4.4009

With this PoC

CPU usage, seconds                     total   per block
approval-distribution                 0.0014      0.0001 --- unused
approval-voting                       0.0437      0.0044.  --- unused
approval-voting-parallel              5.9560      0.5956
approval-voting-parallel-0           22.9073      2.2907
approval-voting-parallel-1           23.0417      2.3042
approval-voting-parallel-2           22.0445      2.2045
approval-voting-parallel-3           22.7234      2.2723
approval-voting-parallel-4           21.9788      2.1979
approval-voting-parallel-5           23.0601      2.3060
approval-voting-parallel-6           22.4805      2.2481
approval-voting-parallel-7           21.8330      2.1833
approval-voting-parallel-db          37.1954      3.7195.  --- the approval-voting thread.

Next steps

  • Make sure through various testing we are not missing anything
  • Build consensus this is the approach we want to take.
  • Polish the implementations to make them production reay
  • Tests
  • Define and implement the strategy for rolling this change, so that the blast radius is minimal(single validator) in case there are problems with the implementation.

@ordian @eskimor @sandreim @AndreiEres, let me know what you think.

@alexggh alexggh force-pushed the alexaggh/approval-voting-parallel-5-5 branch from a591635 to bd5529d Compare June 20, 2024 12:02
@alexggh alexggh changed the title Introduce approval-voting-parallel [5 / 5] Introduce approval-voting-parallel Jun 20, 2024
Signed-off-by: Alexandru Gheorghe <[email protected]>
@alexggh alexggh force-pushed the alexaggh/approval-voting-parallel-4-5 branch from cb57906 to 4b3f489 Compare July 2, 2024 11:40
@alexggh alexggh force-pushed the alexaggh/approval-voting-parallel-5-5 branch from bd5529d to 5da132d Compare July 2, 2024 11:40
@alexggh alexggh changed the base branch from alexaggh/approval-voting-parallel-4-5 to alexaggh/approval-voting-parallel-2-5 July 2, 2024 11:46
@paritytech-cicd-pr
Copy link

The CI pipeline was cancelled due to failure one of the required jobs.
Job name: cargo-clippy
Logs: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/6602947

@alexggh alexggh changed the base branch from alexaggh/approval-voting-parallel-2-5 to master July 2, 2024 11:51
@alexggh alexggh changed the base branch from master to alexaggh/approval-voting-parallel-2-5 July 2, 2024 12:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants