Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Expand PoV Distribution section and improve candidate types #1294

Merged
merged 5 commits into from
Jun 22, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ Register on startup an event producer with `NetworkBridge::RegisterEventProduce

For each relay-parent in our local view update, look at all backed candidates pending availability. Distribute via gossip all erasure chunks for all candidates that we have to peers.

We define an operation `live_candidates(relay_heads) -> Set<AbridgedCandidateReceipt>` which returns a set of candidates a given set of relay chain heads that implies a set of candidates whose availability chunks should be currently gossiped. This is defined as all candidates pending availability in any of those relay-chain heads or any of their last `K` ancestors. We assume that state is not pruned within `K` blocks of the chain-head.
We define an operation `live_candidates(relay_heads) -> Set<CommittedCandidateReceipt>` which returns a set of [`CommittedCandidateReceipt`s](../../types/candidate.md#committed-candidate-receipt) a given set of relay chain heads that implies a set of candidates whose availability chunks should be currently gossiped. This is defined as all candidates pending availability in any of those relay-chain heads or any of their last `K` ancestors. We assume that state is not pruned within `K` blocks of the chain-head.

We will send any erasure-chunks that correspond to candidates in `live_candidates(peer_most_recent_view_update)`. Likewise, we only accept and forward messages pertaining to a candidate in `live_candidates(current_heads)`. Each erasure chunk should be accompanied by a merkle proof that it is committed to by the erasure trie root in the candidate receipt, and this gossip system is responsible for checking such proof.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Once a sufficient quorum has agreed that a candidate is valid, this subsystem no

The [Candidate Selection subsystem](candidate-selection.md) is the primary source of non-overseer messages into this subsystem. That subsystem generates appropriate [`CandidateBackingMessage`s](../../types/overseer-protocol.md#candidate-backing-message), and passes them to this subsystem.

This subsystem validates the candidates and generates an appropriate [`Statement`](../../types/backing.md#statement-type). All `Statement`s are then passed on to the [Statement Distribution subsystem](statement-distribution.md) to be gossiped to peers. When this subsystem decides that a candidate is invalid, and it was recommended to us to second by our own Candidate Selection subsystem, a message is sent to the Candidate Selection subsystem with the candidate's hash so that the collator which recommended it can be penalized.
This subsystem validates the candidates and generates an appropriate [`SignedStatement`](../../types/backing.md#signed-statement-type). All `SignedStatement`s are then passed on to the [Statement Distribution subsystem](statement-distribution.md) to be gossiped to peers. All [Proofs of Validity](../../types/availability.md#proof-of-validity) should be distributed via the [PoV Distribution](pov-distribution.md) subsystem. When this subsystem decides that a candidate is invalid, and it was recommended to us to second by our own Candidate Selection subsystem, a message is sent to the Candidate Selection subsystem with the candidate's hash so that the collator which recommended it can be penalized.

## Functionality

Expand Down
114 changes: 111 additions & 3 deletions roadmap/implementors-guide/src/node/backing/pov-distribution.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,118 @@ This subsystem is responsible for distributing PoV blocks. For now, unified with

## Protocol

Handle requests for PoV block by candidate hash and relay-parent.
`ProtocolId`: `b"povd"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure, are we attaching a version to these protocol ids somewhere? I.e. something along the lines of b"povd/1"?

That would help tremendously evolving the protocol later in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are attached to the network bridge. I figure these can handle versioning internally since the network bridge protocol is so generic

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. The network bridge would treats the protocol ids as opaque strings. Thereby later a subsystem could even just listen for both povd and povd/2 during the transition period, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's right


Input: [`PoVDistributionMessage`](../../types/overseer-protocol.md#pov-distribution-message)


Output:

- NetworkBridge::RegisterEventProducer(`ProtocolId`)
- NetworkBridge::SendMessage(`[PeerId]`, `ProtocolId`, `Bytes`)
- NetworkBridge::ReportPeer(PeerId, cost_or_benefit)


## Functionality

Implemented as a gossip system, where `PoV`s are not accepted unless we know a `Seconded` message.
This network protocol is responsible for distributing [`PoV`s](../../types/availability.md#proof-of-validity) by gossip. Since PoVs are heavy in practice, gossip is far from the most efficient way to distribute them. In the future, this should be replaced by a better network protocol that finds validators who have validated the block and connects to them directly. This protocol is descrbied

This protocol is described in terms of "us" and our peers, with the understanding that this is the procedure that any honest node will run. It has the following goals:
- We never have to buffer an unbounded amount of data
- PoVs will flow transitively across a network of honest nodes, stemming from the validators that originally seconded candidates requiring those PoVs.

As we are gossiping, we need to track which PoVs our peers are waiting for to avoid sending them data that they are not expecting. It is not reasonable to expect our peers to buffer unexpected PoVs, just as we will not buffer unexpected PoVs. So notifying our peers about what is being awaited is key. However it is important that the notifications system is also bounded.

For this, in order to avoid reaching into the internals of the [Statement Distribution](statement-distribution.md) Subsystem, we can rely on an expected propery of candidate backing: that each validator can only second one candidate at each chain head. So we can set a cap on the number of PoVs each peer is allowed to notify us that they are waiting for at a given relay-parent. This cap will be the number of validators at that relay-parent. And the view update mechanism of the [Network Bridge](../utility/network-bridge.md) ensures that peers are only allowed to consider a certain set of relay-parents as live. So this bounding mechanism caps the amount of data we need to store per peer at any time at `sum({ n_validators_at_head(head) | head in view_heads })`. Additionally, peers should only be allowed to notify us of PoV hashes they are waiting for in the context of relay-parents in our own local view, which means that `n_validators_at_head` is implied to be `0` for relay-parents not in our own local view.

View updates from peers and our own view updates are received from the network bridge. These will lag somewhat behind the `StartWork` and `StopWork` messages received from the overseer, which will influence the actual data we store. The `OurViewUpdate`s from the [`NetworkBridgeEvent`](../../types/overseer-protocol.md#network-bridge-update) must be considered canonical in terms of our peers' perception of us.

Lastly, the system needs to be bootstrapped with our own perception of which PoVs we are cognizant of but awaiting data for. This is done by receipt of the [`PoVDistributionMessage`](../../types/overseer-protocol.md#pov-distribution-message)::ValidatorStatement variant. We can ignore anything except for `Seconded` statements.

## Formal Description

This protocol can be implemented as a state machine with the following state:

```rust
struct State {
relay_parent_state: Map<Hash, BlockBasedState>,
peer_state: Map<PeerId, PeerState>,
our_view: View,
}

struct BlockBasedState {
known: Map<Hash, PoV>, // should be a shared PoV in practice. these things are heavy.
awaited: Set<Hash>, // awaited PoVs by blake2-256 hash.
fetching: Map<Hash, [ResponseChannel<PoV>]>,
n_validators: usize,
}

struct PeerState {
awaited: Map<Hash, Set<Hash>>,
}
```

We also assume the following network messages, which are sent and received by the [Network Bridge](../utility/network-bridge.md)

```rust
enum NetworkMessage {
/// Notification that we are awaiting the given PoVs (by hash) against a
/// specific relay-parent hash.
Awaiting(Hash, Vec<Hash>),
/// Notification of an awaited PoV, in a given relay-parent context.
/// (relay_parent, pov_hash, pov)
SendPoV(Hash, Hash, PoV),
}
```

Here is the logic of the state machine:

*Overseer Signals*
- On `StartWork(relay_parent)`:
- Get the number of validators at that relay parent by querying the [Runtime API](../utility/runtime-api.md) for the validators and then counting them.
- Create a blank entry in `relay_parent_state` under `relay_parent` with correct `n_validators` set.
- On `StopWork(relay_parent)`:
- Remove the entry for `relay_parent` from `relay_parent_state`.
- On `Concluded`: conclude.

*PoV Distribution Messages*
- On `ValidatorStatement(relay_parent, statement)`
- If this is not `Statement::Seconded`, ignore.
- If there is an entry under `relay_parent` in `relay_parent_state`, add the `pov_hash` of the seconded Candidate's [`CandidateDescriptor`](../../types/candidate.md#candidate-descriptor) to the `awaited` set of the entry.
- If the `pov_hash` was not previously awaited and there are `n_validators` or fewer entries in the `awaited` set, send `NetworkMessage::Awaiting(relay_parent, vec![pov_hash])` to all peers.
- On `FetchPoV(relay_parent, descriptor, response_channel)`
- If there is no entry in `relay_parent_state` under `relay_parent`, ignore.
- If there is a PoV under `descriptor.pov_hash` in the `known` map, send that PoV on the channel and return.
- Otherwise, place the `response_channel` in the `fetching` map under `descriptor.pov_hash`.
- On `DistributePoV(relay_parent, descriptor, PoV)`
- If there is no entry in `relay_parent_state` under `relay_parent`, ignore.
- Complete and remove any channels under `descriptor.pov_hash` in the `fetching` map.
- Send `NetworkMessage::SendPoV(relay_parent, descriptor.pov_hash, PoV)` to all peers who have the `descriptor.pov_hash` in the set under `relay_parent` in the `peer.awaited` map and remove the entry from `peer.awaited`.
- Note the PoV under `descriptor.pov_hash` in `known`.

*Network Bridge Updates*
- On `PeerConnected(peer_id, observed_role)`
- Make a fresh entry in the `peer_state` map for the `peer_id`.
- On `PeerDisconnected(peer_id)
- Remove the entry for `peer_id` from the `peer_state` map.
- On `PeerMessage(peer_id, bytes)`
- If the bytes do not decode to a `NetworkMessage` or the `peer_id` has no entry in the `peer_state` map, report and ignore.
- If this is `NetworkMessage::Awaiting(relay_parent, pov_hashes)`:
- If there is no entry under `peer_state.awaited` for the `relay_parent`, report and ignore.
- If `relay_parent` is not contained within `our_view`, report and ignore.
- Otherwise, if the `awaited` map combined with the `pov_hashes` would have more than `relay_parent_state[relay_parent].n_validators` entries, report and ignore. Note that we are leaning on the property of the network bridge that it sets our view based on `StartWork` messages.
- For each new `pov_hash` in `pov_hashes`, if there is a `pov` under `pov_hash` in the `known` map, send the peer a `NetworkMessage::SendPoV(relay_parent, pov_hash, pov)`.
- Otherwise, add the `pov_hash` to the `awaited` map
- If this is `NetworkMessage::SendPoV(relay_parent, pov_hash, pov)`:
- If there is no entry under `relay_parent` in `relay_parent_state` or no entry under `pov_hash` in our `awaited` map for that `relay_parent`, report and ignore.
- If the blake2-256 hash of the pov doesn't equal `pov_hash`, report and ignore.
- Complete and remove any listeners in the `fetching` map under `pov_hash`.
- Add to `known` map.
- Send `NetworkMessage::SendPoV(relay_parent, descriptor.pov_hash, PoV)` to all peers who have the `descriptor.pov_hash` in the set under `relay_parent` in the `peer.awaited` map and remove the entry from `peer.awaited`.
- On `PeerViewChange(peer_id, view)`
- If Peer is unknown, ignore.
- Ensure there is an entry under `relay_parent` for each `relay_parent` in `view` within the `peer.awaited` map, creating blank `awaited` lists as necessary.
- Remove all entries under `peer.awaited` that are not within `view`.
- On `OurViewChange(view)`
- Update `our_view` to `view`

> TODO: this requires a lot of cross-contamination with statement distribution even if we don't implement this as a gossip system. In a point-to-point implementation, we still have to know _who to ask_, which means tracking who's submitted `Seconded`, `Valid`, or `Invalid` statements - by validator and by peer. One approach is to have the Statement gossip system to just send us this information and then we can separate the systems from the beginning instead of combining them
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ Implemented as a gossip protocol. Register a network event producer on startup.

Statement Distribution is the only backing subsystem which has any notion of peer nodes, who are any full nodes on the network. Validators will also act as peer nodes.

It is responsible for signing statements that we have generated and forwarding them, and for detecting a variety of Validator misbehaviors for reporting to [Misbehavior Arbitration](../utility/misbehavior-arbitration.md). During the Backing stage of the inclusion pipeline, it's the main point of contact with peer nodes, who distribute statements by validators. On receiving a signed statement from a peer, assuming the peer receipt state machine is in an appropriate state, it sends the Candidate Receipt to the [Candidate Backing subsystem](candidate-backing.md) to handle the validator's statement.
It is responsible for distributing signed statements that we have generated and forwarding them, and for detecting a variety of Validator misbehaviors for reporting to [Misbehavior Arbitration](../utility/misbehavior-arbitration.md). During the Backing stage of the inclusion pipeline, it's the main point of contact with peer nodes. On receiving a signed statement from a peer, assuming the peer receipt state machine is in an appropriate state, it sends the Candidate Receipt to the [Candidate Backing subsystem](candidate-backing.md) to handle the validator's statement.

Track equivocating validators and stop accepting information from them. Forward double-vote proofs to the double-vote reporting system. Establish a data-dependency order:

Expand All @@ -37,7 +37,7 @@ The Statement Distribution subsystem sends statements to peer nodes and detects

There is a very simple state machine which governs which messages we are willing to receive from peers. Not depicted in the state machine: on initial receipt of any [`SignedFullStatement`](../../types/backing.md#signed-statement-type), validate that the provided signature does in fact sign the included data. Note that each individual parablock candidate gets its own instance of this state machine; it is perfectly legal to receive a `Valid(X)` before a `Seconded(Y)`, as long as a `Seconded(X)` has been received.

A: Initial State. Receive `SignedFullStatement(Statement::Second)`: extract `Statement`, forward to Candidate Backing, proceed to B. Receive any other `SignedFullStatement` variant: drop it.
A: Initial State. Receive `SignedFullStatement(Statement::Second)`: extract `Statement`, forward to Candidate Backing and PoV Distribution, proceed to B. Receive any other `SignedFullStatement` variant: drop it.

B: Receive any `SignedFullStatement`: check signature, forward to Candidate Backing. Receive `OverseerMessage::StopWork`: proceed to C.

Expand Down
14 changes: 8 additions & 6 deletions roadmap/implementors-guide/src/runtime/inclusion.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ struct AvailabilityBitfield {

struct CandidatePendingAvailability {
core: CoreIndex, // availability core
receipt: AbridgedCandidateReceipt,
receipt: CandidateReceipt,
availability_votes: Bitfield, // one bit per validator.
relay_parent_number: BlockNumber, // number of the relay-parent.
backed_in_number: BlockNumber,
Expand All @@ -28,6 +28,8 @@ Storage Layout:
bitfields: map ValidatorIndex => AvailabilityBitfield;
/// Candidates pending availability.
PendingAvailability: map ParaId => CandidatePendingAvailability;
/// The commitments of candidates pending availability, by ParaId.
PendingAvailabilityCommitments: map ParaId => CandidateCommitments;

/// The current validators, by their parachain session keys.
Validators: Vec<ValidatorId>;
Expand All @@ -36,8 +38,6 @@ Validators: Vec<ValidatorId>;
CurrentSessionIndex: SessionIndex;
```

> TODO: `CandidateReceipt` and `AbridgedCandidateReceipt` can contain code upgrades which make them very large. the code entries should be split into a different storage map with infrequent access patterns

## Session Change

1. Clear out all candidates pending availability.
Expand All @@ -64,15 +64,17 @@ All failed checks should lead to an unrecoverable error making the block invalid
1. check that there is no candidate pending availability for any scheduled `ParaId`.
1. If the core assignment includes a specific collator, ensure the backed candidate is issued by that collator.
1. Ensure that any code upgrade scheduled by the candidate does not happen within `config.validation_upgrade_frequency` of `Paras::last_code_upgrade(para_id, true)`, if any, comparing against the value of `Paras::FutureCodeUpgrades` for the given para ID.
1. Check the collator's signature on the pov block.
1. Check the collator's signature on the candidate data.
1. Transform each [`CommittedCandidateReceipt`](../types/candidate.md#committed-candidate-receipt) into the corresponding [`CandidateReceipt`](../types/candidate.md#candidate-receipt), setting the commitments aside.
1. check the backing of the candidate using the signatures and the bitfields, comparing against the validators assigned to the groups, fetched with the `group_validators` lookup.
1. check that the upward messages, when combined with the existing queue size, are not exceeding `config.max_upward_queue_count` and `config.watermark_upward_queue_size` parameters.
1. create an entry in the `PendingAvailability` map for each backed candidate with a blank `availability_votes` bitfield.
1. create a corresponding entry in the `PendingAvailabilityCommitments` with the commitments.
1. Return a `Vec<CoreIndex>` of all scheduled cores of the list of passed assignments that a candidate was successfully backed for, sorted ascending by CoreIndex.
* `enact_candidate(relay_parent_number: BlockNumber, AbridgedCandidateReceipt)`:
* `enact_candidate(relay_parent_number: BlockNumber, CommittedCandidateReceipt)`:
1. If the receipt contains a code upgrade, Call `Paras::schedule_code_upgrade(para_id, code, relay_parent_number + config.validationl_upgrade_delay)`.
> TODO: Note that this is safe as long as we never enact candidates where the relay parent is across a session boundary. In that case, which we should be careful to avoid with contextual execution, the configuration might have changed and the para may de-sync from the host's understanding of it.
1. call `Router::queue_upward_messages` for each backed candidate.
1. call `Router::queue_upward_messages` for each backed candidate, using the [`UpwardMessage`s](../types/messages.md#upward-message) from the [`CandidateCommitments`](../types/candidate.md#candidate-commitments).
1. Call `Paras::note_new_head` using the `HeadData` from the receipt and `relay_parent_number`.
* `collect_pending`:

Expand Down
4 changes: 2 additions & 2 deletions roadmap/implementors-guide/src/runtime/router.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,9 +27,9 @@ No initialization routine runs for this module.

## Routines

* `queue_upward_messages(AbridgedCandidateReceipt)`:
* `queue_upward_messages(ParaId, Vec<UpwardMessage>)`:
1. Updates `NeedsDispatch`, and enqueues upward messages into `RelayDispatchQueue` and modifies the respective entry in `RelayDispatchQueueSize`.

## Finalization
## Finalization

1. Dispatch queued upward messages from `RelayDispatchQueues` in a FIFO order applying the `config.watermark_upward_queue_size` and `config.max_upward_queue_count` limits.
2 changes: 1 addition & 1 deletion roadmap/implementors-guide/src/types/availability.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ candidates for the duration of a challenge period. This is done via an erasure-c

## Signed Availability Bitfield

A bitfield [signed](backing.html#signed-wrapper) by a particular validator about the availability of pending candidates.
A bitfield [signed](backing.md#signed-wrapper) by a particular validator about the availability of pending candidates.


```rust
Expand Down
Loading