Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bottleneck TransactionsManager #6656

Closed
wants to merge 19 commits into from

Conversation

emhane
Copy link
Member

@emhane emhane commented Feb 18, 2024

By default, concurrency in the TransactionFetcher is not enabled on a 'per peer' level. Hence, the number of inflight requests is limited to the total number of p2p connections the node can have, 130 connections by default. This number is some orders of magnitude smaller than capacities for other collections that implement Stream in the TransactionsManager. Leading from this, is that many hashes received in announcements will be buffered in the cache for hashes pending fetch. This would not necessarily be because there is no idle peer on a network level, rather just not in the context of TransactionsManager, i.e. the FetchEvent stream hasn't been polled yet but many events may be ready. The problem was identified by implementing #6590 and address #6336 in part.

This PR solves the problem by advancing the FuturesUnordered of inflight request and processing responses synchronously on-op. Processing responses is what marks a peer as idle. The operations are:

  • on new announcement
  • on trying to fetch hashes pending fetch.
    These operations attempt to send new requests to idle peers.

The alternative solution considered, was increasing the concurrency parameter on 'per peer' level, to more than 1 request per peer at a time. Although this would have solved the problem at hand as a side-effect, it would have modified the rate at which GetPooledTransactions requests can be sent to a peer's session. Currently it's "send one request. get one response. send next request.". This is maybe something we want to do in the future, but with stronger intention. The presented solution was also preferred for marking peers as idle in a more deterministic way (easier to reason about).

@emhane emhane added A-networking Related to networking in general C-perf A change motivated by improving speed, memory usage or disk footprint labels Feb 18, 2024
@emhane emhane requested a review from rkrasiuk February 18, 2024 19:17
Copy link
Collaborator

@mattsse mattsse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why was everything made pub?

none of these types, fields functions should be pub

This PR solves the problem by advancing the FuturesUnordered of inflight request and processing responses synchronously on-op. Processing responses is what marks a peer as idle.

could you please elaborate on this, I'm not following

crates/net/network/src/budget.rs Outdated Show resolved Hide resolved
crates/net/network/src/transactions/fetcher.rs Outdated Show resolved Hide resolved
crates/net/network/src/transactions/fetcher.rs Outdated Show resolved Hide resolved
crates/net/network/src/transactions/fetcher.rs Outdated Show resolved Hide resolved
@emhane emhane requested a review from mattsse February 18, 2024 21:49
@emhane
Copy link
Member Author

emhane commented Feb 18, 2024

why was everything made pub?

none of these types, fields functions should be pub

This PR solves the problem by advancing the FuturesUnordered of inflight request and processing responses synchronously on-op. Processing responses is what marks a peer as idle.

could you please elaborate on this, I'm not following

in order to link docs, so that ci would pass. I don't think it hurts anyway. validation should probably be exposed anyway at some point for custom impl on custom networks.

Copy link
Collaborator

@mattsse mattsse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why the on_ event handlers now all need a context, or rather why we poll active inflight requests when we handle on_new_pooled_transaction_hashes

The fetcher now polls an internal channel that is filled by the fetcher itself, like an internal buffer that doesn't need any polling

imo the fetcher should keep advancing inflight requests when it is polled

@emhane
Copy link
Member Author

emhane commented Feb 22, 2024

I don't understand why the on_ event handlers now all need a context, or rather why we poll active inflight requests when we handle on_new_pooled_transaction_hashes

I used noop_context first, but then changed it based on #6656 (comment). think it can be changed back to noop_context since 5cc593e was committed, and I will try make that comment more verbose to explain why noop_conext is ok.

The fetcher now polls an internal channel that is filled by the fetcher itself, like an internal buffer that doesn't need any polling

yes it's an intermediary storage of fetch events. this is needed since we are limited to 130 inflight requests but potentially are trying to queue many many more in each loop in the tx manger future, here for example, one event contains up to 20 k txns.

if let Poll::Ready(Some(event)) = this.transaction_events.poll_next_unpin(cx) {
this.on_network_tx_event(event);
some_ready = true;
}

please re-read the description of this pr for more detail, or refer to code comments in #6651 for description of orders of magnitude in flow in tx manager future loop.

imo the fetcher should keep advancing inflight requests when it is polled

it does since a commit older than this review 5cc593e

@emhane emhane force-pushed the emhane/one-req-per-peer-bottleneck branch from 963012b to 5cc593e Compare February 22, 2024 18:45
@emhane emhane force-pushed the emhane/one-req-per-peer-bottleneck branch from 573f333 to 358f3ac Compare February 22, 2024 19:10
@emhane emhane requested a review from mattsse February 22, 2024 19:43
Copy link
Collaborator

@mattsse mattsse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR solves the problem by advancing the FuturesUnordered of inflight request and processing responses synchronously on-op. Processing responses is what marks a peer as idle.

It's not 100% clear to me what the problem is you're describing.

afai understood this, this introduces response buffering, to free up capacity for outgoing requests.

I think, in theory this makes sense, though the way it's implemented feels a bit obfuscated.

wouldn't we get the same if drain in-progress requests first before handling tx fetching + new incoming messages?

crates/net/network/src/transactions/mod.rs Show resolved Hide resolved
Copy link
Collaborator

@mattsse mattsse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer if we lift the drain call from the event handlers to the poll loop.

Comment on lines +69 to +71
pub(super) fetch_events_head: UnboundedMeteredReceiver<FetchEvent>,
/// Handle for queueing [`FetchEvent`]s as a result of advancing inflight requests.
pub fetch_events_tail: UnboundedMeteredSender<FetchEvent>,
pub(super) fetch_events_tail: UnboundedMeteredSender<FetchEvent>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's no need for a sync primitive here if we're using this as a vecdeque, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a VecDeque is slightly more heavy since it keeps track of its length and links the queue in two directions. I think channels don't exclusively need to be used for async communication

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, vecdeque is a ringbuffer, popping a value is just reading+shifting the head.
channel has a bunch of sync overhead that we don't need because nothing is shared here

@emhane
Copy link
Member Author

emhane commented Mar 2, 2024

I'd prefer if we lift the drain call from the event handlers to the poll loop.

then it doesn't address the bottleneck. when node is running on more than one OS thread, sessions can make progress while tx manager future is executing. this bottleneck can also be addressed by not being stuck so long in tx manager future since then so many request attempts won't be made (and not so many hashes will be buffered in hashes pending fetch cache because peers are busy) #6336.

@emhane emhane requested a review from mattsse March 2, 2024 14:30
@mattsse
Copy link
Collaborator

mattsse commented Mar 2, 2024

from what I understood is that the bottleneck can be attributed to the order in which poll currently processes things:

because there is no idle peer on a network level, rather just not in the context of TransactionsManager, i.e. the FetchEvent stream hasn't been polled yet but many events may be ready.

and this PR addresses this by draining, buffering, responses that are ready before fetching more txs?
It does so by buffering draining the responses in the event handler functions.

But we get the same if we call this at the top level, because rn this feels a bit obfuscated.

I'd prefer if we lift the drain call from the event handlers to the poll loop.

Though it's still a bit unclear to me why this right fix and not changing the order to

  • process responses
  • try fetch

@emhane
Copy link
Member Author

emhane commented Mar 2, 2024

from what I understood is that the bottleneck can be attributed to the order in which poll currently processes things:

because there is no idle peer on a network level, rather just not in the context of TransactionsManager, i.e. the FetchEvent stream hasn't been polled yet but many events may be ready.

and this PR addresses this by draining, buffering, responses that are ready before fetching more txs? It does so by buffering draining the responses in the event handler functions.

But we get the same if we call this at the top level, because rn this feels a bit obfuscated.

I'd prefer if we lift the drain call from the event handlers to the poll loop.

Though it's still a bit unclear to me why this right fix and not changing the order to

  • process responses
  • try fetch

agreed, this is done in #6590. I'd be fine with closing this pr to address the issue from a higher level in the linked pr.

@gakonst
Copy link
Member

gakonst commented Mar 6, 2024

#6590 is approved, should we close this? @mattsse

@emhane emhane closed this Mar 6, 2024
@emhane emhane deleted the emhane/one-req-per-peer-bottleneck branch March 6, 2024 03:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-networking Related to networking in general C-perf A change motivated by improving speed, memory usage or disk footprint
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants