-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bottleneck TransactionsManager
#6656
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why was everything made pub?
none of these types, fields functions should be pub
This PR solves the problem by advancing the FuturesUnordered of inflight request and processing responses synchronously on-op. Processing responses is what marks a peer as idle.
could you please elaborate on this, I'm not following
in order to link docs, so that ci would pass. I don't think it hurts anyway. validation should probably be exposed anyway at some point for custom impl on custom networks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why the on_
event handlers now all need a context, or rather why we poll active inflight requests when we handle on_new_pooled_transaction_hashes
The fetcher now polls an internal channel that is filled by the fetcher itself, like an internal buffer that doesn't need any polling
imo the fetcher should keep advancing inflight requests when it is polled
I used
yes it's an intermediary storage of fetch events. this is needed since we are limited to 130 inflight requests but potentially are trying to queue many many more in each loop in the tx manger future, here for example, one event contains up to 20 k txns. reth/crates/net/network/src/transactions/mod.rs Lines 1117 to 1120 in ca98261
please re-read the description of this pr for more detail, or refer to code comments in #6651 for description of orders of magnitude in flow in tx manager future loop.
it does since a commit older than this review 5cc593e |
963012b
to
5cc593e
Compare
573f333
to
358f3ac
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR solves the problem by advancing the FuturesUnordered of inflight request and processing responses synchronously on-op. Processing responses is what marks a peer as idle.
It's not 100% clear to me what the problem is you're describing.
afai understood this, this introduces response buffering, to free up capacity for outgoing requests.
I think, in theory this makes sense, though the way it's implemented feels a bit obfuscated.
wouldn't we get the same if drain in-progress requests first before handling tx fetching + new incoming messages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer if we lift the drain call from the event handlers to the poll loop.
pub(super) fetch_events_head: UnboundedMeteredReceiver<FetchEvent>, | ||
/// Handle for queueing [`FetchEvent`]s as a result of advancing inflight requests. | ||
pub fetch_events_tail: UnboundedMeteredSender<FetchEvent>, | ||
pub(super) fetch_events_tail: UnboundedMeteredSender<FetchEvent>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's no need for a sync primitive here if we're using this as a vecdeque, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a VecDeque
is slightly more heavy since it keeps track of its length and links the queue in two directions. I think channels don't exclusively need to be used for async communication
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, vecdeque is a ringbuffer, popping a value is just reading+shifting the head.
channel has a bunch of sync overhead that we don't need because nothing is shared here
then it doesn't address the bottleneck. when node is running on more than one OS thread, sessions can make progress while tx manager future is executing. this bottleneck can also be addressed by not being stuck so long in tx manager future since then so many request attempts won't be made (and not so many hashes will be buffered in hashes pending fetch cache because peers are busy) #6336. |
from what I understood is that the bottleneck can be attributed to the order in which poll currently processes things:
and this PR addresses this by draining, buffering, responses that are ready before fetching more txs? But we get the same if we call this at the top level, because rn this feels a bit obfuscated.
Though it's still a bit unclear to me why this right fix and not changing the order to
|
agreed, this is done in #6590. I'd be fine with closing this pr to address the issue from a higher level in the linked pr. |
By default, concurrency in the
TransactionFetcher
is not enabled on a 'per peer' level. Hence, the number of inflight requests is limited to the total number of p2p connections the node can have, 130 connections by default. This number is some orders of magnitude smaller than capacities for other collections that implementStream
in theTransactionsManager
. Leading from this, is that many hashes received in announcements will be buffered in the cache for hashes pending fetch. This would not necessarily be because there is no idle peer on a network level, rather just not in the context ofTransactionsManager
, i.e. theFetchEvent
stream hasn't been polled yet but many events may be ready. The problem was identified by implementing #6590 and address #6336 in part.This PR solves the problem by advancing the
FuturesUnordered
of inflight request and processing responses synchronously on-op. Processing responses is what marks a peer as idle. The operations are:These operations attempt to send new requests to idle peers.
The alternative solution considered, was increasing the concurrency parameter on 'per peer' level, to more than 1 request per peer at a time. Although this would have solved the problem at hand as a side-effect, it would have modified the rate at which
GetPooledTransactions
requests can be sent to a peer's session. Currently it's "send one request. get one response. send next request.". This is maybe something we want to do in the future, but with stronger intention. The presented solution was also preferred for marking peers as idle in a more deterministic way (easier to reason about).