Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify limit on number of connected peer nodes and make it configurable #3753

Open
igorsyl opened this issue Jun 12, 2023 · 3 comments
Open
Assignees
Labels
feature Brand new functionality. New pages, workflows, endpoints, etc. icebox Issues that are not being worked on

Comments

@igorsyl
Copy link
Contributor

igorsyl commented Jun 12, 2023

Is your feature request related to a problem? Please describe.
During the Blockchain Architecture meeting on June 9th, @wileyj, @obycode, and @igorsyl discussed the hard-coded limit on the number of peer nodes the stacks-node allows inbound connections for. This limit could be contributing to poor peer network connectivity.

Describe the solution you'd like
We discussed that Stacks Foundation and Hiro would be able to increase the limit effectively running super-connected nodes. This task aims to identify the hard-coded value and make it configurable.

Describe alternatives you've considered
@wileyj is working on many initiatives to improve peer network connectivity.

Additional context
For all the context, please watch the recording of the Blockchain Architecture meeting on June 9, 2023.

@igorsyl igorsyl added the feature Brand new functionality. New pages, workflows, endpoints, etc. label Jun 12, 2023
@obycode
Copy link
Contributor

obycode commented Jun 13, 2023

+1 to this. Allowing some nodes to be hubs, with a large number of connections, should really help strengthen the P2P network.

@obycode
Copy link
Contributor

obycode commented Jun 13, 2023

Note from @kantai at blockchain meeting - be careful not to send too large of a neighbor set to peers.

@pavitthrap pavitthrap added the icebox Issues that are not being worked on label Jun 13, 2023
@jcnelson
Copy link
Member

This limit could be contributing to poor peer network connectivity.

Peer connectivity is not likely the culprit; I'd need to see some compelling evidence to believe this. It's probably somewhere in either the logic that selects peers to receive forwarded data, and/or in the logic that prunes the set of connected peers every so often to remove dead or stale peer connections. Some background:

First, the number of neighbors has a soft and hard limit, configurable within ConnectionOptions. The soft limit is 20, and the hard limit is 32. That's huge for a K-regular peer network. I'd be very surprised if the network has ever been partitioned through happenstance.

Second, the number of downstream peers is configured in ConnectionOptions as well -- it's default is 128 as a soft limit, and 256 as a hard limit. I don't think there's a config file option for this, but it's trivial to add. However, keep in mind that the p2p state machine only allocates at most 800 file descriptors for itself (most Linux boxes default to 1024 per process).

Third, the number of neighbors to which the node forwards blocks and transactions is different from these values. The logic behind picking which neighbors will receive a message considers (a) whether or not the node is inbound or outbound, and (b) how often in the past 10 minutes the neighbor had sent a message that we had already received.

Regarding inbound/outbound selection, the node will send a block, microblock, or transaction to at most 8 outbound peers, and at most 16 inbound peers (both are currently hard-coded constants). Outbound peers are nodes that this node initiated the connection to; inbound peers are nodes that connected to this node. As you can see, this arrangement favors sending new network data to "downstream" peers, such as those behind NATs.

Regarding duplicate message frequency, the local peer uses the frequency of duplicate messages to calculate a probability distribution over the inbound and outbound nodes that will be sampled to determine which nodes will receive a message. The probability of being selected is inversely proportional to the number of duplicate messages in the last 10 minutes. In other words, the local peer prioritizes sending messages to nodes that often send it novel information. The intuition here is that nodes that send what this local peer perceives to be duplicate messages are located in more peripheral locations in the peer graph, so forwarding them the message (which isn't free -- it costs bandwidth) would have less of an impact in spreading the message around the peer network. On the whole, a public node would be more likely to receive novel data from miners and other public nodes that everyone else connects to, but would be less likely to receive novel data from NAT'ed nodes in users' homes (which often have no downstream nodes). So, a public node would prioritize sending new data it receives to miners (yay!) and to nodes that can forward the data to lots of other nodes.

While all of this is happening, the neighbor walk algorithm is constantly trying to build out as close to a complete view of the peer graph as it can by expanding its "frontier" set. To do so, the neighbor walk state machine constantly tries to maintain connections to a random sample of the peer graph. This works in two modules:

  • The neighbors.rs module, which walks the peer graph and tries to connect to as many nodes as it can find
  • The prune.rs module, which disconnects nodes if the soft or hard limits on the number of peers is exceeded.

The prune.rs module could use some more testing -- it's only indirectly tested. Basically, when there are more peers connected than the maximum number of neighbors, this module will apply various heuristics to cull peers. In particular:

  • It prioritizes culling peers that share the same IP address
  • It prioritizes culling new inbound peers over old inbound peers
  • It prioritizes culling peers with higher protocol error rates
  • It prioritizes culling peers that are in the same autonomous system (currently unused, but it's tested)

As you can see, there are a lot of moving parts to the network stack that can impact the overall QoS. It would behoove the person working on this to empirically measure how well-connected the peer network is, and to characterize the "reach" of broadcasted messages in order to deduce which part of the system has the most substantial impact on how well-broadcasted a given message will be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Brand new functionality. New pages, workflows, endpoints, etc. icebox Issues that are not being worked on
Projects
Status: Status: 🆕 New
Status: In Progress
Development

No branches or pull requests

5 participants