Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Send more packets together with nice_agent_send_messages_nonblocking #2295

Closed
wants to merge 10 commits into from

Conversation

lminiero
Copy link
Member

This is an attempt to increase the performance of Janus, especially when there's a lot of traffic to push.

First of all, while this patch works with the master version of libnice, it's actually meant to be used with this merge request I submitted a few days ago for some real benefit. In fact, the main motivation behind this work was to try and take advantage of sendmmsg to minimize the impact of system calls on the overall CPU usage: at the moment, libnice provides no support for sendmmsg under the hood, which is what I tried to implement in that contribution. Both efforts (the libnice patch and this PR) have to be considered experimental, as we haven't done any large test yet, and so there may be bugs.

In a nutshell, this patch changes the way we send packets on a PeerConnection. Janus by default sends one packet at a time using nice_agent_send: plugins queue packets to send and we do the same in the core for RTCP, then an event loop in the core reads a packet at a time, prepares it and sends it right away. This means that for each outgoing packet we actually have a dedicated system call that is invoked for delivery: from some code profiling we did recently and in the past, these system calls actually take a lot of the runtime, and so can be considered some sort of bottleneck when you're trying to push the boundaries. This is where system calls like sendmmsg can help, as with a single system call you can actually send multiple messages instead, which is supposed to provide considerable benefits in terms of performance (which seems confirmed by this tweet by @murillo128).

libnice does provide a method called nice_agent_send_messages_nonblocking to send multiple messages in a single libnice API call, but at the moment this simply means that, internally, the library iterates on all messages and sends them separately, which means that you still get a single system call per packet anyway. This is why I submitted the previousy mentioned merge request, which basically modifies the behaviour of libnice to try and use sendmmsg instead (check the description there for more details).

Taken that into account, this patch instead refactors the Janus core to use nice_agent_send_messages_nonblocking instead of nice_agent_send for sending packets. More specifically, we keep a buffer of maximum 10 packets (where 10 is just an arbitrary number that came from the tweet above) that we can send together, and any time we have a packet to send, we call a new function called janus_ice_send_or_store which does the following:

  1. it adds the packet to send to the per-PeerConnection buffer/array (unless the buffer is full, in which case we drop the packet; more on this later);
  2. if there's still room in the buffer/array, and we see there's another message in the outgoing queue, then it means we can send more messages together, and so we return (we'll send them later);
  3. if there's no messages in the outgoing queue, instead, or if we filled the buffer/array already, we send all the messages we stored so far (which could be 1, 10, or anything in between);
  4. sending messages means using, as we said, nice_agent_send_messages_nonblocking, which will tell us how many messages were successfully sent: it may be all or just some of them, and we adjust the buffer/array accordingly.

A simple functional test seemed to confirm it works, as even with a single EchoTest PeerConnection, I often sent a single message, but at the same time often got to send 2 or 3 messages together instead. Wireshark confirmed there were no holes or duplicates, so at least in a sunny scenario it seems to be working as expected.

There needs to be some more tests to be sure it works as expected also when things are not that sunny. For instance, at the moment if nice_agent_send_messages_nonblocking returns 0 we assume we got an EAGAIN (libnice mentions something like this in the internals), which means we try again until we get either a success or an error: this may or may not be a good idea, and I'm not sure if it can cause the code to deadlock (can there be cases where it won't ever return anything but 0?). At the same time, I still haven't verified the code works as expected when only some of the messages have been delivered, which will be important to check. Besides, we need to figure out the best policy to use in case nice_agent_send_messages_nonblocking returns a negative number instead: drop all messages we stored? (which would break the stats because we updated them already when queueing them succeeded) Try again? (which can really cause deadlocks if we retry forever because failures may be permanent) Leave the messages there and hope a new outgoing message later will fix it? (which means we should update the janus_ice_send_or_store code to accomodate for that scenario).

Looking forward to your feedback, as I'll assume you're all interested in improved performances. If you can test and provide accurate feedback (that goes beyond "it works/it doesn't work") that would be very welcome. As I mentioned, we plan to make some tests of our own, but we have our hands full and I don't know when that'll happen.

@cb22
Copy link
Contributor

cb22 commented Jul 31, 2020

@lminiero thank you for your work on this!

I've tested out this patch (along with the corresponding libnice patch) and everything works exactly the same as before, but it seems like it doesn't want to send more than 1 packet except on rare occasions. In my test, out of the 113034 logged Sending x packets we stored so far messages, only 2140 were > 1.

This was just a simple test, using the VideoRoom plugin and 6 publishers / subscribers (and event_loops = 8, if that's relevant). Are there perhaps any conditions that would make it more likely to want to send multiple packets together? I'd be happy to try and recreate them. This is also on a fast system and testing via localhost, so it could be that it's not really under enough load to actually queue up enough packets to send multiple in one go.

Also, given what you said about the buffer being on a per PeerConnection basis, it might have a much larger impact once #2211 lands? (which I've been meaning to get around to testing and providing feedback!)

@lminiero
Copy link
Member Author

@cb22 thanks for giving it a spin! Yes, I'd say it's expected that you don't get many of those multiple sends: as I wrote in my description above, I only got a few of those myself when doing an EchoTest. I'd expect them to become much more frequent the moment load increases, because it becomes easier to find more than one packet in any outgoing queue due to the thread scheduling; high bitrates may also increase the chances, I guess, simply because of the higher number of packets they'd try to push around. I personally would see this "adaptive" behaviour as a feature, if it turns out to work nicely, since in most cases sending one packet at a time would be just fine: besides, as anticipated in the description, trying to always send at least X packets per call would cause issues, since we'd have to wait for X packets to be in the buffer first, which may never happen (source sent all they wanted already while the buffer only has Y < X) or slowly (e.g., low framerate video), which would as a consequence also increase delay.

The fact you're using event_loops may make the occurrence of multiple packets even rarer, I believe, since no matter how many PeerConnections you have, the number of threads would remain limited, and so would the resulting context switching probably. You can try disabling event_loops to see if you get more multiple sends that way.

I think you're right that it will probably happen more frequently with the multistream branch, since the same PC might be used to send media coming from more than just one source, and so the outgoing queue would have many more packets per second than now on average. That said, I'd like to first assess its performance here, on master, and only subsequently merge it in multistream.

@lminiero
Copy link
Member Author

lminiero commented Aug 5, 2020

FYI, Olivier merged my libnice merge request, so you can now test this PR using libnice master as well.

@sloweclair
Copy link

sloweclair commented Sep 10, 2020

This seems to be working great. Before we were seeings lots of flapping streams when the Janus outbound BW was about 130 Mbps and inbound BW was about about 30 Mbps. Now I am not really sure when that will happen.
Is there anything I should be trying in particular?

@amnonbb
Copy link
Contributor

amnonbb commented Sep 13, 2020

Can i merge it to multistream in my fork and test it?

@lminiero
Copy link
Member Author

Pretty sure it won't merge cleanly, you'll get many conflicts that you'll have to solve yourself. But if you want to try, sure, it would be nice to know if/how it works in that context 🙂

@amnonbb
Copy link
Contributor

amnonbb commented Sep 13, 2020

it's was nice try 😬
but new function janus_ice_send_or_store use: janus_ice_component, that actually does not exist in multristream.
So it's needed to be more understanding whats really going on there 😬

@yin-zhang
Copy link

yin-zhang commented Oct 13, 2020

I played with this version by creating some stress on the server (videoroom plugin, 10 publishers @ 3.5Mbps each). Found a problem:

nice_agent_send_messages_nonblocking() often returns -1, with error->code == G_IO_ERROR_WOULD_BLOCK.

In this case, should we simply treat it as sent = 0 and try again?

@atoppi
Copy link
Member

atoppi commented Oct 13, 2020

@yin-zhang have you tried increasing the udp send buffer size (net.ipv4.udp_wmem_min) ?

@yin-zhang
Copy link

Let me give it a try

@lminiero
Copy link
Member Author

In this case, should we simply treat it as sent = 0 and try again?

If you read the PR description, it's one of the open questions we need to address. At the moment it means that we'll try again as soon as a new packet is queued, but it also means the queue fills up: in case of too many failures, the queue might become full, meaning that we can't add new messages in the queue.

@yin-zhang
Copy link

After increasing the UDP send buffer to 64KB, the queue still fills up under heavy load (I'm using event_loops = 8, the load is ~800Mbps).

Re: the open questions. I ended up implementing the following strategy that seems to give decent performance.

if (sent < 0) {
	if (component->pending_messages_num < JANUS_MAX_PENDING_MESSAGES) {
		// There is still room in the queue.
                    // Hopefully a future packet will resolve it
		return TRUE;
	} else {
		// Discard half of the messages
		sent = (JANUS_MAX_PENDING_MESSAGES >> 1);
	}
}
/* Update the indexes and clean resources if we sent some or all of the packets */
if(sent > 0) {
     ....

@atoppi
Copy link
Member

atoppi commented Oct 13, 2020

@yin-zhang I'd push the buffer size to megabytes

@atoppi
Copy link
Member

atoppi commented Oct 13, 2020

Also bear in mind that net.ipv4.udp_wmem_min is not in bytes, but in pages (on my system 1 page = 4KB).

@yin-zhang
Copy link

@atoppi I think net.ipv4.udp_wmem_min is in bytes (default = page size = 4096). net.ipv4.udp_mem is in pages.

After pushing net.ipv4.udp_wmem_min = 2097152, G_IO_ERROR_WOULD_BLOCK still occurs fairly frequent (the load is ~800Mbps).

@atoppi
Copy link
Member

atoppi commented Oct 13, 2020

@yin-zhang you're right, sorry for the confusion.
Thanks for testing btw.

@yin-zhang
Copy link

It's a little surprising that EWOULDBLOCK occurs so frequently even with 2MB UDP socket buffer. I would expect UDP to be rather fast. 800Mbps load is high but the link + NIC aren't really saturated yet.

@atoppi
Copy link
Member

atoppi commented Oct 14, 2020

Indeed, I would not expect this behavior especially considered that we achieved higher throughput without this sendmmsg patch.
Have you checked the CPU load ? Maybe it is not keeping up with the target rate.

@yin-zhang
Copy link

After doing some gprof, it seems that the performance bottleneck is elsewhere. nice_agent_send, along with srtp_protect/srtp_unprotect, only account for less than 10% of the total time. There are two main sources of bottleneck.

(1) handles for subscribers of the same publisher is distributed across different event loops. As a result, whenever we receive a new packet from a publisher, it has to be forwarded to N different threads. The overhead of context switch becomes significant when we have to do this for every publisher's packet. In fact, one can see when #event_loops increas from 4, 8, to 32 or 64, the CPU usage also increases significantly. To solve the problem, I did a quick and dirty hack: assigning all subscribers of the same publisher to this publisher's event loop. With this change alone, I'm able to significantly improve the scalability -- now I can easily handle 1Gbps with just 1.5-2 logical cores (on a 40-core/80-thread machine)

(2) janus_cleanup_nack_buffer is inefficient in that it has to delete element from the hash table one by one. We are better off maintaining two hash tables, each cover at most nack_queue_ms. When we have to create a third hash table, we destroy the oldest hash table completely. The cost is that whenever we need to lookup a seq, we need to do lookups in two hash tables. But this is OK, because NACK is not a frequent operation. I'm going to quickly implement this to see if the performance can be further improved. I'm hoping to get janus to sustain 1Gbps with just a single core :-)

@lminiero
Copy link
Member Author

Interesting observations on the gperf results, thanks for sharing! Adding a quick note, though.

I did a quick and dirty hack: assigning all subscribers of the same publisher to this publisher's event loop

I don't see how this is doable: event loops are a core thing, while publishers/subscribers are concepts of a plugin, and core/plugins know nothing of each other. While I agree it may not be optimal, whatever fix you may be doing here is something that would violate this relationship. I guess we might envision something like an opaque "purpose" property than one can pass when attaching a handle, so that the core can decide on which loop to add it on, but I see several issues and constraints in how such a thing might work.

@lminiero
Copy link
Member Author

We are better off maintaining two hash tables, each cover at most nack_queue_ms

This is problematic, because that value is dynamic and can (will) change during a session.

@atoppi
Copy link
Member

atoppi commented Oct 15, 2020

@yin-zhang awesome work! Thank you!

  • So how did you solve the EWOULDBLOCK issue?
  • Can you share the graph of your perf analysis ?

@atoppi
Copy link
Member

atoppi commented Oct 15, 2020

@yin-zhang another question

now I can easily handle 1Gbps with just 1.5-2 logical cores (on a 40-core/80-thread machine)

what setting of event_loops let you achieve this throughput ?

@yin-zhang
Copy link

yin-zhang commented Oct 15, 2020

@lminiero I did the following. In ice.h, ice.c, I implemented a simple function:

void janus_ice_update_event_loop(janus_ice_handle *handle, janus_ice_handle *master) {
	/* We expect event_loops > 0 */
	if (static_event_loops == 0) {
		return;
	}
	if (handle == NULL || master == NULL || master->mainctx == NULL || master->mainloop == NULL) {
		return;
	}
	if (handle->mainctx == master->mainctx && handle->mainloop == master->mainloop) {
		return;
	}

	/* copy mainctx and mainloop */
	handle->mainctx = master->mainctx;
	handle->mainloop = master->mainloop;

	/* create the new rtp_source */
	if(handle->rtp_source) {
		g_source_destroy(handle->rtp_source);
		g_source_unref(handle->rtp_source);
		handle->rtp_source = NULL;
	}
	handle->rtp_source = janus_ice_outgoing_traffic_create(handle, (GDestroyNotify)g_free);
	g_source_set_priority(handle->rtp_source, G_PRIORITY_DEFAULT);
	g_source_attach(handle->rtp_source, handle->mainctx);
}

Then in janus_videoroom.c, whenever we assign session->participant = subscriber, we call:

janus_ice_update_event_loop(
	(janus_ice_handle *)subscriber->session->handle->gateway_handle,
	(janus_ice_handle *)publisher->session->handle->gateway_handle);

@lminiero
Copy link
Member Author

That is a VERY dirty hack, I'm surprised it just doesn't crash for you 😉

@yin-zhang
Copy link

@yin-zhang another question

now I can easily handle 1Gbps with just 1.5-2 logical cores (on a 40-core/80-thread machine)

what setting of event_loops let you achieve this throughput ?

I used event_loops = 4

@yin-zhang
Copy link

That is a VERY dirty hack, I'm surprised it just doesn't crash for you 😉

Yep, it's a hack 😉. I had to scratch my head to get around the layering constraints imposed by janus ... However, I think it should be safe -- the subscriber's handle shouldn't have any outgoing RTP packets before setting up subscriber etc. I also stress test the server with 40 participants come and go -- so far no crash is observed

@rahmanash-dbz
Copy link

Sure, i will watch out for the commit in this space 😃 . Also, i guess this opaque stuff can also be extended to streaming plugin where all viewers/thread->viewers of a mountpoint be assigned to the same event loop, so that there will be no need to switch contexts when relaying a packet, which will remarkably improve the streaming plugin's performance .

@lminiero
Copy link
Member Author

@yin-zhang @rahmanash-dbz please see the PR above, which is where I added the ability to manually specify the loop index to use when adding new handles. It's based on this PR, so it should include the whole sendmmsg changes as well.

@rahmanash-dbz
Copy link

@lminiero thank you for your work on this. I will check this out and tell my feedback.

And, i have a suggestion which can make this logic easier and simple to use, is to return the loop_index with attach response.

One case is to randomly choose an event_loop for a publisher (i.e without specifying any index while attaching) and get the index of the loop. Then, this same index can be used for all subscribers which can have the desired effect. This allows the code to easily integrate in any manager service

@lminiero
Copy link
Member Author

And, i have a suggestion which can make this logic easier and simple to use, is to return the loop_index with attach response.

I don't see the point. The automatic logic isn't "smart", it's just a blind round robin. As such, I think that the manual override I added is already enough.

@rahmanash-dbz
Copy link

Hi lorenzo, i didn't mean the automatic logic was smart. In a situation, where a single janus instance is being shared by many manager services, keeping the event_loop selection in all these manager instances can cause many publisher handles, map to a single event_loop(and thus all subscriber handles too). This may cause overloading of handles on a single event_loop. At least, automatic selection can choose some different event_loop, regardless of many services sharing it.

But, this is a concern for the manager service and like you said, the logic you provided was enough. I am already integrating this code and will deploy it soon. Will share the feedback here, after monitoring for some time. Thanks

@lminiero
Copy link
Member Author

lminiero commented Dec 4, 2020

@rahmanash-dbz any update on #2450? Can it be merged?

@rahmanash-dbz
Copy link

@lminiero Sorry, it took time integrating this code into the present architecture, amid all other works. It will be deployed for production tomorrow. So under high load, i can only tell the performance ratio , some 3 days after gathering some stats.

But, the commit was working without any issues in testing.

@lminiero
Copy link
Member Author

I'll merge #2450 into this. then. Still waiting feedback on how this PR about sendmmsg works, and how to improve it.

@rahmanash-dbz
Copy link

rahmanash-dbz commented Dec 14, 2020

Hi lorenzo, i tested the event_loop_mapping commit in high load conditions. There was a sharp decrease in CPU by about 10-15 %.

One part i noticed is the number of "discarding packets log" (due to queueing) under high load. Setting MAX_PENDING_MESSAGES to higher value mostly solves this issue.

My suggestion is that, if by some mechanism, can the value of MAX_PENDING_MESSAGES be changed dynamically according to "discarding packets count", then this would be perfect

@@ -1669,6 +1709,9 @@ static void janus_ice_component_free(const janus_refcount *component_ref) {
janus_seq_list_free(&component->last_seqs_video[1]);
if(component->last_seqs_video[2])
janus_seq_list_free(&component->last_seqs_video[2]);
int i = 0;
for(i=0; i<JANUS_MAX_PENDING_MESSAGES; i++)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: no need to initialize i twice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense 😄 The paranoid in me tends to initialize everything, these days, as I've had my share of issues caused by a lack of that.

@lminiero
Copy link
Member Author

@rahmanash-dbz thanks for the feedback! I must have missed the notification about your message during the holidays.

Hi lorenzo, i tested the event_loop_mapping commit in high load conditions. There was a sharp decrease in CPU by about 10-15 %.

That's definitely good news!

One part i noticed is the number of "discarding packets log" (due to queueing) under high load. Setting MAX_PENDING_MESSAGES to higher value mostly solves this issue.
My suggestion is that, if by some mechanism, can the value of MAX_PENDING_MESSAGES be changed dynamically according to "discarding packets count", then this would be perfect

In this PR it will be static, any enhancement can come later. 10 may indeed be a low value, but it was just one that I thought made sense: if in your tests you found a number that seems to be more efficient (especially in terms of console log spam), please do let us know, so that we can change the default to something else.

@lminiero
Copy link
Member Author

@rahmanash-dbz any feedback on the above? Thanks.

@@ -147,6 +147,17 @@ gboolean janus_is_opaqueid_in_api_enabled(void) {
return opaqueid_in_api;
}

/* Since #2295, we support sendmmsg to send multiple packets at the same time
* and optimize media delivery; this can disabled at startup, if needed */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can disabled at startup

be seems to be missing before disabled.

@lminiero
Copy link
Member Author

From some tests we made internally, this seemed to not really give the edge we hoped: we didn't receive much feedback from others, so we don't know if it's because of how we performed the tests. Actually, it turned out there's another part of the code that should be rewritten because inefficient, so until that happens we won't be able to proceed. As a side note, this patch will probably make more sense once multistream lands, which is one more reason to hold on to this a while longer. I'll mark this as a draft for the time being.

@lminiero lminiero marked this pull request as draft April 16, 2021 10:27
@cppdev-1
Copy link

@yin-zhang Could you provide a full patch for this?

@lminiero
Copy link
Member Author

lminiero commented Feb 4, 2022

Closing as we won't merge this anytime soon. We may revisit the effort once multistream is merged.

@lminiero lminiero closed this Feb 4, 2022
@lminiero lminiero deleted the sendmmsg branch February 4, 2022 14:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants