Send more packets together with nice_agent_send_messages_nonblocking #2295

lminiero · 2020-07-23T14:20:54Z

This is an attempt to increase the performance of Janus, especially when there's a lot of traffic to push.

First of all, while this patch works with the master version of libnice, it's actually meant to be used with this merge request I submitted a few days ago for some real benefit. In fact, the main motivation behind this work was to try and take advantage of sendmmsg to minimize the impact of system calls on the overall CPU usage: at the moment, libnice provides no support for sendmmsg under the hood, which is what I tried to implement in that contribution. Both efforts (the libnice patch and this PR) have to be considered experimental, as we haven't done any large test yet, and so there may be bugs.

In a nutshell, this patch changes the way we send packets on a PeerConnection. Janus by default sends one packet at a time using nice_agent_send: plugins queue packets to send and we do the same in the core for RTCP, then an event loop in the core reads a packet at a time, prepares it and sends it right away. This means that for each outgoing packet we actually have a dedicated system call that is invoked for delivery: from some code profiling we did recently and in the past, these system calls actually take a lot of the runtime, and so can be considered some sort of bottleneck when you're trying to push the boundaries. This is where system calls like sendmmsg can help, as with a single system call you can actually send multiple messages instead, which is supposed to provide considerable benefits in terms of performance (which seems confirmed by this tweet by @murillo128).

libnice does provide a method called nice_agent_send_messages_nonblocking to send multiple messages in a single libnice API call, but at the moment this simply means that, internally, the library iterates on all messages and sends them separately, which means that you still get a single system call per packet anyway. This is why I submitted the previousy mentioned merge request, which basically modifies the behaviour of libnice to try and use sendmmsg instead (check the description there for more details).

Taken that into account, this patch instead refactors the Janus core to use nice_agent_send_messages_nonblocking instead of nice_agent_send for sending packets. More specifically, we keep a buffer of maximum 10 packets (where 10 is just an arbitrary number that came from the tweet above) that we can send together, and any time we have a packet to send, we call a new function called janus_ice_send_or_store which does the following:

it adds the packet to send to the per-PeerConnection buffer/array (unless the buffer is full, in which case we drop the packet; more on this later);
if there's still room in the buffer/array, and we see there's another message in the outgoing queue, then it means we can send more messages together, and so we return (we'll send them later);
if there's no messages in the outgoing queue, instead, or if we filled the buffer/array already, we send all the messages we stored so far (which could be 1, 10, or anything in between);
sending messages means using, as we said, nice_agent_send_messages_nonblocking, which will tell us how many messages were successfully sent: it may be all or just some of them, and we adjust the buffer/array accordingly.

A simple functional test seemed to confirm it works, as even with a single EchoTest PeerConnection, I often sent a single message, but at the same time often got to send 2 or 3 messages together instead. Wireshark confirmed there were no holes or duplicates, so at least in a sunny scenario it seems to be working as expected.

There needs to be some more tests to be sure it works as expected also when things are not that sunny. For instance, at the moment if nice_agent_send_messages_nonblocking returns 0 we assume we got an EAGAIN (libnice mentions something like this in the internals), which means we try again until we get either a success or an error: this may or may not be a good idea, and I'm not sure if it can cause the code to deadlock (can there be cases where it won't ever return anything but 0?). At the same time, I still haven't verified the code works as expected when only some of the messages have been delivered, which will be important to check. Besides, we need to figure out the best policy to use in case nice_agent_send_messages_nonblocking returns a negative number instead: drop all messages we stored? (which would break the stats because we updated them already when queueing them succeeded) Try again? (which can really cause deadlocks if we retry forever because failures may be permanent) Leave the messages there and hope a new outgoing message later will fix it? (which means we should update the janus_ice_send_or_store code to accomodate for that scenario).

Looking forward to your feedback, as I'll assume you're all interested in improved performances. If you can test and provide accurate feedback (that goes beyond "it works/it doesn't work") that would be very welcome. As I mentioned, we plan to make some tests of our own, but we have our hands full and I don't know when that'll happen.

cb22 · 2020-07-31T05:00:59Z

@lminiero thank you for your work on this!

I've tested out this patch (along with the corresponding libnice patch) and everything works exactly the same as before, but it seems like it doesn't want to send more than 1 packet except on rare occasions. In my test, out of the 113034 logged Sending x packets we stored so far messages, only 2140 were > 1.

This was just a simple test, using the VideoRoom plugin and 6 publishers / subscribers (and event_loops = 8, if that's relevant). Are there perhaps any conditions that would make it more likely to want to send multiple packets together? I'd be happy to try and recreate them. This is also on a fast system and testing via localhost, so it could be that it's not really under enough load to actually queue up enough packets to send multiple in one go.

Also, given what you said about the buffer being on a per PeerConnection basis, it might have a much larger impact once #2211 lands? (which I've been meaning to get around to testing and providing feedback!)

lminiero · 2020-07-31T07:07:39Z

@cb22 thanks for giving it a spin! Yes, I'd say it's expected that you don't get many of those multiple sends: as I wrote in my description above, I only got a few of those myself when doing an EchoTest. I'd expect them to become much more frequent the moment load increases, because it becomes easier to find more than one packet in any outgoing queue due to the thread scheduling; high bitrates may also increase the chances, I guess, simply because of the higher number of packets they'd try to push around. I personally would see this "adaptive" behaviour as a feature, if it turns out to work nicely, since in most cases sending one packet at a time would be just fine: besides, as anticipated in the description, trying to always send at least X packets per call would cause issues, since we'd have to wait for X packets to be in the buffer first, which may never happen (source sent all they wanted already while the buffer only has Y < X) or slowly (e.g., low framerate video), which would as a consequence also increase delay.

The fact you're using event_loops may make the occurrence of multiple packets even rarer, I believe, since no matter how many PeerConnections you have, the number of threads would remain limited, and so would the resulting context switching probably. You can try disabling event_loops to see if you get more multiple sends that way.

I think you're right that it will probably happen more frequently with the multistream branch, since the same PC might be used to send media coming from more than just one source, and so the outgoing queue would have many more packets per second than now on average. That said, I'd like to first assess its performance here, on master, and only subsequently merge it in multistream.

lminiero · 2020-08-05T16:13:01Z

FYI, Olivier merged my libnice merge request, so you can now test this PR using libnice master as well.

sloweclair · 2020-09-10T23:41:56Z

This seems to be working great. Before we were seeings lots of flapping streams when the Janus outbound BW was about 130 Mbps and inbound BW was about about 30 Mbps. Now I am not really sure when that will happen.
Is there anything I should be trying in particular?

amnonbb · 2020-09-13T07:45:53Z

Can i merge it to multistream in my fork and test it?

lminiero · 2020-09-13T07:48:00Z

Pretty sure it won't merge cleanly, you'll get many conflicts that you'll have to solve yourself. But if you want to try, sure, it would be nice to know if/how it works in that context 🙂

amnonbb · 2020-09-13T15:09:01Z

it's was nice try 😬
but new function janus_ice_send_or_store use: janus_ice_component, that actually does not exist in multristream.
So it's needed to be more understanding whats really going on there 😬

…e concise and readable.

yin-zhang · 2020-10-13T13:43:27Z

I played with this version by creating some stress on the server (videoroom plugin, 10 publishers @ 3.5Mbps each). Found a problem:

nice_agent_send_messages_nonblocking() often returns -1, with error->code == G_IO_ERROR_WOULD_BLOCK.

In this case, should we simply treat it as sent = 0 and try again?

atoppi · 2020-10-13T14:16:51Z

@yin-zhang have you tried increasing the udp send buffer size (net.ipv4.udp_wmem_min) ?

yin-zhang · 2020-10-13T14:30:17Z

Let me give it a try

lminiero · 2020-10-13T14:34:43Z

In this case, should we simply treat it as sent = 0 and try again?

If you read the PR description, it's one of the open questions we need to address. At the moment it means that we'll try again as soon as a new packet is queued, but it also means the queue fills up: in case of too many failures, the queue might become full, meaning that we can't add new messages in the queue.

yin-zhang · 2020-10-13T16:07:51Z

After increasing the UDP send buffer to 64KB, the queue still fills up under heavy load (I'm using event_loops = 8, the load is ~800Mbps).

Re: the open questions. I ended up implementing the following strategy that seems to give decent performance.

if (sent < 0) {
	if (component->pending_messages_num < JANUS_MAX_PENDING_MESSAGES) {
		// There is still room in the queue.
                    // Hopefully a future packet will resolve it
		return TRUE;
	} else {
		// Discard half of the messages
		sent = (JANUS_MAX_PENDING_MESSAGES >> 1);
	}
}
/* Update the indexes and clean resources if we sent some or all of the packets */
if(sent > 0) {
     ....

atoppi · 2020-10-13T16:14:17Z

@yin-zhang I'd push the buffer size to megabytes

atoppi · 2020-10-13T16:17:46Z

Also bear in mind that net.ipv4.udp_wmem_min is not in bytes, but in pages (on my system 1 page = 4KB).

yin-zhang · 2020-10-13T16:48:41Z

@atoppi I think net.ipv4.udp_wmem_min is in bytes (default = page size = 4096). net.ipv4.udp_mem is in pages.

After pushing net.ipv4.udp_wmem_min = 2097152, G_IO_ERROR_WOULD_BLOCK still occurs fairly frequent (the load is ~800Mbps).

atoppi · 2020-10-13T16:52:49Z

@yin-zhang you're right, sorry for the confusion.
Thanks for testing btw.

yin-zhang · 2020-10-13T17:25:51Z

It's a little surprising that EWOULDBLOCK occurs so frequently even with 2MB UDP socket buffer. I would expect UDP to be rather fast. 800Mbps load is high but the link + NIC aren't really saturated yet.

atoppi · 2020-10-14T08:33:26Z

Indeed, I would not expect this behavior especially considered that we achieved higher throughput without this sendmmsg patch.
Have you checked the CPU load ? Maybe it is not keeping up with the target rate.

yin-zhang · 2020-10-15T13:41:47Z

After doing some gprof, it seems that the performance bottleneck is elsewhere. nice_agent_send, along with srtp_protect/srtp_unprotect, only account for less than 10% of the total time. There are two main sources of bottleneck.

(1) handles for subscribers of the same publisher is distributed across different event loops. As a result, whenever we receive a new packet from a publisher, it has to be forwarded to N different threads. The overhead of context switch becomes significant when we have to do this for every publisher's packet. In fact, one can see when #event_loops increas from 4, 8, to 32 or 64, the CPU usage also increases significantly. To solve the problem, I did a quick and dirty hack: assigning all subscribers of the same publisher to this publisher's event loop. With this change alone, I'm able to significantly improve the scalability -- now I can easily handle 1Gbps with just 1.5-2 logical cores (on a 40-core/80-thread machine)

(2) janus_cleanup_nack_buffer is inefficient in that it has to delete element from the hash table one by one. We are better off maintaining two hash tables, each cover at most nack_queue_ms. When we have to create a third hash table, we destroy the oldest hash table completely. The cost is that whenever we need to lookup a seq, we need to do lookups in two hash tables. But this is OK, because NACK is not a frequent operation. I'm going to quickly implement this to see if the performance can be further improved. I'm hoping to get janus to sustain 1Gbps with just a single core :-)

lminiero · 2020-10-15T13:48:39Z

Interesting observations on the gperf results, thanks for sharing! Adding a quick note, though.

I did a quick and dirty hack: assigning all subscribers of the same publisher to this publisher's event loop

I don't see how this is doable: event loops are a core thing, while publishers/subscribers are concepts of a plugin, and core/plugins know nothing of each other. While I agree it may not be optimal, whatever fix you may be doing here is something that would violate this relationship. I guess we might envision something like an opaque "purpose" property than one can pass when attaching a handle, so that the core can decide on which loop to add it on, but I see several issues and constraints in how such a thing might work.

lminiero · 2020-10-15T13:50:08Z

We are better off maintaining two hash tables, each cover at most nack_queue_ms

This is problematic, because that value is dynamic and can (will) change during a session.

atoppi · 2020-10-15T13:54:21Z

@yin-zhang awesome work! Thank you!

So how did you solve the EWOULDBLOCK issue?
Can you share the graph of your perf analysis ?

atoppi · 2020-10-15T14:04:23Z

@yin-zhang another question

now I can easily handle 1Gbps with just 1.5-2 logical cores (on a 40-core/80-thread machine)

what setting of event_loops let you achieve this throughput ?

yin-zhang · 2020-10-15T14:58:17Z

@lminiero I did the following. In ice.h, ice.c, I implemented a simple function:

void janus_ice_update_event_loop(janus_ice_handle *handle, janus_ice_handle *master) {
	/* We expect event_loops > 0 */
	if (static_event_loops == 0) {
		return;
	}
	if (handle == NULL || master == NULL || master->mainctx == NULL || master->mainloop == NULL) {
		return;
	}
	if (handle->mainctx == master->mainctx && handle->mainloop == master->mainloop) {
		return;
	}

	/* copy mainctx and mainloop */
	handle->mainctx = master->mainctx;
	handle->mainloop = master->mainloop;

	/* create the new rtp_source */
	if(handle->rtp_source) {
		g_source_destroy(handle->rtp_source);
		g_source_unref(handle->rtp_source);
		handle->rtp_source = NULL;
	}
	handle->rtp_source = janus_ice_outgoing_traffic_create(handle, (GDestroyNotify)g_free);
	g_source_set_priority(handle->rtp_source, G_PRIORITY_DEFAULT);
	g_source_attach(handle->rtp_source, handle->mainctx);
}

Then in janus_videoroom.c, whenever we assign session->participant = subscriber, we call:

janus_ice_update_event_loop(
	(janus_ice_handle *)subscriber->session->handle->gateway_handle,
	(janus_ice_handle *)publisher->session->handle->gateway_handle);

lminiero · 2020-10-15T15:00:52Z

That is a VERY dirty hack, I'm surprised it just doesn't crash for you 😉

yin-zhang · 2020-10-15T15:02:36Z

@yin-zhang another question

now I can easily handle 1Gbps with just 1.5-2 logical cores (on a 40-core/80-thread machine)

what setting of event_loops let you achieve this throughput ?

I used event_loops = 4

yin-zhang · 2020-10-15T15:10:34Z

That is a VERY dirty hack, I'm surprised it just doesn't crash for you 😉

Yep, it's a hack 😉. I had to scratch my head to get around the layering constraints imposed by janus ... However, I think it should be safe -- the subscriber's handle shouldn't have any outgoing RTP packets before setting up subscriber etc. I also stress test the server with 40 participants come and go -- so far no crash is observed

rahmanash-dbz · 2020-11-21T02:51:07Z

Sure, i will watch out for the commit in this space 😃 . Also, i guess this opaque stuff can also be extended to streaming plugin where all viewers/thread->viewers of a mountpoint be assigned to the same event loop, so that there will be no need to switch contexts when relaying a packet, which will remarkably improve the streaming plugin's performance .

lminiero · 2020-11-25T12:13:29Z

@yin-zhang @rahmanash-dbz please see the PR above, which is where I added the ability to manually specify the loop index to use when adding new handles. It's based on this PR, so it should include the whole sendmmsg changes as well.

rahmanash-dbz · 2020-11-25T14:01:56Z

@lminiero thank you for your work on this. I will check this out and tell my feedback.

And, i have a suggestion which can make this logic easier and simple to use, is to return the loop_index with attach response.

One case is to randomly choose an event_loop for a publisher (i.e without specifying any index while attaching) and get the index of the loop. Then, this same index can be used for all subscribers which can have the desired effect. This allows the code to easily integrate in any manager service

lminiero · 2020-11-26T09:44:13Z

And, i have a suggestion which can make this logic easier and simple to use, is to return the loop_index with attach response.

I don't see the point. The automatic logic isn't "smart", it's just a blind round robin. As such, I think that the manual override I added is already enough.

rahmanash-dbz · 2020-11-28T06:56:17Z

Hi lorenzo, i didn't mean the automatic logic was smart. In a situation, where a single janus instance is being shared by many manager services, keeping the event_loop selection in all these manager instances can cause many publisher handles, map to a single event_loop(and thus all subscriber handles too). This may cause overloading of handles on a single event_loop. At least, automatic selection can choose some different event_loop, regardless of many services sharing it.

But, this is a concern for the manager service and like you said, the logic you provided was enough. I am already integrating this code and will deploy it soon. Will share the feedback here, after monitoring for some time. Thanks

lminiero · 2020-12-04T11:12:22Z

@rahmanash-dbz any update on #2450? Can it be merged?

rahmanash-dbz · 2020-12-04T11:46:17Z

@lminiero Sorry, it took time integrating this code into the present architecture, amid all other works. It will be deployed for production tomorrow. So under high load, i can only tell the performance ratio , some 3 days after gathering some stats.

But, the commit was working without any issues in testing.

lminiero · 2020-12-11T11:21:40Z

I'll merge #2450 into this. then. Still waiting feedback on how this PR about sendmmsg works, and how to improve it.

…o use for new handles (#2450)

rahmanash-dbz · 2020-12-14T07:46:16Z

Hi lorenzo, i tested the event_loop_mapping commit in high load conditions. There was a sharp decrease in CPU by about 10-15 %.

One part i noticed is the number of "discarding packets log" (due to queueing) under high load. Setting MAX_PENDING_MESSAGES to higher value mostly solves this issue.

My suggestion is that, if by some mechanism, can the value of MAX_PENDING_MESSAGES be changed dynamically according to "discarding packets count", then this would be perfect

tmatth · 2021-01-19T15:03:49Z

ice.c

@@ -1669,6 +1709,9 @@ static void janus_ice_component_free(const janus_refcount *component_ref) {
 		janus_seq_list_free(&component->last_seqs_video[1]);
 	if(component->last_seqs_video[2])
 		janus_seq_list_free(&component->last_seqs_video[2]);
+	int i = 0;
+	for(i=0; i<JANUS_MAX_PENDING_MESSAGES; i++)


nit: no need to initialize i twice.

Makes sense 😄 The paranoid in me tends to initialize everything, these days, as I've had my share of issues caused by a lack of that.

lminiero · 2021-01-22T17:13:46Z

@rahmanash-dbz thanks for the feedback! I must have missed the notification about your message during the holidays.

Hi lorenzo, i tested the event_loop_mapping commit in high load conditions. There was a sharp decrease in CPU by about 10-15 %.

That's definitely good news!

One part i noticed is the number of "discarding packets log" (due to queueing) under high load. Setting MAX_PENDING_MESSAGES to higher value mostly solves this issue.
My suggestion is that, if by some mechanism, can the value of MAX_PENDING_MESSAGES be changed dynamically according to "discarding packets count", then this would be perfect

In this PR it will be static, any enhancement can come later. 10 may indeed be a low value, but it was just one that I thought made sense: if in your tests you found a number that seems to be more efficient (especially in terms of console log spam), please do let us know, so that we can change the default to something else.

lminiero · 2021-02-23T10:57:12Z

@rahmanash-dbz any feedback on the above? Thanks.

…nice_agent_send

uxmaster · 2021-03-26T18:04:56Z

ice.c

@@ -147,6 +147,17 @@ gboolean janus_is_opaqueid_in_api_enabled(void) {
 	return opaqueid_in_api;
 }

+/* Since #2295, we support sendmmsg to send multiple packets at the same time
+ * and optimize media delivery; this can disabled at startup, if needed */


this can disabled at startup

be seems to be missing before disabled.

lminiero · 2021-04-16T10:27:22Z

From some tests we made internally, this seemed to not really give the edge we hoped: we didn't receive much feedback from others, so we don't know if it's because of how we performed the tests. Actually, it turned out there's another part of the code that should be rewritten because inefficient, so until that happens we won't be able to proceed. As a side note, this patch will probably make more sense once multistream lands, which is one more reason to hold on to this a while longer. I'll mark this as a draft for the time being.

cppdev-1 · 2021-11-29T20:16:00Z

@yin-zhang Could you provide a full patch for this?

lminiero · 2022-02-04T14:51:12Z

Closing as we won't merge this anytime soon. We may revisit the effort once multistream is merged.

Send more packets together with nice_agent_send_messages_nonblocking

faf8188

Aligned with latest changes in master

5e77a42

Avoid unneeded reset of buffer elements. Refactor loop code to be mor…

920fe81

…e concise and readable.

Aligned with latest changes in master

c73d545

lminiero mentioned this pull request Nov 25, 2020

Added configurable mechanism for manually setting static event loop to use for new handles #2450

Merged

lminiero added 2 commits December 11, 2020 12:22

Added configurable mechanism for manually setting static event loop t…

e89694e

…o use for new handles (#2450)

Aligned to latest changes in master

2c2224a

tmatth reviewed Jan 19, 2021

View reviewed changes

lminiero added 2 commits January 22, 2021 18:17

Aligned to latest changes in master

438ee0d

Aligned to latest changes in master

ca1618e

lminiero added 2 commits March 26, 2021 15:19

Aligned to latest changes in master

6bc80e2

Added option to disable sendmmsg, if needed, and fallback to regular …

c92512a

…nice_agent_send

uxmaster reviewed Mar 26, 2021

View reviewed changes

lminiero marked this pull request as draft April 16, 2021 10:27

lminiero added needinfo WIP labels Apr 16, 2021

cppdev-1 mentioned this pull request Jan 1, 2022

occasional big jump in twcc reference time #2841

Closed

lminiero closed this Feb 4, 2022

lminiero deleted the sendmmsg branch February 4, 2022 14:51

Send more packets together with nice_agent_send_messages_nonblocking #2295

Send more packets together with nice_agent_send_messages_nonblocking #2295

Conversation

lminiero commented Jul 23, 2020

cb22 commented Jul 31, 2020

lminiero commented Jul 31, 2020

lminiero commented Aug 5, 2020

sloweclair commented Sep 10, 2020 • edited Loading

amnonbb commented Sep 13, 2020

lminiero commented Sep 13, 2020

amnonbb commented Sep 13, 2020

yin-zhang commented Oct 13, 2020 • edited Loading

atoppi commented Oct 13, 2020

yin-zhang commented Oct 13, 2020

lminiero commented Oct 13, 2020

yin-zhang commented Oct 13, 2020

atoppi commented Oct 13, 2020

atoppi commented Oct 13, 2020

yin-zhang commented Oct 13, 2020

atoppi commented Oct 13, 2020

yin-zhang commented Oct 13, 2020

atoppi commented Oct 14, 2020

yin-zhang commented Oct 15, 2020

lminiero commented Oct 15, 2020

lminiero commented Oct 15, 2020

atoppi commented Oct 15, 2020

atoppi commented Oct 15, 2020

yin-zhang commented Oct 15, 2020 • edited Loading

lminiero commented Oct 15, 2020

yin-zhang commented Oct 15, 2020

yin-zhang commented Oct 15, 2020

rahmanash-dbz commented Nov 21, 2020

lminiero commented Nov 25, 2020

rahmanash-dbz commented Nov 25, 2020

lminiero commented Nov 26, 2020

rahmanash-dbz commented Nov 28, 2020

lminiero commented Dec 4, 2020

rahmanash-dbz commented Dec 4, 2020

lminiero commented Dec 11, 2020

rahmanash-dbz commented Dec 14, 2020 • edited Loading

tmatth Jan 19, 2021

Choose a reason for hiding this comment

lminiero Jan 22, 2021

Choose a reason for hiding this comment

lminiero commented Jan 22, 2021

lminiero commented Feb 23, 2021

uxmaster Mar 26, 2021

Choose a reason for hiding this comment

lminiero commented Apr 16, 2021

cppdev-1 commented Nov 29, 2021

lminiero commented Feb 4, 2022

sloweclair commented Sep 10, 2020 •

edited

Loading

yin-zhang commented Oct 13, 2020 •

edited

Loading

yin-zhang commented Oct 15, 2020 •

edited

Loading

rahmanash-dbz commented Dec 14, 2020 •

edited

Loading