Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault in multistream Videoroom on handle detach #2775

Closed
OxleyS opened this issue Sep 22, 2021 · 15 comments
Closed

Segmentation fault in multistream Videoroom on handle detach #2775

OxleyS opened this issue Sep 22, 2021 · 15 comments
Labels

Comments

@OxleyS
Copy link
Contributor

OxleyS commented Sep 22, 2021

This was on the multistream branch, commit 06c18221ba73aa609d2c7f88ffc3a06fddb2b52e.

This crash was triggered when users left the room. I don't know whether it was the publisher side or subscriber side specifically that triggered this. Here's a GDB stacktrace:

#0  0x0000000000437e23 in janus_plugin_session_is_alive ()
#1  0x000000000046060f in janus_plugin_push_event ()
#2  0x00007f6889021f39 in janus_videoroom_hangup_media_internal ()
   from /nix/store/d716vr4y9d21jn3hkh307pkkcyf098mv-janus-gateway/lib/janus/plugins/libjanus_videoroom.so
#3  0x00007f6889022810 in janus_videoroom_hangup_media ()
   from /nix/store/d716vr4y9d21jn3hkh307pkkcyf098mv-janus-gateway/lib/janus/plugins/libjanus_videoroom.so
#4  0x0000000000449cb7 in janus_ice_outgoing_traffic_handle ()
#5  0x000000000044cdc1 in janus_ice_outgoing_traffic_dispatch ()
#6  0x00007f688b14958b in g_main_context_dispatch ()
   from /nix/store/3jz43ya7j65mh53nj262rilsk1j2jb68-glib-2.68.3/lib/libglib-2.0.so.0
#7  0x00007f688b149838 in g_main_context_iterate.constprop ()
   from /nix/store/3jz43ya7j65mh53nj262rilsk1j2jb68-glib-2.68.3/lib/libglib-2.0.so.0
#8  0x00007f688b149b2b in g_main_loop_run ()
   from /nix/store/3jz43ya7j65mh53nj262rilsk1j2jb68-glib-2.68.3/lib/libglib-2.0.so.0
#9  0x000000000043fc41 in janus_ice_handle_thread ()
#10 0x00007f688b17326d in g_thread_proxy ()
   from /nix/store/3jz43ya7j65mh53nj262rilsk1j2jb68-glib-2.68.3/lib/libglib-2.0.so.0
#11 0x00007f688abc9d3e in start_thread ()
   from /nix/store/9bh3986bpragfjmr32gay8p95k91q4gy-glibc-2.33-47/lib/libpthread.so.0
#12 0x00007f688aaf643f in clone () from /nix/store/9bh3986bpragfjmr32gay8p95k91q4gy-glibc-2.33-47/lib/libc.so.6

My first thought was that the stacktrace implies a potential race with the Janus session's destruction, but my application logs don't indicate we were trying to destroy the session at the time. All Janus handles are managed under the same Janus session in our application.

Logs just before the crash seem pretty unassuming but I'll post them here as well:

2021-09-21T22:36:46.189434194Z Detaching handle from JANUS VideoRoom plugin; 0x7f68800064d0 0x7f688000de70 0x7f68800064d0 0x7f6880007980
2021-09-21T22:36:46.189556764Z [janus.plugin.videoroom-0x7f688000de70[] No WebRTC media anymore; 0x7f68800064d0 0x7f6880007980
2021-09-21T22:36:46.189575584Z Detaching handle from JANUS VideoRoom plugin; 0x7f688000e730 0x7f6880008070 0x7f688000e730 0x7f6880005f00
2021-09-21T22:36:46.189607845Z [janus.plugin.videoroom-0x7f6880008070[] No WebRTC media anymore; 0x7f688000e730 0x7f6880005f00
2021-09-21T22:36:46.189769475Z [8708042490854508[] WebRTC resources freed; 0x7f688000e730 0x7f6880006300
2021-09-21T22:36:46.189914835Z [8708042490854508[] Handle and related resources freed; 0x7f688000e730 0x7f6880006300
@lminiero
Copy link
Member

Trace doesn't say much (especially without symbols), a libasan trace would be better: https://janus.conf.meetecho.com/docs/debug.html

@OxleyS
Copy link
Contributor Author

OxleyS commented Sep 22, 2021

Gotcha, we'll get libasan in our build and post an update if the crash happens again.

@OxleyS
Copy link
Contributor Author

OxleyS commented Sep 24, 2021

Here's two libasan traces, the exact crash location differs, but it seems to be triggered by two clients (who have both publish and subscribe handles) disconnecting at roughly the same time:

Trace 1
Trace 2

@lminiero
Copy link
Member

@OxleyS can you apply this tentative patch and see if it fixes the issue for you?

diff --git a/plugins/janus_videoroom.c b/plugins/janus_videoroom.c
index 3b4ed76a..f47f8f84 100644
--- a/plugins/janus_videoroom.c
+++ b/plugins/janus_videoroom.c
@@ -6945,6 +6945,7 @@ static void janus_videoroom_hangup_media_internal(gpointer session_data) {
 		/* Get rid of streams */
 		janus_mutex_lock(&participant->streams_mutex);
 		GList *subscribers = NULL;
+		GList *subscribers_streams = NULL;
 		GList *temp = participant->streams;
 		while(temp) {
 			janus_videoroom_publisher_stream *ps = (janus_videoroom_publisher_stream *)temp->data;
@@ -6959,8 +6960,11 @@ static void janus_videoroom_hangup_media_internal(gpointer session_data) {
 					janus_videoroom_subscriber_stream_remove(ss, ps, FALSE);
 					/* Take note of the subscriber, so that we can send an updated offer */
 					if(ss->type != JANUS_VIDEOROOM_MEDIA_DATA && g_list_find(subscribers, ss->subscriber) == NULL) {
+						janus_refcount_increase(&ss->subscriber->session->ref);
 						janus_refcount_increase(&ss->subscriber->ref);
 						subscribers = g_list_append(subscribers, ss->subscriber);
+						janus_refcount_increase(&ss->ref);
+						subscribers_streams = g_list_append(subscribers_streams, ss);
 					}
 				}
 			}
@@ -7017,11 +7021,13 @@ static void janus_videoroom_hangup_media_internal(gpointer session_data) {
 						gateway->notify_event(&janus_videoroom_plugin, session->handle, info);
 					}
 				}
+				janus_refcount_decrease(&subscriber->session->ref);
 				janus_refcount_decrease(&subscriber->ref);
 				temp = temp->next;
 			}
 		}
 		g_list_free(subscribers);
+		g_list_free_full(subscribers_streams, (GDestroyNotify)janus_videoroom_subscriber_stream_unref);
 		/* Free streams */
 		g_list_free(participant->streams);
 		participant->streams = NULL;

@OxleyS
Copy link
Contributor Author

OxleyS commented Sep 29, 2021

I tried applying the patch, but it doesn't seem to resolve the crashes. We've managed to put together this tool that can recreate the crash semi-consistently (I wish it was easier to share!), so I'm going to play around with that for a while and see if I can narrow down the conditions it crashes under.

@atoppi atoppi added the bug label Sep 29, 2021
@atoppi
Copy link
Member

atoppi commented Sep 29, 2021

@OxleyS are the stack traces identical to the previous reports ?

@atoppi
Copy link
Member

atoppi commented Sep 29, 2021

The conditions for the crash are clear to us: a publisher and one of his subscribers leaving at the same time.

@lminiero
Copy link
Member

Having access to a tool that consistently replicates the issue would help, though. Just relying on contributed dumps can only help up to a certain extent, since we'd have no way to check if our attempts fix things or not.

@OxleyS
Copy link
Contributor Author

OxleyS commented Sep 29, 2021

Yeah, stack traces look identical. As for the tool, in its current form it's pretty heavily tied to our internal environment, but we can take a shot at porting it to something more shareable.

@thatsmydoing
Copy link
Contributor

Here's a setup for reproduction https://github.com/thatsmydoing/janus-gateway-2775 it's not exactly the same as what we used to get the previous stack traces but it's close enough? I can at least trigger it on d93290f + the provided patch.

I've seen it usually takes between 100-500 attempts to trigger but sometimes longer

@atoppi
Copy link
Member

atoppi commented Sep 30, 2021

thanks @thatsmydoing , we managed to reproduce this bug (and many others too...)

@atoppi
Copy link
Member

atoppi commented Oct 6, 2021

With the attached patch Janus successfully completed 3000+ iterations, whereas it usually crashes in less than 100 on my env.

@OxleyS can you please give it a try to confirm that it fixes the issue?

As a side note we observed some memory leaks. Those should not be regressions since they seem to happen on vanilla multistream too. They are probably triggered by the scenario implemented in the reproducer (thanks again).
We'll try to address them in a separate effort once the patch is confirmed to work.

@OxleyS
Copy link
Contributor Author

OxleyS commented Oct 7, 2021

We've tried the attached patch and haven't gotten any crashes over 20k+ iterations, so I think this fixes it!

@lminiero
Copy link
Member

lminiero commented Oct 7, 2021

I guess we can close this then. Thanks for the feedback!

@lminiero lminiero closed this as completed Oct 7, 2021
@atoppi
Copy link
Member

atoppi commented Oct 7, 2021

The mentioned leaks have now been fixed in 624df49

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants