Fix deadlock on mountpoint destroy during RTSP reconnect #2700

lionelnicolas · 2021-06-12T16:42:44Z

This fixes a deadlock which occurs under some conditions when an RTSP mountpoint destroy is requested from the API, while the mountpoint is reconnecting to the RTSP server.

When a destroy message is handled, the mountpoints_mutex is locked (

janus-gateway/plugins/janus_streaming.c

Line 3777 in c621bb5

janus_mutex_lock(&mountpoints_mutex);

), then the mountpoint is removed from the hash table (

janus-gateway/plugins/janus_streaming.c

Line 3798 in c621bb5

g_hash_table_remove(mountpoints,

). The removal triggers janus_streaming_mountpoint_destroy() callback.

janus_streaming_mountpoint_destroy() interrupt the poll then join the relay thread to wait for it to exit (

janus-gateway/plugins/janus_streaming.c

Line 1284 in c621bb5

g_thread_join(mountpoint->thread);

).

But, if the mountpoint was trying to reconnect at the same moment (

janus-gateway/plugins/janus_streaming.c

Line 7643 in c621bb5

if(janus_streaming_rtsp_connect_to_server(mountpoint) < 0) {

) and mountpoint->destroyed not set to 1 yet, then the relay thread will also try to lock mountpoints_mutex (

janus-gateway/plugins/janus_streaming.c

Line 6485 in c621bb5

janus_mutex_lock(&mountpoints_mutex);

). This results in a deadlock if the lock is currently held by the caller of join() on that mountpoint thread.

As the value of mountpoint->destroyed could change while waiting for the lock, the solution was to use g_mutex_trylock() to be able to abort the lock attempt when destroyed is eventually set to 1.

I've added a janus_mutex_unlock() macro, with no trailing ; to be able to use the macro in if or while statement (otherwise the compiler was not happy). I've also removed the semicolon in other janus_mutex_* definitions for consistency.

This is based on top of #2699. So I'll rebase this PR once the other one is merged. I've chosen to open two separated PRs as they are fixing two different issues.

Fixes: #2691

lminiero · 2021-06-14T10:16:03Z

The changes you made should be done on both definitions of the macros: how you've implemented it now, it only works if GMutex is used, and would trigger the compiler errors you mentioned if using pthread mutex instead.

That said, I'm not convinced by this trylock stuff. If the cause of the issue is janus_streaming_rtsp_connect_to_server trying to lock the mutex because destroyed was not 1 yet, wouldn't it be easier to move the destroyed check within the mutex? e.g.:

janus_mutex_lock(&mountpoints_mutex);
if(g_atomic_int_get(&mp->destroyed)) {
	janus_mutex_unlock(&mountpoints_mutex);
	curl_easy_cleanup(curl);
	g_free(curldata->buffer);
	g_free(curldata);
	return -8;
}

/* Parse both video and audio first before proceed to setup as curldata will be reused */

Edit: probably not, since this would always perform the lock, which is what causes the issue now when destroyed is still 0... I'll have to think about this.

lionelnicolas · 2021-06-14T13:19:57Z

The changes you made should be done on both definitions of the macros: how you've implemented it now, it only works if GMutex is used, and would trigger the compiler errors you mentioned if using pthread mutex instead.

Ok yeah I missed that #ifdef. Fixed.

Edit: probably not, since this would always perform the lock, which is what causes the issue now when destroyed is still 0... I'll have to think about this.

Yes exactly. Using trylock make that lock attempt cancellable depending on the destroyed value, if the thread is stuck in that while for a while. On a very loaded streaming plugin API, I realized that acquiring the lock on this mutex can take some time (I already observed ~ 200 threads stuck on waiting for that lock, which was increasing the probability of having the value of destroyed modified during that wait time.

Using that trylock logic, I wasn't able to reproduce the deadlock, and I saw Destroying mountpoint while trying to reconnect, aborting log appearing multiple times, so this code was avoiding the deadlock.

lminiero · 2021-06-15T07:14:58Z

Makes sense, thanks for the explaination! I tried compiling both with GMutex and pthread mutex and so no problems, so this is good to merge for me 👍

lionelnicolas force-pushed the bugfix/deadlock-destroy-during-rtsp-reconnect branch 2 times, most recently from da5ec21 to e60930f Compare June 14, 2021 13:06

lionelnicolas added 3 commits June 14, 2021 22:37

Remove unnecessary semicolon in janus mutex macros

ac170b6

Add janus_mutex_trylock macro

1bfdc8f

Fix deadlock when destroying RTSP mountpoint while reconnecting

a049a23

lionelnicolas force-pushed the bugfix/deadlock-destroy-during-rtsp-reconnect branch from e60930f to a049a23 Compare June 15, 2021 02:40

lminiero merged commit 1d50e06 into meetecho:master Jun 15, 2021

lionelnicolas deleted the bugfix/deadlock-destroy-during-rtsp-reconnect branch June 15, 2021 13:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix deadlock on mountpoint destroy during RTSP reconnect #2700

Fix deadlock on mountpoint destroy during RTSP reconnect #2700

lionelnicolas commented Jun 12, 2021 •

edited

Loading

lminiero commented Jun 14, 2021 •

edited

Loading

lionelnicolas commented Jun 14, 2021

lminiero commented Jun 15, 2021

Fix deadlock on mountpoint destroy during RTSP reconnect #2700

Fix deadlock on mountpoint destroy during RTSP reconnect #2700

Conversation

lionelnicolas commented Jun 12, 2021 • edited Loading

lminiero commented Jun 14, 2021 • edited Loading

lionelnicolas commented Jun 14, 2021

lminiero commented Jun 15, 2021

lionelnicolas commented Jun 12, 2021 •

edited

Loading

lminiero commented Jun 14, 2021 •

edited

Loading