Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deadlock on mountpoint destroy during RTSP reconnect #2700

Conversation

lionelnicolas
Copy link
Contributor

@lionelnicolas lionelnicolas commented Jun 12, 2021

This fixes a deadlock which occurs under some conditions when an RTSP mountpoint destroy is requested from the API, while the mountpoint is reconnecting to the RTSP server.

When a destroy message is handled, the mountpoints_mutex is locked (

janus_mutex_lock(&mountpoints_mutex);
), then the mountpoint is removed from the hash table (
g_hash_table_remove(mountpoints,
). The removal triggers janus_streaming_mountpoint_destroy() callback.

janus_streaming_mountpoint_destroy() interrupt the poll then join the relay thread to wait for it to exit (

g_thread_join(mountpoint->thread);
).

But, if the mountpoint was trying to reconnect at the same moment (

if(janus_streaming_rtsp_connect_to_server(mountpoint) < 0) {
) and mountpoint->destroyed not set to 1 yet, then the relay thread will also try to lock mountpoints_mutex (
janus_mutex_lock(&mountpoints_mutex);
). This results in a deadlock if the lock is currently held by the caller of join() on that mountpoint thread.

As the value of mountpoint->destroyed could change while waiting for the lock, the solution was to use g_mutex_trylock() to be able to abort the lock attempt when destroyed is eventually set to 1.

I've added a janus_mutex_unlock() macro, with no trailing ; to be able to use the macro in if or while statement (otherwise the compiler was not happy). I've also removed the semicolon in other janus_mutex_* definitions for consistency.

This is based on top of #2699. So I'll rebase this PR once the other one is merged. I've chosen to open two separated PRs as they are fixing two different issues.

Fixes: #2691

@lminiero
Copy link
Member

lminiero commented Jun 14, 2021

The changes you made should be done on both definitions of the macros: how you've implemented it now, it only works if GMutex is used, and would trigger the compiler errors you mentioned if using pthread mutex instead.

That said, I'm not convinced by this trylock stuff. If the cause of the issue is janus_streaming_rtsp_connect_to_server trying to lock the mutex because destroyed was not 1 yet, wouldn't it be easier to move the destroyed check within the mutex? e.g.:

janus_mutex_lock(&mountpoints_mutex);
if(g_atomic_int_get(&mp->destroyed)) {
	janus_mutex_unlock(&mountpoints_mutex);
	curl_easy_cleanup(curl);
	g_free(curldata->buffer);
	g_free(curldata);
	return -8;
}

/* Parse both video and audio first before proceed to setup as curldata will be reused */

Edit: probably not, since this would always perform the lock, which is what causes the issue now when destroyed is still 0... I'll have to think about this.

@lionelnicolas lionelnicolas force-pushed the bugfix/deadlock-destroy-during-rtsp-reconnect branch 2 times, most recently from da5ec21 to e60930f Compare June 14, 2021 13:06
@lionelnicolas
Copy link
Contributor Author

The changes you made should be done on both definitions of the macros: how you've implemented it now, it only works if GMutex is used, and would trigger the compiler errors you mentioned if using pthread mutex instead.

Ok yeah I missed that #ifdef. Fixed.

Edit: probably not, since this would always perform the lock, which is what causes the issue now when destroyed is still 0... I'll have to think about this.

Yes exactly. Using trylock make that lock attempt cancellable depending on the destroyed value, if the thread is stuck in that while for a while. On a very loaded streaming plugin API, I realized that acquiring the lock on this mutex can take some time (I already observed ~ 200 threads stuck on waiting for that lock, which was increasing the probability of having the value of destroyed modified during that wait time.

Using that trylock logic, I wasn't able to reproduce the deadlock, and I saw Destroying mountpoint while trying to reconnect, aborting log appearing multiple times, so this code was avoiding the deadlock.

@lionelnicolas lionelnicolas force-pushed the bugfix/deadlock-destroy-during-rtsp-reconnect branch from e60930f to a049a23 Compare June 15, 2021 02:40
@lminiero
Copy link
Member

Makes sense, thanks for the explaination! I tried compiling both with GMutex and pthread mutex and so no problems, so this is good to merge for me 👍

@lminiero lminiero merged commit 1d50e06 into meetecho:master Jun 15, 2021
@lionelnicolas lionelnicolas deleted the bugfix/deadlock-destroy-during-rtsp-reconnect branch June 15, 2021 13:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

streaming plugin: deadlock when handling API messages
2 participants