WebSocket transport dead lock #732

cedricfung · 2017-01-05T10:18:44Z

I'm using Janus video room plugin to develop a new website, and have found this issue for several times. e.g. the WebSocket won't accept any new client after some tests.

After tracing the code execution, I found the code block at

JANUS_LOG(LOG_VERB, "[%s-%p] WebSocket connection accepted\n", log_prefix, wsi);
if(ws_client == NULL) {
	JANUS_LOG(LOG_ERR, "[%s-%p] Invalid WebSocket client instance...\n", log_prefix, wsi);
	return -1;
}
/* Clean the old sessions list, in case this pointer was used before */
janus_mutex_lock(&old_wss_mutex);

I have some old clients exited without cleanup, e.g. no detach handle, etc.

Is this expected to happen when clients don't cleanup resources before disconnect?

lminiero · 2017-01-05T10:28:38Z

We have an old pointers list in place because the object allocation for the WebSockets client is done by the library itself, and it tends to reuse old pointers for that. Since we use those pointers as opaque identifiers for the transport in place, we have to do some magic, specifically making sure we don't handle send requests from the core on a transport we know has gone (e.g., closed connection, so we added the pointer to the "old list"), and making also sure the "old list" is updated whenever we get a new connection (if a pointer previously used for a connection was in the "old list" and the same one is used for a new connection, remove it from the "old list"). This is indeed a bit wacky, and is not an issue anymore in the reference counters branch for instance as in that case the pointer we use to identify transports is a container we create ourselves.

That said, not sure what can cause that to deadlock. libwebsockets is single threaded and so if it's stuck there it won't indeed go on, but why it locks there is something we'll have to investigate. Probably a lock somewhere else where we don't unlock in some if/then/else branches? I'll have a look.

lminiero · 2017-01-05T10:31:20Z

PS: if you find it easy enough to replicate, you can try enabling the locking debug in Janus and try again until it happens. This way, you should be able to see exactly where in the code the double locking happens for you.

cedricfung · 2017-01-05T10:47:56Z

Thanks. I just enabled the lock_debug, and will wait for this to happen again.

BTW: when will the new refcount branch will be considered stable?

lminiero · 2017-01-05T10:55:56Z

As soon as it is! 😉
Jokes apart, there are people testing this and we try to do that whenever we can too. If you're interested in giving it a go in some of your application scenarios, it will definitely help move it forward faster!

cedricfung · 2017-01-05T11:13:27Z

Will try it soon. And I just noticed another issue, if I start a WebSocket connection, close it, then open another connection soon, a timeout message will be sent to the new connection with old session id.

I'm not sure if this is a bug, because I can deal with it in client code.

lminiero · 2017-01-05T11:21:58Z

Yeah, that's something we experimented too, and happens when a new connection arrives as soon as the previous one ends and the two share the same "pointer". In that case, Janus might be a little to slow to notify the timeout, use the pointer it had, and so the event will get to the new connection instead of the old one. Again, something that with the reference counter branch won't happen.

To solve that issue in the short term, we might simply add a map in the transport, so that we always feed the core with a different pointer no matter what it points to, as it would be mapped to the actual transport structure.

lminiero · 2017-02-08T16:50:05Z

@Vecio any update on the lock debugging?

cedricfung · 2017-02-09T05:12:45Z

Not occurred any more, have been running the service for one month.

lminiero · 2017-02-09T09:39:38Z

Ok, closing the issue then. Feel free to reopen should this reappear.

shrhoads · 2017-02-24T20:46:48Z

I don't have a button to reopen this issue but I am having this issue with the latest master code. It happens about once a day where janus doesn't crash, but it will no longer accept websockets. I'm using the latest libwebsockets. When I build it says using the "new api".

Is there an alternative library I can use to work around this issue?

I'm going to enable lock_debug. Also using the sip plugin only.

BTW in github if a contributor (Iminiero) closes an issue only they can reopen it.

lminiero · 2017-02-24T22:17:16Z

BTW in github if a contributor (Iminiero) closes an issue only they can reopen it.

@shrhoads didn't know about this, reopening then. If you can provide some lock_debug info or find a way to replicate this it would definitely be helpful. My guess is that the libwebsockets context stops generating events for some reason, e.g., some problem in lws_service. You can set a more verbose logging for WebSockets themselves here, which might give you more info when things stop working.

lminiero · 2017-02-24T22:19:36Z

As a side note, you can try changing the LOG_HUGE here in something that's included in your debugging level (e.g., LOG_VERB if you're using level 5, or LOG_INFO for level 4). It will spam your logs much more, but again missing events after the lock may confirm the fact that events are not flowing anymore.

cedricfung · 2017-02-26T04:14:47Z

This issue occurred to me again, will try to get some useful log.

cedricfung · 2017-03-05T03:23:30Z

Recently I have been using the latest refcount branch, there is no this issue any more.

zombiecong · 2017-03-09T03:30:57Z

Same issue with the latest master code , I have this issue when I build Janus in Mac, and it works well in ubuntu.

shrhoads · 2017-03-09T13:37:24Z

It does happen on ubuntu, too, although it hasn't happened again for 1.5 weeks so far.

shrhoads · 2017-03-09T13:38:12Z

If you can readily reproduce on mac, can you add the logging/etc. suggested above?

zombiecong · 2017-03-10T02:46:39Z

It happens on Mac every time and every request, I test both old api and new api of libwebsockets. And http transport works well.

@shrhoads This is lock_debug log : https://gist.github.com/zombiecong/e1417c8001b12c8478ed5658d4181d32

I sent a janus "create" request, there was no response from server.After a moment, I saw the session "timeout" log and the session was released.

BTW:I build some dependencies libs by myself , not by homebrew.

lminiero · 2017-03-20T10:17:45Z

Sorry, missed this notification, apparently... bridging #798 here as well, as it's the same issue (SIP or VideoRoom are not the issue).

lminiero · 2017-03-20T10:24:27Z

I don't think it's a locking issue in Janus, as you get to the Bye! with no problem, which I believe wouldn't happen if it were stuck somewhere. Maybe a libwebsockets/MacOS specific issue? Try enabling a higher libwebsockets debugging instead to see if it gives any insight: https://github.com/meetecho/janus-gateway/blob/master/conf/janus.transport.websockets.cfg.sample.in#L14

sslivins · 2017-03-21T22:33:57Z

I am also hitting the deadlock issue, running openssl 1.0.1e, libwebsockets 2.2.0. I tried disabling secure websockets and just used an unsecure websocket with the decryption handled via nginx but still luck. I can repro this issue at least once per day. I will try some more logging in libwebsockets and post an update

atyenoria · 2017-03-23T07:03:28Z

same issue with the latest refcount branch, libwebsockets 2.2.0, OpenSSL 1.0.1t, debian jessie docker container on ubuntu 16.14. If this happens, creating http request also can't work in addition to websocket request.

lminiero · 2017-03-24T14:41:23Z

Wait, if HTTP requests fail too, then it is indeed a lock in Janus causing this. Can the others confirm? I was under the assumption that the HTTP API was still working when this happened.

cedricfung · 2017-03-26T05:14:10Z

I can confirm that the issue is also in the latest refcount branch.

lminiero · 2017-03-26T13:40:17Z

I won't be able to do any test before the IETF ends. If you guys can find ways to replicate this in the meanwhile, so that I can check for myself when I get back, that would help.

sslivins · 2017-03-27T22:22:37Z

I was able to reproduce this with libwebsockets logging at the highest level (4095): https://www.dropbox.com/s/3wwqlhp0ksk4e4f/janus_deadlock.log

I have an external process that hits janus over wss on the localhost every minute and this is visible in the log by looking for the message: Message for Session 278: {"id": 0, "request": "watch"}

the first appears at 20:54:32, there should have been another one at 20:55:32 and another at 20:55:32. My watchdog thread kills Janus with an exit status of 99 if it doesn't get a new sessions request for 2 minutes which was the case at 20:54:32

lminiero · 2017-04-10T15:37:50Z

Can't seem to be able to replicate the issue. Please, if you have an easy way to consistently replicate the issue, then provide the steps here so that I can try and check if it happens to me as well. Logs with the locking debug enabled (either via configuration or via Admin API) when the issue happens will also help me identify which lock is being held and not released, if any.

lminiero · 2017-04-28T17:10:12Z

Ah, the irony! I had managed to quickly replicate this with the tiny stresser tool I implemented, but as soon as I added the same lines I suggested you, of course Janus is not deadlocking anymore... keeping it up to annoy Janus hoping it will happen soon 😉

lminiero · 2017-04-28T17:29:30Z

Finally got it to happen, and from a first look in the log it seems the cause there was the lws_callback_on_writable in janus_websockets_send_message (after the last non-unlocked lock, I can see send_message:Post-push-Pre-writable but not send_message:Post-writable).

I'll do another round just to check if it happens in the same point, and then I'll evaluate.

sslivins · 2017-04-28T22:57:00Z

Btw,. I'm working out of the deadlock branch: https://github.com/WatchBeam/janus-gateway/tree/deadlock

https://www.dropbox.com/s/gy0ttpi4cn38hoz/janus-deadlock5.log.bz2?dl=0
https://www.dropbox.com/s/voug0o3t1dxd463/janus-deadlock6.log.bz2?dl=0
https://www.dropbox.com/s/rjl74k3va6h5qn7/janus-deadlock7.log.bz2?dl=0
https://www.dropbox.com/s/hdzxhlv8swu4q0k/janus-deadlock8.log.bz2?dl=0

lminiero · 2017-05-01T09:53:01Z

Sorry for the delay, but it will take some time before I can look into this. Today is Labor Day (National holiday in Italy) and tomorrow I'll fly abroad for an event for a few days. Anyway, I did some more dumps of my own too, so I have material to investigate when I can.

lminiero · 2017-05-11T09:05:51Z

Just came back to the office after attending two different events, so sorry for the lack of feedback. I'll give a look at the logs between today and tomorrow in order to check if I can get to the real source of the issue.

lminiero · 2017-05-11T17:18:13Z

Contacted the lws developers for info on what might cause lws_callback_on_writable never to return, and they said locking is only used if the library is compiled with a LWS_MAX_SMP value different than 1, the default.

I'm now trying to attach via gdb to Janus as soon as it deadlocks, but obviously I've been stressing it for more than an hour and nothing happened yet... not sure if I'm using a different library than when I could replicate the issue at the time: now I'm using the version my Fedora 25 ships in the repos, which apparently has LWS_MAX_SMP=1. One thing you guys might want to check is what LWS_MAX_SMP is set to in cmake in your building environment.

I'll keep you posted in case I manage to deadlock it again and get some gdb info.

connectarena · 2017-05-12T07:19:55Z

If we use libwebsockets with single thread, won't it attract extra delay? Any ways we started to use libwebsockets with single thread and it is running with out deadlock since 36 hours.

lminiero · 2017-05-12T07:28:28Z

libwebsockets is conceived to be single thread, so there shouldn't be any issue. If you think about that, node.js is single threaded as well, and there's usually no problem in basing HTTP/WS services on top of that. Considering the WS support in the Janus API is only needed for signalling purposes (no really intensive traffic), I don't expect that to be an issue.

Thanks for the feedback on the lack of deadlocks in that case!

lminiero · 2017-05-12T09:06:01Z

@Vecio @shrhoads @zombiecong @sslivins @atyenoria just pinging you guys as you contributed to the issue with feedback on this happening to you. As explained in a previous note, which addressed a comment from the lws developers, just building your library with LWS_MAX_SMP=1 in cmake should fix this, so please let me know if that's not the case.

sslivins · 2017-05-12T14:54:46Z

i will try this today @lminiero and let you know the results over the next 24h...thanks again for tracking this down

sslivins · 2017-05-12T15:00:24Z

btw the default value of LWS_MAX_SMP is 32 in libwebsockets 2.2

lminiero · 2017-05-12T17:08:42Z

Weird, the lws developer told me the default was 1. My Fedora ships a 2.1 rpm in the repos, and I couldn't replicate the issue with that, which makes me think that one is using 1 (seems to be confirmed by looking at the cmake file).

sslivins · 2017-05-12T18:46:07Z

again, this is i build from source and dont put anything explicit for the cmake...also corresponds with their docs: https://libwebsockets.org/lws-api-doc-master/html/md_README_8coding.html

sslivins · 2017-05-15T00:10:08Z

so just to update things...,i am still experiencing some deadlocks but it's many many orders of magnitude fewer...i still need to evaluate if these are the same root cause or something else, i'll get you more info on monday

atyenoria · 2017-05-15T04:19:49Z

For me, it won't happen. It seems to be solved.
[spec]
Video room with ref-count latest commit(4daf434). Cmake libwebsocket with LWS_MAX_SMP=1(Default is 32)
All 20 Chrome tab(3pc) publishing their own video and audio, I tried to disconnect and connect at the same time for about ten times. no crash and no websocket deadlock.

In the past, I can confirm that this happens frequently.

lminiero · 2017-05-15T16:49:20Z

Something weird I just noticed... I wanted to add a check for LWS_MAX_SMP in the plugin code, so that if it's different than 1 we can print a warning, but when I print it out it's 32 for me! Which means that my Fedora shipped version, which is 2.1.0, is still using the default value of 32, and yet I couldn't replicate the issue anymore? Maybe only an issue with 2.2 and LWS_MAX_SMP > 1 then?

idubinets · 2017-05-23T17:49:58Z

I faced the same issue. LWS_MAX_SMP is set to 32 by default. I rebuilt the library with LWS_MAX_SMP = 1.
let's see how it will work. I will let you know

IvRRimum · 2017-06-04T15:13:45Z

Adding to this discussion:

I am using websockets, i recompiled libwebsockets with LWS_MAX_SMP=1 and i still managed to get the deadlock.

But here's the weird behaviour of the deadlock: I am using 2 plugins in my janus installation:

audiobridge
videoroom.
When a deadlock happens on audioroom, i can still access and use videoroom. I don't know if this is related tho.
Also i wanted to add that i am using janus-gateway version from Dec 20, 2016.

I ran the LOCK/DEADLOCK script and i got these: https://pastebin.com/zAa4iNLC

How to reproduce: Create 6 chrome tabs and join a single room. Do the same on pc 2. and connect from 3rd device(in my case its mobile app). Run automation script that refresh -> join room -> unmute for 2 seconds -> refresh -> join....

After a while the audiobridge will deadlock.

EDIT: After further crashin, the first lines of the LOCK/UNLOCK script change, depending on the situation where it deadlocked(next time its not in my room_helper:radio_thread, but in list_rooms, etc). So that probably means it deadlocks before and then tries to lock again on already locked mutex.

lminiero · 2017-06-04T15:19:44Z

If other plugins work, then it's not websockets related (unless the plugin still works when accessed via another transport like HTTP, for instance, which I find unlikely), so please let's not confuse issues. It would be better to open another issue on the audiobridge deadlocking with the info you found. Il 04 giu 2017 5:13 PM, "Ivrrimum" <[email protected]> ha scritto:

…

Adding to this discussion: I am using websockets, i recompiled libwebsockets with LWS_MAX_SMP=1 and i still managed to get the deadlock. But here's the weird behaviour of the deadlock: I am using 2 plugins in my janus installation: 1. audiobridge 2. videoroom. When a deadlock happens on audioroom, i can still access and use videoroom. I don't know if this is related tho. Also i wanted to add that i am using janus-gateway version from Dec 20, 2016. I ran the LOCK/DEADLOCK script and i got these: https://pastebin.com/zAa4iNLC How to reproduce: Create 6 chrome tabs and join a single room. Do the same on pc 2. and connect from 3rd device(in my case its mobile app). Run automation script that refresh -> join room -> unmute for 2 seconds -> refresh -> join.... After a while the audiobridge will deadlock. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#732 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADg5vFCJYQShKCiYGNjgEeZv9gz-vc9cks5sAsmrgaJpZM4LbiaO> .

IvRRimum · 2017-06-04T15:32:03Z

Hey @lminiero!

Yeah, makes sense. created issue here.

shrhoads · 2017-06-13T14:57:49Z

Just reporting in on this, LWS_MAX_SMP=1 seems to have solved the issue for me.

lminiero · 2017-06-30T10:26:20Z

Closing then, thank you all for the feedback!

To avoid deadlock in websocket See meetecho/janus-gateway#732 re: LWS_MAX_SMP

lminiero closed this as completed Feb 9, 2017

lminiero reopened this Feb 24, 2017

chraja1233 mentioned this issue Feb 28, 2017

Janus dead lock when using sip plugin #798

Closed

lminiero closed this as completed Jun 30, 2017

mqp mentioned this issue Nov 10, 2017

LWS_MAX_SMP=1 in README LWS build instructions #1060

Merged

lminiero mentioned this issue Dec 7, 2017

Segmentation fault in janus_websocket_send_message #1101

Closed

lminiero mentioned this issue Mar 13, 2018

janus_videoroom "unpublished" events dropped when tab is closed #1172

Closed

This was referenced Jun 13, 2018

Janus receiving messages over web-socket but not sending #1263

Closed

Locked websocket - transport mutex is being held #1274

Closed

Cibifang added a commit to Cibifang/docker-janus-webrtc-gateway that referenced this issue Jul 3, 2018

Change websockets.sh

8e8d19c

To avoid deadlock in websocket See meetecho/janus-gateway#732 re: LWS_MAX_SMP

syedMSohaib mentioned this issue Mar 31, 2020

can not configure websockets with janus #2036

Closed

WebSocket transport dead lock #732

WebSocket transport dead lock #732

Comments

cedricfung commented Jan 5, 2017

lminiero commented Jan 5, 2017

lminiero commented Jan 5, 2017

cedricfung commented Jan 5, 2017

lminiero commented Jan 5, 2017

cedricfung commented Jan 5, 2017

lminiero commented Jan 5, 2017

lminiero commented Feb 8, 2017 • edited Loading

cedricfung commented Feb 9, 2017

lminiero commented Feb 9, 2017

shrhoads commented Feb 24, 2017 • edited Loading

lminiero commented Feb 24, 2017

lminiero commented Feb 24, 2017

cedricfung commented Feb 26, 2017

cedricfung commented Mar 5, 2017

zombiecong commented Mar 9, 2017

shrhoads commented Mar 9, 2017

shrhoads commented Mar 9, 2017

zombiecong commented Mar 10, 2017 • edited Loading

lminiero commented Mar 20, 2017

lminiero commented Mar 20, 2017

sslivins commented Mar 21, 2017

atyenoria commented Mar 23, 2017

lminiero commented Mar 24, 2017

cedricfung commented Mar 26, 2017

lminiero commented Mar 26, 2017

sslivins commented Mar 27, 2017

lminiero commented Apr 10, 2017 • edited Loading

lminiero commented Apr 28, 2017

lminiero commented Apr 28, 2017

sslivins commented Apr 28, 2017 • edited Loading

lminiero commented May 1, 2017

lminiero commented May 11, 2017

lminiero commented May 11, 2017

connectarena commented May 12, 2017

lminiero commented May 12, 2017

lminiero commented May 12, 2017

sslivins commented May 12, 2017

sslivins commented May 12, 2017

lminiero commented May 12, 2017 • edited Loading

sslivins commented May 12, 2017 • edited Loading

sslivins commented May 15, 2017

atyenoria commented May 15, 2017

lminiero commented May 15, 2017

idubinets commented May 23, 2017

IvRRimum commented Jun 4, 2017 • edited Loading

lminiero commented Jun 4, 2017 via email

IvRRimum commented Jun 4, 2017

shrhoads commented Jun 13, 2017

lminiero commented Jun 30, 2017

lminiero commented Feb 8, 2017 •

edited

Loading

shrhoads commented Feb 24, 2017 •

edited

Loading

zombiecong commented Mar 10, 2017 •

edited

Loading

lminiero commented Apr 10, 2017 •

edited

Loading

sslivins commented Apr 28, 2017 •

edited

Loading

lminiero commented May 12, 2017 •

edited

Loading

sslivins commented May 12, 2017 •

edited

Loading

IvRRimum commented Jun 4, 2017 •

edited

Loading