Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WebSocket transport dead lock #732

Closed
cedricfung opened this issue Jan 5, 2017 · 75 comments
Closed

WebSocket transport dead lock #732

cedricfung opened this issue Jan 5, 2017 · 75 comments

Comments

@cedricfung
Copy link

I'm using Janus video room plugin to develop a new website, and have found this issue for several times. e.g. the WebSocket won't accept any new client after some tests.

After tracing the code execution, I found the code block at

JANUS_LOG(LOG_VERB, "[%s-%p] WebSocket connection accepted\n", log_prefix, wsi);
if(ws_client == NULL) {
	JANUS_LOG(LOG_ERR, "[%s-%p] Invalid WebSocket client instance...\n", log_prefix, wsi);
	return -1;
}
/* Clean the old sessions list, in case this pointer was used before */
janus_mutex_lock(&old_wss_mutex);

I have some old clients exited without cleanup, e.g. no detach handle, etc.

Is this expected to happen when clients don't cleanup resources before disconnect?

@lminiero
Copy link
Member

lminiero commented Jan 5, 2017

We have an old pointers list in place because the object allocation for the WebSockets client is done by the library itself, and it tends to reuse old pointers for that. Since we use those pointers as opaque identifiers for the transport in place, we have to do some magic, specifically making sure we don't handle send requests from the core on a transport we know has gone (e.g., closed connection, so we added the pointer to the "old list"), and making also sure the "old list" is updated whenever we get a new connection (if a pointer previously used for a connection was in the "old list" and the same one is used for a new connection, remove it from the "old list"). This is indeed a bit wacky, and is not an issue anymore in the reference counters branch for instance as in that case the pointer we use to identify transports is a container we create ourselves.

That said, not sure what can cause that to deadlock. libwebsockets is single threaded and so if it's stuck there it won't indeed go on, but why it locks there is something we'll have to investigate. Probably a lock somewhere else where we don't unlock in some if/then/else branches? I'll have a look.

@lminiero
Copy link
Member

lminiero commented Jan 5, 2017

PS: if you find it easy enough to replicate, you can try enabling the locking debug in Janus and try again until it happens. This way, you should be able to see exactly where in the code the double locking happens for you.

@cedricfung
Copy link
Author

Thanks. I just enabled the lock_debug, and will wait for this to happen again.

BTW: when will the new refcount branch will be considered stable?

@lminiero
Copy link
Member

lminiero commented Jan 5, 2017

As soon as it is! 😉
Jokes apart, there are people testing this and we try to do that whenever we can too. If you're interested in giving it a go in some of your application scenarios, it will definitely help move it forward faster!

@cedricfung
Copy link
Author

Will try it soon. And I just noticed another issue, if I start a WebSocket connection, close it, then open another connection soon, a timeout message will be sent to the new connection with old session id.

I'm not sure if this is a bug, because I can deal with it in client code.

@lminiero
Copy link
Member

lminiero commented Jan 5, 2017

Yeah, that's something we experimented too, and happens when a new connection arrives as soon as the previous one ends and the two share the same "pointer". In that case, Janus might be a little to slow to notify the timeout, use the pointer it had, and so the event will get to the new connection instead of the old one. Again, something that with the reference counter branch won't happen.

To solve that issue in the short term, we might simply add a map in the transport, so that we always feed the core with a different pointer no matter what it points to, as it would be mapped to the actual transport structure.

@lminiero
Copy link
Member

lminiero commented Feb 8, 2017

@Vecio any update on the lock debugging?

@cedricfung
Copy link
Author

Not occurred any more, have been running the service for one month.

@lminiero
Copy link
Member

lminiero commented Feb 9, 2017

Ok, closing the issue then. Feel free to reopen should this reappear.

@lminiero lminiero closed this as completed Feb 9, 2017
@shrhoads
Copy link

shrhoads commented Feb 24, 2017

I don't have a button to reopen this issue but I am having this issue with the latest master code. It happens about once a day where janus doesn't crash, but it will no longer accept websockets. I'm using the latest libwebsockets. When I build it says using the "new api".

Is there an alternative library I can use to work around this issue?

I'm going to enable lock_debug. Also using the sip plugin only.

BTW in github if a contributor (Iminiero) closes an issue only they can reopen it.

@lminiero
Copy link
Member

BTW in github if a contributor (Iminiero) closes an issue only they can reopen it.

@shrhoads didn't know about this, reopening then. If you can provide some lock_debug info or find a way to replicate this it would definitely be helpful. My guess is that the libwebsockets context stops generating events for some reason, e.g., some problem in lws_service. You can set a more verbose logging for WebSockets themselves here, which might give you more info when things stop working.

@lminiero lminiero reopened this Feb 24, 2017
@lminiero
Copy link
Member

As a side note, you can try changing the LOG_HUGE here in something that's included in your debugging level (e.g., LOG_VERB if you're using level 5, or LOG_INFO for level 4). It will spam your logs much more, but again missing events after the lock may confirm the fact that events are not flowing anymore.

@cedricfung
Copy link
Author

This issue occurred to me again, will try to get some useful log.

@cedricfung
Copy link
Author

Recently I have been using the latest refcount branch, there is no this issue any more.

@zombiecong
Copy link

Same issue with the latest master code , I have this issue when I build Janus in Mac, and it works well in ubuntu.

@shrhoads
Copy link

shrhoads commented Mar 9, 2017

It does happen on ubuntu, too, although it hasn't happened again for 1.5 weeks so far.

@shrhoads
Copy link

shrhoads commented Mar 9, 2017

If you can readily reproduce on mac, can you add the logging/etc. suggested above?

@zombiecong
Copy link

zombiecong commented Mar 10, 2017

It happens on Mac every time and every request, I test both old api and new api of libwebsockets. And http transport works well.

@shrhoads This is lock_debug log : https://gist.github.com/zombiecong/e1417c8001b12c8478ed5658d4181d32

I sent a janus "create" request, there was no response from server.After a moment, I saw the session "timeout" log and the session was released.

BTW:I build some dependencies libs by myself , not by homebrew.

@lminiero
Copy link
Member

Sorry, missed this notification, apparently... bridging #798 here as well, as it's the same issue (SIP or VideoRoom are not the issue).

@lminiero
Copy link
Member

I don't think it's a locking issue in Janus, as you get to the Bye! with no problem, which I believe wouldn't happen if it were stuck somewhere. Maybe a libwebsockets/MacOS specific issue? Try enabling a higher libwebsockets debugging instead to see if it gives any insight: https://github.com/meetecho/janus-gateway/blob/master/conf/janus.transport.websockets.cfg.sample.in#L14

@sslivins
Copy link

I am also hitting the deadlock issue, running openssl 1.0.1e, libwebsockets 2.2.0. I tried disabling secure websockets and just used an unsecure websocket with the decryption handled via nginx but still luck. I can repro this issue at least once per day. I will try some more logging in libwebsockets and post an update

@atyenoria
Copy link

same issue with the latest refcount branch, libwebsockets 2.2.0, OpenSSL 1.0.1t, debian jessie docker container on ubuntu 16.14. If this happens, creating http request also can't work in addition to websocket request.

@lminiero
Copy link
Member

Wait, if HTTP requests fail too, then it is indeed a lock in Janus causing this. Can the others confirm? I was under the assumption that the HTTP API was still working when this happened.

@cedricfung
Copy link
Author

I can confirm that the issue is also in the latest refcount branch.

@lminiero
Copy link
Member

I won't be able to do any test before the IETF ends. If you guys can find ways to replicate this in the meanwhile, so that I can check for myself when I get back, that would help.

@sslivins
Copy link

I was able to reproduce this with libwebsockets logging at the highest level (4095): https://www.dropbox.com/s/3wwqlhp0ksk4e4f/janus_deadlock.log

I have an external process that hits janus over wss on the localhost every minute and this is visible in the log by looking for the message: Message for Session 278: {"id": 0, "request": "watch"}

the first appears at 20:54:32, there should have been another one at 20:55:32 and another at 20:55:32. My watchdog thread kills Janus with an exit status of 99 if it doesn't get a new sessions request for 2 minutes which was the case at 20:54:32

@lminiero
Copy link
Member

lminiero commented Apr 10, 2017

Can't seem to be able to replicate the issue. Please, if you have an easy way to consistently replicate the issue, then provide the steps here so that I can try and check if it happens to me as well. Logs with the locking debug enabled (either via configuration or via Admin API) when the issue happens will also help me identify which lock is being held and not released, if any.

@lminiero
Copy link
Member

Ah, the irony! I had managed to quickly replicate this with the tiny stresser tool I implemented, but as soon as I added the same lines I suggested you, of course Janus is not deadlocking anymore... keeping it up to annoy Janus hoping it will happen soon 😉

@lminiero
Copy link
Member

Finally got it to happen, and from a first look in the log it seems the cause there was the lws_callback_on_writable in janus_websockets_send_message (after the last non-unlocked lock, I can see send_message:Post-push-Pre-writable but not send_message:Post-writable).

I'll do another round just to check if it happens in the same point, and then I'll evaluate.

@lminiero
Copy link
Member

lminiero commented May 1, 2017

Sorry for the delay, but it will take some time before I can look into this. Today is Labor Day (National holiday in Italy) and tomorrow I'll fly abroad for an event for a few days. Anyway, I did some more dumps of my own too, so I have material to investigate when I can.

@lminiero
Copy link
Member

Just came back to the office after attending two different events, so sorry for the lack of feedback. I'll give a look at the logs between today and tomorrow in order to check if I can get to the real source of the issue.

@lminiero
Copy link
Member

Contacted the lws developers for info on what might cause lws_callback_on_writable never to return, and they said locking is only used if the library is compiled with a LWS_MAX_SMP value different than 1, the default.

I'm now trying to attach via gdb to Janus as soon as it deadlocks, but obviously I've been stressing it for more than an hour and nothing happened yet... not sure if I'm using a different library than when I could replicate the issue at the time: now I'm using the version my Fedora 25 ships in the repos, which apparently has LWS_MAX_SMP=1. One thing you guys might want to check is what LWS_MAX_SMP is set to in cmake in your building environment.

I'll keep you posted in case I manage to deadlock it again and get some gdb info.

@connectarena
Copy link

If we use libwebsockets with single thread, won't it attract extra delay? Any ways we started to use libwebsockets with single thread and it is running with out deadlock since 36 hours.

@lminiero
Copy link
Member

libwebsockets is conceived to be single thread, so there shouldn't be any issue. If you think about that, node.js is single threaded as well, and there's usually no problem in basing HTTP/WS services on top of that. Considering the WS support in the Janus API is only needed for signalling purposes (no really intensive traffic), I don't expect that to be an issue.

Thanks for the feedback on the lack of deadlocks in that case!

@lminiero
Copy link
Member

@Vecio @shrhoads @zombiecong @sslivins @atyenoria just pinging you guys as you contributed to the issue with feedback on this happening to you. As explained in a previous note, which addressed a comment from the lws developers, just building your library with LWS_MAX_SMP=1 in cmake should fix this, so please let me know if that's not the case.

@sslivins
Copy link

i will try this today @lminiero and let you know the results over the next 24h...thanks again for tracking this down

@sslivins
Copy link

btw the default value of LWS_MAX_SMP is 32 in libwebsockets 2.2

@lminiero
Copy link
Member

lminiero commented May 12, 2017

Weird, the lws developer told me the default was 1. My Fedora ships a 2.1 rpm in the repos, and I couldn't replicate the issue with that, which makes me think that one is using 1 (seems to be confirmed by looking at the cmake file).

@sslivins
Copy link

sslivins commented May 12, 2017

again, this is i build from source and dont put anything explicit for the cmake...also corresponds with their docs: https://libwebsockets.org/lws-api-doc-master/html/md_README_8coding.html

@sslivins
Copy link

so just to update things...,i am still experiencing some deadlocks but it's many many orders of magnitude fewer...i still need to evaluate if these are the same root cause or something else, i'll get you more info on monday

@atyenoria
Copy link

For me, it won't happen. It seems to be solved.
[spec]
Video room with ref-count latest commit(4daf434). Cmake libwebsocket with LWS_MAX_SMP=1(Default is 32)
All 20 Chrome tab(3pc) publishing their own video and audio, I tried to disconnect and connect at the same time for about ten times. no crash and no websocket deadlock.

In the past, I can confirm that this happens frequently.

@lminiero
Copy link
Member

Something weird I just noticed... I wanted to add a check for LWS_MAX_SMP in the plugin code, so that if it's different than 1 we can print a warning, but when I print it out it's 32 for me! Which means that my Fedora shipped version, which is 2.1.0, is still using the default value of 32, and yet I couldn't replicate the issue anymore? Maybe only an issue with 2.2 and LWS_MAX_SMP > 1 then?

@idubinets
Copy link

I faced the same issue. LWS_MAX_SMP is set to 32 by default. I rebuilt the library with LWS_MAX_SMP = 1.
let's see how it will work. I will let you know

@IvRRimum
Copy link

IvRRimum commented Jun 4, 2017

Adding to this discussion:

I am using websockets, i recompiled libwebsockets with LWS_MAX_SMP=1 and i still managed to get the deadlock.

But here's the weird behaviour of the deadlock: I am using 2 plugins in my janus installation:

  1. audiobridge
  2. videoroom.
    When a deadlock happens on audioroom, i can still access and use videoroom. I don't know if this is related tho.
    Also i wanted to add that i am using janus-gateway version from Dec 20, 2016.

I ran the LOCK/DEADLOCK script and i got these: https://pastebin.com/zAa4iNLC

How to reproduce: Create 6 chrome tabs and join a single room. Do the same on pc 2. and connect from 3rd device(in my case its mobile app). Run automation script that refresh -> join room -> unmute for 2 seconds -> refresh -> join....

After a while the audiobridge will deadlock.

EDIT: After further crashin, the first lines of the LOCK/UNLOCK script change, depending on the situation where it deadlocked(next time its not in my room_helper:radio_thread, but in list_rooms, etc). So that probably means it deadlocks before and then tries to lock again on already locked mutex.

@lminiero
Copy link
Member

lminiero commented Jun 4, 2017 via email

@IvRRimum
Copy link

IvRRimum commented Jun 4, 2017

Hey @lminiero!

Yeah, makes sense. created issue here.

@shrhoads
Copy link

Just reporting in on this, LWS_MAX_SMP=1 seems to have solved the issue for me.

@lminiero
Copy link
Member

Closing then, thank you all for the feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants