Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Janus crash libwebsockets #1627

Closed
danflu opened this issue May 16, 2019 · 36 comments
Closed

Janus crash libwebsockets #1627

danflu opened this issue May 16, 2019 · 36 comments

Comments

@danflu
Copy link

danflu commented May 16, 2019

Hi, Janus is crashing randomly when using websockets and videoroom plugin.
The stacktrace is provided below.

When query git HEAD version by using command:
git rev-parse HEAD

It returns:
ddde5e2

Any ideas ? Thanks a lot!

`[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by /opt/janus/bin/irm-janus.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f5595dcca6e in rops_callback_on_writable_ws () from /usr/lib/libwebsockets.so.14
[Current thread is 1 (Thread 0x7f55bd8a2700 (LWP 8129))]
(gdb) bt


#0  0x00007f5595dcca6e in rops_callback_on_writable_ws () from /usr/lib/libwebsockets.so.14
#1  0x00007f5595dc4cc6 in lws_callback_on_writable () from /usr/lib/libwebsockets.so.14
#2  0x00007f55ac0ed13b in janus_websockets_send_message (transport=0x7f5576bc3120, request_id=<optimized out>, 
    admin=<optimized out>, message=0x7f5578016940) at transports/janus_websockets.c:904
#3  0x0000556814679e28 in janus_process_success (request=0x7f5576bc6f60, payload=0x7f5578016940) at janus.c:2247
#4  0x000055681467c59d in janus_process_incoming_request (request=0x7f5576bc6f60) at janus.c:1005
#5  0x0000556814681db8 in janus_transport_requests (data=<optimized out>) at janus.c:2649
#6  0x00007f55c30de3d5 in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#7  0x00007f55c1bbf494 in start_thread (arg=0x7f55bd8a2700) at pthread_create.c:333
#8  0x00007f55c1901acf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97`
@lminiero
Copy link
Member

Try compiling with libasan, as that will give much more relevant info on the crash: https://janus.conf.meetecho.com/docs/debug

@danflu
Copy link
Author

danflu commented May 17, 2019

Thanks. I will update Janus to the latest version and recompile with libasan.

@atoppi
Copy link
Member

atoppi commented May 17, 2019

@atoppi
Copy link
Member

atoppi commented May 20, 2019

@danflu can you test a Janus version that includes 96a586ae ?

@danflu
Copy link
Author

danflu commented May 20, 2019

Sure. I will update today to the latest version available for download in github.
If the problem stop happening let's say... for a week.... then probably its gone... I will keep providing feedback. Thanks a lot for your support.

@atoppi
Copy link
Member

atoppi commented May 20, 2019

Also, are there any available Janus logs for the crashes?

@danflu
Copy link
Author

danflu commented May 20, 2019

No... sorry. I will enable logging to file this time (level 4 by default)

@atoppi
Copy link
Member

atoppi commented May 20, 2019

Level 5 would be better for debugging

@danflu
Copy link
Author

danflu commented May 20, 2019

ok. It's 5 now. Let's see it the problem happens again... Thanks!

@lminiero
Copy link
Member

That's clearly an issue with your plugin, or whatever the irm plugin is, not Janus... You're incorrectly using the SDP utils and so it crashes. Closing as not a Janus issue.

@danflu
Copy link
Author

danflu commented May 21, 2019

Hi.
Yes there was an issue (caused by double freeing sdp_offer) that caused janus to crash when connecting the first client (because of libasan abortion) .
I fixed it by inspecting libasan output.
The issue I'm trying to reproduce is more subtle, and happens only after some time... Maybe it is related, maybe not...
If it happens again I will put janus log and libasan output...
Thanks...

@danflu
Copy link
Author

danflu commented May 22, 2019

It is still happenning

version: 0b80a02

There is a situation where janus crashes (apparently when closing a ws connection)
Please... take a look at this:

https://pastebin.com/KaxjnLqP

@lminiero lminiero reopened this May 23, 2019
@lminiero
Copy link
Member

Be that as it may, this still involves a proprietary plugin, and we have any idea what it does (it may be part of the issue). If you can replicate it with one of the stock plugins, happy to have a look.

@lminiero
Copy link
Member

PS: mh, looks like in that specific case it crashed while sending a response to a destroy though... anyway, it crashes in rops_callback_on_writable_ws, which is deep within libwebsockets. Have you tried opening an issue on their github repo as well? Feel free to mention this issue here, so that we can track the status there.

@danflu
Copy link
Author

danflu commented May 23, 2019

Hi Lorenzo, Thanks for your reply.
I created an issue in libwebsockets: warmcat/libwebsockets#1586
Let's see what they say... I guess it is some concurrent race condition that is not protected...

@lminiero
Copy link
Member

@danflu thanks! I added some implementation considerations there, just in case it helps identifying the root cause of the issue.

@lminiero
Copy link
Member

@danflu can you give the pull request above a try?

@danflu
Copy link
Author

danflu commented May 24, 2019

@lminiero, Sure...! Will rebuild janus and test again...
Thanks!

@lminiero
Copy link
Member

@danflu any update?

@danflu
Copy link
Author

danflu commented May 27, 2019

Hi @lminiero... it is still happening in the same place.
I will perform some changes in my code to see if that stops happening...
It is surely related with a session being created and destroyed inside a very small window of time (like 150ms) AND a message being sent to remote user . Maybe I can do something on my side...

@danflu
Copy link
Author

danflu commented May 27, 2019

I analysed all my code very carefully... I could not spot any error...

I will try put a sleep just inside "plugin->destroy_session" notification. I noticed the create session and destroy session notification come from different threads. Maybe if they happen near the same time the context is invalidated somehow... or just a mismatched retain/release among different threads... really not sure...

Detailed crash log:
https://pastebin.com/QwdU8LZm

@lminiero
Copy link
Member

create_session will come from whatever thread generated the message, but destroy_session will always come from the same thread that generates setup_media, hangup_media, incoming_rtp, etc.

If the patch didn't help please notify it on the libwebsockets issue, and provide them with the info they're asking for, as it means I don't know what (if anything) we're doing wrong then.

@danflu
Copy link
Author

danflu commented May 27, 2019

Ok. It's running with debug lws...

@danflu
Copy link
Author

danflu commented May 27, 2019

The situation I think is happening (looking at it as a complete noob from the SKY):

Edit: Detailing a bit more...

  1. There is a message from lws (create session)
  2. Janus receives the message (1) from transport and session is created from that message (1)
  3. There is another message from lws (attach)
  4. Janus receives the message (3) from transport and a handle is attached to session
  5. There is an app level message from lws
  6. At this meantime there is a destroy session event (from lws) and the handle is destroyed.
  7. Janus receives the message (5) from transport but the session/handle is already destroyed and it crashes when it tries to reply the sender...

Not sure about the ordering of the lws events...

This is the pattern I did recognize so far... maybe the web client is creating some kind of edge situation that the server is not prepared to handle.

Edit2: apparently it is not related... it crashed today 4 times and not all situations were explained by the above steps... the only thing that do not change is the stack trace.

@lminiero
Copy link
Member

@danflu I added another commit to #1638 to try and use lws_cancel_service as suggested by the libwebsockets developer, can you check that again?

@danflu
Copy link
Author

danflu commented May 29, 2019

Sure! I will test and provide you feedback!
Thanks

@danflu
Copy link
Author

danflu commented May 29, 2019

@lminiero to test your changes, shoud I rollback the changes pointed by the link below ? Since I made then the server did not crash anymore (24 hours up until now)
warmcat/libwebsockets#1586 (comment)

@danflu
Copy link
Author

danflu commented May 29, 2019

Maybe I can keep it since it just avoids the server to crash due assertion...

@lminiero
Copy link
Member

I think it's a good idea to rollback those changes, just to see if we can avoid the crashes even without updating the library: not everyone will update their libwebsockets installation, so it would be good to know if we have a fix on our side as well.

@danflu
Copy link
Author

danflu commented May 29, 2019

OK! Will rollback and test again with your changes.

@lminiero
Copy link
Member

Thanks!

@lminiero
Copy link
Member

@danflu any update on this? Do you feel we can merge this and consider it a win?

@danflu
Copy link
Author

danflu commented May 31, 2019

Hi @lminiero no crashes in janus/libwebsockets so far, 48 hours up and running !!!
I'd rather prefer to wait for a longer time frame (a week or so) before doing any firework, but its very promising!
Thanks!!!

@lminiero
Copy link
Member

lminiero commented Jun 3, 2019

Sounds good!
I'll merge then 👍

@lminiero
Copy link
Member

lminiero commented Jun 3, 2019

Closing, thanks for the precious feedback and, most importantly, for your patience 😄

@lminiero lminiero closed this as completed Jun 3, 2019
@danflu
Copy link
Author

danflu commented Jun 4, 2019

Thank you! Still up and running :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants