Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[0.11.8] Janus Server Crash #2964

Closed
tostra opened this issue Apr 30, 2022 · 8 comments
Closed

[0.11.8] Janus Server Crash #2964

tostra opened this issue Apr 30, 2022 · 8 comments
Labels
legacy Related to Janus 0.x

Comments

@tostra
Copy link

tostra commented Apr 30, 2022

What version of Janus is this happening on?
ALL, [0.11.8] and 1.x(on your demo site).

Have you tested a more recent version of Janus too?
Yes, 1.x(on your demo site, today).

Was this working before?
no info.

Is there a gdb or libasan trace of the issue?
no idea

Additional context
Just go to Janus Demo Site(videoroom.plugin), press START and use google chrome console, type at least twice to crash the Janus server, may not work at first try, just refresh and try again, sometimes you can use much smaller loop, since memory is not cleared or something like that, actually have not investigated that much:


for (var i = 0; i < 555; i++) {
id = parseInt(11);
mypvtid=parseInt(11);
myroom=parseInt(11);
newRemoteFeed(id, 11, 1234)
} 

On latest v.0.11.8 I get system errors:

....
[ERR] [plugins/janus_videoroom.c:janus_videoroom_access_room:2989] No such room (11)
[ERR] [janus.c:janus_transport_requests:3448] Got error 0 (Error creating thread: Resource temporarily unavailable) trying to push task in thread pool...
......
systemd[1]: janus.service: Main process exited, code=killed, status=11/SEGV
systemd[1]: janus.service: Failed with result 'signal'.
systemd[1]: janus.service: Scheduled restart job, restart counter is at 3.
systemd[1]: Stopped Janus WebRTC Server.
systemd[1]: Started Janus WebRTC Server.
.....

@tostra tostra added the legacy Related to Janus 0.x label Apr 30, 2022
@tostra tostra changed the title [0.11.8] [0.11.8] Janus Server Crash Apr 30, 2022
@lminiero
Copy link
Member

lminiero commented May 3, 2022

That's a known thing, and is not caused by Janus but by not configuring limits properly as explained here:

https://janus.conf.meetecho.com/docs/FAQ.html#ulimit

In fact we checked the logs and I saw the Too many open files error, which means we forgot to update the limits on the demos server ourselves 🤭

@tostra
Copy link
Author

tostra commented May 3, 2022

I took a closer look, increased Linux limit, and made some tests, and yes the resource of file descriptors/open files is completely and fast exhausted with the command ran in browser, works until limit reached, 100000+ file descriptors easily used up and locked by one browser/user/one session, server crash is a bonus.

I did not find anything to fix this issue, reading the ulimit settings on Mongo and did some more searching. I may be wrong, I hope that I am, really do.

Looking forward to test your demo site configuration once you make the changes.

If Janus does not crash, out of resources:

{
   "janus": "error",
   "session_id": 551241236603937,
   "transaction": "9lfmao43Y0IV",
   "error": {
      "code": 461,
      "reason": "Couldn't attach to plugin: error '-1'"
   }
}

@lminiero
Copy link
Member

lminiero commented May 4, 2022

I suspect this is due to the thread pool we use internally to handle messages addressed to plugins. Both the HTTP and WebSockets transports use a single thread for their server functionality, and both then pass incoming requests to the core for processing; the core also has a single thread for processing most of them, with the exception of message, that is requests that are meant to be handled by a plugin. We delegate those to a separate thread because messages can be handled by plugins synchronously or asynchronously: when handled asynchronously, messages are dealt with right away, while synchronous messages can take longer, and could risk keeping the core thread busy for too long, thus keeping other pending requests waiting.

From what I can see, the problem may be that at startup we create a thread pool for that task with no limitation:

tasks = g_thread_pool_new(janus_transport_task, NULL, -1, FALSE, &error);

meaning the core is free to spawn new threads when there's many incoming requests to process. As explained in the g_thread_pool_new documentation, in fact, -1 does mean indeed "no limit".

Can you try changing that -1 to a more sensible value, like 100? I'd like to understand if that indeed helps keeping resources more constrained under such heavy usage. If that works, we can make that a configurable property in janus.jcfg.

I'd rather not add any shaping functionality for incoming traffic in Janus itself, instead. Janus is often used by companies with a server side component that controls an instance, where a single address may create a single session but many handles to orchestrate the users the service is managing, and adding a shaper there could severely impact the performance of Janus in controlled environments. Besides, my guess is that shapers to HTTP/WS traffic could be better implemented, and more easily, in a proxy component like nginx instead, that is before it reaches Janus in the first place.

I'm pretty sure there are also ways to integrate with frameworks like fail2ban, apiban or similar that may help in scenarios like the one you're replicating. Feedback would be welcome, of course.

@tostra
Copy link
Author

tostra commented May 4, 2022

I currently use the nginx port forwarding to janus, using it to limit connections to janus websocket, should be safer etc, however there is no way to do anything once the websocket connection is established, like in this case, unless directly applied to janus framework(as long as you use websockets provided by janus), I have done this to another framework, but janus is currently out of my league.

Anyways, I changed the janus-gateway/janus.c and restarted the server. Still no luck, error the same.

Hundred of these system logs:

May  4 16:46:15 xxx janus[5610]: [ERR] [plugins/janus_videoroom.c:janus_videoroom_access_room:2989] No such room (11)
May  4 16:46:15 xxx janus[5610]: [ERR] [plugins/janus_videoroom.c:janus_videoroom_access_room:2989] No such room (11)
May  4 16:46:15 xxx janus[5610]: [ERR] [plugins/janus_videoroom.c:janus_videoroom_access_room:2989] No such room (11)
May  4 16:46:15 xxx janus[5610]: [ERR] [plugins/janus_videoroom.c:janus_videoroom_access_room:2989] No such room (11)

....
May  4 16:46:18 xxx janus[5610]: Creating new handle in session 7125092888592978: 5106599294213999; 0x7fc170019980 0x7fc1640dfa20
May  4 16:46:18 xxx janus[5610]: [ERR] [ice.c:janus_ice_handle_attach_plugin:1424] [5106599294213999] Got error 0 (Error creating thread: Resource temporarily unavailable) trying to launch the handle thread...
May  4 16:46:18 xxx janus[5610]: Detaching handle from JANUS VideoRoom plugin; 0x7fc1640dfa20 0x7fc164117cf0 0x7fc1640dfa20 0x7fc16411a3f0
May  4 16:46:18 xxx janus[5610]: [ERR] [janus.c:janus_process_incoming_request:1221] Couldn't attach to plugin 'janus.plugin.videoroom', error '-1'
.....
:::::Once the "attacker" unlocks/leaves::::
May  4 16:48:25 xxx janus[5610]: [1907397174046471] Handle and related resources freed; 0x7fc164013000 0x7fc170019980
May  4 16:48:25 xxx janus[5610]: [janus.plugin.videoroom-0x7fc164088da0] No WebRTC media anymore; 0x7fc164088dd0 0x7fc164088f20
May  4 16:48:25 xxx janus[5610]: Detaching handle from JANUS VideoRoom plugin; 0x7fc164088dd0 0x7fc164088da0 0x7fc164088dd0 0x7fc164088f20
.....

I see that Handle is attached for every request but not detached fully or at all, otherwise the resource would not build up . In the frontend, I attach the handle, and if I do not call detach the resource is not freed until I disconnect.

Probably the solution for this would be to limit the number of handles one session user can have, since one handle uses pretty much server resources (open files especially) and why you need 100s of them.

But that is not the only issue, since I can also abuse the janus .send() system the same way, once I connect to a handle, overwhelm the janus server and crash it (since there are no limits).
As a solution I implemented to another websocket RTC server with Node JS: The limit, to disconnect once there has been a burst of 100 messages from that user.
I already looked, to do this with janus, but would need to learn some new languages and janus backside itself.

Is this the case?

@lminiero
Copy link
Member

lminiero commented May 4, 2022

Anyways, I changed the janus-gateway/janus.c and restarted the server. Still no luck, error the same.

Did you recompile with a make install after the change? C is not JavaScript, you have to recompile or just changing the code will not do anything.

Probably the solution for this would be to limit the number of handles one session user can have, since one handle uses pretty much server resources (open files especially) and why you need 100s of them

No, that's not going to happen, because as I said in my previous post it's not uncommon at all to have server-side controllers create a single session and multiple handles on behalf of users that talk to the server via a custom API, and I'm not going to cripple potentially major use cases. Janus was conceived to have a "raw" API that can be used to take advantage of the full potential right away, so that comes with the territory.

One more thing you can try (besides recompiling after the change I suggested) is set the event_loops property in janus.jcfg to a fixed value, e.g., the number of cores on the machine. We added this property to limit the number of threads independently of how many handles are created. Of course, this will only impact the threads-to-PeerConnection ratio: the value I mentioned in my previous post is related to the threads that are spawned to handle incoming messages, and so impact a different part of the core.

I already looked, to do this with janus, but would need to learn some new languages and janus backside itself.

Is this the case?

Janus is written in C so yes, proficiency in the language would need to to be learnt to contribute.

@tostra
Copy link
Author

tostra commented May 4, 2022

One more thing you can try (besides recompiling after the change I suggested) is set the event_loops property in janus.jcfg to a fixed value, e.g., the number of cores on the machine.

Can you try changing that -1 to a more sensible value, like 100? I'd like to understand if that indeed helps keeping resources more constrained under such heavy usage. If that works, we can make that a configurable property in janus.jcfg.

This event_loops para actually worked against descriptors/open files buildup with handle, and the other one worked against server crash, CPU usage can be still targeted, tested twice, with and without. UPDATE: another test, need both suggestions to work.

So now I only need to work on a solution to disconnect the user if It bursts requests for my CPU.

Solution I came up with:
https://github.com/tostra/janus-gateway-backend-request-limit

This is not a bug, but Janus miss-configuration.

Seriously Lorenzo, Thanks!

@lminiero
Copy link
Member

lminiero commented May 5, 2022

Thanks for the feedback and for testing! Since you needed both to keep your server in shape, I'll add a configuration property for the number of threads in the task pool: I'll leave the default to -1 to keep Janus working as it does today, but you'll be able to set it to a different value. I'll notify here when it's done.

@lminiero
Copy link
Member

lminiero commented May 5, 2022

I've just added the property to both master and the 0.x branches, so considering that in conjunction with event_loops it should do what you need, I'll close the issue.

@lminiero lminiero closed this as completed May 5, 2022
mwalbeck pushed a commit to mwalbeck/docker-janus-gateway that referenced this issue May 25, 2022
This PR contains the following updates:

| Package | Update | Change |
|---|---|---|
| [meetecho/janus-gateway](https://github.com/meetecho/janus-gateway) | patch | `v1.0.1` -> `v1.0.2` |

---

### Release Notes

<details>
<summary>meetecho/janus-gateway</summary>

### [`v1.0.2`](https://github.com/meetecho/janus-gateway/blob/HEAD/CHANGELOG.md#v102---2022-05-23)

[Compare Source](meetecho/janus-gateway@v1.0.1...v1.0.2)

-   Abort DTLS handshake if DTLSv1\_handle_timeout returns an error
-   Fixed rtx not being offered on Janus originated PeerConnections
-   Added configurable property to put a cap to task threads \[[Issue-2964](meetecho/janus-gateway#2964)]
-   Fixed build issue with libressl >= 3.5.0 (thanks [@&#8203;ffontaine](https://github.com/ffontaine)!) \[[PR-2980](meetecho/janus-gateway#2980)]
-   Link to -lresolv explicitly when building websockets transport
-   Fixed RED parsing not returning blocks when only primary data is available
-   Fixed typo in stereo support in EchoTest plugin
-   Added support for dummy publishers in VideoRoom \[[PR-2958](meetecho/janus-gateway#2958)]
-   Added new VideoRoom request to combine subscribe and unsubscribe operations \[[PR-2962](meetecho/janus-gateway#2962)]
-   Fixed incorrect removal of owner/subscriptions mapping in VideoRoom plugin \[[Issue-2965](meetecho/janus-gateway#2965)]
-   Explicitly return list of IDs VideoRoom users are subscribed to for data \[[Issue-2967](meetecho/janus-gateway#2967)]
-   Fixed data port not being returned when creating Streaming mountpoints with the legacy API
-   Fix address size in Streaming plugin RTCP sendto call (thanks [@&#8203;sjkummer](https://github.com/sjkummer)!) \[[PR-2976](meetecho/janus-gateway#2976)]
-   Added custom headers for SIP SUBSCRIBE requests (thanks [@&#8203;oriol-c](https://github.com/oriol-c)!) \[[PR-2971](meetecho/janus-gateway#2971)]
-   Make SIP timer T1X64 configurable (thanks [@&#8203;oriol-c](https://github.com/oriol-c)!) \[[PR-2972](meetecho/janus-gateway#2972)]
-   Disable IPv6 in WebSockets transport if binding to IPv4 address explicitly \[[Issue-2969](meetecho/janus-gateway#2969)]
-   Other smaller fixes and improvements (thanks to all who contributed pull requests and reported issues!)

</details>

---

### Configuration

📅 **Schedule**: At any time (no schedule defined).

🚦 **Automerge**: Enabled.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, click this checkbox.

---

This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).

Reviewed-on: https://git.walbeck.it/walbeck-it/docker-janus-gateway/pulls/79
Co-authored-by: renovate-bot <[email protected]>
Co-committed-by: renovate-bot <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
legacy Related to Janus 0.x
Projects
None yet
Development

No branches or pull requests

2 participants