Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory corruption with Janus running in Docker #2987

Closed
bjelovarduck opened this issue May 26, 2022 · 17 comments
Closed

Memory corruption with Janus running in Docker #2987

bjelovarduck opened this issue May 26, 2022 · 17 comments

Comments

@bjelovarduck
Copy link

Recently I started experiencing memory corruption running Janus in Docker. I tried all versions from 0.11.2 - 1.0.0.1.
I followed your instructions how to debug Janus, run it using Address Sanitizer, added output to pastebin: https://pastebin.com/GtfhutH9

@atoppi
Copy link
Member

atoppi commented May 26, 2022

Please provide a libasan stacktrace from a recent commit.

@bjelovarduck
Copy link
Author

bjelovarduck commented May 26, 2022

@atoppi Sorry, I'm not sure what do you mean under recent commit.

@atoppi
Copy link
Member

atoppi commented May 26, 2022

The stack trace you shared is from an unknown 0.11.8 janus version.
Build janus from either the master or 0.x branch, reproduce the issue and share the new trace with the specific janus commit hash in use.

@bjelovarduck
Copy link
Author

Hm, maybe I am not reading Janus versions correctly but here is link to commit:
21a5fc9

Isn't it v0.11.8 from master branch version?

@lminiero
Copy link
Member

Isn't it v0.11.8 from master branch version?

No, 0.11.8 is almost 4 months old: https://github.com/meetecho/janus-gateway/blob/master/CHANGELOG.md#v0118---2022-02-11

The 0.x branch (the continuation of 0.11.8) is now at 0.12.3, while the master branch (multistream) is at 1.0.3. Please try either the latest 0.x branch, or the latest master branch.

@bjelovarduck
Copy link
Author

I run it using master (1.0.3) and here is link to logs: https://pastebin.com/vcVcMRvZ
Repro of issue is a bit different (and ending in 2 issues, but let me first explain setup.

I am running publisher on the same physical computer as docker container running Janus. Publisher is trying to stream video, audio and data to Janus videoroom plugin. Issue does not repro when publisher is on different computer or when Janus is running in datacenter.

With master build it took 2 runs to repro it:

  1. ISSUE 1: First run failed with Janus returning empty answer. After Publisher receives empty answer, it will gracefully terminate. I. I verified with Wireshark that answer is empty.
  2. ISSUE 2: Second run ends with read memory access failure.

There is one thread that suggests to define PORT environment variable to avoid empty answer, although it was when communication with Janus was through HTTP, not websockets. I'll try it to see if it works. However, it looks like memory issue is still there.

I'll also try with 0.12.3 build later and see what is behavior.

@bjelovarduck
Copy link
Author

I tried with latest 0.x (0.12.3) and crash resembles the same one I got with 0.11.8. Here is link to logs: https://pastebin.com/9A6bA8Q4

@atoppi
Copy link
Member

atoppi commented May 31, 2022

Both crashes happened deep into glib, trying to allocate new memory in order to add an element to a linked list.
I've noticed you are using Docker on Windows, I'm wondering if that could be part of the issue ?

@atoppi
Copy link
Member

atoppi commented May 31, 2022

Could you please try setting the G_SLICE env var before launching janus?

G_SLICE=always-malloc /path/to/bin/janus

@bjelovarduck
Copy link
Author

Yes, I am running docker on Windows, with docker engine running on WSL2. It was working fine up to ~4 weeks ago.
I will try your suggestion and see what happens.

It does sound like realistic possibility that memory allocation fails and after that system fails.

@bjelovarduck
Copy link
Author

bjelovarduck commented Jun 1, 2022

I tried to run docker on computer with more memory and issue still reproes. Although I can't eliminate yet issue with Docker, running with G_SLICE setting as you suggested revealed where issue might be: https://pastebin.com/PMb2grNw
Double free memory at following location:

    #0 0x7f0f06c217a8 in __interceptor_free (/usr/lib/x86_64-linux-gnu/libasan.so.4+0xde7a8)
    #1 0x7f0f066fccd1 in nice_candidate_free ../agent/candidate.c:92

@atoppi
Copy link
Member

atoppi commented Jun 3, 2022

Some stack traces are too short so we can't track the source of the first memory free.
Recompile janus adding the following to your current CFLAGS:

-O0 -fno-omit-frame-pointer -g -ggdb3

And start the process adding ASAN_OPTIONS:

ASAN_OPTIONS=fast_unwind_on_malloc=0 G_SLICE=always-malloc /path/to/bin/janus

@futr
Copy link

futr commented Jun 4, 2022

If you are using the master branch of libnice, I seem to have experienced a similar crash.

The crash occurred when using commit ff9ee991 or later on the master branch of libnice.

After investigating, I thought that a bug may have been introduced in commit ff9ee991 of libnice.
I have filed a bug report.
https://gitlab.freedesktop.org/libnice/libnice/-/issues/164

I believe it has been fixed in the latest libnice master branch.

The crash no longer occurs in my environment.

Here is the output from Valgrind when Janus crashed in my environment
https://gist.github.com/futr/81fc45d96c45587929e2cf46689d14c2

==1512459== Invalid free() / delete / delete[] / realloc()
==1512459=== at 0x483CA3F: free (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==1512459== by 0x487CAF5: nice_candidate_free (candidate.c:92)
==1512459== by 0x15B8F1: janus_ice_candidates_to_sdp (ice.c:3368)
==1512459== by 0x1DFA81: janus_sdp_merge (sdp.c:1562)
==1512459== by 0x18B997: janus_plugin_handle_sdp (janus.c:3952)
==1512459== by 0x186B34: janus_plugin_push_event (janus.c:3574)
==1512459== by 0x248AE810: janus_videoroom_handler (janus_videoroom.c:10777)
==1512459== by 0x4B76AD0: ???? (in /usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.6400.6)
==1512459== by 0x527E608: start_thread (pthread_create.c:477)
==1512459== by 0x53B8132: clone (clone.S:95)

@lminiero
Copy link
Member

lminiero commented Jun 6, 2022

@futr thanks for the heads up on this and for providing a solution already, I wasn't aware of this issue!

@lminiero
Copy link
Member

lminiero commented Jun 6, 2022

@bjelovarduck can you check with either the latest master of libnice (if the fix was indeed pushed) or with a previous version (e.g., 0.1.18) to see if you still have the issue?

@bjelovarduck
Copy link
Author

bjelovarduck commented Jun 6, 2022

I checked and looks like that was an issue. With libnice version 0.1.18 there is no crash. With libnice version 0.1.19 there is crash. Date of change approximately corresponds to date when I started to see the issue.

@lminiero
Copy link
Member

lminiero commented Jun 8, 2022

Ack, thanks for confirming. I'll close then 👍

@lminiero lminiero closed this as completed Jun 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants