Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After multiple model workers start working concurrently for the first time, requests will only be received by one of the workers. #3484

Open
PaulX1029 opened this issue Aug 19, 2024 · 3 comments

Comments

@PaulX1029
Copy link

I am using a server.controller to control 3 model_workers, which are placed on 3 GPUs, and then I opened 3 identical server.gradio_web_server and input the same question. The first time, all 3 gradio_web_server can output content at the same time. But when all the outputs are finished, the second time I send requests to all three gradio_web_server simultaneously, only one model_worker works (i.e., only one gradio_web_server has a streaming output), and when I check the GPU utilization, only one GPU is being used. Can anyone tell me what the reason for this is?
Is there anyone who has the same question?

我使用一个server.controller控制了3个model_worker,分别放置在3张GPU上,然后打开了3个相同的server.gradio_web_server,输入同一个问题,第一次,这3个gradio_web_server能同时输出内容,等到全部输出完毕后,第二次同时向这三个gradio_web_server发送请求,只会有一个model_worker工作(即只有一个gradio_web_server有流式输出),查看显卡利用率也仅仅只有一块GPU被使用,请问这是什么原因呢?
有任何朋友跟我有一样的疑问吗?

@surak
Copy link
Collaborator

surak commented Aug 19, 2024

I have noticed this too. There is a queue which should do a round-robin between the workers, but it’s not working. Thanks for the report.

@PaulX1029
Copy link
Author

@surak do you have a plan to fix that? Thanks

@surak
Copy link
Collaborator

surak commented Aug 23, 2024

It’s being worked on at #3490

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants