Adding a server component for running multiple workers #1838

fozziethebeat · 2023-07-03T01:41:05Z

Why are these changes needed?

This adds a new server component that let's clients run multiple models on the same worker instance. With the new PeftModelAdapter and an eventual fix for huggingface/peft#430, this server component let's clients run multiple adapters that share the same base model weights and load the base model weights only once.

As of right now this not fully optimized since it loads the base model weights once per configured model, that is blocked on the Peft issue.

Related issue number (if applicable)

Implements #1805 (maybe fixes?)

Checks

I've run format.sh to lint the changes in this PR.
I've included any doc changes needed.
I've made sure the relevant tests are passing (if applicable).

…ing multiple models on the same machine process

fozziethebeat · 2023-07-03T01:41:49Z

fastchat/serve/openai_api_server.py

@@ -135,15 +135,15 @@ async def check_length(request, prompt, max_tokens):
        response = await client.post(
            worker_addr + "/model_details",
            headers=headers,
-            json={},
+            json={"model": request.model},


Note: I found all the spots that the OpenAI server calls the worker and ensured it includes model. Are there other places that need this fix?

It seems you fixed all places. #1858 also applied some of your changes. Please rebase.

fozziethebeat · 2023-07-03T01:42:56Z

fastchat/serve/model_multi_worker.py

+    return background_tasks
+
+
+# Note: for all the calls below, we make a hard assumption that the caller


I confirmed this works with two Pythia Peft models. The changes feel a little wonky and maybe fragile. Suggestions for an alternative strategy?

I think it is okay for now.

fozziethebeat · 2023-07-03T01:50:01Z

Please leave any review comments, especially on the broad strategy for doing this, i took the most simple strategy I could see.

I only marked this as draft to ensure it doesn't get merged too quickly as I'm pretty sure there's hiding bugs.

fozziethebeat · 2023-07-03T01:52:46Z

fastchat/serve/openai_api_server.py

@@ -714,7 +714,7 @@ async def create_chat_completion(request: APIChatCompletionRequest):
    if error_check_ret is not None:
        return error_check_ret

-    gen_params = get_gen_params(
+    gen_params = await get_gen_params(


This tripped me up when manually testing. Seems this API route doesn't get used by many?

Yes. The default path does not use this API route. Some contribute it for their special usage.

#1857)

docs/openai_api.md

fastchat/serve/model_multi_worker.py

docs/openai_api.md

fastchat/serve/model_multi_worker.py

Co-authored-by: Ying Sheng <[email protected]>

fozziethebeat · 2023-07-05T10:07:29Z

Seems there's no major redesign comments, marking this as ready for review. I'll be testing this in a docker setup tomorrow to 100% verify it works.

Ying1123

Thanks for the contribution. I think the overall design looks good and we can use this as a starting point.

Could you follow our recent refactor in #1858 and try to reuse as much code as possible? For example, could you import ModelWorker or BaseModelWorker from model_worker.py?

We are still iterating on the interface design of the model worker. The current design may not work for your Peft models.
Feel free to propose any changes that can make it better for Peft models and reduce the redundant code.

Ying1123 · 2023-07-05T10:06:00Z

fastchat/serve/openai_api_server.py

@@ -135,15 +135,15 @@ async def check_length(request, prompt, max_tokens):
        response = await client.post(
            worker_addr + "/model_details",
            headers=headers,
-            json={},
+            json={"model": request.model},


It seems you fixed all places. #1858 also applied some of your changes. Please rebase.

Ying1123 · 2023-07-05T10:07:46Z

fastchat/serve/openai_api_server.py

@@ -714,7 +714,7 @@ async def create_chat_completion(request: APIChatCompletionRequest):
    if error_check_ret is not None:
        return error_check_ret

-    gen_params = get_gen_params(
+    gen_params = await get_gen_params(


Yes. The default path does not use this API route. Some contribute it for their special usage.

Ying1123 · 2023-07-05T10:08:54Z

fastchat/serve/model_multi_worker.py

+    return background_tasks
+
+
+# Note: for all the calls below, we make a hard assumption that the caller


I think it is okay for now.

Ying1123 · 2023-07-05T10:10:37Z

fastchat/serve/model_multi_worker.py

+
+# Note: For now the semaphore locks access to all models managed by the worker.
+# This makes sense when all models are Peft models sharing the same underlying
+# base model weights.  It probably doesn't make sense in other scenarios.


I think It also makes sense in other scenarios.
This semaphore is used to limit concurrency and prevent OOM.
In other scenarios, if multiple workers share the same GPU, they can be put under the same semaphore.

Great, that was also my thinking. These ultimately all share the same queue.

…ing multiple models on the same machine process

Co-authored-by: Ying Sheng <[email protected]>

…t into multi_model_worker

fozziethebeat · 2023-07-05T11:06:34Z

Rebase done, will need to manually test tomorrow and verify nothing breaks

merrymercy · 2023-07-05T11:23:30Z

Sorry that we do not have CI set up yet.
If you want to test, you can contribute some unit tests under https://github.com/lm-sys/FastChat/tree/main/tests
Now we use these commands to test the OpenAI API server https://github.com/lm-sys/FastChat/blob/main/docs/commands/test_process.md#test-openai-api-server

…where

fozziethebeat · 2023-07-06T00:36:53Z

With the fix I just added, tests pass!

dismiss

Ying1123 · 2023-07-06T01:34:36Z

@fozziethebeat Could you use Github rebase rather than merge? The history is messed up in Github: https://github.com/lm-sys/FastChat/pull/1838/files.

fozziethebeat · 2023-07-06T01:41:33Z

yeah, I tried and I think when I merged some some changes made via github UI I screwed things up. i'm actually really bad at rebasing, what commands do you suggest?

Something like

git pull --rebase upstream main
git checkout multi_modal_worker
git rebase main

?

Ying1123 · 2023-07-06T01:52:58Z

For the current case, I suggest copying the files you changed (only 3) and creating a new branch from the main. Then you have two options:

Delete your local branch multi_model_worker, copying the new branch to multi_model_worker, then force push.
Create a new pull request and link back.

For your reference, before the messed up, normally I use the following commands to rebase:

git fetch upstream main
git rebase upstream/main

Then you follow the instruction displayed by git status to resolve conflicts. After fixing all conflicts, run:

git push --force

fozziethebeat · 2023-07-06T02:06:07Z

Yeah, looking at the git history fixing this is too messy.

Here's the replacement: #1866

Adding a model multi-worker serve component that is dedicated to runn…

e289e98

…ing multiple models on the same machine process

fozziethebeat commented Jul 3, 2023

View reviewed changes

Ying1123 self-assigned this Jul 3, 2023

Ying1123 self-requested a review July 3, 2023 01:46

Ying1123 added the enhancement New feature or request label Jul 3, 2023

Adding documentation changes

ce7df04

fozziethebeat commented Jul 3, 2023

View reviewed changes

infwinston and others added 5 commits July 3, 2023 22:03

Add model support and fix bug (#1818)

6d06351

Fix Multi GPU for GPTQ quantized models (#1820)

5f5a9d7

Update MT bench and arena data (#1854)

7ad1d63

revise split threading logic to avoid thread stuck when data volume i… (

fcd1c63

#1857)

Add a base class for model workers (#1858)

a10cb06

Ying1123 reviewed Jul 5, 2023

View reviewed changes

thelinuxkid and others added 7 commits July 5, 2023 02:24

Make vicuna 7b default in the docker example (#1846)

c4c6403

Update fastchat/serve/model_multi_worker.py

a1d52d8

Co-authored-by: Ying Sheng <[email protected]>

Update fastchat/serve/model_multi_worker.py

a503b3d

Co-authored-by: Ying Sheng <[email protected]>

Update fastchat/serve/model_multi_worker.py

074ef0a

Co-authored-by: Ying Sheng <[email protected]>

rearranging doc args

099300c

Add compute agreement (#1855)

b2f187c

Renaming the multi model worker

7f50ccd

fozziethebeat marked this pull request as ready for review July 5, 2023 10:07

Ying1123 reviewed Jul 5, 2023

View reviewed changes

fozziethebeat and others added 5 commits July 5, 2023 20:01

Adding a model multi-worker serve component that is dedicated to runn…

fe69429

…ing multiple models on the same machine process

Adding documentation changes

adf7421

Update fastchat/serve/model_multi_worker.py

d9f9485

Co-authored-by: Ying Sheng <[email protected]>

Update fastchat/serve/model_multi_worker.py

ef4059a

Co-authored-by: Ying Sheng <[email protected]>

Update fastchat/serve/model_multi_worker.py

9e74fe4

Co-authored-by: Ying Sheng <[email protected]>

fozziethebeat added 4 commits July 5, 2023 20:01

rearranging doc args

826add9

Renaming the multi model worker

8eab4a3

Merge branch 'multi_model_worker' of github.com:fozziethebeat/FastCha…

15c7855

…t into multi_model_worker

Using ModelWorker from model_worker.py to reduce code duplication

087a83c

Ensure stream interval is set by the ModelWorker class when used else…

4eecf95

…where

Ying1123 previously approved these changes Jul 6, 2023

View reviewed changes

fozziethebeat mentioned this pull request Jul 6, 2023

Adding a server component for running multiple models on a single model worker. #1866

Merged

3 tasks

fozziethebeat closed this Jul 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding a server component for running multiple workers #1838

Adding a server component for running multiple workers #1838

fozziethebeat commented Jul 3, 2023 •

edited

Loading

fozziethebeat Jul 3, 2023

Ying1123 Jul 5, 2023 •

edited

Loading

fozziethebeat Jul 3, 2023

Ying1123 Jul 5, 2023

fozziethebeat commented Jul 3, 2023

fozziethebeat Jul 3, 2023

Ying1123 Jul 5, 2023

fozziethebeat commented Jul 5, 2023

Ying1123 left a comment •

edited

Loading

Ying1123 Jul 5, 2023 •

edited

Loading

Ying1123 Jul 5, 2023

Ying1123 Jul 5, 2023

Ying1123 Jul 5, 2023

fozziethebeat Jul 5, 2023

fozziethebeat commented Jul 5, 2023

merrymercy commented Jul 5, 2023

fozziethebeat commented Jul 6, 2023

Ying1123 commented Jul 6, 2023

fozziethebeat commented Jul 6, 2023

Ying1123 commented Jul 6, 2023 •

edited

Loading

fozziethebeat commented Jul 6, 2023

		return background_tasks


		# Note: for all the calls below, we make a hard assumption that the caller

Adding a server component for running multiple workers #1838

Adding a server component for running multiple workers #1838

Conversation

fozziethebeat commented Jul 3, 2023 • edited Loading

Why are these changes needed?

Related issue number (if applicable)

Checks

Choose a reason for hiding this comment

Ying1123 Jul 5, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fozziethebeat commented Jul 3, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fozziethebeat commented Jul 5, 2023

Ying1123 left a comment • edited Loading

Choose a reason for hiding this comment

Ying1123 Jul 5, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fozziethebeat commented Jul 5, 2023

merrymercy commented Jul 5, 2023

fozziethebeat commented Jul 6, 2023

Ying1123 commented Jul 6, 2023

fozziethebeat commented Jul 6, 2023

Ying1123 commented Jul 6, 2023 • edited Loading

fozziethebeat commented Jul 6, 2023

fozziethebeat commented Jul 3, 2023 •

edited

Loading

Ying1123 Jul 5, 2023 •

edited

Loading

Ying1123 left a comment •

edited

Loading

Ying1123 Jul 5, 2023 •

edited

Loading

Ying1123 commented Jul 6, 2023 •

edited

Loading