Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add docs for Model.create, update default values and fix per_worker concurrency #332

Merged
merged 11 commits into from
Oct 20, 2023

Conversation

yunfeng-scale
Copy link
Collaborator

@yunfeng-scale yunfeng-scale commented Oct 18, 2023

image
image

@yunfeng-scale yunfeng-scale requested a review from a team October 18, 2023 20:22
clients/python/llmengine/model.py Outdated Show resolved Hide resolved
clients/python/llmengine/model.py Outdated Show resolved Hide resolved

checkpoint_path (`Optional[str]`):
Path to the checkpoint for the LLM. For now we only support loading a tar file from AWS S3.
Path to the checkpoint for the LLM. Can be either a folder (preferred since there's no untar) or a tar file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be either a folder

I think we want to be precise and consistent around our guidance relating to remote files/directories. Would suggest cross-checking with the Files API and making sure that the explanation around this makes sense. e.g. we may want to clarify that this is meant to be some remote path that's accessible by the LLM Engine deployment, etc.

Can be either a folder (preferred since there's no untar)
Might be good to explain why skipping the untar is desirable - is it cold start time?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checking code, i don't think we only support creating endpoints from files created with Files API.

yes skipping untar is good to cold start time

throughput requirements. 2. Determine a value for the maximum number of
concurrent requests in the workload. Divide this number by ``max_workers``. Doing
this ensures that the number of workers will "climb" to ``max_workers``.
Number of Uvicorn workers per pod. Recommendation is set to 2.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to hide details of Uvicorn.

Also I don't think this is correct? IIRC per_worker is basically a scaling sensitivity parameter. cc @seanshi-scale

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, this actually decides both # of workers and HPA target concurrency https://github.com/scaleapi/llm-engine/blob/main/model-engine/model_engine_server/infra/gateways/k8s_resource_parser.py#L65
i'll make some updates

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't understand why we want to set some ratio between HPA target and per_worker. removing the ratio

clients/python/llmengine/model.py Outdated Show resolved Hide resolved
clients/python/llmengine/model.py Outdated Show resolved Hide resolved
clients/python/llmengine/model.py Outdated Show resolved Hide resolved
clients/python/llmengine/model.py Outdated Show resolved Hide resolved
clients/python/llmengine/model.py Outdated Show resolved Hide resolved
@@ -2,9 +2,6 @@
import re
from typing import Union

MAX_CONCURRENCY_TO_TARGET_CONCURRENCY_RATIO = 2.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this technically change our autoscaling sensitivity? cc @seanshi-scale

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which might be fine, just being aware of production behavior changes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this would but i think people are not aware of this ratio at all (like users specify per worker concurrency target to be 10 and HPA uses 5)

@yunfeng-scale yunfeng-scale changed the title Add docs for Model.create Add docs for Model.create, update default values and fix per_worker concurrency Oct 19, 2023
Copy link
Member

@yixu34 yixu34 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more high-level comment: after

clients/python/llmengine/model.py Show resolved Hide resolved
Copy link
Member

@yixu34 yixu34 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only thing I had left were around the examples:

  1. framework version tag
  2. labels

clients/python/llmengine/model.py Outdated Show resolved Hide resolved
clients/python/llmengine/model.py Outdated Show resolved Hide resolved
@yunfeng-scale yunfeng-scale enabled auto-merge (squash) October 20, 2023 20:22
@yunfeng-scale yunfeng-scale merged commit 271156c into main Oct 20, 2023
5 checks passed
@yunfeng-scale yunfeng-scale deleted the yunfeng-retry-cds branch October 20, 2023 20:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants