-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Celery autoscaler #378
Celery autoscaler #378
Conversation
tags.datadoghq.com/version: {{ $tag }} | ||
name: {{ $app }} | ||
spec: | ||
serviceName: "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{{ $app }}?
operator: Equal | ||
value: 'true' | ||
effect: NoSchedule | ||
serviceAccountName: ml-k8s-admin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't think this is defined ml-k8s-admin
. does {{ include "modelEngine.fullname" . }}
provide sufficient permissions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iiuc the serviceaccount that the endpoint builder and gateway use should have enough permission
serviceAccountName: ml-k8s-admin | ||
volumes: | ||
- configMap: | ||
name: ml-worker-config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{{ .Values.aws.configMap.name }}
?
requests: | ||
cpu: 1000m | ||
volumeMounts: | ||
- mountPath: /root/.aws/config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/opt?
|
||
SQS_SAMPLE_COUNT = 10 | ||
|
||
logger = make_logger("model_engine_server.core.celery_autoscaler") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use logger_name()
that you can import as well
return CELERY_AUTOSCALER_EXCLUDED_NAMESPACES | ||
except ModuleNotFoundError: | ||
return [] | ||
finally: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: unnecessary
else: | ||
raise ValueError("broker_type not redis or sqs, how did we get here?") | ||
|
||
env = os.getenv("DD_ENV", "training") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to default to training here?
@@ -0,0 +1,635 @@ | |||
import asyncio as aio |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could this be moved to the celery/
folder?
{{- $env := .Values.context }} | ||
{{- $tag := .Values.tag }} | ||
{{- $service_identifier := .Values.serviceIdentifier }} | ||
{{- $num_shards := ternary 30 3 (eq .Values.context "prod") }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the 30 and 3 values feel like they should be configurable in the helm chart
operator: Equal | ||
value: 'true' | ||
effect: NoSchedule | ||
serviceAccountName: ml-k8s-admin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iiuc the serviceaccount that the endpoint builder and gateway use should have enough permission
|
||
for f in dataclasses.fields(CeleryAutoscalerParams): | ||
k = f.name | ||
# TODO: this technically can remain the same, since Celery deployments won't have the `channel` field and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we can remove this comment probably? there is no NATS autoscaler anymore
{{- $tag := .Values.tag }} | ||
{{- $service_identifier := .Values.serviceIdentifier }} | ||
{{- $num_shards := ternary 30 3 (eq .Values.context "prod") }} | ||
{{- range $base_app := tuple "celery-autoscaler-elasticache" "celery-autoscaler-sqs" }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's probably fine to have this range but I think that we'd be able to choose which one based on whether we're having endpoints use redis or sqs at some point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's parameterize the cost tags.
metadata: | ||
labels: | ||
app: {{ $app }} | ||
team: infra |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tags shouldn't be hardcoded, they're internal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see other deployments. something like
{{- include "modelEngine.selectorLabels.celeryAutoScaler" . | nindent 4 }}
{{- include "modelEngine.labels" . | nindent 4 }}
tags.datadoghq.com/service: {{ include "modelEngine.celeryAutoScalerName" . }}
should
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the labels are hardcoded in some other deployments (e.g. searching team: infra
gives me this file... Do we want the product
tag for the autoscaler to change from common
to model-engine
? (If this is ok, I can reuse modelEngine.labels
as suggested 🙂)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah i think that file should parametrized tags. don't see why balloon deployment needs a different product
tag
metadata: | ||
labels: | ||
app: {{ $app }} | ||
team: infra |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see other deployments. something like
{{- include "modelEngine.selectorLabels.celeryAutoScaler" . | nindent 4 }}
{{- include "modelEngine.labels" . | nindent 4 }}
tags.datadoghq.com/service: {{ include "modelEngine.celeryAutoScalerName" . }}
should
ad.datadoghq.com/main.logs: '[{"service": "{{ $app }}", "source": "python"}]' | ||
sidecar.istio.io/inject: "false" | ||
labels: | ||
app: {{ $app }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these tags should also be templatized
# num_shards is number of instances of the autoscaler | ||
celery_autoscaler: | ||
enabled: true | ||
message_broker: sqs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is already defined as the root-level "celeryBrokerType" (can see that at the bottom of this file)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
had one nit about deduplicating info in the helm values but the rest looks good
Pull Request Summary
Move over the Celery autoscaler
Test Plan and Usage Guide
Installing a test deployment autoscaled an async endpoint from 0 to 1 pod