Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Celery autoscaler #378

Merged
merged 14 commits into from
Nov 17, 2023
Merged

Celery autoscaler #378

merged 14 commits into from
Nov 17, 2023

Conversation

squeakymouse
Copy link
Contributor

Pull Request Summary

Move over the Celery autoscaler

Test Plan and Usage Guide

Installing a test deployment autoscaled an async endpoint from 0 to 1 pod

@squeakymouse squeakymouse requested a review from a team November 15, 2023 02:40
tags.datadoghq.com/version: {{ $tag }}
name: {{ $app }}
spec:
serviceName: ""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{{ $app }}?

operator: Equal
value: 'true'
effect: NoSchedule
serviceAccountName: ml-k8s-admin
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't think this is defined ml-k8s-admin. does {{ include "modelEngine.fullname" . }} provide sufficient permissions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iiuc the serviceaccount that the endpoint builder and gateway use should have enough permission

serviceAccountName: ml-k8s-admin
volumes:
- configMap:
name: ml-worker-config
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{{ .Values.aws.configMap.name }} ?

requests:
cpu: 1000m
volumeMounts:
- mountPath: /root/.aws/config
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/opt?


SQS_SAMPLE_COUNT = 10

logger = make_logger("model_engine_server.core.celery_autoscaler")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use logger_name() that you can import as well

return CELERY_AUTOSCALER_EXCLUDED_NAMESPACES
except ModuleNotFoundError:
return []
finally:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unnecessary

else:
raise ValueError("broker_type not redis or sqs, how did we get here?")

env = os.getenv("DD_ENV", "training")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to default to training here?

@@ -0,0 +1,635 @@
import asyncio as aio
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this be moved to the celery/ folder?

{{- $env := .Values.context }}
{{- $tag := .Values.tag }}
{{- $service_identifier := .Values.serviceIdentifier }}
{{- $num_shards := ternary 30 3 (eq .Values.context "prod") }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the 30 and 3 values feel like they should be configurable in the helm chart

operator: Equal
value: 'true'
effect: NoSchedule
serviceAccountName: ml-k8s-admin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iiuc the serviceaccount that the endpoint builder and gateway use should have enough permission


for f in dataclasses.fields(CeleryAutoscalerParams):
k = f.name
# TODO: this technically can remain the same, since Celery deployments won't have the `channel` field and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can remove this comment probably? there is no NATS autoscaler anymore

{{- $tag := .Values.tag }}
{{- $service_identifier := .Values.serviceIdentifier }}
{{- $num_shards := ternary 30 3 (eq .Values.context "prod") }}
{{- range $base_app := tuple "celery-autoscaler-elasticache" "celery-autoscaler-sqs" }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's probably fine to have this range but I think that we'd be able to choose which one based on whether we're having endpoints use redis or sqs at some point

Copy link
Member

@yixu34 yixu34 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's parameterize the cost tags.

metadata:
labels:
app: {{ $app }}
team: infra
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tags shouldn't be hardcoded, they're internal.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see other deployments. something like

    {{- include "modelEngine.selectorLabels.celeryAutoScaler" . | nindent 4 }}
    {{- include "modelEngine.labels" . | nindent 4 }}
    tags.datadoghq.com/service: {{ include "modelEngine.celeryAutoScalerName" . }}

should

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the labels are hardcoded in some other deployments (e.g. searching team: infra gives me this file... Do we want the product tag for the autoscaler to change from common to model-engine? (If this is ok, I can reuse modelEngine.labels as suggested 🙂)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i think that file should parametrized tags. don't see why balloon deployment needs a different product tag

metadata:
labels:
app: {{ $app }}
team: infra
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see other deployments. something like

    {{- include "modelEngine.selectorLabels.celeryAutoScaler" . | nindent 4 }}
    {{- include "modelEngine.labels" . | nindent 4 }}
    tags.datadoghq.com/service: {{ include "modelEngine.celeryAutoScalerName" . }}

should

ad.datadoghq.com/main.logs: '[{"service": "{{ $app }}", "source": "python"}]'
sidecar.istio.io/inject: "false"
labels:
app: {{ $app }}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these tags should also be templatized

# num_shards is number of instances of the autoscaler
celery_autoscaler:
enabled: true
message_broker: sqs
Copy link
Member

@seanshi-scale seanshi-scale Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is already defined as the root-level "celeryBrokerType" (can see that at the bottom of this file)

Copy link
Member

@seanshi-scale seanshi-scale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

had one nit about deduplicating info in the helm values but the rest looks good

@squeakymouse squeakymouse enabled auto-merge (squash) November 17, 2023 21:04
@squeakymouse squeakymouse merged commit 4888ecf into main Nov 17, 2023
5 checks passed
@squeakymouse squeakymouse deleted the katiewu/celery-autoscaler branch November 17, 2023 21:23
@yunfeng-scale yunfeng-scale mentioned this pull request Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants