Add MultiprocessingConcurrencyLimiter to gateway #399

squeakymouse · 2023-12-05T01:16:15Z

Pull Request Summary

Add concurrency limiter so gateway returns 429s instead of 503s when overloaded

Remove kubernetes liveness probe to prevent pods from restarting when under high load and the healthcheck fails (purpose of the liveness probe is to restart in case of deadlocks? which doesn't seem relevant for us)

Increase Uvicorn worker concurrency now that we have rate limiting (the latter should take precedence so that we don't return 503s)

Test Plan and Usage Guide

Load tested with https://github.com/rakyll/hey on a test deployment and saw 429s instead of 503s for concurrency up to 5000

ian-scale · 2023-12-05T01:23:37Z

model-engine/model_engine_server/common/concurrency_limiter.py

+    def __enter__(self):
+        logger.debug("Entering concurrency limiter semaphore")
+        if self.semaphore and not self.semaphore.acquire(block=self.blocking):
+            logger.warning("Too many requests, returning 429")


nit: can we log the # of requests instead of this log? i.e. # of requests that exceed the concurrency limit

any requests over the concurrency limit will immediately be returned as 429, so we'd always be logging whatever the value of MAX_CONCURRENCY is (which still seems good to log though)

seanshi-scale

code seems good, but I would like more complete load test results (e.g. some stripped-down version of the load test docs phil and I have written) since the change we're making is quite dependent on stuff outside of this code, and there's nontrivial interactions between this code, whatever istio setup we have, where the requests are coming from, etc.

E.g. one reason you might be seeing different behavior with just up and when deployed on a test deployment are because there may be request queues hidden somewhere for your test deployment (e.g. there's a bunch of other stuff in between your devbox and the gateway pods, like some istio layer for sure, the istio-proxy pods), as opposed to when testing locally

ian-scale · 2023-12-05T01:52:33Z

@squeakymouse is this rate-limiting per-user or is it system-wide? I think something we could consider working towards in the future would be implementing per-user rate limits - I could foresee this being awkward if someone sends their first request ever at a time when system load is high and gets back a 429.

Actually, the best case might be instituting per-user rate limits and system rate limits. This way, no one user can monopolize all our throughput at any point in time. What do you think?

Also, I could be totally missing a line in the code and the per-user rate limits have already been added.

song-william · 2023-12-05T04:37:07Z

model-engine/model_engine_server/api/app.py


 class CustomMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request: Request, call_next):
        try:
-            LoggerTagManager.set(LoggerTagKey.REQUEST_ID, str(uuid.uuid4()))
-            return await call_next(request)
+            with concurrency_limiter:


What do people think about trying this out with just a specific route at first? Looking at the breakdown in the past week, get_/v1/async-tasks/_task_id is the most common route by far.

I'm not sure if it make sense to do a global limit, as we know some routes take more time than others.

https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-a-liveness-http-request

Any code greater than or equal to 200 and less than 400 indicates success. Any other code indicates failure.

@squeakymouse From the docs, it looks like if our readiness probe route returns a 429, it would cause the pod to be marked as unready, and should result in 503s from istio again.

It is a little odd since our experimentation doesn't seem to show that...

Hmm I think it shows up as the context deadline exceeded (Client.Timeout exceeded while awaiting headers) errors? 🤔

Does this mean I should try to exclude the healthcheck route from the concurrency limiting?

song-william · 2023-12-05T04:47:22Z

model-engine/model_engine_server/common/concurrency_limiter.py

@@ -0,0 +1,35 @@
+from multiprocessing import BoundedSemaphore


I'm wonder if we should be using an async semaphore rather than a multiprocessing one, since FastAPI is using async.

This SO comment seems to suggest that we should be using the corresponding semaphore:

use the correct type of semaphore for the form of concurrency being used

saiatmakuri · 2023-12-05T08:35:29Z

curious if there was consideration of 3P libraries, like slowapi, where rate limiting can be extended to redis, time-bounded, by user strategy, easy customization of limits per route, etc

squeakymouse · 2023-12-06T00:18:16Z

Edit: resolved in-person. Per-user rate limits and slowapi are out of scope for now.

song-william

Thanks!

song-william · 2023-12-07T01:00:52Z

model-engine/model_engine_server/api/app.py

@@ -91,6 +109,7 @@ def load_redis():
    get_or_create_aioredis_pool()


+# these routes should match those exempt from the concurrency limiter in the middleware


To better link the routes between here <> middleware with code, we could define a shared list variable. Something like this could work.

health_routes = ["/healthcheck", "/healthz", "/readyz"] def healthcheck() -> Response: """Returns 200 if the app is healthy.""" return Response(status_code=200) for endpoint in health_routes: app.get(endpoint)(healthcheck)

The code here can then refer to the health_routes variable.

song-william · 2023-12-07T01:02:15Z

model-engine/model_engine_server/api/app.py


 class CustomMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request: Request, call_next):
        try:
            LoggerTagManager.set(LoggerTagKey.REQUEST_ID, str(uuid.uuid4()))
-            return await call_next(request)
+            if request.url.path in ["/healthcheck", "/healthz", "/readyz"]:


nit: would just add a comment here that we intentionally exclude health check routes from the concurrency limiter.

Add MultiprocessingConcurrencyLimiter to gateway

9654430

squeakymouse requested a review from a team December 5, 2023 01:16

ian-scale reviewed Dec 5, 2023

View reviewed changes

seanshi-scale reviewed Dec 5, 2023

View reviewed changes

yixu34 requested a review from song-william December 5, 2023 01:44

song-william reviewed Dec 5, 2023

View reviewed changes

squeakymouse added 5 commits December 6, 2023 00:40

increase values

0d89d40

remove liveness probe

264d576

concurrency 700, add comments

9b520d0

500?

997da37

Merge branch 'main' into katiewu/concurrency-limiting-gateway

fb78098

squeakymouse requested review from seanshi-scale and song-william December 6, 2023 08:02

omit healthcheck from concurrency limiter

81785d8

song-william approved these changes Dec 7, 2023

View reviewed changes

address comments

f3d3660

squeakymouse enabled auto-merge (squash) December 7, 2023 02:20

squeakymouse merged commit 69e07ff into main Dec 7, 2023
5 checks passed

squeakymouse deleted the katiewu/concurrency-limiting-gateway branch December 7, 2023 02:35

yunfeng-scale mentioned this pull request Mar 6, 2024

Fix cacher #462

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MultiprocessingConcurrencyLimiter to gateway #399

Add MultiprocessingConcurrencyLimiter to gateway #399

squeakymouse commented Dec 5, 2023 •

edited

Loading

ian-scale Dec 5, 2023

seanshi-scale Dec 5, 2023

seanshi-scale left a comment

ian-scale commented Dec 5, 2023

song-william Dec 5, 2023

song-william Dec 6, 2023

squeakymouse Dec 6, 2023

song-william Dec 5, 2023

saiatmakuri commented Dec 5, 2023

squeakymouse commented Dec 6, 2023 •

edited

Loading

song-william left a comment

song-william Dec 7, 2023 •

edited

Loading

song-william Dec 7, 2023

		@@ -0,0 +1,35 @@
		from multiprocessing import BoundedSemaphore

		@@ -91,6 +109,7 @@ def load_redis():
		get_or_create_aioredis_pool()


		# these routes should match those exempt from the concurrency limiter in the middleware

Add MultiprocessingConcurrencyLimiter to gateway #399

Add MultiprocessingConcurrencyLimiter to gateway #399

Conversation

squeakymouse commented Dec 5, 2023 • edited Loading

Pull Request Summary

Test Plan and Usage Guide

ian-scale Dec 5, 2023

Choose a reason for hiding this comment

seanshi-scale Dec 5, 2023

Choose a reason for hiding this comment

seanshi-scale left a comment

Choose a reason for hiding this comment

ian-scale commented Dec 5, 2023

song-william Dec 5, 2023

Choose a reason for hiding this comment

song-william Dec 6, 2023

Choose a reason for hiding this comment

squeakymouse Dec 6, 2023

Choose a reason for hiding this comment

song-william Dec 5, 2023

Choose a reason for hiding this comment

saiatmakuri commented Dec 5, 2023

squeakymouse commented Dec 6, 2023 • edited Loading

song-william left a comment

Choose a reason for hiding this comment

song-william Dec 7, 2023 • edited Loading

Choose a reason for hiding this comment

song-william Dec 7, 2023

Choose a reason for hiding this comment

squeakymouse commented Dec 5, 2023 •

edited

Loading

squeakymouse commented Dec 6, 2023 •

edited

Loading

song-william Dec 7, 2023 •

edited

Loading