Shuffle queues #122

nickstenning · 2024-07-12T10:53:23Z

This PR implements read/write/length support for shuffle-sharded multi-tenant queues implemented over the top of a collection of Redis streams. This is intended to end up as the library used by all of our clients that interact with these data structures (api, autoscaler, director).

Writes

We calculate the possible streams (based on a shard key) in Go, and then pass those into a Lua script in Redis which does the following:

Checks to see if it can update the queue "meta" key to reflect the current number of total streams.

If the new value is greater than the current one it can do this immediately. If the new value is less than the current one it will check the lengths of the streams beyond the current set, and only update the meta key if all those streams are empty.

For example: if we were previously writing to 16 streams (with indexes 0 to 15) and new requests arrive saying that we're writing to 8 streams (0 to 7), then we'll check that streams 9 to 15 are empty before updating the meta key.
Finds the shortest stream out of those provided, or the first empty stream.
XADDs the provided message to the selected stream.
XADDs a short message to a notifications stream, which can be used to await new work when queues are not busy.
Sets a timeout/expiry on all the keys of the queue.
Returns the enqueued message ID.

Reads

The read side of the sharded queue implementation is quite fiddly, primarily because we cannot block/wait on a write to one of N queues easily while also guaranteeing that we never pick up more than one message at a time.

Infuriatingly, when the queue is "empty", then

XREADGROUP ... COUNT 1 BLOCK 1000 STREAMS key1 key2 [...] > > [...]

will do the right thing. It will return immediately after the first XADD on any of the monitored streams. The problem is that if there are items immediately available in more than one of the specified streams, one per stream will be claimed and returned.

This implementation works around this by adding yet another stream -- a "notifications" stream -- the sole purpose of which is to accelerate the next read by the client when the stream is mostly empty.

Every time we write to the queue we also put a tiny message on the notifications stream, which can then be used by clients to block on activity on the queue as a whole.

An individual blocking read, then, has three steps:

a non-blocking pass through all the queues
if nothing is found, block on the notifications queue
if we get a message, do another non-blocking pass through all the queues

This is inherently racy, but should hopefully work well enough for our purposes and doesn't involve hammering Redis to scan for messages repeatedly.

The read code also handles a number of other aspects of this problem:

for new queues, the creation of the requested consumer group is handled automatically
we will automatically read from the "default" stream if it exists and has items in it -- this will allow us to migrate everything to sharded queues without downtime
the consumer group provided is used when reading from the new streams, but when reading from the default stream we use the stream name as a consumer group name -- this will allow us to migrate away from having different consumer group names for each queue

Len

We also expose a Len method which adds up the lengths of all the streams.

Implements one half of a shuffle-sharded queue implementation. Specifically: the write half. This wraps up the logic for writing to a shuffle-sharded queue so that clients can simply provide the relevant parameters for the queue, and the details of which stream is written to are hidden. We calculate the possible streams (based on a shard key) in Go, and then pass those into a Lua script in Redis which does the following: 1. Checks to see if it can update the queue "meta" key to reflect the current number of total streams. If the new value is greater than the current one it can do this immediately. If the new value is less than the current one it will check the lengths of the streams beyond the current set, and only update the meta key if all those streams are empty. For example: if we were previously writing to 16 streams (with indexes 0 to 15) and new requests arrive saying that we're writing to 8 streams (0 to 7), then we'll check that streams 9 to 15 are empty before updating the meta key. 2. Finds the shortest stream out of those provided, or the first empty stream. 3. XADDs the provided message to the selected stream. 4. XADDs a short message to a `notifications` stream, which can be used to await new work when queues are not busy. 4. Sets a timeout/expiry on all the keys of the queue. 5. Returns the enqueued message ID.

This implements the read side of the sharded queue implementation. This is quite fiddly, primarily because we cannot block/wait on a write to one of N queues easily while also guaranteeing that we never pick up more than one message at a time. Infuriatingly, when the queue is "empty", then XREADGROUP ... COUNT 1 BLOCK 1000 STREAMS key1 key2 [...] > > [...] will do the right thing. It will return immediately after the first XADD on any of the monitored streams. The problem is that if there are items immediately available in more than one of the specified streams, one per stream will be claimed and returned. This implementation works around this by adding yet another stream -- a "notifications" stream -- the sole purpose of which is to accelerate the next read by the client when the stream is mostly empty. Every time we write to the queue we also put a tiny message on the notifications stream, which can then be used by clients to block on activity on the queue as a whole. An individual blocking read, then, has three steps: 1. a non-blocking pass through all the queues 2. if nothing is found, block on the notifications queue 3. if we get a message, do another non-blocking pass through all the queues This is inherently racy, but should hopefully work well enough for our purposes and doesn't involve hammering Redis to scan for messages repeatedly. The read code also handles a number of other aspects of this problem: - for new queues, the creation of the requested consumer group is handled automatically - we will automatically read from the "default" stream if it exists and has items in it -- this will allow us to migrate everything to sharded queues without downtime - the consumer group provided is used when reading from the new streams, but when reading from the default stream we use the stream name as a consumer group name -- this will allow us to migrate away from having different consumer group names for each queue

This tests the notification stream is working as intended.

evilstreak

Amazing! ✨

A couple of notes we discussed online:

There's a (very unlikely) race condition with notifications having a max length of 1: two directors can simultaneously fail a queue read and then do a blocking read on the notifications stream, and in the time between the two queue reads failing and the two notification reads making it to redis, we enqueue two (or more) predictions into the queues. Now only one director will pick up the work, and the other will wait the whole blocking time (or until another prediction is enqueued). This seems super unlikely to happen, would be mitigated by increasing the notification stream length slightly, and is something we can probably instrument for to find out if it's actually a problem in reality (I predict not).
Starting at a random queue position with each read doesn't guarantee fair weighting. As a pathological case, if a bulk user is assigned the shard {1,2,3,4,5,6} and a realtime user is assigned the shard {7,8,9,10,11,12} (and the remaining 52 streams are empty), then 6/64 times a prediction from the realtime user will get processed, and 58/64 times a prediction from the bulk user gets processed. We discussed addressing this by keeping a counter with the metadata and incrementing it before every read, so that all reads from all consumers share a global state which is guaranteed to round robin all queues, rather than starting at a random position.

nickstenning · 2024-07-15T12:11:15Z

Thanks for spotting the second issue, as that is rather fatal to the fairness goal. I've pushed bc43b14 which I think addresses the issue.

queue/read.lua

philandstuff · 2024-07-15T12:36:11Z

@evilstreak for the first issue, we could potentially fix it by:

drop the cap on the notifications stream (so it can grow)
in the read lua script, if we find the streams are empty, we return the current max id of the notification stream
when we do XREADGROUP on the notification stream we specify a min ID to guarantee we get exactly only those notifications that arrive after we checked and found the streams empty

Discussed on gather with @nickstenning, we might not need to do this but it should fix things if we do need to.

nickstenning · 2024-07-15T12:48:17Z

Discussed on gather with @nickstenning, we might not need to do this but it should fix things if we do need to.

We agreed that we'd save this as something we can experiment with if the race condition in question seems to be something we're seeing for real. I've added queue.pickup_delay_ms telemetry to help us with this.

philandstuff

This looks great thank you! A few minor comments. I can't say I fully understand the algorithm but we can chat about that.

queue/types.go

queue/len.lua

queue/client.go

queue/client_test.go

shuffleshard/shuffleshard_test.go

This adds a third script which calculates the total queue length by summing the lengths of all the streams.

By analogy with most of the Redis client commands, this switches the queue package to return a sentinel error (queue.Empty) when no messages are available, rather than the somewhat cryptic "nil, nil".

Dom pointed out that using a random (or, as in our case, "random") offset for choosing where to start reading doesn't actually result in fair reads. Consider the contrived but quite possible case of a 64-stream queue in which tenant A is allocated streams [0,4), tenant B is allocated streams [4,8), and no other tenants are using the queue. Any read which starts at offsets in the range [4,8) will fetch a message for tenant B, but a read starting at any other offset (all 60 of them) will fetch a message for tenant A. This updates the read code to use a globally coordinated offset stored in the "meta" key to ensure fairness across reads, still without requiring consumers to maintain state. Any time a message is found, the offset is updated to point to the *next* stream in the queue, and then reads start at the recorded offset. This should ensure that the first read round-robins through all queues that have messages in them.

This records the time that the message has spent in the queue at pickup on the current span as `queue.pickup_delay_ms`.

nickstenning force-pushed the shuffle-queues branch 2 times, most recently from 1833d67 to e7f6ee1 Compare July 13, 2024 11:02

nickstenning force-pushed the main branch from c590917 to d9525fa Compare July 14, 2024 11:44

nickstenning force-pushed the shuffle-queues branch from e7f6ee1 to 13ca885 Compare July 14, 2024 11:53

nickstenning added 2 commits July 14, 2024 13:53

Add a basic shuffle shard implementation

1339739

nickstenning force-pushed the shuffle-queues branch from 13ca885 to f59fe3f Compare July 14, 2024 11:53

nickstenning added 3 commits July 14, 2024 14:09

Add some basic throughput benchmarks for read/write

f1df0d1

Add a test of pickup latency

745e13d

This tests the notification stream is working as intended.

nickstenning force-pushed the shuffle-queues branch from f59fe3f to 80e60b3 Compare July 14, 2024 12:09

nickstenning marked this pull request as ready for review July 15, 2024 09:38

nickstenning force-pushed the shuffle-queues branch 3 times, most recently from 9ac2910 to 26c4a17 Compare July 15, 2024 10:31

evilstreak approved these changes Jul 15, 2024

View reviewed changes

evilstreak reviewed Jul 15, 2024

View reviewed changes

queue/read.lua Outdated Show resolved Hide resolved

nickstenning force-pushed the shuffle-queues branch 2 times, most recently from d943e93 to 433ac39 Compare July 15, 2024 12:46

philandstuff approved these changes Jul 15, 2024

View reviewed changes

queue/types.go Show resolved Hide resolved

queue/len.lua Show resolved Hide resolved

queue/client.go Outdated Show resolved Hide resolved

queue/client_test.go Show resolved Hide resolved

shuffleshard/shuffleshard_test.go Show resolved Hide resolved

nickstenning added 5 commits July 15, 2024 15:12

Add support for calculating total queue length

07934d7

This adds a third script which calculates the total queue length by summing the lengths of all the streams.

Write some proper package documentation for queue

b0138d4

Return a sentinel error when no messages are available

7970503

By analogy with most of the Redis client commands, this switches the queue package to return a sentinel error (queue.Empty) when no messages are available, rather than the somewhat cryptic "nil, nil".

Record message pickup delay

1639f0f

This records the time that the message has spent in the queue at pickup on the current span as `queue.pickup_delay_ms`.

nickstenning force-pushed the shuffle-queues branch from 433ac39 to 1639f0f Compare July 15, 2024 13:12

nickstenning merged commit 462aa88 into main Jul 15, 2024
2 checks passed

nickstenning deleted the shuffle-queues branch July 15, 2024 13:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shuffle queues #122

Shuffle queues #122

nickstenning commented Jul 12, 2024 •

edited

Loading

evilstreak left a comment

nickstenning commented Jul 15, 2024

philandstuff commented Jul 15, 2024

nickstenning commented Jul 15, 2024

philandstuff left a comment

Shuffle queues #122

Shuffle queues #122

Conversation

nickstenning commented Jul 12, 2024 • edited Loading

Writes

Reads

Len

evilstreak left a comment

Choose a reason for hiding this comment

nickstenning commented Jul 15, 2024

philandstuff commented Jul 15, 2024

nickstenning commented Jul 15, 2024

philandstuff left a comment

Choose a reason for hiding this comment

nickstenning commented Jul 12, 2024 •

edited

Loading