Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allocation failures due to enormous (many TB) allocation attempts #48

Open
saethlin opened this issue Jun 3, 2024 · 0 comments
Open
Labels
bug Something isn't working

Comments

@saethlin
Copy link

saethlin commented Jun 3, 2024

We've been using ThreadedRodeo in production for a while to implement a global string interner. For the most part it works great, but recently when we increased the concurrency of our workload from 10 CPUs to 64 CPUs we started seeing sporadic allocation failures, due to attempts to mmap between 2 TB and a few PB. And even when there isn't a crash, we had absurdly high virtual memory usage, commonly 50 GB of resident set size would come with 2 TB virtual memory usage.

I believe the root cause is that lasso is missing some synchronization in its lockfree allocation strategy. Or alternatively, that the entire design of this lockfree allocation is not viable. The memory for ThreadedRodeo strings in the 0.7 release series is allocated in a lock-free strategy using a linked list of AtomicBucket. This is the responsible code:

// Check that we haven't exhausted our memory limit
self.allocate_memory(next_capacity)?;
// Set the capacity to twice of what it currently is to allow for fewer allocations as more strings are interned
self.set_bucket_capacity(next_capacity);
// Safety: `next_capacity` will never be zero
let capacity = unsafe { NonZeroUsize::new_unchecked(next_capacity) };
debug_assert_ne!(next_capacity, 0);
let mut bucket = AtomicBucket::with_capacity(capacity)?;
// Safety: The new bucket will have enough room for the string
let allocated_string = unsafe { bucket.push_slice(slice) };
self.buckets.push_front(bucket.into_ref());
Ok(allocated_string)

The ideal behavior of this data structure is for one thread to get here at a time, double the capacity, then stick its allocated slab onto the linked list. But if many threads race to this point in the code, they will not realize that there is another thread allocating (that's the whole point!) and they will double the size in sequence and allocate that doubled size.

If 64 threads race on this code in perfect synchrony, this would attempt to allocate more than the entire address space.

We've worked around this for now by forking lasso and back-porting some changes from 0.7.2 onto 0.6.0.

@saethlin saethlin added the bug Something isn't working label Jun 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant