Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Probabilistic data structure for "current" known versions #190

Open
jeromegn opened this issue Apr 13, 2024 · 0 comments
Open

Probabilistic data structure for "current" known versions #190

jeromegn opened this issue Apr 13, 2024 · 0 comments

Comments

@jeromegn
Copy link
Member

Probabilistic data structures can help Corrosion sync faster by preventing 99% of lookups for "cleared" versions.

Right now, a DB lookup is required if a version is known because Corrosion does not know if there are any changes in the DB for this version. This is a fairly recent change where everything isn't stored in memory anymore.

People are probably most familiar with Bloom filters. They're pretty great but have shortcomings: they're usually immutable and if they are mutable then it's not possible to delete items from them.

The most interesting data structure for us might be the Cuckoo filter which is similar to a Bloom or XOR filter, but it allows deletion. There's a crate called scalable_cuckoo_filter which seems to fit the bill. It is serializable so we can store it in the DB and load it on boot.

Since these data structures will never return false negatives but may return false positives, we have to store information that's "ok" to be wrong about and a DB lookup won't hurt anything. If we stored which versions are cleared, then we may mistakenly tell a different node that a version has been cleared (false positive). However, if we store current versions only, then a negative definitely means the version was cleared or we don't have it (checked against BookedVersions#needed). If we get a false positive, that's ok because Corrosion can query the DB and the worst case scenario is finding nothing and returning that the version has been cleared.

There's a challenge in fully initializing this structure because the queries can be expensive. Possibly iterating through hundreds of millions of rows in __corro_bookkeeping and various clock tables. It would have to be done incrementally. We can't use the structure until it has been filled completely by both the database's state and incoming data as we're processing broadcasts and syncs. This is tricky, but not impossible:

  • Start loading data incrementally in the background (a few versions at a time to not hog the database resources)
  • Insert / delete versions as Corrosion is processing syncs and broadcasts
  • Wrap the filter in a structure that knows if it is complete
  • Periodically (?) save the serialized filter w/ the "last processed version" information
  • Only rely on the filter if it has been fully loaded, meanwhile fallback to lookups, it ain't that bad

In my testing, a cuckoo filter of ~16 million u64 uses roughly 30MB of memory. It's probably less than that, I was looking at total memory usage of a test binary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant