Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement cache deletes #1029

Closed
wants to merge 10 commits into from
Closed

Implement cache deletes #1029

wants to merge 10 commits into from

Conversation

iFrostizz
Copy link
Contributor

@iFrostizz iFrostizz commented Jun 7, 2024

Cache deletes in the program state so that we now avoid cache misses when it comes to data deletion.
Also added a cache insertion when calling get.
The new cache now tells the caller wether the data is here Data(Vec<u8>), has been deleted Deleted or ToDelete if it's scheduled to be deleted when the cache is flushed.
There is a caveat though which is that ZST Vecs are completely disabled by borsh-rs. That's because they want to avoid a potential DOS since huge vecs can easily be crafted using ZST (they are free memory-wise!) with vec![(), !0].
See near/borsh-rs#52
We need to make sure that we are not exposed to that, or that fuel is correctly metering to avoid abuses.

Closes #867
Closes #868

@@ -257,7 +257,7 @@ pub fn state_keys(_attr: TokenStream, item: TokenStream) -> TokenStream {

// add default attributes
item_enum.attrs.push(parse_quote! {
#[derive(Clone, Copy, Debug, Eq, PartialEq, Hash)]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a tab was off 😄

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can take this change and land it in a separate PR if you want it in earlier/immediately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine, it's only catching the eye when writing another aligned stuff just below

Comment on lines +45 to +50
#[derive(Debug, PartialEq)]
pub enum CachedData {
Data(Vec<u8>),
Deleted,
ToDelete,
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, Deleted means "we know that there is no value here", it could be a bit unclear since it may mean that some value was here before, while it just means that we checked in the actual storage and there is no value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, having a value at a key K in the cache means "we know the state of the value here". If it's not here that means that it's unknown.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔

Since we know that the empty Vec is not something that we store, would it make more sense to simply have:

#[derive]
#[non_exhaustive] // should include this in case we decide to extend or change the behaviour here 
pub enum CacheValue {
    Data(Vec<u8>),
    ToDelete,
}

Then you can match on the value as follows:

match cache_value {
    Data(bytes) if !bytes.is_empty() => borsh::from_slice(bytes),
    ToDetele => ... 
}

Comment on lines +104 to +122
match self.cache.borrow_mut().entry(key) {
Entry::Occupied(entry) => match entry.get() {
CachedData::Data(data) => from_slice::<V>(data)
.map_err(|_| StateError::Deserialization)
.map(Some),
CachedData::ToDelete | CachedData::Deleted => Ok(None),
},
Entry::Vacant(entry) => {
if let Some(bytes) = Self::get_from_host(key)? {
let value = from_slice::<V>(&bytes)
.map_err(|_| StateError::Deserialization)
.map(Some);
entry.insert(CachedData::Data(bytes));
value
} else {
entry.insert(CachedData::Deleted);
Ok(None)
}
}
Copy link
Contributor Author

@iFrostizz iFrostizz Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how to avoid the repetition here, in one branch the data is borrowed because it comes from the cache. On the other, it has been copied from host so it's owned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's working quite well though because we need an owned value to insert in the cache.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't you follow the same pattern as before but use a match instead?

let val_bytes = match cache.get(&key) {
    Some(CacheValue::Data(val)) => val,
    Some(CacheValue::ToDelete | CacheValue::Deleted) => {
        return Ok(None),
    },
    None => // same as the else from before
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, this actually compiles

    pub fn get<V>(self, key: K) -> Result<Option<V>, Error>
    where
        V: BorshDeserialize,
    {
        #[link(wasm_import_module = "state")]
        extern "C" {
            #[link_name = "get_bytes"]
            fn get_bytes(ptr: *const u8, len: usize) -> HostPtr;
        }

        let mut cache = self.cache.borrow_mut();

        let cached_data = match cache.get(&key) {
            Some(cached_data) => cached_data,
            None => {
                let args_bytes = borsh::to_vec(&key).map_err(|_| StateError::Serialization)?;

                let ptr = unsafe { get_bytes(args_bytes.as_ptr(), args_bytes.len()) };

                if ptr.is_null() {
                    cache.insert(key, CachedData::Deleted);
                    return Ok(None);
                }

                cache.entry(key).or_insert(CachedData::Data(ptr.into()))
            }
        };

        match cached_data {
            CachedData::Data(data) => from_slice::<V>(data)
                .map_err(|_| StateError::Deserialization)
                .map(Some),
            CachedData::ToDelete | CachedData::Deleted => Ok(None),
        }
    }

Copy link
Collaborator

@richardpringle richardpringle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, sorry for taking so long with this review.

I want to check with @dboehm-avalabs to see if he also things it would make sense before you launch into it, but what I would like you to do is centralize the API first. What I mean by that is a single

       #[link(wasm_import_module = "state")]
        extern "C" {
            #[link_name = "put_bytes"]
            fn put_bytes(ptr: *const u8, len: usize);
        }

That actually takes a something like Vec<(Vec<u8>, Vec<u8>)> as the serialized data. The first value in the tuple is the key, the second is the value. If the value is empty, that should signify a delete. I'm not 100% sure what the best way is to represent this in Go, maybe [][2][]byte or something along those lines?

You should be able to do that with only minor changes to the Rust (nothing to the actual cache).

Once we merge that code, we can come back and do the CachedData enum.

Off the top of my head, I think the cache would probably be HashMap<K, Option<Vec<u8>>>.

Comment on lines +45 to +50
#[derive(Debug, PartialEq)]
pub enum CachedData {
Data(Vec<u8>),
Deleted,
ToDelete,
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔

Since we know that the empty Vec is not something that we store, would it make more sense to simply have:

#[derive]
#[non_exhaustive] // should include this in case we decide to extend or change the behaviour here 
pub enum CacheValue {
    Data(Vec<u8>),
    ToDelete,
}

Then you can match on the value as follows:

match cache_value {
    Data(bytes) if !bytes.is_empty() => borsh::from_slice(bytes),
    ToDetele => ... 
}

Comment on lines +104 to +122
match self.cache.borrow_mut().entry(key) {
Entry::Occupied(entry) => match entry.get() {
CachedData::Data(data) => from_slice::<V>(data)
.map_err(|_| StateError::Deserialization)
.map(Some),
CachedData::ToDelete | CachedData::Deleted => Ok(None),
},
Entry::Vacant(entry) => {
if let Some(bytes) = Self::get_from_host(key)? {
let value = from_slice::<V>(&bytes)
.map_err(|_| StateError::Deserialization)
.map(Some);
entry.insert(CachedData::Data(bytes));
value
} else {
entry.insert(CachedData::Deleted);
Ok(None)
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't you follow the same pattern as before but use a match instead?

let val_bytes = match cache.get(&key) {
    Some(CacheValue::Data(val)) => val,
    Some(CacheValue::ToDelete | CacheValue::Deleted) => {
        return Ok(None),
    },
    None => // same as the else from before
}

}
}
return nil
})},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a single API for delete and delete_many. It'll be easier to review the changes if you consolidate the API and fix the call signature on the Rust side before you start using it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue that it would probably even be better to have a single put operation such that put-ing an empty Vec/byte-slice is the same as delete.

I'm not 100% sure if that makes sense with how state should work though. I'm only assuming that storing an empty slice isn't valid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #1057 in order to address #1029 (comment)

@@ -257,7 +257,7 @@ pub fn state_keys(_attr: TokenStream, item: TokenStream) -> TokenStream {

// add default attributes
item_enum.attrs.push(parse_quote! {
#[derive(Clone, Copy, Debug, Eq, PartialEq, Hash)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can take this change and land it in a separate PR if you want it in earlier/immediately.

Comment on lines +104 to +122
match self.cache.borrow_mut().entry(key) {
Entry::Occupied(entry) => match entry.get() {
CachedData::Data(data) => from_slice::<V>(data)
.map_err(|_| StateError::Deserialization)
.map(Some),
CachedData::ToDelete | CachedData::Deleted => Ok(None),
},
Entry::Vacant(entry) => {
if let Some(bytes) = Self::get_from_host(key)? {
let value = from_slice::<V>(&bytes)
.map_err(|_| StateError::Deserialization)
.map(Some);
entry.insert(CachedData::Data(bytes));
value
} else {
entry.insert(CachedData::Deleted);
Ok(None)
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, this actually compiles

    pub fn get<V>(self, key: K) -> Result<Option<V>, Error>
    where
        V: BorshDeserialize,
    {
        #[link(wasm_import_module = "state")]
        extern "C" {
            #[link_name = "get_bytes"]
            fn get_bytes(ptr: *const u8, len: usize) -> HostPtr;
        }

        let mut cache = self.cache.borrow_mut();

        let cached_data = match cache.get(&key) {
            Some(cached_data) => cached_data,
            None => {
                let args_bytes = borsh::to_vec(&key).map_err(|_| StateError::Serialization)?;

                let ptr = unsafe { get_bytes(args_bytes.as_ptr(), args_bytes.len()) };

                if ptr.is_null() {
                    cache.insert(key, CachedData::Deleted);
                    return Ok(None);
                }

                cache.entry(key).or_insert(CachedData::Data(ptr.into()))
            }
        };

        match cached_data {
            CachedData::Data(data) => from_slice::<V>(data)
                .map_err(|_| StateError::Deserialization)
                .map(Some),
            CachedData::ToDelete | CachedData::Deleted => Ok(None),
        }
    }

} else {
entry.insert(CachedData::Deleted);
Ok(None)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the comment above, I don't think you need all the repetition. There's a lot of surface area for bugs here.

/// Returns an [Error] if the key cannot be serialized
/// or if the host fails to delete the key and the associated value
pub fn delete<T: BorshDeserialize>(self, key: K) -> Result<Option<T>, Error> {
fn get_from_host(key: K) -> Result<Option<Vec<u8>>, Error> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need this static method.

Comment on lines +199 to +200
CachedData::Data(value) => puts.push(PutArgs { key, value }),
CachedData::ToDelete => deletes.push(key),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this tells me that a single API here would be better. I can't remember where I said it (could have been in this review or elsewhere), but I think that putting an empty Vec<u8> isn't valid. So you can actually eliminate all invalid state (minimizing the runtime checks) by making empty Vec puts the equivalent of delete.

@dboehm-avalabs, thoughts?

Comment on lines +204 to +212
if !puts.is_empty() {
let serialized_args = borsh::to_vec(&puts).expect("failed to serialize");
unsafe { put_many_bytes(serialized_args.as_ptr(), serialized_args.len()) };
}

if !deletes.is_empty() {
let serialized_args = borsh::to_vec(&deletes.as_slice()).expect("failed to serialize");
unsafe { delete_many_bytes(serialized_args.as_ptr(), serialized_args.len()) };
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, I just keep seeing good reason to centralize the API into a single call. If there's only one call, you can do a simple if cache.is_empty() { return } for Drop

@iFrostizz
Copy link
Contributor Author

Closing in favor of #1059

@iFrostizz iFrostizz closed this Jun 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
3 participants