Switch digest to cuckoo filters, to enable O(1) removal #413

yoavweiss · 2017-11-05T19:38:07Z

Resolves #268

yoavweiss · 2017-11-05T19:38:52Z

cc @mnot @kazuho

mnot · 2017-11-05T23:53:52Z

Hey Yoav,

Thanks; will take a look. Two immediate things:

You're getting a error in the markdown; mapping values are not allowed in this context at line 80 column 25.
I see you've added yourself as an author. That's generally the decision of the chair - @mcmanus in this case.

yoavweiss · 2017-11-06T01:09:14Z

You're getting a error in the markdown; mapping values are not allowed in this context at line 80 column 25.

Hopefully fixed. Is there a way to test it locally?

I see you've added yourself as an author. That's generally the decision of the chair - @mcmanus in this case.

Apologies for the noobness. Removed myself.

mnot · 2017-11-06T02:00:21Z

See SUBMITTING.md for build info.

kazuho · 2017-11-07T00:56:03Z

@yoavweiss Thank you for working on the proposal.

Am I correct in assuming that changes other than the switch to Cuckoo filters and the introduction SENDING_CACHE_DIGEST are unintentional? For example, I see VALIDATORS flag of the CACHE_DIGEST frame being removed.
Do you have a working code that implements Cuckoo filters? I am curious to see it working.
The concept of SENDING_CACHE_DIGEST makes sense to me. Maybe we might want to adjust the codepoints and the naming in relation to ACCEPT_CACHE_DIGEST.

kazuho · 2017-11-07T04:34:04Z

For example, I see VALIDATORS flag of the CACHE_DIGEST frame being removed.

Oh, I now understand the intent of removing the flag.

The motive of the proposal is to build a digest without referring to every response object stored in cache. The fact means that it is not be easy for the client to determine the freshness of the entries that is going to be included in the digest.

I am sympathetic to the idea, but I am afraid if the approach works well with the current mechanism of HTTP/2 caching. My understanding is that browsers that exist today only consume a pushed response when it fails to find a freshly cached response in its cache. Otherwise, the pushed response never lands in the browser cache. Unless we change the behavior of the browsers to respect the pushed response even if a freshly cached object already exists in its cache, there's a chance that servers would continually push responses that gets ignored by the client (due to the existence of a freshly cached response in the browser cache with the same URL).

@yoavweiss Assuming that I correctly understand the motive of removing the distinction between a fresh digest and a stale digest, I would appreciate it if you could clarify your ideas on the problem.

yoavweiss · 2017-11-07T14:44:21Z

Thanks for reviewing, @kazuho! :)

My intent was to include all stored resources in the digest, regardless of them being stale or fresh. Entries are added to the digest when a resource is added to the cache and removed from the digest when a resource is removed.

The reason is that I think the distinction doesn't make much sense, and maintaining it adds a lot of complexity, basically forcing browsers to recreate the digest for every connection at O(N) cost.

Under this premise what servers should do is:

Push all the resources that are known not to be in the cache digest
Push 304 responses for resources that are in the cache digest, but are likely to be stale (short freshness lifetime, etc)
Don't push resources that are in the cache digest and have a long term freshness lifetime or are immutable.

Does that make sense? I'm not sure I understand your reference to the push cache vs. the HTTP cache in your comment. In light of my explanation, is there still an issue there in your view?

yoavweiss · 2017-11-07T14:45:43Z

Do you have a working code that implements Cuckoo filters? I am curious to see it working.

https://github.com/efficient/cuckoofilter is the reference implementation.

The concept of SENDING_CACHE_DIGEST makes sense to me. Maybe we might want to adjust the codepoints and the naming in relation to ACCEPT_CACHE_DIGEST.

Happy to change it. Do you have any specific changes in mind?

kazuho · 2017-11-07T20:29:34Z

@yoavweiss

My intent was to include all stored resources in the digest, regardless of them being stale or fresh. Entries are added to the digest when a resource is added to the cache and removed from the digest when a resource is removed.

The reason is that I think the distinction doesn't make much sense, and maintaining it adds a lot of complexity, basically forcing browsers to recreate the digest for every connection at O(N) cost.

Thank you for the explanation. I now understand the intent better.

I think that we need to consider two issues regarding the approach.

First is the fact that a browser cache may contain more stale responses than fresh resources. Below are the numbers of cached objects found in my Firefox's cache (to be honest the date is from 2016, I haven't been using Firefox in recent weeks and therefore cannot provide up-to-date data).

host	fresh	stale	total
*.facebook.com	790	1,483	2,273
*.google.com	373	630	1,003

As you can see, large scale websites tend to have more stale objects than fresh objects. In other words, including information of stale-cached objects increases the size of the digest roughly three times in this case. Since performance-sensitive resources (that we need to push) are likely to be stored fresh (since they are the most likely ones marked as immutable, or near-immutable), transmitting only the digest of freshly-cached responses makes sense.

Second is a configuration issue on the server side.

One strategy that can be employed by an H2 server (under the current draft) is to receive a digest of freshly cached resources only, compare the digest against the list of resources the browser should preload by only using the URL, and push the missing resources to the client. It is possible for a H2 server to perform the comparison without actually fetching the resource (from origin or from cache) since only the URL would be required for calculating the digest.

The proposal prevents such strategy from being deployed since it requires the ETag values to be always taken into consideration (should they be associated to the HTTP responses). In other words, servers would be required to load response headers of the resources to determine if it needs to be pushed, which could be a huge performance degradation on some deployments.

Fortunately, servers could avoid the issue by not including ETags for resources that it may push. I think such change on the server-side configuration would be possible, but we need to make sure if we are to take the path (of removing the fresh vs. stale distinction).

I'm not sure I understand your reference to the push cache vs. the HTTP cache in your comment. In light of my explanation, is there still an issue there in your view?

Let me explain using an example.

Consider the following case:

client has https://example.com/style.css with ETag: 12345 and Expires: Nov 30 2017
on server-side, the resource has been updated to ETag: 67890

When receiving a new request from the client, the server cannot determine if the client has style.css in its cache. Therefore, style.css would be pushed.

The client, when observing link: </style.css>; rel=preload (or equivalent <link> tag), tries to load the resource. Since the fresh resource exists within the browser cache, that would be used. The pushed version is ignored and gets discarded (*).

This would be repeated every time until the cached object either becomes stale or gets removed from the cache.

My understanding is that the browser behavior (explained in *) is true for Firefox and also for Chrome. Am I wrong, or missing something?

Do you have a working code that implements Cuckoo filters? I am curious to see it working.

https://github.com/efficient/cuckoofilter is the reference implementation.

Thank you for the link. I will try to use it.

OTOH, do you have some working code that can actually calculate the cache-digest value taking a list of URLs as an input (something like https://github.com/h2o/cache-digest.js)? I ask this because it would give us a better sense in how the actual size of the digest would be.

The concept of SENDING_CACHE_DIGEST makes sense to me. Maybe we might want to adjust the codepoints and the naming in relation to ACCEPT_CACHE_DIGEST.

Happy to change it. Do you have any specific changes in mind?

One way to proceed would be to split the discussion of SENDING_CACHE_DIGEST from Cuckoo filters into a separate issue or a PR. I do not have a strong opinion on the naming or the codepoints. What do you think? @mnot

kazuho · 2017-11-07T21:00:26Z

@yoavweiss Have you considered the approach using Cuckoo filter to generate GCS?

I can understand the fact that you do not want to iterate through the browser cache when sending a cache digest. Per-host Cuckoo hash seems like a good solution to the issue.

OTOH, as I described in my previous comment, it seems that sending the hash directly has several issues.

That is why I am wondering if it would be viable to generate GCS from the per-host Cuckoo filter that would be maintained within the browser.

I can see three benefits in the approach, compared to sending the values of Cuckoo filter directly:

the size of the digest will be smaller
we can keep the distinction between fresh vs. cache. Sending digest of fresh resources only would end up in even smaller digests. Retaining the distinction lowers the bar to deploy cache-digests on the server side.
- note: you can store the time when the cached object becomes stale in the data associated to the Cuckoo filter entry (assuming that you would have associated data to handle resize, as we discussed in Enabling O(1) removal from digest #268 (comment)). That information can be used when builiding the GCS to determine if a particular object should go into a GCS of fresh resources or that of stale ones
less change to the browser push handling (no need to handle pushes of 304 or replace a freshly cached object when an object with the same URL is being pushed)

In case of *.facebook.com or *.google.com in the comment above, sending fresh-only digests using GCS would be about 1/3 the size of sending fresh & stale digests using Cuckoo filter.

The biggest cost of calculating GCS from Cuckoo hash would be the sort operation. But I think that the cost could be negligible compared to the ECDH operation that we would be doing for every connection, considering the fact that the number of entries that we would need to sort would be small (e.g., up to 1,000 entries of uint32_t), and the fact that sort algorithms faster than O(n log n) radix sort can be deployed (e.g. radix sort).

WDYT?

yoavweiss · 2017-11-07T22:49:20Z

Have you considered the approach using Cuckoo filter to generate GCS?

So have a cuckoo filter digest and then put its fingerprints in a GCS? I have not considered that. Need to give it some thought...

At the same time, it's not clear to me how that would enable a "stale" vs. "fresh" digests, or handling of improperly cached resources (fresh resources that were replaced on the server).

kazuho · 2017-11-08T01:28:43Z

Have you considered the approach using Cuckoo filter to generate GCS?

So have a cuckoo filter digest and then put its fingerprints in a GCS? I have not considered that. Need to give it some thought...

I would appreciate it if you could consider. To me it seems it's worth giving a thought.

At the same time, it's not clear to me how that would enable a "stale" vs. "fresh" digests, or handling of improperly cached resources (fresh resources that were replaced on the server).

Under the approach proposed in this PR, structure that stores the per-host digest would look like below. hashes is required for resizing the filter (e.g., when doubling or halving num_backets).

uintFF_t fingerprints[num_backets]; // FF is the size of the fingerprint
uint32_t hashes[num_backets];       // contains 32-bit hash value of each entry in `fingerprints`

What I am suggesting that you could change the structure to the following.

uintFF_t fingerprints[num_backets]; // FF is the size of the fingerprint
struct {
  uint32_t hash;
  time_t becomes_stale_at;
} hashes_and_expire_times;

In addition to the hash value, each entry will contain the moment when the entry becomes stale. The moment can be calculated when the entry is added. For example, if the entry represents a HTTP response with a cache-control: max-age=V, becomes_stale_at can be calculated as now + V. If the entry represents an immutable HTTP response, then becomes_stale_at should be set to a very large value (e.g.. INT64_MAX assuming that underlying type of time_t is int64_t).

When building a GCS digest, you would do the following:

step 1. prepare an empty list that would contain hashes of fresh responses
step 2. prepare an empty list that would contain hashes of stale responses
step 3. foreach entry in cuckoo_filter:
- step 3-1. check if the entry is fresh or not, by checking the value of becomes_stale_at
- step 3-2. if the entry is fresh, append hash of the entry to the list of the hashes of fresh responses
- step 3-3. otherwise, append hash of the entry to the list of the hashes of stale responses
step 4. sort the list of hashes of the fresh responses, encode as GCS, and send
step 5. sort the list of hashes of the stale responses, encode as GCS, and send

You can skip the operations related to stale objects (i.e. step 2, 3-3, 5) if the server is unwilling to receive stale digests.

Whether the approach can be implemented depends on if a client can determine the moment a response becomes stale. I anticipate that it is possible to determine that when you register the entry to Cuckoo filters (which is when you receive the response from the server).

sebdeckers · 2017-11-08T03:21:29Z

Do you have a working code that implements Cuckoo filters? I am curious to see it working.

https://github.com/efficient/cuckoofilter is the reference implementation.

Thank you for the link. I will try to use it.

OTOH, do you have some working code that can actually calculate the cache-digest value taking a list of URLs as an input (something like https://github.com/h2o/cache-digest.js)? I ask this because it would give us a better sense in how the actual size of the digest would be.

@yoavweiss @kazuho I'm planning to attend the IETF 100 hackathon this weekend in Singapore. (First timer here. 🤗🔰) I'm happy to collaborate on a (Node.js?) implementation of this spec if either of you are around and interested. I'm fairly familiar with the current spec, having implemented it as a service worker and on the server.

kazuho · 2017-11-09T01:52:43Z

@sebdeckers

I'm planning to attend the IETF 100 hackathon this weekend in Singapore. (First timer here. 🤗🔰) I'm happy to collaborate on a (Node.js?) implementation of this spec if either of you are around and interested.

Wonderful! I'll be attending the hackathon on both days (i.e. Saturday and Sunday). I do not think that I would have time to work on Cache Digests, but would love to discuss with you (or help, if you need) about your work on Cache Digests.

sebdeckers

Feedback based on WIP implementation of cuckoo filters for cache digest: https://gitlab.com/http2/cache-digest-koel

sebdeckers · 2017-11-12T13:58:33Z