-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix and document aggregate cluster behavior #13134
Comments
Sure, I will update the document. Please assign the issue to me. Thanks! |
Also, how are EDS drops supposed to work in an aggregate cluster? They are supposed to be top-level parameters, but what happens if an aggregate cluster points to an EDS cluster? Or worse, what happens if the aggregate cluster points to multiple EDS clusters, each with their own drop config? Note that Envoy does not currently implement EDS drop support, but gRPC does, so we need to understand how this is supposed to work. |
Similar question for the overprovisioning factor. |
gRPC is about to implement support for aggregate clusters, and I want to make sure we agree on the expected semantics, so that we don't want up making a decision that is incompatible with what Envoy does or may do in the future. I've put together the following doc proposing what I think the semantics should be: https://docs.google.com/document/d/17fc0rAktQpc5oMkerOLHSTnXwX1xWAqW7vZ6_LL47NY/edit# Already discussed with @yxue, @htuch, @ejona86, and @dfawley. I'd appreciate feedback from @mattklein123, @snowp, as well as any of the other @envoyproxy/api-shepherds that may be interested. |
https://docs.google.com/document/d/17fc0rAktQpc5oMkerOLHSTnXwX1xWAqW7vZ6_LL47NY/edit# has now been approved as specifying the expected behavior of aggregate clusters. This issue should now cover the following:
|
See discussion in #17412 about the max_retries circuit breaker. In general, we do want circuit breakers to come from the underlying clusters instead of from the aggregate cluster, as per our previous discussions, but the max_retries circuit breaker will need to come from the aggregate cluster instead. I've updated the design doc mentioned above to reflect this. |
@yanavlasov has just pointed out to me that we got this slightly wrong. Our previous discussions in this issue assumed that the core idea of an aggregate cluster was that it concatenated the priorities from the underlying clusters into a single list, so that it could use a single LB policy defined at the aggregate cluster layer to choose a priority from that combined list. This made sense as a way to use the normal logic for choosing a priority, since in Envoy priorities are normally handled by the LB policy. Therefore, we assumed that aggregate clusters always used the LB policy defined in the aggregate cluster rather than the ones defined in the underlying clusters. However, Yan pointed out to me that aggregate clusters don't actually define the LB policy in the aggregate cluster; instead, the aggregate cluster uses a special cluster provided LB policy that first chooses the underlying cluster and then delegates to the LB policy of the underlying cluster. This is actually the documented behavior, and it looks like the Envoy implementation here is indeed doing a two-level pick. I think this means that we really do want basically all of the configuration to come from the underlying cluster, with the sole exception of the Note that the current Envoy implementation has the aggregate cluster pick the underlying cluster only when the LB pick is requested. However, since we also want features that are unrelated to the LB pick (e.g., circuit breaking, outlier detection, and EDS drops (see #31345)) to be configured at the underlying cluster level, Envoy will probably need to decouple this from the LB pick somehow. It may need a layer of indirection whereby it first asks the cluster for its configuration and then uses that configuration for all of the above features plus the LB pick. |
A couple of other things to note from a conversation with @yanavlasov:
|
FYI, I've written a design for fixing the aggregate cluster behavior in gRPC in grpc/proposal#405. |
The aggregate cluster documentation describes how load balancing works with an aggregate cluster, but it doesn't describe how any of the other cluster behavior works.
Here are a few off-the-cuff examples (not a comprehensive list):
@yxue, can you please update the documentation to explain how these sorts of things are supposed to work? Ideally, the doc should cover the behavior for every feature configurable via the Cluster xDS resource.
Thanks!
The text was updated successfully, but these errors were encountered: