Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GEP-1364 proposal #1383

Merged
merged 7 commits into from
Sep 26, 2022
Merged

Add GEP-1364 proposal #1383

merged 7 commits into from
Sep 26, 2022

Conversation

youngnick
Copy link
Contributor

What type of PR is this?
/kind gep

What this PR does / why we need it:
This PR adds an implementable proposal for GEP1364 (#1364).

It's a big GEP, for a very big change, that will definitely break conformance tests. But I think that it will set us up with clear rules for Conditions going forward.

It's quite opinionated, so I expect to need to defend my decisions here somewhat, don't be shy about objections.

Which issue(s) this PR fixes:
Updates #1364

Does this PR introduce a user-facing change?:

Status definitions have been updated across the API.

The release note will need something like the tl;dr in the GEP, once we all agree. 😄

Signed-off-by: Nick Young <[email protected]>
Signed-off-by: Nick Young <[email protected]>
@k8s-ci-robot k8s-ci-robot added kind/gep PRs related to Gateway Enhancement Proposal(GEP) cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 9, 2022
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 9, 2022
@shaneutt shaneutt self-requested a review September 9, 2022 15:52
site-src/geps/gep-1364.md Outdated Show resolved Hide resolved
site-src/geps/gep-1364.md Outdated Show resolved Hide resolved
Copy link
Contributor

@howardjohn howardjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely a good direction

is no missing information, including but not limited to:
* Any mandatory references resolve to existing resources (examples here are the
Gateway's gatewayClass field, or the `parentRefs` field in Route resources)
* Any specified TLS secrets exist
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the API supports partially-valid resources. I believe this is proposing a partially-valid resource would be Attached=false, but the config is actually 'active' (or, I would describe it, "attached").

To me this feels a bit odd. The old API sort of had this issue as well.

Thinking purely from a pedantic-correctness POV (which may not be ideal - sometimes simpler is better), I would expect there to be two conditions (I use bad names intentionally to avoid confusion): "ThisResourceWasApplied" - any part of the rule had some effect on the system, and "ThisRuleIsEntirelyValid" - no errors found in the rule at all.

Its possible my concern would be addressed with a different word than "Attached", rather than more conditions, as well

WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is somewhat addressed below.. but I am not sure its accurate? It seems we are defining a very small set of "failures that do not become unattached", but there are a lot more partially-accepted resources I think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The guideline I've been basing the "does something become unattached?" on is "Does this produce at least some configuration in the underlying data plane?". So I guess it sounds like I need to be more explicit with that guideline, but that aside from that, we're largely in agreement?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to have "Is this attached" and "is this 100% valid" as two separate things.

Part of my confusion, I think, is I am not actually sure which fields result in "not attached". It seems most issues are partially accepted. Which is fine - Attached just becomes basically a pretty high level acknowledgement that the controller is handling the resource

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think I agree with @howardjohn's point here: "does this resource produce config" is distinct from "does this resource produce valid config". I'm sure some implementations don't allow invalid (e.g. duplicate route names or something) to be sent to the data plane, but it doesn't feel like that's a requirement for the Gateway API at this juncture.

Copy link
Contributor Author

@youngnick youngnick Sep 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I wrote this pooly - by "Does this resource produce config" I meant to read "Is this resource valid enough to produce some config when combined with the rest of the resource set".

The thing I'm thinking of particularly here is that for HTTPRoute, invalid references still produce data path config because the match will respond with 500s. Whereas for TCPRoute etc, if you don't have a service to route to, there's no way you can build a forwarding path - there's nothing to forward to! So that won't produce any data path config at all, and so, in the current design, would cause the resource to fail to be attached.

I don't think anything should be allowed to produce invalid config in the underlying data plane. If so, that should mean that the resource becomes unattached.

I guess this means that we need some more clarity about what attached means. I've been assuming that if we consider a resource to be "attached" to another, that:

  • the resource is valid in itself
  • the resource will produce at least some valid config in the underlying data plane. (note that "return a 500 for this route" is valid config).

And so, that having a "valid" condition is not really useful because if something is not valid, it won't be attached either.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for something like an HTTPRoute to define two paths, one of which is "valid enough" config, while the other is garbage?

If so, are there expectations or requirements from a conformance PoV ("always apply all or none") that we need to specify?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for something like an HTTPRoute to define two paths, one of which is "valid enough" config, while the other is garbage?

Most cases of an invalid configuration in one route rule (with others being completely valid within the same HTTPRoute) seem to be misconfigurable only in the context of a reference to another resource and fall into the ResolvedRefs bucket (which seems stated below and possibly agreed upon that Attached: true be set with ResolvedRefs: false and return HTTP code 500 for the invalid rule).

One such example not covered is if you were to use a HTTPRequestRedirectFilter and HTTPURLRewrite in the list of filters for an individual HTTPRouteRule (ref). At the moment it doesn't appear we have prescribed any status code or other error handling for this error. This seems to fall into the bucket of things that cannot produce data plane configuration.

Going by the statements below, if this rule were the only one present Attached: false should be set, since it cannot result in any configuration (with another more specific error condition). If present with other valid rules, Attached: true should be set (with another more specific error condition again). I can definitely see the argument for setting Attached: false however in the latter case, since stated above, the Attached condition means The resource is syntatically and semantically valid, and internally consistent.

It seems that the Valid: true condition mentioned above ends up being somewhat equivalent to having for example for HTTPRoute ResolvedRefs: true and no error conditions (and vice versa, Valid: false equivalent to ResolvedRefs: false or some more specific error condition). Maybe this framing helps a little with determining possible merits of having another condition?

If so, are there expectations or requirements from a conformance PoV ("always apply all or none") that we need to specify?

Just doing a quick pass, I don't think I saw any existing conformance tests that cover this kind of thing so I agree that would be good to add.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoiding "always apply all or none" is the reason this is tricky. There are mixed expectations about what happens if some config is present but bad, and I'm attempting to thread the needle here and make this as user-friendly as possible while not becoming really difficult to implement or understand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a section on partial validity to hopefully make this clearer.

stuck with some sort of "will be ready in no longer than x seconds" standard
(which we'll also need to write conformance tests for, probably including a
stress test of adding a *lot* of records as well.) As an initial ballpark,
I propose 5 seconds.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably won't work for cloud providers (at least GKE is slower, not sure about others).

I am also not sure it provides value. What do we lose if we don't define anything here?

If we consider something like Knative, they certainly aren't going to be able to just wait X seconds; I'd expect the same with other users

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if we can't provide any guarantees about Readiness, then it seems like Ready becomes no different than Attached...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We certainly can mandate Ready to be an extended condition, and say "If you supply a Ready condition, it has to work like this". Maybe doing that is a better place to start with Ready?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a feel for which implementations can and cannot support Ready properly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that cloud providers will likely have trouble implementing a Ready condition. Could we introduce a Reconciled condition that would be more broadly implementable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 any distributed (e.g. geo distributed) implementation has a hard time with this (think: config was pushed to 95% of the sites around the world, is it ready?)


Because this GEP is mainly concerned with updating the Conditions we are setting in
Gateway API resources' `status`, it's worth reviewing some important points about
Conditions. (Thi information is mainly taken from the [Typical status properties][typstatus]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing href for the link here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's all the way down at the bottom of the doc, with the other references. I thought it was better to have everything in the one place.

one entry of each item, using the `type` field as a key. (So, this is effectively
a map that looks like a list in YAML form).
* Each has a number of fields, the most important of which for this discussion
are `type`, `status`, `season`, and `observedGeneration`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
are `type`, `status`, `season`, and `observedGeneration`.
are `type`, `status`, `reason`, and `observedGeneration`.

site-src/geps/gep-1364.md Outdated Show resolved Hide resolved
is no missing information, including but not limited to:
* Any mandatory references resolve to existing resources (examples here are the
Gateway's gatewayClass field, or the `parentRefs` field in Route resources)
* Any specified TLS secrets exist
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think I agree with @howardjohn's point here: "does this resource produce config" is distinct from "does this resource produce valid config". I'm sure some implementations don't allow invalid (e.g. duplicate route names or something) to be sent to the data plane, but it doesn't feel like that's a requirement for the Gateway API at this juncture.


Note that some classes of inter-resource reference failure do _not_ cause a resource
to become unattached (that is, to have the `Attached` condition set to `status: false`).
* Nonexistent Service backends - if the backend does not exist on a HTTPRoute that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This starts painting an observability problem IMO. Even if conditions are only meant to be human-readable, how does a human look at an HTTPRoute resource and know "oops I have a typo in my service backend"? Yes, 500s indicate something's wrong, but I would expect to find a condition pointing to the specific issue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should allow room for error conditions in addition to attached in this case so the intformation can be conveyed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I guess I should make this clearer, that in this case, it's absolutely expected that there be some additional, negative-polarity error conditions that tell you that the backendRef doesn't exist. (This is a distinct case from the backend existing, but having zero endpoints, which would not generate a "backendRef doesn't exist" error, but would end up with the same behavior).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might also get a BackendRefEmpty condition...

Copy link
Member

@robscott robscott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the work on this @youngnick! This is a great start

site-src/geps/gep-1364.md Outdated Show resolved Hide resolved
site-src/geps/gep-1364.md Outdated Show resolved Hide resolved
site-src/geps/gep-1364.md Outdated Show resolved Hide resolved
site-src/geps/gep-1364.md Outdated Show resolved Hide resolved
field was when the controller last saw a resource.
* Conditions shoud describe the _current state_ of the resource at observation
time, which means that they should be an adjective (like `Ready`), or a past-tense
verb (like `Attached`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would Attaching ever be valid as a condition? Maybe just verb or adjective is sufficient here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sort of usage is specifically called out as not desirable in the API conventions, and I agree. Conditions are supposed to be level-triggered, not edge-triggered, so having a condition that says that something is happening is a bit weird. (Should Attaching transition to false once the attaching process is complete?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For posterity, can we cite some sources for that assertion? It would probably actually be good as a part of this status effort to document somewhere in the code that we're avoiding this due to conventions, and then point to those conventions, as this will help future contributors to more quickly avoid pursuing such things (or give them a path to who they need to convince in the greater kubernetes community).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is almost a direct quote from the API conventions as it is, should I put a link in here to the exact line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exact quote is as follows, I can just inline if it's better, but I was trying to avoid just pasting in the entire section:


Condition type names should describe the current observed state of the resource, rather than describing the current state transitions. This typically means that the name should be an adjective ("Ready", "OutOfDisk") or a past-tense verb ("Succeeded", "Failed") rather than a present-tense verb ("Deploying"). Intermediate states may be indicated by setting the status of the condition to Unknown.

For state transitions which take a long period of time (e.g. more than 1 minute), it is reasonable to treat the transition itself as an observed state. In these cases, the Condition (such as "Resizing") itself should not be transient, and should instead be signalled using the True/False/Unknown pattern. This allows other observers to determine the last update from the controller, whether successful or failed. In cases where the state transition is unable to complete and continued reconciliation is not feasible, the Reason and Message should be used to indicate that the transition failed.


I don't believe that we generally have any process that we want to handle in the way called out in the exception bullet point. Happy to be guided here on what you'd like to see.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a direct link to the language would be nice, but not a blocker.

really viable:
* All conditions must be negative polarity
* All conditions must be positive polarity
* Some conditions can be positive polarity, but most should be negative.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My own personal preference would be for conditions with a positive polarity to represent the normal healthy state and always be present, and for conditions with a negative polarity to represent very specific error states and to only be present when true. I think we should do everything we can to avoid the presence of double negatives like "Detached == False", but recognize that there are some error states that are best represented with a negative condition. I just don't think those negative conditions should be present when false.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. As I mentioned above -- as long we document and check for it in conformance, this is the only guarantee that the conditions are meaningful programmatically, which is basically what we are looking for when we build for consistency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to have a mixed-polarity set of conditions, then I think we need to define:

  • which Conditions are positive polarity (Probably Attached, Programmed, Ready, and Valid if we do it)
  • which Conditions are positive polarity and should always be present (Probably Attached, Programmed, and Valid if we do it).
  • which Conditions are core (currently, everything other than Ready)
  • which Conditions are extended (only Ready).
  • which Conditions are negative (all the error conditions).

site-src/geps/gep-1364.md Outdated Show resolved Hide resolved
site-src/geps/gep-1364.md Outdated Show resolved Hide resolved
stuck with some sort of "will be ready in no longer than x seconds" standard
(which we'll also need to write conformance tests for, probably including a
stress test of adding a *lot* of records as well.) As an initial ballpark,
I propose 5 seconds.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that cloud providers will likely have trouble implementing a Ready condition. Could we introduce a Reconciled condition that would be more broadly implementable?


For many implementations (certainly for Envoy-based ones), getting this information
correctly and avoiding races on applying it is surprisingly difficult. For this
reason, this GEP proposes we give the `Ready` condition a short window
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd rather reserve Ready for "Ready at this moment" and instead use a different term to represent what we're actually describing - this config has been written to the data plane but we're not sure how long that will take to propagate. Previously Reconciled has been proposed to communicate this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll update this section to:

  • add Ready as an Extended support condition, that, if present, indicates that the data plane is fully configured. But because it's not always present, we'll need Reconciled to indicate what we have here - that everything is okay and the data plane will be ready soon.

site-src/geps/gep-1364.md Outdated Show resolved Hide resolved
* In general, resources should be considered `Attached` if their config will
generate some config in the underlying data plane.
* All Conditions will be positive polarity (`GoodState: true`, not `BadState: false`)
* All relevant Conditions for a resource must be added when it's observed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should state that the conditions covered in the core API must be present -- but there are likely other conditions (e.g. implementation specific), that are optional.

Copy link
Contributor Author

@youngnick youngnick Sep 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, does this mean that all conditions that have Core conformance should always be present? Or does this mean that a set of positive-polarity summary endpoints we name should always be present?

The former could include any negative-polarity conditions we define as having Core conformance (which would clash with the "don't have double negatives" guidance we're looking for in other places.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've guessed "summary conditions" here - meaning just the positive polarity ones.

really viable:
* All conditions must be negative polarity
* All conditions must be positive polarity
* Some conditions can be positive polarity, but most should be negative.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. As I mentioned above -- as long we document and check for it in conformance, this is the only guarantee that the conditions are meaningful programmatically, which is basically what we are looking for when we build for consistency.

in the healthy case is much better rules out the first option, so we are left to
decide between enforcing that all conditions are positive, or that we have a mix.

Having an arbitrary mix will make doing machine-based extraction of information
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this is necessarily true -- it just needs to be explicitly documented for and tested. I think the previous attempt by K8s to force one polarity just simply didn't work out because it didn't line up with people's intuition. We should really try to keep things intuitive as much as possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means that anything consuming the Conditions will need to have a hard-coded list of the positive polarity conditions. Not a deal breaker, but it's another obstacle in the way of building a generic condition-parsing library.

site-src/geps/gep-1364.md Outdated Show resolved Hide resolved

Note that some classes of inter-resource reference failure do _not_ cause a resource
to become unattached (that is, to have the `Attached` condition set to `status: false`).
* Nonexistent Service backends - if the backend does not exist on a HTTPRoute that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should allow room for error conditions in addition to attached in this case so the intformation can be conveyed.

stuck with some sort of "will be ready in no longer than x seconds" standard
(which we'll also need to write conformance tests for, probably including a
stress test of adding a *lot* of records as well.) As an initial ballpark,
I propose 5 seconds.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 any distributed (e.g. geo distributed) implementation has a hard time with this (think: config was pushed to 95% of the sites around the world, is it ready?)

@youngnick
Copy link
Contributor Author

Okay, to summarize the comments around condition naming and polarity:

We're going to end up with four possible positive polarity summary conditions:

  • ValidSpec or similar; there aren't any issues in the config of this resource on its own
  • Attached; there aren't any issues with the way this resources attaches to others
  • Programmed or similar; the resource has been processed and the data plane is in the process of being configured. It should be ready "soon", where "soon" is implementation-specific.
  • Ready; an optional, Extended condition that is only added when the implementation can guarantee that all required configuration has been accepted by the data plane and is running. Because it's Extended, it will have conformance tests to guarantee this behavior.

Other error conditions may be added that are negative polarity - the only thing relevant about these Conditions is if they are status: true. If they are not status: true, they should not be present.

Three or four summary, positive-polarity conditions seems on the high side to me, but if everyone feels that the additional clarity is worth the overhead of managing more Conditions.

I'm kind of inclined to say that anything we would use a ValidSpec summary condition for should probably be in the validation rules - whether they are in the OpenAPI definitions or the validating webhook doesn't really matter. I don't think the cases in which ValidSpec is True and Attached is false are going to be very common, so I don't really see the utility of that condition as much, personally. But if we really want it, we can do it.

@howardjohn
Copy link
Contributor

ValidSpec can have things not feasible to check in a webhook like temporary dependency issues (referenced service doesn't exist) or implementation specific issues (parent has FooGateway, which doesn't support the Bar field), so I think it holds weight.

If we do want to collapse we could have a single Valid field with reasons like NotValid, NotAttached, etc (naming needs work) possibly...?

Also re-iterating my comment on #1383 (comment) since I think I wasn't quite clear - I think making a Ready optional condition is actually harmful. It seems like the type of field only useful if its required, and I am not sure how feasible it is to require it.

@youngnick
Copy link
Contributor Author

youngnick commented Sep 14, 2022

I guess in my mind, ValidSpec is always a requirement for Attached, so there's less added value in having the two bits of information. For me, in the cases you give:

  • Service doesn't exist would cause a resource either to program 500s into the data plane and add a more specific condition for HTTPRoute (the already existing ResolvedRefs), or cause another Route type - that doesn't have an error code like HTTP - to become unattached.
  • The FooGateway not supporting the Bar field would cause the HTTPRoute to fail to attach, and so Attached would be false anyway, with Reason and Message to indicate the reason.

To put this another way, for me, ValidSpec is always a requirement for Attached, so the marginal utility of having both seems low. Invalid objects can't attach, and the reason they can't is that they're invalid.

Is it just the naming that's the problem, or is it the pivoting the status around the idea of attachment?

@howardjohn
Copy link
Contributor

I guess in my mind, ValidSpec is always a requirement for Attached, so there's less added value in having the two bits of information. For me, in the cases you give:

  • Service doesn't exist would cause a resource either to program 500s into the data plane and add a more specific condition for HTTPRoute (the already existing ResolvedRefs), or cause another Route type - that doesn't have an error code like HTTP - to become unattached.

I think I am confused. I thought ValidSpec meant it was 100% valid. So we would make ValidSpec=false if there is a unknown service. I thought the proposal was to have only the conditions listed above, so how would ResolvedRefs be used?

  • The FooGateway not supporting the Bar field would cause the HTTPRoute to fail to attach, and so Attached would be false anyway, with Reason and Message to indicate the reason.

Maybe.. unless the field is "fail open" and we just ignore it...? I am not sure we have a clear spec here on unsupported fields. It would be a bummer to add a filter and then suddenly have an outage because it wasn't supported.

@youngnick
Copy link
Contributor Author

youngnick commented Sep 14, 2022

I think I am confused. I thought ValidSpec meant it was 100% valid. So we would make ValidSpec=false if there is a unknown service. I thought the proposal was to have only the conditions listed above, so how would ResolvedRefs be used?

Ah, I guess I've been including ResolvedRefs in the bucket of "specific error conditions" that we haven't specified in this GEP. If I've implied that only those conditions listed are to be used, I've made a mistake and need to update.

@youngnick
Copy link
Contributor Author

To make my earlier response clearer (and I'm working on updating the text to include something like this), the idea behind Attached is that it's a summary condition that indicates whether or not there is a severe error that will prohibit any config at all from ending up in the data plane.

In the case you're talking about using Valid for, the intention is to have specific error conditions that will describe the error, and only be present when they are True. If there is an unknown Service in a backendRef, then the ResolvedRefs condition must be False (I'm leaving ResolvedRefs with its current polarity). That may also mean that the object becomes unattached - for resources that don't have a thing like sending a 500 error response like HTTPRoute. Basically, for HTTPRoute, because we can program an error response into the data plane, the set of things that will cause an unttachment is actually very small.

Signed-off-by: Nick Young <[email protected]>
site-src/geps/gep-1364.md Outdated Show resolved Hide resolved

The guidance about Conditions being added as soon as a controller sees a resource
is a bit unclear - as written in the conventions, it seems to imply that _all_
relevant conditions should always be added, even if their status has to be set to
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is reasonable for the controller to do this - as an indication the resource is in use by the controller.

It becomes problematic if a resource may be processed by multiple controllers - for example the discussion
of HttpRoute applying to both mesh and Gateway can only work if the type has a prefix (MeshReady).

In general it would be good to indicate how to handle resources where multiple controllers operate on. While
it's good to know which controller mess with the resource - maybe the controller should skip this step if the
resource is not ignored ( a mesh controller will ignore any route that don't target both mesh and gateway as example)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In most cases, the Conditions slice is in a separate struct that's part of a slice that's namespaced by controller name - so that controllers should be able to update only their own Conditions. I'd expect that mesh controllers should have a distinct controller name string, so this method should continue working.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thockin has said that getting multiple controllers to cooperate on node health status was very subtle, so I'd be careful and maybe distill his experience if you need multiple controllers writing one status.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think our practice of including controllerName for each distinct set of conditions should largely allow multiple controllers to cooperate on status for the same resource, but open to other potential improvements.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do controllerName conditions roll up to Accepted?

Also, controllers probably need to be sure to use PATCH on the status sub-resource or do conditional PUTs and avoid extra writes if the conditions haven't changed.

Copy link
Contributor Author

@youngnick youngnick Sep 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some resources, all conditions are namespaced by controllerName. I agree we need to ensure that the rollup behavior is specified. I'm going to add a section about partial validity that should help.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For HTTPRoute as a specific example, the structure of RouteParentStatus appears well-suited to handle a separate list of conditions for each parentRef or controller, so a mesh or Service relationship for a route would have a separate set of conditions and Accepted status than those used for relaying the status of a concurrent parent relationship to a Gateway, with no need to collaborate on writing to a single status.

site-src/geps/gep-1364.md Outdated Show resolved Hide resolved
site-src/geps/gep-1364.md Outdated Show resolved Hide resolved
@youngnick
Copy link
Contributor Author

Okay, I think I've captured all the outstanding comments in that update. Please take a look, our timebox for this last pass of reviews is in just under 48 hours, so that I can start working on the implementation.

Copy link
Member

@shaneutt shaneutt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the importance with timing of this one, I don't see anything in this GEP that's problematic enough that we couldn't merge this and if all else fails sort something out in the follow-up PR. As such I'm approving, but I will leave the LGTM to someone else as this change has a strong impact.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: keithmattix, shaneutt, youngnick

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Member

@robscott robscott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @youngnick! These are great updates. A few nits and a question about naming but really like this direction.

site-src/geps/gep-1364.md Outdated Show resolved Hide resolved
site-src/geps/gep-1364.md Outdated Show resolved Hide resolved

* All the current Conditions that indicate that the resource is okay and ready
for processing witll be replaced with `Attached`, except for GatewayClass (since
it is the root of the resource tree, it will stay with `Accepted`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel strongly about this, but any reason not just to use Accepted everywhere instead of Attached in most places and Accepted in this one place?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the same thought and am +1 to this, Accepted seems a little bit easier to understand for users IMO and can be consistently used across all resources.

of other specific negative-polarity error conditions.
* All relevant positive-polarity summary Conditions for a resource must be added
when it's observed.
For example, HTTPRoutes must always have `Accepted` and `ResolvedRefs`, regardless
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is Accepted distinct from Attached here? Would Accepted ever be true if ResolvedRefs was false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't updated, but with the change to Accepted everywhere, this is now correct, thanks.

Comment on lines 141 to 142
* The summary conditions will be added into the CRD definitions
as default values for `status.Conditions`, with a status of `Unknown`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we set default values here? What happens when multiple controllers are implementing the same resource like a HTTPRoute? I would prefer relying on conformance tests to require controllers to set these conditions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was intending to make it that the non-controllerName Conditions will be present, but I'll leave this out for now.

Gateway API conditions will be positive for conditions that describe the happy
state of the object, which is currently `Attached` and `ResolvedRefs`, and will
also include the new `Programmed` condition, and the newly-Extended condition
`Ready`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this paragraph is missing a sentence that says that a separate set of Error conditions will exist that will only be set when they are true.


Positive polarity Conditions that describe the desirable state of the object must
always be set. These are currently `Attached`, `ResolvedRefs`, and `Programmed`.
Implementations that use `Ready` must also add it immediately, set to `false`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny nit: This sentence feels a bit too prescriptive, maybe "before programming the Route" instead of saying that it needs to be set to false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that seems like a good idea.

Comment on lines 264 to 271
That is, the proposal is to replace:
* `Scheduled` on Gateway
* `Detached` on Listener
* `Accepted` on Route

with `Attached` in all these locations.

GatewayClass will maintain the `Accepted` condition.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing it laid out this way makes me think that the smaller change would be to use Accepted everywhere since it would only represent a change to Gateway (and Listener within it).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, agreed. I've migrated the proposal to this.

Comment on lines 312 to 313
Note that having a correctly-defined set of resources that is empty does not make
these resources unattached.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this. What would empty resources look like? If they're not unattached what are they?

Comment on lines 80 to 81
* `observedGeneration` is an optional field that sets what the `metadata.generation`
field was when the controller last saw a resource.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'd had some conversations about making this required for conformant Gateway implementations. Probably out of scope for this GEP, just wanted to make sure it was intentionally excluded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I updated.

Copy link
Contributor

@howardjohn howardjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not expert here things seem reasonable to me

* The resource is valid enough to produce some configuration in the underlying
data plane.

For Gateway, `Attached` also subsumes the functions of `Scheduled`: `Attached`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On some impls, provisioning takes about a minute. That may mean that some errors that are immediately known are masked. Or I suppose they aren't masked, but we need to wait the entire minute to know if its valid. The message field could say "all fields valid, waiting for deployment" perhaps, but that isn't in the machine parse-able section.

That may not warrant change, just a thought

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worded this carefully, so as not to imply that the provisioning needs to have completed, just that it can be started. The completion of infra provisioning should be handled by Programmed or Ready, or both.

site-src/geps/gep-1364.md Outdated Show resolved Hide resolved
[Typical status properties][typstatus] section of the guidelines.
5. Conditions should be applied to a resource the first time the controller sees
the resource. This seems to imply that _all conditions should be present on every
resource owned by a controller_, but the rest of the conventions don't make this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conditions which appear and disappear from a resource can make status reporting UIs harder, so I'd recommend keeping the Condition set consistent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we should have a set of permanent and reliable conditions that are always present, but I personally don't think it makes sense for error conditions to always be present, especially since that tends to lead to confusing double negative states.

site-src/geps/gep-1364.md Outdated Show resolved Hide resolved

The guidance about Conditions being added as soon as a controller sees a resource
is a bit unclear - as written in the conventions, it seems to imply that _all_
relevant conditions should always be added, even if their status has to be set to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thockin has said that getting multiple controllers to cooperate on node health status was very subtle, so I'd be careful and maybe distill his experience if you need multiple controllers writing one status.

site-src/geps/gep-1364.md Outdated Show resolved Hide resolved
* In general, resources should be considered `Attached` if their config is valid
enough to generate some config in the underlying data plane.
* There will be a limited set of positive polarity summary conditions, and a number
of other specific negative-polarity error conditions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will make UI display somewhat harder (green dots / checkmarks / exclamation points), but I realize it may align with human reading a bit more easily.

I don't have a strong opinion here, just want to highlight the cost elsewhere in the stack.

is no missing information, including but not limited to:
* Any mandatory references resolve to existing resources (examples here are the
Gateway's gatewayClass field, or the `parentRefs` field in Route resources)
* Any specified TLS secrets exist
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for something like an HTTPRoute to define two paths, one of which is "valid enough" config, while the other is garbage?

If so, are there expectations or requirements from a conformance PoV ("always apply all or none") that we need to specify?


Note that some classes of inter-resource reference failure do _not_ cause a resource
to become unattached (that is, to have the `Attached` condition set to `status: false`).
* Nonexistent Service backends - if the backend does not exist on a HTTPRoute that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might also get a BackendRefEmpty condition...

site-src/geps/gep-1364.md Outdated Show resolved Hide resolved
is no missing information, including but not limited to:
* Any mandatory references resolve to existing resources (examples here are the
Gateway's gatewayClass field, or the `parentRefs` field in Route resources)
* Any specified TLS secrets exist
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for something like an HTTPRoute to define two paths, one of which is "valid enough" config, while the other is garbage?

Most cases of an invalid configuration in one route rule (with others being completely valid within the same HTTPRoute) seem to be misconfigurable only in the context of a reference to another resource and fall into the ResolvedRefs bucket (which seems stated below and possibly agreed upon that Attached: true be set with ResolvedRefs: false and return HTTP code 500 for the invalid rule).

One such example not covered is if you were to use a HTTPRequestRedirectFilter and HTTPURLRewrite in the list of filters for an individual HTTPRouteRule (ref). At the moment it doesn't appear we have prescribed any status code or other error handling for this error. This seems to fall into the bucket of things that cannot produce data plane configuration.

Going by the statements below, if this rule were the only one present Attached: false should be set, since it cannot result in any configuration (with another more specific error condition). If present with other valid rules, Attached: true should be set (with another more specific error condition again). I can definitely see the argument for setting Attached: false however in the latter case, since stated above, the Attached condition means The resource is syntatically and semantically valid, and internally consistent.

It seems that the Valid: true condition mentioned above ends up being somewhat equivalent to having for example for HTTPRoute ResolvedRefs: true and no error conditions (and vice versa, Valid: false equivalent to ResolvedRefs: false or some more specific error condition). Maybe this framing helps a little with determining possible merits of having another condition?

If so, are there expectations or requirements from a conformance PoV ("always apply all or none") that we need to specify?

Just doing a quick pass, I don't think I saw any existing conformance tests that cover this kind of thing so I agree that would be good to add.

site-src/geps/gep-1364.md Outdated Show resolved Hide resolved
@youngnick
Copy link
Contributor Author

Okay, updated again, this feels like it's getting closer.

Changelog this time:

  • Changed references to Attached to Accepted. I think @robscott and @skriss are right here, this is probably clearer.
  • Added a section on partial validity to explain this better.
  • Added the example @sunjayBhatia gave to the examples.

@robscott
Copy link
Member

Thanks @youngnick! This is a really well written GEP. Since previous generations of this PR have already been approved a couple times, I think we've got pretty clear consensus around this approach.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 26, 2022
@k8s-ci-robot k8s-ci-robot merged commit 7c52e02 into kubernetes-sigs:main Sep 26, 2022
Copy link
Contributor

@mikemorris mikemorris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite late in doing a thorough review of this, but really happy with where it ended up - feels like a good step forwards in resolving some of the current ambiguity in status conditions. Thanks @youngnick!

Comment on lines +67 to +69
2. They are a listMapType, a list that is enforced by the apiserver to have only
one entry of each item, using the `type` field as a key. (So, this is effectively
a map that looks like a list in YAML form).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually hadn't realized the uniqueness on type constraint - this should probably be documented more clearly within the Gateway API spec, as it has some bearing on how "multiple error" cases may be expressed, and explains the structure of RouteParentStatus, which will be increasingly relevant if we start expecting resources like HTTPRoute to bind to multiple parentRefs (for GAMMA and N/S concurrent usage or route delegation).


The guidance about Conditions being added as soon as a controller sees a resource
is a bit unclear - as written in the conventions, it seems to imply that _all_
relevant conditions should always be added, even if their status has to be set to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For HTTPRoute as a specific example, the structure of RouteParentStatus appears well-suited to handle a separate list of conditions for each parentRef or controller, so a mesh or Service relationship for a route would have a separate set of conditions and Accepted status than those used for relaying the status of a concurrent parent relationship to a Gateway, with no need to collaborate on writing to a single status.

Comment on lines +285 to +288
This GEP proposes replacing all conditions that indicate syntactic and semantic
validity with one, `Accepted` condition type, with the exception of the
GatewayClass resource (because it's the root of the resource tree and doesn't
attach to anything).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the update to use Accepted instead of Attached, GatewayClass is no longer an exception.

Comment on lines +357 to +365
* HTTPRoute with one match with one backend that is a non-existent Service backend.
The `Accepted` Condition is true, the `ResolvedRefs` condition is false. `Accepted`
is true in this case because the data path must respond to requests that would be
sent to that backend with a 500 response.
* HTTPRoute with one match with two backends, one of which is a non-existent Service
backend. The `Accepted` Condition is true, the `ResolvedRefs` condition is false.
`Accepted` is true in this case because the data path must respond to a percentage
of the requests matching the rule corresponding to the weighting of the non-existent
backend (which would be fifty percent unless weights are applied).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For both of these cases, the ResolvedRefs RouteConditionReason should be BackendNotFound

Comment on lines +366 to +369
* HTTPRoute with one match with one backend that is in a different namespace, and
does _not_ have a ReferenceGrant permitting that access. The `Accepted` condition
is true, and the `ResolvedRefs` Condition is false. As before, `Accepted` is true
because in this case, the data path must be programmed with 500s for the match.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ResolvedRefs RouteConditionReason should be RefNotPermitted

Comment on lines +375 to +382
* HTTPRoute with one Custom supported filter added that is not supported by the
implementation. Our spec is currently unclear on what happens in this case, but
custom HTTP Filters require the use of the `ExtensionRef` filter type, and the
setting of the ExtensionRef field to the name, group, version, and kind of a
custom resource that describes the filter. If that custom resource is not supported,
it seems reasonable to say that this should be a reference failure, and be treated
like other reference failures (`Accepted` will be set to true, `ResolvedRefs` to
false, and traffic that would have matched the filter should receive a 500 error.)
Copy link
Contributor

@mikemorris mikemorris Oct 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the ResolvedRefs RouteConditionReason in this case should be InvalidKind, because of the GKV of the reference is invalid. We may want to consider adding a condition specifically for unsupported ExtensionRefs, or note that the reference with an invalid kind should be included in the Condition message field.

false, and traffic that would have matched the filter should receive a 500 error.)
* A HTTPRoute with one rule that specifies a HTTPRequestRedirect filter _and_ a
HTTPURLRewrite filter. `Accepted` must be false, because there's only one rule,
and this configuration for the rule is invalid (see [reference][httpreqredirect])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link appears broken due to a missing footnote reference.

@robscott
Copy link
Member

@mikemorris these are great bits of feedback, do you have time to create a follow up PR or issue to address any that are straightforward to fix? Concerned that otherwise we'll lose track of these.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/gep PRs related to Gateway Enhancement Proposal(GEP) lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet