Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add annotation to allow spreading VIPs across keepalived instances #53

Merged
merged 2 commits into from
Mar 3, 2021

Conversation

tommasopozzetti
Copy link
Contributor

A common use case for keepalived in environments with higher load is to have multiple VIPs for HA configured in such a way that they are spread across the nodes. This configuration allows for maximum load balancing, for example through DNS round robin, while maintaining HA, as a failure of a node will still result in another node getting ownership of the VIP.

This PR introduces the possibility to specify an extra annotation keepalived-operator.redhat-cop.io/spreadvips: "true" on services requesting keepalived-managed IPs, that will instruct keepalived-operator to spread the VIPs of the service across all of the keepalived instances of the KeepalivedGroup. If the annotation is found, keepalived-operator will add extra configs to set a preferred owner for each VIP in the service, using a round-robin algorithm among the keepalived pods. The PR also adds the ability to watch for pod changes, to ensure that any change in the pods backing the daemonset is reflected in the configurations.

This PR aims at being 100% backwards-compatible, by adding this functionality as an opt-in option. Services that do not specify the new annotation will continue to behave as they always have.

@raffaelespazzoli
Copy link
Collaborator

I don't understand the intent of this PR, would you be able to elaborate with an example or perhaps provide a small demo?
Also, I keep forgetting to do it but I intend to deprecate this operator in favor of metallb. Do you think metallb would address your requirements?

@tommasopozzetti
Copy link
Contributor Author

Hi @raffaelespazzoli, thanks for taking a look at this.
To answer your first question, let me give you an example of why I think this can be useful.
Let's assume I have a KeepalivedGroup targeting three nodes, Node A, Node B and Node C. Let's also assume I have a service backed by multiple pods that I want to expose, for example nginx-ingress-controller.
Let's say I specify the annotation to tie this service to that KeepalivedGroup and I specify three externalIPs:

externalIPs:
- 172.20.20.1
- 172.20.20.2
- 172.20.20.3

The current behavior of keepalived-operator will assign all IPs to the same VRRP instance, which means that they will all be owned by the same keepalived pod, therefore they will all belong to the same node, for example Node A.
This effectively makes it useless to utilize multiple IPs, because all ingress traffic in the cluster towards that service will pass through Node A, even if the backing pods are spread on multiple nodes. Node A, therefore, can easily become a bottleneck if the amount of traffic is high enough.
A better configuration would be for those 3 externalIPs to be owned one per node, so that, if the ingress traffic is balanced across the 3 IPs (for example via DNS round robin), the three nodes will share the load, providing much better load balancing.
The new annotation introduced in the PR specifies that the user is requesting this behavior for the given service and assigns a preferred owner per VIP so that, if the owner is up, it will win over the others and own the VIP. The owners are assigned in a round robin fashion among the keepalived pods in the KeepalivedGroup, thus maximizing this "spread" for maximum load balancing.

Unfortunately, MetalLB either requires using a BGP-based advertisement (which however requires the underlying networking infrastructure to support it, which is not necessarily a common case), or it suffers from the same issue when using a layer-2 based approach, as explained in their docs (they call it single-node bottlenecking). Therefore, I still believe keepalived is a valuable alternative that can provide additional benefits.

@tommasopozzetti
Copy link
Contributor Author

It looks like this feature has been requested in MetalLB as well in metallb/metallb#439 so it might eventually get implemented, but for now it looks like there is no simple way to achieve it.

@raffaelespazzoli
Copy link
Collaborator

raffaelespazzoli commented Feb 28, 2021 via email

@tommasopozzetti
Copy link
Contributor Author

@raffaelespazzoli the current implementation should already work for externalIPs, loadbalancer IPs or a mixture of the two.
If we make it the default, do you think it makes sense to still have an annotation to allow for opting out of this behavior?
The downside of spreading VIPs is that each VIP now requires its own VRRP instance, which means that the 255 limit could be reached faster. In cases where this spread is not necessary users might prefer having the possibility of opting out.
Let me know your thoughts and I can modify the PR to reflect the intended behavior.
One more note that I forgot to mention is that, for this PR to work, the keepalived image needs to contain the bash binary, because it is used to pass the pod name from an environment variable to the keepalived process. We have it in the image we use, but I have not been able to verify whether it is present in the default registry.redhat.io/openshift4/ose-keepalived-ipfailover image.

@raffaelespazzoli
Copy link
Collaborator

@tommasopozzetti thanks for the clarifications, so would it be fair to say that I can achieve the same result as this PR, by creating multiple ExternalIP Services in kubernetes?

@tommasopozzetti
Copy link
Contributor Author

@raffaelespazzoli You would achieve the same number of VRRP instances, but not necessarily the same result. If you currently create multiple Services with one externalIP each with the same selector, there is no guarantee that these VIPs will be owned by different nodes. They might end up all on the same node, creating the bottleneck. Also, even if they were to be perfectly spread out, if one node were to go down, the VIP would move to another node and remain there even after the original owner came back up (thus creating a situation where one node is taking 2x traffic and another one 0 traffic).
This PR instead introduces configurations that select a preferred owner per VIP to guarantee that this spread is always true at all times. It is guaranteed that multiple VIPs will be owned by different nodes (unless of course there is a number of VIPs > number of nodes), and if a node goes down the VIP will indeed move to another node for HA, but it will go back to its original owner as soon as the owner is available again.

@raffaelespazzoli
Copy link
Collaborator

ok, let's keep this as it is. I'll try to do some tests in the next days. In the meantime can you add a paragraph to the docs about this functionality?

@tommasopozzetti
Copy link
Contributor Author

@raffaelespazzoli I added some info about this feature to the docs. Let me know if you think it is clear enough.

A common use case for keepalived in environments with higher load is to
have multiple VIPs for HA configured in such a way that they are spread
across the nodes. This configuration allows for maximum load balancing, for
example through DNS round robin, while maintaining HA, as a failure of a
node will still result in another node getting ownership of the VIP.

This commit introduces the possibility to specify an extra annotation
on services requesting keepalived-managed IPs, that will instruct
keepalived-operator to spread the VIPs of the service across all
of the keepalived instances of the KeepalivedGroup. If the annotation
is found, keepalived-operator will add extra configs to set a preferred
owner for each VIP in the service, using a round-robin algorithm among
the keepalived pods. The commit also adds the ability to watch for pod
changes, to ensure that any change in the pods backing the daemonset
is reflected in the configurations.

This commit aims at being 100% backwards-compatible, by adding this
functionality as an opt-in option. Services that do not specify the
new annotation will continue to behave as they always have.
@tommasopozzetti
Copy link
Contributor Author

I rebased this PR over master to include the capabilities from #54 and handle them correctly even with the spreadvips annotation.

@David-Igou
Copy link
Contributor

@raffaelespazzoli I have tested these changes and they LGTM

@David-Igou David-Igou merged commit da6d67b into redhat-cop:master Mar 3, 2021
@baryluk
Copy link

baryluk commented Dec 9, 2021

Correct me if I am wrong, but analyzing the code and template, it looks that if this option is enabled, the first IP of every service, will end up being put on first loadbalancer pod / node.

      {{- $owner := index $root.KeepalivedPods (modulus $i (len $root.KeepalivedPods)) }}
      vrrp_instance {{ $namespacedNameForIP }} {
          @{{ $owner.ObjectMeta.Name }} state MASTER

That is not great.

Considering keepalived default preemption back to master, is about 5 minutes, after a rolling restart, one can end up with ALL services first IP (which often is the ONLY ip of the service), all on the same node.

Would be good to use some hash of a service name as an offset, to ensure more uniformity, or do some permutation of the indexes based on the service name (preferably using consistent hashing, to handle change in the $root.KeepalivedPods length well).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants