Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

old critical alerts in icinga do not go away after upgrade of openshift #26

Open
wahabz opened this issue May 15, 2020 · 4 comments
Open

Comments

@wahabz
Copy link

wahabz commented May 15, 2020

First of all, great product: signalilo.
I recently set this up for our OpenShift clusters.

We had the following scenerio:
For our OpenShift Cluster A, we had bunch of critical alerts that showed up in Icinga.
Those alerts were not resolved (as in from OpenShift side).
We did an upgrade on our OpenShift Cluster, and after that re-added in the webhook config in alertmanager.
So from alertmanager perspective, it is now brand new. So old alerts in icinga were not resolved (they never got the resolved notification from alertmanager via signalilo).
Now in Icinga, we have this OpenShift Cluster set up as a Host "Test Host", and although new alerts are coming in and are resolved, the old alerts from previous version of OpenShift are still there.

I understand that there is a SIGNALILO_ICINGA_KEEP_FOR setting, but that is for OK and or resolved alerts.

I think that there should be a criteria such that if the alert is no longer firing from AlertManager, and if there are some lingering critical services in Icinga which did not receive any resolved status, then those should be garbage collected as well.

@wahabz wahabz changed the title critical alerts in icinga do not go away old critical alerts in icinga do not go away after upgrade of openshift May 15, 2020
@simu
Copy link
Member

simu commented May 18, 2020

Thanks for the feedback.

We have considered different options for handling stale alerts in Icinga, but it's hard to implement a solution that's correct for arbitrary resend intervals in Alertmanager, since Signalilo cannot really distinguish between a critical alert with a high repeat interval and a stale alert, especially since Signalilo does not keep any local state.

One possibility would be to adjust the Icinga checks to be active with a recheck interval that's somehow derived from the repeat interval of the alert in Alertmanager. However the value of the resend interval would have to be provided to Signalilo as an extra configuration value, as it's not available in the received alerts (Side-note: potentially the endsAt field of the received alert can be used, needs further investigation).

In the meantime, what you can to do clean up stale alerts, is to click "check now" in Icinga, which sets the alert status to OK (as dummy_state is set to 0 for Icinga checks created by Signalilo), and makes the check eligible for garbage collection according to the value of SIGNALILO_ICINGA_KEEP_FOR.

@wahabz
Copy link
Author

wahabz commented May 26, 2020

I got some time to review your comment regarding this.
So I was thinking if it is possible to Garbage Collect all the alerts in Icinga and only leave out the one's that are firing?
This way, say alert A, B and C fires and signalilo captures them, reports them in icinga. Now, say after an hour or alert A is still firing, whereas alert B and C have stopped firing but missed sending resolved (could be that alertmanager is upgraded or something else). In this case, when signalilo does garbage collection, it should first look at the firing alerts (A in this case), and collect all the alerts/services in icinga that are not OK.

@simu
Copy link
Member

simu commented Jun 29, 2020

At the time Signalilo performs garbage collection, we do not know which alerts are firing, since we do not keep any local state about alerts in Signalilo. Therefore we cannot just look at the firing alerts and GC all alerts which are not firing anymore, as we simply don't have information to determine which alerts are still firing when GC runs.

I'm leaning towards the solution of using the Alertmanager resend interval, as an additional configuration value that needs to be provided to Signalilo, with a reasonably high default, to create active Icinga2 services. Those services should be checked with roughly the same frequency as Alertmanager resends the alerts. Note that the check interval in Icinga should be a bit higher than the resend interval to allow for some network latency.

Since we already implement active checks for "heartbeat" alerts, this change should be doable.

@Xavier-0965
Copy link

We have also the problem sometimes. But it seems difficult to reproduce:
This morning we have updated one Openshift cluster (to version 4.10.35: prometheus version 2.32.1, Alertmanager Version 0.23.0).
I had an alert before the update, to try to reproduce the problem.
But after the update, that alert is correctly bound to one service in icinga.
So I didn't reproduce the problem. But I document it, maybe this helps for the analysis.

What I have seen, is that as soon a firing alert is seen, Signalilo computes a serviceName (see

func computeServiceName(
) using the UUID and the sorted labels of the alert. It checks in Icinga if that service is found.
If it is not the case, a new service is created in Icinga.
Otherwise the service is updated.

Maybe there are cases, where the labels are changed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants