Health indicators based on Service Level Objectives #21311

jkschneider · 2020-05-04T22:05:33Z

This feature adds support for commonly requested functionality for an application to be able to aggregate some set of metrics key performance indicators down to a health indicator.

I fully expect some changes, probably significant changes, based on feedback iterations on this, but want to offer this up early in the 2.4.0 release iteration so we have time to iterate and also dogfood any autoconfigured service level objectives.

Some indicators are known to be broadly applicable to a wide range of Java applications, and those could be autoconfigured. An example of a set of such indicators is defined here and autoconfigured by this pull request (JvmServiceLevelObjectives.MEMORY).

In many cases, users would like to configure a load balancer to avoid instances that are failing a key performance indicator by configuring an HTTP health check on the load balancer. In fact, some applications may already be doing this for the health indicators Spring Boot or users already provide. Example platform load balancer configurations that can be pointed to /actuator/health:

CloudFoundry health-check-http-endpoint
AWS ALB health checking
Kubernetes service health checking:

metadata:
  name: instance-reported-utilization
  annotations:
    service.beta.kubernetes.io/do-loadbalancer-healthcheck-port: "80"
    service.beta.kubernetes.io/do-loadbalancer-healthcheck-protocol: "http"
    service.beta.kubernetes.io/do-loadbalancer-healthcheck-path: "/actuator/health"

See micrometer-metrics/micrometer#2055 for more detail.

The `HealthMeterRegistry`

As of 1.6.0, Micrometer has a new implementation: micrometer-registry-health. An autoconfiguration was added to spring-boot-actuator-autoconfigure for this new implementation.

Any @Bean ServiceLevelObjective is configured onto the HealthMeterRegistry and bound as a Spring Boot HealthIndicator.

What it looks like in `/actuator/health`

About `ServiceLevelObjective`

Service level objectives broadly have the following capabilities:

Are defined as a single or multi-indicator test against a set of time series registered to HealthMeterRegistry.
Can define required MeterBinder that contain the measurements that they need to determine availability.
Contains a filterable and transformable name and tag set that is mapped to the Spring Boot bean name and Health#details map, respectively.
Optionally contains a readable base unit that is mapped to health details.
Can pretty-print values and thresholds for human-readable interpretation of an SLO at some instant.
Can be defined to look back and aggregate over a time window in different ways.

API error ratio property-driven configuration

management.metrics.export.health.api-error-budgets.api.customer=0.01
management.metrics.export.health.api-error-budgets.admin=0.02

The above properties result in two service level objective health indicators called apiErrorRatioApiCustomer and apiErrorRatioAdmin, which check for a SERVER_ERROR outcome to total throughput ratio of less than 1% for requests to paths starting with /api/customer and 2% for requests to paths starting with /admin, respectively.

jkschneider · 2020-05-04T22:16:21Z

Open questions

We build health indicators with AbstractHealthIndicator(slo.getFailedMessage()). It's unclear to me if the failed message ever appears in /actuator/health response body output.

Some of the SLOs are a combination of two or more indicators. For example, in jvmTotalMemory, we set a relatively low threshold on GC overhead (20% of CPU time over the last 5 minutes) if there is 90% pool utilization as well. These composite SLOs are registered with the relatively new CompositeHealthContributor.fromMap(..) API. Unfortunately there is no way I can see to provide details and a failed message name on the composite. I'd like to add details and a failed message for each contributing health indicator and potentially a different one for what it means for a set of such indicators to fail together. @philwebb you may have suggestions? An example is included below of what I think might be nice (specifically the details directly underneath jvmTotalMemory)?

"jvmTotalMemory": {
  "status": "UP",
  "details": { 
     "someTag": "someValue"
  },
  "components": {
    "jvmGcOverhead": {
      "status": "UP",
      "details": {
        "value": "0.01%",
        "mustBe": "<20%",
        "unit": "percent CPU time spent"
      }
    },
    "jvmMemoryConsumption": {
      "status": "UP",
      "details": {
        "value": "9.09%",
        "mustBe": "<90%",
        "unit": "maximum percent used in last 5 minutes"
      }
    }
  }
}

philwebb · 2020-05-05T20:20:48Z

Thanks @jkschneider! I'll target this for 2.4.x so we remember to take a look as soon the 2.3.0 release crunch is over.

bclozel · 2020-09-28T15:29:39Z

We haven't had a chance to take a look at this change, nor upgrade to Micrometer 1.6.
We're already quite late in the Milestone cycle and we don't think we'll have time to address this change properly.
We need to take a look at this change and its implications (including the new concepts introduced and the Health endpoint format).

mbhave · 2021-09-16T14:46:57Z

@snicoll and I discussed this today. There are a few things that came up:

Since we decided that the diskspace health indicator should ideally be something that can be configured in the monitoring system, this feels very much along those lines. If we decide to surface the SLO's as a health indicator, we should align our strategy for diskspace accordingly. Even with the deprecation of the diskspace indicator, we could surface that information in health via the SLOs.
We are not sure if having a top-level component for every SLO is the best way to do this. Maybe having some sort of nested structure for the SLOs might be a better alternative.
From an API perspective, we could have an API to expose SLOs which we could use to create the composite rather than the current method which registers beans within a bean method.

Flagging for team-meeting so that we can discuss this on the next team call.

wilkinsona · 2021-09-17T16:17:43Z

We discussed this some more as a team today and our feeling is that we're not sure that we have a strong enough opinion to auto-configure SLOs has health indicators. We can see that it may make sense for some users but not for others. For example, in some cases, a proxy will already be aware of the error rate for requests that it routes to an application instance. In this case, exposing the information via a health endpoint that it will also be monitoring will be of minimal value, and may even be harmful depending on how things behave when the application's health changes. For users that do want to expose SLOs as health indicators, we could provide some classes that make it easier to do so.

Since this proposal was made, we've also introduced the concept of application state. It may be that some users want to configure things such that an unmet objective results in a change to the application state to indicate that it's no longer ready, for example. We could provide some helper classes that a user can configure to connect SLOs to application state.

We discussed possibly auto-configuring the HealthMeterRegistry, automatically adding any ServiceLevelObjective beans to it. We could auto-configure some ServiceLevelObjective beans such as JvmServiceLevelObjectives.MEMORY and OperatingSystemServiceLevelObjectives.DISK rather than hard-coding them as proposed here. This would align with our auto-configuring of Micrometer's various Jvm…Metrics classes.

Overall, our feeling was that we would stop short of anything that exposes the SLOs externally, instead auto-configuring the HealthMeterRegistry and supporting beans and making it easier for a user to then plug the SLOs into health or application state in a way that meets their specific needs.

@shakuzen @jonatan-ivanov Could we have your input here please? Are we right to be cautious and just give users the parts they need and leave them to join things together or is there some clearly established usage of HealthMeterRegistry and SLOs that means that we can proceed with confidence in a particular direction?

Upgrade Micrometer to 1.6.0-SNAPSHOT

c5b75a7

spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged label May 4, 2020

jkschneider force-pushed the health-slos branch 3 times, most recently from 220c8ba to d907ba5 Compare May 5, 2020 13:26

philwebb added type: enhancement A general enhancement and removed status: waiting-for-triage An issue we've not yet triaged labels May 5, 2020

philwebb added this to the 2.4.x milestone May 5, 2020

Service level objective health indicators

7290f5f

jkschneider force-pushed the health-slos branch from d907ba5 to 7290f5f Compare May 5, 2020 21:36

snicoll added the for: team-attention An issue we'd like other members of the team to review label Sep 9, 2020

bclozel modified the milestones: 2.4.x, 2.x Sep 28, 2020

bclozel added status: blocked An issue that's blocked on an external project change and removed for: team-attention An issue we'd like other members of the team to review labels Sep 28, 2020

snicoll mentioned this pull request Sep 29, 2020

Upgrade to Micrometer 1.6.0 #23525

Closed

wilkinsona mentioned this pull request Aug 23, 2021

Provide a configuration property for setting the path used by auto-configured disk space metrics #27306

Closed

mbhave self-assigned this Aug 24, 2021

mbhave added for: team-meeting An issue we'd like to discuss as a team to make progress and removed status: blocked An issue that's blocked on an external project change labels Sep 16, 2021

wilkinsona added status: blocked An issue that's blocked on an external project change and removed for: team-meeting An issue we'd like to discuss as a team to make progress labels Sep 17, 2021

mbhave removed their assignment Sep 17, 2021

philwebb added status: pending-design-work Needs design work before any code can be developed and removed status: blocked An issue that's blocked on an external project change labels Sep 20, 2021

mjf1310 approved these changes Nov 3, 2021

View reviewed changes

This comment has been minimized.

Sign in to view

philwebb force-pushed the main branch 3 times, most recently from 1ca278f to 902dd0b Compare November 19, 2021 20:17

philwebb modified the milestones: 2.x, 3.x Aug 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Health indicators based on Service Level Objectives #21311

Health indicators based on Service Level Objectives #21311

jkschneider commented May 4, 2020 •

edited

Loading

jkschneider commented May 4, 2020 •

edited

Loading

philwebb commented May 5, 2020

bclozel commented Sep 28, 2020

mbhave commented Sep 16, 2021 •

edited

Loading

wilkinsona commented Sep 17, 2021 •

edited

Loading

This comment has been minimized.

Health indicators based on Service Level Objectives #21311

Are you sure you want to change the base?

Health indicators based on Service Level Objectives #21311

Conversation

jkschneider commented May 4, 2020 • edited Loading

The HealthMeterRegistry

What it looks like in /actuator/health

About ServiceLevelObjective

API error ratio property-driven configuration

jkschneider commented May 4, 2020 • edited Loading

Open questions

philwebb commented May 5, 2020

bclozel commented Sep 28, 2020

mbhave commented Sep 16, 2021 • edited Loading

wilkinsona commented Sep 17, 2021 • edited Loading

This comment has been minimized.

jkschneider commented May 4, 2020 •

edited

Loading

The `HealthMeterRegistry`

What it looks like in `/actuator/health`

About `ServiceLevelObjective`

jkschneider commented May 4, 2020 •

edited

Loading

mbhave commented Sep 16, 2021 •

edited

Loading

wilkinsona commented Sep 17, 2021 •

edited

Loading