Metrics Server could not scrape log with “tls: failed to verify certificate: x509: certificate is valid for 127.0.0.1, not 10.124.4.238" node="fargate-ip-“ error in log #1468

dkelim1 · 2024-04-11T13:00:38Z

What happened:

Logs from the matrics-server pod show this repeatedly
E0410 22:04:01.247686 1 scraper.go:149] "Failed to scrape node" err="Get "https://10.124.4.238:10250/metrics/resource\": tls: failed to verify certificate: x509: certificate is valid for 127.0.0.1, not 10.124.4.238" node="fargate-ip-10-124-4-238.ap-southeast-1.compute.internal"
E0410 22:04:16.201141 1 scraper.go:149] "Failed to scrape node" err="Get "https://10.124.4.238:10250/metrics/resource\": tls: failed to verify certificate: x509: certificate is valid for 127.0.0.1, not 10.124.4.238" node="fargate-ip-10-124-4-238.ap-southeast-1.compute.internal"
E0410 22:04:31.201853 1 scraper.go:149] "Failed to scrape node" err="Get "https://10.124.4.238:10250/metrics/resource\": tls: failed to verify certificate: x509: certificate is valid for 127.0.0.1, not 10.124.4.238" node="fargate-ip-10-124-4-238.ap-southeast-1.compute.internal"
E0410 22:04:46.277913 1 scraper.go:149] "Failed to scrape node" err="Get "https://10.124.4.238:10250/metrics/resource\": tls: failed to verify certificate: x509: certificate is valid for 127.0.0.1, not 10.124.4.238" node="fargate-ip-10-124-4-238.ap-southeast-1.compute.internal"

What you expected to happen:
To be able to scrape itself.

Anything else we need to know?:

Initially was using the metrics server that came with vpa. Errors similar to the above appears.

# metrics-server -- configuration options for the [metrics server Helm chart](https://github.com/kubernetes-sigs/metrics-server/tree/master/charts/metrics-server). See the projects [README.md](https://github.com/kubernetes-sigs/metrics-server/tree/master/charts/metrics-server#configuration) for all available options
metrics-server:
  # metrics-server.enabled -- Whether or not the metrics server Helm chart should be installed
  enabled: true
  # CHANGE ABOVE from original value false

  defaultArgs:
  - --cert-dir=/tmp
  - --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname
  - --kubelet-use-node-status-port
  - --metric-resolution=15s

But later switch to the metrics server that is directly installed from eks_blueprints_kubernetes_addons. Errors similar to the above appears.

  enable_metrics_server = true
  metrics_server = {
    name          = "metrics-server"
    chart_version = "3.12.1"
    repository    = "https://kubernetes-sigs.github.io/metrics-server/"
    namespace     = "kube-system"
    values        = [templatefile("${path.module}/metrics-svr.yaml", {})]
  }

Tried to upgrade metrics server from version 0.6.x to 0.7.x. Errors similar to the above appears.
Tried to use by pass the certificate check by passing in the ‘--kubelet-insecure-tls’

defaultArgs:
  - --cert-dir=/tmp
  - --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname
  - --kubelet-use-node-status-port
  - --metric-resolution=15s
  - --kubelet-insecure-tls

However, the follow errors appear.
E0410 22:13:28.928630 1 scraper.go:149] "Failed to scrape node" err="request failed, status: "403 Forbidden"" node="fargate-ip-10-124-4-186.ap-southeast-1.compute.internal"
E0410 22:13:43.827793 1 scraper.go:149] "Failed to scrape node" err="request failed, status: "403 Forbidden"" node="fargate-ip-10-124-4-186.ap-southeast-1.compute.internal"

Environment:

Kubernetes distribution EKS Fargate
    Server Version: v1.27.11-eks-b9c9ed7

Metrics Server manifest

spoiler for Metrics Server manifest:

apiVersion: v1
kind: ServiceAccount
metadata:
annotations:
meta.helm.sh/release-name: metrics-server
meta.helm.sh/release-namespace: kube-system
creationTimestamp: "2024-04-10T21:48:44Z"
labels:
app.kubernetes.io/instance: metrics-server
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: metrics-server
app.kubernetes.io/version: 0.7.1
helm.sh/chart: metrics-server-3.12.1
name: metrics-server
namespace: kube-system
resourceVersion: "1044967"
uid: bbd89fdf-d933-4fd3-9bfa-2c8351bc9159

apiVersion: v1
kind: Service
metadata:
annotations:
meta.helm.sh/release-name: metrics-server
meta.helm.sh/release-namespace: kube-system
creationTimestamp: "2024-04-10T21:48:44Z"
labels:
app.kubernetes.io/instance: metrics-server
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: metrics-server
app.kubernetes.io/version: 0.7.1
helm.sh/chart: metrics-server-3.12.1
name: metrics-server
namespace: kube-system
resourceVersion: "1044976"
uid: fe68eb2b-9ecf-4c57-996e-6836955f614c
spec:
clusterIP: 172.20.20.200
clusterIPs:

172.20.20.200
internalTrafficPolicy: Cluster
ipFamilies:
IPv4
ipFamilyPolicy: SingleStack
ports:
name: https
port: 443
protocol: TCP
targetPort: https
selector:
app.kubernetes.io/instance: metrics-server
app.kubernetes.io/name: metrics-server
sessionAffinity: None
type: ClusterIP
status:
loadBalancer: {}

apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "3"
meta.helm.sh/release-name: metrics-server
meta.helm.sh/release-namespace: kube-system
creationTimestamp: "2024-04-10T21:48:44Z"
generation: 3
labels:
app.kubernetes.io/instance: metrics-server
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: metrics-server
app.kubernetes.io/version: 0.7.1
helm.sh/chart: metrics-server-3.12.1
name: metrics-server
namespace: kube-system
resourceVersion: "1048455"
uid: 51c7e198-d10b-4ec4-b96d-69e151de778b
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app.kubernetes.io/instance: metrics-server
app.kubernetes.io/name: metrics-server
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
creationTimestamp: null
labels:
app.kubernetes.io/instance: metrics-server
app.kubernetes.io/name: metrics-server
spec:
containers:
- args:
- --secure-port=10250
- --cert-dir=/tmp
- --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname
- --kubelet-use-node-status-port
- --metric-resolution=15s
- --kubelet-insecure-tls
image: registry.k8s.io/metrics-server/metrics-server:v0.7.1
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /livez
port: https
scheme: HTTPS
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
name: metrics-server
ports:
- containerPort: 10250
name: https
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /readyz
port: https
scheme: HTTPS
initialDelaySeconds: 20
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources:
requests:
cpu: 100m
memory: 200Mi
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true
runAsNonRoot: true
runAsUser: 1000
seccompProfile:
type: RuntimeDefault
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /tmp
name: tmp
dnsPolicy: ClusterFirst
priorityClassName: system-cluster-critical
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: metrics-server
serviceAccountName: metrics-server
terminationGracePeriodSeconds: 30
volumes:
- emptyDir: {}
name: tmp
status:
availableReplicas: 1
conditions:

lastTransitionTime: "2024-04-10T21:50:06Z"
lastUpdateTime: "2024-04-10T21:50:06Z"
message: Deployment has minimum availability.
reason: MinimumReplicasAvailable
status: "True"
type: Available
lastTransitionTime: "2024-04-10T21:48:44Z"
lastUpdateTime: "2024-04-10T22:13:47Z"
message: ReplicaSet "metrics-server-578bc9bf64" has successfully progressed.
reason: NewReplicaSetAvailable
status: "True"
type: Progressing
observedGeneration: 3
readyReplicas: 1
replicas: 1
updatedReplicas: 1
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
annotations:
meta.helm.sh/release-name: metrics-server
meta.helm.sh/release-namespace: kube-system
creationTimestamp: "2024-04-10T21:48:44Z"
labels:
app.kubernetes.io/instance: metrics-server
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: metrics-server
app.kubernetes.io/version: 0.7.1
helm.sh/chart: metrics-server-3.12.1
name: v1beta1.metrics.k8s.io
resourceVersion: "1048453"
uid: 84cc08c7-27bc-4a4e-a7b8-efcd7b428ea2
spec:
group: metrics.k8s.io
groupPriorityMinimum: 100
insecureSkipTLSVerify: true
service:
name: metrics-server
namespace: kube-system
port: 443
version: v1beta1
versionPriority: 100
status:
conditions:
- lastTransitionTime: "2024-04-10T21:48:44Z"
  message: 'failing or missing response from https://10.124.4.186:10250/apis/metrics.k8s.io/v1beta1:
  bad status from https://10.124.4.186:10250/apis/metrics.k8s.io/v1beta1: 404'
  reason: FailedDiscoveryCheck
  status: "False"
  type: Available

Kubelet config:

spoiler for Kubelet config:

Metrics server logs:

spoiler for Metrics Server logs:

Status of Metrics API:

spolier for Status of Metrics API:

kubectl describe apiservice v1beta1.metrics.k8s.io

/kind bug

The text was updated successfully, but these errors were encountered:

honarkhah · 2024-04-17T13:18:28Z

Duplicate #1422
Related to aws/containers-roadmap#1798

dkelim1 · 2024-04-17T14:12:03Z

Hi honarkhah,
So meaning this is related to the above issue and there is no fix or workaround until #1798 is fixed?
Or we can only integrate VPA or HPA with OpenTelemetry Collector or Prometheus? Thanks.

logicalhan · 2024-04-18T16:50:35Z

/kind support
/triage accepted

jaron360 · 2024-04-22T18:54:01Z

Hello,

We are having this same issue. The only workaround I have found is to run metrics-server on ec2 rather then Fargate. When running metrics-server on ec2, there are no issues or errors seen in the logs.

wang-xiaowu · 2024-04-24T03:23:57Z

same question for k3s

# kubectl logs -n kube-system metrics-server-79f66dff9d-5sflh --tail 300 -f
Error from server: Get "https://10.1.4.13:10250/containerLogs/kube-system/metrics-server-79f66dff9d-5sflh/metrics-server?follow=true&tailLines=300": tls: failed to verify certificate: x509: certificate is valid for 127.0.0.1, not 10.1.4.13

babli-coditas · 2024-05-02T06:11:01Z

facing the same for ec2? any workaround?

fadulalla · 2024-05-07T22:19:01Z

I had the same issue while I was upgrading my cluster, and the issue practically solved itself.

Tl;dr: check if you have any other clusters using similar fargate profiles, and if they're on different versions, upgrade them to match. I had dev and prod. Even though they're completely different clusters (with their own nodes and fargate profiles), they broke each other (and fixed each other). I'm still not quite sure what caused my issue, but I'm leaving my story below in case it helps someone else.

long version:
I have two clusters: dev & prod, both were on 1.25 using Fargate, with a working metrics-server (been working fine since 2020), and I needed to bring both to latest EKS. Yesterday, I initially updated my dev cluster to 1.26. I then restarted all deployments. I noticed that after the update, my nodes' kubelet remained on 1.25, even after several restarts, which was odd. I decided to update my addons, so I updated my coredns to v1.9.3-eksbuild.11, and then also decided to update my metrics-server. That's when I noticed that the restarted metrics-server is unable to scrape itself. I spent 2 hours trying to understand why (how I found this issue). I thought maybe it was the version of the metrics-server, so I downgraded back to 3.8 (from 3.12), but it was still broken, unable to scrape itself.

It was odd, because the issue came out of nowhere; I never had issues with the metrics-server on fargate before and it was always able to scrape fargate nodes. This was obviously preventing me from upgrading prod. I couldn't upgrade prod if the broken metrics-server was caused by the upgrade. I decided to see how the metrics-server was working on fargate on prod. That's when I was surprised to see that it was also broken on prod! Baffled, because the two clusters and their fargate profiles are (should be?!) completely separate. I checked prod's version, and it was still 1.25 as expected. For some reason, all my fargate nodes (on prod) had restarted, and that's when my metrics-server problems started. I decided to go ahead and update prod to 1.26, and voila, metrics server suddenly started working on both clusters, dev and prod. Still not sure why...

I've now upgraded dev and prod to 1.29, and metrics-server is still working well.

MickeyShiue · 2024-05-14T10:29:14Z

I'm using AWS EKS Fargate V1.29

After I lowered the metrics-server version to V0.6.4, it worked normally

apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    k8s-app: metrics-server
  name: metrics-server
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    k8s-app: metrics-server
    rbac.authorization.k8s.io/aggregate-to-admin: "true"
    rbac.authorization.k8s.io/aggregate-to-edit: "true"
    rbac.authorization.k8s.io/aggregate-to-view: "true"
  name: system:aggregated-metrics-reader
rules:
- apiGroups:
  - metrics.k8s.io
  resources:
  - pods
  - nodes
  verbs:
  - get
  - list
  - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    k8s-app: metrics-server
  name: system:metrics-server
rules:
- apiGroups:
  - ""
  resources:
  - nodes/metrics
  verbs:
  - get
- apiGroups:
  - ""
  resources:
  - pods
  - nodes
  verbs:
  - get
  - list
  - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  labels:
    k8s-app: metrics-server
  name: metrics-server-auth-reader
  namespace: kube-system
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: extension-apiserver-authentication-reader
subjects:
- kind: ServiceAccount
  name: metrics-server
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  labels:
    k8s-app: metrics-server
  name: metrics-server:system:auth-delegator
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: system:auth-delegator
subjects:
- kind: ServiceAccount
  name: metrics-server
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  labels:
    k8s-app: metrics-server
  name: system:metrics-server
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: system:metrics-server
subjects:
- kind: ServiceAccount
  name: metrics-server
  namespace: kube-system
---
apiVersion: v1
kind: Service
metadata:
  labels:
    k8s-app: metrics-server
  name: metrics-server
  namespace: kube-system
spec:
  ports:
  - name: https
    port: 443
    protocol: TCP
    targetPort: https
  selector:
    k8s-app: metrics-server
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    k8s-app: metrics-server
  name: metrics-server
  namespace: kube-system
spec:
  selector:
    matchLabels:
      k8s-app: metrics-server
  strategy:
    rollingUpdate:
      maxUnavailable: 0
  template:
    metadata:
      labels:
        k8s-app: metrics-server
    spec:
      containers:
      - args:
        - --cert-dir=/tmp
        - --secure-port=4443
        - --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname
        - --kubelet-use-node-status-port
        - --metric-resolution=15s
        image: registry.k8s.io/metrics-server/metrics-server:v0.6.4
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /livez
            port: https
            scheme: HTTPS
          periodSeconds: 10
        name: metrics-server
        ports:
        - containerPort: 4443
          name: https
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /readyz
            port: https
            scheme: HTTPS
          initialDelaySeconds: 20
          periodSeconds: 10
        resources:
          requests:
            cpu: 100m
            memory: 200Mi
        securityContext:
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          runAsNonRoot: true
          runAsUser: 1000
        volumeMounts:
        - mountPath: /tmp
          name: tmp-dir
      nodeSelector:
        kubernetes.io/os: linux
      priorityClassName: system-cluster-critical
      serviceAccountName: metrics-server
      volumes:
      - emptyDir: {}
        name: tmp-dir
---
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
  labels:
    k8s-app: metrics-server
  name: v1beta1.metrics.k8s.io
spec:
  group: metrics.k8s.io
  groupPriorityMinimum: 100
  insecureSkipTLSVerify: true
  service:
    name: metrics-server
    namespace: kube-system
  version: v1beta1
  versionPriority: 100

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 11, 2024

k8s-ci-robot added kind/support Categorizes issue or PR as a support question. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 18, 2024

honarkhah mentioned this issue Apr 19, 2024

EKS Fargate Matrics-server fails to scrape itself #1422

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics Server could not scrape log with “tls: failed to verify certificate: x509: certificate is valid for 127.0.0.1, not 10.124.4.238" node="fargate-ip-“ error in log #1468

Metrics Server could not scrape log with “tls: failed to verify certificate: x509: certificate is valid for 127.0.0.1, not 10.124.4.238" node="fargate-ip-“ error in log #1468

dkelim1 commented Apr 11, 2024

honarkhah commented Apr 17, 2024 •

edited

Loading

dkelim1 commented Apr 17, 2024

logicalhan commented Apr 18, 2024

jaron360 commented Apr 22, 2024

wang-xiaowu commented Apr 24, 2024

babli-coditas commented May 2, 2024

fadulalla commented May 7, 2024 •

edited

Loading

MickeyShiue commented May 14, 2024 •

edited

Loading

Metrics Server could not scrape log with “tls: failed to verify certificate: x509: certificate is valid for 127.0.0.1, not 10.124.4.238" node="fargate-ip-“ error in log #1468

Metrics Server could not scrape log with “tls: failed to verify certificate: x509: certificate is valid for 127.0.0.1, not 10.124.4.238" node="fargate-ip-“ error in log #1468

Comments

dkelim1 commented Apr 11, 2024

honarkhah commented Apr 17, 2024 • edited Loading

dkelim1 commented Apr 17, 2024

logicalhan commented Apr 18, 2024

jaron360 commented Apr 22, 2024

wang-xiaowu commented Apr 24, 2024

babli-coditas commented May 2, 2024

fadulalla commented May 7, 2024 • edited Loading

MickeyShiue commented May 14, 2024 • edited Loading

honarkhah commented Apr 17, 2024 •

edited

Loading

fadulalla commented May 7, 2024 •

edited

Loading

MickeyShiue commented May 14, 2024 •

edited

Loading