Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SR-IOV & Bond CNI Fails to start and terminate pod #1303

Open
itsalexjones opened this issue Jun 24, 2024 · 0 comments
Open

SR-IOV & Bond CNI Fails to start and terminate pod #1303

itsalexjones opened this issue Jun 24, 2024 · 0 comments

Comments

@itsalexjones
Copy link

itsalexjones commented Jun 24, 2024

Hi Everyone,

I have deployed the SR-IOV CNI via the SR-IOV Network Device Plugin (v3.7.0) , and the bond CNI (from master, as the latest release is very old) manually and am trying to create a bond interface from two VFs in the pod.
I have used examples from the bond-cni and sr-iov cni documentation to do this, and have previously had single SR-IOV interfaces working correctly.

What happend:
When the pod is started the event Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "<snip>": plugin type ="multus" name="multus-cni-network" failed (add): [default/test-pod:sriov-network]: error adding container to network "sriov-network": cannot convert: no valid IP addresses is logged, and the pod fails to start.

When the pod is terminated, the event error killing pod: failed to "KillPodSandbox" for "<snip>" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to destroy network for sandbox \"<snip>\": plugin type=\"multus\" name=\"multus-cni-network\" failed (delete): delegateDel: error invoking DelegateDel - \"sriov\": error in getting result from DelNetwork: invalid version \"\": the version is empty / delegateDel: error invoking DelegateDel - \"sriov\": error in getting result from DelNetwork: invalid version \"\": the version is empty" is logged and the pod fails to be deleted.

What you expected to happen:
All documentation suggests the pod should be started with the four interfaces as configured

How to reproduce it (as minimally and precisely as possible):
Deploy the follwing three Network Attachment Definitions (assume the resources are already created):

apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: sriov-net1
  annotations:
    k8s.v1.cni.cncf.io/resourceName: intel.com/intel_sriov_PF_1
spec:
  config: '{
  "type": "sriov",
  "name": "sriov-network",
  "spoofchk":"off"
}'
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: sriov-net2
  annotations:
    k8s.v1.cni.cncf.io/resourceName: intel.com/intel_sriov_PF_2
spec:
  config: '{
  "type": "sriov",
  "name": "sriov-network",
  "spoofchk":"off"
}'
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: bond-net1
spec:
  config: '{
  "type": "bond",
  "cniVersion": "0.3.1",
  "name": "bond-net1",
  "mode": "active-backup",
  "failOverMac": 1,
  "linksInContainer": true,
  "miimon": "100",
  "mtu": 1500,
  "links": [
     {"name": "net1"},
     {"name": "net2"}
  ],
  "ipam": {
    "type": "host-local",
    "subnet": "10.72.0.0/16",
    "rangeStart": "10.72.61.192",
    "rangeEnd": "10.72.61.255"
  }
}'

and the follwoing pod:

apiVersion: v1
kind: Pod
metadata:
  name: test-pod
  annotations:
        k8s.v1.cni.cncf.io/networks: '[
{"name": "sriov-net1",
"interface": "net1"
},
{"name": "sriov-net2",
"interface": "net2"
},
{"name": "bond-net1",
"interface": "bond0"
}]'
spec:
  restartPolicy: Never
  containers:
  - name: bond-test
    image: alpine:latest
    command:
      - /bin/sh
      - "-c"
      - "sleep 60m"
    imagePullPolicy: IfNotPresent
    resources:
      requests:
        intel.com/intel_sriov_PF_1: '1'
        intel.com/intel_sriov_PF_2: '1'
      limits:
        intel.com/intel_sriov_PF_1: '1'
        intel.com/intel_sriov_PF_2: '1'

Anything else we need to know?:
If you assign an address to the two SR-IOV interfaces (a static address is fine), the pod is created correctly (but with two extra addresses on the bond slaves) - but the pod still fails to terminate.

Environment:

  • Multus version
    image path and image ID (from 'docker images'): ghcr.io/k8snetworkplumbingwg/multus-cni:v3.8
  • Kubernetes version (use kubectl version): v1.29.5
  • Primary CNI for Kubernetes cluster: Calico
  • OS (e.g. from /etc/os-release): Debian 12
  • File of '/etc/cni/net.d/':
{
  "cniVersion": "0.4.0",
  "name": "multus-cni-network",
  "type": "multus",
  "capabilities": {
    "portMappings": true,
    "bandwidth": true
  },
  "kubeconfig": "/etc/cni/net.d/multus.d/multus.kubeconfig",
  "delegates": [
    {
      "name": "k8s-pod-network",
      "cniVersion": "0.3.1",
      "plugins": [
        {
          "datastore_type": "kubernetes",
          "nodename": "lqbkubedab-01",
          "type": "calico",
          "log_level": "info",
          "log_file_path": "/var/log/calico/cni/cni.log",
          "ipam": {
            "type": "calico-ipam",
            "assign_ipv4": "true"
          },
          "policy": {
            "type": "k8s"
          },
          "kubernetes": {
            "kubeconfig": "/etc/cni/net.d/calico-kubeconfig"
          }
        },
        {
          "type": "portmap",
          "capabilities": {
            "portMappings": true
          }
        },
        {
          "type": "bandwidth",
          "capabilities": {
            "bandwidth": true
          }
        }
      ]
    }
  ]
}
  • File of '/etc/cni/multus/net.d'
  • NetworkAttachment info (use kubectl get net-attach-def -o yaml)
apiVersion: v1
items:
- apiVersion: k8s.cni.cncf.io/v1
  kind: NetworkAttachmentDefinition
  metadata:
    annotations:
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"k8s.cni.cncf.io/v1","kind":"NetworkAttachmentDefinition","metadata":{"annotations":{},"name":"bond-net1","namespace":"default"},"spec":{"config":"{ \"type\": \"bond\", \"cniVersion\": \"0.3.1\", \"name\": \"bond-net1\", \"mode\": \"active-backup\", \"failOverMac\": 1, \"linksInContainer\": true, \"miimon\": \"100\", \"mtu\": 1500, \"links\": [ {\"name\": \"net1\"}, {\"name\": \"net2\"} ], \"ipam\": { \"type\": \"host-local\", \"subnet\": \"10.72.0.0/16\", \"rangeStart\": \"10.72.61.192\", \"rangeEnd\": \"10.72.61.255\" } }"}}
    creationTimestamp: "2024-06-24T13:30:31Z"
    generation: 2
    name: bond-net1
    namespace: default
    resourceVersion: "2206601"
    uid: 3eac8c19-8674-4c09-bdc8-b5b93246a972
  spec:
    config: '{ "type": "bond", "cniVersion": "0.3.1", "name": "bond-net1", "mode":
      "active-backup", "failOverMac": 1, "linksInContainer": true, "miimon": "100",
      "mtu": 1500, "links": [ {"name": "net1"}, {"name": "net2"} ], "ipam": { "type":
      "host-local", "subnet": "10.72.0.0/16", "rangeStart": "10.72.61.192", "rangeEnd":
      "10.72.61.255" } }'
- apiVersion: k8s.cni.cncf.io/v1
  kind: NetworkAttachmentDefinition
  metadata:
    annotations:
      k8s.v1.cni.cncf.io/resourceName: intel.com/intel_sriov_PF_AXIA_1
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"k8s.cni.cncf.io/v1","kind":"NetworkAttachmentDefinition","metadata":{"annotations":{"k8s.v1.cni.cncf.io/resourceName":"intel.com/intel_sriov_PF_AXIA_1"},"name":"sriov-net1","namespace":"default"},"spec":{"config":"{ \"type\": \"sriov\", \"name\": \"sriov-network\", \"spoofchk\":\"off\" }"}}
    creationTimestamp: "2024-06-24T13:30:24Z"
    generation: 4
    name: sriov-net1
    namespace: default
    resourceVersion: "2211948"
    uid: 043986f3-5e8a-4861-b65d-31232c2b5c07
  spec:
    config: '{ "type": "sriov", "name": "sriov-network", "spoofchk":"off" }'
- apiVersion: k8s.cni.cncf.io/v1
  kind: NetworkAttachmentDefinition
  metadata:
    annotations:
      k8s.v1.cni.cncf.io/resourceName: intel.com/intel_sriov_PF_AXIA_2
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"k8s.cni.cncf.io/v1","kind":"NetworkAttachmentDefinition","metadata":{"annotations":{"k8s.v1.cni.cncf.io/resourceName":"intel.com/intel_sriov_PF_AXIA_2"},"name":"sriov-net2","namespace":"default"},"spec":{"config":"{ \"type\": \"sriov\", \"name\": \"sriov-network\", \"spoofchk\":\"off\" }"}}
    creationTimestamp: "2024-06-24T13:30:27Z"
    generation: 4
    name: sriov-net2
    namespace: default
    resourceVersion: "2211955"
    uid: a25fb747-8ffb-4524-9339-0740e3514f69
  spec:
    config: '{ "type": "sriov", "name": "sriov-network", "spoofchk":"off" }'
kind: List
metadata:
  resourceVersion: ""
  • Target pod yaml info (with annotation, use kubectl get pod <podname> -o yaml)
apiVersion: v1
kind: Pod
metadata:
  annotations:
    cni.projectcalico.org/containerID: 9840e0391c6916c0182bd20d6cc1bcc71b3bceaee8e091daebd6a48a077dbca3
    cni.projectcalico.org/podIP: ""
    cni.projectcalico.org/podIPs: ""
    k8s.v1.cni.cncf.io/networks: '[ {"name": "sriov-net1", "interface": "net1" },
      {"name": "sriov-net2", "interface": "net2" }, {"name": "bond-net1", "interface":
      "bond0" }]'
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{"k8s.v1.cni.cncf.io/networks":"[ {\"name\": \"sriov-net1\", \"interface\": \"net1\" }, {\"name\": \"sriov-net2\", \"interface\": \"net2\" }, {\"name\": \"bond-net1\", \"interface\": \"bond0\" }]"},"name":"test-pod","namespace":"default"},"spec":{"containers":[{"command":["/bin/sh","-c","sleep 60m"],"image":"alpine:latest","imagePullPolicy":"IfNotPresent","name":"bond-test","resources":{"limits":{"intel.com/intel_sriov_PF_AXIA_1":"1","intel.com/intel_sriov_PF_AXIA_2":"1"},"requests":{"intel.com/intel_sriov_PF_AXIA_1":"1","intel.com/intel_sriov_PF_AXIA_2":"1"}}}],"restartPolicy":"Never"}}
  creationTimestamp: "2024-06-24T15:35:25Z"
  deletionGracePeriodSeconds: 30
  deletionTimestamp: "2024-06-24T15:37:12Z"
  name: test-pod
  namespace: default
  resourceVersion: "2212053"
  uid: 23f48611-ed65-4de8-8617-8b0a91591c28
spec:
  containers:
  - command:
    - /bin/sh
    - -c
    - sleep 60m
    image: alpine:latest
    imagePullPolicy: IfNotPresent
    name: bond-test
    resources:
      limits:
        intel.com/intel_sriov_PF_AXIA_1: "1"
        intel.com/intel_sriov_PF_AXIA_2: "1"
      requests:
        intel.com/intel_sriov_PF_AXIA_1: "1"
        intel.com/intel_sriov_PF_AXIA_2: "1"
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-k2krr
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: lqbkubedab-01
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Never
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: kube-api-access-k2krr
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2024-06-24T15:35:25Z"
    status: "False"
    type: PodReadyToStartContainers
  - lastProbeTime: null
    lastTransitionTime: "2024-06-24T15:35:25Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2024-06-24T15:35:25Z"
    message: 'containers with unready status: [bond-test]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2024-06-24T15:35:25Z"
    message: 'containers with unready status: [bond-test]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2024-06-24T15:35:25Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - image: alpine:latest
    imageID: ""
    lastState: {}
    name: bond-test
    ready: false
    restartCount: 0
    started: false
    state:
      waiting:
        reason: ContainerCreating
  hostIP: 10.72.60.30
  hostIPs:
  - ip: 10.72.60.30
  phase: Pending
  qosClass: BestEffort
  startTime: "2024-06-24T15:35:25Z"
  • Other log outputs (if you use multus logging)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants