Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-55622

pod deletion doesn't occur fast enough resulting in new pod multus interface failing ipv6 duplicate address detection

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Critical Critical
    • 4.17.z
    • 4.14
    • Networking / multus
    • None
    • Critical
    • None
    • CNF Network Sprint 270
    • 1
    • False
    • Hide

      None

      Show
      None
    • N/A
    • Release Note Not Required
    • Done

      This is a clone of issue OCPBUGS-55346. The following is the description of the original issue:

      This is a clone of issue OCPBUGS-37212. The following is the description of the original issue:

      Description of problem:

      On pod deletion, clean up takes too long intermittently resulting in the replacement pod multus interface failing IPv6 DAD.
      
      Sample reproduction:
      - Worker 14 begins to remove the pod at 14:21:14:
      
      Jul 17 14:21:14 worker14 kubenswrapper[9796]: I0717 14:21:14.904545    9796 kubelet.go:2441] "SyncLoop DELETE" source="api" pods=[NAMESPACE/POD]
      
      - Worker 19 begins to add the pod at 14:21:14:
      
      Jul 17 14:21:14 worker19 kubenswrapper[9438]: I0717 14:21:14.952931    9438 kubelet.go:2425] "SyncLoop ADD" source="api" pods=[NAMESPACE/POD]
      
      - Worker 19 tries adding the network to the pod at Jul 17 14:21:15:
      
      Jul 17 14:21:15 worker19 crio[9376]: time="2024-07-17 14:21:15.294568336Z" level=info msg="Adding pod NAMESPACE/POD to CNI network \"multus-cni-network\" (type=multus-shim)"
      
      - But hiccups due to DAD failure at 14:21:17:
      
      Jul 17 14:21:17 worker19 kernel: IPv6: eth1: IPv6 duplicate address <IPv6_ADDRESS> used by <MAC> detected!
      
      - worker 14 has not finished tearing down the original pod and related netns:
      
      Jul 17 14:21:37 worker14 crio[9601]: time="2024-07-17 14:21:37.789184337Z" level=info msg="Got pod network &{Name:<POD> Namespace:<NAMESPACE> ID:a36d6da2c26fb668b3d9a665544ae25629377656b180bd3db2b4e199c59f9793 UID:9b7db4ae-b0bc-4987-ac57-35d3c42afdb3 NetNS:/var/run/netns/9b37d0a3-61c9-4b57-b5ea-51e1964b58c0 Networks:[{Name:multus-cni-network Ifname:eth0}] RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}"
      Jul 17 14:21:37 worker14 crio[9601]: time="2024-07-17 14:21:37.789403797Z" level=info msg="Deleting pod <POD> from CNI network \"multus-cni-network\" (type=multus-shim)"
      Jul 17 14:21:38 worker14 kubenswrapper[9796]: I0717 14:21:38.924580    9796 kubelet.go:2441] "SyncLoop DELETE" source="api" pods=[NAMESPACE/POD]
      Jul 17 14:21:38 worker14 kubenswrapper[9796]: I0717 14:21:38.936882    9796 kubelet.go:2435] "SyncLoop REMOVE" source="api" pods=[NAMESPACE/POD]
      
      It's clear this is a timing issue where the replacement pod tries assigning the IPv6 address before the original pod network has been cleaned up.
      
      

      Version-Release number of selected component (if applicable):

          4.14

      How reproducible:

          Somewhat intermittent but can reliably be reproduced

      Steps to Reproduce:

      Steps to reproduce:
      - Delete a pod
      - Wait for pod to be rescheduled and jump on to the new worker:
      - Determine network namespace:
      - - $ for ns in $(ip netns | awk '{print $1}'); do ip netns exec $ns ip a | grep -iq 'IP'; if [ $? == 0 ]; then echo $ns; fi; done
      - Validate eth1 is in tentative+dadfailed:
      - - $ ip netns exec <NS> ip a    

      Actual results:

       6: eth1@if24: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 9000 qdisc noqueue state UNKNOWN link/ether 88:e9:a4:71:62:5c brd ff:ff:ff:ff:ff:ff inet6 IPv6_ADDRESS/64 scope global tentative dadfailed <--- FAILED 
      valid_lft forever preferred_lft forever inet6 fe80::88e9:a400:371:625c/64 scope link valid_lft forever preferred_lft forever     

      Expected results:

       No IPv6 DAD failure.

      Additional info:

          Note: This was not seen in the impacted cluster until upgraded to 4.14 so this might be regression or new bug.

       

              pliurh Peng Liu
              openshift-crt-jira-prow OpenShift Prow Bot
              Weibin Liang Weibin Liang
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: