Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-55797

openshift-oauth-apiserver unavailable because etcd health response

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • 4.16.z
    • Etcd
    • Important
    • None
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      While looking into https://issues.redhat.com/browse/OHSS-43592 it was observed that openshift-oauth-apiserver was having issues because of unhealthy readyz responses from etcd:

      I0506 14:04:21.542391       1 healthz.go:261] etcd,etcd-readiness check failed: readyz
      [-]etcd failed: etcd client connection not yet established
      [-]etcd-readiness failed: error getting data from etcd: context deadline exceeded 

       

      After some digging in etcd we found etcd members were logging the following warning: 

      {
        "level": "warn",
        "ts": "2025-05-06T15:26:14.148276Z",
        "caller": "embed/config_logging.go:160",
        "msg": "rejected connection",
        "remote-addr": "10.128.222.48:60882",
        "server-name": "etcd-2.etcd-discovery.ocm-production-2c88au30mo251lu8nevohplhglb38s7v-eu-tst-02.svc",
        "ip-addresses": [],
        "dns-names": [
          "*.etcd-discovery.ocm-production-2c88au30mo251lu8nevohplhglb38s7v-eu-tst-02.svc",
          "*.etcd-discovery.ocm-production-2c88au30mo251lu8nevohplhglb38s7v-eu-tst-02.svc.cluster.local",
          "127.0.0.1",
          "::1"
        ],
        "error": "tls: \"10.128.222.48\" does not match any of DNSNames [\"*.etcd-discovery.ocm-production-2c88au30mo251lu8nevohplhglb38s7v-eu-tst-02.svc\" \"*.etcd-discovery.ocm-production-2c88au30mo251lu8nevohplhglb38s7v-eu-tst-02.svc.cluster.local\" \"127.0.0.1\" \"::1\"]"
      }

       

      When we looked at the etcd pod with the assigned `10.128.242.130` IP, we observed the following:

      {
        "level": "warn",
        "ts": "2025-05-06T15:09:13.286196Z",
        "caller": "rafthttp/peer.go:267",
        "msg": "dropped internal Raft message since sending buffer is full (overloaded network)",
        "message-type": "MsgHeartbeat",
        "local-member-id": "23dece5d686fd4d7",
        "from": "23dece5d686fd4d7",
        "remote-peer-id": "7153809bb8659f8c",
        "remote-peer-name": "pipeline",
        "remote-peer-active": false
      } 

       

      {
        "level": "warn",
        "ts": "2025-05-06T15:09:57.852582Z",
        "caller": "rafthttp/probing_status.go:68",
        "msg": "prober detected unhealthy status",
        "round-tripper-name": "ROUND_TRIPPER_RAFT_MESSAGE",
        "remote-peer-id": "7153809bb8659f8c",
        "rtt": "1.281923382s",
        "error": "EOF"
      } 

      To the best of our knowledge, ETCD was not reporting itself as unhealthy. We would have expected the pods to be in a bad state, or our monitoring to alert us in this situation - but that did not happen. 

      Pods were all running, no restarts, and healthy:

      $ oc get pods -n ocm-production-2c88au30mo251lu8nevohplhglb38s7v-eu-tst-02 -o wide | grep etcd
      
      etcd-0                                                     4/4     Running     0                13d     10.128.222.48    ip-10-0-170-114.eu-west-1.compute.internal   <none>           <none>
      etcd-1                                                     4/4     Running     1 (6d ago)       17d     10.128.1.213     ip-10-0-172-68.eu-west-1.compute.internal    <none>           <none>
      etcd-2                                                     4/4     Running     0                7d19h   10.128.242.130   ip-10-0-171-52.eu-west-1.compute.internal    <none>           <none> 

       

      FYI: Endpoints behind the discovery service: 

      $ oc get ep -n ocm-production-2c88au30mo251lu8nevohplhglb38s7v-eu-tst-02 etcd-discovery -o yaml
      ...
      subsets:
      - addresses:
        - hostname: etcd-1
          ip: 10.128.1.30
          nodeName: ip-10-0-172-68.eu-west-1.compute.internal
          targetRef:
            kind: Pod
            name: etcd-1
            namespace: ocm-production-2c88au30mo251lu8nevohplhglb38s7v-eu-tst-02
            uid: dc6ee547-323f-4469-9ccb-d5ad56e7c94f
        - hostname: etcd-0
          ip: 10.128.222.48
          nodeName: ip-10-0-170-114.eu-west-1.compute.internal
          targetRef:
            kind: Pod
            name: etcd-0
            namespace: ocm-production-2c88au30mo251lu8nevohplhglb38s7v-eu-tst-02
            uid: 29aafe4b-bb7c-44dc-b591-f8a153636f2d
        - hostname: etcd-2
          ip: 10.128.242.130
          nodeName: ip-10-0-171-52.eu-west-1.compute.internal
          targetRef:
            kind: Pod
            name: etcd-2
            namespace: ocm-production-2c88au30mo251lu8nevohplhglb38s7v-eu-tst-02
            uid: 07f6c074-d461-4f86-98f8-35a410925485 

      Version-Release number of selected component (if applicable):

      4.16.39

      How reproducible:

      Unknown    

      Steps to Reproduce:

      1. Unknown
      2.
      3.    

      Actual results:

          

      Expected results:

      We would expect when etcd isn't responding as healthy on the `readyz` endpoint (or SSL is having issues) that etcd would attempt to self heal, or at least report itself as unhealthy.

      Additional info:

      The issue was fixed by issuing a rolling restart of etcd, openshift-apiserver, openshift-oauth-apiserver. 

              dwest@redhat.com Dean West
              jimd.openshift Jim DAgostino
              Ge Liu Ge Liu
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: