Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.16.z
Component/s: Etcd
Labels:
- service-delivery-prio-asks

Severity:
Important
Regression:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Description of problem:

While looking into https://issues.redhat.com/browse/OHSS-43592 it was observed that openshift-oauth-apiserver was having issues because of unhealthy readyz responses from etcd:

I0506 14:04:21.542391       1 healthz.go:261] etcd,etcd-readiness check failed: readyz
[-]etcd failed: etcd client connection not yet established
[-]etcd-readiness failed: error getting data from etcd: context deadline exceeded

After some digging in etcd we found etcd members were logging the following warning:

{
  "level": "warn",
  "ts": "2025-05-06T15:26:14.148276Z",
  "caller": "embed/config_logging.go:160",
  "msg": "rejected connection",
  "remote-addr": "10.128.222.48:60882",
  "server-name": "etcd-2.etcd-discovery.ocm-production-2c88au30mo251lu8nevohplhglb38s7v-eu-tst-02.svc",
  "ip-addresses": [],
  "dns-names": [
    "*.etcd-discovery.ocm-production-2c88au30mo251lu8nevohplhglb38s7v-eu-tst-02.svc",
    "*.etcd-discovery.ocm-production-2c88au30mo251lu8nevohplhglb38s7v-eu-tst-02.svc.cluster.local",
    "127.0.0.1",
    "::1"
  ],
  "error": "tls: \"10.128.222.48\" does not match any of DNSNames [\"*.etcd-discovery.ocm-production-2c88au30mo251lu8nevohplhglb38s7v-eu-tst-02.svc\" \"*.etcd-discovery.ocm-production-2c88au30mo251lu8nevohplhglb38s7v-eu-tst-02.svc.cluster.local\" \"127.0.0.1\" \"::1\"]"
}

When we looked at the etcd pod with the assigned `10.128.242.130` IP, we observed the following:

{
  "level": "warn",
  "ts": "2025-05-06T15:09:13.286196Z",
  "caller": "rafthttp/peer.go:267",
  "msg": "dropped internal Raft message since sending buffer is full (overloaded network)",
  "message-type": "MsgHeartbeat",
  "local-member-id": "23dece5d686fd4d7",
  "from": "23dece5d686fd4d7",
  "remote-peer-id": "7153809bb8659f8c",
  "remote-peer-name": "pipeline",
  "remote-peer-active": false
}

{
  "level": "warn",
  "ts": "2025-05-06T15:09:57.852582Z",
  "caller": "rafthttp/probing_status.go:68",
  "msg": "prober detected unhealthy status",
  "round-tripper-name": "ROUND_TRIPPER_RAFT_MESSAGE",
  "remote-peer-id": "7153809bb8659f8c",
  "rtt": "1.281923382s",
  "error": "EOF"
}

To the best of our knowledge, ETCD was not reporting itself as unhealthy. We would have expected the pods to be in a bad state, or our monitoring to alert us in this situation - but that did not happen.

Pods were all running, no restarts, and healthy:

$ oc get pods -n ocm-production-2c88au30mo251lu8nevohplhglb38s7v-eu-tst-02 -o wide | grep etcd

etcd-0                                                     4/4     Running     0                13d     10.128.222.48    ip-10-0-170-114.eu-west-1.compute.internal   <none>           <none>
etcd-1                                                     4/4     Running     1 (6d ago)       17d     10.128.1.213     ip-10-0-172-68.eu-west-1.compute.internal    <none>           <none>
etcd-2                                                     4/4     Running     0                7d19h   10.128.242.130   ip-10-0-171-52.eu-west-1.compute.internal    <none>           <none>

FYI: Endpoints behind the discovery service:

$ oc get ep -n ocm-production-2c88au30mo251lu8nevohplhglb38s7v-eu-tst-02 etcd-discovery -o yaml
...
subsets:
- addresses:
  - hostname: etcd-1
    ip: 10.128.1.30
    nodeName: ip-10-0-172-68.eu-west-1.compute.internal
    targetRef:
      kind: Pod
      name: etcd-1
      namespace: ocm-production-2c88au30mo251lu8nevohplhglb38s7v-eu-tst-02
      uid: dc6ee547-323f-4469-9ccb-d5ad56e7c94f
  - hostname: etcd-0
    ip: 10.128.222.48
    nodeName: ip-10-0-170-114.eu-west-1.compute.internal
    targetRef:
      kind: Pod
      name: etcd-0
      namespace: ocm-production-2c88au30mo251lu8nevohplhglb38s7v-eu-tst-02
      uid: 29aafe4b-bb7c-44dc-b591-f8a153636f2d
  - hostname: etcd-2
    ip: 10.128.242.130
    nodeName: ip-10-0-171-52.eu-west-1.compute.internal
    targetRef:
      kind: Pod
      name: etcd-2
      namespace: ocm-production-2c88au30mo251lu8nevohplhglb38s7v-eu-tst-02
      uid: 07f6c074-d461-4f86-98f8-35a410925485

Version-Release number of selected component (if applicable):

4.16.39

How reproducible:

Unknown

Steps to Reproduce:

1. Unknown
2.
3.

Actual results:

Expected results:

We would expect when etcd isn't responding as healthy on the `readyz` endpoint (or SSL is having issues) that etcd would attempt to self heal, or at least report itself as unhealthy.

Additional info:

The issue was fixed by issuing a rolling restart of etcd, openshift-apiserver, openshift-oauth-apiserver.

Assignee:: Dean West

Reporter:: Jim DAgostino

QA Contact:: Ge Liu

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2025/05/06 8:59 PM

Updated:: 2025/05/09 12:08 AM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates