-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.16.z
-
Important
-
None
-
False
-
Description of problem:
While looking into https://issues.redhat.com/browse/OHSS-43592 it was observed that openshift-oauth-apiserver was having issues because of unhealthy readyz responses from etcd:
I0506 14:04:21.542391 1 healthz.go:261] etcd,etcd-readiness check failed: readyz [-]etcd failed: etcd client connection not yet established [-]etcd-readiness failed: error getting data from etcd: context deadline exceeded
After some digging in etcd we found etcd members were logging the following warning:
{ "level": "warn", "ts": "2025-05-06T15:26:14.148276Z", "caller": "embed/config_logging.go:160", "msg": "rejected connection", "remote-addr": "10.128.222.48:60882", "server-name": "etcd-2.etcd-discovery.ocm-production-2c88au30mo251lu8nevohplhglb38s7v-eu-tst-02.svc", "ip-addresses": [], "dns-names": [ "*.etcd-discovery.ocm-production-2c88au30mo251lu8nevohplhglb38s7v-eu-tst-02.svc", "*.etcd-discovery.ocm-production-2c88au30mo251lu8nevohplhglb38s7v-eu-tst-02.svc.cluster.local", "127.0.0.1", "::1" ], "error": "tls: \"10.128.222.48\" does not match any of DNSNames [\"*.etcd-discovery.ocm-production-2c88au30mo251lu8nevohplhglb38s7v-eu-tst-02.svc\" \"*.etcd-discovery.ocm-production-2c88au30mo251lu8nevohplhglb38s7v-eu-tst-02.svc.cluster.local\" \"127.0.0.1\" \"::1\"]" }
When we looked at the etcd pod with the assigned `10.128.242.130` IP, we observed the following:
{ "level": "warn", "ts": "2025-05-06T15:09:13.286196Z", "caller": "rafthttp/peer.go:267", "msg": "dropped internal Raft message since sending buffer is full (overloaded network)", "message-type": "MsgHeartbeat", "local-member-id": "23dece5d686fd4d7", "from": "23dece5d686fd4d7", "remote-peer-id": "7153809bb8659f8c", "remote-peer-name": "pipeline", "remote-peer-active": false }
{ "level": "warn", "ts": "2025-05-06T15:09:57.852582Z", "caller": "rafthttp/probing_status.go:68", "msg": "prober detected unhealthy status", "round-tripper-name": "ROUND_TRIPPER_RAFT_MESSAGE", "remote-peer-id": "7153809bb8659f8c", "rtt": "1.281923382s", "error": "EOF" }
To the best of our knowledge, ETCD was not reporting itself as unhealthy. We would have expected the pods to be in a bad state, or our monitoring to alert us in this situation - but that did not happen.
Pods were all running, no restarts, and healthy:
$ oc get pods -n ocm-production-2c88au30mo251lu8nevohplhglb38s7v-eu-tst-02 -o wide | grep etcd etcd-0 4/4 Running 0 13d 10.128.222.48 ip-10-0-170-114.eu-west-1.compute.internal <none> <none> etcd-1 4/4 Running 1 (6d ago) 17d 10.128.1.213 ip-10-0-172-68.eu-west-1.compute.internal <none> <none> etcd-2 4/4 Running 0 7d19h 10.128.242.130 ip-10-0-171-52.eu-west-1.compute.internal <none> <none>
FYI: Endpoints behind the discovery service:
$ oc get ep -n ocm-production-2c88au30mo251lu8nevohplhglb38s7v-eu-tst-02 etcd-discovery -o yaml ... subsets: - addresses: - hostname: etcd-1 ip: 10.128.1.30 nodeName: ip-10-0-172-68.eu-west-1.compute.internal targetRef: kind: Pod name: etcd-1 namespace: ocm-production-2c88au30mo251lu8nevohplhglb38s7v-eu-tst-02 uid: dc6ee547-323f-4469-9ccb-d5ad56e7c94f - hostname: etcd-0 ip: 10.128.222.48 nodeName: ip-10-0-170-114.eu-west-1.compute.internal targetRef: kind: Pod name: etcd-0 namespace: ocm-production-2c88au30mo251lu8nevohplhglb38s7v-eu-tst-02 uid: 29aafe4b-bb7c-44dc-b591-f8a153636f2d - hostname: etcd-2 ip: 10.128.242.130 nodeName: ip-10-0-171-52.eu-west-1.compute.internal targetRef: kind: Pod name: etcd-2 namespace: ocm-production-2c88au30mo251lu8nevohplhglb38s7v-eu-tst-02 uid: 07f6c074-d461-4f86-98f8-35a410925485
Version-Release number of selected component (if applicable):
4.16.39
How reproducible:
Unknown
Steps to Reproduce:
1. Unknown 2. 3.
Actual results:
Expected results:
We would expect when etcd isn't responding as healthy on the `readyz` endpoint (or SSL is having issues) that etcd would attempt to self heal, or at least report itself as unhealthy.
Additional info:
The issue was fixed by issuing a rolling restart of etcd, openshift-apiserver, openshift-oauth-apiserver.