-
Feature Request
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
False
-
None
-
False
-
Not Selected
-
-
-
1. Proposed title of this feature request
Node drain reporting improvements
2. What is the nature and description of the request?
This is a request to improve observability around worker node drains and their status.
What we're looking for:
- a metric that reports node drain progress
- in case of failures, the pod and reason: e.g. the pods blocking it and why (e.g. pod abc because of pdb)
- status on NodePools that reflects node drain failures, the offending pods and the reason
- [optional] status on CAPI machine reflecting node drain failures, the offending pods and the reason
3. Why does the customer need this? (List the business requirements here)
Managed services (ROSA/ARO HCP) don't expose node drain failures to the customer. The node drain failures are currently only available in CAPI logs. Managed services SRE is looking for a metric to automate sending notifications to customers when a node drain fails due to their workload, as well as a status on the nodepools as a general UX improvement.
4. List any affected packages or components.
hypershift-operator, CAPI
- relates to
-
OCPSTRAT-1615 Enhanced Debuggability for HyperShift Cluster NodePool Failures
-
- New
-