这周线上k8s踩到bug一枚,下面是整个问题的分析过程。
问题
重启kubelet节点的网络,导致node上的POD全部处于Ready: False状态,即无法对外提供服务。
线上k8s代码版本是1.15.5,一开始看的是1.15最新小版本,代码结构完全不同,查了半天都没查明白,汗。。。
node污点
node平时是没有污点的,一旦出现失联就会涉及到2种污点:
-
node.kubernetes.io/not-ready: Node is not ready. This corresponds to the NodeCondition Ready being “False”.
-
node.kubernetes.io/unreachable: Node is unreachable from the node controller. This corresponds to the NodeCondition Ready being “Unknown”.
更新node污点的同时,也会伴随更新Node的对应Condition(这里只关注NodeReady):
1 2 3 4 5 6 7 8 9 10 11 |
taintKeyToNodeConditionMap = map[string]v1.NodeConditionType{ schedulerapi.TaintNodeNotReady: v1.NodeReady, schedulerapi.TaintNodeUnreachable: v1.NodeReady, schedulerapi.TaintNodeNetworkUnavailable: v1.NodeNetworkUnavailable, schedulerapi.TaintNodeMemoryPressure: v1.NodeMemoryPressure, schedulerapi.TaintNodeDiskPressure: v1.NodeDiskPressure, schedulerapi.TaintNodePIDPressure: v1.NodePIDPressure, } pkg/controller/nodelifecycle/node_lifecycle_controller.go:107 |
简单理解,就是node.kubernetes.io/not-ready会同时更新NodeReady Condition为False,node.kubernetes.io/unreachable会同时更新NodeReady Condition为Unknown,效果都是NoExecute也就是立即驱逐node上的POD。
node lifecycle controller
Controller Manager中的node lifecycle controller会周期性检查所有node的状态,默认间隔是5秒。
1 2 3 4 5 6 |
// Incorporate the results of node health signal pushed from kubelet to master. go wait.Until(func() { if err := nc.monitorNodeHealth(); err != nil { klog.Errorf("Error monitoring node health: %v", err) } }, nc.nodeMonitorPeriod, stopCh) |
lifecycle已经监听了etcd中node的变化并同步其状态到内存,上述定时器定期扫描内存中的node信息,做出动作。
从函数注释提供了一些关键信息:
1 2 3 4 5 6 7 8 |
// monitorNodeHealth verifies node health are constantly updated by kubelet, and // if not, post "NodeReady==ConditionUnknown". // For nodes who are not ready or not reachable for a long period of time. // This function will taint them if TaintBasedEvictions feature was enabled. // Otherwise, it would evict it directly. func (nc *Controller) monitorNodeHealth() error { // We are listing nodes from local cache as we can tolerate some small delays // comparing to state from etcd and there is eventual consistency anyway. |
简单总结一下:
1)lifecycle内存中实时同步着etcd的node数据,是kubelet去更新的。
2)正常来说,kubelet会定期更新node信息,这样lifecycle会被动同步到node变化(理解为心跳)。
3)但是如果kubelet有一阵子没有上报node信息,那么lifecyle周期性检测的时候就会标记node为unknown状态。
4)如果unknown状态持续很久,那么lifecycle就会对node生效污点,驱逐上面的pod。
那么,node更新时间>多久才算unknown呢?从lifecycle代码里可以看到一段注释,
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
// Controller will not proactively sync node health, but will monitor node // health signal updated from kubelet. There are 2 kinds of node healthiness // signals: NodeStatus and NodeLease. NodeLease signal is generated only when // NodeLease feature is enabled. If it doesn't receive update for this amount // of time, it will start posting "NodeReady==ConditionUnknown". The amount of // time before which Controller start evicting pods is controlled via flag // 'pod-eviction-timeout'. // Note: be cautious when changing the constant, it must work with // nodeStatusUpdateFrequency in kubelet and renewInterval in NodeLease // controller. The node health signal update frequency is the minimal of the // two. // There are several constraints: // 1. nodeMonitorGracePeriod must be N times more than the node health signal // update frequency, where N means number of retries allowed for kubelet to // post node status/lease. It is pointless to make nodeMonitorGracePeriod // be less than the node health signal update frequency, since there will // only be fresh values from Kubelet at an interval of node health signal // update frequency. The constant must be less than podEvictionTimeout. // 2. nodeMonitorGracePeriod can't be too large for user experience - larger // value takes longer for user to see up-to-date node health. nodeMonitorGracePeriod time.Duration |
这个值默认是40秒。
当kubelet所在node网络重启时,在apiserver日志中看到了断连日志:
1 |
02:12:57.866468 10697 writers.go:163] apiserver was unable to write a JSON response: http2: stream closed |
时间点是02:12:57。
过了一段时间,在cm日志(node lifecycle controller)中,我们看到lifecycle对node的状态进行了变更:
1 |
I0510 02:13:17.033201 384937 controller_utils.go:180] Recording status change NodeNotReady event message for node 10.42.118.62 |
时间点是02:13:17,距离02:12:57正好40秒,也就说明lifecycle在扫描node时发现node心跳停止了40秒。
超过40秒没有心跳(Kubelet stopped posting node status)时,tryUpdateNodeHealth方法会对node更新所有Condition字段更新为Unknown:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
if nc.now().After(nodeHealth.probeTimestamp.Add(gracePeriod)) { // NodeReady condition or lease was last set longer ago than gracePeriod, so // update it to Unknown (regardless of its current value) in the master. } else { klog.V(4).Infof("node %v hasn't been updated for %+v. Last ready condition is: %+v", node.Name, nc.now().Time.Sub(savedNodeHealth.probeTimestamp.Time), observedReadyCondition) if observedReadyCondition.Status != v1.ConditionUnknown { currentReadyCondition.Status = v1.ConditionUnknown currentReadyCondition.Reason = "NodeStatusUnknown" currentReadyCondition.Message = "Kubelet stopped posting node status." // LastProbeTime is the last time we heard from kubelet. currentReadyCondition.LastHeartbeatTime = observedReadyCondition.LastHeartbeatTime currentReadyCondition.LastTransitionTime = nc.now() } |
经过更新后,currentReadyCondition就是Unknown,即node失联状态。
此后,判断node状态从ready: True迁移到了ready非True(实际是unknown),则对失联node进行一波处理:
1 2 3 4 5 6 7 |
// Report node event. if currentReadyCondition.Status != v1.ConditionTrue && observedReadyCondition.Status == v1.ConditionTrue { nodeutil.RecordNodeStatusChange(nc.recorder, node, "NodeNotReady") if err = nodeutil.MarkAllPodsNotReady(nc.kubeClient, node); err != nil { utilruntime.HandleError(fmt.Errorf("Unable to mark all pods NotReady on node %v: %v", node.Name, err)) } } |
1)创建了NodeNotReady的Event
2)更新node上所有POD的Condition Ready为False,因此看到如下日志:
1 |
May 10 02:13:17 10-19-143-199 hyperkube[384937]: I0510 02:13:17.033353 384937 event.go:258] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"10.42.118.62", UID:"f41af641-3f22-45a6-bff8-49945179eb78", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeNotReady' Node 10.42.118.62 status is now: NodeNotReady |
1 2 3 |
May 10 02:13:17 10-19-143-199 hyperkube[384937]: I0510 02:13:17.033228 384937 controller_utils.go:124] Update ready status of pods on node [10.42.118.62] May 10 02:13:17 10-19-143-199 hyperkube[384937]: I0510 02:13:17.184202 384937 controller_utils.go:144] Updating ready status of pod app-api-smzdm-com-64f9fbd859-mrp6k to false May 10 02:13:17 10-19-143-199 hyperkube[384937]: I0510 02:13:17.188499 384937 controller_utils.go:144] Updating ready status of pod bannerservice-smzdm-com-58476c8f4d-ct5h4 to false |
本轮检测完成后5秒,新一轮的monitor检测开始,因此当前node状态已经是unknown了,所以进入了另外一个分支,对node进行污点标记:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
if observedReadyCondition.Status == v1.ConditionUnknown { if nc.useTaintBasedEvictions { // We want to update the taint straight away if Node is already tainted with the UnreachableTaint if taintutils.TaintExists(node.Spec.Taints, NotReadyTaintTemplate) { taintToAdd := *UnreachableTaintTemplate if !nodeutil.SwapNodeControllerTaint(nc.kubeClient, []*v1.Taint{&taintToAdd}, []*v1.Taint{NotReadyTaintTemplate}, node) { klog.Errorf("Failed to instantly swap UnreachableTaint to NotReadyTaint. Will try again in the next cycle.") } } else if nc.markNodeForTainting(node) { klog.V(2).Infof("Node %v is unresponsive as of %v. Adding it to the Taint queue.", node.Name, decisionTimestamp, ) } |
即标记了node.kubernetes.io/not-ready:NoExecute,希望驱逐POD立即离开node。
此时,对应的日志输出也符合预期:
1 2 |
May 10 02:13:22 10-19-143-199 hyperkube[384937]: I0510 02:13:22.509023 384937 node_lifecycle_controller.go:759] Node 10.42.118.62 is unresponsive as of 2020-05-10 02:13:22.508996764 +0800 CST m=+1658350.928182036. Adding it to the Taint queue. May 10 02:13:22 10-19-143-199 hyperkube[384937]: I0510 02:13:22.509776 384937 node_lifecycle_controller.go:759] Node 10.42.163.43 is unresponsive as of 2020-05-10 02:13:22.509768722 +0800 CST m=+1658350.928953992. Adding it to the Taint queue. |
时间点是02:13:32,即上次monitor周期02:13:17之后的大约5秒。
此后5秒,monitor检查周期再次拉起,检查到kubelet已经更新过node心跳时间(被动同步来的),因此开始移除污点:
1 2 |
May 10 02:13:27 10-19-143-199 hyperkube[384937]: I0510 02:13:27.511828 384937 node_lifecycle_controller.go:784] Node 10.42.163.43 is healthy again, removing all taints May 10 02:13:32 10-19-143-199 hyperkube[384937]: I0510 02:13:32.523365 384937 node_lifecycle_controller.go:784] Node 10.42.118.62 is healthy again, removing all taints |
时间点是02:13:77,代码是:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
if observedReadyCondition.Status == v1.ConditionTrue { if nc.useTaintBasedEvictions { removed, err := nc.markNodeAsReachable(node) if err != nil { klog.Errorf("Failed to remove taints from node %v. Will retry in next iteration.", node.Name) } if removed { klog.V(2).Infof("Node %s is healthy again, removing all taints", node.Name) } } else { if nc.cancelPodEviction(node) { klog.V(2).Infof("Node %s is ready again, cancelled pod eviction", node.Name) } } } |
1)在markNodeAsReachable就是删除node上的污点,打印打印了healthy again的日志。
2)但是并没有对上面的POD恢复ready: True。
进一步分析
按道理node曾经上过污点,虽然又取消了,但是kubelet应该可以收到这次历史变动,应该会对POD做驱逐动作。
其实只要Kubelet驱逐了POD,那么就不会出现这种POD处于ready:False的情况了,可为什么Kubelet没有驱逐呢?
我看了一下pod的tolerations,发现pod对node的上述noExecute污点有容忍时间:
1 2 3 4 5 6 7 8 9 |
tolerations: - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 300 - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 300 |
也就是说能容忍污点5分钟,因此导致了POD没有被驱逐,这就是整个事情的来龙去脉了。
如果文章帮助您解决了工作难题,您可以帮我点击屏幕上的任意广告,或者赞助少量费用来支持我的持续创作,谢谢~
