kubernetes – kubelet失联导致POD持续not ready

On 2020年5月12日2020年5月12日By yuer

这周线上k8s踩到bug一枚，下面是整个问题的分析过程。

问题

重启kubelet节点的网络，导致node上的POD全部处于Ready: False状态，即无法对外提供服务。

线上k8s代码版本是1.15.5，一开始看的是1.15最新小版本，代码结构完全不同，查了半天都没查明白，汗。。。

node污点

node平时是没有污点的，一旦出现失联就会涉及到2种污点：

node.kubernetes.io/not-ready: Node is not ready. This corresponds to the NodeCondition Ready being “False”.
node.kubernetes.io/unreachable: Node is unreachable from the node controller. This corresponds to the NodeCondition Ready being “Unknown”.

更新node污点的同时，也会伴随更新Node的对应Condition（这里只关注NodeReady）：

	taintKeyToNodeConditionMap = map[string]v1.NodeConditionType{
		schedulerapi.TaintNodeNotReady:           v1.NodeReady,
		schedulerapi.TaintNodeUnreachable:        v1.NodeReady,
		schedulerapi.TaintNodeNetworkUnavailable: v1.NodeNetworkUnavailable,
		schedulerapi.TaintNodeMemoryPressure:     v1.NodeMemoryPressure,
		schedulerapi.TaintNodeDiskPressure:       v1.NodeDiskPressure,
		schedulerapi.TaintNodePIDPressure:        v1.NodePIDPressure,
	}
	
	
pkg/controller/nodelifecycle/node_lifecycle_controller.go:107

taintKeyToNodeConditionMap = map[string]v1.NodeConditionType{

schedulerapi.TaintNodeNotReady: v1.NodeReady,

schedulerapi.TaintNodeUnreachable: v1.NodeReady,

schedulerapi.TaintNodeNetworkUnavailable: v1.NodeNetworkUnavailable,

schedulerapi.TaintNodeMemoryPressure: v1.NodeMemoryPressure,

schedulerapi.TaintNodeDiskPressure: v1.NodeDiskPressure,

schedulerapi.TaintNodePIDPressure: v1.NodePIDPressure,

}

pkg/controller/nodelifecycle/node_lifecycle_controller.go:107

简单理解，就是node.kubernetes.io/not-ready会同时更新NodeReady Condition为False，node.kubernetes.io/unreachable会同时更新NodeReady Condition为Unknown，效果都是NoExecute也就是立即驱逐node上的POD。

node lifecycle controller

Controller Manager中的node lifecycle controller会周期性检查所有node的状态，默认间隔是5秒。

	// Incorporate the results of node health signal pushed from kubelet to master.
	go wait.Until(func() {
		if err := nc.monitorNodeHealth(); err != nil {
			klog.Errorf("Error monitoring node health: %v", err)
		}
	}, nc.nodeMonitorPeriod, stopCh)

// Incorporate the results of node health signal pushed from kubelet to master.

go wait.Until(func() {

if err := nc.monitorNodeHealth(); err != nil {

klog.Errorf("Error monitoring node health: %v", err)

}

}, nc.nodeMonitorPeriod, stopCh)

lifecycle已经监听了etcd中node的变化并同步其状态到内存，上述定时器定期扫描内存中的node信息，做出动作。

从函数注释提供了一些关键信息：

// monitorNodeHealth verifies node health are constantly updated by kubelet, and
// if not, post "NodeReady==ConditionUnknown".
// For nodes who are not ready or not reachable for a long period of time.
// This function will taint them if TaintBasedEvictions feature was enabled.
// Otherwise, it would evict it directly.
func (nc *Controller) monitorNodeHealth() error {
	// We are listing nodes from local cache as we can tolerate some small delays
	// comparing to state from etcd and there is eventual consistency anyway.

// monitorNodeHealth verifies node health are constantly updated by kubelet, and

// if not, post "NodeReady==ConditionUnknown".

// For nodes who are not ready or not reachable for a long period of time.

// This function will taint them if TaintBasedEvictions feature was enabled.

// Otherwise, it would evict it directly.

func (nc *Controller) monitorNodeHealth() error {

// We are listing nodes from local cache as we can tolerate some small delays

// comparing to state from etcd and there is eventual consistency anyway.

简单总结一下：

1）lifecycle内存中实时同步着etcd的node数据，是kubelet去更新的。

2）正常来说，kubelet会定期更新node信息，这样lifecycle会被动同步到node变化（理解为心跳）。

3）但是如果kubelet有一阵子没有上报node信息，那么lifecyle周期性检测的时候就会标记node为unknown状态。

4）如果unknown状态持续很久，那么lifecycle就会对node生效污点，驱逐上面的pod。

那么，node更新时间>多久才算unknown呢？从lifecycle代码里可以看到一段注释，

	// Controller will not proactively sync node health, but will monitor node
	// health signal updated from kubelet. There are 2 kinds of node healthiness
	// signals: NodeStatus and NodeLease. NodeLease signal is generated only when
	// NodeLease feature is enabled. If it doesn't receive update for this amount
	// of time, it will start posting "NodeReady==ConditionUnknown". The amount of
	// time before which Controller start evicting pods is controlled via flag
	// 'pod-eviction-timeout'.
	// Note: be cautious when changing the constant, it must work with
	// nodeStatusUpdateFrequency in kubelet and renewInterval in NodeLease
	// controller. The node health signal update frequency is the minimal of the
	// two.
	// There are several constraints:
	// 1. nodeMonitorGracePeriod must be N times more than  the node health signal
	//    update frequency, where N means number of retries allowed for kubelet to
	//    post node status/lease. It is pointless to make nodeMonitorGracePeriod
	//    be less than the node health signal update frequency, since there will
	//    only be fresh values from Kubelet at an interval of node health signal
	//    update frequency. The constant must be less than podEvictionTimeout.
	// 2. nodeMonitorGracePeriod can't be too large for user experience - larger
	//    value takes longer for user to see up-to-date node health.
	nodeMonitorGracePeriod time.Duration

// Controller will not proactively sync node health, but will monitor node

// health signal updated from kubelet. There are 2 kinds of node healthiness

// signals: NodeStatus and NodeLease. NodeLease signal is generated only when

// NodeLease feature is enabled. If it doesn't receive update for this amount

// of time, it will start posting "NodeReady==ConditionUnknown". The amount of

// time before which Controller start evicting pods is controlled via flag

// 'pod-eviction-timeout'.

// Note: be cautious when changing the constant, it must work with

// nodeStatusUpdateFrequency in kubelet and renewInterval in NodeLease

// controller. The node health signal update frequency is the minimal of the

// two.

// There are several constraints:

// 1. nodeMonitorGracePeriod must be N times more than the node health signal

// update frequency, where N means number of retries allowed for kubelet to

// post node status/lease. It is pointless to make nodeMonitorGracePeriod

// be less than the node health signal update frequency, since there will

// only be fresh values from Kubelet at an interval of node health signal

// update frequency. The constant must be less than podEvictionTimeout.

// 2. nodeMonitorGracePeriod can't be too large for user experience - larger

// value takes longer for user to see up-to-date node health.

nodeMonitorGracePeriod time.Duration

这个值默认是40秒。

当kubelet所在node网络重启时，在apiserver日志中看到了断连日志：

02:12:57.866468   10697 writers.go:163] apiserver was unable to write a JSON response: http2: stream closed

1	02:12:57.866468 10697 writers.go:163] apiserver was unable to write a JSON response: http2: stream closed

时间点是02:12:57。

过了一段时间，在cm日志（node lifecycle controller）中，我们看到lifecycle对node的状态进行了变更：

I0510 02:13:17.033201 384937 controller_utils.go:180] Recording status change NodeNotReady event message for node 10.42.118.62

1	I0510 02:13:17.033201 384937 controller_utils.go:180] Recording status change NodeNotReady event message for node 10.42.118.62

时间点是02:13:17，距离02:12:57正好40秒，也就说明lifecycle在扫描node时发现node心跳停止了40秒。

超过40秒没有心跳（Kubelet stopped posting node status）时，tryUpdateNodeHealth方法会对node更新所有Condition字段更新为Unknown：

	if nc.now().After(nodeHealth.probeTimestamp.Add(gracePeriod)) {
		// NodeReady condition or lease was last set longer ago than gracePeriod, so
		// update it to Unknown (regardless of its current value) in the master.
		
		} else {
			klog.V(4).Infof("node %v hasn't been updated for %+v. Last ready condition is: %+v",
				node.Name, nc.now().Time.Sub(savedNodeHealth.probeTimestamp.Time), observedReadyCondition)
			if observedReadyCondition.Status != v1.ConditionUnknown {
				currentReadyCondition.Status = v1.ConditionUnknown
				currentReadyCondition.Reason = "NodeStatusUnknown"
				currentReadyCondition.Message = "Kubelet stopped posting node status."
				// LastProbeTime is the last time we heard from kubelet.
				currentReadyCondition.LastHeartbeatTime = observedReadyCondition.LastHeartbeatTime
				currentReadyCondition.LastTransitionTime = nc.now()
			}

if nc.now().After(nodeHealth.probeTimestamp.Add(gracePeriod)) {

// NodeReady condition or lease was last set longer ago than gracePeriod, so

// update it to Unknown (regardless of its current value) in the master.

} else {

klog.V(4).Infof("node %v hasn't been updated for %+v. Last ready condition is: %+v",

node.Name, nc.now().Time.Sub(savedNodeHealth.probeTimestamp.Time), observedReadyCondition)

if observedReadyCondition.Status != v1.ConditionUnknown {

currentReadyCondition.Status = v1.ConditionUnknown

currentReadyCondition.Reason = "NodeStatusUnknown"

currentReadyCondition.Message = "Kubelet stopped posting node status."

// LastProbeTime is the last time we heard from kubelet.

currentReadyCondition.LastHeartbeatTime = observedReadyCondition.LastHeartbeatTime

currentReadyCondition.LastTransitionTime = nc.now()

}

经过更新后，currentReadyCondition就是Unknown，即node失联状态。

此后，判断node状态从ready: True迁移到了ready非True（实际是unknown），则对失联node进行一波处理：

			// Report node event.
			if currentReadyCondition.Status != v1.ConditionTrue && observedReadyCondition.Status == v1.ConditionTrue {
				nodeutil.RecordNodeStatusChange(nc.recorder, node, "NodeNotReady")
				if err = nodeutil.MarkAllPodsNotReady(nc.kubeClient, node); err != nil {
					utilruntime.HandleError(fmt.Errorf("Unable to mark all pods NotReady on node %v: %v", node.Name, err))
				}
			}

// Report node event.

if currentReadyCondition.Status != v1.ConditionTrue && observedReadyCondition.Status == v1.ConditionTrue {

nodeutil.RecordNodeStatusChange(nc.recorder, node, "NodeNotReady")

if err = nodeutil.MarkAllPodsNotReady(nc.kubeClient, node); err != nil {

utilruntime.HandleError(fmt.Errorf("Unable to mark all pods NotReady on node %v: %v", node.Name, err))

}

1）创建了NodeNotReady的Event

2）更新node上所有POD的Condition Ready为False，因此看到如下日志：

May 10 02:13:17 10-19-143-199 hyperkube[384937]: I0510 02:13:17.033353 384937 event.go:258] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"10.42.118.62", UID:"f41af641-3f22-45a6-bff8-49945179eb78", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeNotReady' Node 10.42.118.62 status is now: NodeNotReady

May 10 02:13:17 10-19-143-199 hyperkube[384937]: I0510 02:13:17.033353 384937 event.go:258] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"10.42.118.62", UID:"f41af641-3f22-45a6-bff8-49945179eb78", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeNotReady' Node 10.42.118.62 status is now: NodeNotReady

May 10 02:13:17 10-19-143-199 hyperkube[384937]: I0510 02:13:17.033228 384937 controller_utils.go:124] Update ready status of pods on node [10.42.118.62]
May 10 02:13:17 10-19-143-199 hyperkube[384937]: I0510 02:13:17.184202 384937 controller_utils.go:144] Updating ready status of pod app-api-smzdm-com-64f9fbd859-mrp6k to false
May 10 02:13:17 10-19-143-199 hyperkube[384937]: I0510 02:13:17.188499 384937 controller_utils.go:144] Updating ready status of pod bannerservice-smzdm-com-58476c8f4d-ct5h4 to false

May 10 02:13:17 10-19-143-199 hyperkube[384937]: I0510 02:13:17.033228 384937 controller_utils.go:124] Update ready status of pods on node [10.42.118.62]

May 10 02:13:17 10-19-143-199 hyperkube[384937]: I0510 02:13:17.184202 384937 controller_utils.go:144] Updating ready status of pod app-api-smzdm-com-64f9fbd859-mrp6k to false

May 10 02:13:17 10-19-143-199 hyperkube[384937]: I0510 02:13:17.188499 384937 controller_utils.go:144] Updating ready status of pod bannerservice-smzdm-com-58476c8f4d-ct5h4 to false

本轮检测完成后5秒，新一轮的monitor检测开始，因此当前node状态已经是unknown了，所以进入了另外一个分支，对node进行污点标记：

if observedReadyCondition.Status == v1.ConditionUnknown {
	if nc.useTaintBasedEvictions {
		// We want to update the taint straight away if Node is already tainted with the UnreachableTaint
		if taintutils.TaintExists(node.Spec.Taints, NotReadyTaintTemplate) {
			taintToAdd := *UnreachableTaintTemplate
			if !nodeutil.SwapNodeControllerTaint(nc.kubeClient, []*v1.Taint{&taintToAdd}, []*v1.Taint{NotReadyTaintTemplate}, node) {
				klog.Errorf("Failed to instantly swap UnreachableTaint to NotReadyTaint. Will try again in the next cycle.")
			}
		} else if nc.markNodeForTainting(node) {
			klog.V(2).Infof("Node %v is unresponsive as of %v. Adding it to the Taint queue.",
				node.Name,
				decisionTimestamp,
			)
		}

if observedReadyCondition.Status == v1.ConditionUnknown {

if nc.useTaintBasedEvictions {

// We want to update the taint straight away if Node is already tainted with the UnreachableTaint

if taintutils.TaintExists(node.Spec.Taints, NotReadyTaintTemplate) {

taintToAdd := *UnreachableTaintTemplate

if !nodeutil.SwapNodeControllerTaint(nc.kubeClient, []*v1.Taint{&taintToAdd}, []*v1.Taint{NotReadyTaintTemplate}, node) {

klog.Errorf("Failed to instantly swap UnreachableTaint to NotReadyTaint. Will try again in the next cycle.")

}

} else if nc.markNodeForTainting(node) {

klog.V(2).Infof("Node %v is unresponsive as of %v. Adding it to the Taint queue.",

node.Name,

decisionTimestamp,

)

}

即标记了node.kubernetes.io/not-ready:NoExecute，希望驱逐POD立即离开node。

此时，对应的日志输出也符合预期：

May 10 02:13:22 10-19-143-199 hyperkube[384937]: I0510 02:13:22.509023  384937 node_lifecycle_controller.go:759] Node 10.42.118.62 is unresponsive as of 2020-05-10 02:13:22.508996764 +0800 CST m=+1658350.928182036. Adding it to the Taint queue.
May 10 02:13:22 10-19-143-199 hyperkube[384937]: I0510 02:13:22.509776  384937 node_lifecycle_controller.go:759] Node 10.42.163.43 is unresponsive as of 2020-05-10 02:13:22.509768722 +0800 CST m=+1658350.928953992. Adding it to the Taint queue.

May 10 02:13:22 10-19-143-199 hyperkube[384937]: I0510 02:13:22.509023 384937 node_lifecycle_controller.go:759] Node 10.42.118.62 is unresponsive as of 2020-05-10 02:13:22.508996764 +0800 CST m=+1658350.928182036. Adding it to the Taint queue.

May 10 02:13:22 10-19-143-199 hyperkube[384937]: I0510 02:13:22.509776 384937 node_lifecycle_controller.go:759] Node 10.42.163.43 is unresponsive as of 2020-05-10 02:13:22.509768722 +0800 CST m=+1658350.928953992. Adding it to the Taint queue.

时间点是02:13:32，即上次monitor周期02:13:17之后的大约5秒。

此后5秒，monitor检查周期再次拉起，检查到kubelet已经更新过node心跳时间（被动同步来的），因此开始移除污点：

May 10 02:13:27 10-19-143-199 hyperkube[384937]: I0510 02:13:27.511828  384937 node_lifecycle_controller.go:784] Node 10.42.163.43 is healthy again, removing all taints
May 10 02:13:32 10-19-143-199 hyperkube[384937]: I0510 02:13:32.523365  384937 node_lifecycle_controller.go:784] Node 10.42.118.62 is healthy again, removing all taints

May 10 02:13:27 10-19-143-199 hyperkube[384937]: I0510 02:13:27.511828 384937 node_lifecycle_controller.go:784] Node 10.42.163.43 is healthy again, removing all taints

May 10 02:13:32 10-19-143-199 hyperkube[384937]: I0510 02:13:32.523365 384937 node_lifecycle_controller.go:784] Node 10.42.118.62 is healthy again, removing all taints

时间点是02:13:77，代码是：

if observedReadyCondition.Status == v1.ConditionTrue {
	if nc.useTaintBasedEvictions {
		removed, err := nc.markNodeAsReachable(node)
		if err != nil {
			klog.Errorf("Failed to remove taints from node %v. Will retry in next iteration.", node.Name)
		}
		if removed {
			klog.V(2).Infof("Node %s is healthy again, removing all taints", node.Name)
		}
	} else {
		if nc.cancelPodEviction(node) {
			klog.V(2).Infof("Node %s is ready again, cancelled pod eviction", node.Name)
		}
	}
}

if observedReadyCondition.Status == v1.ConditionTrue {

if nc.useTaintBasedEvictions {

removed, err := nc.markNodeAsReachable(node)

if err != nil {

klog.Errorf("Failed to remove taints from node %v. Will retry in next iteration.", node.Name)

}

if removed {

klog.V(2).Infof("Node %s is healthy again, removing all taints", node.Name)

}

} else {

if nc.cancelPodEviction(node) {

klog.V(2).Infof("Node %s is ready again, cancelled pod eviction", node.Name)

}

1）在markNodeAsReachable就是删除node上的污点，打印打印了healthy again的日志。

2）但是并没有对上面的POD恢复ready: True。

进一步分析

按道理node曾经上过污点，虽然又取消了，但是kubelet应该可以收到这次历史变动，应该会对POD做驱逐动作。

其实只要Kubelet驱逐了POD，那么就不会出现这种POD处于ready:False的情况了，可为什么Kubelet没有驱逐呢？

我看了一下pod的tolerations，发现pod对node的上述noExecute污点有容忍时间：

    tolerations:
    - effect: NoExecute
      key: node.kubernetes.io/not-ready
      operator: Exists
      tolerationSeconds: 300
    - effect: NoExecute
      key: node.kubernetes.io/unreachable
      operator: Exists
      tolerationSeconds: 300

tolerations:

- effect: NoExecute

key: node.kubernetes.io/not-ready

operator: Exists

tolerationSeconds: 300

- effect: NoExecute

key: node.kubernetes.io/unreachable

operator: Exists

tolerationSeconds: 300

也就是说能容忍污点5分钟，因此导致了POD没有被驱逐，这就是整个事情的来龙去脉了。

如果文章帮助您解决了工作难题，您可以帮我点击屏幕上的任意广告，或者赞助少量费用来支持我的持续创作，谢谢~

问题

node污点

node lifecycle controller

进一步分析

发表回复 取消回复

发表回复取消回复