POD持续terminating问题

On 2020年5月18日By yuer

K8S版本：1.15.5，时间：2020-05-18

现象

ucloud云主机磁盘I/O故障，紧急对故障node进行了关机处理，并没有执行kubectl delete node，状态为NotReady。

事后发现，有一些deployment的POD始终没有得到重新调度，并且发现这一类deployment都是运行Crontab的，其发布策略是Recreate模式（先删旧POD，再启动新POD）。

分析

node强制关机后，K8S持续无法收到kubelet心跳，一段时间后对node上的POD进行批量delete操作，即修改POD状态改为terminating（退出中）。

func (nc *Controller) doEvictionPass() {
...
...
			nodeUID, _ := value.UID.(string)
			remaining, err := nodeutil.DeletePods(nc.kubeClient, nc.recorder, value.Value, nodeUID, nc.daemonSetStore)

func (nc *Controller) doEvictionPass() {

...

nodeUID, _ := value.UID.(string)

remaining, err := nodeutil.DeletePods(nc.kubeClient, nc.recorder, value.Value, nodeUID, nc.daemonSetStore)

此时，研发同学尝试重新构建并发布了Deployment，仍旧没有调度新POD，旧POD仍旧处于terminating状态。

怀疑与Deployment的Recreate发布策略有关，因为terminating属于关闭中的语义，在此情况下的确不应该新建POD，正好因为Node非正常关闭，因此terminating状态持续保持。

deployment的基本原理就是操控多个replicaset，缩容old replicaset，新建/扩容new replicaset。

查看deployment controller代码，在syncDeployment同步函数中：

// syncDeployment will sync the deployment with the given key.
// This function is not meant to be invoked concurrently with the same key.
func (dc *DeploymentController) syncDeployment(key string) error {

// syncDeployment will sync the deployment with the given key.

// This function is not meant to be invoked concurrently with the same key.

func (dc *DeploymentController) syncDeployment(key string) error {

它先调用apiserver得到了所有属于deployment的replicaset：

	// List ReplicaSets owned by this Deployment, while reconciling ControllerRef
	// through adoption/orphaning.
	rsList, err := dc.getReplicaSetsForDeployment(d)

// List ReplicaSets owned by this Deployment, while reconciling ControllerRef

// through adoption/orphaning.

rsList, err := dc.getReplicaSetsForDeployment(d)

然后进一步根据deployment的label selector得到所有POD，并根据POD的ownerReference中的replicaset uid划给对应的replicaset，返回一个rs -> pod list的关系：

    ownerReferences:
    - apiVersion: apps/v1
      blockOwnerDeletion: true
      controller: true
      kind: ReplicaSet
      name: go-abtestservice-smzdm-com-7697bd68c9
      uid: 3ea7021d-ef24-4994-8e73-2917bce065ce

ownerReferences:

- apiVersion: apps/v1

blockOwnerDeletion: true

controller: true

kind: ReplicaSet

name: go-abtestservice-smzdm-com-7697bd68c9

uid: 3ea7021d-ef24-4994-8e73-2917bce065ce

// getPodMapForDeployment returns the Pods managed by a Deployment.
//
// It returns a map from ReplicaSet UID to a list of Pods controlled by that RS,
// according to the Pod's ControllerRef.
func (dc *DeploymentController) getPodMapForDeployment(d *apps.Deployment, rsList []*apps.ReplicaSet) (map[types.UID]*v1.PodList, error) {

// getPodMapForDeployment returns the Pods managed by a Deployment.

// It returns a map from ReplicaSet UID to a list of Pods controlled by that RS,

// according to the Pod's ControllerRef.

func (dc *DeploymentController) getPodMapForDeployment(d *apps.Deployment, rsList []*apps.ReplicaSet) (map[types.UID]*v1.PodList, error) {

最终deployment根据strategory是rollout还是recreate做不同的处理，此处执行recreate：

// rolloutRecreate implements the logic for recreating a replica set.
func (dc *DeploymentController) rolloutRecreate(d *apps.Deployment, rsList []*apps.ReplicaSet, podMap map[types.UID]*v1.PodList) error {

1 2	// rolloutRecreate implements the logic for recreating a replica set. func (dc DeploymentController) rolloutRecreate(d apps.Deployment, rsList []apps.ReplicaSet, podMap map[types.UID]v1.PodList) error {

其中有一个关键的短路逻辑，即判断是否有旧POD在运行，有的话就不新建replicaset：

// oldPodsRunning returns whether there are old pods running or any of the old ReplicaSets thinks that it runs pods.
func oldPodsRunning(newRS *apps.ReplicaSet, oldRSs []*apps.ReplicaSet, podMap map[types.UID]*v1.PodList) bool {
...
...
		for _, pod := range podList.Items {
			switch pod.Status.Phase {
			case v1.PodFailed, v1.PodSucceeded:
				// Don't count pods in terminal state.
				continue
			case v1.PodUnknown:
				// This happens in situation like when the node is temporarily disconnected from the cluster.
				// If we can't be sure that the pod is not running, we have to count it.
				return true
			default:
				// Pod is not in terminal phase.
				return true
			}

// oldPodsRunning returns whether there are old pods running or any of the old ReplicaSets thinks that it runs pods.

func oldPodsRunning(newRS *apps.ReplicaSet, oldRSs []*apps.ReplicaSet, podMap map[types.UID]*v1.PodList) bool {

...

for _, pod := range podList.Items {

switch pod.Status.Phase {

case v1.PodFailed, v1.PodSucceeded:

// Don't count pods in terminal state.

continue

case v1.PodUnknown:

// This happens in situation like when the node is temporarily disconnected from the cluster.

// If we can't be sure that the pod is not running, we have to count it.

return true

default:

// Pod is not in terminal phase.

return true

}

因为POD的Phase是terminating，所以走了default分支返回true，也就是有旧POD还没彻底退掉，因此导致deployment没有新建replicaset，业务POD没有得到重新调度。

由此可见，POD必须是failed（失败退出）或者succeed（成功退出）的终态才认为是退出了，其他的都不是确保退出的状态，因此Deployment此时不敢做任何动作。

解决办法

该问题对于rollout滚动发布的deployment没有影响，仅对recreate的造成影响（类似statefulset也有影响）。

紧急关闭node，最好配合kubectl delete node的操作，或者强制清理一下etcd中的pod记录：

kubectl delete pods <pod_name> –grace-period=0 –force -n <namespace>

另外grafana需要确认一下是否有对terminating POD状态的监控，做一下展现和告警。

如果文章帮助您解决了工作难题，您可以帮我点击屏幕上的任意广告，或者赞助少量费用来支持我的持续创作，谢谢~

现象

分析

解决办法

发表回复 取消回复

发表回复取消回复