POD持续terminating问题
K8S版本:1.15.5,时间:2020-05-18
现象
ucloud云主机磁盘I/O故障,紧急对故障node进行了关机处理,并没有执行kubectl delete node,状态为NotReady。
事后发现,有一些deployment的POD始终没有得到重新调度,并且发现这一类deployment都是运行Crontab的,其发布策略是Recreate模式(先删旧POD,再启动新POD)。
分析
node强制关机后,K8S持续无法收到kubelet心跳,一段时间后对node上的POD进行批量delete操作,即修改POD状态改为terminating(退出中)。
1 2 3 4 5 |
func (nc *Controller) doEvictionPass() { ... ... nodeUID, _ := value.UID.(string) remaining, err := nodeutil.DeletePods(nc.kubeClient, nc.recorder, value.Value, nodeUID, nc.daemonSetStore) |
此时,研发同学尝试重新构建并发布了Deployment,仍旧没有调度新POD,旧POD仍旧处于terminating状态。
怀疑与Deployment的Recreate发布策略有关,因为terminating属于关闭中的语义,在此情况下的确不应该新建POD,正好因为Node非正常关闭,因此terminating状态持续保持。
deployment的基本原理就是操控多个replicaset,缩容old replicaset,新建/扩容new replicaset。
查看deployment controller代码,在syncDeployment同步函数中:
1 2 3 |
// syncDeployment will sync the deployment with the given key. // This function is not meant to be invoked concurrently with the same key. func (dc *DeploymentController) syncDeployment(key string) error { |
它先调用apiserver得到了所有属于deployment的replicaset:
1 2 3 |
// List ReplicaSets owned by this Deployment, while reconciling ControllerRef // through adoption/orphaning. rsList, err := dc.getReplicaSetsForDeployment(d) |
然后进一步根据deployment的label selector得到所有POD,并根据POD的ownerReference中的replicaset uid划给对应的replicaset,返回一个rs -> pod list的关系:
1 2 3 4 5 6 7 |
ownerReferences: - apiVersion: apps/v1 blockOwnerDeletion: true controller: true kind: ReplicaSet name: go-abtestservice-smzdm-com-7697bd68c9 uid: 3ea7021d-ef24-4994-8e73-2917bce065ce |
1 2 3 4 5 |
// getPodMapForDeployment returns the Pods managed by a Deployment. // // It returns a map from ReplicaSet UID to a list of Pods controlled by that RS, // according to the Pod's ControllerRef. func (dc *DeploymentController) getPodMapForDeployment(d *apps.Deployment, rsList []*apps.ReplicaSet) (map[types.UID]*v1.PodList, error) { |
最终deployment根据strategory是rollout还是recreate做不同的处理,此处执行recreate:
1 2 |
// rolloutRecreate implements the logic for recreating a replica set. func (dc *DeploymentController) rolloutRecreate(d *apps.Deployment, rsList []*apps.ReplicaSet, podMap map[types.UID]*v1.PodList) error { |
其中有一个关键的短路逻辑,即判断是否有旧POD在运行,有的话就不新建replicaset:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
// oldPodsRunning returns whether there are old pods running or any of the old ReplicaSets thinks that it runs pods. func oldPodsRunning(newRS *apps.ReplicaSet, oldRSs []*apps.ReplicaSet, podMap map[types.UID]*v1.PodList) bool { ... ... for _, pod := range podList.Items { switch pod.Status.Phase { case v1.PodFailed, v1.PodSucceeded: // Don't count pods in terminal state. continue case v1.PodUnknown: // This happens in situation like when the node is temporarily disconnected from the cluster. // If we can't be sure that the pod is not running, we have to count it. return true default: // Pod is not in terminal phase. return true } |
因为POD的Phase是terminating,所以走了default分支返回true,也就是有旧POD还没彻底退掉,因此导致deployment没有新建replicaset,业务POD没有得到重新调度。
由此可见,POD必须是failed(失败退出)或者succeed(成功退出)的终态才认为是退出了,其他的都不是确保退出的状态,因此Deployment此时不敢做任何动作。
解决办法
该问题对于rollout滚动发布的deployment没有影响,仅对recreate的造成影响(类似statefulset也有影响)。
紧急关闭node,最好配合kubectl delete node的操作,或者强制清理一下etcd中的pod记录:
kubectl delete pods <pod_name> –grace-period=0 –force -n <namespace>
另外grafana需要确认一下是否有对terminating POD状态的监控,做一下展现和告警。
如果文章帮助您解决了工作难题,您可以帮我点击屏幕上的任意广告,或者赞助少量费用来支持我的持续创作,谢谢~
