k8s 群里有群友问 pod 创建调度到某节点后,长期处于 containercreating ,让他看日志他看不出啥来。后面加我还有付费让我看看
解决过程
最开始是没挂载的部署一个 nginx 的 pod 出问题,describe 确实看不到啥信息,后面是 nfs pvc 的 pod 无法调度。
环境信息:
1 2 3 4 5 6
$ kubectl get node -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME master Ready master 2y129d v1.18.3 xxx.xx.xx.9 <none> CentOS Linux 7 (Core) 3.10.0-1160.24.1.el7.x86_64 docker://19.3.5 work01 Ready <none> 2y129d v1.18.3 xxx.xx.xx.1 <none> CentOS Linux 7 (Core) 3.10.0-1160.24.1.el7.x86_64 docker://19.3.5 work02 Ready <none> 2y129d v1.18.3 xxx.xx.xx.3 <none> CentOS Linux 7 (Core) 3.10.0-1160.24.1.el7.x86_64 docker://19.3.5 work03 Ready <none> 2y129d v1.18.3 xxx.xx.xx.4 <none> CentOS Linux 7 (Core) 3.10.0-1160.24.1.el7.x86_64 docker://19.3.5
orphaned pod xxx found, but
kubelet 日志刷下面的
1
kubelet_volumes.go:154] orphaned pod "xxx" found, but volume paths are still present on disk : There were a total of 84 errors similar to this. Turn up verbosity to see them.
这个是 1.20 还是哪个版本之前,pod 到其他节点或者删掉后,相关的一些目录还遗留在节点上的 --root-dir 下,默认是 /var/lib/kubelet/pods 下的 uuid 字样的目录,可以 find 下它确认里面的内容,以及看 etc-hosts 文件,看 hostname 后利用 kubectl get pod 查看是否存在这个 pod 名,不存在就是遗留目录,可以手动清理下。 这个问题我记得后续有人提交了 pr kubelet 会定期清理这种目录的。
Faild to get system container stats
依次处理掉上面的日志里错误后,看到下面的
1
summary_sys_containers.go:47] Failed to get system container stats for "/system.slice/docker.service": failed to get cgroup stats for "/system.slice/docker.service": faild to get container info for "system.slice/docker.service": unknow container "/system.slice/docker.service"
然后 nginx 能调度后,发现带 pvc 的 pod 无法调度到该节点上,等待后 describe 显示:
1 2 3 4 5 6 7 8 9 10
$ kubectl -n content-dev get pod content-754c9964bc-8dbxw -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES content-754c9964bc-8dbxw 0/1 ContainerCreating 0 63s <none> work02 <none> <none>
$ kubectl -n content-dev describe pod content-754c9964bc-8dbxw ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedMount 8s kubelet Unable to attach or mount volumes: unmounted volumes=[nfs], unattached volumes=[nfs default-token-vdjpz]: timed out waiting for the condition
查看 pod 使用的 pvc 信息:
1 2 3 4 5 6 7 8 9
$ kubectl get deploy content -n content-dev -o yaml ... volumes: - name: nfs persistentVolumeClaim: claimName: datanfs-pvc $ kubectl -n content-dev get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE datanfs-pvc Bound pvc-9865b525-2cb5-4a2a-b7a7-036ca9f524cf 10Gi RWO nfs-client 39d
$ kubectl -n content-dev describe pod content-754c9964bc-8dbxw Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedMount 4m18s (x4 over 11m) kubelet Unable to attach or mount volumes: unmounted volumes=[nfs], unattached volumes=[nfs default-token-vdjpz]: timed out waiting for the condition Warning FailedMount 2m kubelet Unable to attach or mount volumes: unmounted volumes=[nfs], unattached volumes=[default-token-vdjpz nfs]: timed out waiting for the condition Warning FailedMount 11s kubelet MountVolume.SetUp failed for volume "pvc-9865b525-2cb5-4a2a-b7a7-036ca9f524cf" : mount failed: signal: terminated Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/c61fd1a0-cb8f-4cc9-84c9-122a7d24cde6/volumes/kubernetes.io~nfs/pvc-9865b525-2cb5-4a2a-b7a7-036ca9f524cf --scope -- mount -t nfs xxx.xx.xx.50:/volume3/cloudxxx-pre/content-dev-datanfs-pvc-pvc-9865b525-2cb5-4a2a-b7a7-036ca9f524cf /var/lib/kubelet/pods/c61fd1a0-cb8f-4cc9-84c9-122a7d24cde6/volumes/kubernetes.io~nfs/pvc-9865b525-2cb5-4a2a-b7a7-036ca9f524cf Output: Running scope as unit run-164051.scope. Normal Pulled 10s kubelet Container image "xxx.xx.xx.215/content/content:dev-247" already present on machine Normal Created 10s kubelet Created container spring-boot Normal Started 10s kubelet Started container spring-boot
然后处理掉其他的已有的 pv pvc 都加上 mountOptions 。还需要清理掉每个节点上卡住的 mount 进程