zhangguanzhang's Blog

一次deploy,rs,sts Mismatch 的处理

字数统计: 1.7k阅读时长: 9 min
2020/03/26 Share

这次文章是远程帮群友解决故障后的总结,以时间线描述
群友一开始是apiserver响应慢,使用角度上就是kubectl请求慢,如

1
2
3
E0326 07:49:37.586690       1 available_controller.go:416] v1beta1.metrics.k8s.io failed with: failing or missing response 
from https://10.99.174.208:443/apis/metrics.k8s.io/v1beta1: Get https://10.99.174.208:443/apis/metrics.k8s.io/v1beta1: net/http:
request canceled (Client.Timeout exceeded while awaiting headers)

他环境的机器是虚机,然后他在宿主机上弄了网桥把虚机断了五分钟的网,然后过一段时间后prometheus告警下面信息

1
2
3
KubeDaemonSetRolloutStuck (3 active)
KubeDeploymentReplicasMismatch (13 active)
KubeStatefulSetGenerationMismatch (1 active)

他发现deployment显示的数量和实际pod不一致,实际上这个pod是在运行的,但是deploy get的显示是0/1

1
2
3
$ kubectl -n ai get deployments.apps wise-ai-kafka-program 
NAME READY UP-TO-DATE AVAILABLE AGE
wise-ai-kafka-program 0/1 1 0 118d

群里我只看了他前面的超时,因为v1beta1.metrics.k8s.io这个apiservice选中的svc的pod有问题会在非最新版本的集群下导致kube-apiserver疯狂retry导致kube-apiserver整体请求慢,所以我叫他先把这个apiservice删了后再看看,他试过后发现还是不行。我就远程上去帮他看下。

下面是get deploy -o yaml的部分

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
[root@master-1 ~]# kubectl -n ai get deployments.apps wise-ai-kafka-program -o yaml
status:
conditions:
- lastTransitionTime: "2019-11-29T08:43:39Z"
lastUpdateTime: "2020-03-24T07:57:52Z"
message: ReplicaSet "wise-ai-kafka-program-749b95dd49" has successfully progressed.
reason: NewReplicaSetAvailable
status: "True"
type: Progressing
- lastTransitionTime: "2020-03-26T07:35:44Z"
lastUpdateTime: "2020-03-26T07:35:44Z"
message: Deployment does not have minimum availability.
reason: MinimumReplicasUnavailable
status: "False"
type: Available
observedGeneration: 12
replicas: 1
unavailableReplicas: 1
updatedReplicas: 1
[root@master-1 ~]# kubectl -n ai get deployments.apps wise-ai-kafka-program
NAME READY UP-TO-DATE AVAILABLE AGE
wise-ai-kafka-program 0/1 1 0 118d

reason和message这个信息关键字搜了下对于这个场景没有相似的解决issue,于是我便怀疑是ep的选举问题(见之前博客kube-controller的bug的文章),便删除下ep试试

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
[root@master-1 ~]# kubectl -n kube-system get ep kube-controller-manager -o yaml > temp.yml
[root@master-1 ~]# kubectl -n kube-system delete ep kube-controller-manager
endpoints "kube-controller-manager" deleted
[root@master-1 ~]# kubectl -n kube-system get ep kube-controller-manager
NAME ENDPOINTS AGE
kube-controller-manager <none> 5s
[root@master-1 ~]# kubectl -n kube-system get ep kube-controller-manager
NAME ENDPOINTS AGE
kube-controller-manager <none> 7s
[root@master-1 ~]# kubectl -n kube-system get ep kube-controller-manager
NAME ENDPOINTS AGE
kube-controller-manager 10.10.100.101:10252,10.10.100.102:10252,10.10.100.103:10252 21s
[root@master-1 ~]# kubectl -n ai get deploy wise-ai-kafka-program
NAME READY UP-TO-DATE AVAILABLE AGE
wise-ai-kafka-program 0/1 1 0 118d

然后还是不行,相关的rs信息都是正常的

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
[root@master-1 ~]# kubectl -n ai get rs wise-ai-kafka-program-749b95dd49 
NAME DESIRED CURRENT READY AGE
wise-ai-kafka-program-749b95dd49 1 1 0 2d4h
[root@master-1 ~]# kubectl -n ai describe rs wise-ai-kafka-program-749b95dd49
Name: wise-ai-kafka-program-749b95dd49
Namespace: ai
...
...
Annotations: deployment.kubernetes.io/desired-replicas: 1
deployment.kubernetes.io/max-replicas: 2
deployment.kubernetes.io/revision: 12
Controlled By: Deployment/wise-ai-kafka-program
Replicas: 1 current / 1 desired
Pods Status: 1 Running / 0 Waiting / 0 Succeeded / 0 Failed
...
...
Events: <none>

看了下node的evnet也没不正常,就想到了是不是该pod所在kubelet没有同步信息,查看该node节点

1
2
3
[root@master-1 ~]# kubectl -n ai get po wise-ai-kafka-program-749b95dd49-wtw44 -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
wise-ai-kafka-program-749b95dd49-wtw44 1/1 Running 0 2d4h 10.244.3.218 node-1.test.tjiptv.net <none> <none>

ssh上来后查看kubelet日志发现有错误

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
$ systemctl status -l kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
Drop-In: /usr/lib/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Fri 2019-11-29 11:36:32 CST; 3 months 27 days ago
Docs: https://kubernetes.io/docs/
Main PID: 1858 (kubelet)
Tasks: 55
Memory: 139.8M
CGroup: /system.slice/kubelet.service
└─1858 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --cgroup-driver=systemd --network-plugin=cni --pod-infra-container-image=k8s.gcr.io/pause:3.1 --cgroup-driver=systemd

Mar 26 20:17:18 node-1.test.tjiptv.net kubelet[1858]: I0326 20:17:18.538836 1858 operation_generator.go:831] UnmountVolume.TearDown succeeded for volume "kubernetes.io/secret/2b66fdc2-a25d-4ed4-b9fa-fd8ebdd8f071-flannel-token-bknj5" (OuterVolumeSpecName: "flannel-token-bknj5") pod "2b66fdc2-a25d-4ed4-b9fa-fd8ebdd8f071" (UID: "2b66fdc2-a25d-4ed4-b9fa-fd8ebdd8f071"). InnerVolumeSpecName "flannel-token-bknj5". PluginName "kubernetes.io/secret", VolumeGidValue ""
Mar 26 20:17:18 node-1.test.tjiptv.net kubelet[1858]: I0326 20:17:18.620666 1858 reconciler.go:301] Volume detached for volume "flannel-cfg" (UniqueName: "kubernetes.io/configmap/2b66fdc2-a25d-4ed4-b9fa-fd8ebdd8f071-flannel-cfg") on node "node-1.test.tjiptv.net" DevicePath ""
Mar 26 20:17:18 node-1.test.tjiptv.net kubelet[1858]: I0326 20:17:18.620706 1858 reconciler.go:301] Volume detached for volume "cni" (UniqueName: "kubernetes.io/host-path/2b66fdc2-a25d-4ed4-b9fa-fd8ebdd8f071-cni") on node "node-1.test.tjiptv.net" DevicePath ""
Mar 26 20:17:18 node-1.test.tjiptv.net kubelet[1858]: I0326 20:17:18.620727 1858 reconciler.go:301] Volume detached for volume "flannel-token-bknj5" (UniqueName: "kubernetes.io/secret/2b66fdc2-a25d-4ed4-b9fa-fd8ebdd8f071-flannel-token-bknj5") on node "node-1.test.tjiptv.net" DevicePath ""
Mar 26 20:17:19 node-1.test.tjiptv.net kubelet[1858]: I0326 20:17:19.623551 1858 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "flannel-token-bknj5" (UniqueName: "kubernetes.io/secret/0b7f5c79-5d3b-44ff-b8c6-349240cef26d-flannel-token-bknj5") pod "kube-flannel-ds-amd64-wrtpj" (UID: "0b7f5c79-5d3b-44ff-b8c6-349240cef26d")
Mar 26 20:17:19 node-1.test.tjiptv.net kubelet[1858]: I0326 20:17:19.623612 1858 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "cni" (UniqueName: "kubernetes.io/host-path/0b7f5c79-5d3b-44ff-b8c6-349240cef26d-cni") pod "kube-flannel-ds-amd64-wrtpj" (UID: "0b7f5c79-5d3b-44ff-b8c6-349240cef26d")
Mar 26 20:17:19 node-1.test.tjiptv.net kubelet[1858]: I0326 20:17:19.623679 1858 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "run" (UniqueName: "kubernetes.io/host-path/0b7f5c79-5d3b-44ff-b8c6-349240cef26d-run") pod "kube-flannel-ds-amd64-wrtpj" (UID: "0b7f5c79-5d3b-44ff-b8c6-349240cef26d")
Mar 26 20:17:19 node-1.test.tjiptv.net kubelet[1858]: I0326 20:17:19.623715 1858 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "flannel-cfg" (UniqueName: "kubernetes.io/configmap/0b7f5c79-5d3b-44ff-b8c6-349240cef26d-flannel-cfg") pod "kube-flannel-ds-amd64-wrtpj" (UID: "0b7f5c79-5d3b-44ff-b8c6-349240cef26d")
Mar 26 20:17:20 node-1.test.tjiptv.net kubelet[1858]: W0326 20:17:20.577613 1858 pod_container_deletor.go:75] Container "ac8e3ba8f86e49de2b01eee8fe24a96bce813972915adc807f42801b5acd42b1" not found in pod's containers
Mar 26 20:17:49 node-1.test.tjiptv.net kubelet[1858]: E0326 20:17:49.115530 1858 fsHandler.go:118] failed to collect filesystem stats - rootDiskErr: could not stat "/var/lib/docker/overlay2/5cf336d040522f2e67a80035ba2fe2ee1075abae973c6a75bbe9103b62ed9370/diff" to get inode usage: stat /var/lib/docker/overlay2/5cf336d040522f2e67a80035ba2fe2ee1075abae973c6a75bbe9103b62ed9370/diff: no such file or directory, extraDiskErr: could not stat "/var/lib/docker/containers/56ea8d5a37cb276c7b1e429b3c06ce980588d48543f5c42d081d9442fadbdb4c" to get inode usage: stat /var/lib/docker/containers/56ea8d5a37cb276c7b1e429b3c06ce980588d48543f5c42d081d9442fadbdb4c: no such file or directory

果然有错误,然后触发下同步更新

1
2
systemctl restart kubelet
docker containers prune -f

回到master上查看

1
2
3
4
5
6
[root@master-1 ~]# kubectl -n ai get deployments.apps wise-ai-kafka-program 
NAME READY UP-TO-DATE AVAILABLE AGE
wise-ai-kafka-program 0/1 1 0 118d
[root@master-1 ~]# kubectl -n ai get deployments.apps wise-ai-kafka-program
NAME READY UP-TO-DATE AVAILABLE AGE
wise-ai-kafka-program 1/1 1 1 118d

然后其他的几个告警也都好了

CATALOG