zhangguanzhang's Blog

一次kube-controller-manager的bug导致的线上无法调度处理过程

字数统计: 2.6k阅读时长: 14 min
2019/03/22

线上的k8s环境普遍版本很低,今天同事心血来潮去一个新节点上去看了下发现根分区满了,清理了后发有一个pod没有创建出来。清理后把因为文件系统满了而down掉的k8s相关进程起来后发现kubelet的log一直报错cannot remove /var/lib/kubelet/xxxxconfig/key resource busy now 啥的。查看了下describe对应rc还没有任何Event

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
$ kubectl describe rc rabbit3rc 
Name: rabbit3rc
Namespace: default
Selector: app=rabbitmq-cluster,node=rabbit3
Labels: app=rabbitmq-cluster
node=rabbit3
Annotations: <none>
Replicas: 1 current / 1 desired
Pods Status: 0 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
Labels: app=rabbitmq-cluster
node=rabbit3
Containers:
rabbit3:
Image: cloud-base/rabbitmq-3.6.5:E3103-PUB-20181015-RC1
Ports: 4369/TCP, 5672/TCP, 15672/TCP, 25672/TCP
Limits:
cpu: 16
memory: 8Gi
Requests:
cpu: 400m
memory: 500Mi
Liveness: exec [health_check.sh] delay=600s timeout=10s period=15s #success=1 #failure=3
Environment:
RABBITMQ_DEFAULT_USER: xxx
RABBITMQ_DEFAULT_PASS: xxx
RABBITMQ_ERLANG_COOKIE: xxx
Mounts:
/etc/localtime from time (rw)
/var/lib/rabbitmq from rabbitmqvar (rw)
Volumes:
time:
Type: HostPath (bare host directory volume)
Path: /etc/localtime
rabbitmqvar:
Type: HostPath (bare host directory volume)
Path: /opt/cloud/rabbitmq
Events: <none>

在对应的node上看了下发现没有rabbitmq的容器,只有一个pause的容器,但是状态是Dead并且很多这样的而且删不掉。然后重启了下docker后虽然Dead的没了但是等被拉起来后对应的rabiitmq和它的pause容器都没有。怀疑调度有问题。
这个低版本集群的HA这块儿方案一直很迷,从进程存活来判断leader的机器后用systemctl查看上面的kube-controller-manager发现如下信息

1
Failed to update lock: Operation cannot be fulfilled on endpoints "kube-controller-manager": the object has been modified; please apply your changes to the latest version and try again

其他节点的controller认为它是leader但是它一直报错无法获取选举锁,然后复制systemd启动参数手动启动调整loglevel为8发现信息为如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
I0322 20:26:36.966931   34218 round_trippers.go:395] PUT https://100.68.24.2:6443/api/v1/namespaces/kube-system/endpoints/kube-controller-manager
I0322 20:26:36.966938 34218 round_trippers.go:402] Request Headers:
I0322 20:26:36.966944 34218 round_trippers.go:405] Accept: application/vnd.kubernetes.protobuf, */*
I0322 20:26:36.966951 34218 round_trippers.go:405] Content-Type: application/vnd.kubernetes.protobuf
I0322 20:26:36.966956 34218 round_trippers.go:405] User-Agent: kube-controller-manager/v1.6.7+095136c3078cc (linux/amd64) kubernetes/095136c/leader-election
I0322 20:26:36.967726 34218 round_trippers.go:420] Response Status: 409 Conflict in 0 milliseconds
I0322 20:26:36.967738 34218 round_trippers.go:423] Response Headers:
I0322 20:26:36.967744 34218 round_trippers.go:426] Content-Type: application/vnd.kubernetes.protobuf
I0322 20:26:36.967749 34218 round_trippers.go:426] Content-Length: 259
I0322 20:26:36.967754 34218 round_trippers.go:426] Date: Fri, 22 Mar 2019 12:26:36 GMT
I0322 20:26:36.967888 34218 request.go:988] Response Body:
00000000 6b 38 73 00 0a 0c 0a 02 76 31 12 06 53 74 61 74 |k8s.....v1..Stat|
00000010 75 73 12 ea 01 0a 04 0a 00 12 00 12 07 46 61 69 |us...........Fai|
00000020 6c 75 72 65 1a a1 01 4f 70 65 72 61 74 69 6f 6e |lure...Operation|
00000030 20 63 61 6e 6e 6f 74 20 62 65 20 66 75 6c 66 69 | cannot be fulfi|
00000040 6c 6c 65 64 20 6f 6e 20 65 6e 64 70 6f 69 6e 74 |lled on endpoint|
00000050 73 20 22 6b 75 62 65 2d 63 6f 6e 74 72 6f 6c 6c |s "kube-controll|
00000060 65 72 2d 6d 61 6e 61 67 65 72 22 3a 20 74 68 65 |er-manager": the|
00000070 20 6f 62 6a 65 63 74 20 68 61 73 20 62 65 65 6e | object has been|
00000080 20 6d 6f 64 69 66 69 65 64 3b 20 70 6c 65 61 73 | modified; pleas|
00000090 65 20 61 70 70 6c 79 20 79 6f 75 72 20 63 68 61 |e apply your cha|
000000a0 6e 67 65 73 20 74 6f 20 74 68 65 20 6c 61 74 65 |nges to the late|
000000b0 73 74 20 76 65 72 73 69 6f 6e 20 61 6e 64 20 74 |st version and t|
000000c0 72 79 20 61 67 61 69 6e 22 08 43 6f 6e 66 6c 69 |ry again".Confli|
000000d0 63 74 2a 28 0a 17 6b 75 62 65 2d 63 6f 6e 74 72 |ct*(..kube-contr|
000000e0 6f 6c 6c 65 72 2d 6d 61 6e 61 67 65 72 12 00 1a |oller-manager...|
000000f0 09 65 6e 64 70 6f 69 6e 74 73 28 00 30 99 03 1a |.endpoints(.0...|
00000100 00 22 00 |.".|
E0322 20:26:36.967960 34218 leaderelection.go:263] Failed to update lock: Operation cannot be fulfilled on endpoints "kube-controller-manager": the object has been modified; please apply your changes to the latest version and try again
I0322 20:26:36.967971 34218 leaderelection.go:185] failed to acquire lease kube-system/kube-controller-manager
^C

去官方代码仓库准备搜下代码看看选举逻辑是如何获得锁的,结果根本看不懂,于是想着看看有没有人研究过选举逻辑啥的结果。按照关键词搜到了一个国外文章 http://gogosatellite.blogspot.com/2017/07/how-to-setup-high-availability.html ,里面一行输出日志给了我排查方向

1
2
3
4
5
I0607 11:04:32.485502   17291 leaderelection.go:248] lock is held by kuberm and has not yet expired
I0607 11:04:32.485506 17291 leaderelection.go:185] failed to acquire lease kube-system/kube-controller-manager
I0607 11:04:36.263032 17291 round_trippers.go:417] GET http://172.16.155.165:8080/api/v1/namespaces/kube-system/endpoints/kube-controller-manager 200 OK in 1 milliseconds
I0607 11:04:36.263122 17291 leaderelection.go:248] lock is held by kuberm and has not yet expired
I0607 11:04:36.263125 17291 leaderelection.go:185] failed to acquire lease kube-system/kube-controller-manager

猜测到是所有kube-controller-manager请求apiserver然后竞争这个ep来获得锁,后续查看确实是利用endpoint来竞选leader

1
2
3
$ kube-controller-manager --help |& grep leader
...
--leader-elect-resource-lock endpoints The type of resource object that is used for locking during leader election. Supported options are endpoints (default) and `configmaps`. (default "endpoints")

用kubectl请求了下这个ep发现如下信息并且一直不会改变,annotations的里的holderIdentity字段是当前leader的node

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
$ kubectl get --raw /api/v1/namespaces/kube-system/endpoints/kube-controller-manager | jq .
{
"kind": "Endpoints",
"apiVersion": "v1",
"metadata": {
"name": "kube-controller-manager",
"namespace": "kube-system",
"selfLink": "/api/v1/namespaces/kube-system/endpoints/kube-controller-manager",
"uid": "87e9ff0a-388b-11e9-949b-0cda411d3f00",
"resourceVersion": "36217274",
"creationTimestamp": "2019-02-24T23:25:54Z",
"annotations": {
"control-plane.alpha.kubernetes.io/leader": "{\"holderIdentity\":\"xxxxx{nodename}xxxxxx\",\"leaseDurationSeconds\":15,\"acquireTime\":\"2019-02-24T23:25:54Z\",\"renewTime\":\"2019-03-17T11:20:08Z\",\"leaderTransitions\":0}"
}
},
"subsets": []
}
$ kubectl get --raw /api/v1/namespaces/kube-system/endpoints/kube-controller-manager | jq .
{
"kind": "Endpoints",
"apiVersion": "v1",
"metadata": {
"name": "kube-controller-manager",
"namespace": "kube-system",
"selfLink": "/api/v1/namespaces/kube-system/endpoints/kube-controller-manager",
"uid": "87e9ff0a-388b-11e9-949b-0cda411d3f00",
"resourceVersion": "36217274",
"creationTimestamp": "2019-02-24T23:25:54Z",
"annotations": {
"control-plane.alpha.kubernetes.io/leader": "{\"holderIdentity\":\"Xxxxxxx-S02\",\"leaseDurationSeconds\":15,\"acquireTime\":\"2019-02-24T23:25:54Z\",\"renewTime\":\"2019-03-17T11:20:08Z\",\"leaderTransitions\":0}"
}
},
"subsets": []
}

在我自己搭建的新集群上查看了下发现每时每刻resourceVersion一直在改变,而对比上面线上的却没有任何变化

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
[root@k8s-m1 Kubernetes-ansible]# kubectl get --raw /api/v1/namespaces/kube-system/endpoints/kube-controller-manager | jq .
{
"kind": "Endpoints",
"apiVersion": "v1",
"metadata": {
"name": "kube-controller-manager",
"namespace": "kube-system",
"selfLink": "/api/v1/namespaces/kube-system/endpoints/kube-controller-manager",
"uid": "0915773e-4c4d-11e9-a0b8-fa163e4edb6a",
"resourceVersion": "52752",
"creationTimestamp": "2019-03-22T02:48:56Z",
"annotations": {
"control-plane.alpha.kubernetes.io/leader": "{\"holderIdentity\":\"k8s-m1_00dbe494-4c4d-11e9-a89f-fa163ed10d54\",\"leaseDurationSeconds\":15,\"aransitions\":1}"
}
}
}
[root@k8s-m1 Kubernetes-ansible]# kubectl get --raw /api/v1/namespaces/kube-system/endpoints/kube-controller-manager | jq .
{
"kind": "Endpoints",
"apiVersion": "v1",
"metadata": {
"name": "kube-controller-manager",
"namespace": "kube-system",
"selfLink": "/api/v1/namespaces/kube-system/endpoints/kube-controller-manager",
"uid": "0915773e-4c4d-11e9-a0b8-fa163e4edb6a",
"resourceVersion": "52772",
"creationTimestamp": "2019-03-22T02:48:56Z",
"annotations": {
"control-plane.alpha.kubernetes.io/leader": "{\"holderIdentity\":\"k8s-m1_00dbe494-4c4d-11e9-a89f-fa163ed10d54\",\"leaseDurationSeconds\":15,\"aransitions\":1}"
}
}
}

为了确定字段holderIdentity是leader,便尝试通过关闭kube-controller-manager看看字段会不会变,发现真的改变并且会有新的ep又出现了(注意看creationTimestamp实际上不是生成)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
[root@k8s-m1 Kubernetes-ansible]# systemctl stop kube-controller-manager.service 
[root@k8s-m1 Kubernetes-ansible]# kubectl get --raw /api/v1/namespaces/kube-system/endpoints/kube-controller-manager | jq .
{
"kind": "Endpoints",
"apiVersion": "v1",
"metadata": {
"name": "kube-controller-manager",
"namespace": "kube-system",
"selfLink": "/api/v1/namespaces/kube-system/endpoints/kube-controller-manager",
"uid": "0915773e-4c4d-11e9-a0b8-fa163e4edb6a",
"resourceVersion": "52819",
"creationTimestamp": "2019-03-22T02:48:56Z",
"annotations": {
"control-plane.alpha.kubernetes.io/leader": "{\"holderIdentity\":\"k8s-m1_00dbe494-4c4d-11e9-a89f-fa163ed10d54\",\"leaseDurationSeconds\":15,\"aransitions\":1}"
}
}
}
[root@k8s-m1 Kubernetes-ansible]# kubectl get --raw /api/v1/namespaces/kube-system/endpoints/kube-controller-manager | jq .
{
"kind": "Endpoints",
"apiVersion": "v1",
"metadata": {
"name": "kube-controller-manager",
"namespace": "kube-system",
"selfLink": "/api/v1/namespaces/kube-system/endpoints/kube-controller-manager",
"uid": "0915773e-4c4d-11e9-a0b8-fa163e4edb6a",
"resourceVersion": "52819",
"creationTimestamp": "2019-03-22T02:48:56Z",
"annotations": {
"control-plane.alpha.kubernetes.io/leader": "{\"holderIdentity\":\"k8s-m1_00dbe494-4c4d-11e9-a89f-fa163ed10d54\",\"leaseDurationSeconds\":15,\"aransitions\":1}"
}
}
}

基本就确定了某些原因(可能文件系统满导致的也可能其他),于是生产环境上删除ep来踢出无法修改的leader信息,发现虽然还是他自己获取到leader,但是resourceVersion开始刷新了,而controller的日志里也没报错了

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

$ kubectl -n kube-system delete ep kube-controller-manager
endpoints "kube-controller-manager" deleted
$ kubectl get --raw /api/v1/namespaces/kube-system/endpoints/kube-controller-manager | jq .
{
"kind": "Endpoints",
"apiVersion": "v1",
"metadata": {
"name": "kube-controller-manager",
"namespace": "kube-system",
"selfLink": "/api/v1/namespaces/kube-system/endpoints/kube-controller-manager",
"uid": "dec669dd-4c9f-11e9-949b-0cda411d3f00",
"resourceVersion": "37542637",
"creationTimestamp": "2019-03-22T12:41:53Z",
"annotations": {
"control-plane.alpha.kubernetes.io/leader": "{\"holderIdentity\":\"Xxxxxxx-S02\",\"leaseDurationSeconds\":15,\"acquireTime\":\"2019-03-22T12:41:53Z\",\"renewTime\":\"2019-03-22T12:41:53Z\",\"leaderTransitions\":0}"
}
},
"subsets": []
}
$ kubectl get --raw /api/v1/namespaces/kube-system/endpoints/kube-controller-manager | jq .
{
"kind": "Endpoints",
"apiVersion": "v1",
"metadata": {
"name": "kube-controller-manager",
"namespace": "kube-system",
"selfLink": "/api/v1/namespaces/kube-system/endpoints/kube-controller-manager",
"uid": "dec669dd-4c9f-11e9-949b-0cda411d3f00",
"resourceVersion": "37542785",
"creationTimestamp": "2019-03-22T12:41:53Z",
"annotations": {
"control-plane.alpha.kubernetes.io/leader": "{\"holderIdentity\":\"Xxxxxxx-S02\",\"leaseDurationSeconds\":15,\"acquireTime\":\"2019-03-22T12:41:53Z\",\"renewTime\":\"2019-03-22T12:41:59Z\",\"leaderTransitions\":0}"
}
},
"subsets": []

查看rc描述和对应的pod都起来了

1
2
3
4
5
6
7
8
9
10
kubectl describe rc rabbit3rc
Name: rabbit3rc
Namespace: default
Selector: app=rabbitmq-cluster,node=rabbit3
Labels: app=rabbitmq-cluster
node=rabbit3
Annotations: <none>
Replicas: 1 current / 1 desired
Pods Status: 1 Running / 0 Waiting / 0 Succeeded / 0 Failed
...
CATALOG