zhangguanzhang's Blog

flannel下集群有个节点网络不通的一次排查

字数统计: 1.7k阅读时长: 9 min
2021/08/25 Share

故障

问题和版本没关系,客户的 node 信息啥的后面排错里有。有个节点通信有问题,其余节点都没问题。

排查

惯例信息

先看下 flannelvxlanvtep 信息,客户是双网卡的,但是默认路由是这个网卡,不用管另外的网卡了。下面信息看了下 VtepMACpublic-ip 都正常。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
$ kubectl get node -o yaml | grep -B4 public
annotations:
flannel.alpha.coreos.com/backend-data: '{"VtepMAC":"76:21:69:41:de:fe"}'
flannel.alpha.coreos.com/backend-type: vxlan
flannel.alpha.coreos.com/kube-subnet-manager: "true"
flannel.alpha.coreos.com/public-ip: 10.25.1.51
--
annotations:
flannel.alpha.coreos.com/backend-data: '{"VtepMAC":"b6:61:5c:8d:d9:eb"}'
flannel.alpha.coreos.com/backend-type: vxlan
flannel.alpha.coreos.com/kube-subnet-manager: "true"
flannel.alpha.coreos.com/public-ip: 10.25.1.52
--
annotations:
flannel.alpha.coreos.com/backend-data: '{"VtepMAC":"1e:8c:3e:12:fc:0f"}'
flannel.alpha.coreos.com/backend-type: vxlan
flannel.alpha.coreos.com/kube-subnet-manager: "true"
flannel.alpha.coreos.com/public-ip: 10.25.1.53
--
annotations:
flannel.alpha.coreos.com/backend-data: '{"VtepMAC":"ba:fe:64:36:6e:a1"}'
flannel.alpha.coreos.com/backend-type: vxlan
flannel.alpha.coreos.com/kube-subnet-manager: "true"
flannel.alpha.coreos.com/public-ip: 10.25.1.54
--
annotations:
flannel.alpha.coreos.com/backend-data: '{"VtepMAC":"8e:c1:4d:18:e5:d6"}'
flannel.alpha.coreos.com/backend-type: vxlan
flannel.alpha.coreos.com/kube-subnet-manager: "true"
flannel.alpha.coreos.com/public-ip: 10.25.1.55
--
annotations:
flannel.alpha.coreos.com/backend-data: '{"VtepMAC":"fe:95:e6:bf:a0:62"}'
flannel.alpha.coreos.com/backend-type: vxlan
flannel.alpha.coreos.com/kube-subnet-manager: "true"
flannel.alpha.coreos.com/public-ip: 10.25.1.56

coredns 的 pod ip 和 node 分布情况

1
2
3
4
5
$ kubectl -n kube-system get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
coredns-5757945748-cr67w 1/1 Running 0 19h 172.27.2.7 10.25.1.56 <none> <none>
coredns-5757945748-krwfd 1/1 Running 0 19h 172.27.1.4 10.25.1.55 <none> <none>
coredns-5757945748-zf4zm 1/1 Running 0 19h 172.27.3.7 10.25.1.54 <none> <none>

排查

curl 下 coredns 的 metrics 接口试试,只有 10.25.1.51 和其他节点无法通信。会导致下面的 curl 卡住。

1
curl  172.27.1.4:9153

目标机器 10.25.1.55 上通过 flannel.1 接口抓我们的 curl 包:

1
2
3
4
5
6
7
8
9
10
$ tcpdump -nn -i flannel.1 host 172.27.1.4 and port 9153 -vv
tcpdump: listening on flannel.1, link-type EN10MB (Ethernet), capture size 262144 bytes
10:07:46.203094 IP (tos 0x0, ttl 64, id 56025, offset 0, flags [DF], proto TCP (6), length 60)
172.27.0.0.57888 > 172.27.1.4.9153: Flags [S], cksum 0x6804 (correct), seq 879302783, win 28200, options [mss 1410,sackOK,TS val 56279718 ecr 0,nop,wscale 7], length 0
10:07:46.203173 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60)
172.27.1.4.9153 > 172.27.0.0.57888: Flags [S.], cksum 0x5969 (incorrect -> 0x163b), seq 4197245653, ack 879302784, win 27960, options [mss 1410,sackOK,TS val 431774697 ecr 56279718,nop,wscale 7], length 0
10:07:47.204797 IP (tos 0x0, ttl 64, id 56026, offset 0, flags [DF], proto TCP (6), length 60)
172.27.0.0.57888 > 172.27.1.4.9153: Flags [S], cksum 0x641a (correct), seq 879302783, win 28200, options [mss 1410,sackOK,TS val 56280720 ecr 0,nop,wscale 7], length 0
10:07:47.204880 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60)
172.27.1.4.9153 > 172.27.0.0.57888: Flags [S.], cksum 0x5969 (incorrect -> 0x1251), seq 4197245653, ack 879302784, win 27960, options [mss 1410,sackOK,TS val 431775699 ecr 56279718,nop,wscale 7], length 0

看着是回复了报文 172.27.1.4.9153 > 172.27.0.0.57888,在我们 curl 的机器 10.25.1.51lsof -nPi :57888 看到的确实是卡住的 curl 命令 pid 。10.25.1.51 上也同时抓包看下

1
2
3
4
5
6
7
8
$ tcpdump -nn -i flannel.1 host 172.27.1.4 and port 9153 -vv
tcpdump: listening on flannel.1, link-type EN10MB (Ethernet), capture size 262144 bytes
10:08:57.241129 IP (tos 0x0, ttl 64, id 34444, offset 0, flags [DF], proto TCP (6), length 60)
172.27.0.0.57966 > 172.27.1.4.9153: Flags [S], cksum 0x5969 (incorrect -> 0x2fb2), seq 276913734, win 28200, options [mss 1410,sackOK,TS val 56350922 ecr 0,nop,wscale 7], length 0
10:08:58.242423 IP (tos 0x0, ttl 64, id 34445, offset 0, flags [DF], proto TCP (6), length 60)
172.27.0.0.57966 > 172.27.1.4.9153: Flags [S], cksum 0x5969 (incorrect -> 0x2bc8), seq 276913734, win 28200, options [mss 1410,sackOK,TS val 56351924 ecr 0,nop,wscale 7], length 0
10:09:00.246423 IP (tos 0x0, ttl 64, id 34446, offset 0, flags [DF], proto TCP (6), length 60)
172.27.0.0.57966 > 172.27.1.4.9153: Flags [S], cksum 0x5969 (incorrect -> 0x23f4), seq 276913734, win 28200, options [mss 1410,sackOK,TS val 56353928 ecr 0,nop,wscale 7], length 0

没收到包,从 eth1 抓下 flannel8475 端口(配置里我们改了 flannel 的端口)试试:

目标机器 10.25.1.55 上抓包

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
$ tcpdump -nn -i eth1 host 10.25.1.51 and port 8475 -vvv
tcpdump: listening on eth1, link-type EN10MB (Ethernet), capture size 262144 bytes
10:09:40.966705 IP (tos 0x0, ttl 64, id 50110, offset 0, flags [none], proto UDP (17), length 110)
10.25.1.51.42770 > 10.25.1.55.8475: [no cksum] UDP, length 82
10:09:40.966869 IP (tos 0x0, ttl 64, id 46192, offset 0, flags [none], proto UDP (17), length 110)
10.25.1.55.48472 > 10.25.1.51.8475: [no cksum] UDP, length 82
10:09:41.968322 IP (tos 0x0, ttl 64, id 50327, offset 0, flags [none], proto UDP (17), length 110)
10.25.1.51.42770 > 10.25.1.55.8475: [no cksum] UDP, length 82
10:09:41.968440 IP (tos 0x0, ttl 64, id 46957, offset 0, flags [none], proto UDP (17), length 110)
10.25.1.55.48472 > 10.25.1.51.8475: [no cksum] UDP, length 82
10:09:43.099646 IP (tos 0x0, ttl 64, id 47316, offset 0, flags [none], proto UDP (17), length 110)
10.25.1.55.48472 > 10.25.1.51.8475: [no cksum] UDP, length 82
10:09:43.972322 IP (tos 0x0, ttl 64, id 51119, offset 0, flags [none], proto UDP (17), length 110)
10.25.1.51.42770 > 10.25.1.55.8475: [no cksum] UDP, length 82
10:09:43.972454 IP (tos 0x0, ttl 64, id 47934, offset 0, flags [none], proto UDP (17), length 110)
10.25.1.55.48472 > 10.25.1.51.8475: [no cksum] UDP, length 82
^C

目标机器 10.25.1.51 上抓包:

1
2
3
4
5
6
7
8
$ tcpdump -nn -i eth1 host 10.25.1.55 and port 8475 -vvv
tcpdump: listening on eth1, link-type EN10MB (Ethernet), capture size 262144 bytes
10:10:21.702308 IP (tos 0x0, ttl 64, id 6079, offset 0, flags [none], proto UDP (17), length 110)
10.25.1.51.59558 > 10.25.1.55.8475: [no cksum] UDP, length 82
10:10:22.702441 IP (tos 0x0, ttl 64, id 6117, offset 0, flags [none], proto UDP (17), length 110)
10.25.1.51.59558 > 10.25.1.55.8475: [no cksum] UDP, length 82
10:10:24.706444 IP (tos 0x0, ttl 64, id 7699, offset 0, flags [none], proto UDP (17), length 110)
10.25.1.51.59558 > 10.25.1.55.8475: [no cksum] UDP, length 82

完全没报文过来,看了下 flannel 的接口流量压根就没收到任何包:

1
2
3
4
5
6
7
8
9
$ ifconfig flannel.1
flannel.1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450
inet 172.27.0.0 netmask 255.255.255.255 broadcast 0.0.0.0
inet6 fe80::7421:69ff:fe41:defe prefixlen 64 scopeid 0x20<link>
ether 76:21:69:41:de:fe txqueuelen 0 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 28900 bytes 2113052 (2.0 MiB)
TX errors 0 dropped 8 overruns 0 carrier 0 collisions 0

说明报文从 10.25.1.55 发出后没到 51 上,让客户开通 udp 8475 10.25.1.0/24 整个段的东西向安全组后就正常了。

1
2
3
4
$ curl  172.27.1.4:9153
^C
$ curl 172.27.1.4:9153
404 page not found
CATALOG
  1. 1. 故障
  2. 2. 排查
    1. 2.1. 惯例信息
    2. 2.2. 排查