zhangguanzhang's Blog

个别节点上 flannel.1 的 IP 无法 ping

字数统计: 1.3k阅读时长: 6 min
2025/09/23

一次客户环境上个别节点 flannel.1 的 IP 无法 ping 的排查

由来

客户反馈他们环境 agent 告警监控: 本机上的 IP 10.187.12.0 无法 ping 通。

排查

定位范围

客户环境不能远程,都是发命令让查的,查看 flannel 容器均没有重启:

1
2
$ docekr ps -a | grep flanneld
cb0e35b00899 1e0b2bff6efb "/opt/bin/flanneld -…" 7 weeks ago Up 7 weeks k8s_kube-flannel_kube-flannel-ds-qbps5_kube-system_bffc6d17-0835-468d-bb90-2367c190c94f_0

让客户去告警机器上 ping,客户说 cni0 地址是通的,就 10.187.12.010.187.11.0 无法 ping 通,沟通一番才意识到是这俩 ip 在各自本机上无法 ping 通,直接 ping 报错:

1
2
$ ping 10.187.12.0
Do you want to ping broadcast? Then -b. If not, check your local firewall rules.

看了下 flannel.1cni0 的 IP 信息也没问题:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
$ ip a s flannel.1
9: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default
link/ether 4e:81:e2:84:ff:49 brd ff:ff:ff:ff:ff:ff
inet 10.187.12.0/32 scope global flannel.1
valid_lft forever preferred_lft forever
inet6 fe80::4c81:e2ff:fe84:ff49/64 scope link
valid_lft forever preferred_lft forever
$ ip a s cni0
7: cni0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000
link/ether 02:e6:56:fb:18:3d brd ff:ff:ff:ff:ff:ff
inet 10.187.12.1/24 brd 10.187.12.255 scope global cni0
valid_lft forever preferred_lft forever
inet6 fe80::e6:56ff:fefb:183d/64 scope link
valid_lft forever preferred_lft forever

看了下内核参数也正常:

1
2
$ cat /proc/sys/net/ipv4/icmp_echo_ignore_broadcasts
1

去搜下源码看看,

1
2
$ rpm -qf `which ping`
iputils-20190709-5.ky10.aarch64

搜到源码 iputils 没特殊处理逻辑:

1
2
3
4
5
6
7
8
9
10
11
12
13
// https://github.com/iputils/iputils/blob/master/ping/ping.c#L885-L895

sock_setmark(rts, probe_fd);

dst.sin_port = htons(1025);
if (rts->nroute)
dst.sin_addr.s_addr = rts->route[0];
if (connect(probe_fd, (struct sockaddr *)&dst, sizeof(dst)) == -1) {
if (errno == EACCES) {
if (rts->broadcast_pings == 0)
error(2, 0,
_("Do you want to ping broadcast? Then -b. If not, check your local firewall rules"));
fprintf(stderr, _("WARNING: pinging broadcast address\n"));

完全走的系统层面分配,获取到的地址是广播地址才报错,看下路由:

1
2
3
4
$ ip route show
default via xxx dev enp4s0
....
10.187.12.0/24 dev cni0 proto kernel scope link src 10.187.12.1

但是路由匹配就有问题了:

1
2
3
$ ip route get 10.187.12.0
broadcast 10.187.12.0 dev cni0 src 10.187.12.1 uid 58248
cache <local,brd>

看来问题就在路由这块,ip route show 实际是 ip route show talbe main 看下由网卡生成的 local 路由表:

1
2
3
$ ip route show table local | grep 10.187.12.0
broadcast 10.187.12.0 dev cni0 proto kernel scope link src 10.187.12.1
local 10.187.12.0 dev flannel.1 proto kernel scope host src 10.187.12.0

果然是顺序导致,local 路由表是根据网卡顺序生成的,前面细心的话会发现 flannel.1 前面数字是 9,cni0 是 7,意味着 cni0flannel.1 先创建,或者是 flannel.1 网卡删除后重启 flanneld 容器创建的。

复现

内部找个 k8s 环境测试下复现了:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
$ ip a s flannel.1
6: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default
link/ether 4a:53:ae:34:23:99 brd ff:ff:ff:ff:ff:ff
inet 10.187.2.0/32 scope global flannel.1
valid_lft forever preferred_lft forever
inet6 fe80::4853:aeff:fe34:2399/64 scope link
valid_lft forever preferred_lft forever
$ ping 10.187.2.0
PING 10.187.2.0 (10.187.2.0) 56(84) bytes of data.
64 bytes from 10.187.2.0: icmp_seq=1 ttl=64 time=0.042 ms
^C
--- 10.187.2.0 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.042/0.042/0.042/0.000 ms

$ ip link delete flannel.1
$ docker ps -a | grep flanneld
46079b620076 reg.xxx.lan:5000/xxx/flannel "/opt/bin/flanneld -…" 8 days ago Up 8 days
$ docker restart 460
$ ping 10.187.2.0
Do you want to ping broadcast? Then -b. If not, check your local firewall rules.

解决

添加 32 位掩码路由不行:

1
2
3
$ ip route add 10.187.2.0/32 dev flannel.1
$ ping 10.187.2.0
Do you want to ping broadcast? Then -b. If not, check your local firewall rules.

因为 local 先匹配:

1
2
3
$ ip route show table local | grep 10.187.2.0
broadcast 10.187.2.0 dev cni0 proto kernel scope link src 10.187.2.1
local 10.187.2.0 dev flannel.1 proto kernel scope host src 10.187.2.0

删除后可以:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
$ ip route delete broadcast 10.187.2.0 dev cni0 proto kernel scope link src 10.187.2.1
$ ip route show
...
10.185.0.0/16 dev docker0 proto kernel scope link src 10.185.0.1
10.187.0.0/24 via 10.187.0.0 dev flannel.1 onlink
10.187.1.0/24 via 10.187.1.0 dev flannel.1 onlink
10.187.2.0/24 dev cni0 proto kernel scope link src 10.187.2.1
$ ip route show table local | grep 10.187.2.0
local 10.187.2.0 dev flannel.1 proto kernel scope host src 10.187.2.0

$ ping 10.187.2.0
PING 10.187.2.0 (10.187.2.0) 56(84) bytes of data.
64 bytes from 10.187.2.0: icmp_seq=1 ttl=64 time=0.074 ms
^C
--- 10.187.2.0 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.074/0.074/0.074/0.000 ms

测下跨节点也没问题:

1
2
3
4
5
6
7
8
$ ping 10.187.0.1
PING 10.187.0.1 (10.187.0.1) 56(84) bytes of data.
64 bytes from 10.187.0.1: icmp_seq=1 ttl=64 time=0.365 ms
64 bytes from 10.187.0.1: icmp_seq=2 ttl=64 time=0.325 ms
^C
--- 10.187.0.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 999ms
rtt min/avg/max/mdev = 0.325/0.345/0.365/0.020 ms

结论

虽然这个细节问题不影响 k8s overlay 网络,但是客户监控告警要查清楚原因。

CATALOG
  1. 1. 由来
  2. 2. 排查
    1. 2.1. 定位范围
    2. 2.2. 复现
    3. 2.3. 解决
  3. 3. 结论