zhangguanzhang's Blog

麒麟内核4.19.90-52.49导致的flannel vxlan跨节点不通

字数统计: 1.8k阅读时长: 9 min
2025/09/10

一次麒麟内核导致 flannel 跨节点不通的排查

由来

客户有一套 K8S 环境 A 要扩容,给了两台机器加进后,同事发现新节点 flannel 跨节点不通。抓包排查发现新节点的 ip -s a s flannel.1 显示的 Rx 收包为0。
客户认为是 k8s 问题,我们认为是客户网络环境没放行 UDP。然后双方达成共识搞一套干净环境 B 部署 K8S 看看,然后发现依旧跨节点不通。

排查

环境信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
$ cat /etc/os-release
NAME="Kylin Linux Advanced Server"
VERSION="V10 (Lance)"
ID="kylin"
VERSION_ID="V10"
PRETTY_NAME="Kylin Linux Advanced Server V10 (Lance)"
ANSI_COLOR="0;31"

$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 48 bits physical, 48 bits virtual
CPU(s): 14
On-line CPU(s) list: 0-13
Thread(s) per core: 1
Core(s) per socket: 2
Socket(s): 7
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 15
Model: 6
Model name: Hygon C86-3G 7390 32-core Processor
Stepping: 3
CPU MHz: 2699.998
BogoMIPS: 5399.99
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 896 KiB
L1i cache: 896 KiB
L2 cache: 7 MiB
L3 cache: 112 MiB
NUMA node0 CPU(s): 0-13
Vulnerability Itlb multihit: Not affected
Vulnerability Spec store bypass: Not affected
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Flags: fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx lm rep_good
nopl cpuid extd_apicid tsc_known_freq pni cx16 x2apic aes hypervisor cmp_legacy 3dnowprefetch vmmcall

接手缘由

现场的同事卸载了 K8S 后测试UDP 还是一样,然后认为 udp 存在限制,客户用 nmap 扫描认为 UDP 没限制,我们用 nc 起 server 用 client 测不通,客户认为我们这种测试方式不准,我就写了个测试脚本给同事:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
# -*- coding: utf-8 -*-
from __future__ import print_function
import socket
import sys
import logging


if sys.version_info[0] == 2:
input = raw_input

def usage():
print("Usage:")
print(" Server: python udp-test.py <local-port>")
print(" Client: python udp-test.py <host:port>")
sys.exit(1)

def parse_addr(s):
if ":" in s:
host, port = s.rsplit(":", 1)
return host, int(port)
else:
return None, int(s)

def server(port):
sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
sock.bind(("0.0.0.0", port))
logging.info("Server listening on UDP *:%d", port)
while True:
data, addr = sock.recvfrom(4096)
msg = data.decode('utf-8', 'ignore')
logging.info("Received from %s: %r", addr, msg)
sock.sendto(data, addr) # 回显

def client(host, port):
sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
target = (host, port)
logging.info("Client target UDP %s:%d (type message and press Enter)", host, port)
while True:
try:
text = input(">>> ")
except (EOFError, KeyboardInterrupt):
print("\nBye")
break
if not text:
continue
sock.sendto(text.encode('utf-8'), target)
sock.settimeout(2)
try:
data, _ = sock.recvfrom(4096)
logging.info("Server echoed: %r", data.decode('utf-8', 'ignore'))
except socket.timeout:
logging.debug("No echo within 2s")

def main():
if len(sys.argv) != 2:
usage()
addr_str = sys.argv[1]
host, port = parse_addr(addr_str)

LOG_FMT = "%(asctime)s [%(levelname)s] %(message)s"
logging.basicConfig(level=logging.INFO, format=LOG_FMT)
if host is None:
# 纯数字端口 -> 服务端
server(port)
else:
# host:port -> 客户端
client(host, port)


if __name__ == "__main__":
main()

脚本就是连上后发消息回车,server 把收到的消息发回客户端,客户端也打印收到的消息,同事客户环境上测了下发现不正常,然后客户用 tcpdump 抓包说接收到了。看了下客户抓包方式确实没问题:

  • 机器B 上先开抓包 tcpdump -nn -i eth0 port 8472 -w xxx.pcap
  • 机器A 上 echo "123" | nc -u <机器B_ip> 8472

之前以为是客户在发送机器上抓的,理清楚客户思路是正确后,就远程上去看了。

抓包重现

确实目标机器能抓到报文:

1
2
3
4
5
6
7
8
9
$ tcpdump  -nn -i any port 8472 -vvv
dropped privs to tcpdump
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked v1), capture size 262144 bytes
14:43:39.267522 IP (tos 0x0, ttl 64, id 56033, offset 0, flags [DF], proto UDP (17), length 35)
10.xx.50.166.50492 > 10.xx.50.169.8472: [udp sum ok] OTV, [|OTV]
^C
1 packet captured
3 packets received by filter
0 packets dropped by kernel

上面是抓包后用 nc 发 UDP 抓的,然后下面是脚本形式发的抓的

1
2
3
4
5
6
7
8
9
$ tcpdump -nn -i any port 8472 -vvv
dropped privs to tcpdump
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked v1), capture size 262144 bytes
14:48:58.540224 IP (tos 0x0, ttl 64, id 48511, offset 0, flags [DF], proto UDP (17), length 31)
10.xx.50.166.46543 > 10.xx.50.169.8472: [bad udp cksum 0x7a0d -> 0x4acd!] OTV, [|OTV]
^C
1 packet captured
2 packets received by filter
0 packets dropped by kernel

解决

上面抓包对比里有 bad udp cksum,尝试取消网卡计算 checksum 试试:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
$ ethtool --show-offload eth0 | grep checksum
rx-checksumming: on [fixed]
tx-checksummtng: on
tx-checksum-ipv4:off[fixed]
tx-checksum-ip-generic: on
tx-checksum-ipv6: off [fixed]
tx-checksum-fcoe-crc: off [ftxed]
tx-checksum-sctp: off [fixed]
$ ethtool --offload eth0 tx-checksum-ip-generic off
Actual changes:
tx-checksumming: off
tx-checksum-ip-generic: off
tcp-segmentatton-offLoad: off
tx-tcp-segmentation; off [requested on]
tx-tcp-ecn-segmentation: off [requested on]
tx-tcp6-segmentation: off [requested on]

测了下发下还不行,然后几个都设置了还是不行:

1
ethtool --offload eth0 tx off rx off

对比了环境 A 的 offload 都一样,并且查看驱动也一致:

1
2
$ readlink -f /sys/class/net/eth0/device/driver/module
/sys/module/virtio_net

iptables 啥规则都没有,还有不一样的就只有内核版本了:

  • 正常环境:Linux 4.19.90-52.22.v2207.ky10.x86_64 #1 SMP Tue Mar 14 12:19:10 CST 2023 x86_64 x86_64 x86_64 GNU/Linux
  • 异常环境:Linux 4.19.90-52.49.v2207.ky10.x86_64 #3 SMP Thu Jul 24 02:43:35 CST 2025 x86_64 x86_64 x86_64 GNU/Linux

询问客户能不能这套环境 B 机器换成和正常环境一样内核的虚机,客户答复说是平台自动化开的机器,无法保持内核一致。没办法,然后看这套环境是不是升级过内核:

1
2
3
4
5
6
7
$ grep kernel /var/log/dnf*
...
/var/log/dnf.rpm.log:2025-08-07T01:34:25Z SUBDEBUG InstaLLed:kernel-4.19.90-52.49.v2207.ky10.x86_64
/var/log/dnf.rpm.log:2025-08-67T01:34:26Z SUBDEBUG Upgrade:kernel-tools-4.19.90-52.49.v2207.ky10.x86_64
/var/log/dnf.rpm.log:2025-08-07T01:34:29Z SUBDEBUG Upgraded:kernel-tools-4.19.90-52.45.v2207.ky10.x86_64
/var/log/dnf.rpm.log:2025-08-67T01:34:30Z SUBDEBUG Upgraded:kernel-headers-4.19.90-52.45.v2207.ky10.x86_64
/var/log/dnf.rpm.log:2025-08-07T01:34:34Z SUBDEBUG Upgraded:kernel-tools-libs-4.19.90-52.45.v2207.ky10.x86_64

果然升级了,看下老版本在不在:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
$ rpm -qaIgrep kernel
kernel-modules-extra-4.19.90-52.45.v2207.ky10.x86_64
kernel-moduLes-extra-4.19.90-52.22.v2207.ky10.x86_64
kernel-modules-4.19.90-52.49.v2207.ky10.x86_64
kernel-core-4.19.90-52.22.v2207.ky10.x86_64
kerne1-t001s-11bs-4.19.90-52.49.V2207.ky10.x86_64
kernel-modules-extra-4.19.96-52.49.v2207.ky10.x86_64
kernel-4.19.90-52.22.v2207.ky10.x86_64
kernel-t00L5-4.19.90-52.49.v2207.ky10.x86_64
kernel-core-4.19.90-52.45.v2207.ky10.x86_64
kernel-modules-4.19.90-52.45.v2207.ky10.x86_64
kernel-4.19.90-52.45.v2207.ky10.x86_64
kernel-modules-4.19.90-52.22.v2207.ky10.x86_64
kernel-headers-4.19.96-52.49.v2207.ky10.x86_64
kernel-4.19.90-52.49.v2207.ky10.x86_64
kernel-4.19.90-52.49.v2207.ky10.x86_64

查看下 grub 里顺序:

1
2
3
4
5
$ awk -F\' '$1=="menuentry " {print i++ " : " $2}' /etc/grub2.cfg
0 : Kylin Linux Advanced Server (4.19.90-52.49.v2207.ky10.x86_64) V10 (Lance)
1 : Kylin Linux Advanced Server (4.19.90-52.45.v2207.ky10.x86_64) V10 (Lance)
2 : Kylin Linux Advanced Server (4.19.90-52.22.v2207.ky10.x86_64) V10 (Lance)
3 : Kylin Linux Advanced Server (0-rescue-de06076a688a45bf9d1acd0bf45bb93e) V10 (Lance)

切换到 52.22:

1
2
$ grub2-set-default 2
$ grub2-mkconfig -o /etc/grub2.cfg

询问客户能否重启,可以重启,重启后测试就正常了:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
$ uname -a
Linux TKVMJT0240 4.19.90-52.22.v2207.ky10.x86_64 #1 SMP Tue Mar 14 12:19:10 CST

# 任意机器1
$ python udp-test.py 8472


# 机器2
$ python udp-test.py 机器1的ip:8472
25-09-10 17:29:59,623 [INFO] Client raget UDP xxx:8472( type message and press Enter)
>>> 123
25-09-10 17:30:00,753 [INFO] Server echoed: '123'
>>> ^C
Bye

其他几个机器一样处理后都正常。

CATALOG
  1. 1. 由来
  2. 2. 排查
    1. 2.1. 环境信息
    2. 2.2. 接手缘由
    3. 2.3. 抓包重现
    4. 2.4. 解决