zhangguanzhang's Blog

salt-run很久才返回

字数统计: 1k阅读时长: 5 min
2025/09/12

一次 salt-run 很久才返回的排查

由来

所有 salt-run 命令耗时都很久

1
2
3
4
5
6
$ time salt-run jobs.active
[INFO ] Runner completed: 20250912111420062465_14113

real 0m23.432s
user 0m2.716s
sys 0m0.320s

排查

salt-master 是容器里运行的,改容器没 ptrace 权限,怕重启容器后故障无了。就容器内执行卡住后看下宿主机进程 salt-run 是唯一的,宿主机上有 strace 命令,strace 看看:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
$ strace -p `ps -ef | grep -E 'salt-ru[n]' | awk '{print $2}' `
...
...

openat(AT_FDCWD, "/etc/hosts", O_RDONLY|O_CLOEXEC) = 11
fstat(11, {st_mode=S_IFREG|0644, st_size=898, ...}) = 0
lseek(11, 0, SEEK_SET) = 0
read(11, "#\n# hosts This file desc"..., 4096) = 898
read(11, "", 4096) = 0
close(11) = 0
newfstatat(AT_FDCWD, "/etc/nsswitch.conf", {st_mode=S_IFREG|0644, st_size=1516, ...}, 0) = 0
newfstatat(AT_FDCWD, "/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=211, ...}, 0) = 0
openat(AT_FDCWD, "/etc/hosts", O_RDONLY|O_CLOEXEC) = 11
fstat(11, {st_mode=S_IFREG|0644, st_size=898, ...}) = 0
lseek(11, 0, SEEK_SET) = 0
read(11, "#\n# hosts This file desc"..., 4096) = 898
read(11, "", 4096) = 0
close(11) = 0
socket(PF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 11
setsockopt(11, SOL_IP, IP_RECVERR, [1], 4) = 0
connect(11, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.xx.xx.1")}, 16) = 0
poll([{fd=11, events=POLLOUT}], 1, 0) = 1 ([{fd=11, revents=POLLOUT}])
sendto(11, "e\366\1\0\0\1\0\0\0\0\0\0\7kubexxx\0\0\34\0\1", 25, MSG_NOSIGNAL, NULL, 0) = 25
poll([{fd=11, events=POLLIN}], 1, 5000


...

左边窗口执行,右边赶紧 strace,发现有输出后卡主,赶紧按回车分开,最后往上翻找空行附近就是卡住的信息。从上面看就是 glibc 的 DNS 解析行为:

  1. 先看 /etc/nsswitch.conf 内的 hosts 行,看 hosts 和 dns 的优先级
  2. connect(11, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.xx.xx.1")}, 16) = 0 发起了 DNS 解析请求的连接信息
  3. sendto(11, "e\366\1\0\0\1\0\0\0\0\0\0\7kubexxx\0\0\34\0\1", 25, MSG_NOSIGNAL, NULL, 0) = 25 DNS 请求

然后再执行下,另一个窗口看了下链接确实存在 DNS 解析行为:

1
2
$ ss -anuop | grep :53
ESTAB 0 0 10.xx.xx.215:36250 10.xx.xx.1:53 users:(("salt-run",pid=13752,fd=4))

然后抓包看了下请求:

1
2
3
4
5
6
7
# 往右侧翻
$ tcpdump -nn -i any port 53 -vvv
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
19:34:24.952563 IP (tos 0x0, ttl 64, id 50084, offset 0, flags [DF], proto UDP (17), length 53)
10.xx.xx.215.63760 > 10.xx.xx.1.53: [bad udp cksum 0xc324 -> 0xd5d2!] 64743+ AAAA? kubexxx. (25)
19:34:26.613645 IP (tos 0x0, ttl 64, id 50437, offset 0, flags [DF], proto UDP (17), length 53)
10.xx.xx.215.35564 > 10.xx.xx.1.53: [bad udp cksum 0xc324 -> 0x3614!] 2763+ AAAA? kubexxx. (25)

发现是请求 hostname的 IPv6 DNS 解析记录,然后加了下 hosts 就好了。

1
2
3
4
5
6
7
$  echo "::1 $HOSTNAME" >> /etc/hosts
$ time salt-run jobs.active
[INFO ] Runner completed: 20250912113512377461_16938

real 0m3.265s
user 0m2.714s
sys 0m0.254s

解决

搜了下关键字 salt-run dns ipv6 看看有没有其他人遇到,结果找到类似问题:

但是回复都是老版本遇见多,我这边版本都很新了:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
$ salt --versions-report
Salt Version:
Salt: 3006.15

Python Version:
Python: 3.10.13 (main, Sep 8 2025, 06:32:11) [GCC 10.3.1]

Dependency Versions:
cffi: 1.14.5
cherrypy: Not Installed
cryptography: 44.0.1
dateutil: 2.8.2
docker-py: 6.1.3
gitdb: Not Installed
gitpython: Not Installed
Jinja2: 3.0.1
libgit2: Not Installed
looseversion: 1.3.0
M2Crypto: Not Installed
Mako: 1.2.2
msgpack: 1.0.5
msgpack-pure: Not Installed
mysql-python: Not Installed
packaging: 24.1
pycparser: 2.20
pycrypto: 3.19.1
pycryptodome: Not Installed
pygit2: Not Installed
python-gnupg: Not Installed
PyYAML: 6.0.1
PyZMQ: 27.0.2
relenv: Not Installed
smmap: Not Installed
timelib: Not Installed
Tornado: 4.5.3
ZMQ: 4.3.5

System Versions:
dist: openeuler 22.03 LTS-SP4
locale: utf-8
machine: x86_64
release: 4.12.14-120-default
system: Linux
version: openEuler 22.03 LTS-SP4

这种是 glibc 的行为,salt 会解析 hostname,业务临近封库,改这个 docker 镜像内启动脚本也来不及了,就先在代码层面解决了,搜了下 python socket 库并没有那种纯看 hosts 条目的库,就手动写了如下逻辑:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import fcntl
import socket

def ensure_hosts_entry(ip_address, hostname):
with open("/etc/hosts", "r+") as f:
try:
# 加排他锁(阻塞模式,确保并发安全)
fcntl.flock(f, fcntl.LOCK_EX)
original_content = f.readlines()

for index, line in enumerate(original_content):
line_part = line.split(' ')
if len(line_part) < 2:
continue
if hostname in line_part:
return

original_content.append(f"{ip_address} {hostname}\n")
f.seek(0)
f.truncate()
f.write(''.join(original_content))
f.flush()
finally:
fcntl.flock(f, fcntl.LOCK_UN)


ensure_hosts_entry("::1", socket.gethostname())
...
CATALOG
  1. 1. 由来
  2. 2. 排查
  3. 3. 解决