zhangguanzhang's Blog

ansible hang in docker container

字数统计: 2.1k阅读时长: 10 min
2020/11/23

由来

这几天同事发现在 docker 容器里运行 ansible 命令很卡,发来了个命令叫我试试 ansible localhost -m setup -a 'filter=ansible_default_ipv4' 2>/dev/null |grep '\"address\"' |awk -F'\"' '{print $4}'

环境信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
$ cat /etc/os-release 
NAME="Kylin Linux Advanced Server"
VERSION="V10 (Tercel)"
ID="kylin"
VERSION_ID="V10"
PRETTY_NAME="Kylin Linux Advanced Server V10 (Tercel)"
ANSI_COLOR="0;31"

$ uname -a
Linux xxxxx 4.19.90-17.ky10.aarch64 #1 SMP Sun Jun 28 14:27:40 CST 2020 aarch64 aarch64 aarch64 GNU/Linux

$ docker info
Containers: 1
Running: 1
Paused: 0
Stopped: 0
Images: 1
Server Version: 18.09.9
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 894b81a4b802e4eb2a91d1ce216b8817763c29fb
runc version: 425e105d5a03fabd737a126ad93d62a9eeede87f
init version: fec3683
Security Options:
seccomp
Profile: default
Kernel Version: 4.19.90-17.ky10.aarch64
Operating System: Kylin Linux Advanced Server V10 (Tercel)
OSType: linux
Architecture: aarch64
CPUs: 64
Total Memory: 62.76GiB
Name: xxx
ID: 3ZQD:MZWN:FNR4:5HEE:F57N:3BLD:EP3T:LJT7:NWEJ:3TZ3:IEBD:KSHZ
Docker Root Dir: /data/kube/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
treg.yun.xxx.cn
reg.xxx.lan:5000
127.0.0.0/8
Registry Mirrors:
https://registry.docker-cn.com/
https://docker.mirrors.ustc.edu.cn/
Live Restore Enabled: false
Product License: Community Engine

容器里的 ansible 和 python 信息

1
2
3
4
5
6
7
8
9
$ ansible --version
ansible 2.8.6
config file = None
configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
ansible python module location = /usr/local/lib/python2.7/dist-packages/ansible
executable location = /usr/local/bin/ansible
python version = 2.7.12 (default, Apr 15 2020, 17:07:12) [GCC 5.4.0 20160609]
$ python --version
Python 2.7.12

他说如果用麒麟的 rpm 包安装 docker 就没问题,用我们的拷贝二进制文件安装的 docker 起的容器里就不行。一开始是怀疑 setup 模块在收集某些信息的时候阻塞了,后面我试了下这样也会卡住 ansible localhost -m shell -a date

排查过程

带上了 -vvvv 发现卡在下面的输出这:

1
2
3
<127.0.0.1> PUT /root/.ansible/tmp/ansible-local-466216tkG5t/tmpML7hBj TO /root/.ansible/tmp/ansible-tmp-1606180831.93-276734338965603/AnsiballZ_command.py
<127.0.0.1> EXEC /bin/sh -c 'chmod u+x /root/.ansible/tmp/ansible-tmp-1606180831.93-276734338965603/ /root/.ansible/tmp/ansible-tmp-1606180831.93-276734338965603/AnsiballZ_command.py && sleep 0'
<127.0.0.1> EXEC /bin/sh -c '/usr/bin/python /root/.ansible/tmp/ansible-tmp-1606180831.93-276734338965603/AnsiballZ_command.py && sleep 0'

搜了下可以 export ANSIBLE_DEBUG=True 打印更详细的日志,打印了下面的日志:

1
2
3
4
5
6
7
23918 1606128393.94311: ANSIBALLZ: Done creating module
23918 1606128393.94465: _low_level_execute_command(): starting
23918 1606128393.94493: _low_level_execute_command(): executing: /bin/sh -c '/usr/bin/python && sleep 0'
23918 1606128393.94515: in local.exec_command()
23918 1606128393.94528: opening command with Popen()
23918 1606128393.94979: done running command with Popen()
23918 1606128393.95005: getting output with communicate()

看样子是子进程卡住了,主进程等子进程。因为容器的进程实际上也是在宿主机上的,宿主机上安装了 strace,查看下进程:

1
2
3
4
5
6
7
8
9
$ ps aux | grep ansible
root 3020177 2.7 0.0 129408 48960 pts/0 Sl+ 19:12 0:26 /usr/bin/python /usr/local/bin/ansible localhost -vvvvv -m shell -a ls
root 3020189 0.0 0.0 134912 50368 pts/0 S+ 19:12 0:00 /usr/bin/python /usr/local/bin/ansible localhost -vvvvv -m shell -a ls
root 3020216 0.0 0.0 2368 768 pts/0 S+ 19:12 0:00 /bin/sh -c /bin/sh -c '/usr/bin/python /root/.ansible/tmp/ansible-tmp-1606129970.28-271461867131881/AnsiballZ_command.py && sleep 0'
root 3020217 0.0 0.0 2368 768 pts/0 S+ 19:12 0:00 /bin/sh -c /usr/bin/python /root/.ansible/tmp/ansible-tmp-1606129970.28-271461867131881/AnsiballZ_command.py && sleep 0
root 3020218 0.0 0.0 19776 14336 pts/0 S+ 19:12 0:00 /usr/bin/python /root/.ansible/tmp/ansible-tmp-1606129970.28-271461867131881/AnsiballZ_command.py
root 3020219 9.3 0.0 18688 10816 pts/0 t+ 19:12 1:27 /usr/bin/python /root/.ansible/tmp/ansible-tmp-1606129970.28-271461867131881/AnsiballZ_command.py
root 3050341 0.3 0.0 8832 4544 ? Ss 19:28 0:00 ssh: /root/.ansible/cp/5a0929d121 [mux]
root 3052724 0.0 0.0 214080 1536 pts/7 S+ 19:28 0:00 grep ansible

strace 下 3020219 :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
$ strace -p 3020219
close(76267956) = -1 EBADF (错误的文件描述符)
close(76267957) = -1 EBADF (错误的文件描述符)
close(76267958) = -1 EBADF (错误的文件描述符)
close(76267959) = -1 EBADF (错误的文件描述符)
close(76267960) = -1 EBADF (错误的文件描述符)
close(76267961) = -1 EBADF (错误的文件描述符)
close(76267962) = -1 EBADF (错误的文件描述符)
close(76267963) = -1 EBADF (错误的文件描述符)
close(76267964) = -1 EBADF (错误的文件描述符)
close(76267965) = -1 EBADF (错误的文件描述符)
close(76267966) = -1 EBADF (错误的文件描述符)
close(76267967) = -1 EBADF (错误的文件描述符)
close(76267968) = -1 EBADF (错误的文件描述符)
close(76267969) = -1 EBADF (错误的文件描述符)
close(76267970) = -1 EBADF (错误的文件描述符)
close(76267971) = -1 EBADF (错误的文件描述符)
close(76267972^C) = -1 EBADF (错误的文件描述符)
strace: Process 3020219 detached

终端一直刷上面的,看样子是文件描述符泄露,错误的文件描述符 英文就是 Bad file descriptor, 谷歌搜了下 in "docker" (Bad file descriptor) strace,找到了 Spawning PTY processes is many times slower on Docker 18.09 里几位大佬排查到是某些 os 下,很多地方都有设置 limit(就是那种我以为你做了,我就没做。结果你以为我做了,你就没做,结果大家都没做的感觉了),导致容器的 nofile 会不固定,经常性的太高。而很多语言的模块会遍历所有描述符,导致会卡。如果启动容器 nofile 设置低则没问题,下面还有个大佬给了个链接在 python 层面修复这个问题 python/cpython#11584 ,还有下面的解决办法,直接设置默认的 limit:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

# https://docs.docker.com/engine/reference/commandline/dockerd/#daemon-configuration-file
$ cat /etc/docker/daemon.json-ulimits
{
"default-ulimits": {
"nofile": {
"Name": "nofile",
"Hard": 1024,
"Soft": 1024
},
"nproc": {
"Name": "nproc",
"Soft": 65536,
"Hard": 65536
}
}
}

改配置怕影响其他容器,就决定从 python 层面改,看了下提交的 pr python/cpython@5626fff 是改的 subprocess.py ,在容器里查找下:

1
2
3
4
5
6
find / -type f -name subprocess.py 
/usr/lib/python2.7/subprocess.py
/usr/lib/python3.5/asyncio/subprocess.py
/usr/lib/python3.5/subprocess.py
/usr/local/lib/python2.7/dist-packages/future/moves/subprocess.py
/usr/local/lib/python2.7/dist-packages/gevent/subprocess.py

对比了下函数名 def _close_fds(self, but): ,确定是 /usr/lib/python2.7/subprocess.py,把 pr 的内容加进去后再执行:

1
2
3
4
5
6
7
8
9
10
11
$ time ansible localhost -m shell -a date
[WARNING]: No inventory was parsed, only implicit localhost is available


localhost | CHANGED | rc=0 >>
2020年 11月 23日 星期一 19:38:14 CST


real 0m3.088s
user 0m2.860s
sys 0m0.255s

最终还是选择了起容器的时候限制

1
--ulimit nofile=65536

上面的几位大佬给出了其他的解决方案,也可以在 containerd 配置文件里配置把 nofile 固定住,或者 docker daemon,或者应用的软件层面修复。国产有些系统的 /etc/systemd/system.conf DefaultLimitNOFILE 设置很大很大的一个数字也会导致 ansible 卡住。

原理分析

我们尝试从更底层去分析问题产生的原因,看下有没有更加优雅的解决问题方式。

代码走读

我们看下python2.7 subprocess.py 中,关闭fd描述符的相关代码 /usr/lib/python2.7/subprocess.py:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# MAXFD定义
try:
MAXFD = os.sysconf("SC_OPEN_MAX")
except:
MAXFD = 256

class os(object):
................
# 此处伪代码,仅说明os相关功能
# 遍历给定的fd范围,调用close去关闭
................
@classmethod
def closerange(fd_low, fd_high):
for fd in range(fd_low, fd_high):
try:
os.close(fd)
except OSError:
pass


class Popen(object):
.................
# 此处为subprocess popen相关的代码,在命令执行结果后,会执行_close_fds,用于回收fd
.................
def _close_fds(self, but):
if hasattr(os, 'closerange'):
# 默认会走此分支,会尝试闭关除but,0,1,2以外至maxfd的所有fd
os.closerange(3, but)
os.closerange(but + 1, MAXFD)
else:
for i in xrange(3, MAXFD):
if i == but:
continue
try:
os.close(i)
except:
pass

说明:

  1. MAXFD: 此变量是由 SC_OPEN_MAX 决定的,实际就是系统的 nofile 限制大小,可用 ulimit -n 来查看
  2. 进程描述符中,0 代表标准输入,1 代表标准输出,2 代表标准输出,所以 closerange 是从3开始,关闭除了子进程错误栈描述符的所有 fd

读懂了这一块的逻辑,那么在麒麟V10下出现这种情况,而其他环境没问题的原因,只有一个,那就是系统的nofile值设置的太大了,我们用命令来测试下:

1
2
3
4
5
6
7
8
9
10
11
12
$ ulimit -n 
1073741816

$ python -c "import os;print(os.sysconf('SC_OPEN_MAX'))"
1073741816

# 尝试比较不同closerange的性能差异
$ python -c "import time,os;ns=time.time();os.closerange(3,65535);print(time.time() - ns)"
0.0180819034576

$ python -c "import time,os;ns=time.time();os.closerange(3,6553500);print(time.time() - ns)"
2.00029587746

结论: 可以看到python2.7自带的subprocess,在高nofile情况下,关闭fd的操作会非常耗时,可能一个操作就要耗时几十分钟

参考

CATALOG
  1. 1. 由来
    1. 1.1. 环境信息
    2. 1.2. 排查过程
    3. 1.3. 原理分析
    4. 1.4. 代码走读
  2. 2. 参考