由来 这几天同事发现在 docker 容器里运行 ansible 命令很卡,发来了个命令叫我试试 ansible localhost -m setup -a 'filter=ansible_default_ipv4' 2>/dev/null |grep '\"address\"' |awk -F'\"' '{print $4}'
环境信息 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 $ cat /etc/os-release NAME="Kylin Linux Advanced Server" VERSION="V10 (Tercel)" ID="kylin" VERSION_ID="V10" PRETTY_NAME="Kylin Linux Advanced Server V10 (Tercel)" ANSI_COLOR="0;31" $ uname -aLinux xxxxx 4.19.90-17.ky10.aarch64 #1 SMP Sun Jun 28 14:27:40 CST 2020 aarch64 aarch64 aarch64 GNU/Linux $ docker info Containers: 1 Running: 1 Paused: 0 Stopped: 0 Images: 1 Server Version: 18.09.9 Storage Driver: overlay2 Backing Filesystem: xfs Supports d_type: true Native Overlay Diff: true Logging Driver: json-file Cgroup Driver: cgroupfs Plugins: Volume: local Network: bridge host macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog Swarm: inactive Runtimes: runc Default Runtime: runc Init Binary: docker-init containerd version: 894b81a4b802e4eb2a91d1ce216b8817763c29fb runc version: 425e105d5a03fabd737a126ad93d62a9eeede87f init version: fec3683 Security Options: seccomp Profile: default Kernel Version: 4.19.90-17.ky10.aarch64 Operating System: Kylin Linux Advanced Server V10 (Tercel) OSType: linux Architecture: aarch64 CPUs: 64 Total Memory: 62.76GiB Name: xxx ID: 3ZQD:MZWN:FNR4:5HEE:F57N:3BLD:EP3T:LJT7:NWEJ:3TZ3:IEBD:KSHZ Docker Root Dir: /data/kube/docker Debug Mode (client): false Debug Mode (server): false Registry: https://index.docker.io/v1/ Labels: Experimental: false Insecure Registries: treg.yun.xxx.cn reg.xxx.lan:5000 127.0.0.0/8 Registry Mirrors: https://registry.docker-cn.com/ https://docker.mirrors.ustc.edu.cn/ Live Restore Enabled: false Product License: Community Engine
容器里的 ansible 和 python 信息
1 2 3 4 5 6 7 8 9 $ ansible --version ansible 2.8.6 config file = None configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules'] ansible python module location = /usr/local/lib/python2.7/dist-packages/ansible executable location = /usr/local/bin/ansible python version = 2.7.12 (default, Apr 15 2020, 17:07:12) [GCC 5.4.0 20160609] $ python --version Python 2.7.12
他说如果用麒麟的 rpm 包安装 docker 就没问题,用我们的拷贝二进制文件安装的 docker 起的容器里就不行。一开始是怀疑 setup 模块在收集某些信息的时候阻塞了,后面我试了下这样也会卡住 ansible localhost -m shell -a date
。
排查过程 带上了 -vvvv
发现卡在下面的输出这:
1 2 3 <127.0.0.1> PUT /root/.ansible/tmp/ansible-local-466216tkG5t/tmpML7hBj TO /root/.ansible/tmp/ansible-tmp-1606180831.93-276734338965603/AnsiballZ_command.py <127.0.0.1> EXEC /bin/sh -c 'chmod u+x /root/.ansible/tmp/ansible-tmp-1606180831.93-276734338965603/ /root/.ansible/tmp/ansible-tmp-1606180831.93-276734338965603/AnsiballZ_command.py && sleep 0' <127.0.0.1> EXEC /bin/sh -c '/usr/bin/python /root/.ansible/tmp/ansible-tmp-1606180831.93-276734338965603/AnsiballZ_command.py && sleep 0'
搜了下可以 export ANSIBLE_DEBUG=True
打印更详细的日志,打印了下面的日志:
1 2 3 4 5 6 7 23918 1606128393.94311: ANSIBALLZ: Done creating module 23918 1606128393.94465: _low_level_execute_command(): starting 23918 1606128393.94493: _low_level_execute_command(): executing: /bin/sh -c '/usr/bin/python && sleep 0' 23918 1606128393.94515: in local.exec_command() 23918 1606128393.94528: opening command with Popen() 23918 1606128393.94979: done running command with Popen() 23918 1606128393.95005: getting output with communicate()
看样子是子进程卡住了,主进程等子进程。因为容器的进程实际上也是在宿主机上的,宿主机上安装了 strace,查看下进程:
1 2 3 4 5 6 7 8 9 $ ps aux | grep ansible root 3020177 2.7 0.0 129408 48960 pts/0 Sl+ 19:12 0:26 /usr/bin/python /usr/local/bin/ansible localhost -vvvvv -m shell -a ls root 3020189 0.0 0.0 134912 50368 pts/0 S+ 19:12 0:00 /usr/bin/python /usr/local/bin/ansible localhost -vvvvv -m shell -a ls root 3020216 0.0 0.0 2368 768 pts/0 S+ 19:12 0:00 /bin/sh -c /bin/sh -c '/usr/bin/python /root/.ansible/tmp/ansible-tmp-1606129970.28-271461867131881/AnsiballZ_command.py && sleep 0' root 3020217 0.0 0.0 2368 768 pts/0 S+ 19:12 0:00 /bin/sh -c /usr/bin/python /root/.ansible/tmp/ansible-tmp-1606129970.28-271461867131881/AnsiballZ_command.py && sleep 0 root 3020218 0.0 0.0 19776 14336 pts/0 S+ 19:12 0:00 /usr/bin/python /root/.ansible/tmp/ansible-tmp-1606129970.28-271461867131881/AnsiballZ_command.py root 3020219 9.3 0.0 18688 10816 pts/0 t+ 19:12 1:27 /usr/bin/python /root/.ansible/tmp/ansible-tmp-1606129970.28-271461867131881/AnsiballZ_command.py root 3050341 0.3 0.0 8832 4544 ? Ss 19:28 0:00 ssh: /root/.ansible/cp/5a0929d121 [mux] root 3052724 0.0 0.0 214080 1536 pts/7 S+ 19:28 0:00 grep ansible
strace 下 3020219 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 $ strace -p 3020219 close(76267956) = -1 EBADF (错误的文件描述符) close(76267957) = -1 EBADF (错误的文件描述符) close(76267958) = -1 EBADF (错误的文件描述符) close(76267959) = -1 EBADF (错误的文件描述符) close(76267960) = -1 EBADF (错误的文件描述符) close(76267961) = -1 EBADF (错误的文件描述符) close(76267962) = -1 EBADF (错误的文件描述符) close(76267963) = -1 EBADF (错误的文件描述符) close(76267964) = -1 EBADF (错误的文件描述符) close(76267965) = -1 EBADF (错误的文件描述符) close(76267966) = -1 EBADF (错误的文件描述符) close(76267967) = -1 EBADF (错误的文件描述符) close(76267968) = -1 EBADF (错误的文件描述符) close(76267969) = -1 EBADF (错误的文件描述符) close(76267970) = -1 EBADF (错误的文件描述符) close(76267971) = -1 EBADF (错误的文件描述符) close(76267972^C) = -1 EBADF (错误的文件描述符) strace: Process 3020219 detached
终端一直刷上面的,看样子是文件描述符泄露,错误的文件描述符
英文就是 Bad file descriptor
, 谷歌搜了下 in "docker" (Bad file descriptor) strace
,找到了 Spawning PTY processes is many times slower on Docker 18.09 里几位大佬排查到是某些 os 下,很多地方都有设置 limit(就是那种我以为你做了,我就没做。结果你以为我做了,你就没做,结果大家都没做的感觉了),导致容器的 nofile 会不固定,经常性的太高。而很多语言的模块会遍历所有描述符,导致会卡。如果启动容器 nofile 设置低则没问题,下面还有个大佬给了个链接在 python 层面修复这个问题 python/cpython#11584 ,还有下面的解决办法,直接设置默认的 limit:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 # https://docs.docker.com/engine/reference/commandline/dockerd/ $ cat /etc/docker/daemon.json-ulimits{ "default-ulimits": { "nofile": { "Name": "nofile", "Hard": 1024, "Soft": 1024 }, "nproc": { "Name": "nproc", "Soft": 65536, "Hard": 65536 } } }
改配置怕影响其他容器,就决定从 python 层面改,看了下提交的 pr python/cpython@5626fff 是改的 subprocess.py
,在容器里查找下:
1 2 3 4 5 6 find / -type f -name subprocess.py /usr/lib/python2.7/subprocess.py /usr/lib/python3.5/asyncio/subprocess.py /usr/lib/python3.5/subprocess.py /usr/local/lib/python2.7/dist-packages/future/moves/subprocess.py /usr/local/lib/python2.7/dist-packages/gevent/subprocess.py
对比了下函数名 def _close_fds(self, but):
,确定是 /usr/lib/python2.7/subprocess.py
,把 pr 的内容加进去后再执行:
1 2 3 4 5 6 7 8 9 10 11 $ time ansible localhost -m shell -a date [WARNING]: No inventory was parsed, only implicit localhost is available localhost | CHANGED | rc=0 >> 2020年 11月 23日 星期一 19:38:14 CST real 0m3.088s user 0m2.860s sys 0m0.255s
最终还是选择了起容器的时候限制
上面的几位大佬给出了其他的解决方案,也可以在 containerd 配置文件里配置把 nofile 固定住,或者 docker daemon,或者应用的软件层面修复。国产有些系统的 /etc/systemd/system.conf
DefaultLimitNOFILE
设置很大很大的一个数字也会导致 ansible 卡住。
原理分析 我们尝试从更底层去分析问题产生的原因,看下有没有更加优雅的解决问题方式。
代码走读 我们看下python2.7 subprocess.py
中,关闭fd描述符的相关代码 /usr/lib/python2.7/subprocess.py
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 try : MAXFD = os.sysconf("SC_OPEN_MAX" ) except : MAXFD = 256 class os (object ): ................ ................ @classmethod def closerange (fd_low, fd_high ): for fd in range (fd_low, fd_high): try : os.close(fd) except OSError: pass class Popen (object ): ................. ................. def _close_fds (self, but ): if hasattr (os, 'closerange' ): os.closerange(3 , but) os.closerange(but + 1 , MAXFD) else : for i in xrange(3 , MAXFD): if i == but: continue try : os.close(i) except : pass
说明:
MAXFD
: 此变量是由 SC_OPEN_MAX
决定的,实际就是系统的 nofile 限制大小,可用 ulimit -n
来查看
进程描述符中,0 代表标准输入,1 代表标准输出,2 代表标准输出,所以 closerange 是从3开始,关闭除了子进程错误栈描述符的所有 fd
读懂了这一块的逻辑,那么在麒麟V10下出现这种情况,而其他环境没问题的原因,只有一个,那就是系统的nofile
值设置的太大了,我们用命令来测试下:
1 2 3 4 5 6 7 8 9 10 11 12 $ ulimit -n 1073741816 $ python -c "import os;print(os.sysconf('SC_OPEN_MAX'))" 1073741816 # 尝试比较不同closerange的性能差异 $ python -c "import time,os;ns=time.time();os.closerange(3,65535);print(time.time() - ns)" 0.0180819034576 $ python -c "import time,os;ns=time.time();os.closerange(3,6553500);print(time.time() - ns)" 2.00029587746
结论: 可以看到python2.7自带的subprocess,在高nofile情况下,关闭fd的操作会非常耗时,可能一个操作就要耗时几十分钟
参考