zhangguanzhang's Blog

18.09.03 docker daemon layer broken 的一次不优雅处理

字数统计: 1.3k阅读时长: 6 min
2022/02/10

记录一次 18.09.03 docker daemon 存储的层损坏无法修复的过程,虽然不优雅,但是没找到更好的解决办法,暂时记录仅供参考。

环境

机器重启后,部分 pod 无法启动。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
$ docker info
Containers: 51
Running: 27
Paused: 0
Stopped: 24
Images: 23
Server Version: 18.09.3
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: e6b3f5632f50dbc4e9cb6288d911bf4f5e95b18e
runc version: 6635b4f0c6af3810594d2770f662f34ddc15b40d
init version: fec3683
Security Options:
seccomp
Profile: default
Kernel Version: 3.10.0-693.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 15.51GiB
Name: hdzwvm000006238.novalocal
ID: AUFF:32CM:54KK:FA2F:M3GS:EI77:2VSQ:HH3T:2LXM:7AFG:WXAQ:IKSV
Docker Root Dir: /data/kube/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:

处理过程

初步排查了下确认部分镜像损坏了,比如下面这个,history --no-trunc 看了下这个镜像的 rootfs 是 ubuntu ,结果报错下面:

1
2
$ docker run --rm -ti --entrypoint bash xxx.cn/base/xxxxxx-amd64:v2
standard_init_linux.go:207: exec user process caused "no such file or directory"

之前也有类似情况,但是 rmi后 load就好了。这次是 rmi 掉后手动 load 也不行,对比了镜像离线文件的 md5sum 和包里的是一样的。

1
2
3
4
5
6
7
8
9
10
$ md5sum ./images/xxxxxx-amd64-v2#release_zzzzzzz 
cd1cf11ac90d6df59a31460cb1624933 ./images/xxxxxx-amd64-v2#release_zzzzzzz

$ docker rmi xxx.cn/base/xxxxxx-amd64:v2
Untagged: xxx.cn/base/xxxxxx-amd64:v2
Deleted: sha256:fe7c32d1138c5215dba9fbfa4f675eff47f1a30605d9914fff34a5db00ad45f0
$ docker load -i xxxxxx-amd64-v2#release_zzzzzzz
Loaded image: xxx.cn/base/xxxxxx-amd64:v2
$ docker run --rm -ti --entrypoint bash xxx.cn/base/xxxxxx-amd64:v2
standard_init_linux.go:207: exec user process caused "no such file or directory"

然后排查到有安全软件 sangfor,并且机器重启过。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
$ ps aux | grep san
root 1183 0.0 0.0 113184 1492 ? S Feb09 0:03 /bin/bash /sangfor/edr/agent/bin/eps_services_ctrl
root 5132 0.0 0.0 113436 1696 ? S Feb09 0:17 /bin/bash /sangfor/edr/agent/bin/abs_monitor
root 5164 0.0 0.0 48092 3392 ? S Feb09 0:04 /sangfor/edr/agent/bin/abs_deployer
root 5205 0.0 0.0 43036 1552 ? Ss Feb09 0:07 /sangfor/edr/agent/bin/edr_monitor
root 5378 0.0 0.0 194948 6260 ? Sl Feb09 0:04 /sangfor/edr/agent/bin/sfupdatemgr -p edr_monitor
root 5379 0.0 0.0 43360 3560 ? S Feb09 0:01 /sangfor/edr/agent/bin/ipc_proxy
root 5380 0.6 0.1 708028 29892 ? Sl Feb09 6:56 /sangfor/edr/agent/bin/edr_agent
root 5381 0.1 0.0 17060 1332 ? S< Feb09 1:58 /sangfor/edr/agent/bin/cpulimit --limit=50 --exe=edr_agent
root 5382 0.0 0.0 113568 1900 ? S Feb09 0:28 /bin/bash /sangfor/edr/agent/bin/asset_collection_cpulimit.sh
root 5383 0.0 0.0 128944 5444 ? Sl Feb09 0:27 /sangfor/edr/agent/bin/edr_sec_plan
root 5384 0.0 0.0 117656 8956 ? S Feb09 0:00 /sangfor/edr/agent/bin/lloader /sangfor/edr/agent/bin/../lmodules/isolate_area_tool.lua
root 5385 0.0 0.0 68916 3928 ? S Feb09 0:01 /sangfor/edr/agent/bin/lloader /sangfor/edr/agent/bin/../lmodules/isolate_area_main.lua
root 22594 0.0 0.0 112712 976 pts/2 S+ 11:37 0:00 grep --color=auto san
$ uptime -s
2022-02-09 17:19:09
You have new mail in /var/spool/mail/root
$ tail -n 40 /var/spool/mail/root
...
edr pid 5205
ls: cannot access /sangfor/edr/agent/bin/../packages/: No such file or directory

$ ll /etc/cron.d
total 12
-rw-r--r--. 1 root root 128 Aug 3 2017 0hourly
-rw-r--r-- 1 root root 60 Dec 10 2020 edr_agent
-rw-------. 1 root root 235 Apr 1 2020 sysstat
You have new mail in /var/spool/mail/root
$ cat edr_agent
* * * * * root /sangfor/edr/agent/bin/eps_services_check.sh

让客户卸载掉后还是不行,然后 save 了下发现了问题:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
$ docker save -o test.tar xxx.cn/base/xxxxxx-amd64:v2 
Error response from daemon: open /data/kube/docker/overlay2/920a06a6d4eb64db0898234cd3a81b01115d6fcc2cfc50c5107e0205f7230318/diff/lib/x86_64-linux-gnu/ld-2.23.so: no such file or directory
$ docker inspect xxx.cn/base/xxxxxx-amd64:v2 | grep 920a0
"LowerDir": ...:/data/kube/docker/overlay2/920a06a6d4eb64db0898234cd3a81b01115d6fcc2cfc50c5107e0205f7230318/diff",

$ ls -l /data/kube/docker/overlay2/920a06a6d4eb64db0898234cd3a81b01115d6fcc2cfc50c5107e0205f7230318/diff/lib/x86_64-linux-gnu/ | head
total 10684
lrwxrwxrwx 1 root root 10 Feb 6 2019 ld-linux-x86-64.so.2 -> ld-2.23.so
lrwxrwxrwx 1 root root 15 Feb 7 2016 libacl.so.1 -> libacl.so.1.1.0
-rw-r--r-- 1 root root 31232 Feb 7 2016 libacl.so.1.1.0
-rw-r--r-- 1 root root 14992 Feb 6 2019 libanl-2.23.so
lrwxrwxrwx 1 root root 14 Feb 6 2019 libanl.so.1 -> libanl-2.23.so
lrwxrwxrwx 1 root root 20 May 29 2019 libapparmor.so.1 -> libapparmor.so.1.4.0
-rw-r--r-- 1 root root 64144 May 29 2019 libapparmor.so.1.4.0
lrwxrwxrwx 1 root root 16 Sep 9 2014 libattr.so.1 -> libattr.so.1.1.0
-rw-r--r-- 1 root root 18624 Sep 9 2014 libattr.so.1.1.0

把那个镜像的离线文件拿到其他机器上 load 后看了下该层是有文件 ld-2.23.so 的:

1
2
3
4
$ ll b5f1b3d6665a476b9460532568499f2923c1621d710f6a1e20cf7f3e1a928e17/diff/lib/x86_64-linux-gnu/
total 10844
-rwxr-xr-x 1 root root 162632 Feb 6 2019 ld-2.23.so
lrwxrwxrwx 1 root root 10 Feb 6 2019 ld-linux-x86-64.so.2 -> ld-2.23.so

最后本地试了下,发现如果 daemon 的层损坏了,rmi 后 load 是不会重新覆盖的,正常 load 一个新镜像 load 的时候是会有层显示的,类似下面:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
$ docker load -i netshoot#latest 
b2d5eeeaba3a: Loading layer [==================================================>] 5.88MB/5.88MB
681ff9ab4914: Loading layer [==================================================>] 301.4MB/301.4MB
0e91662a9cb3: Loading layer [==================================================>] 8.683MB/8.683MB
fdcdfe126cc0: Loading layer [==================================================>] 13.63MB/13.63MB
270c883ade5e: Loading layer [==================================================>] 45.31MB/45.31MB
06e19b7687c5: Loading layer [==================================================>] 14.54MB/14.54MB
def3433d213c: Loading layer [==================================================>] 4.566MB/4.566MB
5b6adb9801a8: Loading layer [==================================================>] 869.9kB/869.9kB
765e2d110fbc: Loading layer [==================================================>] 1.831MB/1.831MB
eead121d6964: Loading layer [==================================================>] 7.168kB/7.168kB
400127227d7a: Loading layer [==================================================>] 3.072kB/3.072kB
2b4f749a4a39: Loading layer [==================================================>] 6.571MB/6.571MB
Loaded image: netshoot:latest
$ docker load -i netshoot#latest
Loaded image: netshoot:latest

只有镜像存在的情况下只显示一个 Loaded image,回看之前我们 rmi 后 load 就是没有层显示。看了下代码暂时没看出怎么判断是否已存在的,然后把 overlay2 目录删了暂时解决的。

参考

CATALOG
  1. 1. 环境
  2. 2. 处理过程
  3. 3. 参考