zhangguanzhang's Blog

解决 docker 的 read unix @->/run/containerd/s/xxx read: connection reset by peer...

字数统计: 1.1k阅读时长: 5 min
2021/09/16 Share

由来

为了测试关机对集群的影响,关机了几台机器后很多 pod 一直 CrashLoopBackOffRunContainerError 或者一直无法就绪

环境信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
[root@CentOS76 ~]# docker info
Client:
Debug Mode: false

Server:
Containers: 404
Running: 258
Paused: 0
Stopped: 146
Images: 110
Server Version: 19.03.14
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: ea765aba0d05254012b0b9e595e995c09186427f
runc version: v1.0.0-0-g84113eef6fc2
init version: fec3683
Security Options:
seccomp
Profile: default
Kernel Version: 3.10.0-1160.36.2.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 16
Total Memory: 62.76GiB
Name: CentOS76
ID: BJ2X:EX7H:SCME:Q3AD:IP2M:IB2D:E4RL:XA4C:EOMQ:7S3F:DIA6:WQ2C
Docker Root Dir: /data/kube/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
reg.xxx.lan:5000
treg.yun.xxx.cn
127.0.0.0/8
Registry Mirrors:
https://registry.docker-cn.com/
https://docker.mirrors.ustc.edu.cn/
Live Restore Enabled: false
Product License: Community Engine

排查

日志查看如下

1
RunContainerError: failed to start container "90353b19ae6c7209ba1785286c292f2362fa069b578f2e2731e93747c5ba1912": Error response from daemon: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/90353b19ae6c7209ba1785286c292f2362fa069b578f2e2731e93747c5ba1912/log.json: no such file or directory): runc did not terminate sucessfully: unknown

还有下面日志:

1
2
3
runc did not terminate sucessfully: runtime/cgo: pthread_create failed: Resource temporarily unavailable

container 9853a196008b92033a299e098d73d4268a76ce58faecfe40ca3411857d44a776: unknown error after kill: fork/exec /data/kube/bin/runc: resource temporarily unavailable: : unknown"

应该资源限制了,看了下默认的 kernel.pid_max 太小:

1
2
$ sysctl -n kernel.pid_max
32768

后面陆陆续续调整了一些下面的参数:

1
2
3
4
5
6
7
8
9
10
11
12
13
cat > /etc/security/limits.d/21-custom.conf<<EOF
* soft nproc 131072
* hard nproc 131072
* soft nofile 131072
* hard nofile 131072
root soft nproc 131072
root hard nproc 131072
root soft nofile 131072
root hard nofile 131072
EOF

sed -ri 's/^#(DefaultLimitCORE)=/\1=100000/' /etc/systemd/system.conf
sed -ri 's/^#(DefaultLimitNOFILE)=/\1=100000/' /etc/systemd/system.conf

然后重启后 pod 还没有好转,启动一直处于 Create 的容器会有下面错误:

1
2
3
[root@CentOS76 ~]# docker start 034f
Error response from daemon: read unix @->/run/containerd/s/2ac09cf054eb19b79336b25efe1aeeaf22bcf0d9559ca79b8459c3490cd6034f: read: connection reset by peer: unknown
Error: failed to start containers: 034f

手动起容器报错下面的,调整参数后更多是上面的报错。

1
2
$ docker run --rm nginx:1.19-alpine
docker: Errpr response from daemon: failed to start shim: fork/exec /usr/bin/containerd-shim: resource temporarily unavailable: unknown.

read unix @->/run/containerd/s 这个按照流程走就是 contained 的问题了,可以从 源码 得知,如果没启动 containerd ,docker 则会 os.Exec 起一个 containerd

1
2
$ ps aux | grep '\scontainerd\s'
root 147580 2.4 0.1 10375568 104588 ? Ssl 17:06 3:15 containerd --config /var/run/docker/containerd/containerd.toml --log-level warn

我们的 docker 是官方的 static 二进制安装的,去看了下 rpm 安装的话会分离开,也就是有个 containerd 的 rpm,有一个 containerd.service 服务。 想着看下我们环境上的 containerd 的输出日志,但是源码看的话命令的输出都是绑定到 docker 的输出的。而且命令行参数固定的、无法改为 debug level。

手动杀掉启动下试试:

1
kill -9 147580 && containerd --config /var/run/docker/containerd/containerd.toml --log-level debug

另外开个 ssh 窗口发现 pod 状态都正常了。说明了 systemd 启动的 docker 有限制,去 dockerd 的 proc 目录啥的查找了下看没达到文件啥的限制

1
2
3
4
[root@CentOS76 ~]# pgrep dockerd
113233
[root@CentOS76 ~]# lsof -p 113233 | wc -l
956

最后找到问题所在,下面的Tasks: 2043 (limit: 2048) 限制

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
[root@CentOS76 ~]# systemctl status docker
● docker.service - Docker Application Container Engine
Loaded: loaded (/etc/systemd/system/docker.service; enabled; vendor preset: disabled)
Active: active (running) since 四 2021-09-16 16:53:21 CST; 4min 16s ago
Docs: http://docs.docker.io
Process: 113228 ExecStopPost=/bin/sh -c /sbin/iptables --wait -D INPUT -i cni0 -j ACCEPT &> /dev/null || : (code=exited, status=0/SUCCESS)
Process: 113225 ExecStopPost=/bin/sh -c /sbin/iptables --wait -D FORWARD -s 0.0.0.0/0 -j ACCEPT &> /dev/null || : (code=exited, status=0/SUCCESS)
Process: 113236 ExecStartPost=/sbin/iptables --wait -I INPUT -i cni0 -j ACCEPT (code=exited, status=0/SUCCESS)
Process: 113234 ExecStartPost=/sbin/iptables --wait -I FORWARD -s 0.0.0.0/0 -j ACCEPT (code=exited, status=0/SUCCESS)
Process: 113231 ExecStartPre=/bin/bash -c test -d /var/run/docker.sock && rmdir /var/run/docker.sock || true (code=exited, status=0/SUCCESS)
Main PID: 113233 (dockerd)
Tasks: 2043 (limit: 2048)
Memory: 1.1G
CGroup: /system.slice/docker.service
├─ 89710 containerd-shim -namespace

systemd 的 DefaultTasksMax2048 ,另外对比了官方的 docker.service 是不限制 Tasks 的,我们没加:

1
2
3
4
5
6
7
8
9
10
11
$ systemctl cat docker
..
ExecReload=/bin/kill -s HUP $MAINPID
Restart=on-failure
RestartSec=5
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
Delegate=yes
KillMode=process

加了后重启 docker 就好了:

1
2
3
4
5
$ vi /etc/systemd/system/docker.service
TasksMax=infinity


systemctl daemon-reload && systemctl restart docker

参考

CATALOG
  1. 1. 由来
    1. 1.1. 环境信息
  2. 2. 排查
  3. 3. 参考