zhangguanzhang's Blog

低版本docker veth没清理造成容器网络问题

字数统计: 1.5k阅读时长: 8 min
2022/09/07

帮助同事看一个 gitlab-runner 的问题,最终发现是 docker 低版本的 bug,没清理掉 veth 导致的

由来

同事 gitlab-ci.yml 大概内容为下面:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
include:
- project: "xxx/ci-template"
file: "/backend/common_mini.yml"

test:
services:
- name: minio/minio
command: ["server","/data"]
alias: minio
- name: mysql:5.7.17
variables:
FILTER_COVER_PACKAGES: "grep -E 'impl'"
MYSQL_DATABASE: "docmini"
MINIO_UPODATE: "off"

使用的是 Docker executor 去跑的构建,他反馈说构建容器内部无法访问 minio 的 9000 端口,我大致看了下,发现 service 的 alias 实际上用的是 docker run 的 –link 实现的,也就是容器的 hosts 文件里添加记录指向容器IP,官方文档 service 字段 也是如此说明

排错过程

故障现象

构建过程是 7a9ff9c8ee95 无法访问 minio 里的 9000 端口,直接用 ip,不用 alias 别名都无法访问

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
$ docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
7a9ff9c8ee95 83daaac121e6 "sh -c 'if [ -x /usr…" 41 seconds ago Up 40 seconds runner-s4f3kt7-project-15927-concurrent-0-eed4eb59d5dfa521-build-2
883faff17a2d e59a4655709b "/usr/bin/dumb-init …" 42 seconds ago Exited (0) 41 seconds ago runner-s4f3kt7-project-15927-concurrent-0-eed4eb59d5dfa521-predefined-1
2662fd58cad5 e59a4655709b "/usr/bin/dumb-init …" 43 seconds ago Exited (0) 42 seconds ago runner-s4f3kt7-project-15927-concurrent-0-eed4eb59d5dfa521-predefined-0
139e1a01b50f 9546ca122d3a "docker-entrypoint.s…" About a minute ago Up About a minute 3306/tcp runner-s4f3kt7-project-15927-concurrent-0-eed4eb59d5dfa521-mysql-1
de4647deead4 c15374551d3a "/usr/bin/docker-ent…" About a minute ago Up About a minute 9000/tcp runner-s4f3kt7-project-15927-concurrent-0-eed4eb59d5dfa521-minio__minio-0
$ docker inspect de46 | grep -i pid
"Pid": 23888,
"PidMode": "",
"PidsLimit": 0,
$ nsenter --net -t 23888 curl 172.25.0.2:9000
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>AccessDenied</Code><Message>Access Denied.</Message><Resource>/</Resource><RequestId>17127033A4D00006</RequestId><HostId>ec1fb8ef-f0f0-488e-a71e-da444933f2ed</HostId></Error>
$ docker exec -ti 7a9ff9c8ee95 curl 172.25.0.2:9000
curl: (7) Failed to connect to 172.25.0.2 port 9000: Connection refused

看了下 iptables 和转发参数都没有任何问题,发现宿主机上都无法访问,只有该容器内部才可以访问

1
2
3
4
5
$ curl 172.25.0.2:9000
curl: (7) Failed to connect to 172.25.0.2 port 9000: Connection refused
$ nsenter --net -t 23888 curl 172.25.0.2:9000
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>AccessDenied</Code><Message>Access Denied.</Message><Resource>/</Resource><RequestId>17127033A4D00006</RequestId><HostId>ec1fb8ef-f0f0-488e-a71e-da444933f2ed</HostId></Error>

为了排除 minio 服务问题,清理掉上面的容器后,用官方的 nginx 镜像测试下:

1
2
3
4
5
6
7
8
9
$ docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
$ docker run -d --name t1 --rm -p 81:80 nginx:alpine
91fa481376cbbbdf04dd7ed027048ad20f40eee18f4e7d916d9edba8da102412
$ docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
91fa481376cb nginx:alpine "/docker-entrypoint.…" About a minute ago Up About a minute 0.0.0.0:81->80/tcp t1
$ docker inspect t1 | grep IPAddress
"IPAddress": "172.25.0.2",

发现访问还是有问题

1
2
3
4
5
6
7
8
$ curl 172.25.0.2
curl: (7) Failed to connect to 172.25.0.2 port 80: Connection refused
$ ip link set docker0 promisc on
$ curl 172.25.0.2
curl: (7) Failed to connect to 172.25.0.2 port 80: Connection refused

$ curl localhost:81
curl: (56) Recv failure: Connection reset by peer

清理掉容器后发现 veth 不对:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
$ ip a s 
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 00:50:56:87:52:b5 brd ff:ff:ff:ff:ff:ff
inet 10.226.48.239/23 brd 10.226.49.255 scope global ens160
valid_lft forever preferred_lft forever
inet6 fe80::250:56ff:fe87:52b5/64 scope link
valid_lft forever preferred_lft forever
3: docker0: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 02:42:c0:7b:8b:bf brd ff:ff:ff:ff:ff:ff
inet 172.25.0.1/16 brd 172.25.255.255 scope global docker0
valid_lft forever preferred_lft forever
inet6 fe80::42:c0ff:fe7b:8bbf/64 scope link
valid_lft forever preferred_lft forever
9005: veth4a669ee@if9004: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP group default
link/ether aa:ad:04:ea:b9:a1 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet6 fe80::a8ad:4ff:feea:b9a1/64 scope link
valid_lft forever preferred_lft forever
9007: vethe0ddac0@if9006: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP group default
link/ether 02:9c:d5:3d:18:8f brd ff:ff:ff:ff:ff:ff link-netnsid 1
inet6 fe80::9c:d5ff:fe3d:188f/64 scope link
valid_lft forever preferred_lft forever

一个容器都没有后,不应该有上面的 veth,安装 bridge-utils 查看下果然残留了:

1
2
3
4
5
$ apt-get install -y bridge-utils 
$ brctl show
bridge name bridge id STP enabled interfaces
docker0 8000.0242c07b8bbf no veth4a669ee
vethe0ddac0

因为 docker 容器分配 IP 是从前到后的,验证残留造成的可以多起几个容器,后续容器能访问就确定是 veth 没清理导致的:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
$ docker run -d  --rm -p 81:80 nginx:alpine
a0952e42f6a0da9d1969b327e696022c2dea041061cee2fbf080134037c9c93b
$ docker run -d --rm -p 82:80 nginx:alpine
dcf7d4635f1379b321603760c71a94bf70b1b954bb5528d384f7f9d38d4ed005
$ docker run -d --rm -p 83:80 nginx:alpine
44d5611125bb773ecc24baf760d4ba19f90a33c0b5c802e6afbeee462f200df0
$ docker run -d --rm -p 84:80 nginx:alpine
145b99fea102f71bc99132f2ae5aa8401f890df5d10177964df3e3f4e3fd8281
$ curl localhost:82
curl: (56) Recv failure: Connection reset by peer
$ curl localhost:83
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
html { color-scheme: light dark; }
body { width: 35em; margin: 0 auto;
font-family: Tahoma, Verdana, Arial, sans-serif; }
</style>
</head>
<body>
<h1>Welcome to nginx!</h1>
<p>If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.</p>

<p>For online documentation and support please refer to
<a href="http://nginx.org/">nginx.org</a>.<br/>
Commercial support is available at
<a href="http://nginx.com/">nginx.com</a>.</p>

<p><em>Thank you for using nginx.</em></p>
</body>
</html>

果然是残留导致的,然后看了下 docker 版本很低:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Server Version: 18.03.0-ce
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: cfd04396dc68220d1cecbe686a6cc3aa5ce3667c
runc version: N/A (expected: 4fc53a81fb7c994640722ac585fa9ca548971871)
init version: N/A (expected: )
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 4.4.0-184-generic
Operating System: Ubuntu 16.04.6 LTS
OSType: linux
Architecture: x86_64

搜了下 docker bridge network veth not clean up 发现很多人遇到了,属于低版本的 bug,卸载用官方脚本安装后就正常了

1
curl -fsSL "https://get.docker.com/" | bash -s -- --mirror Aliyun

参考:

CATALOG
  1. 1. 由来
  2. 2. 排错过程
    1. 2.1. 故障现象
  3. 3. 参考: