zhangguanzhang's Blog

Job for docker.service canceled

字数统计: 676阅读时长: 3 min
2021/07/05

故障现象

内部安装 docker 的脚本报错 docker 安装失败。然后启动发现下面奇怪的问题:

1
2
3
4
5
6
7
$ systemctl status docker
● docker.service - Docker Application Container Engine
Loaded: loaded (/etc/systemd/system/docker.service; enabled; vendor preset: enabled)
Active: inactive (dead)
Docs: http://docs.docker.io
$ systemctl start docker
Job for docker.service canceled.

但是用 service docker start 能启动,这就很迷,尝试前台启动也无啥错误。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
$ service docker start
$ systemctl status docker
● docker.service - Docker Application Container Engine
Loaded: loaded (/etc/systemd/system/docker.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2021-07-05 14:41:30 CST; 15s ago
Docs: http://docs.docker.io
$ systemctl cat docker
# /etc/systemd/system/docker.service
[Unit]
Description=Docker Application Container Engine
Documentation=http://docs.docker.io

[Service]
Environment="PATH=/data/kube/bin:/bin:/sbin:/usr/bin:/usr/sbin"
ExecStart=/data/kube/bin/dockerd
ExecStartPost=/sbin/iptables --wait -I FORWARD -s 0.0.0.0/0 -j ACCEPT
ExecStopPost=/bin/sh -c '/sbin/iptables --wait -D FORWARD -s 0.0.0.0/0 -j ACCEPT &> /dev/null || :'
ExecStartPost=/sbin/iptables --wait -I INPUT -i cni0 -j ACCEPT
ExecStopPost=/bin/sh -c '/sbin/iptables --wait -D INPUT -i cni0 -j ACCEPT &> /dev/null || :'
ExecReload=/bin/kill -s HUP $MAINPID
Restart=on-failure
RestartSec=5
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
Delegate=yes
KillMode=process

[Install]
WantedBy=multi-user.target

解决过程

从上面systemctl cat docker看是没有依赖服务的,如果官方rpm 包安装的会依赖containerd。不过先看下失败的

1
2
3
4
5
6
7
8
9
10
$ systemctl --failed
UNIT LOAD ACTIVE SUB DESCRIPTION
● data.mount loaded failed failed /data

LOAD = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB = The low-level unit activation state, values depend on unit type.

1 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.

信息被冲没了,后面拿其他机器信息复制下,systemctl start会连 dbus 之类的,而 service 不会,结合前面的 data 挂载失败,系统应该是 emergency 半启动导致的,看了下果然

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
$ systemctl status emergency 
● emergency.service - Emergency Shell
Loaded: loaded (/lib/systemd/system/emergency.service; static; vendor preset: enabled)
Active: active (running) since Mon 2021-07-05 17:18:28 CST; 5min ago
Docs: man:sulogin(8)
Process: 674 ExecStartPre=/bin/plymouth --wait quit (code=exited, status=0/SUCCESS)
Main PID: 675 (systemd-sulogin)
Tasks: 5 (limit: 4915)
Memory: 32.9M
CGroup: /system.slice/emergency.service
├─647 /sbin/sulogin
├─675 /lib/systemd/systemd-sulogin-shell emergency
├─676 /sbin/sulogin
├─677 bash
└─678 /sbin/sulogin

Jul 05 17:18:28 host100 systemd[1]: Started Emergency Shell.
Jul 05 17:18:28 host100 systemd[1]: emergency.service: Found left-over process 647 (sulogin) in control group while starting unit. Ignoring.
Jul 05 17:18:28 host100 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.

然后看了下/etc/fstab,是把defaults写成了default导致的无法挂载。然后解决重启后好了。

询问了测试人员,她说她改了/etc/fstab后看启动emergency mode的输入root密码提示,然后输入root密码进去。然后启动 sshd 失败,然后用 service sshd start。然后 ssh 上去部署docker,然后我们这边 ssh 上来看,之前接触的centos 版本在 emergency 模式貌似不会有网。

CATALOG
  1. 1. 故障现象
  2. 2. 解决过程