zhangguanzhang's Blog

离线安装docker和包管理安装docker下containerd的启动相关

字数统计: 1.5k阅读时长: 8 min
2025/10/23

简单科普下 docker 启动时候和 contaienrd 相关

由来

昨天处理了一个现场 docker 起不来的问题,借着处理过程科普下。docker 无法启动日志:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
$ journarlctl -xe --no-pager -u docker
Oct 23 17:28:33 XXX251023S00P systemd[1]: docker.service: Unit entered failed state.
Oct 23 17:28:33 XXX251023S00P systemd[1]: docker.service: Failed with result 'exit-code'.
Oct 23 17:28:43 XXX251023S00P systemd[1]: docker.service: Service RestartSec=10s expired, scheduling restart.
Oct 23 17:28:43 XXX251023S00P systemd[1]: Stopped Docker Application Container Engine.
-- Subject: Unit docker.service has finished shutting down
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit docker.service has finished shutting down.
Oct 23 17:28:43 XXX251023S00P systemd[1]: Starting Docker Application Container Engine...
-- Subject: Unit docker.service has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit docker.service has begun starting up.
Oct 23 17:28:43 XXX251023S00P dockerd[7355]: time="2025-10-23T17:28:43+08:00" level=info msg="SUSE:secrets :: enabled"
Oct 23 17:28:44 XXX251023S00P dockerd[7355]: time="2025-10-23T17:28:44.000689797+08:00" level=warning msg="The \"graph\" config file option is deprecated. Please use \"data-root\" instead."
Oct 23 17:28:44 XXX251023S00P dockerd[7355]: time="2025-10-23T17:28:44.064839553+08:00" level=warning msg="grpc: addrConn.createTransport failed to connect to {unix:///run/containerd/containerd.sock 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\". Reconnecting..." module=grpc
Oct 23 17:28:45 XXX251023S00P dockerd[7355]: time="2025-10-23T17:28:45.065163178+08:00" level=warning msg="grpc: addrConn.createTransport failed to connect to {unix:///run/containerd/containerd.sock 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\". Reconnecting..." module=grpc

排查

上面日志右边滑动查看,核心报错是 Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused ,解决这个问题要先了解下 docker 和 containerd 启动相关。

包管理下的 docker 和 containerd

docker damon 和 containerd 是存在交互而工作的,如果是包管理安装的 docker,会有两个 systemd service 文件:

  • containerd 包提供 containerd.service 文件
  • docker-ce 的 docker.service

这里以 rpm 包举例:

1
2
3
4
5
6
7
8
9
10
11
12
$ rpm -qa | grep -P 'containerd'
containerd.io-1.6.33-3.1.el7.x86_64
$ rpm -ql containerd.io | grep -Ev '/(doc|licen|man)'
/etc/containerd
/etc/containerd/config.toml
/usr/bin/containerd
/usr/bin/containerd-shim
/usr/bin/containerd-shim-runc-v1
/usr/bin/containerd-shim-runc-v2
/usr/bin/ctr
/usr/bin/runc
/usr/lib/systemd/system/containerd.service

而包管理 docker.service 有依赖:

1
2
3
$ systemctl cat --no-pager  docker | grep containerd.service
After=network-online.target docker.socket firewalld.service containerd.service time-set.target
Wants=network-online.target containerd.service

二进制安装 docker

我们私有化就是 docker 离线安装的,根据官方文档 https://docs.docker.com/engine/install/binaries/ 下载二进制安装,但是官方文档没有说 systemd service 文件获取。以及我接手后发现也没有创建 containerd.service 纳管 containerd,但是 docker 也能运行,查看 docker 子进程能看到:

1
2
3
4
5
6
7
8
9
10
11
12
$ systemctl status docker
● docker.service - Docker Application Container Engine
Loaded: loaded (/etc/systemd/system/docker.service; enabled; vendor preset: disabled)
Active: active (running) since 三 2025-10-22 16:44:38 CST; 1 day 1h ago
Docs: http://docs.docker.io
Main PID: 16487 (dockerd)
Tasks: 167
Memory: 6.1G
CGroup: /system.slice/docker.service
...
├─16506 containerd --config /var/run/docker/containerd/containerd.toml --log-level warn
...

说明 docker 肯定内部协程起了 containerd 进程,低版本 containerd 名字可能是 docker-containerd
逆向思维查下源码,因为协程起 containerd 进程,肯定会拼接 cmdline,源码搜索 --config 找到:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
// https://github.com/moby/moby/blob/v19.03.15/libcontainerd/supervisor/remote_daemon.go#L165C1-L212C5
func (r *remote) startContainerd() error {
pid, err := r.getContainerdPid()
if err != nil {
return err
}

if pid != -1 {
r.daemonPid = pid
logrus.WithField("pid", pid).
Infof("libcontainerd: %s is still running", binaryName)
return nil
}

configFile, err := r.getContainerdConfig()
if err != nil {
return err
}

args := []string{"--config", configFile}

if r.Debug.Level != "" {
args = append(args, "--log-level", r.Debug.Level)
}

cmd := exec.Command(binaryName, args...)
// redirect containerd logs to docker logs
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
cmd.SysProcAttr = containerdSysProcAttr()
// clear the NOTIFY_SOCKET from the env when starting containerd
cmd.Env = nil
for _, e := range os.Environ() {
if !strings.HasPrefix(e, "NOTIFY_SOCKET") {
cmd.Env = append(cmd.Env, e)
}
}
if err := cmd.Start(); err != nil {
return err
}

r.daemonWaitCh = make(chan struct{})
go func() {
// Reap our child when needed
if err := cmd.Wait(); err != nil {
r.logger.WithError(err).Errorf("containerd did not exit successfully")
}
close(r.daemonWaitCh)
}()

然后反向找 startContainerd() 的调用链:

  • 同文件的 func (r *remote) monitorDaemon(ctx context.Context) {
  • 同文件的 func Start(
  • 因为 Start 方法大写,肯定在其他地方包导入,搜 supervisor.Start 找到
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    // https://github.com/moby/moby/blob/v19.03.15/cmd/dockerd/daemon_unix.go#L152C1-L171
    func (cli *DaemonCli) initContainerD(ctx context.Context) (func(time.Duration) error, error) {
    var waitForShutdown func(time.Duration) error
    if cli.Config.ContainerdAddr == "" {
    systemContainerdAddr, ok, err := systemContainerdRunning(honorXDG)
    if err != nil {
    return nil, errors.Wrap(err, "could not determine whether the system containerd is running")
    }
    if !ok {
    logrus.Debug("Containerd not running, starting daemon managed containerd")
    opts, err := cli.getContainerdDaemonOpts()
    if err != nil {
    return nil, errors.Wrap(err, "failed to generate containerd options")
    }

    r, err := supervisor.Start(ctx, filepath.Join(cli.Config.Root, "containerd"), filepath.Join(cli.Config.ExecRoot, "containerd"), opts...)
    if err != nil {
    return nil, errors.Wrap(err, "failed to start containerd")
    }
    logrus.Debug("Started daemon managed containerd")
    cli.Config.ContainerdAddr = r.Address()

上面代码逻辑就是 systemContainerdRunning 方法判断 containerd 是否运行,没有运行就调用 supervisor.Start 启动 containerd,查看 systemContainerdRunning 内部实现:

1
2
3
4
5
6
7
8
9
10
11
12
13
// https://github.com/moby/moby/blob/v19.03.15/cmd/dockerd/daemon.go#L691C1-L702C2
func systemContainerdRunning(honorXDG bool) (string, bool, error) {
addr := containerddefaults.DefaultAddress
if honorXDG {
runtimeDir, err := homedir.GetRuntimeDir()
if err != nil {
return "", false, err
}
addr = filepath.Join(runtimeDir, "containerd", "containerd.sock")
}
_, err := os.Lstat(addr)
return addr, err == nil, nil
}

是查看连接 containerd 的 grpc sock 文件 /run/containerd/containerd.sock 存在否判断是否运行的,也就是说如果 systemd 启动了 containerd,docker daemon 就不 supervisor.Start 启动 containerd 子进程。

现场的是 suse docker rpm 包安装的,有 containerd 的 rpm 包,才发现里面没 service 文件,也就是走源码子进程逻辑:

1
2
3
4
5
6
7
8
9
10
$ rpm -ql containerd
/etc/containerd
/etc/containerd/config.toml/usr/sbin/containerd
/usr/sbin/containerd-shim
/usr/sbin/docker-containerd
/usr/sbin/docker-containerd-shim
/usr/share/doc/packages/containerd
/usr/share/doc/packages/containerd/README
/usr/share/licenses/containerd
/usr/share/licenses/containerd/LICENSE

查看果然是有 sock 文件而没 containerd 进程:

1
2
3
4
5
$ ls -l /run/containerd/
total 28
srw-rw---- 1 root root 0 Oct 23 15:20 containerd.sock
-rwxr-xr-x 1 root root 25651 Oct 23 15:26 events.log
$ ps -ef | grep container[d]

删掉该文件后重起 docker 解决。

CATALOG
  1. 1. 由来
  2. 2. 排查
    1. 2.1. 包管理下的 docker 和 containerd
    2. 2.2. 二进制安装 docker