简单科普下 docker 启动时候和 contaienrd 相关
由来
昨天处理了一个现场 docker 起不来的问题,借着处理过程科普下。docker 无法启动日志:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
| $ journarlctl -xe --no-pager -u docker Oct 23 17:28:33 XXX251023S00P systemd[1]: docker.service: Unit entered failed state. Oct 23 17:28:33 XXX251023S00P systemd[1]: docker.service: Failed with result 'exit-code'. Oct 23 17:28:43 XXX251023S00P systemd[1]: docker.service: Service RestartSec=10s expired, scheduling restart. Oct 23 17:28:43 XXX251023S00P systemd[1]: Stopped Docker Application Container Engine. -- Subject: Unit docker.service has finished shutting down -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit docker.service has finished shutting down. Oct 23 17:28:43 XXX251023S00P systemd[1]: Starting Docker Application Container Engine... -- Subject: Unit docker.service has begun start-up -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit docker.service has begun starting up. Oct 23 17:28:43 XXX251023S00P dockerd[7355]: time="2025-10-23T17:28:43+08:00" level=info msg="SUSE:secrets :: enabled" Oct 23 17:28:44 XXX251023S00P dockerd[7355]: time="2025-10-23T17:28:44.000689797+08:00" level=warning msg="The \"graph\" config file option is deprecated. Please use \"data-root\" instead." Oct 23 17:28:44 XXX251023S00P dockerd[7355]: time="2025-10-23T17:28:44.064839553+08:00" level=warning msg="grpc: addrConn.createTransport failed to connect to {unix:///run/containerd/containerd.sock 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\". Reconnecting..." module=grpc Oct 23 17:28:45 XXX251023S00P dockerd[7355]: time="2025-10-23T17:28:45.065163178+08:00" level=warning msg="grpc: addrConn.createTransport failed to connect to {unix:///run/containerd/containerd.sock 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\". Reconnecting..." module=grpc
|
排查
上面日志右边滑动查看,核心报错是 Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused ,解决这个问题要先了解下 docker 和 containerd 启动相关。
包管理下的 docker 和 containerd
docker damon 和 containerd 是存在交互而工作的,如果是包管理安装的 docker,会有两个 systemd service 文件:
- containerd 包提供
containerd.service 文件
- docker-ce 的
docker.service
这里以 rpm 包举例:
1 2 3 4 5 6 7 8 9 10 11 12
| $ rpm -qa | grep -P 'containerd' containerd.io-1.6.33-3.1.el7.x86_64 $ rpm -ql containerd.io | grep -Ev '/(doc|licen|man)' /etc/containerd /etc/containerd/config.toml /usr/bin/containerd /usr/bin/containerd-shim /usr/bin/containerd-shim-runc-v1 /usr/bin/containerd-shim-runc-v2 /usr/bin/ctr /usr/bin/runc /usr/lib/systemd/system/containerd.service
|
而包管理 docker.service 有依赖:
1 2 3
| $ systemctl cat --no-pager docker | grep containerd.service After=network-online.target docker.socket firewalld.service containerd.service time-set.target Wants=network-online.target containerd.service
|
二进制安装 docker
我们私有化就是 docker 离线安装的,根据官方文档 https://docs.docker.com/engine/install/binaries/ 下载二进制安装,但是官方文档没有说 systemd service 文件获取。以及我接手后发现也没有创建 containerd.service 纳管 containerd,但是 docker 也能运行,查看 docker 子进程能看到:
1 2 3 4 5 6 7 8 9 10 11 12
| $ systemctl status docker ● docker.service - Docker Application Container Engine Loaded: loaded (/etc/systemd/system/docker.service; enabled; vendor preset: disabled) Active: active (running) since 三 2025-10-22 16:44:38 CST; 1 day 1h ago Docs: http://docs.docker.io Main PID: 16487 (dockerd) Tasks: 167 Memory: 6.1G CGroup: /system.slice/docker.service ... ├─16506 containerd --config /var/run/docker/containerd/containerd.toml --log-level warn ...
|
说明 docker 肯定内部协程起了 containerd 进程,低版本 containerd 名字可能是 docker-containerd 。
逆向思维查下源码,因为协程起 containerd 进程,肯定会拼接 cmdline,源码搜索 --config 找到:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
| func (r *remote) startContainerd() error { pid, err := r.getContainerdPid() if err != nil { return err }
if pid != -1 { r.daemonPid = pid logrus.WithField("pid", pid). Infof("libcontainerd: %s is still running", binaryName) return nil }
configFile, err := r.getContainerdConfig() if err != nil { return err }
args := []string{"--config", configFile}
if r.Debug.Level != "" { args = append(args, "--log-level", r.Debug.Level) }
cmd := exec.Command(binaryName, args...) cmd.Stdout = os.Stdout cmd.Stderr = os.Stderr cmd.SysProcAttr = containerdSysProcAttr() cmd.Env = nil for _, e := range os.Environ() { if !strings.HasPrefix(e, "NOTIFY_SOCKET") { cmd.Env = append(cmd.Env, e) } } if err := cmd.Start(); err != nil { return err }
r.daemonWaitCh = make(chan struct{}) go func() { if err := cmd.Wait(); err != nil { r.logger.WithError(err).Errorf("containerd did not exit successfully") } close(r.daemonWaitCh) }()
|
然后反向找 startContainerd() 的调用链:
- 同文件的
func (r *remote) monitorDaemon(ctx context.Context) {
- 同文件的
func Start(
- 因为
Start 方法大写,肯定在其他地方包导入,搜 supervisor.Start 找到1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
| func (cli *DaemonCli) initContainerD(ctx context.Context) (func(time.Duration) error, error) { var waitForShutdown func(time.Duration) error if cli.Config.ContainerdAddr == "" { systemContainerdAddr, ok, err := systemContainerdRunning(honorXDG) if err != nil { return nil, errors.Wrap(err, "could not determine whether the system containerd is running") } if !ok { logrus.Debug("Containerd not running, starting daemon managed containerd") opts, err := cli.getContainerdDaemonOpts() if err != nil { return nil, errors.Wrap(err, "failed to generate containerd options") }
r, err := supervisor.Start(ctx, filepath.Join(cli.Config.Root, "containerd"), filepath.Join(cli.Config.ExecRoot, "containerd"), opts...) if err != nil { return nil, errors.Wrap(err, "failed to start containerd") } logrus.Debug("Started daemon managed containerd") cli.Config.ContainerdAddr = r.Address()
|
上面代码逻辑就是 systemContainerdRunning 方法判断 containerd 是否运行,没有运行就调用 supervisor.Start 启动 containerd,查看 systemContainerdRunning 内部实现:
1 2 3 4 5 6 7 8 9 10 11 12 13
| func systemContainerdRunning(honorXDG bool) (string, bool, error) { addr := containerddefaults.DefaultAddress if honorXDG { runtimeDir, err := homedir.GetRuntimeDir() if err != nil { return "", false, err } addr = filepath.Join(runtimeDir, "containerd", "containerd.sock") } _, err := os.Lstat(addr) return addr, err == nil, nil }
|
是查看连接 containerd 的 grpc sock 文件 /run/containerd/containerd.sock 存在否判断是否运行的,也就是说如果 systemd 启动了 containerd,docker daemon 就不 supervisor.Start 启动 containerd 子进程。
现场的是 suse docker rpm 包安装的,有 containerd 的 rpm 包,才发现里面没 service 文件,也就是走源码子进程逻辑:
1 2 3 4 5 6 7 8 9 10
| $ rpm -ql containerd /etc/containerd /etc/containerd/config.toml/usr/sbin/containerd /usr/sbin/containerd-shim /usr/sbin/docker-containerd /usr/sbin/docker-containerd-shim /usr/share/doc/packages/containerd /usr/share/doc/packages/containerd/README /usr/share/licenses/containerd /usr/share/licenses/containerd/LICENSE
|
查看果然是有 sock 文件而没 containerd 进程:
1 2 3 4 5
| $ ls -l /run/containerd/ total 28 srw-rw---- 1 root root 0 Oct 23 15:20 containerd.sock -rwxr-xr-x 1 root root 25651 Oct 23 15:26 events.log $ ps -ef | grep container[d]
|
删掉该文件后重起 docker 解决。