zhangguanzhang's Blog

docker open /var/lib/docker/tmp/GetImageBlobXXX: no such file 的正确处理方式

字数统计: 1.8k阅读时长: 9 min
2025/11/06
loading

docker open /var/lib/docker/tmp/GetImageBlobXXX: no such file or directory. 解决

由来

测试反馈 04:33 出包失败,相关步骤报错:

1
2
3
4
5
6
7
8
9
10
11
12
13
$ docker run -d -p 48835:5000 --name daily-master-K8S_XC-2298 -v /xxx/images:/var/lib/registry harbor.xxx.cn/xxx-base/registry:2.6.1
Unable to find image 'harbor.xxx.cn/xxx-base/registry' locally
2.6.1: Pulling from xxx-base/registry
53478ce18e19: Pulling fs layer
907370c150a1: Pulling fs layer
ecd89ee27260: Pulling fs layer
e4d3e6950197: Pulling fs layer
a0c226b30c4f: Pulling fs layer
d4bda1830450: Pulling fs layer
f441bc34ec75: Pulling fs layer
877c19e43805: Pulling fs layer
docker: open /work/docker/tmp/GetImageBlob838250894: no such file or directory.
See 'docker run --help'.

我们的 docker 设置了 data-root,如果默认路径会是 open /var/lib/docker/tmp/GetImageBlob

处理过程

复现

根据构建日志,登录到构建机器上,手动拉镜像也复现:

1
2
3
4
$ docker pull harbor.xxx.cn/xxxx-run/gosu:v1
v1: Pulling from xxxx-run/gosu
e9abf7e9593f: Pulling fs layer
open /work/docker/tmp/GetImageBlob159490514: no such file or directory

去 jenkins 上看这台机器没构建任务,给标记下线状态避免影响其他构建。

排查

这种报错很直观了,不要造轮子和要学会善用搜索引擎,结果搜到都是说重启 docker 解决,如果一个开源项目存在一个必现问题,那就不是问题,golang 项目里妥善 return err 的话,直接源码能搜到相关逻辑,源码里搜 GetImageBlob 搜到:

1
2
3
4
// https://github.com/moby/moby/blob/v26.1.4/distribution/pull_v2.go#L1070C1-L1072C2
func createDownloadFile() (*os.File, error) {
return os.CreateTemp("", "GetImageBlob")
}

os.CreateTemp 默认是在 /tmp/ 下创建临时文件的,但是实际目录是拼接了,应该有地方设置了 tmp 相关 env,先从 proc 看下:

1
2
3
4
5
6
7
8
9
$ ps -ef | grep docker[d]
root 23938 1 3 03:00 ? 00:19:00 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock

$ xargs -0 -n1 < /proc/23938/environ
LANG=en_US.UTF-8
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
NOTIFY_SOCKET=/run/systemd/notify
LISTEN_PID=23938
LISTEN_FDS=1

从 proc 进程能确认启动的没有 TMPDIR 相关 env,那么只有进程自己 os.Setenv 了,搜 TMPDIR 搜到相关逻辑:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// https://github.com/moby/moby/blob/v26.1.4/daemon/daemon.go#L841-L856
// set up the tmpDir to use a canonical path
tmp, err := prepareTempDir(config.Root)
if err != nil {
return nil, fmt.Errorf("Unable to get the TempDir under %s: %s", config.Root, err)
}
realTmp, err := fileutils.ReadSymlinkedDirectory(tmp)
if err != nil {
return nil, fmt.Errorf("Unable to get the full path to the TempDir (%s): %s", tmp, err)
}
if isWindows {
...
} else {
os.Setenv("TMPDIR", realTmp)
}

路径拼接逻辑确认了,要确认 err 哪里抛出的,搜上面的 createDownloadFile 找到:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
// https://github.com/moby/moby/blob/v26.1.4/distribution/pull_v2.go#L169-L199
func (ld *layerDescriptor) Download(ctx context.Context, progressOutput progress.Output) (io.ReadCloser, int64, error) {
log.G(ctx).Debugf("pulling blob %q", ld.digest)

var (
err error
offset int64
)

if ld.tmpFile == nil {
ld.tmpFile, err = createDownloadFile()
if err != nil {
return nil, 0, xfer.DoNotRetry{Err: err}
}
} else {
offset, err = ld.tmpFile.Seek(0, io.SeekEnd)
if err != nil {
log.G(ctx).Debugf("error seeking to end of download file: %v", err)
offset = 0

ld.tmpFile.Close()
if err := os.Remove(ld.tmpFile.Name()); err != nil {
log.G(ctx).Errorf("Failed to remove temp file: %s", ld.tmpFile.Name())
}
ld.tmpFile, err = createDownloadFile()
if err != nil {
return nil, 0, xfer.DoNotRetry{Err: err}
}
} else if offset != 0 {
log.G(ctx).Debugf("attempting to resume download of %q from %d bytes", ld.digest, offset)
}
}

err 是这附近抛出的,具体位置不知道,但是看有日志打印,需要查看下 docker daemon 日志确认下:

1
2
Nov 06 10:30:22 centos-xx dockerd[23938]: time="2025-11-06T10:30:22.290434484+08:00" level=error msg="Download failed after 1 attempts: open /work/docker/tmp/GetImageBlob1655640392: no such file or directory"
Nov 06 10:48:51 centos-xx dockerd[23938]: time="2025-11-06T10:48:51.619866730+08:00" level=error msg="Download failed after 1 attempts: open /work/docker/tmp/GetImageBlob1626967315: no such file or directory"

Download failed after 搜到:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
// https://github.com/moby/moby/blob/v26.1.4/distribution/xfer/download.go#L274-L293
for {
downloadReader, size, err = descriptor.Download(d.transfer.context(), progressOutput)
if err == nil {
break
}

// If an error was returned because the context
// was cancelled, we shouldn't retry.
select {
case <-d.transfer.context().Done():
d.err = err
return
default:
}

if _, isDNR := err.(DoNotRetry); isDNR || attempt >= ldm.maxDownloadAttempts {
log.G(context.TODO()).Errorf("Download failed after %d attempts: %v", attempt, err)
d.err = err
return
}

那确认 err 是上面的 download 里 return 的,从 daemon 的日志来看,该 err 没有经过 errors.Wrap 类似添加额外信息,应该就是 createDownloadFile() 抛出的 syscall 层面 error。

想着 golang 写一个 demo 复现,但是想着 mktmp 这个不也是很常见的行为吗,shell 就自带 mktemp 相关命令,复现下看看:

1
2
3
4
5
6
$ mktemp /work/docker/tmp/test1111
mktemp: too few X's in template ‘/work/docker/tmp/test1111’
$ mktemp /work/docker/tmp/testXXXX
mktemp: failed to create file via template ‘/work/docker/tmp/testXXXX’: No such file or directory
$ mktemp /tmp/testXXX
/tmp/testx1T

work 目录挂载的,测试了下读写也没问题,然后发现了 docker data-root 没有 tmp 目录:

1
2
3
4
5
6
7
8
9
10
11
12
13
$ touch /work/docker/test1111
$ rm -f /work/docker/test1111
$ ls -l /work/docker/
total 4
drwx--x--x 4 root root 170 Nov 6 03:00 buildkit
drwx--x--- 2 root root 10 Nov 6 03:00 containers
-rw------- 1 root root 36 Nov 6 03:00 engine-id
drwx------ 3 root root 30 Nov 6 03:00 image
drwxr-x--- 3 root root 27 Nov 6 03:00 network
drwx--x--- 3 root root 52 Nov 6 03:00 overlay2
drwx------ 4 root root 44 Nov 6 03:00 plugins
drwx------ 2 root root 10 Nov 6 03:00 swarm
drwx-----x 2 root root 62 Nov 6 03:00 volumes

然后创建该目录后就好了:

1
2
3
4
5
6
7
8
9
10
11
$ docker pull harbor.xxx.cn/xxxx-run/gosu:v1
v1: Pulling from xxxx-run/gosu
e9abf7e9593f: Pulling fs layer
open /work/docker/tmp/GetImageBlob1426925393: no such file or directory
$ mkdir -p /work/docker/tmp
$ docker pull harbor.xxx.cn/xxxx-run/gosu:v1
v1: Pulling from xxxx-run/gosu
e9abf7e9593f: Pull complete
Digest: sha256:06ff9bb691ce53498f7dda976e0028639fb320f71513f6a41b4dd6761e989e78
Status: Downloaded newer image for harbor.xxx.cn/xxxx-run/gosu:v1
harbor.xxx.cn/xxxx-run/gosu:v1

根因

原本想启动 docker 后创建下目录:

1
2
3
4
5
6
7
8
# .d 目录的话,即使后续 docker 更新也不会删掉下面文件,能持久化住逻辑
mkdir -p /etc/systemd/system/docker.service.d/
cat > /etc/systemd/system/docker.service.d/tmp.conf << EOF
[Service]
ExecStartPost=/usr/bin/mkdir /work/docker/tmp
EOF

systemctl daemon-reload

但是想了下,别人重启 docker 能解决说明 docker daemon 启动是会创建这个目录的,搜了下确认有如此逻辑:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// https://github.com/moby/moby/blob/v26.1.4/daemon/daemon.go#L1420-L1442

// prepareTempDir prepares and returns the default directory to use
// for temporary files.
// If it doesn't exist, it is created. If it exists, its content is removed.
func prepareTempDir(rootDir string) (string, error) {
var tmpDir string
if tmpDir = os.Getenv("DOCKER_TMPDIR"); tmpDir == "" {
tmpDir = filepath.Join(rootDir, "tmp")
newName := tmpDir + "-old"
if err := os.Rename(tmpDir, newName); err == nil {
go func() {
if err := os.RemoveAll(newName); err != nil {
log.G(context.TODO()).Warnf("failed to delete old tmp directory: %s", newName)
}
}()
} else if !os.IsNotExist(err) {
log.G(context.TODO()).Warnf("failed to rename %s for background deletion: %s. Deleting synchronously", tmpDir, err)
if err := os.RemoveAll(tmpDir); err != nil {
log.G(context.TODO()).Warnf("failed to delete old tmp directory: %s", tmpDir)
}
}
}
return tmpDir, idtools.MkdirAllAndChown(tmpDir, 0o700, idtools.CurrentIdentity())
}

那就是说外部行为删除了 tmp 目录导致,查看定时任务:

1
2
3
4
5
6
7
8
9
10
11
12
$ crontab -l
0 3 * * * systemctl stop docker && mv /work/docker /work/docker-$(date +\%Y\%m\%d) && systemctl start docker
15 3 * * * rm -rf /work/docker-*
20 3 * * * find /work -maxdepth 1 -name '*dev*' -type d -ctime +1 | xargs rm -rf
23 3 * * * find /work -maxdepth 1 -name '*test*' -type d -ctime +1 | xargs rm -rf
25 3 * * * find /work -maxdepth 1 -name '*release*' -type d -ctime +1 | xargs rm -rf
27 3 * * * find /work -maxdepth 1 -name '*openxxx*' -type d -ctime +1 | xargs rm -rf
0 9-23/2 * * * python /data/epy/clean_dir.py /work
#0 9-23/2 * * * docker system prune -f
0 */2 * * * echo 3 > /proc/sys/vm/drop_caches

*/30 * * * * python /data/prepullimage/pull_image.py

根据 04:33 附近时间看没有,询问了下同事发现 jenkins 有定时 04:30 清理机器上的文件,调整后解决

CATALOG
  1. 1. 由来
  2. 2. 处理过程
    1. 2.1. 复现
    2. 2.2. 排查
    3. 2.3. 根因