zhangguanzhang's Blog

多线程执行skopeo copy panic的一次解决过程

字数统计: 1.7k阅读时长: 9 min
2025/10/31
loading

多线程执行skopeo copy panic的一次解决过程

由来

内部构建出包存在大概以下逻辑:

  1. 起一个 registry 容器,假设随机端口为 45678
  2. 然后把相关镜像 skopeo copy 从缓存的 harbor 同步到 registry容器
  3. 打包registry的目录成为 iamge-xxx.tgz

然后发现这几天打包有问题,多线程 skopeo copy 报错没处理,最后 iamge-xxx.tgz 大小不对。

排查

调用链分析

查看出包日志一堆panic,由于多线程调用的,golang 的 panic 堆栈的顺序都错乱了,构建机器都是 centos7,都不维护了,上面的 skopeo 通过包管理安装的,版本比较低:

1
2
$ skopeo --version
skopeo version 0.1.40

根据版本去查看源码找调用链,skopeo 使用了 cobra 库,从 cmd/skopeo/copy.go 找到堆栈:

1
/builddir/build/BUILD/skopeo-be6146b0a8471b02e776134119a2c37dfb70d414/cmd/skopeo/copy.go:159 +0x94b fp=0xc0005f3920 sp=0xc0005f3678 pc=0x559635b91b0b

代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import (
"github.com/containers/image/v5/copy"
....
)

// https://github.com/containers/skopeo/blob/v0.1.40/cmd/skopeo/copy.go#L159-L167
_, err = copy.Image(ctx, policyContext, destRef, srcRef, &copy.Options{
RemoveSignatures: opts.removeSignatures,
SignBy: opts.signByFingerprint,
ReportWriter: stdout,
SourceCtx: sourceCtx,
DestinationCtx: destinationCtx,
ForceManifestMIMEType: manifestType,
ImageListSelection: imageListSelection,
})

根据导包找到:

1
2
// https://github.com/containers/skopeo/blob/v0.1.40/vendor/github.com/containers/image/v5/copy/copy.go#L173
func Image(ctx context.Context, policyContext *signature.PolicyContext, destRef, srcRef types.ImageReference, options *Options) (copiedManifest []byte, retErr error) {

便于查找,给所有堆栈字符串保存到文件里,从输出混乱的堆栈字符串里搜索 copy/copy.go: 找到:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
$ grep -Po 'vendor.*?/copy/copy.go:\d+' txt | sort -u
vendor/src/github.com/containers/image/v5/copy/copy.go:1337
vendor/src/github.com/containers/image/v5/copy/copy.go:258
vendor/src/github.com/containers/image/v5/copy/copy.go:578
vendor/src/github.com/containers/image/v5/copy/copy.go:740
vendor/src/github.com/containers/image/v5/copy/copy.go:755
vendor/src/github.com/containers/image/v5/copy/copy.go:765
vendor/src/github.com/containers/image/v5/copy/copy.go:766
vendor/src/github.com/containers/image/v5/copy/copy.go:770
vendor/src/github.com/containers/image/v5/copy/copy.go:771
vendor/src/github.com/containers/image/v5/copy/copy.go:860
vendor/src/github.com/containers/image/v5/copy/copy.go:948
vendor/src/github.com/containers/image/v5/copy/copy.go:949
vendor/src/github.com/containers/image/v5/pkg/blobinfocache/boltdb/boltdb.go0x105/builddir/build/BUILD/skopeo-be6146b0a8471b02e776134119a2c37dfb70d414/vendor/src/github.com/containers/image/v5/copy/copy.go:174

根据上面信息和源码,调用链为:

  • copy/copy.go#L173func Image(ctx context.Context,
  • copy/copy.go#L258if copiedManifest, _, _, err = c.copyOneImage(
  • copy/copy.go#L473func (c *copier) copyOneImage
  • copy/copy.go#L578if err := ic.copyLayers(ctx);
  • copy/copy.go#L704func (ic *imageCopier) copyLayers(ctx context.Context)
  • copy/copy.go:766go copyLayerHelper(i, srcLayer, progressPool),而该方法是下面闭包声明的
  • copy/copy.go:755cld.destInfo, cld.diffID, cld.err = ic.copyLayer(ctx, srcLayer, pool)
  • copy/copy.go:948func (ic *imageCopier) copyLayer(ctx,然后走到内部第一行
  • copy/copy.go:949cachedDiffID := ic.c.blobInfoCache.UncompressedDigest(srcInfo.Digest)

而方法 UncompressedDigest 是接口:

1
2
3
4
5
6
7
8
9
// https://github.com/containers/skopeo/blob/v0.1.40/vendor/github.com/containers/image/v5/types/types.go#L177-L198
type BlobInfoCache interface {

UncompressedDigest(anyDigest digest.Digest) digest.Digest

RecordDigestUncompressedPair(anyDigest digest.Digest, uncompressed digest.Digest)
RecordKnownLocation(transport ImageTransport, scope BICTransportScope, digest digest.Digest, location BICLocationReference)
CandidateLocations(transport ImageTransport, scope BICTransportScope, digest digest.Digest, canSubstitute bool) []BICReplacementCandidate
}

ic.c.blobInfoCache 里的 blobInfoCache 赋值,搜到:

1
2
// https://github.com/containers/skopeo/blob/v0.1.40/vendor/github.com/containers/image/v5/copy/copy.go#L173-L233
blobInfoCache: blobinfocache.DefaultCache(options.DestinationCtx)

然后发现是 boltdb 存储 cache:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// https://github.com/containers/skopeo/blob/v0.1.40/vendor/github.com/containers/image/v5/pkg/blobinfocache/default.go
func DefaultCache(sys *types.SystemContext) types.BlobInfoCache {
dir, err := blobInfoCacheDir(sys, getRootlessUID())
if err != nil {
logrus.Debugf("Error determining a location for %s, using a memory-only cache", blobInfoCacheFilename)
return memory.New()
}
path := filepath.Join(dir, blobInfoCacheFilename)
if err := os.MkdirAll(dir, 0700); err != nil {
logrus.Debugf("Error creating parent directories for %s, using a memory-only cache: %v", blobInfoCacheFilename, err)
return memory.New()
}

logrus.Debugf("Using blob info cache at %s", path)
return boltdb.New(path)
}

查看失败构建的构建机器上:

1
2
3
4
5
6
7
8
9
$ cd /var/lib/containers/cache
$ ls -l
total 25740
-rw------- 1 root root 43134976 Oct 28 12:31 blob-info-cache-v1.boltdb
$ ls -al
total 25740
drwx------ 2 root root 39 Aug 27 14:18 .
drwxr-xr-x 4 root root 35 Aug 27 14:18 ..
-rw------- 1 root root 43134976 Oct 28 12:31 blob-info-cache-v1.boltdb

和堆栈里的 boltdb 相关堆栈也对的上:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
$ grep -Po 'blobinfocache/boltdb/boltdb.go:\d+' txt  | sort -u
blobinfocache/boltdb/boltdb.go:108
blobinfocache/boltdb/boltdb.go:112
blobinfocache/boltdb/boltdb.go:114
blobinfocache/boltdb/boltdb.go:119
blobinfocache/boltdb/boltdb.go:124
blobinfocache/boltdb/boltdb.go:146
blobinfocache/boltdb/boltdb.go:172
blobinfocache/boltdb/boltdb.go:174
blobinfocache/boltdb/boltdb.go:175
blobinfocache/boltdb/boltdb.go:3010
blobinfocache/boltdb/boltdb.go:54
blobinfocache/boltdb/boltdb.go:56
blobinfocache/boltdb/boltdb.go:58
blobinfocache/boltdb/boltdb.go:65
blobinfocache/boltdb/boltdb.go:66
blobinfocache/boltdb/boltdb.go:67
blobinfocache/boltdb/boltdb.go:84

这里调用链就不细致分析了,确定是 uncompressedDigest 方法里 boltdb 问题:

1
2
3
// https://github.com/containers/skopeo/blob/v0.1.40/vendor/github.com/containers/image/v5/pkg/blobinfocache/boltdb/boltdb.go#L145C1-L146C57
func (bdc *cache) uncompressedDigest(tx *bolt.Tx, anyDigest digest.Digest) digest.Digest {
if b := tx.Bucket(uncompressedDigestBucket); b != nil {

这里有问题的话就说明 boltdb 文件损坏,boltdb 文件损坏的话,docker 和 etcd 都能遇到,搜关键字就知道了:

1
2
$ grep invalid txt
panic: invalid page type: 6432: 10

原因

写个 boltdb 查看 Bucket 的 cli 复现下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
package main

import (
"fmt"
"log"
"os"

bolt "go.etcd.io/bbolt"
)

func main() {
if len(os.Args) < 2 {
fmt.Fprintf(os.Stderr, "Usage: %s <bolt-db-file>\n", os.Args[0])
os.Exit(1)
}
filename := os.Args[1]

db, err := bolt.Open(filename, 0600, &bolt.Options{ReadOnly: true})
if err != nil {
log.Fatalf("failed to open %s: %v", filename, err)
}
defer db.Close()

err = db.View(func(tx *bolt.Tx) error {
fmt.Printf("Top-level buckets in %s:\n", filename)
return tx.ForEach(func(name []byte, _ *bolt.Bucket) error {
fmt.Printf("- %s\n", name)
return nil
})
})
if err != nil {
log.Fatal(err)
}
}

拷贝到构建机器上执行:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
$ ./bbolt-tool blob-info-cache-v1.boltdb
Top-level buckets in blob-info-cache-v1.boltdb:
panic: invalid page type: 6432: 10

goroutine 1 [running]:
go.etcd.io/bbolt.(*Cursor).search(0xc0000a5cd8, {0x7ff6b8ff8130, 0x47, 0x47}, 0x0?)
/root/go/pkg/mod/go.etcd.io/bbolt@v1.4.3/cursor.go:286 +0x279
go.etcd.io/bbolt.(*Cursor).seek(0xc0000a5cd8, {0x7ff6b8ff8130?, 0xc000080140?, 0xc0000ac040?})
/root/go/pkg/mod/go.etcd.io/bbolt@v1.4.3/cursor.go:162 +0x2e
go.etcd.io/bbolt.(*Bucket).Bucket(0xc0000aa018, {0x7ff6b8ff8130, 0x47, 0x4e8c40?})
/root/go/pkg/mod/go.etcd.io/bbolt@v1.4.3/bucket.go:97 +0xb6
main.main.func1.(*Tx).ForEach.2({0x7ff6b8ff8130, 0x47, 0x47}, {0xc0000a5dd8?, 0x1?, 0x1?})
/root/go/pkg/mod/go.etcd.io/bbolt@v1.4.3/tx.go:158 +0x45
go.etcd.io/bbolt.(*Bucket).ForEach(0x51c268?, 0xc0000a5de8)
/root/go/pkg/mod/go.etcd.io/bbolt@v1.4.3/bucket.go:591 +0x89
go.etcd.io/bbolt.(*Tx).ForEach(...)
/root/go/pkg/mod/go.etcd.io/bbolt@v1.4.3/tx.go:157
main.main.func1(0xc0000aa000)
/root/code/golang/bbolt/main.go:26 +0x9d
go.etcd.io/bbolt.(*DB).View(0x7ffca6c567a8?, 0xc0000a5f00)
/root/go/pkg/mod/go.etcd.io/bbolt@v1.4.3/db.go:939 +0x6c
main.main()
/root/code/golang/bbolt/main.go:24 +0x1f0


正常构建机器上:

1
2
3
$ ./bbolt-tool blob-info-cache-v1.boltdb 
Top-level buckets in blob-info-cache-v1.boltdb:
- knownLocations

解决

根据上面方法 DefaultCache 可以把 path 创建成文件,让走内存缓存,但是查看了下新版本 skopeo 已经默认使用 sqlite 缓存了:

1
2
3
4
// If the format changes in an incompatible way, increase the version number.
blobInfoCacheFilename = "blob-info-cache-v1.sqlite"
// systemBlobInfoCacheDir is the directory containing the blob info cache (in blobInfocacheFilename) for root-running processes.
systemBlobInfoCacheDir = "/var/lib/containers/cache"

改为使用新版本 skopeo ,以及相关多线程的输出也加了前缀,避免下次类似问题堆栈混乱。然后发现老版本分支还是会走老逻辑,构建机器上把这个 boltdb 文件 mv 下避免影响其他分支,

CATALOG
  1. 1. 由来
  2. 2. 排查
    1. 2.1. 调用链分析
    2. 2.2. 原因
  3. 3. 解决