多线程执行skopeo copy panic的一次解决过程
由来 内部构建出包存在大概以下逻辑:
起一个 registry 容器,假设随机端口为 45678
然后把相关镜像 skopeo copy 从缓存的 harbor 同步到 registry容器
打包registry的目录成为 iamge-xxx.tgz
然后发现这几天打包有问题,多线程 skopeo copy 报错没处理,最后 iamge-xxx.tgz 大小不对。
排查 调用链分析 查看出包日志一堆panic,由于多线程调用的,golang 的 panic 堆栈的顺序都错乱了,构建机器都是 centos7,都不维护了,上面的 skopeo 通过包管理安装的,版本比较低:
1 2 $ skopeo --version skopeo version 0.1.40
根据版本去查看源码找调用链,skopeo 使用了 cobra 库,从 cmd/skopeo/copy.go 找到堆栈:
1 /builddir/build/BUILD/skopeo-be6146b0a8471b02e776134119a2c37dfb70d414/cmd/skopeo/copy.go:159 +0x94b fp=0xc0005f3920 sp=0xc0005f3678 pc=0x559635b91b0b
代码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 import ( "github.com/containers/image/v5/copy" .... ) _, err = copy .Image(ctx, policyContext, destRef, srcRef, © .Options{ RemoveSignatures: opts.removeSignatures, SignBy: opts.signByFingerprint, ReportWriter: stdout, SourceCtx: sourceCtx, DestinationCtx: destinationCtx, ForceManifestMIMEType: manifestType, ImageListSelection: imageListSelection, })
根据导包找到:
1 2 func Image (ctx context.Context, policyContext *signature.PolicyContext, destRef, srcRef types.ImageReference, options *Options) (copiedManifest []byte , retErr error ) {
便于查找,给所有堆栈字符串保存到文件里,从输出混乱的堆栈字符串里搜索 copy/copy.go: 找到:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 $ grep -Po 'vendor.*?/copy/copy.go:\d+' txt | sort -u vendor/src/github.com/containers/image/v5/copy /copy .go :1337 vendor/src/github.com/containers/image/v5/copy /copy .go :258 vendor/src/github.com/containers/image/v5/copy /copy .go :578 vendor/src/github.com/containers/image/v5/copy /copy .go :740 vendor/src/github.com/containers/image/v5/copy /copy .go :755 vendor/src/github.com/containers/image/v5/copy /copy .go :765 vendor/src/github.com/containers/image/v5/copy /copy .go :766 vendor/src/github.com/containers/image/v5/copy /copy .go :770 vendor/src/github.com/containers/image/v5/copy /copy .go :771 vendor/src/github.com/containers/image/v5/copy /copy .go :860 vendor/src/github.com/containers/image/v5/copy /copy .go :948 vendor/src/github.com/containers/image/v5/copy /copy .go :949 vendor/src/github.com/containers/image/v5/pkg/blobinfocache/boltdb/boltdb.go0x105/builddir/build/BUILD/skopeo-be6146b0a8471b02e776134119a2c37dfb70d414/vendor/src/github.com/containers/image/v5/copy /copy .go :174
根据上面信息和源码,调用链为:
copy/copy.go#L173 的 func Image(ctx context.Context,
copy/copy.go#L258 的 if copiedManifest, _, _, err = c.copyOneImage(
copy/copy.go#L473 的 func (c *copier) copyOneImage
copy/copy.go#L578 的 if err := ic.copyLayers(ctx);
copy/copy.go#L704 的 func (ic *imageCopier) copyLayers(ctx context.Context)
copy/copy.go:766 的 go copyLayerHelper(i, srcLayer, progressPool),而该方法是下面闭包声明的
copy/copy.go:755 的 cld.destInfo, cld.diffID, cld.err = ic.copyLayer(ctx, srcLayer, pool)
copy/copy.go:948 的 func (ic *imageCopier) copyLayer(ctx,然后走到内部第一行
copy/copy.go:949 的 cachedDiffID := ic.c.blobInfoCache.UncompressedDigest(srcInfo.Digest)
而方法 UncompressedDigest 是接口:
1 2 3 4 5 6 7 8 9 type BlobInfoCache interface { UncompressedDigest(anyDigest digest.Digest) digest.Digest RecordDigestUncompressedPair(anyDigest digest.Digest, uncompressed digest.Digest) RecordKnownLocation(transport ImageTransport, scope BICTransportScope, digest digest.Digest, location BICLocationReference) CandidateLocations(transport ImageTransport, scope BICTransportScope, digest digest.Digest, canSubstitute bool ) []BICReplacementCandidate }
搜 ic.c.blobInfoCache 里的 blobInfoCache 赋值,搜到:
1 2 blobInfoCache: blobinfocache.DefaultCache(options.DestinationCtx)
然后发现是 boltdb 存储 cache:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 func DefaultCache (sys *types.SystemContext) types.BlobInfoCache { dir, err := blobInfoCacheDir(sys, getRootlessUID()) if err != nil { logrus.Debugf("Error determining a location for %s, using a memory-only cache" , blobInfoCacheFilename) return memory.New() } path := filepath.Join(dir, blobInfoCacheFilename) if err := os.MkdirAll(dir, 0700 ); err != nil { logrus.Debugf("Error creating parent directories for %s, using a memory-only cache: %v" , blobInfoCacheFilename, err) return memory.New() } logrus.Debugf("Using blob info cache at %s" , path) return boltdb.New(path) }
查看失败构建的构建机器上:
1 2 3 4 5 6 7 8 9 $ cd /var/lib/containers/cache$ ls -ltotal 25740 -rw------- 1 root root 43134976 Oct 28 12:31 blob-info-cache-v1.boltdb $ ls -altotal 25740 drwx------ 2 root root 39 Aug 27 14:18 . drwxr-xr-x 4 root root 35 Aug 27 14:18 .. -rw------- 1 root root 43134976 Oct 28 12:31 blob-info-cache-v1.boltdb
和堆栈里的 boltdb 相关堆栈也对的上:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 $ grep -Po 'blobinfocache/boltdb/boltdb.go:\d+' txt | sort -u blobinfocache/boltdb/boltdb.go:108 blobinfocache/boltdb/boltdb.go:112 blobinfocache/boltdb/boltdb.go:114 blobinfocache/boltdb/boltdb.go:119 blobinfocache/boltdb/boltdb.go:124 blobinfocache/boltdb/boltdb.go:146 blobinfocache/boltdb/boltdb.go:172 blobinfocache/boltdb/boltdb.go:174 blobinfocache/boltdb/boltdb.go:175 blobinfocache/boltdb/boltdb.go:3010 blobinfocache/boltdb/boltdb.go:54 blobinfocache/boltdb/boltdb.go:56 blobinfocache/boltdb/boltdb.go:58 blobinfocache/boltdb/boltdb.go:65 blobinfocache/boltdb/boltdb.go:66 blobinfocache/boltdb/boltdb.go:67 blobinfocache/boltdb/boltdb.go:84
这里调用链就不细致分析了,确定是 uncompressedDigest 方法里 boltdb 问题:
1 2 3 func (bdc *cache) uncompressedDigest(tx *bolt.Tx, anyDigest digest.Digest) digest.Digest { if b := tx.Bucket(uncompressedDigestBucket); b != nil {
这里有问题的话就说明 boltdb 文件损坏,boltdb 文件损坏的话,docker 和 etcd 都能遇到,搜关键字就知道了:
1 2 $ grep invalid txt panic: invalid page type: 6432: 10
原因 写个 boltdb 查看 Bucket 的 cli 复现下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 package mainimport ( "fmt" "log" "os" bolt "go.etcd.io/bbolt" ) func main () { if len (os.Args) < 2 { fmt.Fprintf(os.Stderr, "Usage: %s <bolt-db-file>\n" , os.Args[0 ]) os.Exit(1 ) } filename := os.Args[1 ] db, err := bolt.Open(filename, 0600 , &bolt.Options{ReadOnly: true }) if err != nil { log.Fatalf("failed to open %s: %v" , filename, err) } defer db.Close() err = db.View(func (tx *bolt.Tx) error { fmt.Printf("Top-level buckets in %s:\n" , filename) return tx.ForEach(func (name []byte , _ *bolt.Bucket) error { fmt.Printf("- %s\n" , name) return nil }) }) if err != nil { log.Fatal(err) } }
拷贝到构建机器上执行:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 $ ./bbolt-tool blob-info-cache-v1.boltdb Top-level buckets in blob-info-cache-v1.boltdb: panic: invalid page type: 6432: 10 goroutine 1 [running]: go.etcd.io/bbolt.(*Cursor).search(0xc0000a5cd8, {0x7ff6b8ff8130, 0x47, 0x47}, 0x0?) /root/go/pkg/mod/go.etcd.io/bbolt@v1.4.3/cursor.go:286 +0x279 go.etcd.io/bbolt.(*Cursor).seek(0xc0000a5cd8, {0x7ff6b8ff8130?, 0xc000080140?, 0xc0000ac040?}) /root/go/pkg/mod/go.etcd.io/bbolt@v1.4.3/cursor.go:162 +0x2e go.etcd.io/bbolt.(*Bucket).Bucket(0xc0000aa018, {0x7ff6b8ff8130, 0x47, 0x4e8c40?}) /root/go/pkg/mod/go.etcd.io/bbolt@v1.4.3/bucket.go:97 +0xb6 main.main.func1.(*Tx).ForEach.2({0x7ff6b8ff8130, 0x47, 0x47}, {0xc0000a5dd8?, 0x1?, 0x1?}) /root/go/pkg/mod/go.etcd.io/bbolt@v1.4.3/tx.go:158 +0x45 go.etcd.io/bbolt.(*Bucket).ForEach(0x51c268?, 0xc0000a5de8) /root/go/pkg/mod/go.etcd.io/bbolt@v1.4.3/bucket.go:591 +0x89 go.etcd.io/bbolt.(*Tx).ForEach(...) /root/go/pkg/mod/go.etcd.io/bbolt@v1.4.3/tx.go:157 main.main.func1(0xc0000aa000) /root/code/golang/bbolt/main.go:26 +0x9d go.etcd.io/bbolt.(*DB).View(0x7ffca6c567a8?, 0xc0000a5f00) /root/go/pkg/mod/go.etcd.io/bbolt@v1.4.3/db.go:939 +0x6c main.main() /root/code/golang/bbolt/main.go:24 +0x1f0
正常构建机器上:
1 2 3 $ ./bbolt-tool blob-info-cache-v1.boltdb Top-level buckets in blob-info-cache-v1.boltdb: - knownLocations
解决 根据上面方法 DefaultCache 可以把 path 创建成文件,让走内存缓存,但是查看了下新版本 skopeo 已经默认使用 sqlite 缓存了:
1 2 3 4 blobInfoCacheFilename = "blob-info-cache-v1.sqlite" systemBlobInfoCacheDir = "/var/lib/containers/cache"
改为使用新版本 skopeo ,以及相关多线程的输出也加了前缀,避免下次类似问题堆栈混乱。然后发现老版本分支还是会走老逻辑,构建机器上把这个 boltdb 文件 mv 下避免影响其他分支,