多线程执行skopeo copy panic的一次解决过程
由来 内部构建出包存在大概以下逻辑:
起一个 registry 容器,假设随机端口为 45678 
然后把相关镜像 skopeo copy 从缓存的 harbor 同步到 registry容器 
打包registry的目录成为 iamge-xxx.tgz 
 
然后发现这几天打包有问题,多线程 skopeo copy 报错没处理,最后 iamge-xxx.tgz 大小不对。
排查 调用链分析 查看出包日志一堆panic,由于多线程调用的,golang 的 panic 堆栈的顺序都错乱了,构建机器都是 centos7,都不维护了,上面的 skopeo 通过包管理安装的,版本比较低:
1 2 $  skopeo --version skopeo version 0.1.40 
根据版本去查看源码找调用链,skopeo 使用了 cobra 库,从 cmd/skopeo/copy.go 找到堆栈:
1 /builddir/build/BUILD/skopeo-be6146b0a8471b02e776134119a2c37dfb70d414/cmd/skopeo/copy.go:159 +0x94b fp=0xc0005f3920 sp=0xc0005f3678 pc=0x559635b91b0b 
代码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 import  (    "github.com/containers/image/v5/copy"      .... )     _, err = copy .Image(ctx, policyContext, destRef, srcRef, © .Options{         RemoveSignatures:      opts.removeSignatures,         SignBy:                opts.signByFingerprint,         ReportWriter:          stdout,         SourceCtx:             sourceCtx,         DestinationCtx:        destinationCtx,         ForceManifestMIMEType: manifestType,         ImageListSelection:    imageListSelection,     }) 
根据导包找到:
1 2 func  Image (ctx context.Context, policyContext *signature.PolicyContext, destRef, srcRef types.ImageReference, options *Options) byte , retErr error ) {
便于查找,给所有堆栈字符串保存到文件里,从输出混乱的堆栈字符串里搜索 copy/copy.go: 找到:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 $ grep -Po 'vendor.*?/copy/copy.go:\d+'  txt | sort -u vendor/src/github.com/containers/image/v5/copy /copy .go :1337  vendor/src/github.com/containers/image/v5/copy /copy .go :258  vendor/src/github.com/containers/image/v5/copy /copy .go :578  vendor/src/github.com/containers/image/v5/copy /copy .go :740  vendor/src/github.com/containers/image/v5/copy /copy .go :755  vendor/src/github.com/containers/image/v5/copy /copy .go :765  vendor/src/github.com/containers/image/v5/copy /copy .go :766  vendor/src/github.com/containers/image/v5/copy /copy .go :770  vendor/src/github.com/containers/image/v5/copy /copy .go :771  vendor/src/github.com/containers/image/v5/copy /copy .go :860  vendor/src/github.com/containers/image/v5/copy /copy .go :948  vendor/src/github.com/containers/image/v5/copy /copy .go :949  vendor/src/github.com/containers/image/v5/pkg/blobinfocache/boltdb/boltdb.go0x105/builddir/build/BUILD/skopeo-be6146b0a8471b02e776134119a2c37dfb70d414/vendor/src/github.com/containers/image/v5/copy /copy .go :174  
根据上面信息和源码,调用链为:
copy/copy.go#L173 的 func Image(ctx context.Context,copy/copy.go#L258 的 if copiedManifest, _, _, err = c.copyOneImage(copy/copy.go#L473 的 func (c *copier) copyOneImagecopy/copy.go#L578 的 if err := ic.copyLayers(ctx);copy/copy.go#L704 的 func (ic *imageCopier) copyLayers(ctx context.Context)copy/copy.go:766 的 go copyLayerHelper(i, srcLayer, progressPool),而该方法是下面闭包声明的copy/copy.go:755 的 cld.destInfo, cld.diffID, cld.err = ic.copyLayer(ctx, srcLayer, pool) copy/copy.go:948 的 func (ic *imageCopier) copyLayer(ctx,然后走到内部第一行copy/copy.go:949 的 cachedDiffID := ic.c.blobInfoCache.UncompressedDigest(srcInfo.Digest) 
而方法 UncompressedDigest 是接口:
1 2 3 4 5 6 7 8 9 type  BlobInfoCache interface  {    UncompressedDigest(anyDigest digest.Digest) digest.Digest     RecordDigestUncompressedPair(anyDigest digest.Digest, uncompressed digest.Digest)     RecordKnownLocation(transport ImageTransport, scope BICTransportScope, digest digest.Digest, location BICLocationReference)     CandidateLocations(transport ImageTransport, scope BICTransportScope, digest digest.Digest, canSubstitute bool ) []BICReplacementCandidate } 
搜 ic.c.blobInfoCache 里的 blobInfoCache 赋值,搜到:
1 2 blobInfoCache: blobinfocache.DefaultCache(options.DestinationCtx) 
然后发现是 boltdb 存储 cache:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 func  DefaultCache (sys *types.SystemContext)     dir, err := blobInfoCacheDir(sys, getRootlessUID())     if  err != nil  {         logrus.Debugf("Error determining a location for %s, using a memory-only cache" , blobInfoCacheFilename)         return  memory.New()     }     path := filepath.Join(dir, blobInfoCacheFilename)     if  err := os.MkdirAll(dir, 0700 ); err != nil  {         logrus.Debugf("Error creating parent directories for %s, using a memory-only cache: %v" , blobInfoCacheFilename, err)         return  memory.New()     }     logrus.Debugf("Using blob info cache at %s" , path)     return  boltdb.New(path) } 
查看失败构建的构建机器上:
1 2 3 4 5 6 7 8 9 $  cd  /var/lib/containers/cache$  ls  -ltotal 25740 -rw------- 1 root root 43134976 Oct 28 12:31 blob-info-cache-v1.boltdb $  ls  -altotal 25740 drwx------ 2 root root       39 Aug 27 14:18 . drwxr-xr-x 4 root root       35 Aug 27 14:18 .. -rw------- 1 root root 43134976 Oct 28 12:31 blob-info-cache-v1.boltdb 
和堆栈里的 boltdb 相关堆栈也对的上:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 $  grep -Po 'blobinfocache/boltdb/boltdb.go:\d+'  txt  | sort  -u blobinfocache/boltdb/boltdb.go:108 blobinfocache/boltdb/boltdb.go:112 blobinfocache/boltdb/boltdb.go:114 blobinfocache/boltdb/boltdb.go:119 blobinfocache/boltdb/boltdb.go:124 blobinfocache/boltdb/boltdb.go:146 blobinfocache/boltdb/boltdb.go:172 blobinfocache/boltdb/boltdb.go:174 blobinfocache/boltdb/boltdb.go:175 blobinfocache/boltdb/boltdb.go:3010 blobinfocache/boltdb/boltdb.go:54 blobinfocache/boltdb/boltdb.go:56 blobinfocache/boltdb/boltdb.go:58 blobinfocache/boltdb/boltdb.go:65 blobinfocache/boltdb/boltdb.go:66 blobinfocache/boltdb/boltdb.go:67 blobinfocache/boltdb/boltdb.go:84 
这里调用链就不细致分析了,确定是 uncompressedDigest 方法里 boltdb 问题:
1 2 3 func  (bdc *cache)     if  b := tx.Bucket(uncompressedDigestBucket); b != nil  { 
这里有问题的话就说明 boltdb 文件损坏,boltdb 文件损坏的话,docker 和 etcd 都能遇到,搜关键字就知道了:
1 2 $  grep invalid txt panic: invalid page type: 6432: 10 
原因 写个 boltdb 查看 Bucket 的 cli 复现下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 package  mainimport  (    "fmt"      "log"      "os"      bolt "go.etcd.io/bbolt"  ) func  main ()     if  len (os.Args) < 2  {         fmt.Fprintf(os.Stderr, "Usage: %s <bolt-db-file>\n" , os.Args[0 ])         os.Exit(1 )     }     filename := os.Args[1 ]     db, err := bolt.Open(filename, 0600 , &bolt.Options{ReadOnly: true })     if  err != nil  {         log.Fatalf("failed to open %s: %v" , filename, err)     }     defer  db.Close()     err = db.View(func (tx *bolt.Tx) error  {         fmt.Printf("Top-level buckets in %s:\n" , filename)         return  tx.ForEach(func (name []byte , _ *bolt.Bucket) error  {             fmt.Printf("- %s\n" , name)             return  nil          })     })     if  err != nil  {         log.Fatal(err)     } } 
拷贝到构建机器上执行:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 $  ./bbolt-tool blob-info-cache-v1.boltdb Top-level buckets in blob-info-cache-v1.boltdb: panic: invalid page type: 6432: 10 goroutine 1 [running]: go.etcd.io/bbolt.(*Cursor).search(0xc0000a5cd8, {0x7ff6b8ff8130, 0x47, 0x47}, 0x0?)         /root/go/pkg/mod/go.etcd.io/bbolt@v1.4.3/cursor.go:286 +0x279 go.etcd.io/bbolt.(*Cursor).seek(0xc0000a5cd8, {0x7ff6b8ff8130?, 0xc000080140?, 0xc0000ac040?})         /root/go/pkg/mod/go.etcd.io/bbolt@v1.4.3/cursor.go:162 +0x2e go.etcd.io/bbolt.(*Bucket).Bucket(0xc0000aa018, {0x7ff6b8ff8130, 0x47, 0x4e8c40?})         /root/go/pkg/mod/go.etcd.io/bbolt@v1.4.3/bucket.go:97 +0xb6 main.main.func1.(*Tx).ForEach.2({0x7ff6b8ff8130, 0x47, 0x47}, {0xc0000a5dd8?, 0x1?, 0x1?})         /root/go/pkg/mod/go.etcd.io/bbolt@v1.4.3/tx.go:158 +0x45 go.etcd.io/bbolt.(*Bucket).ForEach(0x51c268?, 0xc0000a5de8)         /root/go/pkg/mod/go.etcd.io/bbolt@v1.4.3/bucket.go:591 +0x89 go.etcd.io/bbolt.(*Tx).ForEach(...)         /root/go/pkg/mod/go.etcd.io/bbolt@v1.4.3/tx.go:157 main.main.func1(0xc0000aa000)         /root/code/golang/bbolt/main.go:26 +0x9d go.etcd.io/bbolt.(*DB).View(0x7ffca6c567a8?, 0xc0000a5f00)         /root/go/pkg/mod/go.etcd.io/bbolt@v1.4.3/db.go:939 +0x6c main.main()         /root/code/golang/bbolt/main.go:24 +0x1f0                   
正常构建机器上:
1 2 3 $  ./bbolt-tool blob-info-cache-v1.boltdb  Top-level buckets in blob-info-cache-v1.boltdb: - knownLocations 
解决 根据上面方法 DefaultCache 可以把 path 创建成文件,让走内存缓存,但是查看了下新版本 skopeo 已经默认使用 sqlite 缓存了:
1 2 3 4 blobInfoCacheFilename = "blob-info-cache-v1.sqlite"  systemBlobInfoCacheDir = "/var/lib/containers/cache"  
改为使用新版本 skopeo ,以及相关多线程的输出也加了前缀,避免下次类似问题堆栈混乱。然后发现老版本分支还是会走老逻辑,构建机器上把这个 boltdb 文件 mv 下避免影响其他分支,