私有化下,cri-dockerd Pulling the image without credentials. Image: reg.xxx.lan:5000/xxx/pause:3.9
由来 私有化下,环境都会部署一个内网仓库镜像,然后发现某天客户环境的 pod 无法拉起来,发现是镜像 gc 后,cri-dockerd 的 pause 镜像无法拉取了,手动拉取没问题的。
解决 之前遇到过,但是当时比较忙,今天有空看下。
信息 cri-dockerd 版本无关,参考官方文档使用 systemd 部署:
1 2 3 4 5 6 7 8 9 10 11 $ systemctl cat --no-pager cri-dockerd ... [Service] Type=notify ExecStart=/data/kube/bin/cri-dockerd \ --container-runtime-endpoint unix:///var/run/cri-dockerd.sock \ --network-plugin=cni \ --streaming-bind-addr=127.0.0.1 \ --cni-bin-dir=/data/kube/bin/ \ --pod-infra-container-image=reg.xxx.lan:5000/xxx/pause:3.9 ...
报错信息为:
1 2 3 4 $ journalctl -xe -u cri-dockerd Apr 11 15:11:48 xxx cri-dockerd[5894]: level=info msg="Pulling the image without credentials. Image: reg.xxx.lan:5000/xxx/pause:3.9" Apr 11 15:12:14 xxx cri-dockerd[5894]: level=info msg="Pulling the image without credentials. Image: reg.xxx.lan:5000/xxx/pause:3.9" Apr 11 15:13:11 xxx cri-dockerd[5894]: level=info msg="Pulling the image without credentials. Image: reg.xxx.lan:5000/xxx/pause:3.9"
查看源码逻辑 根据日志关键字,找到是如下函数
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 func ensureSandboxImageExists (client libdocker.DockerClientInterface, image string ) error { _, err := client.InspectImageByRef(image) if err == nil { return nil } if !libdocker.IsImageNotFoundError(err) { return fmt.Errorf("failed to inspect sandbox image %q: %v" , image, err) } repoToPull, _, _, err := utils.ParseImageName(image) if err != nil { return err } keyring := credentialprovider.NewDockerKeyring() creds, withCredentials := keyring.Lookup(repoToPull) if !withCredentials { logrus.Infof("Pulling the image without credentials. Image: %v" , image) err := client.PullImage(image, dockerregistry.AuthConfig{}, dockertypes.ImagePullOptions{}) if err != nil { return fmt.Errorf("failed pulling image %q: %v" , image, err) } return nil } var pullErrs []error for _, currentCreds := range creds { authConfig := dockerregistry.AuthConfig(currentCreds) err := client.PullImage(image, authConfig, dockertypes.ImagePullOptions{}) if err == nil { return nil } pullErrs = append (pullErrs, err) } return errors.NewAggregate(pullErrs) }
按照 credentialprovider.NewDockerKeyring()
往下找,发现最终是在 ./vendor/k8s.io/kubernetes/pkg/credentialprovider/
下的逻辑:
1 2 3 4 5 6 7 8 func init () { RegisterCredentialProvider(".dockercfg" , &CachingDockerConfigProvider{ Provider: &defaultDockerConfigProvider{}, Lifetime: 5 * time.Minute, }) }
上面的 CachingDockerConfigProvider
是定义一个间隔时间读取文件的 provider,读取文件的逻辑在:
1 2 3 4 5 6 7 8 9 10 func (d *defaultDockerConfigProvider) Provide(image string ) DockerConfig { if cfg, err := ReadDockerConfigFile(); err == nil { return cfg } else if !os.IsNotExist(err) { klog.V(2 ).Infof("Docker config file not found: %v" , err) } return DockerConfig{} }
一路跳转,到文件 ./vendor/k8s.io/kubernetes/pkg/credentialprovider/config.go
里的:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 var ( preferredPathLock sync.Mutex preferredPath = "" workingDirPath = "" homeDirPath, _ = os.UserHomeDir() rootDirPath = "/" homeJSONDirPath = filepath.Join(homeDirPath, ".docker" ) rootJSONDirPath = filepath.Join(rootDirPath, ".docker" ) configFileName = ".dockercfg" configJSONFileName = "config.json" ) ... func DefaultDockercfgPaths () []string { return []string {GetPreferredDockercfgPath(), workingDirPath, homeDirPath, rootDirPath} } func ReadDockercfgFile (searchPaths []string ) (cfg DockerConfig, err error ) { if len (searchPaths) == 0 { searchPaths = DefaultDockercfgPaths() } for _, configPath := range searchPaths { ...
查找目录逻辑也没问题,cri-dockerd 是 root 运行的,/root/.docker/config.json
里有的,也没其他特殊权限啥的。
调试 手动拉取没问题的,所以主要逻辑是为啥进程没读取到 /root/.docker/config.json
,然后下载源码后 dlv 调试下:
1 2 3 4 5 6 7 8 9 $ systemctl stop cri-docker kubelet $ docker rmi -f reg.xxx.lan:5000/xxx/pause:3.9 $ reboot # 恢复没有拉取的环境情况再 debug $ dlv exec main.go -- --container-runtime-endpoint unix:///var/run/cri-dockerd.sock \ --network-plugin=cni \ --streaming-bind-addr=127.0.0.1 \ --cni-bin-dir=/data/kube/bin/ \ --pod-infra-container-image=reg.xxx.lan:5000/xxx/pause:3.9
最后发现代码逻辑没走到 if !withCredentials {
,就很奇怪,然后自己编译一个替换启动后发现也能复现,就打算 dlv attach 看下:
1 2 3 4 5 $ go build -gcflags="all=-N -l" -o cri-dockerd $ systemctl stop cri-dockerd $ \cp cri-dockerd /data/kube/bin/cri-dockerd $ systemctl start cri-dockerd $ dlv attach $(pgrep cri-dockerd)
打了三个断点后 continue ,发现 DefaultDockercfgPaths()
返回的四个查找路径值不对:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 (dlv) c > k8s.io/kubernetes/pkg/credentialprovider.ReadDockerConfigJSONFile() ./vendor/k8s.io/kubernetes/pkg/credentialprovider/config.go:138 (hits goroutine(677):1 total:1) (PC: 0x2285942) 133: // if searchPaths is empty, the default paths are used. 134: func ReadDockerConfigJSONFile(searchPaths []string) (cfg DockerConfig, err error) { 135: if len(searchPaths) == 0 { 136: searchPaths = DefaultDockerConfigJSONPaths() 137: } => 138: for _, configPath := range searchPaths { 139: absDockerConfigFileLocation, err := filepath.Abs(filepath.Join(configPath, configJSONFileName)) 140: if err != nil { 141: klog.Errorf("while trying to canonicalize %s: %v", configPath, err) 142: continue 143: } (dlv) p searchPaths []string len: 4, cap: 4, [ "", "", ".docker", "/.docker", ]
看了下 homeJSONDirPath
发现也不对
1 2 (dlv) p homeJSONDirPath ".docker"
代码里它的值来源是:
1 2 3 4 homeDirPath, _ = os.UserHomeDir() ... homeJSONDirPath = filepath.Join(homeDirPath, ".docker" )
调用下 os.UserHomeDir()
看看:
1 2 3 4 5 6 (dlv) call os.UserHomeDir() > k8s.io/kubernetes/pkg/credentialprovider.ReadDockerConfigJSONFile() ./vendor/k8s.io/kubernetes/pkg/credentialprovider/config.go:138 (PC: 0x2285942) Values returned: ~r0: "" ~r1: error(*errors.errorString) *{ s: "$HOME is not defined",}
居然没有 HOME
变量,从 procfs 看看启动时候的 env:
1 2 3 4 5 6 $ xargs -0 -n1 < /proc/$(pgrep cri-dockerd)/environ LANG=en_US.UTF-8 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin NOTIFY_SOCKET=/run/systemd/notify LISTEN_PID=5894 LISTEN_FDS=1
看来 systemd 没有给配置 $HOME
变量,然后发现设置了 User
才有 HOME=/root
环境变量,这也说明之前 dlv exec
正常的原因。
解决 几种解决方法:
systemd 文件里设置 WorkingDirectory
下:
直接设置为 /root
拷贝一个 docker login 的 config.json 文件到进程 WorkingDirectory
下: config.json
或者 .dockercfg
设置 User=root
已提交 pr 修复 Mirantis/cri-dockerd/pull/349