zhangguanzhang's Blog

cri-dockerd 无法拉取需认证仓库上的 pause 镜像解决

字数统计: 1.3k阅读时长: 6 min
2024/04/11

私有化下,cri-dockerd Pulling the image without credentials. Image: reg.xxx.lan:5000/xxx/pause:3.9

由来

私有化下,环境都会部署一个内网仓库镜像,然后发现某天客户环境的 pod 无法拉起来,发现是镜像 gc 后,cri-dockerd 的 pause 镜像无法拉取了,手动拉取没问题的。

解决

之前遇到过,但是当时比较忙,今天有空看下。

信息

cri-dockerd 版本无关,参考官方文档使用 systemd 部署:

1
2
3
4
5
6
7
8
9
10
11
$ systemctl cat --no-pager cri-dockerd
...
[Service]
Type=notify
ExecStart=/data/kube/bin/cri-dockerd \
--container-runtime-endpoint unix:///var/run/cri-dockerd.sock \
--network-plugin=cni \
--streaming-bind-addr=127.0.0.1 \
--cni-bin-dir=/data/kube/bin/ \
--pod-infra-container-image=reg.xxx.lan:5000/xxx/pause:3.9
...

报错信息为:

1
2
3
4
$ journalctl -xe -u cri-dockerd
Apr 11 15:11:48 xxx cri-dockerd[5894]: level=info msg="Pulling the image without credentials. Image: reg.xxx.lan:5000/xxx/pause:3.9"
Apr 11 15:12:14 xxx cri-dockerd[5894]: level=info msg="Pulling the image without credentials. Image: reg.xxx.lan:5000/xxx/pause:3.9"
Apr 11 15:13:11 xxx cri-dockerd[5894]: level=info msg="Pulling the image without credentials. Image: reg.xxx.lan:5000/xxx/pause:3.9"

查看源码逻辑

根据日志关键字,找到是如下函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
// https://github.com/Mirantis/cri-dockerd/blob/b138f5226ae901b99ea34d40ab1eaed1c26445a4/core/sandbox_helpers.go#L408-L448
func ensureSandboxImageExists(client libdocker.DockerClientInterface, image string) error {
_, err := client.InspectImageByRef(image)
if err == nil {
return nil
}
if !libdocker.IsImageNotFoundError(err) {
return fmt.Errorf("failed to inspect sandbox image %q: %v", image, err)
}

repoToPull, _, _, err := utils.ParseImageName(image)
if err != nil {
return err
}

keyring := credentialprovider.NewDockerKeyring()
creds, withCredentials := keyring.Lookup(repoToPull)
if !withCredentials {
logrus.Infof("Pulling the image without credentials. Image: %v", image)

err := client.PullImage(image, dockerregistry.AuthConfig{}, dockertypes.ImagePullOptions{})
if err != nil {
return fmt.Errorf("failed pulling image %q: %v", image, err)
}

return nil
}

var pullErrs []error
for _, currentCreds := range creds {
authConfig := dockerregistry.AuthConfig(currentCreds)
err := client.PullImage(image, authConfig, dockertypes.ImagePullOptions{})
// If there was no error, return success
if err == nil {
return nil
}

pullErrs = append(pullErrs, err)
}

return errors.NewAggregate(pullErrs)
}

按照 credentialprovider.NewDockerKeyring() 往下找,发现最终是在 ./vendor/k8s.io/kubernetes/pkg/credentialprovider/ 下的逻辑:

1
2
3
4
5
6
7
8
// https://github.com/Mirantis/cri-dockerd/blob/b138f5226ae901b99ea34d40ab1eaed1c26445a4/vendor/k8s.io/kubernetes/pkg/credentialprovider/provider.go#L46-L 52
func init() {
RegisterCredentialProvider(".dockercfg",
&CachingDockerConfigProvider{
Provider: &defaultDockerConfigProvider{},
Lifetime: 5 * time.Minute,
})
}

上面的 CachingDockerConfigProvider 是定义一个间隔时间读取文件的 provider,读取文件的逻辑在:

1
2
3
4
5
6
7
8
9
10
// https://github.com/Mirantis/cri-dockerd/blob/b138f5226ae901b99ea34d40ab1eaed1c26445a4/vendor/k8s.io/kubernetes/pkg/credentialprovider/provider.go#L77C1-L85C2
func (d *defaultDockerConfigProvider) Provide(image string) DockerConfig {
// Read the standard Docker credentials from .dockercfg
if cfg, err := ReadDockerConfigFile(); err == nil {
return cfg
} else if !os.IsNotExist(err) {
klog.V(2).Infof("Docker config file not found: %v", err)
}
return DockerConfig{}
}

一路跳转,到文件 ./vendor/k8s.io/kubernetes/pkg/credentialprovider/config.go 里的:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28

var (
preferredPathLock sync.Mutex
preferredPath = ""
workingDirPath = ""
homeDirPath, _ = os.UserHomeDir()
rootDirPath = "/"
homeJSONDirPath = filepath.Join(homeDirPath, ".docker")
rootJSONDirPath = filepath.Join(rootDirPath, ".docker")

configFileName = ".dockercfg"
configJSONFileName = "config.json"
)

...

func DefaultDockercfgPaths() []string {
return []string{GetPreferredDockercfgPath(), workingDirPath, homeDirPath, rootDirPath}
}

func ReadDockercfgFile(searchPaths []string) (cfg DockerConfig, err error) {
if len(searchPaths) == 0 {
searchPaths = DefaultDockercfgPaths()
}

for _, configPath := range searchPaths {
...

查找目录逻辑也没问题,cri-dockerd 是 root 运行的,/root/.docker/config.json 里有的,也没其他特殊权限啥的。

调试

手动拉取没问题的,所以主要逻辑是为啥进程没读取到 /root/.docker/config.json,然后下载源码后 dlv 调试下:

1
2
3
4
5
6
7
8
9
$ systemctl stop cri-docker kubelet
$ docker rmi -f reg.xxx.lan:5000/xxx/pause:3.9
$ reboot
# 恢复没有拉取的环境情况再 debug
$ dlv exec main.go -- --container-runtime-endpoint unix:///var/run/cri-dockerd.sock \
--network-plugin=cni \
--streaming-bind-addr=127.0.0.1 \
--cni-bin-dir=/data/kube/bin/ \
--pod-infra-container-image=reg.xxx.lan:5000/xxx/pause:3.9

最后发现代码逻辑没走到 if !withCredentials { ,就很奇怪,然后自己编译一个替换启动后发现也能复现,就打算 dlv attach 看下:

1
2
3
4
5
$ go build  -gcflags="all=-N -l"  -o cri-dockerd
$ systemctl stop cri-dockerd
$ \cp cri-dockerd /data/kube/bin/cri-dockerd
$ systemctl start cri-dockerd
$ dlv attach $(pgrep cri-dockerd)

打了三个断点后 continue ,发现 DefaultDockercfgPaths() 返回的四个查找路径值不对:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
(dlv) c
> k8s.io/kubernetes/pkg/credentialprovider.ReadDockerConfigJSONFile() ./vendor/k8s.io/kubernetes/pkg/credentialprovider/config.go:138 (hits goroutine(677):1 total:1) (PC: 0x2285942)
133: // if searchPaths is empty, the default paths are used.
134: func ReadDockerConfigJSONFile(searchPaths []string) (cfg DockerConfig, err error) {
135: if len(searchPaths) == 0 {
136: searchPaths = DefaultDockerConfigJSONPaths()
137: }
=> 138: for _, configPath := range searchPaths {
139: absDockerConfigFileLocation, err := filepath.Abs(filepath.Join(configPath, configJSONFileName))
140: if err != nil {
141: klog.Errorf("while trying to canonicalize %s: %v", configPath, err)
142: continue
143: }
(dlv) p searchPaths
[]string len: 4, cap: 4, [
"",
"",
".docker",
"/.docker",
]

看了下 homeJSONDirPath 发现也不对

1
2
(dlv) p homeJSONDirPath
".docker"

代码里它的值来源是:

1
2
3
4
homeDirPath, _    = os.UserHomeDir()
...
homeJSONDirPath = filepath.Join(homeDirPath, ".docker")

调用下 os.UserHomeDir() 看看:

1
2
3
4
5
6
(dlv) call os.UserHomeDir()
> k8s.io/kubernetes/pkg/credentialprovider.ReadDockerConfigJSONFile() ./vendor/k8s.io/kubernetes/pkg/credentialprovider/config.go:138 (PC: 0x2285942)
Values returned:
~r0: ""
~r1: error(*errors.errorString) *{
s: "$HOME is not defined",}

居然没有 HOME 变量,从 procfs 看看启动时候的 env:

1
2
3
4
5
6
$ xargs -0 -n1 < /proc/$(pgrep cri-dockerd)/environ
LANG=en_US.UTF-8
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
NOTIFY_SOCKET=/run/systemd/notify
LISTEN_PID=5894
LISTEN_FDS=1

看来 systemd 没有给配置 $HOME 变量,然后发现设置了 User 才有 HOME=/root 环境变量,这也说明之前 dlv exec 正常的原因。

解决

几种解决方法:

  • systemd 文件里设置 WorkingDirectory 下:
    • 直接设置为 /root
    • 拷贝一个 docker login 的 config.json 文件到进程 WorkingDirectory 下: config.json 或者 .dockercfg
  • 设置 User=root

已提交 pr 修复 Mirantis/cri-dockerd/pull/349

CATALOG
  1. 1. 由来
  2. 2. 解决
    1. 2.1. 信息
    2. 2.2. 查看源码逻辑
    3. 2.3. 调试
  3. 3. 解决