前情提要
我们环境有部分 pod 特殊,单独节点部署,oom 的时候会搞挂一些系统进程,这几天折腾了下配置了下 kubelet 相关的 reserved
。主要是 kubelet 的配置文件一些参数,不写 systemd 里,全部写配置文件里。版本是如下,因为我们不单单是 x86_64
,由于还有其他的架构以及会部署在客户的现场,为了减少维护,所以我们都是除了 flanneld
和 coredns
以外。k8s 相关的二进制的形式部署的。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
| $ kubectl version -o json { "clientVersion": { "major": "1", "minor": "20", "gitVersion": "v1.20.6", "gitCommit": "8a62859e515889f07e3e3be6a1080413f17cf2c3", "gitTreeState": "clean", "buildDate": "2021-04-15T03:28:42Z", "goVersion": "go1.15.10", "compiler": "gc", "platform": "linux/amd64" }, "serverVersion": { "major": "1", "minor": "20", "gitVersion": "v1.20.6", "gitCommit": "8a62859e515889f07e3e3be6a1080413f17cf2c3", "gitTreeState": "clean", "buildDate": "2021-04-15T03:19:55Z", "goVersion": "go1.15.10", "compiler": "gc", "platform": "linux/amd64" } }
|
阅读本篇文章之前,推荐先浏览器同时打开这两篇官方文档后稍微看完再看本篇文章:
相关说明
相关术语就是 enforceNodeAllocatable
,它的默认值是 ["pods"]
,也就是 pod 能够使用节点上所有资源。但是节点上除了自己以外还有 kubelet ,kube 的三个组件,container runtime engine,以及 systemd 纳管的一些系统进程。如果有个 node 达到资源满了被驱逐,可能会漂移到其他节点上,把其他节点也搞挂了,形成连锁雪崩的情况。根据 官方文档最开始的设计 一个 node 的 allocate 为下面的情况,Allocatable
为 pod 的:
1
| [Allocatable] = [Node Capacity] - [Kube-Reserved] - [System-Reserved] - [Hard-Eviction-Threshold]
|
转换下就是:
1
| [Node Capacity] = [Allocatable] + [Kube-Reserved] + [System-Reserved] + [Hard-Eviction-Threshold]
|
节点上的 Allocatable
被定义为 pod 的可用计算资源量。 调度器不会超额申请 Allocatable。 目前支持 CPU
, memory
和 ephemeral-storage
这几个参数。上面的 Hard-Eviction
是有默认值的。而由于下面默认值,我们需要加上 kube 和 system 的 reserved 。
1 2 3 4
| enforceNodeAllocatable: - pods - kube-reserved - system-reserved
|
尝试
加了上面俩后发现不生效,最后去看 yaml 里相关设置的参考后以及部分源码后摸索出来了。但是其实官方这块是有文档的: 官方文档 和 最初的设计文档
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
| enforceNodeAllocatable: - pods - kube-reserved - system-reserved evictionHard: imagefs.available: "15%" memory.available: "200Mi" nodefs.available: "10%" nodefs.inodesFree: "5%" kubeReserved: {% if inventory_hostname in groups['kube_master'] %} cpu: 400m memory: 896Mi {% else %} cpu: 100m memory: 256Mi {% endif %} ephemeral-storage: 500Mi systemReserved: memory: 1Gi cpu: 500m ephemeral-storage: 2Gi
|
这个模板判断的灵感是来源于 kubespray ,defaults/main.yml 和 templates/kubelet-config.v1beta1.yaml.j2
我们环境都是二进制,所以 master 上 kube 会多配置些。但是这样配置了看了下无法生效,看了下必须要配置 cgroup path。也就是下面的:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
| enforceNodeAllocatable: - pods - kube-reserved - system-reserved evictionHard: imagefs.available: "15%" memory.available: "200Mi" nodefs.available: "10%" nodefs.inodesFree: "5%" kubeReserved: {% if inventory_hostname in groups['kube_master'] %} cpu: 400m memory: 896Mi {% else %} cpu: 100m memory: 256Mi {% endif %} ephemeral-storage: 500Mi
kubeReservedCgroup: /kube.slice systemReserved: memory: 1Gi cpu: 500m ephemeral-storage: 2Gi systemReservedCgroup: /system.slice
|
根据官方文档的示例值是俩不同的 path,但是市面上有不少人这方面的文章互相抄袭,他们会把 kubeReservedCgroup: /system.slice/kube.slice
嵌套下。配置了上面的后会发现依然无法启动报错下面的:
1
| Failed to start ContainerManager Failed to enforce Kube Reserved Cgroup Limits on "/kube.slice": ["kubelet"] cgroup does not exist
|
最后找了下相关源码 pkg/kubelet/cm/cgroup_manager_linux.go 的 func (m *cgroupManagerImpl) Exists(name CgroupName) bool 方法,我们只关心下面的几个 cgroup 就行了:
1 2 3 4 5
| allowlistControllers := sets.NewString("cpu", "cpuacct", "cpuset", "memory", "systemd", "pids")
if _, ok := m.subsystems.MountPoints["hugetlb"]; ok { allowlistControllers.Insert("hugetlb") }
|
市面上都是手动创建的不推荐,推荐在 kubelet 的 service 加个 ExecStartPre
和脚本判断处理。
最终配置
kubelet 的 service 文件参考:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
| [Unit] Description=Kubernetes Kubelet Documentation=https://github.com/GoogleCloudPlatform/kubernetes After=docker.service Requires=docker.service
[Service] WorkingDirectory={{ data_dir }}/kube/kubelet ExecStartPre=/bin/bash {{ data_dir }}/kube/kubelet/kubelet-cg.sh ExecStart={{ bin_dir }}/kubelet \ --config={{ data_dir }}/kube/kubelet/kubelet-config.yaml \ --root-dir={{ data_dir }}/kube/kubelet \ --docker-root={{ data_dir }}/kube/docker \ --cni-bin-dir={{ bin_dir }} \ --cni-conf-dir=/etc/cni/net.d \ --hostname-override={{ inventory_hostname }} \ --kubeconfig=/etc/kubernetes/kubelet.kubeconfig \ --network-plugin=cni \ --experimental-dockershim-root-directory={{ data_dir }}/kube/dockershim \ --pod-infra-container-image=registry.aliyuncs.com/k8sxio/pause:3.5 \ --register-node=true \ --v=2 \ --node-ip={{ inventory_hostname }}
Restart=always RestartSec=5
[Install] WantedBy=multi-user.target
|
我们的环境目前还是 cgroupfs , systemd 的可能需要你自己去摸索了。下面是 kubelet-cg.sh
和 kubelet-config.yaml
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154
| kind: KubeletConfiguration apiVersion: kubelet.config.k8s.io/v1beta1
allowedUnsafeSysctls: [] address: {{ inventory_hostname }} authentication: anonymous: enabled: false webhook: cacheTTL: 2m0s enabled: true x509: clientCAFile: {{ ca_dir }}/ca.pem authorization: mode: Webhook webhook: cacheAuthorizedTTL: 5m0s cacheUnauthorizedTTL: 30s tlsCertFile: {{ ca_dir }}/kubelet.pem tlsPrivateKeyFile: {{ ca_dir }}/kubelet-key.pem cgroupDriver: cgroupfs cgroupsPerQOS: true clusterDNS: - {{ CLUSTER_DNS_SVC_IP }} clusterDomain: {{ CLUSTER_DNS_DOMAIN }} configMapAndSecretChangeDetectionStrategy: Watch containerLogMaxFiles: 5 containerLogMaxSize: 10Mi contentType: application/vnd.kubernetes.protobuf cpuCFSQuota: true
cpuCFSQuotaPeriod: 100ms cpuManagerPolicy: none cpuManagerReconcilePeriod: 10s enableControllerAttachDetach: true
enableDebuggingHandlers: true
enableSystemLogHandler: true
enforceNodeAllocatable: - pods - kube-reserved - system-reserved
eventBurst: 100
eventRecordQPS: 50
evictionHard: imagefs.available: "15%" memory.available: "200Mi" nodefs.available: "10%" nodefs.inodesFree: "5%" evictionPressureTransitionPeriod: 5m0s failSwapOn: true
fileCheckFrequency: 10s
hairpinMode: promiscuous-bridge healthzPort: 10248
healthzBindAddress: {{ inventory_hostname }}
httpCheckFrequency: 0s imageGCHighThresholdPercent: 85 imageGCLowThresholdPercent: 80 imageMinimumGCAge: 2m0s
iptablesDropBit: 15
iptablesMasqueradeBit: 14
kubeAPIBurst: 100
kubeAPIQPS: 50
kubeReserved: {% if inventory_hostname in groups['kube_master'] %} cpu: 400m memory: 896Mi {% else %} cpu: 100m memory: 256Mi {% endif %} ephemeral-storage: 500Mi
kubeReservedCgroup: /kube.slice systemReserved: memory: 1Gi cpu: 500m ephemeral-storage: 2Gi systemReservedCgroup: /system.slice makeIPTablesUtilChains: true
maxOpenFiles: 1000000
{% set nodeLen = groups['kube_node'] | length %} {% if nodeLen == 1 %} maxPods: 253 {% elif nodeLen < 3 %} maxPods: 200 {% elif nodeLen >= 3 and nodeLen <=6 %} maxPods: 150 {% else %} maxPods: 110 {% endif %}
nodeLeaseDurationSeconds: 40
nodeStatusReportFrequency: 1m0s nodeStatusUpdateFrequency: 10s oomScoreAdj: -999 podPidsLimit: -1 port: 10250 readOnlyPort: 0
registryBurst: 20
registryPullQPS: 10 resolvConf: {% if ansible_distribution == "Ubuntu" and ansible_distribution_major_version|int > 16 %}/run/systemd/resolve/resolv.conf {% else %}/etc/resolv.conf {% endif %} rotateCertificates: true
runtimeRequestTimeout: 2m0s serializeImagePulls: true staticPodPath: /etc/kubernetes/manifests
streamingConnectionIdleTimeout: 20m0s
syncFrequency: 1m0s volumeStatsAggPeriod: 1m0s volumePluginDir: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ tlsCipherSuites: - TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 - TLS_RSA_WITH_AES_128_GCM_SHA256 - TLS_RSA_WITH_AES_256_GCM_SHA384 - TLS_RSA_WITH_AES_128_CBC_SHA - TLS_RSA_WITH_AES_256_CBC_SHA
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
| #!/bin/bash function check_and_create(){
local cg_controller=$1 if mountpoint -q /sys/fs/cgroup/${cg_controller};then mkdir -p /sys/fs/cgroup/${cg_controller}/system.slice mkdir -p /sys/fs/cgroup/${cg_controller}/kube.slice fi }
check_and_create cpu check_and_create cpuacct check_and_create cpuset check_and_create memory check_and_create systemd check_and_create pids check_and_create hugetlb
|
关于 pod 数量这块和大佬讨论了下,maxPods
大了的话实际上例如 docker 撑不住,所以没必要太大,我的判断逻辑是节点数量少的时候也就是我们内部的测试环境下,pod 数量调大,客户现场还是推荐的 110。
坑
2021/08/23 内部很多机器配置不一致,然后上面的配置会导致起不来,而且我理解错了 enforceNodeAllocatable
的意思了,我以为它是开关,实际上是给这几个创建 cgroup。reserved 配置了就会减去分配的配额,它开了就会强制 cgroup 限制 kube 和 systemd 来预留,也是不推荐配置的。取消它的配置为下面相关:
1 2 3 4 5 6
| #ExecStartPre=/bin/bash {{ data_dir }}/kube/kubelet/kubelet-cg.sh
enforceNodeAllocatable: - pods # - kube-reserved # - system-reserved
|
oom killer
当系统内存不足时候,内核会调用 oom-killer 来选择讲一些进程杀掉,以便能回收一些内存,尽量继续保持系统继续运行。具体选择哪个进程杀掉,这有一套算分的策略,参考因子是进程占用的内存数,进程页表占用的内存数等,oom_score_adj
的值越小,进程得分越少,也就越难被杀掉。它的计算公式大概类似下面,oom_score
的取值为[0,1000],而 oom_score_adj
的取值为[-1000,1000] ,oom_score_adj
是给我们调整的,例如我们不希望某些进程被 oom-killer 杀掉,可以调整它的 oom_score_adj
为 -1000
。
1
| oom_score = 内存消耗/总内存 *1000 # 这个不完全对,实际还有 cpu 实际和存活时间
|
其中
内存消耗包括了:常驻内存RSS + 进程页面 +交换内存
总内存就简单了:总的物理内存 +交换分区
k8s 的 qosClass
Kubernetes 创建 Pod 时就给它指定了下列三种 QoS 类:
- Guaranteed - limit 的 cpu 和 memory 必须设置,并且 request cpu 和 limit 下 cpu 要一样数值,memory 也一样。只设置 limit 的 cpu 和 memory,k8s 会设置与之一样的 requests
- Burstable - 不满足 Guaranteed ,并且 Pod 中至少一个容器具有 memory 或 CPU 请求,limit 和 request 里的 cpu 或者 内存请求数值相等和不相等都没关系
- BestEffort - 所有容器都没有设置 memory 和 CPU 限制或请求
查看了下,目前我们所有业务 pod 都没配置限制,也就是 BestEffort
。下面命令查看 ns 下 pod 的 qosClass
1
| kubectl get pod -o yaml | grep qosClass
|
节点 OOM 行为和 qosClass 的 oom_score_adj
根据官方文档,节点 oom 的行为 为:
如果节点在 kubelet
回收内存之前经历了系统 OOM(内存不足)事件,它将基于 oom-killer 做出响应。
kubelet
基于 pod 的 service 质量为每个容器设置一个 oom_score_adj
值,这个值在容器创建的时候设置的。
Service 质量 |
oom_score_adj |
Guaranteed |
-997 |
Burstable |
min(max(2, 1000 - (1000 * memoryRequestBytes) / machineMemoryCapacityBytes), 999) |
BestEffort |
1000 |
如果 kubelet
在节点经历系统 OOM 之前无法回收内存,oom_killer
将基于它在节点上
使用的内存百分比算出一个 oom_score
,并加上 oom_score_adj
得到容器的有效
oom_score
,然后结束得分最高的容器。
预期的行为应该是拥有最低服务质量并消耗和调度请求相关内存量最多的容器第一个被结束,以回收内存。
和 pod 驱逐不同,如果一个 Pod 的容器是被 OOM 结束的,基于其 RestartPolicy
,
它可能会被 kubelet
重新启动。
在文件 pkg/kubelet/kuberuntime/kuberuntime_container_linux.go 里的 generateLinuxContainerConfig
和 GetContainerOOMScoreAdjust 可以去了解更多细节。
主要是 oomScoreAdjust := 1000 - (1000 * container.Resources.Requests.Memory().Value())/memoryCapacity
。
memoryCapacity
是机器的物理内存大小,而不是减去预留后的。最小值就是避免 memoryRequest
/ 机器内存
趋近于 0 ,最大值避免 oomScoreAdjust
等于了最大值 1000 了。 kubelet 和 docker 通常会把他们自身的 oom_score_adj
设置为 -999
。
可以得出一个结论:在非 Guaranteed
和 request 和 limit 为空的 BestEffort
以外,request 内存越大则 oom_score_adj
越小。oom_score_adj
越小, oom 的时候最不会被 oom-kill 杀掉。
测试
在 32G 的机器上,空闲占用 1G ,我们部署几个 pod 都分为三个 qos 组,每个 都是 12G 的内存请求,看看哪个最先被杀掉 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57
| --- apiVersion: v1 kind: Pod metadata: name: stress-12-guaranteed spec: nodeName: xx.xx.82.174 containers: - name: ctr image: registry.aliyuncs.com/zhangguanzhang/stress-ng:0.13.03 command: - sh - -c - | cp stress-ng /stress-12-Guaranteed exec /stress-12-Guaranteed --vm 4 --vm-bytes 12G resources: limits: memory: "13Gi" cpu: "300m" --- apiVersion: v1 kind: Pod metadata: name: stress-12-burstable spec: nodeName: xx.xx.82.174 containers: - name: ctr image: registry.aliyuncs.com/zhangguanzhang/stress-ng:0.13.03 command: - sh - -c - | cp stress-ng /stress-12-Burstable exec /stress-12-Burstable --vm 4 --vm-bytes 12G resources: requests: memory: "10Mi" --- apiVersion: v1 kind: Pod metadata: name: stress-12-besteffort spec: nodeName: xx.xx.82.174 containers: - name: ctr image: registry.aliyuncs.com/zhangguanzhang/stress-ng:0.13.03 command: - sh - -c - | cp stress-ng /stress-12-BestEffort exec /stress-12-BestEffort --vm 4 --vm-bytes 12G ---
|
创建完后,通过系统日志查看是对的,oom-killer 杀掉的确实是 stress-12-besteffort
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
| [80389.171797] Memory cgroup out of memory: Kill process 27672 (stress-12-BestE) score 1105 or sacrifice child [80389.202470] Killed process 27672 (stress-12-BestE), UID 0, total-vm:3189272kB, anon-rss:3145848kB, file-rss:4kB, shmem-rss:8kB [80391.538015] stress-12-BestE invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=1000 [80391.538021] stress-12-BestE cpuset=597912cbecd66eccbd66c62dfd354bf5497db8e27a5bb04672ee5a84f217fbff mems_allowed=0-3 [80391.538026] CPU: 6 PID: 27991 Comm: stress-12-BestE Kdump: loaded Tainted: G ------------ T 3.10.0-1127.el7.x86_64 #1 [80391.538028] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/21/2015 [80391.538030] Call Trace: [80391.538041] [<ffffffff8497ff85>] dump_stack+0x19/0x1b [80391.538045] [<ffffffff8497a8a3>] dump_header+0x90/0x229 [80391.538051] [<ffffffff8449c4a8>] ? ep_poll_callback+0xf8/0x220 [80391.538057] [<ffffffff843c246e>] oom_kill_process+0x25e/0x3f0 [80391.538062] [<ffffffff84333a41>] ? cpuset_mems_allowed_intersects+0x21/0x30 [80391.538067] [<ffffffff84440ba6>] mem_cgroup_oom_synchronize+0x546/0x570 [80391.538071] [<ffffffff84440020>] ? mem_cgroup_charge_common+0xc0/0xc0 [80391.538075] [<ffffffff843c2d14>] pagefault_out_of_memory+0x14/0x90 [80391.538078] [<ffffffff84978db3>] mm_fault_error+0x6a/0x157 [80391.538082] [<ffffffff8498d8d1>] __do_page_fault+0x491/0x500 [80391.538086] [<ffffffff8498d975>] do_page_fault+0x35/0x90 [80391.538091] [<ffffffff84989778>] page_fault+0x28/0x30
|
再测试下下面这种内存大小不一致的
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58
| --- apiVersion: v1 kind: Pod metadata: name: stress-20-guaranteed spec: nodeName: xx.xx.82.174 containers: - name: ctr image: registry.aliyuncs.com/zhangguanzhang/stress-ng:0.13.03 command: - sh - -c - | cp stress-ng /stress-20-Guaranteed exec /stress-20-Guaranteed --vm 4 --vm-bytes 20G resources: limits: memory: "24Gi" cpu: "4000m"
--- apiVersion: v1 kind: Pod metadata: name: stress-8-burstable spec: nodeName: xx.xx.82.174 containers: - name: ctr image: registry.aliyuncs.com/zhangguanzhang/stress-ng:0.13.03 command: - sh - -c - | cp stress-ng /stress-8-Burstable exec /stress-8-Burstable --vm 4 --vm-bytes 8G resources: requests: memory: "300Mi" --- apiVersion: v1 kind: Pod metadata: name: stress-4-besteffort spec: nodeName: xx.xx.82.174 containers: - name: ctr image: registry.aliyuncs.com/zhangguanzhang/stress-ng:0.13.03 command: - sh - -c - | cp stress-ng /stress-4-BestEffort exec /stress-4-BestEffort --vm 2 --vm-bytes 4G ---
|
查看日志,stress-20-Guara
被杀掉了。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
| Sep 28 09:54:39 82-174-zhang kernel: [140244.941467] stress-20-Guara invoked oom-killer: gfp_mask=0x50, order=0, oom_score_adj=1000 Sep 28 09:54:39 82-174-zhang kernel: [140244.941472] stress-20-Guara cpuset=4be3661456aa74304875c3a00646851baea21be3e0667e84dad7ae812d3d0169 mems_allowed=0-3 Sep 28 09:54:39 82-174-zhang kernel: [140244.941476] CPU: 8 PID: 10195 Comm: stress-20-Guara Kdump: loaded Tainted: G ------------ T 3.10.0-1127.el7.x86_64 #1 Sep 28 09:54:39 82-174-zhang kernel: [140244.941478] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/21/2015 Sep 28 09:54:39 82-174-zhang kernel: [140244.941479] Call Trace: Sep 28 09:54:39 82-174-zhang kernel: [140244.941492] [<ffffffff8497ff85>] dump_stack+0x19/0x1b Sep 28 09:54:39 82-174-zhang kernel: [140244.941495] [<ffffffff8497a8a3>] dump_header+0x90/0x229 Sep 28 09:54:39 82-174-zhang kernel: [140244.941502] [<ffffffff8449c4a8>] ? ep_poll_callback+0xf8/0x220 Sep 28 09:54:39 82-174-zhang kernel: [140244.941508] [<ffffffff843c246e>] oom_kill_process+0x25e/0x3f0 Sep 28 09:54:39 82-174-zhang kernel: [140244.941512] [<ffffffff84333a41>] ? cpuset_mems_allowed_intersects+0x21/0x30 Sep 28 09:54:39 82-174-zhang kernel: [140244.941518] [<ffffffff84440ba6>] mem_cgroup_oom_synchronize+0x546/0x570 Sep 28 09:54:39 82-174-zhang kernel: [140244.941520] [<ffffffff84440020>] ? mem_cgroup_charge_common+0xc0/0xc0 Sep 28 09:54:39 82-174-zhang kernel: [140244.941523] [<ffffffff843c2d14>] pagefault_out_of_memory+0x14/0x90 Sep 28 09:54:39 82-174-zhang kernel: [140244.941525] [<ffffffff84978db3>] mm_fault_error+0x6a/0x157 Sep 28 09:54:39 82-174-zhang kernel: [140244.941529] [<ffffffff8498d8d1>] __do_page_fault+0x491/0x500 Sep 28 09:54:39 82-174-zhang kernel: [140244.941531] [<ffffffff8498d975>] do_page_fault+0x35/0x90 Sep 28 09:54:39 82-174-zhang kernel: [140244.941534] [<ffffffff84989778>] page_fault+0x28/0x30 Sep 28 09:54:39 82-174-zhang kernel: [140244.941538] Task in /kubepods/podff2a4320-67f9-4afc-b1a8-0aa39caa8904/4be3661456aa74304875c3a00646851baea21be3e0667e84dad7ae812d3d0169 killed as a result of limit of /kubepods Sep 28 09:54:39 82-174-zhang kernel: [140244.941540] memory: usage 29763384kB, limit 29763384kB, failcnt 734923 Sep 28 09:54:39 82-174-zhang kernel: [140244.941542] memory+swap: usage 29763384kB, limit 9007199254740988kB, failcnt 0 Sep 28 09:54:39 82-174-zhang kernel: [140244.941543] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
|
按理说不应该被杀掉。查看下进程的 oom_score_adj
。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
| $ ps aux | grep stress-20-Guar[a] root 10146 0.0 0.0 43540 2468 ? Ss 09:54 0:00 /stress-20-Guaranteed --vm 4 --vm-bytes 20G root 10181 0.0 0.0 43544 308 ? S 09:54 0:00 /stress-20-Guaranteed --vm 4 --vm-bytes 20G root 10184 0.0 0.0 43544 272 ? S 09:54 0:00 /stress-20-Guaranteed --vm 4 --vm-bytes 20G root 10186 0.0 0.0 43544 276 ? S 09:54 0:00 /stress-20-Guaranteed --vm 4 --vm-bytes 20G root 10188 0.1 0.0 43544 348 ? S 09:54 0:00 /stress-20-Guaranteed --vm 4 --vm-bytes 20G root 11024 99.7 15.9 5286424 5243196 ? R 09:56 1:23 /stress-20-Guaranteed --vm 4 --vm-bytes 20G root 11491 98.6 15.9 5286424 5243164 ? R 09:57 0:23 /stress-20-Guaranteed --vm 4 --vm-bytes 20G root 11539 103 15.9 5286424 5243168 ? R 09:57 0:16 /stress-20-Guaranteed --vm 4 --vm-bytes 20G root 11610 101 15.9 5286424 5243192 ? R 09:57 0:09 /stress-20-Guaranteed --vm 4 --vm-bytes 20G $ pstree -sp 10146 systemd(1)───dockerd(751)───containerd(838)───containerd-shim(10099)───stress-20-Guara(10146)─┬─stress-20-Guara(10181)───stress-20-Guara(11610) ├─stress-20-Guara(10184)───stress-20-Guara(11491) ├─stress-20-Guara(10186)───stress-20-Guara(11539) └─stress-20-Guara(10188)───stress-20-Guara(11024) $ cat /proc/11024/oom_score 1160 $ cat /proc/11024/oom_score_adj 1000 $ cat /proc/10146/oom_score 0 $ cat /proc/10146/oom_score_adj -997
|
docker run 个看看 oom_score_adj
:
1 2 3 4 5
| $ docker run -d --name test --oom-score-adj -998 nginx:alpine $ docker exec test cat /proc/*/oom_score_adj -998 -998 ...
|
最后稍微看了下 stress-ng 源码 发现了 stress-ng 会设置子进程的 oom_score_adj
成 1000。容器里进程只能增加 oom_score_adj
,不能减少,stress-ng 这块应该是没考虑到容器的情况,已经反馈 issue 了。
no-oom-adjust
stress-ng 的作者经过 issue反馈后 添加了 --no-oom-adjust
选项了,可以继续上面的测试了:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56
| --- apiVersion: v1 kind: Pod metadata: name: stress-20-guaranteed spec: nodeName: xx.xx.82.174 containers: - name: ctr image: registry.aliyuncs.com/zhangguanzhang/stress-ng:temp command: - sh - -c - | cp stress-ng /stress-20-Guaranteed exec /stress-20-Guaranteed --vm 4 --vm-bytes 20G --no-oom-adjust resources: limits: memory: "22Gi" cpu: "4000m" --- apiVersion: v1 kind: Pod metadata: name: stress-8-burstable spec: nodeName: xx.xx.82.174 containers: - name: ctr image: registry.aliyuncs.com/zhangguanzhang/stress-ng:temp command: - sh - -c - | cp stress-ng /stress-8-Burstable exec /stress-8-Burstable --vm 4 --vm-bytes 8G --no-oom-adjust resources: requests: memory: "300Mi" --- apiVersion: v1 kind: Pod metadata: name: stress-4-besteffort spec: nodeName: xx.xx.82.174 containers: - name: ctr image: registry.aliyuncs.com/zhangguanzhang/stress-ng:temp command: - sh - -c - | cp stress-ng /stress-4-BestEffort exec /stress-4-BestEffort --vm 2 --vm-bytes 4G --no-oom-adjust ---
|
apply 后机器上日志最先 oom 的是 4G
这个:
1 2
| Oct 11 10:38:03 82-174-zhang kernel: [1266038.261910] Memory cgroup out of memory: Kill process 7017 (stress-4-BestEf) score 1070 or sacrifice child Oct 11 10:38:03 82-174-zhang kernel: [1266038.267444] Killed process 7017 (stress-4-BestEf), UID 0, total-vm:2140696kB, anon-rss:2097272kB, file-rss:8kB, shmem-rss:4kB
|
参考