故障现象
部署好业务 pod 后,有很多 pod 一直 Crash 或者无法就绪。
处理过程
describe pod 的状态有显示下面的:
1
| Error: failed to start container "xxxxx": Error response from daemon: OCI runtime start failed: cannot start a container that has stopped: unkown
|
上去这个节点查看容器状态:
1 2 3 4 5
| $ docker ps -a | grep 6cb 6440a7e90bde reg.xxx.lan:5000/xxx/xxxx-task "start.sh" 28 seconds ago Up 27 seconds k8s_xxx-taskxxx-task-6cbbdbf8cd-pllz9_default_a277ca7c-6cb1-42ee-8a8a-9e257f056180_10 08b915c1fd26 xxxxx "start.sh" 5 minutes ago Exited (1) 5 minutes ago xxxxx aca7d892e6cb xxxxxx "start.sh" 18 minutes ago Exited (137) 5 minutes ago xxxx 03bc60fe84cb pause-amd64:3.1 "/pause" 31 minutes ago Up 31 minutes
|
查看日志
1 2
| $ docker logs 08b standard_init_linux.go:211: exec user process caused "too many open files in system"
|
查看下文件打开数:
1 2 3 4 5 6 7 8 9 10
| $ lsof 2>/dev/null | awk '{print $2}' | sort | uniq -c | sort -n | tail -n 6 2701 1964 3081 20990 3724 23675 8035 704 52849 1000 111345 684 $ ps -p 684 PID TTY TIME CMD 684 ? 00:05:10 dockerd
|
对比了正常环境,看了下打开数量正常,所以看下系统的参数
1 2
| $ sysctl -n fs.file-max 10240
|
正常来讲,现在操作系统的内核版本下,这个值有个计算公式的,参考:
1 2 3
| n = (mempages * (PAGE_SIZE / 1024)) / 10; files_stat.max_files = max_t(unsigned long, n, NR_FILE); # PAGE_SIZE可以通过 getconf PAGE_SIZE,一般是 4096
|
和内存有关系,客户是 64G 内存,直接设置大点
1 2
| $ sysctl -w fs.file-max=6306821 $ echo fs.file-max=6306821 >> /etc/sysctl.conf
|
一些参考值
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92
| $ sysctl -n fs.file-max 6306821 $ free -g total used free shared buff/cache available Mem: 62 15 39 0 8 46 Swap: 0 0 0 $ free -h total used free shared buff/cache available Mem: 62G 15G 39G 18M 8.0G 47G Swap: 0B 0B 0B $ uname -r 3.10.0-1160.el7.x86_64 #----
$ sysctl -n fs.file-max; free -h 52706963 total used free shared buff/cache available Mem: 32G 2.7G 11G 1.8M 17G 28G Swap: 0B 0B 0B $ uname -r 3.10.0-1127.el7.x86_64
#------
$ sysctl -n fs.file-max; free -h 2426696 total used free shared buff/cache available Mem: 23G 1.1G 1.1G 64M 21G 21G Swap: 2.0G 1.0G 1.0G $ uname -r 3.10.0-957.el7.x86_64 # ----
$ sysctl -n fs.file-max 368322 $ free -h total used free shared buff/cache available Mem: 3.7G 425M 510M 888K 2.8G 3.0G Swap: 0B 0B 0B $ uname -r 3.10.0-1160.24.1.el7.x86_64
#----- $ sysctl -n fs.file-max; free -h; uname -r 176936 total used free shared buff/cache available Mem: 1.8G 240M 84M 560K 1.5G 1.4G Swap: 0B 0B 0B 3.10.0-1160.11.1.el7.x86_64
$ sysctl -n fs.file-max; free -h; uname -r 183890 total used free shared buff/cache available Mem: 1.8G 433M 73M 16M 1.3G 1.2G Swap: 0B 0B 0B 3.10.0-957.12.2.el7.x86_64
$ sysctl -n fs.file-max; free -h; uname -r 3232208 total used free shared buff/cache available Mem: 31G 1.9G 266M 354M 28G 28G Swap: 8.0G 98M 7.9G 3.10.0-957.el7.x86_64
$ sysctl -n fs.file-max; free -h; uname -r 381840 total used free shared buff/cache available Mem: 3.7G 295M 1.1G 8.5M 2.3G 3.1G Swap: 0B 0B 0B 3.10.0-1127.el7.x86_64
# aliyunlinux,指centos7阿里改名版 $ sysctl -n fs.file-max; free -h; uname -r 2097152 total used free shared buff/cache available Mem: 15G 1.6G 4.2G 7.0M 9.4G 13G Swap: 0B 0B 0B 4.19.91-22.2.al7.x86_64
$ sysctl -n fs.file-max; free -h; uname -r 1048576 total used free shared buff/cache available Mem: 7.6G 6.5G 163M 26M 973M 725M Swap: 0B 0B 0B 3.10.0-957.21.3.el7.x86_64
$ sysctl -n fs.file-max; free -h; uname -r 3250616 total used free shared buff/cache available Mem: 31G 21G 6.6G 43M 2.8G 9.0G Swap: 1.0G 0B 1.0G 3.10.0-327.el7.x86_64
|
参考: