zhangguanzhang's Blog

standard_init_linux.go:211: exec user process caused "too many open files in ...

字数统计: 787阅读时长: 4 min
2019/05/11

故障现象

部署好业务 pod 后,有很多 pod 一直 Crash 或者无法就绪。

处理过程

describe pod 的状态有显示下面的:

1
Error: failed to start container "xxxxx": Error response from daemon: OCI runtime start failed: cannot start a container that has stopped: unkown

上去这个节点查看容器状态:

1
2
3
4
5
$ docker ps -a | grep 6cb
6440a7e90bde reg.xxx.lan:5000/xxx/xxxx-task "start.sh" 28 seconds ago Up 27 seconds k8s_xxx-taskxxx-task-6cbbdbf8cd-pllz9_default_a277ca7c-6cb1-42ee-8a8a-9e257f056180_10
08b915c1fd26 xxxxx "start.sh" 5 minutes ago Exited (1) 5 minutes ago xxxxx
aca7d892e6cb xxxxxx "start.sh" 18 minutes ago Exited (137) 5 minutes ago xxxx
03bc60fe84cb pause-amd64:3.1 "/pause" 31 minutes ago Up 31 minutes

查看日志

1
2
$ docker logs 08b
standard_init_linux.go:211: exec user process caused "too many open files in system"

查看下文件打开数:

1
2
3
4
5
6
7
8
9
10
$ lsof 2>/dev/null | awk '{print $2}' | sort | uniq -c | sort -n | tail -n 6
2701 1964
3081 20990
3724 23675
8035 704
52849 1000
111345 684
$ ps -p 684
PID TTY TIME CMD
684 ? 00:05:10 dockerd

对比了正常环境,看了下打开数量正常,所以看下系统的参数

1
2
$ sysctl -n fs.file-max
10240

正常来讲,现在操作系统的内核版本下,这个值有个计算公式的,参考

1
2
3
  n = (mempages * (PAGE_SIZE / 1024)) / 10;
files_stat.max_files = max_t(unsigned long, n, NR_FILE);
# PAGE_SIZE可以通过 getconf PAGE_SIZE,一般是 4096

和内存有关系,客户是 64G 内存,直接设置大点

1
2
$ sysctl -w fs.file-max=6306821
$ echo fs.file-max=6306821 >> /etc/sysctl.conf

一些参考值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
$ sysctl -n fs.file-max
6306821
$ free -g
total used free shared buff/cache available
Mem: 62 15 39 0 8 46
Swap: 0 0 0
$ free -h
total used free shared buff/cache available
Mem: 62G 15G 39G 18M 8.0G 47G
Swap: 0B 0B 0B
$ uname -r
3.10.0-1160.el7.x86_64
#----

$ sysctl -n fs.file-max; free -h
52706963
total used free shared buff/cache available
Mem: 32G 2.7G 11G 1.8M 17G 28G
Swap: 0B 0B 0B
$ uname -r
3.10.0-1127.el7.x86_64

#------

$ sysctl -n fs.file-max; free -h
2426696
total used free shared buff/cache available
Mem: 23G 1.1G 1.1G 64M 21G 21G
Swap: 2.0G 1.0G 1.0G
$ uname -r
3.10.0-957.el7.x86_64
# ----

$ sysctl -n fs.file-max
368322
$ free -h
total used free shared buff/cache available
Mem: 3.7G 425M 510M 888K 2.8G 3.0G
Swap: 0B 0B 0B
$ uname -r
3.10.0-1160.24.1.el7.x86_64

#-----
$ sysctl -n fs.file-max; free -h; uname -r
176936
total used free shared buff/cache available
Mem: 1.8G 240M 84M 560K 1.5G 1.4G
Swap: 0B 0B 0B
3.10.0-1160.11.1.el7.x86_64

$ sysctl -n fs.file-max; free -h; uname -r
183890
total used free shared buff/cache available
Mem: 1.8G 433M 73M 16M 1.3G 1.2G
Swap: 0B 0B 0B
3.10.0-957.12.2.el7.x86_64

$ sysctl -n fs.file-max; free -h; uname -r
3232208
total used free shared buff/cache available
Mem: 31G 1.9G 266M 354M 28G 28G
Swap: 8.0G 98M 7.9G
3.10.0-957.el7.x86_64

$ sysctl -n fs.file-max; free -h; uname -r
381840
total used free shared buff/cache available
Mem: 3.7G 295M 1.1G 8.5M 2.3G 3.1G
Swap: 0B 0B 0B
3.10.0-1127.el7.x86_64

# aliyunlinux,指centos7阿里改名版
$ sysctl -n fs.file-max; free -h; uname -r
2097152
total used free shared buff/cache available
Mem: 15G 1.6G 4.2G 7.0M 9.4G 13G
Swap: 0B 0B 0B
4.19.91-22.2.al7.x86_64

$ sysctl -n fs.file-max; free -h; uname -r
1048576
total used free shared buff/cache available
Mem: 7.6G 6.5G 163M 26M 973M 725M
Swap: 0B 0B 0B
3.10.0-957.21.3.el7.x86_64

$ sysctl -n fs.file-max; free -h; uname -r
3250616
total used free shared buff/cache available
Mem: 31G 21G 6.6G 43M 2.8G 9.0G
Swap: 1.0G 0B 1.0G
3.10.0-327.el7.x86_64

参考:

CATALOG
  1. 1. 故障现象
    1. 1.1. 处理过程
    2. 1.2. 一些参考值