zhangguanzhang's Blog

docker containerd 不定时的 segfault 的一次处理过程

字数统计: 4.4k阅读时长: 25 min
2022/04/22

客户环境 docker 和 containerd 启动时不时 segment fault 的一次处理过程。

由来

问题拉我处理是 04/15 号,现象是客户的根分区被撑爆了,后台 tty 进去看了下是根目录充满了 core.$pid 的 coredump 文件。然后清理后重启发现很多容器起不来。然后喊我来看下。

处理过程

客户系统是 centos7.9 ,先使用 systemctl status docker 看了下 docker 运行一段时间后就崩了,前台启动调试下:

1
2
3
4
$ dockerd --version
Docker version 19.03.14, build 5eb3275
$ systemctl stop kubelet docker
$ dockerd --debug

然后发现每次 dockerd 退出日志不一样,有时候是 segment fault,有时候报错无法通过 /var/run/docker/containerd/containerd.sock 连接 containerd,该 sock 文件不存在,根据这个报错可以看出来 containerd 无法启动。可以通过 源码 得知,如果没启动 containerd ,docker 则会 os.Exec 起一个 containerd

我们的 docker 是官方的 static bin 安装的,如果是官方包管理安装的话,containerd 会由 systemd 启动,docker bin 的方式的话,会由 dockerd 使用 exec 方式启动一个 containerd,找个同版本的查询下 cmdline 后手动前台 debug log-level 启动下 containerd:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
$ containerd --config /var/run/docker/containerd/containerd.toml --log-level debug   
INFO[2022-04-15T14:14:01.755841221+08:00] starting containerd revision=ea765aba0d05254012b0b9e595e995c09186427f version=v1.3.9
DEBU[2022-04-15T14:14:01.755989919+08:00] changing OOM score to -500
INFO[2022-04-15T14:14:01.787587081+08:00] loading plugin "io.containerd.content.v1.content"... type=io.containerd.content.v1
INFO[2022-04-15T14:14:01.787837808+08:00] loading plugin "io.containerd.snapshotter.v1.btrfs"... type=io.containerd.snapshotter.v1
INFO[2022-04-15T14:14:01.790684367+08:00] skip loading plugin "io.containerd.snapshotter.v1.btrfs"... error="path /data/kube/docker/containerd/daemon/io.containerd.snapshotter.v1.btrfs (xfs) must be a btrfs filesystem to be used with the btrfs snapshotter: skip plugin" type=io.containerd.snapshotter.v1
INFO[2022-04-15T14:14:01.790779651+08:00] loading plugin "io.containerd.snapshotter.v1.devmapper"... type=io.containerd.snapshotter.v1
WARN[2022-04-15T14:14:01.790827944+08:00] failed to load plugin io.containerd.snapshotter.v1.devmapper error="devmapper not configured"
INFO[2022-04-15T14:14:01.790850382+08:00] loading plugin "io.containerd.snapshotter.v1.aufs"... type=io.containerd.snapshotter.v1
INFO[2022-04-15T14:14:01.793069978+08:00] skip loading plugin "io.containerd.snapshotter.v1.aufs"... error="modprobe aufs failed: \"modprobe: FATAL: Module aufs not found.\\n\": exit status 1: skip plugin" type=io.containerd.snapshotter.v1
INFO[2022-04-15T14:14:01.793133838+08:00] loading plugin "io.containerd.snapshotter.v1.native"... type=io.containerd.snapshotter.v1
INFO[2022-04-15T14:14:01.793229323+08:00] loading plugin "io.containerd.snapshotter.v1.overlayfs"... type=io.containerd.snapshotter.v1
INFO[2022-04-15T14:14:01.793392157+08:00] loading plugin "io.containerd.snapshotter.v1.zfs"... type=io.containerd.snapshotter.v1
INFO[2022-04-15T14:14:01.794994033+08:00] skip loading plugin "io.containerd.snapshotter.v1.zfs"... error="path /data/kube/docker/containerd/daemon/io.containerd.snapshotter.v1.zfs must be a zfs filesystem to be used with the zfs snapshotter: skip plugin" type=io.containerd.snapshotter.v1
INFO[2022-04-15T14:14:01.795033213+08:00] loading plugin "io.containerd.metadata.v1.bolt"... type=io.containerd.metadata.v1
WARN[2022-04-15T14:14:01.795068521+08:00] could not use snapshotter devmapper in metadata plugin error="devmapper not configured"
INFO[2022-04-15T14:14:01.795097127+08:00] metadata content store policy set policy=shared
INFO[2022-04-15T14:14:01.795925864+08:00] loading plugin "io.containerd.differ.v1.walking"... type=io.containerd.differ.v1
INFO[2022-04-15T14:14:01.795966063+08:00] loading plugin "io.containerd.gc.v1.scheduler"... type=io.containerd.gc.v1
INFO[2022-04-15T14:14:01.796083519+08:00] loading plugin "io.containerd.service.v1.containers-service"... type=io.containerd.service.v1
INFO[2022-04-15T14:14:01.796116546+08:00] loading plugin "io.containerd.service.v1.content-service"... type=io.containerd.service.v1
INFO[2022-04-15T14:14:01.796143806+08:00] loading plugin "io.containerd.service.v1.diff-service"... type=io.containerd.service.v1
INFO[2022-04-15T14:14:01.796174035+08:00] loading plugin "io.containerd.service.v1.images-service"... type=io.containerd.service.v1
INFO[2022-04-15T14:14:01.796201988+08:00] loading plugin "io.containerd.service.v1.leases-service"... type=io.containerd.service.v1
INFO[2022-04-15T14:14:01.796236554+08:00] loading plugin "io.containerd.service.v1.namespaces-service"... type=io.containerd.service.v1
INFO[2022-04-15T14:14:01.796274839+08:00] loading plugin "io.containerd.service.v1.snapshots-service"... type=io.containerd.service.v1
INFO[2022-04-15T14:14:01.796309507+08:00] loading plugin "io.containerd.runtime.v1.linux"... type=io.containerd.runtime.v1
DEBU[2022-04-15T14:14:01.796470242+08:00] loading tasks in namespace namespace=moby
ERRO[2022-04-15T14:14:01.796797091+08:00] connecting to shim error="dial unix \x00/containerd-shim/moby/427b5abebd744817fe9cf8c0aa2febadff17d5905e830d3236bb46fa58d6858b/shim.sock: connect: connection refused" id=427b5abebd744817fe9cf8c0aa2febadff17d5905e830d3236bb46fa58d6858b namespace=moby
WARN[2022-04-15T14:14:01.796844852+08:00] cleaning up after shim dead id=427b5abebd744817fe9cf8c0aa2febadff17d5905e830d3236bb46fa58d6858b namespace=moby
DEBU[2022-04-15T14:14:01.809292656+08:00] event published ns=moby topic=/tasks/exit type=containerd.events.TaskExit
DEBU[2022-04-15T14:14:01.810296940+08:00] event published ns=moby topic=/tasks/delete type=containerd.events.TaskDelete
ERRO[2022-04-15T14:14:01.810491037+08:00] connecting to shim error="dial unix /run/containerd/s/058682ed3ebcc6c9b8d37022b1d379d2d11dbf583467c8d834cc09b3d0c76fea: connect: connection refused" id=44d4a81a122079c684c2a45fcd412c8dc2eef3e4e3569ecae2332ac450b80076 namespace=moby
WARN[2022-04-15T14:14:01.810533770+08:00] cleaning up after shim dead id=44d4a81a122079c684c2a45fcd412c8dc2eef3e4e3569ecae2332ac450b80076 namespace=moby
DEBU[2022-04-15T14:14:01.823164929+08:00] event published ns=moby topic=/tasks/exit type=containerd.events.TaskExit
DEBU[2022-04-15T14:14:01.823795277+08:00] event published ns=moby topic=/tasks/delete type=containerd.events.TaskDelete
ERRO[2022-04-15T14:14:01.823953161+08:00] connecting to shim error="dial unix /run/containerd/s/96bd5e94f86fc8e5752989ee5f22f46924d7deea59c7b8ef11087186f81ca50b: connect: connection refused" id=52aa8ad1be01f3f947bcd2f03197771f1b965b0733f3d4026a58d77979618966 namespace=moby
WARN[2022-04-15T14:14:01.823995417+08:00] cleaning up after shim dead id=52aa8ad1be01f3f947bcd2f03197771f1b965b0733f3d4026a58d77979618966 namespace=moby
DEBU[2022-04-15T14:14:01.835741082+08:00] event published ns=moby topic=/tasks/exit type=containerd.events.TaskExit
DEBU[2022-04-15T14:14:01.836774408+08:00] event published ns=moby topic=/tasks/delete type=containerd.events.TaskDelete
ERRO[2022-04-15T14:14:01.836931317+08:00] connecting to shim error="dial unix \x00/containerd-shim/moby/6b1fb39b4c5ad46612b6755f652d608ef75b3374f719b08a67cac1b7f4ddf646/shim.sock: connect: connection refused" id=6b1fb39b4c5ad46612b6755f652d608ef75b3374f719b08a67cac1b7f4ddf646 namespace=moby
WARN[2022-04-15T14:14:01.836979896+08:00] cleaning up after shim dead id=6b1fb39b4c5ad46612b6755f652d608ef75b3374f719b08a67cac1b7f4ddf646 namespace=moby
DEBU[2022-04-15T14:14:01.849530967+08:00] event published ns=moby topic=/tasks/exit type=containerd.events.TaskExit
ERRO[2022-04-15T14:14:01.862000189+08:00] delete bundle error="rename /data/kube/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/6b1fb39b4c5ad46612b6755f652d608ef75b3374f719b08a67cac1b7f4ddf646 /data/kube/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/.6b1fb39b4c5ad46612b6755f652d608ef75b3374f719b08a67cac1b7f4ddf646: file exists"
DEBU[2022-04-15T14:14:01.862166378+08:00] event published ns=moby topic=/tasks/delete type=containerd.events.TaskDelete
ERRO[2022-04-15T14:14:01.862300707+08:00] connecting to shim error="dial unix \x00/containerd-shim/moby/b49a6d51e1a8997440bbcb6e9267a76d8ce2f24a684454890ae98600de364f64/shim.sock: connect: connection refused" id=b49a6d51e1a8997440bbcb6e9267a76d8ce2f24a684454890ae98600de364f64 namespace=moby
WARN[2022-04-15T14:14:01.862334378+08:00] cleaning up after shim dead id=b49a6d51e1a8997440bbcb6e9267a76d8ce2f24a684454890ae98600de364f64 namespace=moby
DEBU[2022-04-15T14:14:01.877502812+08:00] event published ns=moby topic=/tasks/exit type=containerd.events.TaskExit
DEBU[2022-04-15T14:14:01.877994394+08:00] event published ns=moby topic=/tasks/delete type=containerd.events.TaskDelete
ERRO[2022-04-15T14:14:01.878143864+08:00] connecting to shim error="dial unix /run/containerd/s/b538e3e9b252cc635009139b544114a233b3cc34ca48925588da77aa15cdf90a: connect: connection refused" id=b902f9e634100578e9ab38eeaf9a27224d843035ae33a368d65efc071393bd2f namespace=moby
WARN[2022-04-15T14:14:01.878183396+08:00] cleaning up after shim dead id=b902f9e634100578e9ab38eeaf9a27224d843035ae33a368d65efc071393bd2f namespace=moby
DEBU[2022-04-15T14:14:01.895566880+08:00] event published ns=moby topic=/tasks/exit type=containerd.events.TaskExit
DEBU[2022-04-15T14:14:01.896242084+08:00] event published ns=moby topic=/tasks/delete type=containerd.events.TaskDelete
ERRO[2022-04-15T14:14:01.896450640+08:00] connecting to shim error="dial unix /run/containerd/s/4d17ae4e022c51c38d6eff249e4a50d6eaceba757f043204ae68b9382271b0da: connect: connection refused" id=dc31940495385987c90e96c333014f1ac7dc7eddb5e3f45acb19877c79e22893 namespace=moby
WARN[2022-04-15T14:14:01.896496784+08:00] cleaning up after shim dead id=dc31940495385987c90e96c333014f1ac7dc7eddb5e3f45acb19877c79e22893 namespace=moby
DEBU[2022-04-15T14:14:01.904116791+08:00] garbage collected d=7.541562ms
0x40c05f
0x40c05f
0x40c05f
0x40c05f
0x40c05f
0x40c05f
0x40c05f
0x40c05f
0x40c05f
0x40c05f
0x40c05f
0x40c05f
0x40c05f
0x40c05f
0x40c05f
0x40c05f
0x40c05f
0x40c05f
0x40c05f
0x40c05f
0x40c05f
0x40c05f
0x40c05f
0x40c05f
0x40c05f
0x40c05f
0x40c05f
0x40c05f
0x40c05f
0x40c05f
0x40c05f
0x40c05f
0x40c05f
0x40c05f
0x40c05f
0x40c05f
0x40c05f
0x40c05f
0x40c05f
0x40c05f
0x40c05f
0x40c05f
0x40c05f
0x40c05f
0x40c05f
0x40c05f
0x40c05f
0x40c05f
0x40c05f
0x40c05f
0x40c05f
0x40c05f
0x40c05f
fatal error: bad lfnode address
[signal SIGSEGV: segmentation violation code=0x80 addr=0x0 pc=0x45cf1f]

runtime stack:
runtime: unexpected return pc for runtime.lfnodeValidate called from 0x0
stack: frame={sp:0x7f5918d8f5d8, fp:0x7f5918d8f600} stack=[0x7f5918594148,0x7f5918d93d48)
00007f5918d8f4d8: 000000c000000900 01000000000003e8
00007f5918d8f4e8: 0000000000000004 000000000000001f
00007f5918d8f4f8: 000000000045cf1f <runtime.GoroutineProfile.func2+47> 0000000000000000
00007f5918d8f508: 0000000000000080 00000000013f1515
00007f5918d8f518: 00007f5918d8f560 000000000045d961 <runtime.fatalthrow.func1+97>
00007f5918d8f528: 000000c000000900 0000000000430dd4 <runtime.throw+116>
00007f5918d8f538: 00007f5918d8f5a8 0000000000000001
00007f5918d8f548: 00007f5918d8f5a8 0000000000430dd4 <runtime.throw+116>
00007f5918d8f558: 000000c000000900 00007f5918d8f598
00007f5918d8f568: 0000000000430fa9 <runtime.fatalthrow+89> 00007f5918d8f578
00007f5918d8f578: 000000000045d900 <runtime.fatalthrow.func1+0> 000000c000000900
00007f5918d8f588: 0000000000430dd4 <runtime.throw+116> 00007f5918d8f5a8
00007f5918d8f598: 00007f5918d8f5c8 0000000000430dd4 <runtime.throw+116>
00007f5918d8f5a8: 00007f5918d8f5b0 000000000045d870 <runtime.throw.func1+0>
00007f5918d8f5b8: 00000000013dce54 0000000000000012
00007f5918d8f5c8: 00007f5918d8f608 000000000040beba <runtime.lfnodeValidate+170>
00007f5918d8f5d8: <00000000013dce54 0000000000000012
00007f5918d8f5e8: 00000000004317a0 <runtime.recordForPanic+304> 000000000274594b
00007f5918d8f5f8: !0000000000000000 >0000000000000000
00007f5918d8f608: 00007f5918d8f648 0000000000000000
00007f5918d8f618: 00007f5918d8f650 00000000004317a0 <runtime.recordForPanic+304>
00007f5918d8f628: 000000000274594b 0000000000000004
00007f5918d8f638: 00007f5918d8f670 00000000004317a0 <runtime.recordForPanic+304>
00007f5918d8f648: 00007f5918d8f668 000000000043183d <runtime.printlock+109>
00007f5918d8f658: 00000000027445b0 000000c000074380
00007f5918d8f668: 00007f5918d8f698 000000000045d8a6 <runtime.throw.func1+54>
00007f5918d8f678: 0000000000431967 <runtime.gwrite+167> 0000000000000002
00007f5918d8f688: 000000000000002a 00000000014054d6
00007f5918d8f698: 00007f5918d8f6c8 0000000000430dad <runtime.throw+77>
00007f5918d8f6a8: 00007f5918d8f6b0 000000000045d870 <runtime.throw.func1+0>
00007f5918d8f6b8: 00000000014054d6 000000000000002a
00007f5918d8f6c8: 00007f5918d8f6f8 0000000000446a60 <runtime.sigpanic+1152>
00007f5918d8f6d8: 00000000014054d6 000000000000002a
00007f5918d8f6e8: 00000000013c9053 0000000000000001
00007f5918d8f6f8: 00007f5918d8f720
runtime.throw(0x13dce54, 0x12)
/usr/local/go/src/runtime/panic.go:774 +0x74
runtime: unexpected return pc for runtime.lfnodeValidate called from 0x0
stack: frame={sp:0x7f5918d8f5d8, fp:0x7f5918d8f600} stack=[0x7f5918594148,0x7f5918d93d48)
00007f5918d8f4d8: 000000c000000900 01000000000003e8
00007f5918d8f4e8: 0000000000000004 000000000000001f
00007f5918d8f4f8: 000000000045cf1f <runtime.GoroutineProfile.func2+47> 0000000000000000
00007f5918d8f508: 0000000000000080 00000000013f1515
00007f5918d8f518: 00007f5918d8f560 000000000045d961 <runtime.fatalthrow.func1+97>
00007f5918d8f528: 000000c000000900 0000000000430dd4 <runtime.throw+116>
00007f5918d8f538: 00007f5918d8f5a8 0000000000000001
00007f5918d8f548: 00007f5918d8f5a8 0000000000430dd4 <runtime.throw+116>
00007f5918d8f558: 000000c000000900 00007f5918d8f598
00007f5918d8f568: 0000000000430fa9 <runtime.fatalthrow+89> 00007f5918d8f578
00007f5918d8f578: 000000000045d900 <runtime.fatalthrow.func1+0> 000000c000000900
00007f5918d8f588: 0000000000430dd4 <runtime.throw+116> 00007f5918d8f5a8
00007f5918d8f598: 00007f5918d8f5c8 0000000000430dd4 <runtime.throw+116>
00007f5918d8f5a8: 00007f5918d8f5b0 000000000045d870 <runtime.throw.func1+0>
00007f5918d8f5b8: 00000000013dce54 0000000000000012
00007f5918d8f5c8: 00007f5918d8f608 000000000040beba <runtime.lfnodeValidate+170>
00007f5918d8f5d8: <00000000013dce54 0000000000000012
00007f5918d8f5e8: 00000000004317a0 <runtime.recordForPanic+304> 000000000274594b
00007f5918d8f5f8: !0000000000000000 >0000000000000000
00007f5918d8f608: 00007f5918d8f648 0000000000000000
00007f5918d8f618: 00007f5918d8f650 00000000004317a0 <runtime.recordForPanic+304>
00007f5918d8f628: 000000000274594b 0000000000000004
00007f5918d8f638: 00007f5918d8f670 00000000004317a0 <runtime.recordForPanic+304>
00007f5918d8f648: 00007f5918d8f668 000000000043183d <runtime.printlock+109>
00007f5918d8f658: 00000000027445b0 000000c000074380
00007f5918d8f668: 00007f5918d8f698 000000000045d8a6 <runtime.throw.func1+54>
00007f5918d8f678: 0000000000431967 <runtime.gwrite+167> 0000000000000002
00007f5918d8f688: 000000000000002a 00000000014054d6
00007f5918d8f698: 00007f5918d8f6c8 0000000000430dad <runtime.throw+77>
00007f5918d8f6a8: 00007f5918d8f6b0 000000000045d870 <runtime.throw.func1+0>
00007f5918d8f6b8: 00000000014054d6 000000000000002a
00007f5918d8f6c8: 00007f5918d8f6f8 0000000000446a60 <runtime.sigpanic+1152>
00007f5918d8f6d8: 00000000014054d6 000000000000002a
00007f5918d8f6e8: 00000000013c9053 0000000000000001
00007f5918d8f6f8: 00007f5918d8f720
runtime.lfnodeValidate(0x0)
/usr/local/go/src/runtime/lfstack.go:65 +0xaa

goroutine 1 [chan receive]:
github.com/containerd/containerd/vendor/github.com/containerd/go-runc.(*defaultMonitor).Wait(0x2744360, 0xc0002738c0, 0xc00035e4e0, 0x0, 0x0, 0x3d54454b434f)
/tmp/tmp.0JSku0IZFM/src/github.com/containerd/containerd/vendor/github.com/containerd/go-runc/monitor.go:74 +0x50
github.com/containerd/containerd/vendor/github.com/containerd/go-runc.cmdOutput(0xc0002738c0, 0xc0002ee301, 0x0, 0x0, 0x0, 0x0, 0x0)
/tmp/tmp.0JSku0IZFM/src/github.com/containerd/containerd/vendor/github.com/containerd/go-runc/runc.go:709 +0x14e
github.com/containerd/containerd/vendor/github.com/containerd/go-runc.(*Runc).runOrError(0xc00055ec80, 0xc0002738c0, 0xc0002cee40, 0xc00004ee40)
/tmp/tmp.0JSku0IZFM/src/github.com/containerd/containerd/vendor/github.com/containerd/go-runc/runc.go:689 +0x186
github.com/containerd/containerd/vendor/github.com/containerd/go-runc.(*Runc).Delete(0xc00055ec80, 0x1bcd9e0, 0xc0002cee40, 0xc000016796, 0x40, 0xc00062cd77, 0x40, 0xc00055ec80)
/tmp/tmp.0JSku0IZFM/src/github.com/containerd/containerd/vendor/github.com/containerd/go-runc/runc.go:302 +0x16a
github.com/containerd/containerd/runtime/v1/linux.(*Runtime).terminate(0xc000240960, 0x1bcd9e0, 0xc0002cee40, 0xc0002cea80, 0xc00010d5d1, 0x4, 0xc000016796, 0x40, 0x1, 0x1)
/tmp/tmp.0JSku0IZFM/src/github.com/containerd/containerd/runtime/v1/linux/runtime.go:473 +0xfc
github.com/containerd/containerd/runtime/v1/linux.(*Runtime).cleanupAfterDeadShim(0xc000240960, 0x1bcd9e0, 0xc0002cee40, 0xc0002cea80, 0xc00010d5d1, 0x4, 0xc000016796, 0x40, 0x1b938c0, 0xc00003caf0)
/tmp/tmp.0JSku0IZFM/src/github.com/containerd/containerd/runtime/v1/linux/runtime.go:432 +0x386
github.com/containerd/containerd/runtime/v1/linux.(*Runtime).loadTasks(0xc000240960, 0x1bcd960, 0xc000040098, 0xc00010d5d1, 0x4, 0x0, 0x0, 0xc00062e610, 0x439d71, 0xc000052500)
/tmp/tmp.0JSku0IZFM/src/github.com/containerd/containerd/runtime/v1/linux/runtime.go:362 +0xa28
github.com/containerd/containerd/runtime/v1/linux.(*Runtime).restoreTasks(0xc000240960, 0x1bcd960, 0xc000040098, 0x1a9f5c0, 0xc000532540, 0x0, 0x0, 0xc00062e898)
/tmp/tmp.0JSku0IZFM/src/github.com/containerd/containerd/runtime/v1/linux/runtime.go:298 +0x368
github.com/containerd/containerd/runtime/v1/linux.New(0xc00034ea80, 0xc000332c60, 0x2, 0x2, 0x1968f20)
/tmp/tmp.0JSku0IZFM/src/github.com/containerd/containerd/runtime/v1/linux/runtime.go:125 +0x3db
github.com/containerd/containerd/plugin.(*Registration).Init(0xc00009a1e0, 0xc00034ea80, 0x18c2f40)
/tmp/tmp.0JSku0IZFM/src/github.com/containerd/containerd/plugin/plugin.go:110 +0x3a
github.com/containerd/containerd/services/server.New(0x1bcd960, 0xc000040098, 0xc00055c480, 0x1, 0x1, 0xc0002079b0)
/tmp/tmp.0JSku0IZFM/src/github.com/containerd/containerd/services/server/server.go:167 +0xcaa
github.com/containerd/containerd/cmd/containerd/command.App.func1(0xc000558580, 0x0, 0xc000162880)
/tmp/tmp.0JSku0IZFM/src/github.com/containerd/containerd/cmd/containerd/command/main.go:177 +0x7fa
github.com/containerd/containerd/vendor/github.com/urfave/cli.HandleAction(0x1937d80, 0x1b71550, 0xc000558580, 0xc000558580, 0x0)
/tmp/tmp.0JSku0IZFM/src/github.com/containerd/containerd/vendor/github.com/urfave/cli/app.go:523 +0xc0
github.com/containerd/containerd/vendor/github.com/urfave/cli.(*App).Run(0xc000536700, 0xc00003c050, 0x5, 0x5, 0x0, 0x0)
/tmp/tmp.0JSku0IZFM/src/github.com/containerd/containerd/vendor/github.com/urfave/cli/app.go:285 +0x5e1
main.main()
github.com/containerd/containerd/cmd/containerd/main.go:33 +0x51

goroutine 6 [syscall]:
os/signal.signal_recv(0x0)
/usr/local/go/src/runtime/sigqueue.go:147 +0x9e
os/signal.loop()
/usr/local/go/src/os/signal/signal_unix.go:23 +0x24
created by os/signal.init.0
/usr/local/go/src/os/signal/signal_unix.go:29 +0x43

goroutine 7 [chan receive]:
github.com/containerd/containerd/vendor/k8s.io/klog.(*loggingT).flushDaemon(0x2721260)
/tmp/tmp.0JSku0IZFM/src/github.com/containerd/containerd/vendor/k8s.io/klog/klog.go:1010 +0x8d
created by github.com/containerd/containerd/vendor/k8s.io/klog.init.0
/tmp/tmp.0JSku0IZFM/src/github.com/containerd/containerd/vendor/k8s.io/klog/klog.go:411 +0xd8

goroutine 43 [select]:
github.com/containerd/containerd/cmd/containerd/command.handleSignals.func1(0xc000551380, 0xc000551320, 0x1bcd960, 0xc000040098, 0xc00054c300)
/tmp/tmp.0JSku0IZFM/src/github.com/containerd/containerd/cmd/containerd/command/main_unix.go:44 +0xf2
created by github.com/containerd/containerd/cmd/containerd/command.handleSignals
/tmp/tmp.0JSku0IZFM/src/github.com/containerd/containerd/cmd/containerd/command/main_unix.go:41 +0x8b

goroutine 11 [select]:
github.com/containerd/containerd/vendor/github.com/docker/go-events.(*Broadcaster).run(0xc00003c0f0)
/tmp/tmp.0JSku0IZFM/src/github.com/containerd/containerd/vendor/github.com/docker/go-events/broadcast.go:117 +0x1b3
created by github.com/containerd/containerd/vendor/github.com/docker/go-events.NewBroadcaster
/tmp/tmp.0JSku0IZFM/src/github.com/containerd/containerd/vendor/github.com/docker/go-events/broadcast.go:39 +0x1b0

goroutine 114 [runnable]:
os/exec.(*Cmd).Start.func2(0xc0002738c0)
/usr/local/go/src/os/exec/exec.go:448 +0xc6
created by os/exec.(*Cmd).Start
/usr/local/go/src/os/exec/exec.go:447 +0x6d2

goroutine 47 [select]:
github.com/containerd/containerd/gc/scheduler.(*gcScheduler).run(0xc0002408a0, 0x1bcd960, 0xc000040098)
/tmp/tmp.0JSku0IZFM/src/github.com/containerd/containerd/gc/scheduler/scheduler.go:268 +0x1ce
created by github.com/containerd/containerd/gc/scheduler.init.0.func1
/tmp/tmp.0JSku0IZFM/src/github.com/containerd/containerd/gc/scheduler/scheduler.go:132 +0x429

goroutine 115 [runnable]:
os/exec.(*Cmd).Wait(0xc0002738c0, 0x0, 0x0)
/usr/local/go/src/os/exec/exec.go:514 +0x127
github.com/containerd/containerd/vendor/github.com/containerd/go-runc.(*defaultMonitor).Start.func1(0xc0002738c0, 0xc00035e4e0)
/tmp/tmp.0JSku0IZFM/src/github.com/containerd/containerd/vendor/github.com/containerd/go-runc/monitor.go:55 +0x31
created by github.com/containerd/containerd/vendor/github.com/containerd/go-runc.(*defaultMonitor).Start
/tmp/tmp.0JSku0IZFM/src/github.com/containerd/containerd/vendor/github.com/containerd/go-runc/monitor.go:53 +0xa7

上面只是偶尔的报错,偶尔也会报错 segment fault,同时根目录也有 coredump 生成:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
$ ll /
总用量 145912
lrwxrwxrwx. 1 root root 7 10月 13 2021 bin -> usr/bin
dr-xr-xr-x. 5 root root 4096 4月 11 14:55 boot
-rw------- 1 root root 230711296 4月 15 13:59 core.28650
-rw------- 1 root root 187424768 4月 15 13:46 core.3977
drwxr-xr-x 12 xxx xxx 156 10月 29 11:32 data
drwxr-xr-x 19 root root 3180 4月 15 12:38 dev
drwxr-xr-x. 85 root root 8192 4月 15 12:44 etc
drwxr-xr-x. 3 root root 17 10月 28 13:30 home
lrwxrwxrwx. 1 root root 7 10月 13 2021 lib -> usr/lib
lrwxrwxrwx. 1 root root 9 10月 13 2021 lib64 -> usr/lib64
drwxr-xr-x. 3 root root 127 10月 28 13:32 media
drwxr-xr-x. 2 root root 6 4月 11 2018 mnt
drwxr-xr-x. 3 root root 24 10月 28 13:38 opt
dr-xr-xr-x 253 root root 0 4月 15 12:38 proc
dr-xr-x---. 7 root root 258 4月 15 13:13 root
drwxr-xr-x 30 root root 960 4月 15 14:11 run
lrwxrwxrwx. 1 root root 8 10月 13 2021 sbin -> usr/sbin
drwxr-xr-x. 2 root root 6 4月 11 2018 srv
dr-xr-xr-x 13 root root 0 4月 15 13:12 sys
drwxrwxrwt. 14 root root 4096 4月 15 14:28 tmp
drwxr-xr-x. 13 root root 155 10月 13 2021 usr
drwxr-xr-x. 19 root root 267 10月 13 2021 var

对比了二进制文件,也没损坏,查看了进程,也没有啥安全软件,查看下系统日志,其实在上面排查过程中 ssh 也会偶尔断开,然后 strace 也是没有啥头绪。

系统日志里过滤一些无用的信息后,发现 bash 也会 segfault

1
2
3
4
5
6
7
8
9
10
11
12
13
14
Apr 15 13:46:32 xxx supervisord: 2022-04-15 13:46:32,958 INFO exited: prometheus_00 (exit status 2; expected)
Apr 15 13:46:33 xxx supervisord: 2022-04-15 13:46:33,962 INFO spawned: 'prometheus_00' with pid 3751
Apr 15 13:46:35 xxx supervisord: 2022-04-15 13:46:35,379 INFO success: prometheus_00 entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
Apr 15 13:46:35 xxx dockerd: failed to start daemon: failed to dial "/run/containerd/containerd.sock": failed to dial "/run/containerd/containerd.sock": context deadline exceeded
Apr 15 13:46:35 xxx systemd: docker.service: main process exited, code=exited, status=1/FAILURE
Apr 15 13:46:35 xxx systemd: Unit docker.service entered failed state.
Apr 15 13:46:35 xxx systemd: docker.service failed.
Apr 15 13:46:40 xxx systemd: docker.service holdoff time over, scheduling restart.
Apr 15 13:46:40 xxx systemd: Starting Docker Application Container Engine...
Apr 15 13:46:40 xxx systemd: Started Docker Application Container Engine.
Apr 15 13:46:40 xxx systemd: docker.service: main process exited, code=killed, status=11/SEGV
Apr 15 13:46:40 xxx systemd: Unit docker.service entered failed state.
Apr 15 13:46:40 xxx systemd: docker.service failed.
Apr 15 13:46:42 xxx kernel: bash[4028]: segfault at fa8 ip 0000000000440c58 sp 00007fff8528e830 error 4 in bash[400000+de000]

单独看看 segfault

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
$ grep kernal /var/log/messages | grep -i segfault

Apr 15 11:17:31 xxx kernel: celery[13267]: segfault at 6de410 ip 000000000041b720 sp 00007ffccf1813c0 error 4 in python3.7[400000+293000]
Apr 15 11:20:35 xxx kernel: redis-server[21118]: segfault at 5dd030 ip 00000000005dd030 sp 00007ffc05cbe440 error 14 in redis-server[6dd000+1000]
Apr 15 11:20:35 xxx kernel: supervisord[20844]: segfault at 29 ip 000000000045c030 sp 00007ffd48aa29d0 error 4 in python3.7[400000+293000]
Apr 15 11:20:51 xxx kernel: gunicorn[23241]: segfault at 8 ip 00000000004272c8 sp 00007ffc0c4f34e0 error 4 in python3.7[400000+293000]
Apr 15 11:20:56 xxx kernel: celery[23193]: segfault at 10 ip 0000000000454a29 sp 00007ffd17acce78 error 4 in python3.7[400000+293000]
Apr 15 11:20:56 xxx kernel: python[23196]: segfault at 154c0ab0 ip 0000000000420eb0 sp 00007fff43b9fe08 error 6 in python3.7[400000+293000]
Apr 15 12:38:34 xxx kernel: init.ipv6-globa[1154]: segfault at 68 ip 000000000044f62f sp 00007ffc2cf307a8 error 6 in bash[400000+de000]
Apr 15 12:59:25 xxx kernel: bash[11112]: segfault at 18 ip 0000000000449dc9 sp 00007ffffbc2dac8 error 4 in bash[400000+de000]
Apr 15 13:00:58 xxx kernel: bash[14117]: segfault at ffffffff8d48ffff ip 0000000000440c36 sp 00007ffe9be29640 error 7 in bash[400000+de000]
Apr 15 13:01:19 xxx kernel: strace[14852]: segfault at 0 ip (null) sp 00007ffc836a2be8 error 14 in strace[400000+f7000]
Apr 15 13:10:44 xxx kernel: grep[2587]: segfault at a0d ip 000000000040c43f sp 00007ffcda056be8 error 4 in grep[400000+25000]
Apr 15 13:14:06 xxx kernel: strace[9484]: segfault at ffffffff89489abc ip 00000000004336c2 sp 00007ffcb217a758 error 7 in strace[400000+f7000]
Apr 15 13:27:57 xxx kernel: grep[32713]: segfault at 0 ip (null) sp 00007ffebf181d10 error 14 in grep[400000+25000]
Apr 15 13:30:22 xxx kernel: bash[5138]: segfault at 108 ip 000000000040cc32 sp 00007ffd55de2728 error 4 in bash[400000+de000]
Apr 15 13:35:56 xxx kernel: strace[15846]: segfault at 0 ip (null) sp 00007ffef0a931e0 error 14 in strace[400000+f7000]
Apr 15 13:36:27 xxx kernel: bash[16818]: segfault at 33173b0 ip 000000000046d584 sp 00007ffdb71ff5c8 error 6 in bash[400000+de000]
Apr 15 13:37:04 xxx kernel: bash[17993]: segfault at 46a0 ip 0000000000440c58 sp 00007fffd76cf1e0 error 4 in bash[400000+de000]
Apr 15 13:39:49 xxx kernel: bash[23124]: segfault at 5aa0 ip 0000000000440c58 sp 00007fff98271ac0 error 4 in bash[400000+de000]
Apr 15 13:46:42 xxx kernel: bash[4028]: segfault at fa8 ip 0000000000440c58 sp 00007fff8528e830 error 4 in bash[400000+de000]
Apr 15 13:59:04 xxx kernel: redis_exporter-[21520]: segfault at 43b8ba ip 000000000043b892 sp 000000c000057f48 error 7 in redis_exporter-1.23.1.linux-x86_64[400000+41f000]
Apr 15 14:12:12 xxx kernel: grepconf.sh[20714]: segfault at 0 ip (null) sp 00007ffe3009c158 error 14 in bash[400000+de000]
Apr 15 14:14:43 xxx kernel: bash[25463]: segfault at 313f4023 ip 00000000313f4023 sp 00007ffd25075948 error 14 in ISO8859-1.so[7f6335587000+2000]
Apr 15 14:22:32 xxx kernel: bash[7390]: segfault at 18 ip 0000000000449dc9 sp 00007ffc765d0558 error 4 in bash[400000+de000]
Apr 15 14:38:21 xxx kernel: prometheus[26806]: segfault at 440eb5 ip 0000000000421391 sp 000000c0002f1f38 error 7 in prometheus[400000+20ec000]
Apr 15 14:38:21 xxx kernel: prometheus[26801]: segfault at bffffffff8 ip 0000000000440ea3 sp 000000c000000000 error 6 in prometheus[400000+20ec000]

看了下内存容量也正常,很多东西都会触发 segmenft fault,但是最常见的就是内存越界,但是根据系统日志看并不存在内存越界(毕竟这么多进程都 segfault,不可能这么多进程代码写的有问题),使用 rpm -V glibc 也没看到 so 文件被修改,初步怀疑客户的宿主机内存有问题,让客户迁移下这台机器。

后续

2022/04/22 反馈迁移后一切都正常了。

coredump 的配置参考

参考文章 coredump配置、产生、分析以及分析示例 coredump 的一些配置

1
2
$ cat /proc/sys/kernel/core_pattern 
core

临时修改

1
echo "core-%e-%p-%t-%s" > /proc/sys/kernel/core_pattern 

参数说明:

1
2
3
4
5
6
7
8
%% - 单个%字符
%p - 添加pid
%u - 添加当前uid
%g - 添加当前gid
%s - 添加导致产生core的信号
%t - 添加core文件生成时的unix时间
%h - 添加主机名
%e - 添加程序文件名

sysctl 固化

1
2
kernel.core_pattern=core-%e-%p-%t-%s
kernel.core_uses_pid=1

limit.d/*.conf 配置 coredump 文件限制

1
* soft core 1024
CATALOG
  1. 1. 由来
  2. 2. 处理过程
  3. 3. 后续
  4. 4. coredump 的配置参考