zhangguanzhang's Blog

从自己造轮子NodePort白名单到参考 calico 规则

字数统计: 5.3k阅读时长: 28 min
2025/10/15

研究下 calico 如何实现 nodePort 白名单。

由来

由于我们做私有化,很多客户注重安全,需要有类似 NetworkPolicy 那样做白名单限制来源,而 calico 最小化部署的话下面配置:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
apiVersion: operator.tigera.io/v1
kind: Installation
metadata:
name: default
spec: # https://docs.tigera.io/calico/latest/reference/installation/api#installationspec
calicoNetwork: # https://docs.tigera.io/calico/latest/reference/installation/api#caliconetworkspec
ipPools:
- name: default-ipv4-ippool
blockSize: 24 # node 上分配到的 PodIP 的掩码,默认26,我喜欢改成24方便阅读
cidr: 10.187.0.0/16
encapsulation: VXLANCrossSubnet # https://docs.tigera.io/calico/latest/reference/installation/api#encapsulationtype
natOutgoing: Enabled
nodeSelector: all()
nodeAddressAutodetectionV4: # https://docs.tigera.io/calico/latest/reference/installation/api#nodeaddressautodetection
canReach: 223.5.5.5
registry: m.daocloud.io/quay.io # https://docs.tigera.io/calico/latest/operations/image-options/alternate-registry
flexVolumePath: None # 设置为None 不安装 CSI 相关
kubeletVolumePluginPath: None
---
apiVersion: operator.tigera.io/v1
kind: APIServer
metadata:
name: default
spec: {}

下都会部署很多组件:

1
2
3
4
5
6
7
8
9
$ kubectl -n calico-system get deploy
NAME READY UP-TO-DATE AVAILABLE AGE
calico-kube-controllers 1/1 1 1 6d3h
calico-typha 1/1 1 1 6d3h
goldmane 1/1 1 1 6d1h
whisker 1/1 1 1 6d1h
$ kubectl -n calico-apiserver get deploy
NAME READY UP-TO-DATE AVAILABLE AGE
calico-apiserver 2/2 2 2 6d1h

而且很多客户机器配置不高,所以我们使用 flannel,而网络策略这块有实现一个 agent 容器做 iptables 规则白名单,非 K8S 下 docker 也可以用。某个版本开始有部分业务需要 NodePort 暴漏用于外部上传备份文件,但是考虑到客户安全要求,所以需要 NodePort 做白名单限制。
如果对 iptables 不熟悉,可能会下意识的去 INPUT 链去做,实际是不行的,NodePort 在 nat 表里 PREROUTING 先匹配做 nat 的,拿如下 svc 做说明:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Name:                     my-service
Namespace: default
Labels: <none>
Annotations: <none>
Selector: nodePort=test
Type: NodePort
IP Family Policy: SingleStack
IP Families: IPv4
IP: 10.186.158.205
IPs: 10.186.158.205
Port: <unset> 80/TCP
TargetPort: 80/TCP
NodePort: <unset> 30008/TCP
Endpoints: 10.187.220.19:80
Session Affinity: None
External Traffic Policy: Cluster
Events: <none>

相关 nat 表下的 PRETOUTING 链如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# 入口
-A PREROUTING -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A KUBE-SERVICES -m comment --comment "kubernetes service nodeports; NOTE: this must be the last rule in this chain" -m addrtype --dst-type LOCAL -j KUBE-NODEPORTS
-A KUBE-NODEPORTS -p tcp -m comment --comment "default/my-service" -m tcp --dport 30008 -j KUBE-EXT-FXIYY6OHUSNBITIX
# nodePort 匹配之前先打 snat mark,再是下面的 KUBE-SVC-FXIYY6OHUSNBITIX svc 的 dnat 链
-A KUBE-EXT-FXIYY6OHUSNBITIX -m comment --comment "masquerade traffic for default/my-service external destinations" -j KUBE-MARK-MASQ
-A KUBE-EXT-FXIYY6OHUSNBITIX -j KUBE-SVC-FXIYY6OHUSNBITIX

# svc 的 dnat 链
-A KUBE-SVC-FXIYY6OHUSNBITIX -d 10.186.158.205/32 -p tcp -m comment --comment "default/my-service cluster IP" -m tcp --dport 80 -j KUBE-MARK-MASQ
-A KUBE-SVC-FXIYY6OHUSNBITIX -m comment --comment "default/my-service -> 10.187.220.19:80" -j KUBE-SEP-DPGHCWFA3YQKRCGQ

# svc 的 endpoint 链
-A KUBE-SEP-DPGHCWFA3YQKRCGQ -s 10.187.220.19/32 -m comment --comment "default/my-service" -j KUBE-MARK-MASQ
-A KUBE-SEP-DPGHCWFA3YQKRCGQ -p tcp -m comment --comment "default/my-service" -m tcp -j DNAT --to-destination 10.187.220.19:80

等走到 INPUT 后,目标 IP 和 port 都经过了 dnat了,所以不能在 INPUT 拦截匹配,同理 docker -p 暴漏的端口也是一样。所以之前我是在 raw 表的 PREROUTING 里做的。

规则问题

设计的规则是一个 ipset 存白名单端口列表,一个是 ip 白名单,规则如下:

1
2
3
4
5
$ iptables -t raw -S PREROUTING
# 回程 conntrack 状态放行
-A PREROUTING -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
# 来源 IP 不是白名单 IP,但是目标端口是白名单端口就拒绝
-A PREROUTING -m set ! --match-set whiteiplist src -m set --match-set whiteportlist dst -j DROP

然后测了下发现没问题,后面时不时收到实施反馈客户现场环境上,服务作为客户端访问外部低概率超时,抓包发现本机上访问外部 server,server 回包被阻断:

  • 本机上作为 client 访问外部,发送 SYN 包
  • 外部 server 给本机发送 SYN-ACK 被阻断

根据 iptables 统计数据看:

1
2
3
4
5
$ iptables -t raw -nvL PREROUTING
Chain PREROUTING (policy ACCEPT)
pkts bytes target prot opt in out source destination
162M 64G ACCEPT all -- * * 0.0.0.0/0 0.0.0.0/0 ctstate RELATED,ESTABLISHED
20 1604 ACCEPT all -- * * 0.0.0.0/0 0.0.0.0/0 ! match-set whiteiplist src match-set whiteportlist dst

发现就是 raw 的这个规则匹配 DROP 的,排查发现,某个版本开始后,把白名单端口增加很多,例如 49100-49500 之类的(INPUT 链我们也在用 whiteportlist),在 ip_local_port_range 范围内,刚好客户端使用就会发生:

  • 本机请求外部 server,分配的 local_portwhiteportlist 例如:49123
  • 外部回包,此刻没有被 conntrack 标记为 ESTABLISHED 状态,走到下一条规则
  • 然后命中下一条就被 DROP

使用 tcp 编程复现:

1
2
3
4
5
6
7
8
import socket

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.bind(('0.0.0.0', 49123))
s.connect(('39.156.70.37', 80))
s.send(b'GET / HTTP/1.1\r\nHost: www.baidu.com\r\n\r\n')
print(s.recv(1024))
s.close()

39.156.70.37 为百度域名 IP,执行后卡住,查看统计信息增加了也符合:

1
2
3
4
5
# 清空 PREROUTING 统计信息
$ iptables -t raw -Z PREROUTING
$ python test.py
^C
$ iptables -t raw -nvL PREROUTING

看来 raw 如字面意思,太原始了。

calico

研究下 calico 如何实现的,单机干净 K8S 集群上部署了 calico ,看了下 operator 安装的 calico 版本为:

1
2
$ docker images | grep calico/node
m.daocloud.io/quay.io/calico/node v3.30.3 ce9c4ac0f175 7 weeks ago 401MB

本文的规则研究以 v3.30.3 版本为准。

准备工作

先部署一个 NodePort:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
apiVersion: v1
kind: Pod
metadata:
namespace: default
name: test-hostname
labels:
nodePort: test
spec:
containers:
- name: test
image: m.daocloud.io/docker.io/library/nginx:alpine
---
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
type: NodePort
selector:
nodePort: test
ports:
- protocol: TCP
port: 80
targetPort: 80
nodePort: 30008

相关信息存档,后续生效的策略对比:

1
2
3
4
5
iptables -w -t raw -S > raw
iptables -w -t nat -S > nat
iptables -w -t mangle -S > mangle
iptables -w -S > filter
ipset list > ipset

GlobalNetworkPolicy

谷歌搜到官方文档network-policy kubernetes-node-ports,使用的是 GlobalNetworkPolicy,看了下文档,calico 的这个 CRD 相对于 NetworkPolicy 范围更广,它可以控制主机层面,而非 NetworkPolicy 只控制 ns 和 Pod 策略,根据文档例子写了下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
apiVersion: projectcalico.org/v3
kind: GlobalNetworkPolicy
metadata:
name: allow-cluster-nodeport-only
spec:
# 常规使用是配合兜底策略,优先级高的前面放行,优先级最低的是拒绝,也就是白名单策略。或者优先级最低的是放行,优先级高的是DROP,也就是黑名单策略
# 这里我是只测试,写出下面规则
order: 20
preDNAT: true
applyOnForward: true
ingress:
- action: Allow
source:
nets:
- 10.xxx.41.110/32 # 自身 IP
- 10.xxx.195.118/32 # 外部测试 IP
- 10.187.0.0/16 # Pod CIDR
- action: Deny
protocol: TCP
destination:
ports: [30008]
selector: has(kubernetes.io/os)

apply 后发现 iptables 的所有表里都没有规则增加,官方文档说选择器可以选 node 的,但是实际测试不行,看官方文档其他地方有使用 kind: HostEndpoint 配合 selector ,设置下自动创建 hep 还是不行:

1
2
3
4
5
$ kubectl patch kubecontrollersconfigurations default \
--type=merge -p '{"spec": {"controllers": {"node":{"hostEndpoint":{"autoCreate": "Enabled"}}}}}'
$ kubectl get hep -l kubernetes.io/os
NAME CREATED AT
10.xxx.xx.170-auto-hep 2025-10-15T09:18:37

看了下选择器文档,直接改成 selector: all() 后可以了,外部 IP 不在上面的白名单里 curl nodeport 不通

规则研究

导出现在规则:

1
2
3
4
iptables -w -t raw -S > raw2
iptables -w -t nat -S > nat2
iptables -w -t mangle -S > mangle2
iptables -w -S > filter2

新增规则

对比:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
$ diff mangle*
10a11
> -N cali-failsafe-in
11a13
> -N cali-fh-any-interface-at-all
12a15
> -N cali-pi-_Ddz2TLFtYPs0Zt3iUZs
25a29,37
> -A cali-failsafe-in -p tcp -m comment --comment "cali:wWFQM43tJU7wwnFZ" -m multiport --dports 22 -j ACCEPT
> -A cali-failsafe-in -p udp -m comment --comment "cali:LwNV--R8MjeUYacw" -m multiport --dports 68 -j ACCEPT
> -A cali-failsafe-in -p tcp -m comment --comment "cali:QOO5NUOqOSS1_Iw0" -m multiport --dports 179 -j ACCEPT
> -A cali-failsafe-in -p tcp -m comment --comment "cali:cwZWoBSwVeIAZmVN" -m multiport --dports 2379 -j ACCEPT
> -A cali-failsafe-in -p tcp -m comment --comment "cali:7FbNXT91kugE_upR" -m multiport --dports 2380 -j ACCEPT
> -A cali-failsafe-in -p tcp -m comment --comment "cali:8Ftbkk2dRH2eEeq1" -m multiport --dports 5473 -j ACCEPT
> -A cali-failsafe-in -p tcp -m comment --comment "cali:-JoRSaAQZPJAegMo" -m multiport --dports 6443 -j ACCEPT
> -A cali-failsafe-in -p tcp -m comment --comment "cali:PUKij4Rn9njHfVTi" -m multiport --dports 6666 -j ACCEPT
> -A cali-failsafe-in -p tcp -m comment --comment "cali:vSprVE-4rient0wc" -m multiport --dports 6667 -j ACCEPT
34a47,64
> -A cali-fh-any-interface-at-all -m comment --comment "cali:CCbcqJXqEISzSqnH" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
> -A cali-fh-any-interface-at-all -m comment --comment "cali:mmvu-cTJXJ7YH9Lp" -m conntrack --ctstate INVALID -j DROP
> -A cali-fh-any-interface-at-all -m comment --comment "cali:NnqjZhu9yccY4C7-" -j cali-failsafe-in
> -A cali-fh-any-interface-at-all -m comment --comment "cali:AtciE88iDfq0ah2L" -j MARK --set-xmark 0x0/0x30000
> -A cali-fh-any-interface-at-all -m comment --comment "cali:BZMMxJKaVi8hIM9r" -m comment --comment "Start of tier default" -j MARK --set-xmark 0x0/0x20000
> -A cali-fh-any-interface-at-all -m comment --comment "cali:_hnIU4TYdSt--CFh" -m mark --mark 0x0/0x20000 -j cali-pi-_Ddz2TLFtYPs0Zt3iUZs
> -A cali-fh-any-interface-at-all -m comment --comment "cali:-n3Ama1WlBcv-Yv9" -m comment --comment "Return if policy accepted" -m mark --mark 0x10000/0x10000 -j RETURN
> -A cali-from-host-endpoint -m comment --comment "cali:0MLuqUx2SPsTwgBS" -g cali-fh-any-interface-at-all
> -A cali-pi-_Ddz2TLFtYPs0Zt3iUZs -m comment --comment "cali:5eFTXO3b0B-Tbiq8" -m comment --comment "Policy default.allow-cluster-nodeport-only ingress" -j MARK --set-xmark 0x0/0x180000
> -A cali-pi-_Ddz2TLFtYPs0Zt3iUZs -s 10.xxx.41.110/32 -m comment --comment "cali:L5TwSbHWsELZIAEd" -j MARK --set-xmark 0x80000/0x80000
> -A cali-pi-_Ddz2TLFtYPs0Zt3iUZs -s 10.xxx.195.118/32 -m comment --comment "cali:noUEAlswvbgG5j7d" -j MARK --set-xmark 0x80000/0x80000
> -A cali-pi-_Ddz2TLFtYPs0Zt3iUZs -s 10.187.0.0/16 -m comment --comment "cali:TxEjJz-IsLiJzVDK" -j MARK --set-xmark 0x80000/0x80000
> -A cali-pi-_Ddz2TLFtYPs0Zt3iUZs -m comment --comment "cali:SBosizM5mtjxTsOe" -m mark --mark 0x80000/0x80000 -j MARK --set-xmark 0x10000/0x10000
> -A cali-pi-_Ddz2TLFtYPs0Zt3iUZs -m comment --comment "cali:5w4NEetZaXhF7wjm" -m mark --mark 0x10000/0x10000 -j NFLOG --nflog-prefix "API0|default.allow-cluster-nodeport-only" --nflog-group 1 --nflog-range 80
> -A cali-pi-_Ddz2TLFtYPs0Zt3iUZs -m comment --comment "cali:NMym66CfdBVWGhc6" -m mark --mark 0x10000/0x10000 -j RETURN
> -A cali-pi-_Ddz2TLFtYPs0Zt3iUZs -p tcp -m comment --comment "cali:HONlGpSGnitWLUh-" -m multiport --dports 30008 -j MARK --set-xmark 0x40000/0x40000
> -A cali-pi-_Ddz2TLFtYPs0Zt3iUZs -m comment --comment "cali:9vcFh92OaOMP06xg" -m mark --mark 0x40000/0x40000 -j NFLOG --nflog-prefix "DPI1|default.allow-cluster-nodeport-only" --nflog-group 1 --nflog-range 80
> -A cali-pi-_Ddz2TLFtYPs0Zt3iUZs -m comment --comment "cali:URAeOCUsDbThanFp" -m mark --mark 0x40000/0x40000 -j DROP

主要是多了三个链:

  • cali-failsafe-in
  • cali-fh-any-interface-at-all
  • cali-pi-_Ddz2TLFtYPs0Zt3iUZs

cali-failsafe-in 链如名字所示,兜底策略,先放行 ssh/etcd/kube-apiserver 之类端口 ,避免配置错误网络策略后导致机器集群无法连上,涉及到的端口见官方文档 failsafe,对于新增过滤规则的都会先跳到这个链。

后面俩链是具体 cali-pi-_Ddz2TLFtYPs0Zt3iUZs 里做处理:

1
> -A cali-fh-any-interface-at-all -m comment --comment "cali:_hnIU4TYdSt--CFh" -m mark --mark 0x0/0x20000 -j cali-pi-_Ddz2TLFtYPs0Zt3iUZs

来懒人办法,清理掉链的统计信息看生效在哪块:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
$ iptables -t mangle -Z cali-pi-_Ddz2TLFtYPs0Zt3iUZs
$ iptables -t mangle -nvL cali-pi-_Ddz2TLFtYPs0Zt3iUZs
Chain cali-pi-_Ddz2TLFtYPs0Zt3iUZs (1 references)
pkts bytes target prot opt in out source destination
10 600 MARK all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:5eFTXO3b0B-Tbiq8 */ /* Policy default.allow-cluster-nodeport-only ingress */ MARK and 0xffe7ffff
0 0 MARK all -- * * 10.xxx.41.110 0.0.0.0/0 /* cali:L5TwSbHWsELZIAEd */ MARK or 0x80000
0 0 MARK all -- * * 10.xxx.195.118 0.0.0.0/0 /* cali:noUEAlswvbgG5j7d */ MARK or 0x80000
0 0 MARK all -- * * 10.187.0.0/16 0.0.0.0/0 /* cali:TxEjJz-IsLiJzVDK */ MARK or 0x80000
0 0 MARK all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:SBosizM5mtjxTsOe */ mark match 0x80000/0x80000 MARK or 0x10000
0 0 NFLOG all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:5w4NEetZaXhF7wjm */ mark match 0x10000/0x10000 nflog-prefix "API0|default.allow-cluster-nodeport-only" nflog-group 1 nflog-range 80
0 0 RETURN all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:NMym66CfdBVWGhc6 */ mark match 0x10000/0x10000
0 0 MARK tcp -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:HONlGpSGnitWLUh- */ multiport dports 30008 MARK or 0x40000
0 0 NFLOG all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:9vcFh92OaOMP06xg */ mark match 0x40000/0x40000 nflog-prefix "DPI1|default.allow-cluster-nodeport-only" nflog-group 1 nflog-range 80
0 0 DROP all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:URAeOCUsDbThanFp */ mark match 0x40000/0x40000

然后外部的不在白名单里的 curl 下 NodePort,再看下统计信息:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
$ iptables -t mangle -nvL cali-pi-_Ddz2TLFtYPs0Zt3iUZs
Chain cali-pi-_Ddz2TLFtYPs0Zt3iUZs (1 references)
pkts bytes target prot opt in out source destination
44 2640 MARK all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:5eFTXO3b0B-Tbiq8 */ /* Policy default.allow-cluster-nodeport-only ingress */ MARK and 0xffe7ffff
0 0 MARK all -- * * 10.xxx.41.110 0.0.0.0/0 /* cali:L5TwSbHWsELZIAEd */ MARK or 0x80000
0 0 MARK all -- * * 10.xxx.195.118 0.0.0.0/0 /* cali:noUEAlswvbgG5j7d */ MARK or 0x80000
0 0 MARK all -- * * 10.187.0.0/16 0.0.0.0/0 /* cali:TxEjJz-IsLiJzVDK */ MARK or 0x80000
0 0 MARK all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:SBosizM5mtjxTsOe */ mark match 0x80000/0x80000 MARK or 0x10000
0 0 NFLOG all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:5w4NEetZaXhF7wjm */ mark match 0x10000/0x10000 nflog-prefix "API0|default.allow-cluster-nodeport-only" nflog-group 1 nflog-range 80
0 0 RETURN all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:NMym66CfdBVWGhc6 */ mark match 0x10000/0x10000
2 120 MARK tcp -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:HONlGpSGnitWLUh- */ multiport dports 30008 MARK or 0x40000
2 120 NFLOG all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:9vcFh92OaOMP06xg */ mark match 0x40000/0x40000 nflog-prefix "DPI1|default.allow-cluster-nodeport-only" nflog-group 1 nflog-range 80
2 120 DROP all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:URAeOCUsDbThanFp */ mark match 0x40000/0x40000

其实就是用 mark 做条件 flag 匹配处理,主要看这几个规则就行:

1
2
3
4
5
6
7
8
9
10
> -A cali-pi-_Ddz2TLFtYPs0Zt3iUZs -m comment --comment "cali:5eFTXO3b0B-Tbiq8" -m comment --comment "Policy default.allow-cluster-nodeport-only ingress" -j MARK --set-xmark 0x0/0x180000
> -A cali-pi-_Ddz2TLFtYPs0Zt3iUZs -s 10.xxx.41.110/32 -m comment --comment "cali:L5TwSbHWsELZIAEd" -j MARK --set-xmark 0x80000/0x80000
> -A cali-pi-_Ddz2TLFtYPs0Zt3iUZs -s 10.xxx.195.118/32 -m comment --comment "cali:noUEAlswvbgG5j7d" -j MARK --set-xmark 0x80000/0x80000
> -A cali-pi-_Ddz2TLFtYPs0Zt3iUZs -s 10.187.0.0/16 -m comment --comment "cali:TxEjJz-IsLiJzVDK" -j MARK --set-xmark 0x80000/0x80000
> -A cali-pi-_Ddz2TLFtYPs0Zt3iUZs -m comment --comment "cali:SBosizM5mtjxTsOe" -m mark --mark 0x80000/0x80000 -j MARK --set-xmark 0x10000/0x10000
> -A cali-pi-_Ddz2TLFtYPs0Zt3iUZs -m comment --comment "cali:5w4NEetZaXhF7wjm" -m mark --mark 0x10000/0x10000 -j NFLOG --nflog-prefix "API0|default.allow-cluster-nodeport-only" --nflog-group 1 --nflog-range 80
> -A cali-pi-_Ddz2TLFtYPs0Zt3iUZs -m comment --comment "cali:NMym66CfdBVWGhc6" -m mark --mark 0x10000/0x10000 -j RETURN
> -A cali-pi-_Ddz2TLFtYPs0Zt3iUZs -p tcp -m comment --comment "cali:HONlGpSGnitWLUh-" -m multiport --dports 30008 -j MARK --set-xmark 0x40000/0x40000
> -A cali-pi-_Ddz2TLFtYPs0Zt3iUZs -m comment --comment "cali:9vcFh92OaOMP06xg" -m mark --mark 0x40000/0x40000 -j NFLOG --nflog-prefix "DPI1|default.allow-cluster-nodeport-only" --nflog-group 1 --nflog-range 80
> -A cali-pi-_Ddz2TLFtYPs0Zt3iUZs -m comment --comment "cali:URAeOCUsDbThanFp" -m mark --mark 0x40000/0x40000 -j DROP
  • 白名单会打上 0x80000/0x80000 标记
  • -m mark --mark 0x80000/0x80000 -j MARK --set-xmark 0x10000/0x10000 匹配上 0x80000/0x80000 的打新 mark 0x10000/0x10000 ,这里按照二进制理解,两者都存在
  • --mark 0x10000/0x10000 -j NFLOG 匹配 0x10000/0x10000 的在 NFLOG 上记录,可以用 tcpdump -i nflog:1 抓包,配合前面一条也就是命中规则的才会 NFLOG
  • --mark 0x10000/0x10000 -j RETURN 白名单命中放行的此刻不往下走
  • -m multiport --dports 30008 -j MARK --set-xmark 0x40000/0x40000 访问的是 NodePort 打上标记
  • --mark 0x40000/0x40000 -j NFLOG 记录,再往下走
  • --mark 0x40000/0x40000 -j DROP 扔掉报文

mangle 的 PREROUTING

相关流程都在 mangle 里:

1
2
3
4
5
6
7
8
9
10
$ iptables -t mangle -S PREROUTING
-P PREROUTING ACCEPT
-A PREROUTING -m comment --comment "cali:6gwbT8clXdHdC1b1" -j cali-PREROUTING

$ iptables -t mangle -S cali-PREROUTING
-N cali-PREROUTING
-A cali-PREROUTING -m comment --comment "cali:6BJqBjBC7crtA-7-" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A cali-PREROUTING -m comment --comment "cali:KX7AGNd6rMcDUai6" -m mark --mark 0x10000/0x10000 -j ACCEPT
-A cali-PREROUTING -m comment --comment "cali:wNH7KsA3ILKJBsY9" -j cali-from-host-endpoint
-A cali-PREROUTING -m comment --comment "cali:Cg96MgVuoPm7UMRo" -m comment --comment "Host endpoint policy accepted packet." -m mark --mark 0x10000/0x10000 -j ACCEPT

然后是 cali-from-host-endpoint 里,如果没有 return 就无法走到下面的 "Host endpoint policy accepted packet." -m mark --mark 0x10000/0x10000 -j ACCEPT,而它:

1
2
3
$ iptables -t mangle -S cali-from-host-endpoint
-N cali-from-host-endpoint
-A cali-from-host-endpoint -m comment --comment "cali:0MLuqUx2SPsTwgBS" -g cali-fh-any-interface-at-all

可以看到它会走到上面新增的 diff 规则里,这是整个流程。

相关源码

calico 负责 iptables 规则的是 felix

链名字

https://github.com/projectcalico/calico/blob/v3.30.3/felix/rules/rule_defs.go

mark

相关 mark 值源码里找到:

1
2
3
4
5
6
7
8
9
10
// https://github.com/projectcalico/calico/blob/v3.30.3/felix/dataplane/driver.go#L156-L164
log.WithFields(log.Fields{
"acceptMark": markAccept,
"passMark": markPass,
"dropMark": markDrop,
"scratch0Mark": markScratch0,
"scratch1Mark": markScratch1,
"endpointMark": markEndpointMark,
"endpointMarkNonCali": markEndpointNonCaliEndpoint,
}).Info("Calculated iptables mark bits")

查看日志,下面便于阅读加几个换行:

1
2
3
4
5
6
7
8
9
$ docker logs d96d | grep 'Calculated iptables mark bits'
2025-10-15 08:54:06.100 [INFO][85] felix/driver.go 164: Calculated iptables mark bits
acceptMark=0x10000
dropMark=0x40000
endpointMark=0xffe00000
endpointMarkNonCali=0x0
passMark=0x20000
scratch0Mark=0x80000
scratch1Mark=0x100000

host ipset

发现 calico 有一个 ipset 存储了本机上的网卡 IP:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
Name: cali40this-host
Type: hash:ip
Revision: 4
Header: family inet hashsize 1024 maxelem 1048576
Size in memory: 456
References: 0
Number of entries: 7
Members:
127.0.0.1
169.254.20.10
10.187.220.0
10.185.0.1
10.xxx.xx.xxx #本机IP
10.186.0.2

相关代码在:

https://github.com/projectcalico/calico/blob/v3.30.3/felix/daemon/daemon.go#L179
https://github.com/projectcalico/calico/blob/v3.30.3/felix/config/config_params.go#L1067

查看相关日志:

1
2
3
4
5
6
7
8
9
10
11
$ docker logs calico-node | grep int_dataplane.go
2025-10-15 08:54:06.562 [INFO][85] felix/int_dataplane.go 2063: Started internal iptables dataplane driver loop
2025-10-15 08:54:06.562 [INFO][85] felix/int_dataplane.go 2180: Will refresh IP sets on timer interval=1m30s
2025-10-15 08:54:06.562 [INFO][85] felix/int_dataplane.go 2180: Will refresh routes on timer interval=1m30s
2025-10-15 08:54:06.562 [INFO][85] felix/int_dataplane.go 2618: Started internal status report thread
2025-10-15 08:54:06.562 [INFO][85] felix/int_dataplane.go 2620: Process status reports disabled
2025-10-15 08:54:06.565 [INFO][85] felix/int_dataplane.go 1590: Linux interface state changed. ifIndex=1 ifaceName="lo" state="up"
2025-10-15 08:54:06.565 [INFO][85] felix/int_dataplane.go 2259: Received interface update msg=&intdataplane.ifaceStateUpdate{Name:"lo", State:"up", Index:1}
2025-10-15 08:54:06.565 [INFO][85] felix/int_dataplane.go 1634: Linux interface addrs changed. addrs=set.Set{127.0.0.0,127.0.0.1,::1,fe80::ecee:eeff:feee:eeee} ifaceName="lo"
2025-10-15 08:54:06.565 [INFO][85] felix/int_dataplane.go 1590: Linux interface state changed. ifIndex=2 ifaceName="ens192" state="up"
2025-10-15 08:54:06.565 [INFO][85] felix/int_dataplane.go 2286: Received interface addresses update msg=&intdataplane.ifaceAddrsUpdate{Name:"lo", Addrs:set.Typed[string]{"127.0.0.0":set.v{}, "127.0.0.1":set.v{}, "::1":set.v{}, "fe80::ecee:eeff:feee:eeee":set.v{}}}

看了下源码获取网卡 IP 逻辑在 felix/ifacemonitor/iface_monitor.go ,主要是使用 Linux netlink 接口获取网卡和变更添加删除消息监听,然后执行 OnUpdate :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// https://github.com/projectcalico/calico/blob/v3.30.3/felix/dataplane/linux/hostip_mgr.go#L81-L103
func (m *hostIPManager) OnUpdate(msg interface{}) {
switch msg := msg.(type) {
case *ifaceAddrsUpdate:
log.WithField("update", msg).Info("Interface addrs changed.")
if m.nonHostIfacesRegexp.MatchString(msg.Name) {
log.WithField("update", msg).Debug("Not a real host interface, ignoring.")
return
}
if msg.Addrs != nil {
m.hostIfaceToAddrs[msg.Name] = msg.Addrs
} else {
delete(m.hostIfaceToAddrs, msg.Name)
}

// Host ip update is a relative rare event. Flush entire ipsets to make it simple.
metadata := ipsets.IPSetMetadata{
Type: ipsets.IPSetTypeHashIP,
SetID: m.hostIPSetID,
MaxSize: m.maxSize,
}
m.ipsetsDataplane.AddOrReplaceIPSet(metadata, m.getCurrentMembers())
}
}

设计

  • 一个 ipset 存储白名单 whiteiplist,匹配上就 ACCEPT
  • 一个 ipset 存储 Port whiteportlist,此刻还匹配就说明不是白名单走过来,打 mark 2
  • 匹配到 mark 2 则 DROP

如果有问题,可以某个地方再加一个规则打上 mark 1 来热修。

mangle 表

由于我们只使用 flannel,并且也不需要各种情况,所以就不用 mark 处理了,所以也只是像 calico 那样在 mangle 的 PRETOUTING 做链 :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
#先创建链
iptables --wait --table mangle --new test-PREROUTING

# 插入 mangle 表的 PREROUTING 链前面
iptables --wait --table mangle --insert PREROUTING --jump test-PREROUTING

# 放行已建立的连接
iptables --wait --table mangle --insert test-PREROUTING -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT

# 创建单独的链
iptables --wait --table mangle --new test-from-host-endpoint

# 匹配行为在 test-from-host-endpoint 里做,有问题在它前面 INSERT 规则即可
iptables --wait --table mangle --append test-PREROUTING -j test-from-host-endpoint

test-from-host-endpoint 链

拆成两个是可以后续再 test-from-host-endpoint 前 insert 本机网卡 IP 或者可以添加类似 failsafe-in 之类的 ipset 之类的。

1
2
3
4
5
# 白名单 IP 直接放行
iptables --wait --table mangle --append test-from-host-endpoint -m set --match-set whiteiplist src -j ACCEPT

# 非白名单 ip 访问白名单端口拒绝
iptables --wait --table mangle --append test-from-host-endpoint -m set --match-set whiteportlist dst -j DROP

filter 表的 INPUT 链里我们也加了下类似 this-host 的逻辑:

1
2
3
4
5
6
-A INPUT -j BASE-RULE

-A BASE-RULE -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A BASE-RULE -m set --match-set whiteiplist src -j ACCEPT
-A BASE-RULE -m set --match-set this-host src -j ACCEPT
-A BASE-RULE -m set --match-set whiteportlist dst -j DROP

成品

支持双栈,支持获取网卡IP,ipv6 根据 ipv6list 和 cat /proc/sys/net/ipv6/conf/all/disable_ipv6 值做开关

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
Name: this-host
Type: hash:net
Revision: 6
Header: family inet hashsize 1024 maxelem 1000000
Size in memory: 568
References: 1
Number of entries: 3
Members:
127.0.0.0/24
10.xx.94.189
169.254.0.0/16

Name: whiteipv6list
Type: hash:net
Revision: 6
Header: family inet6 hashsize 1024 maxelem 1000000
Size in memory: 1608
References: 2
Number of entries: 4
Members:
2408:8656:22df:ff01::14:1620
2408:8656:22df:ff01::14:1621
::1
2408:8656:22df:ff01::14:1622

Name: this-host6
Type: hash:net
Revision: 6
Header: family inet6 hashsize 1024 maxelem 1000000
Size in memory: 1496
References: 1
Number of entries: 3
Members:
ee80:169:254:20::/64
2408:8656:22df:ff01::14:1620
::1

Name: whiteportlist
Type: bitmap:port
Revision: 3
Header: range 0-65535
Size in memory: 8296
References: 4
Number of entries: 230
Members:
...

一些仅供他人参考的 shell:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
function get_if_inet(){
local if=$1
ip -4 -o a s $if | awk '{print $4}'
}

function this_host(){
ipset create this-host hash:net maxelem 1000000 -exist
ipset add this-host 127.0.0.1/24 -exist
ipset add this-host 169.254.0.0/16 -exist

if [ -d /sys/devices/virtual/net/cni0/ ];then
ipset add this-host $(get_if_inet cni0| sed 's#/\d+#/16#') -exist
fi

ip -o -4 a s scope global | grep -Ev ':\s+(cali|tunl|vxlan|flannel|docker0|veth|wireguard|wg|cni0|kube|dummy|veth)' | awk -F'[ /]+' '{print $4}'| \
while read ip;do
ipset add this-host $ip -exist
done

}

function get_if_inet6(){
local if=$1 ignore=$2 inet6
inet6=ip -6 -o a s $if | awk '{print $4}'
if [ -n "$ignore" ];then
inet6=$(echo $inet6 | grep -Ev "$2")
fi
echo $inet6
}

function this_host6(){
ipset create this-host6 hash:net maxelem 1000000 family inet6 -exist
ipset add this-host6 ::1 -exist

if [ -d /sys/devices/virtual/net/cni0/ ];then
cni0_inet6=$(get_if_inet6 cni0| sed 's#/\d+#/56#')
if [ -n "$cni0_inet6" ];then
ipset add this-host6 $cni0_inet6 -exist
fi
fi

ip -o -6 a s scope global | grep -Ev ':\s+(cali|tunl|vxlan|flannel|docker0|veth|wireguard|wg|cni0|kube|dummy|veth)' |\
grep -Ev '^fe80::.+/64' | awk -F'[ /]+' '{print $4}'| \
while read ip;do
ipset add this-host6 $ip -exist
done

}
CATALOG
  1. 1. 由来
  2. 2. 规则问题
  3. 3. calico
    1. 3.1. 准备工作
    2. 3.2. GlobalNetworkPolicy
  4. 4. 规则研究
    1. 4.1. 新增规则
    2. 4.2. mangle 的 PREROUTING
  5. 5. 相关源码
    1. 5.1. 链名字
    2. 5.2. mark
    3. 5.3. host ipset
  6. 6. 设计
    1. 6.1. mangle 表
    2. 6.2. test-from-host-endpoint 链
  7. 7. 成品