Deep Dive kube-proxy with iptables mode

Posted by serena on March 26, 2020

summary: in Kubernetes, a Service is a L4(TCP/UDP/SCTP) load balancer, it uses the DNAT to redirect inbound traffic to backend pods. The redirecting is performed by kube-proxy, which sits on every node. In this post, we will introduce when working in iptables mode, how kube-proxy handles the forwarding of traffics.

3 modes

kube-proxy triggers the configure of routing rules by watching kube-apiserver for the addition and removal of Service and Endpoints objects, currently, it supports three different operation modes:

User space

In this mode, for each Service, it opens a port(randomly chosen) on the local node. Any connections to this “proxy port” are proxied to one of the Service’s backend Pods. It is not commonly used as it is slow and outdated.

iptables

Built on Netfilter. For each Service, it installs iptables rules, which capture traffic to the Service’s clusterIP and port, and redirect that traffic to one of the Service’s backend sets. For each Endpoint object, it installs iptables rules which select a backend Pod. It is the default mode for most platforms.

IPVS

This mode calls netlink interface to create IPVS rules accordingly and synchronizes IPVS rules with Kubernetes Services and Endpoints periodically. it requires the Linux kernel to have the IPVS modules loaded.

externalTrafficPolicy

This configuration denotes if this Service desires to route external traffic to node-local or cluster-wide endpoints. There are two available options: Cluster(default) and Local

Cluster

It is the default option, and has the following attributes:

  • packets sent to the service with type=NodePort or type=LoadBalance are SNAT’d by the node’s IP.
  • if external packet is routed to a node which has no pod retained, the proxy will forward it to a pod on another host. This may cause an extra hop, which is bad.
  • kubernetes performs the balancing within the cluster, which means the number of pods is taken into account when do the balancing, so it has a good overall load-spreading.

Local

  • only available on NodePort and LoadBalancer service.
  • packets are not SNAT’d for either inter-cluster or external traffic.
  • when the packet comes to a no-pod-retain node, it gets dropped, which avoids the extra hop between nodes.
  • kubernetes performs the balancing within the node, which means the proxy only distributes the load to the pods on the on-site node. it is up to the external balancer to solve the imbalance problem.

chains of iptables

Several chains of iptables are programed to do all kinds of filtering and NAT between pods and services when a Service or Endpoint object is created.

  • KUBE-SERVICES is the entry point for service packets. What it does is to match the destination IP:port and dispatch the packet to the corresponding KUBE-SVC-* chain.
  • KUBE-SVC-* acts as a load balancer, which distributes the packet to KUBE-SEP-* chain. The number of KUBE-SEP-* is equal to the number of endpoints behind the service. Which KUBE-SEP-* to be chosen is determined randomly.
  • KUBE-SEP-* represents a Service EndPoint. It simply does DNAT, replacing service IP:port with pod’s endpoint IP:Port.
  • KUBE-MARK-MASQ adds a Netfilter mark to packets destined for the service which originate outside the cluster’s network. Packets with this mark will be altered in a POSTROUTING rule to use source network address translation (SNAT) with the node’s IP address as their source IP address.
  • KUBE-MARK-DROP adds a Netfilter mark to packets which do not have destination NAT enabled by this point. These packets will be discarded in the KUBE-FIREWALL chain.
  • KUBE-FW-* chain acts while service is deployed with type LoadBalancer, it matches the destination IP with service’s loadbalancer IP and distributes the packet to the corresponding KUBE-SVC-* chain(externalTrafficPolicy: Cluster) or KUBE-XLB-* chain (externalTrafficPolicy: Local).
  • KUBE-NODEPORTS chain is occurred while service is deployed in the type of NodePort and LoadBalancer. With it the external sources can access the service by the node port. it matches the node port and distributes the packet to the corresponding KUBE-SVC-* chain(externalTrafficPolicy: Cluster) or KUBE-XLB-* chain (externalTrafficPolicy: Local).
  • KUBE-XLB-* chain works while externalTrafficPolicy is set to Local. With this chain programed, the packet is dropped if a node has no relevant endpoints retained.

Workflow of kube-proxy iptables

Practices

Now that we have the basic concept of kube-proxy in mind, let’s explain them in practice.

Since LoadBalance is not support in the cluster, the practices will only focus on NodePort & ClusterIP this time.

environment

Our testing cluster is deployed using cactus, the kubernetes version is v1.17.3, setting up 4 VMs(1 masters, 3 workers) with the lab definition pod11. The network setup is shown below:

  • DNS domain: basic.k8s
  • PodIP CIDR: 10.244.0.0/16
  • ClusterIP CIDR: 10.96.0.0/12

The pod backends are managed by a deployment echo with 3 pod replicas.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: echo
  name: echo
spec:
  replicas: 3
  selector:
    matchLabels:
      app: echo
  template:
    metadata:
      labels:
        app: echo
    spec:
      containers:
      - image: k8s.gcr.io/echoserver:1.4
        name: echoserver

To simulate the in-cluster communications and for debug purpose, we also set up a debug pod in leveraging image nicolaka/netshoot:latest, which is widely used in kubectl-debug

1
kubectl run -it --rm --restart=Never debug --image=nicolaka/netshoot:latest sh

The environment looks like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
serena@ubuntu:~$ virsh list
 Id    Name                           State
----------------------------------------------------
 216   basic_master01                 running
 217   basic_minion01                 running
 218   basic_minion02                 running
 219   basic_minion03                 running

serena@ubuntu:~$ kubectl get po -l app=echo -o wide
NAME                    READY   STATUS    RESTARTS   AGE    IP               NODE       NOMINATED NODE   READINESS GATES
debug                   1/1     Running   0          10h    10.244.193.194   minion02   <none>           <none>
echo-75548f949f-6hg97   1/1     Running   0          100s   10.244.122.1     minion03   <none>           <none>
echo-75548f949f-7r575   1/1     Running   0          100s   10.244.50.68     minion01   <none>           <none>
echo-75548f949f-h5h6w   1/1     Running   0          100s   10.244.193.193   minion02   <none>           <none>

serena@ubuntu:~$ kubectl get node -o wide
NAME       STATUS   ROLES    AGE   VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
master01   Ready    master   23h   v1.17.3   192.168.11.2   <none>        Ubuntu 18.04.4 LTS   4.15.0-88-generic   docker://18.9.7
minion01   Ready    <none>   22h   v1.17.3   192.168.11.3   <none>        Ubuntu 18.04.4 LTS   4.15.0-88-generic   docker://18.9.7
minion02   Ready    <none>   22h   v1.17.3   192.168.11.4   <none>        Ubuntu 18.04.4 LTS   4.15.0-88-generic   docker://18.9.7
minion03   Ready    <none>   22h   v1.17.3   192.168.11.5   <none>        Ubuntu 18.04.4 LTS   4.15.0-88-generic   docker://18.9.7

ClusterIP

With different configurations, in the following, we will discuss 5 type of ClusterIP services:

  • normal: a minimal service definition, with available pod backends, and not headless service
  • session affinity:
  • external ip
  • no endpoints
  • headless

normal service

A simple ClusterIP Service definition is like:

1
2
3
4
5
6
7
8
9
10
apiVersion: v1
kind: Service
metadata:
  name: echo
spec:
  ports:
  - port: 6711
    targetPort: 8080
  selector:
    app: echo

Kubernetes creates a service called ‘echo’, and automatically creates 3 endpoints pointing to 3 echo pods respectively.

1
2
3
4
5
6
7
serena@ubuntu:~$ kubectl get svc echo
NAME   TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
echo   ClusterIP   10.98.124.225   <none>        6711/TCP   89s

serena@ubuntu:~$ kubectl get ep echo
NAME   ENDPOINTS                                                 AGE
echo   10.244.122.1:8080,10.244.193.193:8080,10.244.50.68:8080   107s

Cluster IP is a virtual IP, it doesn’t have an entity, kubernetes registers DNS records by associating the service name to it. For each service, kubernetes generates 4 dns records: svc-name, svc-name.namespace, svc-name.namespace.svc, svc-name.namespace.svc.domain. In this case echo/echo.default/echo.default.svc/echo.default.svc.basic.k8s are associated with service ‘echo’. From the debugging pod, we can see:

1
2
3
4
5
6
7
8
9
/ # nslookup echo
Server:		10.96.0.10
Address:	10.96.0.10:53

Name:   echo.default.svc.basic.k8s
Address: 10.98.124.225

/ # dig @10.96.0.10 echo +search +short
10.98.124.225

If namespace is not specified, dns server attaches the namespace from the source pod, if accessing a service of the other namespace is required, we should attach namespace obviously. For instance, nslookup-ing ‘echo’ service in ‘extra’ namespace will be like:

1
2
3
4
5
6
7
8
9
10
serena@ubuntu:~$ kubectl get svc -n extra
NAME   TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
echo   ClusterIP   10.106.152.64   <none>        6711/TCP   64m

/ # nslookup echo.extra
Server:         10.96.0.10
Address:        10.96.0.10#53

Name:   echo.extra.svc.basic.k8s
Address: 10.106.152.64

To make service’s cluster IP accessible from external network as well as from pod network, kube-proxy creates several chains and rules.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
cactus@master01:~$ sudo iptables -t nat -S | grep KUBE-SERVICES
-N KUBE-SERVICES
-A PREROUTING -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A OUTPUT -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
...

cactus@master01:~$ sudo iptables -t nat -L KUBE-SERVICES | grep echo
Chain KUBE-SERVICES (2 references)
target                     prot  opt  source          destination         
KUBE-MARK-MASQ  tcp  -- !10.244.0.0/16        10.98.124.225        /* default/echo: cluster IP */ tcp dpt:6711
KUBE-SVC-U52O5CQH2XXNVZ54  tcp  --  anywhere             10.98.124.225        /* default/echo: cluster IP */ tcp dpt:6711

cactus@master01:~$ sudo iptables -t nat -L KUBE-SVC-U52O5CQH2XXNVZ54
Chain KUBE-SVC-U52O5CQH2XXNVZ54 (1 references)
target     prot opt source               destination         
KUBE-SEP-EXCZZIFMC3FTGK26  all  --  anywhere             anywhere             statistic mode random probability 0.33333333349
KUBE-SEP-KRPRU4V5NQPJR2QF  all  --  anywhere             anywhere             statistic mode random probability 0.50000000000
KUBE-SEP-PYQWLFFOR4OGUSWB  all  --  anywhere             anywhere            

cactus@master01:~$ sudo iptables -t nat -L KUBE-SEP-EXCZZIFMC3FTGK26
Chain KUBE-SEP-EXCZZIFMC3FTGK26 (1 references)
target     prot opt source               destination         
KUBE-MARK-MASQ  all  --  10.244.122.1         anywhere            
DNAT       tcp  --  anywhere             anywhere             tcp to:10.244.122.1:8080

cactus@master01:~$ sudo iptables -t nat -L KUBE-SEP-KRPRU4V5NQPJR2QF
Chain KUBE-SEP-KRPRU4V5NQPJR2QF (1 references)
target     prot opt source               destination         
KUBE-MARK-MASQ  all  --  10.244.193.193       anywhere            
DNAT       tcp  --  anywhere             anywhere             tcp to:10.244.193.193:8080

cactus@master01:~$ sudo iptables -t nat -L KUBE-SEP-PYQWLFFOR4OGUSWB
Chain KUBE-SEP-PYQWLFFOR4OGUSWB (1 references)
target     prot opt source               destination         
KUBE-MARK-MASQ  all  --  10.244.50.68         anywhere            
DNAT       tcp  --  anywhere             anywhere             tcp to:10.244.50.68:8080

KUBE-SERVICES

From the chains of PREROUTING and OUTPUT we can see, all data packets incoming or outgoing of Pods enters the chain KUBE-SERVICES as the starting point, in this case, all inbound traffics come into service echo(matched by destination IP 10.104.16.232 and port 6711), will be processed by two rules in ordinal:

1) the source IP of the packet not comes from pod is substituted with node IP when going through chain KUBE-MARK-MASQ

2) then, the packet flows into chain KUBE-SVC-U52O5CQH2XXNVZ54

From below we can see, when accessing the service externally(not from pods), the source IP is replaced with nodeIP;

1
2
3
4
5
cactus@master01:~$ curl --interface 192.168.11.2 -s 10.101.220.97:8711 | grep client
client_address=10.244.241.64

cactus@master01:~$ sudo conntrack -L -d 10.101.220.97
tcp      6 117 TIME_WAIT src=192.168.11.2 dst=10.101.220.97 sport=47153 dport=8711 src=10.244.122.1 dst=10.244.241.64 sport=8080 dport=24678 [ASSURED] mark=0 use=1

when the packet comes from the internal pod–the ‘debug’ pod, the source IP remains it is.

1
2
bash-5.0# curl -s 10.101.220.97:8711 | grep client
client_address=10.244.193.194

KUBE-SVC-*

In the chain KUBE-SVC-U52O5CQH2XXNVZ54, the load balancing between pods is performed by iptables module ‘statistic’. It distributes packet to KUBE-SEP-* randomly with the ‘probability’ setting.

KUBE-SEP-*

Each KUBE-SEP-* chain represents a pod or endpoint respectively. It includes two actions:

1) packet escaping from the pod is source NAT-ed with host’s docker0 IP. 2) packet coming into the pod is DNAT-ed with pod’s ip, then routes to the backend pod.

session affinity service

Kubernetes supports ClientIP based session affinity, session affinity makes sure that requests from the same particular client are passed to the same backend server, which means pod in kubernetes.

A session affinity service is defined as:

1
2
3
4
5
6
7
8
9
10
11
apiVersion: v1
kind: Service
metadata:
  name: echo-session
spec:
  sessionAffinity: ClientIP
  ports:
  - port: 6711
    targetPort: 8080
  selector:
    app: echo

After creating the service in kubernetes, a service and endpoint are created:

1
2
3
4
5
6
7
serena@ubuntu:~$ kubectl get svc echo-session
NAME           TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
echo-session   ClusterIP   10.109.153.82   <none>        6711/TCP   16s

serena@ubuntu:~$ kubectl get ep echo-session
NAME           ENDPOINTS                                                 AGE
echo-session   10.244.122.1:8080,10.244.193.193:8080,10.244.50.68:8080   26s

The difference between ‘normal’ ClusterIP and ClusterIP with session affinity is beside distributing requests coming into the service for the first time leveraging ‘statistic’ iptables module, kube-proxy will also apply ‘recent’ module for not the first time visitors, the timeout checkout time is 10800s by default. it can be reset by service.spec.sessionAffinityConfig.clientIP.timeoutSeconds. The iptables configurations look like below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
cactus@master01:~$ sudo iptables -t nat  -L KUBE-SERVICES | grep echo-session
KUBE-MARK-MASQ  tcp  -- !10.244.0.0/16        10.109.153.82        /* default/echo-session: cluster IP */ tcp dpt:6711
KUBE-SVC-NPPZ32WH6LRMQBHN  tcp  --  anywhere             10.109.153.82        /* default/echo-session: cluster IP */ tcp dpt:6711

cactus@master01:~$ sudo iptables -t nat  -L KUBE-SVC-NPPZ32WH6LRMQBHN
Chain KUBE-SVC-NPPZ32WH6LRMQBHN (1 references)
target     prot opt source               destination         
KUBE-SEP-XHJZRVA3S6MYRZW7  all  --  anywhere             anywhere             recent: CHECK seconds: 10800 reap name: KUBE-SEP-XHJZRVA3S6MYRZW7 side: source mask: 255.255.255.255
KUBE-SEP-GCFXO5QV6K3OGUQA  all  --  anywhere             anywhere             recent: CHECK seconds: 10800 reap name: KUBE-SEP-GCFXO5QV6K3OGUQA side: source mask: 255.255.255.255
KUBE-SEP-PMGN3E23UUMYGBZ3  all  --  anywhere             anywhere             recent: CHECK seconds: 10800 reap name: KUBE-SEP-PMGN3E23UUMYGBZ3 side: source mask: 255.255.255.255
KUBE-SEP-XHJZRVA3S6MYRZW7  all  --  anywhere             anywhere             statistic mode random probability 0.33333333349
KUBE-SEP-GCFXO5QV6K3OGUQA  all  --  anywhere             anywhere             statistic mode random probability 0.50000000000
KUBE-SEP-PMGN3E23UUMYGBZ3  all  --  anywhere             anywhere            

cactus@master01:~$ sudo iptables -t nat  -L KUBE-SEP-XHJZRVA3S6MYRZW7
Chain KUBE-SEP-XHJZRVA3S6MYRZW7 (2 references)
target     prot opt source               destination         
KUBE-MARK-MASQ  all  --  10.244.122.1         anywhere            
DNAT       tcp  --  anywhere             anywhere             recent: SET name: KUBE-SEP-XHJZRVA3S6MYRZW7 side: source mask: 255.255.255.255 tcp to:10.244.122.1:8080

cactus@master01:~$ sudo iptables -t nat  -L KUBE-SEP-GCFXO5QV6K3OGUQA
Chain KUBE-SEP-GCFXO5QV6K3OGUQA (2 references)
target     prot opt source               destination         
KUBE-MARK-MASQ  all  --  10.244.193.193       anywhere            
DNAT       tcp  --  anywhere             anywhere             recent: SET name: KUBE-SEP-GCFXO5QV6K3OGUQA side: source mask: 255.255.255.255 tcp to:10.244.193.193:8080

cactus@master01:~$ sudo iptables -t nat  -L KUBE-SEP-PMGN3E23UUMYGBZ3
Chain KUBE-SEP-PMGN3E23UUMYGBZ3 (2 references)
target     prot opt source               destination         
KUBE-MARK-MASQ  all  --  10.244.50.68         anywhere            
DNAT       tcp  --  anywhere             anywhere             recent: SET name: KUBE-SEP-PMGN3E23UUMYGBZ3 side: source mask: 255.255.255.255 tcp to:10.244.50.68:8080

external ip service

If there are external IPs that route to one or more cluster nodes, Kubernetes Services can be exposed on those externalIPs. Traffic that ingresses into the cluster with the external IP (as destination IP), on the Service port, will be routed to one of the Service endpoints.

In our environment, ‘172.0.11.33’ is a ip located on node master01, can be accessed externally, considering it as an external ip, the service is defined:

1
2
3
4
5
6
7
8
9
10
11
12
apiVersion: v1
kind: Service
metadata:
  name: echo-extip
spec:
  ports:
  - port: 8711
    targetPort: 8080
  selector:
    app: echo
  externalIPs:
  - 172.0.11.33

so, the service and endpoint is:

1
2
3
4
5
6
7
serena@ubuntu:~$ kubectl get svc echo-extip
NAME         TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
echo-extip   ClusterIP   10.105.77.232   172.0.11.33   8711/TCP   27m

serena@ubuntu:~$ kubectl get ep echo-extip
NAME         ENDPOINTS                                                 AGE
echo-extip   10.244.122.1:8080,10.244.193.193:8080,10.244.50.68:8080   27m

The corresponding rules in the chain KUBE-SERVICES are as below, due to the chains of KUBE-SVC- and KUBE-SEP- is very similar to normal service, we will not show them thereafter:

1
2
3
4
5
6
cactus@master01:~$ sudo iptables -t nat  -L KUBE-SERVICES | grep echo-extip
KUBE-MARK-MASQ  tcp  -- !10.244.0.0/16        10.105.77.232        /* default/echo-extip: cluster IP */ tcp dpt:8711
KUBE-SVC-DDLWINQFLEZGGMTH  tcp  --  anywhere             10.105.77.232        /* default/echo-extip: cluster IP */ tcp dpt:8711
KUBE-MARK-MASQ  tcp  --  anywhere             172.0.11.33          /* default/echo-extip: external IP */ tcp dpt:8711
KUBE-SVC-DDLWINQFLEZGGMTH  tcp  --  anywhere             172.0.11.33             /* default/echo-extip: external IP */ tcp dpt:8711 PHYSDEV match ! --physdev-is-in ADDRTYPE match src-type !LOCAL
KUBE-SVC-DDLWINQFLEZGGMTH  tcp  --  anywhere             172.0.11.33             /* default/echo-extip: external IP */ tcp dpt:8711 ADDRTYPE match dst-type LOCAL

The first two chains works just a normal service, after SNAT operation, goes into chain KUBE-SVC- for load balancing.

Next 3 chains is unique to the external ip type. It employs physdev module to identify traffics from bridge device and perform NAT operation. The second rule from the bottom denotes requests from neither local nor bridge interface are passed in to chain KUBE-SVC for load balancing.

no endpoint service

A cluster IP is always associated with backend pods by selector. If no pod matches Service’s selector, the service will be created with no endpoints referring to it, in that case, there will be no iptables rules created for routing traffics.

The wrong selector case:

1
2
3
4
5
6
7
8
9
10
apiVersion: v1
kind: Service
metadata:
  name: echo-noep
spec:
  ports:
  - port: 8711
    targetPort: 8080
  selector:
    app: echo-noep

The created service, endpoint and iptables rules would be:

1
2
3
4
5
6
7
8
9
serena@ubuntu:~$ kubectl get svc echo-noep
NAME        TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)    AGE
echo-noep   ClusterIP   10.96.42.131   <none>        8711/TCP   19m
serena@ubuntu:~$ kubectl get ep echo-noep
NAME        ENDPOINTS   AGE
echo-noep   <none>      19m

cactus@master01:~$ sudo iptables-save -t nat | grep echo-noep
cactus@master01:~$ 

And if a Service has no selector defined, the endpoint would be not created at all.

1
2
3
4
5
6
7
8
apiVersion: v1
kind: Service
metadata:
  name: echo-nosel
spec:
  ports:
  - port: 8711
    targetPort: 8080
1
2
3
4
5
6
7
8
serena@ubuntu:~$ kubectl get svc
NAME            TYPE        CLUSTER-IP       EXTERNAL-IP                 PORT(S)    AGE
echo-nosel      ClusterIP   10.109.105.198   <none>                      8711/TCP   21m
kubernetes      ClusterIP   10.96.0.1        <none>                      443/TCP    18h

serena@ubuntu:~$ kubectl get ep
NAME            ENDPOINTS                                                 AGE
kubernetes      10.0.11.2:6443                                            18h

headless service

Sometimes you don’t need load-balancing and a single Service IP. In this case, you can create what are termed “headless” Services, by explicitly specifying “None” for the cluster IP (.spec.clusterIP).

The headless service definition of ‘echo’ is:

1
2
3
4
5
6
7
8
9
10
11
apiVersion: v1
kind: Service
metadata:
  name: echo-headless
spec:
  clusterIP: None
  ports:
  - port: 9711
    targetPort: 8080
  selector:
    app: echo

After kubectl apply, the service controller and endpoint controller will create the following objects:

1
2
3
4
5
6
7
serena@ubuntu:~$ kubectl get svc echo-headless
NAME            TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
echo-headless   ClusterIP   None            <none>        9711/TCP   8s

serena@ubuntu:~$ kubectl get ep echo-headless
NAME            ENDPOINTS                                                 AGE
echo-headless   10.244.122.1:8080,10.244.193.193:8080,10.244.50.68:8080   10s

However, from iptables, no chains or rules is created, by checking DNS records, shows it uses pod’s IP directly.

1
2
3
4
bash-5.0# dig echo-headless +search +short
10.244.50.68
10.244.193.193
10.244.122.1

NodePort

Kubernetes exposes service with nodeIP:nodePort by deploying NodePort type service, by doing so, kubernetes creates a ClusterIP service, to which the NodePort service will route, and meanwhile opens a specific port(node port) on all the Nodes, and any traffic that is sent to this port is forwarded to destination pod.

The entry point of iptables rules for performing NodePort trafficing is ‘-A KUBE-SERVICES -m comment –comment “kubernetes service nodeports; NOTE: this must be the last rule in this chain” -m addrtype –dst-type LOCAL -j KUBE-NODEPORTS’, this rule is the last one in KUBE-SERVICES chain, it indicates that if a visiting packet is not matched by any of the anterior rules, it is forwarded into KUBE-NODEPORTS chain for further processing.

1
2
3
4
5
6
7
cactus@master01:~$ sudo iptables -t nat -L KUBE-SERVICES
Chain KUBE-SERVICES (2 references)
target     prot opt source               destination         
.
.
.
KUBE-NODEPORTS  all  --  anywhere             anywhere             /* kubernetes service nodeports; NOTE: this must be the last rule in this chain */ ADDRTYPE match dst-type LOCAL

In the following subsection, we will discuss 2 types of NodePort service:

  • default service(externalTrafficPolicy: Cluster)
  • externalTrafficPolicy: Local

default NodePort service

A NodePort service with no customization would create an externalTrafficPolicy: Cluster type service. The manifest is written in below:

1
2
3
4
5
6
7
8
9
10
11
apiVersion: v1
kind: Service
metadata:
  name: echo-np
spec:
  type: NodePort
  ports:
  - port: 8711
    targetPort: 8080
  selector:
    app: echo

The service and endpoints are created:

1
2
3
4
5
6
serena@ubuntu:~$ kubectl get svc echo-np
NAME      TYPE       CLUSTER-IP      EXTERNAL-IP   PORT(S)          AGE
echo-np   NodePort   10.107.142.56   <none>        8711:30398/TCP   84s
serena@ubuntu:~$ kubectl get ep echo-np
NAME      ENDPOINTS                                                 AGE
echo-np   10.244.122.1:8080,10.244.193.193:8080,10.244.50.68:8080   88s

As we can see from the above, the port 30398 is assigned to the service ‘echo-np’. On each node, kube-proxy allocating a listening port for it:

1
2
3
4
5
6
cactus@master01:~$ sudo netstat -alnp | grep 30398
tcp6       0      0 :::30398                :::*                    LISTEN      4406/kube-proxy     

cactus@master01:~$ ps -p 4406 -o args
COMMAND
/usr/local/bin/kube-proxy --config=/var/lib/kube-proxy/config.conf --hostname-override=master01

From iptables’ perspective, each two sets of chains & rules are added to the chain KUBE-SERVICES & KUBE-NODEPORTS separately:

1
2
3
4
5
6
7
8
9
cactus@master01:~$ sudo iptables -t nat -L KUBE-SERVICES | grep echo-np
KUBE-MARK-MASQ  tcp  -- !10.244.0.0/16        10.107.142.56        /* default/echo-np: cluster IP */ tcp dpt:8711
KUBE-SVC-2JCTU37Y3EPLWMDU  tcp  --  anywhere             10.107.142.56        /* default/echo-np: cluster IP */ tcp dpt:8711

cactus@master01:~$ sudo iptables -t nat -L KUBE-NODEPORTS
Chain KUBE-NODEPORTS (1 references)
target     prot opt source               destination         
KUBE-MARK-MASQ  tcp  --  anywhere             anywhere             /* default/echo-np: */ tcp dpt:30398
KUBE-SVC-2JCTU37Y3EPLWMDU  tcp  --  anywhere             anywhere             /* default/echo-np: */ tcp dpt:30398

Regarding KUBE-SERVICES chain, it is used for the cluster ip accessing, for detailed information, please reference ‘normal ClusterIP service’.

KUBE-NODEPORTS denotes that all packets accessing port 30398, firstly is SNAT-ed, then goes into KUBE-SVC chain for load balancing to select a pod to route to.

externalTrafficPolicy: Local

Using “externalTrafficPolicy: Local” will preserve source IP and drop packets from node has no local endpoint. It is defined as:

1
2
3
4
5
6
7
8
9
10
11
12
apiVersion: v1
kind: Service
metadata:
  name: echo-local
spec:
  ports:
  - port: 8711
    targetPort: 8080
  selector:
    app: echo
  type: NodePort
  externalTrafficPolicy: Local

service and endpoints created:

1
2
3
4
5
6
7
serena@ubuntu:~$ kubectl get svc echo-local
NAME         TYPE       CLUSTER-IP      EXTERNAL-IP   PORT(S)          AGE
echo-local   NodePort   10.103.249.21   <none>        8711:30757/TCP   45s

serena@ubuntu:~$ kubectl get ep echo-local
NAME         ENDPOINTS                                                 AGE
echo-local   10.244.122.1:8080,10.244.193.193:8080,10.244.50.68:8080   53s

The chain of KUBE-NODEPORTS is different between the nodes have local endpoints and those haven’t. in our cluster, ‘echo’ pod is not scheduled on master01 node, so when we trying to access the service by master01’s nodeIP, the packet will be dropped by the rule ‘-A KUBE-XLB-WVR3FY6OQL4XX3DZ -m comment –comment “default/echo-local: has no local endpoints” -j KUBE-MARK-DROP’. On the contrary, when accessing the service by minion01’s nodeIP, it will be forwarded to pod ‘echo-75548f949f-7r575’(which is scheduled on minion01 node) directly, by rule ‘-A KUBE-XLB-WVR3FY6OQL4XX3DZ -m comment –comment “Balancing rule 0 for default/echo-local:” -j KUBE-SEP-FO5ZQ232FWC5GK4A’

The chains and rules on the pod-owned nodes is look like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
cactus@ubuntu:~$ sudo iptables -t nat -L KUBE-NODEPORTS
Chain KUBE-NODEPORTS (1 references)
target     prot opt source               destination         
KUBE-MARK-MASQ  tcp  --  anywhere             anywhere             /* default/echo-np: */ tcp dpt:30398
KUBE-SVC-2JCTU37Y3EPLWMDU  tcp  --  anywhere             anywhere             /* default/echo-np: */ tcp dpt:30398
KUBE-MARK-MASQ  tcp  --  localhost/8          anywhere             /* default/echo-local: */ tcp dpt:30757
KUBE-XLB-WVR3FY6OQL4XX3DZ  tcp  --  anywhere             anywhere             /* default/echo-local: */ tcp dpt:30757

cactus@ubuntu:~$ sudo iptables -t nat -L KUBE-XLB-WVR3FY6OQL4XX3DZ
Chain KUBE-XLB-WVR3FY6OQL4XX3DZ (1 references)
target     prot opt source               destination         
KUBE-SVC-WVR3FY6OQL4XX3DZ  all  --  10.244.0.0/16        anywhere             /* Redirect pods trying to reach external loadbalancer VIP to clusterIP */
KUBE-MARK-MASQ  all  --  anywhere             anywhere             /* masquerade LOCAL traffic for default/echo-local: LB IP */ ADDRTYPE match src-type LOCAL
KUBE-SVC-WVR3FY6OQL4XX3DZ  all  --  anywhere             anywhere             /* route LOCAL traffic for default/echo-local: LB IP to service chain */ ADDRTYPE match src-type LOCAL
KUBE-SEP-FO5ZQ232FWC5GK4A  all  --  anywhere             anywhere             /* Balancing rule 0 for default/echo-local: */

No pod retained nodes:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
cactus@master01:~$ sudo iptables -t nat -L KUBE-NODEPORTS
Chain KUBE-NODEPORTS (1 references)
target     prot opt source               destination         
KUBE-MARK-MASQ  tcp  --  anywhere             anywhere             /* default/echo-np: */ tcp dpt:30398
KUBE-SVC-2JCTU37Y3EPLWMDU  tcp  --  anywhere             anywhere             /* default/echo-np: */ tcp dpt:30398
KUBE-MARK-MASQ  tcp  --  localhost/8          anywhere             /* default/echo-local: */ tcp dpt:30757
KUBE-XLB-WVR3FY6OQL4XX3DZ  tcp  --  anywhere             anywhere             /* default/echo-local: */ tcp dpt:30757

cactus@master01:~$ sudo iptables -t nat -L KUBE-XLB-WVR3FY6OQL4XX3DZ
Chain KUBE-XLB-WVR3FY6OQL4XX3DZ (1 references)
target     prot opt source               destination         
KUBE-SVC-WVR3FY6OQL4XX3DZ  all  --  10.244.0.0/16        anywhere             /* Redirect pods trying to reach external loadbalancer VIP to clusterIP */
KUBE-MARK-MASQ  all  --  anywhere             anywhere             /* masquerade LOCAL traffic for default/echo-local: LB IP */ ADDRTYPE match src-type LOCAL
KUBE-SVC-WVR3FY6OQL4XX3DZ  all  --  anywhere             anywhere             /* route LOCAL traffic for default/echo-local: LB IP to service chain */ ADDRTYPE match src-type LOCAL
KUBE-MARK-DROP  all  --  anywhere             anywhere             /* default/echo-local: has no local endpoints */

Accessing from master01’s node IP is forbiddened, but not for minion01. And from minion01’s response we can see source IP is not SNAT-ed by nodeIP.

1
2
3
4
5
serena@ubuntu:~$ wget -qO - 192.168.11.2:30757 | grep client
^C

serena@ubuntu:~$ wget -qO - 192.168.11.3:30757 | grep client
client_address=192.168.11.1

Further inspect:

  1. ClusterIP with external ip(physdev module of iptables deep dive)

References