将spark提交到k8s中时，所有executor都报UnknownHostExceptions错误，记录下排查过程

1.环境

k8s版本：1.15.1
centos版本：7.6
内核:3.10.0-1160.21.1.el7.x86_64

2.问题说明

当提交任务到k8s时，executor会出现以下错误

无法解析到主节点

3.问题排查

1.首先查看coredns是否有报错(运行正常)

[root@kube-master ～]# kubectl get pod -n kube-system
NAME                                                       READY   STATUS    RESTARTS   AGE
calico-kube-controllers-76d4774d89-q86j9        1/1     Running   3          8d
calico-node-bfkf8                                          1/1     Running   2          8d
calico-node-gcv9f                                         1/1     Running   3          8d
calico-node-jqcx9                                         1/1     Running   4          8d
coredns-7ff77c879f-bnlpb                             1/1     Running   0          4h31m
coredns-7ff77c879f-z4zrh                             1/1     Running   0          4h32m

2.查看coredns日志

[root@kube-master ～]# kubectl logs --tail 100 -f coredns-7ff77c879f-bnlpb -n kube-system
.:53
consul.local.:53
[INFO] plugin/reload: Running configuration MD5 = 48a9faf327db0890f6a2f5d9467b2307
CoreDNS-1.6.7
linux/amd64, go1.13.6, da7f65b
[INFO] Reloading
[INFO] plugin/reload: Running configuration MD5 = bd6502a98482c07c2ece18ef5fccf195
[INFO] Reloading complete
[INFO] Reloading
[INFO] plugin/reload: Running configuration MD5 = 943b088ffd2550b14fa0f10495a317ed
[INFO] Reloading complete

~~log日志并无异常，下一步继续排查~~

3.按照官网上的步骤，添加一个简单的pod，然后验证创建pod，并验证dns是否能够解析

# 创建pod
[root@kube-master ～]# kubectl create -f https://k8s.io/examples/admin/dns/busybox.yaml
pod/busybox created
 
# 验证dns解析错误
[root@kube-master ～]# kubectl exec -ti busybox -- nslookup kubernetes.default
Server:    10.96.0.10
Address 1: 10.96.0.10
nslookup: can't resolve 'kubernetes.default'
command terminated with exit code 1

4.查看pod中dns的配置是否正常，pod中的dns配置，在etc/resolv.conf中

# 验证pod内dns配置是否正确
[root@kube-master ～]# kubectl exec busybox cat /etc/resolv.conf
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl kubectl exec [POD] -- [COMMAND] instead.
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5
[root@kube-master ～]# kubectl get svc -n kube-system
NAME       TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                  AGE
kube-dns   ClusterIP   10.96.0.10   <none>        53/UDP,53/TCP,9153/TCP   8d

~~pod中nameserver地址为coredns deplyment 的service地址一致，说明配置无问题~~

5.打开coredns的日志配置，重新请求域名解析，查看日志

# edit方法修改coredns的configmap
[root@kube-master ～]# kubectl -n kube-system edit configmap coredns
apiVersion: v1
data:
  Corefile: |
    .:53 {
        log     # 添加log
        errors
        health
        kubernetes cluster.as-gmbh.de in-addr.arpa ip6.arpa {
           pods verified
           upstream
           fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        ready :8181
        forward .  172.18.2.21
        cache 30
        loop
        reload
        loadbalance
    }
 
# 重新请求解析
[root@kube-master ～]# kubectl exec -ti busybox -- nslookup kubernetes.default
Server:    10.96.0.10
Address 1: 10.96.0.10
nslookup: can't resolve 'kubernetes.default'
command terminated with exit code 1
 
# 查看corednspod日志
# dns报错
[root@kube-master es-cluster]# kubectl logs --tail 100 -f coredns-7ff77c879f-bnlpb  -n kube-system
.:53
consul.local.:53
[INFO] plugin/reload: Running configuration MD5 = 48a9faf327db0890f6a2f5d9467b2307
CoreDNS-1.6.7
linux/amd64, go1.13.6, da7f65b
[INFO] Reloading
[INFO] plugin/reload: Running configuration MD5 = bd6502a98482c07c2ece18ef5fccf195
[INFO] Reloading complete
[INFO] Reloading
[INFO] plugin/reload: Running configuration MD5 = 943b088ffd2550b14fa0f10495a317ed
[INFO] Reloading complete
[INFO] 127.0.0.1:35951 - 14041 "HINFO IN 5191543454773210781.3002096867822145647. udp 57 false 512" NXDOMAIN qr,rd,ra 132 0.042900779s

~~打开日志后，新增了最后一条记录，然后，再继续请求解析域名，并无新增日记，说明pod中的dns解析请求并没有到达coredns pod~~
~~根据解析路径， pod—–>kube-proxy—–>coredns service ip——->coredns 所以怀疑是否是kube-proxy的问题~~

6.排查kube-proxy，发现报错

[root@kube-master ~]# kubectl get pods -n kube-system | grep kube-proxy
kube-proxy-9p9g8                                1/1     Running   3          8d
kube-proxy-plvdj                                1/1     Running   4          8d
kube-proxy-tndvb                                1/1     Running   4          8d
[root@kube-master ~]# kubectl logs --tail 100 -f kube-proxy-9p9g8 -n kube-system
E0729 08:06:07.431488       1 proxier.go:1950] Failed to list IPVS destinations, error: parseIP Error ip=[10 0 9 100 0 0 0 0 0 0 0 0 0 0 0 0]
E0729 08:06:07.431550       1 proxier.go:1192] Failed to sync endpoint for service: 10.101.142.176:5601/TCP, err: parseIP Error ip=[10 0 9 100 0 0 0 0 0 0 0 0 0 0 0 0]
E0729 08:06:07.431701       1 proxier.go:1950] Failed to list IPVS destinations, error: parseIP Error ip=[10 0 9 100 0 0 0 0 0 0 0 0 0 0 0 0]
E0729 08:06:07.431738       1 proxier.go:1533] Failed to sync endpoint for service: 172.17.0.1:31795/TCP, err: parseIP Error ip=[10 0 9 100 0 0 0 0 0 0 0 0 0 0 0 0]
E0729 08:06:07.431868       1 proxier.go:1950] Failed to list IPVS destinations, error: parseIP Error ip=[10 0 9 100 0 0 0 0 0 0 0 0 0 0 0 0]

通过 kube-proxy Pod 的日志可以看到，里面有很多 Error 级别的日志信息，根据关键字 IPVS、parseIP Error 可知，可能是由于 IPVS 模块对 IP 进行格式化导致出现问题。
这个问题完全没有任何的头绪，只能求助google，差了一下官网，然后找到一个问题点
因为这个问题是升级到 Kubernetes 1.18 版本才出现的，所以去 Kubernetes GitHub 查看相关 issues，发现有人在升级 Kubernetes 版本到 1.18 后，也遇见了相同的问题，经过 issue 中 Kubernetes 维护人员讨论，分析出原因可能为新版 Kubernetes 使用的 IPVS 模块是比较新的，需要系统内核版本支持，本人使用的是 CentOS 系统，内核版本为 3.10，里面的 IPVS 模块比较老旧，缺少新版 Kubernetes IPVS 所需的依赖。
根据该 issue 讨论结果，解决该问题的办法是，更新内核为新的版本。
注：该 issues 地址为：https://github.com/kubernetes/kubernetes/issues/89520

3.解决问题

1.更新内核版本（k8s所有节点）

载入公钥
[root@kube-master ~]# rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org
 
## 安装 ELRepo 最新版本
[root@kube-master ~]# yum install -y https://www.elrepo.org/elrepo-release-7.el7.elrepo.noarch.rpm
 
## 列出可以使用的 kernel 包版本
[root@kube-master ~]# yum list available --disablerepo=* --enablerepo=elrepo-kernel
 
# 根据罗列的版本安装
[root@kube-master ~]# yum install -y kernel-lt-4.4.231-1.el7.elrepo --enablerepo=elrepo-kernel
# 下载使用该内容，重启失败
 
## 查看系统可用内核
[root@kube-master ~]# cat /boot/grub2/grub.cfg | grep menuentry
if [ x"${feature_menuentry_id}" = xy ]; then
  menuentry_id_option="--id"
  menuentry_id_option=""
export menuentry_id_option
menuentry 'CentOS Linux (4.4.231-1.el7.elrepo.x86_64) 7 (Core)' --class centos --class gnu-linux --class gnu --class os --unrestricted $menuentry_id_option 'gnulinux-3.10.0-693.el7.x86_64-advanced-71db7cdf-7608-4daa-b954-436824de2f4c' {
menuentry 'CentOS Linux (3.10.0-1127.13.1.el7.x86_64) 7 (Core)' --class centos --class gnu-linux --class gnu --class os --unrestricted $menuentry_id_option 'gnulinux-3.10.0-693.el7.x86_64-advanced-71db7cdf-7608-4daa-b954-436824de2f4c' {
menuentry 'CentOS Linux (3.10.0-693.el7.x86_64) 7 (Core)' --class centos --class gnu-linux --class gnu --class os --unrestricted $menuentry_id_option 'gnulinux-3.10.0-693.el7.x86_64-advanced-71db7cdf-7608-4daa-b954-436824de2f4c' {
menuentry 'CentOS Linux (0-rescue-e85ea74cb6784a9394b320f95e538fcb) 7 (Core)' --class centos --class gnu-linux --class gnu --class os --unrestricted $menuentry_id_option 'gnulinux-0-rescue-e85ea74cb6784a9394b320f95e538fcb-advanced-71db7cdf-7608-4daa-b954-436824de2f4c' {
 
 
## 设置开机从新内核启动
[root@kube-master ~]# grub2-set-default "CentOS Linux (4.4.231-1.el7.elrepo.x86_64) 7 (Core)"
## 查看内核启动项
[root@kube-master ~]# grub2-editenv list
saved_entry=CentOS Linux (4.4.231-1.el7.elrepo.x86_64) 7 (Core)

使用该方式更新内容后，重启系统失败
更新，在查看罗列的版本后使用以下命令更新内核至最新版本
[root@kube-node1 ~]# yum –enablerepo=elrepo-kernel install kernel-ml
说明为什么安装kernel-ml内核，主线内核更稳定：

  ELRepo有两种类型的Linux内核包，kernel-lt和kernel-ml。 他们之间有什么区别？ kernel-ml软件包是根据Linux Kernel Archives的主线稳定分支提供的源构建的。 内核配置基于默认的RHEL-7配置，并根据需要启用了添加的功能。 这些软件包有意命名为kernel-ml，以免与RHEL-7内核发生冲突，因此，它们可以与常规内核一起安装和更新。 kernel-lt包是从Linux Kernel Archives提供的源代码构建的，就像kernel-ml软件包一样。 不同之处在于kernel-lt基于长期支持分支，而kernel-ml基于主线稳定分支。

1.环境

2.问题说明

3.问题排查

1.首先查看coredns是否有报错(运行正常)

2.查看coredns日志

3.按照官网上的步骤，添加一个简单的pod，然后验证创建pod，并验证dns是否能够解析

4.查看pod中dns的配置是否正常，pod中的dns配置，在etc/resolv.conf中

5.打开coredns的日志配置，重新请求域名解析，查看日志

6.排查kube-proxy，发现报错

3.解决问题

1.更新内核版本（k8s所有节点）

2.重启系统

3.重启

发送评论编辑评论

2025 年 5 月
一	二	三	四	五	六	日
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

1.环境

2.问题说明

3.问题排查

1.首先查看coredns是否有报错(运行正常)

2.查看coredns日志

3.按照官网上的步骤，添加一个简单的pod，然后验证创建pod，并验证dns是否能够解析

4.查看pod中dns的配置是否正常，pod中的dns配置，在etc/resolv.conf中

5.打开coredns的日志配置，重新请求域名解析，查看日志

6.排查kube-proxy，发现报错

3.解决问题

1.更新内核版本（k8s所有节点）

2.重启系统

3.重启

发送评论 编辑评论

发送评论编辑评论