我使用三台Centos离线部署K8S集群-主要是公司产品部署在甲方的机房,甲方机房是局域网

前言

根据公司项目的要求,项目组要离线部署到客户的机房,需要在有网的服务器安装k8s集群,然后在
打成系统镜像

1. 安装Docker和离线镜像

下载Docker离线安装包和镜像文件,如docker-ce-18.09.0-3.el7.x86_64.rpm和kubernetes-images-v1.17.0.tar.gz
1.下载docker-ce-18.09.0-3.el7.x86_64.rpm
需要一个联网的电脑上执行如下命令:这样就可以下载本地了

https://download.docker.com/linux/centos/7/x86_64/stable/Packages/

通过浏览器里面找到我们要下载的文件。

然后开始下载:
kubernetes-images-v1.17.0.tar.gz
需要在可以上网的浏览器上面直接执行URL:

https://dl.k8s.io/v1.17.0/kubernetes-server-linux-amd64.tar.gz

将下载好的安装包和镜像文件拷贝到三台服务器上,并安装Docker:

我在服务器上面创建了如下目录:

/home/docker

进行安装:

 rpm -ivh docker-ce-18.09.0-3.el7.x86_64.rpm

当进行安装时报如下错误:

  [root@zzmuap6zwdoqhqxb docker]# rpm -ivh docker-ce-18.09.0-3.el7.x86_64.rpm 
warning: docker-ce-18.09.0-3.el7.x86_64.rpm: Header V4 RSA/SHA512 Signature, key ID 621e9f35: NOKEY
error: Failed dependencies:
        container-selinux >= 2.9 is needed by docker-ce-3:18.09.0-3.el7.x86_64
        containerd.io is needed by docker-ce-3:18.09.0-3.el7.x86_64
        docker-ce-cli is needed by docker-ce-3:18.09.0-3.el7.x86_64

解决办法

通过上面的报错,我们知道这是安装 Docker CE 并且出现了一些依赖问题,解决此问题:

Mac 电脑上,在浏览器中打开以下链接来获取所需的 rpm 包:

  1. 在浏览器中打开以下链接:
  2. 在上述每个链接中,找到所需的 rpm 包。例如,在 container-selinux 的链接中,找到包含 container-selinux-2.x.x-x.el7.noarch.rpm 的文件,然后单击该文件名,在出现的对话框中单击“下载”按钮即可下载该文件。

以下是为 CentOS 7 准备 Docker CE 18.09.0 离线安装包所需的包及其版本:

  • container-selinux: container-selinux-2.107-3.el7.noarch.rpm
  • containerd.io: containerd.io-1.2.13-3.1.el7.x86_64.rpm
  • docker-ce-cli: docker-ce-cli-18.09.0-3.el7.x86_64.rpm
  • docker-ce: docker-ce-18.09.0-3.el7.x86_64.rpm
sudo rpm -ivh containerd.io-1.2.13-3.1.el7.x86_64.rpm container-selinux-2.119.1-1.c57a6f9.el7.noarch.rpm docker-ce-18.09.0-3.el7.x86_64.rpm docker-ce-cli-18.09.3-3.el7.x86_64.rpm 

成功了,然后开始继续执行其他步骤:

  • 导入Kubernetes离线镜像
  docker load -i kubernetes-images-v1.17.0.tar.gz

执行上面的命令报错了😀:

[root@zzmuap6zwdoqhqxb docker]# docker load -i kubernetes-server-linux-amd64.tar.gz 
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

解决:

该错误表明您在运行 docker load 命令时,Docker 客户端无法连接到 Docker 引擎。因此,我推测出现该错误的原因是因为 Docker 引擎未启动或已停止。

您可以按照以下步骤检查 Docker 引擎是否启动,并重启 Docker 引擎:

  1. 在终端中输入以下命令,检查 Docker 引擎是否正在运
sudo systemctl start docker

如果Docker已经在运行中,您可以尝试使用以下命令重新启动它:

sudo systemctl restart docker

如果仍然无法连接到Docker守护进程,请检查网络设置并尝试重新安装Docker。

然后继续执行我们的命令:

docker load -i kubernetes-images-v1.17.0.tar.gz

错误变了O(∩_∩)O哈哈~

open /var/lib/docker/tmp/docker-import-481500431/kubernetes/json: no such file or directory

这个错误可能是因为你的路径有误,或者是由于tar包中的内容不正确导致的。请确认以下几点:

  1. 确保 tar 包路径是正确的,并且文件存在。(我的文件存在,可以说明这个错误不存在)
  2. 使用 tar 命令来检查 tar 包是否完整并且包含了正确的文件:

    tar tf kubernetes-server-linux-amd64.tar.gz

    这个命令会列出tar包中所有包含的文件和文件夹。

    kubernetes/
    kubernetes/server/
    kubernetes/server/bin/
    kubernetes/server/bin/apiextensions-apiserver
    kubernetes/server/bin/kube-controller-manager.tar
    kubernetes/server/bin/mounter
    kubernetes/server/bin/kube-proxy.docker_tag
    kubernetes/server/bin/kube-controller-manager.docker_tag
    kubernetes/server/bin/kube-proxy.tar
    kubernetes/server/bin/kubectl
    kubernetes/server/bin/kube-scheduler.tar
    kubernetes/server/bin/kube-apiserver.docker_tag
    kubernetes/server/bin/kube-scheduler
    kubernetes/server/bin/kubeadm
    kubernetes/server/bin/kube-controller-manager
    kubernetes/server/bin/kube-scheduler.docker_tag
    kubernetes/server/bin/kubelet
    kubernetes/server/bin/kube-proxy
    kubernetes/server/bin/kube-apiserver.tar
    kubernetes/server/bin/kube-apiserver
    kubernetes/LICENSES
    kubernetes/kubernetes-src.tar.gz
    kubernetes/addons/
  3. 确认你的 Docker 版本支持 docker load 命令。你可以通过以下命令来查看 Docker 版本:

    docker version
    Client:
    Version:           18.09.3
    API version:       1.39
    Go version:        go1.10.8
    Git commit:        774a1f4
    Built:             Thu Feb 28 06:33:21 2019
    OS/Arch:           linux/amd64
    Experimental:      false
    
    Server: Docker Engine - Community
    Engine:
     Version:          18.09.0
     API version:      1.39 (minimum version 1.12)
     Go version:       go1.10.4
     Git commit:       4d60db4
     Built:            Wed Nov  7 00:19:08 2018
     OS/Arch:          linux/amd64
     Experimental:     false

    Docker 客户端的版本为 18.09.3,而 Docker 服务器版本为 18.09.0。

    这两个版本之间的差异可能导致一些命令在服务器上不起作用。但是小版本,我感觉应该不会有太大的差异

如果以上步骤都没有问题,请确认你的 Docker 守护进程正在运行,你可以使用以下命令启动它:

sudo systemctl start docker
sudo systemctl status docker
● docker.service - Docker Application Container Engine
   Loaded: loaded (/usr/lib/systemd/system/docker.service; disabled; vendor preset: disabled)
   Active: active (running) since Tue 2023-04-25 13:48:58 CST; 16min ago
     Docs: https://docs.docker.com
 Main PID: 4440 (dockerd)
   Memory: 66.3M
   CGroup: /system.slice/docker.service
           ├─4440 /usr/bin/dockerd -H unix://
           └─4459 containerd --config /var/run/docker/containerd/containerd.toml --log-level info

Apr 25 13:48:58 zzmuap6zwdoqhqxb dockerd[4440]: time="2023-04-25T13:48:58.383158185+08:00" level=info msg="Graph migration to c...conds"Apr 25 13:48:58 zzmuap6zwdoqhqxb dockerd[4440]: time="2023-04-25T13:48:58.383404121+08:00" level=warning msg="mountpoint for pi...found"Apr 25 13:48:58 zzmuap6zwdoqhqxb dockerd[4440]: time="2023-04-25T13:48:58.383647067+08:00" level=info msg="Loading containers: start."
Apr 25 13:48:58 zzmuap6zwdoqhqxb dockerd[4440]: time="2023-04-25T13:48:58.392385683+08:00" level=warning msg="Running modprobe bridge...Apr 25 13:48:58 zzmuap6zwdoqhqxb dockerd[4440]: time="2023-04-25T13:48:58.481289323+08:00" level=info msg="Default bridge (dock...dress"Apr 25 13:48:58 zzmuap6zwdoqhqxb dockerd[4440]: time="2023-04-25T13:48:58.520565932+08:00" level=info msg="Loading containers: done."
Apr 25 13:48:58 zzmuap6zwdoqhqxb dockerd[4440]: time="2023-04-25T13:48:58.532050119+08:00" level=info msg="Docker daemon" commi...8.09.0Apr 25 13:48:58 zzmuap6zwdoqhqxb dockerd[4440]: time="2023-04-25T13:48:58.532162365+08:00" level=info msg="Daemon has completed...ation"Apr 25 13:48:58 zzmuap6zwdoqhqxb dockerd[4440]: time="2023-04-25T13:48:58.553613862+08:00" level=info msg="API listen on /var/r....sock"Apr 25 13:48:58 zzmuap6zwdoqhqxb systemd[1]: Started Docker Application Container Engine.
Hint: Some lines were ellipsized, use -l to show in full.

如果Docker已经在运行中,你可以尝试使用以下命令重新启动它:

sudo systemctl restart docker

通过上面的一顿骚操作还是不行,是不是:┭┮﹏┭┮

Docker没有足够的空间来创建临时文件。可以尝试增加Docker守护进程的存储空间限制,以便可以创建临时文件。

以下是如何增加Docker存储空间限制的步骤:

  1. 编辑 /etc/docker/daemon.json 文件

    sudo vim /etc/docker/daemon.json
  2. 将以下内容复制到文件中:

    {
       "storage-driver": "devicemapper",
       "storage-opts": [
           "dm.basesize=20G",
           "dm.thinpooldev=/dev/mapper/docker-thinpool",
           "dm.use_deferred_deletion=true",
           "dm.use_deferred_removal=true"
       ]
    }
  3. 保存该文件并退出编辑器。
  4. 重启 Docker 服务

    sudo systemctl restart docker

执行命令后报错:

sudo systemctl restart docker
Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journalctl -xe" for details.

如果 systemctl restart docker 命令运行失败, 有几种可能情况:

  1. Docker 配置文件有误:你可以尝试编辑 /etc/docker/daemon.json 文件并检查其中的配置是否正确。
  2. Docker 已经在运行中:在有些情况下,如果 Docker 已经在运行中,你可能需要先停止 Docker 服务,然后再重启它。

    sudo systemctl stop docker
    sudo systemctl start docker
  3. Docker 服务错误:如果Docker服务正在使用过多内存或CPU资源,则可能无法重新启动。使用 systemctl status 命令来查看Docker服务的状态,查看Docker日志,以便找到相关错误并解决它们。

    systemctl status docker
    journalctl -xe
  4. 系统启动检查点问题:如果你正在进行系统的引导和重启操作,可能需要检查引导检查点和启动序列配置,以确保Docker服务在启动期间正确地启动。

我感觉是配置文件写错了:

vim /etc/docker/daemon.json
{
    "storage-driver": "devicemapper",
    "storage-opts": [
        "dm.basesize=20G",
        "dm.thinpooldev=/dev/mapper/docker-thinpool",
        "dm.use_deferred_deletion=true",
        "dm.use_deferred_removal=true"
    ]
}

查看上面我的配置是正确,现在检查我的当前安装目录是否磁盘满了:

检查 /var/lib/docker 的磁盘空间使用情况是非常重要的,因为 Docker 的所有数据(容器、镜像、卷等)都会保存在这个目录下。

df -h /var/lib/docker
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda2        36G  3.1G   31G  10% /

根据提供的输出,/var/lib/docker 目录似乎并没有使用很多磁盘空间,因为它没有显示在 df 命令的输出中。因此,磁盘空间限制并不是导致 Docker 启动失败的原因,你需要检查其他可能的问题。

可以运行以下两个命令来检查 Docker 服务的详细信息:

systemctl status docker.service
journalctl -xe
● docker.service - Docker Application Container Engine
   Loaded: loaded (/usr/lib/systemd/system/docker.service; disabled; vendor preset: disabled)
   Active: failed (Result: start-limit) since Tue 2023-04-25 14:16:04 CST; 7min ago
     Docs: https://docs.docker.com
  Process: 5069 ExecStart=/usr/bin/dockerd -H unix:// (code=exited, status=1/FAILURE)
 Main PID: 5069 (code=exited, status=1/FAILURE)

Apr 25 14:16:02 zzmuap6zwdoqhqxb systemd[1]: Failed to start Docker Application Container Engine.
Apr 25 14:16:02 zzmuap6zwdoqhqxb systemd[1]: Unit docker.service entered failed state.
Apr 25 14:16:02 zzmuap6zwdoqhqxb systemd[1]: docker.service failed.
Apr 25 14:16:04 zzmuap6zwdoqhqxb systemd[1]: docker.service holdoff time over, scheduling restart.
Apr 25 14:16:04 zzmuap6zwdoqhqxb systemd[1]: Stopped Docker Application Container Engine.
Apr 25 14:16:04 zzmuap6zwdoqhqxb systemd[1]: start request repeated too quickly for docker.service
Apr 25 14:16:04 zzmuap6zwdoqhqxb systemd[1]: Failed to start Docker Application Container Engine.
Apr 25 14:16:04 zzmuap6zwdoqhqxb systemd[1]: Unit docker.service entered failed state.
Apr 25 14:16:04 zzmuap6zwdoqhqxb systemd[1]: docker.service failed.
Apr 25 14:16:56 zzmuap6zwdoqhqxb sshd[5083]: Failed password for root from 164.92.157.12 port 40670 ssh2
Apr 25 14:16:56 zzmuap6zwdoqhqxb sshd[5083]: Connection closed by 164.92.157.12 port 40670 [preauth]
Apr 25 14:17:35 zzmuap6zwdoqhqxb sshd[5086]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=164Apr 25 14:17:35 zzmuap6zwdoqhqxb sshd[5086]: pam_succeed_if(sshd:auth): requirement "uid >= 1000" not met by user "root"
Apr 25 14:17:37 zzmuap6zwdoqhqxb sshd[5086]: Failed password for root from 164.92.157.12 port 58892 ssh2
Apr 25 14:17:37 zzmuap6zwdoqhqxb sshd[5086]: Connection closed by 164.92.157.12 port 58892 [preauth]
Apr 25 14:18:10 zzmuap6zwdoqhqxb sshd[5105]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=218Apr 25 14:18:10 zzmuap6zwdoqhqxb sshd[5105]: pam_succeed_if(sshd:auth): requirement "uid >= 1000" not met by user "root"
Apr 25 14:18:12 zzmuap6zwdoqhqxb sshd[5105]: Failed password for root from 218.94.53.250 port 56682 ssh2
Apr 25 14:18:12 zzmuap6zwdoqhqxb sshd[5105]: Connection closed by 218.94.53.250 port 56682 [preauth]
Apr 25 14:18:17 zzmuap6zwdoqhqxb sshd[5107]: Invalid user oracle from 164.92.157.12 port 45276
Apr 25 14:18:17 zzmuap6zwdoqhqxb sshd[5107]: input_userauth_request: invalid user oracle [preauth]
Apr 25 14:18:17 zzmuap6zwdoqhqxb sshd[5107]: pam_unix(sshd:auth): check pass; user unknown
Apr 25 14:18:17 zzmuap6zwdoqhqxb sshd[5107]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=164Apr 25 14:18:19 zzmuap6zwdoqhqxb sshd[5107]: Failed password for invalid user oracle from 164.92.157.12 port 45276 ssh2
Apr 25 14:18:19 zzmuap6zwdoqhqxb sshd[5107]: Connection closed by 164.92.157.12 port 45276 [preauth]
Apr 25 14:18:58 zzmuap6zwdoqhqxb sshd[5109]: Invalid user oracle from 164.92.157.12 port 39130
Apr 25 14:18:58 zzmuap6zwdoqhqxb sshd[5109]: input_userauth_request: invalid user oracle [preauth]
Apr 25 14:18:58 zzmuap6zwdoqhqxb sshd[5109]: pam_unix(sshd:auth): check pass; user unknown
Apr 25 14:18:58 zzmuap6zwdoqhqxb sshd[5109]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=164Apr 25 14:18:59 zzmuap6zwdoqhqxb sshd[5109]: Failed password for invalid user oracle from 164.92.157.12 port 39130 ssh2
Apr 25 14:19:00 zzmuap6zwdoqhqxb sshd[5109]: Connection closed by 164.92.157.12 port 39130 [preauth]
Apr 25 14:19:37 zzmuap6zwdoqhqxb sshd[5111]: refused connect from 164.92.157.12 (164.92.157.12)
Apr 25 14:20:01 zzmuap6zwdoqhqxb CROND[5113]: (root) CMD (/usr/lib64/sa/sa1 1 1)
Apr 25 14:20:01 zzmuap6zwdoqhqxb postfix/pickup[3637]: 8B7C6E0010: uid=0 from=
Apr 25 14:20:01 zzmuap6zwdoqhqxb postfix/cleanup[5118]: 8B7C6E0010: message-id=<20230425062001.8B7C6E0010@zzmuap6zwdoqhqxb.localdomain>
Apr 25 14:20:01 zzmuap6zwdoqhqxb postfix/qmgr[1167]: 8B7C6E0010: from=, size=721, nrcpt=1 (queue actiApr 25 14:20:01 zzmuap6zwdoqhqxb postfix/local[5120]: 8B7C6E0010: to=, orig_to=, relay=local, dApr 25 14:20:01 zzmuap6zwdoqhqxb postfix/qmgr[1167]: 8B7C6E0010: removed
Apr 25 14:20:19 zzmuap6zwdoqhqxb sshd[5121]: refused connect from 164.92.157.12 (164.92.157.12)
Apr 25 14:20:59 zzmuap6zwdoqhqxb sshd[5123]: refused connect from 164.92.157.12 (164.92.157.12)
Apr 25 14:21:41 zzmuap6zwdoqhqxb sshd[5124]: refused connect from 164.92.157.12 (164.92.157.12)
Apr 25 14:22:22 zzmuap6zwdoqhqxb sshd[5125]: refused connect from 164.92.157.12 (164.92.157.12)
Apr 25 14:23:02 zzmuap6zwdoqhqxb sshd[5126]: refused connect from 164.92.157.12 (164.92.157.12)
sudo dockerd
INFO[2023-04-25T14:28:27.769068507+08:00] parsed scheme: "unix"                         module=grpc
INFO[2023-04-25T14:28:27.769125008+08:00] scheme "unix" not registered, fallback to default scheme  module=grpc
INFO[2023-04-25T14:28:27.769162226+08:00] parsed scheme: "unix"                         module=grpc
INFO[2023-04-25T14:28:27.769169182+08:00] scheme "unix" not registered, fallback to default scheme  module=grpc
INFO[2023-04-25T14:28:27.769197511+08:00] ccResolverWrapper: sending new addresses to cc: [{unix:///run/containerd/containerd.sock 0  }]  module=grpc
INFO[2023-04-25T14:28:27.769239388+08:00] ClientConn switching balancer to "pick_first"  module=grpc
INFO[2023-04-25T14:28:27.769302514+08:00] pickfirstBalancer: HandleSubConnStateChange: 0xc420793780, CONNECTING  module=grpc
INFO[2023-04-25T14:28:27.769299151+08:00] ccResolverWrapper: sending new addresses to cc: [{unix:///run/containerd/containerd.sock 0  }]  module=grpc
INFO[2023-04-25T14:28:27.769327661+08:00] ClientConn switching balancer to "pick_first"  module=grpc
INFO[2023-04-25T14:28:27.770884453+08:00] pickfirstBalancer: HandleSubConnStateChange: 0xc420793780, READY  module=grpc
INFO[2023-04-25T14:28:27.770933748+08:00] pickfirstBalancer: HandleSubConnStateChange: 0xc4201d0070, CONNECTING  module=grpc
INFO[2023-04-25T14:28:27.771102764+08:00] pickfirstBalancer: HandleSubConnStateChange: 0xc4201d0070, READY  module=grpc
Error starting daemon: error initializing graphdriver: devicemapper: Non existing device docker-thinpool

通过命令行查看一下我们写的配置文件是否正确:

sudo docker daemon --config-file /etc/docker/daemon.json --test

O(∩_∩)O哈哈~

使用默认的,配置文件,然后把我的/etc/docker/daemon.json配置文件删除了,

systemctl start docker

然后分别向另外两台服务器,上传配置文件,安装,启动docker容器,请查看上面的安装步骤:

解压下载的压缩包:

tar -zxvf kubernetes-server-linux-amd64.tar.gz

安装 Kubernetes 组件

在解压 kubernetes-server-linux-amd64.tar.gz 文件后,下面是一些可能有用的操作:

  1. 将 kubernetes 二进制文件移动到 /usr/local/bin 目录,可用以下命令实现:

    sudo mv kubernetes/server/bin/* /usr/local/bin/
  2. 设置 kubectl 客户端的自动补全功能:

    echo "source <(kubectl completion bash)" >> ~/.bashrc

    然后在终端执行以下命令,使变更立即生效:

    source ~/.bashrc
  3. 启动 Kubernetes 服务:

    sudo systemctl start kubelet
    [root@zzmuap6zwdoqhqxb docker]# sudo systemctl start kubelet
    Failed to start kubelet.service: Unit not found.
    这个错误可能是由于缺失kubelet这个systemd服务导致的。在确保kubelet被正确安装的前提下,可以先尝试运行指令sudo systemctl daemon-reload,来重新加载systemd管理的配置文件。如果有更新的systemd配置文件,它们会被重新加载,并使得systemd感知到新的kubelet服务的存在。然后再尝试启动kubelet:sudo systemctl start kubelet。

    如果即使重新加载了systemd管理的配置文件后,仍然无法启动kubelet服务,并且提示“Unit not found”错误,可能是因为kubelet服务在系统中没有被正确安装导致的。你可以通过重新安装kubelet服务来解决这个问题,

    需要离线进行安装:

    但是我已经安装了,上面的文件,我试着给文件赋予权限试一试:

    sudo chmod +x /usr/local/bin/kubelet

    还是不行😀

    如果在安装完成kubelet服务之后,通过sudo systemctl start kubelet启动服务时,仍然提示“Unit not found”错误,那么可能是由于kubelet服务的systemd单元文件没有被正确安装。这种情况下,你需要手动安装kubelet的systemd单元文件。具体来说,可以按照以下步骤操作:

    1. 打开/lib/systemd/system/kubelet.service文件,如果该文件不存在则新建该文件。可以使用你喜欢的编辑器,例如vi或nano来打开该文件:
    sudo vim /lib/systemd/system/kubelet.service
    1. 在kubelet.service文件中添加以下内容:
    [Unit]
    Description=kubelet: The Kubernetes Node Agent
    Documentation=https://kubernetes.io/docs/home/
    
    [Service]
    ExecStart=/usr/local/bin/kubelet --config=/etc/kubernetes/kubelet.conf --pod-manifest-path=/etc/kubernetes/manifests --cgroup-driver=systemd
    Restart=always
    StartLimitInterval=0s
    RestartSec=10s
    
    [Install]
    WantedBy=multi-user.target

    上述内容包括了kubelet服务的systemd单元文件的一些基本信息。其中,ExecStart选项指定了kubelet二进制文件的路径和配置文件的路径,这里的路径应该根据你实际使用的路径进行修改;Restart选项设置为always,表示在kubelet服务出现问题时自动重启;WantedBy选项将kubelet服务添加到multi-user.target单元,以便在系统启动时自动启动kubelet服务。

    1. 保存kubelet.service文件并退出编辑器。然后,运行以下命令重新加载systemd的配置文件:
    sudo systemctl daemon-reload
    1. 接下来,就可以启动kubelet服务了。运行以下命令启动kubelet服务:
    sudo systemctl start kubelet

    现在应该可以成功启动kubelet服务了。如果还有问题,可以查看systemd服务的日志来检查详细的错误信息,使用命令sudo journalctl -u kubelet

  4. 检查 Kubernetes 服务的运行状态:

    systemctl status kubelet

    如果服务正在运行,则会显示 active (running)。

    [root@zzmuap6zwdoqhqxb systemd]# systemctl status kubelet ● kubelet.service - kubelet: The Kubernetes Node Agent Loaded: loaded (/usr/lib/systemd/system/kubelet.service; disabled; vendor preset: disabled) Active: activating (auto-restart) (Result: exit-code) since Tue 2023-04-25 17:38:09 CST; 6s ago Docs: https://kubernetes.io/docs/home/ Process: 7545 ExecStart=/usr/local/bin/kubelet --config=/etc/kubernetes/kubelet.conf --pod-manifest-path=/etc/kubernetes/manifests --cgroup-driver=systemd (code=exited, status=255) Main PID: 7545 (code=exited, status=255) Apr 25 17:38:09 zzmuap6zwdoqhqxb systemd[1]: Unit kubelet.service entered failed state. Apr 25 17:38:09 zzmuap6zwdoqhqxb syst

    提供的信息,kubelet服务没有以正确的方式启动,并且在启动失败后自动重新启动。这通常是由于配置文件的错误或者缺失而导致的。你可以检查kubelet的配置文件/etc/kubernetes/kubelet.conf是否存在,并确保其中的信息是正确的。另外,也可以查看kubelet的日志文件,以获取更多关于服务启动失败的详细信息。

    通过执行以下命令可以获取kubelet的日志信息:

    sudo journalctl -u kubelet
    -- Logs begin at Tue 2023-04-25 11:21:07 CST, end at Tue 2023-04-25 17:40:27 CST. --
    Apr 25 17:37:59 zzmuap6zwdoqhqxb systemd[1]: Started kubelet: The Kubernetes Node Agent.
    Apr 25 17:37:59 zzmuap6zwdoqhqxb kubelet[7534]: Flag --pod-manifest-path has been deprecated, This parameter should be set via the confiApr 25 17:37:59 zzmuap6zwdoqhqxb kubelet[7534]: Flag --cgroup-driver has been deprecated, This parameter should be set via the config fiApr 25 17:37:59 zzmuap6zwdoqhqxb kubelet[7534]: F0425 17:37:59.171440    7534 server.go:198] failed to load Kubelet config file /etc/kubApr 25 17:37:59 zzmuap6zwdoqhqxb systemd[1]: kubelet.service: main process exited, code=exited, status=255/n/a
    Apr 25 17:37:59 zzmuap6zwdoqhqxb systemd[1]: Unit kubelet.service entered failed state.
    Apr 25 17:37:59 zzmuap6zwdoqhqxb systemd[1]: kubelet.service failed.
    Apr 25 17:38:09 zzmuap6zwdoqhqxb systemd[1]: kubelet.service holdoff time over, scheduling restart.
    Apr 25 17:38:09 zzmuap6zwdoqhqxb systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
    Apr 25 17:38:09 zzmuap6zwdoqhqxb systemd[1]: Started kubelet: The Kubernetes Node Agent.
    Apr 25 17:38:09 zzmuap6zwdoqhqxb kubelet[7545]: Flag --pod-manifest-path has been deprecated, This parameter should be set via the confiApr 25 17:38:09 zzmuap6zwdoqhqxb kubelet[7545]: Flag --cgroup-driver has been deprecated, This parameter should be set via the config fiApr 25 17:38:09 zzmuap6zwdoqhqxb kubelet[7545]: F0425 17:38:09.364173    7545 server.go:198] failed to load Kubelet config file /etc/kubApr 25 17:38:09 zzmuap6zwdoqhqxb systemd[1]: kubelet.service: main process exited, code=exited, status=255/n/a
    Apr 25 17:38:09 zzmuap6zwdoqhqxb systemd[1]: Unit kubelet.service entered failed state.
    Apr 25 17:38:09 zzmuap6zwdoqhqxb systemd[1]: kubelet.service failed.
    Apr 25 17:38:19 zzmuap6zwdoqhqxb systemd[1]: kubelet.service holdoff time over, scheduling restart.
    Apr 25 17:38:19 zzmuap6zwdoqhqxb systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
    Apr 25 17:38:19 zzmuap6zwdoqhqxb systemd[1]: Started kubelet: The Kubernetes Node Agent.
    Apr 25 17:38:19 zzmuap6zwdoqhqxb kubelet[7558]: Flag --pod-manifest-path has been deprecated, This parameter should be set via the confiApr 25 17:38:19 zzmuap6zwdoqhqxb kubelet[7558]: Flag --cgroup-driver has been deprecated, This parameter should be set via the config fiApr 25 17:38:19 zzmuap6zwdoqhqxb kubelet[7558]: F0425 17:38:19.607625    7558 server.go:198] failed to load Kubelet config file /etc/kubApr 25 17:38:19 zzmuap6zwdoqhqxb systemd[1]: kubelet.service: main process exited, code=exited, status=255/n/a
    Apr 25 17:38:19 zzmuap6zwdoqhqxb systemd[1]: Unit kubelet.service entered failed state.
    Apr 25 17:38:19 zzmuap6zwdoqhqxb systemd[1]: kubelet.service failed.
    Apr 25 17:38:29 zzmuap6zwdoqhqxb systemd[1]: kubelet.service holdoff time over, scheduling restart.
    Apr 25 17:38:29 zzmuap6zwdoqhqxb systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
    Apr 25 17:38:29 zzmuap6zwdoqhqxb systemd[1]: Started kubelet: The Kubernetes Node Agent.
    Apr 25 17:38:29 zzmuap6zwdoqhqxb kubelet[7569]: Flag --pod-manifest-path has been deprecated, This parameter should be set via the confiApr 25 17:38:29 zzmuap6zwdoqhqxb kubelet[7569]: Flag --cgroup-driver has been deprecated, This parameter should be set via the config fiApr 25 17:38:29 zzmuap6zwdoqhqxb kubelet[7569]: F0425 17:38:29.868205    7569 server.go:198] failed to load Kubelet config file /etc/kubApr 25 17:38:29 zzmuap6zwdoqhqxb systemd[1]: kubelet.service: main process exited, code=exited, status=255/n/a
    Apr 25 17:38:29 zzmuap6zwdoqhqxb systemd[1]: Unit kubelet.service entered failed state.

    在日志信息中,可以查找关于kubelet服务启动失败或遇到问题的提示或错误信息。

    根据提供的日志信息,kubelet服务启动失败,并且在启动失败后自动退出。可能的原因是kubelet的配置文件中有错误导致kubelet服务无法正常启动。

    你可以检查kubelet配置文件/etc/kubernetes/kubelet.conf的内容,并确保其中的信息配置正确,如kubeconfig文件、kubelet 监听的 ip 和hostname等。如果你之前通过kubeadm工具来安装Kubernetes集群的话,kubelet的配置文件(kubelet.conf)应该由kubeadm自动生成并放置在/etc/kubernetes/目录下。因此,你可以尝试使用以下命令检查该配置文件是否存在:

    ls -l /etc/kubernetes/kubelet.conf

    如果kubelet配置文件不存在,则需要手动创建该文件,并确保其中包含正确的配置信息。可以参考Kubernetes官方文档中的说明,来确定kubelet配置文件应该包含哪些信息以及如何进行配置。在编辑好kubelet配置文件后,重新启动kubelet服务即可。

    如果问题仍然存在,可以进一步查看kubelet的日志信息,以便清楚地了解kubelet服务启动时发生了什么错误,从而更好地解决问题。

    安装kubelet时缺失了kubelet配置文件,可以重新生成kubelet配置文件,可以使用下面的命令生成~

    sudo kubeadm config print init-defaults --component-configs KubeletConfiguration > kubelet.conf

    生成的kubelet.conf文件中应包含了kubelet的默认配置信息。接下来,将该文件移动到/etc/kubernetes/目录下:

    sudo mv kubelet.conf /etc/kubernetes/
    sudo mv kubelet.conf /etc/kubernetes/ mv: cannot move ‘kubelet.conf’ to ‘/etc/kubernetes/’: No such file or directory

    如果在尝试将kubelet.conf文件移动到/etc/kubernetes/目录时,系统提示“no such file or directory”错误,那么可能是由于/etc/kubernetes目录不存在导致的。你应该先创建/etc/kubernetes目录,并确保新目录具有正确的权限。具体来说,可以执行以下命令:

    sudo mkdir /etc/kubernetes
    sudo chmod 755 /etc/kubernetes

    然后再将kubelet.conf文件移动到/etc/kubernetes目录下即可:

    sudo mv kubelet.conf /etc/kubernetes/

    请注意,如果你使用的kubeadm生成的kubelet.conf文件包含CA证书的相关信息,那么确保权限正确,可以使用以下命令:

    sudo chown root:root /etc/kubernetes/kubelet.conf
    sudo chmod 644 /etc/kubernetes/kubelet.conf
  5. 配置 Kubernetes 服务的开机自启动:

    sudo systemctl enable kubelet
sudo systemctl enable kubelet
Created symlink from /etc/systemd/system/multi-user.target.wants/kubelet.service to /usr/lib/systemd/system/kubelet.service.

======= 重新开始==========

准备三台CentOS 7.6以上机器,配置主机名和hosts,关闭防火墙和Selinux

  1. 配置主机名

    # 修改主机名
    hostnamectl set-hostname node1  # node1作为主节点
    hostnamectl set-hostname node2  # node2作为工作节点
    hostnamectl set-hostname node3  # node3作为工作节点
# 确认主机名已变更
hostnamectl status 
  1. 配置hosts文件
# 在三台服务器上添加hosts记录
cat >> /etc/hosts <<EOF
192.168.1.127 node1 
192.168.1.115 node2
192.168.1.147 node3  
EOF
  1. 关闭防火墙
# 在三台服务器上关闭防火墙
systemctl stop firewalld
systemctl disable firewalld
  1. 关闭Selinux
# 在三台服务器上关闭Selinux
sed -i 's/enforcing/disabled/' /etc/selinux/config  
setenforce 0
  1. 重启网络服务
# 在三台服务器上重启网络服务
systemctl restart network
  1. 配置ssh免密码登录- 在node1上生成ssh key:
ssh-keygen -t rsa

一直按回车,最终会生成两个文件:id_rsa(私钥)和id_rsa.pub(公钥)

  1. 将id_rsa.pub(公钥)内容复制到剪切板:
cat ~/.ssh/id_rsa.pub

复制显示的内容

  1. 登录node2,在authorized_keys文件末尾添加复制的公钥内容:
ssh node2
echo 把步骤2复制的公钥内容粘贴并回车 > >> ~/.ssh/authorized_keys 
echo ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC6k8Av2wDFXOr7avTM4ScrYla4Sb+/p3TSuN+YrQPORo4JL9RlCxUyGTeEPoRahzipZ4DxRauL/mghtc1huDRjkZJQJUS1Ebg96M3+L3BQ4RPqmp3g7Lv46XSIHGOPKDvX16o0kTsYiTlEFc9BZc1LyJMDzIaMSfmKnEYIwb4lzPb/VOWpq7SGLNK/WDmxIGkZNrfDmclE3S68YrE4iE06cKKZxgiDFkmyM7SeeiqFVtLaScSIOA1Hftc+M2r+yOLfUtMvJofs6wwG3dv+L7CN0uPVuPwd/uIES5bHR5qPDdk09hsxQCfqN2wj2SgTb2sQDLZQA3FjKYF43zgTxK/r root@node1 \>\> ~/.ssh/authorized_keys  
-bash: syntax error near unexpected token `>>'

解决办法:

echo ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC6k8Av2wDFXOr7avTM4ScrYla4Sb+/p3TSuN+YrQPORo4JL9RlCxUyGTeEPoRahzipZ4DxRauL/mghtc1huDRjkZJQJUS1Ebg96M3+L3BQ4RPqmp3g7Lv46XSIHGOPKDvX16o0kTsYiTlEFc9BZc1LyJMDzIaMSfmKnEYIwb4lzPb/VOWpq7SGLNK/WDmxIGkZNrfDmclE3S68YrE4iE06cKKZxgiDFkmyM7SeeiqFVtLaScSIOA1Hftc+M2r+yOLfUtMvJofs6wwG3dv+L7CN0uPVuPwd/uIES5bHR5qPDdk09hsxQCfqN2wj2SgTb2sQDLZQA3FjKYF43zgTxK/r root@node1 \>\> ~/.ssh/authorized_keys  

在步骤3和步骤4中,> >> ~/.ssh/authorized_keys这一行执行时出现syntax error near unexpected token '>'错误,这是因为 '>' 符号在bash中有特殊含义,需要转义。正确的步骤应为:3. 登录node2,在authorized_keys文件末尾添加复制的公钥内容:

bash
ssh node2  
echo 把步骤2复制的公钥内容粘贴并回车 \>\> ~/.ssh/authorized_keys  

注意,> >> 变为 \>\>

  1. 登录node3,在authorized_keys文件末尾添加复制的公钥内容:
bash 
ssh node3
echo 把步骤2复制的公钥内容粘贴并回车 \>\> ~/.ssh/authorized_keys  

同样,> >> 变为 \>\>'>' 符号作为输出重定向符,在bash中有特殊意义,所以直接使用会产生语法错误。使用转义符'\'可以避免这个错误。修正后的这两个步骤,应该可以正常执行,将公钥内容追加到authorized_keys文件,实现免密登录配置。

  1. 登录node3,在authorized_keys文件末尾添加复制的公钥内容:
bash
ssh node3
echo 把步骤2复制的公钥内容粘贴并回车 > >> ~/.ssh/authorized_keys
  • 将公钥复制到其他两台机器:
    ssh-copy-id node2 
    ssh-copy-id node3
  • 测试ssh免密码登录:
    bash 
    ssh node2 
    ssh node3

    登录node2时仍然需要输入密码,这表明node2并未成功配置免密登录。这可能由几个原因导致:1. node2的authorized_keys文件内容格式错误请登录node2,检查~/.ssh/authorized_keys文件内容是否为步骤3输出的公钥字符串,是否有空行或空格等影响内容格式的字符。
    如果格式不正确,请删除authorized_keys文件内容,重新执行步骤3,确保公钥内容完全正确地追加到该文件。2. node2的authorized_keys文件权限不正确正确的权限为644,请执行:

    bash
    chmod 644 ~/.ssh/authorized_keys
    1. SSH服务未启动或配置错误请登录node2,执行:
    bash
    systemctl status sshd

    检查SSH服务是否处于active状态,如果未启动请执行:

    bash
    systemctl start sshd 

    如果配置文件有误,请检查/etc/ssh/sshd_config,参考其他节点的配置进行修正。

    1. 节点时间同步问题如果节点时间不同步,也会导致SSH登录需要输入密码。请在三个节点上同步时间,执行:
    bash
    ntpdate 0.centos.pool.ntp.org

    以上几个因素是导致免密登录不成功,需要输入密码的常见原因。请按照上述步骤进行检查、修正,最终确保:- node2的authorized_keys文件格式正确,权限为644

  • SSH服务启动正常,配置正确
  • 三个节点系统时间同步完成上述要求后,重新执行ssh node2应该可以直接登录,不需要输入密码。
    1. 登录node2:
    bash
    ssh node2   # 此时还需要输入密码
    1. 检查authorized_keys文件内容:
    bash
    cat ~/.ssh/authorized_keys
    1. 检查文件内容是否正确,应该是您在node1上生成的公钥字符串,并且每一行只包含一个公钥,没有空行或空格。如果内容格式正确,文件内容应该类似这样:
    ssh-rsa AAAAB3NzaC1yc2EAAAABJQAAAQEAzFhfgaLx41Wmh2vO3h2rjvkRUtdflFcDSUaZyREeVig3FKSTTiFmORBZWTuSHOzgYsuIwOBm852994WkmI9VREtdem1MxiSMDb+WrheMTBxnWiwje0W+QbvEfWkhT+5h+TJp0IzSySbcMnVovVXLCL8z15MKoxf94CUz+GfRlWZ+RR5i5QVdWtHtc464fWXk8VYM1G13LKnVey66qjoqarUrKo1fMMidZXr9acWeGw/+B+TTdhiZPzInhVbQbXozki+Tt6cxvDmCw== root@node1

    如果内容格式不正确,或不存在authorized_keys文件,请执行下述步骤修正:1) 删除已有的authorized_keys文件:

    bash
    rm -f ~/.ssh/authorized_keys

    2) 在node1上将公钥内容重新复制到剪切板:

    bash
    cat ~/.ssh/id_rsa.pub

    3) 登录node2,将复制的公钥内容追加到authorized_keys文件:

    bash 
    echo 把步骤2复制的公钥内容粘贴并回车 \>\> ~/.ssh/authorized_keys

    4) 设置authorized_keys文件权限为644:

    bash 
    chmod 644 ~/.ssh/authorized_keys 

    5) 重新测试从node1登录node2是否需要输入密码。请按以上步骤检查authorized_keys文件内容,确保格式和权限正确无误,然后重新测试ssh登录。如果还需要输入密码,请参考前一答案中的其他修正步骤进行检查。
    6) SSH密钥对文件权限在node1上检查id_rsa(私钥)和id_rsa.pub(公钥)文件的权限,应为600和644。如果权限过大,会导致免密登录失败,请修正:

    bash
    chmod 600 ~/.ssh/id_rsa
    chmod 644 ~/.ssh/id_rsa.pub
    1. node1和node2上SSH服务状态正常,并启用PublicKeyAuthentication检查systemctl status sshd

    检查/etc/ssh/sshd_config文件中PublicKeyAuthentication是否为yes

    1. node1和node2系统时间同步
    bash
    ntpdate 0.centos.pool.ntp.org 
    [root@node2 ~]# echo ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCxHBwc4AoL6/qLGoCUXV0oR7di9PKIxem9CH5JgtwGZroqhUR+WjwPbbuft7pmh4h7o6Gig52VltuxzZajj6rGxA88IsDzDKL/Rzh/lbiK6OtNQSh4Uop4Ni8t40aFKF8/JXefgDp9SJ12APDMdiUgwEFsLx/CAAoYlAxoUVbVVyUkYAdmPR/yqPA4kQhHis9lkJJqZZdsk6AxjXBHhQLGJgZxylamLeLxfqyEvVv8BeSVlG0a0IR82kLveKBRb8i5TUbH/85zVCns8O75lr0E39AsY8Uez3P/zr352Ido56cfM2GO+7+6asf7q0XhBpQ2AwOGAFKzv1QX810PZ5ED root@node1 \>\> ~/.ssh/authorized_keys
    ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCxHBwc4AoL6/qLGoCUXV0oR7di9PKIxem9CH5JgtwGZroqhUR+WjwPbbuft7pmh4h7o6Gig52VltuxzZajj6rGxA88IsDzDKL/Rzh/lbiK6OtNQSh4Uop4Ni8t40aFKF8/JXefgDp9SJ12APDMdiUgwEFsLx/CAAoYlAxoUVbVVyUkYAdmPR/yqPA4kQhHis9lkJJqZZdsk6AxjXBHhQLGJgZxylamLeLxfqyEvVv8BeSVlG0a0IR82kLveKBRb8i5TUbH/85zVCns8O75lr0E39AsY8Uez3P/zr352Ido56cfM2GO+7+6asf7q0XhBpQ2AwOGAFKzv1QX810PZ5ED root@node1 >> /root/.ssh/authorized_keys

    根据您提供的输出,我看到您是直接将node1的公钥内容追加到node2的/root/.ssh/authorized_keys文件,这就导致了免密登录失败。正确的步骤应该是:将公钥追加到普通用户的~/.ssh/authorized_keys文件,而不是root用户的文件。所以,请执行以下步骤修正:1.在node2上创建一个普通用户,如node2:

    bash
    adduser node2

    2.切换到node2用户

    bash
    su - node2

    3.将node1上的id_rsa.pub公钥内容复制到剪切板

    bash
    cat ~/.ssh/id_rsa.pub 

    4.将复制的公钥内容追加到当前普通用户test的~/.ssh/authorized_keys文件

    bash
    echo 从步骤3复制的公钥内容粘贴并回车 >> ~/.ssh/authorized_keys 

    5.设置authorized_keys文件权限为644:

    bash
    chmod 644 ~/.ssh/authorized_keys  

    6.使用node1的普通用户node1账号登录node2的普通用户node2账号

    bash
    ssh node2@node2  
    # 如果仍需密码,检查步骤3-5

    7.使用node1的root账号登录node2的test账号

    bash 
    ssh node2@node2

    现在,应该可以直接登录,不需要密码。这表明我们通过为普通用户配置免密登录,实现了从node1的root和普通用户登录node2的普通用户。将免密登录配置在普通用户上,而不是直接在root用户上,这是出于安全考虑。我们操作系统的各种资源也应遵循最低权限原则。

    1. 在node2上创建普通用户node2,并为node2用户生成公钥对:
    bash
    adduser node2 
    su - node2
    ssh-keygen -t rsa  # 一直按回车,生成密钥对
    1. 将node2的node2用户的公钥内容复制到剪切板:
    bash 
    cat ~/.ssh/id_rsa.pub
    1. 登录node1,将复制的公钥内容追加到node2用户的~/.ssh/authorized_keys文件:
    bash
    echo 从步骤2复制的公钥内容粘贴至回车 >> ~/.ssh/authorized_keys 
    1. 设置node1上的~/.ssh/authorized_keys文件权限为644:
    bash
    chmod 644 ~/.ssh/authorized_keys
    1. 使用node2的node2用户登录node1:
    bash
    ssh node1  
    # 直接登录,不需输入密码
    1. 使用node1的root用户登录node2的node2用户:
    bash 
    ssh node2@node2
    # 直接登录,不需输入密码

    这个参数的全称是PublicKeyAuthentication,用于启用或禁用SSH公钥认证。要在配置文件中添加这个参数,可以参考以下步骤:1. 编辑/etc/ssh/sshd_config配置文件:

    bash
    vim /etc/ssh/sshd_config
    1. 按下Shift+G快捷键跳转到配置文件末尾。3. 添加PubkeyAuthentication参数,并设置为yes启用公钥认证:
    PubkeyAuthentication yes
    1. 保存并退出编辑器。5. 重启SSH服务以使配置生效:
    bash 
    systemctl restart sshd

    .ssh目录要700,authorized_keys文件要600。也有说是755和655点,反正后面两个不能有写权限

在配置k8s高可用集群时,节点之间通常需要建立SSH免密登录连接,用于节点之间的通信和控制。

对于3节点的k8s集群,配置SSH免密登录的推荐方式有两种:

  1. 节点对节点配置:即node1免密访问node2,node2免密访问node3,node3免密访问node1。这种配置简单,但是存在问题是中间节点失效会影响到其他节点的连接。
  2. 全部节点配置:即所有节点(node1、node2、node3)相互之间都配置SSH免密登录。这种配置比较复杂,但是高可用性更好,即使部分节点失效也不会影响其他节点连接。所以,对于3节点高可用的k8s集群,我推荐选用第二种方案:在所有节点之间都配置SSH免密登录连接。具体配置步骤如下:

    1. 在node1上生成SSH密钥对,然后将公钥追加到node2和node3的~/.ssh/authorized_keys文件

      1. 在node2上生成SSH密钥对,然后将公钥追加到node1和node3的~/.ssh/authorized_keys文件

        1. 在node3上生成SSH密钥对,然后将公钥追加到node1和node2的~/.ssh/authorized_keys文件4. 检查所有节点的~/.ssh/authorized_keys文件权限均为6005. 在所有节点上检查/etc/ssh/sshd_config,确保有PubkeyAuthentication yes配置6. 在所有节点上重启SSH服务:systemctl restart sshd7. 从每个节点测试SSH免密登录其他所有节点,确保连接正常8. (可选)在所有节点禁用SSH密码验证与Keyboard-interactive验证,提高安全性。完成上述步骤后,3个k8s节点之间就配置了完整的SSH免密登录连接,这为后续的k8s高可用集群部署打下了基础

        总结:主要是配置好公钥,node1,node2, node3,然后设置好权限

K8S组件版本号:

部署最新版本的k8s集群,我的推荐配置如下:

  • kubelet & kubeadm:1.22.4
    - kubectl:1.22.4
    - kubernetes-cni:0.8.7
    - chrony:4.0+
    - coredns:1.8.0
    - etcd:3.4.13-0
    - pause:3.4.1
  • kubelet & kubeadm:kubelet是集群节点上的主要节点代理,它监视节点上的pod和容器,并确保它们按预期运行。kubeadm用于快速轻松地在机器上启动一个Kubernetes集群。
  • kubectl:Kubernetes命令行工具,用于运行命令行命令与Kubernetes集群进行交互。
  • kubernetes-cni:Kubernetes CNI网络插件,为pod提供网络连接,实现pod网络通信。

    • chrony:时间同步服务,为k8s集群内各节点同步时间以满足k8s的要求。
    • coredns: Kubernetes DNS服务器,为Kubernetes集群提供DNS服务。
    • etcd:Kubernetes数据持久化的数据库,存储集群状态。
      • pause:一个空的容器镜像,主要用于占位。某些pod中的一个容器启动失败后将被替换为pause容器以保证pod处于running状态。
      • kubelet、kubeadm、kubectl用于部署和管理k8s集群。
        • coredns、etcd、pause用于k8s集群的基础设施与维持pod running状态。
        • kubernetes-cni用于k8s网络编排,chrony用于时间同步。
          • Docker用于运行容器与容器镜像。

三台机器同步时间:

node1上离线安装:chrony

那么具体的离线安装chrony步骤如下:1. 在node1上检查当前系统时间同步服务,如果已经安装请先卸载:

bash
yum remove chrony ntp  # 卸载chrony与NTP
  1. 查找chrony与k8s版本的对应关系,可以在官网或Release note中找到。假定k8s版本为1.20,需要chrony版本为4.0。3. 在其他联网节点上下载对应chrony版本的RPM包:
wget https://repo.chrony.org/chrony-4.0-1.x86_64.rpm       

使用浏览器:

- 打开https://chrony.tuxfamily.org/download.html

chrony-4.3.tar.gz

  1. 在node1服务器上解压源码包:
bash
tar -xzf chrony-4.3.tar.gz
  1. 进入解压后的目录:
bash 
cd chrony-4.3
  1. 运行配置脚本:
bash
./configure
[root@node1 chrony-4.3]# ./configure
Configuring for  Linux-x86_64
Checking for gcc : No
Checking for clang : No
Checking for cc : No
error: no C compiler found

由于我们的服务器处于局域网,所以只能通过其他电脑浏览器下载,进行传递

  1. 在一台可以访问外网的电脑上,打开浏览器,分别访问gcc、gcc-c++和make的下载页面。

    gcc下载页面:https://app.slack.com/client/T054JM7292S/D054CP9NHF1

​ gcc-c++下载页面: https://vault.centos.org/centos/8/AppStream/x86_64/os/Packages/gcc-8.5.0-4.el8_5.x86_64.rpm
​ make下载页面:http://ftp.gnu.org/gnu/make/在这些页面上您可以找到对应系统与架构的二进制包下载链接,如:gcc-8.3.0.tar.xz
gcc-8.3.0-2.el7.x86_64.rpm
make-4.2.1.tar.gz

  1. 点击各个链接,浏览器会开始下载对应的二进制包。请注意查看文件名是否正确。
bash
# 对tar包解压、配置与编译
tar -xvf gcc-8.3.0.tar.xz 
cd gcc-8.3.0

./configure --prefix=/home/k8s/gcc-8.3.0 --enable-languages=c,c++

make 

make install

# 使用rpm直接安装
rpm -ivh gcc-8.3.0-2.el7.x86_64.rpm 
rpm -ivh make-4.2.1-1.el7.x86_64.rpm
configure: error: in `/home/k8s/gcc-8.3.0':
configure: error: no acceptable C compiler found in $PATH
See `config.log' for more details.

上面的依赖包,我在局域网情况下无法安装,我需要更新一下,进行安装,下载7.3的版本试一试:

https://ftp.gnu.org/gnu/gcc/gcc-7.3.0/gcc-7.3.0.tar.gz

./configure --prefix=/home/k8s/gcc-7.3.0 --enable-languages=c,c++ 

配置环境变量:

export PATH=/home/k8s/gcc-7.3.0/bin:$PATH 
  1. 执行source命令,重新加载环境变量:

    source /etc/profile 
    [root@node1 gcc-7.3.0]# source /etc/profile 
    -bash: TMOUT: readonly variable

    手动添加环境变量:

    vim /etc/profile
    source /etc/profile 
./configure --prefix=/home/k8s/gcc-7.3.0 --enable-languages=c,c++

这个脚本会自动检查环境依赖,并生成Makefile文件。

  1. 编译源码:
bash
make
  1. 安装:
bash
make install 

chrony会被默认安装到/usr/local/bin目录。6. 创建chrony配置文件/etc/chrony.conf(示例内容如下),并启动服务。

server 0.centos.pool.ntp.org iburst 
server 1.centos.pool.ntp.org iburst  
server 2.centos.pool.ntp.org iburst 
server 3.centos.pool.ntp.org iburst

然后启动chrony:

bash
systemctl start chronyd.service
  1. 检查chrony版本与状态,确保安装与配置成功。
  2. 验证chrony版本与功能是否正常:
bash
chronyc -v   # 查看chrony版本,应为4.0 
systemctl start chronyd   # 启动chrony
chronyc sources  # 查看时间同步状态
  1. 编辑/etc/chrony.conf配置文件,确保其它节点可以访问:
allow 192.168.0.0/16   # 节点网段 
bindcmdaddress 0.0.0.0  # 允许所有地址访问
  1. 启动chrony服务并设置开机启动:
bash
systemctl enable chronyd  && systemctl start chronyd 

其他节点开始同步时间:

不联网真的TMD,我现在使用互联网先把K8S安装成功:

chrony是一个实现NTP协议的服务,用于同步主机系统时间。它可以定期从红香服务器同步时间,确保主机维持准确的系统时间。在CentOS上安装chrony,可以按以下步骤进行:

​ 确保系统已安装epel-release源

​ yum install epel-release

  1. 安装chrony包:

    yum install chrony

    1. 启动chrony服务并设置开机启动:

      systemctl start chronyd
      systemctl enable chronyd

      1. 检查chrony是否正常运行:

        systemctl status chronyd

        1. 检查chrony是否正常同步时间:

        chronyc sources

        这会列出chrony正在同步的NTP服务器信息。

        1. 配置NTP服务器。默认配置文件为/etc/chrony.conf。

          指定自己的NTP服务器,也可以使用CentOS提供的如:

          server 0.centos.pool.ntp.org iburst
          server 1.centos.pool.ntp.org iburst
          server 2.centos.pool.ntp.org iburst
          server 3.centos.pool.ntp.org iburst

          1. 重启chrony生效配置更改:

            systemctl restart chronyd

            1. 查看主机是否与NTP服务器正常同步

              chronyc tracking这样chrony服务就安装并配置完成了。

              它将定期同步时间,确保主机维持准确的系统时间。

将node1配置为NTP服务器,让node2和node3同步node1的时间是一个很好的方案。具体步骤如下:

  1. 在node1上安装chrony并配置为NTP服务器:yum install chrony

    vim /etc/chrony.conf

    在文件末尾添加:

    allow 192.168.1.* #允许局域网同步时间
    local stratum 10

    1. 启动chrony服务并设置开机启动:

      systemctl start chronyd
      systemctl enable chronyd

      3.在node2和node3安装chrony,

      配置同步node1时间:

      yum install chrony

      两台服务器都要安装

      vim /etc/chrony.conf

      编辑文件,内容为:

      server node1

      1. node2和node3启动chrony服务:

        systemctl start chronyd
        systemctl enable chronyd

        1. 在三台服务器上检查配置和同步状态:# node1
          systemctl status chronyd #检查服务状态
          chronyc sources #检查是否有客户端同步# node2,node3
          systemctl status chronyd
          chronyc sources
          chronyc tracking #检查是否成功同步node1
        2. 随时检查chrony同步状态和时间准确性。需要的话可以重启服务:

          systemctl restart chronyd

          这样,node1作为NTP服务器提供时间服务,node2和node3作为客户端同步node1的时间。

配置keepalived实现VIP漂移

keepalived可以提供高可用的VIP(虚拟IP),通过VRRP协议在多个节点之间漂移,实现流量分发和高可用。具体的配置步骤如下:1. 安装keepalived在node1,node2,node3上安装keepalived:

bash
yum install keepalived -y
  1. 配置keepalived修改/etc/keepalived/keepalived.conf,在node1上:
global_defs {
   router_id node1  # 唯一ID,在集群中必须不同 
}

vrrp_instance VI_1 {
    state MASTER  # node1 为MASTER 
    interface ens33 # VIP绑定在此接口
    virtual_router_id 51 # VRRP 路由ID,在集群中必须相同
    priority 100   # node1 的优先级最高,所以为MASTER
    advert_int 1    # VRRP 心跳包广播间隔
    authentication {
        auth_type PASS
        auth_pass 1111
    }
    virtual_ipaddress {
        192.168.200.16  # VIP地址
    }
} 

在node2和node3上,配置如下:

global_defs {
   router_id node2  # 不同于node1
}

vrrp_instance VI_1 {
    state BACKUP   # node2/node3 为BACKUP
    interface ens33
    virtual_router_id 51 
    priority 90     # 优先级比node1低
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass 1111
    }
    virtual_ipaddress {
        192.168.200.16  
    }
}
  1. 启动并启用keepalived在node1,node2和node3上启动keepalived:
bash
systemctl start keepalived && systemctl enable keepalived
  1. 测试VIP漂移我们可以通过ping 192.168.200.16来检查VIP漂移情况。keepalived启动后,VIP会在node1上,此时ping通。如果关闭node1的keepalived或接口,VIP会自动漂移到node2,ping仍通。关闭node2的keepalived后,VIP会漂移到node3,ping同样通。
Created symlink from /etc/systemd/system/multi-user.target.wants/keepalived.service to /usr/lib/systemd/system/keepalived.service.

这条信息表示在node1上创建了keepalived.service的软链接。具体来说,它创建了从/etc/systemd/system/multi-user.target.wants/keepalived.service到/usr/lib/systemd/system/keepalived.service的软链接。在Linux系统中,keepalived服务由keepalived.service文件定义和控制。默认情况下,这个服务文件位于/usr/lib/systemd/system/目录下。如果要启动keepalived服务,需要创建从/etc/systemd/system/multi-user.target.wants/到keepalived.service文件的软链接。这个软链接告诉systemd在系统启动到multi-user.target目标时自动启动keepalived服务。所以,这条信息表明在node3上已经创建了这个软链接,keepalived服务被配置为在系统启动时自动启动。接下来,要启动keepalived服务,只需要运行:

bash
systemctl start keepalived.service 

并通过以下命令检查其运行状态:

bash 
systemctl status keepalived.service

如果要设置keepalived服务开机自启动,运行:

bash
systemctl enable keepalived.service

综上,这条信息提示我们node3的keepalived服务已经被正确配置为开机自启动。接下来只需要启动该服务,并检查状态与运行情况就可以了。

安装:kubectl:1.22.4

  1. 在node1,node2和node3上安装kubectl的先决条件
bash
# 更新yum包索引 
yum update -y

# 安装依赖包
yum install -y yum-utils device-mapper-persistent-data lvm2 curl socat
  1. 添加阿里云Kubernetes源
bash
cat <<EOF > /etc/yum.repos.d/kubernetes.repo
[kubernetes]
name=Kubernetes
baseurl=https://mirrors.aliyun.com/kubernetes/yum/repos/kubernetes-el7-x86_64/
enabled=1
gpgcheck=1
repo_gpgcheck=1 
gpgkey=https://mirrors.aliyun.com/kubernetes/yum/doc/yum-key.gpg https://mirrors.aliyun.com/kubernetes/yum/doc/rpm-package-key.gpg
EOF
  1. 安装kubectl
bash
yum install -y kubectl-1.22.4
  1. 检查kubectl安装版本
bash
kubectl version --client

输出应为:

Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.4", GitCommit:"b69590d82caaae87caf6234447777b506ca95fab2", GitTreeState:"clean", BuildDate:"2021-11-16T19:29:40Z", GoVersion:"go1.16.10", Compiler:"gc", Platform:"linux/amd64"}

确认Client Version为1.22.4,表示kubectl已正确安装。

  1. 安装其他node节点(可选)如果有其他node节点(node2,node3等),重复步骤1-4,在其他节点上也安装kubectl 1.22.4。这样,所有的node上就具有相同版本的kubectl,可以用于在Kubernetes集群上进行各种管理操作。

安装kubernetes-cni:0.8.7

在Kubernetes集群的主节点node1和工作节点node2,node3上安装kubernetes-cni 0.8.7,可以按以下步骤操作:

  1. 在node1,node2和node3上下载kubernetes-cni 0.8.7
bash
wget https://github.com/containernetworking/plugins/releases/download/v0.8.7/cni-plugins-linux-amd64-v0.8.7.tgz

可以使用一些国内镜像源来加速下载。例如:- 阿里云镜像源:

bash
wget https://kubernetes.oss-cn-hangzhou.aliyuncs.com/cni-plugins/v0.8.7/cni-plugins-linux-amd64-v0.8.7.tgz

- 中科大镜像源:

bash
wget https://mirrors.ustc.edu.cn/cni/cni-plugins-linux-amd64-v0.8.7.tgz

- 华为云镜像源:

bash 
wget https://mirrors.huaweicloud.com/kubernetes/cni-plugins/v0.8.7/cni-plugins-linux-amd64-v0.8.7.tgz

- 网易镜像源:

bash
wget https://mirrors.163.com/cni/cni-plugins-linux-amd64-v0.8.7.tgz
  1. 在node1,node2和node3上解压下载的文件
bash
tar -xzvf cni-plugins-linux-amd64-v0.8.7.tgz
  1. 在node1,node2和node3上移动解压后的目录
bash 
sudo mv cni-plugins-linux-amd64-v0.8.7 /opt/cni/bin
  1. 在node1,node2和node3上创建CNI网络配置目录及文件
bash
sudo mkdir -p /etc/cni/net.d
cat >/etc/cni/net.d/10-calico.conflist <<EOF
{
    "name": "k8s-pod-network", 
    "cniVersion": "0.3.0", 
    "plugins": [
        {
            "type": "calico", 
            "etcd_endpoints": "https://127.0.0.1:2379",
            "log_level": "info", 
            "datastore_type": "kubernetes",
            "nodename": "node1", 
            "ipam": {
                "type": "calico-ipam"
            },
            "policy": {
                "type": "k8s"
           },
            "kubernetes": {
                "kubeconfig": "/etc/cni/net.d/calico-kubeconfig"
            }
        }, 
        {
            "type": "portmap", 
            "snat": true, 
            "capabilities": {"portMappings": true}
        }
    ]
}
EOF

其中,nodename的值在node2和node3上应修改为对应的节点名。5. 在node1上授权CNI网络配置文件

bash
chmod 644 /etc/cni/net.d/10-calico.conflist 
  1. 重启Kubelet服务以加载CNI插件在node1,node2和node3上运行:
bash
systemctl restart kubelet
  1. 检查安装情况在任意节点上运行:
bash 
/opt/cni/bin/cni-plugins-helper.sh 

输出应显示cni-plugins-helper及其他CNI插件的详细信息,表明CNI已正确安装。以上就是在Kubernetes集群的主节点和工作节点上安装CNI网络插件kubernetes-cni 0.8.7的完整步骤。

kubelet & kubeadm:1.22.4

在3台服务器(node1,node2,node3)上安装kubelet、kubeadm1.22.4版本的步骤如下:

  1. 添加Kubernetes软件包的yum源:
bash
cat <<EOF | sudo tee /etc/yum.repos.d/kubernetes.repo
[kubernetes]
name=Kubernetes
baseurl=https://mirrors.aliyun.com/kubernetes/yum/repos/kubernetes-el7-x86_64/
enabled=1
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://mirrors.aliyun.com/kubernetes/yum/doc/yum-key.gpg https://mirrors.aliyun.com/kubernetes/yum/doc/rpm-package-key.gpg
EOF
  1. 安装kubeadm、kubelet和kubectl:
bash 
sudo yum install -y kubelet-1.22.4 kubeadm-1.22.4 
  1. 启动kubelet:
bash
sudo systemctl enable --now kubelet
  1. 编辑kubelet配置文件vim /etc/sysctl.d/k8s.conf,添加以下内容:
net.bridge.bridge-nf-call-iptables  = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward                 = 1
net.ipv4.tcp_keepalive_time        =  600
  1. 重新加载sysctl并使配置生效
bash 
sudo sysctl --system 
  1. 阻止swap分区,因为kubernetes不支持swap分区
bash
sudo swapoff -a
  1. 编辑vim /etc/default/grub,在GRUB_CMDLINE_LINUX参数中添加"swapoff=yes"
bash
sudo grub2-mkconfig -o /boot/grub2/grub.cfg
  1. 重启服务器激活配置
bash
sudo reboot
  1. 安装Docker 19.03或更高版本

    10.执行以上步骤在node1、node2和node3服务器上重复。至此,3台服务器上的kubelet、kubeadm和kubectl 已安装完成,并做好了初始化Kubernetes集群的准备工作。

    在 Kubernetes 集群中,Docker 被用作默认的容器运行时。kubelet通过Docker来管理Pod中的容器,所以 Docker 是部署 Kubernetes 集群的必要组件之一。Docker的安装方法可以参考这篇文章:https://yeasy.gitbook.io/docker_practice/install总体来说,在CentOS系统上安装Docker CE社区版的步骤如下:

    1. 添加Docker的yum源
    bash
    yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
    1. 安装Docker CE:
    bash
    yum install docker-ce docker-ce-cli containerd.io

    或者指定安装特定版本:

    bash
    yum install docker-ce- docker-ce-cli- containerd.io
    1. 启动Docker并设置开机自启:
    bash
    systemctl start docker 
    systemctl enable docker
    1. 配置Docker加速器(可选)
    2. 将您的用户添加到docker组或设置sudo权限(可选)

      如果您不将用户添加到docker组,那么每次使用Docker,都需要sudo来获取root权限,比较麻烦。所以,建议您将用户添加到docker组,步骤如下:1. 创建docker组:

      bash
      groupadd docker
      1. 将您的用户添加到docker组: root
      bash
      usermod -aG docker root

      将username替换为您的实际用户名。

      1. 退出并重新登录,然后运行 id username并确认用户是否已添加到docker组。
      [root@zzmuap6zwdoqhqxb ~]# groupadd docker
      groupadd: group 'docker' already exists

      这是因为在安装Docker时,安装脚本自动创建了docker组。所以,您只需要将您的用户添加到现有的docker组即可。命令如下:

      将此命令中的user1替换为您自己的实际登录用户名。

      bash
      # node1服务器执行
      usermod -aG docker root
      # node2服务器执行
      usermod -aG docker root
      # node3服务器执行
      usermod -aG docker root

      将your_username替换为您的实际用户名。执行此命令后,您的用户就被添加到了docker组。然后您需要:1. 退出当前登录会话:

      bash 
      exit
      1. 再重新登录。
      2. 测试您的用户是否可以正常使用Docker命令,运行:
      bash 
      docker run hello-world
      [root@zzmuap6zwdoqhqxb ~]# docker run hello-world
      Unable to find image 'hello-world:latest' locally
      latest: Pulling from library/hello-world
      2db29710123e: Pull complete 
      Digest: sha256:4e83453afed1b4fa1a3500525091dbfca6ce1e66903fd4c01ff015dbcb1ba33e
      Status: Downloaded newer image for hello-world:latest
      
      Hello from Docker!
      This message shows that your installation appears to be working correctly.
      
      To generate this message, Docker took the following steps:
      1. The Docker client contacted the Docker daemon.
      2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
        (amd64)
      3. The Docker daemon created a new container from that image which runs the
        executable that produces the output you are currently reading.
      4. The Docker daemon streamed that output to the Docker client, which sent it
        to your terminal.
      
      To try something more ambitious, you can run an Ubuntu container with:
      $ docker run -it ubuntu bash
      
      Share images, automate workflows, and more with a free Docker ID:
      https://hub.docker.com/
      
      For more examples and ideas, visit:
      https://docs.docker.com/get-started/
      

      恭喜!从输出可以看出您已经成功安装Docker并可以正常运行Docker命令了。"Hello from Docker!"的输出表明Docker安装正常,可以拉取和运行镜像。

      如果可以正常运行,说明您的用户已获得使用Docker的权限,不需要sudo就可以管理Docker。

    以上步骤需要在node1、node2和node3三个节点上都执行一遍,安装Docker并正常运行后,才能开始初始化Kubernetes集群。所以,总结来说,在安装kubelet、kubeadm和kubectl后,您还需要:

    1. 在所有节点安装Docker,版本需要在19.03或更高
    2. 在所有节点启动Docker并设置开机自启
    3. 在所有节点测试Docker是否正常工作完成以上步骤后,才能开始通过
    4. kubeadm init 在node1节点初始化Kubernetes集群。

    接下来可以在node1上执行kubeadm init命令初始化集群控制平面,之后在node2和node3上执行kubeadm join命令将其加入集群。

  • node1上执行kubeadm init

    在node1节点上执行kubeadm init命令来初始化Kubernetes控制平面(control plane)。

假设node1的IP地址是192.168.0.1,那么执行命令如下:

bash
kubeadm init --apiserver-advertise-address=192.168.0.1 --pod-network-cidr=10.244.0.0/16
# 我的node1IP地址为:192.168.1.127
kubeadm init --apiserver-advertise-address=192.168.1.127 --pod-network-cidr=10.244.0.0/16
[root@node1 ~]# kubeadm init --apiserver-advertise-address=192.168.1.127 --pod-network-cidr=10.244.0.0/16
I0427 08:33:33.663710   31403 version.go:255] remote version is much newer: v1.27.1; falling back to: stable-1.22
[init] Using Kubernetes version: v1.22.17
[preflight] Running pre-flight checks
        [WARNING SystemVerification]: this Docker version is not on the list of validated versions: 23.0.4. Latest validated version: 20.10
error execution phase preflight: [preflight] Some fatal errors occurred:
        [ERROR Swap]: running with swap on is not supported. Please disable swap
[preflight] If you know what you are doing, you can make a check non-fatal with --ignore-preflight-errors=...
To see the stack t

从输出可以看出,在执行kubeadm init时遇到了一些错误。我们需要解决这些错误,然后再重新执行初始化命令。错误主要有两个:

  1. Docker版本不在验证版本列表内。目前Kubernetes 1.22版本支持的Docker版本范围是17.03~20.10。所以需要升级或降级Docker版本。

    
    [root@node1 ~]# docker version
    Client: Docker Engine - Community
    Version:           23.0.4
    API version:       1.42
    Go version:        go1.19.8
    Git commit:        f480fb1
    Built:             Fri Apr 14 10:36:38 2023
    OS/Arch:           linux/amd64
    Context:           default

Server: Docker Engine - Community
Engine:
Version: 23.0.4
API version: 1.42 (minimum version 1.12)
Go version: go1.19.8
Git commit: cbce331
Built: Fri Apr 14 10:34:14 2023
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.6.20
GitCommit: 2806fc1057397dbaeefbea0e4e17bddfbd388f38
runc:
Version: 1.1.5
GitCommit: v1.1.5-0-gf19387a
docker-init:
Version: 0.19.0
GitCommit: de40ad0
[root@node1 ~]#


1. 系统启用了swap分区,这会影响Kubernetes的功能。需要临时或永久关闭swap分区。所以,解决步骤如下:

   1. 升级或降级Docker版本到17.03~20.10之间的版本。可以参考Docker官网的安装指南。

      1. 停止Docker服务:
  bash
  systemctl stop docker
  ```

  ````
  [root@node1 ~]# systemctl stop docker
  Warning: Stopping docker.service, but it can still be activated by:
    docker.socket
  ````

  从警告信息可以看出,仅停止docker.service服务是不够的。Docker服务实际由两个单元组成:- docker.service: Docker引擎守护进程服务
  \- docker.socket: Docker守护进程的socket文件,用于docker daemon的socket激活所以,要完全停止Docker服务,需要停止这两个单元:

  ```
  bash
  systemctl stop docker.service docker.socket
  ```

  这将停止Docker daemon和相关socket,从而完全停止Docker服务。

  2. 可选步骤:如果要卸载当前版本的Docker,强烈建议做备份。您可以:- 备份Docker相关的数据:

  ```
  bash
  cp -rpf /var/lib/docker /var/lib/docker.bak
  ```

  \- 备份现有Docker配置文件:

  ```
  bash 
  cp -rpf /etc/docker /etc/docker.bak
  ```

  ````
  当前安装的Docker版本是23.0.4。
  这不在Kubernetes 1.22支持的Docker版本范围(17.03~20.10)内。
  所以,您需要卸载当前版本的Docker,然后安装支持版本的Docker。
  卸载Docker步骤如下:
  bash
  # 1. 卸载Docker
  yum remove docker \
                    docker-client \
                    docker-client-latest \
                    docker-common \
                    docker-latest \
                    docker-latest-logrotate \
                    docker-logrotate \
                    docker-engine

  # 2. 主机上删除Docker的其他组件  
  rm -rf /var/lib/docker
  rm -rf /var/run/docker.sock
  安装Docker 19.03版本的步骤如下:
  bash 
  # 1. 添加Docker仓库
  yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo

  # 2. 安装Docker CE 19.03  
  yum install docker-ce-19.03.15 docker-ce-cli-19.03.15 containerd.io-1.4.7

  # 3. 启动 Docker CE 
  systemctl start docker

  # 4. 添加开机启动Docker服务 
  systemctl enable docker
  完成Docker升级后,请重新执行kubeadm init初始化Kubernetes集群
  ````
  1. 关闭swap分区:
bash
# 临时关闭 
swapoff -a 

# 永久关闭,注释/etc/fstab文件中的swap分区行
sed -i 's/^\/dev\/mapper\/centos-swap/#&/' /etc/fstab
  1. 重新执行kubeadm init命令初始化Kubernetes控制平面:
bash
kubeadm init --apiserver-advertise-address=192.168.1.127 --pod-network-cidr=10.244.0.0/16

如果输出显示"Kubernetes control plane has initialized successfully",说明Kubernetes控制平面初始化成功!

其中:

--apiserver-advertise-address 参数指定集群中其它节点与控制平面通信的IP地址。

--pod-network-cidr 参数指定 Pod 网络的 CIDR 范围,如果您打算使用 Flannel 作为网络插件,建议使用 10.244.0.0/16。执行该命令后,输出中会显示如下信息:

Your Kubernetes control-plane has initialized successfully!

To start using your cluster, you need to run the following as a regular user:

  mkdir -p $HOME/.kube
  sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
  sudo chown $(id -u):$(id -g) $HOME/.kube/config

You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
  https://kubernetes.io/docs/concepts/cluster-administration/addons/

You can now join any number of worker nodes by running the following on each as root:

kubeadm join 192.168.0.1:6443 --token abcdef.1234567890abcdef \
    --discovery-token-ca-cert-hash sha256:1234..cdef 

根据输出,接下来您需要:

  1. 配置kubectl工具,使用常规用户执行提示的3条命令

    1. 以常规用户登录node1节点,不是root用户
    2. 执行以下3条命令:
    bash
    # 创建.kube目录
    mkdir -p $HOME/.kube 
    
    # 拷贝群集配置文件到.kube目录
    sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config  
    
    # 设置文件权限 
    sudo chown $(id -u):$(id -g) $HOME/.kube/config
    1. 测试kubectl工具,执行kubectl get nodes查看节点示例输出:
    NAME     STATUS     ROLES                  AGE   VERSION
    node1   NotReady   control-plane,master   11m   v1.23.0

    node1节点显示为NotReady状态,这是因为尚未部署Pod网络插件。4. 部署Pod网络插件,比如Flannel

    bash
    kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
    1. 再次执行kubectl get nodes检查节点状态,node1节点应变为Ready状态。
  2. 部署Pod网络插件,执行 kubectl apply -f [podnetwork].yaml命令。
  3. 您可以选择Calico或Flannel3. 在node2和node3节点执行kubeadm join命令将节点加入集群
  4. 安装其他插件或组件(可选)
  • node2和node3上执行kubeadm join命令将其加入集群

    [root@node1 ~]# kubeadm init --apiserver-advertise-address=192.168.1.127 --pod-network-cidr=10.244.0.0/16
    I0427 08:52:28.284600     583 version.go:255] remote version is much newer: v1.27.1; falling back to: stable-1.22
    [init] Using Kubernetes version: v1.22.17
    [preflight] Running pre-flight checks
          [WARNING SystemVerification]: this Docker version is not on the list of validated versions: 23.0.4. Latest validated version: 20.10
    error execution phase preflight: [preflight] Some fatal errors occurred:
          [ERROR Swap]: running with swap on is not supported. Please disable swap
    [preflight] If you know what you are doing, you can make a check non-fatal with --ignore-preflight-errors=...
    To see the stack trace
    [root@node1 ~]# yum remove docker \
    >                   docker-client \
    >                   docker-client-latest \
    >                   docker-common \
    >                   docker-latest \
    >                   docker-latest-logrotate \
    >                   docker-logrotate \
    >                   docker-engine
    Loaded plugins: fastestmirror, langpacks
    Repository base is listed more than once in the configuration
    Repository updates is listed more than once in the configuration
    Repository extras is listed more than once in the configuration
    Repository centosplus is listed more than once in the configuration
    No Match for argument: docker
    No Match for argument: docker-client
    No Match for argument: docker-client-latest
    No Match for argument: docker-common
    No Match for argument: docker-latest
    No Match for argument: docker-latest-logrotate
    No Match for argument: docker-logrotate
    No Match for argument: docker-engine
    No Packages marked for removal
    [root@node1 ~]# 

    从输出可以看出,在执行yum remove命令卸载Docker时并未找到任何匹配的Docker软件包进行卸载。这说明node1节点上可能并未安装Docker,或者Docker是通过其他方式(如脚本)安装的。在这种情况下,我们需要手动删除Docker相关的文件与目录:

    bash
    # 删除docker目录
    rm -rf /var/lib/docker
    
    # 删除docker.service 文件
    rm /usr/lib/systemd/system/docker.service  
    
    # 删除docker.socket文件
    rm /usr/lib/systemd/system/docker.socket  
    
    # 删除drain数据库
    rm -rf /var/lib/containerd
    
    # 删除containerd.service文件
    rm /usr/lib/systemd/system/containerd.service

    最终卸载方法

  1. 卸载Docker 23.0.4:
bash 
# 1. 删除软件包
yum remove docker-ce docker-ce-cli containerd.io 

# 2. 删除文件与目录
rm -rf /var/lib/docker 
rm /usr/lib/systemd/system/docker.service 
rm /usr/lib/systemd/system/docker.socket  
rm -rf /var/lib/containerd 
rm /usr/lib/systemd/system/containerd.service
[root@node1 ~]# yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
Loaded plugins: fastestmirror, langpacks
Repository base is listed more than once in the configuration
Repository updates is listed more than once in the configuration
Repository extras is listed more than once in the configuration
Repository centosplus is listed more than once in the configuration
adding repo from: https://download.docker.com/linux/centos/docker-ce.repo
grabbing file https://download.docker.com/linux/centos/docker-ce.repo to /etc/yum.repos.d/docker-ce.repo
repo saved to /etc/yum.repos.d/docker-ce.repo
[root@node1 ~]# yum install docker-ce-19.03.15 docker-ce-cli-19.03.15 containerd.io-1.4.7
Loaded plugins: fastestmirror, langpacks
Repository base is listed more than once in the configuration
Repository updates is listed more than once in the configuration
Repository extras is listed more than once in the configuration
Repository centosplus is listed more than once in the configuration
Loading mirror speeds from cached hostfile
 * base: mirrors.ustc.edu.cn
 * epel: mirrors.tuna.tsinghua.edu.cn
 * extras: mirrors.ustc.edu.cn
 * updates: mirrors.ustc.edu.cn
No package containerd.io-1.4.7 available.
Resolving Dependencies
--> Running transaction check
---> Package docker-ce.x86_64 3:19.03.15-3.el7 will be installed
--> Processing Dependency: containerd.io >= 1.2.2-3 for package: 3:docker-ce-19.03.15-3.el7.x86_64
---> Package docker-ce-cli.x86_64 1:19.03.15-3.el7 will be installed
--> Running transaction check
---> Package containerd.io.x86_64 0:1.6.20-3.1.el7 will be installed
--> Finished Dependency Resolution

Dependencies Resolved

============================================================================================================================== Package                       Arch                   Version                          Repository                        Size
==============================================================================================================================Installing:
 docker-ce                     x86_64                 3:19.03.15-3.el7                 docker-ce-stable                  24 M
 docker-ce-cli                 x86_64                 1:19.03.15-3.el7                 docker-ce-stable                  38 M
Installing for dependencies:
 containerd.io                 x86_64                 1.6.20-3.1.el7                   docker-ce-stable                  34 M

Transaction Summary
==============================================================================================================================Install  2 Packages (+1 Dependent package)

Total download size: 96 M
Installed size: 389 M
Is this ok [y/d/N]: y
Downloading packages:
(1/3): docker-ce-19.03.15-3.el7.x86_64.rpm                                                             |  24 MB  00:00:02     
(2/3): docker-ce-cli-19.03.15-3.el7.x86_64.rpm                                                         |  38 MB  00:00:02     
(3/3): containerd.io-1.6.20-3.1.el7.x86_64.rpm                                                         |  34 MB  00:00:06     
------------------------------------------------------------------------------------------------------------------------------Total                                                                                          16 MB/s |  96 MB  00:00:06     
Running transaction check
Running transaction test

Transaction check error:
  file /usr/libexec/docker/cli-plugins/docker-buildx from install of docker-ce-cli-1:19.03.15-3.el7.x86_64 conflicts with file from package docker-buildx-plugin-0:0.10.4-1.el7.x86_64

Error Summary
-------------

[root@node1 ~]# docker version
-bash: /usr/bin/docker: No such file or directory
[root@node1 ~]# docker version
-bash: /usr/bin/docker: No such file or directory
[root@node1 ~]# yum install docker-ce-19.03.15 docker-ce-cli-19.03.15 containerd.io-1.4.7
Loaded plugins: fastestmirror, langpacks
Repository base is listed more than once in the configuration
Repository updates is listed more than once in the configuration
Repository extras is listed more than once in the configuration
Repository centosplus is listed more than once in the configuration
Loading mirror speeds from cached hostfile
 * base: mirrors.aliyun.com
 * epel: mirrors.bfsu.edu.cn
 * extras: mirrors.aliyun.com
 * updates: mirrors.aliyun.com
No package containerd.io-1.4.7 available.
Resolving Dependencies
--> Running transaction check
---> Package docker-ce.x86_64 3:19.03.15-3.el7 will be installed
--> Processing Dependency: containerd.io >= 1.2.2-3 for package: 3:docker-ce-19.03.15-3.el7.x86_64
---> Package docker-ce-cli.x86_64 1:19.03.15-3.el7 will be installed
--> Running transaction check
---> Package containerd.io.x86_64 0:1.6.20-3.1.el7 will be installed
--> Finished Dependency Resolution

Dependencies Resolved

============================================================================================================================== Package                       Arch                   Version                          Repository                        Size
==============================================================================================================================Installing:
 docker-ce                     x86_64                 3:19.03.15-3.el7                 docker-ce-stable                  24 M
 docker-ce-cli                 x86_64                 1:19.03.15-3.el7                 docker-ce-stable                  38 M
Installing for dependencies:
 containerd.io                 x86_64                 1.6.20-3.1.el7                   docker-ce-stable                  34 M

Transaction Summary
==============================================================================================================================Install  2 Packages (+1 Dependent package)

Total size: 96 M
Installed size: 389 M
Is this ok [y/d/N]: y
Downloading packages:
Running transaction check
Running transaction test

Transaction check error:
  file /usr/libexec/docker/cli-plugins/docker-buildx from install of docker-ce-cli-1:19.03.15-3.el7.x86_64 conflicts with file from package docker-buildx-plugin-0:0.10.4-1.el7.x86_64

Error Summary
-------------

从输出可以看出,在安装Docker 19.03时遇到了一些错误。主要错误是:

Transaction check error:
 file /usr/libexec/docker/cli-plugins/docker-buildx from install of docker-ce-cli-1:19.03.15-3.el7.x86_64 conflicts with file from package docker-buildx-plugin-0:0.10.4-1.el7.x86_64

这说明您node1节点上已经安装了docker-buildx-plugin软件包,这与docker-ce-cli软件包产生文件冲突。所以,需要先卸载docker-buildx-plugin软件包,然后重新安装Docker。具体步骤如下:

  1. 卸载docker-buildx-plugin
bash
yum remove docker-buildx-plugin
  1. 清理Docker旧版本遗留的文件与目录:
bash 
rm -rf /var/lib/docker 
rm /usr/lib/systemd/system/docker.service  
rm /usr/lib/systemd/system/docker.socket
rm -rf /var/lib/containerd
rm /usr/lib/systemd/system/containerd.service 
  1. 安装Docker 19.03:
bash
yum install -y docker-ce-19.03.15 docker-ce-cli-19.03.15 containerd.io
  1. 启动Docker并设置开机自启:
bash
systemctl start docker 
systemctl enable docker
  1. 测试Docker是否正常工作:
bash 
docker version
docker run hello-world
  1. 重新执行kubeadm init命令初始化Kubernetes控制平面:
bash
kubeadm init --apiserver-advertise-address=192.168.1.127 --pod-network-cidr=10.244.0.0/16

如果输出显示"Kubernetes control plane has initialized successfully",说明Kubernetes控制平面初始化成功!

在重新执行kubeadm init之前,需要进行几个准备步骤:1. 停止kubelet和kubeadm服务:

bash
systemctl stop kubelet 
systemctl stop kubeadm
  1. 清理节点上已有的Kubernetes组件和数据:
bash
kubeadm reset 
rm -rf /etc/cni/net.d
rm -rf /var/lib/etcd 
rm -rf /var/lib/kubelet 
rm -rf /var/lib/dockershim
  1. 重启Docker服务:
bash
systemctl restart docker
  1. 重启kubelet服务:
bash
systemctl restart kubelet
  1. 使用kubeadm init重新初始化Kubernetes主节点:

    kubelet & kubeadm:1.22.4
    - kubectl:1.22.4
    - kubernetes-cni:0.8.7
    - chrony:4.0+
    - coredns:1.8.0
    - etcd:3.4.13-0
    - pause:3.4.1

bash
kubeadm init \
  --apiserver-advertise-address=192.168.1.127 \
  --image-repository registry.aliyuncs.com/google_containers \
  --kubernetes-version v1.22.4 \
  --service-cidr=10.96.0.0/12 \
  --pod-network-cidr=10.244.0.0/16

其中:--apiserver-advertise-address: 指定APiserver的ADVERTISE地址,

这里是node1的内网IP--image-repository: 指定镜像仓库,这里使用阿里云的仓库

--kubernetes-version: 指定Kubernetes版本,

这里使用v1.20.4

--service-cidr: 指定Service虚拟网络段,这里使用10.96.0.0/12

--pod-network-cidr: 指定Pod虚拟网络段,这里使用10.244.0.0/166.

根据kubeadm init的输出,设置kubeconfig文件并启动kube-proxy、kube-dns等Addon组件:

bash
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
  1. 验证Kubernetes主节点是否初始化成功:
bash
kubectl get nodes
kubectl get pods --all-namespaces

如果节点全部为Ready状态,Pod都处于Running状态,说明Kubernetes集群初始化成功!

事实上,对于初次体验Kubernetes,使用kubeadm init的默认参数就可以轻松初始化一个单节点的Kubernetes集群。下面是最简单的初始化命令:

bash
kubeadm init

这条命令会:

  1. 使用kubeadm安装的最新稳定版本部署Kubernetes控制平面
  2. 为集群生成自签名证书
  3. 安装默认的网络插件(如果有安装的话)
  4. 部署核心DNS addon、kube-proxy等组件
  5. 为您本地生成admin.conf kubeconfig文件
  6. 显示集群相关信息及进一步指引所以,如果您只是想快速体验和学习Kubernetes,使用上面的简单命令即可完成初始化。当您更深入理解并试验Kubernetes各组件之后,可以根据自身需求选择是否指定参数:- 指定版本: --kubernetes-version
    - 指定CIDR: --service-cidr --pod-network-cidr
    - 指定镜像源: --image-repository
    - 指定ADVERTISE地址:--apiserver-advertise-address
    - 等等
[root@node1 ~]# kubeadm init --apiserver-advertise-address=192.168.1.127 --pod-network-cidr=10.244.0.0/16
I0427 09:30:32.251099    3800 version.go:255] remote version is much newer: v1.27.1; falling back to: stable-1.22
[init] Using Kubernetes version: v1.22.17
[preflight] Running pre-flight checks
error execution phase preflight: [preflight] Some fatal errors occurred:
        [ERROR Swap]: running with swap on is not supported. Please disable swap
[preflight] If you know what you are doing, you can make a check non-fatal with --ignore-preflight-errors=...
To see the stack trace of this error execute with --v=5 or higher

从初始化输出可以看出,在运行preflight检查时发现node1节点上swap分区是启用的,这会导致初始化失败。Kubernetes不支持swap分区,所以在初始化Kubernetes节点之前,需要禁用swap。要禁用swap的方法如下:1.临时禁用swap分区:

bash
swapoff -a  # 临时禁用所有的swap分区

2.永久禁用swap分区:- 编辑/etc/fstab文件,注释掉SWAP的行:

bash
vim /etc/fstab

# /dev/mapper/centos-swap swap                    swap    defaults        0 0

- 重启节点来生效配置

bash
reboot

3.在Kubernetes节点初始化之前,确认swap已经禁用:

bash 
free -m  # 检查是否还有swap分区
              total        used        free      shared  buff/cache   available
Mem:           1878         164         165           0        1548        1641
Swap:             0           0           0           

输出中Swap: 0 0 0表示swap已禁用。

在配置文件/etc/fstab中,注释行以#开始。所以,要注释掉swap那一行,应修改为:

# /etc/fstab
# Created by anaconda on Mon Sep  4 16:13:12 2017  
#  
# Accessible filesystems, by reference, are maintained under '/dev/disk'  
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
#
UUID=d3b7747c-23f4-47f5-9b90-eeb6faa412ea /                       ext4    defaults        1 1  
#UUID=959ab3ee-5bea-44c2-a908-c577a79e5bb9 swap                    swap    defaults        0 0

将swap一行前面加#注释掉就可以了。保存/etc/fstab文件后,执行:

bash
swapoff -a  # 临时禁用swap
reboot     # 重启系统使配置生效

重启后,使用free -m命令检查,swap应该显示为0,这说明swap已被永久禁用。

4.重新执行kubeadm init命令初始化Kubernetes主节点:

bash
kubeadm init --apiserver-advertise-address=192.168.1.127 --pod-network-cidr=10.244.0.0/16

初始化过程中如果没有提到swap相关错误,说明Kubernetes主节点初始化成功!接下来按照kubeadm init输出完成集群搭建即可。

[root@node1 ~]# kubeadm init --apiserver-advertise-address=192.168.1.127 --pod-network-cidr=10.244.0.0/16
I0427 09:52:09.941900    8401 version.go:255] remote version is much newer: v1.27.1; falling back to: stable-1.22
[init] Using Kubernetes version: v1.22.17
[preflight] Running pre-flight checks
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'

禁用swap后,再次执行kubeadm init时正确下载了需要的镜像并运行了preflight检查。这说明node1节点现在可以正确初始化Kubernetes了。根据输出,kubeadm正在下载初始化Kubernetes集群所需的镜像,这可能需要几分钟,具体取决于网络连接速度。下载完成后,kubeadm会继续自动进行证书生成、Kubernetes控制平面安装、kube-proxy和DNS插件安装等工作,并输出相关信息。您需要按照输出,进行以下操作:

  1. kubeadm会生成一个kubeconfig文件,用于本地访问集群,您需要如下移动与修改该文件权限:
bash
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
  1. 应用一个Pod网络解决方案,kubeadm推荐使用flannel。可以使用以下命令快速安装:
kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
  1. 检查节点和Pod网络是否准备就绪:
bash
kubectl get nodes 
kubectl get pods --all-namespaces

所有节点状态如果变为Ready,Pod全部运行,说明Kubernetes集群安装成功!

  1. 默认安装的组件有DNS插件CoreDNS、kube-proxy等。您可以使用kubectl命令查看各个资源对象。
[root@node1 ~]# kubeadm init --apiserver-advertise-address=192.168.1.127 --pod-network-cidr=10.244.0.0/16
I0427 09:52:09.941900    8401 version.go:255] remote version is much newer: v1.27.1; falling back to: stable-1.22
[init] Using Kubernetes version: v1.22.17
[preflight] Running pre-flight checks
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'

error execution phase preflight: [preflight] Some fatal errors occurred:
        [ERROR ImagePull]: failed to pull image k8s.gcr.io/kube-apiserver:v1.22.17: output: Error response from daemon: Get https://k8s.gcr.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers), error: exit status 1
        [ERROR ImagePull]: failed to pull image k8s.gcr.io/kube-controller-manager:v1.22.17: output: Error response from daemon: Get https://k8s.gcr.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
, error: exit status 1
        [ERROR ImagePull]: failed to pull image k8s.gcr.io/kube-scheduler:v1.22.17: output: Error response from daemon: Get https://k8s.gcr.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers), error: exit status 1
        [ERROR ImagePull]: failed to pull image k8s.gcr.io/kube-proxy:v1.22.17: output: Error response from daemon: Get https://k8s.gcr.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
, error: exit status 1
        [ERROR ImagePull]: failed to pull image k8s.gcr.io/pause:3.5: output: Error response from daemon: Get https://k8s.gcr.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
, error: exit status 1
        [ERROR ImagePull]: failed to pull image k8s.gcr.io/etcd:3.5.0-0: output: Error response from daemon: Get https://k8s.gcr.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
, error: exit status 1
        [ERROR ImagePull]: failed to pull image k8s.gcr.io/coredns/coredns:v1.8.4: output: Error response from daemon: Get https://k8s.gcr.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
, error: exit status 1
[preflight] If you know what you are doing, you can make a check non-fatal with --ignore-preflight-errors=...
To see the stack trace of this error execute with --v=5 or higher
[root@node1 ~]# 

从输出可以看出,在拉取初始化所需镜像时出现错误,导致preflight检查失败。xn--k8s-om0ep9hgz4aoefr4ewfr1a539c278ax24cswex6j3xv.gcr.io,这是Google Container Registry地址,由于网络问题无法正常访问。要解决这个问题,有两种方式:1. 等待网络恢复正常,重新执行init命令。xn--k8s-7l0e996d47r01ujylc4al6j9y4a9eg0zocxufvd.gcr.io。2. 使用--image-repository参数指定其他镜像仓库地址,例如阿里云仓库,xn--gcr-9s1ew58xo7fj9j.io。初始化命令可以修改为:

bash
kubeadm init --apiserver-advertise-address=192.168.1.127 --pod-network-cidr=10.244.0.0/16  --image-repository registry.cn-hangzhou.aliyuncs.com/google_containers

这将使用阿里云的镜像仓库来拉取所需镜像。

[etcd] Creating static Pod manifest for local etcd in "/etc/kubernetes/manifests"
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from dirests". This can take up to 4m0s
[kubelet-check] Initial timeout of 40s passed.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with err48/healthz": dial tcp [::1]:10248: connect: connection refused.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with err48/healthz": dial tcp [::1]:10248: connect: connection refused.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with err48/healthz": dial tcp [::1]:10248: connect: connection refused.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with err48/healthz": dial tcp [::1]:10248: connect: connection refused.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with err48/healthz": dial tcp [::1]:10248: connect: connection refused.

        Unfortunately, an error has occurred:
                timed out waiting for the condition

        This error is likely caused by:
                - The kubelet is not running
                - The kubelet is unhealthy due to a misconfiguration of the node in some way (req

        If you are on a systemd-powered system, you can try to troubleshoot the error with the fo
                - 'systemctl status kubelet'
                - 'journalctl -xeu kubelet'

        Additionally, a control plane component may have crashed or exited when started by the co
        To troubleshoot, list all containers using your preferred container runtimes CLI.

        Here is one example how you may list all Kubernetes containers running in docker:
                - 'docker ps -a | grep kube | grep -v pause'
                Once you have found the failing container, you can inspect its logs with:
                - 'docker logs CONTAINERID'

error execution phase wait-control-plane: couldn't initialize a Kubernetes cluster
To see the stack trace of this error execute with --v=5 or higher
[root@node1 ~]# 

从输出可以看出,初始化过程在等待kubelet引导控制平面组件时超时。这是因为kubelet进程没有正常启动或者运行。kubeadm通过调用kubelet的健康检查API来确认其运行状态,但一直失败,因此判定kubelet未运行并导致初始化超时。要解决这个问题,需要首先确保kubelet进程正常启动并响应请求。有几个地方需要检查:

  1. 检查kubelet进程是否启动。使用systemctl status kubelet命令查看kubelet服务状态。如果未启动,使用systemctl start kubelet启动。
  2. 检查kubelet启动后的日志,看是否有错误。使用journalctl -xeu kubelet查看日志。如果有错误,需要根据错误信息进行排查。
  3. 检查kubelet使用的配置文件/var/lib/kubelet/config.yaml。确保其配置正确,没有误配参数。
  4. 确保机器能够访问所需镜像仓库,kubelet需要下载所需镜像才能工作。
  5. 网络问题也会导致kubelet无法响应。检查节点网络配置。
  6. 资源不足也会影响kubelet运行。检查节点资源使用情况。
[root@node1 ~]# systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
   Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
  Drop-In: /usr/lib/systemd/system/kubelet.service.d
           └─10-kubeadm.conf
   Active: activating (auto-restart) (Result: exit-code) since Thu 2023-04-27 10:11:36 CST; 5s ago
     Docs: https://kubernetes.io/docs/
  Process: 14527 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS (code=exited, status=1/FAILURE)
 Main PID: 14527 (code=exited, status=1/FAILURE)

Apr 27 10:11:36 node1 systemd[1]: kubelet.service: main process exited, code=exited...UREApr 27 10:11:36 node1 systemd[1]: Unit kubelet.service entered failed state.
Apr 27 10:11:36 node1 systemd[1]: kubelet.service failed.
Hint: Some lines were ellipsized, use -l to show in full.

从kubelet服务状态可以看出,kubelet进程启动后很快就退出了,并进入failed状态。这说明kubelet启动时出现了错误,无法正常运行。要排查错误,需要检查kubelet的日志。使用以下命令查看:

bash
journalctl -xeu kubelet 

这会输出kubelet的详细日志信息,包括启动错误信息。根据输出,查找错误原因。
常见的错误有:

  1. 镜像拉取失败:网络问题或仓库认证失败导致
  2. 配置错误:kubelet启动参数或配置文件kubelet.config有误
  3. 资源不足:内存、CPU或者磁盘空间不足,导致kubelet无法启动
  4. 依赖服务问题:系统依赖服务如docker启动失败,会影响kubelet
  5. 权限问题:kubelet没有正确的文件读写权限

    [root@node1 ~]# journalctl -xeu kubelet
    Apr 27 10:13:49 node1 kubelet[15854]: Flag --network-plugin has been deprecated, will be 
    Apr 27 10:13:49 node1 kubelet[15854]: Flag --network-plugin has been deprecated, will be 
    Apr 27 10:13:49 node1 kubelet[15854]: I0427 10:13:49.599506   15854 server.go:440] "KubelApr 27 10:13:49 node1 kubelet[15854]: I0427 10:13:49.599861   15854 server.go:868] "ClienApr 27 10:13:49 node1 kubelet[15854]: I0427 10:13:49.602320   15854 certificate_store.go:Apr 27 10:13:49 node1 kubelet[15854]: I0427 10:13:49.603585   15854 dynamic_cafile_contenApr 27 10:13:49 node1 kubelet[15854]: I0427 10:13:49.681532   15854 server.go:687] "--cgrApr 27 10:13:49 node1 kubelet[15854]: I0427 10:13:49.681835   15854 container_manager_linApr 27 10:13:49 node1 kubelet[15854]: I0427 10:13:49.681944   15854 container_manager_linApr 27 10:13:49 node1 kubelet[15854]: I0427 10:13:49.681969   15854 topology_manager.go:1Apr 27 10:13:49 node1 kubelet[15854]: I0427 10:13:49.681982   15854 container_manager_linApr 27 10:13:49 node1 kubelet[15854]: I0427 10:13:49.682027   15854 state_mem.go:36] "IniApr 27 10:13:49 node1 kubelet[15854]: I0427 10:13:49.682099   15854 kubelet.go:314] "UsinApr 27 10:13:49 node1 kubelet[15854]: I0427 10:13:49.682135   15854 client.go:78] "ConnecApr 27 10:13:49 node1 kubelet[15854]: I0427 10:13:49.682150   15854 client.go:97] "Start 
    Apr 27 10:13:49 node1 kubelet[15854]: I0427 10:13:49.690804   15854 docker_service.go:566Apr 27 10:13:49 node1 kubelet[15854]: I0427 10:13:49.690831   15854 docker_service.go:242Apr 27 10:13:49 node1 kubelet[15854]: I0427 10:13:49.694427   15854 cni.go:204] "Error vaApr 27 10:13:49 node1 kubelet[15854]: I0427 10:13:49.694465   15854 cni.go:239] "Unable tApr 27 10:13:49 node1 kubelet[15854]: I0427 10:13:49.702505   15854 cni.go:204] "Error vaApr 27 10:13:49 node1 kubelet[15854]: I0427 10:13:49.702529   15854 cni.go:239] "Unable tApr 27 10:13:49 node1 kubelet[15854]: I0427 10:13:49.702591   15854 docker_service.go:257Apr 27 10:13:49 node1 kubelet[15854]: I0427 10:13:49.706184   15854 cni.go:204] "Error vaApr 27 10:13:49 node1 kubelet[15854]: I0427 10:13:49.706210   15854 cni.go:239] "Unable tApr 27 10:13:49 node1 kubelet[15854]: I0427 10:13:49.712293   15854 docker_service.go:264Apr 27 10:13:49 node1 kubelet[15854]: E0427 10:13:49.712331   15854 server.go:294] "FaileApr 27 10:13:49 node1 systemd[1]: kubelet.service: main process exited, code=exited, statApr 27 10:13:49 node1 systemd[1]: Unit kubelet.service entered failed state.
    Apr 27 10:13:49 node1 systemd[1]: kubelet.service failed.
    lines 1345-1373/1373 (END)

    从日志可以看到,kubelet启动失败的原因是CNI网络插件错误。有以下两处错误日志:

    I0427 10:13:49.694427   15854 cni.go:204] "Error validating network plugin kernel parameters: "
    I0427 10:13:49.694465   15854 cni.go:239] "Unable to update cni config - Error validating network plugin kernel parameters: " 

    这说明CNI网络插件的配置或执行有问题,导致kubelet无法设置容器网络,所以启动失败。

    解决方案:

    1. 检查CNI网络插件(默认为bridge)的配置文件是否正确。配置文件通常在/etc/cni/net.d/目录下。

      确定CNI网络插件配置正确的步骤如下:1. 进入CNI配置目录/etc/cni/net.d/

      bash
      cd /etc/cni/net.d/
      1. 检查是否有 CNI 网络配置文件。默认为10-bridge.conf
      bash 
      ls
      # 应该看到10-bridge.conf文件
      1. 检查10-bridge.conf文件内容是否正确。文件内容应为:
      json
      {
        "cniVersion": "0.4.0",
        "name": "bridge",
        "type": "bridge",
        "bridge": "cnio0",
        "isGateway": true,
        "ipMasq": true,
        "ipam": {
            "type": "host-local",
            "subnet": "10.244.0.0/16",
            "routes": [
                { "dst": "0.0.0.0/0"  }
            ]
        }
      }

      vim 10-bridge.conf :

      {
        "name": "k8s-pod-network",
        "cniVersion": "0.3.0",
        "plugins": [
            {
                "type": "calico",
                "etcd_endpoints": "https://127.0.0.1:2379",
                "log_level": "info",
                "datastore_type": "kubernetes",
                "nodename": "node1",
                "ipam": {
                    "type": "calico-ipam"
                },
                "policy": {
                    "type": "k8s"
               },
                "kubernetes": {
                    "kubeconfig": "/etc/cni/net.d/calico-kubeconfig"
                }
            },
            {
                "type": "portmap",
                "snat": true,
                "capabilities": {"portMappings": true}
            }
        ]
      }

      从配置文件看,正在使用Calico作为CNI网络插件。
      这说明bridge插件配置被覆盖,kubelet启动失败应该跟Calico相关。需要做的如下:

      1. 确认Calico所有组件是否安装正确并运行正常。这包括calicoctl、calico-node、calico-kube-controllers等。参考Calico安装文档进行确认。

        确认Calico组件安装正确的步骤:

        可以使用以下命令查看服务器是否安装了calicoctl CLI工具:

        bash
        which calicoctl

        如果返回calicoctl路径,则说明已安装。例如:

        /usr/local/bin/calicoctl
        1. 安装calicoctl CLI工具。
        bash
        curl -O -L  https://github.com/projectcalico/calicoctl/releases/download/v3.19.2/calicoctl
        sudo install calicoctl /usr/local/bin

        中科大镜像源提供了Calico安装包,可以从那里下载calicoctl。命令如下:

        bash
        curl -O -L  https://mirrors.ustc.edu.cn/calico/v3.19.2/calicoctl
        chmod +x calicoctl
        sudo mv calicoctl /usr/local/bin
        1. 配置calicoctl以访问Kubernetes API。
        bash
        mkdir -p /etc/calico 
        cat < /etc/calico/calicoctl.cfg
        apiVersion: projectcalico.org/v3
        kind: CalicoAPIConfig
        metadata:
        spec:
        etcdEndpoints: "https://127.0.0.1:2379"
        kubeconfig: "/etc/cni/net.d/calico-kubeconfig" 
        EOF
        1. 检查calico-node服务是否在运行。
        bash 
        systemctl status calico-node
        # calico-node.service 应处于 active (running) 状态
        1. 检查calico-kube-controllers部署是否正确。
        bash
        kubectl -n kube-system get deployment calico-kube-controllers
        # 应部署成功并有可用副本 
        1. 使用calicoctl node status检查Calico节点状态是否为ready。
        bash
        calicoctl node status
        # 应显示节点状态是ready  
        1. 检查NetworkPolicy CRD是否部署。
        bash
        kubectl get crds networkpolicies.crd.projectcalico.org
        # 应返回NetworkPolicy CRD
        1. 检查Calico Pods是否都在正常运行。
        bash
        kubectl -n kube-system get pods -l k8s-app=calico-node
        kubectl -n kube-system get pods -l k8s-app=calico-kube-controllers 
        # Pod都应处于Running状态
      2. 检查你提供的Calico CNI配置是否正确。关键项如:

        • etcd_endpoints:确保能连接上Etcd
          - kubeconfig:配置文件应存在和正确
          - nodename:节点名称应为你的机器名称
          - 剩余参数为默认值或按需设置

          1. 确认防火墙没有屏蔽Calico服务相关端口。Calico使用许多端口,防火墙可能影响其正常工作。4. 重启docker、kubelet和calico-node服务以加载正确的Calico CNI配置。
      bash
      systemctl restart docker kubelet calico-node
      1. 重新初始化Kubernetes,查看Calico Pod网络是否正常。
      bash
      kubeadm init 
      kubectl get pods -n kube-system 
      # 应有calico-node Pod运行正常
      1. 检查节点NetworkPolicy是否生效,确认Calico网络策略功能正常。
      2. 确认文件权限是否为0644
      bash
      stat 10-bridge.conf
      File: 10-bridge.conf
      Size: 255             Blocks: 8          IO Block: 4096   regular file
      Device: 801h/2049d      Inode: 393377      Links: 1    
      Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
      1. 检查cni0网桥是否存在。
      bash
      ip addr show cnio0
      # 应显示cnio0网桥接口信息
      1. 可选:如果使用第三方CNI插件,参考其文档检查其配置和部署是否正确。配置文件正确且CNI插件工作正常后,重启kubelet服务,然后重新初始化Kubernetes。这应该可以解决因CNI网络错误导致的kubelet启动失败问题。
    2. 确认CNI网络插件的执行文件是否可执行。通常为/opt/cni/bin/目录下的脚本。
    3. 如果使用第三方CNI插件,确认其是否安装正确并可用。
    4. 重启docker服务。因为CNI网络设置依赖docker,重启可以确保docker处于正常状态。
    5. 重新启动kubelet服务。错误修复后,重启kubelet使之加载正确的CNI网络配置。
    bash
    systemctl restart docker 
    systemctl daemon-reload
    systemctl restart kubelet
    1. 然后再初始化 Kubernetes 控制平面。
    bash 
    kubeadm init

    初始化如果成功,则说明CNI网络错误已经解决,kubelet 和 Kubernetes 都能正常工作了。

    [root@node1 net.d]# kubeadm init \
    >   --apiserver-advertise-address=192.168.1.127 \
    >   --image-repository registry.aliyuncs.com/google_containers \
    >   --kubernetes-version v1.22.4 \
    >   --service-cidr=10.96.0.0/12 \
    >   --pod-network-cidr=10.244.0.0/16
    [init] Using Kubernetes version: v1.22.4
    [preflight] Running pre-flight checks
    error execution phase preflight: [preflight] Some fatal errors occurred:
           [ERROR FileAvailable--etc-kubernetes-manifests-kube-apiserver.yaml]: /etc/kubernetes/manifests/kube-apiserver.yaml already exists
           [ERROR FileAvailable--etc-kubernetes-manifests-kube-controller-manager.yaml]: /etc/kubernetes/manifests/kube-controller-manager.yaml already exists
           [ERROR FileAvailable--etc-kubernetes-manifests-kube-scheduler.yaml]: /etc/kubernetes/manifests/kube-scheduler.yaml already exists
           [ERROR FileAvailable--etc-kubernetes-manifests-etcd.yaml]: /etc/kubernetes/manifests/etcd.yaml already exists
    [preflight] If you know what you are doing, you can make a check non-fatal with --ignore-preflight-errors=...
    To see the stack trace of this error execute with --v=5 or higher

    这是因为你的节点上已经存在Kubernetes的静态Pod配置文件,所以kubeadm init检查失败。解决方案:1. 备份并删除已有的Kubernetes静态Pod配置文件:

    bash 
    mv /etc/kubernetes/manifests/* /etc/kubernetes/manifests.bak

    然后重新运行kubeadm init初始化。2. 在kubeadm init时使用--ignore-preflight-errors=FileExisting-etc-kubernetes-manifests忽略这个错误。

    bash
    kubeadm init  --ignore-preflight-errors=FileExisting-etc-kubernetes-manifests
    1. 如果要保留已有配置,可以修改对应的--config参数指向备份Manifests目录。例如:
    bash
    kubeadm init --config=/etc/kubernetes/manifests.bak 

    然后kubeadm会基于备份Manifests进行初始化。

    [root@node1 net.d]# kubeadm init --apiserver-advertise-address=192.168.1.127 --pod-network-cidr=10.244.0.0/16  --image-repository registry.cn-hangzhou.aliyuncs.com/google_containers
    I0427 11:20:24.406920   23590 version.go:255] remote version is much newer: v1.27.1; falling back to: stable-1.22
    [init] Using Kubernetes version: v1.22.17
    [preflight] Running pre-flight checks
    error execution phase preflight: [preflight] Some fatal errors occurred:
           [ERROR FileAvailable--etc-kubernetes-manifests-kube-apiserver.yaml]: /etc/kubernetes/manifests/kube-apiserver.yaml already exists
           [ERROR FileAvailable--etc-kubernetes-manifests-kube-controller-manager.yaml]: /etc/kubernetes/manifests/kube-controller-manager.yaml already exists
           [ERROR FileAvailable--etc-kubernetes-manifests-kube-scheduler.yaml]: /etc/kubernetes/manifests/kube-scheduler.yaml already exists
           [ERROR FileAvailable--etc-kubernetes-manifests-etcd.yaml]: /etc/kubernetes/manifests/etcd.yaml already exists
    [preflight] If you know what you are doing, you can make a check non-fatal with --ignore-preflight-errors=...
    To see the stack trace of this error execute with --v=5 or higher

    是因为节点上已存在Kubernetes静态Pod配置文件,导致kubeadm init检查失败。解决办法:

    1. 备份并删除已有Kubernetes静态Pod配置文件:
    bash
    mv /etc/kubernetes/manifests/* /etc/kubernetes/manifests.bak

    然后重新运行kubeadm init。

    1. 在kubeadm init时使用--ignore-preflight-errors=FileExisting-etc-kubernetes-manifests忽略这个错误:
    bash
    kubeadm init  --ignore-preflight-errors=FileExisting-etc-kubernetes-manifests
    1. 如果要保留已有配置,可以使用--config参数指定备份Manifests目录:
    bash
    kubeadm init --config=/etc/kubernetes/manifests.bak

    kubeadm将基于备份的Manifests进行初始化。

    kubeadm init \
     --apiserver-advertise-address=192.168.1.127 \
     --image-repository registry.aliyuncs.com/google_containers \
     --kubernetes-version v1.22.4 \
     --service-cidr=10.96.0.0/12 \
     --pod-network-cidr=10.244.0.0/16
    Running pre-flight checks
    error execution phase preflight: [preflight] Some fatal errors occurred:
           [ERROR FileAvailable--etc-kubernetes-manifests-kube-apiserver.yaml]: /etc/kubernetes/manifests/kube-apiserver.yaml already exists
           [ERROR FileAvailable--etc-kubernetes-manifests-kube-controller-manager.yaml]: /etc/kubernetes/manifests/kube-controller-manager.yaml already exists
           [ERROR FileAvailable--etc-kubernetes-manifests-kube-scheduler.yaml]: /etc/kubernetes/manifests/kube-scheduler.yaml already exists
           [ERROR FileAvailable--etc-kubernetes-manifests-etcd.yaml]: /etc/kubernetes/manifests/etcd.yaml already exists
    [preflight] If you know what you are doing, you can make a check non-fatal with --ignore-preflight-errors=...
    To see the stack trace of this error execute with --v=5 or higher
    [root@node1 net.d]# kubeadm init --apiserver-advertise-address=192.168.1.127 --pod-network-cidr=10.244.0.0/16  --image-repository registry.cn-hangzhou.aliyuncs.com/google_containers
    I0427 11:20:24.406920   23590 version.go:255] remote version is much newer: v1.27.1; falling back to: stable-1.22
    [init] Using Kubernetes version: v1.22.17
    [preflight] Running pre-flight checks
    error execution phase preflight: [preflight] Some fatal errors occurred:
           [ERROR FileAvailable--etc-kubernetes-manifests-kube-apiserver.yaml]: /etc/kubernetes/manifests/kube-apiserver.yaml already exists
           [ERROR FileAvailable--etc-kubernetes-manifests-kube-controller-manager.yaml]: /etc/kubernetes/manifests/kube-controller-manager.yaml already exists
           [ERROR FileAvailable--etc-kubernetes-manifests-kube-scheduler.yaml]: /etc/kubernetes/manifests/kube-scheduler.yaml already exists
           [ERROR FileAvailable--etc-kubernetes-manifests-etcd.yaml]: /etc/kubernetes/manifests/etcd.yaml already exists
    [preflight] If you know what you are doing, you can make a check non-fatal with --ignore-preflight-errors=...
    To see the stack trace of this error execute with --v=5 or higher
    [root@node1 net.d]# rm -f /etc/kubernetes/manifests/*
    [root@node1 net.d]# kubeadm init \
    >   --apiserver-advertise-address=192.168.1.127 \
    >   --image-repository registry.aliyuncs.com/google_containers \
    >   --kubernetes-version v1.22.4 \
    >   --service-cidr=10.96.0.0/12 \
    >   --pod-network-cidr=10.244.0.0/16
    [init] Using Kubernetes version: v1.22.4
    [preflight] Running pre-flight checks
    [preflight] Pulling images required for setting up a Kubernetes cluster
    [preflight] This might take a minute or two, depending on the speed of your internet connection
    [preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
    
    [certs] Using certificateDir folder "/etc/kubernetes/pki"
    [certs] Using existing ca certificate authority
    [certs] Using existing apiserver certificate and key on disk
    [certs] Using existing apiserver-kubelet-client certificate and key on disk
    [certs] Using existing front-proxy-ca certificate authority
    [certs] Using existing front-proxy-client certificate and key on disk
    [certs] Using existing etcd/ca certificate authority
    [certs] Using existing etcd/server certificate and key on disk
    [certs] Using existing etcd/peer certificate and key on disk
    [certs] Using existing etcd/healthcheck-client certificate and key on disk
    [certs] Using existing apiserver-etcd-client certificate and key on disk
    [certs] Using the existing "sa" key
    [kubeconfig] Using kubeconfig folder "/etc/kubernetes"
    [kubeconfig] Using existing kubeconfig file: "/etc/kubernetes/admin.conf"
    [kubeconfig] Using existing kubeconfig file: "/etc/kubernetes/kubelet.conf"
    [kubeconfig] Using existing kubeconfig file: "/etc/kubernetes/controller-manager.conf"
    [kubeconfig] Using existing kubeconfig file: "/etc/kubernetes/scheduler.conf"
    [kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
    [kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
    [kubelet-start] Starting the kubelet
    [control-plane] Using manifest folder "/etc/kubernetes/manifests"
    [control-plane] Creating static Pod manifest for "kube-apiserver"
    [control-plane] Creating static Pod manifest for "kube-controller-manager"
    [control-plane] Creating static Pod manifest for "kube-scheduler"
    [etcd] Creating static Pod manifest for local etcd in "/etc/kubernetes/manifests"
    [wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
    [kubelet-check] Initial timeout of 40s passed.
    [kubelet-check] It seems like the kubelet isn't running or healthy.
    [kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused.
    [kubelet-check] It seems like the kubelet isn't running or healthy.
    [kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused.
    [kubelet-check] It seems like the kubelet isn't running or healthy.
    [kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused.
    [kubelet-check] It seems like the kubelet isn't running or healthy.
    [kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused.
    [kubelet-check] It seems like the kubelet isn't running or healthy.
    [kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused.
    
           Unfortunately, an error has occurred:
                   timed out waiting for the condition
    
           This error is likely caused by:
                   - The kubelet is not running
                   - The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)
    
           If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
                   - 'systemctl status kubelet'
                   - 'journalctl -xeu kubelet'
    
           Additionally, a control plane component may have crashed or exited when started by the container runtime.
           To troubleshoot, list all containers using your preferred container runtimes CLI.
           Here is one example how you may list all Kubernetes containers running in docker:                - 'docker ps -a | grep kube | grep -v pause'
                   Once you have found the failing container, you can inspect its logs with:                - 'docker logs CONTAINERID'
    
    error execution phase wait-control-plane: couldn't initialize a Kubernetes cluster
    To see the stack trace of this error execute with --v=5 or higher

    这个错误是因为kubelet服务没有正常启动,导致控制面组件作为静态Pod运行失败。解决方案:

    1. 检查kubelet服务状态,是否正常运行:
    bash
    systemctl status kubelet

    如果未运行,请启动kubelet:

    bash
    systemctl start kubelet
    1. 检查kubelet日志,查看是否有错误信息:
    bash
    journalctl -xeu kubelet 

    根据日志信息进行相应修复。3. 检查docker服务,并列出Kubernetes相关容器:

    bash 
    docker ps -a | grep kube | grep -v pause

    找到失败的容器,并检查其日志:

    bash
    docker logs CONTAINERID

    根据日志信息进行修复。4. 如果kubelet和容器运行正常但初始化仍失败,请执行以下命令重新运行:

    bash
    rm -rf /etc/kubernetes/pki 
    kubeadm init ...

    这将删除Kubernetes证书,然后重新初始化集群。Kubernetes的运行依赖于底层服务的正常运行,如kubelet和docker。只有它们正常运行,控制面组件和应用才能正常工作。

  6. 根据日志信息,kubelet启动失败是因为CNI网络插件安装错误。解决办法:1. 卸载当前CNI插件:

    bash
    /etc/cni/net.d # rm -f *
    1. 重新安装CNI插件。这里以Flannel为例:
    bash 
    curl -sSL "https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml" | kubectl apply -f -
    1. 重启kubelet服务:
    bash
    systemctl daemon-reload
    systemctl restart kubelet
    1. 检查kubelet和CNI网络插件Pod是否正常运行:
    bash
    kubectl get pods -n kube-system 
    1. 如果仍有问题,请再次检查kubelet和CNI网络插件的日志,进行进一步排查和修复。

      1. 直接在节点上下载Flannel CNI插件清单文件kube-flannel.yml:
      bash
      wget https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
      1. 安装Flannel CNI插件:
      bash
      kubectl apply -f kube-flannel.yml
      1. 检查Flannel Pods是否正常运行:
      bash
      kubectl get pods -n kube-system -l app=flannel 
      1. 重新启动kubelet服务:
      bash 
      systemctl restart kubelet
[root@node1 net.d]# wget https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
--2023-04-27 12:15:25--  https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 0.0.0.0, ::
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|0.0.0.0|:443... failed: Connection refused.
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|::|:443... failed: Connection refused.

在浏览器中,可以直接访问这个 URL 下载 kube-flannel.yml 文件:https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml

具体步骤:

  1. 打开浏览器,输入该 URL。
  2. 页面会直接显示 YAML 文件内容,点击浏览器的"另存为"或者"下载"按钮。
  3. 保存文件名为 kube-flannel.yml,保存类型选择 "All Files"。
  4. 文件下载完成后,可以通过 U 盘、网络共享等方式把 kube-flannel.yml 文件传到节点上。
  5. 在节点上,安装 Flannel:
bash
kubectl apply -f kube-flannel.yml
  1. 检查 Flannel Pods 是否正常运行:
bash 
kubectl get pods -n kube-system -l app=flannel
  1. 重启 kubelet 服务:
bash
systemctl restart kubelet
[root@node1 net.d]# kubectl apply -f kube-flannel.yml
The connection to the server localhost:8080 was refused - did you specify the right host or port?
[root@node1 net.d]# 

这个错误表示节点无法连接到 Kubernetes API Server,所以无法申请 Flannel 资源。这通常是因为:

  1. Kubernetes API Server 未运行;
  2. kubectl 配置文件配置错误,导致无法连接 API Server。

    解决方法:

    1. 检查 Kubernetes API Server 是否正常运行:
bash
kubectl get nodes

如果无法获取节点信息,表示 API Server 未运行,请重新初始化 Kubernetes 集群。

  1. 检查 kubectl 配置,一般在 ~/.kube/config 文件中:
bash 
cat ~/.kube/config

确认 server 和 certificate-authority-data 配置正确,如果有误,请重新初始化 kubectl 配置:

bash 
rm -rf ~/.kube
kubeadm init phase kubeconfig all
  1. 如果配置正确,请检查 Kubernetes 服务网络是否正常,例如 Flannel:
bash
kubectl get pods -n kube-system -l app=flannel 

如果 Flannel 未部署或未运行,请先安装 Flannel,然后重启 kubelet 后重新尝试安装 Flannel。4. 再次确认 kubectl 可以正常连接并操作 Kubernetes 集群后,重新安装 Flannel:

bash
kubectl apply -f kube-flannel.yml

重新初始化集群:

kubeadm init \
  --apiserver-advertise-address=192.168.1.127 \
  --image-repository registry.aliyuncs.com/google_containers \
  --kubernetes-version v1.22.4 \
  --service-cidr=10.96.0.0/12 \
  --pod-network-cidr=10.244.0.0/16
  --ignore-preflight-errors=FileAvailable--etc-kubernetes-manifests-kube-apiserver.yaml,FileAvailable--etc-kubernetes-manifests-kube-controller-manager.yaml,FileAvailable--etc-kubernetes-manifests-kube-scheduler.yaml,FileAvailable--etc-kubernetes-manifests-etcd.yaml
[root@node1 ~]# kubeadm init \
>   --apiserver-advertise-address=192.168.1.127 \
>   --image-repository registry.aliyuncs.com/google_containers \
>   --kubernetes-version v1.22.4 \
>   --service-cidr=10.96.0.0/12 \
>   --pod-network-cidr=10.244.0.0/16
[init] Using Kubernetes version: v1.22.4
[preflight] Running pre-flight checks
error execution phase preflight: [preflight] Some fatal errors occurred:
        [ERROR FileAvailable--etc-kubernetes-manifests-kube-apiserver.yaml]: /etc/kubernetes/manifests/kube-apiserver.yaml already exists
        [ERROR FileAvailable--etc-kubernetes-manifests-kube-controller-manager.yaml]: /etc/kubernetes/manifests/kube-controller-manager.yaml already exists
        [ERROR FileAvailable--etc-kubernetes-manifests-kube-scheduler.yaml]: /etc/kubernetes/manifests/kube-scheduler.yaml already exists
        [ERROR FileAvailable--etc-kubernetes-manifests-etcd.yaml]: /etc/kubernetes/manifests/etcd.yaml already exists
[preflight] If you know what you are doing, you can make a check non-fatal with --ignore-preflight-errors=...
To see the stack trace of this error execute with --v=5 or higher
[root@node1 ~]#   --ignore-preflight-errors=FileAvailable--etc-kubernetes-manifests-kube-apiserver.yaml,FileAvailable--etc-kubernetes-manifests-kube-controller-manager.yaml,FileAvailable--etc-kubernetes-manifests-kube-scheduler.yaml,FileAvailable--etc-kubernetes-manifests-etcd.yaml
-bash: --ignore-preflight-errors=FileAvailable--etc-kubernetes-manifests-kube-apiserver.yaml,FileAvailable--etc-kubernetes-manifests-kube-controller-manager.yaml,FileAvailable--etc-kubernetes-manifests-kube-scheduler.yaml,FileAvailable--etc-kubernetes-manifests-etcd.yaml: command not found
kubeadm init \
  --apiserver-advertise-address=192.168.1.127 \
  --image-repository registry.aliyuncs.com/google_containers \
  --kubernetes-version v1.22.4 \
  --service-cidr=10.96.0.0/12 \
  --pod-network-cidr=10.244.0.0/16 \
  --ignore-preflight-errors=FileAvailable--etc-kubernetes-manifests-kube-apiserver.yaml,FileAvailable--etc-kubernetes-manifests-kube-controller-manager.yaml,FileAvailable--etc-kubernetes-manifests-kube-scheduler.yaml,FileAvailable--etc-kubernetes-manifests-etcd.yaml
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused.

        Unfortunately, an error has occurred:
                timed out waiting for the condition

        This error is likely caused by:
                - The kubelet is not running
                - The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)

        If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
                - 'systemctl status kubelet'
                - 'journalctl -xeu kubelet'

        Additionally, a control plane component may have crashed or exited when started by the container runtime.
        To troubleshoot, list all containers using your preferred container runtimes CLI.
        Here is one example how you may list all Kubernetes containers running in docker:                - 'docker ps -a | grep kube | grep -v pause'
                Once you have found the failing container, you can inspect its logs with:                - 'docker logs CONTAINERID'

error execution phase wait-control-plane: couldn't initialize a Kubernetes cluster
To see the stack trace of this error execute with --v=5 or higher

根据错误信息,初始化 Kubernetes 集群失败是因为:

  1. kubelet 服务未运行;
  2. kubelet 服务配置错误,导致 unhealthy 状态;
  3. 一些控制面组件在容器运行时启动失败。

    解决方案:

    1. 检查 kubelet 服务状态和日志:
bash
systemctl status kubelet 
journalctl -xeu kubelet

根据日志信息修复 kubelet 配置或服务问题,然后重启 kubelet。

  1. 检查是否有 Kubernetes 控制面组件容器启动失败:
bash 
docker ps -a | grep kube | grep -v pause

找到失败的容器,检查其日志排查问题:

bash
docker logs CONTAINER_ID

根据日志修复失败原因,如镜像缺失、参数配置错误等。

  1. 再次初始化 Kubernetes 集群,检查组件状态:
bash
kubeadm reset -f
kubeadm init ...

kubectl get componentstatuses
kubectl get pods -n kube-system 
  1. 如果仍有组件未启动,请进一步检查对应 Pods 日志修复问题。
pr 27 12:45:44 node1 systemd[1]: kubelet.service: main process exited, code=exited, statApr 27 12:45:44 node1 systemd[1]: Unit kubelet.service entered failed state.
Apr 27 12:45:44 node1 systemd[1]: kubelet.service failed.
lines 1381-1409/1409 (END)

根据 kubelet 服务日志,kubelet 进程退出,导致 kubelet 服务进入 failed 状态。这通常是因为:

  1. kubelet 配置文件配置错误,导致 kubelet 启动失败;
  2. docker 或其他容器运行时连接失败,导致 kubelet 退出;
  3. kubelet 静态 Pod 配置文件错误,导致 kubelet 启动失败。

    解决方法:

    1. 检查 kubelet 配置文件 /var/lib/kubelet/kubeadm-flags.env 是否正确,

      cat /var/lib/kubelet/kubeadm-flags.env

      这个配置文件包含 kubelet 启动参数,主要有:- KUBELET_KUBEADM_ARGS: kubeadm 初始化设置的 kubelet 参数
      - KUBELET_SYSTEM_PODS_ARGS: kubelet 系统 Pod 参数
      - KUBELET_NETWORK_ARGS: kubelet 网络参数
      - KUBELET_AUTHZ_ARGS: kubelet 授权参数
      - KUBELET_CERTIFICATE_ARGS: kubelet 证书参数如果发现有参数值为空或显然错误,则需要重新初始化 kubelet 配置。执行以下命令重新初始化 kubelet 配置:

      bash
      rm -rf /var/lib/kubelet/kubeadm-flags.env 
      kubeadm init phase kubelet config all

      如果有误,请重新初始化 kubelet 配置:

      我的配置如下:

      [root@node1 ~]# cat /var/lib/kubelet/kubeadm-flags.env
      KUBELET_KUBEADM_ARGS="--network-plugin=cni --pod-infra-container-image=registry.aliyuncs.com/google_containers/pause:3.5"

      正确的 kubeadm-flags.env 配置文件内容应该如下:

      KUBELET_KUBEADM_ARGS="--network-plugin=cni --pod-infra-container-image=registry.aliyuncs.com/google_containers/pause:3.5" 
      KUBELET_SYSTEM_PODS_ARGS="--pod-manifest-path=/etc/kubernetes/manifests "
      KUBELET_NETWORK_ARGS="--network-plugin=cni --cni-conf-dir=/etc/cni/net.d --cni-bin-dir=/opt/cni/bin"
      KUBELET_AUTHZ_ARGS="--authorization-mode=Webhook --client-ca-file=/etc/kubernetes/pki/ca.crt"
      KUBELET_CERTIFICATE_ARGS="--rotate-certificates=true --cert-dir=/var/lib/kubelet/pki\
                               --kubeconfig=/var/lib/kubelet/kubeconfig --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf"

      参数说明:- KUBELET_KUBEADM_ARGS: kubeadm 初始化设置的 kubelet 参数,这里指定了 CNI 网络和 pause 镜像
      - KUBELET_SYSTEM_PODS_ARGS: kubelet 系统 Pod 参数,指定 manifest 路径
      - KUBELET_NETWORK_ARGS: kubelet 网络参数,指定 CNI 网络
      - KUBELET_AUTHZ_ARGS: kubelet 授权参数,指定 Webhook 模式和 CA 证书
      - KUBELET_CERTIFICATE_ARGS: kubelet 证书参数,指定证书相关路径这个配置文件包含 kubelet 启动所需的所有主要参数,格式也是正确的。备份现有错误配置文件后,可以将这个配置内容覆盖到 /var/lib/kubelet/kubeadm-flags.env 中,然后重启 kubelet 服务使之生效。kubelet 服务重启正常后,可以重新初始化 Kubernetes 集群。

bash
rm -rf /var/lib/kubelet/kubeadm-flags.env 
  1. 检查 docker 服务状态并启动:
bash
systemctl status docker
systemctl start docker
  1. 检查 kubelet 静态 Pod 配置文件 /var/lib/kubelet/config.yaml 是否正确,如果有误,请备份后删除:
bash
mv /var/lib/kubelet/config.yaml /var/lib/kubelet/config.yaml.bak

然后重启 kubelet:

bash 
systemctl restart kubelet
  1. kubelet 重新启动后,检查其日志输出,如果一切正常,则重新初始化 Kubernetes 集群:
bash
kubeadm reset -f 
kubeadm init ...
journalctl -xeu kubelet
Apr 27 12:57:54 node1 kubelet[10814]: I0427 12:57:54.474288   10814 cni.go:239] "Unable tApr 27 12:57:54 node1 kubelet[10814]: I0427 12:57:54.474391   10814 docker_service.go:257Apr 27 12:57:54 node1 kubelet[10814]: I0427 12:57:54.474506   10814 cni.go:239] "Unable tApr 27 12:57:54 node1 kubelet[10814]: I0427 12:57:54.484922   10814 docker_service.go:264Apr 27 12:57:54 node1 systemd[1]: kubelet.service: main process exited, code=exited, statApr 27 12:57:54 node1 kubelet[10814]: E0427 12:57:54.484964   10814 server.go:294] "FaileApr 27 12:57:54 node1 systemd[1]: Unit kubelet.service entered failed state.
Apr 27 12:57:54 node1 systemd[1]: kubelet.service failed.
Apr 27 12:57:54 node1 systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
-- Subject: Unit kubelet.service has finished shutting down
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit kubelet.service has finished shutting down.
lines 1386-1414/1414 (END)

根据 kubelet 服务日志,kubelet 进程再次退出,导致 kubelet 服务进入 failed 状态。

日志显示:

  1. kubelet 无法设置 CNI 网络,可能是 CNI 配置文件错误或 CNI 插件二进制文件缺失;
  2. kubelet 无法连接 Docker 服务,可能是 Docker 服务未运行。
[root@node1 net.d]# pwd
/etc/cni/net.d
[root@node1 net.d]# cat kube-flannel.yml 
---
kind: Namespace
apiVersion: v1
metadata:
  name: kube-flannel
  labels:
    k8s-app: flannel
    pod-security.kubernetes.io/enforce: privileged
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  labels:
    k8s-app: flannel
  name: flannel
rules:
- apiGroups:
  - ""
  resources:
  - pods
  verbs:
  - get
- apiGroups:
  - ""
  resources:
  - nodes
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - nodes/status
  verbs:
  - patch
- apiGroups:
  - networking.k8s.io
  resources:
  - clustercidrs
  verbs:
  - list
  - watch
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  labels:
    k8s-app: flannel
  name: flannel
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: flannel
subjects:
- kind: ServiceAccount
  name: flannel
  namespace: kube-flannel
---
apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    k8s-app: flannel
  name: flannel
  namespace: kube-flannel
---
kind: ConfigMap
apiVersion: v1
metadata:
  name: kube-flannel-cfg
  namespace: kube-flannel
  labels:
    tier: node
    k8s-app: flannel
    app: flannel
data:
  cni-conf.json: |
    {
      "name": "cbr0",
      "cniVersion": "0.3.1",
      "plugins": [
        {
          "type": "flannel",
          "delegate": {
            "hairpinMode": true,
            "isDefaultGateway": true
          }
        },
        {
          "type": "portmap",
          "capabilities": {
            "portMappings": true
          }
        }
      ]
    }
  net-conf.json: |
    {
      "Network": "10.244.0.0/16",
      "Backend": {
        "Type": "vxlan"
      }
    }
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: kube-flannel-ds
  namespace: kube-flannel
  labels:
    tier: node
    app: flannel
    k8s-app: flannel
spec:
  selector:
    matchLabels:
      app: flannel
  template:
    metadata:
      labels:
        tier: node
        app: flannel
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.io/os
                operator: In
                values:
                - linux
      hostNetwork: true
      priorityClassName: system-node-critical
      tolerations:
      - operator: Exists
        effect: NoSchedule
      serviceAccountName: flannel
      initContainers:
      - name: install-cni-plugin
        image: docker.io/flannel/flannel-cni-plugin:v1.1.2
       #image: docker.io/rancher/mirrored-flannelcni-flannel-cni-plugin:v1.1.2
        command:
        - cp
        args:
        - -f
        - /flannel
        - /opt/cni/bin/flannel
        volumeMounts:
        - name: cni-plugin
          mountPath: /opt/cni/bin
      - name: install-cni
        image: docker.io/flannel/flannel:v0.21.4
       #image: docker.io/rancher/mirrored-flannelcni-flannel:v0.21.4
        command:
        - cp
        args:
        - -f
        - /etc/kube-flannel/cni-conf.json
        - /etc/cni/net.d/10-flannel.conflist
        volumeMounts:
        - name: cni
          mountPath: /etc/cni/net.d
        - name: flannel-cfg
          mountPath: /etc/kube-flannel/
      containers:
      - name: kube-flannel
        image: docker.io/flannel/flannel:v0.21.4
       #image: docker.io/rancher/mirrored-flannelcni-flannel:v0.21.4
        command:
        - /opt/bin/flanneld
        args:
        - --ip-masq
        - --kube-subnet-mgr
        resources:
          requests:
            cpu: "100m"
            memory: "50Mi"
        securityContext:
          privileged: false
          capabilities:
            add: ["NET_ADMIN", "NET_RAW"]
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: EVENT_QUEUE_DEPTH
          value: "5000"
        volumeMounts:
        - name: run
          mountPath: /run/flannel
        - name: flannel-cfg
          mountPath: /etc/kube-flannel/
        - name: xtables-lock
          mountPath: /run/xtables.lock
      volumes:
      - name: run
        hostPath:
          path: /run/flannel
      - name: cni-plugin
        hostPath:
          path: /opt/cni/bin
      - name: cni
        hostPath:
          path: /etc/cni/net.d
      - name: flannel-cfg
        configMap:
          name: kube-flannel-cfg
      - name: xtables-lock
        hostPath:
          path: /run/xtables.lock
          type: FileOrCreate
[root@node1 net.d]# 

根据 kube-flannel.yml 配置文件内容检查,主要配置如下:

  1. flannel 后端网络类型为 vxlan;
  2. 网络范围为 10.244.0.0/16;
  3. cni-conf.json 中 cbr0 接口为 Pod 网络接口,type 为 flannel;
  4. flannel CNI 二进制文件为 /opt/cni/bin/flannel;
  5. flanneld 参数 --ip-masq 表示启用 SNAT;

    配置文件内容看起来正确无误。

  6. /opt/cni/bin 目录下是否存在 flannel 二进制文件;
[root@node1 net.d]# cd /opt/cni/bin
[root@node1 bin]# ll
total 49832
-rwxr-xr-x 1 root root 2782728 Jan 19 05:09 bandwidth
-rwxr-xr-x 1 root root 3104192 Jan 19 05:09 bridge
drwxrwxr-x 2 root root    4096 Apr 26 18:44 cni-plugins-linux-amd64-v0.8.7
-rwxr-xr-x 1 root root 7607056 Jan 19 05:09 dhcp
-rwxr-xr-x 1 root root 2863024 Jan 19 05:09 dummy
-rwxr-xr-x 1 root root 3165352 Jan 19 05:09 firewall
-rwxr-xr-x 1 root root 2775224 Jan 19 05:09 host-device
-rwxr-xr-x 1 root root 2332792 Jan 19 05:09 host-local
-rwxr-xr-x 1 root root 2871792 Jan 19 05:09 ipvlan
-rwxr-xr-x 1 root root 2396976 Jan 19 05:09 loopback
-rwxr-xr-x 1 root root 2893624 Jan 19 05:09 macvlan
-rwxr-xr-x 1 root root 2689440 Jan 19 05:09 portmap
-rwxr-xr-x 1 root root 3000032 Jan 19 05:09 ptp
-rwxr-xr-x 1 root root 2542400 Jan 19 05:09 sbr
-rwxr-xr-x 1 root root 2074072 Jan 19 05:09 static
-rwxr-xr-x 1 root root 2456920 Jan 19 05:09 tuning
-rwxr-xr-x 1 root root 2867512 Jan 19 05:09 vlan
-rwxr-xr-x 1 root root 2566424 Jan 19 05:09 vrf
[root@node1 bin]# 

根据 /opt/cni/bin 目录列表,flannel CNI 二进制文件确实缺失。

请下载 flannel CNI 二进制文件,并将其放置在 /opt/cni/bin 目录下:

bash
wget https://github.com/coreos/flannel/releases/download/v0.21.4/flannel-v0.21.4-linux-amd64.zip
unzip flannel-v0.21.4-linux-amd64.zip
cp flanneld /opt/cni/bin/flannel 

下载后,将 flanneld 二进制文件解压并放置在 /opt/cni/bin 目录下:

https://github.com/flannel-io/flannel/releases/tag/v0.21.4/

flannel-v0.21.4-linux-amd64.tar.gz

  1. 解压 flannel-v0.21.4-linux-amd64.tar.gz 文件:
bash
tar -xvf flannel-v0.21.4-linux-amd64.tar.gz
  1. 进入解压后的目录,将 flanneld 二进制文件复制到 CNI 目录下:
bash
cd flannel-v0.21.4-linux-amd64
cp flanneld /opt/cni/bin/flannel
  1. 重新初始化 kubelet 并启动:
bash
systemctl daemon-reload 
systemctl restart kubelet
journalctl -u kubelet
[root@node1 flannel]# journalctl -u kubelet
-- Logs begin at Thu 2023-04-27 09:47:40 CST, end at Thu 2023-04-27 13:31:41 CST. --
Apr 27 09:47:50 node1 systemd[1]: Started kubelet: The Kubernetes Node Agent.
Apr 27 09:47:50 node1 kubelet[1317]: E0427 09:47:50.544554    1317 server.go:206] "FailedApr 27 09:47:50 node1 systemd[1]: kubelet.service: main process exited, code=exited, statApr 27 09:47:50 node1 systemd[1]: Unit kubelet.service entered failed state.
Apr 27 09:47:50 node1 systemd[1]: kubelet.service failed.
Apr 27 09:48:00 node1 systemd[1]: kubelet.service holdoff time over, scheduling restart.
Apr 27 09:48:00 node1 systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
Apr 27 09:48:00 node1 systemd[1]: Started kubelet: The Kubernetes Node Agent.
Apr 27 09:48:00 node1 kubelet[7218]: E0427 09:48:00.851842    7218 server.go:206] "FailedApr 27 09:48:00 node1 systemd[1]: kubelet.service: main process exited, code=exited, statApr 27 09:48:00 node1 systemd[1]: Unit kubelet.service entered failed state.
Apr 27 09:48:00 node1 systemd[1]: kubelet.service failed.
Apr 27 09:48:11 node1 systemd[1]: kubelet.service holdoff time over, scheduling restart.
Apr 27 09:48:11 node1 systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
Apr 27 09:48:11 node1 systemd[1]: Started kubelet: The Kubernetes Node Agent.
Apr 27 09:48:11 node1 kubelet[8097]: E0427 09:48:11.093840    8097 server.go:206] "FailedApr 27 09:48:11 node1 systemd[1]: kubelet.service: main process exited, code=exited, statApr 27 09:48:11 node1 systemd[1]: Unit kubelet.service entered failed state.
Apr 27 09:48:11 node1 systemd[1]: kubelet.service failed.
Apr 27 09:48:21 node1 systemd[1]: kubelet.service holdoff time over, scheduling restart.
Apr 27 09:48:21 node1 systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
Apr 27 09:48:21 node1 systemd[1]: Started kubelet: The Kubernetes Node Agent.
Apr 27 09:48:21 node1 kubelet[8111]: E0427 09:48:21.352944    8111 server.go:206] "FailedApr 27 09:48:21 node1 systemd[1]: kubelet.service: main process exited, code=exited, statApr 27 09:48:21 node1 systemd[1]: Unit kubelet.service entered failed state.
Apr 27 09:48:21 node1 systemd[1]: kubelet.service failed.
Apr 27 09:48:31 node1 systemd[1]: kubelet.service holdoff time over, scheduling restart.
Apr 27 09:48:31 node1 systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
lines 1-29

根据日志,kubelet 启动失败是因为找不到配置文件 /var/lib/kubelet/config.yaml。
导致这个问题的原因可能是

安装coredns:1.8.0

安装etcd:3.4.13-0

安装 pause:3.4.1

三台机器打成系统镜像,同步到局域网服务器:

我使用三台Centos离线部署K8S集群-主要是公司产品部署在甲方的机房,甲方机房是局域网

发表回复

您的电子邮箱地址不会被公开。 必填项已用*标注

滚动到顶部