基于 Rancher Kubernetes 1.17.17 搭建 Kubeflow 1.3 机器学习平台

0 / 559

假设机器上有 NVIDIA GPU,且已经安装高版本驱动。

安装 docker

安装过程参考[1]

yum -y install yum-utils && 
yum-config-manager --add-repo=https://download.docker.com/linux/centos/docker-ce.repo && 
yum install -y https://download.docker.com/linux/centos/7/x86_64/stable/Packages/containerd.io-1.4.3-3.1.el7.x86_64.rpm && 
yum install docker-ce -y && 
systemctl --now enable docker

安装 nvidia-docker2

安装过程参考[2]

centos7:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo && 
yum clean expire-cache && 
yum install -y nvidia-docker2 && 
systemctl restart docker

ubuntu:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
&& curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker

验证安装:

docker run --rm  nvidia/cuda:11.0-base nvidia-smi

为避免每次都指定 --gpus ,更改 docker 配置文件 /etc/docker/daemon.json:

{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    },
    "exec-opts":["native.cgroupdriver=systemd"],
    "default-runtime": "nvidia"
}

安装 Kubernetes

对于 Kubernetes 的安装,除了使用 kubeadm 之外,还有很多种方案,比如安装 microk8s,kind等,前者的问题是其所有镜像都放在 gcr.io (google container registry)中,没有科学上网是无法拉取镜像的,而且没有提供设置镜像地址的接口,使用起来颇为麻烦;后者是 Kubernets IN Dokcer,是模拟的伪集群,虽然部署简单,但也不在考虑范围之内了。

使用 Rancher,不失为一种更好的方式,它提供易用的 UI,人性化的交互,基于 Docker 容器,即使要删档重来,也不会出现不可预估的意外。

对于 Kubernetes 的版本,Kubeflow 1.3 是在 Kubernetes 1.17 上做的测试[3],为避免不必要的麻烦,我们也选择 Kubernetes 1.17 版本。

在进一步之前,先配置一下系统设置,比如:

# 关闭 SELinux

sudo setenforce 0

# 关闭交换分区

sudo swapoff -a

另外,机器上的时间包括时区,要一致。

名词解释
在这个部分的主要涉及的名词如下:

  • Rancher Server: 是用于管理和配置 Kubernetes 集群。您可以通过 Rancher Server 的 UI 与下游 Kubernetes 集群进行交互。
  • RKE(Rancher Kubernetes Engine): 是经过认证的 Kubernetes 发行版,它拥有对应的 CLI 工具可用于创建和管理 Kubernetes 集群。在 Rancher UI 中创建集群时,它将调用 RKE 来配置 Rancher 启动的 Kubernetes 集群。
  • kubectl: Kubernetes 命令行工具。

安装 Rancher

Rancher 必须要通过 https 暴露服务,所以最好的解决方案是申请正规(子)域名,来获得权威证书,若是嫌申请流程麻烦,可以生成自签名证书(虽然在后续仍会面临不少麻烦,但是还都是可以克服的)。[4]

生成自签名证书
这一过程参考[5]

一键生成 ssl 自签名证书脚本:

#!/bin/bash -e

help ()
{
    echo  ' ================================================================ '
    echo  ' --ssl-domain: 生成ssl证书需要的主域名,如不指定则默认为www.rancher.local,如果是ip访问服务,则可忽略;'
    echo  ' --ssl-trusted-ip: 一般ssl证书只信任域名的访问请求,有时候需要使用ip去访问server,那么需要给ssl证书添加扩展IP,多个IP用逗号隔开;'
    echo  ' --ssl-trusted-domain: 如果想多个域名访问,则添加扩展域名(SSL_TRUSTED_DOMAIN),多个扩展域名用逗号隔开;'
    echo  ' --ssl-size: ssl加密位数,默认2048;'
    echo  ' --ssl-cn: 国家代码(2个字母的代号),默认CN;'
    echo  ' 使用示例:'
    echo  ' ./create_self-signed-cert.sh --ssl-domain=www.test.com --ssl-trusted-domain=www.test2.com \ '
    echo  ' --ssl-trusted-ip=1.1.1.1,2.2.2.2,3.3.3.3 --ssl-size=2048 --ssl-date=3650'
    echo  ' ================================================================'
}

case "$1" in
    -h|--help) help; exit;;
esac

if [[ $1 == '' ]];then
    help;
    exit;
fi

CMDOPTS="$*"
for OPTS in $CMDOPTS;
do
    key=$(echo ${OPTS} | awk -F"=" '{print $1}' )
    value=$(echo ${OPTS} | awk -F"=" '{print $2}' )
    case "$key" in
        --ssl-domain) SSL_DOMAIN=$value ;;
        --ssl-trusted-ip) SSL_TRUSTED_IP=$value ;;
        --ssl-trusted-domain) SSL_TRUSTED_DOMAIN=$value ;;
        --ssl-size) SSL_SIZE=$value ;;
        --ssl-date) SSL_DATE=$value ;;
        --ca-date) CA_DATE=$value ;;
        --ssl-cn) CN=$value ;;
    esac
done

# CA相关配置
CA_DATE=${CA_DATE:-3650}
CA_KEY=${CA_KEY:-cakey.pem}
CA_CERT=${CA_CERT:-cacerts.pem}
CA_DOMAIN=cattle-ca

# ssl相关配置
SSL_CONFIG=${SSL_CONFIG:-$PWD/openssl.cnf}
SSL_DOMAIN=${SSL_DOMAIN:-'www.rancher.local'}
SSL_DATE=${SSL_DATE:-3650}
SSL_SIZE=${SSL_SIZE:-2048}

## 国家代码(2个字母的代号),默认CN;
CN=${CN:-CN}

SSL_KEY=$SSL_DOMAIN.key
SSL_CSR=$SSL_DOMAIN.csr
SSL_CERT=$SSL_DOMAIN.crt

echo -e "\033[32m ---------------------------- \033[0m"
echo -e "\033[32m       | 生成 SSL Cert |       \033[0m"
echo -e "\033[32m ---------------------------- \033[0m"

if [[ -e ./${CA_KEY} ]]; then
    echo -e "\033[32m ====> 1. 发现已存在CA私钥,备份"${CA_KEY}"为"${CA_KEY}"-bak,然后重新创建 \033[0m"
    mv ${CA_KEY} "${CA_KEY}"-bak
    openssl genrsa -out ${CA_KEY} ${SSL_SIZE}
else
    echo -e "\033[32m ====> 1. 生成新的CA私钥 ${CA_KEY} \033[0m"
    openssl genrsa -out ${CA_KEY} ${SSL_SIZE}
fi

if [[ -e ./${CA_CERT} ]]; then
    echo -e "\033[32m ====> 2. 发现已存在CA证书,先备份"${CA_CERT}"为"${CA_CERT}"-bak,然后重新创建 \033[0m"
    mv ${CA_CERT} "${CA_CERT}"-bak
    openssl req -x509 -sha256 -new -nodes -key ${CA_KEY} -days ${CA_DATE} -out ${CA_CERT} -subj "/C=${CN}/CN=${CA_DOMAIN}"
else
    echo -e "\033[32m ====> 2. 生成新的CA证书 ${CA_CERT} \033[0m"
    openssl req -x509 -sha256 -new -nodes -key ${CA_KEY} -days ${CA_DATE} -out ${CA_CERT} -subj "/C=${CN}/CN=${CA_DOMAIN}"
fi

echo -e "\033[32m ====> 3. 生成Openssl配置文件 ${SSL_CONFIG} \033[0m"
cat > ${SSL_CONFIG} <<EOM
[req]
req_extensions = v3_req
distinguished_name = req_distinguished_name
[req_distinguished_name]
[ v3_req ]
basicConstraints = CA:FALSE
keyUsage = nonRepudiation, digitalSignature, keyEncipherment
extendedKeyUsage = clientAuth, serverAuth
EOM

if [[ -n ${SSL_TRUSTED_IP} || -n ${SSL_TRUSTED_DOMAIN} ]]; then
    cat >> ${SSL_CONFIG} <<EOM
subjectAltName = @alt_names
[alt_names]
EOM
    IFS=","
    dns=(${SSL_TRUSTED_DOMAIN})
    dns+=(${SSL_DOMAIN})
    for i in "${!dns[@]}"; do
      echo DNS.$((i+1)) = ${dns[$i]} >> ${SSL_CONFIG}
    done

    if [[ -n ${SSL_TRUSTED_IP} ]]; then
        ip=(${SSL_TRUSTED_IP})
        for i in "${!ip[@]}"; do
          echo IP.$((i+1)) = ${ip[$i]} >> ${SSL_CONFIG}
        done
    fi
fi

echo -e "\033[32m ====> 4. 生成服务SSL KEY ${SSL_KEY} \033[0m"
openssl genrsa -out ${SSL_KEY} ${SSL_SIZE}

echo -e "\033[32m ====> 5. 生成服务SSL CSR ${SSL_CSR} \033[0m"
openssl req -sha256 -new -key ${SSL_KEY} -out ${SSL_CSR} -subj "/C=${CN}/CN=${SSL_DOMAIN}" -config ${SSL_CONFIG}

echo -e "\033[32m ====> 6. 生成服务SSL CERT ${SSL_CERT} \033[0m"
openssl x509 -sha256 -req -in ${SSL_CSR} -CA ${CA_CERT} \
    -CAkey ${CA_KEY} -CAcreateserial -out ${SSL_CERT} \
    -days ${SSL_DATE} -extensions v3_req \
    -extfile ${SSL_CONFIG}

echo -e "\033[32m ====> 7. 证书制作完成 \033[0m"
echo
echo -e "\033[32m ====> 8. 以YAML格式输出结果 \033[0m"
echo "----------------------------------------------------------"
echo "ca_key: |"
cat $CA_KEY | sed 's/^/  /'
echo
echo "ca_cert: |"
cat $CA_CERT | sed 's/^/  /'
echo
echo "ssl_key: |"
cat $SSL_KEY | sed 's/^/  /'
echo
echo "ssl_csr: |"
cat $SSL_CSR | sed 's/^/  /'
echo
echo "ssl_cert: |"
cat $SSL_CERT | sed 's/^/  /'
echo

echo -e "\033[32m ====> 9. 附加CA证书到Cert文件 \033[0m"
cat ${CA_CERT} >> ${SSL_CERT}
echo "ssl_cert: |"
cat $SSL_CERT | sed 's/^/  /'
echo

echo -e "\033[32m ====> 10. 重命名服务证书 \033[0m"
echo "cp ${SSL_DOMAIN}.key tls.key"
cp ${SSL_DOMAIN}.key tls.key
echo "cp ${SSL_DOMAIN}.crt tls.crt"
cp ${SSL_DOMAIN}.crt tls.crt

复制以上代码另存为create_self-signed-cert.sh或者其他您喜欢的文件名。

脚本参数:

--ssl-domain: 生成ssl证书需要的主域名,如不指定则默认为www.rancher.local,如果是ip访问服务,则可忽略;
--ssl-trusted-ip: 一般ssl证书只信任域名的访问请求,有时候需要使用ip去访问server,那么需要给ssl证书添加扩展IP,多个IP用逗号隔开;
--ssl-trusted-domain: 如果想多个域名访问,则添加扩展域名(TRUSTED_DOMAIN),多个TRUSTED_DOMAIN用逗号隔开;
--ssl-size: ssl加密位数,默认2048;
--ssl-cn: 国家代码(2个字母的代号),默认CN;
使用示例:
./create_self-signed-cert.sh --ssl-domain=www.test.com --ssl-trusted-domain=www.test2.com
--ssl-trusted-ip=1.1.1.1,2.2.2.2,3.3.3.3 --ssl-size=2048 --ssl-date=3650

比如:

mkdir sslcert
cd sslcert
chmod +x create_self-signed-cert.sh
./create_self-signed-cert.sh --ssl-domain=ml.rancher.kna.cn

安装 Rancher

这里使用【单节点安装】的方式[6],基于 docker 镜像,搭建 Rancher,然后使用 Rancher 搭建 Kubernetes

而不采用基于现有的 Kubernetes 搭建 Rancher 的方式,也就是【高可用安装】[7]

以下具体的参数说明请参考6[6:1]

docker run -d --privileged --restart=unless-stopped
-p 80:80 -p 443:443
-v /path/to/sslcert/tls.crt:/etc/rancher/ssl/cert.pem
-v /path/to/sslcert/tls.key:/etc/rancher/ssl/key.pem
-v /path/to/sslcert/cacerts.pem:/etc/rancher/ssl/cacerts.pem
-v /path/to/sslcert:/container/certs
-v /path/to/rancher:/var/lib/rancher
-e SSL_CERT_DIR="/container/certs"
-v /data/var/log/rancher/auditlog:/var/log/auditlog
-e AUDIT_LEVEL=1
rancher/rancher:v2.5.8

配置服务

完事儿后,等待一会儿,不出意外的话,即可访问 80 和 443 端口。

在集群中访问,需要做一层反向代理,通过 huge01 把这两个端口暴露出去,同时也需要配置证书,启用 https,在各节点上配置 hosts(编辑 /etc/hosts 文件)。

创建 Kubernetes 集群

通过浏览器进入 Rancher 设定密码界面。

  • 如果密码遗忘,可以重置密码
$ docker exec  reset-password
  • 初始密码之后,进入管理界面,添加自定义集群

设定集群名称,选定 Kubernetes 版本之后,网络驱动可以选择 Flannel(其他的可能也行但是没有试过),其余均保持默认即可。

然后就到了集群选项页,按照说明,在其他机器上执行创建 docker 容器的命令,添加子节点。

每台主机可以运行多个角色。每个集群至少需要一个 Etcd 角色、一个 Control 角色、一个 Worker 角色。

最佳实践是将 Etcd 角色 和 Control 角色单独放置于一台空闲机器。

  • 添加节点

等待一段时间,不出意外的话,点击主机,即可看见添加的节点

设置 kubectl

为避免使用 kubectl 命令行时提示:

Unable to connect to the server: x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0

此错误跟本地的go环境有关。

设置一个环境变量(添加到 ~/.bashrc 或其他地方):

$ export GODEBUG=x509ignoreCN=0

下载 kubectl 二进制文件:

下载地址1: http://mirror.cnrancher.com/

下载地址2:

curl -LO https://dl.k8s.io/release/v1.17.17/bin/linux/amd64/kubectl

添加执行权限

chmod +x kubectl

其中版本号可根据实际情况修改

将至软链到某 $PATH 目录,比如:

sudo ln -s $(pwd)/kubectl /usr/bin/kubectl

在集群页面,点击 Kubeconfig 文件

在主节点上创建

mkdir ~/.kube
vim ~/.kube/config

文件夹和文件。

编辑内容为浏览器中打开的窗口展示的内容,这样 kubectl 就知道如何找到集群了。

设置 GPU

将以下内容保存成 nvidia-device-plugin.yml

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      # This annotation is deprecated. Kept here for backward compatibility
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      # This toleration is deprecated. Kept here for backward compatibility
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      - key: CriticalAddonsOnly
        operator: Exists
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      - key: node-role.kubernetes.io/master
        effect: NoSchedule
      # Mark this pod as a critical add-on; when enabled, the critical add-on
      # scheduler reserves resources for critical add-on pods so that they can
      # be rescheduled after a failure.
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      priorityClassName: "system-node-critical"
      containers:
      - image: nvcr.io/nvidia/k8s-device-plugin:v0.9.0
        name: nvidia-device-plugin-ctr
        args: ["--fail-on-init-error=false"]
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
          - name: device-plugin
            mountPath: /var/lib/kubelet/device-plugins
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins

然后执行

$ kubectl apply -f nvidia-device-plugin.yml

可以通过一下命令追踪 pods 的创建状态:

$ kubectl get pods -n kube-system | grep nvidia
nvidia-device-plugin-daemonset-74kv8       1/1     Running            0          2d4h
nvidia-device-plugin-daemonset-75845       1/1     Running            0          2d4h
nvidia-device-plugin-daemonset-8nlsp       1/1     Running            0          2d4h
nvidia-device-plugin-daemonset-rnq8w       1/1     Running            0          2d4h

有几台机器,就会有几个 pods 被创建。

设置存储

最简单的方式,是设置本地存储:

$ kubectl apply -f https://raw.githubusercontent.com/rancher/local-path-provisioner/master/deploy/local-path-storage.yaml

默认会在 /opt/local-path-provisioner 存储数据,要修改的话,根据 https://github.com/rancher/local-path-provisioner 所述,克隆此项目: git clone https://github.com/rancher/local-path-provisioner.git --depth 1,编辑此文件,然后执行

$ kubectl apply -f deploy/local-path-storage.yaml

你可以通过

$ kubectl -n local-path-storage get pod

查看状态。

设置域名 ip 映射

使用自签名证书的坏处就是,在容器内部无法通过设置 hosts 解析自定义域名。

可能会出现

ERROR: https://rancher.my.org/ping is not accessible (Could not resolve host: rancher.my.org)

的问题。

要解决这个问题,可以在环境中搭建一个 dns 服务器,配置正确的域名和 IP 的对应关系,然后将每个节点的nameserver指向这个 dns 服务器。

或者使用 HostAliases,给关键的几个容器(如 cattle-cluster-agent、cattle-node-agent)打补丁(patch)[8]

kubectl -n cattle-system patch deployments cattle-cluster-agent --patch '{
    "spec": {
        "template": {
            "spec": {
                "hostAliases": [
                    {
                      "hostnames":
                      [
                        "ml.r***a.cn"
                      ],
                      "ip": "10.1***3.17"
                    }
                ]
            }
        }
    }
}'

kubectl -n cattle-system patch daemonsets cattle-node-agent --patch '{
 "spec": {
     "template": {
         "spec": {
             "hostAliases": [
                 {
                    "hostnames":
                      [
                        "ml.r***a.cn"
                      ],
                    "ip": "10.1***3.17"
                 }
             ]
         }
     }
 }
}'

完事儿后可以使用如下命令追踪状态和进度

$ kubectl get pods -n cattle-system
NAME                                    READY   STATUS    RESTARTS   AGE
cattle-cluster-agent-84f4d9f7cc-xkcrq   1/1     Running   0          3h58m
cattle-node-agent-fdc5z                 1/1     Running   0          4h41m
cattle-node-agent-jlpnl                 1/1     Running   0          4h40m
kube-api-auth-xww7h                     1/1     Running   0          2d

设置 istio

  • 点击 Default 项目
  • 点击 资源 -> istio,保持默认,选择启用

可以使用命令

$ kubectl get pods -n istio-system
NAME                                      READY   STATUS      RESTARTS   AGE
authservice-0                             1/1     Running     0          4h32m
cluster-local-gateway-66bcf8bc5d-rltpj    1/1     Running     0          4h31m
istio-citadel-66864ff6b8-znrjw            1/1     Running     0          3h12m
istio-galley-5bd9bf8b9c-8b9x6             1/1     Running     0          3h12m
istio-ingressgateway-85b49c758f-4khs7     1/1     Running     0          4h31m
istio-pilot-674bdcbbf9-8dpc8              2/2     Running     1          3h12m
istio-policy-6d9f4577db-mhxnz             2/2     Running     1          3h12m
istio-security-post-install-1.5.9-jfbkk   0/1     Completed   4          3h12m
istio-sidecar-injector-9bcfb645-vm54x     1/1     Running     0          3h12m
istio-telemetry-664b6dfd44-bhr2c          2/2     Running     6          3h12m
istio-tracing-cc6c8c677-7mrnl             1/1     Running     0          3h12m
istiod-5ff6cdbbcd-4vnhf                   1/1     Running     0          4h31m
kiali-79c4c46468-bpl7l                    1/1     Running     0          3h12m

来追踪状态和进度。

设置 kube-controller 额外参数

Rancher 没有默认的证书签名者,在直接安装 Kubeflow 后,pod: cache-server 会面临

Unable to attach or mount volumes: unmounted volumes=[webhook-tls-certs], unattached volumes=[istio-data istio-envoy istio-podinfo kubeflow-pipelines-cache-token-7pwl7 webhook-tls-certs istiod-ca-cert]: timed out waiting for the condition

的错误[9],原因是 cert-manager 没有 Issuer 权限,所以有必要在安装之前添加两个参数,方法如下:

  • 在全局页面,点击升级
  • 选择编辑YAML文件

在 kube-controller 字段下添加如下 3 行[10]

extra_args:
cluster-signing-cert-file: "/etc/kubernetes/ssl/kube-ca.pem"
cluster-signing-key-file: "/etc/kubernetes/ssl/kube-ca-key.pem"

安装 Kubeflow

Kubeflow 的安装要求比较苛刻,前面的大部分操作都是为了 Kubeflow 铺路,官方指定的方式适合在墙外操作,由于各种原因[11]墙内操作几乎不太容易实现,所以经摸索,选出了一条合适的方法。

https://github.com/shikanon/kubeflow-manifests

这是一份长期更新的国内镜像版本的 Kubeflow 安装文件,不用管 README.md 是如何描述的,只要上述步骤没问题[12],克隆下来,

git clone https://github.com/shikanon/kubeflow-manifests.git --depth 1

直接 python install.py,即可。

如果安装过程中的输出没有报错出现的话,可以通过以下命令监控后续的各 pods 创建的状态和进度:

$ kubectl get pods -A
NAMESPACE                   NAME                                                        READY   STATUS      RESTARTS   AGE
auth                        dex-6686f66f9b-sbbxf                                        1/1     Running     0          4h55m
cattle-prometheus           exporter-kube-state-cluster-monitoring-5dd6d5c9fd-7rwq2     1/1     Running     0          3h35m
cattle-prometheus           exporter-node-cluster-monitoring-7wvw2                      1/1     Running     0          3h35m
cattle-prometheus           exporter-node-cluster-monitoring-vhrk5                      1/1     Running     0          3h35m
cattle-prometheus           grafana-cluster-monitoring-75c5cd5995-ssr2d                 2/2     Running     0          3h35m
cattle-prometheus           prometheus-cluster-monitoring-0                             5/5     Running     1          3h31m
cattle-prometheus           prometheus-operator-monitoring-operator-f9b9567b-p6h4l      1/1     Running     0          3h35m
cattle-system               cattle-cluster-agent-84f4d9f7cc-xkcrq                       1/1     Running     0          4h14m
cattle-system               cattle-node-agent-fdc5z                                     1/1     Running     0          4h57m
cattle-system               cattle-node-agent-jlpnl                                     1/1     Running     0          4h56m
cattle-system               kube-api-auth-xww7h                                         1/1     Running     0          2d1h
cert-manager                cert-manager-9d5774b59-8hp2f                                1/1     Running     0          4h55m
cert-manager                cert-manager-cainjector-67c8c5c665-896r8                    1/1     Running     0          4h55m
cert-manager                cert-manager-webhook-75dc9757bd-zcxjp                       1/1     Running     0          4h55m
ingress-nginx               nginx-ingress-controller-9jzpx                              1/1     Running     0          2d1h
ingress-nginx               nginx-ingress-controller-bscxv                              1/1     Running     0          2d1h
istio-system                authservice-0                                               1/1     Running     0          4h55m
istio-system                cluster-local-gateway-66bcf8bc5d-rltpj                      1/1     Running     0          4h54m
istio-system                istio-citadel-66864ff6b8-znrjw                              1/1     Running     0          3h35m
istio-system                istio-galley-5bd9bf8b9c-8b9x6                               1/1     Running     0          3h35m
istio-system                istio-ingressgateway-85b49c758f-4khs7                       1/1     Running     0          4h53m
istio-system                istio-pilot-674bdcbbf9-8dpc8                                2/2     Running     1          3h35m
istio-system                istio-policy-6d9f4577db-mhxnz                               2/2     Running     1          3h35m
istio-system                istio-security-post-install-1.5.9-jfbkk                     0/1     Completed   4          3h35m
istio-system                istio-sidecar-injector-9bcfb645-vm54x                       1/1     Running     0          3h35m
istio-system                istio-telemetry-664b6dfd44-bhr2c                            2/2     Running     6          3h35m
istio-system                istio-tracing-cc6c8c677-7mrnl                               1/1     Running     0          3h35m
istio-system                istiod-5ff6cdbbcd-4vnhf                                     1/1     Running     0          4h53m
istio-system                kiali-79c4c46468-bpl7l                                      1/1     Running     0          3h35m
knative-eventing            broker-controller-5c84984b97-4shtv                          1/1     Running     0          4h55m
knative-eventing            eventing-controller-54bfbd5446-4pknn                        1/1     Running     0          4h55m
knative-eventing            eventing-webhook-58f56d9cf4-2mrdn                           1/1     Running     0          4h55m
knative-eventing            imc-controller-769896c7db-8gzmc                             1/1     Running     0          4h55m
knative-eventing            imc-dispatcher-86954fb4cd-l9l98                             1/1     Running     0          4h55m
knative-serving             activator-75696c8c9-786pn                                   1/1     Running     0          4h55m
knative-serving             autoscaler-6764f9b5c5-q9pd9                                 1/1     Running     0          4h55m
knative-serving             controller-598fd8bfd7-8ng4j                                 1/1     Running     0          4h55m
knative-serving             istio-webhook-785bb58cc6-xlnnk                              1/1     Running     0          4h55m
knative-serving             networking-istio-77fbcfcf9b-p7wr7                           1/1     Running     0          4h55m
knative-serving             webhook-865f54cf5f-rn7qq                                    1/1     Running     0          4h55m
kube-system                 coredns-6b84d75d99-2f5p4                                    1/1     Running     0          2d1h
kube-system                 coredns-6b84d75d99-8j8zs                                    1/1     Running     0          48m
kube-system                 coredns-autoscaler-5c4b6999d9-qq87l                         1/1     Running     0          48m
kube-system                 kube-flannel-kjmlb                                          2/2     Running     0          2d1h
kube-system                 kube-flannel-vnhm7                                          2/2     Running     0          2d1h
kube-system                 metrics-server-7579449c57-2jqld                             1/1     Running     0          2d1h
kube-system                 rke-coredns-addon-deploy-job-drwtd                          0/1     Completed   0          49m
kubeflow-user-example-com   ml-pipeline-ui-artifact-6d7ffcc4b6-dzfsk                    2/2     Running     0          4h52m
kubeflow-user-example-com   ml-pipeline-visualizationserver-84d577b989-7bhbm            2/2     Running     0          4h52m
kubeflow                    admission-webhook-deployment-54cf94d964-8qsh2               1/1     Running     0          4h51m
kubeflow                    cache-deployer-deployment-65cd55d4d9-d6dzd                  2/2     Running     11         4h50m
kubeflow                    cache-server-f85c69486-rgzq6                                2/2     Running     6          4h50m
kubeflow                    centraldashboard-7b7676d8bd-w5jw6                           1/1     Running     0          4h53m
kubeflow                    jupyter-web-app-deployment-66f74586d9-pnv2r                 1/1     Running     0          169m
kubeflow                    katib-controller-5467f8fdc8-rcc78                           1/1     Running     0          4h50m
kubeflow                    katib-db-manager-646695754f-v2x82                           1/1     Running     0          4h53m
kubeflow                    katib-mysql-5bb5bd9957-7zzct                                1/1     Running     0          4h53m
kubeflow                    katib-ui-55fd4bd6f9-mnqg7                                   1/1     Running     0          4h53m
kubeflow                    kfserving-controller-manager-0                              2/2     Running     0          4h53m
kubeflow                    kubeflow-pipelines-profile-controller-5698bf57cf-9t99q      1/1     Running     0          169m
kubeflow                    kubeflow-pipelines-profile-controller-5698bf57cf-wmbwc      1/1     Running     0          4h53m
kubeflow                    metacontroller-0                                            1/1     Running     0          4h53m
kubeflow                    metadata-envoy-deployment-76d65977f7-5bjkk                  1/1     Running     0          4h53m
kubeflow                    metadata-grpc-deployment-697d9c6c67-5xbq6                   2/2     Running     1          4h53m
kubeflow                    metadata-writer-58cdd57678-ns6kq                            2/2     Running     0          4h53m
kubeflow                    minio-6d6784db95-82wtz                                      2/2     Running     0          169m
kubeflow                    ml-pipeline-85fc99f899-mn65n                                2/2     Running     3          4h53m
kubeflow                    ml-pipeline-persistenceagent-65cb9594c7-gt8bw               2/2     Running     0          4h53m
kubeflow                    ml-pipeline-scheduledworkflow-7f8d8dfc69-spq9b              2/2     Running     0          4h53m
kubeflow                    ml-pipeline-ui-5c765cc7bd-hks2f                             2/2     Running     0          4h53m
kubeflow                    ml-pipeline-viewer-crd-5b8df7f458-x62wv                     2/2     Running     1          4h53m
kubeflow                    ml-pipeline-visualizationserver-56c5ff68d5-ndltc            2/2     Running     0          4h53m
kubeflow                    mpi-operator-789f88879-l5d7l                                1/1     Running     0          4h53m
kubeflow                    mxnet-operator-7fff864957-5gcv2                             1/1     Running     0          4h53m
kubeflow                    mysql-56b554ff66-k2qsl                                      2/2     Running     0          168m
kubeflow                    notebook-controller-deployment-74d9584477-jvvcc             1/1     Running     0          4h53m
kubeflow                    profiles-deployment-67b4666796-zjrpw                        2/2     Running     0          4h53m
kubeflow                    pytorch-operator-fd86f7694-tgh8l                            2/2     Running     0          4h53m
kubeflow                    tensorboard-controller-controller-manager-fd6bcffb4-lkz2g   3/3     Running     1          4h53m
kubeflow                    tensorboards-web-app-deployment-78d7b8b658-qg8kc            1/1     Running     0          4h53m
kubeflow                    tf-job-operator-7bc5cf4cc7-689d9                            1/1     Running     0          4h53m
kubeflow                    volumes-web-app-deployment-68fcfc9775-mw7gz                 1/1     Running     0          4h53m
kubeflow                    workflow-controller-5449754fb4-pc79x                        2/2     Running     1          169m
kubeflow                    xgboost-operator-deployment-5c7bfd57cc-54hjq                2/2     Running     1          4h53m
local-path-storage          local-path-provisioner-5bd6f65fdf-j575f                     1/1     Running     0          2d

如果所有的 pods 都进入 Running 或 Completed 状态,那么就说明部署成功了,如果有节点迟迟卡在创建状态,可以重新执行一遍 python install.py

如果要删除,可以执行

$ kubectl delete -f manifest1.3

一些可能会遇到的问题
在 Rancher 页面,可能看不见这些 pods

是因为他们没有被归到某一个 project 下面,点击集群名称,点击命名空间,即可看见,将之挪到 Default 项目或新建项目下,即可在对应项目下看到这些 pods 的命名空间及 pods 详情了。

默认用户名密码不正确/如何添加新的用户和命名空间

可以编辑 patch/auth.yaml 文件,在

staticPasswords:
- email: "admin@example.com"
  # hash string is "password"
  hash: "$2y$12$X.oNHMsIfRSq35eRfiTYV.dPIYlWyPDRRc1.JVp0f3c.YqqJNW4uK"
  username: "admin"
  userID: "08a8684b-db88-4b73-90a9-3cd1661f5466"
- email: myname@abc.cn
  hash: $2b$10$.zSuIlx1bl9PCyigEtebhuWG/PAhZlZoyokPdGObiE7jRUHUcQ0qW
  username: myname
  userID: 08a8684b-db88-4b73-90a9-3cd1661f5466

可以添加用户,密码使用 hash 生成,可以通过 https://passwordhashing.com/BCrypt 工具来生成密码。

注意:每一个用户都对应了一个命名空间(namespace),所以如果添加了新用户,需要对应的添加新的命名空间,在此文件的下面几行

---
apiVersion: kubeflow.org/v1beta1
kind: Profile
metadata:
  name: kubeflow-user-example-com
spec:
  owner:
    kind: User
    name: admin@example.com
---
apiVersion: kubeflow.org/v1beta1
kind: Profile
metadata:
  name: myname # 命名空间就是这个
spec:
  owner:
    kind: User
    name: myname@wps.cn # 登录的邮箱

最后,重启

  • 应用
kubectl apply -f patch/auth.yaml
  • 重启
kubectl rollout restart deployment dex -n auth

或者直接使用此命令编辑容器配置,保存后自动应用。

kubectl edit configmap dex -n auth

暴露 Kubeflow 服务

当 Kubeflow 服务启动完成后,可以通过

$ kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80

将容器内网关的 80 端口临时暴露到本机的 8080 端口,通过 localhost 域名以 http 的方式访问。

通过修改 patch/auth.yaml 文件,可以更改密码,默认的用户名是admin@example.com,密码是password。[13]

生成密码的方式是[14]

python3 -c 'from passlib.hash import bcrypt; import getpass; print(bcrypt.using(rounds=12, ident="2y").hash(getpass.getpass()))'

或通过 https://passwordhashing.com/BCrypt 在线工具。

如果要通过 NodePort / LoadBalancer / Ingress 暴露服务到非 localhost 网络,那么必须使用 https[15]。否则能打开主页面但是 notebook、ssh等均无法连接。

一个简单可行的方案是使用 NodePort。

获知 ssl 端口

使用如下命令[16]:

$ kubectl -n istio-system get service istio-ingressgateway
NAME                   TYPE       CLUSTER-IP      EXTERNAL-IP   PORT(S)                                                                      AGE
istio-ingressgateway   NodePort   10.43.121.108           15021:31066/TCP,80:30000/TCP,443:32692/TCP,31400:32293/TCP,15443:31428/TCP   45h

可以发现,443 端口被映射到物理机的 32692,但是此时 443 端口还没有启用服务,下面几步将生成并启用证书,以启动 443 端口。

生成自签名证书
使用前文中提到过的脚本,生成新的ssl证书,供给 kubeflow 使用。

./create_self-signed-cert.sh --ssl-domain=kub**a.cn

如果有CA签名的证书亦可直接使用。

使用 cert-manager 管理证书

kubectl create --namespace istio-system secret tls kf-tls-cert --key /data/gang/kfcerts/kub***a.cn.key --cert /data/gang/kfcerts/kub***a.cn.crt

配置 Knative cluster 使用自定义域名

kubectl edit cm config-domain --namespace knative-serving

在 data 下面添加一个名为域名的键[17],值若无特殊需求的话可以留空,如:kub***a.cn: ""

配置 Kubeflow 使用此证书
编辑 manifest1.3/016-istio-1-9-0-kubeflow-istio-resources-base.yaml 文件

在最后面的 kubeflow-gateway 中,添加[18]

- hosts:
  - '*'
  port:
    name: https
    number: 443
    protocol: HTTPS
  tls:
    mode: SIMPLE
    credentialName: kf-tls-cert

hosts 可以直接指定为刚刚生成的证书所绑定的域名(仅接受此域名的访问),也可以填写成 * 以接受其他域名的访问。

应用重启

kubectl apply -f manifest1.3/016-istio-1-9-0-kubeflow-istio-resources-base.yaml
kubectl rollout restart deploy istio-ingressgateway -n istio-system

这样就能在其他机器上通过

curl https://10.1***2:32692 -k

安全的访问到 Kubeflow 服务了。由于是自签名证书,所以使用-k 参数可以绕过检查。

效果

删档重玩
谨慎使用脚本,删除残留文件、服务、网络配置、容器等![19]

#!/bin/bash

KUBE_SVC='
kubelet
kube-scheduler
kube-proxy
kube-controller-manager
kube-apiserver
'

for kube_svc in ${KUBE_SVC};
do
  # 停止服务
  if [[ `systemctl is-active ${kube_svc}` == 'active' ]]; then
    systemctl stop ${kube_svc}
  fi
  # 禁止服务开机启动
  if [[ `systemctl is-enabled ${kube_svc}` == 'enabled' ]]; then
    systemctl disable ${kube_svc}
  fi
done

# 停止所有容器
docker stop $(docker ps -aq)

# 删除所有容器
docker rm -f $(docker ps -qa)

# 删除所有容器卷
docker volume rm $(docker volume ls -q)

# 卸载mount目录
for mount in $(mount | grep tmpfs | grep '/var/lib/kubelet' | awk '{ print $3 }') /var/lib/kubelet /var/lib/rancher;
do
  umount $mount;
done

# 备份目录
mv /etc/kubernetes /etc/kubernetes-bak-$(date +"%Y%m%d%H%M")
mv /var/lib/etcd /var/lib/etcd-bak-$(date +"%Y%m%d%H%M")
mv /var/lib/rancher /var/lib/rancher-bak-$(date +"%Y%m%d%H%M")
mv /opt/rke /opt/rke-bak-$(date +"%Y%m%d%H%M")
rm -rf ~/.kube/
rm -rf /etc/kubernetes/
rm -rf /etc/systemd/system/kubelet.service.d
rm -rf /etc/systemd/system/kubelet.service
rm -rf /usr/bin/kube*
rm -rf /etc/cni
rm -rf /opt/cni
rm -rf /var/lib/etcd
rm -rf /var/etcd

# 删除残留路径
rm -rf /etc/ceph \
    /etc/cni \
    /opt/cni \
    /run/secrets/kubernetes.io \
    /run/calico \
    /run/flannel \
    /var/lib/calico \
    /var/lib/cni \
    /var/lib/kubelet \
    /var/log/containers \
    /var/log/kube-audit \
    /var/log/pods \
    /var/run/calico \
    /usr/libexec/kubernetes

# 清理网络接口
no_del_net_inter='
lo
docker0
eth
ens
bond
'

network_interface=`ls /sys/class/net`

for net_inter in $network_interface;
do
  if ! echo "${no_del_net_inter}" | grep -qE ${net_inter:0:3}; then
    ip link delete $net_inter
  fi
done

# 清理残留进程
port_list='
80
443
6443
2376
2379
2380
8472
9099
10250
10254
'

for port in $port_list;
do
  pid=`netstat -atlnup | grep $port | awk '{print $7}' | awk -F '/' '{print $1}' | grep -v - | sort -rnk2 | uniq`
  if [[ -n $pid ]]; then
    kill -9 $pid
  fi
done

kube_pid=`ps -ef | grep -v grep | grep kube | awk '{print $2}'`

if [[ -n $kube_pid ]]; then
  kill -9 $kube_pid
fi

# 清理Iptables表
## 注意:如果节点Iptables有特殊配置,以下命令请谨慎操作
sudo iptables --flush
sudo iptables --flush --table nat
sudo iptables --flush --table filter
sudo iptables --table nat --delete-chain
sudo iptables --table filter --delete-chain
systemctl restart docker

另外,创建 Rancher 的时候指定的几个卷文件夹,也可以酌情删除

rm -rf /data/var/log/rancher/auditlog
rm -rf /path/to/rancher
参考
https://docs.docker.com/engine/install/centos/#install-using-the-repository ↩︎

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#id2 ↩︎

https://github.com/kubeflow/manifests#prerequisites ↩︎

https://docs.rancher.cn/docs/rancher2.5/installation/install-rancher-on-k8s/_index#3-选择您的-ssl-选项 ↩︎

https://docs.rancher.cn/docs/rancher2.5/installation/resources/advanced/self-signed-ssl/_index ↩︎

https://docs.rancher.cn/docs/rancher2.5/installation/other-installation-methods/single-node-docker/advanced/_index ↩︎ ↩︎

https://docs.rancher.cn/docs/rancher2.5/installation/install-rancher-on-k8s/_index ↩︎

https://docs.rancher.cn/docs/rancher2/faq/install/_index/#error-httpsranchermyorgping-is-not-accessible-could-not-resolve-host-ranchermyorg ↩︎

https://github.com/shikanon/kubeflow-manifests/issues/20#issuecomment-843014942 ↩︎

https://github.com/cockroachdb/cockroach/issues/28075#issuecomment-420497277 ↩︎

原因1:部分镜像放在 gcr.io 无法拉取;原因2:部分镜像是临时版本(版本号为sha256码)无法保存导入;原因3:官方推荐的渠道大部分为云服务提供商深度结合,其他的几个方法都要求科学上网。 ↩︎

所谓没问题是指在执行 kubectl get pods -A 后,所有的 pods 都在 Running 或 Completed 状态。 ↩︎

https://github.com/shikanon/kubeflow-manifests ↩︎

https://github.com/kubeflow/manifests#change-default-user-password ↩︎

https://github.com/kubeflow/manifests#nodeport--loadbalancer--ingress ↩︎

https://istio.io/latest/zh/docs/tasks/traffic-management/ingress/ingress-control/#determining-the-ingress-ip-and-ports ↩︎

https://knative.dev/docs/serving/using-a-tls-cert/#before-you-begin ↩︎

https://knative.dev/docs/serving/using-a-tls-cert/#manually-adding-a-tls-certificate ↩︎

https://docs.rancher.cn/docs/rancher2/cluster-admin/cleaning-cluster-nodes/_index ↩︎