前言
服务器系统:ubuntu20.04
服务器网络:192.168.1.109, 192.168.1.110
Kubernetes版本:1.20.1
Docker版本:18.09.8
Kubeflow版本:1.3.1
配置要求
Kubernetes的旧版本可能与最新的Kubeflow版本不兼容,前往链接自行查看:https://www.kubeflow.org/docs/started/k8s/overview/#minimum-system-requirements
安装kubernetes
1.关闭防火墙、selinux以及swap(master节点和node节点均执行)
a、关闭防火墙
sudo ufw disable
b、关闭selinux(自行选择)
sudo apt install selinux-utils
临时关闭:
setenforce 0 # 临时
永久关闭:
sed -i 's/enforcing/disabled' /etc/selinux/config
c、关闭swap
sudo sed -i 's/^.*swap/#&/g' /etc/fstab
sudo swapoff -a # 临时
2.将桥接的IPV4流量传递到iptables(master节点和node节点均执行)
指令如下:
sudo tee /etc/sysctl.d/k8s.conf <<-'EOF'
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1
net.ipv4.ip_forward=1
EOF
使其生效:
sudo sysctl --system
3.时间同步(master节点和node节点均执行)
sudo apt install ntpdate -y
sudo ntpdate time.windows.com
4.安装Docker(master节点和node节点均执行)
卸载旧版本:
sudo apt-get remove docker docker-engine docker-ce docker.io # 卸载旧版本
更新 sources.list(提示:更新前请自行备份sources.list):
sudo tee /etc/apt/sources.list <<-'EOF'
deb http://mirrors.aliyun.com/ubuntu/ focal main
deb-src http://mirrors.aliyun.com/ubuntu/ focal main
deb http://mirrors.aliyun.com/ubuntu/ focal-updates main
deb-src http://mirrors.aliyun.com/ubuntu/ focal-updates main
deb http://mirrors.aliyun.com/ubuntu/ focal universe
deb-src http://mirrors.aliyun.com/ubuntu/ focal universe
deb http://mirrors.aliyun.com/ubuntu/ focal-updates universe
deb-src http://mirrors.aliyun.com/ubuntu/ focal-updates universe
deb http://mirrors.aliyun.com/ubuntu/ focal-security main
deb-src http://mirrors.aliyun.com/ubuntu/ focal-security main
deb http://mirrors.aliyun.com/ubuntu/ focal-security universe
deb-src http://mirrors.aliyun.com/ubuntu/ focal-security universe
EOF
安装docker(依次执行):
sudo apt-get update
sudo apt-get install -y apt-transport-https ca-certificates curl software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
sudo apt-get update
apt-cache madison docker-ce # 查看docker版本列表,选择安装自己需要的版本(此处选择安装18.09.8)
sudo apt install docker-ce
更新用户组
sudo groupadd docker
sudo gpasswd -a $USER docker
newgrp docker
创建HTTP/HTTPS代理文件
sudo vim /etc/profile
# 加入
export http_proxy=http://192.168.1.110:1080
export https_proxy=http://192.168.1.110:1080
export no_proxy="127.0.0.1,192.168.1.0/24,localhost,10.96.0.0/12,10.244.0.0/16"
#需要取消的话
$ unset http_proxy
$ unset https_proxy
# 创建docker.service目录
$ sudo mkdir -p /etc/systemd/system/docker.service.d
$ sudo vim /etc/systemd/system/docker.service.d/http-proxy.conf
具体配置内容: # proxy-addr为代理IP或域名;proxy-port为代理端口;NO_PROXY后面接不需要代理的私有仓库的域名或者IP,以英文逗号结尾
[Service]
Environment="HTTP_PROXY=http://proxy-addr:proxy-port" #代理服务器地址
Environment="HTTPS_PROXY=http://proxy-addr:proxy-port" #proxy-addr也是http开头,而不是https,否则会报错
Environment="NO_PROXY=localhost,127.0.0.0/8,docker-registry.example.com,.corp" #哪些不需要代理
查看配置结果
$ sudo systemctl daemon-reload
$ sudo systemctl show --property=Environment docker
Environment=HTTP_PROXY=http://proxy-addr:proxy-port/ HTTPS_PROXY=http://proxy-addr:proxy-port/ NO_PROXY=localhost,127.0.0.0/8,docker-registry.example.com,.corp
安装 nvidia-docker2:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
&& curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install -y nvidia-docker2
可能出现两个问题:
gpg: 找不到有效的 OpenPGP (源于指令:curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -)
E: 无法定位软件包 nvidia-docker2 (源于指令:curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list)
需要把ip地址写入host文件中!!
sudo vim /etc/hosts
复制以下ip到host中
185.199.108.153 nvidia.github.io
185.199.109.153 nvidia.github.io
185.199.110.153 nvidia.github.io
185.199.111.153 nvidia.github.io
创建文件 /etc/docker/daemon.json ,添加内容,命令如下:
sudo mkdir -p /etc/docker
sudo tee /etc/docker/daemon.json <<-'EOF'
{
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
},
"exec-opts": ["native.cgroupdriver=systemd"],
"registry-mirrors": ["http://hub-mirror.c.163.com"],
"default-runtime": "nvidia"
}
EOF
重启docker并设置开机自启动:
sudo systemctl daemon-reload
sudo systemctl restart docker
sudo systemctl enable docker
验证安装:
docker run --rm nvidia/cuda:11.0-base nvidia-smi
关闭dns服务,注释dns,内容如下:
sudo vim /etc/NetworkManager/NetworkManager.conf # 打开配置文件
[main]
plugins=ifupdown,keyfile,ofono
#dns=dnsmasq # 这行注释
保存并关闭,重启服务:
sudo systemctl restart network-manager
5.安装kubeadm,kubelet,kubectl(master节点和node节点均执行)
sudo apt-get update && sudo apt-get install -y apt-transport-https
curl https://mirrors.aliyun.com/kubernetes/apt/doc/apt-key.gpg | sudo apt-key add -
创建文件 /etc/apt/sources.list.d/kubernetes.list并添加内容, 指令如下:
sudo tee /etc/apt/sources.list.d/kubernetes.list <<-'EOF'
deb https://mirrors.aliyun.com/kubernetes/apt/ kubernetes-xenial main
EOF
sudo apt-get update
sudo apt-get install -y kubelet=1.20.1-00 kubeadm=1.20.1-00 kubectl=1.20.1-00
提示:以上步骤,master节点和node节点均需执行,后面的步骤master节点和node节点执行略有不同
创建集群并初始化(仅master节点执行):
sudo kubeadm init --kubernetes-version v1.20.1 --apiserver-advertise-address=192.168.1.109 --pod-network-cidr=10.244.0.0/16 --image-repository registry.cn-hangzhou.aliyuncs.com/google_containers --upload-certs |tee kubeadm-init.log
使用ifstat工具实时观测下网卡速率,等待安装完成后,根据提示初始化 kubectl,例:
mkdir -p $HOME/.kube
cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
chown $(id -u):$(id -g) $HOME/.kube/config
sudo scp /run/systemd/resolve/resolv.conf openoker@192.168.1.110:/run/systemd/resolve/resolv.conf
以下内容用于添加node节点,是由主节点初始化后生成的,注意保存对应的自己机器生成的信息,例:
kubeadm join 192.168.1.109:6443 --token y07gsn.pd62oi5wbujgfnxs
--discovery-token-ca-cert-hash sha256:efc9349a6b0a5a2d2384ab5b568b9f35b3041f80d3474f5b9c631d4268feadb6
如果出错,重置并重新初始化,直至成功:
sudo kubeadm reset
sudo kubeadm init --kubernetes-version v1.20.1 --apiserver-advertise-address=192.168.1.109 --pod-network-cidr=10.244.0.0/16 --upload-certs |tee kubeadm-init.log --image-repository registry.cn-hangzhou.aliyuncs.com/google_containers
设置开机启动:
sudo systemctl enable kubelet
启动k8s服务程序
sudo systemctl start kubelet
安装flannel网络插件:
wget https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
kubectl apply -f kube-flannel.yml
#重启pod
kubectl get pod -n kube-system | grep kube-proxy |awk '{system("kubectl delete pod "$1" -n kube-system")}'
查看状态
kubectl get pod -n kube-system
有时会有coreDNS两个容器起不来
执行命令$ sudo vim /run/flannel/subnet.env,添加以下内容:
FLANNEL_NETWORK=10.244.0.0/16
FLANNEL_SUBNET=10.244.0.1/24
FLANNEL_MTU=1450
FLANNEL_IPMASQ=true
让 master 也可以接受调度:
kubectl get nodes
kubectl taint node xxx-nodename node-role.kubernetes.io/master- #将 Master 也当作 Node 使用
kubectl taint node xxx-nodename node-role.kubernetes.io/master="":NoSchedule #将 Master 恢复成 Master Only 状态
此处选用flannel网络插件原因如下:之前环境测试过程中选择了calico网络插件,导致安装kubeflow后有几个pod一直起不来,猜测可能由网络问题导致,如有需要,calico网络插件安装方式如下
安装calico网络插件:
kubectl apply -f https://docs.projectcalico.org/manifests/calico.yaml
查看状态:
kubectl get pod -n kube-system
如果插件pod起不来,重启机器,然后依次执行以下命令:
sudo ufw disable
setenforce 0
sudo swapoff -a
sudo systemctl restart docker
sudo systemctl restart kubelet.service
添加Node节点
Node节点环境配置方式如下
将主节点中的/etc/kubernetes/admin.conf文件拷贝到node节点相同的目录下,执行远程拷贝指令:
sudo scp /etc/kubernetes/admin.conf openoker@192.168.1.110:/etc/kubernetes
注意:该指令在主节点下执行,openoker@192.168.1.110为node节点IP,需换成你自己的IP
然后依次执行:
mkdir -p $HOME/.kube
cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
chown $(id -u):$(id -g) $HOME/.kube/config
kubeadm join 192.168.1.109:6443 --token y07gsn.pd62oi5wbujgfnxs
--discovery-token-ca-cert-hash sha256:efc9349a6b0a5a2d2384ab5b568b9f35b3041f80d3474f5b9c631d4268feadb6
注意:记得关闭swap
安装网络插件,Node节点安装网络插件方式与主节点方式一致。
- 重要额外配置:
在Master节点修改k8s配置文件,进入 /etc/kubernetes/manifests/
修改 kube-apiserver.yaml,加入如下内容
# v1.21.X以下版本需要执行,解决configmap "istio-ca-root-cert" not found报错
- --service-account-signing-key-file=/etc/kubernetes/pki/sa.key
- --service-account-issuer=kubernetes.default.svc
# 升级到1.20.X以上版本需要执行,解决NFS动态存储一直Pending的问题
- --feature-gates=RemoveSelfLink=false
修改 kube-controller-manager.yaml 、kube-scheduler.yaml,将 - --port=0 注释掉
重启kubelet:systemctl restart kubelet
至此kubernetes集群部署完毕
安装kubeflow
设置 GPU
将以下内容保存成 nvidia-device-plugin.yml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
updateStrategy:
type: RollingUpdate
template:
metadata:
# This annotation is deprecated. Kept here for backward compatibility
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ""
labels:
name: nvidia-device-plugin-ds
spec:
tolerations:
# This toleration is deprecated. Kept here for backward compatibility
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
- key: CriticalAddonsOnly
operator: Exists
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
- key: node-role.kubernetes.io/master
effect: NoSchedule
# Mark this pod as a critical add-on; when enabled, the critical add-on
# scheduler reserves resources for critical add-on pods so that they can
# be rescheduled after a failure.
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
priorityClassName: "system-node-critical"
containers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.9.0
name: nvidia-device-plugin-ctr
args: ["--fail-on-init-error=false"]
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
然后执行
$ kubectl apply -f nvidia-device-plugin.yml
可以通过一下命令追踪 pods 的创建状态:
$ kubectl get pods -n kube-system | grep nvidia
nvidia-device-plugin-daemonset-74kv8 1/1 Running 0 2d4h
nvidia-device-plugin-daemonset-75845 1/1 Running 0 2d4h
nvidia-device-plugin-daemonset-8nlsp 1/1 Running 0 2d4h
nvidia-device-plugin-daemonset-rnq8w 1/1 Running 0 2d4h
有几台机器,就会有几个 pods 被创建。
设置NFS、创建PV
在安装kubeflow之前,首先设置nfs并创建pv,安装kubeflow时mysql、katib等组件时需要使用pvc,如果不提前创建,会导致这些节点状态为pending
选择一台机器作为nfs服务器
sudo apt install nfs-kernel-server
#创建共享文件夹
sudo mkdir /data/nfs-kubeflow
cd /data/nfs-kubeflow
sudo mkdir v1
sudo mkdir v2
sudo mkdir v3
sudo mkdir v4
sudo mkdir v5
sudo chmod -R 777 /data/nfs-kubeflow
#配置nfs
sudo vim /etc/exports
加入如下内容:
/data/nfs-kubeflow *(insecure,rw,no_root_squash,no_all_squash,sync)
#使配置生效
sudo exportfs -r
#启动服务rpcbind、nfs服务
sudo service nfs-kernel-server restart
#查看效果
sudo showmount -e
# 在Node节点测试
sudo apt-get install nfs-common
sudo mount -t nfs -o nolock,nfsvers=4,rsize=2048,wsize=2048,timeo=15 192.168.1.109:/data/nfs-kubeflow /data/nfs-kubeflow
如果nfs服务卡死,甚至kill -9也不行,可以通过连续执行“sudo umount -f /data/nfs-kubeflow/”后退出。
配置pv文件 pv.yaml,加入如下内容,注意替换path与server
apiVersion: v1
kind: PersistentVolume
metadata:
name: pv001
labels:
name: pv001
spec:
nfs:
path: /data/nfs-kubeflow/v1
server: 192.168.1.109
accessModes: ["ReadWriteMany","ReadWriteOnce"]
capacity:
storage: 20Gi
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: pv002
labels:
name: pv002
spec:
nfs:
path: /data/nfs-kubeflow/v2
server: 192.168.1.109
accessModes: ["ReadWriteMany","ReadWriteOnce"]
capacity:
storage: 20Gi
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: pv003
labels:
name: pv003
spec:
nfs:
path: /data/nfs-kubeflow/v3
server: 192.168.1.109
accessModes: ["ReadWriteMany","ReadWriteOnce"]
capacity:
storage: 30Gi
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: pv004
labels:
name: pv004
spec:
nfs:
path: /data/nfs-kubeflow/v4
server: 192.168.1.109
accessModes: ["ReadWriteMany","ReadWriteOnce"]
capacity:
storage: 50Gi
执行 kubectl apply -f pv.yaml 创建 pv
安装kubeflow
wget https://github.com/kubeflow/manifests/archive/refs/tags/v1.3.1.zip
unzip v1.3.1.zip
wget https://github.com/kubernetes-sigs/kustomize/releases/download/v3.2.0/kustomize_3.2.0_linux_amd64
mv kustomize_3.2.0_linux_amd64 kustomize
chmod +x kustomize
sudo mv kustomize /usr/bin/
cd manifests-1.3.1/
#安装cert-manager
kustomize build common/cert-manager/cert-manager/base | kubectl apply -f -
kustomize build common/cert-manager/kubeflow-issuer/base | kubectl apply -f -
#安装istio
kustomize build common/istio-1-9/istio-crds/base | kubectl apply -f -
kustomize build common/istio-1-9/istio-namespace/base | kubectl apply -f -
kustomize build common/istio-1-9/istio-install/base | kubectl apply -f -
#安装dex
kustomize build common/dex/overlays/istio | kubectl apply -f -
#安装oidc
kustomize build common/oidc-authservice/base | kubectl apply -f -
#安装knative-serving
kubectl apply -f https://github.com/knative/serving/releases/download/v0.17.1/serving-crds.yaml
kustomize build common/knative/knative-serving/base | kubectl apply -f -
kustomize build common/istio-1-9/cluster-local-gateway/base | kubectl apply -f -
kustomize build common/knative/knative-eventing/base | kubectl apply -f -
#创建kubeflow namespace
kustomize build common/kubeflow-namespace/base | kubectl apply -f -
#创建kubeflow-roles
kustomize build common/kubeflow-roles/base | kubectl apply -f -
#创建istio-resource
kustomize build common/istio-1-9/kubeflow-istio-resources/base | kubectl apply -f -
#安装pipeline
kustomize build apps/pipeline/upstream/env/platform-agnostic-multi-user | kubectl apply -f -
#安装kfserving
kustomize build apps/kfserving/upstream/overlays/kubeflow | kubectl apply -f -
#安装katib
kustomize build apps/katib/upstream/installs/katib-with-kubeflow | kubectl apply -f -
#安装kubeflow dashboard
kustomize build apps/centraldashboard/upstream/overlays/istio | kubectl apply -f -
#安装admission-webhook
kustomize build apps/admission-webhook/upstream/overlays/cert-manager | kubectl apply -f -
#安装notebook
kustomize build apps/jupyter/notebook-controller/upstream/overlays/kubeflow | kubectl apply -f -
#安装jupyter
kustomize build apps/jupyter/jupyter-web-app/upstream/overlays/istio | kubectl apply -f -
#安装kfam
kustomize build apps/profiles/upstream/overlays/kubeflow | kubectl apply -f -
#安装volume
kustomize build apps/volumes-web-app/upstream/overlays/istio | kubectl apply -f -
#安装tensorboard
kustomize build apps/tensorboard/tensorboards-web-app/upstream/overlays/istio | kubectl apply -f -
kustomize build apps/tensorboard/tensorboard-controller/upstream/overlays/kubeflow | kubectl apply -f -
#安装各种operator
kustomize build apps/tf-training/upstream/overlays/kubeflow | kubectl apply -f -
kustomize build apps/pytorch-job/upstream/overlays/kubeflow | kubectl apply -f -
kustomize build apps/mpi-job/upstream/overlays/kubeflow | kubectl apply -f -
kustomize build apps/mxnet-job/upstream/overlays/kubeflow | kubectl apply -f -
kustomize build apps/xgboost-job/upstream/overlays/kubeflow | kubectl apply -f -
#创建namespace
kustomize build common/user-namespace/base | kubectl apply -f -
等待所有pod均为running状态
直接访问(可选):kubectl port-forward svc/istio-ingressgateway --address 0.0.0.0 -n istio-system 8080:80
暴露指定端口30000,编辑kubeflow-ui-nodeport.yaml
apiVersion: v1
kind: Service
metadata:
labels:
app: istio-ingressgateway
install.operator.istio.io/owning-resource: unknown
istio: ingressgateway
istio.io/rev: default
operator.istio.io/component: IngressGateways
release: istio
name: istio-ingressgateway
namespace: istio-system
spec:
ports:
- name: status-port
port: 15021
protocol: TCP
targetPort: 15021
- name: http2
port: 80
protocol: TCP
targetPort: 8080
nodePort: 30000
- name: https
port: 443
protocol: TCP
targetPort: 8443
- name: tcp
port: 31400
protocol: TCP
targetPort: 31400
- name: tls
port: 15443
protocol: TCP
targetPort: 15443
selector:
app: istio-ingressgateway
istio: ingressgateway
type: NodePort
执行 kubectl apply -f kubeflow-ui-nodeport.yaml
到这里kubeflow1.3的部署就完成了,可通过 http://ip:30000 来访问,默认账号为 user@example.com,密码为12341234
设置 storageClass
默认情况下,创建notebook server会报错误,原因是没有配置默认的StorageClass
编辑storage-nfs.yaml,注意替换NFS_SERVER、NFS_PATH内容
apiVersion: v1
kind: ServiceAccount
metadata:
name: nfs-client-provisioner
---
kind: Deployment
apiVersion: apps/v1
metadata:
name: nfs-client-provisioner
spec:
replicas: 1
strategy:
type: Recreate
selector:
matchLabels:
app: nfs-client-provisioner
template:
metadata:
labels:
app: nfs-client-provisioner
spec:
serviceAccount: nfs-client-provisioner
containers:
- name: nfs-client-provisioner
image: quay.io/external_storage/nfs-client-provisioner:latest
volumeMounts:
- name: nfs-client-root
mountPath: /persistentvolumes
env:
- name: PROVISIONER_NAME
value: nfs-client
- name: NFS_SERVER
value: 192.168.1.109
- name: NFS_PATH
value: /data/nfs-kubeflow/v5
volumes:
- name: nfs-client-root
nfs:
server: 192.168.1.109
path: /data/nfs-kubeflow/v5
编辑storage-rbac.yaml