前言
服务器系统:ubuntu20.04
服务器网络:192.168.1.109, 192.168.1.110
Kubernetes版本:1.20.1
Docker版本:18.09.8
Kubeflow版本:1.3.1
配置要求
Kubernetes的旧版本可能与最新的Kubeflow版本不兼容,前往链接自行查看:https://www.kubeflow.org/docs/started/k8s/overview/#minimum-system-requirements
安装kubernetes
1.关闭防火墙、selinux以及swap(master节点和node节点均执行)
a、关闭防火墙
sudo ufw disable
b、关闭selinux(自行选择)
sudo apt install selinux-utils
临时关闭:
setenforce 0 # 临时
永久关闭:
sed -i 's/enforcing/disabled' /etc/selinux/config
c、关闭swap
sudo sed -i 's/^.*swap/#&/g' /etc/fstab
sudo swapoff -a # 临时
2.将桥接的IPV4流量传递到iptables(master节点和node节点均执行)
指令如下:
sudo tee /etc/sysctl.d/k8s.conf <<-'EOF'
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1
net.ipv4.ip_forward=1
EOF
使其生效:
sudo sysctl --system
3.时间同步(master节点和node节点均执行)
sudo apt install ntpdate -y
sudo ntpdate time.windows.com
4.安装Docker(master节点和node节点均执行)
卸载旧版本:
sudo apt-get remove docker docker-engine docker-ce docker.io # 卸载旧版本
更新 sources.list(提示:更新前请自行备份sources.list):
sudo tee /etc/apt/sources.list <<-'EOF'
deb http://mirrors.aliyun.com/ubuntu/ focal main
deb-src http://mirrors.aliyun.com/ubuntu/ focal main
deb http://mirrors.aliyun.com/ubuntu/ focal-updates main
deb-src http://mirrors.aliyun.com/ubuntu/ focal-updates main
deb http://mirrors.aliyun.com/ubuntu/ focal universe
deb-src http://mirrors.aliyun.com/ubuntu/ focal universe
deb http://mirrors.aliyun.com/ubuntu/ focal-updates universe
deb-src http://mirrors.aliyun.com/ubuntu/ focal-updates universe
deb http://mirrors.aliyun.com/ubuntu/ focal-security main
deb-src http://mirrors.aliyun.com/ubuntu/ focal-security main
deb http://mirrors.aliyun.com/ubuntu/ focal-security universe
deb-src http://mirrors.aliyun.com/ubuntu/ focal-security universe
EOF
安装docker(依次执行):
sudo apt-get update
sudo apt-get install -y apt-transport-https ca-certificates curl software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
sudo apt-get update
apt-cache madison docker-ce # 查看docker版本列表,选择安装自己需要的版本(此处选择安装18.09.8)
sudo apt install docker-ce
更新用户组
sudo groupadd docker
sudo gpasswd -a $USER docker
newgrp docker
创建HTTP/HTTPS代理文件
sudo vim /etc/profile
# 加入
export http_proxy=http://192.168.1.110:1080
export https_proxy=http://192.168.1.110:1080
export no_proxy="127.0.0.1,192.168.1.0/24,localhost,10.96.0.0/12,10.244.0.0/16"
#需要取消的话
$ unset http_proxy
$ unset https_proxy
# 创建docker.service目录
$ sudo mkdir -p /etc/systemd/system/docker.service.d
$ sudo vim /etc/systemd/system/docker.service.d/http-proxy.conf
具体配置内容: # proxy-addr为代理IP或域名;proxy-port为代理端口;NO_PROXY后面接不需要代理的私有仓库的域名或者IP,以英文逗号结尾
[Service]
Environment="HTTP_PROXY=http://proxy-addr:proxy-port" #代理服务器地址
Environment="HTTPS_PROXY=http://proxy-addr:proxy-port" #proxy-addr也是http开头,而不是https,否则会报错
Environment="NO_PROXY=localhost,127.0.0.0/8,docker-registry.example.com,.corp" #哪些不需要代理
查看配置结果
$ sudo systemctl daemon-reload
$ sudo systemctl show --property=Environment docker
Environment=HTTP_PROXY=http://proxy-addr:proxy-port/ HTTPS_PROXY=http://proxy-addr:proxy-port/ NO_PROXY=localhost,127.0.0.0/8,docker-registry.example.com,.corp
安装 nvidia-docker2:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
&& curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install -y nvidia-docker2
可能出现两个问题:
gpg: 找不到有效的 OpenPGP (源于指令:curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -)
E: 无法定位软件包 nvidia-docker2 (源于指令:curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list)
需要把ip地址写入host文件中!!
sudo vim /etc/hosts
复制以下ip到host中
185.199.108.153 nvidia.github.io
185.199.109.153 nvidia.github.io
185.199.110.153 nvidia.github.io
185.199.111.153 nvidia.github.io
创建文件 /etc/docker/daemon.json ,添加内容,命令如下:
sudo mkdir -p /etc/docker
sudo tee /etc/docker/daemon.json <<-'EOF'
{
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
},
"exec-opts": ["native.cgroupdriver=systemd"],
"registry-mirrors": ["http://hub-mirror.c.163.com"],
"default-runtime": "nvidia"
}
EOF
重启docker并设置开机自启动:
sudo systemctl daemon-reload
sudo systemctl restart docker
sudo systemctl enable docker
验证安装:
docker run --rm nvidia/cuda:11.0-base nvidia-smi
关闭dns服务,注释dns,内容如下:
sudo vim /etc/NetworkManager/NetworkManager.conf # 打开配置文件
[main]
plugins=ifupdown,keyfile,ofono
#dns=dnsmasq # 这行注释
保存并关闭,重启服务:
sudo systemctl restart network-manager
5.安装kubeadm,kubelet,kubectl(master节点和node节点均执行)
sudo apt-get update && sudo apt-get install -y apt-transport-https
curl https://mirrors.aliyun.com/kubernetes/apt/doc/apt-key.gpg | sudo apt-key add -
创建文件 /etc/apt/sources.list.d/kubernetes.list并添加内容, 指令如下:
sudo tee /etc/apt/sources.list.d/kubernetes.list <<-'EOF'
deb https://mirrors.aliyun.com/kubernetes/apt/ kubernetes-xenial main
EOF
sudo apt-get update
sudo apt-get install -y kubelet=1.20.1-00 kubeadm=1.20.1-00 kubectl=1.20.1-00
提示:以上步骤,master节点和node节点均需执行,后面的步骤master节点和node节点执行略有不同
创建集群并初始化(仅master节点执行):
sudo kubeadm init --kubernetes-version v1.20.1 --apiserver-advertise-address=192.168.1.109 --pod-network-cidr=10.244.0.0/16 --image-repository registry.cn-hangzhou.aliyuncs.com/google_containers --upload-certs |tee kubeadm-init.log
使用ifstat工具实时观测下网卡速率,等待安装完成后,根据提示初始化 kubectl,例:
mkdir -p $HOME/.kube
cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
chown $(id -u):$(id -g) $HOME/.kube/config
sudo scp /run/systemd/resolve/resolv.conf openoker@192.168.1.110:/run/systemd/resolve/resolv.conf
以下内容用于添加node节点,是由主节点初始化后生成的,注意保存对应的自己机器生成的信息,例:
kubeadm join 192.168.1.109:6443 --token y07gsn.pd62oi5wbujgfnxs
--discovery-token-ca-cert-hash sha256:efc9349a6b0a5a2d2384ab5b568b9f35b3041f80d3474f5b9c631d4268feadb6
如果出错,重置并重新初始化,直至成功:
sudo kubeadm reset
sudo kubeadm init --kubernetes-version v1.20.1 --apiserver-advertise-address=192.168.1.109 --pod-network-cidr=10.244.0.0/16 --upload-certs |tee kubeadm-init.log --image-repository registry.cn-hangzhou.aliyuncs.com/google_containers
设置开机启动:
sudo systemctl enable kubelet
启动k8s服务程序
sudo systemctl start kubelet
安装flannel网络插件:
wget https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
kubectl apply -f kube-flannel.yml
#重启pod
kubectl get pod -n kube-system | grep kube-proxy |awk '{system("kubectl delete pod "$1" -n kube-system")}'
查看状态
kubectl get pod -n kube-system
有时会有coreDNS两个容器起不来
执行命令$ sudo vim /run/flannel/subnet.env,添加以下内容:
FLANNEL_NETWORK=10.244.0.0/16
FLANNEL_SUBNET=10.244.0.1/24
FLANNEL_MTU=1450
FLANNEL_IPMASQ=true
让 master 也可以接受调度:
kubectl get nodes
kubectl taint node xxx-nodename node-role.kubernetes.io/master- #将 Master 也当作 Node 使用
kubectl taint node xxx-nodename node-role.kubernetes.io/master="":NoSchedule #将 Master 恢复成 Master Only 状态
此处选用flannel网络插件原因如下:之前环境测试过程中选择了calico网络插件,导致安装kubeflow后有几个pod一直起不来,猜测可能由网络问题导致,如有需要,calico网络插件安装方式如下
安装calico网络插件:
kubectl apply -f https://docs.projectcalico.org/manifests/calico.yaml
查看状态:
kubectl get pod -n kube-system
如果插件pod起不来,重启机器,然后依次执行以下命令:
sudo ufw disable
setenforce 0
sudo swapoff -a
sudo systemctl restart docker
sudo systemctl restart kubelet.service
添加Node节点
Node节点环境配置方式如下
将主节点中的/etc/kubernetes/admin.conf文件拷贝到node节点相同的目录下,执行远程拷贝指令:
sudo scp /etc/kubernetes/admin.conf openoker@192.168.1.110:/etc/kubernetes
注意:该指令在主节点下执行,openoker@192.168.1.110为node节点IP,需换成你自己的IP
然后依次执行:
mkdir -p $HOME/.kube
cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
chown $(id -u):$(id -g) $HOME/.kube/config
kubeadm join 192.168.1.109:6443 --token y07gsn.pd62oi5wbujgfnxs
--discovery-token-ca-cert-hash sha256:efc9349a6b0a5a2d2384ab5b568b9f35b3041f80d3474f5b9c631d4268feadb6
注意:记得关闭swap
安装网络插件,Node节点安装网络插件方式与主节点方式一致。
- 重要额外配置:
在Master节点修改k8s配置文件,进入 /etc/kubernetes/manifests/
修改 kube-apiserver.yaml,加入如下内容
# v1.21.X以下版本需要执行,解决configmap "istio-ca-root-cert" not found报错
- --service-account-signing-key-file=/etc/kubernetes/pki/sa.key
- --service-account-issuer=kubernetes.default.svc
# 升级到1.20.X以上版本需要执行,解决NFS动态存储一直Pending的问题
- --feature-gates=RemoveSelfLink=false
修改 kube-controller-manager.yaml 、kube-scheduler.yaml,将 - --port=0 注释掉
重启kubelet:systemctl restart kubelet
至此kubernetes集群部署完毕
安装kubeflow
设置 GPU
将以下内容保存成 nvidia-device-plugin.yml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
updateStrategy:
type: RollingUpdate
template:
metadata:
# This annotation is deprecated. Kept here for backward compatibility
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ""
labels:
name: nvidia-device-plugin-ds
spec:
tolerations:
# This toleration is deprecated. Kept here for backward compatibility
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
- key: CriticalAddonsOnly
operator: Exists
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
- key: node-role.kubernetes.io/master
effect: NoSchedule
# Mark this pod as a critical add-on; when enabled, the critical add-on
# scheduler reserves resources for critical add-on pods so that they can
# be rescheduled after a failure.
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
priorityClassName: "system-node-critical"
containers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.9.0
name: nvidia-device-plugin-ctr
args: ["--fail-on-init-error=false"]
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
然后执行
$ kubectl apply -f nvidia-device-plugin.yml
可以通过一下命令追踪 pods 的创建状态:
$ kubectl get pods -n kube-system | grep nvidia
nvidia-device-plugin-daemonset-74kv8 1/1 Running 0 2d4h
nvidia-device-plugin-daemonset-75845 1/1 Running 0 2d4h
nvidia-device-plugin-daemonset-8nlsp 1/1 Running 0 2d4h
nvidia-device-plugin-daemonset-rnq8w 1/1 Running 0 2d4h
有几台机器,就会有几个 pods 被创建。
设置NFS、创建PV
在安装kubeflow之前,首先设置nfs并创建pv,安装kubeflow时mysql、katib等组件时需要使用pvc,如果不提前创建,会导致这些节点状态为pending
选择一台机器作为nfs服务器
sudo apt install nfs-kernel-server
#创建共享文件夹
sudo mkdir /data/nfs-kubeflow
cd /data/nfs-kubeflow
sudo mkdir v1
sudo mkdir v2
sudo mkdir v3
sudo mkdir v4
sudo mkdir v5
sudo chmod -R 777 /data/nfs-kubeflow
#配置nfs
sudo vim /etc/exports
加入如下内容:
/data/nfs-kubeflow *(insecure,rw,no_root_squash,no_all_squash,sync)
#使配置生效
sudo exportfs -r
#启动服务rpcbind、nfs服务
sudo service nfs-kernel-server restart
#查看效果
sudo showmount -e
# 在Node节点测试
sudo apt-get install nfs-common
sudo mount -t nfs -o nolock,nfsvers=4,rsize=2048,wsize=2048,timeo=15 192.168.1.109:/data/nfs-kubeflow /data/nfs-kubeflow
如果nfs服务卡死,甚至kill -9也不行,可以通过连续执行“sudo umount -f /data/nfs-kubeflow/”后退出。
配置pv文件 pv.yaml,加入如下内容,注意替换path与server
apiVersion: v1
kind: PersistentVolume
metadata:
name: pv001
labels:
name: pv001
spec:
nfs:
path: /data/nfs-kubeflow/v1
server: 192.168.1.109
accessModes: ["ReadWriteMany","ReadWriteOnce"]
capacity:
storage: 20Gi
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: pv002
labels:
name: pv002
spec:
nfs:
path: /data/nfs-kubeflow/v2
server: 192.168.1.109
accessModes: ["ReadWriteMany","ReadWriteOnce"]
capacity:
storage: 20Gi
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: pv003
labels:
name: pv003
spec:
nfs:
path: /data/nfs-kubeflow/v3
server: 192.168.1.109
accessModes: ["ReadWriteMany","ReadWriteOnce"]
capacity:
storage: 30Gi
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: pv004
labels:
name: pv004
spec:
nfs:
path: /data/nfs-kubeflow/v4
server: 192.168.1.109
accessModes: ["ReadWriteMany","ReadWriteOnce"]
capacity:
storage: 50Gi
执行 kubectl apply -f pv.yaml 创建 pv
安装kubeflow
wget https://github.com/kubeflow/manifests/archive/refs/tags/v1.3.1.zip
unzip v1.3.1.zip
wget https://github.com/kubernetes-sigs/kustomize/releases/download/v3.2.0/kustomize_3.2.0_linux_amd64
mv kustomize_3.2.0_linux_amd64 kustomize
chmod +x kustomize
sudo mv kustomize /usr/bin/
cd manifests-1.3.1/
#安装cert-manager
kustomize build common/cert-manager/cert-manager/base | kubectl apply -f -
kustomize build common/cert-manager/kubeflow-issuer/base | kubectl apply -f -
#安装istio
kustomize build common/istio-1-9/istio-crds/base | kubectl apply -f -
kustomize build common/istio-1-9/istio-namespace/base | kubectl apply -f -
kustomize build common/istio-1-9/istio-install/base | kubectl apply -f -
#安装dex
kustomize build common/dex/overlays/istio | kubectl apply -f -
#安装oidc
kustomize build common/oidc-authservice/base | kubectl apply -f -
#安装knative-serving
kubectl apply -f https://github.com/knative/serving/releases/download/v0.17.1/serving-crds.yaml
kustomize build common/knative/knative-serving/base | kubectl apply -f -
kustomize build common/istio-1-9/cluster-local-gateway/base | kubectl apply -f -
kustomize build common/knative/knative-eventing/base | kubectl apply -f -
#创建kubeflow namespace
kustomize build common/kubeflow-namespace/base | kubectl apply -f -
#创建kubeflow-roles
kustomize build common/kubeflow-roles/base | kubectl apply -f -
#创建istio-resource
kustomize build common/istio-1-9/kubeflow-istio-resources/base | kubectl apply -f -
#安装pipeline
kustomize build apps/pipeline/upstream/env/platform-agnostic-multi-user | kubectl apply -f -
#安装kfserving
kustomize build apps/kfserving/upstream/overlays/kubeflow | kubectl apply -f -
#安装katib
kustomize build apps/katib/upstream/installs/katib-with-kubeflow | kubectl apply -f -
#安装kubeflow dashboard
kustomize build apps/centraldashboard/upstream/overlays/istio | kubectl apply -f -
#安装admission-webhook
kustomize build apps/admission-webhook/upstream/overlays/cert-manager | kubectl apply -f -
#安装notebook
kustomize build apps/jupyter/notebook-controller/upstream/overlays/kubeflow | kubectl apply -f -
#安装jupyter
kustomize build apps/jupyter/jupyter-web-app/upstream/overlays/istio | kubectl apply -f -
#安装kfam
kustomize build apps/profiles/upstream/overlays/kubeflow | kubectl apply -f -
#安装volume
kustomize build apps/volumes-web-app/upstream/overlays/istio | kubectl apply -f -
#安装tensorboard
kustomize build apps/tensorboard/tensorboards-web-app/upstream/overlays/istio | kubectl apply -f -
kustomize build apps/tensorboard/tensorboard-controller/upstream/overlays/kubeflow | kubectl apply -f -
#安装各种operator
kustomize build apps/tf-training/upstream/overlays/kubeflow | kubectl apply -f -
kustomize build apps/pytorch-job/upstream/overlays/kubeflow | kubectl apply -f -
kustomize build apps/mpi-job/upstream/overlays/kubeflow | kubectl apply -f -
kustomize build apps/mxnet-job/upstream/overlays/kubeflow | kubectl apply -f -
kustomize build apps/xgboost-job/upstream/overlays/kubeflow | kubectl apply -f -
#创建namespace
kustomize build common/user-namespace/base | kubectl apply -f -
等待所有pod均为running状态
直接访问(可选):kubectl port-forward svc/istio-ingressgateway --address 0.0.0.0 -n istio-system 8080:80
暴露指定端口30000,编辑kubeflow-ui-nodeport.yaml
apiVersion: v1
kind: Service
metadata:
labels:
app: istio-ingressgateway
install.operator.istio.io/owning-resource: unknown
istio: ingressgateway
istio.io/rev: default
operator.istio.io/component: IngressGateways
release: istio
name: istio-ingressgateway
namespace: istio-system
spec:
ports:
- name: status-port
port: 15021
protocol: TCP
targetPort: 15021
- name: http2
port: 80
protocol: TCP
targetPort: 8080
nodePort: 30000
- name: https
port: 443
protocol: TCP
targetPort: 8443
- name: tcp
port: 31400
protocol: TCP
targetPort: 31400
- name: tls
port: 15443
protocol: TCP
targetPort: 15443
selector:
app: istio-ingressgateway
istio: ingressgateway
type: NodePort
执行 kubectl apply -f kubeflow-ui-nodeport.yaml
到这里kubeflow1.3的部署就完成了,可通过 http://ip:30000 来访问,默认账号为 user@example.com,密码为12341234
设置 storageClass
默认情况下,创建notebook server会报错误,原因是没有配置默认的StorageClass
编辑storage-nfs.yaml,注意替换NFS_SERVER、NFS_PATH内容
apiVersion: v1
kind: ServiceAccount
metadata:
name: nfs-client-provisioner
---
kind: Deployment
apiVersion: apps/v1
metadata:
name: nfs-client-provisioner
spec:
replicas: 1
strategy:
type: Recreate
selector:
matchLabels:
app: nfs-client-provisioner
template:
metadata:
labels:
app: nfs-client-provisioner
spec:
serviceAccount: nfs-client-provisioner
containers:
- name: nfs-client-provisioner
image: quay.io/external_storage/nfs-client-provisioner:latest
volumeMounts:
- name: nfs-client-root
mountPath: /persistentvolumes
env:
- name: PROVISIONER_NAME
value: nfs-client
- name: NFS_SERVER
value: 192.168.1.109
- name: NFS_PATH
value: /data/nfs-kubeflow/v5
volumes:
- name: nfs-client-root
nfs:
server: 192.168.1.109
path: /data/nfs-kubeflow/v5
编辑storage-rbac.yaml
kind: ServiceAccount
apiVersion: v1
metadata:
name: nfs-client-provisioner
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: nfs-client-provisioner-runner
rules:
- apiGroups: [""]
resources: ["persistentvolumes"]
verbs: ["get", "list", "watch", "create", "delete"]
- apiGroups: [""]
resources: ["persistentvolumeclaims"]
verbs: ["get", "list", "watch", "update"]
- apiGroups: ["storage.k8s.io"]
resources: ["storageclasses"]
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["events"]
verbs: ["create", "update", "patch"]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: run-nfs-client-provisioner
subjects:
- kind: ServiceAccount
name: nfs-client-provisioner
namespace: default #替换成你要部署NFS Provisioner的 Namespace
roleRef:
kind: ClusterRole
name: nfs-client-provisioner-runner
apiGroup: rbac.authorization.k8s.io
---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: leader-locking-nfs-client-provisioner
rules:
- apiGroups: [""]
resources: ["endpoints"]
verbs: ["get", "list", "watch", "create", "update", "patch"]
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: leader-locking-nfs-client-provisioner
subjects:
- kind: ServiceAccount
name: nfs-client-provisioner
namespace: default #替换成你要部署NFS Provisioner的 Namespace
roleRef:
kind: Role
name: leader-locking-nfs-client-provisioner
apiGroup: rbac.authorization.k8s.io
编辑storage-class.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: nfs-storage
annotations:
storageclass.kubernetes.io/is-default-class: "true" #---设置为默认的storageclass
provisioner: nfs-client #---动态卷分配者名称,必须和上面创建的"provisioner"变量中设置的Name一致
parameters:
archiveOnDelete: "true" #---设置为"false"时删除PVC不会保留数据,"true"则保留数据
allowVolumeExpansion: True #允许pvc创建后扩容
依次执行如下命令
kubectl apply -f storage-nfs.yaml
kubectl apply -f storage-rbac.yaml
kubectl apply -f storage-class.yaml
kubectl patch storageclass nfs-storage -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
设置Https访问
如果要通过 NodePort / LoadBalancer / Ingress 暴露服务到非 localhost 网络,那么必须使用 https。否则能打开主页面但是 notebook、ssh等均无法连接。创建生成 ssl 自签名证书脚本 create_self-signed-cert.sh
#!/bin/bash -e
help ()
{
echo ' ================================================================ '
echo ' --ssl-domain: 生成ssl证书需要的主域名,如不指定则默认为www.rancher.local,如果是ip访问服务,则可忽略;'
echo ' --ssl-trusted-ip: 一般ssl证书只信任域名的访问请求,有时候需要使用ip去访问server,那么需要给ssl证书添加扩展IP,多个IP用逗号隔开;'
echo ' --ssl-trusted-domain: 如果想多个域名访问,则添加扩展域名(SSL_TRUSTED_DOMAIN),多个扩展域名用逗号隔开;'
echo ' --ssl-size: ssl加密位数,默认2048;'
echo ' --ssl-cn: 国家代码(2个字母的代号),默认CN;'
echo ' 使用示例:'
echo ' ./create_self-signed-cert.sh --ssl-domain=www.test.com --ssl-trusted-domain=www.test2.com \ '
echo ' --ssl-trusted-ip=1.1.1.1,2.2.2.2,3.3.3.3 --ssl-size=2048 --ssl-date=3650'
echo ' ================================================================'
}
case "$1" in
-h|--help) help; exit;;
esac
if [[ $1 == '' ]];then
help;
exit;
fi
CMDOPTS="$*"
for OPTS in $CMDOPTS;
do
key=$(echo ${OPTS} | awk -F"=" '{print $1}' )
value=$(echo ${OPTS} | awk -F"=" '{print $2}' )
case "$key" in
--ssl-domain) SSL_DOMAIN=$value ;;
--ssl-trusted-ip) SSL_TRUSTED_IP=$value ;;
--ssl-trusted-domain) SSL_TRUSTED_DOMAIN=$value ;;
--ssl-size) SSL_SIZE=$value ;;
--ssl-date) SSL_DATE=$value ;;
--ca-date) CA_DATE=$value ;;
--ssl-cn) CN=$value ;;
esac
done
# CA相关配置
CA_DATE=${CA_DATE:-3650}
CA_KEY=${CA_KEY:-cakey.pem}
CA_CERT=${CA_CERT:-cacerts.pem}
CA_DOMAIN=cattle-ca
# ssl相关配置
SSL_CONFIG=${SSL_CONFIG:-$PWD/openssl.cnf}
SSL_DOMAIN=${SSL_DOMAIN:-'www.rancher.local'}
SSL_DATE=${SSL_DATE:-3650}
SSL_SIZE=${SSL_SIZE:-2048}
## 国家代码(2个字母的代号),默认CN;
CN=${CN:-CN}
SSL_KEY=$SSL_DOMAIN.key
SSL_CSR=$SSL_DOMAIN.csr
SSL_CERT=$SSL_DOMAIN.crt
echo -e "\033[32m ---------------------------- \033[0m"
echo -e "\033[32m | 生成 SSL Cert | \033[0m"
echo -e "\033[32m ---------------------------- \033[0m"
if [[ -e ./${CA_KEY} ]]; then
echo -e "\033[32m ====> 1. 发现已存在CA私钥,备份"${CA_KEY}"为"${CA_KEY}"-bak,然后重新创建 \033[0m"
mv ${CA_KEY} "${CA_KEY}"-bak
openssl genrsa -out ${CA_KEY} ${SSL_SIZE}
else
echo -e "\033[32m ====> 1. 生成新的CA私钥 ${CA_KEY} \033[0m"
openssl genrsa -out ${CA_KEY} ${SSL_SIZE}
fi
if [[ -e ./${CA_CERT} ]]; then
echo -e "\033[32m ====> 2. 发现已存在CA证书,先备份"${CA_CERT}"为"${CA_CERT}"-bak,然后重新创建 \033[0m"
mv ${CA_CERT} "${CA_CERT}"-bak
openssl req -x509 -sha256 -new -nodes -key ${CA_KEY} -days ${CA_DATE} -out ${CA_CERT} -subj "/C=${CN}/CN=${CA_DOMAIN}"
else
echo -e "\033[32m ====> 2. 生成新的CA证书 ${CA_CERT} \033[0m"
openssl req -x509 -sha256 -new -nodes -key ${CA_KEY} -days ${CA_DATE} -out ${CA_CERT} -subj "/C=${CN}/CN=${CA_DOMAIN}"
fi
echo -e "\033[32m ====> 3. 生成Openssl配置文件 ${SSL_CONFIG} \033[0m"
cat > ${SSL_CONFIG} <<EOM
[req]
req_extensions = v3_req
distinguished_name = req_distinguished_name
[req_distinguished_name]
[ v3_req ]
basicConstraints = CA:FALSE
keyUsage = nonRepudiation, digitalSignature, keyEncipherment
extendedKeyUsage = clientAuth, serverAuth
EOM
if [[ -n ${SSL_TRUSTED_IP} || -n ${SSL_TRUSTED_DOMAIN} ]]; then
cat >> ${SSL_CONFIG} <<EOM
subjectAltName = @alt_names
[alt_names]
EOM
IFS=","
dns=(${SSL_TRUSTED_DOMAIN})
dns+=(${SSL_DOMAIN})
for i in "${!dns[@]}"; do
echo DNS.$((i+1)) = ${dns[$i]} >> ${SSL_CONFIG}
done
if [[ -n ${SSL_TRUSTED_IP} ]]; then
ip=(${SSL_TRUSTED_IP})
for i in "${!ip[@]}"; do
echo IP.$((i+1)) = ${ip[$i]} >> ${SSL_CONFIG}
done
fi
fi
echo -e "\033[32m ====> 4. 生成服务SSL KEY ${SSL_KEY} \033[0m"
openssl genrsa -out ${SSL_KEY} ${SSL_SIZE}
echo -e "\033[32m ====> 5. 生成服务SSL CSR ${SSL_CSR} \033[0m"
openssl req -sha256 -new -key ${SSL_KEY} -out ${SSL_CSR} -subj "/C=${CN}/CN=${SSL_DOMAIN}" -config ${SSL_CONFIG}
echo -e "\033[32m ====> 6. 生成服务SSL CERT ${SSL_CERT} \033[0m"
openssl x509 -sha256 -req -in ${SSL_CSR} -CA ${CA_CERT} \
-CAkey ${CA_KEY} -CAcreateserial -out ${SSL_CERT} \
-days ${SSL_DATE} -extensions v3_req \
-extfile ${SSL_CONFIG}
echo -e "\033[32m ====> 7. 证书制作完成 \033[0m"
echo
echo -e "\033[32m ====> 8. 以YAML格式输出结果 \033[0m"
echo "----------------------------------------------------------"
echo "ca_key: |"
cat $CA_KEY | sed 's/^/ /'
echo
echo "ca_cert: |"
cat $CA_CERT | sed 's/^/ /'
echo
echo "ssl_key: |"
cat $SSL_KEY | sed 's/^/ /'
echo
echo "ssl_csr: |"
cat $SSL_CSR | sed 's/^/ /'
echo
echo "ssl_cert: |"
cat $SSL_CERT | sed 's/^/ /'
echo
echo -e "\033[32m ====> 9. 附加CA证书到Cert文件 \033[0m"
cat ${CA_CERT} >> ${SSL_CERT}
echo "ssl_cert: |"
cat $SSL_CERT | sed 's/^/ /'
echo
echo -e "\033[32m ====> 10. 重命名服务证书 \033[0m"
echo "cp ${SSL_DOMAIN}.key tls.key"
cp ${SSL_DOMAIN}.key tls.key
echo "cp ${SSL_DOMAIN}.crt tls.crt"
cp ${SSL_DOMAIN}.crt tls.crt
chmod +x create_self-signed-cert.sh
./create_self-signed-cert.sh --ssl-domain=kubeflow.cn
kubectl create --namespace istio-system secret tls kf-tls-cert --key /root/ssl/kubeflow.cn.key --cert /root/ssl/kubeflow.cn.crt
kubectl edit cm config-domain --namespace knative-serving
#在 data 下面添加:kubeflow.cn: ""
编辑kubeflow-https.yaml
apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
name: kubeflow-gateway
namespace: kubeflow
spec:
selector:
istio: ingressgateway
servers:
- hosts:
- '*'
port:
name: http
number: 80
protocol: HTTP
- hosts:
- '*'
port:
name: https
number: 443
protocol: HTTPS
tls:
mode: SIMPLE
credentialName: kf-tls-cert
执行 kubectl apply -f kubeflow-https.yaml
执行 kubectl -n istio-system get service istio-ingressgateway 获取https端口为32143
可通过 https://ip:32143 来进行 https 访问
Multi Users 隔离设置
kubeflow安装结束后,默认情况下只能通过 user@example.com 一个账号登录操作,下面进行多账户号设置,设置完成后,每个账户拥有独立的 namespace与profile
首先创建用户的 profile,下面的yaml文件创建了2个profile,namespace分别为test1与test2
apiVersion: kubeflow.org/v1beta1
kind: Profile
metadata:
name: test1
spec:
owner:
kind: User
name: test1@example.com
---
apiVersion: kubeflow.org/v1beta1
kind: Profile
metadata:
name: test2
spec:
owner:
kind: User
name: test2@example.com
执行 kubectl get configmap dex -n auth -o jsonpath='{.data.config.yaml}' > dex-yaml.yaml 将dex的用户信息输出到dex-yaml.yaml文件,其中user@example.com 部署是按照时创建的,编辑该文件,添加test1与test2部分
执行 kubectl create configmap dex --from-file=config.yaml=dex-yaml.yaml -n auth --dry-run=client -o yaml | kubectl apply -f - 使新用户生效
执行 kubectl rollout restart deployment dex -n auth 重启dex,这样就完成了多用户的设置
minio的使用(可选)
# 获取accesskey和secretkey
kubectl get secret mlpipeline-minio-artifact --namespace=kubeflow -o yaml
kubectl port-forward -n kubeflow svc/minio-service --address 0.0.0.0 9000:9000
# 浏览器访问9000端口
安装Harbor(可选)
在kubeflow的使用过程中,我们需要一个私有的镜像仓库来存放执行过程中生成的镜像文件,这里我们选择部署一套harbor作为镜像仓库,也可以选择docker registry,选择一台机器操作即可
- 安装docker-compose:
wget https://github.com/docker/compose/releases/download/1.29.2/docker-compose-Linux-x86_64
mv docker-compose-Linux-x86_64 docker-compose
chmod +x docker-compose
mv docker-compose /usr/bin/
- 可执行 docker-compose version 验证是否安装成功
wget https://github.com/goharbor/harbor/releases/download/v1.9.2/harbor-offline-installer-v1.9.2.tgz
tar -zxvf harbor-offline-installer-v1.9.2.tgz
cd harbor
vim harbor.yaml
./install.sh
- 修改harbor.yaml中的hostname和port
访问 http://43.128.14.116/,即可打开harbor页面,43.128.14.116为服务器外网地址、172.19.0.14为其内网地址,默认账号admin,密码为 Harbor12345。登录之后创建一个名为kubeflow的公开项目作为我们的私有仓库。
- 修改docker配置文件,/etc/docker/daemon.json,加入如下内容,之后重启docker
{ "insecure-registries": ["172.19.0.14:86"] }
docker login -u admin -p Harbor12345 172.19.0.14:86
harbor搭建完成 ,docker-compose up -d 重启
- 上传及下载镜像到Harbor
#1.标记镜像
docker tag {镜像名}:{tag} {Harbor地址}:{端口}/{Harbor项目名}/{自定义镜像名}:{自定义tag}
#eg:docker tag vmware/harbor-adminserver:v1.1.0 172.19.0.14:86/test/harbor-adminserver:v1.1.0
#2.push 到Harbor
docker push {Harbor地址}:{端口}/{自定义镜像名}:{自定义tag}
#eg:docker push 172.19.0.14:86/test/harbor-adminserver:v1.1.0
#3.pull 到本地
docker pull 172.19.0.14:86/test/harbor-adminserver:v1.1.0
常用命令
- 清除状态为Evicted的pod
kubectl get pods -nkubeflow | grep Evicted | awk '{print $2}' | xargs kubectl delete pod -nkubeflow
常见问题
- 通过ssh连接ubuntu20.04有问题
sudo vim /etc/ssh/sshd_config最后添加如下设置:
KexAlgorithms curve25519-sha256@libssh.org,ecdh-sha2-nistp256,ecdh-sha2-nistp384,ecdh-sha2-nistp521,diffie-hellman-group14-sha1
sudo systemctl restart sshd
- no available network
安装完docker启动遇到如下错误:
Error starting daemon:Error initializing network controller: list bridge addresses failed: no available network
执行如下命令
sudo ip link add name docker0 type bridge
sudo ip addr add dev docker0 172.17.42.1/16
重启docker就解决了
- 报错:unable to recognize "STDIN": no matches for kind "CustomResourceDefinition" in version "apiextensions.k8s.io/v1beta1"
方法一:回退k8s到v1.22+以下
方法二:apiextensions.k8s.io/v1beta1 用 apiextensions.k8s.io/v1 替换
- 报错:error: unable to recognize "STDIN": no matches for kind "CompositeController" in version "metacontroller.k8s.io/v1alpha1"
把apps/pipeline/upstream/env/platform-agnostic-multi-user/kustomization.yaml文件里的metacontroller.k8s.io/v1alpha1 替换成 kustomize.config.k8s.io/v1beta1
- 报错:failed to set bridge addr: "cni0" already has an IP address different from 10.244.1.1/24
在所有节点执行:
ifconfig cni0 down
ip link delete cni0
- 报错:error: unable to upgrade connection: container not found
找不到容器所在worker节点的IP,需要调整master节点和worker节点的配置,执行:sudo vim /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
添加如下一行:
Environment="KUBELET_EXTRA_ARGS=--node-ip=192.168.1.110"
注意,将“192.168.1.110”替换为节点网络地址
重启kubelet,执行:
sudo systemctl daemon-reload && sudo systemctl restart kubelet.service
- 报错:MountVolume.SetUp failed for volume "mlpipeline-minio-artifact" : secret "mlpipeline-minio-artifact" not found
kubectl get secret mlpipeline-minio-artifact --namespace=kubeflow -o yaml | sed 's/namespace: kubeflow/namespace: kubeflow-user-example-com/' | kubectl create -f -