Ceph容器化部署一時爽,運維火葬場~
Rook 是一個開源的云原生存儲編排工具,提供平臺、框架和對各種存儲解決方案的支持,以和云原生環境進行本地集成。
Rook 將存儲軟件轉變成自我管理、自我擴展和自我修復的存儲服務,通過自動化部署、啟動、配置、供應、擴展、升級、遷移、災難恢復、監控和資源管理來實現。Rook 底層使用云原生容器管理、調度和編排平臺提供的能力來提供這些功能。
Rook 利用擴展功能將其深度集成到云原生環境中,并為調度、生命周期管理、資源管理、安全性、監控等提供了無縫的體驗。有關 Rook 當前支持的存儲解決方案的狀態的更多詳細信息,可以參考 Rook 倉庫 的項目介紹。不過目前 Rook 已經很好地提供了對 Ceph 的支持。
Rook架構圖
Rook通俗地理解,就是一個存儲適配層,是一個框架。對上可以承接K8S的存儲需求 ,對下實現對底層存儲軟件統一適配管控。
目前Rook支持多種存儲集群的部署,主要包括:
- Ceph,它是一個高度可擴展的分布式存儲解決方案,適用于塊存儲、對象存儲和共享文件系統,具有多年的生產部署經驗。
- NFS,它允許遠程主機通過網絡掛載文件系統,并與這些文件系統進行交互,就像在本地掛載一樣。
- Cassandra,它是一個高度可用的NoSQL數據庫,具有閃電般快速的性能、靈活可控的數據一致性和大規模的可擴展性。
以上這些存儲系統都有獨立的基于K8S的Openrator,能夠實現all in K8S運行。真正達到云原生。
為什么要使用Rook
云原生是大勢所趨,應用以容器化作為交付標準越來成為事實標準,存儲類的應用也不例外。基礎設施圍繞基于K8S的“云”操作系統來建設,逐漸在技術圈內達成了共識。使用Rook進行存儲管控, 可以解決以下問題:
- 本身有基于K8S的云原生基礎設施,可以直接接入存儲管理,實現統一化
- 能夠快速部署一套云原生存儲集群
- 平臺化管理云原生存儲集群,包括存儲的擴容、升級、監控、災難恢復等全生命周期管理
環境說明
測試試驗環境:
- Kubernetes: v1.19.9
- Docker:20.10.11
- Rook:release-1.4
k8s環境可以通過minikube或者kubeadm進行部署, 這里我使用的是kainstall,在此強烈推薦好友@lework的kainstall,一個腳本完成生產級的k8s集群搭建(基于kubeadm的shell封裝)
[root@k8s-master-node1 ~]# kubectl get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME k8s-master-node1 Ready master 244d v1.19.9 172.16.8.80 <none> CentOS Linux 7 (Core) 3.10.0-862.el7.x86_64 docker://20.10.5 k8s-master-node2 Ready master 244d v1.19.9 172.16.8.81 <none> CentOS Linux 7 (Core) 3.10.0-862.el7.x86_64 docker://20.10.5 k8s-master-node3 Ready master 244d v1.19.9 172.16.8.82 <none> CentOS Linux 7 (Core) 3.10.0-862.el7.x86_64 docker://20.10.5 k8s-worker-node1 Ready worker 244d v1.19.9 172.16.8.83 <none> CentOS Linux 7 (Core) 3.10.0-862.el7.x86_64 docker://20.10.5 k8s-worker-node2 Ready worker 21h v1.19.9 172.16.49.210 <none> CentOS Linux 7 (Core) 3.10.0-957.el7.x86_64 docker://20.10.11 k8s-worker-node3 Ready worker 21h v1.19.9 172.16.49.211 <none> CentOS Linux 7 (Core) 3.10.0-957.el7.x86_64 docker://20.10.11 k8s-worker-node4 Ready worker 21h v1.19.9 172.16.49.212 <none> CentOS Linux 7 (Core) 3.10.0-957.el7.x86_64 docker://20.10.11
說明:
k8s-worker-node{2,3,4}每個節點上面都一塊vdb的數據盤
部署rook和Ceph集群
從github獲取指定版本的rook
[root@k8s-master-node1 /opt]# git clone -b release-1.4 https://github.com/rook/rook.git 正克隆到 'rook'... remote: Enumerating objects: 91504, done. remote: Counting objects: 100% (389/389), done. remote: Compressing objects: 100% (237/237), done. remote: Total 91504 (delta 176), reused 327 (delta 144), pack-reused 91115 接收對象中: 100% (91504/91504), 45.43 MiB | 4.41 MiB/s, done. 處理 delta 中: 100% (63525/63525), done.
進入rook的ceph目錄,部署rook以及ceph集群
cd rook/cluster/examples/kubernetes/ceph kubectl create -f common.yaml -f operator.yaml kubectl create -f cluster.yaml
說明:
comm.yaml里面主要是權限控制以及CRD資源定義operator.yaml是rook-ceph-operator的deloymentcluster.yaml是cephclusters.ceph.rook.io這個CRD資源的使用,即部署一個完整的ceph集群- 默認不做定制, 集群默認會啟動3個mon,帶有空閑裸盤的節點,會自動為這個盤進行OSD初始化.(默認至少需要有3個節點,且每個節點至少有一個空閑盤)
部署完成之后,可以通過kubectl在rook-ceph空間下查看pod的狀態
[root@k8s-master-node1 /tmp/rook/cluster/examples/kubernetes/ceph]# kubectl get pods -n rook-ceph -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES csi-cephfsplugin-4q4j5 3/3 Running 0 21h 172.16.49.211 k8s-worker-node3 <none> <none> csi-cephfsplugin-c8sdw 3/3 Running 0 21h 172.16.8.82 k8s-master-node3 <none> <none> csi-cephfsplugin-provisioner-56d8446896-4c68j 6/6 Running 0 21h 10.244.91.69 k8s-worker-node4 <none> <none> csi-cephfsplugin-provisioner-56d8446896-rq2j2 6/6 Running 0 21h 10.244.219.71 k8s-worker-node2 <none> <none> csi-cephfsplugin-r9cqt 3/3 Running 0 21h 172.16.8.80 k8s-master-node1 <none> <none> csi-cephfsplugin-sdxm5 3/3 Running 0 21h 172.16.49.212 k8s-worker-node4 <none> <none> csi-cephfsplugin-sntm4 3/3 Running 0 21h 172.16.49.210 k8s-worker-node2 <none> <none> csi-cephfsplugin-stkg4 3/3 Running 0 21h 172.16.8.83 k8s-worker-node1 <none> <none> csi-cephfsplugin-v88d6 3/3 Running 0 21h 172.16.8.81 k8s-master-node2 <none> <none> csi-rbdplugin-bnmhp 3/3 Running 0 21h 172.16.8.82 k8s-master-node3 <none> <none> csi-rbdplugin-grw9c 3/3 Running 0 21h 172.16.8.80 k8s-master-node1 <none> <none> csi-rbdplugin-p47n6 3/3 Running 0 21h 172.16.49.210 k8s-worker-node2 <none> <none> csi-rbdplugin-provisioner-569c75558-4hw9d 6/6 Running 0 21h 10.244.198.197 k8s-worker-node3 <none> <none> csi-rbdplugin-provisioner-569c75558-62ds8 6/6 Running 0 21h 10.244.219.70 k8s-worker-node2 <none> <none> csi-rbdplugin-s56gp 3/3 Running 0 21h 172.16.49.211 k8s-worker-node3 <none> <none> csi-rbdplugin-vhjv7 3/3 Running 0 21h 172.16.49.212 k8s-worker-node4 <none> <none> csi-rbdplugin-xg48n 3/3 Running 0 21h 172.16.8.81 k8s-master-node2 <none> <none> csi-rbdplugin-zb6b9 3/3 Running 0 21h 172.16.8.83 k8s-worker-node1 <none> <none> rook-ceph-crashcollector-k8s-worker-node2-bbd9587f9-hvq92 1/1 Running 0 21h 10.244.219.74 k8s-worker-node2 <none> <none> rook-ceph-crashcollector-k8s-worker-node3-65bb549b8b-8z4q2 1/1 Running 0 21h 10.244.198.202 k8s-worker-node3 <none> <none> rook-ceph-crashcollector-k8s-worker-node4-8457f67c97-29wgn 1/1 Running 0 21h 10.244.91.72 k8s-worker-node4 <none> <none> rook-ceph-mgr-a-749575fc54-dtbpw 1/1 Running 0 21h 10.244.198.198 k8s-worker-node3 <none> <none> rook-ceph-mon-a-59f6565594-nxlbv 1/1 Running 0 21h 10.244.198.196 k8s-worker-node3 <none> <none> rook-ceph-mon-b-688948c479-j7hcj 1/1 Running 0 21h 10.244.91.68 k8s-worker-node4 <none> <none> rook-ceph-mon-c-7b7c6fffd7-h5hk6 1/1 Running 0 21h 10.244.219.69 k8s-worker-node2 <none> <none> rook-ceph-operator-864f5d5868-gsww8 1/1 Running 0 22h 10.244.91.65 k8s-worker-node4 <none> <none> rook-ceph-osd-0-6b74867f6b-2qwnv 1/1 Running 0 21h 10.244.219.73 k8s-worker-node2 <none> <none> rook-ceph-osd-1-65596bf48-6lxxv 1/1 Running 0 21h 10.244.91.71 k8s-worker-node4 <none> <none> rook-ceph-osd-2-5bc6788b7f-z2rzv 1/1 Running 0 21h 10.244.198.201 k8s-worker-node3 <none> <none> rook-ceph-osd-prepare-k8s-master-node1-4kxg8 0/1 Completed 0 3h12m 10.244.236.163 k8s-master-node1 <none> <none> rook-ceph-osd-prepare-k8s-master-node2-tztm9 0/1 Completed 0 3h12m 10.244.237.101 k8s-master-node2 <none> <none> rook-ceph-osd-prepare-k8s-master-node3-768v5 0/1 Completed 0 3h12m 10.244.113.222 k8s-master-node3 <none> <none> rook-ceph-osd-prepare-k8s-worker-node1-dlljc 0/1 Completed 0 3h12m 10.244.50.240 k8s-worker-node1 <none> <none> rook-ceph-osd-prepare-k8s-worker-node2-qszkt 0/1 Completed 0 3h12m 10.244.219.79 k8s-worker-node2 <none> <none> rook-ceph-osd-prepare-k8s-worker-node3-krxqc 0/1 Completed 0 3h12m 10.244.198.210 k8s-worker-node3 <none> <none> rook-ceph-osd-prepare-k8s-worker-node4-l77ds 0/1 Completed 0 3h12m 10.244.91.78 k8s-worker-node4 <none> <none> rook-ceph-tools-5949d6759-lbj74 1/1 Running 0 21h 10.244.50.234 k8s-worker-node1 <none> <none> rook-discover-cjpxh 1/1 Running 0 22h 10.244.198.193 k8s-worker-node3 <none> <none> rook-discover-lw96w 1/1 Running 0 22h 10.244.91.66 k8s-worker-node4 <none> <none> rook-discover-m7jzr 1/1 Running 0 22h 10.244.236.157 k8s-master-node1 <none> <none> rook-discover-mbqtx 1/1 Running 0 22h 10.244.237.95 k8s-master-node2 <none> <none> rook-discover-r4m6h 1/1 Running 0 22h 10.244.50.232 k8s-worker-node1 <none> <none> rook-discover-xwml2 1/1 Running 0 22h 10.244.113.216 k8s-master-node3 <none> <none> rook-discover-xzw2z 1/1 Running 0 22h 10.244.219.66 k8s-worker-node2 <none> <none>
說明:
1.部署成功之后,會包含rook的組件以及ceph-csi相關的組件(rbd和cephfs的plugin同時都會部署)
Ceph 面板
Ceph mgr組件里帶有一個Dashboard 的插件,通過這個面板,我們可以在上面查看集群的狀態,包括總體運行狀態,mgr、osd 和其他 Ceph 進程的狀態,查看存儲池和 PG 狀態,以及顯示守護進程的日志等等.以下是cluster.yaml里的默認配置
dashboard:
enabled: true
# serve the dashboard under a subpath (useful when you are accessing the dashboard via a reverse proxy)
# urlPrefix: /ceph-dashboard
# serve the dashboard at the given port.
# port: 8443
# serve the dashboard using SSL
ssl: true
說明:
- 默認開啟dashboard插件,在ceph里可以通過
ceph mgr module ls可以查看mgr各種插件狀態 - 默認訪問路徑為
/,可通過urlPrefix指定路由訪問路由前綴 - 默認開啟ssl,且訪問端口是
8443
rook部署成功后,可以查看到如下的 service 服務
[root@k8s-master-node1 /opt/rook/cluster/examples/kubernetes/ceph]# kubectl get service -n rook-ceph NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE rook-ceph-mgr ClusterIP 10.96.227.178 <none> 9283/TCP 3d16h rook-ceph-mgr-dashboard ClusterIP 10.96.159.108 <none> 8443/TCP 3d16h rook-ceph-mon-a ClusterIP 10.96.31.17 <none> 6789/TCP,3300/TCP 3d17h rook-ceph-mon-b ClusterIP 10.96.176.163 <none> 6789/TCP,3300/TCP 3d17h rook-ceph-mon-c ClusterIP 10.96.146.28 <none> 6789/TCP,3300/TCP 3d17h
其中rook-ceph-mgr 服務用于暴露 Prometheus metrics格式的監控指標,而rook-ceph-mgr-dashboard 服務即是ceph dashboard 服務。在集群內部可以通過 DNS 名稱 https://rook-ceph-mgr-dashboard.rook-ceph:8443 或者 CluterIP https://10.96.159.108:7000 來進行訪問。通常dashboard需要通過外部瀏覽器來進行訪問,可以通過 Ingress 或者 NodePort 類型的 Service 來暴露服務。rook已經貼心地為我們準備好了相關的service。
[root@k8s-master-node1 /opt/rook/cluster/examples/kubernetes/ceph]# ll dashboard-* -rw-r--r-- 1 root root 363 11月 30 14:10 dashboard-external-https.yaml -rw-r--r-- 1 root root 362 11月 30 14:10 dashboard-external-http.yaml -rw-r--r-- 1 root root 839 11月 30 14:10 dashboard-ingress-https.yaml -rw-r--r-- 1 root root 365 11月 30 14:10 dashboard-loadbalancer.yaml
這里選擇NodePort類型的service
[root@k8s-master-node1 /opt/rook/cluster/examples/kubernetes/ceph]# cat dashboard-external-https.yaml
apiVersion: v1
kind: Service
metadata:
name: rook-ceph-mgr-dashboard-external-https
namespace: rook-ceph
labels:
app: rook-ceph-mgr
rook_cluster: rook-ceph
spec:
ports:
- name: dashboard
port: 8443
protocol: TCP
targetPort: 8443
selector:
app: rook-ceph-mgr
rook_cluster: rook-ceph
sessionAffinity: None
type: NodePort
創建成功之后, 就可查看相關的service了。如下所示, 其中49096就是NodePort的外部端口。瀏覽器通過https://<NodeIP>:49096可以訪問到ceph dashboard了
[root@k8s-master-node1 /opt/rook/cluster/examples/kubernetes/ceph]# kubectl get svc -n rook-ceph | grep dash rook-ceph-mgr-dashboard ClusterIP 10.96.159.108 <none> 8443/TCP 3d17h rook-ceph-mgr-dashboard-external-https NodePort 10.96.83.5 <none> 8443:49096/TCP 2m

注意:
- 由于是自簽名證書,需要手動添加信任證書
- 不同ceph版本dashboard略有差異, 本環境的ceph版本為
ceph version 15.2.8 octopus (stable)
默認用戶名為admin,密碼是放rook-ceph空間下的rook-ceph-dashboard-password的secret里, 通過以下方式可以獲取明文密碼
[root@k8s-master-node1 /opt/rook/cluster/examples/kubernetes/ceph]# kubectl -n rook-ceph get secret rook-ceph-dashboard-password -o jsonpath="{['data']['password']}" | base64 --decode && echo
xxxxxxx ##你的密碼 ##
成功登陸之后,大屏展示如下:

Rook 工具箱
要驗證集群是否處于正常狀態,我們可以使用 Rook 工具箱 來運行 ceph -s 命令來查看集群整體狀態。
Rook 工具箱是一個用于調試和測試 Rook 的常用工具容器,對應的toolbox的yaml文件如下所示:
[root@k8s-master-node1 /opt/rook/cluster/examples/kubernetes/ceph]# kubectl apply -f toolbox.yaml deployment.apps/rook-ceph-tools created
部署成功之后,通過以下指令進入toolbox的pod環境,然后對ceph集群可以進行運維操作:
kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- bash
toolbox的base鏡像是基于Centos8,因此要擴展一些工具,直接使用yum或者rpm工具就可以安裝
[root@rook-ceph-tools-5949d6759-256c5 /]# cat /etc/redhat-release CentOS Linux release 8.3.2011
Tip:
可以創建如下的命令別名,可以方便進入toolbox的環境
alias ceph-ops='kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- bash'
獲取集群狀態
[root@rook-ceph-tools-5949d6759-256c5 /]# ceph -s
cluster:
id: a0540409-d822-48e0-869b-273936597f2d
health: HEALTH_OK
services:
mon: 3 daemons, quorum a,b,c (age 23h)
mgr: a(active, since 4h)
osd: 3 osds: 3 up (since 23h), 3 in (since 23h)
data:
pools: 2 pools, 33 pgs
objects: 17 objects, 21 MiB
usage: 3.1 GiB used, 297 GiB / 300 GiB avail
pgs: 33 active+clean
獲取集群拓撲
與預期的一樣,每塊空閑盤都初始化成了OSD
[root@rook-ceph-tools-5949d6759-256c5 /]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.29306 root default -3 0.09769 host k8s-worker-node2 0 hdd 0.09769 osd.0 up 1.00000 1.00000 -7 0.09769 host k8s-worker-node3 2 hdd 0.09769 osd.2 up 1.00000 1.00000 -5 0.09769 host k8s-worker-node4 1 hdd 0.09769 osd.1 up 1.00000 1.00000
更多關于ceph架構及運維事宜請查看這里(https://docs.ceph.com/en/pacific/)
部署StorageClass
通過rook部署的Ceph-CSI,已經包含了rbdplugin和cephfs-plugin
rbd塊存儲
rbd屬于塊存儲,通俗地理解,就是給使用方(這里指POD)掛載一塊硬盤。在k8s里不適用多端同時(掛載)讀寫。Statefulset的應用中的volumeClaimTemplates會為每個pod都創建獨立的pv(rbd)卷
[root@k8s-master-node1 /opt/rook/cluster/examples/kubernetes/ceph/csi/rbd]# cat storageclass.yaml
apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
name: replicapool ## 指定的rados pool的名字 ##
namespace: rook-ceph
spec:
failureDomain: host ## 故障域為host級 ##
replicated: ## 使用副本機制而非EC ##
size: 3
# Disallow setting pool with replica 1, this could lead to data loss without recovery.
# Make sure you're *ABSOLUTELY CERTAIN* that is what you want
requireSafeReplicaSize: true
# gives a hint (%) to Ceph in terms of expected consumption of the total cluster capacity of a given pool
# for more info: https://docs.ceph.com/docs/master/rados/operations/placement-groups/#specifying-expected-pool-size
#targetSizeRatio: .5
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: rook-ceph-block
# Change "rook-ceph" provisioner prefix to match the operator namespace if needed
provisioner: rook-ceph.rbd.csi.ceph.com
parameters:
# clusterID is the namespace where the rook cluster is running
# If you change this namespace, also change the namespace below where the secret namespaces are defined
clusterID: rook-ceph
# If you want to use erasure coded pool with RBD, you need to create
# two pools. one erasure coded and one replicated.
# You need to specify the replicated pool here in the `pool` parameter, it is
# used for the metadata of the images.
# The erasure coded pool must be set as the `dataPool` parameter below.
#dataPool: ec-data-pool
pool: replicapool ## 使用的rados pool的名字 ##
# RBD image format. Defaults to "2".
imageFormat: "2"
# RBD image features. Available for imageFormat: "2". CSI RBD currently supports only `layering` feature.
imageFeatures: layering
# The secrets contain Ceph admin credentials. These are generated automatically by the operator
# in the same namespace as the cluster.
csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
# Specify the filesystem type of the volume. If not specified, csi-provisioner
# will set default as `ext4`. Note that `xfs` is not recommended due to potential deadlock
# in hyperconverged settings where the volume is mounted on the same node as the osds.
csi.storage.k8s.io/fstype: ext4 ## rbd卷的文件系統 ##
# uncomment the following to use rbd-nbd as mounter on supported nodes
# **IMPORTANT**: If you are using rbd-nbd as the mounter, during upgrade you will be hit a ceph-csi
# issue that causes the mount to be disconnected. You will need to follow special upgrade steps
# to restart your application pods. Therefore, this option is not recommended.
#mounter: rbd-nbd
allowVolumeExpansion: true
reclaimPolicy: Delete
說明:
spec.replicated.size表示存儲池使用的副本,值為3且表示為3個副本。更多解釋請點這里
創建storageclass
[root@k8s-master-node1 /opt/rook/cluster/examples/kubernetes/ceph/csi/rbd]# kubectl apply -f storageclass.yaml cephblockpool.ceph.rook.io/replicapool created storageclass.storage.k8s.io/rook-ceph-block created
創建pvc和帶有pvc的pod
[root@k8s-master-node1 /opt/rook/cluster/examples/kubernetes/ceph/csi/rbd]# cat pod.yaml
---
apiVersion: v1
kind: Pod
metadata:
name: csirbd-demo-pod
spec:
containers:
- name: web-server
image: nginx
volumeMounts:
- name: mypvc
mountPath: /var/lib/www/html
volumes:
- name: mypvc
persistentVolumeClaim:
claimName: rbd-pvc
readOnly: false
[root@k8s-master-node1 /opt/rook/cluster/examples/kubernetes/ceph/csi/rbd]# cat pvc.yaml
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: rbd-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
storageClassName: rook-ceph-block
[root@k8s-master-node1 /opt/rook/cluster/examples/kubernetes/ceph/csi/rbd]# kubectl apply -f pvc.yaml -f pod.yaml
persistentvolumeclaim/rbd-pvc created
pod/csirbd-demo-pod created
查看pvc和pv的狀態
[root@k8s-master-node1 /opt/rook/cluster/examples/kubernetes/ceph/csi/rbd]# kubectl get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE rbd-pvc Bound pvc-a38b140d-cff8-4bfb-9fa6-141b207fe5f4 1Gi RWO rook-ceph-block 44s [root@k8s-master-node1 /opt/rook/cluster/examples/kubernetes/ceph/csi/rbd]# kubectl get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE pvc-a38b140d-cff8-4bfb-9fa6-141b207fe5f4 1Gi RWO Delete Bound default/rbd-pvc rook-ceph-block 45s
進入pod,驗證pv掛載,如下所示,在pod里可以找到rbd的掛載的卷
[root@k8s-master-node1 /opt/rook/cluster/examples/kubernetes/ceph/csi/rbd]# kubectl get pods NAME READY STATUS RESTARTS AGE csirbd-demo-pod 1/1 Running 0 87s [root@k8s-master-node1 /opt/rook/cluster/examples/kubernetes/ceph/csi/rbd]# kubectl exec -it csirbd-demo-pod -- bash root@csirbd-demo-pod:/# df -Th Filesystem Type Size Used Avail Use% Mounted on overlay overlay 47G 9.5G 38G 21% / tmpfs tmpfs 64M 0 64M 0% /dev tmpfs tmpfs 1.9G 0 1.9G 0% /sys/fs/cgroup /dev/mapper/centos-root xfs 47G 9.5G 38G 21% /etc/hosts shm tmpfs 64M 0 64M 0% /dev/shm /dev/rbd0 ext4 976M 2.6M 958M 1% /var/lib/www/html tmpfs tmpfs 1.9G 12K 1.9G 1% /run/secrets/kubernetes.io/serviceaccount tmpfs tmpfs 1.9G 0 1.9G 0% /proc/acpi tmpfs tmpfs 1.9G 0 1.9G 0% /proc/scsi tmpfs tmpfs 1.9G 0 1.9G 0% /sys/firmware
cephfs文件存儲
cephfs屬于文件存儲,通俗地理解為掛載一個遠程目錄(類似NFS),可實現多端同時讀寫。
默認rook只部署了mon/mgr/osd組件,cephfs需要額外的mds組件,需要如下的CRD資源部署
[root@k8s-master-node1 /opt/rook/cluster/examples/kubernetes/ceph]# cat filesystem.yaml | grep -v "#"
apiVersion: ceph.rook.io/v1
kind: CephFilesystem
metadata:
name: myfs
namespace: rook-ceph
spec:
metadataPool: ## 元數據存儲池 ##
replicated:
size: 3
requireSafeReplicaSize: true
parameters:
compression_mode: none
dataPools: ## 數據存儲池 ##
- failureDomain: host
replicated:
size: 3
requireSafeReplicaSize: true
parameters:
compression_mode: none
preservePoolsOnDelete: true
metadataServer: ## 主備模式,1主1從 ##
activeCount: 1
activeStandby: true
placement:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- rook-ceph-mds
topologyKey: kubernetes.io/hostname
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- rook-ceph-mds
topologyKey: topology.kubernetes.io/zone
annotations:
labels:
resources:
說明:
- 創建一個
myfs的cephfs的文件系統,分別定義了副本類型的元數據存儲池和數據存儲池,且副本數為3 - 創建2個mds, 為一主一從
- 可以通過親和性設置,把mon pod調度到指定的節點上
創建cephfs文件系統,然后進入toolbox,驗證cephfs狀態
[root@k8s-master-node1 /opt/rook/cluster/examples/kubernetes/ceph]# kubectl apply -f filesystem.yaml
從集群狀態可以看到有mds組件狀態
[root@rook-ceph-tools-5949d6759-256c5 /]# ceph -s
cluster:
id: a0540409-d822-48e0-869b-273936597f2d
health: HEALTH_OK
services:
mon: 3 daemons, quorum a,b,c (age 35m)
mgr: a(active, since 54m)
mds: myfs:1 {0=myfs-a=up:active} 1 up:standby-replay ##mds的狀態,1主1從 ##
osd: 3 osds: 3 up (since 24h), 3 in (since 24h)
data:
pools: 4 pools, 97 pgs
objects: 41 objects, 21 MiB
usage: 3.1 GiB used, 297 GiB / 300 GiB avail
pgs: 97 active+clean
io:
client: 853 B/s rd, 1 op/s rd, 0 op/s wr
查看fs詳情
[root@rook-ceph-tools-5949d6759-256c5 /]# ceph fs status --format=json | jq .
{
"clients": [
{
"clients": 0,
"fs": "myfs"
}
],
"mds_version": "ceph version 15.2.8 (bdf3eebcd22d7d0b3dd4d5501bee5bac354d5b55) octopus (stable)",
"mdsmap": [
{
"dns": 10,
"inos": 13,
"name": "myfs-a",
"rank": 0,
"rate": 0,
"state": "active"
},
{
"dns": 5,
"events": 0,
"inos": 5,
"name": "myfs-b",
"rank": 0,
"state": "standby-replay"
}
],
"pools": [
{
"avail": 100898840576,
"id": 5,
"name": "myfs-metadata",
"type": "metadata",
"used": 1572864
},
{
"avail": 100898840576,
"id": 6,
"name": "myfs-data0",
"type": "data",
"used": 0
}
]
}
說明:
- 元數據存儲池的名字是
myfs-metadata - 數據存儲池的名字是
myfs-data0
創建cephfs類型的storageclass
[root@k8s-master-node1 /opt/rook/cluster/examples/kubernetes/ceph/csi/cephfs]# cat storageclass.yaml apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: rook-cephfs provisioner: rook-ceph.cephfs.csi.ceph.com parameters: # clusterID is the namespace where operator is deployed. clusterID: rook-ceph # CephFS filesystem name into which the volume shall be created fsName: myfs # Ceph pool into which the volume shall be created # Required for provisionVolume: "true" pool: myfs-data0 # Root path of an existing CephFS volume # Required for provisionVolume: "false" # rootPath: /absolute/path # The secrets contain Ceph admin credentials. These are generated automatically by the operator # in the same namespace as the cluster. ## 使用到的相關的密鑰 ## csi.storage.k8s.io/provisioner-secret-name: rook-csi-cephfs-provisioner csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph csi.storage.k8s.io/controller-expand-secret-name: rook-csi-cephfs-provisioner csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph csi.storage.k8s.io/node-stage-secret-name: rook-csi-cephfs-node csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph # (optional) The driver can use either ceph-fuse (fuse) or ceph kernel client (kernel) # If omitted, default volume mounter will be used - this is determined by probing for ceph-fuse # or by setting the default mounter explicitly via --volumemounter command-line argument. # mounter: kernel reclaimPolicy: Delete ## 回收策略 ## allowVolumeExpansion: true mountOptions: ### 掛載定制參數 ## # uncomment the following line for debugging #- debug
說明:
fsName設置cephfs文件系統的名字,根據前文部署,指定為myfspool設置數據存儲池的名字,根據前文部署,指定為myfs-data0
查看是否成功能創建storageclass
[root@k8s-master-node1 /opt/rook/cluster/examples/kubernetes/ceph/csi/cephfs]# kubectl get sc NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE rook-ceph-block rook-ceph.rbd.csi.ceph.com Delete Immediate true 19h rook-cephfs rook-ceph.cephfs.csi.ceph.com Delete Immediate true 12h
創建pvc和pod,并且把pvc掛載在pod里
[root@k8s-master-node1 /opt/rook/cluster/examples/kubernetes/ceph/csi/cephfs]# cat pod.yaml
---
apiVersion: v1
kind: Pod
metadata:
name: csicephfs-demo-pod
spec:
containers:
- name: web-server
image: nginx
volumeMounts:
- name: mypvc
mountPath: /var/lib/www/html
volumes:
- name: mypvc
persistentVolumeClaim:
claimName: cephfs-pvc
readOnly: false
[root@k8s-master-node1 /opt/rook/cluster/examples/kubernetes/ceph/csi/cephfs]# cat pvc.yaml
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: cephfs-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
storageClassName: rook-cephfs
[root@k8s-master-node1 /opt/rook/cluster/examples/kubernetes/ceph/csi/cephfs]# kubectl apply -f pvc.yaml -f pod.yaml
persistentvolumeclaim/cephfs-pvc created
pod/csicephfs-demo-pod created
如果全部正常,pod應該是running狀態,如下所示
[root@k8s-master-node1 /opt/rook/cluster/examples/kubernetes/ceph/csi/cephfs]# kubectl get pod NAME READY STATUS RESTARTS AGE csicephfs-demo-pod 1/1 Running 0 23m ## 啟動正常 ## csirbd-demo-pod 1/1 Running 0 25h
進入pod,查看掛載情況
[root@k8s-master-node1 /opt/rook/cluster/examples/kubernetes/ceph/csi/cephfs]# kubectl exec -it csicephfs-demo-pod -- df -Th Filesystem Type Size Used Avail Use% Mounted on overlay overlay 10G 6.0G 4.1G 60% / tmpfs tmpfs 64M 0 64M 0% /dev tmpfs tmpfs 1.9G 0 1.9G 0% /sys/fs/cgroup /dev/mapper/vg_root-lv_root xfs 10G 6.0G 4.1G 60% /etc/hosts shm tmpfs 64M 0 64M 0% /dev/shm 10.96.176.163:6789,10.96.146.28:6789,10.96.31.17:6789:/volumes/csi/csi-vol-e45c556b-528d-11ec-97cf-222a3fe2a760/69824d96-cdeb-4602-a770-1c4422db4a34 ceph 1.0G 0 1.0G 0% /var/lib/www/html tmpfs tmpfs 1.9G 12K 1.9G 1% /run/secrets/kubernetes.io/serviceaccount tmpfs tmpfs 1.9G 0 1.9G 0% /proc/acpi tmpfs tmpfs 1.9G 0 1.9G 0% /proc/scsi tmpfs tmpfs 1.9G 0 1.9G 0% /sys/firmware
如上所示,1G的pv已經成功掛載到指定的目錄 ,掛載的文件系統類型為ceph,符合預期
S3對象存儲
Rook可以直接在當前是的Ceph環境里部署RGW實例,也可以對接已經存在的外部的Ceph集群,詳情點擊查看
以下是rook提供的CephObjectStore類型的CRD資源yaml文件,通過kubectl apply 直接創建。
[root@k8s-master-node1 /opt/rook/cluster/examples/kubernetes/ceph]# cat object.yaml | grep -v "#"
apiVersion: ceph.rook.io/v1
kind: CephObjectStore
metadata:
name: my-store
namespace: rook-ceph
spec:
metadataPool: ## 索引存儲池配置 ##
failureDomain: host ## 故障域為host級 ##
replicated: ## 副本策略 ##
size: 3 ## 副本數 ##
requireSafeReplicaSize: true
parameters:
compression_mode: none
dataPool: ## 數據存儲池配置 ##
failureDomain: host
replicated:
size: 3
requireSafeReplicaSize: true
parameters:
compression_mode: none
preservePoolsOnDelete: false
gateway:
type: s3 ##網關類型是S3 ##
sslCertificateRef:
port: 80 ## rgw實例端口 ##
instances: 1 ## 實例個數 ##
placement:
annotations:
labels:
resources:
healthCheck:
bucket:
disabled: false
interval: 60s
livenessProbe:
disabled: false
成功創建之后會有如下的service,通過service ip和端口驗證S3服務.服務正常就會出現如下返回。
[root@k8s-master-node1 /opt/rook/cluster/examples/kubernetes/ceph]# kubectl -n rook-ceph get svc -l app=rook-ceph-rgw NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE rook-ceph-rgw-my-store ClusterIP 10.96.57.91 <none> 80/TCP 174m [root@k8s-master-node1 /opt/rook/cluster/examples/kubernetes/ceph]# curl 10.96.57.91 <?xml version="1.0" encoding="UTF-8"?><ListAllMyBucketsResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/"><Owner><ID>anonymous</ID><DisplayName></DisplayName></Owner><Buckets></Buckets></ListAllMyBucketsResult>
至此, 基于ceph rgw的S3對象存儲服務已經正常運行.傳統模式下,可以通過radosgw-admin工具或者API創建相關的用戶及密鑰對。詳情見 https://docs.ceph.com/en/pacific/radosgw/admin/
既然已經配置了對象存儲,接下來我們需要創建一個存儲桶(即bucket,下同),客戶端可以在其中讀寫對象。可以通過定義StorageClass來創建存儲桶,類似于塊和文件存儲使用的模式。首先,定義允許對象客戶端創建存儲桶的存儲類。StorageClass定義了對象存儲系統、存儲桶保留策略以及管理員所需的其他屬性。
以下是S3類型的storageclass。通過kubectl apply創建即可。
[root@k8s-master-node1 /opt/rook/cluster/examples/kubernetes/ceph]# cat storageclass-bucket-retain.yaml apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: rook-ceph-retain-bucket provisioner: rook-ceph.ceph.rook.io/bucket # set the reclaim policy to retain the bucket when its OBC is deleted reclaimPolicy: Retain ## 回收策略 ## parameters: objectStoreName: my-store # port 80 assumed objectStoreNamespace: rook-ceph region: us-east-1 ## 默認即可 ## # To accommodate brownfield cases reference the existing bucket name here instead # of in the ObjectBucketClaim (OBC). In this case the provisioner will grant # access to the bucket by creating a new user, attaching it to the bucket, and # providing the credentials via a Secret in the namespace of the requesting OBC. #bucketName:
說明:
- 定義的回收策略是持久化,不隨申請者的生命周期終結而自動清除
objectStoreName指定CephObjectStore類型的CRD的名字
用戶端要使用對象存儲資源,可以通過創建ObjectBucketClaim(即OBC,可類比于PVC)類型的CRD資源進行申請。創建成功之后會在當前的命名空間下面一個對應名字的Secret,里面的包含用戶需要密鑰對(access_key和secret_key)。
[root@k8s-master-node1 /opt/rook/cluster/examples/kubernetes/ceph]# cat <<EOF | kubectl apply -f - apiVersion: objectbucket.io/v1alpha1 kind: ObjectBucketClaim metadata: name: ceph-bucket spec: generateBucketName: ceph-bucket-demo storageClassName: rook-ceph-retain-bucket EOF
注意:
storageClassName指向上文創建的StorageClass的名字generateBucketName指定創建的bucket前綴名
以下方式可以直接獲取相關的密鑰對
[root@k8s-master-node1 /opt/rook/cluster/examples/kubernetes/ceph]# kubectl -n default get secret ceph-bucket -o jsonpath='{.data.AWS_ACCESS_KEY_ID}' | base64 -d
xxxxxxxxxxxxxx ######## 這里是access_key ######
[root@k8s-master-node1 /opt/rook/cluster/examples/kubernetes/ceph]# kubectl -n default get secret ceph-bucket -o jsonpath='{.data.AWS_SECRET_ACCESS_KEY}' | base64 -d
xxxxxxxxxxxxxx ######## 這里secret_key #########
在這個命名空間下面還會成一個同名的ConfigMap,里面包含了bucket和endpoint的信息。
[root@k8s-master-node1 /opt/rook/cluster/examples/kubernetes/ceph]# kubectl -n default get cm ceph-bucket -o jsonpath='{.data}' | jq .
{
"BUCKET_HOST": "rook-ceph-rgw-my-store.rook-ceph.svc",
"BUCKET_NAME": "ceph-bucket-demo-aaa69329-40db-467b-81d6-dd4f6585ebfa", ## 符合預期 ##
"BUCKET_PORT": "80",
"BUCKET_REGION": "us-east-1",
"BUCKET_SUBREGION": ""
}
注意:
- 關于bucket的詳情,可以通過radosgw-admin進行查看
- 關于驗證s3的使用,在此不再詳細列舉
小結
以上是我對Rook初次體驗過程的簡單記錄,下面談談個人感受吧。
- Rook部署確實快, 一把梭各種組件直接都running了。但這只是理想狀態,不同的環境都有差異,只要一個環節出現異常,整個排障周期會延長很多。(以上的體驗過程中,遇到了各種小問題)
- 對于不熟悉的Ceph的人,重頭開始部署一套的成本是很大的,有這樣能夠一鍵部署的工具(或者說方案),確實帶來了很大的便利。但只是圖部署快, 那只能是demo或者驗證環境, 離真正的線上環境還差得很遠。”部署一時爽,運維火葬場“,在沒有真正理解整個Rook的內在運行邏輯及架構時,我是不會斷然為云原生而上這套系統。作為存儲管理員,應該有這個意識,存儲是最核心的東西,數據的價值是最關鍵的。俗話說地好,“計算來了就走,存儲一去不回”。計算任務(系統)可以重試、重啟,但存儲系統不能隨意重試、重啟操作,一但操作異常就有可能造不可挽回的損失,即數據丟失。
- Rook本身也有自己的管理端和agent端,再加上為配合PV/PVC實現的Ceph-CSI組件,非存儲相關的組件就是一大堆。這大大增加了后期管理及維護成本。如果涉及到存儲端的容器化(這里以Ceph為例),這對于存儲系統來講可能是雪上加霜。Ceph存儲系統本身復雜程度已經非普通應用系統可比,自身組件就很多,比如MON、MGR、OSD、RGW、MDS等。容器化后,又要加上運行時環境,這個對于存儲本身后期運維和排障又增加了負擔。容器化對Ceph的是否有必要,其時社區也有激烈的討論。詳請見Why you might want packages not containers for Ceph deployments(https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/TTTYKRVWJOR7LOQ3UCQAZQR32R7YADVY/)
- 原來問題排障可以直接到固定的日志文件里搜尋線索。云原生之后日志都是標準輸出,如果沒有統一的日志平臺,對于排查問題又會帶來阻礙。對于習慣于從系統層面的日志里定位的Ceph存儲管理員來講,容器化之后找東西顯得很是變扭。
- 對于要實現對于Ceph集群定制與優化,基于Rook部署之后, 這些都要通過符合Rook的規范及約束來實現。有些可能當前Rook不一定支持。所以,Rook對于Ceph存儲的管理員來講,又要增加新的學習成本,甚至需要對Rook進行定制,來達到符合自己的生產需求。
綜上,個人覺得,如果你是云上的K8S集群,直接使用云廠商提供的PV/PVC就好。對于自建的集群,有持久化存儲的需求,如果有專業的存儲團隊,可以把存儲方案讓他們來搞,至于是對接方式, 讓存儲團隊來抉擇,緩解K8S SRE的壓力。專業的人干專業的事。
云原生是趨勢,應用(不論是無狀態還是有狀態的)圍繞著云原生建設是不可阻擋的洪流。Rook項目任重道遠,革命尚未成功,同志仍需努力。