Renis 5a9c8d41b7 Docs: Update ceph documentation
- Adding section for Ceph troubleshoot
- Rearrange Testing section to include Ceph

Co-Authored-By: portdirect <pete@port.direct>

Change-Id: Ib04e9b59fea2557cf6cad177dfcc76390c161e06
Signed-off-by: Pete Birley <pete@port.direct>
2018-07-06 16:01:20 -04:00

3.1 KiB

Host Failure

Test Environment

  • Cluster size: 4 host machines
  • Number of disks: 24 (= 6 disks per host * 4 hosts)
  • Kubernetes version: 1.10.5
  • Ceph version: 12.2.3
  • OpenStack-Helm commit: 25e50a34c6

Case: One host machine where ceph-mon is running is rebooted

Symptom:

After reboot (node voyager3), the node status changes to NotReady.

$ kubectl get nodes
NAME       STATUS     ROLES     AGE       VERSION
voyager1   Ready      master    6d        v1.10.5
voyager2   Ready      <none>    6d        v1.10.5
voyager3   NotReady   <none>    6d        v1.10.5
voyager4   Ready      <none>    6d        v1.10.5

Ceph status shows that ceph-mon running on voyager3 becomes out of quorum. Also, six osds running on voyager3 are down; i.e., 18 osds are up out of 24 osds.

(mon-pod):/# ceph -s
  cluster:
    id:     9d4d8c61-cf87-4129-9cef-8fbf301210ad
    health: HEALTH_WARN
            6 osds down
            1 host (6 osds) down
            Degraded data redundancy: 195/624 objects degraded (31.250%), 8 pgs degraded
            too few PGs per OSD (17 < min 30)
            mon voyager1 is low on available space
            1/3 mons down, quorum voyager1,voyager2

  services:
    mon: 3 daemons, quorum voyager1,voyager2, out of quorum: voyager3
    mgr: voyager1(active), standbys: voyager3
    mds: cephfs-1/1/1 up  {0=mds-ceph-mds-65bb45dffc-cslr6=up:active}, 1 up:standby
    osd: 24 osds: 18 up, 24 in
    rgw: 2 daemons active

  data:
    pools:   18 pools, 182 pgs
    objects: 208 objects, 3359 bytes
    usage:   2630 MB used, 44675 GB / 44678 GB avail
    pgs:     195/624 objects degraded (31.250%)
             126 active+undersized
             48  active+clean
             8   active+undersized+degraded

Recovery:

The node status of voyager3 changes to Ready after the node is up again. Also, Ceph pods are restarted automatically. Ceph status shows that the monitor running on voyager3 is now in quorum.

$ kubectl get nodes
NAME       STATUS    ROLES     AGE       VERSION
voyager1   Ready     master    6d        v1.10.5
voyager2   Ready     <none>    6d        v1.10.5
voyager3   Ready     <none>    6d        v1.10.5
voyager4   Ready     <none>    6d        v1.10.5
(mon-pod):/# ceph -s
  cluster:
    id:     9d4d8c61-cf87-4129-9cef-8fbf301210ad
    health: HEALTH_WARN
            too few PGs per OSD (22 < min 30)
            mon voyager1 is low on available space

  services:
    mon: 3 daemons, quorum voyager1,voyager2,voyager3
    mgr: voyager1(active), standbys: voyager3
    mds: cephfs-1/1/1 up  {0=mds-ceph-mds-65bb45dffc-cslr6=up:active}, 1 up:standby
    osd: 24 osds: 24 up, 24 in
    rgw: 2 daemons active

  data:
    pools:   18 pools, 182 pgs
    objects: 208 objects, 3359 bytes
    usage:   2635 MB used, 44675 GB / 44678 GB avail
    pgs:     182 active+clean