From b712584e533e874c021a085f85ae753f52d994d1 Mon Sep 17 00:00:00 2001
From: Clark Boylan <clark.boylan@gmail.com>
Date: Fri, 14 Apr 2017 10:52:36 -0700
Subject: [PATCH] Add AFS maintenance docs

This adds documentation on how to maintenance on the AFS cluster with no
service outages.

Change-Id: Idf9ab67603a1c5e8ac062458f3d17399d807e3a8
---
 doc/source/afs.rst | 72 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 72 insertions(+)

diff --git a/doc/source/afs.rst b/doc/source/afs.rst
index abf6f2923e..bf8e726227 100644
--- a/doc/source/afs.rst
+++ b/doc/source/afs.rst
@@ -425,3 +425,75 @@ place for Apache on these hosts.  This avoids management overheads of
 a completely new service deployment such as Squid or a caching docker
 registry daemon.
 
+No Outage Server Maintenance
+----------------------------
+
+afsdb0X.openstack.org
+~~~~~~~~~~~~~~~~~~~~~
+
+We have redundant AFS DB servers. You can take one down without causing
+a service outage as long as the other remains up. To do this safely::
+
+  root@afsdb01:~# bos shutdown afsdb01.openstack.org -wait -localauth
+  root@afsdb01:~# bos status afsdb01.openstack.org -localauth
+  Instance ptserver, temporarily disabled, currently shutdown.
+  Instance vlserver, temporarily disabled, currently shutdown.
+
+Then perform your maintenance on afsdb01. When done a reboot will
+automatically restart the bos service or you can manually restart
+the openafs-fileserver service::
+
+  root@afsdb01:~# service openafs-fileserver start
+
+Finally check that the service is back up and running::
+
+  root@afsdb01:~# bos status afsdb01.openstack.org -localauth
+  Instance ptserver, currently running normally.
+  Instance vlserver, currently running normally.
+
+Now you can repeat the process against afsdb02.
+
+afs0X.openstack.org
+~~~~~~~~~~~~~~~~~~~
+
+Taking down the actual fileservers is slightly more complicated
+but works similarly. Basically what we need to do is make sure that
+either no one needs the RW volumes hosted by a fileserver before
+taking it down or move the RW volume to another fileserver.
+
+To ensure nothing needs the RW volumes you can hold the various
+file locks on hosts that publish to AFS and/or remove cron entries
+that perform vos releases or volume writes.
+
+If instead you need to move the RW volume first step is checking
+where the volumes live::
+
+  root@afsdb01:~# vos listvldb -localauth
+  VLDB entries for all servers
+
+  mirror
+      RWrite: 536870934     ROnly: 536870935
+      number of sites -> 3
+         server afs01.dfw.openstack.org partition /vicepa RW Site
+         server afs01.dfw.openstack.org partition /vicepa RO Site
+         server afs01.ord.openstack.org partition /vicepa RO Site
+
+We see that if we want to allow write to the mirror volume and take
+down afs01.dfw.openstack.org we will have to move the volume to one
+of the other servers::
+
+  root@afsdb01:~# screen # use screen as this may take quite some time.
+  root@afsdb01:~# vos move -id mirror -toserver afs01.ord.openstack.org -topartition vicepa -fromserver afs01.dfw.openstack.org -frompartition vicepa -localauth
+
+When that is done (use listvldb command above to check) it is now safe
+to take down afs01.dfw.openstack.org while having writers to the mirror
+volume. We use the same process as for the db server::
+
+  root@afsdb01:~# bos shutdown afs01.dfw.openstack.org -localauth
+  root@afsdb01:~# bos status afsdb01.dfw.openstack.org -localauth
+  Auxiliary status is: file server shut down.
+
+Perform maintenance, then restart as above and check the status again::
+
+  root@afsdb01:~# bos status afsdb01.dfw.openstack.org -localauth
+  Auxiliary status is: file server running.