Update zuul restart documentation

It was recently pointed out that our restart process for zuul is a bit stale. Document the new modern process that deals with ansible playbooks and docker containers. Change-Id: I52812e87ed73e6ed538f94a86c1b62ce3de57c37
2021-10-20 09:49:56 -07:00 · 2021-10-20 09:49:56 -07:00 · 7eff5b5af2
commit 7eff5b5af2
parent 2c1a449a42
1 changed files with 55 additions and 39 deletions
--- a/doc/source/zuul.rst
+++ b/doc/source/zuul.rst
@ -108,7 +108,7 @@ Scheduler
 ---------

 The Zuul Scheduler and gear are all co-located on a single host,
-referred to by the ``zuul.openstack.org`` CNAME in DNS.
+referred to by the ``zuul.opendev.org`` CNAME in DNS.

 Zuul is stateless, so the server does not need backing up. However
 zuul talks through git and ssh so you will need to manually check ssh
@ -127,44 +127,6 @@ a MySQL database via the SQL Reporter plugin. The database for that is a
 Rackspace Cloud DB and is configured in the ``mysql`` entry of the
 ``zuul_connection_secrets`` entry for the ``zuul-scheduler`` group.

-Restarting the Scheduler
------------------------
-
-Zuul Scheduler restarts are disruptive, so non-emergency restarts should
-always be scheduled for quieter times of the day, week and cycle. To be as
-courteous to developers as possible, just prior to a restart the `Zuul
-Status Page`_ should be checked to see the status of the gate. If there is a
-series of changes nearly merged, wait until that has been completed.
-
-Since Zuul is stateless, some work needs to be done to save and then
-re-enqueue patches when restarts are done. To accomplish this, start by
-running `zuul-changes.py
-<https://opendev.org/zuul/zuul/src/branch/master/tools/zuul-changes.py>`_
-to save the check and gate queues::
-
-  python /opt/zuul/tools/zuul-changes.py http://zuul.openstack.org \
-    check >check.sh
-  python /opt/zuul/tools/zuul-changes.py http://zuul.openstack.org \
-    gate >gate.sh
-
-These check.sh and gate.sh scripts will be used after the restart to
-re-enqueue the changes.
-
-Now use `service zuul stop` to stop zuul and then run ps to make sure
-the process has actually stopped, it may take several seconds for it to
-finally go away.
-
-Once you're ready, use `service zuul start` to start zuul again.
-
-To re-enqueue saved jobs, first run the gate.sh script and then check.sh to
-re-enqueue the changes from before the restart::
-
-  ./gate.sh
-  ./check.sh
-
-You may watch the `Zuul Status Page`_ to confirm that changes are
-returning to the queues.
-
 Executors
 ---------

@ -194,6 +156,60 @@ Zuul Web is stateless so is safe to restart, however restarting it will result
 in a loss of connection for anyone watching a live-stream of a console log
 when the restart happens.

+Restarting Zuul Services
+------------------------
+
+Currently the safest way to restart the Zuul scheduler is to restart all
+services at the same time. The reason for this is that if the scheduler is
+restarted but executors are not then the executors and scheduler can get out
+of sync with each other. Note that restarting zuul web or a single executor
+should continue to be safe as noted above, but this process should generally
+be preferred.
+
+Zuul Scheduler restarts are disruptive, so non-emergency restarts should
+always be scheduled for quieter times of the day, week and cycle. We should
+attempt to be courteous and avoid restarts when project teams are cutting
+releases or have other important changes that are about to land.
+
+Since Zuul is stateless, some work needs to be done to save and then
+re-enqueue patches when restarts are done. To accomplish this, start by
+running the zuul-changes script to save the check and gate queues::
+
+  root@zuul02# ~root/zuul-changes.py https://zuul.opendev.org >queues-$(date +%Y%m%d).sh
+
+This script will be executed when Zuul is up and running again to restore
+the previous queue contents.
+
+One other thing to consider before restarting all zuul services is you may
+want to update all of the zuul docker images. This can be useful if restarting
+Zuul to correct a bug that was fixed in the Zuul codebase. To do this run
+the zuul_pull.yaml playbook from bridge::
+
+  root@bridge# ansible-playbook -f 20 /home/zuul/src/opendev.org/opendev/system-config/playbooks/zuul_pull.yaml
+
+Once ready to restart all Zuul services you will want to run the
+zuul_restart.yaml playbook from bridge to do this::
+
+  root@bridge# ansible-playbook -f20 /home/zuul/src/opendev.org/opendev/system-config/playbooks/zuul_restart.yaml
+
+Once this playbook is done running the services will have been restarted, but
+the Zuul system still needs to load its configs before it is ready to do work.
+The `root <https://zuul.opendev.org/>`_ of the Zuul dashboard will show you
+loaded tenants. Once all tenants show up on this page it is safe to proceed
+with re-enqueing changes to pipelines with the script we generated earlier.
+Note that the OpenStack tenant takes the most time. If you wait for it to
+show up in the dashboard you should be ready to go. You can double check
+this by loading the OpenStack Zuul `status
+<https://zuul.opendev.org/t/openstack/status>`_ and ensuring it doesn't report
+an error.
+
+To re-enqueue, execute the previously generated script::
+
+  root@zuul# bash queues-$(date +%Y%m%d).sh
+
+When this has completed you are done with the Zuul restart. Consider logging
+the restart and update with statusbot in IRC.
+
 Secrets
 -------