Update zuul restart documentation

It was recently pointed out that our restart process for zuul is a bit
stale. Document the new modern process that deals with ansible playbooks
and docker containers.

Change-Id: I52812e87ed73e6ed538f94a86c1b62ce3de57c37
This commit is contained in:
Clark Boylan 2021-10-20 09:49:56 -07:00
parent 2c1a449a42
commit 7eff5b5af2

View File

@ -108,7 +108,7 @@ Scheduler
---------
The Zuul Scheduler and gear are all co-located on a single host,
referred to by the ``zuul.openstack.org`` CNAME in DNS.
referred to by the ``zuul.opendev.org`` CNAME in DNS.
Zuul is stateless, so the server does not need backing up. However
zuul talks through git and ssh so you will need to manually check ssh
@ -127,44 +127,6 @@ a MySQL database via the SQL Reporter plugin. The database for that is a
Rackspace Cloud DB and is configured in the ``mysql`` entry of the
``zuul_connection_secrets`` entry for the ``zuul-scheduler`` group.
Restarting the Scheduler
------------------------
Zuul Scheduler restarts are disruptive, so non-emergency restarts should
always be scheduled for quieter times of the day, week and cycle. To be as
courteous to developers as possible, just prior to a restart the `Zuul
Status Page`_ should be checked to see the status of the gate. If there is a
series of changes nearly merged, wait until that has been completed.
Since Zuul is stateless, some work needs to be done to save and then
re-enqueue patches when restarts are done. To accomplish this, start by
running `zuul-changes.py
<https://opendev.org/zuul/zuul/src/branch/master/tools/zuul-changes.py>`_
to save the check and gate queues::
python /opt/zuul/tools/zuul-changes.py http://zuul.openstack.org \
check >check.sh
python /opt/zuul/tools/zuul-changes.py http://zuul.openstack.org \
gate >gate.sh
These check.sh and gate.sh scripts will be used after the restart to
re-enqueue the changes.
Now use `service zuul stop` to stop zuul and then run ps to make sure
the process has actually stopped, it may take several seconds for it to
finally go away.
Once you're ready, use `service zuul start` to start zuul again.
To re-enqueue saved jobs, first run the gate.sh script and then check.sh to
re-enqueue the changes from before the restart::
./gate.sh
./check.sh
You may watch the `Zuul Status Page`_ to confirm that changes are
returning to the queues.
Executors
---------
@ -194,6 +156,60 @@ Zuul Web is stateless so is safe to restart, however restarting it will result
in a loss of connection for anyone watching a live-stream of a console log
when the restart happens.
Restarting Zuul Services
------------------------
Currently the safest way to restart the Zuul scheduler is to restart all
services at the same time. The reason for this is that if the scheduler is
restarted but executors are not then the executors and scheduler can get out
of sync with each other. Note that restarting zuul web or a single executor
should continue to be safe as noted above, but this process should generally
be preferred.
Zuul Scheduler restarts are disruptive, so non-emergency restarts should
always be scheduled for quieter times of the day, week and cycle. We should
attempt to be courteous and avoid restarts when project teams are cutting
releases or have other important changes that are about to land.
Since Zuul is stateless, some work needs to be done to save and then
re-enqueue patches when restarts are done. To accomplish this, start by
running the zuul-changes script to save the check and gate queues::
root@zuul02# ~root/zuul-changes.py https://zuul.opendev.org >queues-$(date +%Y%m%d).sh
This script will be executed when Zuul is up and running again to restore
the previous queue contents.
One other thing to consider before restarting all zuul services is you may
want to update all of the zuul docker images. This can be useful if restarting
Zuul to correct a bug that was fixed in the Zuul codebase. To do this run
the zuul_pull.yaml playbook from bridge::
root@bridge# ansible-playbook -f 20 /home/zuul/src/opendev.org/opendev/system-config/playbooks/zuul_pull.yaml
Once ready to restart all Zuul services you will want to run the
zuul_restart.yaml playbook from bridge to do this::
root@bridge# ansible-playbook -f20 /home/zuul/src/opendev.org/opendev/system-config/playbooks/zuul_restart.yaml
Once this playbook is done running the services will have been restarted, but
the Zuul system still needs to load its configs before it is ready to do work.
The `root <https://zuul.opendev.org/>`_ of the Zuul dashboard will show you
loaded tenants. Once all tenants show up on this page it is safe to proceed
with re-enqueing changes to pipelines with the script we generated earlier.
Note that the OpenStack tenant takes the most time. If you wait for it to
show up in the dashboard you should be ready to go. You can double check
this by loading the OpenStack Zuul `status
<https://zuul.opendev.org/t/openstack/status>`_ and ensuring it doesn't report
an error.
To re-enqueue, execute the previously generated script::
root@zuul# bash queues-$(date +%Y%m%d).sh
When this has completed you are done with the Zuul restart. Consider logging
the restart and update with statusbot in IRC.
Secrets
-------