Update zuul restart documentation
It was recently pointed out that our restart process for zuul is a bit stale. Document the new modern process that deals with ansible playbooks and docker containers. Change-Id: I52812e87ed73e6ed538f94a86c1b62ce3de57c37
This commit is contained in:
parent
2c1a449a42
commit
7eff5b5af2
@ -108,7 +108,7 @@ Scheduler
|
||||
---------
|
||||
|
||||
The Zuul Scheduler and gear are all co-located on a single host,
|
||||
referred to by the ``zuul.openstack.org`` CNAME in DNS.
|
||||
referred to by the ``zuul.opendev.org`` CNAME in DNS.
|
||||
|
||||
Zuul is stateless, so the server does not need backing up. However
|
||||
zuul talks through git and ssh so you will need to manually check ssh
|
||||
@ -127,44 +127,6 @@ a MySQL database via the SQL Reporter plugin. The database for that is a
|
||||
Rackspace Cloud DB and is configured in the ``mysql`` entry of the
|
||||
``zuul_connection_secrets`` entry for the ``zuul-scheduler`` group.
|
||||
|
||||
Restarting the Scheduler
|
||||
------------------------
|
||||
|
||||
Zuul Scheduler restarts are disruptive, so non-emergency restarts should
|
||||
always be scheduled for quieter times of the day, week and cycle. To be as
|
||||
courteous to developers as possible, just prior to a restart the `Zuul
|
||||
Status Page`_ should be checked to see the status of the gate. If there is a
|
||||
series of changes nearly merged, wait until that has been completed.
|
||||
|
||||
Since Zuul is stateless, some work needs to be done to save and then
|
||||
re-enqueue patches when restarts are done. To accomplish this, start by
|
||||
running `zuul-changes.py
|
||||
<https://opendev.org/zuul/zuul/src/branch/master/tools/zuul-changes.py>`_
|
||||
to save the check and gate queues::
|
||||
|
||||
python /opt/zuul/tools/zuul-changes.py http://zuul.openstack.org \
|
||||
check >check.sh
|
||||
python /opt/zuul/tools/zuul-changes.py http://zuul.openstack.org \
|
||||
gate >gate.sh
|
||||
|
||||
These check.sh and gate.sh scripts will be used after the restart to
|
||||
re-enqueue the changes.
|
||||
|
||||
Now use `service zuul stop` to stop zuul and then run ps to make sure
|
||||
the process has actually stopped, it may take several seconds for it to
|
||||
finally go away.
|
||||
|
||||
Once you're ready, use `service zuul start` to start zuul again.
|
||||
|
||||
To re-enqueue saved jobs, first run the gate.sh script and then check.sh to
|
||||
re-enqueue the changes from before the restart::
|
||||
|
||||
./gate.sh
|
||||
./check.sh
|
||||
|
||||
You may watch the `Zuul Status Page`_ to confirm that changes are
|
||||
returning to the queues.
|
||||
|
||||
Executors
|
||||
---------
|
||||
|
||||
@ -194,6 +156,60 @@ Zuul Web is stateless so is safe to restart, however restarting it will result
|
||||
in a loss of connection for anyone watching a live-stream of a console log
|
||||
when the restart happens.
|
||||
|
||||
Restarting Zuul Services
|
||||
------------------------
|
||||
|
||||
Currently the safest way to restart the Zuul scheduler is to restart all
|
||||
services at the same time. The reason for this is that if the scheduler is
|
||||
restarted but executors are not then the executors and scheduler can get out
|
||||
of sync with each other. Note that restarting zuul web or a single executor
|
||||
should continue to be safe as noted above, but this process should generally
|
||||
be preferred.
|
||||
|
||||
Zuul Scheduler restarts are disruptive, so non-emergency restarts should
|
||||
always be scheduled for quieter times of the day, week and cycle. We should
|
||||
attempt to be courteous and avoid restarts when project teams are cutting
|
||||
releases or have other important changes that are about to land.
|
||||
|
||||
Since Zuul is stateless, some work needs to be done to save and then
|
||||
re-enqueue patches when restarts are done. To accomplish this, start by
|
||||
running the zuul-changes script to save the check and gate queues::
|
||||
|
||||
root@zuul02# ~root/zuul-changes.py https://zuul.opendev.org >queues-$(date +%Y%m%d).sh
|
||||
|
||||
This script will be executed when Zuul is up and running again to restore
|
||||
the previous queue contents.
|
||||
|
||||
One other thing to consider before restarting all zuul services is you may
|
||||
want to update all of the zuul docker images. This can be useful if restarting
|
||||
Zuul to correct a bug that was fixed in the Zuul codebase. To do this run
|
||||
the zuul_pull.yaml playbook from bridge::
|
||||
|
||||
root@bridge# ansible-playbook -f 20 /home/zuul/src/opendev.org/opendev/system-config/playbooks/zuul_pull.yaml
|
||||
|
||||
Once ready to restart all Zuul services you will want to run the
|
||||
zuul_restart.yaml playbook from bridge to do this::
|
||||
|
||||
root@bridge# ansible-playbook -f20 /home/zuul/src/opendev.org/opendev/system-config/playbooks/zuul_restart.yaml
|
||||
|
||||
Once this playbook is done running the services will have been restarted, but
|
||||
the Zuul system still needs to load its configs before it is ready to do work.
|
||||
The `root <https://zuul.opendev.org/>`_ of the Zuul dashboard will show you
|
||||
loaded tenants. Once all tenants show up on this page it is safe to proceed
|
||||
with re-enqueing changes to pipelines with the script we generated earlier.
|
||||
Note that the OpenStack tenant takes the most time. If you wait for it to
|
||||
show up in the dashboard you should be ready to go. You can double check
|
||||
this by loading the OpenStack Zuul `status
|
||||
<https://zuul.opendev.org/t/openstack/status>`_ and ensuring it doesn't report
|
||||
an error.
|
||||
|
||||
To re-enqueue, execute the previously generated script::
|
||||
|
||||
root@zuul# bash queues-$(date +%Y%m%d).sh
|
||||
|
||||
When this has completed you are done with the Zuul restart. Consider logging
|
||||
the restart and update with statusbot in IRC.
|
||||
|
||||
Secrets
|
||||
-------
|
||||
|
||||
|
Loading…
x
Reference in New Issue
Block a user