From 7eff5b5af24e9d742ca36689fd2996b53016e4af Mon Sep 17 00:00:00 2001 From: Clark Boylan Date: Wed, 20 Oct 2021 09:49:56 -0700 Subject: [PATCH] Update zuul restart documentation It was recently pointed out that our restart process for zuul is a bit stale. Document the new modern process that deals with ansible playbooks and docker containers. Change-Id: I52812e87ed73e6ed538f94a86c1b62ce3de57c37 --- doc/source/zuul.rst | 94 ++++++++++++++++++++++++++------------------- 1 file changed, 55 insertions(+), 39 deletions(-) diff --git a/doc/source/zuul.rst b/doc/source/zuul.rst index 375f6ecc8e..643360d49c 100644 --- a/doc/source/zuul.rst +++ b/doc/source/zuul.rst @@ -108,7 +108,7 @@ Scheduler --------- The Zuul Scheduler and gear are all co-located on a single host, -referred to by the ``zuul.openstack.org`` CNAME in DNS. +referred to by the ``zuul.opendev.org`` CNAME in DNS. Zuul is stateless, so the server does not need backing up. However zuul talks through git and ssh so you will need to manually check ssh @@ -127,44 +127,6 @@ a MySQL database via the SQL Reporter plugin. The database for that is a Rackspace Cloud DB and is configured in the ``mysql`` entry of the ``zuul_connection_secrets`` entry for the ``zuul-scheduler`` group. -Restarting the Scheduler ------------------------- - -Zuul Scheduler restarts are disruptive, so non-emergency restarts should -always be scheduled for quieter times of the day, week and cycle. To be as -courteous to developers as possible, just prior to a restart the `Zuul -Status Page`_ should be checked to see the status of the gate. If there is a -series of changes nearly merged, wait until that has been completed. - -Since Zuul is stateless, some work needs to be done to save and then -re-enqueue patches when restarts are done. To accomplish this, start by -running `zuul-changes.py -`_ -to save the check and gate queues:: - - python /opt/zuul/tools/zuul-changes.py http://zuul.openstack.org \ - check >check.sh - python /opt/zuul/tools/zuul-changes.py http://zuul.openstack.org \ - gate >gate.sh - -These check.sh and gate.sh scripts will be used after the restart to -re-enqueue the changes. - -Now use `service zuul stop` to stop zuul and then run ps to make sure -the process has actually stopped, it may take several seconds for it to -finally go away. - -Once you're ready, use `service zuul start` to start zuul again. - -To re-enqueue saved jobs, first run the gate.sh script and then check.sh to -re-enqueue the changes from before the restart:: - - ./gate.sh - ./check.sh - -You may watch the `Zuul Status Page`_ to confirm that changes are -returning to the queues. - Executors --------- @@ -194,6 +156,60 @@ Zuul Web is stateless so is safe to restart, however restarting it will result in a loss of connection for anyone watching a live-stream of a console log when the restart happens. +Restarting Zuul Services +------------------------ + +Currently the safest way to restart the Zuul scheduler is to restart all +services at the same time. The reason for this is that if the scheduler is +restarted but executors are not then the executors and scheduler can get out +of sync with each other. Note that restarting zuul web or a single executor +should continue to be safe as noted above, but this process should generally +be preferred. + +Zuul Scheduler restarts are disruptive, so non-emergency restarts should +always be scheduled for quieter times of the day, week and cycle. We should +attempt to be courteous and avoid restarts when project teams are cutting +releases or have other important changes that are about to land. + +Since Zuul is stateless, some work needs to be done to save and then +re-enqueue patches when restarts are done. To accomplish this, start by +running the zuul-changes script to save the check and gate queues:: + + root@zuul02# ~root/zuul-changes.py https://zuul.opendev.org >queues-$(date +%Y%m%d).sh + +This script will be executed when Zuul is up and running again to restore +the previous queue contents. + +One other thing to consider before restarting all zuul services is you may +want to update all of the zuul docker images. This can be useful if restarting +Zuul to correct a bug that was fixed in the Zuul codebase. To do this run +the zuul_pull.yaml playbook from bridge:: + + root@bridge# ansible-playbook -f 20 /home/zuul/src/opendev.org/opendev/system-config/playbooks/zuul_pull.yaml + +Once ready to restart all Zuul services you will want to run the +zuul_restart.yaml playbook from bridge to do this:: + + root@bridge# ansible-playbook -f20 /home/zuul/src/opendev.org/opendev/system-config/playbooks/zuul_restart.yaml + +Once this playbook is done running the services will have been restarted, but +the Zuul system still needs to load its configs before it is ready to do work. +The `root `_ of the Zuul dashboard will show you +loaded tenants. Once all tenants show up on this page it is safe to proceed +with re-enqueing changes to pipelines with the script we generated earlier. +Note that the OpenStack tenant takes the most time. If you wait for it to +show up in the dashboard you should be ready to go. You can double check +this by loading the OpenStack Zuul `status +`_ and ensuring it doesn't report +an error. + +To re-enqueue, execute the previously generated script:: + + root@zuul# bash queues-$(date +%Y%m%d).sh + +When this has completed you are done with the Zuul restart. Consider logging +the restart and update with statusbot in IRC. + Secrets -------