We must have missed this, I noticed when it didn't run on the gate job
for I949c40e9046008d4f442b322a267ce0c967a99dc
Change-Id: I62c5c0f262d9bd53580367dc9f1ad00fe7b6f6f2
We still have some Ubuntu Xenial servers, so cap the max usable pip
and setuptools versions in their venvs like we already do for
Bionic, in order to avoid broken installations. Switch our
conditionals from release name comparisons to version numbers in
order to more cleanly support ranges. Also make sure the borg run
test is triggered by changes to the create-venv role.
Change-Id: I5dd064c37786c47099bf2da66b907facb517c92a
Many of our tests are actually running with a timeout of 3600; I think
between a combination of bumping timeouts for failures and
copy-pasting jobs.
We are seeing frequent timeouts of other jobs without this,
particularly on OVH GRA1. Let's bump the base timeout to 3600 to
account for this. The only job that overrides this now is gitea,
which runs for 4800 due to it's long import process.
Change-Id: I762f0f7c7a53a456d9269530c9ae5a9c85903c9c
Keeping the testing nodes at the other end of the namespace separates
them from production hosts. This one isn't really referencing itself
in testing like many others, but move it anyway.
Change-Id: I2130829a5f913f8c7ecd8b8dfd0a11da3ce245a9
Similar to Id98768e29a06cebaf645eb75b39e4dc5adb8830d, move the
certificate variables to the group definition file, so that we don't
have to duplicate handlers or definitions for the testing host.
Change-Id: I6650f5621a4969582f40700232a596d84e2b4a06
Currently we define the letsencrypt certs for each host in its
individual host variables.
With recent work we have a trusted CA and SAN names setup in
our testing environment; introducing the possibility that we could
accidentally reference the production host during testing (both have
valid certs, as far as the testing hosts are concerned).
To avoid this, we can use our naming scheme to move our testing hosts
to "99" and avoid collision with the production hosts. As a bonus,
this really makes you think more about your group/host split to get
things right and keep the environment as abstract as possible.
One example of this is that with letsencrypt certificates defined in
host vars, testing and production need to use the same hostname to get
the right certificates created. Really, this should be group-level
information so it applies equally to host01 and host99. To cover
"hostXX.opendev.org" as a SAN we can include the inventory_hostname in
the group variables.
This updates one of the more tricky hosts, static, as a proof of
concept. We rename the handlers to be generic, and update the testing
targets.
Change-Id: Id98768e29a06cebaf645eb75b39e4dc5adb8830d
I've seen a couple of jobs timeout on this for no apparent reason.
Loading all the repos just seems to take a long time. Looking at the
logs [1], depending on the cloud taking 55m - 1h is not terribly
uncommon. Increase the timeout on this by 20 minutes to give it
enough headroom over an hour.
[1] https://zuul.opendev.org/t/openstack/builds?job_name=system-config-run-gitea&project=opendev%2Fsystem-config
Change-Id: I51080820bae35ac615a3b8b7ee1b8890e0df8410
This is the first step in running our servers on jammy. This will help
us boot new servers on jammy and bionic replacements on jammy.
Change-Id: If2e8a683c32eca639c35768acecf4f72ce470d7d
This reverts commit 21c6dc02b5b3069e4c9410416aeae804b2afbb5c.
Everything appears to be working with Ansible 2.9, which does seem to
sugguest reverting this will result in jobs timing out again. We will
monitor this, and I76ba278d1ffecbd00886531b4554d7aed21c43df is a
potential fix for this.
Change-Id: Id741d037040bde050abefa4ad7888ea508b484f6
We've been seeing ansible post-run playbook timeouts in our infra-prod
jobs. The only major thing that has changed recently is the default
update to ansible 5 for these jobs. Force them back to 2.9 to see if the
problem goes away.
Albin Vass has noted that there are possibly glibc + debian bullseye +
ansible 5 problems that may be causing this. If we determine 2.9 is
happy then this is the likely cause.
Change-Id: Ibd40e15756077d1c64dba933ec0dff6dc0aac374
If the production playbook times out, we don't get any logs collected
with the run. By moving the log collection into a post-run step, we
should always get something copied to help us diagnose what is going
wrong.
Change-Id: I3e99b80e442db0cc87f8e8c9728b7697a5e4d1d3
These files got moved around and refactored to better support testing of
the Gerrit 3.5 to 3.6 upgrade path. Make sure we trigger the test jobs
when these files are updated.
Change-Id: I5a520e8a8a7c794a761279d4fb98c23e5d25f0ad
haproxy only logs to /dev/log; this means all our access logs get
mixed into syslog. This makes it impossible to pick out anything in
syslog that might be interesting (and vice-versa, means you have to
filter out things if analysing just the haproxy logs).
It seems like the standard way to deal with this is to have rsyslogd
listen on a separate socket, and then point haproxy to that. So this
configures rsyslogd to create /var/run/dev/log and maps that into the
container as /dev/log (i.e. don't have to reconfigure the container at
all).
We then capture this sockets logs to /var/log/haproxy.log, and install
rotation for it.
Additionally we collect this log from our tests.
Change-Id: I32948793df7fd9b990c948730349b24361a8f307
Move the paste testing server to paste99 to distinguish it in testing
from the actual production paste service. Since we have certificates
setup now, we can directly test against "paste99.opendev.org",
removing the insecure flags to various calls.
Change-Id: Ifd5e270604102806736dffa86dff2bf8b23799c5
When we migrated this to ansible I missed that we didn't bring across
the storage-aggregation.conf file.
This has had the unfortunate effect of regressing the xFilesFactor set
for every newly created graphite stat since the migration. This
setting is a percentage (0-1 float) of how much of a "bucket" needs to
be non-null to keep the value when rolling up changes. We want this
to be zero due to the sporadic nature of data (see the original change
I5f416e798e7abedfde776c9571b6fc8cea5f3a33).
This only affected newly created statistics, as graphite doesn't
modify this setting once it creates the whisper file. This probably
helped us overlook this for so long, as longer-existing stats were
operating correctly, but newer were dropping data when zoomed out.
Restore this setting, and double-check it in testinfra for the future.
For simplicity and to get this back to the prior state I will manually
update the on-disk .wsp files to this when this change applies.
Change-Id: I57873403c4ca9783b1851ba83bfba038f4b90715
This adds upgrade testing from our current Gerrit version (3.5) to the
likely future version of our next upgrade (3.6).
To do so we have to refactor the gerrit testing becase the 3.5 to 3.6
upgrade requires we run a command against 3.5. The previous upgrade
system assumed the old version could be left alone and jumped straight
into the upgrade finally testing the end state. Now we have split up the
gerrit bootstrapping and gerrit testing so that normal gerrit testing
and upgrade testing can run these different tasks at different points in
the gerrit deployment process.
Now the upgrade tests use the bootstrapping playbook to create users,
projects, and changes on the old version of gerrit before running the
copy-approvals command. Then after the upgrade we run the test assertion
portion of the job.
Change-Id: Id58b27e6f717f794a8ef7a048eec7fbb3bc52af6
This adds Gerrit 3.6 image build jobs as well as CI testing for this
version of Gerrit. Once we've got images that build and function
generally we'll reenable the upgrade job and work through that.
Change-Id: I494a21911a2279228e57ff8d2b731b06a1573438
This removes our Gerrit 3.4 image builds as well as testing. We should
land this after an appropriate amount of time has passed since the 3.5
upgrade that we are unlikely to revert.
Depends-On: https://review.opendev.org/c/openstack/project-config/+/847057
Change-Id: Iefa7cc1157311f0239794b15bea7c93f0c625a93
We've upgraded to Gerrit 3.5 so now need to wait for the 3.5 image to
promote rather than the 3.4 image when deploying Gerrit.
Change-Id: Ic3a4d578aea955aeee51f4cac7f4c95de931a94b
3.4.5 is a fairly minor update. Some bugs are fixed and jgit is updated.
3.4.5 release notes:
https://www.gerritcodereview.com/3.4.html#345
3.5.2 is a bigger update and important adds support for being able to
upgrade to 3.6.0 later. There is a new copy-approvals command that must
be run offline on 3.5.2 before upgrading to 3.6.0. This copies approvals
in the notedb in a way that 3.6.0 can handle them apparently. The
release notes indicate this may take some time to run. We don't need to
run it now though and instead need to make note of it when we prepare
for the 3.6.0 upgrade.
3.5.2 release notes:
https://www.gerritcodereview.com/3.5.html#352
For now don't overthink things and instead just get up to date with our
images.
Change-Id: I837c2cbb09e9a4ff934973f6fc115142d459ae0f
The status.openstack.org server is offline now that it no longer
hosts any working services. Remove all configuration for it in
preparation for retiring related Git repositories.
Also roll some related cleanup into this for the already retired
puppet-kibana module.
Change-Id: I3cfcc129983e3641dfbe55d5ecc208c554e97de4
I think this was overlooked in the removal of the ELK stack with
I5f7f73affe7b97c74680d182e68eb4bfebbe23e1, the repo is now retired.
Change-Id: I87bfe7be61f20a7c05c500af4e82b787d9c37a8c
Now that we've cleaned up the old unused images we can look forward to
new Python. Add Python 3.10 base images based on Bullseye.
As part of this process we update the default var values in our
Dockerfiles to set Bullseye and Python3.10 as our defaults as these
should be valid for some time. We also tidy up some yaml anchor names to
make future copy and paste for new versions of images easier to perform
text replacement on.
Change-Id: I4943a9178334c4bdf10ee5601e39004d6783b34c
Everything is running on 3.8 or newer which should allow us to remove
the 3.7 images. This reduces the total set before we add python3.10
images and acts as good cleanup.
Change-Id: I2cc02fd681485f35a1b0bf1c089a12a4c5438df3