Now that we have a mostly working 3.8 image it is time to test the
upgrade from 3.7 (what we run in prod) to 3.8 (what we will eventually
run in prod).
Change-Id: Ied8bae6b80cff79668a293ae2f30498abbf6839d
We've been running Gerrit 3.7 for some time now and seem very unlikely
to revert at this point. Clean up the Gerrit 3.6 image builds as we
don't need them anymore.
THis change also comments out the 3.6 -> 3.8 upgrade job. Followup
changes will add 3.8 image builds and test the 3.7 -> 3.8 upgrade
process.
Depends-On: https://review.opendev.org/c/openstack/project-config/+/881595
Change-Id: I759b34e48dcede7ffaa66c83da01b81c4fed4b4f
The tsig_key value is a shared secret between the hidden-primary and
secondary servers to facilitate secure zone transfers. Thus we should
store it once in the common "adns" group, rather than duplicating it
in the adns-primary and ads-secondary.
Change-Id: I600f1ecdfc06bda79b6a4ce77253f489ad515fa5
There were two problems with our gerrit upgrade config diff checking.
The first is that we were comparing using command exit codes after
pipeing diff to tee without setting pipefail. This meant that even if
the diff failed we got an exit code of 0 from tee and everything passed
happily.
Second we were not checking our pre gerrit state diff state. If the old
gerrit version also updated the config on disk we wouldn't get a diff
after upgrading to the new version. I think that is ultimately what
broke for us here because the 3.6 and 3.7 config diffs are empty, but
they differ from what we initially write to disk. As for explaining why
this might happen I can only assume some update to 3.6 made the changes
we saw after we had deployed 3.6.
As a result of checking things more thoroughly we need to update our
config to remove any delta. This removes some extraneous quoting around
gitweb config entries to do that.
Change-Id: I9c7e73d5f64546fb57a21a249b29e2aca9229ac7
This switches us to running the services against the etherpad group. We
also define vars in a group_vars file rather than a host specific
file. This allows us to switch testing over to etherpad99 to decouple it
from our production hostnames.
A followup change will add a new etherpad production server that will be
deployed alongside the existing one. This refactor makes that a bit
simpler.
Change-Id: I838ad31eb74a3abfd02bbfa77c9c2d007d57a3d4
This updates our base config to 3.7. This should only be merged as
part of the update process described at
https://etherpad.opendev.org/p/gerrit-upgrade-3.7
Change-Id: I9a1fc4a9f35ed0f60b9899cb9d08aa81995e640b
Adding these two labels to the test project gives us better coverage
of the copyCondition and submit-requirment setup of projects.
Review-Priority is voting, so has a blocking s-r. Backport-Candidate
is an example of a non-voting (or "trigger" as gerrit calls it) label,
which should show up in a different part of the UI.
This should help us better evaluate any changes in this area,
particularly UI changes via our test screenshots.
Change-Id: Ib42fa61d9805d1946da7307f0969cdaf6a937514
Firstly, my understanding of "adns" is that it's short for
authoritative-dns; i.e. things related to our main non-recursive DNS
servers for the zones we manage. The "a" is useful to distinguish
this from any sort of other dns services we might run for CI, etc.
The way we do this is with a "hidden" server that applies updates from
config management, which then notifies secondary public servers which
do a zone transfer from the primary. They're all "authoritative" in
the sense they're not for general recursive queries.
As mentioned in Ibd8063e92ad7ff9ee683dcc7dfcc115a0b19dcaa, we
currently have 3 groups
adns : the hidden primary bind server
ns : the secondary public authoratitive servers
dns : both of the above
This proposes a refactor into the following 3 groups
adns-primary : hidden primary bind server
adns-secondary : the secondary public authoritative servers
adns : both of the above
This is meant to be a no-op; I just feel like this makes it a bit
clearer as to the "lay of the land" with these servers. It will need
some considering of the hiera variables on bridge if we merge.
Change-Id: I9ffef52f27bd23ceeec07fe0f45f9fee08b5559a
After reading through the Gerrit code I'm beginning to think that not
setting a function has it default to MaxWithBlock and this doesn't get
rejected like an explicit setting of MaxWith Block. This means we may
not be properly exercising our new submit requirement rule for Verified.
Switch it to NoBlock explicitly to exercise the submit requirements.
Change-Id: I3296ba650c0c58326499604f1117916a990f0cf1
The last iteration of this donor environment was taken down at the
end of 2022, let's proceed with final config removal for it.
Change-Id: Icfa9a681f052f69d96fd76c6038a6cd8784d9d8d
We haven't used the Packethost donor environment in a very long
time, go ahead and clean up lingering references to it in our
configuration.
Change-Id: I870f667d10cc38de3ee16be333665ccd9fe396b9
The mirror in our Limestone Networks donor environment is now
unreachable, but we ceased using this region years ago due to
persistent networking trouble and the admin hasn't been around for
roughly as long, so it's probably time to go ahead and say goodbye
to it.
Change-Id: Ibad440a3e9e5c210c70c14a34bcfec1fb24e07ce
This is done for a number of reasons. First it will allow us to update
the python version used in the images as we can have a 3.10 builder and
base images (but not a 3.10 openjdk:11 image). Second it will allow us
to easily switch to openjdk 17 by simply updating the package we install
and some paths for the jdk location.
The goal here is to have more control over the images so that we can do
things like change python and java versions when we want to.
Depends-On: https://review.opendev.org/c/opendev/jeepyb/+/870873
Change-Id: I7ea2658caf71336d582c01be17a91759e9ac2043
This updates the gerrit upgrade testing job to upgrade from 3.6 to 3.7.
This upgrade requires an offline reindex which is new for us since we've
been on Gerrit 3.x. In order to support this offline reindex requirement
the gerrit role is modified to trigger an offline reindex in the role's
start tasks if the flag to do so is set. I expect this will really only
be used in testing, but it allows us to reuse most everything else in
testing and in production which is nice.
Change-Id: Ibe68176970394cbe71c3126ff3fe7a1b0601b09a
Now that we've dropped gerrit 3.5 we can convert label functions to
submit requirements. This is required for Gerrit 3.7 but is optional
under 3.6. Eventually we'll need to do this for all of our custom labels
prior to the 3.7 upgrade.
Change-Id: I4dda45040842de76c12f36b5b6d6b5948d82077a
If this flag is set, the logs are copied into the published job.
There's no need to save an encrypted copy of the same thing.
Change-Id: I32ac5e0ac4d2307f2e1df88c5e2ccbe2fd381839
If infra_prod_playbook_collect_log is set, then we copy and publish
the playbook log in the job results.
Currently we skip renaming the log file on bridge in this case,
meaning that we don't keep logs of old runs on bridge. Also, there is
a bug in the bit that resets the timestamp on the logfile (so it is
timestamped by the time it started, no ended) that it isn't checking
this flag, so we end up with a bunch of zero-length files in this
case.
I guess the thinking here was that since the log is published, there's
no need to keep it on bridge as well.
The abstract case here is really only instantiated for
manage-projects, which is the only job we publish the log for. Today
we wanted an older log, but it had already been purged from object
storage.
It seems worth keeping this on-disk as well as publishing it. Remove
the checks around the rename/cleanup. This will also fix the bug of
zero-sized files being created, because the renamed file will be there
now.
Change-Id: Ic5ab52797fef880ae3ec3d92c071ef802e63b778
These dummy variables were for the nodepool.yaml template during
testing, but are no longer referenced. Clean them up.
Change-Id: I717ab8f9b980b363fdddaa28e76cd269b1e4d876
This is just enough to get the cloud-launcher working on the new
Linaro cloud. It's a bit of a manual setup, and much newer hardware,
so trying to do things in small steps.
Change-Id: Ibd451e80bbc6ba6526ba9470ac48b99a981c1a8d
This should only be landed as part of our upgrade process. This change
will not upgrade Gerrit properly on its own.
Note, we keep Gerrit 3.5 image builds and 3.5 -> 3.6 upgrade jobs in
place until we are certain we won't roll back. Once we've crossed that
threshold we can drop 3.5 image builds, add 3.7 image builds, and update
the upgrade testing to perform a 3.6 -> 3.7 upgrade.
Change-Id: I40c4f96cc40edc5caeb32a1af80069ef784967fd
On the old bridge node we had some unmanaged venv's with a very old,
now unmaintained RAX DNS API interaction tool.
Adding the RDNS entries is fairly straight forward, and this small
tool is mostly a copy of some of the bits for our dns api backup tool.
It really just comes down to getting a token and making a post request
with the name/ip addresses.
When the cloud the node is launched as is identified as RAX, this will
automatically add the PTR records for the ip4 & 6 addresses. It also
has an entrypoint to be called manually.
This is added and hacked in, along with a config file for the
appropriate account (I have added these details on bridge).
I've left the update of openstack.org DNS entries as a manual
procedure. Although they could be set automatically with small
updates to the tool (just a different POST) -- details like CNAMES,
etc. and the relatively few servers we start in the RAX mangaed DNS
domains means I think it's easier to just do manually via the web ui.
The output comment is updated.
Change-Id: I8a42afdd00be2595ca73819610757ce5d4435d0a
For some reason, this was in our original lists.openstack.org Exim
configuration when we first imported it to Puppet so many years ago.
Somehow it's survived and multiplied its way into other configs as
well. Time to finally let it go.
Change-Id: I23470c10ae0324954cb2afda929c86e7ad34663e
The dependent change allows us to also post to mastodon. Configure
this to point to fosstodon where we have an opendevinfra account.
Change-Id: Iafa8074a439315f3db74b6372c1c3181a159a474
Depends-On: https://review.opendev.org/c/opendev/statusbot/+/864586
This should now be a largely functional deployment of mailman 3. There
are still some bits that need testing but we'll use followup changes to
force failure and hold nodes.
This deployment of mailman3 uses upstream docker container images. We
currently hack up uids and gids to accomodate that. We also hack up the
settings file and bind mount it over the upstream file in order to use
host networking. We override the hyperkitty index type to xapian. All
list domains are hosted in a single installation and we use native
vhosting to handle that.
We'll deploy this to a new server and migrate one mailing list domain at
a time. This will allow us to start with lists.opendev.org and test
things like dmarc settings before expanding to the remaining lists.
A migration script is also included, which has seen extensive
testing on held nodes for importing copies of the production data
sets.
Change-Id: Ic9bf5cfaf0b87c100a6ce003a6645010a7b50358
In thinking harder about the bootstrap process, it struck me that the
"bastion" group we have is two separate ideas that become a bit
confusing because they share a name.
We have the testing and production paths that need to find a single
bridge node so they can run their nested Ansible. We've recently
merged changes to the setup playbooks to not hard-code the bridge node
and they now use groups["bastion"][0] to find the bastion host -- but
this group is actually orthogonal to the group of the same name
defined in inventory/service/groups.yaml.
The testing and production paths are running on the executor, and, as
mentioned, need to know the bridge node to log into. For the testing
path this is happening via the group created in the job definition
from zuul.d/system-config-run.yaml. For the production jobs, this
group is populated via the add-bastion-host role which dynamically
adds the bridge host and group.
Only the *nested* Ansible running on the bastion host reads
s-c:inventory/service/groups.yaml. None of the nested-ansible
playbooks need to target only the currently active bastion host. For
example, we can define as many bridge nodes as we like in the
inventory and run service-bridge.yaml against them. It won't matter
because the production jobs know the host that is the currently active
bridge as described above.
So, instead of using the same group name in two contexts, rename the
testing/production group "prod_bastion". groups["prod_bastion"][0]
will be the host that the testing/production jobs use as the bastion
host -- references are updated in this change (i.e. the two places
this group is defined -- the group name in the system-config-run jobs,
and add-bastion-host for production).
We then can return the "bastion" group match to bridge*.opendev.org in
inventory/service/groups.yaml.
This fixes a bootstrapping problem -- if you launch, say,
bridge03.opendev.org the launch node script will now apply the
base.yaml playbook against it, and correctly apply all variables from
the "bastion" group which now matches this new host. This is what we
want to ensure, e.g. the zuul user and keys are correctly populated.
The other thing we can do here is change the testing path
"prod_bastion" hostname to "bridge99.opendev.org". By doing this we
ensure we're not hard-coding for the production bridge host in any way
(since if both testing and production are called bridge01.opendev.org
we can hide problems). This is a big advantage when we want to rotate
the production bridge host, as we can be certain there's no hidden
dependencies.
Change-Id: I137ab824b9a09ccb067b8d5f0bb2896192291883
This playbook can use the add-bastion-host role to add the bastion
host. This is one less place the bridge name is hard-coded.
Change-Id: I5ad7f6f1ac9bdf9af59b835d8fd466c3ca276639
This switches the bridge name to bridge01.opendev.org.
The testing path is updated along with some final references still in
testinfra.
The production jobs are updated in add-bastion-host, and will have the
correct setup on the new host after the dependent change.
Everything else is abstracted behind the "bastion" group; the entry is
changed here which will make all the relevant playbooks run on the new
host.
Depends-On: https://review.opendev.org/c/opendev/base-jobs/+/862551
Change-Id: I21df81e45a57f1a4aa5bc290e9884e6dc9b4ca13
Similar to I84acaa917187db092a302519c14bc94a6a87c2c0, this is a
follow-on to I286796ebd71173019a627f8fe8d9a25d0bfc575a.
At this point, there is no "bastion" group for the executor Ansible to
use.
The idea here is to reduce the number of places we're directly
referencing bridge.openstack.org. We could move this into a job
variable, but that's the same as defining it here. KISS and reference
it directly here (since it's in a role and used multiple times, it's
still better than hardcoding in multiple places).
Change-Id: If6dbcb34e25e3eb721cd2892b8adb84344289882
In I286796ebd71173019a627f8fe8d9a25d0bfc575a we abstracted adding the
bastion host into this role. However, when running on the executor
this role doesn't see playbooks/roles; the roles should be in
playbooks/zuul/roles as they are then siblings to the playbooks
running the production jobs (zuul/run-production-playbook[-post].yaml)
Change-Id: I84acaa917187db092a302519c14bc94a6a87c2c0
Following-on from Iffb462371939989b03e5d6ac6c5df63aa7708513, instead
of directly referring to a hostname when adding the bastion host to
the inventory for the production playbooks, this finds it from the
first element of the "bastion" group.
As we do this twice for the run and post playbooks, abstract it into a
role.
The host value is currently "bridge.openstack.org" -- as is the
existing hard-coding -- thus this is intended to be a no-op change.
It is setting the foundation to make replacing the bastion host a
simpler process in the future.
Change-Id: I286796ebd71173019a627f8fe8d9a25d0bfc575a
The prior change Iffb462371939989b03e5d6ac6c5df63aa7708513 added the
"bastion" group for system-config-run-* jobs, and the dependent change
here adds the bridge host to the "bastion" group when it is
dynamically added in opendev/base-jobs.
This playbook can thus refer to the bastion group, rather than having
to hardcode the hostname.
This should have no affect in production as it all still refers to the
existing bridge.openstack.org; but will make it easier to switch in
the (near) future.
Depends-On: https://review.opendev.org/c/opendev/base-jobs/+/861026
Change-Id: Icc52d2544afc1faf519a036cda94a3cae10448ee
This replaces hard-coding of the host "bridge.openstack.org" with
hard-coding of the first (and only) host in the group "bastion".
The idea here is that we can, as much as possible, simply switch one
place to an alternative hostname for the bastion such as
"bridge.opendev.org" when we upgrade. This is just the testing path,
for now; a follow-on will modify the production path (which doesn't
really get speculatively tested)
This needs to be defined in two places :
1) We need to define this in the run jobs for Zuul to use in the
playbooks/zuul/run-*.yaml playbooks, as it sets up and collects
logs from the testing bastion host.
2) The nested Ansible run will then use inventory
inventory/service/groups.yaml
Various other places are updated to use this abstracted group as the
bastion host.
Variables are moved into the bastion group (which only has one host --
the actual bastion host) which means we only have to update the group
mapping to the new host.
This is intended to be a no-op change; all the jobs should work the
same, but just using the new abstractions.
Change-Id: Iffb462371939989b03e5d6ac6c5df63aa7708513
As a short history diversion, at one point we were trying building
diskimage-builder based images for upload to our control-plane
(instead of using upstream generic cloud images). This didn't really
work because the long-lived production servers led to leaking images
and nodepool wasn't really meant to deal with this lifecycle.
Before this the only thing that needed credentials for the
control-plane clouds was bridge.
Id1161bca8f23129202599dba299c288a6aa29212 reworked things to have a
control-plane-clouds group which would have access to the credential
variables.
So at this point we added
zuul/templates/group_vars/control-plane-clouds.yaml.j2 with stub
variables for testing.
However, we also have the same cloud: variable with stub variables in
zuul/templates/host_vars/bridge.openstack.org.yaml.j2. This is
overriding the version from control-plane-clouds because it is more
specific (host variable). Over time this has skewed from the
control-plane-clouds definition, but I think we have not noticed
because we are not updating the control-plane clouds on the non-bridge
(nodepool) nodes any more.
This is a long way of saying remove the bridge-specific definitions,
and just keep the stub variables in the control-plane-clouds group.
Change-Id: I6c1bfe7fdca27d6e34d9691099b0e1c6d30bb967
The idea with this role is to install the root key from the on-disk
RSA secret. However, when this play runs against localhost it doesn't
match the host-variable defined root_rsa_key.
This is being run nested -- the executor Ansible task has forked the
Ansible we have installed on the bridge which is now installing this.
"connection: local" does what we want here -- it makes ansible assume
bridge.openstack.org is 127.0.0.1 -- which it is -- and avoids us
having to worry about the bootstrap ssh-ing back to itself.
This is a fixup for Iebaeed5028050d890ab541818f405978afd60124
Change-Id: I4cdcc373d1b7b6fa542a78c9f84067c79352d2f6
In discussion of other changes, I realised that the bridge bootstrap
job is running via zuul/run-production-playbook.yaml. This means it
uses the Ansible installed on bridge to run against itself -- which
isn't much of a bootstrap.
What should happen is that the bootstrap-bridge.yaml playbook, which
sets up ansible and keys on the bridge node, should run directly from
the executor against the bridge node.
To achieve this we reparent the job to opendev-infra-prod-setup-keys,
which sets up the executor to be able to log into the bridge node. We
then add the host dynamically and run the bootstrap-bridge.yaml
playbook against it.
This is similar to the gate testing path; where bootstrap-bridge.yaml
is run from the exeuctor against the ephemeral bridge testing node
before the nested-Ansible is used.
The root key deployment is updated to use the nested Ansible directly,
so that it can read the variable from the on-host secrets.
Change-Id: Iebaeed5028050d890ab541818f405978afd60124