90 Commits

Author SHA1 Message Date
James E. Blair
99342db824 Add a standalone zuul db server
Change-Id: Ibb260f820dbc1d9d6ca523ff3903134612cb003e
2024-04-04 12:25:23 -07:00
Jeremy Stanley
f477e35561 Upgrade to Keycloak 23.0
This includes a switch from the "legacy" style Wildfly-based image
to a new setup using Quarkus.

Because Keycloak maintainers consider H2 databases as a test/dev
only option, there are no good migration and upgrade paths short of
export/import data. Go ahead and change our deployment model to rely
on a proper RDBMS, run locally from a container on the same server.

Change-Id: I01f8045563e9f6db6168b92c5a868b8095c0d97b
2024-02-06 05:33:37 +00:00
Clark Boylan
47d2e07d94 Trigger gerrit image promotion when the gerrit image jobs update
We often need to update gerrit image build details that only live in the
job specification. For example tag or branch versions of gerrit and
related repos. When we do this if we don't also do a noop update to our
Dockerfiles the promotion job doesn't run for these images because we
can't do the implicit file match in the promotion (deploy) pipeline.

Fix this by explicitly matching the job config file in our jobs so that
when we update the gerrit jobs we also run the gerrit image promotion
jobs.

We also ensure the system-config-run-review and
infra-prod-service-review jobs are triggered when the docker image jobs
update. This ensures we actually test the resulting images and then
perform potentially necessary deployment actions before they are pulled
into use.

Change-Id: Id0c51818cd1e01bd16a79ab0c0f9172e844376b8
2023-11-29 10:19:02 -08:00
Jeremy Stanley
73f0a5336a Merge production and test node mailman configs
Now that the Mailman v3 migration is complete, we no longer need any
divergence between the lists01 (production) and lists99 (test node)
host vars, so put everything into the group vars file instead.

Change-Id: If92943694e95ef261fbd254eff65a51d8d3f7ce5
2023-10-30 19:26:03 +00:00
Zuul
bd3fd30462 Merge "Remove the old mailing list server" 2023-10-20 23:04:26 +00:00
Jeremy Stanley
cab53d10ac Remove the old mailing list server
Clean up references to lists.openstack.org other than as a virtual
host on the new lists01.opendev.org Mailman v3 server. Update a few
stale references to the old openstack-infra mailing list (and
accompanying stale references to the OpenStack Foundation and
OpenStack Infra team). Update our mailing list service documentation
to reflect the new system rather than the old one. Once this change
merges, we can create an archival image of the old server and delete
it (as well as removing it from our emergency skip list for
Ansible).

Side note, the lists.openstack.org server will be 11.5 years old on
November 1, created 2012-05-01 21:14:53 UTC. Farewell, old friend!

Change-Id: I54eddbaaddc7c88bdea8a1dbc88f27108c223239
2023-10-20 18:10:08 +00:00
Clark Boylan
944b78154d Fix the relevant files lists for lists3 jobs
Fix the infra-prod-service-lists3 job to trigger when we update the
mailman3.yaml group vars file. In addition we make a noop reorganization
change to the mailman3 group file to group exim vars together which will
be used to ensure that this change triggers the lists3 job as expected.

In system-config-run-lists3 we update that job to be triggered when we
update the docker images for mailman. We don't bother testing this now
as that would be masked off by the update to the mailman3 groups file.
But in the future when we do mailman3 image updates we'll be looking for
this job to run.

Change-Id: I994b0a79bf46f525dd9e059719f5a08c9c390b8c
2023-10-15 19:52:01 -07:00
Jeremy Stanley
a6ab3543fc Move Airship and Kata lists to Mailman 3
This uncomments the list additions for the lists.airshipit.org and
lists.katacontainers.io sites on the new mailman server, removing
the configuration for them from the lists.opendev.org server and, in
the case of the latter, removing all our configuration management
for the server as it was the only site hosted there.

Change-Id: Ic1c735469583e922313797f709182f960e691efc
2023-09-14 12:08:34 +00:00
Zuul
53391950e1 Merge "Run bootstrap-bridge with empty nodeset" 2023-09-04 12:53:13 +00:00
Jeremy Stanley
c9c8febd84 Trigger mm3 deployment when containers change
Add the docker/mailman tree to the infra-prod-service-lists3 job so
that we deploy new versions whenever we make changes to the
container images.

Change-Id: Ife5e878b1f81c2879c2959fe6d4de22fe841583b
2023-08-25 16:35:46 +00:00
Clark Boylan
9fdbed9c27 Run bootstrap-bridge with empty nodeset
We are currently using the default nodeset on the
infra-prod-bootstrap-bridge job which results in us waiting for a node
that we end up ignoring. As far as I can tell this job runs against
localhost and the add_host bridge entry. It ignores the default test
node from the nodeset.

Speed up job execution and reduce node waste by setting an empty nodeset
on the job.

Change-Id: I8c3ffda60b92a8655989579335a49423fbdd18a2
2023-08-17 09:59:09 -07:00
Clark Boylan
2875f64ed3 Run gitea and static tests when update Apache UA filters
The gitea and static services both deploy apache using our UA filters.
We were not testing these services when updating UA filters. This change
fixes that giving us some basic sanity checking that UA filter updates
are functional and quicker deployments of these filters.

Change-Id: Icbe6558bb47946299a43905e2f64522576bad939
2023-05-02 10:18:41 -07:00
Zuul
4a101da52a Merge "Refactor adns variables" 2023-04-13 02:31:48 +00:00
Clark Boylan
ed1c7c94a3 Make etherpad configuration more generic for multiple hosts
This switches us to running the services against the etherpad group. We
also define vars in a group_vars file rather than a host specific
file. This allows us to switch testing over to etherpad99 to decouple it
from our production hostnames.

A followup change will add a new etherpad production server that will be
deployed alongside the existing one. This refactor makes that a bit
simpler.

Change-Id: I838ad31eb74a3abfd02bbfa77c9c2d007d57a3d4
2023-04-05 08:36:27 -07:00
Ian Wienand
be992b3bb6
infra-prod: run job against linaro
We have access to manage the linaro cloud, but we don't want to
completely own the host as it has been configured with kolla-ansible;
so we don't want to take over things like name resolution, iptables
rules, docker installation, etc.

But we would like to manage some parts of it, like rolling out our
root users, some cron jobs, etc.  While we could just log in and do
these things, it doesn't feel very openinfra.

This allows us to have a group "unmanaged" that skips the base jobs.
The base playbook is updated to skip these hosts.

For now, we add a cloud-linaro prod job that just does nothing so we
can validate the whole thing.  When it's working, I plan to add a few
things as discussed above.

Change-Id: Ie8de70cbac7ffb9d727a06a349c3d2a3b3aa0b40
2023-03-15 12:00:25 +11:00
Ian Wienand
b0d27692de
Refactor adns variables
Firstly, my understanding of "adns" is that it's short for
authoritative-dns; i.e. things related to our main non-recursive DNS
servers for the zones we manage.  The "a" is useful to distinguish
this from any sort of other dns services we might run for CI, etc.

The way we do this is with a "hidden" server that applies updates from
config management, which then notifies secondary public servers which
do a zone transfer from the primary.  They're all "authoritative" in
the sense they're not for general recursive queries.

As mentioned in Ibd8063e92ad7ff9ee683dcc7dfcc115a0b19dcaa, we
currently have 3 groups

 adns : the hidden primary bind server
 ns : the secondary public authoratitive servers
 dns : both of the above

This proposes a refactor into the following 3 groups

 adns-primary : hidden primary bind server
 adns-secondary : the secondary public authoritative servers
 adns : both of the above

This is meant to be a no-op; I just feel like this makes it a bit
clearer as to the "lay of the land" with these servers.  It will need
some considering of the hiera variables on bridge if we merge.

Change-Id: I9ffef52f27bd23ceeec07fe0f45f9fee08b5559a
2023-03-10 09:36:01 +11:00
Ian Wienand
edb16542b1
Remove unused adns1/ns* host_vars files
These files are empty, so remove them to avoid any confusion

Change-Id: I7f7f87a1058f5fb189b395a8f3ab6e7465940faf
2023-03-09 14:59:41 +11:00
Clark Boylan
a5dd619e24 Fix infra-prod-service-review file matchers
We changed review01.openstack.org to review02.openstack.org in the host
var file matchers for this job thinking that was the issue previously.
Unfortunately the actual file is review02.opendev.org. Update the
matcher again to actually trigger the job.

We also make a small edit to the gerrit role's README to ensure we
trigger the job when this change lands.

Change-Id: I1f235d0ddbb2d7f400ea2e99ffabdf5db35671a1
2023-03-03 11:47:02 -08:00
Clark Boylan
7b1b911e49 Trigger infra-prod-service-review when review02 hostvars update
We didn't update this job's file matchers when review01 was replaced
with review02. That caused us to miss triggering this job when review02
hostvars updated. Fix that which should also cause this job to run since
we update the job.

Change-Id: I8b58ee26084681242b9881651d6eeab9ff8d5ad2
2023-02-17 10:11:53 -08:00
Jeremy Stanley
fa22fa726a Also bootstrap bridge any time inventory changes
We need the infra-prod-bootstrap-bridge job to add SSH host keys
from our Ansible inventory to the /etc/ssh_known_hosts on the
bridge. When adding a new server to the inventory, any added host
keys should be deployed. Make sure this happens.

Change-Id: I422f80fc033cfe8e20d6d30b0fe23f82800c4cea
2022-11-29 20:48:23 +00:00
Zuul
b7b2157133 Merge "Add a mailman3 list server" 2022-11-22 18:00:30 +00:00
Clark Boylan
c1c91886b4 Add a mailman3 list server
This should now be a largely functional deployment of mailman 3. There
are still some bits that need testing but we'll use followup changes to
force failure and hold nodes.

This deployment of mailman3 uses upstream docker container images. We
currently hack up uids and gids to accomodate that. We also hack up the
settings file and bind mount it over the upstream file in order to use
host networking. We override the hyperkitty index type to xapian. All
list domains are hosted in a single installation and we use native
vhosting to handle that.

We'll deploy this to a new server and migrate one mailing list domain at
a time. This will allow us to start with lists.opendev.org and test
things like dmarc settings before expanding to the remaining lists.

A migration script is also included, which has seen extensive
testing on held nodes for importing copies of the production data
sets.

Change-Id: Ic9bf5cfaf0b87c100a6ce003a6645010a7b50358
2022-11-11 23:20:19 +00:00
Ian Wienand
51611845d4
Convert production playbooks to bastion host group
Following-on from Iffb462371939989b03e5d6ac6c5df63aa7708513, instead
of directly referring to a hostname when adding the bastion host to
the inventory for the production playbooks, this finds it from the
first element of the "bastion" group.

As we do this twice for the run and post playbooks, abstract it into a
role.

The host value is currently "bridge.openstack.org" -- as is the
existing hard-coding -- thus this is intended to be a no-op change.
It is setting the foundation to make replacing the bastion host a
simpler process in the future.

Change-Id: I286796ebd71173019a627f8fe8d9a25d0bfc575a
2022-10-20 09:49:10 +11:00
Ian Wienand
d4c46ecdef
Abstract name of bastion host for testing path
This replaces hard-coding of the host "bridge.openstack.org" with
hard-coding of the first (and only) host in the group "bastion".

The idea here is that we can, as much as possible, simply switch one
place to an alternative hostname for the bastion such as
"bridge.opendev.org" when we upgrade.  This is just the testing path,
for now; a follow-on will modify the production path (which doesn't
really get speculatively tested)

This needs to be defined in two places :

 1) We need to define this in the run jobs for Zuul to use in the
    playbooks/zuul/run-*.yaml playbooks, as it sets up and collects
    logs from the testing bastion host.

 2) The nested Ansible run will then use inventory
    inventory/service/groups.yaml

Various other places are updated to use this abstracted group as the
bastion host.

Variables are moved into the bastion group (which only has one host --
the actual bastion host) which means we only have to update the group
mapping to the new host.

This is intended to be a no-op change; all the jobs should work the
same, but just using the new abstractions.

Change-Id: Iffb462371939989b03e5d6ac6c5df63aa7708513
2022-10-20 09:00:43 +11:00
Ian Wienand
8efaf8da93
infra-prod-bootstrap-bridge: fix typo in playbook name
Introduced with Iebaeed5028050d890ab541818f405978afd60124

Change-Id: I2e06221d03589dc6bcb5fb060b439e35e3d604dc
2022-10-19 11:10:21 +11:00
Ian Wienand
77ebe6e0b7
infra-prod-bootstrap-bridge: run directly on bridge
In discussion of other changes, I realised that the bridge bootstrap
job is running via zuul/run-production-playbook.yaml.  This means it
uses the Ansible installed on bridge to run against itself -- which
isn't much of a bootstrap.

What should happen is that the bootstrap-bridge.yaml playbook, which
sets up ansible and keys on the bridge node, should run directly from
the executor against the bridge node.

To achieve this we reparent the job to opendev-infra-prod-setup-keys,
which sets up the executor to be able to log into the bridge node.  We
then add the host dynamically and run the bootstrap-bridge.yaml
playbook against it.

This is similar to the gate testing path; where bootstrap-bridge.yaml
is run from the exeuctor against the ephemeral bridge testing node
before the nested-Ansible is used.

The root key deployment is updated to use the nested Ansible directly,
so that it can read the variable from the on-host secrets.

Change-Id: Iebaeed5028050d890ab541818f405978afd60124
2022-10-15 10:39:53 +11:00
James E. Blair
c661fb0972 Add Jaeger tracing server
Change-Id: I1aa68b1d5f99364fa09776301894b922ed169a3a
2022-09-15 19:21:33 -07:00
Ian Wienand
5ba37ced60 paste: move certificate to group variable
Similar to Id98768e29a06cebaf645eb75b39e4dc5adb8830d, move the
certificate variables to the group definition file, so that we don't
have to duplicate handlers or definitions for the testing host.

Change-Id: I6650f5621a4969582f40700232a596d84e2b4a06
2022-08-05 08:18:55 +10:00
Ian Wienand
e70c1e581c static: move certs to group, update testing name to static99
Currently we define the letsencrypt certs for each host in its
individual host variables.

With recent work we have a trusted CA and SAN names setup in
our testing environment; introducing the possibility that we could
accidentally reference the production host during testing (both have
valid certs, as far as the testing hosts are concerned).

To avoid this, we can use our naming scheme to move our testing hosts
to "99" and avoid collision with the production hosts.  As a bonus,
this really makes you think more about your group/host split to get
things right and keep the environment as abstract as possible.

One example of this is that with letsencrypt certificates defined in
host vars, testing and production need to use the same hostname to get
the right certificates created.  Really, this should be group-level
information so it applies equally to host01 and host99.  To cover
"hostXX.opendev.org" as a SAN we can include the inventory_hostname in
the group variables.

This updates one of the more tricky hosts, static, as a proof of
concept.  We rename the handlers to be generic, and update the testing
targets.

Change-Id: Id98768e29a06cebaf645eb75b39e4dc5adb8830d
2022-08-05 08:18:55 +10:00
Ian Wienand
376648bfdc Revert "Force ansible 2.9 on infra-prod jobs"
This reverts commit 21c6dc02b5b3069e4c9410416aeae804b2afbb5c.

Everything appears to be working with Ansible 2.9, which does seem to
sugguest reverting this will result in jobs timing out again.  We will
monitor this, and I76ba278d1ffecbd00886531b4554d7aed21c43df is a
potential fix for this.

Change-Id: Id741d037040bde050abefa4ad7888ea508b484f6
2022-07-17 09:07:20 +10:00
Clark Boylan
21c6dc02b5 Force ansible 2.9 on infra-prod jobs
We've been seeing ansible post-run playbook timeouts in our infra-prod
jobs. The only major thing that has changed recently is the default
update to ansible 5 for these jobs. Force them back to 2.9 to see if the
problem goes away.

Albin Vass has noted that there are possibly glibc + debian bullseye +
ansible 5 problems that may be causing this. If we determine 2.9 is
happy then this is the likely cause.

Change-Id: Ibd40e15756077d1c64dba933ec0dff6dc0aac374
2022-07-15 09:06:11 -07:00
Ian Wienand
21efe11eed production-playbook logs : move to post-run step
If the production playbook times out, we don't get any logs collected
with the run.  By moving the log collection into a post-run step, we
should always get something copied to help us diagnose what is going
wrong.

Change-Id: I3e99b80e442db0cc87f8e8c9728b7697a5e4d1d3
2022-07-15 07:58:23 +10:00
Clark Boylan
e2442eeaf0 Don't run infra-prod-run-refstack on all group var updates
This was running on all group var updates but we only need to run it
when refstack group vars update. Change the file requirements to match
the refstack.yaml group file to address this.

Change-Id: Id5ed4b65c1ed6566696fea9a33db27e9318af1a6
2022-03-04 15:30:47 -08:00
James E. Blair
3f8acefbe1 Run zuul-web on zuul01 and add to load balancer
Change-Id: Ia8b10338fa3a1876993404276e0759f4b10d6b54
2022-03-04 13:11:09 -08:00
Ian Wienand
3f6cd427d7 encrypt-logs: turn on for all prod playbooks
We have validated that the log encryption/export path is working, so
turn it on for all prod jobs.

Change-Id: Ic04d5b6e716dffedc925cb799e3630027183d890
2022-02-24 09:57:55 +11:00
Ian Wienand
7b22badf6a run-production-playbook: return encrypted logs
Based on the changes in I5b9f9dd53eb896bb542652e8175c570877842584,
enable returning encrypted log artifacts for the codesearch production
job, as an initial test.

Change-Id: I9bd4ed0880596968000b1f153c31df849cd7fa8d
2022-02-16 16:39:46 +11:00
James E. Blair
2a9553ef25 Add Zuul load balancer
This adds a load balancer for zuul-web and fingergw.

Change-Id: Id5aa01151f64f3c85e1532ad66999ef9471c5896
2022-02-10 13:24:42 -08:00
Ian Wienand
73a9acc7ad Rename install-ansible to bootstrap-bridge
This used to be called "bridge", but was then renamed with
Ia7c8dd0e32b2c4aaa674061037be5ab66d9a3581 to install-ansible to be
clearer.

It is true that this is installing Ansible, but as part of our
reworking for parallel jobs this is the also the synchronisation point
where we should be deploying the system-config code to run for the
buildset.

Thus naming this "boostrap-bridge" should hopefully be clearer again
about what's going on.

I've added a note to the job calling out it's difference to the
infra-prod-service-bridge job to hopefully also avoid some of the
inital confusion.

Change-Id: I4db1c883f237de5986edb4dc4c64860390cc8e22
2021-12-07 16:24:53 +11:00
James E. Blair
e79dbbe6bb Add a keycloak server
This adds a keycloak server so we can start experimenting with it.

It's based on the docker-compose file Matthieu made for Zuul
(see https://review.opendev.org/819745 )

We should be able to configure a realm and federate with openstackid
and other providers as described in the opendev auth spec.  However,
I am unable to test federation with openstackid due its inability to
configure an oauth app at "localhost".  Therefore, we will need an
actual deployed system to test it.  This should allow us to do so.

It will also allow use to connect realms to the newly available
Zuul admin api on opendev.

It should be possible to configure the realm the way we want, then
export its configuration into a JSON file and then have our playbooks
or the docker-compose file import it.  That would allow us to drive
change to the configuration of the system through code review.  Because
of the above limitation with openstackid, I think we should regard the
current implementation as experimental.  Once we have a realm
configuration that we like (which we will create using the GUI), we
can chose to either continue to maintain the config with the GUI and
appropriate file backups, or switch to a gitops model based on an
export.

My understanding is that all the data (realms configuration and session)
are kept in an H2 database.  This is probably sufficient for now and even
production use with Zuul, but we should probably switch to mariadb before
any heavy (eg gerrit, etc) production use.

This is a partial implementation of https://docs.opendev.org/opendev/infra-specs/latest/specs/central-auth.html

We can re-deploy with a new domain when it exists.

Change-Id: I2e069b1b220dbd3e0a5754ac094c2b296c141753
Co-Authored-By: Matthieu Huin <mhuin@redhat.com>
2021-12-03 14:17:23 -08:00
Ian Wienand
d0467bfc98 Refactor infra-prod jobs for parallel running
Refactor the infra-prod jobs to specify dependencies so they can run
in parallel.

Change-Id: I8f6150ec2f696933c93560c11fed0fd16b11bf65
2021-11-18 10:31:11 +11:00
Clark Boylan
cf91bc0971 Remove the gerrit group in favor of the review group
Having two groups here was confusing. We seem to use the review group
for most ansible stuff so we prefer that one. We move contents of the
gerrit group_vars into the review group_vars and then clean up the use
of the old group vars file.

Change-Id: I7fa7467f703f5cec075e8e60472868c60ac031f7
2021-10-12 09:48:53 -07:00
Clark Boylan
76baae4e3f Replace testing group vars with host vars for review02
Previously we had a test specific group vars file for the review Ansible
group. This provided junk secrets to our test installations of Gerrit
then we relied on the review02.opendev.org production host vars file to
set values that are public.

Unfortunately, this meant we were using the production heapLimit value
which is far too large for our test instances leading to the occasionaly
failure:

  There is insufficient memory for the Java Runtime Environment to continue.
  Native memory allocation (mmap) failed to map 9596567552 bytes for committing reserved memory.

We cannot set the heapLimit in the group var file because the hostvar
file overrides those values. To fix this we need to replace the test
specific group var contents with a test specific host var file instead.
To avoid repeating ourselves we also create a new review.yaml group_vars
file to capture common settings between testing and prod. Note we should
look at combining this new file with the gerrit.yaml group_vars.

On the testing side of things we set the heapLimit to 6GB, we change the
serverid value to prevent any unexpected notedb confusion, and we remove
replication config.

Change-Id: Id8ec5cae967cc38acf79ecf18d3a0faac3a9c4b3
2021-10-12 09:48:45 -07:00
Clark Boylan
e47dccdc34 Upgrade Gerrit to 3.3
This bumps the gerrit image up to our 3.3 image. Followup changes will
shift upgrade testing to test 3.3 to 3.4 upgrades, clean up no longer
needed 3.2 images, and start building 3.4 images.

Change-Id: Id0f544846946d4c50737a54ceb909a0a686a594e
2021-10-07 11:54:46 -07:00
Kendall Nelson
62e30e52de Setting Up Ansible For ptgbot
Heavily taken from statusbot, but removed wiki and twitter defaults.

Change-Id: I7b1958dbe37e5d25b8fde746235c88a4d6763ffd
2021-10-06 15:39:25 +11:00
Zuul
dac6ae68b9 Merge "Restrict generic inventory matchers to inventory/base" 2021-08-24 19:30:51 +00:00
Monty Taylor
92a68b3f78 Restrict generic inventory matchers to inventory/base
We have a subdir in inventory called base that includes the shared
files that we don't have a good way to distinguish between services.
Limit the file matchers to inventory/base so that we don't trigger
all of the services anytime a single service's host_vars changes.

Change-Id: I3f461b4ab56ec55beca29e123186b36513803a44
2021-08-21 12:12:33 -05:00
Clark Boylan
d3837a7d95 Run service-eavesdrop after promoting the matrix eavesdrop bot
This order is important to ensure we update the matrix eavesdrop bot
when expected and not later in the day when the daily runs happen.

Change-Id: If8e3f9f34e30cdeb7765e6665d1fb19b339454a3
2021-08-20 10:57:26 -07:00
Clark Boylan
652ea73013 Stop requiring puppet things for afs, eavesdrop, and nodepool
These services are all managed with ansible now and don't need to be
triggered when puppet updates.

Change-Id: Ie32b788263724ad9a5ca88a6406290309ec8c87a
2021-08-17 15:58:17 -07:00
Clark Boylan
ce5d207dbb Run remote-puppet-else daily instead of hourly
Update the file matchers to actually match the current set of puppet
things. This ensure the deploy job runs when we want it and we can catch
up daily instead of hourly.

Previously a number of the matchers didn't actually match the puppet
things because the path prefix was wrong or works were in different
orders for the dir names.

Change-Id: I3510da81d942cf6fb7da998b8a73b0a566ea7411
2021-08-17 15:54:38 -07:00
Zuul
04fac27ea8 Merge "Run matrix-gerritbot on eavesdrop" 2021-08-02 17:00:12 +00:00