system-config

Author	SHA1	Message	Date
Clark Boylan	a5dd619e24	Fix infra-prod-service-review file matchers We changed review01.openstack.org to review02.openstack.org in the host var file matchers for this job thinking that was the issue previously. Unfortunately the actual file is review02.opendev.org. Update the matcher again to actually trigger the job. We also make a small edit to the gerrit role's README to ensure we trigger the job when this change lands. Change-Id: I1f235d0ddbb2d7f400ea2e99ffabdf5db35671a1	2023-03-03 11:47:02 -08:00
Clark Boylan	7b1b911e49	Trigger infra-prod-service-review when review02 hostvars update We didn't update this job's file matchers when review01 was replaced with review02. That caused us to miss triggering this job when review02 hostvars updated. Fix that which should also cause this job to run since we update the job. Change-Id: I8b58ee26084681242b9881651d6eeab9ff8d5ad2	2023-02-17 10:11:53 -08:00
Jeremy Stanley	fa22fa726a	Also bootstrap bridge any time inventory changes We need the infra-prod-bootstrap-bridge job to add SSH host keys from our Ansible inventory to the /etc/ssh_known_hosts on the bridge. When adding a new server to the inventory, any added host keys should be deployed. Make sure this happens. Change-Id: I422f80fc033cfe8e20d6d30b0fe23f82800c4cea	2022-11-29 20:48:23 +00:00
Zuul	b7b2157133	Merge "Add a mailman3 list server"	2022-11-22 18:00:30 +00:00
Clark Boylan	c1c91886b4	Add a mailman3 list server This should now be a largely functional deployment of mailman 3. There are still some bits that need testing but we'll use followup changes to force failure and hold nodes. This deployment of mailman3 uses upstream docker container images. We currently hack up uids and gids to accomodate that. We also hack up the settings file and bind mount it over the upstream file in order to use host networking. We override the hyperkitty index type to xapian. All list domains are hosted in a single installation and we use native vhosting to handle that. We'll deploy this to a new server and migrate one mailing list domain at a time. This will allow us to start with lists.opendev.org and test things like dmarc settings before expanding to the remaining lists. A migration script is also included, which has seen extensive testing on held nodes for importing copies of the production data sets. Change-Id: Ic9bf5cfaf0b87c100a6ce003a6645010a7b50358	2022-11-11 23:20:19 +00:00
Ian Wienand	51611845d4	Convert production playbooks to bastion host group Following-on from Iffb462371939989b03e5d6ac6c5df63aa7708513, instead of directly referring to a hostname when adding the bastion host to the inventory for the production playbooks, this finds it from the first element of the "bastion" group. As we do this twice for the run and post playbooks, abstract it into a role. The host value is currently "bridge.openstack.org" -- as is the existing hard-coding -- thus this is intended to be a no-op change. It is setting the foundation to make replacing the bastion host a simpler process in the future. Change-Id: I286796ebd71173019a627f8fe8d9a25d0bfc575a	2022-10-20 09:49:10 +11:00
Ian Wienand	d4c46ecdef	Abstract name of bastion host for testing path This replaces hard-coding of the host "bridge.openstack.org" with hard-coding of the first (and only) host in the group "bastion". The idea here is that we can, as much as possible, simply switch one place to an alternative hostname for the bastion such as "bridge.opendev.org" when we upgrade. This is just the testing path, for now; a follow-on will modify the production path (which doesn't really get speculatively tested) This needs to be defined in two places : 1) We need to define this in the run jobs for Zuul to use in the playbooks/zuul/run-*.yaml playbooks, as it sets up and collects logs from the testing bastion host. 2) The nested Ansible run will then use inventory inventory/service/groups.yaml Various other places are updated to use this abstracted group as the bastion host. Variables are moved into the bastion group (which only has one host -- the actual bastion host) which means we only have to update the group mapping to the new host. This is intended to be a no-op change; all the jobs should work the same, but just using the new abstractions. Change-Id: Iffb462371939989b03e5d6ac6c5df63aa7708513	2022-10-20 09:00:43 +11:00
Ian Wienand	8efaf8da93	infra-prod-bootstrap-bridge: fix typo in playbook name Introduced with Iebaeed5028050d890ab541818f405978afd60124 Change-Id: I2e06221d03589dc6bcb5fb060b439e35e3d604dc	2022-10-19 11:10:21 +11:00
Ian Wienand	77ebe6e0b7	infra-prod-bootstrap-bridge: run directly on bridge In discussion of other changes, I realised that the bridge bootstrap job is running via zuul/run-production-playbook.yaml. This means it uses the Ansible installed on bridge to run against itself -- which isn't much of a bootstrap. What should happen is that the bootstrap-bridge.yaml playbook, which sets up ansible and keys on the bridge node, should run directly from the executor against the bridge node. To achieve this we reparent the job to opendev-infra-prod-setup-keys, which sets up the executor to be able to log into the bridge node. We then add the host dynamically and run the bootstrap-bridge.yaml playbook against it. This is similar to the gate testing path; where bootstrap-bridge.yaml is run from the exeuctor against the ephemeral bridge testing node before the nested-Ansible is used. The root key deployment is updated to use the nested Ansible directly, so that it can read the variable from the on-host secrets. Change-Id: Iebaeed5028050d890ab541818f405978afd60124	2022-10-15 10:39:53 +11:00
James E. Blair	c661fb0972	Add Jaeger tracing server Change-Id: I1aa68b1d5f99364fa09776301894b922ed169a3a	2022-09-15 19:21:33 -07:00
Ian Wienand	5ba37ced60	paste: move certificate to group variable Similar to Id98768e29a06cebaf645eb75b39e4dc5adb8830d, move the certificate variables to the group definition file, so that we don't have to duplicate handlers or definitions for the testing host. Change-Id: I6650f5621a4969582f40700232a596d84e2b4a06	2022-08-05 08:18:55 +10:00
Ian Wienand	e70c1e581c	static: move certs to group, update testing name to static99 Currently we define the letsencrypt certs for each host in its individual host variables. With recent work we have a trusted CA and SAN names setup in our testing environment; introducing the possibility that we could accidentally reference the production host during testing (both have valid certs, as far as the testing hosts are concerned). To avoid this, we can use our naming scheme to move our testing hosts to "99" and avoid collision with the production hosts. As a bonus, this really makes you think more about your group/host split to get things right and keep the environment as abstract as possible. One example of this is that with letsencrypt certificates defined in host vars, testing and production need to use the same hostname to get the right certificates created. Really, this should be group-level information so it applies equally to host01 and host99. To cover "hostXX.opendev.org" as a SAN we can include the inventory_hostname in the group variables. This updates one of the more tricky hosts, static, as a proof of concept. We rename the handlers to be generic, and update the testing targets. Change-Id: Id98768e29a06cebaf645eb75b39e4dc5adb8830d	2022-08-05 08:18:55 +10:00
Ian Wienand	376648bfdc	Revert "Force ansible 2.9 on infra-prod jobs" This reverts commit 21c6dc02b5b3069e4c9410416aeae804b2afbb5c. Everything appears to be working with Ansible 2.9, which does seem to sugguest reverting this will result in jobs timing out again. We will monitor this, and I76ba278d1ffecbd00886531b4554d7aed21c43df is a potential fix for this. Change-Id: Id741d037040bde050abefa4ad7888ea508b484f6	2022-07-17 09:07:20 +10:00
Clark Boylan	21c6dc02b5	Force ansible 2.9 on infra-prod jobs We've been seeing ansible post-run playbook timeouts in our infra-prod jobs. The only major thing that has changed recently is the default update to ansible 5 for these jobs. Force them back to 2.9 to see if the problem goes away. Albin Vass has noted that there are possibly glibc + debian bullseye + ansible 5 problems that may be causing this. If we determine 2.9 is happy then this is the likely cause. Change-Id: Ibd40e15756077d1c64dba933ec0dff6dc0aac374	2022-07-15 09:06:11 -07:00
Ian Wienand	21efe11eed	production-playbook logs : move to post-run step If the production playbook times out, we don't get any logs collected with the run. By moving the log collection into a post-run step, we should always get something copied to help us diagnose what is going wrong. Change-Id: I3e99b80e442db0cc87f8e8c9728b7697a5e4d1d3	2022-07-15 07:58:23 +10:00
Clark Boylan	e2442eeaf0	Don't run infra-prod-run-refstack on all group var updates This was running on all group var updates but we only need to run it when refstack group vars update. Change the file requirements to match the refstack.yaml group file to address this. Change-Id: Id5ed4b65c1ed6566696fea9a33db27e9318af1a6	2022-03-04 15:30:47 -08:00
James E. Blair	3f8acefbe1	Run zuul-web on zuul01 and add to load balancer Change-Id: Ia8b10338fa3a1876993404276e0759f4b10d6b54	2022-03-04 13:11:09 -08:00
Ian Wienand	3f6cd427d7	encrypt-logs: turn on for all prod playbooks We have validated that the log encryption/export path is working, so turn it on for all prod jobs. Change-Id: Ic04d5b6e716dffedc925cb799e3630027183d890	2022-02-24 09:57:55 +11:00
Ian Wienand	7b22badf6a	run-production-playbook: return encrypted logs Based on the changes in I5b9f9dd53eb896bb542652e8175c570877842584, enable returning encrypted log artifacts for the codesearch production job, as an initial test. Change-Id: I9bd4ed0880596968000b1f153c31df849cd7fa8d	2022-02-16 16:39:46 +11:00
James E. Blair	2a9553ef25	Add Zuul load balancer This adds a load balancer for zuul-web and fingergw. Change-Id: Id5aa01151f64f3c85e1532ad66999ef9471c5896	2022-02-10 13:24:42 -08:00
Ian Wienand	73a9acc7ad	Rename install-ansible to bootstrap-bridge This used to be called "bridge", but was then renamed with Ia7c8dd0e32b2c4aaa674061037be5ab66d9a3581 to install-ansible to be clearer. It is true that this is installing Ansible, but as part of our reworking for parallel jobs this is the also the synchronisation point where we should be deploying the system-config code to run for the buildset. Thus naming this "boostrap-bridge" should hopefully be clearer again about what's going on. I've added a note to the job calling out it's difference to the infra-prod-service-bridge job to hopefully also avoid some of the inital confusion. Change-Id: I4db1c883f237de5986edb4dc4c64860390cc8e22	2021-12-07 16:24:53 +11:00
James E. Blair	e79dbbe6bb	Add a keycloak server This adds a keycloak server so we can start experimenting with it. It's based on the docker-compose file Matthieu made for Zuul (see https://review.opendev.org/819745 ) We should be able to configure a realm and federate with openstackid and other providers as described in the opendev auth spec. However, I am unable to test federation with openstackid due its inability to configure an oauth app at "localhost". Therefore, we will need an actual deployed system to test it. This should allow us to do so. It will also allow use to connect realms to the newly available Zuul admin api on opendev. It should be possible to configure the realm the way we want, then export its configuration into a JSON file and then have our playbooks or the docker-compose file import it. That would allow us to drive change to the configuration of the system through code review. Because of the above limitation with openstackid, I think we should regard the current implementation as experimental. Once we have a realm configuration that we like (which we will create using the GUI), we can chose to either continue to maintain the config with the GUI and appropriate file backups, or switch to a gitops model based on an export. My understanding is that all the data (realms configuration and session) are kept in an H2 database. This is probably sufficient for now and even production use with Zuul, but we should probably switch to mariadb before any heavy (eg gerrit, etc) production use. This is a partial implementation of https://docs.opendev.org/opendev/infra-specs/latest/specs/central-auth.html We can re-deploy with a new domain when it exists. Change-Id: I2e069b1b220dbd3e0a5754ac094c2b296c141753 Co-Authored-By: Matthieu Huin <mhuin@redhat.com>	2021-12-03 14:17:23 -08:00
Ian Wienand	d0467bfc98	Refactor infra-prod jobs for parallel running Refactor the infra-prod jobs to specify dependencies so they can run in parallel. Change-Id: I8f6150ec2f696933c93560c11fed0fd16b11bf65	2021-11-18 10:31:11 +11:00
Clark Boylan	cf91bc0971	Remove the gerrit group in favor of the review group Having two groups here was confusing. We seem to use the review group for most ansible stuff so we prefer that one. We move contents of the gerrit group_vars into the review group_vars and then clean up the use of the old group vars file. Change-Id: I7fa7467f703f5cec075e8e60472868c60ac031f7	2021-10-12 09:48:53 -07:00
Clark Boylan	76baae4e3f	Replace testing group vars with host vars for review02 Previously we had a test specific group vars file for the review Ansible group. This provided junk secrets to our test installations of Gerrit then we relied on the review02.opendev.org production host vars file to set values that are public. Unfortunately, this meant we were using the production heapLimit value which is far too large for our test instances leading to the occasionaly failure: There is insufficient memory for the Java Runtime Environment to continue. Native memory allocation (mmap) failed to map 9596567552 bytes for committing reserved memory. We cannot set the heapLimit in the group var file because the hostvar file overrides those values. To fix this we need to replace the test specific group var contents with a test specific host var file instead. To avoid repeating ourselves we also create a new review.yaml group_vars file to capture common settings between testing and prod. Note we should look at combining this new file with the gerrit.yaml group_vars. On the testing side of things we set the heapLimit to 6GB, we change the serverid value to prevent any unexpected notedb confusion, and we remove replication config. Change-Id: Id8ec5cae967cc38acf79ecf18d3a0faac3a9c4b3	2021-10-12 09:48:45 -07:00
Clark Boylan	e47dccdc34	Upgrade Gerrit to 3.3 This bumps the gerrit image up to our 3.3 image. Followup changes will shift upgrade testing to test 3.3 to 3.4 upgrades, clean up no longer needed 3.2 images, and start building 3.4 images. Change-Id: Id0f544846946d4c50737a54ceb909a0a686a594e	2021-10-07 11:54:46 -07:00
Kendall Nelson	62e30e52de	Setting Up Ansible For ptgbot Heavily taken from statusbot, but removed wiki and twitter defaults. Change-Id: I7b1958dbe37e5d25b8fde746235c88a4d6763ffd	2021-10-06 15:39:25 +11:00
Zuul	dac6ae68b9	Merge "Restrict generic inventory matchers to inventory/base"	2021-08-24 19:30:51 +00:00
Monty Taylor	92a68b3f78	Restrict generic inventory matchers to inventory/base We have a subdir in inventory called base that includes the shared files that we don't have a good way to distinguish between services. Limit the file matchers to inventory/base so that we don't trigger all of the services anytime a single service's host_vars changes. Change-Id: I3f461b4ab56ec55beca29e123186b36513803a44	2021-08-21 12:12:33 -05:00
Clark Boylan	d3837a7d95	Run service-eavesdrop after promoting the matrix eavesdrop bot This order is important to ensure we update the matrix eavesdrop bot when expected and not later in the day when the daily runs happen. Change-Id: If8e3f9f34e30cdeb7765e6665d1fb19b339454a3	2021-08-20 10:57:26 -07:00
Clark Boylan	652ea73013	Stop requiring puppet things for afs, eavesdrop, and nodepool These services are all managed with ansible now and don't need to be triggered when puppet updates. Change-Id: Ie32b788263724ad9a5ca88a6406290309ec8c87a	2021-08-17 15:58:17 -07:00
Clark Boylan	ce5d207dbb	Run remote-puppet-else daily instead of hourly Update the file matchers to actually match the current set of puppet things. This ensure the deploy job runs when we want it and we can catch up daily instead of hourly. Previously a number of the matchers didn't actually match the puppet things because the path prefix was wrong or works were in different orders for the dir names. Change-Id: I3510da81d942cf6fb7da998b8a73b0a566ea7411	2021-08-17 15:54:38 -07:00
Zuul	04fac27ea8	Merge "Run matrix-gerritbot on eavesdrop"	2021-08-02 17:00:12 +00:00
Zuul	af5fcdcb13	Merge "Run matrix-eavesdrop on eavesdrop"	2021-08-02 17:00:09 +00:00
Tristan Cacqueray	c4b0a8950d	Run matrix-gerritbot on eavesdrop Thin runs the new gerritbot-matrix bot on the eavesdrop server. Change-Id: Ic11ca46aa4da61d5b80a8996ad900fdf83ab70dc	2021-07-30 09:16:42 -05:00
James E. Blair	82c966e6da	Run matrix-eavesdrop on eavesdrop Thin runs the new matrix-eavesdrop bot on the eavesdrop server. It will write logs out to the limnoria logs directory, which is mounted inside the container. Change-Id: I867eec692f63099b295a37a028ee096c24109a2e	2021-07-28 18:34:58 -05:00
Ian Wienand	916c1d3dc8	Add paste service The paste service needs an upgrade; since others have created a lodgeit container it seems worth us keeping the service going if only to maintain the historical corpus of pastes. This adds the ansible to deploy lodgeit and a sibling mariadb container. I have imported a dump of the old data as a test. The dump is ~4gb and imported it takes up about double that; certainly nothing we need to be too concerned over. The server will be more than capable of running the db container alongside the lodgeit instance. This should have no effect on production until we decide to switch DNS. Change-Id: I284864217aa49d664ddc3ebdc800383b2d7e00e3	2021-07-07 15:12:04 +10:00
James E. Blair	d8779f4fb0	Update Zuul job semaphore usage The semaphore property is deprecated; replaced with semaphores. Change-Id: Idc459e60b846a270d7d9a9eb3737391c4ce2dd17	2021-06-24 13:20:29 -07:00
Ian Wienand	f304c1a161	Update eavesdrop deploy job This was missed when adding the statusbot/ircbot containers Change-Id: I198da471b8a0dd648a8e9f1bfe41988561a745f8	2021-06-11 23:23:20 +10:00
Zuul	be4f67f23e	Merge "Add infra-prod-service-lists job"	2021-05-19 19:16:32 +00:00
Clark Boylan	8d9975be67	Double the default number of ansible forks We run these ansible jobs serially which means we don't gain much by forcing ansible to use a small number of forks. Double the default for our infra prod job fork count from 5 to 10 to see if this speeds up our deploy jobs. Note some jobs override this value to either add more forks or fewer when necessary. These are left as is. Change-Id: I6fded724cb9c8654153bcc5937eae7203326076e	2021-05-14 12:14:15 -07:00
Clark Boylan	c743b7e484	Clean up zuul01 from inventory This cleans up zuul01 as it should no longer be used at this point. We also make the inventory groups a bit more clear that all zuul servers are under the opendev.org domain now. Depends-On: https://review.opendev.org/c/opendev/zone-opendev.org/+/790483 Change-Id: I7885fe60028fbd87688f3ae920a24bce4d1a3acd	2021-05-13 06:58:36 -07:00
Clark Boylan	533594d959	Add zuul02 to inventory This zuul02 instance will replace zuul01. There are a few items to coordinate when doing an actual switch so we haven't removed zuul01 from inventory here. In particular we need to update gearman server config values in the zuul cluster and we need to save queues, shutdown zuul01, then start zuul02's scheduler and restore queues there. I believe landing this change is safe as we don't appear to start zuul on new instances by default. Reviewers should double check this. Depends-On: https://review.opendev.org/c/opendev/zone-opendev.org/+/791039 Change-Id: I524b456e494124d8293fbe8e1468de40f3800772	2021-05-13 06:58:30 -07:00
Clark Boylan	caedb11d3d	Add infra-prod-service-lists job This job is not added in the parent so that we can manually run playbooks after the parent lands. Once we are happy with the results from the new service-lists.yaml playbook we can land this change and have zuul automatically apply it when necessary. Change-Id: I38de8b98af9fb08fa5b9b8849d65470cbd7b3fdc	2021-05-11 08:40:06 -07:00
Clark Boylan	6e7c07411b	Bump the infra-prod-manage-projects job timeout Bump this timeout for a couple of reasons. First we've seen the job timeout at least once in the last month. This seems to be due to gitea portions of the job running slowly. Second we're planning some large scale updates to the openstack acls and a longer timeout should help us get those in in larger batches. We can consider trimming this back again after these updates are done if gitea doesn't continue to give us trouble. Change-Id: Ib61849b4c73a1b3fa2a0bbe90ace29fb23849449	2021-04-16 14:10:34 -07:00
Zuul	a7be740183	Merge "Fix up openafs-client job matching"	2021-04-12 22:43:13 +00:00
Ian Wienand	9f11fc5c75	Remove references to review-dev With our increased ability to test in the gate, there's not much use for review-dev any more. Remove references. Change-Id: I97e9865e0b655cd157acf9ffa7d067b150e6fc72	2021-03-24 11:40:31 +11:00
Zuul	77b1c14a9a	Merge "Use upstream jitsi-meet web image"	2021-03-17 00:22:50 +00:00
Ian Wienand	c1aff2ed38	kerberos-kdc: role to manage Kerberos KDC servers This adds a role and related testing to manage our Kerberos KDC servers, intended to replace the puppet modules currently performing this task. This role automates realm creation, initial setup, key material distribution and replica host configuration. None of this is intended to run on the production servers which are already setup with an active database, and the role should be effectively idempotent in production. Note that this does not yet switch the production servers into the new groups; this can be done in a separate step under controlled conditions and with related upgrades of the host OS to Focal. Change-Id: I60b40897486b29beafc76025790c501b5055313d	2021-03-17 08:30:52 +11:00
James E. Blair	b768325480	Use upstream jitsi-meet web image This has our change to open etherpad on join, so we should no longer need to run a fork of the web server. Switch to the upstream container image and stop building our own. Change-Id: I3e8da211c78b6486a3dcbd362ae7eb03cc9f5a48	2021-03-09 12:35:46 -08:00

1 2

73 Commits