49 Commits

Author SHA1 Message Date
Clark Boylan
76baae4e3f Replace testing group vars with host vars for review02
Previously we had a test specific group vars file for the review Ansible
group. This provided junk secrets to our test installations of Gerrit
then we relied on the review02.opendev.org production host vars file to
set values that are public.

Unfortunately, this meant we were using the production heapLimit value
which is far too large for our test instances leading to the occasionaly
failure:

  There is insufficient memory for the Java Runtime Environment to continue.
  Native memory allocation (mmap) failed to map 9596567552 bytes for committing reserved memory.

We cannot set the heapLimit in the group var file because the hostvar
file overrides those values. To fix this we need to replace the test
specific group var contents with a test specific host var file instead.
To avoid repeating ourselves we also create a new review.yaml group_vars
file to capture common settings between testing and prod. Note we should
look at combining this new file with the gerrit.yaml group_vars.

On the testing side of things we set the heapLimit to 6GB, we change the
serverid value to prevent any unexpected notedb confusion, and we remove
replication config.

Change-Id: Id8ec5cae967cc38acf79ecf18d3a0faac3a9c4b3
2021-10-12 09:48:45 -07:00
Clark Boylan
e47dccdc34 Upgrade Gerrit to 3.3
This bumps the gerrit image up to our 3.3 image. Followup changes will
shift upgrade testing to test 3.3 to 3.4 upgrades, clean up no longer
needed 3.2 images, and start building 3.4 images.

Change-Id: Id0f544846946d4c50737a54ceb909a0a686a594e
2021-10-07 11:54:46 -07:00
Kendall Nelson
62e30e52de Setting Up Ansible For ptgbot
Heavily taken from statusbot, but removed wiki and twitter defaults.

Change-Id: I7b1958dbe37e5d25b8fde746235c88a4d6763ffd
2021-10-06 15:39:25 +11:00
Zuul
dac6ae68b9 Merge "Restrict generic inventory matchers to inventory/base" 2021-08-24 19:30:51 +00:00
Monty Taylor
92a68b3f78 Restrict generic inventory matchers to inventory/base
We have a subdir in inventory called base that includes the shared
files that we don't have a good way to distinguish between services.
Limit the file matchers to inventory/base so that we don't trigger
all of the services anytime a single service's host_vars changes.

Change-Id: I3f461b4ab56ec55beca29e123186b36513803a44
2021-08-21 12:12:33 -05:00
Clark Boylan
d3837a7d95 Run service-eavesdrop after promoting the matrix eavesdrop bot
This order is important to ensure we update the matrix eavesdrop bot
when expected and not later in the day when the daily runs happen.

Change-Id: If8e3f9f34e30cdeb7765e6665d1fb19b339454a3
2021-08-20 10:57:26 -07:00
Clark Boylan
652ea73013 Stop requiring puppet things for afs, eavesdrop, and nodepool
These services are all managed with ansible now and don't need to be
triggered when puppet updates.

Change-Id: Ie32b788263724ad9a5ca88a6406290309ec8c87a
2021-08-17 15:58:17 -07:00
Clark Boylan
ce5d207dbb Run remote-puppet-else daily instead of hourly
Update the file matchers to actually match the current set of puppet
things. This ensure the deploy job runs when we want it and we can catch
up daily instead of hourly.

Previously a number of the matchers didn't actually match the puppet
things because the path prefix was wrong or works were in different
orders for the dir names.

Change-Id: I3510da81d942cf6fb7da998b8a73b0a566ea7411
2021-08-17 15:54:38 -07:00
Zuul
04fac27ea8 Merge "Run matrix-gerritbot on eavesdrop" 2021-08-02 17:00:12 +00:00
Zuul
af5fcdcb13 Merge "Run matrix-eavesdrop on eavesdrop" 2021-08-02 17:00:09 +00:00
Tristan Cacqueray
c4b0a8950d Run matrix-gerritbot on eavesdrop
Thin runs the new gerritbot-matrix bot on the eavesdrop server.

Change-Id: Ic11ca46aa4da61d5b80a8996ad900fdf83ab70dc
2021-07-30 09:16:42 -05:00
James E. Blair
82c966e6da Run matrix-eavesdrop on eavesdrop
Thin runs the new matrix-eavesdrop bot on the eavesdrop server.

It will write logs out to the limnoria logs directory, which is mounted
inside the container.

Change-Id: I867eec692f63099b295a37a028ee096c24109a2e
2021-07-28 18:34:58 -05:00
Ian Wienand
916c1d3dc8 Add paste service
The paste service needs an upgrade; since others have created a
lodgeit container it seems worth us keeping the service going if only
to maintain the historical corpus of pastes.

This adds the ansible to deploy lodgeit and a sibling mariadb
container.  I have imported a dump of the old data as a test.  The
dump is ~4gb and imported it takes up about double that; certainly
nothing we need to be too concerned over.  The server will be more
than capable of running the db container alongside the lodgeit
instance.

This should have no effect on production until we decide to switch
DNS.

Change-Id: I284864217aa49d664ddc3ebdc800383b2d7e00e3
2021-07-07 15:12:04 +10:00
James E. Blair
d8779f4fb0 Update Zuul job semaphore usage
The semaphore property is deprecated; replaced with semaphores.

Change-Id: Idc459e60b846a270d7d9a9eb3737391c4ce2dd17
2021-06-24 13:20:29 -07:00
Ian Wienand
f304c1a161 Update eavesdrop deploy job
This was missed when adding the statusbot/ircbot containers

Change-Id: I198da471b8a0dd648a8e9f1bfe41988561a745f8
2021-06-11 23:23:20 +10:00
Zuul
be4f67f23e Merge "Add infra-prod-service-lists job" 2021-05-19 19:16:32 +00:00
Clark Boylan
8d9975be67 Double the default number of ansible forks
We run these ansible jobs serially which means we don't gain much by
forcing ansible to use a small number of forks. Double the default for
our infra prod job fork count from 5 to 10 to see if this speeds up our
deploy jobs.

Note some jobs override this value to either add more forks or fewer
when necessary. These are left as is.

Change-Id: I6fded724cb9c8654153bcc5937eae7203326076e
2021-05-14 12:14:15 -07:00
Clark Boylan
c743b7e484 Clean up zuul01 from inventory
This cleans up zuul01 as it should no longer be used at this point. We
also make the inventory groups a bit more clear that all zuul servers
are under the opendev.org domain now.

Depends-On: https://review.opendev.org/c/opendev/zone-opendev.org/+/790483
Change-Id: I7885fe60028fbd87688f3ae920a24bce4d1a3acd
2021-05-13 06:58:36 -07:00
Clark Boylan
533594d959 Add zuul02 to inventory
This zuul02 instance will replace zuul01. There are a few items to
coordinate when doing an actual switch so we haven't removed zuul01 from
inventory here. In particular we need to update gearman server config
values in the zuul cluster and we need to save queues, shutdown zuul01,
then start zuul02's scheduler and restore queues there.

I believe landing this change is safe as we don't appear to start zuul
on new instances by default. Reviewers should double check this.

Depends-On: https://review.opendev.org/c/opendev/zone-opendev.org/+/791039
Change-Id: I524b456e494124d8293fbe8e1468de40f3800772
2021-05-13 06:58:30 -07:00
Clark Boylan
caedb11d3d Add infra-prod-service-lists job
This job is not added in the parent so that we can manually run
playbooks after the parent lands. Once we are happy with the results
from the new service-lists.yaml playbook we can land this change and
have zuul automatically apply it when necessary.

Change-Id: I38de8b98af9fb08fa5b9b8849d65470cbd7b3fdc
2021-05-11 08:40:06 -07:00
Clark Boylan
6e7c07411b Bump the infra-prod-manage-projects job timeout
Bump this timeout for a couple of reasons. First we've seen the job
timeout at least once in the last month. This seems to be due to gitea
portions of the job running slowly.

Second we're planning some large scale updates to the openstack acls and
a longer timeout should help us get those in in larger batches. We can
consider trimming this back again after these updates are done if gitea
doesn't continue to give us trouble.

Change-Id: Ib61849b4c73a1b3fa2a0bbe90ace29fb23849449
2021-04-16 14:10:34 -07:00
Zuul
a7be740183 Merge "Fix up openafs-client job matching" 2021-04-12 22:43:13 +00:00
Ian Wienand
9f11fc5c75 Remove references to review-dev
With our increased ability to test in the gate, there's not much use
for review-dev any more.  Remove references.

Change-Id: I97e9865e0b655cd157acf9ffa7d067b150e6fc72
2021-03-24 11:40:31 +11:00
Zuul
77b1c14a9a Merge "Use upstream jitsi-meet web image" 2021-03-17 00:22:50 +00:00
Ian Wienand
c1aff2ed38 kerberos-kdc: role to manage Kerberos KDC servers
This adds a role and related testing to manage our Kerberos KDC
servers, intended to replace the puppet modules currently performing
this task.

This role automates realm creation, initial setup, key material
distribution and replica host configuration.  None of this is intended
to run on the production servers which are already setup with an
active database, and the role should be effectively idempotent in
production.

Note that this does not yet switch the production servers into the new
groups; this can be done in a separate step under controlled
conditions and with related upgrades of the host OS to Focal.

Change-Id: I60b40897486b29beafc76025790c501b5055313d
2021-03-17 08:30:52 +11:00
James E. Blair
b768325480 Use upstream jitsi-meet web image
This has our change to open etherpad on join, so we should no longer need
to run a fork of the web server.  Switch to the upstream container image
and stop building our own.

Change-Id: I3e8da211c78b6486a3dcbd362ae7eb03cc9f5a48
2021-03-09 12:35:46 -08:00
Ian Wienand
c0144eab68 Fix up openafs-client job matching
The jobs should have file matchers for "roles/openafs-client" (not
playbooks/).  Fix this.

Add the openafs/kerberos role matchers to Zuul as well, as it uses
them on the executors.

Change-Id: I66fd7792d6b533362606291e1bfc01dfa2a2e05b
2021-03-03 13:41:56 +11:00
Ian Wienand
39ffc685d6 backups: remove all bup
All hosts are now running thier backups via borg to servers in
vexxhost and rax.ord.

For reference, the servers being backed up at this time are:

 borg-ask01
 borg-ethercalc02
 borg-etherpad01
 borg-gitea01
 borg-lists
 borg-review-dev01
 borg-review01
 borg-storyboard01
 borg-translate01
 borg-wiki-update-test
 borg-zuul01

This removes the old bup backup hosts, the no-longer used ansible
roles for the bup backup server and client roles, and any remaining
bup related configuration.

For simplicity, we will remove any remaining bup cron jobs on the
above servers manually after this merges.

Change-Id: I32554ca857a81ae8a250ce082421a7ede460ea3c
2021-02-16 16:00:28 +11:00
Ian Wienand
533e6d43fa refstack: fix typo in role matcher
Change-Id: I61929708be87a28669606ac38abf478afd70fc51
2021-02-11 10:37:31 +11:00
Ian Wienand
78167396bf refstack: add production image and deployment jobs
Change-Id: I017a32ee374f0473525c9941c41b26c2a43bf2c8
2021-02-10 07:11:22 +11:00
Ian Wienand
7683fa11b3 openafs-server : add ansible roles for OpenAFS servers
This starts at migrating OpenAFS server setup to Ansible.

Firstly we split up the groups and explicitly name hosts, as we will
me migrating each one step-by-step.  We split out 1.8 hosts into a new
afs-1.8 group; the first host is afs01.ord.openstack.org which already
has openafs 1.8 installed manually.

An openafs-server role is introduced that does the same setup as the
extant puppet.

The AFS job is renamed to infra-prod-afs as the puppet component will
eventually disappear.  Otherwise it runs in the same way, but also
runs the openafs-server role for the 1.8 servers.

Once this is merged, we can run it against afs01.ord.openstack.org to
ensure it works and is idempotent.  We can then take on upgrading the
other file servers, and work further on the database servers.

Change-Id: I7998af43961999412f58a78214f4b5387713d30e
2021-01-19 08:08:33 +11:00
Clark Boylan
2ccabb17d0 Fix the infra-prod-service-review image dependency
We were still soft depending on gerrit 2.13 image builds but those jobs
don't exist anymore. Update this to 3.2 image builds which is what we
are doing now.

Change-Id: I0f5c02d69c199a605278318b93162437b7539547
2020-11-22 12:25:08 -08:00
Ian Wienand
368466730c Migrate codesearch site to container
The hound project has undergone a small re-birth and moved to

 https://github.com/hound-search/hound

which has broken our deployment.  We've talked about leaving
codesearch up to gitea, but it's not quite there yet.  There seems to
be no point working on the puppet now.

This builds a container than runs houndd.  It's an opendev specific
container; the config is pulled from project-config directly.

There's some custom scripts that drive things.  Some points for
reviewers:

 - update-hound-config.sh uses "create-hound-config" (which is in
   jeepyb for historical reasons) to generate the config file.  It
   grabs the latest projects.yaml from project-config and exits with a
   return code to indicate if things changed.

 - when the container starts, it runs update-hound-config.sh to
   populate the initial config.  There is a testing environment flag
   and small config so it doesn't have to clone the entire opendev for
   functional testing.

 - it runs under supervisord so we can restart the daemon when
   projects are updated.  Unlike earlier versions that didn't start
   listening till indexing was done, this version now puts up a "Hound
   is not ready yet" message when while it is working; so we can drop
   all the magic we were doing to probe if hound is listening via
   netstat and making Apache redirect to a status page.

 - resync-hound.sh is run from an external cron job daily, and does
   this update and restart check.  Since it only reloads if changes
   are made, this should be relatively rare anyway.

 - There is a PR to monitor the config file
   (https://github.com/hound-search/hound/pull/357) which would mean
   the restart is unnecessary.  This would be good in the near and we
   could remove the cron job.

 - playbooks/roles/codesearch is unexciting and deploys the container,
   certificates and an apache proxy back to localhost:6080 where hound
   is listening.

I've combined removal of the old puppet bits here as the "-codesearch"
namespace was already being used.

Change-Id: I8c773b5ea6b87e8f7dfd8db2556626f7b2500473
2020-11-20 07:41:12 +11:00
Ian Wienand
f50490481b reprepro: run deploy job on role changes
This was missed so the deploy job doesn't run automatically on changes
to the reprepro role.

Change-Id: I8e40926bd5bbb2836c63704ba02a43e45c12b4c4
2020-10-27 16:29:48 +11:00
Ian Wienand
dd50f6b732 borg : match install-borg role to run deploy job
This was forgotten in the original addition.

Change-Id: I0725b99e938b993e68106f6ba1b2704e9413e902
2020-10-12 13:06:10 +11:00
Zuul
083e8b43ea Merge "Add borg-backup roles" 2020-10-01 07:36:47 +00:00
Zuul
0cf20e0756 Merge "Run our etherpad prod deploy job when docker updates" 2020-07-27 20:33:24 +00:00
James E. Blair
b9f7f5506f Use infra-prod-base in infra-prod jobs
This uses a new base job which handles pushing the git repos on to
bridge since that must now happen in a trusted playbook.

Depends-On: https://review.opendev.org/742934
Change-Id: Ie6d0668f83af801c0c0e920b676f2f49e19c59f6
2020-07-24 09:04:50 -07:00
Ian Wienand
028d655375 Add borg-backup roles
This adds roles to implement backup with borg [1].

Our current tool "bup" has no Python 3 support and is not packaged for
Ubuntu Focal.  This means it is effectively end-of-life.  borg fits
our model of servers backing themselves up to a central location, is
well documented and seems well supported.  It also has the clarkb seal
of approval :)

As mentioned, borg works in the same manner as bup by doing an
efficient back up over ssh to a remote server.  The core of these
roles are the same as the bup based ones; in terms of creating a
separate user for each host and deploying keys and ssh config.

This chooses to install borg in a virtualenv on /opt.  This was chosen
for a number of reasons; firstly reading the history of borg there
have been incompatible updates (although they provide a tool to update
repository formats); it seems important that we both pin the version
we are using and keep clients and server in sync.  Since we have a
hetrogenous distribution collection we don't want to rely on the
packaged tools which may differ.  I don't feel like this is a great
application for a container; we actually don't want it that isolated
from the base system because it's goal is to read and copy it offsite
with as little chance of things going wrong as possible.

Borg has a lot of support for encrypting the data at rest in various
ways.  However, that introduces the possibility we could lose both the
key and the backup data.  Really the only thing stopping this is key
management, and if we want to go down this path we can do it as a
follow-on.

The remote end server is configured via ssh command rules to run in
append-only mode.  This means a misbehaving client can't delete its
old backups.  In theory we can prune backups on the server side --
something we could not do with bup.  The documentation has been
updated but is vague on this part; I think we should get some hosts in
operation, see how the de-duplication is working out and then decide
how we want to mange things long term.

Testing is added; a focal and bionic host both run a full backup of
themselves to the backup server.  Pretty cool, the logs are in
/var/log/borg-backup-<host>.log.

No hosts are currently in the borg groups, so this can be applied
without affecting production.  I'd suggest the next steps are to bring
up a borg-based backup server and put a few hosts into this.  After
running for a while, we can add all hosts, and then deprecate the
current bup-based backup server in vexxhost and replace that with a
borg-based one; giving us dual offsite backups.

[1] https://borgbackup.readthedocs.io/en/stable/

Change-Id: I2a125f2fac11d8e3a3279eb7fa7adb33a3acaa4e
2020-07-21 17:36:50 +10:00
Clark Boylan
4ebff6f9b2 Run our etherpad prod deploy job when docker updates
We want to pick up changes to our docker setup in production. Without
this we don't get the infra-prod-service-etherpad job running when we
update the etherpad docker image.

Change-Id: I25aee457b7c0547fc11439301054bb5aef799476
2020-07-17 13:20:48 -07:00
Ian Wienand
185797a0e5 Graphite container deployment
This deploys graphite from the upstream container.

We override the statsd configuration to have it listen on ipv6.
Similarly we override the ngnix config to listen on ipv6, enable ssl,
forward port 80 to 443, block the /admin page (we don't use it).

For production we will just want to put some cinder storage in
/opt/graphite/storage on the production host and figure out how to
migrate the old stats.  The is also a bit of cleanup that will follow,
because we half-converted grafana01.opendev.org -- so everything can't
be in the same group till that is gone.

Testing has been added to push some stats and ensure they are seen.

Change-Id: Ie843b3d90a72564ef90805f820c8abc61a71017d
2020-07-03 07:17:28 +10:00
Ian Wienand
b146181174 Grafana container deployment
This uses the Grafana container created with
Iddfafe852166fe95b3e433420e2e2a4a6380fc64 to run the
grafana.opendev.org service.

We retain the old model of an Apache reverse-proxy; it's well tested
and understood, it's much easier than trying to map all the SSL
termination/renewal/etc. into the Grafana container and we don't have
to convince ourselves the container is safe to be directly web-facing.

Otherwise this is a fairly straight forward deployment of the
container.  As before, it uses the graph configuration kept in
project-config which is loaded in with grafyaml, which is included in
the container.

Once nice advantage is that it makes it quite easy to develop graphs
locally, using the container which can talk to the public graphite
instance.  The documentation has been updated with a reference on how
to do this.

Change-Id: I0cc76d29b6911aecfebc71e5fdfe7cf4fcd071a4
2020-07-03 07:17:22 +10:00
Monty Taylor
83ced7f6e6 Split inventory into multiple dirs and move hostvars
Make inventory/service for service-specific things, including the
groups.yaml group definitions, and inventory/base for hostvars
related to the base system, including the list of hosts.

Move the exisitng host_vars into inventory/service, since most of
them are likely service-specific. Move group_vars/all.yaml into
base/group_vars as almost all of it is related to base things,
with the execption of the gerrit public key.

A followup patch will move host-specific values into equivilent
files in inventory/base.

This should let us override hostvars in gate jobs. It should also
allow us to do better file matchers - and to be able to organize
our playbooks move if we want to.

Depends-On: https://review.opendev.org/731583
Change-Id: Iddf57b5be47c2e9de16b83a1bc83bee25db995cf
2020-06-04 07:44:36 -05:00
Monty Taylor
f27c170d01 Rename service-letsencrypt to just letsencrypt
This isn't a service, it's a meta thing that we run for different
hosts at different times.

Change-Id: Ib65665c98afb3ddb94b15346931be88a4b1757d8
2020-06-04 07:44:36 -05:00
Monty Taylor
d93a661ae4 Run iptables in service playbooks instead of base
It's the only part of base that's important to run when we run a
service. Run it in the service playbooks and get rid of the
dependency on infra-prod-base.

Continue running it in base so that new nodes are brought up
with iptables in place.

Bump the timeout for the mirror job, because the iptables addition
seems to have just bumped it over the edge.

Change-Id: I4608216f7a59cfa96d3bdb191edd9bc7bb9cca39
2020-06-04 07:44:22 -05:00
Monty Taylor
e8716e742e Move base roles into a base subdir
If we move these into a subdir, it cleans up the number of things
we nave to files match on.

Stop running disable-puppet-agent in base. We run it in run-puppet
which should be fine.

Change-Id: Ia16adb96b11d25a097490882c4c59a50a0b7b23d
2020-05-27 16:28:37 -05:00
Ian Wienand
45201f3d66 Remove puppet mirror support
Remove the separate "mirror_opendev" group and rename it to just
"mirror".  Update various parts to reflect that change.

We no longer deploy any mirror hosts with puppet, remove the various
configuration files.

Depends-On: https://review.opendev.org/728345
Change-Id: Ia982fe9cb4357447989664f033df976b528aaf84
2020-05-16 10:14:25 +10:00
Monty Taylor
6a53ffa3ae Run accessbot less frequently
We already run accessbot in project-config when the accessbot
script changes. We don't need to run it whenever any of the puppet
or other config on eavesdrop runs, not do we need to run it
hourly. Just run it nightly and on changes to the actual
accessbot config.

Change-Id: Idd47f7c96f677fd1e1c8da3be262a52a70646acd
2020-05-08 08:15:14 -05:00
Clark Boylan
cfc83807b7 Organize zuul jobs in zuul.d/ dir
Our .zuul.yaml file has grown quite large. Try to make this more
manageable by splitting it into zuul.d/ directory with jobs organized by
function.

Change-Id: I0739eb1e2bc64dcacebf92e25503f67302f7c882
2020-05-07 17:30:48 -05:00