We duplicate the KDC settings over all our kerberos clients. Add
clients to a "kerberos-client" group and set the variables in a group
file.
Change-Id: I25ed5f8c68065060205dfbb634c6558488003a38
The PUBLIC_URL is quoted which results in quotes ending up in our config
breaking etherpad base url setting in config.js. We remove the quotes as
they are not necessary.
We also remove the /p/ suffix from ETHERPAD_URL_BASE as this causes the
proxying to send extra /p/s to etherpad which results in problems.
Note these fixes appear to be necessary but are not sufficient to have
working meetpad proxying of etherpad. We also need to fix the nginx
meet.conf proxy settings to send valid Host heads. A followup change
will attempt to address that.
Change-Id: I0f59339a33267468ad5481858507a43cefa0021d
We unforked our jitsi web container and discovered that etherpad doc
embedding was broken. In the process of debugging this the jitsi meet
services on meetpad were restart which pulled in newer configs which
expect ENABLE_XMPP_WEBSOCKET to be enabled by default. Unfortunately
this wasn't quite working for us. Explicitly disabling this seems to
make audio and video calling work again. But doc sharing isn't even
attempted now.
Let's get this fix in as audio and video are important then we'll keep
debugging the etherpad doc sharing problem.
https://github.com/jitsi/docker-jitsi-meet/issues/902 has details from
others that hit this problem.
Note that part of the issue here seems to be that nginx is using the
default configs in the container found at /default and not the configs
we bind mount at /config. This at least seems to be why the proxying for
etherpad documents is broken.
Change-Id: I03fa9d331e6825b3b953a3573c0dd43c7be478a4
This adds a role and related testing to manage our Kerberos KDC
servers, intended to replace the puppet modules currently performing
this task.
This role automates realm creation, initial setup, key material
distribution and replica host configuration. None of this is intended
to run on the production servers which are already setup with an
active database, and the role should be effectively idempotent in
production.
Note that this does not yet switch the production servers into the new
groups; this can be done in a separate step under controlled
conditions and with related upgrades of the host OS to Focal.
Change-Id: I60b40897486b29beafc76025790c501b5055313d
There is some correlation that running the manage-projects playbook
gives our gitea fits. The bulk of the work done here is in trying to
update the descriptions of all projects. There isn't a good way to see
if the description is already set first so we just try and ignore
errors. This creates potentially thousands of operations all at once and
could be why things are sad.
We move these operations under the always update flag which is not set
on normal runs. If we really need to converge to a good updated state we
can manually run the playbook/role with always update set.
We also don't set a limit on the number of ThreadPoolExecutor workers
which will default to 5 * NumProcs. Could be that tuning this down would
make gitea happier.
One other thought is that we may not be using request sessions properly
for connection reuse. In particular requests notes that you need to set
stream to False or read request content to return a connection back to
the pool for reuse. We might look into this for further improvements.
Change-Id: I6e6fb1eb08303e9da7e38cf493d1871364340000
This got copied from another command that also had this typo.
Also, don't bother backing up the on-disk backups, as we backup
directly via the stream dumps.
Change-Id: Ie200a29eec2b1a0725a8872ab548bcb0f26980e6
Zookeeper supports a number of "4 letter" commands [0] which are useful
for debugging and general diagnostics. By default only srvr is enabled,
but we want to add stat and dump to see details on server and client
connection statuses.
We do this via the 4lw.commands.whitelist configuration option [1] and
not the docker image env vars because we're mounting a zoo.cfg in
already.
[0] https://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_4lw
[1] https://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_clusterOptions
Change-Id: I24ea9b37cd5766c9d393106e8eab34623cad1624
The production server is trying to send itself to
refstack01.openstack.org, causing cross-site scripting issues. In
production, use the CNAME, but use the FQDN for testing.
Fix up job file matchers while here.
Change-Id: I18a5067ee25c59c5eaa17b7c2d9bd5a942a9173d
The previous refstack server had 'api' in the endpoint
addresses of API calls. Let's try to set it in the new
instance as well to keep the same interface.
Also, fix the typo in the testinfra host match and in
the test name.
Change-Id: I7319990144396b3a753678975a09b0add3ac4465
This has our change to open etherpad on join, so we should no longer need
to run a fork of the web server. Switch to the upstream container image
and stop building our own.
Change-Id: I3e8da211c78b6486a3dcbd362ae7eb03cc9f5a48
These are new focal replacement servers. Because this is the last set of
replacements for the executors we also cleanup the testing of the old
servers in the system-config-run-zuul job and the inventory group
checker job.
Change-Id: I111d42c9dfd6488ef69ff1a7f76062a73d1f37bf
The path for get-pip.py script in version 3.5 has been changed
with this commit [1].
[1] 2360f025eb
Change-Id: Ie13a6597c23c0a376f9feba2aed664e1129c5b60
We have identified an issue with stevedore < 3.3.0 where the
cloud-launcher, running under ansible, makes stevedore hashe a /tmp
path into a entry-point cache file it makes, causing a never-ending
expansion.
This appears to be fixed by [1] which is available in 3.3.0. Ensure
we install this on bridge. For good measure, add a ".disable" file as
we don't really need caches here.
There's currently 491,089 leaked files, so I didn't think it wise to
delete these in a ansible loop as it will probably time out the job.
We can do this manually once we stop creating them :)
[1] d7cfadbb7d
Change-Id: If5773613f953f64941a1d8cc779e893e0b2dd516
This server has been replaced by ze01.opendev.org running Focal. Lets
remove the old ze01.openstack.org from inventory so that we can delete
the server. We will follow this up with a rotation of new focal servers
being put in place.
This also renames the xenial executor in testing to ze12.openstack.org
as that will be the last one to be rotated out in production. We will
remove it from testing at that point as well.
We also remove a completely unused zuul-executor-opendev.yaml group_vars
file to avoid confusion.
Change-Id: Ida9c9a5a11578d32a6de2434a41b5d3c54fb7e0c
We are in the process of upgrading the AFS servers to focal. As
explained by auristor (extracted from IRC below) we need 3 servers to
actually perform HA with the ubik protocol:
the ubik quorum is defined by the list of voting primary ip addresses
as specified in the ubik service's CellServDB file. The server with
the lowest ip address gets 1.5 votes and the others 1 vote. To win
election requires greater than 50% of the votes. In a two server
configuration there are a total of 2.5 votes to cast. 1.5 > 2.5/2 so
afsdb02.openstack.org always wins regardless of what
afsdb01.openstack.org says. And afsb01.openstack.org can never win
because 1 < 2.5/2. by adding a third ubik server to the quorum, the
total votes cast are 3.5 and it always requires the vote of two
servers to elect a winner ... if afsdb03 is added with the highest
ip address, then either afsdb01 or afsdb02 can be elected
Add a third server which is a focal host and related configuration.
Change-Id: I59e562dd56d6cbabd2560e4205b3bd36045d48c2
We update the docker-compose config for zuul-executor to better handle
its shutdown handling. In particular we want to support zuul-executor
graceful which will pause the server then exit with rc 0 when all builds
complete. To do this we switch restart: always to restart: on-failure.
With the always setting docker simply restarts zuul-executor after a
graceful stop.
We also remove the stop signal of SIGHUP with its long timeout. Zuul
executor does not seem to catch SIGHUP for anything anymore so this is
there for old behavior and can be cleaned up.
Change-Id: I5211b91025ce5a13648f3648db3b42d357ecd590
This is a focal replacement for ze01.openstack.org. Cleanup for
ze01.openstack.org will happen in a followup when we are happy with the
results of running zuul-executor on focal.
Change-Id: If1fef88e2f4778c6e6fbae6b4a5e7621694b64c5
This file is now removed (I0cbcd4694a4796573fe48383756be03597d2da0f);
get rid of this to avoid any confusion.
Change-Id: I837d1fccbfa2461eb1315eac54c2a017fcb86511
This syslog configuration is what sends any logs with a program-name
of "docker-<foo>" to /var/log/containers/foo.log. However, at 98-
level the rules are after the default 50- rules, so we're seeing the
logs copied to both syslog and /var/log/containers. Since this
contains a "stop" command, we should move this earlier before the
default rules and the docker logs will not be duplicated.
Change-Id: I0cbcd4694a4796573fe48383756be03597d2da0f
As described inline, ensure that minimal facts for the backup servers'
are loaded before running the backup roles on hosts, so they can read
the ansible_ssh_host_key_ed25519_public fact for each backup server
and ensure it is accepted.
Update the other comments slightly as well.
Change-Id: I1f207ca0770d58f61a89f9ade0bd26cebc982c62
I introduced this typo with I500062c1c52c74a567621df9aaa716de804ffae7.
Luckily Ibb63f19817782c25a5929781b0f6342fe4c82cf0 has alerted us to
this problem.
Change-Id: I02bf2f4fa1041642a719100e9591bf5cd1a0bf49
So we can stop/pull/start, move the pull tasks to their own files
and add a playbook that invokes them.
Change-Id: I4f351c1d28e5e4606e0a778e545a3a805525ac71
This includes a fix for I216528a76307189d8d87bd2fcfeff95c6ceb53cc.
Now it's released we can be a bit more explicit about why we added the
workaround.
Change-Id: Ibaf1850549b5e7ec3622418b650bc5e59a289ab6
We have seen some poor performance from gitea which may be related to
manage project updates. Start a dstat service which logs to a csv file
on our system-config-run job hosts in order to collect performance info
from our services in pre merge testing. This will include gitea and
should help us evaluate service upgrades and other changes from a
performance perspective before they hit production.
Change-Id: I7bdaab0a0aeb9e1c00fcfcca3d114ae13a76ccc9
All hosts are now running thier backups via borg to servers in
vexxhost and rax.ord.
For reference, the servers being backed up at this time are:
borg-ask01
borg-ethercalc02
borg-etherpad01
borg-gitea01
borg-lists
borg-review-dev01
borg-review01
borg-storyboard01
borg-translate01
borg-wiki-update-test
borg-zuul01
This removes the old bup backup hosts, the no-longer used ansible
roles for the bup backup server and client roles, and any remaining
bup related configuration.
For simplicity, we will remove any remaining bup cron jobs on the
above servers manually after this merges.
Change-Id: I32554ca857a81ae8a250ce082421a7ede460ea3c
This sets a global BORG_UNDER_CRON=1 environment variable for
production hosts and makes the borg-backup script send an email if any
part of the backup job appears to fail (this avoids spamming ourselves
if we're testing backups, etc).
We should ideally never get this email, but if we do it's something we
want to investigate quickly. There's nothing worse than thinking
backups are working when they aren't.
Change-Id: Ibb63f19817782c25a5929781b0f6342fe4c82cf0