This should now be a largely functional deployment of mailman 3. There
are still some bits that need testing but we'll use followup changes to
force failure and hold nodes.
This deployment of mailman3 uses upstream docker container images. We
currently hack up uids and gids to accomodate that. We also hack up the
settings file and bind mount it over the upstream file in order to use
host networking. We override the hyperkitty index type to xapian. All
list domains are hosted in a single installation and we use native
vhosting to handle that.
We'll deploy this to a new server and migrate one mailing list domain at
a time. This will allow us to start with lists.opendev.org and test
things like dmarc settings before expanding to the remaining lists.
A migration script is also included, which has seen extensive
testing on held nodes for importing copies of the production data
sets.
Change-Id: Ic9bf5cfaf0b87c100a6ce003a6645010a7b50358
Facebook mirror is out of sync for some days, so i'm proposing to use
rackspace one. This reverts [1] as it seems it is accepting rsync
connections properly.
[1] https://review.opendev.org/c/opendev/system-config/+/824829
Change-Id: Ic0076191157be8947f62ce18d5dd37f1f0ac3337
We still have some Ubuntu Xenial servers, so cap the max usable pip
and setuptools versions in their venvs like we already do for
Bionic, in order to avoid broken installations. Switch our
conditionals from release name comparisons to version numbers in
order to more cleanly support ranges. Also make sure the borg run
test is triggered by changes to the create-venv role.
Change-Id: I5dd064c37786c47099bf2da66b907facb517c92a
This is the latest 1.1.18 release, and from the changelog there
doesn't seem to be anything important we need to take into account
from 1.1.14.
Just as a note the 1.2 series is released, but this requires much more
thought when updating.
Change-Id: I949c40e9046008d4f442b322a267ce0c967a99dc
As noted inline, make a create-venv role that brings in appropriate
versions on Bionic.
This was noticed because pip is trying to install borgbackup with
setuptools_scm 7.0.5, which doesn't support Python 3.6. We use this
new role to create the venv correctly.
Change-Id: I81fd268a9354685496a75e33a6f038a32b686352
This is a follow-on to Ica63860f3221e99ca0a2aa2636d573fc134447bb to
make what's happening with the various exit points clearer.
Also sneak in an explaination of the weird arg input for clarity.
Change-Id: Ib059f1de465430d6e6f674b6649817105b7ef9a0
Currently we discard the exit code of the acme.sh call and swallow any
possible errors. Although they are logged, it means the Ansible calls
won't fail and you'll have to debug much later on why you didn't get a
certificate as expected.
Capture the failure of the call and log it better. Note that when
skipping renewal due to current valid certificates acme.sh returns
"2". After [1] acme.sh is returning "3" when it exits with a TXT
entry requiring validation; anything else is an error on the request
path. Valid issues should be "0" and anything else will be an error.
While we here, make sure we always output the end stamp by putting it
in a exit trap.
[1] 2d4ea720eb
Change-Id: Ica63860f3221e99ca0a2aa2636d573fc134447bb
Similar to Id98768e29a06cebaf645eb75b39e4dc5adb8830d, move the
certificate variables to the group definition file, so that we don't
have to duplicate handlers or definitions for the testing host.
Change-Id: I6650f5621a4969582f40700232a596d84e2b4a06
Currently we define the letsencrypt certs for each host in its
individual host variables.
With recent work we have a trusted CA and SAN names setup in
our testing environment; introducing the possibility that we could
accidentally reference the production host during testing (both have
valid certs, as far as the testing hosts are concerned).
To avoid this, we can use our naming scheme to move our testing hosts
to "99" and avoid collision with the production hosts. As a bonus,
this really makes you think more about your group/host split to get
things right and keep the environment as abstract as possible.
One example of this is that with letsencrypt certificates defined in
host vars, testing and production need to use the same hostname to get
the right certificates created. Really, this should be group-level
information so it applies equally to host01 and host99. To cover
"hostXX.opendev.org" as a SAN we can include the inventory_hostname in
the group variables.
This updates one of the more tricky hosts, static, as a proof of
concept. We rename the handlers to be generic, and update the testing
targets.
Change-Id: Id98768e29a06cebaf645eb75b39e4dc5adb8830d
The etsencrypt_certs variable defined here in the "static" group file
is overwritten by the host variable. This is not doing anything (and
we don't have a logs.openstack.org any more as it is all in object
storage), remove it.
Change-Id: I6910d6652c558c94d71b1609d1194b654bc5b42d
Jammy nodes appear to lack the /etc/apt/sources.list.d dir by default.
Ensure it exists in the install-docker role before we attempt to
install a deb repo config to that directory for docker packages.
Change-Id: I859d31ed116607ffa3d8db5bfd0b805d72dd70c0
This is the first step in running our servers on jammy. This will help
us boot new servers on jammy and bionic replacements on jammy.
Change-Id: If2e8a683c32eca639c35768acecf4f72ce470d7d
The most recent version of the grafana-oss:latest container seems to be
a beta version with some issues, or maybe we need to adapt our
deployment. Until we do this, pin the container to the latest known
working version.
Change-Id: Id50bf3121f3009f36f0f9961cf5211053410a576
The earlier problems identified with using mod_substitute have been
narrowed down to the new PEP 691 JSON simple API responses from
Warehouse, which are returned as a single line of data. The
currently largest known project index response we've been diagnosing
this problem with is only 1524169 characters in length, but there
are undoubtedly others and they will only continue to grow with
time. The main index is also already over the new 5m limit we set
(nearly double it), and while we don't currently process it with
mod_substitute, we shouldn't make it harder to do so if we need to
later.
Change-Id: Ib32acd48e5166780841695784c55793d014b3580
Reflect changes to mirror vhost configs immediately in their running
Apache services by notifying a new reload handler.
Change-Id: Ib3c9560781116f94b0fdfc56dfa5df3a1af74113
We've been getting the following error for some pages we're proxying
today:
AH01328: Line too long, URI /pypi/simple/grpcio/,
While we suspect PyPI or its Fastly CDN may have served some unusual
contents for the affected package indices, the content gets cached
and then mod_substitute trips over the result because it (as of
2.3.15) enforces a maximum line length of one megabyte:
https://bz.apache.org/bugzilla/show_bug.cgi?id=56176
Override that default to "5m" per the example in Apache's
documentation:
https://httpd.apache.org/docs/2.4/mod/mod_substitute.html
Change-Id: I5351f0465287f695fb2f1957062182fd3bf6c226
kernel.org has been rejecting rsync attempts with an over-capacity
message for several days now. Switch to the facebook mirror which
seems to be working for 8-stream.
Change-Id: I98de9dd827a3c78a023b677da854089593d5a454
haproxy only logs to /dev/log; this means all our access logs get
mixed into syslog. This makes it impossible to pick out anything in
syslog that might be interesting (and vice-versa, means you have to
filter out things if analysing just the haproxy logs).
It seems like the standard way to deal with this is to have rsyslogd
listen on a separate socket, and then point haproxy to that. So this
configures rsyslogd to create /var/run/dev/log and maps that into the
container as /dev/log (i.e. don't have to reconfigure the container at
all).
We then capture this sockets logs to /var/log/haproxy.log, and install
rotation for it.
Additionally we collect this log from our tests.
Change-Id: I32948793df7fd9b990c948730349b24361a8f307
Move the paste testing server to paste99 to distinguish it in testing
from the actual production paste service. Since we have certificates
setup now, we can directly test against "paste99.opendev.org",
removing the insecure flags to various calls.
Change-Id: Ifd5e270604102806736dffa86dff2bf8b23799c5
To make testing more like production, copy the OpenDev CA into the
haproxy container configuration directory during Zuul runs. We then
update the testing configuration to use SSL checking like production
does with this cert.
Change-Id: I1292bc1aa4948c8120dada0f0fd7dfc7ca619afd
Some of our testing makes use of secure communication between testing
nodes; e.g. testing a load-balancer pass-through. Other parts
"loop-back" but require flags like "curl --insecure" because the
self-signed certificates aren't trusted.
To make testing more realistic, create a CA that is distributed and
trusted by all testing nodes early in the Zuul playbook. This then
allows us to sign local certificates created by the letsencrypt
playbooks with this trusted CA and have realistic peer-to-peer secure
communications.
The other thing this does is reworks the letsencrypt self-signed cert
path to correctly setup SAN records for the host. This also improves
the "realism" of our testing environment. This is so realistic that
it requires fixing the gitea playbook :). The Apache service proxying
gitea currently has to override in testing to "localhost" because that
is all the old certificate covered; we can now just proxy to the
hostname directly for testing and production.
Change-Id: I3d49a7b683462a076263127018ec6a0f16735c94
A missed detail of the HTTPS config migration,
/usr/lib/mailman/Mailman/Defaults.py explicitly sets this:
PUBLIC_ARCHIVE_URL = 'http://%(hostname)s/pipermail/%(listname)s/'
Override that setting to https:// so that the archive URL embedded
in E-mail headers will no longer unnecessarily rely on our Apache
redirect. Once merged and deployed, fix_url.py will need to be rerun
for all the lists on both servers in order for this update to take
effect.
Change-Id: Ie4a6e04a2ef0de1db7336a2607059a2ad42665c2
openEuler 20.03 LTS SP2 was out of data in May 2022, and the newest
LTS version is 22.03 LTS which will be maintained to 2024.03.
This Patch add the 22.03-LTS mirror
Change-Id: I2eb72de4eee22a7a8739320ead8376c999993928
For the past six months, all our mailing list sites have supported
HTTPS without incident. The main downside to the current
implementation is that Mailman itself writes some URLs with an
explicit scheme, causing people submitting forms from pages served
over HTTPS to get warnings because the forms are posting to plain
HTTP URLs for the same site. In order to correct this, we need to
tell Mailman to put https:// instead of http:// into these, but
doing so essentially eliminates any reason for us to continue
serving content over plain HTTP anyway.
Configure the default URL scheme of all our Mailman sites to use
HTTPS now, and set up permanent redirects from HTTP to HTTPS, per
the examples in the project's documentation:
https://wiki.list.org/DOC/4.27%20Securing%20Mailman%27s%20web%20GUI%20by%20using%20Secure%20HTTP-SSL%20%28HTTPS%29
Also update our testinfra functions to validate the blanket
redirects and perform all other testing over HTTPS.
Once this merges, the fix_url script will need to be run manually
against all lists for the current sites, as noted in that document.
Change-Id: I366bc915685fb47ef723f29d16211a2550e02e34
When we migrated this to ansible I missed that we didn't bring across
the storage-aggregation.conf file.
This has had the unfortunate effect of regressing the xFilesFactor set
for every newly created graphite stat since the migration. This
setting is a percentage (0-1 float) of how much of a "bucket" needs to
be non-null to keep the value when rolling up changes. We want this
to be zero due to the sporadic nature of data (see the original change
I5f416e798e7abedfde776c9571b6fc8cea5f3a33).
This only affected newly created statistics, as graphite doesn't
modify this setting once it creates the whisper file. This probably
helped us overlook this for so long, as longer-existing stats were
operating correctly, but newer were dropping data when zoomed out.
Restore this setting, and double-check it in testinfra for the future.
For simplicity and to get this back to the prior state I will manually
update the on-disk .wsp files to this when this change applies.
Change-Id: I57873403c4ca9783b1851ba83bfba038f4b90715
We previously auto updated nodepool builders but not launchers when new
container images were present. This created confusion over what versions
of nodepool opendev is running. Use the same behavior for both services
now and auto restart them both.
There is a small chance that we can pull in an update that breaks things
so we run serially to avoid the most egregious instances of this
scenario.
Change-Id: Ifc3ca375553527f9a72e4bb1bdb617523a3f269e
This is a new config option for Gerrit 3.5. While it defaults to true we
set it explicitly to true to avoid any changes in behavior should that
default change eventually with newer Gerrit. They note this is expensive
to calculate, but our users rely on it and it hasn't caused us problems
yet. We can always explicitly disable it in the future if that becomes
necessary.
Change-Id: Idc002810de2d848af043978894ef9dc194ac5b6a
The zuul cli command is deprecated and creates a warning when it is
being used, replace it with zuul-admin.
Change-Id: Ifcc891f5da6f16824a65dc8dbf560b5d4c6ee9fc
Add released Fedora 36 to the mirror. Traditionally we have kept two
releases (prior and current) around; but depending on what is broken
often we drop the prior release earlier if it is not worth fixing;
this is what happened with F34. Ergo this is adding 36 and leaving
35, for now.
Change-Id: I9864666be0a6e32edc730b736f81d8883411bcb2
This updates the gerrit configuration to deploy 3.5 in production.
For details of the upgrade process see:
https://etherpad.opendev.org/p/gerrit-upgrade-3.5
Change-Id: I50c9c444ef9f798c97e5ba3dd426cc4d1f9446c1
As part of the Gerrit 3.5 upgrade we are also upgrading the reviewdb
to the latest mariadb LTS. This should be merged after the update
process; see
https://etherpad.opendev.org/p/gerrit-upgrade-3.5
Change-Id: Ie30c84eeb003ee86a7a66e0c1c5fd7f95ddf3f5f
Previously the merger docker-compose restart value was set to always.
This caused the merger to immediately restart after asking it to
gracefully stop and our check for the merger stopping:
docker-compose ps -q | xargs docker wait
never saw it as being stopped.
Make the mergers match executors and restart only on failure. This
should allow us to gracefully stop the mergers with intention and detect
they are stopped for maintenance purposes.
Change-Id: Ia8d12fbf6a45e4ca85174ccafd18b5d2351c26c1
This handles rolling the mergers and executors, but not yet
the schedulers.
Also, it does the executors in complete batches of 6, but could be
improved to stop 6 and then do each of the next as the first ones
complete.
Change-Id: I2dca104194c2f129b68dcef7721d7d08cb987c46