The current loop here uses the ansible_host value of the ZK servers,
which we have set to the IPv4 address in the inventory.
nb03 is constantly dropping out of ZK; for the record the logs record:
2021-04-21 05:56:15,151 WARNING kazoo.client: Connection dropped: socket connection error: Connection reset by peer
2021-04-21 05:56:15,151 WARNING kazoo.client: Transition to CONNECTING
2021-04-21 05:56:15,151 INFO kazoo.client: Zookeeper connection lost
2021-04-21 05:56:15,152 INFO kazoo.client: Connecting to 23.253.90.246(23.253.90.246):2281, use_ssl: True
2021-04-21 05:56:15,176 INFO kazoo.client: Zookeeper connection established, state: CONNECTED
and this happens every few minutes. This cloud does IPv4 behind a NAT
and it seems very likely this is related.
So the primary motivation here is to see if using IPv6 clears this up,
giving us some datapoints. However I think that our other nodepool
hosts should all be fine to use ZK over IPv6. However, I think in the
gate we may have cases where hosts don't have IPv6 addresses, so this
looks for the v6 address and if not found, falls back to the current
ansible_host behaviour.
Change-Id: Ifde86ddd632662f36bcbe2a0dc99660f06b01ac3
This reverts commit 05021f11a29a0213c5aecddf8e7b907b7834214a.
This switches Zuul and Nodepool to use Zookeeper TLS. The ZK
cluster is already listening on both ports.
Change-Id: I03d28fb75610fbf5221eeee28699e4bd6f1157ea
We use ansible's to_nice_yaml output filter when writing ansible
datastructures to yaml. This has a default indent of 4, but we humans
usually write yaml with an indent of 2. Make the generated yaml more
similar to what us humans write and set the indent to 2.
Change-Id: I3dc41b54e1b6480d7085261bc37c419009ef5ba7
To prepare for switching to TLS, set up TLS certs for Zookeeper and
all of Nodepool and Zuul, but do not have them connect over TLS yet.
We have observed problems with Kazoo using TLS in production. This
will let us run the ZK quorum using TLS internally, and have Zuul
and Nodepool connect over plaintext while also exposing the TLS
client port so that we can perform some more production tests.
Change-Id: If93b27f5b55be42be1cf6ee23258127fab5ce9ea
This reverts commit 29825ac18b58145f007f64b2998357445b8fdd91.
We observed this issue in production:
https://github.com/python-zk/kazoo/issues/587
Revert until we find a fix.
Change-Id: Ib7b8e3b06770a83b39458d09d2b1e655bd94bd22
This creates TLS certs for Zookeeper, uses them inside the ZK
quorum, and configures Nodepool and Zuul to use them as well.
A full system restart of all ZK-related components will be required
after merging this patch.
Change-Id: I0cb96a989f3d2c7e0563ce8899f2a5945ea225b3
This avoids the conflict with the zuul user (1000) on the test
nodes. The executor will continue to use the default username
of 'zuul' as the ansible_user in the inventory.
This change also touches the zk and nodepool deployment to use
variables for the usernames and uids to make changes like this
easier. No changes are intended there.
Change-Id: Ib8cef6b7889b23ddc65a07bcba29c21a36e3dcb5
Our zk config is a little too brittle. Let's just use the inventory
vars instead of detected network facts.
Change-Id: I288990edf587bc8394c9473388a858f46efb0691
We have a logging config to log to /var/log/nodepool but we weren't
using it. Start using it.
Add logging config to nodepool-builder
We should log nodepool builder to /var/log/nodepool too.
Change-Id: I6e7196dc12e8c1bfc54274432b94cf53629bdf3d
Rather than running a local zookeeper, just run a real zookeeper.
Also, get rid of nb01-test and just use nb04 - what could possibly
go wrong?
Dynamically write zookeeper host information to nodepool.yaml
So that we can run an actual zk using the new zk role on hosts in
ansible inventory, we need to write out the ip addresses of the
hosts that we build in zuul. This means having the info baked in
to the file in project-config isn't going to work.
We can do this in prod too, it shouldn't hurt anything.
Increase timeout for run-service-nodepool
We need to fix the playbook, but we'll do that after we get the
puppet gone.
Change-Id: Ib01d461ae2c5cec3c31ec5105a41b1a99ff9d84a
We use project-config for gerrit, gitea and nodepool config. That's
cool, because can clone that from zuul too and make sure that each
prod run we're doing runs with the contents of the patch in question.
Introduce a flag file that can be touched in /home/zuulcd that will
block zuul from running prod playbooks. By default, if the file is
there, zuul will wait for an hour before giving up.
Rename zuulcd to zuul
To better align prod and test, name the zuul user zuul.
Change-Id: I83c38c9c430218059579f3763e02d6b9f40c7b89
This adds a simple role to install Zookeeper.
Add an option to nodepool-base to use this role to install Zookeeper.
Use this in the nodepool-builder gate testing where we are just
validating that the nodepool-builder container starts and is ready to
accept connections. It needs a zookeeper to talk to, even though it
is not going to do anything.
Change-Id: I4ae89a51e454be4ee53ad4e04407162aaa8d9f9a
This is a start at ansible-deployed nodepool environments.
We rename the minimal-nodepool element to nodepool-base-legacy, and
keep running that for the old nodes.
The groups are updated so that only the .openstack.org hosts will run
puppet. Essentially they should remain unchanged.
We start a nodepool-base element that will replace the current
puppet-<openstackci|nodepool> deployment parts. For step one, this
grabs project-config and links in the elements and config file.
A testing host is added for gate testing which should trigger these
roles. This will build into a full deployment test of the builder
container.
Change-Id: If0eb9f02763535bf200062c51a8a0f8793b1e1aa
Depends-On: https://review.opendev.org/#/c/710700/