It looks like 6 hours is too infrequent and is enough time for the
disk to fill up when we're busy. Instead, purge old snapshots every
2 hours, which looks like it should give us plenty of headroom with
our current usage pattern.
Change-Id: Ieb92d052e633e9326c41367442f036cc333c40f2
We were using a loop index which meant for our cluster size of three we
would always assign server.1 through server.3. Unfortunately, as we
replace servers we may add notes with a myid value >3 which breaks when
we try to assign serverids in this way.
Fix it by using the calculation for myid in the peer listing.
Change-Id: Icf770c75cf3a84420116f47ad691d9f06191fb65
This adds a program, zookeeper-statsd, which monitors zookeeper
metrics and reports them to statsd. It also adds a container to
run that program. And it runs the container on each of the
ZooKeeper quorum members. And it updates the graphite host to
allow statsd traffic from quorum members. And it updates the
4-letter-word whitelist to allow the mntr command (which is used
to gather metrics) to be issued.
Change-Id: I298f0b13a05cc615d8496edd4622438507fc5423
Zookeeper supports a number of "4 letter" commands [0] which are useful
for debugging and general diagnostics. By default only srvr is enabled,
but we want to add stat and dump to see details on server and client
connection statuses.
We do this via the 4lw.commands.whitelist configuration option [1] and
not the docker image env vars because we're mounting a zoo.cfg in
already.
[0] https://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_4lw
[1] https://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_clusterOptions
Change-Id: I24ea9b37cd5766c9d393106e8eab34623cad1624
To prepare for switching to TLS, set up TLS certs for Zookeeper and
all of Nodepool and Zuul, but do not have them connect over TLS yet.
We have observed problems with Kazoo using TLS in production. This
will let us run the ZK quorum using TLS internally, and have Zuul
and Nodepool connect over plaintext while also exposing the TLS
client port so that we can perform some more production tests.
Change-Id: If93b27f5b55be42be1cf6ee23258127fab5ce9ea
This reverts commit 29825ac18b58145f007f64b2998357445b8fdd91.
We observed this issue in production:
https://github.com/python-zk/kazoo/issues/587
Revert until we find a fix.
Change-Id: Ib7b8e3b06770a83b39458d09d2b1e655bd94bd22
This creates TLS certs for Zookeeper, uses them inside the ZK
quorum, and configures Nodepool and Zuul to use them as well.
A full system restart of all ZK-related components will be required
after merging this patch.
Change-Id: I0cb96a989f3d2c7e0563ce8899f2a5945ea225b3
Migration plan:
* add zk* to emergency
* copy data files on each node to a safe place for DR backup
* make a json data backup: zk-shell localhost:2181 --run-once 'mirror / json://!tmp!zookeeper-backup.json/'
* manually run a modified playbook to set up the docker infra without starting containers
* rolling restart; for each node:
* stop zk
* split data and log files and move them to new locations
* remove zk packages
* start zk containers
* remove from emergency; land this change.
Change-Id: Ic06c9cf9604402aa8eb4bb79238021c14c5d9563