15 Commits

Author SHA1 Message Date
Zuul
52f864727e Merge "Skip purged borg backups during backup pruning" 2024-11-25 18:44:02 +00:00
Clark Boylan
cb8226128e Skip purged borg backups during backup pruning
We recently updated our borg backup system to support explicit purging
of backups. When purged there are no longer any backups to prune so we
need to skip any backups in this state. Update the pruning script to do
so.

Change-Id: Ib23a72c949c05855176fe54637f406b7797b37ae
2024-11-25 09:17:57 -08:00
Clark Boylan
3d9793927c Update backup verifier to handle purged repos
The backup verifier currently emails us warnings about inconsistent
backups in purged backup locations. This is expected because the
backups have been removed/purged. Update the verifier to simply log and
skip over these cases.

Change-Id: I0dd0b464e64dd4795d75e71ec4218d851eb9f742
2024-11-12 08:00:56 -08:00
Ian Wienand
8361ab701c
backups: add retirement and purge lists
This adds a retirement and purge list to the borg management role.

The idea here is that when a backed-up host is shut-down, we add its
backup user to the retired list.  On the next ansible run the user
will be disabled on the backup-server and the backup repo marked as
retired.  On the next prune, we will trim the backup to only the last
run to save space.  This gives us a grace period to restore if we
should need to.

When we are sure we don't want the data, we can put it in the purge
list, and the backup repo is removed on the next ansible run (hosts
can go straight into this if we want).  This allows us to have a
review process/history before we purge data.

To test, we create a fake "borg-retired" user on the backup-server,
and give it a simple backup.  This is marked as retired, which is
reflected in the testinfra run of the prune script.  Similarly a
"borg-purge" user is created, and we ensure it's backup dir is
removed.

Documentation is updated.

Change-Id: I5dff0a9d35b11a1f021048a12ecddce952c0c13c
2024-11-08 22:30:49 +11:00
Ian Wienand
476b225fca
borg-backup-server: build borg users betterer
This looks wrong, in hindsight I'm not really sure how it works.
Ansible 6 seems to barf on it.  Make this one evaluated statement.

Change-Id: I7f73bf723af1086fc4473e76614ce30ca14f3d74
2022-11-23 08:26:28 +11:00
Ian Wienand
b9d98cca21 borg-backup: skip .checkpoint archives
We may see an archive with ".checkpoint" on the end, as described in
[1]; the short version is this that borg stamps this every 30 minutes
and may appear if a long backup is interrupted.  Skip this when making
the list of archives to prune.

We noticed this on wiki-test; for clarity the list of archives looks
like

...
 wiki-upgrade-test-filesystem-2021-02-16T02:56:09.checkpoint Tue, 2021-02-16 02:56:11 [c444a0765e5791f3f68f08624d1efd80bf8a3ebc96bb225f08e4013befa2b460]
 wiki-upgrade-test-filesystem-2021-02-16T17:45:04 Tue, 2021-02-16 17:45:06 [b901b55ac3bf9abecba024caebad5ba7cd1a966e3f00b366f6cff45feba7bdff]
 wiki-upgrade-test-mysql-2021-02-16T18:35:09 Tue, 2021-02-16 18:35:11 [1d38cd3b4b1b3927b543e4ccc6c794cd3a513a70979ff025bbf303e1fe5e490f]
 wiki-upgrade-test-filesystem-2021-02-17T17:45:05 Wed, 2021-02-17 17:45:07 [f665e275c0014a21b82efaece5d36525a4ce6cb423253d5bd0b1323b230fa53a]
...

[1] https://borgbackup.readthedocs.io/en/stable/faq.html#if-a-backup-stops-mid-way-does-the-already-backed-up-data-stay-there

Change-Id: Ia33f46305ef8f541efb7c7150d4bb2e977b01d46
2021-11-03 12:39:10 +11:00
Ian Wienand
fff85f029c borg-backup-server: wait for lock in verify
We have seen a case where the weekly verification run conflicted with
an in-progress backup run.  Make the verification step wait for up to
an hour for the lock to allow backups to complete.

Change-Id: Id87dd090c7cd652695ab0c4aa73477cf0d72c28d
2021-10-06 10:34:13 +11:00
Ian Wienand
3d63b3b8a4 borg-backup-server: log prune output to file
This saves prune output to a log file automatically.  Add a bit more
info on the process too.

Change-Id: I2607ddbc313dfebc122609af78bb5eed63906f6b
2021-08-04 14:47:50 +10:00
Ian Wienand
86ed1d74dd borg-backup-server: set SHELL for verification script
In today's weird corner-case issue; when running under cron,
SHELL=/bin/sh ... which doesn't really matter (this script is run
under #!/bin/bash) *except* that "sudo -s" is obeying SHELL and
consequently the in-line script here fails under cron, but not when
run interactively.  Just set SHELL=/bin/bash for consistency.

Change-Id: Ic8584b90fea8382f7a7d294b98a0a3689bfc981b
2021-03-23 14:53:56 +11:00
Ian Wienand
e5a2354451 borg-backup-server: fix verification run
&>> is a bashism and not supported by sh, which cron runs the jobs
under.  Use >> instead.

Change-Id: I8e67f466887070fb1dedc403c53227c3ce1b2f1d
2021-03-17 15:09:57 +11:00
Ian Wienand
ece90fb7f7 borg-backup-server: make sure to append verification logs
We don't want to overwrite every run, but rather append to the log
file.

Change-Id: I304caedecbf6a9552f314636ca82a543ef16a8b6
2021-02-15 14:45:03 +11:00
Ian Wienand
0d01d941b1 borg-backup-server: run a weekly backup verification
This checks the backup archives and alerts us if anything seems wrong.
This will take a few hours, so we run once a week.

Change-Id: I832c0d29a37df94d4bf2704c59bb3f8d855c3cc8
2021-02-11 00:43:16 +00:00
Ian Wienand
62801d8a93 borg-backup-server: volume space monitor
Due to backups running in append-only mode, we do not have a way to
safely automatically prune backups.  To reduce the likelyhood we
forget about backups and end up with failing jobs, add a cron job to
send a email to infra-root if the backup partition goes over 90%
usage.  At this point a manual prune should be run
(I9559bb8aeeef06b95fb9e172a2c5bfb5be5b480e).

Change-Id: I250d84c4a9f707e63fef6f70cfdcc1fb7807d3a7
2021-02-09 11:31:02 +11:00
Ian Wienand
4f0bfa6d9d borg-backup-server: add script for pruning borg backups
This adds a script that performs a manual pruning of backup
directories.

Change-Id: I9559bb8aeeef06b95fb9e172a2c5bfb5be5b480e
2021-02-09 11:29:46 +11:00
Ian Wienand
028d655375 Add borg-backup roles
This adds roles to implement backup with borg [1].

Our current tool "bup" has no Python 3 support and is not packaged for
Ubuntu Focal.  This means it is effectively end-of-life.  borg fits
our model of servers backing themselves up to a central location, is
well documented and seems well supported.  It also has the clarkb seal
of approval :)

As mentioned, borg works in the same manner as bup by doing an
efficient back up over ssh to a remote server.  The core of these
roles are the same as the bup based ones; in terms of creating a
separate user for each host and deploying keys and ssh config.

This chooses to install borg in a virtualenv on /opt.  This was chosen
for a number of reasons; firstly reading the history of borg there
have been incompatible updates (although they provide a tool to update
repository formats); it seems important that we both pin the version
we are using and keep clients and server in sync.  Since we have a
hetrogenous distribution collection we don't want to rely on the
packaged tools which may differ.  I don't feel like this is a great
application for a container; we actually don't want it that isolated
from the base system because it's goal is to read and copy it offsite
with as little chance of things going wrong as possible.

Borg has a lot of support for encrypting the data at rest in various
ways.  However, that introduces the possibility we could lose both the
key and the backup data.  Really the only thing stopping this is key
management, and if we want to go down this path we can do it as a
follow-on.

The remote end server is configured via ssh command rules to run in
append-only mode.  This means a misbehaving client can't delete its
old backups.  In theory we can prune backups on the server side --
something we could not do with bup.  The documentation has been
updated but is vague on this part; I think we should get some hosts in
operation, see how the de-duplication is working out and then decide
how we want to mange things long term.

Testing is added; a focal and bionic host both run a full backup of
themselves to the backup server.  Pretty cool, the logs are in
/var/log/borg-backup-<host>.log.

No hosts are currently in the borg groups, so this can be applied
without affecting production.  I'd suggest the next steps are to bring
up a borg-based backup server and put a few hosts into this.  After
running for a while, we can add all hosts, and then deprecate the
current bup-based backup server in vexxhost and replace that with a
borg-based one; giving us dual offsite backups.

[1] https://borgbackup.readthedocs.io/en/stable/

Change-Id: I2a125f2fac11d8e3a3279eb7fa7adb33a3acaa4e
2020-07-21 17:36:50 +10:00