system-config

Author	SHA1	Message	Date
Zuul	52f864727e	Merge "Skip purged borg backups during backup pruning"	2024-11-25 18:44:02 +00:00
Clark Boylan	cb8226128e	Skip purged borg backups during backup pruning We recently updated our borg backup system to support explicit purging of backups. When purged there are no longer any backups to prune so we need to skip any backups in this state. Update the pruning script to do so. Change-Id: Ib23a72c949c05855176fe54637f406b7797b37ae	2024-11-25 09:17:57 -08:00
Clark Boylan	3d9793927c	Update backup verifier to handle purged repos The backup verifier currently emails us warnings about inconsistent backups in purged backup locations. This is expected because the backups have been removed/purged. Update the verifier to simply log and skip over these cases. Change-Id: I0dd0b464e64dd4795d75e71ec4218d851eb9f742	2024-11-12 08:00:56 -08:00
Ian Wienand	8361ab701c	backups: add retirement and purge lists This adds a retirement and purge list to the borg management role. The idea here is that when a backed-up host is shut-down, we add its backup user to the retired list. On the next ansible run the user will be disabled on the backup-server and the backup repo marked as retired. On the next prune, we will trim the backup to only the last run to save space. This gives us a grace period to restore if we should need to. When we are sure we don't want the data, we can put it in the purge list, and the backup repo is removed on the next ansible run (hosts can go straight into this if we want). This allows us to have a review process/history before we purge data. To test, we create a fake "borg-retired" user on the backup-server, and give it a simple backup. This is marked as retired, which is reflected in the testinfra run of the prune script. Similarly a "borg-purge" user is created, and we ensure it's backup dir is removed. Documentation is updated. Change-Id: I5dff0a9d35b11a1f021048a12ecddce952c0c13c	2024-11-08 22:30:49 +11:00
Ian Wienand	476b225fca	borg-backup-server: build borg users betterer This looks wrong, in hindsight I'm not really sure how it works. Ansible 6 seems to barf on it. Make this one evaluated statement. Change-Id: I7f73bf723af1086fc4473e76614ce30ca14f3d74	2022-11-23 08:26:28 +11:00
Ian Wienand	b9d98cca21	borg-backup: skip .checkpoint archives We may see an archive with ".checkpoint" on the end, as described in [1]; the short version is this that borg stamps this every 30 minutes and may appear if a long backup is interrupted. Skip this when making the list of archives to prune. We noticed this on wiki-test; for clarity the list of archives looks like ... wiki-upgrade-test-filesystem-2021-02-16T02:56:09.checkpoint Tue, 2021-02-16 02:56:11 [c444a0765e5791f3f68f08624d1efd80bf8a3ebc96bb225f08e4013befa2b460] wiki-upgrade-test-filesystem-2021-02-16T17:45:04 Tue, 2021-02-16 17:45:06 [b901b55ac3bf9abecba024caebad5ba7cd1a966e3f00b366f6cff45feba7bdff] wiki-upgrade-test-mysql-2021-02-16T18:35:09 Tue, 2021-02-16 18:35:11 [1d38cd3b4b1b3927b543e4ccc6c794cd3a513a70979ff025bbf303e1fe5e490f] wiki-upgrade-test-filesystem-2021-02-17T17:45:05 Wed, 2021-02-17 17:45:07 [f665e275c0014a21b82efaece5d36525a4ce6cb423253d5bd0b1323b230fa53a] ... [1] https://borgbackup.readthedocs.io/en/stable/faq.html#if-a-backup-stops-mid-way-does-the-already-backed-up-data-stay-there Change-Id: Ia33f46305ef8f541efb7c7150d4bb2e977b01d46	2021-11-03 12:39:10 +11:00
Ian Wienand	fff85f029c	borg-backup-server: wait for lock in verify We have seen a case where the weekly verification run conflicted with an in-progress backup run. Make the verification step wait for up to an hour for the lock to allow backups to complete. Change-Id: Id87dd090c7cd652695ab0c4aa73477cf0d72c28d	2021-10-06 10:34:13 +11:00
Ian Wienand	3d63b3b8a4	borg-backup-server: log prune output to file This saves prune output to a log file automatically. Add a bit more info on the process too. Change-Id: I2607ddbc313dfebc122609af78bb5eed63906f6b	2021-08-04 14:47:50 +10:00
Ian Wienand	86ed1d74dd	borg-backup-server: set SHELL for verification script In today's weird corner-case issue; when running under cron, SHELL=/bin/sh ... which doesn't really matter (this script is run under #!/bin/bash) except that "sudo -s" is obeying SHELL and consequently the in-line script here fails under cron, but not when run interactively. Just set SHELL=/bin/bash for consistency. Change-Id: Ic8584b90fea8382f7a7d294b98a0a3689bfc981b	2021-03-23 14:53:56 +11:00
Ian Wienand	e5a2354451	borg-backup-server: fix verification run &>> is a bashism and not supported by sh, which cron runs the jobs under. Use >> instead. Change-Id: I8e67f466887070fb1dedc403c53227c3ce1b2f1d	2021-03-17 15:09:57 +11:00
Ian Wienand	ece90fb7f7	borg-backup-server: make sure to append verification logs We don't want to overwrite every run, but rather append to the log file. Change-Id: I304caedecbf6a9552f314636ca82a543ef16a8b6	2021-02-15 14:45:03 +11:00
Ian Wienand	0d01d941b1	borg-backup-server: run a weekly backup verification This checks the backup archives and alerts us if anything seems wrong. This will take a few hours, so we run once a week. Change-Id: I832c0d29a37df94d4bf2704c59bb3f8d855c3cc8	2021-02-11 00:43:16 +00:00
Ian Wienand	62801d8a93	borg-backup-server: volume space monitor Due to backups running in append-only mode, we do not have a way to safely automatically prune backups. To reduce the likelyhood we forget about backups and end up with failing jobs, add a cron job to send a email to infra-root if the backup partition goes over 90% usage. At this point a manual prune should be run (I9559bb8aeeef06b95fb9e172a2c5bfb5be5b480e). Change-Id: I250d84c4a9f707e63fef6f70cfdcc1fb7807d3a7	2021-02-09 11:31:02 +11:00
Ian Wienand	4f0bfa6d9d	borg-backup-server: add script for pruning borg backups This adds a script that performs a manual pruning of backup directories. Change-Id: I9559bb8aeeef06b95fb9e172a2c5bfb5be5b480e	2021-02-09 11:29:46 +11:00
Ian Wienand	028d655375	Add borg-backup roles This adds roles to implement backup with borg [1]. Our current tool "bup" has no Python 3 support and is not packaged for Ubuntu Focal. This means it is effectively end-of-life. borg fits our model of servers backing themselves up to a central location, is well documented and seems well supported. It also has the clarkb seal of approval :) As mentioned, borg works in the same manner as bup by doing an efficient back up over ssh to a remote server. The core of these roles are the same as the bup based ones; in terms of creating a separate user for each host and deploying keys and ssh config. This chooses to install borg in a virtualenv on /opt. This was chosen for a number of reasons; firstly reading the history of borg there have been incompatible updates (although they provide a tool to update repository formats); it seems important that we both pin the version we are using and keep clients and server in sync. Since we have a hetrogenous distribution collection we don't want to rely on the packaged tools which may differ. I don't feel like this is a great application for a container; we actually don't want it that isolated from the base system because it's goal is to read and copy it offsite with as little chance of things going wrong as possible. Borg has a lot of support for encrypting the data at rest in various ways. However, that introduces the possibility we could lose both the key and the backup data. Really the only thing stopping this is key management, and if we want to go down this path we can do it as a follow-on. The remote end server is configured via ssh command rules to run in append-only mode. This means a misbehaving client can't delete its old backups. In theory we can prune backups on the server side -- something we could not do with bup. The documentation has been updated but is vague on this part; I think we should get some hosts in operation, see how the de-duplication is working out and then decide how we want to mange things long term. Testing is added; a focal and bionic host both run a full backup of themselves to the backup server. Pretty cool, the logs are in /var/log/borg-backup-<host>.log. No hosts are currently in the borg groups, so this can be applied without affecting production. I'd suggest the next steps are to bring up a borg-based backup server and put a few hosts into this. After running for a while, we can add all hosts, and then deprecate the current bup-based backup server in vexxhost and replace that with a borg-based one; giving us dual offsite backups. [1] https://borgbackup.readthedocs.io/en/stable/ Change-Id: I2a125f2fac11d8e3a3279eb7fa7adb33a3acaa4e	2020-07-21 17:36:50 +10:00

15 Commits