We constantly have problems with various timeouts on the release of
our mirror volumes creating locked volumes or stuck transactions; this
then requires significant manual intervention. This has been
discussed multiple times, but this short exchange from #openafs
probably sums it up best:
Sep 11 13:32:35 <auristor> The timeout problem is due to the fact
that UV_ReleaseVolume performs multiple RPCs. vos acquires a token
from the cache manager when it starts. it has no method of acquiring
a new token if it expired during an RPC. Therefore, if the token did
expire the remaining RPCs are performed unauthenticated. Without
appropriate permissions the cleanup of the volservers, writing the
updating VL entry will fail.
Sep 11 13:33:59 <auristor> A frequent solution is to deploy a remctld
service which has access to issue vos commands as -localauth and then
use remctld ACLs to restrict the identities of the processes that are
permitted to request the volume release.
Sep 11 14:37:28 <kaduk> Yeah, the -localauth tokens are pretty key
for long-running stuff, at the moment.
Indeed remctl [1] has been written to be the kerberos-based remote
control AFS wrapper. However, it is complex to setup, uses a lot of
Perl and it is unlikely to be familiar to very many people (making the
footprint of people who can help us admin it low). Getting it wrong
seems to be a pretty good vector for remote exploits. It does not
seem to be a good fit.
However, we can take a simpler approach. We can use Ansible to setup
our afs server to allow a particular key to run a release script that
wraps the "vos release -localauth" for us. With this in place, we can
update the scripts that run on mirror-update to ssh remotely and call
this, rather than call "vos release" directly.
This implements this basic support for the remote script. A new key
will be generated on mirror-update.opendev.org and it will be allowed
to run the vos_release.sh script remotely; which filters the command
to just do "vos release -localauth".
After we have tested this, we can start using it in scripts. I think
time will tell if we need locking or other features; this seems like
the KISS place to start.
[1] https://www.eyrie.org/~eagle/software/remctl/remctl.html
Change-Id: I6c96f89c6f113362e6085febca70d58176f678e7