37 Commits

Author SHA1 Message Date
Antony Messerli
ad1a4bc9ef Force filesystem type on swift format
It currently seems to think that /dev/vmvg00/disk1
is used for btrfs, so force this operation to
ensure it's changed to xfs.

Change-Id: I0bcc9723fb33b557315422c3259a7ba2b75ceff6
2018-09-05 13:13:22 -05:00
Jesse Pretorius
868a559840 MNAIO: Only run systemd daemon_reload when necessary
When the VM's are Ubuntu Trusty, this task causes total failure.
We should only try and do the daemon_reload if the system being
used supports it.

Change-Id: I557856045a7735c8f351df6350f777caae526b10
2018-09-04 19:23:42 +01:00
Jesse Pretorius
4e9c1c5fd8 MNAIO: Make the galera image prep more robust
Unfortunately guestfish may error out silently (no return code
of 1), making hunting down the error a bit obscure. To combat
this we add a bunch of stdout output to the script, and look
for that final step to validate success. To make this work, we
need to copy the script over and execute it with the command
module, because the script module puts everything into stderr.

Change-Id: I8e514ceb2462870721745c9445ec149864a45f4d
2018-09-04 19:19:50 +01:00
Jesse Pretorius
e7387a6baa MNAIO: Make galera startup cleanly when using images
In an ideal state, if the galera containers are shut down
cleanly, they will leave behind a gvwstate.dat file on each
node which provides the cluster member details so that it
can automatically start up again without intervention.

However, when imaging the MNAIO systems we only interact
with the hosts, so the galera containers sometimes do no
shut down cleanly.

To cater for this, we inspect the disk images for the
primary component, then build the gvwstate.dat file for
the other galera containers. With those put back into the
image, when the VM's start, the cluster forms immediately.

References:
http://galeracluster.com/documentation-webpages/pcrecovery.html
http://galeracluster.com/documentation-webpages/restartingcluster.html

Change-Id: Icfe067607baefd661147f3c22ce846f06fff7c60
2018-09-01 19:20:38 +00:00
Jesse Pretorius
0e8423d536 MNAIO: Make post-install storage provisioning idempotent
The swift and cinder hosts do not use containers for services,
so there is no need to do the current process of shrinking the
volumes. Instead, we ensure that the lxc & machines mounts are
removed, with their respective logical volumes.

When setting up the swift logical volumes, we do not need to
create the mount point directories, because the mount task will
do that for us. As such, we remove that task.

Change-Id: Ibbe6d0fede6b6965415e421161354e311708d113
2018-08-29 17:12:43 +01:00
Jesse Pretorius
7f39e408e3 MNAIO: Inject the host ssh public key into the image
To allow a downloaded set of file-backed images to be used on
another host, the new host's public ssh key needs to be injected
into the VM disks so that ansible is able to connect to it and
complete the rest of the preparation.

Change-Id: I6b9b5efb88283417c15f74f40cfb91943bb8774d
2018-08-29 13:34:42 +01:00
Jesse Pretorius
934a3c2651 MNAIO: Default vm_use_snapshot is group_vars
Rather than have to default it in tasks all over the
place, we default it in group_vars. The default is to
enable the feature if file-backed VM's are used.

However, if there are no base images available, the
set_fact task disables it. If a user wishes to force
it not to be used, then an extra-var override is still
usable.

Change-Id: I5c916244a02a44da831d2a0fefd8e8aafae829b2
2018-08-29 13:23:40 +01:00
Jesse Pretorius
42189e272f MNAIO: Use discard option for all mount points
Using the discard option for all mount points ensures
that the deletes actually release the blocks on the disk.
This ensures that SSD performance is optimised and that
file-backed images are kept as small as possible.

Change-Id: I648cbaca56d75e355cf6c8af01e2e3ad20dfc398
2018-08-23 18:57:02 +01:00
Jesse Pretorius
5ce798b360 MNAIO: Use images subdirectory for VM images
Instead of putting the images in the root of the disk,
we use a subdirectory. This prevents silly mistakes
from happening.

Change-Id: I19d22b7e72de88736db410a771ec22664c641c94
2018-08-21 18:30:28 +01:00
Jesse Pretorius
9488e76bf1 MNAIO: Ensure vm_use_snapshot is defined
When not using a file-backed backing store, the vriable is not
defined and results in an error to that effect.

Change-Id: I3142a5960bc4521f79bbdfe32b0e7a0f71742b7d
2018-08-17 10:27:26 +01:00
Jesse Pretorius
993bac94f5 MNAIO: Extend image saving to include manifest
In order to more successfully reproduce an environment using
saved images, we include the VM XML definition files and the
output from 'pip freeze'. We capture the list of files, their
checksums and the SHA for the git repo into a json manifest
file.

Change-Id: Ia0bf74d509b4acb10b0dd832a4cfe1bb2afb2503
2018-08-15 19:25:12 +01:00
Jesse Pretorius
484059205a MNAIO: Enable using a data disk for file-backed VM's
In order to make use of a data disk, we enable the 'file'
implementation of default_vm_disk_mode to use a data disk
much like the 'lvm' implementation.

To simplify changing from the default_vm_disk_mode of lvm
to file and back again, the setup-host playbook will remove
any previous implementation and replace it. This is useful
when doing testing for these different modes because it
does not require cleaning up by hand.

This patch also fixes the implementation of the virt
storage pool. Currently the tasks only execute if
'virt_data_volume.pools is not defined', but it is always
defined so the tasks never execute. We now ensure that
for both backing stores the 'default' storage pool is
defined, started and set to auto start (as three tasks
because the virt_pool module sucks really bad and can only
do one thing at a time).

The pool implementation for the 'file' backed VM's uses
the largest data disk it can find and creates the /data
mount for it. To cater for a different configuration, we
ensure that all references to the disk files use the path
that is configured in the pool ,rather than assuming the
path.

Change-Id: If7e7e37df4d7c0ebe9d003e5b5b97811d41eff22
2018-08-15 16:58:50 +01:00
Jesse Pretorius
4a48a6874d Optimise vm_disk_mode conditionals
There is already a default in group_vars/all, so we do not need
to provide a default in every conditional.

Also, we move several LVM data volume tasks into a block given
they have a common set of conditions.

Change-Id: Iff0fafefda2bc5dc1596b7198b779f5da763086c
2018-08-15 08:49:23 +00:00
Antony Messerli
ef560373e3 Undefine existing VM configurations during rebuild
Allows for configuration changes to be redeployed on a
rebuild where previously it didn't attempt to update
the VMs configuration.
`

Change-Id: If14dbdfe7ba3e69a50127fa724ad3f2a8ed58c1a
2018-07-23 13:59:27 -05:00
Corey Wright
bdeeb39e42 mnaio: Correct LVM terminology in task names
Change-Id: I23f1a245f30f45dc66b6f2ae0ff2ee5aab147dd8
Signed-off-by: Corey Wright <corey.wright@rackspace.com>
2018-07-16 21:15:39 -05:00
Corey Wright
f21bc66671 mnaio: Only resize Swift & Cinder machines00 LV when using nspawn
Commit 875fa96f / change-id Ief0040f6 unintentionally tries to enlarge
the "machines00" LV when LXC is the default container technology which
fails due to the Debian automated installation having assigned all the
space within the associated "vmvg00" VG.

As the intention of the aforementioned commit was to apply when
systemd-nspawn was used, codify that explicitly in a `when:` condition
on the problematic Ansible task.

Change-Id: I56ec1290d71d0d09db447e347d7d55432d9b81c6
Signed-off-by: Corey Wright <corey.wright@rackspace.com>
Closes-Bug: #1781823
2018-07-15 15:46:56 -05:00
Antony Messerli
875fa96fb8 Make space for Cinder/Swift nodes if using nspawn
Currently if CONTAINER_TECH=nspawn is uses, Cinder and Swift
are unable to create volumes as space is fully allocated for
machines volume.

This shrinks machine00 mount to 8192 to make space for Cinder
and Swift volumes when using nspawn for the container tech.

Change-Id: Ief0040f638f0d3570557ac76fd5e0a8aee80df8d
2018-07-11 15:54:09 -05:00
Dave Wilde
482e845d92 Improve multi-node AIO robustness
In order to improve the readability and robustness of the mnaio feature
I have replaced the shell out to virsh tasks to use the virt module
where available.  I have also created a vm-status play that will
hopefully help resolve SSH failures into the VMs.  This play utilizes
the block/rescue/handler pattern to attempt to restart the VM once if
it fails the initial SSH check.  Hopefully this will reduce the SSH
failures due to a suck VM.  This adds a new variable called
vm_ssh_timeout which allows the deployer an easy place to override the
default timeout.  The python-lxml package is needed for the virt module.

Change-Id: I027556b71a8c26d08a56b4ffa56b2eeaf1cbabe9
2018-06-29 10:12:16 -05:00
Jesse Pretorius
0cd5c1704f MNAIO: Check capabilities only once
The capabilities check is done on the host, so it only needs
to be executed once, not once for every VM on the host. This
patch eliminates the duplicated checking.

Change-Id: I2bc7ebbe699e5ace82c1bcbdfd8e917661054fef
2018-06-26 11:54:16 +01:00
Jesse Pretorius
329aa472f2 MNAIO: Enable saving and re-using file-based backing images
Being able to save the images and re-use them on other hosts
is extremely useful to cut down deployment time. This patch
allows an MNAIO setup to be setup using a file-based backing
store, then have those saved and re-used on the same host or
on other hosts.

Change-Id: I491d04fb94352e37312891a9b9bd58093fdd00cf
2018-06-26 11:54:16 +01:00
Jesse Pretorius
bc2ced27c2 MNAIO: Ensure a consistent and readable style
This patch implements the following style changes:

1. The 'environment' argument is placed in the same
   location for all plays, making sure it's easier
   to find.
2. The play tags are located in the same place, also
   making sure they're easier to find.
3. The line breaks between tasks and plays are set
   to be consistently 1 between tasks and 2 between
   plays.
4. Given that there are no roles being used, the use
   of pre/post tasks is converted to only using tasks.

Change-Id: I2e22c8360d65256b8e44ca1e310e0668a651196d
2018-06-26 11:54:16 +01:00
Zuul
fade8fa287 Merge "Increase VM SSH timeouts" 2018-06-15 21:09:04 +00:00
Jesse Pretorius
80b4efab8a MNAIO: Use file instead of command for disk image clean-up
Rather than use the command module, we use the file module
which is idempotent. We also remove the conditional for the
clean up so that it's easier to switch between back-ends
and still have the clean-up happen.

Change-Id: I5d77bbd236d1cb7d45bc4dda4206475b5663b1c0
2018-06-15 16:48:22 +01:00
Dave Wilde
232be485cb Increase VM SSH timeouts
Occasionally the VM check times out, this relaxes the checks a bit to
decrease failures.

Change-Id: Ic327efb7a86b20bc3f97f3ced2fe3fc54b93d347
2018-06-15 10:35:14 -05:00
Bjoern Teipel
d625057b7a Adding mnaio qemu file backends
mnaio uses LVM as default VM disk backend but that
requires a lot of available disk space.
A alternative option, the qcow2 file based backend
is added to take benefit of thin provisioning.
This backend can be triggered by setting the override
`default_vm_disk_mode` to `file`

Change-Id: Iaf97fef601f656901b6913eaafb9a6c28d4b7ba6
2018-03-23 21:28:38 -05:00
Shannon Mitchell
4cbb0d8b98 Fix Ssh connection issues on openstack-ansible-ops mnaio builds
We normally see ssh connection issues during the lxc container setup
  portion of OSA builds. Most people usually end up tweaking ansible ssh
  pipeline and retry settings or nerfing the build via ansible fork lowering
  to work around it. This is an old issue that we normally put a more
  permanent fix in our physical environments by setting the ssh maxsessions
  and maxstartups. On the mnaio builds I have been working around this by
  stopping the build before deployment and making the changes in a script.

Change-Id: I54c223e1fb9edf6947bc7f76ff689bad22456420
Closes-Bug: 1752914
2018-03-02 09:54:34 -06:00
Antony Messerli
33c3fbdfe7 Stop any running VMs when re-deploying
Stops running VMs when doing a deploy to ensure
the VMs start fresh and can reload their config
during deploy.

Also removes LVs to force a redeploy of the VMs.

Change-Id: I7992e25f4e0e103ae66487f2e88a99ca962a9355
2018-02-07 13:19:04 -06:00
Antony Messerli
4536c0a9ef Abort run if VMs aren't up before timeout
Occasionally the VM install will exceed the timeout
if it doesn't fire correctly.  Instead of treating
the host as down and continuing on with the others,
fail early.

Change-Id: I543d8e354a5357f7059fe82497edb9b7e3a22097
2017-11-02 11:49:20 -05:00
Matt Thompson
ce29ea23d1 Updates for Trusty VMs
Currently, attempting to use Trusty (14.04) VMs causes VMs to not
provision correctly due to a grub-install error.  With respect to this
specific issue, this commit updates vm.preseed.j2 by removing some
grub-installer options which were not present before the ansible
rewrite.

Secondly, with that change in place, VMs do not come online on their
10.0.236 addresses as something is overwriting
/etc/networking/interfaces, which wipes out the source of the
/etc/network/interfaces.d directory.  Bug [1] seems to indicate this
is in fact an issue and has been resolved, however attempts at using
this preseed option (netcfg/target_network_config) were not successful.
As a workaround, we simply chattr +i the interfaces file in
vm-post-install-script.sh.j2, and then remove the attr in
deploy-vms.yml when the instance is up an accessible.

[1] https://bugs.launchpad.net/ubuntu/+source/netcfg/+bug/1361902

Change-Id: I12d0c5108d1df0ab02b69d1b8cdb271a02999602
2017-09-26 08:52:38 -04:00
Matt Thompson
28684e6c6e Add missing features to multi-node-aio
The multi-node-aio update that moved the provisioning from bash to
ansible dropped a few features that we use for gating purposes.  This
commit re-adds the following:

1. The ability to drop iptables rules to do port redirection from the
   host to private IPs.  This is controlled by CONFIG_PREROUTING and
   the ansible variable mnaio_host_iptables_prerouting_ports.
2. /etc/hosts on the physical node is now updated w/ the hostname and
   IP of each VM so we can access VMs by name.

NOTE: With #1, we redirect to the VM's DHCP address, and not it's
      management address.  The latter seemed to the desired address
      but didn't work, which is why we've resorted to DHCP.  If using
      this address is incorrect please note so we can investigate
      further.

Change-Id: Ib194c314280f2474a2e4dac6d0feba44b1ee696f
2017-09-13 11:47:25 -04:00
Matt Thompson
815ac51249 Wait for guest capabilities to appear
Deploying the multi-node-aio from master on a machine running Ubuntu
14.04 fails frequently as libvirt doesn't think it has the hvm
OS type.  I was able to manually run "virsh capabilities" shortly after
libvirt was installed and sure enough it didn't list any guest
capabilities.  Subsequent runs of "virsh capabilities" then returned
the <guest> XML element w/ <os_type>hvm</os_type> defined.

This commit simply adds a task that checks "virsh capabilities",
retrying up to 6 times if the <guest> element is not present.  From my
limit testing this seems sufficient to ensure that the domains are
defined and created successfully.

Lastly, we add a task to create /etc/libvirt/storage which is expected
to exist, but doesn't on a 14.04 deployment.

Change-Id: I158987270b71d3781e91d819fdcb02da736f3c1d
2017-09-06 15:42:51 -04:00
Kevin Carter
c678e83275
General improvements
* added ip alias for interfaces

* Update settings and improve vm performance

* Change the VG name in VMs. The VG name was changed so that the volume
  which is being used by VMs can be mounted on a physical host, and not
  conflict, with standard volume group naming. This is usful when a VM
  is DOA and a deployer wants to disect the instance.

Change-Id: If4d10165fe08f82400772ca88f8490b01bad5cf8
Signed-off-by: Kevin Carter <kevin.carter@rackspace.com>
2017-08-14 10:35:12 -05:00
Kevin Carter
369f68832e
Add environment options and re-flow the README.rst
Change-Id: I7a2640856045e36043de8508f9421fbd8a593591
Signed-off-by: Kevin Carter <kevin.carter@rackspace.com>
2017-08-01 09:13:15 -05:00
Kevin Carter
cfc76ded4a
Convert vars in files to host_vars
This change allows the MNAIO to really be used as a stand alone kick
system which has the potential to be developed into a stand alone
project. At the very least this change improves playbook performance
by scoping variables.

The inventory has been converted into a typical Ansible inventory and
the "servers" used in the MNAIO are now simply host_vars
which will trigger specific VM builds when instructed to do so. This
gives the MNAIO the ability to serve as a stand alone kick system which
could be used for physical hosts as well as MNAIO testing all through
the same basic set of playbooks. Should a deployer want to use this with
physical servers they'd need to do nothing more than define their basic
inventory and where the the required pieces of infrastructure needed to
PXE boot their machines.

Change-Id: I6c47e02ecfbe8ee7533e77b11041785db485a1a9
Signed-off-by: Kevin Carter <kevin.carter@rackspace.com>
2017-07-31 23:31:13 -05:00
Kevin Carter
9abaeba8c8
move deploy node to infra1 and make a LB node
Sadly the log node does not have enough ram to run a full ansible run.
Ansible 2.x requires more ram than one would expect, especially when the
inventory gets large. this change moves the deploy node to infra1 as it
will already have the ram needed to run the playbooks. Additionally the
container storage for infra nodes was too small which forces builds into
error. The default storage for VMs has been set to 90GiB each and the
preseed will create a logical volume for VMs mounted at /var/lib/lxc.

While the limited ram works well for the VMs and within a running
deployment of OSA, ansible-playbook is subject to crash like so:

  An exception occurred during task execution. To see the full traceback,
  use -vvv. The error was: OSError: [Errno 12] Cannot allocate memory
  fatal: [infra1_cinder_api_container-b38b47ea]: FAILED! =>
  {"failed": true, "msg": "Unexpected failure during module execution.", "stdout": ""}

So infra nodes have had the memory constraint raised to 8GiB

Change-Id: I7175ea92f663dfef5966532cfc0b4beaadb9eb03
Signed-off-by: Kevin Carter <kevin.carter@rackspace.com>
2017-07-29 15:52:36 -05:00
Kevin Carter
c3800224b0
move drive layout to deploy-vms and fix deploy-osa tags/tasks
Change-Id: Iaee4c3683d798320099dec77286e6fac7a10bee8
Signed-off-by: Kevin Carter <kevin.carter@rackspace.com>
2017-07-28 14:49:27 -05:00
Kevin Carter
a94f0a9026 Combine our two multi-node-aio processes into one
The original mnaio was built using a lot of bash and was tailored
specifically for ubuntu 14.04. The new mnaio was built using a mix of
bash and ansible and was tailored specifically for ubuntu 16.04. This
patch takes the two code bases and combines the best things from each
method and wraps it up into a single code path all written using ansible
playbooks and basic variables.

While underlying system has changed the bash environment variable syntax
for overrides remains the same. This allows users to continue with what
has become their normal work-flow while leveraging the new structure and
capabilities.

High level overview:
  * The general performance of the VMs running within the MNAIO will now
    be a lot better. Before the VMs were built within QCOW2 containers,
    while this was flexible and portable it was slower. The new
    capabilities will use RAW logical volumes and native IO.
  * New repo management starts with preseeds and allows the user to pin
    to specific repositories without having to worry about flipping them
    post build.
  * CPU overhead will be a lot less. The old VM system used an
    un-reasonable number of processors per VM which directly translated
    to sockets. The new system will use cores and a single socket
    allowing for generally better VM performance with a lot less
    overhead and resource contention on the host.
  * Memory consumption has been greatly reduced. Each VM is now
    following the memory restrictions we'd find in the gate, as a MAX.
    Most of the VMs are using 1 - 2 GiB of RAM which should be more than
    enough for our purposes.

Overall the deployment process is simpler and more flexible and will
work on both trusty and xenial out of the box with the hope to bring
centos7 and suse into the fold some time in the future.

Change-Id: Idc8924452c481b08fd3b9362efa32d10d1b8f707
Signed-off-by: Kevin Carter <kevin.carter@rackspace.com>
2017-07-28 15:35:23 +00:00