
In the README, we state that options will be available in the plugins/group_vars/all file for the user to uncomment. We missed adding a few of those options to the file, so I've added them. I also removed a section of outdated documentation that snuck back in during a merge. Change-Id: Ib5b19ebc02477a00765bc625612c7f9ae8ed0de3
236 lines
11 KiB
ReStructuredText
236 lines
11 KiB
ReStructuredText
Using Ansible to update images
|
|
==============================
|
|
|
|
This is a new approach to updating an in-place TripleO cloud with new
|
|
images. We have chosen Ansible as it allows fine grained control of
|
|
the work-flow without requiring one to write any idempotent bash or
|
|
python. There are components that are bash or python scripts, and we are
|
|
working hard not to replace the whole of TripleO with Ansible, but just
|
|
the pieces that make updates more complicated than they need to be.
|
|
|
|
In general this update process works in the following manner:
|
|
|
|
* Gather inventory and facts about the deployed cloud from Heat and Nova
|
|
* Quiesce the cloud by shutting down all OpenStack services on
|
|
appropriate nodes
|
|
* Nova-Rebuild nodes using requested image ids
|
|
* Disable os-collect-config polling of Heat
|
|
* Push Metadata from Heat to rebuilt nodes using Ansible and manually
|
|
trigger os-collect-config
|
|
* Start OpenStack services
|
|
|
|
Installing Ansible
|
|
------------------
|
|
|
|
Please see the `ansible` element in `tripleo-image-elements`
|
|
|
|
The following patches are required for operation:
|
|
|
|
* Add nova metadata for group (openstack/tripleo-heat-templates) -
|
|
https://review.openstack.org/#/c/113358/2 - This heat template update
|
|
labels instances such that the ansible tools can group the instances
|
|
into groups to facilitate the updates.
|
|
* Element to restore ssh keys from
|
|
/mnt/state (openstack/tripleo-image-elements) -
|
|
https://review.openstack.org/#/c/114360/ - This includes a new image
|
|
element, named restore-ssh-host-keys, which is intended to restore host
|
|
keys preserved by the ansible scripts after a reboot.
|
|
|
|
To make things simpler, you may want to add tripleo-ansible to /opt/stack
|
|
on the seed and/or undercloud. We include elements/tripleo-ansible,
|
|
which can be included in seed and undercloud image builds to allow the
|
|
tripleo-ansible tools to be automatically deployed for use.
|
|
|
|
Executing Scripts and Playbooks
|
|
-------------------------------
|
|
|
|
All Ansible playbooks and scripts have been written to be run directly
|
|
from the tripleo-ansible folder.
|
|
|
|
An ``ansible.cfg`` file is provided. If you have a systemwide
|
|
(/etc/ansible/ansible.conf) or user-specific ( ~/.ansible.cfg) Ansible
|
|
config file, then Ansible will not utilize the provided configuration file.
|
|
|
|
Pre-flight check
|
|
----------------
|
|
|
|
A playbook exists that can be used to check the controllers prior to the
|
|
execution of the main playbook in order to quickly identify any issues in
|
|
advance.
|
|
|
|
All controller nodes must be in a healty state (ACTIVE) for the pre flight
|
|
checks to pass. We **CANNOT** proceed with an update if a controller node is
|
|
down.
|
|
|
|
ansible-playbook -vvvv -M library/cloud -i plugins/inventory/heat.py -u heat-admin playbooks/pre-flight_check.yml
|
|
|
|
Running the updates
|
|
-------------------
|
|
|
|
You will want to set your environment variables to the appropriate
|
|
values for the following: OS_AUTH_URL, OS_USERNAME, OS_PASSWORD, and
|
|
OS_TENANT_NAME
|
|
|
|
source /root/stackrc
|
|
|
|
Your new images will need to be uploaded to glance, such that an instance
|
|
can be booted from them, and the image ID will need to be provided to
|
|
the playbook as an argument.
|
|
|
|
You can obtain the ID with the `glance image-list` command, and then
|
|
set them to be passed into ansible as arguments.
|
|
|
|
glance image-list
|
|
|
|
It may be possible to infer the image IDs using the script
|
|
"populate_image_vars". It will try to determine the latest image for
|
|
each image class and set it as a group variable in inventory.
|
|
|
|
scripts/populate_image_vars
|
|
|
|
After it runs, inspect `plugins/inventory/group_vars` and if the data
|
|
is what you expect, you can omit the image ids from the ansible command
|
|
line below.
|
|
|
|
You will now want to utilize the image ID values observed in the previous
|
|
step, and execute the ansible-playbook command with the appropriate values
|
|
subsituted into place. Current variables for passing the image variables
|
|
in are nova_compute_rebuild_image_id and controller_rebuild_image_id
|
|
which are passed into the chained playbook.
|
|
|
|
ansible-playbook -vvvv -u heat-admin -i plugins/inventory/heat.py -e nova_compute_rebuild_image_id=1ae9fe6e-c0cc-4f62-8e2b-1d382b20fdcb -e controller_rebuild_image_id=2432dd37-a072-463d-ab86-0861bb5f36cc -e swift_storage_rebuild_image_id=2432dd37-a072-463d-ab86-0861bb5f36cc -e vsa_rebuild_image_id=2432dd37-a072-463d-ab86-0861bb5f36cc playbooks/update_cloud.yml
|
|
|
|
If you have set the image ids in group vars:
|
|
|
|
ansible-playbook -vvvv -u heat-admin -i plugins/inventory/heat.py playbooks/update_cloud.yml
|
|
|
|
Below, we break down the above command so you can see what each part does:
|
|
|
|
* -vvvv - Make Ansible very verbose.
|
|
* -u heat-admin - Utilize the heat-admin user to connect to the remote machine.
|
|
* -i plugins/inventory/heat.py - Sets the inventory plugin.
|
|
* -e nova_compute_rebuild_image_id=1ae9fe6e-c0cc-4f62-8e2b-1d382b20fdcb - Sets the compute node image ID.
|
|
* -e controller_rebuild_image_id=2432dd37-a072-463d-ab86-0861bb5f36cc - Sets the controller node image ID.
|
|
* -e swift_storage_rebuild_image_id=2432dd37-a072-463d-ab86-0861bb5f36cc - Sets the swift storage node image ID.
|
|
* -e vsa_rebuild_image_id=2432dd37-a072-463d-ab86-0861bb5f36cc - Sets the vsa node image ID.
|
|
* playbooks/update_cloud.yml is the path and file name to the ansible playbook that will be utilized.
|
|
|
|
Upon a successful completion, ansible will print a summary report:
|
|
|
|
PLAY RECAP ********************************************************************
|
|
192.0.2.24 : ok=18 changed=9 unreachable=0 failed=0
|
|
192.0.2.25 : ok=19 changed=9 unreachable=0 failed=0
|
|
192.0.2.26 : ok=18 changed=8 unreachable=0 failed=0
|
|
|
|
Additionally:
|
|
|
|
As ansible utilizes SSH, you may encounter ssh key errors if the IP
|
|
address has been re-used. The fact that SSH keys aren't preserved is a
|
|
defect that is being addressed. In order to avoid problems while this
|
|
defect is being fixed, you will want to set an environment variable of
|
|
"ANSIBLE_HOST_KEY_CHECKING=False", example below.
|
|
|
|
ANSIBLE_HOST_KEY_CHECKING=False ansible-playbook -vvvv -M library/cloud -i plugins/inventory/heat.py -e controller_rebuild_image_id=4bee1a0a-2670-48e4-a3a4-17da6be795cb -e nova_compute_rebuild_image_id=bd20e098-0753-4dc8-8dba-2f739c01ee65 -u heat-admin playbooks/update_cloud.yml
|
|
|
|
Python, the language that ansible is written in, buffers IO output by default.
|
|
This can be observed as long pauses between sudden bursts of log entries where
|
|
multiple steps are observed, particullarlly when executed by Jenkins. This
|
|
behavior can be disabled by passing setting the an environment variable of
|
|
"PYTHONUNBUFFERED=1", examble below.
|
|
|
|
PYTHONUNBUFFERED=1 ANSIBLE_HOST_KEY_CHECKING=False ansible-playbook -vvvv -M library/cloud -i plugins/inventory/heat.py -e controller_rebuild_image_id=4bee1a0a-2670-48e4-a3a4-17da6be795cb -e nova_compute_rebuild_image_id=bd20e098-0753-4dc8-8dba-2f739c01ee65 -u heat-admin playbooks/update_cloud.yml
|
|
|
|
For more information about Ansible, please refer to the documentation at http://docs.ansible.com/
|
|
|
|
Failure Handling
|
|
----------------
|
|
|
|
Ansible has tunable options to abort the execution of a playbook upon
|
|
encountering a failure.
|
|
|
|
The max_fail_percentage parameter allows users to define what percentage of
|
|
nodes can fail before the playbook stops executing. This setting is pre-defined
|
|
in the playbook file playbooks/update_cloud.yml. The default value is zero,
|
|
which causes the playbook to abort execution if any node fails. You can read
|
|
about this option at:
|
|
http://docs.ansible.com/playbooks_delegation.html#maximum-failure-percentage
|
|
|
|
Additionally, it should be noted that the any_errors_fatal variable, when
|
|
set to a value of True, will result in ansible aborting upon encountering
|
|
any failures. This variable can be set by adding '-e any_errors_fatal=True'
|
|
to the command line.
|
|
|
|
Additional Options
|
|
------------------
|
|
|
|
The plugins/inventory/group_vars/all file has the following options in order
|
|
to tune behavior of the playbook execution. These options can be enabled by
|
|
defining the variable name that they represent on the ansible comamnd line, or
|
|
by uncommenting the appropriate line in the plugins/inventory/group-vars/all
|
|
file.
|
|
|
|
* force_rebuild - This option overrides the logic that prevents an instance
|
|
from being rebuilt if the pre-existing image id maches the id being deployed.
|
|
This may be useful for the purposes of testing.
|
|
Example command line addition: -e force_rebuild=True
|
|
* wait_for_hostkey - This option causes the playbook to wait for the
|
|
SSH host keys to be restored. This option should only be used if
|
|
the restore-ssh-host-keys element is built into the new image.
|
|
* single_controller - This option is for when a single controller node is
|
|
receiving an upgrade. It alters the logic so that mysql checks operate
|
|
as if the mysql database cluster is being maintained online by other
|
|
controller nodes during the upgrade. *IF* you are looking at this option
|
|
due to an error indicating "Node appears to be the last node in a cluster"
|
|
then consult Troubleshooting.rst.
|
|
* ssh_timeout - This value, defaulted to 900 [seconds], is the maximum
|
|
amount of time that the post-rebuild ssh connection test will wait for
|
|
before proceeding.
|
|
* pre_hook_command - This, when set to a command, such as /bin/date,
|
|
will execute that command on the host where the playbook is run
|
|
before starting any jobs.
|
|
* post_hook_command - Similar to the pre_hook_command variable, when
|
|
defined, will execute upon the completion of the upgrade job.
|
|
* online_upgrade - This setting tells the script to attempt an online upgrade
|
|
of the node. At present this is only known to work on compute nodes.
|
|
|
|
Online Upgrade
|
|
--------------
|
|
|
|
When an upgrade *does not* require a kernel update, the Online Upgrade feature
|
|
can be utilized to upgrade compute nodes while leaving their virtual machines
|
|
in a running state. The result is a short one to two minute loss of network
|
|
connectivity for the virtual machines as os-refresh-config stops and
|
|
restarts key services which causes the loss in network connectivity.
|
|
|
|
This operation is performed by uploading the new image to the /tmp folder on
|
|
the node, syncing file contents over while preserving key files, and then
|
|
restarting services. This is only known to work on compute nodes.
|
|
|
|
Nova Powercontrol
|
|
-----------------
|
|
|
|
A module named nova_powercontrol has been included which is intended to utilize
|
|
nova for all instance power control operations. This utility module also records
|
|
the previous state of the instance and has a special flag which allows the user
|
|
to resume or restart all virtual machines that are powered off/suspended upon the
|
|
completion of the upgrade if the module is utilized to shut down the instances.
|
|
|
|
To Use:
|
|
|
|
From the tripleo-ansible folder, execute the command:
|
|
|
|
bash scripts/retrieve_oc_vars
|
|
|
|
The script will then inform you of a file you need to source into your current
|
|
user environment, it will contain the overcloud API credentials utilizing modified
|
|
variable names which the playbook knows how to utilize.
|
|
|
|
source /root/oc-stackrc-tripleo-ansible
|
|
|
|
Now that the environment variables are present, add the following to the
|
|
ansible-playbook command line for the playbooks to utilize the nova_powercontrol
|
|
module:
|
|
|
|
-e use_nova_powercontrol=True
|
|
|