
Backup and restore tftpboot as ironic does not recreate files necessary for overcloud nodes to boot. Change-Id: Ibdc8b41be480f9344e0ba014bb0017591c603257
Using Ansible to update images
This is a new approach to updating an in-place TripleO cloud with new images. We have chosen Ansible as it allows fine grained control of the work-flow without requiring one to write any idempotent bash or python. There are components that are bash or python scripts, and we are working hard not to replace the whole of TripleO with Ansible, but just the pieces that make updates more complicated than they need to be.
In general this update process works in the following manner:
- Gather inventory and facts about the deployed cloud from Heat and Nova
- Quiesce the cloud by shutting down all OpenStack services on appropriate nodes
- Nova-Rebuild nodes using requested image ids
- Disable os-collect-config polling of Heat
- Push Metadata from Heat to rebuilt nodes using Ansible and manually trigger os-collect-config
- Start OpenStack services
Installing Ansible
Please see the ansible element in tripleo-image-elements
The following patches are required for operation:
- Add nova metadata for group (openstack/tripleo-heat-templates) -https://review.openstack.org/#/c/113358/2 - This heat template update labels instances such that the ansible tools can group the instances into groups to facilitate the updates.
- Element to restore ssh keys from /mnt/state (openstack/tripleo-image-elements) -https://review.openstack.org/#/c/114360/ - This includes a new image element, named restore-ssh-host-keys, which is intended to restore host keys preserved by the ansible scripts after a reboot.
To make things simpler, you may want to add tripleo-ansible to /opt/stack on the seed and/or undercloud. We include elements/tripleo-ansible, which can be included in seed and undercloud image builds to allow the tripleo-ansible tools to be automatically deployed for use.
Pre-flight check
A playbook exists that can be used to check the controllers prior to the execution of the main playbook in order to quickly identify any issues in advance.
ansible-playbook -vvvv -M library/cloud -i plugins/inventory/heat.py -u heat-admin playbooks/pre-flight_check.yml
Running the updates
You will want to set your environment variables to the appropriate values for the following: OS_AUTH_URL, OS_USERNAME, OS_PASSWORD, and OS_TENANT_NAME
source /root/stackrc
Your new images will need to be uploaded to glance, such that an instance can be booted from them, and the image ID will need to be provided to the playbook as an argument.
You can obtain the ID with the glance image-list command, and then set them to be passed into ansible as arguments.
glance image-list
It may be possible to infer the image IDs using the script "populate_image_vars". It will try to determine the latest image for each image class and set it as a group variable in inventory.
scripts/populate_image_vars
After it runs, inspect plugins/inventory/group_vars and if the data is what you expect, you can omit the image ids from the ansible command line below.
You will now want to utilize the image ID values observed in the previous step, and execute the ansible-playbook command with the appropriate values subsituted into place. Current variables for passing the image variables in are nova_compute_rebuild_image_id and controller_rebuild_image_id which are passed into the chained playbook.
ansible-playbook -vvvv -u heat-admin -i plugins/inventory/heat.py -e nova_compute_rebuild_image_id=1ae9fe6e-c0cc-4f62-8e2b-1d382b20fdcb -e controller_rebuild_image_id=2432dd37-a072-463d-ab86-0861bb5f36cc -e controllermgmt_rebuild_image_id=2432dd37-a072-463d-ab86-0861bb5f36cc -e swift_storage_rebuild_image_id=2432dd37-a072-463d-ab86-0861bb5f36cc -e vsa_rebuild_image_id=2432dd37-a072-463d-ab86-0861bb5f36cc playbooks/update_cloud.yml
If you have set the image ids in group vars:
ansible-playbook -vvvv -u heat-admin -i plugins/inventory/heat.py playbooks/update_cloud.yml
Below, we break down the above command so you can see what each part does:
- -vvvv - Make Ansible very verbose.
- -u heat-admin - Utilize the heat-admin user to connect to the remote machine.
- -i plugins/inventory/heat.py - Sets the inventory plugin.
- -e nova_compute_rebuild_image_id=1ae9fe6e-c0cc-4f62-8e2b-1d382b20fdcb - Sets the compute node image ID.
- -e controller_rebuild_image_id=2432dd37-a072-463d-ab86-0861bb5f36cc - Sets the controller node image ID.
- -e controllermgmt_rebuild_image_id=2432dd37-a072-463d-ab86-0861bb5f36cc - Sets the controllerMgmt node image ID.
- -e swift_storage_rebuild_image_id=2432dd37-a072-463d-ab86-0861bb5f36cc - Sets the swift storage node image ID.
- -e vsa_rebuild_image_id=2432dd37-a072-463d-ab86-0861bb5f36cc - Sets the vsa node image ID.
- playbooks/update_cloud.yml is the path and file name to the ansible playbook that will be utilized.
Upon a successful completion, ansible will print a summary report:
PLAY RECAP ******************************************************************** 192.0.2.24 : ok=18 changed=9 unreachable=0 failed=0 192.0.2.25 : ok=19 changed=9 unreachable=0 failed=0 192.0.2.26 : ok=18 changed=8 unreachable=0 failed=0
Additionally:
As ansible utilizes SSH, you may encounter ssh key errors if the IP address has been re-used. The fact that SSH keys aren't preserved is a defect that is being addressed. In order to avoid problems while this defect is being fixed, you will want to set an environment variable of "ANSIBLE_HOST_KEY_CHECKING=False", example below.
ANSIBLE_HOST_KEY_CHECKING=False ansible-playbook -vvvv -M library/cloud -i plugins/inventory/heat.py -e controller_rebuild_image_id=4bee1a0a-2670-48e4-a3a4-17da6be795cb -e nova_compute_rebuild_image_id=bd20e098-0753-4dc8-8dba-2f739c01ee65 -u heat-admin playbooks/update_cloud.yml
Python, the language that ansible is written in, buffers IO output by default. This can be observed as long pauses between sudden bursts of log entries where multiple steps are observed, particullarlly when executed by Jenkins. This behavior can be disabled by passing setting the an environment variable of "PYTHONUNBUFFERED=1", examble below.
PYTHONUNBUFFERED=1 ANSIBLE_HOST_KEY_CHECKING=False ansible-playbook -vvvv -M library/cloud -i plugins/inventory/heat.py -e controller_rebuild_image_id=4bee1a0a-2670-48e4-a3a4-17da6be795cb -e nova_compute_rebuild_image_id=bd20e098-0753-4dc8-8dba-2f739c01ee65 -u heat-admin playbooks/update_cloud.yml
For more information about Ansible, please refer to the documentation at http://docs.ansible.com/
Failure Handling
Ansible has tunable options to abort the execution of a playbook upon encountering a failure.
The max_fail_percentage parameter allows users to define what percentage of nodes can fail before the playbook stops executing. This setting is pre-defined in the playbook file playbooks/update_cloud.yml. The default value is zero, which causes the playbook to abort execution if any node fails. You can read about this option at: http://docs.ansible.com/playbooks_delegation.html#maximum-failure-percentage
Additionally, it should be noted that the any_errors_fatal variable, when set to a value of True, will result in ansible aborting upon encountering any failures. This variable can be set by adding '-e any_errors_fatal=True' to the command line.
Additional Options
The plugins/inventory/group_vars/all file has the following options in order to tune behavior of the playbook execution. These options can be enabled by defining the variable name that they represent on the ansible comamnd line, or by uncommenting the appropriate line in the plugins/inventory/group-vars/all file.
- force_rebuild - This option overrides the logic that prevents an instance from being rebuilt if the pre-existing image id maches the id being deployed. This may be useful for the purposes of testing. Example command line addition: -e force_rebuild=True
- wait_for_hostkey - This option causes the playbook to wait for the SSH host keys to be restored. This options should only be used if the restore-ssh-host-keys element is built into the new image.
- single_controller - This option is for when a single controller node is receiving an upgrade. It alters the logic so that mysql checks operate as if the mysql database cluster is being maintained online by other controller nodes during the upgrade. IF you are looking at this option due to an error indicating "Node appears to be the last node in a cluster" then consult Troubleshooting.rst.
- ssh_timeout - This value, defaulted to 900 [seconds], is the maximum amount of time that the post-rebuild ssh connection test will wait for before proceeding.