
With the latest roll of the hlinunx repo the name of the libvirt daemon has changed to libvirtd. Support this new name as well as the old libvirt-bin Change-Id: I8b8a5bb987d93e3f62d11d6fa2e82e6e351d80b2
19 KiB
Retrying failed actions
In some cases, steps may fail as some components may not yet be ready for use due to initialization times, which can vary based on hardware and volume In the event of this occurring, two options exist that allows a user to optionally re-attempt or resume playbook executions.
- Solutions:
- Ansible ansible-playbook command option --start-at-task="TASK NAME" allows resumption of a playbook, when used with the -l limit option.
- Ansible ansible-playbook command option --step allows a user to confirm each task executed by Ansible before it is executed upon.
A node goes to ERROR state during rebuild
This can happen from time to time due to network errors or temporary overload of the undercloud.
- Symptoms:
- After error, nova list shows node in ERROR
- Solution:
Verify hardware is in working order.
Verify that approximately 20% of the disk space is free on the Ironic server node.
Get the image ID of the machine with `nova show`:
nova show $node_id
Rebuild manually:
nova rebuild --preserve-ephemeral $node_id $image_id
A node times out after rebuild
While rare, there is the possibility that something unexpected happened and the host has failed to reboot as expected from a rebuild.
- Symptoms:
- Error Message: msg: Timeout waiting for the server to come up.. Please check manually
- Solution:
- Follow the steps detailed above in "A node goes to ERROR state during rebuild"
MySQL CLI configuration file missing
Should the post-rebuild restart fail, the possibility exists that the MySQL CLI configuration file is missing.
- Symptoms:
Attempts to access the MySQL CLI command return an error:
ERROR 1045 (28000): Access denied for user 'root'@'localhost' (using password: NO)
- Solution:
Verify that the MySQL CLI config file stored on the state drive is present and has content within the file. You can do this by executing the command below to display the contents in your terminal.:
sudo cat /mnt/state/root/metadata.my.cnf
If the file is empty, run the command below which will retrieve current metadata and update config files on disk.:
sudo os-collect-config --force --one --command=os-apply-config
Verify that the MySQL CLI config file is present in the root user directory by executing the following command:
sudo cat /root/.my.cnf
If that file does not exist or is empty, two options exist.
Add the following to your MySQL CLI command line:
--defaults-extra-file=/mnt/state/root/metadata.my.cnf
Alternatively, copy configuration from the state drive.:
sudo cp -f /mnt/state/root/metadata.my.cnf /root/.my.cnf
MySQL fails to start upon retrying update
If the update was aborted or failed during the Update sequence before a single MySQL controller was operational, MySQL will fail to start upon retrying.
- Symptoms:
Update is being re-attempted.
The following error messages having been observed.
- msg: Starting MySQL (Percona XtraDB Cluster) database server: mysqld . . . . The server quit without updating PID file (/var/run/mysqld/mysqld.pid)
- stderr: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (111)
- FATAL: all hosts have already failed -- aborting
Update automatically aborts.
- WARNING:
- The command /etc/init.d/mysql bootstrap-pxc which is mentioned below should only ever be executed when an entire MySQL cluster is down, and then only on the last node to have been shut down. Running this command on multiple nodes will cause the MySQL cluster to enter a split brain scenario effectively breaking the cluster which will result in unpredictable behavior.
- Solution:
Use nova list to determine the IP of the controllerMgmt node, then ssh into it:
ssh heat-admin@$IP
Verify MySQL is down by running the mysql client as root. It _should fail:
sudo mysql -e "SELECT 1"
Attempt to restart MySQL in case another cluster node is online. This should fail in this error state, however if it succeeds your cluster should again be operational and the next step can be skipped.:
sudo /etc/init.d/mysql start
Start MySQL back up in single node bootstrap mode:
sudo /etc/init.d/mysql bootstrap-pxc
MySQL/Percona/Galera is out of sync
OpenStack is configured to store all of its state in a multi-node synchronous replication Percona XtraDB Cluster database, which uses Galera for replication. This database must be in sync and have the full complement of servers before updates can be performed safely.
Symptoms:
- Update fails with errors about Galera and/or MySQL being "Out of Sync"
Solution:
use nova list to determine IP of controllerMgmt node, then SSH to it:
ssh heat-admin@$IP
Verify replication is out of sync:
sudo mysql -e "SHOW STATUS like 'wsrep_%'"
Stop mysql:
sudo /etc/init.d/mysql stop
Verify it is down by running the mysql client as root. It _should fail:
sudo mysql -e "SELECT 1"
Start controllerMgmt0 MySQL back up in single node bootstrap mode:
sudo /etc/init.d/mysql bootstrap-pxc
On the remaining controller nodes observed to be having issues, utilize the IP address via nova list and login to them.:
ssh heat-admin@$IP
Verify replication is out of sync:
sudo mysql -e "SHOW STATUS like 'wsrep_%'"
Stop mysql:
sudo /etc/init.d/mysql stop
Verify it is down by running the mysql client as root. It _should fail:
sudo mysql -e "SELECT 1"
Start MySQL back up so it attempts to connect to controllerMgmt0:
sudo /etc/init.d/mysql start
If restarting MySQL fails, then the database is most certainly out of sync and the MySQL error logs, located at /var/log/mysql/error.log, will need to be consulted. In this case, never attempt to restart MySQL with sudo /etc/init.d/mysql bootstrap-pxc as it will bootstrap the host as a single node cluster thus worsening what already appears to be a split-brain scenario.
MysQL "Node appears to be the last node in a cluster" error
This error occurs when one of the controller nodes does not have MySQL running. The playbook has detected that the current node is the last running node, although based on sequence it should not be the last node. As a result the error is thrown and update aborted.
- Symptoms:
- Update Failed with error message "Galera Replication - Node appears to be the last node in a cluster - cannot safely proceed unless overridden via single_controller setting - See README.rst"
- Actions:
- Run the pre-flight_check.yml playbook. It will attempt to restart MySQL on each node in the "Ensuring MySQL is running -" step. If that step succeeeds, you should be able to re-run the playbook and not encounter "Node appears to be last node in a cluster" error.
- IF pre-flight_check fails to restart MySQL, you will need to consult the MySQL logs (/var/log/mysql/error.log) to determine why the other nodes are not restarting.
SSH Connectivity is lost
Ansible uses SSH to communicate with remote nodes. In heavily loaded, single host virtualized environments, SSH can lose connectivity. It should be noted that similar issues in a physical environment may indicate issues in the underlying network infrastructure.
- Symptoms:
Ansible update attempt fails.
Error output:
fatal: [192.0.2.25] => SSH encountered an unknown error. The output was: OpenSSH_6.6.1, OpenSSL 1.0.1i-dev xx XXX xxxx debug1: Reading configuration data /etc/ssh/ssh_config debug1: /etc/ssh/ssh_config line 19: Applying options for * debug1: auto-mux: Trying existing master debug2: fd 3 setting O_NONBLOCK mux_client_hello_exchange: write packet: Broken pipe FATAL: all hosts have already failed – aborting
- Solution:
You will generally be able to re-run the playbook and complete the upgrade, unless SSH connectivity is lost while all MySQL nodes are down. (See 'MySQL fails to start upon retrying update' to correct this issue.)
Early Ubuntu Trusty kernel versions have known issues with KVM which will severely impact SSH connectivity to instances. Test hosts should have a minimum kernel version of 3.13.0-36-generic. The update steps, as root, are:
apt-get update apt-get dist-upgrade reboot
If this issue is repeatedly encountered on a physical environment, the network infrastructure should be inspected for errors.
Similar error messages to the error noted in the Symptom may occur with long running processes, such as database creation/upgrade steps. These cases will generally have partial program execution log output immediately before the broken pipe message visible.
Should this be the case, Ansible and OpenSSH may need to have their configuration files tuned to meet the needs of the environment.
Consult the Ansible configuration file to see available connection settings ssh_args, timeout, and possibly pipelining..:
https://github.com/ansible/ansible/blob/release1.7.0/examples/ansible.cfg
As Ansible uses OpenSSH, Please reference the ssh_config manual, in paricular the ServerAliveInterval and ServerAliveCountMax options.
Postfix fails to reload
Occasionally the postfix mail transfer agent will fail to reload because it is not running when the system expects it to be running.
- Symptoms:
- Step in /var/log/upstart/os-collect-config.log shows that 'service postfix reload' failed.
Solution:
Start postfix:
sudo service postfix start
Apache2 Fails to start
Apache2 requires some self-signed SSL certificates to be put in place that may not have been configured yet due to earlier failures in the setup process.
- Error Message:
- failed: [192.0.2.25] => (item=apache2) => {"failed": true, "item": "apache2"}
- msg: start: Job failed to start
- Symptoms:
- apache2 service fails to start
- /etc/ssl/certs/ssl-cert-snakeoil.pem is missing or empty
- Solution:
Re-run os-collect-config to reassert the SSL certificates:
sudo os-collect-config --force --one
RabbitMQ still running when restart is attempted
There are certain system states that cause RabbitMQ to fail to die on normal kill signals.
- Symptoms:
- Attempts to start rabbitmq fail because it is already running
- Solution:
- Find any processes running as rabbitmq on the box, and kill them, forcibly if need be.
Instance reported with status == "SHUTOFF" and task_state == "powering on"
If nova attempts to restart an instance when the compute node is not ready, it is possible that nova could entered a confused state where it thinks that an instance is starting when in fact the compute node is doing nothing.
- Symptoms:
- Command nova list --all-tenants reports instance(s) with STATUS == "SHUTOFF" and task_state == "powering on".
- Instance cannot be pinged.
- No instance appears to be running on the compute node.
- Nova hangs upon retrieving logs or returns old logs from the previous boot.
- Console session cannot be established.
- Solution:
- On a controller logged in as root, after executing `source stackrc`:
- Execute nova list --all-tenants to obtain instance ID(s)
- Execute nova show <instance-id> on each suspected ID to identify suspected compute nodes.
- Log into the suspected compute node(s) and execute: os-collect-config --force --one
- Return to the controller node that you were logged into previously, and using the instancce IDs obtained previously, take the following steps.
- Execute nova reset-state --active <instance-id>
- Execute nova stop <instance-id>
- Execute nova start <instance-id>
- Once the above steps have been taken in order, you should see the instance status return to ACTIVE and the instance become accessible via the network.
state drive /mnt is not mounted
In the rare event that something bad happened between the state drive being unmounted and the rebuild command being triggered, the /mnt volume on the instance that was being executed upon at that time will be in an unmounted state.
In such a state, pre-flight checks will fail attempting to start MySQL and RabbitMQ.
- Error Messages:
Pre-flight check returns an error similar to:
failed: [192.0.2.24] => {"changed": true, "cmd": "rabbitmqctl -n rabbit@$(hostname) status" stderr: Error: unable to connect to node 'rabbit@overcloud-controller0-vahypr34iy2x': nodedown
Attempting to manually start MySQL or RabbitMQ return:
start: Job failed to start
Upgrade execution returns with an error indicating:
TASK: [fail msg="Galera Replication - Node appears to be the last node in a cluster - cannot safely proceed unless overriden via single_controller setting - See README.rst"] ***
- Symptom:
- Execution of the df command does not show a volume mounted as /mnt.
- Unable to manually start services.
- Solution:
Execute the os-collect config which will re-mount the state drive. This command may fail without additional intervention, however it should mount the state drive which is all that is needed to proceed to the next step.:
sudo os-collect-config --force --one
At this point, the /mnt volume should be visible in the output of the df command.
Start MySQL by executing:
sudo /etc/init.d/mysqld start
If MySQL fails to start, and it has been verified that MySQL is not running on any controller nodes, then you will need to identify the last node that MySQL was stopped on and consult the section "MySQL fails to start upon retrying update" for guidance on restarting the cluster.
Start RabbitMQ by executing:
service rabbitmq-server start
If rabbitmq-server fails to start, then the cluster may be down. If this is the case, then the last node to be stopped will need to be identified and started before attempting to restart RabbitMQ on this node.
At this point, re-execute the pre-flight check, and proceed with the upgrade.
VMs may not shut down properly during upgrade
During the upgrade process, VMs on compute nodes are shut down gracefully. If the VMs do not shut down, this can cause the upgrade to stop.
- Error Messages:
A playbook run ends with a message similar to:
failed: [10.23.210.31] => {"failed": true} msg: The ephemeral storage of this system failed to be cleaned up properly and processes or files are still in use. The previous ansible play should have information to help troubleshoot this issue.
The output of the playbook run prior to this message contains a process listing and a listing of open files.
- Symptoms:
The state drive on the compute node, /mnt, is still in use and cannot be unmounted. You can confirm this by executing:
lsof -n | grep /mnt
VMs are running on the node. To see which VMs are running, run:
virsh list
If virsh list fails, you may need to restart libvirt-bin or libvirtd depending on which process you are running. Do so by running:
service libvirt-bin restart or service libvirtd restart
- Solution:
- Manual intervention is required. You will need to determine why the VMs did not shut down properly, and resolve the issue.
- Unresponsive VMs can be forcibly shutdown using virsh destroy <id>. Note that this can corrupt filesystems on the VM.
- Resume the playbook run once the VMs have been shut down.
Instances are inaccessible via network
Upon restarting, it is possible that the virtual machine is unreachable due to Open vSwitch not being ready for the virtual machine networking.
- Symptom:
- After a restart, instances won't ping.
- Solution:
- To resolve:
- Log into a controller node and execute source /root/stackrc
- Stop all virtual machines on a compute node utilizing nova hypervisor-servers <hostname> and nova stop <id>
- Log into the undercloud node and execute source /root/stackrc
- Obtain a list of nodes by executing nova list
- Execute nova stop <id> for the affected compute node.
- Once the compute node has stopped, execute nova start <id> to reboot the compute node.
Online Upgrade fails with message saying glanceclient is not found.
- Symptoms:
- Online upgrade has been attempted, however the playbook execution failed when attempting to download the new image from Glance reporting that glanceclient was not found.
- Solution:
- If you are attempting to execute the Ansible playbook on the seed or undercloud node, source the Ansible virtual environment by executing source /opt/stack/venvs/ansible/bin/activate
- Once the Ansible virtual environment has been sourced, execute sudo pip install python-glanceclient on the node you are attempting to execute Ansible from.
Online Upgrade of compute node failed
In the event that an online upgrade of a compute node somehow failed, the node can be recovered utilizing a traditional rebuild.
- Symptoms:
- Online upgrade was performed.
- Compute node cannot be logged into, or is otherwise in a non-working state.
- Solution:
- From the undercloud:
- Execute source /root/stackrc
- Identify the instance ID of the broken compute node via nova list
- Execute the command nova stop <instance-id> to stop the instance.
- Return to the host that you ran the upgrade from and re-run the playbook without the "-e online_upgrade=True" option.
- Additionally, you may need to utilize the "-e force_rebuild=True" option to force the instance to rebuild.