diff --git a/doc/source/specs/index.rst b/doc/source/specs/index.rst index 0a76827e7b..1b6e994565 100644 --- a/doc/source/specs/index.rst +++ b/doc/source/specs/index.rst @@ -6,5 +6,6 @@ Contents: .. toctree:: :maxdepth: 2 + osh-lma-stack.rst specifications.rst template.rst diff --git a/doc/source/specs/osh-lma-stack.rst b/doc/source/specs/osh-lma-stack.rst new file mode 100644 index 0000000000..4a2a983338 --- /dev/null +++ b/doc/source/specs/osh-lma-stack.rst @@ -0,0 +1,219 @@ +.. + This work is licensed under a Creative Commons Attribution 3.0 Unported + License. + + http://creativecommons.org/licenses/by/3.0/legalcode + +.. + +===================================== +OSH Logging, Monitoring, and Alerting +===================================== + +Blueprints: +1. osh-monitoring_ +2. osh-logging-framework_ + +.. _osh-monitoring: https://blueprints.launchpad.net/openstack-helm/+spec/osh-monitoring +.. _osh-logging-framework: https://blueprints.launchpad.net/openstack-helm/+spec/openstack-logging-framework + + +Problem Description +=================== + +OpenStack-Helm currently lacks a centralized mechanism for providing insight +into the performance of the OpenStack services and infrastructure components. +The log formats of the different components in OpenStack-Helm vary, which makes +identifying causes for issues difficult across services. To support operational +readiness by default, OpenStack-Helm should include components for logging +events in a common format, monitoring metrics at all levels, alerting and alarms +for those metrics, and visualization tools for querying the logs and metrics in +a single pane view. + + +Platform Requirements +===================== + +Logging Requirements +-------------------- + +The requirements for a logging platform include: + +1. All services in OpenStack-Helm log to stdout and stderr by default +2. Log collection daemon runs on each node to forward logs to storage +3. Proper directories mounted to retrieve logs from the node +4. Ability to apply custom metadata and uniform format to logs +5. Time-series database for logs collected +6. Backed by highly available storage +7. Configurable log rotation mechanism +8. Ability to perform custom queries against stored logs +9. Single pane visualization capabilities + +Monitoring Requirements +----------------------- + +The requirements for a monitoring platform include: + +1. Time-series database for collected metrics +2. Backed by highly available storage +3. Common method to configure all monitoring targets +4. Single pane visualization capabilities +5. Ability to perform custom queries against metrics collected +6. Alerting capabilities to notify operators when thresholds exceeded + + +Use Cases +========= + +Logging Use Cases +----------------- + +Example uses for centralized logging include: + +1. Record compute instance behavior across nodes and services +2. Record OpenStack service behavior and status +3. Find all backtraces for a tenant id's uuid +4. Identify issues with infrastructure components, such as RabbitMQ, mariadb, etc +5. Identify issues with Kubernetes components, such as: etcd, CNI, scheduler, etc +6. Organizational auditing needs +7. Visualize logged events to determine if an event is recurring or an outlier +8. Find all logged events that match a pattern (service, pod, behavior, etc) + +Monitoring Use Cases +-------------------- + +Example OpenStack-Helm metrics requiring monitoring include: + +1. Host utilization: memory usage, CPU usage, disk I/O, network I/O, etc +2. Kubernetes metrics: pod status, replica availability, job status, etc +3. Ceph metrics: total pool usage, latency, health, etc +4. OpenStack metrics: tenants, networks, flavors, floating IPs, quotas, etc +5. Proactive monitoring of stack traces across all deployed infrastructure + +Examples of how these metrics can be used include: + +1. Add or remove nodes depending on utilization +2. Trigger alerts when desired replicas fall below required number +3. Trigger alerts when services become unavailable or unresponsive +4. Identify etcd performance that could lead to cluster instability +5. Visualize performance to identify trends in traffic or utilization over time + +Proposed Change +=============== + +Logging +------- + +Fluentd, Elasticsearch, and Kibana meet OpenStack-Helm's logging requirements +for capture, storage and visualization of logged events. Fluentd runs as a +daemonset on each node and mounts the /var/lib/docker/containers directory. +The Docker container runtime engine directs events posted to stdout and stderr +to this directory on the host. Fluentd should then declare the contents of +that directory as an input stream, and use the fluent-plugin-elasticsearch +plugin to apply the Logstash format to the logs. Fluentd will also use the +fluentd-plugin-kubernetes-metadata plugin to write Kubernetes metadata to the +log record. Fluentd will then forward the results to Elasticsearch, which +indexes the logs in a logstash-* index by default. The resulting logs can then +be queried directly through Elasticsearch, or they can be viewed via Kibana. +Kibana offers a dashboard that can create custom views on logged events, and +Kibana integrates well with Elasticsearch by default. + +The proposal includes the following: + +1. Helm chart for Fluentd +2. Helm chart for Elasticsearch +3. Helm chart for Kibana + +All three charts must include sensible configuration values to make the +logging platform usable by default. These include: proper input configurations +for Fluentd, proper metadata and formats applied to the logs via Fluentd, +sensible indexes created for Elasticsearch, and proper configuration values for +Kibana to query the Elasticsearch indexes previously created. + +Monitoring +---------- + +Prometheus and Grafana meet OpenStack-Helm's monitoring requirements. The +Prometheus monitoring tool provides the ability to scrape targets for metrics +over HTTP, and it stores these metrics in Prometheus's time-series database. +The monitoring targets can be discovered via static configuration in Prometheus +or through service discovery. Prometheus includes a querying language that +provides meaningful queries against the metrics gathered and supports the +creation of rules to measure these metrics against for alerting purposes. It +also supports a wide range of Prometheus exporters for existing services, +including Ceph and OpenStack. Grafana supports Prometheus as a data source, and +provides the ability to view the metrics gathered by Prometheus in a single pane +dashboard. Grafana can be bootstrapped with dashboards for each target scraped, +or the dashboards can be added via Grafana's web interface directly. To meet +OpenStack-Helm's alerting needs, Alertmanager can be used to interface with +Prometheus and send alerts based on Prometheus rule evaluations. + +The proposal includes the following: + +1. Helm chart for Prometheus +2. Helm chart for Alertmanager +3. Helm chart for Grafana +4. Helm charts for any appropriate Prometheus exporters + +All charts must include sensible configuration values to make the monitoring +platform usable by default. These include: static Prometheus configurations +for the included exporters, static dashboards for Grafana mounted via configMaps +and configurations for Alertmanager out of the box. + +Security Impact +--------------- + +All services running within the platform should be subject to the +security practices applied to the other OpenStack-Helm charts. + +Performance Impact +------------------ + +To minimize the performance impacts, the following should be considered: + +1. Sane defaults for log retention and rotation policies +2. Identify opportunities for improving Prometheus's operation over time +3. Elasticsearch configured to prevent memory swapping to disk +4. Elasticsearch configured in a highly available manner with sane defaults + + +Implementation +============== + +Assignee(s) +----------- + +Primary assignees: + srwilker (Steve Wilkerson) + portdirect (Pete Birley) + lr699s (Larry Rensing) + + +Work Items +---------- + +1. Fluentd chart +2. Elasticsearch chart +3. Kibana chart +4. Prometheus chart +5. Alertmanager chart +6. Grafana chart +7. Charts for exporters: kube-state-metrics, ceph-exporter, openstack-exporter? + +All charts should follow design approaches applied to all other OpenStack-Helm +charts, including the use of helm-toolkit. + +All charts require valid and sensible default values to provide operational +value out of the box. + +Testing +======= +Testing should include Helm tests for each of the included charts as well as an +integration test in the gate. + + +Documentation Impact +==================== +Documentation should be included for each of the included charts as well as +documentation detailing the requirements for a usable monitoring platform, +preferably with sane default values out of the box.