42 Commits

Author SHA1 Message Date
Kevin Carter
814622cc6c
Improve logstash and elasticsearch performance
The logstash and elasticsearch performance can be improved by using
async index options, pulling back the refresh interval, and by not
fingerprinting every document.

* Async translog allows elasticsearch to using run fsync in the
  background instead of blocking
* the refresh interval will now be 5x the number of replicas with a cap
  of 30. This integer is representitive of the seconds between index
  refresh calls which greatly lowers the load generated across the
  cluster.
* All documents were fingerprinted before writting to the cluster. This
  was a costly operation as elasticsearch will do a forward lookup on all
  documents with a preset ID resulting in 100's, if not 1000's, of extra
  reads. The purpose of the fingerprint function is to limit repeading
  writes so to keep some of this functionality the fingerprint function is
  now only added to documents with messages.
* G1 garbage collection is now enabled by default when the heap size is
  > 6GiB. Early versions of elasticsearch did not recommend this setting
  however its since stabalized in recent releases.
* JVM options have been moved into the elasticsearch and logstash roles
  allowing these tasks to trigger service restarts when changes are made.

Change-Id: I805129b207ad4db182ae6e59b6ec78eb3e246b54
Signed-off-by: Kevin Carter <kevin.carter@rackspace.com>
2018-09-21 21:47:07 -05:00
Kevin Carter
0d4a4a92c7
Converg the logstash pipelines and enhance memory backed queues
The multi-logstash pipeline setup, while amazingly fast, was crashing
and causing index errors when under high load for a long period of time.
Because of the crashing behavior and the fact that the folks from
Elastic describe multi-pipeline queues to be "beta" at this time the
logstash pipelines have been converted back into a single pipeline.

The memory backed queue options are now limited by a ram disk (tmpfs)
which will ensure that a burst within the queue does not cause OOM
issues and ensures a highly performant deployment and limiting memory
usage at the same time. Memory backed queues will be enabled when the
underlying system is using "rotational" media as detected by ansible
facts. This will ensure a fast and consistent experience across all
deployment types.

Pipeline/ml/template/dashboard setup has been added to the beat
configurations which will ensure beats are properly configured even
when running in an isolated deployment and outside of normal operations
where beats are generally configured on the first data node.

Change-Id: Ie3c775f98b14f71bcbed05db9cb1c5aa46d9c436
Signed-off-by: Kevin Carter <kevin.carter@rackspace.com>
2018-09-16 23:44:58 -05:00
Kevin Carter
1c56b7f034
Add option block to ensure apache2 is enabled correctly
The apache2 monitoring process requires a couple interactions to deploy
successfully. This change will ensure that if the apache2 monitoring
fails, in any way, it does not block the deployment.

Change-Id: Ibe35197a1c65f4abe9e4870c07ee15f37f9a58ab
Signed-off-by: Kevin Carter <kevin.carter@rackspace.com>
2018-08-29 15:39:08 -05:00
Kevin Carter
e4c84aa28d
Add Redhat to the ELK deployment capabilities
Change-Id: Id34e046a546f8d0878843596f53e400165e37c6e
Signed-off-by: Kevin Carter <kevin.carter@rackspace.com>
2018-08-13 18:59:57 -05:00
Kevin Carter
8db0238749 Move most of the variables into the roles
Change-Id: I82a48c554c164c7166c1a0d4e3192332af5024fb
Signed-off-by: Kevin Carter <kevin.carter@rackspace.com>
2018-08-13 03:20:33 +00:00
Zuul
a0780fb582 Merge "Further tune the playbooks, configs, and thread pool" 2018-07-26 20:37:01 +00:00
Kevin Carter
f69d391325 Further tune the playbooks, configs, and thread pool
* Implements G1 GC optionally. The variable `elastic_g1gc_enabled` has
  been added with a default of false. If this option is set true and the
  system has more than 4GiB of RAM G1GC will be enabled.
* Adds new thread options
* Better constraints coordination nodes
* Interface recover speed has been limited
* Buffer size is now set correctly
* Serialize elk deployment so that upgrades are non-impacting

Change-Id: I89224eeaf4ed29c3bb1d7f8010b69503dbc74e11
Signed-off-by: Kevin Carter <kevin.carter@rackspace.com>
2018-07-26 18:43:13 +00:00
Kevin Carter
7b2e56885b Add arcsight ingestion into logstash
Logstash is able to handle arcsight events, this PR enables that
capability.

Change-Id: Id220c671cc5d7cb7ee33fb53e2ae4185d579fc2a
Signed-off-by: Kevin Carter <kevin.carter@rackspace.com>
2018-07-26 13:09:53 -05:00
Jonathan Rosser
39e9905d00 Allow mounting of shared filesystems for index backup/restore
Change-Id: I6590bd0b7560fe42bd82d1a8aa7932a45f067ca5
2018-07-25 17:01:32 +01:00
Zuul
72d1de3888 Merge "update default kibana elastic timeout" 2018-07-24 21:37:55 +00:00
Victor Palma
08a5f02a78 update default kibana elastic timeout
* set the default elasticsearch request timeout to 60 seconds

Change-Id: Ieac2c96315bbbcfe7cc2d2bff42d2ee15f23fb0b
2018-07-24 13:09:25 -05:00
Kevin Carter
f7aad4832f Update retention policy weighting
This change adds retention policy weighting based on experience with the
indexes in production in large scale clouds.

Change-Id: I0d09d4cfc68f70fe790170d5d54f1585616c5524
Signed-off-by: Kevin Carter <kevin.carter@rackspace.com>
2018-07-24 09:46:42 -05:00
Kevin Carter
0ab9d82545 Move heartbeat from utility_all to kibana
The heartbeat probe was making an assumption that the deployment will
always be an OSA one by using the group "utility_all" as a deployment
target. This change moves heartbeat to the first kibana three kibana
nodes by default which corrects the previous assumption.

Change-Id: Ic1b90eb94dd20dc2273542333de47bfd690af1dd
Signed-off-by: Kevin Carter <kevin.carter@rackspace.com>
2018-07-20 16:39:10 -05:00
Zuul
f59c4a76e0 Merge "Tidy Heartbeat service names" 2018-07-17 03:28:53 +00:00
Kevin Carter
7a32b5c9a9 Add additional ES cluster tuning
The following options will reduce cluster pressure and generally
improve search performance.

Change-Id: I1619680db1fd595503f0845b182d6f6ce4c59f3c
Signed-off-by: Kevin Carter <kevin.carter@rackspace.com>
2018-07-16 22:52:40 +00:00
Kevin Carter
b6a9a6fc7a Add dynamic retention policies to curator
The curator retention policies will now query the storage nodes within
a given deployment and set a suitable index retention policy based on
the total amount of storage each index is assumed to produce every day.
To ensure we're minimizing the storage required and optimizing search
performance several actions are now being taken:

* Indexes will be shrunk after a quarter of their retention time.
* Indexes will be deleted should they exceed the retention time.

Change-Id: I8bf548620b5404d25deaadba8fda93452ef64fa0
Signed-off-by: Kevin Carter <kevin.carter@rackspace.com>
2018-07-12 17:03:40 +00:00
Jonathan Rosser
eb893f1776 Tidy Heartbeat service names
Remove spaces in service names, and don't duplicate the protocol as
heartbeat includes these into the monitor.scheme and monitor.id
fields by default.

Change-Id: If7633dd5ca23c22eff37a8b7140fff4bf0911432
2018-07-10 09:39:33 +01:00
Kevin Carter
91dbd09353
Tune vars to better support an isolated deployment
Change-Id: I93d33bed42976d20919f887ef8096b212a6559a2
Signed-off-by: Kevin Carter <kevin.carter@rackspace.com>
2018-07-09 23:47:40 -05:00
Zuul
09c412e8b0 Merge "Remove the unused port 35357" 2018-06-27 00:59:43 +00:00
Zuul
a9e2d93ec2 Merge "Add kibana custom dashboard" 2018-06-26 05:30:59 +00:00
Guilherme Steinmuller Pimentel
fde2f649bf Add kibana custom dashboard
These files provide an alternative for those who want their
custom dashboards on kibana. The playbook setupKibanaDashboards.yml
installs elasticdump and uses it to dump into kibana's index a simple
dashboard that collects logs from filebeat.

Change-Id: Ibb3407b1f19eac5f7cda753e00c3bc6f3ff16da7
2018-06-26 05:10:09 +00:00
ZhijunWei
38f6164556 Remove the unused port 35357
Now that the v2.0 API has been removed, we don't have a reason to
include deployment instructions for two separate applications on
different ports.

Change-Id: I0c8451207afec77c9a8071ca8035337ffd0ac9f0
2018-06-23 00:07:51 -04:00
Kevin Carter
57756eefe2 Add kafka output plugin to logstash
This change will allow a deployer to directly ship data from
logstash into Kafka.

Change-Id: I5de0caf270c8ced8111ac099cb91a70814f80259
Signed-off-by: Kevin Carter <kevin.carter@rackspace.com>
2018-06-21 13:09:28 +00:00
Jonathan Rosser
e3eb653b37 Add apm-server to loadbalancer
Change-Id: I7442296d0ff984839e7f63ffcf82a77db722b72e
2018-06-18 14:24:56 +00:00
Kevin Carter
778002714c Add upgrade task options
To ensure users can upgrade packages the variable
`"{{ elk_package_state | default('present') }}"` has been added
to all package installs.

Change-Id: I0238d9e1ed991cb1480bd924f2d5a09687890da3
Signed-off-by: Kevin Carter <kevin.carter@rackspace.com>
2018-06-14 19:30:29 -05:00
Kevin Carter
bc2937d9c9
Use elasticsearch coordinator nodes as smart LBs
Elasticsearch can be used as a smart load balancer for all traffic
which will remove the requirement for a VIP and move the cluster to a
mesh topology. All of the Kibana nodes will now run elasticsearch as
cordonator.

* Kibana will now connect to elasticsearch on localhost.
* All of the beats have been setup to use use the new mesh topology.
* jvm memory management has been updated to reflect the additional
  services.

More on node assigments can be found here:
* https://www.elastic.co/guide/en/elasticsearch/reference/6.2/modules-node.html#modules-node

* The readme has been updated to reflect these changes.

Change-Id: I769e0251072f5dbde56fcce7753236d37d5c3b19
Signed-off-by: Kevin Carter <kevin.carter@rackspace.com>
2018-06-13 23:37:48 -05:00
Dave Wilde
23ac2aa985 Add logstash filters
This adds the ability to include logstash log parsing filters for
various openstack and service logs.  These filters are disabled by
default and can be enabled by toggling the deploy_logstash_filters
variable.

Change-Id: I5c46f78f232d3fb604283ae623cd3975a8346c7c
2018-06-07 22:13:48 -05:00
Jonathan Rosser
eb73dd6e66 Point metricbeat rabbitmq collector to existing rabbitmq endpoint
Change-Id: I9511a1da1a031b4b05bbbb108386cd5b56fd96e9
2018-06-05 12:56:29 +01:00
Jonathan Rosser
62f9508df2 Point metricbeat haproxy collector to existing stats endpoint
Change-Id: I36e86746a851d48501bce7f91910761a08d20196
2018-06-05 12:56:28 +01:00
Jonathan Rosser
b2a66c9a18 Convert ELK repo location to a variable so it can be overridden
Change-Id: I9e5b78960c891aae7f4e94317647668a77b08c58
2018-05-16 12:34:01 +01:00
Zuul
8ce8da08c8 Merge "update heartbeat vars to check response" 2018-05-12 19:55:34 +00:00
Kevin Carter
422b13fd86
update heartbeat vars to check response
Several API services use 300 to indicate it's up, this change add the
ability to check for that.

Change-Id: Ic85f6cff3bc225b29ae0e3e8fbd19eceece00441
Signed-off-by: Kevin Carter <kevin.carter@rackspace.com>
2018-05-11 16:10:23 -05:00
Kevin Carter
846a90d025 Tune down the collection intervals and default retention policy
At present we're collecting too much info by default. We're seeing
+500GB on a <50 node environment in just two weeks. While we dont expect
the data set to grow much larger given the use of curator, this change
lowers the default collection intervals of the various beats and updates
the retention / detection policies so we're not storing too much
information.

To correct a unicode problem with py2 the host index loops have been
updated.

Curator has also been updated to run everyday.

Change-Id: Ic202eb19806d1b805fa314d3d8bde05b286740e0
Signed-off-by: Kevin Carter <kevin.carter@rackspace.com>
2018-05-11 16:04:14 +00:00
Kevin Carter
0c41b0fd70
Add curator and dynamic shard counts
Curator has been added to automatically maintain the cluster with
sensible defaults when it pertains to data retention.

The index counts have been modified such that they're determined by the
size of the initial cluster. While these shard counts can be modified
post deployment by reindexing the data, it's not something being done at
this time.

Depends-On: https://review.openstack.org/c/565807
Change-Id: I249d715ae5241ab57c4117b14377e4d07cb6e984
Signed-off-by: Kevin Carter <kevin.carter@rackspace.com>
2018-05-02 12:11:30 -05:00
Kevin Carter
4e0c30ed16 Add documentation and tooling for legacy environments
A deployer may want to run these tools within a legacy environment
(running Ansible <2.4) but will find it the deployment of these
playbooks impossible to due to the use of new-ish task syntax,
roles, and modules. This change gives deployers options when running
within legacy environments by providing everything needed to deploy
these playbooks using embeded ansible.

Change-Id: Ic99b93017129321b2eb8b773a77f7fa478cc8dc7
Signed-off-by: Kevin Carter <kevin.carter@rackspace.com>
2018-05-01 19:06:38 -05:00
Kevin Carter
49f63cabae
cleanup heartbeat config
Change-Id: Iea30a4187e93fce252c603d4e188b2e475672b32
Signed-off-by: Kevin Carter <kevin.carter@rackspace.com>
2018-04-27 17:08:10 -05:00
Kevin Carter
ac286b0ac3
Update rollback plan and configs
* Added options for the rollback plan so that if a rollback is executed
  all beat packages will be removed.

* additional updates to streamline elk and fix container bindmounts,
  the  use of group information for metric and heartbeat information.

* Readme information has been fixed

Change-Id: Icd070259db5b19d289d10033b1f055125f56e18c
Signed-off-by: Kevin Carter <kevin.carter@rackspace.com>
2018-04-26 16:41:51 -05:00
Kevin Carter
903b995d32 Add APM Server
The elastic Stack has the ability to get application performance metrics
using the built in APM server. This change implements the APM server in
an existing ELK environment.

Change-Id: Ie6f533b81cfdb0c6a4ba2f33fd3b9f0a3e49a1fc
Signed-off-by: Kevin Carter <kevin.carter@rackspace.com>
2018-04-16 08:36:48 -05:00
Kevin Carter
390314e18b
Add variables to connect ELK and Grafana
With the option to deploy grafana the following changes allow a user to
automatically connect ELK and Grafana.

Change-Id: Ic8e64a31d860940c6863f46ce558908d5ef8f8e7
Signed-off-by: Kevin Carter <kevin.carter@rackspace.com>
2018-04-13 23:08:31 -05:00
Kevin Carter
969a30c6c7
Add grafana
This change introduces grafana into the stack which gives us a great
way to visualize the data. The grafana role from cloudalchemy is being
used for the bulk of the deployment.

Because the grafana deployment playbook is now standalone the mentions
of grafana in the other ops directories have been removed.

Change-Id: I23e1c96cd1fda7ece9b86a69f9f0326913de714d
Signed-off-by: Kevin Carter <kevin.carter@rackspace.com>
2018-04-13 10:31:34 -05:00
Kevin Carter
17fb37f075
Update elk 6.x playbooks
Most of the changes in this PR are for style and to adapt the playbooks
so that the system can operate on a multi-node cloud.

Functional change includes the removal of mainline Java 8 in favor of
OpenJDK 8.

A site playbook was add to allow an operator to just run everything.

Old tools that no longer function within the stack have been removed.

Packetbeat was added to the install list
Auditbeat was added to the install list

All of the config files have been updated for the recent ElasticStack
6.x changes.

Change-Id: I01200ad4772ff200b9c5c93f8f121145dfb88170
Signed-off-by: Kevin Carter <kevin.carter@rackspace.com>
2018-04-11 03:11:44 -05:00
Per Abildgaard Toft
48e2b8e998 Updatev version of ELK stack for openstack ansible
This addition is an updated of the curent elk_metrics which will install Elasticsearc, Logstash and Kibana 6.x.
It also include configuration guide for haproxy endpoints

Change-Id: Iac4dec6d17bc75433e5fe672f3b9781536b8e619
2018-03-06 14:21:23 +00:00