Besides using aggregate metrics gathered from machines and cloud resources, we extensively make use of our log data for monitoring. For this purpose we introduced Yelp’s ElastAlert to our monitoring setup some time ago. It queries our ELK cluster and sends notifications to teams in charge.
Sometimes elastalert returns unexpected results, sometimes it does not alert although one would expect an alert. In this article I want to explain some of the common questions we have asked ourselves while rolling out elastalert:
Elastalert runs all rules in a loop on a fixed schedule. In the most use cases this means it loops over all rules once per minute. As events are usually not indexed in real time it adds some buffer_time (per default 15min) to the timeframe in which it searches for events. On each run elastalert filters the returned events based on the given timeframe and stores the result in memory . Also, if number of search result > threshold(num_hits) configured in the rule, it will directly alert. If following runs return new events it has not stored so far (based on document id) elastalert adds these and once it reaches the threshold alerts. It works like a sliding window over events. This behavior allows for late indexed events due to pipeline congestion which especially in error situations might happen due to increased event numbers.
Enabling use_count_query for frequency rules might sound like a good option to lower the load on the system(count queries are usually faster than real searches as no data has to be transported). However this deactivates the complete event filtering and storage on elastalert side. Still the rule is executed each run period(once per minute) and the counted query hits are stored. As the count query returns no timestamp elastalert uses the execution time to determine if a a rule has reached its threshold. If events are not indexed in realtime this easily leads to lost events to alert on. To compensate this set the query_delay parameter of a rule to at least one minute. Below you find a simplified view of how normal rules and rules with use_count_query are handled and why in most cases its better to not use_count_query.
Each time a rule is reloaded or Elastalert is restarted the complete in-memory buffer gets discarded. This resets threshold counter and the sliding window starts new. For rules which have timeframe < buffer_time (so on default setting <15min) all events leading to a potential alert will be reread into elastalert. If the timeframe is > than 15min this is no longer the case. For this the property scan_entire_timeframe is available. However for frequency rules with use_count_query elastalert would have to set the timestamp of all events the rule matches to the startup or reload time of the rule. So they will be considered a whole timeframe long for alerting which could lead to a lot of false alerts due to a different assumption of the time window elastalert considered then manually checking the events.
If your rule only reached its threshold after running multiple queries in the timeframe you can add the old events by using the attach_related property for your rule. If that creates unreadable or to long alerts you can also just send interesting fields by using include: [„x“,“y“,“z“].
When managing lots of alerts from different sources it might be tricky to find the sweet spot between a very specific index pattern (like one just for this type of events) and a broader one which would also include other events. In our experience its better to just specify a very broad index pattern in the alert rule and let elasticsearch do the job figuring out the indices to search through. This allows administative changes later like introducing new rollup patterns or splitting/merging indices.
To minimize startup & reload issues, either be ok if you loose some matches on elastalert restarts. Or try to not using a timeframe higher than buffer_time (15min). Use elastalert-test-rule to test your rule before deploying it and ensure special chars in strings are properly encoded etc. Otherwise small errors in the query might easily be missed and then the rule never alerts. And use query_key to minimize the number of queries to be sent to elasticsearch. F.i. by using it to monitor if a certain service has returned more than threshold of any http error status code with a common threshold.