Logs alerting

In this end to end example we will set up an alert which triggers if the amount of error logs from any workload has crossed a certain threshold.

Defining the query

We will construct a query that uses the count() operator to get the number of error logs in the defined time window.

SELECT    count()   as log_count,
          workload  as workload,
          namespace as namespace
FROM      groundcover.logs
WHERE     $__timeFilter(timestamp) 
          AND level = 'error'
GROUP     BY workload, namespace

groundcover always saves log levels as lower-cased values, e.g: 'error', 'info'.

The GROUP BY operator will generate the labels that will be attached as part of the alert when it fires.

Running the query returns a list of workloads and the count of error logs. Note the time range at the top of the query, which can be changed accordingly to the needed use case.

Defining the alert condition

Now that we have our data, we need to set an alert condition to determine when our SLO should be considered breached. In our case, we consider any amount of error logs as breaching the SLO. We will use the Threshold expression with 0 as the threshold value, indicating that any workload that has more than 0 error logs should count as a breach.

Note the firing status for all of the returned results - all of these have more than 0 error logs in the last one hour, breaching our SLO condition.

Defining the evaluation behavior

The next step is instructing Grafana on how we want this alert to be evaluated:

Evaluation group: How often we want the rule to be evaluated
Pending period: How long do we allow the SLO to be breached before firing an alert

For example, if we choose an evaluation group of 1m , and a pending period of 3m, we are defining that the alert condition should be checked for breach every 1 minute, but only fire an alert if the breach is ongoing for 3 consecutive minutes.

To give a concrete example, let's look at two different series of evaluations:

Result

BREACHED

FIRING

Even though both examples have the same amount of evaluations that breached the SLO, only the second one is firing an alert. This is because the SLO was breached for more than the allowed pending period of 3 consecutive minutes.

Defining labels and notifications

The next step is to add any extra labels to the fired alert, which can be used when deciding how to handle the firing of the alert. For example, labels such as team and severity could be used to decide on which contact point should be used.

In the notifications part, we can choose to either use the labels assigned to route the alert, or we can select a contact point directly.

Last updated 1 year ago