Logs alerting
Last updated
Last updated
In this end to end example we will set up an alert which triggers if the amount of error logs from any workload has crossed a certain threshold.
We will construct a query that uses the count()
operator to get the number of error logs in the defined time window.
groundcover always saves log levels as lower-cased values, e.g: 'error'
, 'info'
.
The GROUP BY
operator will generate the labels that will be attached as part of the alert when it fires.
Running the query returns a list of workloads and the count of error logs. Note the time range at the top of the query, which can be changed accordingly to the needed use case.
Now that we have our data, we need to set an alert condition to determine when our SLO should be considered breached. In our case, we consider any amount of error logs as breaching the SLO.
We will use the Threshold
expression with 0
as the threshold value, indicating that any workload that has more than 0 error logs should count as a breach.
Note the firing
status for all of the returned results - all of these have more than 0 error logs in the last one hour, breaching our SLO condition.
The next step is instructing Grafana on how we want this alert to be evaluated:
Evaluation group: How often we want the rule to be evaluated
Pending period: How long do we allow the SLO to be breached before firing an alert
For example, if we choose an evaluation group of 1m
, and a pending period of 3m
, we are defining that the alert condition should be checked for breach every 1 minute, but only fire an alert if the breach is ongoing for 3 consecutive minutes.
To give a concrete example, let's look at two different series of evaluations:
1m | 2m | 3m | 4m | 5m | Result |
---|---|---|---|---|---|
BREACHED | BREACHED | OK | BREACHED | BREACHED | OK |
OK | BREACHED | BREACHED | BREACHED | BREACHED | FIRING |
Even though both examples have the same amount of evaluations that breached the SLO, only the second one is firing an alert. This is because the SLO was breached for more than the allowed pending period of 3 consecutive minutes.
The next step is to add any extra labels to the fired alert, which can be used when deciding how to handle the firing of the alert. For example, labels such as team
and severity
could be used to decide on which contact point should be used.
In the notifications
part, we can choose to either use the labels assigned to route the alert, or we can select a contact point directly.