SQL Based Monitors

Sometimes there are use cases that involve complex queries and conditions for triggering a monitor. This might go beyond the built-in query logic that is provided within the groundcover logs page query language.

An example for such a use case could be the need to compare some logs to the same ones in a past period. This is not something that is regularly available for log search but can definitely be something to alert on. If the number of errors for a group of logs dramatically changes from a previous week, this could be an event to alert and investigate.

For such use cases you can harness the powerful ClickHouse SQL language to create an SQL based monitor within groundcover.

ClickHouse within groundcover

Log and Trace telemetry data is stored within a ClickHouse database.

You can directly query this data using SQL statements and create powerful monitors.

To create and test your SQL queries use the Grafana Explore page within the groundcover app.

Select the ClickHouse@groundcover datasource with the SQL Editor option to start crafting your SQL queries

Start with show tables; to see of all the available tables to use for your queries: logs and traces would be popular choices (table names are case sensitive).

Query best practices

While testing your queries always use LIMIT to limit your results to a small set of data.

SELECT * FROM logs LIMIT 10;

To apply the Grafana timeframe on your queries make sure to add the following conditions:

Logs: WHERE $__timeFilter(timestamp)

Traces: WHERE $__timeFilter(start_timestamp)

Note: When querying logs with SQL, it's crucial to use efficient filters to prevent timeouts and enhance performance. Implementing primary filters like cluster, workload, namespace, and env will significantly speed up queries. Always integrate these filters when writing your queries to avoid inefficient queries.

Filtering on attributes and tags

Traces and Logs have rich context that is normally stored in dedicated columns in json format. Accessing the context for filtering and retrieving values is a popular need when querying the data.

To get to the relevant context item, either in the attributes or tags you can use the following syntax:

WHERE string_attributes['host_name'] = 'my.host'

WHERE string_tags['cloud.name'] = 'aws'

WHERE float_attributes['hradline_count'] = 4

WHERE float_tags['container.size'] = 22.4

To use the float context ensure that the relevant attributes or tags are indeed numeric. To do that, check the relevant log in json format to see if the referenced field is not wrapped with quotes (for example, headline_count in the screenshot below)

SQL Query structure for a monitor

In order to be able to use an SQL query to create a monitor you must make sure the query returns no more than a single numeric field - this is the monitored field on which the threshold is placed.

The query can also contain any number of "group by" fields that are passed to the monitor as context labels.

Here is an exmaple of an SQL query that can be used for a monitor

with engineStatusLastWeek as (
  select string_attributes['tenantID'] tenantID, , string_attributes['env'] env, max(float_attributes['engineStatus.numCylinders']) cylinders
  from logs
  where timestamp >= now() - interval 7 days
    and workload = 'engine-processing'
    and string_attributes['tenantID'] != ''
  group by tenantID, env
),
engineStatusNow as (
  select string_attributes['tenantID'] tenantID, string_attributes['env'] env, min(float_attributes['engineStatus.numCylinders']) cylinders
  from logs
  where timestamp >= now() - interval 10 minutes
    and workload = 'engine-processing'
    and string_attributes['tenantID'] != ''
  group by tenantID, env
)
select n.tenantID, n.env, n.cylinders/lw.cylinders AS threshold
from engineStatusNow n
left join engineStatusLastWeek lw using (tenantID)
where n.cylinders/lw.cylinders <= 0.5

In this query the threshold field is a ratio between some value measured on the last week and in the last 10 minutes.

tenantID and env are the group by labels that are passed to the monitor as context labels.

Here is another query example (check the percentage of errors in a set of logs):

A single numeric value is calculated and grouped by cluster, namespace and workload

SELECT cluster, namespace, workload, 
    round( 100.0 * countIf(level = 'error') / 
    nullIf(count(), 0), 2 ) AS error_ratio_pct 
FROM "groundcover"."logs" 
WHERE timestamp >= now() - interval '10 minute' AND 
namespace IN ('refurbished', 'interface') GROUP BY cluster, namespace, workload

Applying the SQL query as a monitor

Applying an SQL query can only happens in YAML mode. You can use the following YAML template to add your query

title: "[SQL] Monitor name"
display:
  header: Monitor description
severity: S2
measurementType: event
model:
  queries:
    - name: threshold_input_query
      expression: "[YOUR SQL QUERY GOES HERE]"
      datasourceType: clickhouse
      queryType: instant
  thresholds:
    - name: threshold_1
      inputName: threshold_input_query
      operator: lt
      values:
        - 0.5
annotations:
  [Workflow Name]: enabled
executionErrorState: OK
noDataState: OK
evaluationInterval:
  interval: 3m
  pendingFor: 2m
isPaused: false

Give your monitor a name and a description
Paste your SQL query in the expression field
Set the threshold value and the relevant operator - in this example this is "lower than" 0.5 (< 0.5)
Set your workflow name in the annotations section
Set the check interval and the pending time
Save the monitor

Trigger an alert when no logs are coming from a Linux host

Use YAML mode to add the following template to your monitor.

title: Host not sending logs more than 5 minutes
display:
  header: Host "{{host}}" is not sending logs for more than 5 minutes
severity: S2
measurementType: event
model:
  queries:
    - name: threshold_input_query
      expression: "
      WITH
          (
          SELECT groupArray(DISTINCT host)
          FROM logs
          WHERE timestamp >= now() - INTERVAL 24 HOUR
          AND env_type = 'host'
          ) AS all_hosts
      SELECT
          host,
          coalesce(log_count, 0) AS log_count
      FROM
          (
          SELECT arrayJoin(all_hosts) AS host
          ) AS h
          LEFT JOIN
              (
              SELECT host, count(*) AS log_count
              FROM logs
              WHERE timestamp >= now() - INTERVAL 5 MINUTE
              AND env_type = 'host'
              GROUP BY host
              ) AS l
      USING (host)
      ORDER BY host
      "
      datasourceType: clickhouse
      queryType: instant
  thresholds:
    - name: threshold_1
      inputName: threshold_input_query
      operator: lt
      values:
        - 10
annotations:
  {Put Your Workflow Name Here}: enabled
executionErrorState: Error
noDataState: NoData
evaluationInterval:
  interval: 5m
  pendingFor: 0s
isPaused: false

In this example we are creating a list of Linux hosts that were sending logs in the last 24 hours and then checking if there were any logs collected from those hosts in the last 5 minutes.

This monitor can be used e.g. to catch when the host is down.

It would be helpful if you add an indication on the monitor name that this is SQL based. For example, add an [SQL] prefix or suffix to the monitor name as shown in the example

Last updated 7 days ago