Monitor YAML structure

While we strongly suggest building monitors using our Wizard or Catalog, groundcover supports building and editing your Monitors using YAML. If you choose to do so, the following will provide you the necessary definitions.

Monitor fields explained

In this section, you'll find a breakdown of the key fields used to define and configure Monitors within the groundcover platform. Each field plays a critical role in how a Monitor behaves, what data it tracks, and how it responds to specific conditions. Understanding these fields will help you set up effective Monitors to track performance, detect issues, and provide timely alerts.

Below is a detailed explanation of each field, along with examples to illustrate their usage, ensuring your team can manage and respond to incidents efficiently.

Field

Explanation

Example

Title

A string that defines the human-readable name of the Monitor. The title is what you will see in the list of all existing Monitors in the Monitors section.

Description

Additional information about the Monitor.

Severity

When triggered, this will show the severity level of the Monitor's issue. You can set any severity you want here.

s1 for Critical

s2 for High

s3 for Medium

s4 for Low

Header

This is the header of the generated issues from the Monitor.

A short string describing the condition that is being monitored. You can also use this as a pattern using labels from you query.

“HTTP API Error {{ alert.labels.return_code}}”

ResourceHeaderLabels

A list of labels that help you identify the resources that are related to the Monitor. This appear as a secondary header in all Issues tables across the platform.

["span_name", "kind"] for monitors on protocol issues.

ContextHeaderLabels

A list of contextual labels that help you identify the location of the issue. This appears as a subset of the Issue’s labels, and is displayed on all Issues tables across the platform.

["cluster", "namespace", "pod_name"]

Labels

A set of pre-defined labels that were set to Issues related to the selected Monitor. Labels can be static, or dynamic using a Monitor's query results.

team: sre_team

ExecutionErrorState

Defines the actions that take place when a Monitor encounters query execution errors.

Valid options are Alerting, OK and Error.

When Alerting is set, query execution errors will result in a firing issue.
When Error is set, query execution errors will result in an error state.
When OK is set, query execution errors will do neither of the above. This is the default setting

NoDataState

This defines what happens when queries in the Monitor return empty datasets.

Valid options are: NoData , Alerting, OK

When NoData is set, issue instances state will be: No Data.
When OK is set, issues instance state will be Pending. The will change to Alerting once the pending period of the monitor ends. This is the dafault setting

Interval

Defines how frequently the Monitor evaluates the conditions. Common intervals could be 1m, 5m, etc.

PendingFor

Defines the period of consecutive intervals where threshold condition must be met to trigger the alert.

Trigger

Defines the condition under which the Monitor fires. This is the definition of threshold for the Monitor, with op - operator and value .

op: gt, value: 5

Model

Describes the queries, thresholds and data processing of the Monitor. It can have the following fields:

Queries: List of one or more queries to run, this can be either SQL over ClickHouse, PromQL over VictoriaMetrics, SqlPipeline. Each query will have a name for reference in the monitor.

Thresholds: This is the threshold of your Monitor, a threshold has a name, inputName for data input, operator one of gt , lt , within_range, outside_range and array of values which are the threshold values.

measurementType

Describe how will we present issues of this Monitor. Some Monitors count events, and some a state. And we will display them differently in our dashboards.

state - Will present issues in line chart.
event - Will present issues in bar chart, counting events.

Monitor YAML Examples

Traces Based Monitors

MySQL Query Errors Monitor

title: MySQL Query Errors Monitor
display:
  header: MySQL Error {{ alert.labels.statusCode }}
  description: This monitor detects MySQL Query errors.
  resourceHeaderLabels:
    - span_name
    - role
  contextHeaderLabels:
    - cluster
    - namespace
    - workload
severity: S3
measurementType: event
model:
  queries:
    - name: threshold_input_query
      dataType: traces
      sqlPipeline:
        selectors:
          - key: _time
            origin: root
            type: string
            processors:
              - op: toStartOfInterval
                args:
                  - 1 minutes
            alias: bucket_timestamp
          - key: statusCode
            origin: root
            type: string
            alias: statusCode
          - key: span_name
            origin: root
            type: string
            alias: span_name
          - key: cluster
            origin: root
            type: string
            alias: cluster
          - key: namespace
            origin: root
            type: string
            alias: namespace
          - key: role
            origin: root
            type: string
            alias: role
          - key: workload
            origin: root
            type: string
            alias: workload
          - key: "*"
            origin: root
            type: string
            processors:
              - op: count
            alias: logs_total
        groupBy:
          - key: _time
            origin: root
            type: string
            processors:
              - op: toStartOfInterval
                args:
                  - 1 minutes
          - key: statusCode
            origin: root
            type: string
            alias: statusCode
          - key: span_name
            origin: root
            type: string
            alias: span_name
          - key: cluster
            origin: root
            type: string
            alias: cluster
          - key: namespace
            origin: root
            type: string
            alias: namespace
          - key: role
            origin: root
            type: string
            alias: role
          - key: workload
            origin: root
            type: string
            alias: workload
        orderBy:
          - selector:
              key: bucket_timestamp
              origin: root
              type: string
            direction: ASC
        limit:
        filters:
          operator: and
          conditions:
            - filters:
                - op: match
                  value: mysql
              key: eventType
              origin: root
              type: string
            - filters:
                - op: match
                  value: error
              key: status
              origin: root
              type: string
            - filters:
                - op: match
                  value: eBPF
              key: source
              origin: root
              type: string
      instantRollup: 1 minutes
  thresholds:
    - name: threshold_1
      inputName: threshold_input_query
      operator: gt
      values:
        - 0
executionErrorState: OK
noDataState: OK
evaluationInterval:
  interval: 1m
  pendingFor: 0s
labels:
  team: infra

gRPC API Errors Monitor

title: gRPC API Errors Monitor
display:
  header: gRPC API Error {{ alert.labels.statusCode }}
  description: This monitor detects gRPC API errors by identifying responses with a non-zero status code.
  resourceHeaderLabels:
    - span_name
    - role
  contextHeaderLabels:
    - cluster
    - namespace
    - workload
severity: S3
measurementType: event
model:
  queries:
    - name: threshold_input_query
      dataType: traces
      sqlPipeline:
        selectors:
          - key: _time
            origin: root
            type: string
            processors:
              - op: toStartOfInterval
                args:
                  - 1 minutes
            alias: bucket_timestamp
          - key: statusCode
            origin: root
            type: string
            alias: statusCode
          - key: span_name
            origin: root
            type: string
            alias: span_name
          - key: cluster
            origin: root
            type: string
            alias: cluster
          - key: namespace
            origin: root
            type: string
            alias: namespace
          - key: role
            origin: root
            type: string
            alias: role
          - key: workload
            origin: root
            type: string
            alias: workload
          - key: "*"
            origin: root
            type: string
            processors:
              - op: count
            alias: logs_total
        groupBy:
          - key: _time
            origin: root
            type: string
            processors:
              - op: toStartOfInterval
                args:
                  - 1 minutes
          - key: statusCode
            origin: root
            type: string
            alias: statusCode
          - key: span_name
            origin: root
            type: string
            alias: span_name
          - key: cluster
            origin: root
            type: string
            alias: cluster
          - key: namespace
            origin: root
            type: string
            alias: namespace
          - key: role
            origin: root
            type: string
            alias: role
          - key: workload
            origin: root
            type: string
            alias: workload
        orderBy:
          - selector:
              key: bucket_timestamp
              origin: root
              type: string
            direction: ASC
        limit:
        filters:
          operator: and
          conditions:
            - filters:
                - op: match
                  value: grpc
              key: eventType
              origin: root
              type: string
            - filters:
                - op: ne
                  value: "0"
              key: statusCode
              origin: root
              type: string
            - filters:
                - op: match
                  value: error
              key: status
              origin: root
              type: string
            - filters:
                - op: match
                  value: eBPF
              key: source
              origin: root
              type: string
      instantRollup: 1 minutes
  thresholds:
    - name: threshold_1
      inputName: threshold_input_query
      operator: gt
      values:
        - 0
executionErrorState: OK
noDataState: OK
evaluationInterval:
  interval: 1m
  pendingFor: 0s

Log Based Monitors

High Error Log Rate Monitor

title: High Error Log Rate Monitor
severity: S4
display:
  header: High Log Error Rate
  description: This monitor will trigger an alert when we have a rate of error logs.
  resourceHeaderLabels:
    - workload
  contextHeaderLabels:
    - cluster
    - namespace
evaluationInterval:
  interval: 1m
  pendingFor: 0s
model:
  queries:
    - name: threshold_input_query
      dataType: logs
      sqlPipeline:
        selectors:
          - key: _time
            origin: root
            type: string
            processors:
              - op: toStartOfInterval
                args:
                  - 1 minutes
            alias: bucket_timestamp
          - key: workload
            origin: root
            type: string
            alias: workload
          - key: namespace
            origin: root
            type: string
            alias: namespace
          - key: cluster
            origin: root
            type: string
            alias: cluster
          - key: "*"
            origin: root
            type: string
            processors:
              - op: count
            alias: logs_total
        groupBy:
          - key: _time
            origin: root
            type: string
            processors:
              - op: toStartOfInterval
                args:
                  - 1 minutes
          - key: workload
            origin: root
            type: string
            alias: workload
          - key: namespace
            origin: root
            type: string
            alias: namespace
          - key: cluster
            origin: root
            type: string
            alias: cluster
        orderBy:
          - selector:
              key: bucket_timestamp
              origin: root
              type: string
            direction: ASC
        limit:
        filters:
          conditions:
            - filters:
                - op: match
                  value: error
              key: level
              origin: root
              type: string
          operator: and
      instantRollup: 1 minutes
  thresholds:
    - name: threshold_1
      inputName: threshold_input_query
      operator: gt
      values:
        - 150
noDataState: OK
measurementType: event

Last updated 5 months ago