Monitor YAML structure
While we strongly suggest building monitors using our Wizard or Catalog, groundcover supports building and editing your Monitors using YAML. If you choose to do so, the following will provide you the necessary definitions.
Monitor fields explained
In this section, you'll find a breakdown of the key fields used to define and configure Monitors within the groundcover platform. Each field plays a critical role in how a Monitor behaves, what data it tracks, and how it responds to specific conditions. Understanding these fields will help you set up effective Monitors to track performance, detect issues, and provide timely alerts.
Below is a detailed explanation of each field, along with examples to illustrate their usage, ensuring your team can manage and respond to incidents efficiently.
Title
A string that defines the human-readable name of the Monitor. The title is what you will see in the list of all existing Monitors in the Monitors section.
Description
Additional information about the Monitor.
Severity
When triggered, this will show the severity level of the Monitor's issue. You can set any severity you want here.
s1
for Critical
s2
for High
s3
for Medium
s4
for Low
Header
This is the header of the generated issues from the Monitor.
A short string describing the condition that is being monitored. You can also use this as a pattern using labels from you query.
“HTTP API Error {{ alert.labels.return_code}}”
ResourceHeaderLabels
A list of labels that help you identify the resources that are related to the Monitor. This appear as a secondary header in all Issues tables across the platform.
["span_name", "kind"]
for monitors on protocol issues.
ContextHeaderLabels
A list of contextual labels that help you identify the location of the issue. This appears as a subset of the Issue’s labels, and is displayed on all Issues tables across the platform.
["cluster", "namespace", "pod_name"]
Labels
A set of pre-defined labels that were set to Issues related to the selected Monitor. Labels can be static, or dynamic using a Monitor's query results.
team: sre_team
ExecutionErrorState
Defines the actions that take place when a Monitor encounters query execution errors.
Valid options are Alerting
, OK
and Error.
When
Alerting
is set, query execution errors will result in a firing issue.When
Error
is set, query execution errors will result in an error state.When
OK
is set, query execution errors will do neither of the above. This is the default setting
NoDataState
This defines what happens when queries in the Monitor return empty datasets.
Valid options are: NoData
, Alerting
, OK
When
NoData
is set, issue instances state will be:No Data
.When
OK
is set, issues instance state will bePending
. The will change toAlerting
once the pending period of the monitor ends. This is the dafault setting
Interval
Defines how frequently the Monitor evaluates the conditions. Common intervals could be 1m
, 5m
, etc.
PendingFor
Defines the period of consecutive intervals where threshold condition must be met to trigger the alert.
Trigger
Defines the condition under which the Monitor fires. This is the definition of threshold for the Monitor, with op
- operator and value
.
op: gt, value: 5
Model
Describes the queries, thresholds and data processing of the Monitor. It can have the following fields:
Queries: List of one or more queries to run, this can be either SQL over ClickHouse, PromQL over VictoriaMetrics, SqlPipeline. Each query will have a name for reference in the monitor.
Thresholds: This is the threshold of your Monitor, a threshold has a name, inputName for data input, operator one of
gt
,lt
,within_range
,outside_range
and array of values which are the threshold values.
measurementType
Describe how will we present issues of this Monitor. Some Monitors count events, and some a state. And we will display them differently in our dashboards.
state - Will present issues in line chart.
event - Will present issues in bar chart, counting events.
Monitor YAML Examples
Traces Based Monitors
MySQL Query Errors Monitor
title: MySQL Query Errors Monitor
display:
header: MySQL Error {{ alert.labels.statusCode }}
description: This monitor detects MySQL Query errors.
resourceHeaderLabels:
- span_name
- role
contextHeaderLabels:
- cluster
- namespace
- workload
severity: S3
measurementType: event
model:
queries:
- name: threshold_input_query
dataType: traces
sqlPipeline:
selectors:
- key: _time
origin: root
type: string
processors:
- op: toStartOfInterval
args:
- 1 minutes
alias: bucket_timestamp
- key: statusCode
origin: root
type: string
alias: statusCode
- key: span_name
origin: root
type: string
alias: span_name
- key: cluster
origin: root
type: string
alias: cluster
- key: namespace
origin: root
type: string
alias: namespace
- key: role
origin: root
type: string
alias: role
- key: workload
origin: root
type: string
alias: workload
- key: "*"
origin: root
type: string
processors:
- op: count
alias: logs_total
groupBy:
- key: _time
origin: root
type: string
processors:
- op: toStartOfInterval
args:
- 1 minutes
- key: statusCode
origin: root
type: string
alias: statusCode
- key: span_name
origin: root
type: string
alias: span_name
- key: cluster
origin: root
type: string
alias: cluster
- key: namespace
origin: root
type: string
alias: namespace
- key: role
origin: root
type: string
alias: role
- key: workload
origin: root
type: string
alias: workload
orderBy:
- selector:
key: bucket_timestamp
origin: root
type: string
direction: ASC
limit:
filters:
operator: and
conditions:
- filters:
- op: match
value: mysql
key: eventType
origin: root
type: string
- filters:
- op: match
value: error
key: status
origin: root
type: string
- filters:
- op: match
value: eBPF
key: source
origin: root
type: string
instantRollup: 1 minutes
thresholds:
- name: threshold_1
inputName: threshold_input_query
operator: gt
values:
- 0
executionErrorState: OK
noDataState: OK
evaluationInterval:
interval: 1m
pendingFor: 0s
labels:
team: infra
gRPC API Errors Monitor
title: gRPC API Errors Monitor
display:
header: gRPC API Error {{ alert.labels.statusCode }}
description: This monitor detects gRPC API errors by identifying responses with a non-zero status code.
resourceHeaderLabels:
- span_name
- role
contextHeaderLabels:
- cluster
- namespace
- workload
severity: S3
measurementType: event
model:
queries:
- name: threshold_input_query
dataType: traces
sqlPipeline:
selectors:
- key: _time
origin: root
type: string
processors:
- op: toStartOfInterval
args:
- 1 minutes
alias: bucket_timestamp
- key: statusCode
origin: root
type: string
alias: statusCode
- key: span_name
origin: root
type: string
alias: span_name
- key: cluster
origin: root
type: string
alias: cluster
- key: namespace
origin: root
type: string
alias: namespace
- key: role
origin: root
type: string
alias: role
- key: workload
origin: root
type: string
alias: workload
- key: "*"
origin: root
type: string
processors:
- op: count
alias: logs_total
groupBy:
- key: _time
origin: root
type: string
processors:
- op: toStartOfInterval
args:
- 1 minutes
- key: statusCode
origin: root
type: string
alias: statusCode
- key: span_name
origin: root
type: string
alias: span_name
- key: cluster
origin: root
type: string
alias: cluster
- key: namespace
origin: root
type: string
alias: namespace
- key: role
origin: root
type: string
alias: role
- key: workload
origin: root
type: string
alias: workload
orderBy:
- selector:
key: bucket_timestamp
origin: root
type: string
direction: ASC
limit:
filters:
operator: and
conditions:
- filters:
- op: match
value: grpc
key: eventType
origin: root
type: string
- filters:
- op: ne
value: "0"
key: statusCode
origin: root
type: string
- filters:
- op: match
value: error
key: status
origin: root
type: string
- filters:
- op: match
value: eBPF
key: source
origin: root
type: string
instantRollup: 1 minutes
thresholds:
- name: threshold_1
inputName: threshold_input_query
operator: gt
values:
- 0
executionErrorState: OK
noDataState: OK
evaluationInterval:
interval: 1m
pendingFor: 0s
Log Based Monitors
High Error Log Rate Monitor
title: High Error Log Rate Monitor
severity: S4
display:
header: High Log Error Rate
description: This monitor will trigger an alert when we have a rate of error logs.
resourceHeaderLabels:
- workload
contextHeaderLabels:
- cluster
- namespace
evaluationInterval:
interval: 1m
pendingFor: 0s
model:
queries:
- name: threshold_input_query
dataType: logs
sqlPipeline:
selectors:
- key: _time
origin: root
type: string
processors:
- op: toStartOfInterval
args:
- 1 minutes
alias: bucket_timestamp
- key: workload
origin: root
type: string
alias: workload
- key: namespace
origin: root
type: string
alias: namespace
- key: cluster
origin: root
type: string
alias: cluster
- key: "*"
origin: root
type: string
processors:
- op: count
alias: logs_total
groupBy:
- key: _time
origin: root
type: string
processors:
- op: toStartOfInterval
args:
- 1 minutes
- key: workload
origin: root
type: string
alias: workload
- key: namespace
origin: root
type: string
alias: namespace
- key: cluster
origin: root
type: string
alias: cluster
orderBy:
- selector:
key: bucket_timestamp
origin: root
type: string
direction: ASC
limit:
filters:
conditions:
- filters:
- op: match
value: error
key: level
origin: root
type: string
operator: and
instantRollup: 1 minutes
thresholds:
- name: threshold_1
inputName: threshold_input_query
operator: gt
values:
- 150
noDataState: OK
measurementType: event
Last updated