Monitor Examples

Monitors with a single PromQL query

Workload High API Error Rate Monitor

Purpose: This monitor alerts when the API error rate for a workload exceeds 5%, indicating potential issues impacting user experience. The monitor uses the groundcover_workload_issue_counter and groundcover_workload_total_counter (see: Metrics & Labels) metrics to calculate the error rate, grouped by clusterId, namespace, and workload_name. The query evaluates the error percentage using an instant query (default behavior), with a threshold set at 5%.

title: Workload High API Error Rate Monitor
severity: S2
display:
  header: Workload High API Error Rate Monitor
  resourceHeaderLabels: 
    - workload
    - pod_name
  contextHeaderLabels: 
    - cluster
    - namespace
    - workload
  description: This Monitor fires when the workload's APIs are failing to handle a significant proportion of requests, which can negatively impact user experience and trust.

labels:
  team: backend_team

annotations:
  runbook: https://docs.groundcover.com/api_error_rate_runbook

executionErrorState: Alerting
noDataState: NoData

evaluationInterval:
  interval: 1m
  pendingFor: 1m
      
model:
  queries:
    - name: api_error_rate_query
      expression: |
        clamp((sum by (clusterId, namespace, workload_name) (increase(groundcover_workload_issue_counter{role="server"})) / sum by (clusterId, namespace, workload_name) (increase(groundcover_workload_total_counter{role="server"}))) * 100, 0, 100)
      datasourceType: prometheus
      queryType: instant
  thresholds:
    - name: api_error_rate_threshold
      inputName: api_error_rate_query
      operator: gt
      values: 
        - 5

Monitors with a single ClickHouse Query

Workload Pods Crashed Monitor

Purpose: This monitor triggers an alert when a container in a pod crashes. It uses ClickHouse’s events table to detect container crashes immediately, with no pending period (pendingFor: 0s), as even a single crash warrants an alert. The threshold is set at 0 to ensure any crash is flagged.

title: Workload Pods Crashed Monitor
severity: S1
display:
  header: Pod Crashed
  resourceHeaderLabels:
    - pod_name
    - container
    - exit_code
    - reason
  contextHeaderLabels:
    - cluster
    - namespace
    - workload
  description: This Monitor fires when a pod has crashed, leading to potential application instability.

executionErrorState: Alerting
noDataState: NoData

evaluationInterval:
  interval: 1m
  pendingFor: 0s

model:
  queries:
    - name: pods_crashed_query
      expression: |
        select count(), cluster, entity_workload AS workload, entity_namespace AS namespace, string_attributes['podName'] as pod_name, entity_name as container, toString(float_attributes['exitCode']) as exit_code, string_attributes['reason'] as reason, env_name FROM groundcover.events where timestamp > now() - interval '1 minutes' AND type = 'container_crash' group by  cluster, entity_workload, entity_namespace, string_attributes['podName'], container, float_attributes['exitCode'], string_attributes['reason'], env_name
      datasourceType: clickhouse
  thresholds:
    - name: pods_crashed_threshold
      inputName: pods_crashed_query
      operator: gt
      values: 
        - 0

HTTP 5xx Errors Monitor

Purpose: This monitor alerts when HTTP 5xx errors (e.g., HTTP 504) are detected, which can indicate server-side issues. It uses the traces ClickHouse table, which stores spans from groundcover. The header for this monitor is dynamic, displaying the label value (return_code) from the query results.

title: HTTP 5xx Errors Monitor
severity: S3
display:
  header: HTTP API Error {{ .labels.return_code }}
  resourceHeaderLabels:
    - clustered_resource
    - kind
  contextHeaderLabels:
    - cluster
    - namespace
    - workload
  description: This Monitor fires when HTTP 5xx errors are detected, indicating potential server-side issues.
  
executionErrorState: Alerting
noDataState: NoData

evaluationInterval:
  interval: 1m
  pendingFor: 1m

model:
  queries:
    - name: http_5xx_query
      expression: |
        select clustered_resource, subtype as method, cluster, namespace, workload, kind, return_code, count() as count from traces where start_timestamp > now() - interval '1 minutes' and protocol_type = 'http' and source = 'eBPF' and status != 'ok' and toInt16(return_code) >= 500 group by return_code, clustered_resource, subtype, cluster, namespace, workload, kind
      datasourceType: clickhouse
  thresholds:
    - name: http_5xx_threshold
      inputName: http_5xx_query
      operator: gt
      values: 
        - 0

Last updated 1 month ago