Purpose: This monitor alerts when the API error rate for a workload exceeds 5%, indicating potential issues impacting user experience. The monitor uses the groundcover_workload_issue_counter and groundcover_workload_total_counter (see: Metrics & Labels) metrics to calculate the error rate, grouped by clusterId, namespace, and workload_name. The query evaluates the error percentage using an instant query (default behavior), with a threshold set at 5%.
title:Workload High API Error Rate Monitorseverity:S2display:header:Workload High API Error Rate MonitorresourceHeaderLabels: - workload - pod_namecontextHeaderLabels: - cluster - namespace - workloaddescription:This Monitor fires when the workload's APIs are failing to handle a significant proportion of requests, which can negatively impact user experience and trust.labels:team:backend_teamannotations:runbook:https://docs.groundcover.com/api_error_rate_runbookexecutionErrorState:AlertingnoDataState:NoDataevaluationInterval:interval:1mpendingFor:1mmodel:queries: - name:api_error_rate_queryexpression:| clamp((sum by (clusterId, namespace, workload_name) (increase(groundcover_workload_issue_counter{role="server"})) / sum by (clusterId, namespace, workload_name) (increase(groundcover_workload_total_counter{role="server"}))) * 100, 0, 100)datasourceType:prometheusqueryType:instantthresholds: - name:api_error_rate_thresholdinputName:api_error_rate_queryoperator:gtvalues: - 5
Monitors with a single ClickHouse Query
Workload Pods Crashed Monitor
Purpose: This monitor triggers an alert when a container in a pod crashes. It uses ClickHouse’s events table to detect container crashes immediately, with no pending period (pendingFor: 0s), as even a single crash warrants an alert. The threshold is set at 0 to ensure any crash is flagged.
title:Workload Pods Crashed Monitorseverity:S1display:header:Pod CrashedresourceHeaderLabels: - pod_name - container - exit_code - reasoncontextHeaderLabels: - cluster - namespace - workloaddescription:This Monitor fires when a pod has crashed, leading to potential application instability.executionErrorState:AlertingnoDataState:NoDataevaluationInterval:interval:1mpendingFor:0smodel:queries: - name:pods_crashed_queryexpression:| select count(), cluster, entity_workload AS workload, entity_namespace AS namespace, string_attributes['podName'] as pod_name, entity_name as container, toString(float_attributes['exitCode']) as exit_code, string_attributes['reason'] as reason, env_name FROM groundcover.events where timestamp > now() - interval '1 minutes' AND type = 'container_crash' group by cluster, entity_workload, entity_namespace, string_attributes['podName'], container, float_attributes['exitCode'], string_attributes['reason'], env_name
datasourceType: clickhousethresholds: - name:pods_crashed_thresholdinputName:pods_crashed_queryoperator:gtvalues: - 0
HTTP 5xx Errors Monitor
Purpose: This monitor alerts when HTTP 5xx errors (e.g., HTTP 504) are detected, which can indicate server-side issues. It uses the traces ClickHouse table, which stores spans from groundcover. The header for this monitor is dynamic, displaying the label value (return_code) from the query results.
title:HTTP 5xx Errors Monitorseverity:S3display:header:HTTP API Error {{ .labels.return_code }}resourceHeaderLabels: - clustered_resource - kindcontextHeaderLabels: - cluster - namespace - workloaddescription:This Monitor fires when HTTP 5xx errors are detected, indicating potential server-side issues.executionErrorState:AlertingnoDataState:NoDataevaluationInterval:interval:1mpendingFor:1mmodel:queries: - name:http_5xx_queryexpression:| select clustered_resource, subtype as method, cluster, namespace, workload, kind, return_code, count() as count from traces where start_timestamp > now() - interval '1 minutes' and protocol_type = 'http' and source = 'eBPF' and status != 'ok' and toInt16(return_code) >= 500 group by return_code, clustered_resource, subtype, cluster, namespace, workload, kinddatasourceType:clickhousethresholds: - name:http_5xx_thresholdinputName:http_5xx_queryoperator:gtvalues: - 0