Metrics & Labels

Infrastructure Metrics & Labels

Container CPU and Memory

Labels

type

clusterId region namespace node_name workload_name

pod_name container_name container_image

Metrics

NameDescriptionUnitType

groundcover_container_m_cpu_usage_seconds_total

cumulative cpu time consumed

mCPU * seconds

Counter

groundcover_container_cpu_request_m_cpu

K8s container CPU request

mCPU

Gauge

groundcover_container_cpu_limit_m_cpu

K8s container CPU limit

mCPU

Gauge

groundcover_container_mem_working_set_bytes

current memory working set

Bytes

Gauge

groundcover_container_memory_request_bytes

K8s container memory request

Bytes

Gauge

groundcover_container_memory_limit_bytes

K8s container memory limit

Bytes

Gauge

groundcover_container_cpu_delay_seconds

K8s container CPU delay

Seconds

Counter

groundcover_container_disk_delay_seconds

K8s container disk delay

Seconds

Counter

groundcover_container_cpu_throttled_seconds_total

K8s container total CPU throttling

Seconds

Counter

Node CPU, Memory and Disk

Labels

type clusterId region node_name

Metrics

NameDescriptionUnitType

groundcover_node_allocatable_cpum_cpu

amount of allocatable CPU in the current node

mCPU

Gauge

groundcover_node_allocatable_mem_bytes

amount of allocatable memory in the current node

Bytes

Gauge

groundcover_node_mem_used_percent

percent of used memory in current node

0-100

Gauge

groundcover_node_used_disk_space

current used disk space in current node

Bytes

Gauge

groundcover_node_free_disk_space

amount of free disk space in current node

Bytes

Gauge

groundcover_node_total_disk_space

amount of total disk space in current node

Bytes

Gauge

groundcover_node_used_percent_disk_space

percent of used disk space in current node

0-100

Gauge

Storage Usage

Labels

type clusterId region name namespace

Metrics

NameDescriptionUnitType

groundcover_pvc_usage_bytes

PVC usage

Bytes

Gauge

groundcover_pvc_capacity_bytes

PVC capacity

Bytes

Gauge

groundcover_pvc_available_bytes

PVC available

Bytes

Gauge

groundcover_pvc_usage_percent

percent of used PVC storage

0-100

Gauge

groundcover_pvc_read_bytes_total

total amount of bytes read by the workload from the PVC

Bytes

Counter

groundcover_pvc_write_bytes_total

total amount of bytes written by the workload to the PVC

Bytes

Counter

groundcover_pvc_reads_total

total amount of read operations done by the workload from the PVC

MiB/s

Counter

groundcover_pvc_writes_total

total amount of write operations done by the workload to the PVC

MiB/s

Counter

groundcover_pvc_read_latency

latency of read operation by the workload from the PVC

ms / μs

Summary

groundcover_pvc_write_latency

latency of write operation by the workload to the PVC

ms / μs

Summary

Network Usage

Labels

clusterId workload_name namespace container_name remote_service_name remote_namespace remote_is_external availability_zone region remote_availability_zone remote_region is_cross_az protocol role server_port encryption transport_protocol is_loopback

Notes:

  • is_loopback and remote_is_external are special labels that indicate the remote service is either the same service as the recording side (loopback) or resides in an external network, e.g managed service outside of the cluster (external).

    • In both cases the remote_service_name and the remote_namespace labels will be empty

  • is_cross_az means the traffic was sent and/or received between two different availability zones. This is a helpful flag to quickly identify this special kind of communication.

    • The actual zones are detailed in the availability_zone and remote_availability_zone labels

Metrics

NameDescriptionUnitType

groundcover_network_rx_bytes_total

Bytes received by the workload

Bytes

Counter

groundcover_network_tx_bytes_total

Bytes sent by the workload

Bytes

Counter

groundcover_network_connections_opened_total

Connections opened by the workload

Number

Counter

groundcover_network_connections_closed_total

Connections closed by the workload

Number

Counter

Application Metrics & Labels

Label nameDescriptionRelevant types

clusterId

Name identifier of the K8s cluster

All

region

Cloud provider region name

All

namespace

K8s namespace

All

workload_name

K8s workload (or service) name

All

pod_name

K8s pod name

All

container_name

K8s container name

All

container_image

K8s container image name

All

remote_namespace

Remote K8s namespace (other side of the communication)

All

remote_service_name

Remote K8s service name (other side of the communication)

All

remote_container_name

Remote K8s container name (other side of the communication)

All

type

The protocol in use (HTTP, gRPC, Kafka, DNS etc.)

All

sub_type

The sub type of the protocol (GET, POST, etc)

All

role

Role in the communication (client or server)

All

clustered_resource_name

The clustered name of the resource, depends on the protocol

All

status_code

"ok", "error" or "unset"

All

server

The server workload/name

All

client

The client workload/name

All

server_namesapce

The server namespace

All

client_namespace

The client namespace

All

server_is_external

Indicate whether the server is external

All

client_is_external

Indicate wheter the client is external

All

is_encrypted

Indicate whether the communication is encrypted

All

is_cross_az

Indicate wether the communication is cross availability zone

All

clustered_path

HTTP / gRPC aggregated resource path (e.g. /metrics/*)

http, grpc

method

HTTP / gRPC method (e.g GET)

http, grpc

response_status_code

Return status code of a HTTP / gPRC request (e.g. 200 in HTTP)

http, grpc

dialect

SQL dialect (MySQL or PostgreSQL)

mysql, postgresql

response_status

Return status code of a SQL query (e.g 42P01 for undefined table)

mysql, postgresql

client_type

Kafka client type (Fetcher / Producer)

kafka

topic

Kafka topic name

kafka

partition

Kafka partition identifier

kafka

error_code

Kafka return status code

kafka

query_type

type of DNS query (e.g. AAAA)

dns

response_return_code

Return status code of a DNS resolution request (e.g. Name Error)

dns

exit_code

K8s container termination exit code

container_state, container_crash

state

K8s container current state (Running, Waiting or Terminated)

container_state

state_reason

K8s container state transition reason (e.g CrashLoopBackOff or OOMKilled)

container_state

crash_reason

K8s container crash reason (e.g Error, OOMKilled)

container_crash

pvc_name

K8s PVC name

storage

Summary based metrics have an additional quantile label, representing the percentile. Available values: [”0.5”, “0.95”, 0.99”].

We also use a set of internal labels which are not relevant in most use-cases. Find them interesting? Let us know over Slack!

issue_id entity_id resource_id query_id aggregation_id parent_entity_id perspective_entity_id perspective_entity_is_external perspective_entity_issue_id perspective_entity_name perspective_entity_namespace perspective_entity_resource_id

Golden Signals (Errors & Issues)

In the lists below, we describe error and issue counters. Every issue flagged by the platform is an error; but not every error is flagged as an issue.

Resource metrics

NameDescriptionUnitType

groundcover_resource_total_counter

total amount of resource requests

Number

Counter

groundcover_resource_error_counter

total amount of requests with error status codes

Number

Counter

groundcover_resource_issue_counter

total amount of requests which were flagged as issues

Number

Counter

groundcover_resource_success_counter

total amount of resource requests with OK status codes

Number

Counter

groundcover_resource_latency_seconds

resource latency

Seconds

Summary

Workload metrics

NameDescriptionUnitType

groundcover_workload_total_counter

total amount of requests handled by the workload

Number

Counter

groundcover_workload_error_counter

total amount of requests handled by the workload with error status codes

Number

Counter

groundcover_workload_issue_counter

total amount of requests handled by the workload which were flagged as issues

Number

Counter

groundcover_workload_success_counter

total amount of requests handled by the workload with OK status codes

Number

Counter

groundcover_workload_latency_seconds

resource latency across all of the workload APIs

Seconds

Summary

Kafka specific metrics

NameDescriptionUnitType

groundcover_client_offset

client last message offset (for producer the last offset produced, for consumer the last requested offset)

Gauge

groundcover_workload_client_offset

client last message offset (for producer the last offset produced, for consumer the last requested offset), aggregated by workload

Gauge

groundcover_calc_lagged_messages

current lag in messages

Number

Gauge

groundcover_workload_calc_lagged_messages

current lag in messages, aggregated by workload

Number

Gauge

groundcover_calc_lag_seconds

current lag in time

Seconds

Gauge

groundcover_workload_calc_lag_seconds

current lag in time, aggregated by workload

Seconds

Gauge

Last updated