Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
groundcover's unique pricing model is the first to decouple data volumes from cost of owning and operating the solution. For example, subscribing to our Enterprise plan costs $30 per node/host per month.
Overall, the cost of owning and operating groundcover is based on two factors:
The number of nodes (hosts) you are running in the environment you are monitoring
The costs of hosting groundcover's backend in your environment
Check out our TCO calculator to simulate your total cost of ownership for groundcover.
Definitely. As you deploy groundcover each cluster is automatically assigned the unique name it holds inside your cloud environment. You can browse and select all your clusters at one place with our UI experience.
groundcover has been tested and validated on the most common K8s distributions. See full list in the Requirements section.
groundcover supports the most common protocols in most K8s production environments out-of-the-box. See full list here.
groundcover's kernel-level eBPF sensor automatically collects your logs, application metrics (such as latency, throughput, error rate and much more), infrastructure metrics (such as deployment updates, container crashes etc.), traces, and Kubernetes events. You can control which data is left out of the automatic collection using data obfuscation.
groundcover stores all the data it collects inside your environment, using the state-of-the-art storage services of ClickHouse and Victoria Metrics, with the option to offload data to object storage such as S3 for long-term retention. See our Architecture section for more details.
groundcover stores the data it collects in-cluster, inside your environment without ever leaving the cluster to be stored anywhere else.
Our SaaS UI experience stores only information related to the account, user access and general K8s metadata used for governance (like the number of nodes per cluster, the name given to the cluster etc.).
All the information served to the UI experience is encrypted all the way to the in-cluster data sources. groundcover has no access to your collected data, which is accessible only to an authenticated user from your organization. groundcover does collect telemetry information (opt-out is of course possible) which includes metrics about the performance of the deployment (e.g. resource consumption metrics) and logs reported from the groundcover components running in the cluster.
All telemetry information is anonymized, and contains no data related to your environment.
Regardless, groundcover is SOC2 and ISO 27001 compliant and follows best practices.
If you used your business email to create your groundcover account, you can invite your team to your workspace by clicking on the purple "Invite" button on the upper menu. This will open a pop-up where you can enter the emails of the people you want to invite. You also have an option to copy and share your private link.
Note: The Admin of the account (i.e. the person that created it) can also invite users outside of your email domain. Non-admin users can only invite users that share the same email domain. If you used a private email, you can only share the link to your workspace by clicking the "Share" button on the top bar.
Read more about invites in our installation guide.
groundcover's CLI tool is currently Open Source along side more projects like Murre and Caretta. We're working on releasing more parts of our solution to Open Source very soon. Stay tuned in our GitHub page!
groundcover’s sensor uses eBPF, which means it can only deployed on a Kubernetes cluster that is running on a Linux system.
Installing using the CLI command is currently only supported on Linux and Mac.
You can install using the Helm command from any operating system.
Once installed, accessing the groundcover platform is possible from any web browser, on any operating system.
The following architectures are fully supported for all groundcover workloads:
x86
ARM
groundcover is a full stack, cloud-native observability platform, developed to break all industry paradigms - from making instrumentation a thing of the past, to decoupling cost from data volumes.
The groundcover platform consolidates all your traces, metrics, logs, and Kubernetes events into a single pane of glass, allowing you to identify issues faster than ever before and conduct granular investigations for quick remediation and long-term prevention.
Our pricing is not impacted by the volume of data generated by the environments you monitor, so you can dare to start monitoring environments that had been blind spots until now - such as your Dev and Staging clusters. This, in turn, provides you visibility into all your environments, making it much more likely to identify issues in the early stages of development, rather than in your live product.
groundcover introduces game-changing concepts to observability:
eBPF (extended Berkeley Packet Filter) is a groundbreaking technology that has significantly impacted the Linux kernel, offering a new way to safely and efficiently extend its capabilities.
By powering our sensor with eBPF, groundcover unlocks unprecedented granularity on your cloud environment, while also practically eliminating the need for human involvement in the installation and deployment process. Our unique sensor collects data directly from the Linux kernel with near-zero impact on CPU and memory.
Advantages of our eBPF sensor:
Zero instrumentation: groundcover's eBPF sensor gathers granular observability data without the need for integrating an SDK or changing your applications' code in any way. This enables all your logs, metrics, traces, and other observability data to flow automatically into the platform. In minutes, you gain full visibility into application and infrastructure health, performance, resource usage, and more.
Minimal resources footprint: groundcover’s sensor in installed on a dedicated node in each monitored cluster, operating separately from the applications it is monitoring. Without interference with the application's primary functions, the groundcover platform operates with near-zero impact on your resources, maintaining the applications' performance and avoiding unexpected overhead on the infrastructure.
A new level of insight granularity: With direct access to the Linux kernel, our eBPF sensor enables the collection of data straight from the source. This guarantees that the data is clean, unaltered, and precise. It also offers access to unique insight on your application and infrastructure, such as the ability to view the full traces of payloads, or analyzing network performance over time.
The one-of-a-kind architecture on which groundcover was built eliminates all requirements to stream your logs, metrics, traces, and other monitoring data outside of your environment and into a third-party's cloud. By leveraging integrations with best-of-breed technologies, including ClickHouse and Victoria Metrics, all your observability is stored data locally, with the ability of being fully managed by groundcover.
Advantages of our BYOC architecture:
By separating the data plane from the control plane, you get the advantages of a SaaS solution, without its security and privacy challenges.
With multiple deployment models available, you also get to choose the level of security and privacy your organization needs, up to the highest standards (FedRamp-level).
Automated deployment, maintenance & resource optimization with our inCloud Managed deployment option.
This concept is unique to groundcover, and takes a while to grasp. Read about our BYOC architecture more in detail in this dedicated section.
Learn about groundcover inCloud Managed (currently available only on a paid plan), which enables you to deploy groundcover's control plane inside your own environment and delegate the entire setup and management of the groundcover platform.
Enabled by our unique BYOC architecture, groundcover's vision is to revolutionize the industry by offering a pricing model that is unheard of anywhere else. Our fully transparent pricing model is based only on the number of nodes being monitored, and the costs of hosting the groundcover backend in your environment. Volume of logs, metrics, traces, and all other observability data don’t affect your cost. This results in savings of 60-90% compared to SaaS platforms.
In addition, all our subscription tiers never limit your access to features and capabilities.
Advantages of our nodes-based pricing model:
Cost is predictable and transparent, becoming an enabler of growth and expansion.
The ability to deploy groundcover in data-intensive environments enables the monitoring of Dev and Staging clusters, which promotes early identification of issues.
No cardinality or retention limits
Read our latest customer stories to learn how organization of varying sizes all reduce their observability costs dramatically by migrating to groundcover:
groundcover applies a stream processing approach to collect and control the continuous flow of data to gain immediate insights, detect anomalies, and respond to changing conditions. Unlike batch processing, where data is collected over a period and then analyzed, stream processing analyzes the data as it flows through the system.
Our platform uses a distributed stream processing engine that enables it to ingest huge amounts of data (such as logs, traces and Kubernetes events) in real time. It also processes all that data and instantly generates complex insights (such as metrics and context) based on it.
As a result, the volume of raw data stored dramatically decreases which, in turn, further reduces the overall cost of observability.
Designed for high scalability and rapid query performance, enabling quick and efficient log analysis from all your environments. Each log is enriched with actionable context and correlated with relevant metrics and traces, providing a comprehensive view for fast troubleshooting.
The groundcover platform provides cloud-native infrastructure monitoring, enabling automatic collection and real-time monitoring of infrastructure health and efficiency.
Gain end-to-end observability into your applications performance, identify and resolve issues instantly, all with zero code changes.
Real User Monitoring (RUM) extends groundcover’s observability platform to the client side, providing visibility into actual user interactions and front-end performance. It tracks key aspects of your web application as experienced by real users, then correlates them with backend metrics, logs, and traces for a full-stack view of your system.
Traces are a powerful observability pillar, providing granular insights into microservice interactions. Traditionally, they were hard to implement, requiring coordination of multiple teams and constant code changes, making this critical aspect very challenging to maintain.
groundcover's eBPF sensor disrupts the famous tradeoff, empowering developers to gain full visibility into their applications, effortlessly and without any code changes.
The platform supports two kinds of traces:
These traces are automatically generated for every supported service in your stack. They are available out-of-the-box and within seconds of installation. These traces always include critical information such as:
All services that took part in the interaction (both client and server)
Accessed resource
Full payloads, including:
All headers
All query parameters
All bodies - for both the request and response
These can be ingested into the platform, allowing to leverage already existing instrumentation to create a single pane of glass for all of your traces.
Traces are stored in groundcover's ClickHouse
deployment, ensuring top notch performance on every scale.
For more details about ingesting 3rd party traces, see the integrations page.
groundcover further disrupts the customary traces experience by reinventing the concept of sampling. This innovation differs between the different types of traces:
These are generated by using 100% of the data, always processing every request being made, on every scale. However, the groundcover platform utilizes smart sampling to only store a fraction of the traces, while still generating an accurate picture. In general, sampling is performed according to these rules:
Requests with unusually high or low latencies, measured per resource
Requests which returned an error response (e.g 500 status code for HTTP)
"Normal" requests which form the baseline for each resource
Lastly, stream processing is utilized to make the sampling decisions on the node itself, without having to send or save any redundant traces.
Certain aspects of our sampling algorithm are configurable - read more here.
These traces are always ingested fully - meaning, no sampling is applied to traces which have been generated elsewhere and ingested by the platform.
When integrating 3rd-party traces, it is often wise to configure some sampling mechanism according to the specific use case.
Each trace is enriched with additional information to give as much context as possible for the service which generated the trace. This includes:
Container information - image, environment variables, pod name
Logs generated by the service around the time of the trace
Golden Signals of the resource around the time of the trace
Kubernetes events relevant to the service
CPU and Memory utilization of the service and the node it is scheduled on
One of the advantages of ingesting 3rd-party traces is the ability to leverage their distributed tracing feature. groundcover natively displays the full trace for ingested traces in the Traces
page.
Trace Attributes
enable advanced filtering and search capabilities. groundcover support attributes across all trace types. This encompasses a diverse range of protocols such as HTTP, MongoDB, PostgreSQL, and others, as well as varied sources including eBPF or manual instrumentations (for example - OpenTelemetry).
groundcover enriches your original traces and generates meaningful metadata as key-value pairs. This metadata includes critical information, such as protocol type, http.path
, db.statement
, and similar attributes, aligning with OTel conventions. Furthermore, groundcover seamlessly incorporates this metadata from spans received through supported manual instrumentations. For an in-depth understanding of attributes in OTel, please refer to OTel Attributes Documentation (external link to OpenTelemtry website).
Each attribute can be effortlessly integrated into your filters and search queries. You can add them directly from the trace side-panel with a simple click or input them manually into the search bar.
Example: If you want to filter all HTTP traces that contain the path "/products". The query would be formatted as: @http.path:"/products"
. For a comprehensive guide on the query syntax, see Syntax table below.
Trace Tags
enable advanced filtering and search capabilities. groundcover support tags across all trace types. This encompasses a diverse range of protocols such as HTTP, MongoDB, PostgreSQL, and others, as well as varied sources including eBPF or manual instrumentations (for example - OpenTelemetry).
Tags are powerful metadata components, structured as key-value pairs. They offer insightful information about the resource generating the span, like: container.image.name
,host.name
and more.
Tags include metadata enriched by the our sensor and additional metadata if provided by manual instrumentations (such as OpenTelemetry traces) . Utilizing these Tags enhances understanding and context of your traces, allowing for more comprehensive analysis and easier filtering by the relevant information.
Each tag can be effortlessly integrated into your filters and search queries. You can add them directly from the trace side-panel with a simple click or input them manually into the search bar.
Example: If you want to filter all traces from mysql containers - The query would be formatted as: container.image.name:mysql
. For a comprehensive guide on the query syntax, see Syntax table below.
The Trace Explorer integrates dynamic filters and a versatile search functionality, to enhance your trace data analysis. You can filter out traces using specific criteria, including trace-status, workload, namespace and more, as well as limit your search to a specific time range.
Learn more about how to use our search syntaxes
groundcover natively supports setting up log pipelines using Vector transforms. This allow for full flexibility in the processing and manipulation of traces being collected - parsing additional patterns by regex, renaming attributes, and many more.
Learn more about how to configure traces pipelines
groundcover allows full control over the retention of your traces. Read here to learn more.
Tracing can be customized in several ways:
Gain end-to-end observability into your applications performance, identify and resolve issues instantly - all with zero code changes.
The groundcover platform collects data all across your stack using the power of eBPF instrumentation. Our proprietary eBPF sensor is installed in seconds and provides 100% coverage into application metrics and traces with zero code changes or configurations.
Resolve faster - By seamlessly correlating traces with application metrics, logs, and infrastructure events, groundcover’s APM enables you to detect and resolve root issues faster.
Improve user experience - Optimize your application performance and resource utilization faster than ever before, avoid downtimes and make poor end-user experience a thing of the past.
Our revolutionary eBPF sensor, Flora, is deployed as a DaemonSet in your Kubernetes cluster. This approach allows us to inspect every packet that each service is sending or receiving, achieving 100% coverage. No sampling rates or relying on statistical luck - all requests and responses are observed.
This approach would not be feasible without a resource-efficient eBPF-powered sensor. eBPF not only extends the ability to pinpoint issues - it does so with much less overhead than any other method. eBPF can be used to analyze traffic originating from every programming language and SDK - even for encrypted connections!
Click here for a full list of supported technologies
After being collected by our eBPF code, the traffic is then classified according to its protocol - which is identified directly from the underlying traffic, or the library from which it originated. Connections are reconstructed, and we can generate transactions - HTTP requests and responses, SQL queries and responses etc.
In order to give as much context as possible each transaction is enriched with as much metadata as possible. Some examples might include the pods that took part in this transaction (both client and server), the nodes on which these pods are scheduled, and the state of container at the time of the request.
It is important to emphasize the impressive granularity level with which this process takes place - every single transaction observed is fully enriched. This allows us to perform more advanced aggregations.
After being enriched by as much context as possible, the transactions as grouped together into meaningful aggregations. These could be defined by the workloads involved, the protocols detected and the resources that were accessed in the operations. These aggregations will mostly come into play when displaying golden signals.
After collecting the data, contextualizing it and putting it together in meaningful aggregations - we can now create metrics and traces to provide meaningful insights into the services' behaviors.
Learn how groundcover's application metrics work:
Learn how groundcover's application traces work:
Stream, store, and query your logs at any scale, for a fixed cost.
Our Log Management solution is built for high scale and fast query performance so you can analyze logs quickly and effectively from all your cloud environments.
Gain context - Each log data is enriched with actionable context and correlated with relevant metrics and traces in one single view so you can find what you’re looking for and troubleshoot, faster.
Centralize to maximize - The groundcover platform can act as a limitless, centralized log management hub. Your subscription costs are completely unaffected by the amount of logs you choose to store or query. It's entirely up to you to decide.
groundcover ensures a seamless log collection experience with our proprietary eBPF sensor, which automatically collects and aggregates all logs in all formats - including JSON, plain text, NGINX logs, and more. All this without any configuration needed.
This sensor is deployed as a DaemonSet, running a single pod on each node within your Kubernetes cluster. This configuration enables the groundcover platform to automatically collect logs from all of your pods, across all namespaces in your cluster. This means that once you've installed groundcover, no further action is needed on your part for log collection. The logs collected by each sensor instance are then channeled to the OTel Collector
.
Acting as the central processing hub, the OTel Collector
is a vendor-agnostic tool that receives logs from various sensor
pods. It processes, enriches, and forwards the data into groundcover's ClickHouse database
, where all log data from your cluster is securely stored.
Logs Attributes
enable advanced filtering capabilities and is currently supported for the formats:
JSON
Common Log Format (CLF) - like those from NGNIX and Kong
logfmt
groundcover automatically detects the format of these logs, extracting key:value pairs from the original log records as Attributes
.
Each attribute can be added to your filters and search queries.
Example: filtering a log in a supported format with a field of a request path "/status" will look as follows: @request.path:"/status"
. Syntax can be found here.
groundcover offers the flexibility to craft tailored collection filtering rules, you can choose to set up filters and collect only the logs that are essential for your analysis, avoiding unnecessary data noise. For guidance on configuring your filters, explore our Customize Logs Collection section.
You also have the option to define the retention period for your logs in the ClickHouse database. By default, logs are retained for 3 days. To adjust this period to your preferences, visit our Customize Retention section for instructions.
Once logs are collected and ingested, they are available within the groundcover platform in the Log Explorer, which is designed for quick searches and seamless exploration of your logs data. Using the Log Explorer you can troubleshoot and explore your logs with advanced search capabilities and filters, all within a clear and fast interface.
The Log Explorer integrates dynamic filters and a versatile search functionality that enables you to quickly and easily identify the right data. You can filter out logs by selecting one or multiple criteria, including log-level, workload, namespace and more, and can limit your search to a specific time range.
Learn more about how to use our search syntaxes
groundcover natively supports setting up log pipelines using Vector transforms. This allow for full flexibility in the processing and manipulation of logs being collected - parsing additional patterns by regex, renaming attributes, and many more.
groundcover’s eBPF sensor uses state-of-the-art kernel features to provide full coverage at low overhead. In order to do so it requires certain kernel features which are listed below.
Version v5.3 or higher (anything since 2020).
You can check if your kernel has CO:RE support by manually looking for the BTF file:
If the file exists, congratulations! Your kernel supported CO:RE.
groundcover supports any K8s version from v1.21.
For the installation to complete successfully, permissions to deploy the following objects are required:
StatefulSet
Deployment
ConfigMap
Secret
PVC
groundcover's portal
pod sends HTTP requests to the cloud platform app.groundcover.com
on port 443.
To ensure a seamless experience with groundcover, it's important to confirm that your environment meets the necessary requirements. Please review the detailed requirements for Kubernetes, our eBPF sensor, and the necessary hardware and resources to guarantee optimal performance.
groundcover supports a wide range of Kubernetes versions and distributions, including popular platforms like EKS, AKS, and GKE.
Our state-of-the-art eBPF sensor leverages advanced kernel features to deliver comprehensive monitoring with minimal overhead, requiring specific Linux kernel versions, permissions, and CO:RE support.
groundcover fully supports both x86 and ARM processors, ensuring compatibility across diverse environments.
groundcover operates ClickHouse to support many of its core features. This requires suitable resources given to the deployment, which needs to be considered when deploying the platform.
Monitor front-end applications and connect it to your backend — all inside your cloud.
Capture real end-user experiences directly from their browsers and unify these insights with your backend observability data.
Real User Monitoring (RUM) extends groundcover’s observability platform to the client side, providing visibility into actual user interactions and front-end performance. It tracks key aspects of your web application as experienced by real users, then correlates them with backend metrics, logs, and traces for a full-stack view of your system.
Understand user experience - capture every interaction, page load, and performance metric from the end-user perspective to pinpoint front-end issues in real time.
Resolve issues faster - seamlessly tie front-end events to backend traces and logs in one platform, enabling end-to-end troubleshooting of user journeys.
Privacy first - groundcover’s Bring Your Own Cloud (BYOC) model ensures all RUM data stays in your own cloud environment. Sensitive user data never leaves your infrastructure, ensuring privacy and compliance without sacrificing insight.
groundcover RUM collects a wide range of data from users’ browsers through a lightweight JavaScript SDK. Once integrated into your web application, the SDK automatically gathers and sends the following telemetry from each user session to the groundcover platform:
Network requests: Every HTTP request initiated by the browser (such as API calls) is captured as a trace. Each client-side request can be linked with its corresponding server-side trace, giving you a complete picture of the request from the user’s click to the backend response.
Front-end logs: Client-side log messages (e.g., console.log
outputs, warnings, and errors) are collected and forwarded to groundcover’s log management. This ensures that browser logs are stored alongside your application’s server logs for unified analysis.
Exceptions: Uncaught JavaScript exceptions and errors are automatically captured with full stack traces and contextual data (browser type, URL, etc.). These front-end errors become part groundcover monitors, letting you quickly identify and debug issues in the user’s environment.
Performance metrics (Core Web Vitals): Key performance indicators like page load time and Core Web Vitals like Largest Contentful Paint, Interaction to Next Paint, and Cumulative Layout Shift are measured for each page view. groundcover RUM records these metrics to help you track real-world performance and detect slowdowns affecting users.
User interactions: RUM tracks user interactions such as clicks, keydown, and navigation events. By recording which elements users interact with and when, groundcover helps you reconstruct user flows and understand the sequence of actions leading up to any issue or performance problem.
Custom events: You can instrument your application to send custom events via the RUM SDK. This allows you to capture domain-specific actions or business events (for example, a checkout completion or a specific UI gesture) with associated metadata, providing deeper insight into user behavior beyond automatic captures.
All collected data is streamed securely to your groundcover deployment. Because groundcover runs in your environment, RUM data (including potentially sensitive details from user sessions) is stored in the observability backend within your own cloud. From there, it is aggregated and indexed just like other telemetry, ready to be searched and analyzed in the groundcover UI.
One of the core advantages of groundcover RUM is its native integration with backend observability data. Every front-end trace, log, or event captured via RUM is contextualized alongside server-side data:
Trace correlation: Client-side traces (from browser network requests) are automatically correlated with server-side traces captured by groundcover’s eBPF-based instrumentation. This means when a user triggers an API call, you can see the complete distributed trace that spans the browser and the backend services, all in one view.
Unified logging: Front-end log entries and error reports are ingested into the same backend as your server-side logs. In the groundcover Log Explorer, you can filter and search across logs from both client and server, using common fields (like timestamp, session ID, or trace ID) to connect events.
End-to-end troubleshooting: With full-stack data in one platform, you can pivot easily between a user’s session replay, the front-end events, and the backend metrics/traces involved. This end-to-end context significantly reduces the time to isolate whether an issue originated in the frontend (browser/UI) or the backend (services/infrastructure), helping teams pinpoint problems faster across the entire stack.
By bridging the gap between the user’s browser and your cloud infrastructure, groundcover’s RUM capability ensures that no part of the user journey is invisible to your monitoring. This holistic view is critical for optimizing user experience and rapidly resolving issues that span multiple layers of your application.
Once RUM data is collected, it becomes available in the groundcover platform via the Sessions Explorer — a dedicated view for inspecting and troubleshooting user sessions. The Sessions Explorer allows you to navigate through user journeys and understand how your users experience your application.
Clicking on any session opens the Session View, where you can inspect a full timeline of the user’s experience. This view shows every key event captured during the session - including clicks, navigations, network requests, logs, custom events, and errors.
Each event is displayed in sequence with full context like timestamps, URLs, and stack traces. The Session View helps you understand exactly what the user did and what the system reported at each step, making it easier to trace issues and user flows.
Get complete visibility into your cloud infrastructure performance at any scale, easily access all your metrics in one place and optimize infrastructure efficiency.
The groundcover platform offers infrastructure monitoring capabilities that were built for cloud-native environments. It enables you to track the health and efficiency of your infrastructure instantly, with an effortless deployment process.
Troubleshoot efficiently - acting as a centralized hub for all your infrastructure, application and customer metrics allows you to query, correlate and troubleshoot your cloud environments using real time data and insight on your entire stack.
groundcover's proprietary eBPF sensor leverages all its innovative powers to collect comprehensive data across your cloud environments without the burden of performance overhead. This data is sourced from various Kubernetes components, including kube-system workloads, cluster information via the Kubernetes API, and the applications' interactions with the Kubernetes infrastructure. This level of detailed collection at the kernel level enables the ability to provide actionable insights into the health of your Kubernetes clusters, which are indispensable for troubleshooting existing issues and taking proactive steps to future-proof your cloud environments.
Monitoring a cluster involves tracking resources that are critical to the performance and stability of the entire system. Monitoring these essential metrics is crucial for maintaining a healthy Kubernetes cluster:
CPU consumption: It's essential to track the CPU resources being utilized against the total capacity to prevent workloads from failing due to insufficient CPU availability.
Memory utilization: Keeping an eye on the remaining memory resources ensures that your cluster doesn't encounter disruptions due to memory shortages.
Disk space allocation: For Kubernetes clusters running stateful applications or requiring persistent storage for data, such as etcd databases, tracking the available disk space is crucial to avert potential storage deficiencies.
Network usage: Visualize traffic rates and connections being established and closed on a service-to-service level of granularity, and easily pinpoint cross availability zone communication to investigate misconfigurations and surging costs.
Available Labels
type
clusterId
region
namespace
node_name
workload_name
pod_name
container_name
container_image
Available Metrics
Available Labels
type
clusterId
region
node_name
Available Metrics
Available Labels
type
clusterId
region
name
namespace
Available Metrics
Available Labels
clusterId workload_name
namespace
container_name
remote_service_name
remote_namespace
remote_is_external
availability_zone
region
remote_availability_zone
remote_region
is_cross_az
protocol
role
server_port
encryption
transport_protocol
is_loopback
Notes:
is_loopback
and remote_is_external
are special labels that indicate the remote service is either the same service as the recording side (loopback) or resides in an external network, e.g managed service outside of the cluster (external).
In both cases the remote_service_name
and the remote_namespace
labels will be empty
is_cross_az
means the traffic was sent and/or received between two different availability zones. This is a helpful flag to quickly identify this special kind of communication.
The actual zones are detailed in the availability_zone
and remote_availability_zone
labels
Available Metrics
groundcover may work on many other linux kernels, but we might just didn't get a chance to test it yet. Can't find yours in the list?
Loading eBPF code requires running privileged containers. While this might seem unusual, there's nothing to worry about - eBPF is
Our sensor uses eBPF’s feature in order to support the vast variety of linux kernels and distributions detailed above. This feature requires the kernel to be compiled with BTF information (enabled using the CONFIG_BTF_ENABLE=Y kernel compilation flag). This is the case for most common nowadays.
If your system does not fit into any of the above - unfortunately, our eBPF sensor will not be able to run on your environment. However, this does not mean groundcover won’t collect any data. You will still be able to inspect your , see all and use with outer data sources.
groundcover may work on many other K8s flavors, but we might just didn't get a chance to test it yet. Can't find yours in the list?
DaemonSet (With privileged containers for loading our )
To learn more about groundcover's architecture and components visit our
This unique keeps the the data inside the cluster and fetches it on-demand keeping the data encrypted all the way without the need to open the cluster for incoming traffic via ingresses.
This capability is only available to organizations subscribed to our .
➡️ Check out our to your platform.
groundcover will work out-of-the-box on all protocols, encryption libraries and runtimes below - generating and with zero code changes.
We're growing our coverage all the time. Cant find what you're looking for?
Store it all, without a sweat - store any metrics volume without worrying about cardinality or retention limits. remain unaffected by the granularity of metrics you store or query.
You also have the option to for your metrics in the VictoriaMetrics database. By default, logs are retained for 7 days, but you can adjust this period to your preferences.
Beyond collecting data, groundcover's methodology involves a strategic layer of data enrichment that seeks to correlate Kubernetes metrics with application performance indicators. This correlation is crucial for creating a transparent image of the Kubernetes ecosystem. It enables a deep understanding of how Kubernetes interacts with applications, identifying across the interconnected environment. By monitoring Kubernetes not as an isolated platform but as an integral part of the application infrastructure, groundcover ensures that the monitoring strategy aligns with your dynamic and complex cloud operations.
Debian
11+
RedHat Enterprise Linux
8.2+
Ubuntu
20.10+
CentOS
7.3+
Fedora
31+
BottlerocketOS
1.10+
Amazon Linux
All off the shelf AMIs
Google COS
All off the shelf AMIs
Azure Linux
All off the shelf AMIs
Talos
1.7.3+
EKS
supported
AKS
supported
GKE
supported
OKE
supported
OpenShift
supported
Rancher
supported
Self-managed
supported
minikube
supported
kind
supported
Rancher Desktop
supported
k0s
supported
k3s
supported
k3d
supported
microk8s
supported
AWS Fargate
not supported
Docker-desktop
not supported
HTTP
supported
gRPC
supported
MySQL
supported
PostgreSQL
supported
Redis
supported
DNS
supported
Kafka
supported
MongoDB
supported
v3.6+
AMQP
supported
AMQP 0-9-1
GraphQL
supported
AWS S3
supported
AWS SQS
supported
crypto/tls (golang)
supported
OpenSSL (c, c++, Python)
supported
NodeJS
supported
JavaSSL
supported
Java 11+ is supported. Requires enabling the groundcover Java agent
groundcover_container_cpu_usage_rate_millis
CPU usage in mCPU
groundcover_container_cpu_request_m_cpu
K8s container CPU request (mCPU)
groundcover_container_cpu_limit_m_cpu
K8s container CPU limit (mCPU)
groundcover_container_memory_working_set_bytes
current memory working set (B)
groundcover_container_memory_rss_bytes
current memory RSS (B)
groundcover_container_memory_request_bytes
K8s container memory request (B)
groundcover_container_memory_limit_bytes
K8s container memory limit (B)
groundcover_container_cpu_delay_seconds
K8s container CPU delay accounting in seconds
groundcover_container_disk_delay_seconds
K8s container disk delay accounting in seconds
groundcover_container_cpu_throttled_seconds_total
K8s container total CPU throttling in seconds
groundcover_node_allocatable_cpum_cpu
amount of allocatable CPU in the current node (mCPU)
groundcover_node_allocatable_mem_bytes
amount of allocatable memory in the current node (B)
groundcover_node_mem_used_percent
percent of used memory in current node (0-100)
groundcover_node_used_disk_space
current used disk space in current node (B)
groundcover_node_free_disk_space
amount of free disk space in current node (B)
groundcover_node_total_disk_space
amount of total disk space in current node (B)
groundcover_node_used_percent_disk_space
percent of used disk space in current node (0-100)
groundcover_pvc_usage_bytes
PVC used bytes (B)
groundcover_pvc_capacity_bytes
PVC capacity bytes (B)
groundcover_pvc_available_bytes
PVC available bytes (B)
groundcover_pvc_usage_percent
percent of used pvc storage (0-100)
groundcover_network_rx_bytes_total
Bytes received by the workload (B)
groundcover_network_tx_bytes_total
Bytes sent by the workload (B)
groundcover_network_connections_opened_total
Connections opened by the workload
groundcover_network_connections_closed_total
Connections closed by the workload
groundcover_network_connections_opened_failed_total
Connections attempts failed per workload (including refused connections)
groundcover_network_connections_opened_refused_total
Connections attempts refused per workload
The following section is only relevant for inCluster deployments, where ClickHouse is deployed inside each monitored cluster. For inCloud Managed, groundcover takes care of database deployment and management, scaling a central deployment of ClickHouse according to your data usage.
ClickHouse is one of the two core databases used by groundcover. It is responsible for logs, traces, events, monitors, dashboards and many more features available in the platform.
As such, it requires suitable resources to operate without issues. The default resources in the groundcover chart were chosen according to both ClickHouse docs (here and here) but also according to our extensive experience of running it in all types of plans and deployments.
We don't recommend decreasing those resources as it might cause ClickHouse to become unstable and degrade system performance.
The default values for ClickHouse resources are:
Get up and running in minutes
Before installing groundcover, please make sure your cluster meets the requirements.
The first thing you need to do to start using groundcover is sign up using your email address (no credit card required for the free tier account). Signing up is only possible using a computer and will not be possible using a mobile phone or tablet. It is highly recommended you use your corporate email address, as it will make it easier to use other features such as inviting your colleagues to your workspace. However, signing up using Gmail, Outlook or any other public domains is also possible.
After signing up, you can install groundcover using any one of these methods:
When signing in to groundcover for the first time, the platform automatically detects your organization based on the domain you used to sign in. If your organization already has existing workspaces available, the workspace selection screen will be displayed, where you can choose which of the existing workspaces you would like to join, or if you want to create a new workspace.
Available workspaces will be displayed only if either of the following applies:
You have been invited to join existing workspaces and haven't joined them yet
Someone has previously created a workspace that has auto-join enabled for the email domain that you used to sign in (applicable for corporate email domains only)
Click the Join button next to the desired workspace
You will be added as a user to that workspace with the user privileges that were assigned by default or those that were assigned to you specifically when the invite was sent.
You will automatically be redirected to that workspace.
Click the Create a new workspace button
Specify a workspace name
Choose whether to enable auto-join (those settings can be changed later)
Click continue
Copy the command line displayed on the screen using the Copy Command button (Note: This is not your API key)
Open your CLI
Paste the command line in your CLI (Note: Make sure your kubectl context is pointing to the desired cluster)
Coverage policy covers all nodes excluding control plane and fargate. See details here.
The installation screen will open automatically and will let you know on-screen when the installation has completed.
Within 10 minutes after installation completes, all of your cluster's data will appear in your workspace.
Workspace owners and admins can allow teammates that log in with the same email domain as them to join the Workspace they created automatically, without an admin approval. This capability is called "Auto-join". It is disabled by default, but can be switched on during the workspace set up process, or any time in the workspace settings.
If you logged in with a public email domain (Gmail, Yahoo, Proton, etc.) and are creating a new Workspace, you will not be able to switch on Auto-join for that Workspace.
Use groundcover CLI to automate the installation process. The main advantages of using this installation method are:
Auto-detection of cluster incompatibility issues
Tolerations setup automation
Tuning of resources according to cluster size
It supports passing helm overrides
Automated detection of new versions and upgrades suggestions
Read more here.
Coverage policy covers all nodes excluding control plane and fargate. See details here.
Deploying groundcover using the CLI
groundcover can be installed using the official helm chart.
If you’re interested in installing the helm chart using a CI/CD solution, such as ArgoCD, make sure you read our CI/CD installation section as well.
Coverage policy covers all nodes excluding control plane and fargate. See details here.
Set up your first alert
Set up your first dashboard
Invite your colleagues
Installing groundcover on additional clusters
groundcover can monitor multiple clusters, and new clusters can be added at any given time. You can add new clusters using the UI.
Click on the cluster picker in the top right corner and then on the + Add Cluster
option.
Note: Our free plan limits the use of groundcover on only one cluster. Check out our Team and Enterprise plans to install on an unlimited number of clusters.
To update the groundcover agent to the latest version:
Learn how to create and configure monitors using the Wizard, Monitor Catalog, or Import options. The following guide will help you set up queries, thresholds, and alert routing for effective monitoring.
Creating new monitors is currently supported using our web interface only.
In the Monitors section (left navigation bar), navigate to the Issues page or the Monitor List page to create a new Monitor. Click on the blue “Create Monitor” button and select one of the following options from the dropdown:
The Monitor Wizard is a guided, user-friendly approach to creating and configuring monitors tailored to your observability needs. By breaking down the process into simple steps, it ensures consistency and accuracy.
Set up the basic information for the monitor.
Monitor Title (Required):
Add a title for the monitor. The title will appear in notifications and in the Monitor List page.
Description (Optional):
Add description for your monitor, The description will appear when viewing monitor details, you can also use this for your alerts.
Select the data source, build the query and define thresholds for the monitor.
If you're unfamiliar with query building in groundcover, refer to the Query Builder section for full details on the different components.
Data Source (Required):
Select the type of data (Metrics, Infra Metrics, Logs, or Traces).
Query Functionality:
Choose how to process the data (e.g., average, count).
Add group-by clauses if applicable.
Time Window (Required):
Specify the period over which data is aggregated.
Example: “Over the last 5 minutes.”
Threshold Conditions (Required):
Define when the monitor triggers. You can use:
Greater Than - Trigger when the value exceeds X.
Lower Than - Trigger when the value falls below X.
Within Range - Trigger when the value is between X and Y.
Outside Range - Trigger when the value is not between X and Y.
Example: “Trigger if disk space usage is greater than 10%.”
Visualization Type (Optional):
Preview data using Stacked Bar or Line Chart for better clarity. This is just for helping visualize while building the monitor.
Customize how the Monitor’s Issues will appear. This section also includes a live preview of the way it will appear in the Issues page.
Ensure that the labels you wish to use dynamically (e.g., span_name
, workload
) are defined in the query configuration step (Section 2: Query)
Issue Header (required):
Define a name for issues that this Monitor will raise. It's useful to use labels that can include information from the query.
For example, adding {{ alert.labels.statusCode }}
to the header will inject the status code to the name of the issue - this becomes especially useful when one Monitor raises multiple issues and you want to quickly understand their content without having to open each one.
Severity (required):
Use severity to categorize alerts by importance.
Select a severity level (S1-S4).
Resource Labels (optional):
The labels here should give you team the context of what is the subject of the issue.
Examples:
span_name
for an API based monitor
pod_name
for a Pod Crash monitor
Context Labels (optional):
The labels here should give you team the context of where the issue happened.
Examples:
cluster
namespace
Organize and categorize monitors, you can use these to route issues using advanced workflows.
Labels (optional):
Add key-value pairs for metadata.
Define how often the monitor evaluates its conditions.
Evaluation Interval (Required):
Specify how often the monitor evalutes the query
Example: “Evaluate every 1 minute.”
Pending Period (Required):
This ensures that transient conditions do not trigger alerts, reducing false positives. For example, setting this to 10 minutes ensures the condition must persist for at least 10 minutes before firing.
Example: “Wait for 10 minute before alerting.t
Set up how issues from this monitor will be routed.
Select Workflow (Optional):
Route alerts to existing workflows only, this means that other workflows will not process them. Use this to send alerts for a critical application to Slack or PagerDuty.
No Routing (Optional):
This means that any workflow (without filters), will process the issue.
Whenever possible, use our carefully crafted monitors from the Monitor Catalog. This will save you time, ensure the Monitors are built effectively, and help you align your alerting strategy with best practices. If you can't find one that perfectly matches your needs, use them as your starting point and edit their properties to customize them to your needs.
Give the Monitor a clear, short name, that describes its function at a high level.
“Workload High API Error Rate”
“Workload Pods High Memory”
The title will appear in the monitors page table and be accessible in workflows and alerts.
Choose a clear name for the Issue header, offering a bit more details and going into a more specific description of the monitor name. A Header is a specific property of an issue, so you can add templated dynamic values here. For example, you can use dynamic label values in the header name.
“HTTP API Error {{ alert.labels.status_code }}”
,
“Workload {{ alert.labels.workload }} Pod Restart”
“{{ alert.labels.customer }} APIs High Latency”
.
If you do choose to use templated dynamic values, make sure they exist as monitor query labels.
We recommend using up to 3 ResourceHeaderLabels. The labels here should give your team the context of what is the subject of the issue.
span_name
, pod_name
ResourceHeaderLabels appear as a secondary header in Issues tables across the platform.
We recommend using up to 3 ContextHeaderLabels. The labels here should give you team the context of where the issue happened.
cluster
, namespace
, workload
ContextHeaderLabels appear on Issues tables across platform, next to your issues.
This is an advanced feature, please use it with caution.
Here you can add multiple Monitors using an array of Monitors that follows the Monitor YAML structure.
Click on "Create Monitors" to create them.
The groundcover platform generates 100% of its metrics from the actual data. There are no sample rates or complex interpolations to make up for partial coverage. Our measurements represent the real, complete flow of data in your environment.
Stream processing allows us to construct the majority of the metrics on the very node where the raw transactions are recorded. This means the raw data is turned into numbers the moment it becomes possible - removing the need for storing or sending it elsewhere.
Metrics are stored in groundcover's victoria-metrics
deployment, ensuring top-notch performance on every scale.
In the world of excessive data, it's important to have a rule of thumb for knowing where to start looking. For application metrics, we rely on our golden signals.
The following metrics are generated for each resource being aggregated:
Requests per second (RPS)
Errors rate
Latencies (p50 and p95)
The golden signals are then displayed in two important ways: Workload and Resource aggregations.
See below for the full list of generated workload and resource golden metrics.
Resource aggregations are highly granularity metrics, providing insights into individual APIs.
Workload aggregations are designed to show an overview of each service, enabling a higher level inspection. These are constructed using all of the resources recorded for each service.
groundcover allows full control over the retention of your metrics. Learn more here.
Below you will find the full list of our APM metrics, as well as the labels we export for each. These labels are designed with high granularity in mind for maximal insight depth. All of the metrics listed are available out of the box after installing groundcover, without any further setup.
We fully support the ingestion of custom metrics to further expand the visibility into your environment.
We also allow for building custom dashboards, enabling full freedom in deciding how to display your metrics - building on groundcover's metrics below plus every custom metric ingested.
clusterId
Name identifier of the K8s cluster
region
Cloud provider region name
namespace
K8s namespace
workload_name
K8s workload (or service) name
pod_name
K8s pod name
container_name
K8s container name
container_image
K8s container image name
remote_namespace
Remote K8s namespace (other side of the communication)
remote_service_name
Remote K8s service name (other side of the communication)
remote_container_name
Remote K8s container name (other side of the communication)
type
The protocol in use (HTTP, gRPC, Kafka, DNS etc.)
role
Role in the communication (client or server)
clustered_path
HTTP / gRPC aggregated resource path (e.g. /metrics/*)
http, grpc
method
HTTP / gRPC method (e.g GET)
http, grpc
response_status_code
Return status code of a HTTP / gPRC request (e.g. 200 in HTTP)
http, grpc
dialect
SQL dialect (MySQL or PostgreSQL)
mysql, postgresql
response_status
Return status code of a SQL query (e.g 42P01 for undefined table)
mysql, postgresql
client_type
Kafka client type (Fetcher / Producer)
kafka
topic
Kafka topic name
kafka
partition
Kafka partition identifier
kafka
error_code
Kafka return status code
kafka
query_type
type of DNS query (e.g. AAAA)
dns
response_return_code
Return status code of a DNS resolution request (e.g. Name Error)
dns
method_name, method_class_name
Method code for the operation
amqp
response_method_name, response_method_class_name
Method code for the operation's response
amqp
exit_code
K8s container termination exit code
container_state, container_crash
state
K8s container current state (Running, Waiting or Terminated)
container_state
state_reason
K8s container state transition reason (e.g CrashLoopBackOff or OOMKilled)
container_state
crash_reason
K8s container crash reason (e.g Error, OOMKilled)
container_crash
pvc_name
K8s PVC name
storage
Summary based metrics have an additional quantile label, representing the percentile. Available values: [”0.5”, “0.95”, 0.99”
].
groundcover uses a set of internal labels which are not relevant in most use-cases. Find them interesting? Let us know over Slack!
issue_id
entity_id
resource_id
query_id
aggregation_id
parent_entity_id
perspective_entity_id
perspective_entity_is_external
perspective_entity_issue_id
perspective_entity_name
perspective_entity_namespace
perspective_entity_resource_id
In the lists below, we describe error and issue counters. Every issue flagged by groundcover is an error; but not every error is flagged as an issue.
groundcover_resource_total_counter
total amount of resource requests
groundcover_resource_error_counter
total amount of requests with error status codes
groundcover_resource_issue_counter
total amount of requests which were flagged as issues
groundcover_resource_success_counter
total amount of resource requests with OK status codes
groundcover_resource_latency_seconds
resource latency [sec]
groundcover_workload_total_counter
total amount of requests handled by the workload
groundcover_workload_error_counter
total amount of requests handled by the workload with error status codes
groundcover_workload_issue_counter
total amount of requests handled by the workload which were flagged as issues
groundcover_workload_success_counter
total amount of requests handled by the workload with OK status codes
groundcover_workload_latency_seconds
resource latency across all of the workload APIs [sec]
groundcover_pvc_read_bytes_total
total amount of bytes read by the workload from the PVC
groundcover_pvc_write_bytes_total
total amount of bytes written by the workload to the PVC
groundcover_pvc_reads_total
total amount of read operations done by the workload from the PVC
groundcover_pvc_writes_total
total amount of write operations done by the workload to the PVC
groundcover_pvc_read_latency
latency of read operation by the workload from the PVC, in microseconds
groundcover_pvc_write_latency
latency of write operation by the workload to the PVC, in microseconds
groundcover_client_offset
client last message offset (for producer the last offset produced, for consumer the last requested offset)
groundcover_workload_client_offset
client last message offset (for producer the last offset produced, for consumer the last requested offset), aggregated by workload
groundcover_calc_lagged_messages
current lag in messages
groundcover_workload_calc_lagged_messages
current lag in messages, aggregated by workload
groundcover_calc_lag_seconds
current lag in time [sec]
groundcover_workload_calc_lag_seconds
current lag in time, aggregated by workload [sec]
Linux hosts sensor
Supported architectures: AMD64
+ ARM64
For the following providers, we will fetch the machine metadata from the provider's API.
Infrastructure Host metrics: CPU/Memory/Disk usage
Logs
Natively from docker containers running on the machine
JournalD (requires configuration)
Static log files on the machine (requires configuration)
Traces
Natively from docker containers running on the machine
APM metrics and insights from the traces
Installation currently requires running a script on the machine.
The script will pull the latest sensor version and install it as a service: groundcover-sensor (requires privileges)
Where:
{apiKey} - Your unique backend token, you can get it using groundcover auth print-api-key
{inCloud_Site} - Your backend ingress address (Your inCloud public ingestion endpoint)
{selected_Env} - The Environment that will group those machines on the cluster drop down in the top right corner (We recommend setting a separate one for non k8s deployments)
The sensor supports overriding its default configuration (similarly to the Kubernetes sensor), only in this case it's required to write the overrides to a file on disk.
The file is located in /etc/opt/groundcover/overrides.yaml
, after writing it you should restart the sensor service using systemctl restart groundcover-sensor
Example - override Docker max log line size:
Once installed, we recommend following these steps to help you quickly gain the most our of groundcover's unique observability platform.
The "Home page" of the groundcover app is our Workloads page. From here, you can get a- service-centric view,
Invites lets you share your workspaces with your colleagues in just a couple of clicks. You can find the "Invite Members" option at the bottom of the left navigation bar. Type in the email addresses of the teammates you want to invite, and set their user permissions (Admin, Editor, Read Only), then click "Send Invites".
groundcover’s Real User Monitoring (RUM) SDK allows you to capture front-end performance data, user events, and errors from your web applications.
This guide will walk you through installing the SDK, initializing it, identifying users, sending custom events, capturing exceptions, and configuring optional settings.
At your app’s entry point:
Link RUM data to specific users:
Instrument key user interactions:
Manually track caught errors:
You can customize SDK behavior (event sampling, data masking, enabled events). The following properties are customizable:
You can pass the values by calling the init function:
Or via the updateConfig function:
Explore and select pre-built Monitors from the catalog to quickly set up observability for your environment. Customize and deploy Monitors in just a few clicks.
The Monitor Catalog is a library of pre-built templates to efficiently create new Monitors. Browse and select one or more Monitors to quickly configure their environment with a single click. The Catalog groups monitors into "Packs", based on different use cases.
You can select as many monitors as you wish, and add them all in one click. Select a complete pack or multiple Monitors from different packs, then click "Create Monitor". All Monitors will be automatically created. You can always edit them later.
View and analyze monitor issues with detailed timelines, metadata, and context to quickly identify and resolve problems in your environment.
The Issues page provides a detailed view of active and resolved issues triggered by Monitors. This page helps users investigate, analyze, and resolve problems in their environment by visualizing issue trends and providing in-depth context through an issue drawer.
Clicking on an issue in the Issues List opens the Issue drawer, which provides an in-depth view of the Monitor and its triggered issue. You can also navigate if possible to related entities like workload, node, pod, etc.
Displays metadata about the issue, including:
Monitor Name: Name of the Monitor responsible for the issue, including a link to it.
Description: Explains what the Monitor tracks and why it triggered.
Severity: Shows the assigned severity level (e.g., S3).
Labels: Lists contextual labels like cluster
, namespace
, and workload
.
Creation Time: Shows when the issue started firing.
Displays the Kubernetes events related to the selected issue within the timeframe selected in the Time Picker dropdown (upper right of the issue drawer).
When creating a Monitor using a traces query, the Traces tab will display the matching traces generated within the timeframe selected in the Time Picker dropdown (upper right of the issue drawer). Click on "View in Traces" to navigate to the Traces section with all relevant filter automatically applied.
When creating a monitor using a log query, the Logs tab will display the matching logs generated within the timeframe selected in the Time Picker dropdown (upper right of the issue drawer). Click on "View in Logs" to navigate to the Logs section with all relevant filter automatically applied.
A focused visualization of the interactions between workloads related to the selected issue.
Manage and create silences to suppress Monitor notifications during maintenance or specific periods, reducing noise and focusing on critical issues.
The Silences page lists all Silences you and your team created for your Monitors. In this section, you can also create and manage your Silence rules, to suppress notifications and Issues noise, for a specified period of time. Silences are a great way to reduce alert fatigue, which can lead to missing important issues, and help focus on the most critical issues during specific operational scenarios such as scheduled maintenances.
Follow these simple steps to create a new Silence.
Specify the timeframe for the silence rule. Note that the starting point doesn't have to be now, and can also be any time in the future.
Below the From / Until boxes, you'll see a Silence summary, showing its approximate length (rounded down to full days) and starting date.
Define the criteria for Monitors or Issues to be silenced.
Click Add Matcher to specify match conditions (e.g., cluster, namespace, span_name).
Combine multiple matchers for more granular control.
Example: Silence all Monitors in the "demo" namespace.
Preview the issues currently affected by the Silence rule, based on any defined Matchers. This list contains only actively firing Issues.
Tip: Us this preview to see the list of impacted issues and adjust your Matchers before finishing to create the Silence.
Add notes or context for the Silence rule. These comments help you and other users understand the purpose of the rule.
The following guides will help you setup and import your alerts from Grafana:
In this section, you'll find a breakdown of the key fields used to define and configure Monitors within the groundcover platform. Each field plays a critical role in how a Monitor behaves, what data it tracks, and how it responds to specific conditions. Understanding these fields will help you set up effective Monitors to track performance, detect issues, and provide timely alerts.
Below is a detailed explanation of each field, along with examples to illustrate their usage, ensuring your team can manage and respond to incidents efficiently.
View, filter, and manage all monitors in one place, and quickly identify issues or create new monitors.
The Monitor List is the central hub for managing and monitoring all active and configured Monitors. It provides a clear, filterable table view of your Monitors, with their current status and key details, such as creation date, severity, and live issues. Use this page to review Monitor performance, identify issues, and take appropriate action.
Displays the following columns:
Name: Title of the monitor.
Creation Date: When the monitor was created.
Live Issues: Number of live issues currently firing.
Status: Is the Monitor "Firing" (alerts active) or "Normal" (no alerts).
Tip: Click on a Monitor name to view its detailed configuration and performance metrics.
Use filters to narrow down monitors by:
Severity: S1, S2, S3, or custom severity levels.
Status: Alerting or Normal.
Silenced: Exclude silenced monitors.
Tip: Toggle multiple filters to refine your view.
Quickly locate monitors by typing a name, status, category, or other keywords.
Located at the top-right corner, use these to focus on monitors for specific clusters or environments.
Note: Linux host sensor is currently available exclusively to Enterprise users. Check out our for more information about subscription plans.
We currently support running on eBPF-enabled linux machines (See more )
A highly impactful advantage of leveraging eBPF in our proprietary sensor is that it enables visibility on the - including headers! This allows you to very quickly understand issues and provides context.
groundcover enables to very easily to visualize your data using our intuitive Query Builder as a guide, or using your own queries.
using our native Monitors, which you can configure using groundcover data and custom metrics. You can also choose from our Monitors Catalog, which contains multiple pre-built Monitors that cover a few of the most common use cases and needs.
This capability is only available to organizations subscribed to our .
Start capturing RUM data by installing the in your web app.
You can also create a single Monitor from the Catalog. When hovering over a Monitor, a "Wizard" button will appear. Clicking on it will direct you to the where you can review and edit before creation.
groundcover enables access to an embedded Grafana, within the groundcover platform's interface. This enables you to easily import and continue using your existing Grafana and alerts.
While we strongly suggest building monitors using our or , groundcover supports building and editing your Monitors using YAML. If you choose to do so, the following will provide you the necessary definitions.
You can create a new Monitor by clicking on Create Monitor, then choosing between the different options: Monitor Wizard, Monitor Catalog, or Import. For further guidance, .
Install using our UI
Install using a CLI
Install using Helm
Use Argo CD to deploy
AWS
✅
GCP
✅
Azure
✅
Linode
✅
Title
A string that defines the human-readable name of the Monitor. The title is what you will see in the list of all existing Monitors in the Monitors section.
Description
Additional information about the Monitor.
Severity
When triggered, this will show the severity level of the Monitor's issue. You can set any severity you want here.
s1
for Critical
s2
for High
s3
for Medium
s4
for Low
Header
This is the header of the generated issues from the Monitor.
A short string describing the condition that is being monitored. You can also use this as a pattern using labels from you query.
“HTTP API Error {{ alert.labels.return_code}}”
ResourceHeaderLabels
A list of labels that help you identify the resources that are related to the Monitor. This appear as a secondary header in all Issues tables across the platform.
["span_name", "kind"]
for monitors on protocol issues.
ContextHeaderLabels
A list of contextual labels that help you identify the location of the issue. This appears as a subset of the Issue’s labels, and is displayed on all Issues tables across the platform.
["cluster", "namespace", "pod_name"]
Labels
A set of pre-defined labels that were set to Issues related to the selected Monitor. Labels can be static, or dynamic using a Monitor's query results.
team: sre_team
ExecutionErrorState
Defines the actions that take place when a Monitor encounters query execution errors.
Valid options are Alerting
, OK
and Error.
When Alerting
is set, query execution errors will result in a firing issue.
When Error
is set, query execution errors will result in an error state.
When OK
is set, query execution errors will do neither of the above. This is the default setting
NoDataState
This defines what happens when queries in the Monitor return empty datasets.
Valid options are: NoData
, Alerting
, OK
When NoData
is set, issue instances state will be: No Data
.
When OK
is set, issues instance state will be Pending
. The will change to Alerting
once the pending period of the monitor ends. This is the dafault setting
Interval
Defines how frequently the Monitor evaluates the conditions. Common intervals could be 1m
, 5m
, etc.
PendingFor
Defines the period of consecutive intervals where threshold condition must be met to trigger the alert.
Trigger
Defines the condition under which the Monitor fires. This is the definition of threshold for the Monitor, with op
- operator and value
.
op: gt, value: 5
Model
Describes the queries, thresholds and data processing of the Monitor. It can have the following fields:
Queries: List of one or more queries to run, this can be either SQL over ClickHouse, PromQL over VictoriaMetrics, SqlPipeline. Each query will have a name for reference in the monitor.
Thresholds: This is the threshold of your Monitor, a threshold has a name, inputName for data input, operator one of gt
, lt
, within_range
, outside_range
and array of values which are the threshold values.
measurementType
Describe how will we present issues of this Monitor. Some Monitors count events, and some a state. And we will display them differently in our dashboards.
state - Will present issues in line chart.
event - Will present issues in bar chart, counting events.
groundcover enables access to an embedded Grafana, within the groundcover platform's interface. This enables you to easily import and continue using your existing Grafana dashboards and alerts.
The following guides will help you setup and import your visualizations from Grafana:
This workflow is triggered by issue with filter of alertname: Workload Pods Crashed Monitor
. Which means only issues created by the monitor named "Workload Pods Crashed Monitor" will trigger the workflow, in this example we use a slack message action using labels from the issue.
In some cases, you may want to avoid sending resolved alerts to your integrations—this prevents incidents from being automatically marked as “resolved” in tools like PagerDuty.
To achieve this, you can add a condition to your action that ensures only firing alerts are sent. Here’s an example of how to configure it in your workflow:
This configuration uses an if condition to check that the alert’s status is firing before executing the PagerDuty action - line 8.
This workflow is triggered by an issue and uses the slack_webhook
integration to send a Slack message formatted with Block Kit. For more details, see Slack Block Kit.
See Jira Webhook Integration for setting up the integration
Get your Issue Type ID from your Jira, see: https://confluence.atlassian.com/jirasoftwarecloud/finding-the-issue-type-id-in-jira-cloud-1333825937.html
Get your project id from you Jira, see: https://confluence.atlassian.com/jirakb/how-to-get-project-id-from-the-jira-user-interface-827341414.html
Replace in this workflow these by using your values: <issue_id>, <project_id>
Replace in this workflow <integration_name>
based on your created integration name.
Exposing Data Sources for Managed inCloud Setup
groundcover inCloud Managed supports integration with customer owned Grafana by exposing Prometheus and ClickHouse data sources.
Different steps are required for On-Prem deployments, contact us for additional info.
The groundcover tenant API KEY is required for configuring the data source connection.
The key can be obtained by running: groundcover auth get-datasources-api-key
For this example we will use the key
API-KEY-VALUE
Configure Grafana prometheus Data Source by following these steps logged in as Grafana Admin.
Connections > Data Sources > + Add new data source
Pick Prometheus
Name: groundcover-prometheus
Prometheus server URL: https://ds.groundcover.com/datasources/prometheus
Customer HTTP Headers > Add Header
Header: apikey
Value: API-KEY-VALUE
Performance
Prometheus type: Prometheus
Prometheus version: > 2.50.x
Click "Save & test"
"Successfully queried the Prometheus API" means the integration was configured correctly.
Configure Grafana ClickHouse Data Source by following these steps logged in as Grafana Admin.
Connections > Data Sources > + Add new data source
Click "Save & test"
"Data source is working" means the integration was configured correctly.
Workflows are YAML-based configurations designed to automate the response and add context to your issues. Each workflow consists of triggers, steps, and actions, which define when and how a workflow is executed and what tasks are performed.
After uploading your workflow, it can be executed based on the trigger type. Currently groundcover supports triggering workflows based on issues only.
Automatically activates a workflow when an issues is identified by a monitoring source. You can filter issues based on their source or other attributes.
Define a series of data-fetching or computation tasks that help enrich the workflow. Steps are optional but can be useful for adding context to your issues.
Actions specify what happens when a workflow is triggered. Actions can include notifications, data enrichment, or automation tasks. In groundcover, actions typically interface with external systems (like sending a Slack message).
The following guide explains how to build dashboards within the groundcover platform using our fully integrated Grafana interface. To learn how you can create dashboards using Grafana Terraform, follow this guide.
A dashboard is a great tool for visually tracking, analyzing, and displaying key performance metrics, which enable you to monitor the health of your infrastructure and applications.
1️⃣ Go to the Dashboards tab in the groundcover app, and click New and then New Dashboard.
2️⃣ Create your first panel by clicking Add a new panel.
3️⃣ In the New panel view, go to the Query tab.
4️⃣ Choose your data source by pressing the -- Grafana --
on the data source selector. You would see your the metrics collected from each of your clusters an a Prometheus data source called Prometheus
@
<cluster-name>
5️⃣ Create your first Query in the PromQL query interface.
Learn more about Grafana panels and PromQL queries to improve your skills. For any help in creating your custom dashboard don't hesitate to join our Slack support channel.
Tips:
Learn more about the supported metrics you can use to build dashboards in the Infrastructure Metrics section under Infrastructure Monitoring and Application Metrics page.
Learn how to build custom dashboards using groundcover
groundcover’s dashboards are designed to personalize your data visualization and maximize the value of your existing data. Dashboards are perfect for creating investigation flows for critical monitors, displaying the data you care about in a way that suits you and your team, and crafting insights from the data on groundcover.
Easily create a new Dashboard using our guide.
Multi-Mode Query Bar: The Query Bar is central to dashboards and supports multiple modes fully integrated with native pages and Monitors. Currently, the modes include Metrics, Infra Metrics, Logs, and Traces. Learn more in the Query Builder section.
Variables: Built-in variables allow you to filter data quickly based on a predefined list crafted by groundcover.
Widget Types: Two widget types are currently supported:
Chart Widget: Displays data visually.
Textual Widget: Adds context to your dashboards.
Display Types: Five display types are supported for data visualization:
Time Series, Table, Stat, Top List, and Pie. Read more in the Widget Types section.
Alerts in groundcover leverage a fully integrated Grafana interface. To learn how you can create alerts using Grafana Terraform, follow this guide.
Setting up an alert in groundcover involves defining conditions based on data collected in the platform, such as metrics, traces, logs, or Kubernetes events. This guide will walk you through the process of creating an alert based on metrics. More guides will follow to include all different types of data.
Log in to groundcover and navigate to the Alerts section by clicking on it on the left navigation menu.
Once in the Alerts section, click on Alerting in the inner menu on the left.
If you can't see the inner menu, click on the 3 bars next to "Home" in the upper left corner.
Click on Alert Rules
Then click on the blue "+ New alert rule" button in the upper right.
Type a name for your alert. It's recommended to use a name that will make it easy for you to understand its function later.
Select the data source:
ClickHouse: For alerts based on your traces, logs, and Kubernetes events.
Prometheus: For alerts based on metrics (includes APM metrics, infrastructure metrics, and custom metrics from your environment)
Click on "Select metric"
Note: Make sure you are in "Builder" view (see screenshot) to see this option.
Click on "Metrics explorer"
Start typing the name of the metric you want this alert to be based on. Note that the Metrics explorer will start displaying matches as you type, so you can find your metric even if you don't remember it exact name. You can also check out our list of Metrics & Labels.
Once you see your metric in the list, click on "Select" in that row.
Note: You can click on "Run queries" to see the results of this query.
In the Reduce section, open on the "Function" dropdown menu and choose the type of value you want to use.
Min - the lowest value
Max - the highest value
Mean - the average of the values
Sum - the sum of all values
Count - the number of values in the result
Last - the last value
In the Threshold section, type a value and choose whether you want the alert to fire when the query result is above or below that value. You can also select a range of values.
Click on "+ New folder" and type a name for the folder in which this rule will be stored. You can choose any name, but it's recommended to use a name that will make it easy for you to find the relevant evaluation groups, should you want to use them again in future alerts.
Click on "+ New evaluation group" and type a name for this evaluation group. The same recommendation applies here too.
In the Evaluation interval textbox, type how often the rule should be evaluated to see if it matches the conditions set in Step 3. Then, click "Create". Note: For the Evaluation interval, use the format (number)(unit), where units are:
s = seconds
m = minutes
h = hours
d = days
w = weeks
In the Pending period box, type how often you want the alert to match the conditions before it fires.
Evaluation interval = how often do you want to check if the alert should fire
Pending period = how long do you want this to be true before it fires
As an example, you can define the alert to fire only if the Mean percentage of memory used by a node is above 90% in the past 2 minutes (Pending period = 2m) and you want to check if that's true every 30 seconds (Evaluation interval = 30s).
If you already have a contact point set up, simply select it from the dropdown menu at the bottom of the "Configure lables and notifications" section. If not, click on the blue "View or create contact points" link, which will open a new tab.
Click on the blue "Add contact point" button
This will get you to the Contact points screen. Then:
Type a name for the contact point
From the dropdown menu, choose which system you want to use to push the alert to.
The information required to push the alert will change based on the system you select. Follow on-screen instructions (for example, if email is selected, you'll need to enter the email address(es) for that contact.
Click "Save contact point"
You can now close this tab to go back to the alert rule screen.
Next to the link you clicked to create this new contact point, you'll find a dropdown menu, where you can select the contact point you just created.
Under "Add annotations", you have two free text boxes that give you the option to add any information that can be useful to you and/or the recipient(s) of this alert, such as a summary that reminds you of the alert's functionality or purpose, or next step instructions when this alert fires.
Once all of it is ready, you can click the blue "Save rule and exit" button on the upper right of the screen, which will bring you back to the Alert rules screen. You will now be able to see your alert, as well as its status - normal (green), pending (yellow), or firing (red), as well as the Evaluation interval (blue).
Log in to your groundcover account and navigate to the dashboard that you want to create an alert from.
Locate the Grafana panel that you want to create an alert from and click on the panel's header and select edit .
Click on the alert tab as seen in the image below. Select the Manage alerts option from the dropdown menu.
Click on the New Alert Rule button.
Note:
only time series panels support alert creation.
An alert is derived from three parts that will be configured in the screen that you are navigated to:
Expression - the query that defines the alert input itself,
Reduction - the value that should be leveraged from the aforementioned expression
Threshold - value to measure against said reduciton output to see if an alert should be triggered
Verify expression value and enter reduction and threshold values in line with your alerting expectation
Select folder - if needed you can navigate to dashboard tab in left nav and create new folder
Select evaluation ground or type text in order to create a new group as shown below
Click "Save and Exit" on top right hand side of screen to create alert
Ensure your notification is configured to have alerts sent to end users. See "Configuring Slack Contact Point" section below if needed.
Note:
Make sure to test the alert to ensure that it is working as expected. You can do this by triggering the conditions that you defined and verifying that the alert is sent to the specified notification channels.
The Query Builder in the platform's Explore and Monitors sections helps you craft and visualize queries on top of your data - Metrics, Infra Metrics, Logs, and Traces.
Metrics – Work with all your available metrics. Great for advanced use cases and custom metrics.
Infra Metrics – Use expert-built, predefined queries for common infrastructure scenarios. Ideal if you’re not sure which metric to pick or just want a quick start.
Logs – Query and visualize Logs data.
Traces – Query and visualize Traces, similar to logs.
RUM - Query and visualizer RUM Events.
When you select the Metrics or Infra Metrics modes, you’ll work with something akin to Prometheus queries - but simplified.
groundcover supports a wide variety of metrics - Application and Infrastructure metrics are automatically generated using our eBPF sensor, and custom metrics can be ingested natively.
Metric Selector:
Search and choose a metric.
View associated labels and metadata (for groundcover’s built-in metrics).
If the chosen built-in metric’s type is known, the Query Builder automatically applies the best-suited function to streamline your workflow.
Infra Metrics Mode:
Select from ready-made queries grouped into categories (e.g., Container CPU, Node Disk).
Perfect if you’re unsure which metric to choose. Just pick a category, and you’re set.
Filters Bar (Metrics/Infra Metrics):
Filter by label key/value pairs.
Use -
to exclude values
All filters are ANDed together, but multiple values for the same key form an OR condition.
Type a key and :
(e.g cluster:
) to list its values.
Use patterns (wildcards, partial matches) to refine results.
Aggregation Function Selector:
sum: Adds up all values.
avg: Calculates the average value.
max: Finds the maximum value.
min: Finds the minimum value.
count: Counts how many data points there are.
no aggregation: Leaves data un-aggregated.
Aggregation Labels Selector:
Select one or more labels to group your results by.
Limit Selector:
Show top or bottom results based on:
Max: Highest values.
Min: Lowest values.
Mean: Highest/lowest average values.
Median: Highest/lowest median values.
Last: Highest/lowest most recent values.
Visualization Type:
Time-series: View data over time (time range set by the time-picker).
Table: See instant snapshot data.
Time & Rollup Notes:
The time-picker defines the time range for your query.
Advanced Query
Switching to Advanced Query mode allows you to view and modify the PromQL query generated by the Query Builder. This mode provides full flexibility for advanced users. However, changes made in the editor are not reflected back in the Query Builder. The editor is ideal for making manual adjustments that are beyond the capabilities offered in Query Builder mode.
Selecting or deselecting Clusters and Environments in the Backend Picker won't affect the metrics displayed.
To enhance your data analysis in Metrics mode, groundcover supports the use of Formulas. Formulas allow you to perform arithmetic operations and apply functions to your metrics queries.
Using Formulas:
Assign Query Symbols: each metric query is automatically assigned a letter (A, B, C, etc.).
Construct Formulas: Combine these letters using operators and functions to create expressions.
Supported Operators:
Addition: +
Subtraction: -
Multiplication: *
Division: /
Modulo: %
Exponentiation: **
Parentheses: ()
for grouping
Example:
To calculate the CPU usage percentage where A is the used CPU metric and B is the total CPU capacity:
Filters Bar (Logs/Traces):
Same label-based filtering as Metrics.
Free Text Search (Logs only): Search for any substring.
Exclude terms by prefixing with -
Use *
as a wildcard.
Measurement Selection (Logs/Traces):
Count: Count total logs/traces.
Count (unique): Count distinct values of a chosen field.
Avg/Sum/Max/Min: For numeric fields, perform calculations.
Percentiles (P99/P95/P50/P10): Show the value at a specific percentile.
Group By:
Group results by fields (e.g., k8s.namespace
, service.name
) to break down the data by categories.
Rollup (Logs/Traces):
Choose time buckets (like 1m, 5m) for aggregation. This helps smooth out spikes or show trends over chosen intervals.
Limit:
Limit and sort table data to display only the most relevant rows.
Visualization Type (Logs/Traces):
Time-series
Table
Stat
Top List
Pie
Here are a few examples to help you understand how to build and visualize queries using the Query Builder:
Query the top 5 workloads with the highest average container CPU usage in the namespaces demo-ng
and opentelemetry-demo
.
Quickly find the average memory usage of workloads in the namespace demo-ng
. Instead of crafting a complex query, we simply selected Container Memory > Usage Amount.
Query the P99 duration of all HTTP traces across the platform. This query is broad, with no filters applied for clusters, namespaces, or specific services.
Narrow it down: Query the P99 duration of HTTP traces, but this time only for outbound traces from a specific workload.
Visualize the distribution of log counts per log level in the cluster demo. This provides a quick snapshot of the log severity levels.
The RUM Query Builder allows you to query RUM data collected by the groundcover SDK.
It includes the same components as the Logs/Traces builder (Measurement Selection, Group By, Rollup, Limit, Visualization Type), plus an additional selector for Event Type.
The following event types and their properties are available:
Page Loads
page_load_time
Page Load Time
Numeric
page_url
Page URL
String
Errors
error_fingerprint
Error Fingerprint
String
error_message
Error Message
String
error_type
Error Type
String
Custom Events
custom_event_name
Event Name
String
Interactions
dom_event_target.text
Target Text
String
dom_event_selector
Target Selector
String
dom_event_target.id
Target ID
String
dom_event_target.className
Target Class Name
String
Performance
performance_metric_name
Metric Name
String
performance_metric_value
Metric Value
Numeric
Navigations
page_url
URL
String
session_id
service.name
location.path
location.title
user.email
user.organization
user.name
browser.version
browser.name
browser.type
browser.platform
Create a visualization of Errors over time grouped by error_message in the demo service:
Quickly understand your data with groundcover
groundcover insights give you a clear snapshot of notable events in your data. Currently, the platform supports Error Anomalies, with more insight types on the way.
Error Anomalies instantly highlight workloads, containers, or environments experiencing unusual spikes in Error or Critical logs. These anomalies are detected using statistical algorithms, continuously refined through user feedback for accuracy.
Each insight displays the error/critical log trends of the specific entity (e.g., workload). Clicking the insight automatically applies relevant filters, letting you quickly investigate and resolve the root cause.
Creating new workflows currently supported through the app only, browse to the settings page then to workflows screen, click on the “Create Workflow” Button:
Clicking the button will open up a text editor where you can add your workflow definition in YAML format, See Workflows for definitions and Workflow Examples for examples.
When creating a new workflow which uses an action using an integration, for example: slack, Make sure you've added the integration prior to the workflow creation. See: Workflow Integrations
Upon successful workflow creation it will be active immediately, and a new workflow record will appear in the underlying table.
For each existing workflow we can see the following fields"
Name: your defined workflow name
Description: if you've added a description of the workflow.
Creator: workflow creator mail
Creation Date: Workflow creation date
Last Execution Time: Timestamp of last workflow execution, this field is dependant on workflow trigger type.
Last Execution Status: Last execution status, failure or success.
guide on how to enable crd-based scraping targets
By default the vm operator will identify the Prometheus CRDs (ServiceMonitor
, PodMonitor
, PrometheusRule
and Probe
) that are already deployed and will scrape them automatically.
In case you want to deploy a test monitor object, here is an example using PodMonitor
Create the following my-test-podmonitor.yaml
Deploy it
The vmagent will reload its configuration and will start scraping the target, metrics should appear in groundcover's grafana shortly after.
Quickly understand what requires your attention and drive your investigations
The issues page is a useful place to start a troubleshooting or investigation flow from. It gathers together all active issues found in your Kubernetes environment.
HTTP / gRPC Failures Capturing failed HTTP calls with Response Status Codes of:
5XX
— Internal Server Error
429
— Too Many Requests
MySQL / PostgreSQL Failures
Capturing failed SQL statement executions with Response Errors Codes such as:
1146
— No Such Table
1040
— Too Many Connections
1064
— Syntax Error
Redis Failures Capturing any reported Error by the Redis serialization protocol (RESP), such as:
ERR unknown command
Container Restarts Capturing all container restart events across the cluster, with Exit Codes such as:
0
— Completed
137
— OOMKilled
Deployment Failures
Capturing events such as:
MinimumReplicasUnavailable
— Deployment does not have minimum availabiltiy
Issues are auto-detected and aggregated - representing many identical repeating incidents. Aggregation help cutting through the noise quickly and reach insights like when a new type of issue started to appear, and when it was last seen.
Issues are grouped by:
Type (HTTP, gRPC, Container Restart, etc..)
Status Code / Error Code (e.g HTTP 500,
gRPC 13
)
Workload name
Namespace
The smart aggregation mechanism will also identify query parameters, remove them, and group the stripped queries / API URIs into patterns. This allows users to easily identify and isolate the root cause of a problem.
Each issue is assigned a velocity graph showing it's behavior over time (like when it was first seen) and a live counter of its number of incidents.
By clicking on an issue, users can access the specific traces captured around the relevant issue. Each trace is related to the exact resource that was used (e.g. raw API URI, or SQL query), it's latency and Status Code / Error Code.
Further clicking on a selected captured trace allows the user to investigate the root cause of the issue with the entire payload (body and headers) of the request and response, the information around the participating container, the application logs around incident's time and the full context of the metrics around the incident.
Fields description in the alert you can use in your workflows
Role-Based Access Control (RBAC) in groundcover gives you a flexible way to manage who can access certain features and data in the platform. By defining both default roles and policies, you ensure each team member only sees and does what their level of access permits. This approach strengthens security and simplifies onboarding, allowing administrators to confidently grant or limit access.
Policies are the foundational elements of groundcover’s RBAC. Each policy defines:
A permission level – which actions the user can perform (Admin, Editor, or Viewer-like capabilities).
A data scope – which clusters, environments, or namespaces the user can see.
By assigning one or more policies to a user, you can precisely control both what they can do and where they can do it.
groundcover provides three default policies to simplify common use cases:
Default Admin Policy
Permission: Admin
Data Scope: Full (no restrictions)
Behavior: Unlimited access to groundcover features and configurations.
Default Editor Policy
Permission: Editor
Data Scope: Full (no restrictions)
Behavior: Full creative/editing capabilities on observability data, but no user or system management.
Default Viewer Policy
Permission: Viewer
Data Scope: Full (no restrictions)
Behavior: Read-only access to all data in groundcover.
These default policies allow you to quickly onboard new users with typical Admin/Editor/Viewer capabilities. However, you can also create custom policies with narrower data scopes, if needed.
A policy’s data scope can be defined in two modes: Simple or Advanced.
Simple Mode
Uses AND logic across the specified conditions.
Applies the same scope to all entity types (e.g., logs, traces, events, workloads).
Example: “Cluster = Dev
AND Environment = QA
,” restricting all logs, traces, events, etc. to the Dev cluster and QA environment.
Advanced Mode
Lets you define different scopes for each data entity (logs, traces, events, workloads, etc.).
Each scope can use OR logic among conditions, allowing more fine-grained control.
Example:
Logs: “Cluster = Dev
OR Prod
,”
Traces: “Namespace = abc123
,”
Events: “Environment = Staging
OR Prod
.”
When creating or editing a policy, you select permission (Admin, Editor, or Viewer) and a data scope mode (Simple or Advanced).
A user can be associated with multiple policies. When that occurs:
Permission Merging
The user’s final permission level is the highest among all assigned policies.
Example: If one policy grants Editor and another grants Viewer, the user is effectively an Editor overall.
Data Scope Merging
Data scopes merge via OR logic, broadening the user’s overall data access.
Example: Policy A => “Cluster = A
,” Policy B => “Environment = B
,” so final scope is “Cluster A OR Environment B.”
Metrics Exception
For metrics data only, groundcover uses a single policy’s scope (not a combination). This prevents creating an overly broad metrics view when multiple policies are assigned.
By combining multiple policies, you can support sophisticated permission setups—for example, granting Editor capabilities in certain clusters while restricting a user to Viewer in others. The user’s final access reflects the highest permission among their assigned policies and the union (OR) of scopes for all data types except metrics.
In summary:
Policies define both permission (Admin, Editor, or Viewer) and data scope (clusters, environments, namespaces).
Default Policies (Admin, Editor, Viewer) provide no data restrictions, suitable for quick onboarding.
Custom Policies allow more granular restrictions, specifying exactly which entities a user can see or modify.
Multiple Policies can co-exist, merging permission levels and data scopes (with a special rule for metrics).
This flexible system gives you robust control over observability data in groundcover, ensuring each user has precisely the access they need.
Note: Only users with Write or Admin permissions can create and edit dashboards.
Navigate to the Dashboard List and click on the Create New Dashboard button.
Provide an indicative name for your dashboard and, optionally, a description.
Create a new widget
Choose a Widget Type
Select a Widget Mode
Build your query
Choose a Display Type
Save the widget
Optional:
Add variables
Apply variable(s) to the widget
Widgets can be added by clicking on the Create New Widget button.
Widgets are the main building blocks of dashboards. groundcover supports the following widget types:
Chart Widget: Visualize your data through various display types.
Textual Widget: Add context to your dashboard, such as headers or instructions for issue investigations.
Since selecting a Textual Widget is the last step for this type of widget, the rest of this guide is relevant only to Chart Widgets.
Metrics: Work with all your available metrics for advanced use cases and custom metrics.
Infra Metrics: Use expert-built, predefined queries for common infrastructure scenarios. Ideal for quick starts.
Logs: Query and visualize log data.
Traces: Query and visualize trace data similar to logs.
Once the Widget Mode selected, build your query for the visualization.
Variables dynamically filter your entire dashboard or specific widgets with just one click. They consist of a key-value pair that you define once and reuse across multiple widgets.
Our predefined variables cover most use cases, but if you’re missing an important one, let us know. Advanced variables are also on our roadmap.
Click on Add Variable.
Select the variable key and values from the predefined list.
Optionally, rename the variable or use the default name, then click Create.
Once created, select the values to apply to this variable.
Variables can be referenced in the Filter Bar of the Widget Creation Modal using their name.
Create a variable (for example, select Clusters from the predefined list, and name it 'clusters')
While creating or editing a Chart Widget, add a reference to the variable using a dollar sign in the filter bar, (for example, $clusters
).
The data will automatically filter by the variable's key with the selected values. If all values are selected, the filter will be followed by an asterisk (for example, cluster:*)
To help you slice and dice your data, you can use our dynamic filters (left panel) and/or our powerful querying capabilities:
Query Builder - Supports key:value pairs, as well as free text search. The Query Builder works in tandem with our filters.
Advanced Query - Currently available only for our Logs section, enables more complex queries, included nested condition support and explicit use of a variety of operators.
To further focus your results, you can also restrict the results to specific time windows using the time picker on the upper right of the screen.
The Query Builder is the default search option wherever search is available. Supporting advanced autocomplete of keys, values, and our discovery mode that across values in your data to teach users the data model.
The following syntaxes are available for you to use in Query Builder:
Filters are very easy to add and remove, using the filters menu on the left bar. You can combine filters with the Query Builder, and filters applied using the left menu will also be added to the Query Builder in text format.
Select / deselect a single filter - click on the checkbox on the left of the filter. (You can also deselect a filter by clicking the 'x' next to the text format of the filter on the search bar).
Deselect all but one filter (within a filter category, such as 'Level' or 'Format') - hover over the filter you want to leave on, then click on "ONLY".
You can switch between filters you want to leave on by hovering on another filter and clicking "ONLY" again.
To turn all other filters in that filter category back on, hover over the filter again and click "ALL".
Clear all filters within a filters category - click on the funnel icon next to the category name.
Clear all filters currently applied - click on the funnel icon next to the number of results.
Advanced Query is currently available only in the Logs section.
Filters are not available in Advanced Query mode.
The following syntaxes are available for you to use in Advanced Query:
Find all logs with level 'error' or 'warning', in 'json' or 'logfmt' format, where the status code is 500 or 503, the request path contains '/api/v1/', and exclude logs where the user agent is 'vmagent' or 'curl':
Find logs where the bytes transferred are greater than 10000, the request method is POST, the host is not '10.1.11.65', and the namespace is 'production' or 'staging':
Find logs from pods starting with 'backend-' in 'cluster-prod', where the level is 'error', the status code is not 200 or 204, and the request protocol is 'HTTP/2.0':
Find logs where the 'user_agent' field is empty or does not exist, the request path starts with '/admin', and the status code is greater than 400:
Find logs in 'json' format from hosts starting with 'ip-10-1-', where the level is 'unknown', the container name contains 'redis', excluding logs with bytes transferred equal to 0:
Find logs where the time is '18/Sep/2024:07:25:46 +0000', the request method is GET, the status code is less than 200 or greater than 299, and the host is '10.1.11.65':
Find logs where the level is 'info', the format is 'clf', the namespace is 'production', the pod name contains 'web', and exclude logs where the user agent is 'vmagent':
Find logs where the container name does not exist, the cluster is 'cluster-prod', the request path starts with '/internal', and the request protocol is 'HTTP/1.1':
Find logs where the bytes transferred are greater than 5000, the request method is PUT or DELETE, the status code is 403 or 404, and the host is not '10.1.11.65':
Find logs where the format is 'unknown', the level is not 'error', the user agent is 'curl', and the pod name starts with 'test-':
By default, the search bar will be displayed in Query Builder mode. Use the button on the right of the search bar to switch back and forth between the Query Builder and Advanced Query.
Let groundcover automatically scrape your custom metrics
The following helm override enables custom metrics scraping
Scrape your custom metrics using groundcover CLI (using default scrape jobs):
Either create a new custom-values.yaml or edit your existing groundcover values.yaml
Ensure that the Kubernetes resources that contain your Prometheus exporters have been deployed with the following annotations to enable scraping
By default, the following scrape jobs are deployed when enabling custom-metrics:
In case you're interested in disabling autodiscovery scrape jobs, provide the below override
Disabling custom-metrics scrape jobs allows you to scale the custom-metrics deployment horizontally.
in case you're interested in deploying custom scrape jobs, create/add the following override
In order to safeguard groundcover's performance, there are default limitations on metrics ingestion in place.
In order to increase metrics resolution, you can implement the following overrides
Increasing cardinality parameters will increase memory/cpu consumption and might cause OOMKills/CPU Throttling.
Please use with caution and increase the custom metrics agent/metrics server resources accordingly
In case you wish to increase metrics server / custom metrics resources, use the following overrides:
groundcover has a set of example dashboards in the Dashboards by groundcover folder which can get you started. These dashboard are read-only but you can see the PromQL query behind each panel by right-clicking the panel and then Explore
Enable scraping and victoria-metrics-operator
Other then supporting the standard Prometheus CRDs, VictoriaMetrics operator has their own proprietary CRDs that can be used, more about them can be found .
This capability is only available to organizations subscribed to our .
If you're unfamiliar with query building in groundcover, refer to the for full details on the different components.
groundcover can scrape your custom metrics by deploying a metrics scraper ( by Victoria Metrics) that will automatically scrape prometheus targets.
vmagent is fully compatible with prometheus scrape job syntax - more can be found .
Starting November 25th 2024, kubernetes-pods
scraping job will be the only scrape job that is enabled out of the box when activating custom metrics scraping.
You can add back the legacy scrape jobs under extraScrapeConfigs
section as described in
labels
Map of key:values derived from monitor definition.
{ "workload": "frontend",
"namespace": "prod" }
status
Current status of the alert
firing - Active alert indicating an ongoing issue.
resolved - The issue has been resolved, and the alert is no longer active.
suppressed - Alert is suppressed.
pending - No Data or insufficient data to determine the alert state.
lastReceived
Timestamp when the alert was last received
This alert timestamp
firingStartTime
Start time of the firing alert
First timestamp of the current firing state.
source
Sources generating the alert
grafana
fingerprint
Unique fingerprint of the alert, this is a hash of the labels
02f5568d4c4b5b7f
alertname
Name of the monitor
Workload Pods Crashed Monitor
trigger
Trigger condition of the workflow
alert / manual / interval
Choose a Y-axis unit from the predefined list.
Select a visualization type: Stacked Bar or Line Chart.
Metrics
Infra Metrics
Logs
Traces
Define columns based on data fields or metrics.
Choose a Y-axis unit from the predefined list.
Metrics
Infra Metrics
Logs
Traces
Select a Y-axis unit from the predefined list.
Metrics
Infra Metrics
Logs
Traces
Choose a ranking metric and sort order.
Logs
Traces
Select a data source and aggregation method.
Logs
Traces
key:value
Filters: Use golden filters to narrow down your search. Note: Multiple filters for the same key act as 'OR' conditions, whereas multiple filters for different keys act as 'AND' conditions.
level:error
Logs Traces K8s Events API Catalog
@key:value
Attributes: Search within the content of attributes. Note: Multiple filters for the same key act as 'OR' conditions, whereas multiple filters for different keys act as 'AND' conditions.
@transaction.id:123
Logs Traces
term
Free text (exact match): Search for single-word terms.
Tip: Expand your search results by using wildcards.
term
Logs K8s Events
" "
Phrase Search (case-insensitive): Enclose terms within double quotes to find results containing the exact phrase.
"search term"
Logs K8s Events
*
Wildcard: Search for partial matches. Note: Wildcards are enabled in all searches except phrase search, where they will be treated as an asterisk character.
key:val*
@key:val*
te*
Logs Traces K8s Events API Catalog
-
Exclude: Specify terms or filters to omit from your search; applies to each distinct search.
-key:value
-@key:value
-term
-"search term"
Logs Traces K8s Events API Catalog
*:""
Hollistic Attribute Search Search for a particular value
*:"error"
Logs Traces
key:value
Filters: Use golden filters to narrow down your search. Note: Multiple filters for the same key act as 'OR' conditions, whereas multiple filters for different keys act as 'AND' conditions.
level:error
Logs
@key:value
Attributes: Search within the content of attributes. Note: Multiple filters for the same key act as 'OR' conditions, whereas multiple filters for different keys act as 'AND' conditions.
@transaction.id:123
Logs
term
Free text (exact match): Search for single-word terms.
Tip: Expand your search results by using wildcards.
term
Logs
" "
Phrase Search (case-insensitive): Enclose terms within double quotes to find results containing the exact phrase.
"search term"
Logs
~
Wildcard: Search for partial matches. Note: Wildcards must be added before the search term or value, and will always be treated as a partial match search.
key:~val
@key:~val
~term
~"search phrase"
Logs
NOT
!
Exclude: Specify terms or filters to omit from your search; applies to each distinct search.
!key:value
NOT @key:value
NOT term
!"search term"
Logs
key:""
Identify cases where key does not exist or is empty
pid:""
Logs
key:=#
key:>#
key:<#
Search for key:pair values where the value is equal, greater than, or smaller than, a specified number.
threadPriority:>5
Logs
key:(val1 or val2)
Search for key:value pairs using a list of values.
level:(error or info)
Logs
query1 or query2
Use OR operator to display matches on either queries
level:error or format:json
Logs
query1 and query2
Use AND operator to display matches on both queries
level:error and format:json
Logs
"Search term prefix"*
Exact phrase prefix search
"Error 1064 (42"*
Logs
Enjoy a richer custom dashboards experience
kube-state-metrics is a simple service that listens to the Kubernetes API server and generates metrics about the state of the objects. groundcover comes with a built-in deployment of kube-state-metrics, and exposes its metrics within our embedded Grafana in two potential ways described below.
groundcover's built-in deployment of kube-state-metrics comes with a minimal configuration, keeping only a pre-selected set of the metrics which are used in our native screens. This approach has the benefit of limiting the potentially huge cardinality of KSM metrics, while also allowing us to enrich the selected metrics with extra groundcover labels.
KSM metrics generated in this method will be prefixed with groundcover_
in order to separate them from other KSM deployments.
For example, the metrickube_pod_status_phase
will appear in groundcover as groundcover_kube_pod_status_phase.
If you are interested in having the full scope of available KSM metrics, it's possible to configure groundcover's KSM deployment for the task.
This is achieved following these steps:
Configuring the KSM deployment for full collection
Turning on custom metrics scraping
Setting up a scrape job to fetch the KSM metrics
groundcover uses a minimal set of KSM collectors. In order to configure our KSM deployment for full collection, use the following configuration:
groundcover supports a flag to turn on automatic collection of custom metrics. If you haven't enabled custom metrics yet, do so using the steps in the link above.
Add the following override to add a scrape job which will scrape the KSM deployment, making the metrics available in groundcover's embedded Grafana.
Enhance your ability to easily monitor a group of clusters
This capability is available only to Enterprise users. Learn more about our paid plans.
Labeling clusters in your cloud-native environments is very helpful for efficient resource management and observability. By assigning an environment label in groundcover, you can categorize and identify clusters based on any specific criteria that you find helpful. For example, you can choose to label your clusters by environment type (development, staging, production, etc.), or by region (EU, US, etc.).
To add an environment label to your cluster, edit your cluster's existing values.yaml and add the following line:
Once defined and added, these labels will be available for you to select in the cluster and environment drop down menu ("Cluster Picker").
Learn how to backup and restore metrics into groundcover metrics storage
groundcover uses VictoriaMetrics as its underlying metrics storage solution. As such, groundcover integrates seamlessly with VictoriaMetrics vmbackup and vmrestore tools.
port-forward groundcover's VictoriaMetrics service object
Run the vmbackup
utility, in this example we'll set the destination to an AWS S3 bucket, but more providers are supported
vmbackup automatically uses incremental backup strategy if the destination contains an existing backup
Scale down VictoriaMetrics statefulSet (VictoriaMetrics must be offline during restorations)
Get the VictoriaMetrics PVC name
Create the following Kubernetes Job manifest vm-restore.yaml
Make sure you replace {VICTORIA METRICS PVC NAME} with the fetched pvc name
Deploy the job and wait for completion
Once completed, scale up groundcover's VictoriaMetrics instance
groundcover generates Infrastructure metrics out of the box, covering common use cases for monitoring the health of your services. This is done without relying on any existing services such as cadvisor or node-exporter.
However certain use-cases such as importing existing dashboards and alerts require scraping additional metrics, which can be scraped using our custom metrics integration.
cadvisor metrics can be automatically scraped into groundcover using the following configuration.
In order to limit cardinality, the configuration below only scrapes the container_cpu_usage_seconds_total
and container_memory_working_set_bytes
metrics.
Editing the regex
part will control which metrics are being scraped into groundcover.
type
clusterId
region
namespace
node_name
workload_name
pod_name
container_name
container_image
groundcover_container_cpu_usage_rate_millis
CPU usage in mCPU
mCPU
groundcover_container_cpu_request_m_cpu
K8s container CPU request
mCPU
groundcover_container_cpu_limit_m_cpu
K8s container CPU limit
mCPU
groundcover_container_memory_working_set_bytes
current memory working set
Bytes
groundcover_container_memory_rss_bytes
current memory RSS
Bytes
groundcover_container_memory_request_bytes
K8s container memory request
Bytes
groundcover_container_memory_limit_bytes
K8s container memory limit
Bytes
groundcover_container_cpu_delay_seconds
K8s container CPU delay
Seconds
groundcover_container_disk_delay_seconds
K8s container disk delay
Seconds
groundcover_container_cpu_throttled_seconds_total
K8s container total CPU throttling
Seconds
type
clusterId
region
node_name
groundcover_node_allocatable_cpum_cpu
amount of allocatable CPU in the current node
mCPU
groundcover_node_allocatable_mem_bytes
amount of allocatable memory in the current node
Bytes
groundcover_node_mem_used_percent
percent of used memory in current node
0-100
groundcover_node_used_disk_space
current used disk space in current node
Bytes
groundcover_node_free_disk_space
amount of free disk space in current node
Bytes
groundcover_node_total_disk_space
amount of total disk space in current node
Bytes
groundcover_node_used_percent_disk_space
percent of used disk space in current node
0-100
type
clusterId
region
name
namespace
groundcover_pvc_usage_bytes
PVC usage
Bytes
groundcover_pvc_capacity_bytes
PVC capacity
Bytes
groundcover_pvc_available_bytes
PVC available
Bytes
groundcover_pvc_usage_percent
percent of used PVC storage
0-100
groundcover_pvc_read_bytes_total
total amount of bytes read by the workload from the PVC
Bytes
groundcover_pvc_write_bytes_total
total amount of bytes written by the workload to the PVC
Bytes
groundcover_pvc_reads_total
total amount of read operations done by the workload from the PVC
Number
groundcover_pvc_writes_total
total amount of write operations done by the workload to the PVC
Number
groundcover_pvc_read_latency
latency of read operation by the workload from the PVC
Seconds
groundcover_pvc_write_latency
latency of write operation by the workload to the PVC
Seconds
clusterId workload_name
namespace
container_name
remote_service_name
remote_namespace
remote_is_external
availability_zone
region
remote_availability_zone
remote_region
is_cross_az
protocol
role
server_port
encryption
transport_protocol
is_loopback
Notes:
is_loopback
and remote_is_external
are special labels that indicate the remote service is either the same service as the recording side (loopback) or resides in an external network, e.g managed service outside of the cluster (external).
In both cases the remote_service_name
and the remote_namespace
labels will be empty
is_cross_az
means the traffic was sent and/or received between two different availability zones. This is a helpful flag to quickly identify this special kind of communication.
The actual zones are detailed in the availability_zone
and remote_availability_zone
labels
groundcover_network_rx_bytes_total
Bytes received by the workload
Bytes
groundcover_network_tx_bytes_total
Bytes sent by the workload
Bytes
groundcover_network_connections_opened_total
Connections opened by the workload
Number
groundcover_network_connections_closed_total
Connections closed by the workload
Number
groundcover_network_connections_opened_failed_total
Connections attempts failed per workload (including refused connections)
Number
groundcover_network_connections_refused_failed_total
Connections attempts refused per workload
Number
clusterId
Name identifier of the K8s cluster
All
region
Cloud provider region name
All
namespace
K8s namespace
All
workload_name
K8s workload (or service) name
All
pod_name
K8s pod name
All
container_name
K8s container name
All
container_image
K8s container image name
All
remote_namespace
Remote K8s namespace (other side of the communication)
All
remote_service_name
Remote K8s service name (other side of the communication)
All
remote_container_name
Remote K8s container name (other side of the communication)
All
type
The protocol in use (HTTP, gRPC, Kafka, DNS etc.)
All
sub_type
The sub type of the protocol (GET, POST, etc)
All
role
Role in the communication (client or server)
All
clustered_resource_name
The clustered name of the resource, depends on the protocol
All
status_code
"ok", "error" or "unset"
All
server
The server workload/name
All
client
The client workload/name
All
server_namesapce
The server namespace
All
client_namespace
The client namespace
All
server_is_external
Indicate whether the server is external
All
client_is_external
Indicate wheter the client is external
All
is_encrypted
Indicate whether the communication is encrypted
All
is_cross_az
Indicate wether the communication is cross availability zone
All
clustered_path
HTTP / gRPC aggregated resource path (e.g. /metrics/*)
http, grpc
method
HTTP / gRPC method (e.g GET)
http, grpc
response_status_code
Return status code of a HTTP / gPRC request (e.g. 200 in HTTP)
http, grpc
dialect
SQL dialect (MySQL or PostgreSQL)
mysql, postgresql
response_status
Return status code of a SQL query (e.g 42P01 for undefined table)
mysql, postgresql
client_type
Kafka client type (Fetcher / Producer)
kafka
topic
Kafka topic name
kafka
partition
Kafka partition identifier
kafka
error_code
Kafka return status code
kafka
query_type
type of DNS query (e.g. AAAA)
dns
response_return_code
Return status code of a DNS resolution request (e.g. Name Error)
dns
exit_code
K8s container termination exit code
container_state, container_crash
state
K8s container current state (Running, Waiting or Terminated)
container_state
state_reason
K8s container state transition reason (e.g CrashLoopBackOff or OOMKilled)
container_state
crash_reason
K8s container crash reason (e.g Error, OOMKilled)
container_crash
pvc_name
K8s PVC name
storage
Summary based metrics have an additional quantile label, representing the percentile. Available values: [”0.5”, “0.95”, 0.99”
].
We also use a set of internal labels which are not relevant in most use-cases. Find them interesting? Let us know over Slack!
issue_id
entity_id
resource_id
query_id
aggregation_id
parent_entity_id
perspective_entity_id
perspective_entity_is_external
perspective_entity_issue_id
perspective_entity_name
perspective_entity_namespace
perspective_entity_resource_id
In the lists below, we describe error and issue counters. Every issue flagged by the platform is an error; but not every error is flagged as an issue.
groundcover_resource_total_counter
total amount of resource requests
Number
groundcover_resource_error_counter
total amount of requests with error status codes
Number
groundcover_resource_issue_counter
total amount of requests which were flagged as issues
Number
groundcover_resource_success_counter
total amount of resource requests with OK status codes
Number
groundcover_resource_latency_seconds
resource latency
Seconds
groundcover_workload_total_counter
total amount of requests handled by the workload
Number
groundcover_workload_error_counter
total amount of requests handled by the workload with error status codes
Number
groundcover_workload_issue_counter
total amount of requests handled by the workload which were flagged as issues
Number
groundcover_workload_success_counter
total amount of requests handled by the workload with OK status codes
Number
groundcover_workload_latency_seconds
resource latency across all of the workload APIs
Seconds
groundcover_client_offset
client last message offset (for producer the last offset produced, for consumer the last requested offset)
groundcover_workload_client_offset
client last message offset (for producer the last offset produced, for consumer the last requested offset), aggregated by workload
groundcover_calc_lagged_messages
current lag in messages
Number
groundcover_workload_calc_lagged_messages
current lag in messages, aggregated by workload
Number
groundcover_calc_lag_seconds
current lag in time
Seconds
groundcover_workload_calc_lag_seconds
current lag in time, aggregated by workload
Seconds
groundcover supports the configuration of logs and traces pipelines, to further process and customize the data being collected, using Vector transforms. This enables full flexibility to manipulate the data as it flows into the platform.
See this page for more information about how Vector is being used in the groundcover platform's architecture.
groundcover uses Vector as an aggregator and transformer deployed into each monitored environment. It is an open-source, highly performant service, capable of supporting many manipulations on the data flowing into groundcover's backend.
Pipelines are configured using Vector transforms, where each transform defines one step in the pipeline. There are many types of transforms, and all of them can be natively used within the groundcover deployment to achieve full flexibility.
The most common transform is the remap
transform - allowing to write arbitrary logic using Vector's VRL syntax. There are many pre-defined functions to parse, filter and enrich data, and we recommend experimenting with it to fit your needs.
For testing out VRL before deployment we recommend the VRL playground.
groundcover's deployment supports adding a list of transforms for logs and traces independently. These steps will be automatically appended into the default pipeline, eliminating the need to understand the inner workings of grouncover's setup. Instead, you only need to configure the steps you wish to execute, and after redeploying groundcover you will see them take effect immediately.
Each step requires two attributes:
name: must be unique across all pipelines
transform: the transform itself, passed as-is to Vector.
The following is a template for a logs pipeline with two remap stages:
The following is a template for a traces pipeline with one filter stage:
Logs to Events pipelines allow creating of custom events from incoming logs. Unlike the logs and traces pipelines, they do not affect the original logs, and are meant to create parallel, distinguished events for future analytics.
The following is a template for a custom event pipeline with a filter stage and an extraction step.
The inputs
fields below will connect the events pipeline with the default incoming logs pipelines.
Below are the essentials relevant to writing remap transforms in groundcover. Extended information can be found in Vector's documentation.
We support using all types of Vector transforms as pipeline steps.
For testing VRL before deployment we recommend the VRL playground.
When processing Vector events, fields names need to be prefixed by .
, a single period. For example, the content
field in a log, representing the body of the log, is accessible using .content
.
Specifically in groundcover, attributes parsed from logs or associated with traces will be stored under the string_attributes
for string values, and under float_attributes
for numerical values. Accessing attributes is possible by adding additional .
as needed. For example, a JSON log that looks like this:
Will be translated into an event with the following attributes:
Each of Vector's built-in function can be either fallible or infallible. Fallible functions can throw an error when called, and require error handling, whereas infallible functions will never throw an error.
When writing Vector transforms in VRL it's important to use error handling where needed. Below are the two ways error handling in Vector is possible - see more on these docs.
VRL code without proper error handling will throw an error during compilation, resulting in error logs in the Vector deployment.
Let's take a look at the following code.
The code above can either succeed in parsing the json, or fail in parsing it. The err
variable will contain indication of the result status, and we can proceed accordingly.
Let's take a look at this slightly different version of the code above:
This time there's no error handling around, but !
was added after the function call.
This method of error handling is called abort on error
- it will fail the transform entirely if the function returns an error, and proceed normally otherwise.
Both methods above are valid VRL for handling errors, and you must choose one or the other when handling fallible functions. However, they carry one big difference in terms of pipelines in groundcover:
Transforms which use option #1 (error handling) will not stop the pipeline in case of error - the following steps will continue to execute normally. This is useful when writing optional enrichment steps that could potentially fail with no issue.
Transforms which use option #2 (abort) will stop the pipeline in case of error - the event will not proceed to the other steps. This is mostly useful for mandatory steps which can't fail no matter what.
The default behavior above can be changed using the drop_on_error flag. When this flag is set to false
, errors encountered will never stop the pipeline - both for method #1 and for method #2.
This is useful for writing simpler code with less explicit error handling, as can be seen in this log pipeline example.
groundcover recommends using as many "strong" filters as possible, like time filters, workload and namespaces filters, log level filters, etc.
These will help making free text and attribute searches much faster and efficient.
The $__timeFilter condition enforces time range limits on the query, which are selected based on the time window selected for the query.
When querying logs in the platform it's important to distinguish between two types of queries:
Instant queries - will return a single value for each group. For example, counting the amount of error logs per workload.
When to use: Threshold-based alerting or when you only need the most recent value
Range queries - will return a series of values over time. For example, counting the amount of logs per workload in 5-minute buckets.
When to use: Plotting trends over time
The query uses the count()
operator to get the number of error logs in the defined time window.
groundcover always saves log levels as lower-cased values, e.g: 'error'
, 'info'
.
The query uses the count()
operator to get the number of logs generated by the kafkajs-events-consumer
workload, which contain the phrase Connection timeout.
Using formatted logs allows groundcover to automatically extract attributes from the log, which can then be used in alerts and dashboards.
For example, let's look at the following json-formatted log:
The following query uses the string_attributes
column to query the "http.req.method"
attribute and filter for GET
requests:
Make sure to select the Time Series
query type when using range queries
The following query will plot the count of logs grouped by a specific attribute extracted from the logs. It will arrange the counts into 5-minute buckets, showing trend over time.
We strongly advise reading the intro guide to working with remap transforms in order to fully understand the functionalities of writing pipelines steps.
The following example will filter out all HTTP traces that include the /health
URI.
Note that the filter
transform works by setting up an allow condition - meaning, events which fail the condition will be dropped.
The filter below implements this logic:
If the event isn't an HTTP event, allow it
If the event is an HTTP event, and the resource name doesn't contain "/health", allow it
If the event is an HTTP event AND it has "/health" in the resource name, drop it
We are using the abort
error handling below when calling the string
function. If the protocol type or resource name aren't valid strings, we drop the event.
The following example will obfuscate response payloads from a specific server. This can be useful when you want to completely redact responses that contain sensitive data, such as secrets managed by an external server.
We strongly advise reading the intro guide to working with remap transforms in order to fully understand the functionalities of writing pipelines steps.
Attributes parsed from logs or traces can be accessed under the .string_attributes
or .float_attributes
maps - see here for more information.
The following example attempts to match the contents of a log line with a given regex pattern, extracting named groups if successful. We recommend using named groups in the regex pattern for best experience, automatically creating named attributes which will appear in the system.
For example, this transform will create new timestamp
and pid
fields if they are successfully extracted from the content.
Note that we are only performing the parsing if the format
attribute equals "unknown" - otherwise it means groundcover has already parsed the log format and extracted the fields beforehand.
The custom-format
value is up to you, and will appear in the UI under the format
filter.
For more regex documentation see this page. Vector natively supports parsing many known formats - it's always worth checking if the format is already natively supported!
We are using the drop_on_abort
attribute to instruct vector to keep forwarding the event down the pipeline when encountering errors. For more information see this section.
The following example attempts to rename an attribute called oldName
to newName
.
If it does not exists, no changes are made.
groundcover’s pipelines can be used to protect sensitive data in your Logs and Traces using Vector's redact function. Mask or remove sensitive information while preserving the usefulness of your data.
groundcover’s pipelines can be used to protect sensitive data in your Logs and Traces. With Vector's redact
function, you can mask or remove sensitive information while preserving the usefulness of your observability data.
We highly recommend using Vector's built-in function redact
for Logs/Traces obfuscation. This powerful function allows you to configure simple yet effective redaction rules to protect sensitive information in your logs and traces.
With redact
, you can:
Mask or remove sensitive data from strings, arrays, or objects
Replace text matching specified patterns (like regex) with a placeholder, custom text, or a hash (SHA-2 or SHA-3)
Please refer to the redact
function's documentation for more details.
On this page, we'll explore how to leverage the redact
function and VRL's capabilities to obfuscate PII in Logs and Traces. At the end of this page, you'll find a handy list of regex patterns to save you time and effort.
Trace obfuscation can also be configured directly in the sensor. Can't find what you're looking for? Let us know over Slack.
In the examples below, we redact both the log contents (.content
) and any attributes derived from the structued logs (.string_attributes
).
Obfuscate credit card numbers from Logs
In this example, we'll obfuscate Visa credit card numbers from logs using the Visa credit card regex pattern from the library. By not specifying a redactor type, the redact
function will default to full redaction, replacing detected numbers with the string “[REDACTED].”
Here's an example of how Logs appear before and after obfuscation:
Hash US SSNs in Logs
In this example we’ll hash of all US Social Security Numbers hidden in logs. We’ll pass the sha2
parameter to the redactor
to hash the sensitive values.
Here's how logs appear before and after obfuscation:
Obfuscate IPs with Two Stages from Logs This example demonstrates how to obfuscate IP addresses in logs using a two-stage approach:
Here's an example of how logs appear before and after this obfuscation:
Credit Card Scanners
Maestro Card (16 digits)
5[0678]\d{14}|5[0678]\d{7}\d{8}
Discover Card (16 digits)
(?:6(?:011|5\d{2}))\d{12}|6(?:011|5\d{2})\d{7}\d{8}
Diners Club (14 digits)
3(?:0[0-5]|[68]\d)\d{11}|3(?:0[0-5]|[68]\d)\d{4}\d{4}\d{2}|3(?:0[0-5]|[68]\d)\d{7}\d{6}
American Express (15 digits)
3[47]\d{2}\d{6}\d{5}|3[47]\d{2}\d{4}\d{4}\d{3}|3[47]\d{7}\d{6}|3[47]\d{13}
JCB Card (16 digits)
(?:2131|1800|35\d{3})\d{11}|(?:2131|1800|35\d{3})\d{7}\d{8}
MasterCard (16 digits)
(?:5[1-5]\d{2}){4}|5[1-5]\d{7}\d{8}|5[1-5]\d{14}
Visa Card (16 or 19 digits)
(?:4\d{3}){4}|4\d{7}\d{8}|4\d{12}(?:\d{3})?
API Key and Token Scanners
AWS Access Key ID and Secret Access Key
AKIA[0-9A-Z]{16}|(?:[A-Za-z0-9/+=]{40})
Google API Key and OAuth Access Token
AIza[0-9A-Za-z-]{35}|ya29.[0-9A-Za-z-]{35,}
Mailchimp API Key
(?i)(?:[a-f0-9]{32}-us\d{1,2})
Social Media Tokens (Facebook, Slack, Twitter, Instagram, LinkedIn)
EAACEdEose0cBA[0-9A-Za-z]{0,}|xox[baprs]-[0-9]{12}-[0-9]{12}-[a-zA-Z0-9]{24}|T[a-zA-Z0-9]{8}/B[a-zA-Z0-9]{8}/[a-zA-Z0-9]{24}|AAAAA[0-9A-Za-z]{35}|IGQVJW[0-9A-Za-z]{16,}|[A-Z0-9]{16}
Azure Personal Access Token
eyJ[0-9A-Za-z-_=]+
Azure SQL Connection String
Server=tcp:[A-Za-z0-9-.]+,1433;Database=[A-Za-z0-9-]+;User Id=[A-Za-z0-9-]+;Password=[A-Za-z0-9-]+;Encrypt=true;
Azure Subscription Key
Ocp-Apim-Subscription-Key: [A-Za-z0-9]{32}
GitHub Access Token and Refresh Token
ghp_[0-9A-Za-z]{36}
Shopify Access Token and Shared Secret
shpat_[A-Za-z0-9]{32}
Okta API Token
00[0-9A-Fa-f]{8}-[0-9A-Fa-f]{4}-[0-9A-Fa-f]{4}-[0-9A-Fa-f]{4}-[0-9A-Fa-f]{12}
JSON Web Token (JWT)
eyJ[0-9A-Za-z-*=]+\.[0-9A-Za-z-*=]+\.[A-Za-z0-9-_=]+
RSA Private Key
-----BEGIN RSA PRIVATE KEY-----[\s\S]+-----END RSA PRIVATE KEY-----
PGP Private Key
-----BEGIN PGP PRIVATE KEY BLOCK-----[\s\S]+-----END PGP PRIVATE KEY BLOCK-----
GitLab Token
glpat-[A-Za-z0-9-]{20}
Amazon Marketplace Web Services Auth Token
amzn\.mws\.[a-zA-Z0-9]{64}
Bearer Token
Bearer [A-Za-z0-9-*=]+\.[A-Za-z0-9-*=]+\.[A-Za-z0-9-_=]+
JIRA API Token
jira\.api\.token\.[A-Za-z0-9-_]+
Other Scanners
Standard Email Address
[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+
Standard IBAN Code
[A-Z]{2}\d{2}[A-Z0-9]{1,30}
Standard MAC Address
([0-9A-Fa-f]{2}[:-]){5}([0-9A-Fa-f]{2})
IPv4 Address
\b(?:\d{1,3}\.){3}\d{1,3}\b
IPv6 Address
([0-9a-fA-F]{1,4}:){7}[0-9a-fA-F]{1,4}
HTTP(S) URL
https?:\/\/(?:www\.)?[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,6}(\/[a-zA-Z0-9\&%_\.\/~\-]*)?
HTTP Basic Authentication Header
Basic\s[a-zA-Z0-9=:_-]+
HTTP Cookie
Set-Cookie:\s[a-zA-Z0-9_\-]+=\S+
US Passport Number
[a-zA-Z]\d{9}
US Vehicle Identification Number (VIN)
[A-HJ-NPR-Z0-9]{17}
UK National Insurance Number
[A-CEGHJ-PR-TW-Za-ceghj-pr-tw-z]\d{2}\d{6}[A-Za-z]{1}
Canadian Social Insurance Number (SIN)
\d{3}-\d{3}-\d{3}
US Social Security Number (SSN)
\b\d{3}-\d{2}-\d{4}\b
We strongly advise reading the intro guide to working with remap transforms in order to fully understand the functionalities of writing pipelines steps.
The generated events will currently only be available by querying the ClickHouse database directly. Contact us over Slack for additional information.
Attributes parsed from logs or traces can be accessed under the .string_attributes
or .float_attributes
maps - see here for more information.
The following example demonstrates transformation of a log in a specific format to an event, while applying additional filtering and extraction logic.
In this example, we want to create events for when a user consistently fails to login to a system. We base it on logs with this specific format:
This pipeline will create events with the type multiple_login_failures
for each time a user fails to login for the 5th time or more . It will store the username in .string_attributes
and the attempt number in .float_attributes
.
Quickly see how real users experience your app and catch front-end issues early.
The RUM page in groundcover lets you monitor real user activity on your web application. It highlights key performance metrics (like web-vitals and page load times), error rates, and user behavior.
This helps you understand what real users are facing – for example, identifying slow page load times, spotting frequent errors, or seeing which pages are most popular – so you know where to focus your attention.
The Real User Monitoring summary page has three main sections:
A summary cards,
Trend charts (Exceptions over Time, Sessions over Time) and a list of visited pages,
A table with detailed sessions table
At the top of the RUM page, you’ll see five summary cards showing key user experience stats. These cards give a quick health check of your front-end performance:
Avg INP (input delay): The average delay after a user interaction (like clicking a button) before the page responds. A lower number means the app feels snappy; a high number indicates users might be waiting too long for feedback.
Avg CLS (layout shift): The average Cumulative Layout Shift score, measuring how much the page content jumps around during load.
Page Load Time: The average time it takes for pages to fully load for users, measured in milliseconds. This reflects your site’s loading speed.
Error Rate: The percentage of user sessions that encountered an error (for example, JavaScript exceptions).
User Count: The number of unique user sessions in the selected time frame. This shows how many distinct users (or sessions) have been recorded.
At the bottom of the RUM page, you’ll find a detailed sessions table listing individual user sessions. Each row in this table represents one user’s session on your app, with key details to help you trace their experience. Important columns include:
Date/Time: When the session occurred (start time or timestamp of the session).
User (Email/ID): The user’s identifier, which could be an email, username, or “Anonymous” if the user isn’t logged in. This helps identify the session’s user if needed.
Errors: How many errors occurred during that session (e.g. count of exceptions). If this number is greater than 0, it means the user encountered problems.
Pages: The number of pages the user visited in that session. A higher page count might indicate a longer session or a user navigating through many parts of the app.
Duration: How long the session lasted (for example, “07:36” means 7 minutes 36 seconds). This shows if the user spent a long time (possibly struggling) or left quickly.
Browser: The browser used in the session (often shown by an icon or name like Chrome, Safari, etc.). This can reveal if an issue is browser-specific (e.g. all errors happening on one browser).
Device: The type of device (e.g. desktop, mobile) indicated by an icon. This helps you see if mobile users vs. desktop users have different experiences.
groundcover supports a rich set of features for log management, from collection to analysis. In addition, it fully supports defining alerts and dashboards based on a variety of attributes in your logs. This guide will explore how to get started querying your logs in our embedded Grafana.
groundcover uses ClickHouse as its database for storing logs. When building log based alerts or dashboards in our embedded Grafana, the ClickHouse
datasource needs to be selected in order to query the logs stored.
See this page for examples you can get started with!
ClickHouse supports standard SQL syntax, which can be used to query the table storing your logs.
For example, the following query will return the count of logs in the selected time range:
Below is a list of the most commonly used fields in the Logs table, which should serve the majority of the use cases for alerting.
Can't find what you're looking for? Let us know over Slack!
timestamp
DateTime64
content
String
content
log attribute if exists, entire log body otherwise
cluster
String
workload
String
namespace
String
k8s only
pod_name
String
k8s only
node_name
String
k8s only
level
String
lower-cased, e.g: 'info', 'error', 'fatal'...
format
String
'json', 'logfmt'...
env
String
string_attributes
Map(String,String)
String attributes extracted from formatted logs; empty for unformatted logs
float_attributes
Map(String, Float64)
Numeric attributes extracted from formatted logs; empty for unformatted logs
Our helm chart provides access to many values passed into the chart via the standard values.yaml interface. The deployment can be configured based on most constraints and controls Kubernetes allows around pod assignments to nodes.
groundcover supports most K8s primitives out there, among which:
nodeAffinity
- allowing you to constrain which nodes your Pod can be scheduled on based on node labels and is more expressive than nodeSelector
and allows you to specify soft rules.
Address potential bandwidth limitations with Amazon ECR Public when using clusters outside of the AWS network.
Argo CD’s multi-environment support ensures that groundcover can be deployed consistently across various Kubernetes clusters.
Learn how to install groundcover on multiple clusters while using a single, centralized instance of each of these databases.
Deploy groundcover using an API Key Secret ensures that only authorized entities can access the API's functionalities.
Allows you to use groundcover in secured environments without relying on outbound connections except for authentication purposes.
Allows the pod to use the host's networking stack for all communication, which means that the pod will use the same IP address as the host.
Configure custom log transformations in groundcover using OpenTelemetry Transformation Language (OTTL). Tailor your logs with structured pipelines for parsing, filtering, and enriching data before ing
groundcover supports the configuration of log pipelines using OpenTelemetry Transformation Language (OTTL) to process and customize your logs. With OTTL, you gain full flexibility to transform data as it flows into the platform.
groundcover uses OTTL to enrich and shape log data inside your monitored environments. OTTL pipelines give you a structured way to parse, filter, and modify logs before ingestion.
Each pipeline is made up of transformation steps—each step defines a specific operation (like parsing JSON, extracting key-value pairs, or modifying attributes). You can configure these transformations directly in your groundcover deployment.
To test your logic before going live, we recommend using our Parsing Playground (click the top right corner when viewing a specific log).
To define an OTTL pipeline, make sure to include the following fields:
statements
– List of transformations to apply.
conditions
– Logic for when the rule should trigger.
errorMode
– How to handle errors (e.g., skip, fail).
logicOperator
– Used when you define multiple conditions.
Transformations are automatically appended to the default pipeline, so there’s no need to replace anything. Just define your rules, add them to your values file, redeploy, and you’re done.
Each rule must have a unique ruleName
.
Use conditions
to apply transformations only when specific attributes match. This ensures your pipeline runs efficiently and only on relevant logs.
Common fields you can use:
workload
– Name of the service or app.
container_name
– Container where the log originated.
level
– Log severity (e.g., info, error).
format
– Log format (e.g., JSON, CLF, unknown).
Some commonly used functions in groundcover:
ExtractGrokPatterns
ParseJSON
Replace_pattern
Delete_key
ToLowerCase
Concat
ParseKeyValue
In the following example we will be controlling the replica count of a Kubernetes deployment. This common example can be extended and used to query any type of metrics, and we suggest experimenting with it to fit your own use cases.
KEDA requires access to the ds-api-key so it can add it to the Prometheus requests. We will set up two Kubernetes object for this purpose - a Secret object and a ClusterTriggerAuthentication object.
We will be using a simple static Secret object, but any type can be used, including external secrets. Make sure to replace the <ds-api-key>
placeholder with the api key previously created.
Now that we have the secret object defined, we need to tell KEDA how to use it. groundcover requires the ds-api-key to appear as an additional header in the requests to our datasources, and that's exactly what we will define below.
Now that we have set up KEDA to properly authenticate with groundcover, we can define our first auto-scaling behavior.
We will be defining a Prometheus
trigger, querying the groundcover_kube_deployment_status_replicas_ready
which indicates the amount of ready replicas for our deployment. Note that we apply the deployment="example"
filter, to query the metric specifically for our example deployment.
groundcover provides a robust user interface that allows you to view and analyze all your observability data from inside the platform. However, there may be cases in which you need to query the data from outside our platform using API communication.
Our proprietary eBPF sensor automatically captures granular observability data, which is stored via our integrations with two best-of-breed technologies. VictoriaMetrics for metrics storage, and ClickHouse for storage of logs, traces, and Kubernetes events.
Run the following command in your CLI, and select tenant:
groundcover auth get-datasources-api-key
Example for querying ClickHouse database using POST HTTP Request:
X-ClickHouse-Key
(header): API Key you retrieved from the groundcover CLI. Replace ${API_KEY}
with your actual API key, or set API_KEY
as env parameter.
SELECT count() FROM traces WHERE start_timestamp > now() - interval '15 minutes'
(data): The SQL query to execute. This query counts the number of traces where the start_timestamp
is within the last 15 minutes.
apikey
(header): API Key you retrieved from the groundcover CLI. Replace ${API_KEY}
with your actual API key, or set API_KEY
as env parameter.
query
(data): The promql query to execute. In this case, it calculates the sum of the rate of groundcover_resource_total_counter
with the type
set to http
.
start
(data): The start timestamp for the query range in Unix time (seconds since epoch). Example: 1715760000
.
end
(data): The end timestamp for the query range in Unix time (seconds since epoch). Example: 1715763600
.
This guide will explore how to get started querying your metrics in our embedded Grafana.
All types of metrics stored in groundcover are available using a standard Promethues
datasource, exposed by Victoria Metrics. Querying metrics is possible using standard PromQL
syntax.
For example, let's query the groundcover_container_cpu_usage_rate_millis
metric, which is one of our core infrastructure metrics:
See support for
supports many custom configurations so you can fit it to the way your deployment works and to your exact needs.
nodeSelector
- a simple form of node selection constraint. You can add the nodeSelector
field to your Pod specification and specify the you want the target node to have.
groundcover supports most K8s primitives out there. Can't find what you need?
groundcover can be used as a Promethues datasource for . Any metric stored in groundcover can be queried and used to automatically make decisions about scaling your infrastructure or deployments.
groundcover to query data stored in the platform from outside the UI. In our case we will be using the Prometheus
endpoint:
https://ds.groundcover.com/datasources/prometheus
Querying the groundcover datasources APIs requires setting up a dedicated apikey for the authentication. If you haven't done this already, please before proceeding.
We will be referring to this apikey below as ds-api-key
.
Read more about our architecture .
Learn more about the ClickHouse query language .
Example for querying the VictoriaMetrics database using the API:
Learn more about the promql syntax .
Learn more about VictoriaMetrics HTTP API .
groundcover supports a wide variety of metrics - and metrics are automatically generated using our eBPF magic, and can be ingested natively.
groundcover uses as its database for storing metrics. When building metric based alerts or dashboards in our embedded Grafana, the Prometheus
datasource needs to be selected in order to query the metrics.
The full lists of available out-of-the-box metrics can be found in our and pages.
Learn more about and to improve your skills. For any help in creating your custom dashboard don't hesitate to .
groundcover has a set of example dashboards in the Dashboards by groundcover folder which can get you started. These dashboard are read-only but you can see the PromQL query behind each panel by right-clicking the panel and then Explore
The Amazon Elastic Container Registry (ECR) Public Registry Pull Bandwidth Limit for clusters outside of AWS is set to 500 GB per day. This means that, collectively, all container image pulls from ECR Public by all clusters located outside the AWS network cannot exceed 500 GB of data transfer in a 30 days period.
It is crucial for users who operate Kubernetes clusters or other container orchestrators outside of AWS and use ECR Public as a container image source to be aware of this bandwidth limitation. If the cumulative data transfer of container image pulls from ECR Public exceeds the daily limit, further pulls may be denied until the bandwidth usage falls back within the allowed threshold.
To address potential bandwidth limitations with Amazon ECR Public when using clusters outside of the AWS network, one viable solution is to override the container registry to utilize an alternative registry, such as Quay.io. By redirecting the container image pulls to Quay.io, users can leverage the bandwidth allowance and performance capabilities of Quay.io to complement or replace ECR Public for image retrieval.
By default, groundcover agents are running in K8S default network model in which every Pod
in a cluster gets its own unique cluster-wide IP address. This means you do not need to explicitly create links between Pods
and you almost never need to deal with mapping container ports to host ports.
In cases where CNI has limited number of IP address we can set our agents to host network mode that can be used to ensure that the IP addresses of the pods do not get allocated from the cluster's IP address range. This mode allows the pod to use the host's networking stack for all communication, which means that the pod will use the same IP address as the host.
Either create a new custom-values.yaml or edit your existing groundcover values.yaml
Deploying groundcover using an API Key Secret ensures that only authorized entities can access the API's functionalities. It is relatively straightforward compared to other authentication mechanisms, which can be beneficial for rapid deployment and integration.
You can inject the API Key using a custom secret by following these steps:
Either manually, or using a secret manager, create a secret in the following structure
Create/Update helm overrides file, with the following override
Argo CD is a declarative, GitOps continuous delivery tool for Kubernetes. Argo CD aligns with the GitOps principles, ensuring that the deployment of groundcover is always in sync with the predefined configurations in your Git repository. This means that any changes made to the deployment configurations are automatically applied to the cluster, streamlining updates and ensuring that all instances of groundcover are consistent across different environments.
Argo CD’s multi-environment support ensures that groundcover can be deployed consistently across various Kubernetes clusters, whether they are designated for development, testing, or production.
To deploy groundcover through Argo CD, use the following steps.
The steps below require a user with admin permissions
groundcover requires setting up several secrets in the installation namespace prior to creating the ArgoCD application. For that reason we will start by creating the groundcover namespace:
In the following steps you will create the follow Kubernetes secrets objects:
API Key secret
ClickHouse password secret
Start by fetching the API key associated with your workspace using the following CLI command:
Create a secret in the groundcover
namespace using the following snippet:
Create the spec file using the following snippet:
Make sure to replace the <apikey>
value below with the value fetched in the previous step
Apply the spec file from above:
Start by generating a random password for ClickHouse. For example using openssl rand
:
openssl
is just one way to do it - you can use any random string you wish
Create the spec file using the following snippet:
Make sure to replace the <password>
value below with the result of the previous step
Apply the spec file from above:
Make sure to set the following values in the manifest:
<project-name> - to match your environment
<targetRevision> - set the deployment version. EIther a specific groundcover chart version, or use "≥ 1.0.0"
for auto upgrades.
You can use the following commands to fetch the latest chart version:
helm repo update
helm search repo groundcover/groundcover
The <cluster-name>
value in the values
part in the manifest can be any name you wish to assign to the cluster where the platform is installed. In multi-cluster installations, make sure to change it according to the cluster being installed.
After creating the manifest above the groundcover deployments will start spinning up in the namespace. When all pods are running you can access the platform at app.groundcover.com
to access new data from the environment.
If you encounter any issues in the installation let us know over Slack.
Customize data collection by filtering K8s entities
By default, groundcover traces all namespaces and workloads in your cluster, but sometimes you want to block ones that are irrelevant to your needs, or alternatively only allow something very specific.
groundcover allows you to add traces filtering rules on specific workloads and namespaces by creating a custom values.yaml
file.
The rules are a list of regex patterns with its matching type (allow / block) that represents the entities you want groundcover filter.
The following overrides need to be added to your values.yaml
configuration file. After updating the file use the installing & updating instructions to apply the changes.
There are two ways to define retention in groundcover:
Simple - each type of data has a global retention period
Advanced - data is retained based on various criteria such as cluster, log level, namespace, etc.
For metrics, only the simple strategy is supported.
For managed inCloud deployments, the values will need to be set by the groundcover team. Please contact us to perform any retention changes.
To customize the retention on the groundcover platform, either create a new custom-values.yaml
or edit your existing values.yaml
with the overrides defined below and redeploy groundcover.
Retention value format is: {amount}[h(ours), d(ays), w(eeks), y(ears)]
.
For example: 4h
, 30d
, 6w
, 1y
The most common and simple way to configure retention in groundcover. Based solely on data type, without exceptions.
Below is an example configuration for setting data retention values:
Traces - 24 hours
Metrics - 7 days
Logs - 3 days
Events - 7 days
Traces - 48 hours
Metrics - 30 days
Logs - 30 days
Events - 7 days
groundcover allows you to customize retention policies for your data to better manage storage and compliance requirements. You can define specific retention periods for logs, traces, and events based on various criteria such as cluster, log level, namespace and more.
The custom_retention_overrides
list allows you to define specific retention periods for data based on conditions. Each override has a retention
field and a conditions
field.
Retention: Specifies the duration for which the data should be retained.
Conditions: Specify the criteria that the retention policy applies to. When multiple conditions are set, they are connected by an AND
condition, meaning all conditions must be met.
The retention
field under each data type (traces, logs, events) specifies the default retention period for that data type, meaning, anything that doesn't match the custom conditions set.
If a default retention value is not set for a certain datatype, groundcover will apply its own default retention from the default list described above.
In instances of overlapping overrides, the override with the shorter retention interval will be used.
The configuration below implements the following logic:
Traces
Traces with labels cluster: prod
, namespace: app
will be retained for 7d
Traces with label env: staging
will be retained for 14d
Other traces will be retained for 24h
Logs
Logs with labels cluster: prod
, level: info
will be retained for 20d
Logs with labels cluster: prod
, level: error
will be retained for 30d
Other logs will be retained for 3d
Events
Events with labels cluster: dev
, type: Warning
will be retained for 15d
Other events will be retained for 15d
Logs
cluster
source
env
env_type
workload
namespace
level
Traces
cluster
source
env
env_type
protocol_type
Events
cluster
source
env_name
entity_workload
entity_namespace
type
As any application monitoring system, the data collected by groundcover is by nature sensitive and contains payloads of full requests and queries. Raw traces can go a long way in a troubleshooting process, but you can choose to obfuscate their payload.
By default groundcover does not obfuscate payloads. However, it will obfuscate sensitive HTTP and gRPC headers - see below for more information.
Obfuscation is granularly defined separately for each protocol, using the following names:
httphandler
grpchandler
redishandler
sqlhandler
This applies both for MySQL and PostgreSQL
mongodbhandler
amqphandler
Data obfuscation can be configured in two ways: key-value and unstructured.
This method will automatically identify key-value structures such as JSON and query params, and for those it will perform obfuscation based on a defined set of keys.
The configuration consists of the following fields:
enabled
- turns this obfuscator on and off. Default: false
mode
- What should be done with values matching the specified keys. Possible modes are:
KeepSpecificValues
- Obfuscate all values except for keys specified in specificKeys
ObfuscateSpecificValues
- Keep all values and obfuscate only values for keys specified in specificKeys
caseSensitive
- are the keys case sensitive. Default: False
specificKeys
- a list of comma separated strings. Example:
If mode is not specified, the default behavior of this obfuscator is to obfuscate all keys, equivalent to:
mode: KeepSpecificValues
specificKeys: []
Obfuscation for nested JSON structures is based on the inner keys within the nested JSON objects. An example can be found at Obfuscation Examples
Below is an example of using the key-value configuration with different settings:
This method will obfucsate "free text" without any predefined rules. It is meant as a way to make sure all data is obfuscated regardless of its contents.
The configuration exists of the following fields:
Enabled
- Turns this obfuscator on and off. Default: false
Below is an example of turning on the unstructured obfuscator:
It's perfectly fine to use both the key-value and unstructured obfuscators together! When this is set, the key-value method will be executed first, and only if the structure isn't key-value, it will proceed to the unstructured method.
For example, let's look at a configuration for turning both obfuscators on:
JSON, {"key": "value"}
{"key": "?"}
JSON with array, {"key": [1,2,3]}
{"key": ["?", "?", "?"]}
JSON with nested keys, {"root": {"sub": {"key": "value"}}}
{"root": {"sub": {"key": "?"}}}
key=value
maps:
key=?
Plain text plain text
:
p**** ****
Truncated data: if data has been truncated, it will not be obfuscated and will show scrubbed
as the data. You can change the truncation size limits if you need to.
Want to change your data truncation size limits? Contact us on slack.
After you prepared your desired values.yaml
, apply them using:
more on getting api-key, see: Using Helm
groundcover will obfuscate sensitive HTTP and gRPC headers by default so that they are not shown in traces. This behavior is customizable using the same key value config as above.
The default values for the headers obfuscation are:
According to the HTTP RFC, headers are case insensitive by nature. Because of that, the headers obfuscation will always be case insensitive and can't be configured otherwise.
This feature is turned-off by default, and can be turned on by following the steps below.
groundcover offers eBPF-based tracing for all frameworks and protocols, including encrypted traffic. This is done by attaching probes to key points in popular encryption libraries, such as LibSSL.
There is one exception to the out-of-the-box coverage with eBPF - Java. The Java runtime uses encryption libraries which are written in pure Java, providing a challenge to trace with the machine-level primitives that eBPF offers.
groundcover's approach to bridge this gap is by using a more traditional approach to Java observability - using a Java agent. This agent runs alongside your applications, tracing key functions in the Java encryption libraries to provide visibility into the APIs being handled.
Non-SSL traffic in Java works out of the box similar to out frameworks
groundcover's sensor comes pre-packed with the Java agent binaries. When detecting Java processes running alongside it, the sensor will use an injection method to dynamically execute the groundcover Java agent in the detected process.
No configuration changes are needed, and new processes will be monitored automatically.
groundcover's core approach to tracing relies on eBPF, which has built-in safety guarantees that make sure our tracing can not affect the services being monitored in any significant way.
Java agents are somewhat different in that regard - they run alongside your code, and have more potential to interfere with the standard operation of the process. For this reason, our development cycle for the Java agent is extremely strict, and includes testing with many common Java use cases. The agent has and continues to run safely in a large number of customer environments, providing the high standard of frictionless coverage groundcover provides.
We recommend testing the deployment on lower environments (e.g dev, staging) before moving on to production environments.
Use the following configuration values to turn on the Java agent deployment:
Note: On-premise deployment is available only for users subscribed to our Enterprise plan.
groundcover on-premise installation allows you to use groundcover in secured environments without relying on outbound connections except for authentication purposes (Auth0).
In this mode, groundcover installation includes 3 additional components:
router
- the frontend microservice
grafana
postgresql
- db backend for the router and grafana microservices
Upon subscribing to Enterprise plan, you should receive a groundcover-enterprise-key
Kubernetes Secret Object.
Create a groundcover naåmespace (if not created by now)
kubectl create ns groundcover
Load the image pull secret into the namespace
kubectl create -f groundcover-enterprise-key.yml --namespace=groundcover
Create a database on existing PostgresSQL
Either manually or using a secret manager, create a secret in the following structure:
Create/Update helm overrides file, with the following override:
By default, when you install groundcover on several clusters, each cluster will contain its own independent set of traces, metrics and logs databases (the groundcover backend). The following guide will walk you through how to install groundcover on multiple clusters while using a single, centralized instance of each of these databases.
An ingress controller - as this setup requires cross-cluster communication
Accessible hostnames from outside the cluster that will be used by the deployed ingresses for the following components (see groundcover backend):
metrics-ingester
opentelemetry-collector
In this installation mode, we will deploy the following components separately:
groundcover backend - containing the databases, installation connector and the ingresses.
groundcover agent - containing groundcover's data collection and aggregation agent.
Create the following backend-values.yaml
file and fill the required values accordingly
Run the following installation command:
Obtain the addresses that are used by the created ingresses, using:
Now, make sure the ingresses you've just created are accessible from the clusters you intend to deploy the groundcover agent on.
Create the following agent-values.yaml
, and fill the required values accordingly
Run the following installation command:
In case you're interested in using groundcover's ClickHouse datasource in your own Grafana, follow these steps:
Add the following override to the backend values
Fetch the ClickHouse password for the config (either in the env or from the injected secret)
Install the ClickHouse plugin for Grafana
Create a new Grafana datasource and fill the required fields, make sure Server port, Skip TLS Verify
are matching your ingress configuration
groundcover allows you to control the way data is being collected and treated across your cluster in many ways.
By default, groundcover traces all namespaces and workloads in your cluster, but sometimes you want to block ones that are irrelevant to your needs.
Strike the right balance between covering sufficient data for analysis and data storage.
Raw traces can go a long way in a troubleshooting process, but you can choose to obfuscate their payload.
Customize groundcover storage volumes for logs, metrics and traces.
There are different methods that can allow you to filter the collection of logs based on your needs and/or only when issues related to these logs are identified.
In certain situations it could be useful to disable the tracing of specific protocols. This modularity is natively supported by our sensor.
In environments where the volume of transactions or operations per second is very high, resource tuning is essential to ensure that the cluster performs optimally.
Customize groundcover storage volumes for logs, metrics and traces
Either create a new custom-values.yaml or edit your existing groundcover values.yaml agent:
Warning! this will require to re-install groundcover, existing groundcover information will be lost.
In high throughput cluster, it is common practice to optimizing the allocation and usage of computing resources. In such environments, where the volume of transactions or operations per second is very high, resource tuning is essential to ensure that the cluster performs optimally.
The following is an example of how to use that practice in a large cluster. Further tweaks may required for different clusters.
Custom k8s logs filtering / storing
By default, groundcover stores logs from all namespaces and workloads in your cluster. However there are multiple ways to modify this behavior.
groundcover allows you to add logs filtering rules using LogQL syntax by creating a custom values.yaml
file.
The available labels to filter are: namespace, workload, pod, level, container.
Example of filtering out all logs coming from namespace demo
with level info
:
{namespace="demo",level="info"}
In addition, we enable the use of the optional log stream pipeline in order filter the log lines.
Example of filtering out all logs coming from container my-container
which contain the word fifo
or handler
:
{container="my-container"} |~ "fifo|handler"
Rules are applied sequentially and independently. Therefore, rules which are meant to specify multiple values of the same label should be written as one rule with multiple options, and not many rules with one option each.
For example, a rule to drop logs from all namespaces except prod
and dev
should be written as:
{namespace!="prod", namespace!="dev"}
groundcover collects kubelet logs on Kubernetes clusters and docker logs on host machines. You can customize this behavior through additional configuration options.
groundcover can collect logs from specific files on your host machine. You can define paths to monitor and add custom labels to the collected logs.
This enables merging multiple logs lines into a single log block. A new block is defined using a pre-define firstLineRegex
, which should match the line prefix.
A block is terminated when one of the following conditions is met:
A new block is matched
Timeout has occurred (optional config, default is 3 seconds)
Max number of lines-per-block is reached (optional config, default is 1024 lines)
Configuration holds also workload
& namespace
fields, which can be set to .*
in order to use wildcard logic. An additional optional container
field can be added.
This will merge all exception logs into a single block line.
To add a new grok format, you need to specify a pattern
and a ruleName
which categories the parsed logs as a specific sub-format. Additionally, namespaces
, workloads
, and containers
can be used as filters to determine where the patterns should be applied.
Each attribute parsed will be automatically appended to the attributes of the log, making it searchable and filterable in the platform.
We strongly advise applying namespaces
, workloads
and containers
filters to make the matching as tight as possible, deducing unneeded CPU overhead during parsing.
This example adds a custom Grok rule for parsing postgresql
logs:
PostgreSQL error log:
2023-12-25 19:31:10.042 GMT [130] FATAL: terminating connection due to unexpected postmaster exit
This feature enable removing ANSI color codes from logs' body.
Will be stripped into:
Use this customization carefully as it might heavily effect the performance of the groundcover sensors.
During log parsing groundcover generates two attributes named content
and body
:
body
- contains the full log line
content
- contains the message field of structured logs (from msg/message attribute) or the full log line for unstructured logs
In the platform UI the attribute displayed is the content
, while body
is available in the DB.
Formatted log with message:
{"time": "Jun 09 2023 15:28:14", "severity": "info", "msg": "Hello World"}
Unformatted log:
[Jun 09 2023 15:28:14], Hello World
The following values contain the default truncation size for body
and content
respectively:
More info on LogQL syntax can be found .
more on getting api-key, see:
groundcover supports providing custom patterns to parse logs with unique formats that don’t conform to standard types.
This page addresses the sampling of eBPF traces generated by the groundcover sensors. For sampling of traces generated by external libraries, see OpenTelemetry & DataDog. Can't find what you're looking for? Let us know over Slack!
groundcover utilizes smart sampling in order to only store the traces which we believe are the most important for monitoring your environment. However, some more advanced use cases might require adjusting the sampling strategy, making sure you get the exact coverage you need.
This page details the way in which sampling can be configured.
Some cases might require 100% visibility into traces of specific services or APIs. groundcover allows forcing sampling of transactions by these methods:
HTTP/gRPC requests which include the header below will be force sampled.
Pods which have the following value as either a label and/or annotation will have all traces sampled:
The sampling mechanism can be rate limited to reduce the total amount of traces produced. Note that traces marked as issues will not be affected by this configuration.
Rate is specified by number of traces allowed per 100 milliseconds, using the following values override:
Ingest and visualize OTEL data with groundcover
groundcover supports ingesting different data types in the OTEL format, displaying them natively in the platform. For more information see the following subpages:
For more customization options see Traces & Logs, Metrics
The configuration below is meant to setup OTLP ingestion from a a single service in an environment with a standard installation of groundcover.
Apply the following environment variables to the instrumented service to redirect all signals to groundcover's ingestion endpoint inside your cluster.
For versions before 1.8.245, it is required to turn on custom-metrics
to enable the OTLP metrics endpoint. See docs here.
To generate your API key, go here: https://id.atlassian.com/manage-profile/security/api-tokens, Create a new API Key and copy it.
Get your host
and projectKey
from your respective board from its URL. For Example: https://example.atlassian.net/jira/software/projects/rnd/list
host: https://example.atlassian.net
projectKey: rnd
In groundcover, Go to Settings → Integrations.
Click on Webhook Integration
Add webhook integration:
Select your name
Add the following URL, {host}/rest/api/2/issue?projectKey={projectKey}
based on your host from step 2. For Example:
Using Basic Authentication
User: Enter your jira mail
Password: API key from step 1.
Save
groundcover allows you to add custom labels and annotations from Kubernetes pods and Docker containers to the traces and logs generated by those pods or containers.
This feature is available in groundcover sensor versions ≥ 1.9.127
To apply custom labels and annotations collection, you need to update your groundcover deployment values.yaml
file.
To collect a pod label named app.kubernetes.io/name
and an annotation named app.kubernetes.io/other-name
you will need to add the following configuration:
To collect all pod labels and/or annotations, use the following configuration:
We prefix the labels/annotations with the following (according to pod/docker):
k8s.pod.label
k8s.pod.annotation
docker.container.label
docker.container.annotation
Once you've setup the configuration to the sensor, you will be able to view and search for specific labels and annotation like any other attribute.
For example, to search for all traces that has the label app.kubernetes.io/part-of = groundcover
you can do the following search:
k8s.pod.label.app.kubernetes.io/part-of:groundcover
in the search bar.
In PagerDuty, Go to the your desired service, then click on integrations tab
Click on "Add an integration" or "+ Add another integration"
Select "Events API V2" Integration, and click on Add.
Copy the Integration Key.
In groundcover, Go to Settings → Integrations.
Click on PagerDuty Integration
Select a name and paste your routing key
Save.
In groundcover, Go to Settings → Integrations.
Click on Webhook Integration
Fill in a Webhook name
Fill Webhook details
Select an HTTP method: GET
/ POST
/ PUT
/ DELETE
Enter your URL
Optional: Add Authentication headers, you can either add basic auth
user and password, or an API Key. (Can't do both)
Optional: Add custom headers, by adding key and value pairs
Save.
groundcover Workflows support using integrations with external systems to enable notifications.
Set up Integrations in Settings → Integrations Page.
Only admins in groundcover can add integrations.
In Opsgenie, Go to Settings → Integrations.
Run a search and select “API”.
On the next screen, enter a name for the integration. (We will use it later when adding the integration to groundcover)
Optional: Select a team in Assignee team if you want a specific team to receive alerts from the integration.
Select Continue. The integration is saved at this point.
Select Turn on integration.
If you're using Opsgenie's Free or Essentials plan, you can add this integration from your team dashboard only. The Integrations page under Settings is not available in your plan.
Make sure you have enabled "Allow Create and Update Access"
In Opsgenie, Go to Settings → Integrations.
Click on your selected integration from the list.
Copy API Key from the Integration settings panel.
Make sure Status is ON
In groundcover, Go to Settings → Integrations.
Click on Opsgenie integration
Fill the form:
Integration name in groundcover
Original name of the integration in Opsgenie
API Key
Save
The groundcover platform was built to be an all-in-one observability solution for cloud-native environments. It was built to ingest any data source directly into groundcover's in-cloud backend using any protocol supported by OpenTelemetry or Prometheus.
It supports any log stream from sources like Fluentd, Fluentbit, Logstash, or CloudWatch logs. Metrics can be ingested via Prometheus remote-write or directly from agents like StatsD and Telegraf. Traces can be ingested from any application instrumented with OpenTelemetry or Datadog’s SDK.
Learn how to ingest OTEL traces & logs with groundcover
groundcover fully supports the ingestion of traces and logs in the Open Telemetry format, displaying them natively in our UI.
OTLP Traces and Logs generated from Kubernetes pods can be ingested directly by our DaemonSet Sensor
. Ingestion is supported by changing the exporter endpoint to the Sensor
Service Endpoint, which will also enrich the received spans and logs with Kubernetes metadata.
Ingestion is supported for both OTLP/HTTP and OTLP/gRPC
Apply the environment variables below to your services in order to make them ship data to groundcover's ingestion endpoint.
groundcover will automatically enrich traces and logs ingested with Kubernetes metadata, in order to provide as much context as possible.
groundcover will replace the service.name
attribute indicating the name of the service with the name of the Kubernetes Deployment which is the owner of the pod. Keep this in mind when looking for your traces and logs in the system!
Pod Level attributes:
k8s.namespace.name
- the namespace of the pod
k8s.node.name
- the node the pod is scheduled on
k8s.pod.name
- the name of the pod
k8s.pod.uid
- the UID of the pod
k8s.pod.ip
- the IP address of the pod at the time of the trace
k8s.cluster.name
- the Kubernetes cluster name
Container level attributes:
If the container.id
tag is provided with the container ID provided by the Container Runtime, the following tags will also be enriched:
container.name
- the name of the container
container.image.name
- the name of the container image
container.image.tag
- the tag of the container image
Starting from version 1.8.216
, groundcover will enrich container level attributes for pods with a single container without the need for providing the container.id
tag.
Starting from version 1.8.216
, the recommended method to ship traces & logs from an OpenTelemetry Collector is the same as other deployments - directly to the groundcover sensor endpoint.
Ingestion is supported for both OTLP/HTTP and OTLP/gRPC
groundcover exposes an OpenTelemetry interface as part of our inCloud Managed endpoints, which can be used to ingest data in all standard OTLP protocols for workloads which are not running alongside sensors.
These endpoints require authentication using an {apikey}
which can be fetched with the groundcover CLI using the following command:
groundcover auth print-api-key
Both gRPCs and HTTPs are supported.
Ingestion is supported for both OTLP/HTTP and OTLP/gRPC
While some instrumentation libraries allow sampling of traces, it can be convenient to sample a ratio of the incoming traces directly in groundcover.
groundcover sampling does not take into account sampling being done in earlier stages (e.g SDK or collectors). It's recommended to choose one point for sampling.
To configure sampling, the relevant values can be used:
The samplingRatio
field is a fraction in the range 0-1. For example, 0.1 means 10% of the incoming traces will be sampled and stored in groundcover.
As of December 1st, 2024, the default sampling rate is 5%.
Use the values below to disable sampling and ingest 100% of the incoming traces.
Use the instructions to locate the endpoint for the Sensor service, referenced below as {GROUNDCOVER_SENSOR_ENDPOINT}
.
groundcover follows when naming the attributes.
Using an earlier version? Upgrade your installation or
Use the instructions to locate the endpoint for the Sensor service, referenced below as {GROUNDCOVER_SENSOR_ENDPOINT}
.
This feature is only supported for inCloud Managed installations as part of our Enterprise offering. See for more details.
Use the instructions to locate the endpoint, referenced below as {GROUNDCOVER_MANAGED_OPENTELEMETRY_ENDPOINT}.
The list of supported authentication methods can be found .
Learn how to ingest OTEL metrics with groundcover
groundcover fully supports the ingestion of metrics in the Open Telemetry format, displaying them as custom metrics in our platform, allowing to build custom dashboards and setting up alerts.
OTLP Metrics generated from Kubernetes pods can be ingested directly by our Custom Metrics
deployment. Ingestion is supported by changing the exporter endpoint to the Custom Metrics Service Endpoint, sending it directly to our Victoria Metrics backend.
Use the instructions here to locate the endpoint for the Custom Metrics service, referenced below as {GROUNDCOVER_CUSTOM_METRICS_ENDPOINT}
.
Change your deployments environments variables to the endpoint found above.
For versions before 1.8.245, it is required to turn on custom-metrics
to enable the OTLP metrics endpoint. See docs here.
The method above only support HTTP
protocol with protobuf
payloads. If your SDK is sending data in other protocols or formats, it will not be ingested.
Traces
All your traces are sourced out-of-the-box
Logs
All your logs are sourced out-of-the-box
Metrics
All your metrics are built out-of-the-box
Ingest your data from fluentd
Import your logs from logstash
Import all your metrics from StatsD
Ingest your metrics directly from telegraf
Integrate with OpenTelemetry
Scrape your Prometheus custom metrics
Ingest your data from Amazon CloudWatch
Automatically ingest Datadog traces & metrics
Integrate Istio distributed tracing and custom metrics
Ingest data from Google Cloud Monitoring
Import all your logs from fluentbit
A Monitor
defines a set of rules and conditions that track the state of your system. When a monitor's conditions are met, it triggers an issue that is displayed on the and can be used for alerting using your and .
Easily create a new Monitor is .
Make sure you have groundcover cli installed and version is 0.10.13+
Generate service account token
Service Account Token are only accessible once, so make sure you keep them somewhere safe, running the command again will generate a new service account token
Only groundcover tenant admins can generate Service Account Tokens
make sure you have Terraform installed
Use the official Grafana Terraform provider with the following attributes
Create a directory for the terraform assets
Create a main.tf
file within the directory that contains the terraform provider configuration mentioned in step 2
Create the following dashboards.tf
file, this example declares a new Golden Signals
folder, and within it a Workload Golden Signals
dashboard that will be created
add the workloadgoldensignals.json
file to the directory as well
Run terraform init
to initialize terraform context
Run terraform plan
, you should see a long output describing the assets that are going to be created last line should state
Plan: 2 to add, 0 to change, 0 to destroy.
Run terraform apply
to execute the changes, you should now see a new folder in your grafana dashboards screen with the newly created dashboard
Run terraform destroy
to revert the changes
Here is a short video to demonstrate the process
You can read more about what you can achieve with the Grafana Terraform provider in the official docs
ClickHouse
datasource for working with logsPrometheus
datasource for working with metricsPrometheus
datasource for a metricIn this end to end example we will set up an alert which triggers if the amount of error logs from any workload has crossed a certain threshold.
We will construct a query that uses the count()
operator to get the number of error logs in the defined time window.
groundcover always saves log levels as lower-cased values, e.g: 'error'
, 'info'
.
The GROUP BY
operator will generate the labels that will be attached as part of the alert when it fires.
Running the query returns a list of workloads and the count of error logs. Note the time range at the top of the query, which can be changed accordingly to the needed use case.
Now that we have our data, we need to set an alert condition to determine when our SLO should be considered breached. In our case, we consider any amount of error logs as breaching the SLO.
We will use the Threshold
expression with 0
as the threshold value, indicating that any workload that has more than 0 error logs should count as a breach.
Note the firing
status for all of the returned results - all of these have more than 0 error logs in the last one hour, breaching our SLO condition.
The next step is instructing Grafana on how we want this alert to be evaluated:
Evaluation group: How often we want the rule to be evaluated
Pending period: How long do we allow the SLO to be breached before firing an alert
For example, if we choose an evaluation group of 1m
, and a pending period of 3m
, we are defining that the alert condition should be checked for breach every 1 minute, but only fire an alert if the breach is ongoing for 3 consecutive minutes.
To give a concrete example, let's look at two different series of evaluations:
Even though both examples have the same amount of evaluations that breached the SLO, only the second one is firing an alert. This is because the SLO was breached for more than the allowed pending period of 3 consecutive minutes.
The next step is to add any extra labels to the fired alert, which can be used when deciding how to handle the firing of the alert. For example, labels such as team
and severity
could be used to decide on which contact point should be used.
In the notifications
part, we can choose to either use the labels assigned to route the alert, or we can select a contact point directly.
BREACHED
BREACHED
OK
BREACHED
BREACHED
OK
OK
BREACHED
BREACHED
BREACHED
BREACHED
FIRING
groundcover's sensors are required to be running on every node for it to be monitored. By default, the sensor will be included across all installed clusters with the exception of control-plan and fargate nodes.
When installing groundcover using the CLI, detected taints will be displayed and a prompt for adding appropriate tolerations will be displayed.
The following configuration values will add tolerations allowing our sensor to run on all nodes:
To prevent the sensor from starting on control plane nodes, and from attempting to start on fargate nodes, use nodeAffinity rules based on node labels:
A priorityClass can be used for the sensor with a high priority (lower than default node and cluster critical priority class) and preemption policy to evict lower priority pods:
Exceptions can be set by overriding the above values.
By default, groundcover automatically collects traces and metrics for all supported protocols, in order to generate the most comprehensive picture possible. However, in certain situations it could be useful to disable the tracing of specific protocols. This modularity is natively supported by our sensor, and this section will describe how to do so.
In order to stop the collection of a specific protocol - say, HTTP - add the following lines to the groundcover deployment values file:
In order to disable multiple protocols, simply add another env variable. If we want to disable both HTTP and Redis, for example:
The list of supported protocols:
HTTP
HTTP2 (will also disable gRPC)
REDIS
DNS
POSTGRESQL
MYSQL
KAFKA
MONGODB
To send notifications to Slack, follow these steps to generate a webhook URL for your workspace:
Go to Slack Webhook Page: Visit https://my.slack.com/services/new/incoming-webhook to create a new incoming webhook, Make sure you select the correct workspace in the top right corner.
Select a Channel: Once you're on the page, select the Slack channel where you want the notifications to be sent. You can also create a new channel by clicking the “create a new channel”.
Create the Webhook: Click Add Incoming Webhook Integration. A webhook URL will be generated.
Copy the Webhook URL: After the webhook is created, copy the webhook URL. This URL will be used to configure your groundcover workflow.
Once you have the Slack webhook URL, you can configure it in your workflow to send notifications.
Go to settings page, and then go to the integrations page.
Click on the “Slack Webhook” card:
In the window, fill in the name and the URL you’ve created, the name will be used later when setting up workflows:
The groundcover platform was built to be an all-in-one observability solution for cloud-native environments. It was built to ingest any data source directly into groundcover's in-cloud backend using any protocol supported by OpenTelemetry or Prometheus.
It supports any log stream from sources like Fluentd, Fluentbit, Logstash, or CloudWatch logs. Metrics can be ingested via Prometheus remote-write or directly from agents like StatsD and Telegraf. Traces can be ingested from any application instrumented with OpenTelemetry or Datadog’s SDK.
groundcover's embedded Grafana dashboards capability can offer you visibility on these hundreds of sources by simply enabling the scraping of custom metrics and importing the relevant Grafana visualizations into groundcover, from a pool of over 7,000 ready-made dashboards.
Below are some of the most popular examples, but the process is exactly the same for any tool:
Set up alerts to be pushed to Slack
Get your alerts directly to your inbox
Get your alerts on PagerDuty
Centralize your groundcover alerts on Opsgenie
Easily share alerts with your team via Microsoft Teams
Easily share alerts with your team via Webex Teams
Publish your data to external Grafana sources
Publish alerts via kafka
HTTP
Fully supported
gRPC
Excluding encrypted gRPC in Java services
DNS
Fully supported
MySQL
PostgreSQL
redis
kafka
mongoDB
RabbitMQ
Amazon SQS
Amazon S3
GraphQL
All your traces are sourced out-of-the-box
All your logs are sourced out-of-the-box
All your metrics are built out-of-the-box
Ingest your data from fluentd
Import your logs from logstash
Import all your metrics from StatsD
Ingest your metrics directly from telegraf
Integrate with OpenTelemetry
Scrape your Prometheus custom metrics
Ingest your data from Amazon CloudWatch
Automatically ingest Datadog traces & metrics
Integrate Istio distributed tracing and custom metrics
Ingest data from Google Cloud Monitoring
Import all your logs from fluentbit
Harbor
ActiveMQ
Aerospike
Cassandra
CloudFlare
Consul
CoreDNS
etcd
Haproxy
JMeter
k6
Loki
Nginx
Pi-hole
Postfix
RabbitMQ
Redpanda
SNMP
Solr
Tomcat
Traefik
Varnish
Vertica
Zabbix
Slack
Traces
Logs
Metrics