For the complete documentation index, see llms.txt. This page is also available as Markdown.

Working with Metrics: Datadog vs groundcover

If you're coming from Datadog and starting to work with groundcover, one of the first things you'll notice is that metrics work differently. Not just the query syntax, but the underlying model, how aggregations behave, and what "rate" actually means. This article explains how to think about these differences so you can translate your mental model—not just your queries.

groundcover stores metrics in Prometheus format and uses MetricsQL (a PromQL-compatible query language) for querying. This means the concepts, operators, and query patterns differ from Datadog's approach.

Metric name normalization

When metrics are ingested into groundcover, their names are normalized to be Prometheus-compatible. Prometheus metric names must match the regex [a-zA-Z_:][a-zA-Z0-9_:]*, which means:

  • Dots (.) are replaced with underscores (_)

  • Hyphens (-) are replaced with underscores (_)

  • Names must start with a letter or underscore

Datadog metric name
Prometheus metric name

system.cpu.user

system_cpu_user

http.server.requests

http_server_requests

my-app.request-count

my_app_request_count

aws.ec2.cpuutilization

aws_ec2_cpuutilization

When querying metrics that originated from Datadog or other sources with dot-notation, remember to use underscores in your Prometheus queries.

Counters vs gauges: same names, different behaviors

Both systems have counters and gauges, but they behave differently in practice.

Counters

A counter is a monotonically increasing value—think "total requests served" or "bytes transmitted." The value only goes up (or resets to zero on restart).

In Datadog, counters are often submitted as already-computed rates or deltas. When you send a counter metric via DogStatsD, the agent may compute the difference between submissions for you. The count type in Datadog represents "events per flush interval."

In Prometheus, counters are raw cumulative values. If your counter reads 1000 at t=0 and 1050 at t=60s, you have the raw numbers. To get a rate, you explicitly apply rate() or increase().

This means that a Datadog query showing sum:my.counter{*} might already be showing a rate, while a Prometheus query showing my_counter is showing the raw cumulative value.

Gauges

Gauges work more similarly between the two systems—they represent instantaneous values like CPU usage or memory consumption. However, aggregation behavior still differs.

Translating common Datadog operators

Here's how to think about the most common Datadog functions and their Prometheus/MetricsQL equivalents.

When writing queries in groundcover dashboards, you can use the built-in variables $range and $interval to dynamically adapt to the selected time range and step size. For example, rate(my_counter[$range]) uses the dashboard's current time range, and rate(my_counter[$interval]) uses the current step interval.

as_rate()

In Datadog: .as_rate() converts a count metric to a per-second rate. It's applied as a modifier to metrics that were submitted as counts.

In Prometheus/MetricsQL: rate() computes per-second rate from raw counter values. It handles counter resets automatically and expects a range vector.

Datadog
Prometheus/MetricsQL

my.counter{*}.as_rate()

rate(my_counter[$interval])

sum:my.requests{*}.as_rate()

sum(rate(my_requests_total[$interval]))

count_nonzero() and count_not_null()

Datadog's count_nonzero() and count_not_null() count the number of non-zero or non-null data points across a time period.

In Prometheus/MetricsQL, use count_over_time() to count how many raw data points (scrapes) occurred for a specific time series within a lookback window.

Syntax: count_over_time(metric_name[duration])

Example: count_over_time(up[1h]) counts how many times the up metric was scraped in the last hour. If your scrape interval is 15 seconds, this returns approximately 240.

Datadog
Prometheus/MetricsQL

count_nonzero(my.metric{*})

count_over_time((my_metric > 0)[1h])

count_not_null(my.metric{*})

count_over_time(my_metric[1h])

Note that Prometheus's count() (without _over_time) is different—it counts the number of time series in a result set, not the number of samples.

sum(), avg(), min(), max()

These aggregation functions are conceptually similar but apply differently.

In Datadog, aggregations often happen across both time and series simultaneously. avg:my.metric{*} averages across all series.

In Prometheus, you separate time aggregation from series aggregation:

Datadog
Prometheus/MetricsQL

avg:my.metric{*}

avg(my_metric)

avg:my.metric{*}.rollup(avg, 60)

avg(avg_over_time(my_metric[1m]))

sum:my.metric{*} by {host}

sum by (host) (my_metric)

rollup()

Datadog's .rollup() controls time aggregation explicitly. In Prometheus, you use *_over_time() functions:

Datadog
Prometheus/MetricsQL

.rollup(avg, 300)

avg_over_time(metric[5m])

.rollup(max, 300)

max_over_time(metric[5m])

.rollup(sum, 300)

sum_over_time(metric[5m])

.rollup(count, 300)

count_over_time(metric[5m])

as_count()

Datadog's .as_count() converts a rate metric back to a count representation. In Prometheus, use increase() to get the total increase over a time window.

For example, if a counter http_requests_total goes from 1000 to 1500 over 5 minutes, increase(http_requests_total[5m]) returns 500—the total number of new requests in that window.

Datadog
Prometheus/MetricsQL

my.counter{*}.as_count()

increase(my_counter[$interval])

top() and bottom()

Datadog
Prometheus/MetricsQL

top(my.metric{*}, 10, 'mean', 'desc')

topk(10, my_metric)

bottom(my.metric{*}, 5, 'mean', 'asc')

bottomk(5, my_metric)

MetricsQL also provides topk_last(), topk_avg(), and similar variants that may better match Datadog's behavior for specific aggregation methods.

fill()

Datadog's fill() handles missing data points. In Prometheus, missing data typically appears as gaps. MetricsQL provides the default binary operator and the keep_last_value() function to handle this:

Datadog
Prometheus/MetricsQL

.fill(zero)

my_metric default 0

.fill(last)

keep_last_value(my_metric) (MetricsQL)

Troubleshooting data discrepancies

When comparing Datadog and groundcover side-by-side during a migration, you'll likely see numbers that don't match perfectly. Here's how to diagnose common issues.

Step size and resolution

The problem: Your Datadog dashboard shows 1000 requests/sec but groundcover shows 950.

Why it happens: Datadog and Prometheus may use different step sizes (the interval between data points). A 10-second step captures more detail than a 60-second step, and aggregations over these intervals produce different results.

How to fix it:

In groundcover dashboards, explicitly control the step using the step parameter in your query or by adjusting the time range.

Compare like-for-like by:

  1. Using the same time range in both systems

  2. Explicitly setting rollup/step to match

  3. Understanding that Datadog auto-adjusts rollup based on time range while Prometheus requires explicit ranges

Missing filters

The problem: Datadog shows 500 errors/min but groundcover shows 2000.

Why it happens: Datadog might have implicit filters from dashboard template variables, saved views, or default scopes that aren't obvious. Tags in Datadog might map to different label names in groundcover.

How to fix it:

  1. Check for template variables in the Datadog dashboard that apply filters

  2. Review the full Datadog query including any .filter() calls

  3. Map Datadog tags to groundcover labels—common mappings include:

    • hostnode_name or instance

    • envenv

    • serviceworkload

    • kube_namespacenamespace

Environment and cluster scope

The problem: Numbers are dramatically different between the two systems.

Why it happens: You might be looking at different environments or clusters. Datadog scopes might include production + staging, while groundcover might be filtering to a single cluster.

How to fix it:

Always explicitly filter by environment and cluster:

Check the groundcover query builder's environment and cluster filters—they persist across sessions and might be limiting your view.

Autofill and null handling

The problem: Datadog shows a smooth line while groundcover shows gaps.

Why it happens: Datadog's .fill(last) or .fill(zero) interpolates missing data points. Prometheus shows gaps where no data exists.

How to fix it:

In MetricsQL (groundcover's metrics language), use:

Be aware that filling with zeros can distort averages—a gap might mean "no data" not "zero value."

Counter resets

The problem: You see sudden spikes or dips in rate calculations.

Why it happens: When a pod restarts, counters reset to zero. Prometheus's rate() handles this, but if your range selector is too short, you might capture artifacts. Datadog's pre-aggregation might smooth these over differently.

How to fix it:

Use a range selector at least 4x your scrape interval:

Consider using increase() for total counts over a period rather than instantaneous rates.

Metric name differences

The problem: The metric exists in Datadog but doesn't seem to exist in groundcover.

Why it happens: Metrics might have different names. groundcover uses the groundcover_ prefix for its built-in metrics and follows Prometheus naming conventions (snake_case with units as suffixes).

Common mappings:

Datadog
groundcover

system.cpu.user

groundcover_container_cpu_usage_rate_millis

kubernetes.cpu.usage.total

groundcover_container_cpu_usage_rate_millis

system.mem.used

groundcover_node_mem_used_bytes

kubernetes.memory.usage

groundcover_container_mem_used_bytes

kube_pod_status_phase

groundcover_kube_pod_status_phase

For a full list of available metrics, see the Metrics & Labels reference.

Thinking in PromQL vs Datadog query language

Beyond syntax translation, there's a mental model shift worth making.

Datadog queries are often imperative: "Take this metric, filter it, aggregate it, roll it up." You describe a sequence of operations.

Prometheus queries are more declarative: "Give me the 5-minute rate of this counter, summed by service." You describe the result you want.

This shows up in how you build complex queries:

Datadog style:

Prometheus style:

Both compute error rate, but the Prometheus version reads more like a mathematical formula. The rate calculation and aggregation happen together, not as sequential transformations.

What to do when migration tools don't get it right

groundcover's migration tools handle most query translations automatically. But complex queries or custom metrics might need manual adjustment.

When a migrated monitor or dashboard doesn't match expected values:

  1. Start simple: Query the raw metric without aggregations in both systems

  2. Check cardinality: Ensure the same number of series exist in both systems

  3. Match time ranges: Use identical absolute time ranges, not relative ones

  4. Add aggregations incrementally: Build up the query step-by-step, comparing at each stage

  5. Account for timing: Datadog and groundcover might not have collected data at exactly the same moments—allow for small variations

Further reading

Last updated