# Span to Metrics

## Spans to Metrics

#### Overview

Transform your trace data into queryable metrics for long-term monitoring, alerting, and cost-effective analysis. Spans-to-metrics extracts numerical data from span attributes and converts them into time-series metrics that can be visualized, alerted on, and retained at a fraction of the cost.

{% hint style="warning" %}
**Sampling matters.** For **eBPF-captured traces**, the pipeline runs **after sampling.** metrics generated from eBPF spans reflect only sampled data, not total traffic. For **ingested traces**, the pipeline runs **before sampling -** metrics reflect the full, unsampled dataset. Keep this distinction in mind when interpreting counts and sums.
{% endhint %}

#### Why Use Spans to Metrics?

Traces are perfect for debugging specific requests, but they become expensive and unwieldy at scale. Metrics, on the other hand, are:

* **Cost-effective** - Store aggregated data instead of every span
* **Fast to query** - Optimized for time-series analysis
* **Perfect for alerting** - Track trends and thresholds over time

**The Transformation**

Think of it like converting request traces into spreadsheet rows:

**A Span (request):**

```
POST /api/orders — duration: 120ms, status: Ok, workload: order-service
```

**Becomes Metrics (structured data):**

| Timestamp | Metric Name                 | Value | Labels                                         |
| --------- | --------------------------- | ----- | ---------------------------------------------- |
| `[now]`   | `order_requests_total`      | `1`   | `workload:order-service, endpoint:/api/orders` |
| `[now]`   | `order_request_duration_ms` | `120` | `workload:order-service, endpoint:/api/orders` |

You're essentially turning trace data into countable, measurable data points.

#### When to Use Spans to Metrics

Spans to metrics doesn't replace traces, it complements them. Use it for these scenarios:

**1. Monitoring Latency Distribution**

Track response time trends to understand system performance.

**Use cases:**

* Average, minimum, and maximum response times per endpoint
* Service latency degradation detection
* Comparing latency across workloads or protocols

**Example:**

Create a metric `request_duration_ns` with sum, min, max, and count, then calculate average latency:

```
rate(request_duration_ns_sum) / rate(request_duration_ns_count)
```

**2. Tracking Error Ratios**

Derive error rates by comparing error span counts to total span counts.

**Use cases:**

* Error ratio per service or endpoint
* Detecting degradation trends over time
* Protocol-level error comparison (HTTP vs gRPC)

**Example:**

Create separate count metrics for all spans and error spans, then compute the ratio:

```
rate(span_errors_total) / rate(span_requests_total)
```

**3. Enrichment-Based Metrics**

Generate metrics from data you've extracted or enriched in earlier pipeline rules — such as parsed response fields, custom attributes, or header values.

**Use cases:**

* Metrics derived from JSON response body fields (e.g. `cache["order_total"]`)
* Counting spans by custom attributes set in transform rules
* Aggregating values extracted from request/response headers

**Example:**

After a transform rule parses `order_total` from the response body into `cache`, create a sum metric:

```yaml
- 'span_to_metric_sum("order_total_amount", s2m, Double(cache["order_total"]))'
```

**4. Counting Ingested (OTel) Request Volume**

For **ingested traces only** (not eBPF), you can count absolute request volume because the pipeline runs before sampling.

**Use cases:**

* Total API request counts per endpoint
* Service-to-service call frequency
* Request volume per protocol type

{% hint style="warning" %}
Absolute counts from **eBPF spans** reflect sampled data only. Use them for relative comparisons and trend analysis, not for exact volume measurement.
{% endhint %}

#### How Spans-to-Metrics Works

Spans-to-metrics uses the special `s2m` map to define what metrics to create and how to aggregate them.

**Available Operations**

groundcover supports four metric aggregation operations:

| Function               | Description                   | Use Case                                |
| ---------------------- | ----------------------------- | --------------------------------------- |
| `span_to_metric_count` | Count spans matching criteria | Request counts, event occurrences       |
| `span_to_metric_sum`   | Sum extracted values          | Total duration, total payload size      |
| `span_to_metric_max`   | Maximum value observed        | Peak response time, largest payload     |
| `span_to_metric_min`   | Minimum value observed        | Fastest response time, smallest payload |

{% hint style="info" %}
groundcover automatically adds a `_gc_op` suffix with the operation type to generated metrics (e.g., `_sum`, `_min`, `_max`, `_count`).
{% endhint %}

**Basic Structure**

```yaml
ottlRules:
  - ruleName: s2m-example
    conditions:
      - workload == "my-service"
    statements:
      # 1. Define metric labels in s2m map
      - set(s2m["label_name"], attributes["field"])
      
      # 2. Create metrics with aggregations
      - span_to_metric_count("metric_name", s2m)
      - span_to_metric_sum("metric_name", s2m, Double(attributes["value"]))
```

#### Best Practices

1. **Choose meaningful metric names** - Use descriptive names that indicate what's being measured
   * Good: `http_requests_total`, `order_processing_duration`
   * Bad: `metric1`, `counter`
2. **Use appropriate labels** - Add dimensions that help you slice and dice the data
   * Common labels: `workload`, `endpoint`, `protocol_type`, `status`, `namespace`
   * Avoid high-cardinality labels (unique trace IDs, user IDs, span IDs)
3. **Use conditions to scope rules** - Only generate metrics from relevant spans to minimize overhead
4. **Combine operations** - Use count, sum, min, and max together for comprehensive insights
   * Count requests + sum duration = average latency
   * Min/max provide performance bounds
5. **Use type conversion** - Always convert values to `Double()` for sum/min/max operations
   * `Double(attributes["duration"])` not `attributes["duration"]`
6. **Prefer ratios over absolute counts for eBPF** - Since eBPF spans are sampled, ratios (e.g. error rate) are more reliable than raw counts

#### Viewing Your Metrics

After creating spans-to-metrics rules:

1. Metrics appear in [**Metrics Explorer**](https://app.groundcover.com/explore/data-explorer) within minutes
2. Use PromQL to query your custom metrics
3. Create [dashboards](https://app.groundcover.com/dashboards) to visualize trends
4. Set up [monitors](https://app.groundcover.com/monitors) for alerting on thresholds

#### Common Use Cases

**Tracking Request Duration**

Monitor response times with min, max, and sum. Duration metrics are accurate even on sampled eBPF data.

```yaml
ottlRules:
  - ruleName: s2m-request-duration
    conditions:
      - workload == "api-gateway"
      - protocol_type == "http"
    conditionLogicOperator: "and"
    statements:
      - set(s2m["endpoint"], span_name)
      - set(s2m["workload"], workload)
      - span_to_metric_sum("request_duration_ns", s2m, Double(duration))
      - span_to_metric_max("request_duration_ns", s2m, Double(duration))
      - span_to_metric_min("request_duration_ns", s2m, Double(duration))
      - span_to_metric_count("request_duration_ns", s2m)
```

**Output metrics:**

```
request_duration_ns_sum{endpoint="GET /api/users", workload="api-gateway"} = 165000000
request_duration_ns_max{endpoint="GET /api/users", workload="api-gateway"} = 120000000
request_duration_ns_min{endpoint="GET /api/users", workload="api-gateway"} = 45000000
request_duration_ns_count{endpoint="GET /api/users", workload="api-gateway"} = 2
```

💡 **What it does:** Tracks latency distribution per endpoint. Calculate average latency with `rate(sum) / rate(count)`.

**Tracking Error Ratios**

Compare error spans to total spans for reliable error rate monitoring.

```yaml
ottlRules:
  - ruleName: s2m-all-requests
    conditions:
      - protocol_type == "http"
    statements:
      - set(s2m["workload"], workload)
      - set(s2m["namespace"], namespace)
      - span_to_metric_count("http_spans_total", s2m)

  - ruleName: s2m-error-requests
    conditions:
      - protocol_type == "http"
      - status == "Error"
    conditionLogicOperator: "and"
    statements:
      - set(s2m["workload"], workload)
      - set(s2m["namespace"], namespace)
      - span_to_metric_count("http_spans_errors", s2m)
```

💡 **What it does:** Creates two count metrics. Calculate error rate with `rate(http_spans_errors) / rate(http_spans_total)`. This ratio is reliable even on sampled eBPF data.

**Monitoring Payload Size**

Track request and response body sizes from span attributes.

```yaml
ottlRules:
  - ruleName: s2m-payload-size
    conditions:
      - workload == "api-gateway"
      - attributes["response_size"] != nil
    conditionLogicOperator: "and"
    statements:
      - set(s2m["endpoint"], span_name)
      - set(s2m["workload"], workload)
      - span_to_metric_sum("response_bytes", s2m, Double(attributes["response_size"]))
      - span_to_metric_max("response_bytes", s2m, Double(attributes["response_size"]))
      - span_to_metric_min("response_bytes", s2m, Double(attributes["response_size"]))
      - span_to_metric_count("response_bytes", s2m)
```

💡 **What it does:** Tracks payload size distribution. Min/max values are accurate regardless of sampling.

#### Key Functions

**s2m Map**

The `s2m` map stores the labels (dimensions) for your metrics.

```yaml
- set(s2m["label_name"], "value")
- set(s2m["endpoint"], span_name)
- set(s2m["workload"], workload)
```

**span\_to\_metric\_count**

Counts the number of spans matching the rule.

**Syntax:**

```yaml
- span_to_metric_count("metric_name", s2m)
```

**Use for:** Request counts, event occurrences, error counts

**span\_to\_metric\_sum**

Sums numerical values from spans.

**Syntax:**

```yaml
- span_to_metric_sum("metric_name", s2m, Double(attributes["value"]))
```

**Use for:** Total duration, total bytes, cumulative values

**span\_to\_metric\_max**

Tracks the maximum value observed.

**Syntax:**

```yaml
- span_to_metric_max("metric_name", s2m, Double(attributes["value"]))
```

**Use for:** Peak response times, largest payloads

**span\_to\_metric\_min**

Tracks the minimum value observed.

**Syntax:**

```yaml
- span_to_metric_min("metric_name", s2m, Double(attributes["value"]))
```

**Use for:** Fastest response times, smallest payloads
