Metrics
The framework exposes metrics through Quarkus and Micrometer, giving step-level visibility into throughput, latency, and failures.
Built-in Metrics
Typical metrics you can expect to expose:
- Execution duration per step
- Success and failure counts
- End-to-end pipeline latency
- Throughput and backpressure signals
- Error rates by step and error type
Micrometer Integration
Micrometer is the default metrics façade. You can export to Prometheus or other backends supported by Quarkus.
quarkus.micrometer.export.prometheus.enabled=true
quarkus.micrometer.export.prometheus.path=/q/metricsDashboards
Pair metrics with Grafana dashboards that show:
- Step latency percentiles (p95/p99)
- Throughput per step
- Error rate by step
- Pipeline end-to-end latency
LGTM Metrics Pipeline
LGTM Dev Services ship an OTLP collector and Prometheus. Grafana's built-in dashboards read from the Prometheus datasource, so Prometheus scraping must be enabled even if OTLP export is configured. For OTLP-first dashboards, you need a Grafana datasource that reads OTLP metrics storage (for example Mimir) instead of Prometheus.
Parallelism and Backpressure
TPF emits additional metrics and span attributes to showcase parallelism and buffer pressure:
Metrics (OTel/Micrometer):
tpf.step.inflight(gauge): in-flight items per step (tpf.step.classattribute)tpf.step.buffer.queued(gauge): queued items in the backpressure buffer (tpf.step.classattribute)tpf.step.buffer.capacity(gauge): configured backpressure buffer capacity per step (tpf.step.classattribute)tpf.step.parent(attribute): parent step class for plugin steps (same astpf.step.classfor regular steps)tpf.pipeline.max_concurrency(gauge): configured max concurrency for the pipeline runtpf.item.produced(counter): items produced at the configured item boundarytpf.item.consumed(counter): items consumed at the configured item boundarytpf.slo.rpc.server.*(counters): SLO-ready totals for RPC server reliability and latency (gRPC + REST)tpf.slo.rpc.client.*(counters): SLO-ready totals for RPC client reliability and latency (gRPC + REST)tpf.slo.item.throughput.*(counters): SLO-ready totals for item throughput per run
Prometheus exports these as *_items because the unit is set to items.
Note: tpf.step.* metrics represent step executions (not domain items). Use the tpf.item.* counters when you want throughput for a specific domain type.
Note: New Relic dimensional metrics treat tpf.slo.item.throughput.* as event-counted counters, so SLOs should use COUNT (not SUM) over metricName = 'tpf.slo.item.throughput.total|good'.
Aspect position note: AFTER_STEP observes the output of each step. This captures every boundary except the very first input boundary (before the pipeline starts). Conversely, BEFORE_STEP captures every boundary except the final output boundary (after the pipeline completes). Use two aspects if you need complete boundary coverage.
Run-level span attributes (on tpf.pipeline.run):
tpf.parallel.max_in_flighttpf.parallel.avg_in_flight
These are designed for batch-style pipelines where parallelism should be inspected while the pipeline is running.
Tip: gauges report the instantaneous value, so after a run finishes they will return to 0. When querying, use a max over time window to surface the peak:
max(tpf_step_inflight_items) by (tpf_step_class)
max(tpf_step_buffer_queued_items) by (tpf_step_class)Custom Metrics
Use Micrometer to add counters and timers inside your services:
@Inject
MeterRegistry registry;
Timer timer = registry.timer("payment.processing.duration");
Counter success = registry.counter("payment.processing.success");
return timer.recordCallable(() -> processPayment(record));Design Tips
- Prefer low-cardinality labels
- Track user-visible latency
- Align metrics with SLIs/SLOs
- Measure queue depth if you use streaming steps