Metrics

The framework exposes metrics through Quarkus and Micrometer, giving step-level visibility into throughput, latency, and failures.

Built-in Metrics

Typical metrics you can expect to expose:

Execution duration per step
Success and failure counts
End-to-end pipeline latency
Throughput and backpressure signals
Error rates by step and error type

Micrometer Integration

Micrometer is the default metrics façade. You can export to Prometheus or other backends supported by Quarkus.

properties

quarkus.micrometer.export.prometheus.enabled=true
quarkus.micrometer.export.prometheus.path=/q/metrics

Dashboards

Pair metrics with Grafana dashboards that show:

Step latency percentiles (p95/p99)
Throughput per step
Error rate by step
Pipeline end-to-end latency

Execution Channels and Signals

Queue-async operations involve three distinct channels that should be monitored separately:

Channel	What it means	Core signals
Worker/dispatcher control plane	orchestration coordination and progress	queue depth, worker lag, lease conflicts, stale commits, sweeper recoveries
Execution DLQ	terminal execution failures	DLQ publish count, provider queue depth, the oldest message age
Item Reject Sink	item-level recover-and-continue business rejects	`tpf.step.reject.total`, provider queue depth (when durable), reject fingerprint concentration

Operational interpretation:

High worker lag or stale/lease contention points to orchestration pressure or dependency latency.
Execution DLQ growth points to systemic execution failures that require execution-level triage.
Item reject growth often indicates data-quality/business-rule drift and should route to business remediation and selective re-drive.

LGTM Metrics Pipeline

LGTM Dev Services ship an OTLP collector and Prometheus. Grafana's built-in dashboards read from the Prometheus datasource, so Prometheus scraping must be enabled even if OTLP export is configured. For OTLP-first dashboards, you need a Grafana datasource that reads OTLP metrics storage (for example Mimir) instead of Prometheus.

Prometheus cadence only affects the metrics view. It does not affect live trace delivery to Tempo. For the split between metrics dashboards, Tempo live topology, and replay playback, see Replay & Live Topology.

Parallelism and Backpressure

TPF emits additional metrics and span attributes to showcase parallelism and buffer pressure:

Metrics (OTel/Micrometer):

tpf.step.inflight (gauge): in-flight items per step (tpf.step.class attribute)
tpf.step.buffer.queued (gauge): queued items in the backpressure buffer (tpf.step.class attribute)
tpf.step.buffer.capacity (gauge): configured backpressure buffer capacity per step (tpf.step.class attribute)
tpf.step.parent (attribute): parent step class for plugin steps (same as tpf.step.class for regular steps)
tpf.pipeline.max_concurrency (gauge): configured max concurrency for the pipeline run
tpf.item.produced (counter): items produced at the configured item boundary
tpf.item.consumed (counter): items consumed at the configured item boundary
tpf.slo.rpc.server.* (counters): SLO-ready totals for RPC server reliability and latency (gRPC + REST)
tpf.slo.rpc.client.* (counters): SLO-ready totals for RPC client reliability and latency (gRPC + REST)
tpf.slo.item.throughput.* (counters): SLO-ready totals for item throughput per run

Prometheus exports these as *_items because the unit is set to items.

Note: tpf.step.* metrics represent step executions (not domain items). Use the tpf.item.* counters when you want throughput for a specific domain type.

Note: New Relic dimensional metrics treat tpf.slo.item.throughput.* as event-counted counters, so SLOs should use COUNT (not SUM) over metricName = 'tpf.slo.item.throughput.total|good'.

Aspect position note: AFTER_STEP observes the output of each step. This captures every boundary except the very first input boundary (before the pipeline starts). Conversely, BEFORE_STEP captures every boundary except the final output boundary (after the pipeline completes). Use two aspects if you need complete boundary coverage.

Run-level span attributes (on tpf.pipeline.run):

tpf.parallel.max_in_flight
tpf.parallel.avg_in_flight

These are designed for batch-style pipelines where parallelism should be inspected while the pipeline is running.

Tip: gauges report the instantaneous value, so after a run finishes they will return to 0. When querying, use a max over time window to surface the peak:

text

max(tpf_step_inflight_items) by (tpf_step_class)
max(tpf_step_buffer_queued_items) by (tpf_step_class)

Custom Metrics

Use Micrometer to add counters and timers inside your services:

java

@Inject
MeterRegistry registry;

Timer timer = registry.timer("payment.processing.duration");
Counter success = registry.counter("payment.processing.success");

return timer.recordCallable(() -> processPayment(record));

Orchestrator Queue-Async Signals

For QUEUE_ASYNC, include control-plane metrics in addition to step metrics. Treat these as required operational signals for GA readiness:

lease claim conflicts (OCC contention),
stale commit rejections,
retry scheduling rate and retry-saturation ratio,
due-sweeper recovery count (persisted-before-dispatch gap recovery),
execution DLQ publish count and backlog depth,
item reject sink publish count and backlog depth,
queue depth and worker lag.

Use these to separate dependency outages (high retries, low success) from coordination issues (high stale/lease conflicts).

Implementation note:

TPF core already emits step/pipeline telemetry.
Control-plane metrics may be emitted by provider integration or surrounding platform telemetry (queue, datastore, worker runtime).
Keep metric names stable per environment even if data comes from different backends.

Step-level reject signal:

tpf.step.reject.total (counter): rejected step items published to item reject sinks.

Await signal:

tpf.await.completion.dropped.total (counter): completions that cannot be admitted because the target interaction is already terminal, stale, or otherwise not admissible.

Await execution logs include the parked await unit when a QUEUE_ASYNC execution waits or resumes. Replay and trace events expose await unit lifecycle transitions, including dispatch, waiting, item completion, unit completion, resume release, and terminal timeout/failure states. The metrics surface intentionally remains limited to dropped-completion counting plus provider-native backlog/latency signals.

Backlog signal note:

TPF emits tpf.step.reject.total for reject throughput.
Backlog depth is provider-native:
use SQS queue depth/age for durable sinks (provider=sqs),
use retained-size logs/metrics for in-memory sink (provider=memory).

Design Tips

Prefer low-cardinality labels
Track user-visible latency
Align metrics with SLIs/SLOs
Measure queue depth if you use streaming steps

Runtime Layouts

Orchestrator Runtime

Framework Portability Assessment

Await Unit Runtime

Annotation Processor Guide

Metrics

Built-in Metrics

Micrometer Integration

Dashboards

Execution Channels and Signals

LGTM Metrics Pipeline

Parallelism and Backpressure

Custom Metrics

Orchestrator Queue-Async Signals

Design Tips

Metrics ​

Built-in Metrics ​

Micrometer Integration ​

Dashboards ​

Execution Channels and Signals ​

LGTM Metrics Pipeline ​

Parallelism and Backpressure ​

Custom Metrics ​

Orchestrator Queue-Async Signals ​

Design Tips ​

Metrics

Built-in Metrics

Micrometer Integration

Dashboards

Execution Channels and Signals

LGTM Metrics Pipeline

Parallelism and Backpressure

Custom Metrics

Orchestrator Queue-Async Signals

Design Tips