Metrics
The framework exposes metrics through Quarkus and Micrometer, giving step-level visibility into throughput, latency, and failures.
Built-in Metrics
Typical metrics you can expect to expose:
- Execution duration per step
- Success and failure counts
- End-to-end pipeline latency
- Throughput and backpressure signals
- Error rates by step and error type
Micrometer Integration
Micrometer is the default metrics façade. You can export to Prometheus or other backends supported by Quarkus.
quarkus.micrometer.export.prometheus.enabled=true
quarkus.micrometer.export.prometheus.path=/q/metricsDashboards
Pair metrics with Grafana dashboards that show:
- Step latency percentiles (p95/p99)
- Throughput per step
- Error rate by step
- Pipeline end-to-end latency
Execution Channels and Signals
Queue-async operations involve three distinct channels that should be monitored separately:
| Channel | What it means | Core signals |
|---|---|---|
| Worker/dispatcher control plane | orchestration coordination and progress | queue depth, worker lag, lease conflicts, stale commits, sweeper recoveries |
| Execution DLQ | terminal execution failures | DLQ publish count, provider queue depth, oldest message age |
| Item Reject Sink | item-level recover-and-continue business rejects | tpf.step.reject.total, provider queue depth (when durable), reject fingerprint concentration |
Operational interpretation:
- High worker lag or stale/lease contention points to orchestration pressure or dependency latency.
- Execution DLQ growth points to systemic execution failures that require execution-level triage.
- Item reject growth often indicates data-quality/business-rule drift and should route to business remediation and selective re-drive.
LGTM Metrics Pipeline
LGTM Dev Services ship an OTLP collector and Prometheus. Grafana's built-in dashboards read from the Prometheus datasource, so Prometheus scraping must be enabled even if OTLP export is configured. For OTLP-first dashboards, you need a Grafana datasource that reads OTLP metrics storage (for example Mimir) instead of Prometheus.
Parallelism and Backpressure
TPF emits additional metrics and span attributes to showcase parallelism and buffer pressure:
Metrics (OTel/Micrometer):
tpf.step.inflight(gauge): in-flight items per step (tpf.step.classattribute)tpf.step.buffer.queued(gauge): queued items in the backpressure buffer (tpf.step.classattribute)tpf.step.buffer.capacity(gauge): configured backpressure buffer capacity per step (tpf.step.classattribute)tpf.step.parent(attribute): parent step class for plugin steps (same astpf.step.classfor regular steps)tpf.pipeline.max_concurrency(gauge): configured max concurrency for the pipeline runtpf.item.produced(counter): items produced at the configured item boundarytpf.item.consumed(counter): items consumed at the configured item boundarytpf.slo.rpc.server.*(counters): SLO-ready totals for RPC server reliability and latency (gRPC + REST)tpf.slo.rpc.client.*(counters): SLO-ready totals for RPC client reliability and latency (gRPC + REST)tpf.slo.item.throughput.*(counters): SLO-ready totals for item throughput per run
Prometheus exports these as *_items because the unit is set to items.
Note: tpf.step.* metrics represent step executions (not domain items). Use the tpf.item.* counters when you want throughput for a specific domain type.
Note: New Relic dimensional metrics treat tpf.slo.item.throughput.* as event-counted counters, so SLOs should use COUNT (not SUM) over metricName = 'tpf.slo.item.throughput.total|good'.
Aspect position note: AFTER_STEP observes the output of each step. This captures every boundary except the very first input boundary (before the pipeline starts). Conversely, BEFORE_STEP captures every boundary except the final output boundary (after the pipeline completes). Use two aspects if you need complete boundary coverage.
Run-level span attributes (on tpf.pipeline.run):
tpf.parallel.max_in_flighttpf.parallel.avg_in_flight
These are designed for batch-style pipelines where parallelism should be inspected while the pipeline is running.
Tip: gauges report the instantaneous value, so after a run finishes they will return to 0. When querying, use a max over time window to surface the peak:
max(tpf_step_inflight_items) by (tpf_step_class)
max(tpf_step_buffer_queued_items) by (tpf_step_class)Custom Metrics
Use Micrometer to add counters and timers inside your services:
@Inject
MeterRegistry registry;
Timer timer = registry.timer("payment.processing.duration");
Counter success = registry.counter("payment.processing.success");
return timer.recordCallable(() -> processPayment(record));Orchestrator Queue-Async Signals
For QUEUE_ASYNC, include control-plane metrics in addition to step metrics. Treat these as required operational signals for GA readiness:
- lease claim conflicts (OCC contention),
- stale commit rejections,
- retry scheduling rate and retry-saturation ratio,
- due-sweeper recovery count (persisted-before-dispatch gap recovery),
- execution DLQ publish count and backlog depth,
- item reject sink publish count and backlog depth,
- queue depth and worker lag.
Use these to separate dependency outages (high retries, low success) from coordination issues (high stale/lease conflicts).
Implementation note:
- TPF core already emits step/pipeline telemetry.
- Control-plane metrics may be emitted by provider integration or surrounding platform telemetry (queue, datastore, worker runtime).
- Keep metric names stable per environment even if data comes from different backends.
Step-level reject signal:
tpf.step.reject.total(counter): rejected step items published to item reject sinks.
Backlog signal note:
- TPF emits
tpf.step.reject.totalfor reject throughput. - Backlog depth is provider-native:
- use SQS queue depth/age for durable sinks (
provider=sqs), - use retained-size logs/metrics for in-memory sink (
provider=memory).
Design Tips
- Prefer low-cardinality labels
- Track user-visible latency
- Align metrics with SLIs/SLOs
- Measure queue depth if you use streaming steps