Alerting

Alerts should be actionable, low-noise, and tied to user impact.

Batch-style pipelines behave differently than request/response APIs. Prefer run-based and item-based alerts instead of wall-clock throughput over idle time.

Principles

Alert on symptoms, not every error
Use severity levels consistently
Include context in the alert payload
Separate SLO alerts from operational alerts

Dashboards

Pair alerts with dashboards that show step latency, item throughput (while running), and error rates.

Common Alerts

Run failure rate above threshold (orchestrator)
Step error rate above SLO (gRPC server spans)
Item latency above SLO (run average or per-step)
Backpressure pressure rising (buffer queued stays high)
Orchestrator runtime failure or restart loops

Practical Defaults

Start with:

Run failure rate > 1% over 1 day (warning)
Item avg latency > 2x baseline for 10 minutes (warning)
Buffer queued stays high for 5 minutes (warning)
DLQ growth sustained for 5 minutes (critical)

When using New Relic, derive these from tpf.pipeline.run spans and tpf.step.* metrics.

Alerting ​

Principles ​

Dashboards ​

Common Alerts ​

Practical Defaults ​

Alerting

Principles

Dashboards

Common Alerts

Practical Defaults