Alerting
Alerts should be actionable, low-noise, and tied to user impact.
Batch-style pipelines behave differently than request/response APIs. Prefer run-based and item-based alerts instead of wall-clock throughput over idle time.
Principles
- Alert on symptoms, not every error
- Use severity levels consistently
- Include context in the alert payload
- Separate SLO alerts from operational alerts
Dashboards
Pair alerts with dashboards that show step latency, item throughput (while running), and error rates.
Common Alerts
- Run failure rate above threshold (orchestrator)
- Step error rate above SLO (gRPC server spans)
- Item latency above SLO (run average or per-step)
- Backpressure pressure rising (buffer queued stays high)
- Orchestrator runtime failure or restart loops
Practical Defaults
Start with:
- Run failure rate > 1% over 1 day (warning)
- Item avg latency > 2x baseline for 10 minutes (warning)
- Buffer queued stays high for 5 minutes (warning)
- DLQ growth sustained for 5 minutes (critical)
When using New Relic, derive these from tpf.pipeline.run spans and tpf.step.* metrics.