Alerting
Alerts should be actionable, low-noise, and tied to user impact.
Batch-style pipelines behave differently than request/response APIs. Prefer run-based and item-based alerts instead of wall-clock throughput over idle time.
Principles
- Alert on symptoms, not every error
- Use severity levels consistently
- Include context in the alert payload
- Separate SLO alerts from operational alerts
Dashboards
Pair alerts with dashboards that show step latency, item throughput (while running), and error rates.
Common Alerts
- Run failure rate above threshold (orchestrator)
- Step error rate above SLO (gRPC server spans)
- Item latency above SLO (run average or per-step)
- Backpressure pressure rising (buffer queued stays high)
- Orchestrator runtime failure or restart loops
Practical Defaults
Start with:
- Run failure rate > 1% over 1 day (warning)
- Item avg latency > 2x baseline for 10 minutes (warning)
- Buffer queued stays high for 5 minutes (warning)
- Execution DLQ backlog growth sustained for 5 minutes (critical, provider queue-depth metric)
- Item reject sink backlog growth sustained for 5 minutes (critical, provider queue-depth metric; in-memory sink uses retained-size logs instead of a backlog gauge)
Queue-async additions:
- Due-sweeper recoveries stop while due backlog rises (critical)
- Lease conflict/stale-commit rate spikes above baseline (warning)
- Retry-saturation exceeds threshold (warning/critical by tenant tier)
- Queue age/lag exceeds execution SLO budget (critical)
When using New Relic, derive these from tpf.pipeline.run spans, tpf.step.* metrics (for example tpf.step.reject.total), and provider-native queue-depth metrics for DLQ/reject backlog.
Suggested starter thresholds:
- Queue oldest-message age > 2x target execution SLO for 10 minutes (critical).
- Retry-saturation ratio > 0.2 for 15 minutes (warning), > 0.4 (critical).
- Sweeper recoveries = 0 while due backlog grows for 5 minutes (critical).
- Lease/stale conflict rate > 3x 7-day baseline for 10 minutes (warning).
What Alerts Mean Operationally
Use channel-specific interpretation so incidents route to the right team.
Execution DLQ Backlog Growth (Critical)
Operational meaning:
- Terminal execution failures are accumulating faster than triage/re-drive.
- Queue-async control plane may be healthy, but execution outcomes are failing systemically.
Business meaning:
- End-to-end workflows are not completing.
- Customer-visible outcomes can be delayed, missing, or inconsistent until replay.
Immediate operator actions:
- Identify dominant terminal causes (
FAILEDvsDLQ) and affected contracts/steps. - Validate downstream idempotency before any bulk re-drive.
- Re-drive in bounded batches and watch duplicate suppression and retry saturation.
Item Reject Sink Backlog Growth (Critical)
Operational meaning:
- Recover-and-continue paths are active, but rejected items are not being drained.
- Step execution may still be healthy while reject-handling capacity is insufficient.
Business meaning:
- Main workflows can complete, but rejected records accumulate unresolved business exceptions.
- SLA risk is often on data completeness/quality rather than total platform availability.
Immediate operator actions:
- Segment by reject fingerprint/error class to isolate top failure cohorts.
- Coordinate data/business remediation for dominant reject reasons.
- Re-drive only corrected cohorts and track repeat-reject ratio.
Worker Lag / Queue Age Breach (Critical)
Operational meaning:
- Dispatch and processing are behind incoming workload or dependency latency budget.
- Recovery paths (retry/sweeper) may amplify lag if left unchecked.
Business meaning:
- End-user latency and completion-time SLOs are at risk.
- Time-sensitive workflows may miss windows even without outright failure.
Immediate operator actions:
- Check dependency latency/error spikes and retry amplification signals.
- Scale workers or reduce ingest pressure temporarily.
- Verify sweeper activity and lease conflict levels during catch-up.