Skip to content

Alerting

Alerts should be actionable, low-noise, and tied to user impact.

Batch-style pipelines behave differently than request/response APIs. Prefer run-based and item-based alerts instead of wall-clock throughput over idle time.

Principles

  1. Alert on symptoms, not every error
  2. Use severity levels consistently
  3. Include context in the alert payload
  4. Separate SLO alerts from operational alerts

Dashboards

Pair alerts with dashboards that show step latency, item throughput (while running), and error rates.

Common Alerts

  1. Run failure rate above threshold (orchestrator)
  2. Step error rate above SLO (gRPC server spans)
  3. Item latency above SLO (run average or per-step)
  4. Backpressure pressure rising (buffer queued stays high)
  5. Orchestrator runtime failure or restart loops

Practical Defaults

Start with:

  1. Run failure rate > 1% over 1 day (warning)
  2. Item avg latency > 2x baseline for 10 minutes (warning)
  3. Buffer queued stays high for 5 minutes (warning)
  4. Execution DLQ backlog growth sustained for 5 minutes (critical, provider queue-depth metric)
  5. Item reject sink backlog growth sustained for 5 minutes (critical, provider queue-depth metric; in-memory sink uses retained-size logs instead of a backlog gauge)

Queue-async additions:

  1. Due-sweeper recoveries stop while due backlog rises (critical)
  2. Lease conflict/stale-commit rate spikes above baseline (warning)
  3. Retry-saturation exceeds threshold (warning/critical by tenant tier)
  4. Queue age/lag exceeds execution SLO budget (critical)

When using New Relic, derive these from tpf.pipeline.run spans, tpf.step.* metrics (for example tpf.step.reject.total), and provider-native queue-depth metrics for DLQ/reject backlog.

Suggested starter thresholds:

  1. Queue oldest-message age > 2x target execution SLO for 10 minutes (critical).
  2. Retry-saturation ratio > 0.2 for 15 minutes (warning), > 0.4 (critical).
  3. Sweeper recoveries = 0 while due backlog grows for 5 minutes (critical).
  4. Lease/stale conflict rate > 3x 7-day baseline for 10 minutes (warning).

What Alerts Mean Operationally

Use channel-specific interpretation so incidents route to the right team.

Execution DLQ Backlog Growth (Critical)

Operational meaning:

  1. Terminal execution failures are accumulating faster than triage/re-drive.
  2. Queue-async control plane may be healthy, but execution outcomes are failing systemically.

Business meaning:

  1. End-to-end workflows are not completing.
  2. Customer-visible outcomes can be delayed, missing, or inconsistent until replay.

Immediate operator actions:

  1. Identify dominant terminal causes (FAILED vs DLQ) and affected contracts/steps.
  2. Validate downstream idempotency before any bulk re-drive.
  3. Re-drive in bounded batches and watch duplicate suppression and retry saturation.

Item Reject Sink Backlog Growth (Critical)

Operational meaning:

  1. Recover-and-continue paths are active, but rejected items are not being drained.
  2. Step execution may still be healthy while reject-handling capacity is insufficient.

Business meaning:

  1. Main workflows can complete, but rejected records accumulate unresolved business exceptions.
  2. SLA risk is often on data completeness/quality rather than total platform availability.

Immediate operator actions:

  1. Segment by reject fingerprint/error class to isolate top failure cohorts.
  2. Coordinate data/business remediation for dominant reject reasons.
  3. Re-drive only corrected cohorts and track repeat-reject ratio.

Worker Lag / Queue Age Breach (Critical)

Operational meaning:

  1. Dispatch and processing are behind incoming workload or dependency latency budget.
  2. Recovery paths (retry/sweeper) may amplify lag if left unchecked.

Business meaning:

  1. End-user latency and completion-time SLOs are at risk.
  2. Time-sensitive workflows may miss windows even without outright failure.

Immediate operator actions:

  1. Check dependency latency/error spikes and retry amplification signals.
  2. Scale workers or reduce ingest pressure temporarily.
  3. Verify sweeper activity and lease conflict levels during catch-up.