Skip to content

Error Handling and Recovery ​

This guide is for operations triage and recovery. It focuses on runtime failure channels and background execution recovery behaviour.

Step-level Item Reject Sink is intentionally a business processing path, not an execution failure. Developer implementation guidance lives in Item Reject Sink.

Failure Channels (Operational View) ​

ChannelScopeTriggerPrimary operational signal
Checkpoint Publication Backlogpre-execution handoffcheckpoint publication cannot admit work into downstream orchestration quickly enoughpublication lag/backlog and handoff latency
Item Reject Sinkindividual items/streamsstep-level recover-and-continue pathreject sink throughput/backlog trends
Execution DLQfull async executionterminal orchestration failureexecution DLQ backlog growth

Triage rule:

  1. An increase in item rejects with stable execution success usually indicates data quality or business-rule drift.
  2. A growing execution DLQ indicates control-plane, dependency, or systemic execution failure.
  3. When checkpoint publication backlog rises, it indicates throughput or admission pressure before downstream execution has started.

Execution DLQ Configuration (Background Execution) ​

properties
pipeline.orchestrator.mode=QUEUE_ASYNC
pipeline.orchestrator.dlq-provider=sqs
pipeline.orchestrator.dlq-url=https://sqs.eu-west-1.amazonaws.com/123456789012/tpf-dlq

QUEUE_ASYNC is the mode name used by pipeline.orchestrator.mode for background execution.

Execution DLQ applies to terminal execution failures only. A DLQ is a dead-letter channel for failed executions that need investigation or replay. It does not replace item-level rejection flows.

Execution DLQ Envelope (Terminal Details) ​

Execution DLQ entries include standard fields for triage across REST, gRPC, local, and function-style execution:

  • execution fields: tenantId, executionId, executionKey, transitionKey
  • correlation/resource fields: correlationId, resourceType, resourceName
  • runtime identity fields: transport, platform, terminalStatus, createdAtEpochMs
  • failure fields: terminalReason, errorCode, errorMessage, retryable, retriesObserved

Terminal reason mapping:

terminalReasonMeaningFirst action
retry_exhaustedretryable failure class reached terminal state after exhausting retry budget (includes zero-retry configurations (maxRetries = 0))Stabilise dependency/path, then re-drive bounded batches
non_retryablenon-retryable failure class (for example NonRetryableException)Correct payload/contract issue before replay

Background Execution Crash Matrix ​

Crash pointBehaviour after restart/recoveryDuplicate riskRequired safeguard
Before transition state commitWork is redelivered and re-executed from the last stored versionHighIdempotent operator boundary (executionId:stepIndex:attempt)
After state commit, before next enqueueTransition is stored, but next dispatch can stall until due sweeper re-dispatchesLowDue-execution sweeper + stored state
During retry schedulingRetry can replay from the last stored version if scheduling details were not committedMediumPersist retry intent (attempt, nextDue) before enqueue
After external side effect, before commitSide effect may repeat on replay because effect and commit are not one transactionHighDownstream dedupe keyed by transition identity
Worker dies while lease heldLease expires and another worker can claim executionLowShort lease window + conditional lease claim (OCC)

Semantics summary:

  1. committed execution state transitions are exactly-once,
  2. operator invocation and dispatch are at-least-once,
  3. replay is deterministic for control-plane state, not for non-idempotent external systems.

Retry and Idempotency Defaults ​

properties
pipeline.defaults.retry-limit=5
pipeline.defaults.retry-wait-ms=1000
pipeline.defaults.max-backoff=30000
pipeline.defaults.jitter=true

Use NonRetryableException to fail fast for non-transient failures.

For at-least-once boundaries (queue delivery, operator invocation, re-drive), enforce idempotency with stable transition identity (executionId:stepIndex:attempt). Idempotency means retries can happen without duplicating the business effect.

Operations Runbook ​

  1. Classify incident scope first: item reject trend vs execution DLQ growth.
  2. For checkpoint publication incidents, inspect publication lag, handoff latency, duplicate suppression (records intentionally skipped because a checkpoint handoff key was already seen), and delivery failure logs (publication log events emitted when downstream admission fails) before treating the incident as downstream execution failure.
  3. Checkpoint publication rejects and downstream admission failures occur before downstream execution admission.
  4. They are not execution DLQ events, and they do not use Item Reject Sink by default.
  5. For item reject incidents, check fingerprint concentration and dominant error classes; route to business-data remediation and selective re-drive.
  6. Treat item reject re-drive as application-owned: default reject envelopes are metadata-only, so replay payload reconstruction is not provided by framework runtime.
  7. For execution DLQ incidents, triage terminal execution causes (FAILED vs DLQ) and validate idempotency before replay.
  8. If due executions stall, verify sweeper health and dispatcher lag.
  9. Re-drive in bounded batches and monitor duplicate suppression plus retry saturation (retry attempts approaching the configured retry limit).