Error Handling and Recovery

This guide is for operations triage and recovery. It focuses on runtime failure channels and background execution recovery behaviour.

Step-level Item Reject Sink is intentionally a business processing path, not an execution failure. Developer implementation guidance lives in Item Reject Sink.

Failure Channels (Operational View)

Channel	Scope	Trigger	Primary operational signal
Checkpoint Publication Backlog	pre-execution handoff	checkpoint publication cannot admit work into downstream orchestration quickly enough	publication lag/backlog and handoff latency
Item Reject Sink	individual items/streams	step-level recover-and-continue path	reject sink throughput/backlog trends
Execution DLQ	full async execution	terminal orchestration failure	execution DLQ backlog growth

Triage rule:

An increase in item rejects with stable execution success usually indicates data quality or business-rule drift.
A growing execution DLQ indicates control-plane, dependency, or systemic execution failure.
When checkpoint publication backlog rises, it indicates throughput or admission pressure before downstream execution has started.

Execution DLQ Configuration (Background Execution)

properties

pipeline.orchestrator.mode=QUEUE_ASYNC
pipeline.orchestrator.dlq-provider=sqs
pipeline.orchestrator.dlq-url=https://sqs.eu-west-1.amazonaws.com/123456789012/tpf-dlq

QUEUE_ASYNC is the mode name used by pipeline.orchestrator.mode for background execution.

Execution DLQ applies to terminal execution failures only. A DLQ is a dead-letter channel for failed executions that need investigation or replay. It does not replace item-level rejection flows.

Execution DLQ Envelope (Terminal Details)

Execution DLQ entries include standard fields for triage across REST, gRPC, local, and function-style execution:

execution fields: tenantId, executionId, executionKey, transitionKey
correlation/resource fields: correlationId, resourceType, resourceName
runtime identity fields: transport, platform, terminalStatus, createdAtEpochMs
failure fields: terminalReason, errorCode, errorMessage, retryable, retriesObserved

Terminal reason mapping:

`terminalReason`	Meaning	First action
`retry_exhausted`	retryable failure class reached terminal state after exhausting retry budget (includes zero-retry configurations (`maxRetries = 0`))	Stabilise dependency/path, then re-drive bounded batches
`non_retryable`	non-retryable failure class (for example `NonRetryableException`)	Correct payload/contract issue before replay

Background Execution Crash Matrix

Crash point	Behaviour after restart/recovery	Duplicate risk	Required safeguard
Before transition state commit	Work is redelivered and re-executed from the last stored version	High	Idempotent operator boundary (`executionId:stepIndex:attempt`)
After state commit, before next enqueue	Transition is stored, but next dispatch can stall until due sweeper re-dispatches	Low	Due-execution sweeper + stored state
During retry scheduling	Retry can replay from the last stored version if scheduling details were not committed	Medium	Persist retry intent (`attempt`, `nextDue`) before enqueue
After external side effect, before commit	Side effect may repeat on replay because effect and commit are not one transaction	High	Downstream dedupe keyed by transition identity
Worker dies while lease held	Lease expires and another worker can claim execution	Low	Short lease window + conditional lease claim (OCC)

Semantics summary:

committed execution state transitions are exactly-once,
operator invocation and dispatch are at-least-once,
replay is deterministic for control-plane state, not for non-idempotent external systems.

Retry and Idempotency Defaults

properties

pipeline.defaults.retry-limit=5
pipeline.defaults.retry-wait-ms=1000
pipeline.defaults.max-backoff=30000
pipeline.defaults.jitter=true

Use NonRetryableException to fail fast for non-transient failures.

For at-least-once boundaries (queue delivery, operator invocation, re-drive), enforce idempotency with stable transition identity (executionId:stepIndex:attempt). Idempotency means retries can happen without duplicating the business effect.

Operations Runbook

Classify incident scope first: item reject trend vs execution DLQ growth.
For checkpoint publication incidents, inspect publication lag, handoff latency, duplicate suppression (records intentionally skipped because a checkpoint handoff key was already seen), and delivery failure logs (publication log events emitted when downstream admission fails) before treating the incident as downstream execution failure.
Checkpoint publication rejects and downstream admission failures occur before downstream execution admission.
They are not execution DLQ events, and they do not use Item Reject Sink by default.
For item reject incidents, check fingerprint concentration and dominant error classes; route to business-data remediation and selective re-drive.
Treat item reject re-drive as application-owned: default reject envelopes are metadata-only, so replay payload reconstruction is not provided by framework runtime.
For execution DLQ incidents, triage terminal execution causes (FAILED vs DLQ) and validate idempotency before replay.
If due executions stall, verify sweeper health and dispatcher lag.
Re-drive in bounded batches and monitor duplicate suppression plus retry saturation (retry attempts approaching the configured retry limit).

Runtime Layouts

Orchestrator Runtime

Framework Portability Assessment

Await Unit Runtime

Annotation Processor Guide

Error Handling and Recovery

Failure Channels (Operational View)

Execution DLQ Configuration (Background Execution)

Execution DLQ Envelope (Terminal Details)

Background Execution Crash Matrix

Retry and Idempotency Defaults

Operations Runbook

Error Handling and Recovery ​

Failure Channels (Operational View) ​

Execution DLQ Configuration (Background Execution) ​

Execution DLQ Envelope (Terminal Details) ​

Background Execution Crash Matrix ​

Retry and Idempotency Defaults ​

Operations Runbook ​

Error Handling and Recovery

Failure Channels (Operational View)

Execution DLQ Configuration (Background Execution)

Execution DLQ Envelope (Terminal Details)

Background Execution Crash Matrix

Retry and Idempotency Defaults

Operations Runbook