In-flight Probe

The in-flight probe is a guardrail that detects when a pipeline run is heading toward resource exhaustion. It behaves like a liveness probe: it monitors health signals and, when a failure mode is detected, it signals termination rather than allowing the run to continue unchecked.

The termination mechanism is separate: the probe triggers it, and the runtime enforces it by aborting the run.

These probes are intentionally off by default. They are safety controls, not general-purpose tuning knobs.

Retry Amplification Guard

This guard detects sustained in-flight growth across the entire pipeline. It is designed for expansion-heavy pipelines where upstream pace can exceed downstream capacity (for example, CSV readers feeding a slow third-party API).

What It Measures

The guard samples global inflight counts at a fixed interval (derived from the window), computes the slope over the window, and triggers if that slope exceeds a threshold for a configured number of consecutive samples.

In-flight items means the number of items currently being processed or buffered across all steps (summed globally). This is the same signal you see in the tpf.step.inflight metric (aggregated for the pipeline).

In practical terms: it looks for runaway inflight growth that persists, not momentary spikes.

When It Triggers

The guard triggers when:

text

inflight_slope > inflight_slope_threshold
for sustain-samples consecutive samples

Where:

inflight_slope is measured in items/sec
sustain-samples is the number of consecutive samples required
the evaluation window controls the smoothing period

Configuration

properties

pipeline.kill-switch.retry-amplification.enabled=true
pipeline.kill-switch.retry-amplification.window=PT30S
pipeline.kill-switch.retry-amplification.inflight-slope-threshold=1.0
pipeline.kill-switch.retry-amplification.sustain-samples=3
pipeline.kill-switch.retry-amplification.mode=fail-fast

Telemetry

When triggered, the run span records:

tpf.kill_switch.triggered=true
tpf.kill_switch.reason=retry_amplification
tpf.kill_switch.step=global
tpf.kill_switch.inflight_slope
tpf.kill_switch.inflight_slope_threshold
tpf.kill_switch.sustain_samples

Metric:

tpf.pipeline.kill_switch.triggered increments

How it compares to Kubernetes liveness

The intent is similar to a Kubernetes liveness probe: the probe detects an unhealthy condition, and a separate termination mechanism restarts or aborts the workload. In TPF, the probe detects sustained in-flight growth and signals the runtime to abort the current run before resources are exhausted.

Tuning Guidance

Expansion steps (1→N) tend to create slow, steady inflight growth. Tune for those:

Start with window=PT30S and sustain-samples=3
Lower the slope threshold until it trips at the point you consider unhealthy (for example, if inflight grows by +1000 every 5 minutes, slope is ~3.33/sec)

Fast bursts can create short spikes; the sustain-samples requirement is the primary protection against false positives.

Demo-Friendly Settings

For a short demo that should trigger quickly when inflight growth starts:

properties

pipeline.kill-switch.retry-amplification.window=PT30S
pipeline.kill-switch.retry-amplification.inflight-slope-threshold=1.0
pipeline.kill-switch.retry-amplification.sustain-samples=3
pipeline.kill-switch.retry-amplification.mode=fail-fast

Fail-Fast vs Log-Only

fail-fast: throws a runtime exception and aborts the run
log-only: records telemetry but continues execution

Use log-only in staging if you want to validate thresholds before enforcing them.

Runtime Layouts

Annotation Processor Guide

In-flight Probe

Retry Amplification Guard

What It Measures

When It Triggers

Configuration

Telemetry

How it compares to Kubernetes liveness

Tuning Guidance

Demo-Friendly Settings

Fail-Fast vs Log-Only

In-flight Probe ​

Retry Amplification Guard ​

What It Measures ​

When It Triggers ​

Configuration ​

Telemetry ​

How it compares to Kubernetes liveness ​

Tuning Guidance ​

Demo-Friendly Settings ​

Fail-Fast vs Log-Only ​

In-flight Probe

Retry Amplification Guard

What It Measures

When It Triggers

Configuration

Telemetry

How it compares to Kubernetes liveness

Tuning Guidance

Demo-Friendly Settings

Fail-Fast vs Log-Only