In-flight Probe
The in-flight probe is a guardrail that detects when a pipeline run is heading toward resource exhaustion. It behaves like a liveness probe: it monitors health signals and, when a failure mode is detected, it signals termination rather than allowing the run to continue unchecked.
The termination mechanism is separate: the probe triggers it, and the runtime enforces it by aborting the run.
These probes are intentionally off by default. They are safety controls, not general-purpose tuning knobs.
Retry Amplification Guard
This guard detects sustained in-flight growth across the entire pipeline. It is designed for expansion-heavy pipelines where upstream pace can exceed downstream capacity (for example, CSV readers feeding a slow third-party API).
What It Measures
The guard samples global inflight counts at a fixed interval (derived from the window), computes the slope over the window, and triggers if that slope exceeds a threshold for a configured number of consecutive samples.
In-flight items means the number of items currently being processed or buffered across all steps (summed globally). This is the same signal you see in the tpf.step.inflight metric (aggregated for the pipeline).
In practical terms: it looks for runaway inflight growth that persists, not momentary spikes.
When It Triggers
The guard triggers when:
inflight_slope > inflight_slope_threshold
for sustain-samples consecutive samplesWhere:
inflight_slopeis measured in items/secsustain-samplesis the number of consecutive samples required- the evaluation window controls the smoothing period
Configuration
pipeline.kill-switch.retry-amplification.enabled=true
pipeline.kill-switch.retry-amplification.window=PT30S
pipeline.kill-switch.retry-amplification.inflight-slope-threshold=1.0
pipeline.kill-switch.retry-amplification.sustain-samples=3
pipeline.kill-switch.retry-amplification.mode=fail-fastTelemetry
When triggered, the run span records:
tpf.kill_switch.triggered=truetpf.kill_switch.reason=retry_amplificationtpf.kill_switch.step=globaltpf.kill_switch.inflight_slopetpf.kill_switch.inflight_slope_thresholdtpf.kill_switch.sustain_samples
Metric:
tpf.pipeline.kill_switch.triggeredincrements
How it compares to Kubernetes liveness
The intent is similar to a Kubernetes liveness probe: the probe detects an unhealthy condition, and a separate termination mechanism restarts or aborts the workload. In TPF, the probe detects sustained in-flight growth and signals the runtime to abort the current run before resources are exhausted.
Tuning Guidance
Expansion steps (1→N) tend to create slow, steady inflight growth. Tune for those:
- Start with
window=PT30Sandsustain-samples=3 - Lower the slope threshold until it trips at the point you consider unhealthy (for example, if inflight grows by +1000 every 5 minutes, slope is ~3.33/sec)
Fast bursts can create short spikes; the sustain-samples requirement is the primary protection against false positives.
Demo-Friendly Settings
For a short demo that should trigger quickly when inflight growth starts:
pipeline.kill-switch.retry-amplification.window=PT30S
pipeline.kill-switch.retry-amplification.inflight-slope-threshold=1.0
pipeline.kill-switch.retry-amplification.sustain-samples=3
pipeline.kill-switch.retry-amplification.mode=fail-fastFail-Fast vs Log-Only
fail-fast: throws a runtime exception and aborts the runlog-only: records telemetry but continues execution
Use log-only in staging if you want to validate thresholds before enforcing them.