Roadmap: Checkpoint Pipelines vs FTGO (Pessimist's Notebook)
This guide captures the ongoing architectural exploration of checkpoint-style pipelines as an alternative to FTGO's saga-first model. It is written for the engineer who just entered the meeting room: quick context, core principles, and the risks we are explicitly tracking. The goal is to be intentionally pessimistic: list what can go wrong, what we believe is already covered, and what still needs design work.
For broader context, see:
TL;DR (why this exists)
- We want a better-than-FTGO architecture that avoids rollbacks and avoids hiding domain decisions in ops.
- We treat a pipeline as a checkpoint: when it finishes, the state is valid and stable.
- We prefer sync, reactive piping between pipelines (backpressure preserved) over async fan-out.
- We treat failures as operational errors; business unhappy paths are modeled as explicit pipelines.
- We assume a single tech stack and a single leadership/business, so "autonomy for its own sake" is not a goal.
Closure Scope Boundary (v2.1)
- TPFGo closure is HA-decoupled for this cycle.
- Merge-blocking scope is SYNC-path business/workflow correctness plus transport-contract parity.
- Queue/HA delivery (
QUEUE_ASYNC, durable providers) is a separate epic and tracked as a compatibility dependency. Protobuf-over-HTTPparity is enforced as canonical metadata/semantic parity in this cycle (not streaming-shape parity).
Working Model (current stance)
- Pipeline = checkpoint: a pipeline produces a stable, valid state. No status fields, no updates, no rollbacks.
- Steps own persistence: 1 step = 1 table/entity type.
- Success vs failure only: success flows data; failures are operational and go to an error sink.
- Business unhappy paths: modeled as explicit pipelines, not as exceptions.
- Sync piping: pipelines can be chained with reactive (non-blocking) request/response.
Architectural Principles (expanded)
Shipability over modularity The primary boundary is how we ship/release. The orchestrator and step runtimes exist to make deployment safe and observable. Code modularity is useful, but secondary.
Immutable checkpoints, no in-place updates Status fields imply updates and mutable state. We avoid them. Each step persists a new, immutable type and hands it forward.
Pipelines are composite steps A pipeline is a higher-order step with clear input and output types. This lets us chain pipelines as a workflow without pretending they are a monolith.
Operational failures are not domain logic A failure (exception) is operational. Domain unhappy paths are modeled as explicit pipelines.
Backpressure is a first-class contract If pipelines are chained, demand must flow end-to-end. We avoid unbounded buffering and treat backpressure as part of the contract.
Visuals (how it works)
1) Checkpoint pipeline (single pipeline, immutable steps)
2) Pipeline-to-pipeline piping (sync, reactive)
3) Failure handling separation
4) Workflow fan-out (pipeline as workflow)
5) Observers vs mid-step taps
6) Lifecycle evolution (early vs mature)
Example (CreateOrder, checkpoint model)
Input DTO: OrderRequestOutput DTO: ReadyOrder
Steps (each step persists its own type):
OrderRequestProcess->OrderRequest+LineItemOrderCreate->InitialOrderOrderReady->ReadyOrder
Outcome: if the pipeline completes, ReadyOrder is valid and stable. No rollback. If a failure occurs, it goes to the error sink; optional ops remediation pipelines can be attached for automation.
Implemented Milestones (History)
This roadmap started before the recent implementation stack. The following FTGo-related milestones are already implemented:
- 2026-02-12 to 2026-02-13 (early groundwork):
- TraceEnvelope streaming lineage tests landed in runtime.
- Search multi-e2e and cardinality stabilization work was merged.
- TPFGo epic-next and search trace-envelope lineage branches were merged.
- 2026-03-02 (runtime core milestones):
- Split/merge lineage stabilization and deterministic-id hardening landed in runtime adapters.
- FUNCTION mode parity across streaming handler shapes was implemented (FUNCTION/LAMBDA path).
- 2026-03-03 (reference lane depth milestone):
- Search fan-out/fan-in reference lane was expanded with richer aggregation depth and additional reliability/lineage assertions.
- Current state:
- Canonical checkout contracts now cover the full 01->08 chain with strict handoff validation (including negative mismatch fixtures).
- SYNC full-chain canonical execution proof is implemented via deterministic thin services and merge/replay lineage assertions.
- TPFGo SYNC gate workflow runs framework verify + canonical checkout E2E + parity tests + docs build.
TPFGo Closure Board (Epic Termination Criteria)
This board is the authoritative closure tracker for the FTGo epic. No new parallel track should be opened while any merge-blocking item below remains open.
Status legend: TODO, IN_PROGRESS, DONE
Merge-blocking items
- Connector idempotency and dedup policy closure
Status: DONE
Done means:
- Default duplicate-handling policy is documented and versioned.
- Retry key derivation contract is explicit and deterministic.
- Unit/integration tests cover duplicate/replay scenarios at connector boundaries.
- Connector backpressure and buffering policy closure
Status: DONE
Done means:
- End-to-end demand signaling model is documented.
- Buffer capacity/overflow behavior is explicit and tested.
- Failure signatures under pressure are observable and documented.
- Cross-pipeline handoff contract build-time enforcement
Status: DONE
Done means:
- Output-to-input contract checks fail fast at build time for incompatible handoffs.
- Diagnostics include pipeline/step context and expected vs actual contract details.
- Coverage includes version drift and mapper/payload mismatch cases.
- Canonical full FTGo flow implemented end-to-end
Status: DONE
Done means:
- Full checkout -> validation -> restaurant acceptance -> preparation -> dispatch -> delivery -> payment flow exists in TPF.
- Explicit failure/compensation pipelines are implemented for terminal failure checkpoints.
- End-to-end test validates business correctness and deterministic lineage continuity.
- Transport/platform parity gate (REST, gRPC, FUNCTION + Protobuf-over-HTTP semantic row)
Status: DONE
Done means:
- Equivalent supported-shape behavior is validated across all required paths.
- Unsupported shapes fail with explicit, consistent diagnostics.
- Parity matrix tests are green and required in CI, including the semantic parity row for Protobuf-over-HTTP.
- Partial-progress and recovery behavior closure
Status: DONE
Done means:
- Partial-progress scenarios are explicitly classified and tested.
- Replay/remediation workflow is codified and verifiable.
- Parking/retry exhaustion behavior is operationally diagnosable.
- Docs/runbooks aligned to shipped behavior
Status: DONE
Done means:
- Build/development/operations/evolve docs agree on current capabilities.
- Troubleshooting guidance maps concrete failure signatures to actions.
- No planned-but-unimplemented capability is documented as available.
- Single merge-blocking CI gate for epic acceptance
Status: DONE
Done means:
- CI includes framework verify + canonical FTGo E2E + parity matrix.
- CI gate is wired through
CI — TPFGo SYNC Gateworkflow. - Gate is required for merges affecting FTGo-critical modules.
- Failure output is actionable for on-call and contributors.
Status Snapshot (2026-03-10, branch codex/tpfgo-final-pr2-gates-docs)
Evidence used for the DONE statuses above:
./mvnw -f framework/pom.xml verify./mvnw -f examples/checkout/pom.xml -pl common,create-order-orchestrator-svc,deliver-order-orchestrator-svc -am -Dtest=CanonicalFtgoSyncFlowTest,CreateToDeliverBridgeE2ETest,CreateToDeliverGrpcBridgeE2ETest,DeliverForwardBridgeE2ETest -Dsurefire.failIfNoSpecifiedTests=false test./scripts/ci/bootstrap-local-repo-prereqs.sh framework./mvnw -f framework/pom.xml -pl deployment -Dtest=OrchestratorRestResourceRendererTest,OrchestratorGrpcRendererTest,OrchestratorFunctionHandlerRendererTest test./mvnw -f framework/pom.xml -pl runtime -Dtest=FunctionTransportBridgeTest,UnaryFunctionTransportBridgeTest,FunctionTransportAdaptersTest,HttpRemoteFunctionInvokeAdapterTest,RestExceptionMapperTest,ProtobufHttpStatusMapperTest,ObserverTapContractValidatorTest,CheckoutCanonicalFlowContractTest,CheckoutCanonicalFlowContractNegativeTest testnpm --prefix docs ci && npm --prefix docs run build
Execution rule
- Prioritize closure of the eight merge-blocking items above.
- Treat additional exploratory work as non-blocking unless it directly closes one of these items.
- Epic is considered complete only when all eight items are DONE.
Pain-Point Matrix
Status legend: RESOLVED, DECIDED, PROPOSED, PARTIAL, OPEN
- Checkpoint invariants
- Problem: What makes a pipeline output "valid"?
- Stance: The pipeline process itself guarantees invariants; no status fields, no validators needed.
- Status: RESOLVED
- Failure classification
- Problem: Distinguish business unhappy paths vs operational failures.
- Stance: TPF uses a single failure channel; exceptions are operational and go to the error sink. Business unhappy paths are separate pipelines.
- Status: DECIDED
- Partial progress across pipelines
- Problem: Pipeline A completes, Pipeline B fails.
- Stance: Treated as an ops failure; A's checkpoint remains valid. Optional ops pipelines may handle remediation. Blocking for cross-pipeline sync composition until an ops remediation pattern is defined. Action: define an ops remediation pipeline pattern for partial-progress recovery.
- Status: OPEN
- Idempotency / duplicate handoff
- Problem: Checkpoint publication retries can duplicate downstream processing.
- Stance: Orchestrator-owned checkpoint publication preserves incoming idempotency metadata when present and otherwise derives a deterministic handoff key from declared checkpoint key fields. The public model does not expose publication-local dedupe modes.
- Status: RESOLVED
- Traceability / lineage
- Problem: Track the lineage of items through steps and pipelines.
- Stance: "Russian doll" tracing is implemented in current supported runtime paths via
TraceEnvelope, including deterministic split/merge lineage behavior and parity-oriented test coverage. - Status: RESOLVED
- Type compatibility between pipelines
- Problem: Pipeline B should not depend on Pipeline A internals.
- Stance: Checkpoint publication/subscription declarations validate published output type, subscriber ingress type, and mapper compatibility at build time. Cross-pipeline handoff remains explicit through checkpoint boundary contracts instead of hidden pipeline internals.
- Status: RESOLVED
- Backpressure across pipelines
- Problem: Piping should preserve backpressure end-to-end.
- Stance: Reliable checkpoint publication inherits orchestrator queue-async admission behavior instead of exposing publication-local backpressure policies. Live subscribe/tap remains a weaker observation surface and does not define reliable handoff semantics.
- Status: RESOLVED
- Branching outputs (multi-out steps)
- Problem: A step may need to emit different output types based on business decisions.
- Stance: Fan-out/fan-in behavior is implemented in runtime paths and reference lanes, but formal branch policy (primary vs aux, required vs optional) remains design work.
- Status: PARTIAL
- Observers and mid-step taps
- Problem: Optional features (e.g., marketing) may want to observe outputs that are not stable checkpoints.
- Stance: Distinguish checkpoint observers (stable) from mid-step taps (weak guarantees); allow explicit opt-in.
- Status: PROPOSED
- Decision points as checkpoints
- Problem: Adding a new decision step can introduce a new step type or complex branching inside a pipeline.
- Stance: Prefer ending the pipeline at a decision and spawning one pipeline per outcome. Over time, checkpoints should remain relatively stable even as steps grow.
- Status: PROPOSED
- Remote subscription trigger
- Problem: Pipeline-to-pipeline chaining currently relies on external triggers (CLI/HTTP).
- Stance: Add a streaming trigger to the orchestrator (subscribe/ingest) with backpressure and buffering.
- Status: PROPOSED
Additional Risks (forward-looking)
- Cross-pipeline atomicity illusion: sync chaining can look atomic while still being partial. (See Pain-point matrix #3)
- Schema drift: handoff DTO versioning can break compatibility without strict rules. (See Pain-point matrix #6; Addressed in Near-Term Design Work: Build-Time Checks)
- Temporal coupling: downstream slowness collapses upstream throughput. (See Pain-point matrix #7)
- Hotspot steps: a single heavy step can dominate latency and throughput.
- Backpressure deadlocks: mismatched demand signaling can stall a chain. (See Pain-point matrix #7)
- Implicit retries: checkpoint publication retries can trigger duplicate side effects. (See Pain-point matrix #4)
- Observability blind spots: reference-based tracing needs reliable lookup. (See Pain-point matrix #5; Addressed in Near-Term Design Work: TraceEnvelope)
- Fan-out/fan-in complexity: ordering and timeout handling become tricky. (See Pain-point matrix #8)
- Distributed time assumptions: ordering based on timestamps becomes ambiguous.
- Policy leakage into ops: domain obligations can get pushed into SLOs if not modeled. (See Pain-point matrix #2)
Near-Term Design Work
- Error Sink: define a runtime error sink interface with a default StdErrSink and optional gRPC/REST sink service.
- Checkpoint Publication Contract: define topic publication/subscription rules, queue-async-only support, and orchestrator-owned handoff behavior. (Pain-point #3, #4, #7)
- Build-Time Checks: existing operator/mapping compatibility checks are extended to explicit checkpoint publication/subscription contracts.
Open Questions
- How should the tracing store be configured (inline vs reference)?
- Should the pipeline definition expose a formal "checkpoint contract"?
- How should multi-out decisions be modeled (discriminated envelopes vs explicit pipelines)?
Intended Outcome
A pragmatic, pessimistic architecture that improves on FTGO by keeping strong, explicit checkpoint semantics, minimal operational ambiguity, and a clear separation of business vs ops concerns.