Self-Hosted Deployment Recipe ​
This page describes the current production-ish self-host shape for the durable coordinator. It is not a deployment stack or a managed service contract. It is a recipe for operators who want to run a compute-first coordinator with durable stores, explicit worker boundaries, and known manual procedures.
The runnable starting point remains examples/restaurant-approval/self-host. That example proves the same control-plane, release, await, result, and failure/DLQ paths in one local process. The containerized HA reference in examples/restaurant-approval/self-host/container runs the same flow with a coordinator container, REST worker container, and LocalStack-backed DynamoDB/SQS/S3-compatible services.
examples/csv-payments/self-host/container is the advanced container reference. It adds stream input, app persistence, a REST transition worker, and a grouped pipeline-runtime-svc gRPC step runtime on top of the same durable coordinator pattern. The default lane uses SQS to stay within the LocalStack-backed AWS-shaped substrate; TPF_CSV_AWAIT_TRANSPORT=kafka runs the same self-host topology with Kafka await completions.
Deployment Shapes ​
One Process ​
Use this shape for local adoption and demos.
One packaged application process enables:
- generic control-plane API,
- release admin API,
- file release registry and artifact store,
- local in-process transition worker,
- memory/event/log providers by default.
This is the fastest way to prove the model, but it is not crash-surviving HA. If the process dies, in-memory execution state is gone.
Coordinator And Worker Processes ​
Use this shape when you want a real control-plane/data-plane boundary.
The coordinator process enables the control-plane and admin APIs, owns execution/await state, and selects a remote transition worker through configured REST or gRPC target properties. Worker processes host the same pipeline release code and expose the default-disabled worker endpoint or service.
REST worker selection example:
pipeline.orchestrator.worker.rest.base-url=http://restaurant-worker:8181
pipeline.orchestrator.worker.rest.shared-secret-ref=env:TPF_WORKER_SECRETWorker process example:
pipeline.orchestrator.worker.rest.server-enabled=true
pipeline.orchestrator.worker.rest.shared-secret-ref=env:TPF_WORKER_SECRETgRPC follows the same model with pipeline.orchestrator.worker.grpc.endpoint, grpc.server-enabled, and the matching shared secret or secret ref.
Durable Coordinator Baseline ​
Use this shape for production-style recovery tests and self-host pilots.
The coordinator persists execution and await state in DynamoDB-style tables, dispatches work through SQS-style queues, and publishes terminal execution failures to a durable DLQ. Workers can still be local, REST, gRPC, or SQS request/reply workers, but a separated REST or gRPC worker is the clearer operational boundary.
This is the TPF-owned HA path. Current FUNCTION builds are serverless invocation artifacts; they do not own durable execution records, await units, leases, DLQ/re-drive, or release pinning inside the function runtime. An all-serverless durable coordinator would require a different architecture backed by durable services such as DynamoDB, SQS, and EventBridge-style scheduling.
Minimum coordinator configuration:
pipeline.orchestrator.mode=QUEUE_ASYNC
pipeline.orchestrator.strict-startup=true
pipeline.orchestrator.idempotency-policy=CLIENT_KEY_REQUIRED
pipeline.orchestrator.execution-ttl-days=7
pipeline.orchestrator.lease-ms=30000
pipeline.orchestrator.max-retries=3
pipeline.orchestrator.retry-delay=PT10S
pipeline.orchestrator.retry-multiplier=2.0
pipeline.orchestrator.sweep-interval=PT30S
pipeline.orchestrator.sweep-limit=100
pipeline.orchestrator.state-provider=dynamo
pipeline.orchestrator.dispatcher-provider=sqs
pipeline.orchestrator.dlq-provider=sqs
pipeline.orchestrator.queue-url=https://sqs.eu-west-1.amazonaws.com/123456789012/tpf-work
pipeline.orchestrator.dlq-url=https://sqs.eu-west-1.amazonaws.com/123456789012/tpf-execution-dlq
pipeline.orchestrator.dynamo.execution-table=tpf_execution
pipeline.orchestrator.dynamo.execution-key-table=tpf_execution_key
pipeline.orchestrator.dynamo.await-interaction-table=tpf_await_interaction
pipeline.orchestrator.dynamo.await-interaction-key-table=tpf_await_interaction_key
pipeline.orchestrator.dynamo.release-table=tpf_release_registry
pipeline.orchestrator.dynamo.worker-table=tpf_worker_registry
pipeline.orchestrator.dynamo.region=eu-west-1
pipeline.orchestrator.sqs.region=eu-west-1
pipeline.orchestrator.control-plane.enabled=true
pipeline.orchestrator.control-plane.admin-token-ref=env:TPF_CONTROL_PLANE_TOKEN
pipeline.orchestrator.control-plane.require-remote-worker=true
pipeline.orchestrator.admin.enabled=true
pipeline.orchestrator.admin.admin-token-ref=env:TPF_ADMIN_TOKEN
pipeline.orchestrator.releases.registry.provider=dynamo
pipeline.orchestrator.releases.storage.provider=s3
pipeline.orchestrator.releases.storage.s3.bucket=tpf-release-artifacts
pipeline.orchestrator.releases.storage.s3.prefix=tpf/releases
pipeline.orchestrator.releases.storage.s3.region=eu-west-1
pipeline.orchestrator.worker.lifecycle.provider=dynamo
pipeline.orchestrator.worker.lifecycle.stale-after=PT2MThe await interaction table must provide these ALL-projected GSIs:
await-interaction-by-unit,await-interaction-pending-by-tenant,await-interaction-pending-by-assignee,await-interaction-pending-by-group,await-interaction-pending-by-step,await-interaction-pending-by-deadline.
SQS request/reply worker protocol has one additional v1 constraint: pipeline.orchestrator.worker.sqs.response-queue-url must be dedicated per coordinator instance or shard. Shared response queues can route a worker response to the wrong process because response demultiplexing is not implemented.
The Dynamo release registry stores immutable release records plus append-only activation events. Active release lookup reads the latest activation event for the tenant and pipeline; it does not update a mutable active pointer. The Dynamo worker lifecycle registry follows the same rule with append-only registration, heartbeat, and drain events. Existing execution and await Dynamo stores still use conditional updates for leases and state transitions until that storage model is redesigned.
For one-process local development, use pipeline.orchestrator.releases.storage.provider=local with pipeline.orchestrator.releases.storage.root=/var/lib/tpf/releases.
For multi-coordinator self-host deployments, choose the artifact backing system by artifact form:
| Artifact form | Recommended backing system |
|---|---|
| Container worker image, including Jib output | OCI registry such as ECR, GHCR, JFrog, Harbor, or Docker registry. Reference by digest in pipeline-release.json; do not copy image layers into the coordinator artifact store. |
| JVM JAR | Maven/JFrog/Nexus when it is a published JVM artifact, or S3-compatible blob storage when the coordinator must manage the blob directly. |
| Native binary | OCI generic artifact, local managed filesystem for single-host, or S3-compatible blob storage for shared self-host access. |
| Lambda ZIP | S3 or S3-compatible object storage. |
| Lambda container image | OCI registry, usually ECR on AWS. |
| External endpoint | Existing deployment/service discovery; the descriptor pins endpoint identity and digest/provenance metadata where available. |
The S3-compatible provider is therefore a shared blob-store option, not the default artifact repository strategy. AWS S3, MinIO, and LocalStack-style endpoints fit this model; use endpoint-override and path-style-access=true for non-AWS S3-compatible stores.
The restaurant container reference demonstrates this baseline locally:
./examples/restaurant-approval/self-host/container/run-container-ha-demo.sh --ciIt uses LocalStack to create the required DynamoDB tables, SQS work/DLQ queues, and S3-compatible release artifact bucket. Treat that as local verification of the topology, not production AWS provisioning.
The CSV Payments container reference demonstrates the same baseline with broker-backed await completions and the example persistence path:
./examples/csv-payments/self-host/container/run-container-ha-demo.sh --ciThe SQS lane is the default AWS-shaped proof. The Kafka lane proves the same await abstraction against a second provider:
TPF_CSV_AWAIT_TRANSPORT=kafka ./examples/csv-payments/self-host/container/run-container-ha-demo.sh --ciCSV await item continuations use the same bounded transition-worker seam as normal queue-async work. The worker executes each item continuation segment up to the aggregate boundary, while generated step clients target the runtime and persistence containers.
Startup Checklist ​
Before accepting work:
- Build the pipeline artifact and confirm it contains
META-INF/pipeline/pipeline-contract.json. - Start durable substrates first: execution tables, await tables and indexes, work queue, DLQ queue, and any worker protocol queues.
- Start worker processes with the matching pipeline code and worker protocol secret.
- Start the coordinator with
strict-startup=true. - Produce a
pipeline-release.jsonthat pins the built artifacts by digest. - Register and activate the release for the tenant and pipeline.
- Register or heartbeat at least one worker for the active contract/release identity.
- Submit one canary execution and verify status, pending await interaction, completion, and result.
The current coordinator does not dynamically load registered code. Release registration validates the release descriptor and, for local executable artifacts, validates and stores the artifact in the configured release artifact store. Container images remain in the OCI registry; the coordinator records their immutable reference and uses worker capability checks to verify that deployed workers host the active release. Workers must already host matching code. Worker availability checks verify the active contract/release identity before hosted execution submission, artifact id/digest when both sides provide it, and a matching HEALTHY worker lifecycle record. Stale, draining, and unavailable workers reject new hosted submissions with 503.
pipeline.orchestrator.control-plane.require-remote-worker=true is recommended for separated self-host deployments. It prevents a coordinator process from silently falling back to the local in-process worker when no REST, gRPC, or SQS worker target is configured. Leave it disabled for the one-process local demo. See Runtime Boundaries And Performance.
Operator Runbooks ​
Register And Activate ​
- Register the release descriptor with
POST /tpf/admin/tenants/{tenantId}/pipelines/{pipelineId}/releases/register. - Read the returned
releaseVersion. - Activate with
POST /tpf/admin/tenants/{tenantId}/pipelines/{pipelineId}/releases/{releaseVersion}/activate. - Confirm
GET /activereturns the expected release, contract version, and artifact identity. - Register or heartbeat a worker with
POST /tpf/admin/tenants/{tenantId}/pipelines/{pipelineId}/workers/register. - Keep workers for that release healthy before submitting executions.
Activation affects new executions only. Existing executions remain pinned to the contract/release identity stored on their execution record.
Submit And Complete Await ​
- Submit through
POST /tpf/control-plane/tenants/{tenantId}/executions. - Poll
GET /executions/{executionId}. - If the execution parks on await, query
GET /interactions/pending. - Complete with
POST /interactions/complete. - Poll terminal status and then read
GET /executions/{executionId}/result.
The restaurant self-host client demonstrates this flow through run-self-hosted-demo.sh --ci.
The containerized HA reference demonstrates the same flow through self-host/container/run-container-ha-demo.sh --ci.
Incident Triage ​
Use the restaurant incident demo as the current failure-visibility proof:
./examples/restaurant-approval/self-host/run-self-hosted-incident.sh --ciFor the containerized HA reference:
./examples/restaurant-approval/self-host/container/run-container-ha-incident.sh --ciFor real incidents:
- Check execution status and read
errorCode,errorMessage,attempt, andstepIndex. - Confirm whether the execution is retrying, failed, or terminally DLQ'd.
- Inspect coordinator logs or the configured execution DLQ for the matching execution id.
- Correct the downstream dependency or input path that caused the failure.
- Confirm downstream idempotency before any re-drive.
- Re-drive a terminal execution with
POST /tpf/admin/tenants/{tenantId}/executions/{executionId}/redrive.
Re-drive reads the durable execution record and re-enqueues the original execution id. The DLQ message is evidence for triage and alerting; it is not consumed as the replay source. FAILED execution re-drive is opt-in (allowFailed=true) because those failures may not have exhausted the DLQ path.
Manual Upgrade And Drain ​
The current safe upgrade procedure is conservative:
- Deploy workers that host the new release version.
- Register or heartbeat the new workers for the new release.
- Register and activate the new release.
- Submit canary executions and verify worker availability and results.
- Mark old workers draining when they should stop accepting new hosted submissions.
- Leave old workers running until executions pinned to the old release drain.
- Stop old workers only after status queries show no active executions for the old release.
The lifecycle model is intentionally small. It records HEALTHY, STALE, DRAINING, and UNAVAILABLE views for submit admission. It does not autoscale workers, choose among worker pools, manage Kubernetes deployments, or move in-flight executions between workers.
What To Monitor ​
At minimum:
- work queue depth and oldest message age,
- execution status distribution: running, waiting external, retrying, failed, DLQ,
- lease takeover and sweeper activity,
- await pending count and oldest pending deadline,
- DLQ publication count and repeated failure fingerprints,
- worker protocol latency and transport failures,
- release activation events and active release id per tenant/pipeline,
- worker lifecycle records by tenant, pipeline, release, and state.
For metric names and observability surfaces, use the operations guides for Metrics, Replay & Live Topology, and the Operator Playbook.
Current Limits ​
This recipe intentionally does not include:
- Kubernetes manifests or Docker Compose files,
- dynamic JAR loading in the coordinator,
- append-only execution/await state storage,
- bulk DLQ-message consumers or automated replay campaigns,
- worker autoscaling, fleet routing, or deployment orchestration,
- production tenancy, RBAC, or org/principal management.
The file-backed release registry is suitable for local/dev and single-coordinator self-host pilots. Use the Dynamo release registry for multi-coordinator release metadata. Use the S3-compatible release artifact store only for artifacts that should be coordinator-managed blobs; use OCI or Maven-style repositories for artifacts that already have native repository semantics.