PRD Status¶

Project: Smart Gate Trigger Service

Branch: ralph/smart-gate-trigger

Scope: Self-hosted decision/orchestration service that decides when and how often to pulse a gate-open MQTT command (gate/cmd) based on phone location, so the gate is already open before an FSD vehicle reaches it. Runs as a single Docker/docker-compose service on a home-lab VM. Source of truth: tasks/prd-smart-gate-trigger-service.md (blessed by the 4-pass critic loop). Completion gate for every story: lint + typecheck + pytest green on GitHub Actions runners against a mock MQTT broker; no physical hardware.

Story progress¶

✅ Complete (passes == true): 21
⏳ Remaining: 0
📋 Total stories: 21

A story flips to complete only when the Ralph wrapper sees lint + typecheck + pytest green on GitHub Actions against a mock MQTT broker — no physical hardware.

All stories¶

ID	Pri	Title	Status	Notes
US-001	1	Project scaffold and GitHub Actions CI pipeline	✅ done	Combines PRD US-001 scaffold + US-013 CI portion. Container build is deferred to US-018.
US-002	2	Typed configuration model with validation and secret-by-reference	✅ done	PRD US-001.
US-003	3	Persistence layer with migrations and append-only audit	✅ done	PRD US-015, moved early because US-012 (config API) and US-014 (recovery) depend on it.
US-004	4	MQTT client wrapper with TLS, auth, reconnect, and retain=false	✅ done	PRD US-002.
US-005	5	Location ingest with freshness and accuracy guards	✅ done	PRD US-003.
US-006	6	Geometry module: zone membership, distance, and ground speed	✅ done	PRD US-004.
US-007	7	Per-entity state machine	✅ done	PRD US-005.
US-008	8	Singleton publish guard via flock	✅ done	Extracted from PRD US-016 (singleton portion); the coordinator (US-009) depends on it.
US-009	9	Pulse coordinator and actuator publisher	✅ done	PRD US-006.
US-010	10	Prometheus metrics and liveness/readiness probes	✅ done	PRD US-016 (metrics + probes portion); placed before caps/alerting/API which reference its
US-011	11	Safety caps and structured audit logging	✅ done	PRD US-007.
US-012	12	Runtime configuration and health API	✅ done	PRD US-009. Depends on persistence (US-003), config model (US-002), probes (US-010), and c
US-013	13	Alerting via Home Assistant notification topic	✅ done	PRD US-008. Placed after the API (US-012) because it references the API auth-failure metri
US-014	14	State recovery on restart	✅ done	PRD US-010.
US-015	15	Degraded-operation behavior	✅ done	PRD US-011.
US-016	16	End-to-end simulation/replay harness (CI completion gate)	✅ done	PRD US-012. The synthetic fixtures are the primary behavioral completion gate.
US-017	17	Home Assistant configuration artifacts and drift check	✅ done	PRD US-014.
US-018	18	Containerization (Dockerfile, digest-pinned, non-root, read-only rootfs)	✅ done	PRD US-013 container portion. SBOM/cosign signing are out of scope per owner decision.
US-019	19	Hardened docker-compose deployment	✅ done	PRD US-018, retargeted from Kubernetes manifests to docker-compose.
US-020	20	Failure-modes catalog, runbooks, and threat model docs	✅ done	PRD US-017.
US-021	21	Progress/docs site buildable for Cloudflare Pages	✅ done	Mirrors the nodewright Cloudflare progress-tracking pattern. The live publish hook + deplo

Acceptance criteria¶

Expand a story to see its full acceptance criteria from prd.json.

✅ US-001 — Project scaffold and GitHub Actions CI pipeline

As a developer, I need a Python project skeleton and a CI pipeline so every later story is gated on green CI.

Python package layout with a dependency/lock file and lint + type-check tooling configured (e.g., ruff + mypy)
pytest configured with at least one trivial passing test to prove the suite runs
GitHub Actions workflow runs lint, typecheck, and pytest on push and pull_request and fails the job if any step fails
The workflow runs on the self-hosted Proxmox runners (runs-on: [self-hosted, linux, x64, ci]), NOT GitHub-hosted ubuntu-latest, and bootstraps its own tooling in-job (e.g. a make install / venv step) rather than assuming a clean image
A mock/in-process MQTT broker fixture is available to the test suite (no external broker)
README documents local run and how to execute the test suite
Lint passes
Typecheck passes
Tests pass in CI (GitHub Actions)

✅ US-002 — Typed configuration model with validation and secret-by-reference

As a developer, I need a validated, typed config model so all tunables live in one place and unsafe configs are rejected.

Pydantic config model with canonical SI units: zone radii (arm/trigger/home) in meters, gate_center in decimal degrees, pulse_interval, auto_close_window (observed minimum), max_pulses_per_session, max_session_duration, min_trigger_speed_mps (default ~3.6 = 8 mph), location_freshness_seconds, max_gps_accuracy_meters, gate_open_time_seconds, max_approach_speed_mps, system_latency_seconds, tracked entities, MQTT connection + topics, quiet-hours window with an explicit IANA timezone
Config loads from file with environment-variable overrides; invalid config fails fast with a clear error
Secret-typed fields (MQTT credentials, API tokens, TLS material) are supplied by reference from env (.env / Docker secret), never inlined or persisted, and a redaction helper masks them
Rejects config where pulse_interval is not strictly less than auto_close_window
Emits a warning when configured trigger radius < (gate_open_time_seconds + system_latency_seconds) * max_approach_speed_mps (FR-17)
Emits a warning when home radius <= max_gps_accuracy_meters (FR-19)
Tests cover valid load, env override, fail-fast on invalid, both warnings, the pulse_interval rejection, and secret redaction
Lint passes
Typecheck passes
Tests pass in CI

✅ US-003 — Persistence layer with migrations and append-only audit

As a developer, I need a SQLite-backed repository with migrations so config, latches, and audit rows persist behind a swappable interface.

SQLite-backed persistence with a versioned migration framework; forward migrations applied automatically on startup
A repository interface abstracts storage so callers depend on the interface, not SQLite specifics
Persists runtime-tunable config, the per-entity arrived/armed latch, and the Tier-C cap-halt marker
Audit rows are append-only: the repository interface exposes no update or delete path for audit rows (FR-28)
Tests cover migration apply on a fresh DB, repository read/write round-trips, and that audit rows cannot be updated/deleted via the interface
Lint passes
Typecheck passes
Tests pass in CI

✅ US-004 — MQTT client wrapper with TLS, auth, reconnect, and retain=false

As a developer, I need a resilient, authenticated MQTT client so the service survives broker restarts and cannot leak or replay commands.

A single asyncio-capable MQTT client library is chosen and pinned to an exact hashed version in the lock file
Connect/subscribe/publish helpers with automatic reconnect and backoff
Publishes to gate/cmd force retain=false; a test asserts no retained actuator command can be emitted (FR-1)
TLS (validating the broker CA) and per-client authentication are required by default; an explicit opt-out flag is allowed only for the mock-broker CI path and emits a startup warning (FR-25)
Credentials are read from injected secret env, never from the config blob
Tests cover: connect against mock broker, reconnect after broker drop, refusal to connect on TLS-verify failure (unless CI opt-out), and retain=false enforcement
Lint passes
Typecheck passes
Tests pass in CI

✅ US-005 — Location ingest with freshness and accuracy guards

As the system, I must reject unreliable location fixes so stale or drifting GPS cannot cause bad decisions.

Parses incoming MQTT location messages (gate/location/) into a normalized fix (entity, lat, lon, gps_accuracy, timestamp, optional speed)
Rejects fixes older than location_freshness_seconds (dead/sleeping phone)
Rejects fixes with accuracy worse than max_gps_accuracy_meters (GPS drift)
Drops malformed/partial payloads without crashing
Tests cover fresh+accurate (accept), stale (reject), inaccurate (reject), and malformed (drop)
Lint passes
Typecheck passes
Tests pass in CI

✅ US-006 — Geometry module: zone membership, distance, and ground speed

As the system, I need geometry primitives so I can decide which zone an entity is in and how fast it is moving.

Haversine distance and circular zone membership (arm/trigger/home) computed from raw coordinates
Ground speed computed from consecutive fixes; an explicit speed attribute is used directly when present; speed requires >= 2 fresh consecutive fixes, otherwise reported as unknown (not zero/guessed)
The is-point-in-zone check is behind an interface so polygon/corridor implementations can be added later without changing callers
Range-rate is NOT computed in v1 (deferred to future garage intent)
Tests use known coordinate fixtures with expected zone membership, distance, and speed/unknown results
Lint passes
Typecheck passes
Tests pass in CI

✅ US-007 — Per-entity state machine

As the system, I need a per-entity state machine so each phone independently progresses IDLE->ARMED->TRIGGERING->ARRIVED/STALE and re-arms correctly.

Implements IDLE, ARMED, TRIGGERING, ARRIVED, STALE with the transitions in PRD section 4
min_trigger_speed_mps gates only the initial entry to TRIGGERING; unknown speed must not enter TRIGGERING (fail-safe)
Trigger-zone latch: once TRIGGERING, stays until ARRIVED, trigger-zone exit, or fix expiry, ignoring momentary speed dips
STALE withdrawal: no fresh fix for an entity (age > location_freshness_seconds) stops its pulse contribution; arrived/armed latches behave per PRD
Re-arm only after the entity exits the arm zone
Tests assert: normal arrival, drive-past (no home, bounded), arrive-and-park (stop, no re-fire), pedestrian/low-speed (no trigger), slow/stop-in-zone after triggering (stays TRIGGERING), first in-zone fix with unknown speed (no trigger), and silence-while-TRIGGERING (enters STALE, stops)
Lint passes
Typecheck passes
Tests pass in CI

✅ US-008 — Singleton publish guard via flock

As the operator, I need at most one process able to publish gate/cmd so a duplicate instance cannot double-pulse the gate.

Before publishing to gate/cmd, a process acquires an exclusive flock advisory lock on /singleton.lock (FR-20)
The lock is held continuously for the lifetime of the publish-capable coordinator on a single open file descriptor (not acquired/released per pulse), and auto-releases on fd close, process exit, or SIGKILL so a successor acquires it immediately (no heartbeat/lease)
The lockfile lives on a LOCAL data volume; non-local or synced filesystems (NFS/SMB/9p/FUSE) are unsupported for the lock and are rejected or warned at startup
A non-owner must refuse to publish to gate/cmd until it acquires the lock
Exposes the singleton ownership state for metrics/health (sgt_singleton_owner source)
Tests assert two concurrently-started coordinators yield exactly one publisher (single stream, not doubled), and that killing the lock holder (SIGKILL / fd close) lets the waiter take over
Lint passes
Typecheck passes
Tests pass in CI

✅ US-009 — Pulse coordinator and actuator publisher

As the system, I must keep the gate open across the approach by re-pulsing under the auto-close timer, deduped across both phones.

Keeps the gate open while any entity is TRIGGERING with a non-expired fix; re-pulses every pulse_interval
pulse_interval is enforced strictly less than auto_close_window (rejected at config validation)
Two entities triggering simultaneously produce a single deduped command stream, not a doubled rate (FR-8)
Publishes pulse to the configured gate/cmd topic with retain=false, only when this process holds the singleton lock (US-008)
Pulsing stops when no entity is triggering
Tests assert pulse cadence under the auto-close window, dedupe across entities, and stop-when-none-triggering
Lint passes
Typecheck passes
Tests pass in CI

✅ US-010 — Prometheus metrics and liveness/readiness probes

As the operator, I need a metrics registry, a /metrics endpoint, and livez/readyz probes so the service is observable and a Docker healthcheck can target it.

A metrics registry and a Prometheus /metrics endpoint expose the metrics whose sources exist so far (e.g., sgt_build_info, sgt_mqtt_connected, sgt_location_fixes_total, sgt_entity_state, sgt_pulses_emitted_total, sgt_decision_latency_seconds, sgt_singleton_owner) with the labels in PRD section 11.1; later stories register their own metrics on this registry
livez reflects only that the process/event loop is up (never depends on broker/persistence/lock)
readyz is ready only when broker connected AND persistence reachable AND migrations applied AND singleton lock held (FR-21)
A non-owner instance still serves livez, readyz, /metrics, and /v1/health while it retries lock acquisition — lock contention must not block startup or look like a dead process
Tests assert /metrics exposes the expected metric names, and readyz goes not-ready when the broker mock is down, persistence is unreachable, or the lock is not held, then recovers
Lint passes
Typecheck passes
Tests pass in CI

✅ US-011 — Safety caps and structured audit logging

As the operator, I need hard limits and a full decision trail so the system can never runaway-pulse and I can answer why the gate opened.

Enforces max_pulses_per_session and max_session_duration; loop halts when either is hit
A cap-halt persists a Tier-C halted_session marker (FR-24), increments sgt_cap_halts_total, and is not silently resumed on restart (clears only on re-arm or operator reset)
Every decision and pulse emits a structured log entry: ts, entity, state_from, state_to, lat, lon, gps_accuracy, speed (or unknown), zone, reason, session_id, pulse_seq, quiet_hours
Pulses within the quiet-hours window are flagged, evaluated in the configured IANA timezone (DST-aware)
Each audit event is appended via the append-only repository (US-003) and also written to stdout (FR-28); the stdout copy requires a documented Docker log-retention/rotation config (e.g., json-file max-size/max-file) and is scoped as a pragmatic single-home repudiation control, not tamper-proof against host-root/Docker-daemon compromise
Tests assert caps halt the loop, the persisted marker is set, audit entries contain the required fields and reason codes, and quiet-hours flagging uses the configured timezone
Lint passes
Typecheck passes
Tests pass in CI

✅ US-012 — Runtime configuration and health API

As the operator (and a future C2 site), I want an authenticated REST API to read/update config at runtime and check health.

Versioned endpoints GET/PUT/PATCH /v1/config, GET /v1/health, POST /v1/session/reset; only /livez, /readyz, /metrics are unversioned
Every endpoint requires authentication (bearer token or mTLS, credential from injected secret env); unauthenticated requests get 401; mutating endpoints are authorized separately from reads (read-only token on a mutating endpoint gets 403)
GET /v1/config returns effective config with all secret-typed fields redacted (FR-27)
Config fields are classified hot-mutable / external-artifact-coupled (returns requires_ha_sync) / connection-level (controlled reconnect); the response indicates which class applied and whether a reconnect/HA sync happened
Every config mutation writes an audit record (source, field-level before->after diff, reconnect flag) (FR-23); invalid updates are rejected and do not partially apply
GET /v1/health reports MQTT connectivity, per-entity last-fix age and state (incl STALE), last-pulse ts, persistence reachability, singleton-lock ownership, and session/cap state; POST /v1/session/reset clears a Tier-C cap-halt
The API binds only to a controlled interface (loopback/host LAN), never published to the public internet (FR-26)
Tests cover read, valid update, invalid update, health output, 401 unauthenticated, 403 read-only-on-mutate, and secret redaction
Lint passes
Typecheck passes
Tests pass in CI

✅ US-013 — Alerting via Home Assistant notification topic

As the operator, I want deduped, runbook-linked alerts so I notice failures and anomalies without alert storms.

Publishes alerts to a configurable MQTT notify topic that HA can route to phones
Alerts fire on: cap reached, actuator publish failure, MQTT broker connection lost, persistence-write failure, quiet-hours pulse, and an API auth-failure spike (rate of sgt_api_auth_failures_total over a configurable threshold)
Each alert payload carries severity (page|notify), a stable alert_name, the reason code, and a runbook field pointing at the section 13 runbook
Dedupe: emit on transition into the failing state then at most once per configurable alert_repeat_interval (default 300s); a resolved event fires on recovery
Tests assert an alert is emitted for each trigger condition with required fields, a repeat within the interval is suppressed, and a resolve fires on recovery
Lint passes
Typecheck passes
Tests pass in CI

✅ US-014 — State recovery on restart

As the system, I must behave safely if Home Assistant or the service restarts mid-session.

On startup, state is re-derived from the latest valid fixes; no in-flight pulse loop resumes blindly
The arrived/armed latch is persisted (US-003) so a restart while parked at home does not re-trigger
If no fresh location is available on restart, the system stays safe with no pulse (STALE unless a persisted ARRIVED latch applies)
A latch write failure does not crash the service: the in-memory latch holds, the failure is alerted, and on restart the system falls back to fix-derived state (geometry decides ARRIVED), never to re-pulsing a parked car (FR-22)
A persisted cap-halt marker is replayed on restart and cleared only by re-arm or POST /v1/session/reset
Tests simulate restart in each state (IDLE, ARMED, TRIGGERING, ARRIVED, STALE, cap-halted) including restart-after-latch-write-failure and restart-while-cap-halted
Lint passes
Typecheck passes
Tests pass in CI

✅ US-015 — Degraded-operation behavior

As the system, I must fail safe when dependencies are unavailable.

MQTT broker unreachable -> retry with backoff, stay IDLE, flip readiness not-ready, alert, never crash (broker-down must not fail liveness)
Location stream silent -> TRIGGERING contribution expires after location_freshness_seconds, entity enters STALE, no pulses from stale data; surfaced as a metric and (after grace) an alert
Actuator publish failure -> logged and alerted, retried within the pulse_interval budget, loop not wedged
Persistence unavailable / migration failure on boot -> readiness fails and alerts rather than running against an unmigrated schema
Data volume not writable by the service UID/GID (non-root + read-only rootfs) is detected at startup with a clear, actionable error rather than a crash loop
Duplicate instance -> the non-owner refuses to publish until it holds the lock; a test asserts a single deduped stream
Internet outage does not affect the local arrival flow (no cloud calls on the critical path)
Tests cover each degraded scenario
Lint passes
Typecheck passes
Tests pass in CI

✅ US-016 — End-to-end simulation/replay harness (CI completion gate)

As a developer, I need to replay GPS traces end-to-end so behavior is validated without hardware and serves as the integration gate.

Harness feeds timestamped fix sequences through ingest -> geometry -> state machine -> coordinator and records emitted pulses
Synthetic scenario fixtures with expected outcomes ship in the repo and are the CI gate (no external data): (a) normal arrival opens before arrival and stops after home; (b) drive-past opens then stops with no runaway; (c) two phones arriving together produce a single deduped stream; (d) phone battery dies mid-approach -> stale fixes ignored, safe; (e) dog walk through zone -> no trigger; (f) enter-then-leave quickly -> bounded pulses
An offline calibration mode replays recorded real GPS traces and grid-searches zone radii and min_trigger_speed_mps; it runs only when trace fixtures are present and is NOT a CI pass/fail gate
All synthetic scenario assertions pass in CI
Lint passes
Typecheck passes
Tests pass in CI

✅ US-017 — Home Assistant configuration artifacts and drift check

As the operator, I need the HA zones, high-accuracy automation, and broker ACL defined so the service receives timely location and the broker is scoped.

Documented HA zone definitions for arm/trigger/home, generated from or mechanically checked against the same gate_center/radius values used by the service config; drift is a release-blocking validation failure
Runtime API changes to gate_center/arm_radius either regenerate/validate the HA artifact or are staged/rejected with requires_ha_sync (consistent with US-012)
Documented HA automation that enables high-accuracy mode on arm-zone entry and disables it on exit
Documented MQTT publishing of per-entity location to gate/location/ and the broker ACL scoping this client to its topics (FR-25)
A validation test confirms the documented topic/payload shape matches what the ingest module (US-005) expects
Lint passes
Typecheck passes
Tests pass in CI

✅ US-018 — Containerization (Dockerfile, digest-pinned, non-root, read-only rootfs)

As a developer, I need a hardened container image so the service can deploy to the home-lab VM with verifiable provenance.

Dockerfile builds a runnable image from a base image pinned by digest (@sha256:...), not a floating tag (FR-31)
The image runs as a non-root UID with a read-only root filesystem (only the data volume writable)
The dependency lock file pins every dependency with a cryptographic hash (FR-31)
CI builds the image successfully
README documents the container run and configuration
Lint passes
Typecheck passes
Tests pass in CI

✅ US-019 — Hardened docker-compose deployment

As the operator, I need the compose deployment to enforce least privilege so a compromised container cannot reach the host, other containers, or the public internet.

compose service runs as non-root, read_only root filesystem (data volume + optional tmpfs writable), cap_drop [ALL], security_opt no-new-privileges:true, no privileged mode, and no mount of the Docker socket or host paths (FR-29)
API and /metrics ports are not published to 0.0.0.0/public internet; they bind to loopback or the internal compose network; egress on the critical path is limited to the broker and DNS; the host-firewall posture is documented (FR-30)
Credentials come from an env_file or Docker secret referenced by the compose file, never as literals or baked into the image; rotation is documented (FR-27)
The data volume is mounted writable by the service's non-root UID/GID (documented) so the read-only-rootfs container can write SQLite and the lockfile
restart: unless-stopped, a healthcheck targeting /readyz (status only; no autoheal restart-on-unhealthy in v1), and a log-retention/rotation policy (e.g., json-file max-size/max-file) are set
A CI check (docker compose config parse and/or hadolint) asserts non-root, read_only, cap_drop ALL, no-new-privileges, no docker.sock/host bind mount, the log-retention policy is present, and that API/metrics ports are NOT published on a wildcard bind (rejects bare port mappings and 0.0.0.0 / :: ; requires absent ports or an explicit non-wildcard host IP); it fails the workflow if any is missing
Lint passes
Typecheck passes
Tests pass in CI

✅ US-020 — Failure-modes catalog, runbooks, and threat model docs

As the on-call homeowner, I need every failure mode and operator action documented so a 2 a.m. page is actionable.

docs/runbooks.md ships and covers every row of the section 12 failure-modes catalog; each runbook has precondition (the literal docker compose / curl /readyz / curl /v1/health / metric / log signal), actions, expected end state, and rollback
Includes a runbook for a data-volume permission/UID mismatch (container cannot write SQLite/lockfile): how to detect it and chown/fix the volume
Every alert emitted by US-013 carries a runbook field whose value resolves to an existing runbook section; a test asserts no alert references a missing runbook anchor
The cap-halt / POST /v1/session/reset runbook documents how to verify the car is actually home before clearing a halt
docs/security.md ships the STRIDE summary, the container-compromise blast-radius statement, and a security-incident runbook (spoofed location / unexplained quiet-hours pulse -> rotate credentials, review append-only audit)
Lint passes
Typecheck passes
Tests pass in CI (the alert->runbook link check)

✅ US-021 — Progress/docs site buildable for Cloudflare Pages

As the operator, I want a static docs/progress site that renders the PRD and live story progress so I can track the Ralph loop, deployable to Cloudflare Pages.

A static site builds from the repo (e.g., the PRD plus a progress view derived from prd.json / progress.txt / metrics.json) with a single documented build command (mirrors the nodewright build-tracker generator pattern: a generate-pages script + mkdocs)
The build journal renders each Ralph loop/iteration as an EXPLAINED entry, not just a pass/fail table row: story id + title + one-line goal, a plain-language 'what this loop did' narrative (sourced from progress.txt + the commit message body + the story description), the CI outcome with the specific failing lane and why, the attempt number, and what changed (commit link). The explained build journal is the primary view; the loop-stats table is secondary
An overview page summarizes the build at a glance: stories complete vs remaining, current loop, first-pass CI rate, and a link to the live PRD
The build runs in CI and fails the job if the site does not build
A Cloudflare Pages deploy configuration is present (project/build output dir documented); deploy credentials are referenced from secrets, never committed
README documents how the Ralph loop publishes progress after each iteration (the publish hook) and how the site is deployed
Lint passes
Typecheck passes
Tests pass in CI (site build check)