PRD Status¶
Project: Smart Gate Trigger Service
Branch: ralph/smart-gate-trigger
Scope: Self-hosted decision/orchestration service that decides when and how often to pulse a gate-open MQTT command (gate/cmd) based on phone location, so the gate is already open before an FSD vehicle reaches it. Runs as a single Docker/docker-compose service on a home-lab VM. Source of truth: tasks/prd-smart-gate-trigger-service.md (blessed by the 4-pass critic loop). Completion gate for every story: lint + typecheck + pytest green on GitHub Actions runners against a mock MQTT broker; no physical hardware.
Story progress¶
- ✅ Complete (
passes == true): 0 - ⏳ Remaining: 21
- 📋 Total stories: 21
A story flips to complete only when the Ralph wrapper sees lint + typecheck + pytest green on GitHub Actions against a mock MQTT broker — no physical hardware.
All stories¶
| ID | Pri | Title | Status | Notes |
|---|---|---|---|---|
| US-001 | 1 | Project scaffold and GitHub Actions CI pipeline | ⏳ pending | Combines PRD US-001 scaffold + US-013 CI portion. Container build is deferred to US-018. |
| US-002 | 2 | Typed configuration model with validation and secret-by-reference | ⏳ pending | PRD US-001. |
| US-003 | 3 | Persistence layer with migrations and append-only audit | ⏳ pending | PRD US-015, moved early because US-012 (config API) and US-014 (recovery) depend on it. |
| US-004 | 4 | MQTT client wrapper with TLS, auth, reconnect, and retain=false | ⏳ pending | PRD US-002. |
| US-005 | 5 | Location ingest with freshness and accuracy guards | ⏳ pending | PRD US-003. |
| US-006 | 6 | Geometry module: zone membership, distance, and ground speed | ⏳ pending | PRD US-004. |
| US-007 | 7 | Per-entity state machine | ⏳ pending | PRD US-005. |
| US-008 | 8 | Singleton publish guard via flock | ⏳ pending | Extracted from PRD US-016 (singleton portion); the coordinator (US-009) depends on it. |
| US-009 | 9 | Pulse coordinator and actuator publisher | ⏳ pending | PRD US-006. |
| US-010 | 10 | Prometheus metrics and liveness/readiness probes | ⏳ pending | PRD US-016 (metrics + probes portion); placed before caps/alerting/API which reference its |
| US-011 | 11 | Safety caps and structured audit logging | ⏳ pending | PRD US-007. |
| US-012 | 12 | Runtime configuration and health API | ⏳ pending | PRD US-009. Depends on persistence (US-003), config model (US-002), probes (US-010), and c |
| US-013 | 13 | Alerting via Home Assistant notification topic | ⏳ pending | PRD US-008. Placed after the API (US-012) because it references the API auth-failure metri |
| US-014 | 14 | State recovery on restart | ⏳ pending | PRD US-010. |
| US-015 | 15 | Degraded-operation behavior | ⏳ pending | PRD US-011. |
| US-016 | 16 | End-to-end simulation/replay harness (CI completion gate) | ⏳ pending | PRD US-012. The synthetic fixtures are the primary behavioral completion gate. |
| US-017 | 17 | Home Assistant configuration artifacts and drift check | ⏳ pending | PRD US-014. |
| US-018 | 18 | Containerization (Dockerfile, digest-pinned, non-root, read-only rootfs) | ⏳ pending | PRD US-013 container portion. SBOM/cosign signing are out of scope per owner decision. |
| US-019 | 19 | Hardened docker-compose deployment | ⏳ pending | PRD US-018, retargeted from Kubernetes manifests to docker-compose. |
| US-020 | 20 | Failure-modes catalog, runbooks, and threat model docs | ⏳ pending | PRD US-017. |
| US-021 | 21 | Progress/docs site buildable for Cloudflare Pages | ⏳ pending | Mirrors the nodewright Cloudflare progress-tracking pattern. The live publish hook + deplo |
Acceptance criteria¶
Expand a story to see its full acceptance criteria from prd.json.
⏳ US-001 — Project scaffold and GitHub Actions CI pipeline
As a developer, I need a Python project skeleton and a CI pipeline so every later story is gated on green CI.
- Python package layout with a dependency/lock file and lint + type-check tooling configured (e.g., ruff + mypy)
- pytest configured with at least one trivial passing test to prove the suite runs
- GitHub Actions workflow runs lint, typecheck, and pytest on push and pull_request and fails the job if any step fails
- A mock/in-process MQTT broker fixture is available to the test suite (no external broker)
- README documents local run and how to execute the test suite
- Lint passes
- Typecheck passes
- Tests pass in CI (GitHub Actions)
⏳ US-002 — Typed configuration model with validation and secret-by-reference
As a developer, I need a validated, typed config model so all tunables live in one place and unsafe configs are rejected.
- Pydantic config model with canonical SI units: zone radii (arm/trigger/home) in meters, gate_center in decimal degrees, pulse_interval, auto_close_window (observed minimum), max_pulses_per_session, max_session_duration, min_trigger_speed_mps (default ~3.6 = 8 mph), location_freshness_seconds, max_gps_accuracy_meters, gate_open_time_seconds, max_approach_speed_mps, system_latency_seconds, tracked entities, MQTT connection + topics, quiet-hours window with an explicit IANA timezone
- Config loads from file with environment-variable overrides; invalid config fails fast with a clear error
- Secret-typed fields (MQTT credentials, API tokens, TLS material) are supplied by reference from env (.env / Docker secret), never inlined or persisted, and a redaction helper masks them
- Rejects config where pulse_interval is not strictly less than auto_close_window
- Emits a warning when configured trigger radius < (gate_open_time_seconds + system_latency_seconds) * max_approach_speed_mps (FR-17)
- Emits a warning when home radius <= max_gps_accuracy_meters (FR-19)
- Tests cover valid load, env override, fail-fast on invalid, both warnings, the pulse_interval rejection, and secret redaction
- Lint passes
- Typecheck passes
- Tests pass in CI
⏳ US-003 — Persistence layer with migrations and append-only audit
As a developer, I need a SQLite-backed repository with migrations so config, latches, and audit rows persist behind a swappable interface.
- SQLite-backed persistence with a versioned migration framework; forward migrations applied automatically on startup
- A repository interface abstracts storage so callers depend on the interface, not SQLite specifics
- Persists runtime-tunable config, the per-entity arrived/armed latch, and the Tier-C cap-halt marker
- Audit rows are append-only: the repository interface exposes no update or delete path for audit rows (FR-28)
- Tests cover migration apply on a fresh DB, repository read/write round-trips, and that audit rows cannot be updated/deleted via the interface
- Lint passes
- Typecheck passes
- Tests pass in CI
⏳ US-004 — MQTT client wrapper with TLS, auth, reconnect, and retain=false
As a developer, I need a resilient, authenticated MQTT client so the service survives broker restarts and cannot leak or replay commands.
- A single asyncio-capable MQTT client library is chosen and pinned to an exact hashed version in the lock file
- Connect/subscribe/publish helpers with automatic reconnect and backoff
- Publishes to gate/cmd force retain=false; a test asserts no retained actuator command can be emitted (FR-1)
- TLS (validating the broker CA) and per-client authentication are required by default; an explicit opt-out flag is allowed only for the mock-broker CI path and emits a startup warning (FR-25)
- Credentials are read from injected secret env, never from the config blob
- Tests cover: connect against mock broker, reconnect after broker drop, refusal to connect on TLS-verify failure (unless CI opt-out), and retain=false enforcement
- Lint passes
- Typecheck passes
- Tests pass in CI
⏳ US-005 — Location ingest with freshness and accuracy guards
As the system, I must reject unreliable location fixes so stale or drifting GPS cannot cause bad decisions.
- Parses incoming MQTT location messages (gate/location/
) into a normalized fix (entity, lat, lon, gps_accuracy, timestamp, optional speed) - Rejects fixes older than location_freshness_seconds (dead/sleeping phone)
- Rejects fixes with accuracy worse than max_gps_accuracy_meters (GPS drift)
- Drops malformed/partial payloads without crashing
- Tests cover fresh+accurate (accept), stale (reject), inaccurate (reject), and malformed (drop)
- Lint passes
- Typecheck passes
- Tests pass in CI
⏳ US-006 — Geometry module: zone membership, distance, and ground speed
As the system, I need geometry primitives so I can decide which zone an entity is in and how fast it is moving.
- Haversine distance and circular zone membership (arm/trigger/home) computed from raw coordinates
- Ground speed computed from consecutive fixes; an explicit speed attribute is used directly when present; speed requires >= 2 fresh consecutive fixes, otherwise reported as unknown (not zero/guessed)
- The is-point-in-zone check is behind an interface so polygon/corridor implementations can be added later without changing callers
- Range-rate is NOT computed in v1 (deferred to future garage intent)
- Tests use known coordinate fixtures with expected zone membership, distance, and speed/unknown results
- Lint passes
- Typecheck passes
- Tests pass in CI
⏳ US-007 — Per-entity state machine
As the system, I need a per-entity state machine so each phone independently progresses IDLE->ARMED->TRIGGERING->ARRIVED/STALE and re-arms correctly.
- Implements IDLE, ARMED, TRIGGERING, ARRIVED, STALE with the transitions in PRD section 4
- min_trigger_speed_mps gates only the initial entry to TRIGGERING; unknown speed must not enter TRIGGERING (fail-safe)
- Trigger-zone latch: once TRIGGERING, stays until ARRIVED, trigger-zone exit, or fix expiry, ignoring momentary speed dips
- STALE withdrawal: no fresh fix for an entity (age > location_freshness_seconds) stops its pulse contribution; arrived/armed latches behave per PRD
- Re-arm only after the entity exits the arm zone
- Tests assert: normal arrival, drive-past (no home, bounded), arrive-and-park (stop, no re-fire), pedestrian/low-speed (no trigger), slow/stop-in-zone after triggering (stays TRIGGERING), first in-zone fix with unknown speed (no trigger), and silence-while-TRIGGERING (enters STALE, stops)
- Lint passes
- Typecheck passes
- Tests pass in CI
⏳ US-008 — Singleton publish guard via flock
As the operator, I need at most one process able to publish gate/cmd so a duplicate instance cannot double-pulse the gate.
- Before publishing to gate/cmd, a process acquires an exclusive flock advisory lock on
/singleton.lock (FR-20) - The lock is held continuously for the lifetime of the publish-capable coordinator on a single open file descriptor (not acquired/released per pulse), and auto-releases on fd close, process exit, or SIGKILL so a successor acquires it immediately (no heartbeat/lease)
- The lockfile lives on a LOCAL data volume; non-local or synced filesystems (NFS/SMB/9p/FUSE) are unsupported for the lock and are rejected or warned at startup
- A non-owner must refuse to publish to gate/cmd until it acquires the lock
- Exposes the singleton ownership state for metrics/health (sgt_singleton_owner source)
- Tests assert two concurrently-started coordinators yield exactly one publisher (single stream, not doubled), and that killing the lock holder (SIGKILL / fd close) lets the waiter take over
- Lint passes
- Typecheck passes
- Tests pass in CI
⏳ US-009 — Pulse coordinator and actuator publisher
As the system, I must keep the gate open across the approach by re-pulsing under the auto-close timer, deduped across both phones.
- Keeps the gate open while any entity is TRIGGERING with a non-expired fix; re-pulses every pulse_interval
- pulse_interval is enforced strictly less than auto_close_window (rejected at config validation)
- Two entities triggering simultaneously produce a single deduped command stream, not a doubled rate (FR-8)
- Publishes pulse to the configured gate/cmd topic with retain=false, only when this process holds the singleton lock (US-008)
- Pulsing stops when no entity is triggering
- Tests assert pulse cadence under the auto-close window, dedupe across entities, and stop-when-none-triggering
- Lint passes
- Typecheck passes
- Tests pass in CI
⏳ US-010 — Prometheus metrics and liveness/readiness probes
As the operator, I need a metrics registry, a /metrics endpoint, and livez/readyz probes so the service is observable and a Docker healthcheck can target it.
- A metrics registry and a Prometheus /metrics endpoint expose the metrics whose sources exist so far (e.g., sgt_build_info, sgt_mqtt_connected, sgt_location_fixes_total, sgt_entity_state, sgt_pulses_emitted_total, sgt_decision_latency_seconds, sgt_singleton_owner) with the labels in PRD section 11.1; later stories register their own metrics on this registry
- livez reflects only that the process/event loop is up (never depends on broker/persistence/lock)
- readyz is ready only when broker connected AND persistence reachable AND migrations applied AND singleton lock held (FR-21)
- A non-owner instance still serves livez, readyz, /metrics, and /v1/health while it retries lock acquisition — lock contention must not block startup or look like a dead process
- Tests assert /metrics exposes the expected metric names, and readyz goes not-ready when the broker mock is down, persistence is unreachable, or the lock is not held, then recovers
- Lint passes
- Typecheck passes
- Tests pass in CI
⏳ US-011 — Safety caps and structured audit logging
As the operator, I need hard limits and a full decision trail so the system can never runaway-pulse and I can answer why the gate opened.
- Enforces max_pulses_per_session and max_session_duration; loop halts when either is hit
- A cap-halt persists a Tier-C halted_session marker (FR-24), increments sgt_cap_halts_total, and is not silently resumed on restart (clears only on re-arm or operator reset)
- Every decision and pulse emits a structured log entry: ts, entity, state_from, state_to, lat, lon, gps_accuracy, speed (or unknown), zone, reason, session_id, pulse_seq, quiet_hours
- Pulses within the quiet-hours window are flagged, evaluated in the configured IANA timezone (DST-aware)
- Each audit event is appended via the append-only repository (US-003) and also written to stdout (FR-28); the stdout copy requires a documented Docker log-retention/rotation config (e.g., json-file max-size/max-file) and is scoped as a pragmatic single-home repudiation control, not tamper-proof against host-root/Docker-daemon compromise
- Tests assert caps halt the loop, the persisted marker is set, audit entries contain the required fields and reason codes, and quiet-hours flagging uses the configured timezone
- Lint passes
- Typecheck passes
- Tests pass in CI
⏳ US-012 — Runtime configuration and health API
As the operator (and a future C2 site), I want an authenticated REST API to read/update config at runtime and check health.
- Versioned endpoints GET/PUT/PATCH /v1/config, GET /v1/health, POST /v1/session/reset; only /livez, /readyz, /metrics are unversioned
- Every endpoint requires authentication (bearer token or mTLS, credential from injected secret env); unauthenticated requests get 401; mutating endpoints are authorized separately from reads (read-only token on a mutating endpoint gets 403)
- GET /v1/config returns effective config with all secret-typed fields redacted (FR-27)
- Config fields are classified hot-mutable / external-artifact-coupled (returns requires_ha_sync) / connection-level (controlled reconnect); the response indicates which class applied and whether a reconnect/HA sync happened
- Every config mutation writes an audit record (source, field-level before->after diff, reconnect flag) (FR-23); invalid updates are rejected and do not partially apply
- GET /v1/health reports MQTT connectivity, per-entity last-fix age and state (incl STALE), last-pulse ts, persistence reachability, singleton-lock ownership, and session/cap state; POST /v1/session/reset clears a Tier-C cap-halt
- The API binds only to a controlled interface (loopback/host LAN), never published to the public internet (FR-26)
- Tests cover read, valid update, invalid update, health output, 401 unauthenticated, 403 read-only-on-mutate, and secret redaction
- Lint passes
- Typecheck passes
- Tests pass in CI
⏳ US-013 — Alerting via Home Assistant notification topic
As the operator, I want deduped, runbook-linked alerts so I notice failures and anomalies without alert storms.
- Publishes alerts to a configurable MQTT notify topic that HA can route to phones
- Alerts fire on: cap reached, actuator publish failure, MQTT broker connection lost, persistence-write failure, quiet-hours pulse, and an API auth-failure spike (rate of sgt_api_auth_failures_total over a configurable threshold)
- Each alert payload carries severity (page|notify), a stable alert_name, the reason code, and a runbook field pointing at the section 13 runbook
- Dedupe: emit on transition into the failing state then at most once per configurable alert_repeat_interval (default 300s); a resolved event fires on recovery
- Tests assert an alert is emitted for each trigger condition with required fields, a repeat within the interval is suppressed, and a resolve fires on recovery
- Lint passes
- Typecheck passes
- Tests pass in CI
⏳ US-014 — State recovery on restart
As the system, I must behave safely if Home Assistant or the service restarts mid-session.
- On startup, state is re-derived from the latest valid fixes; no in-flight pulse loop resumes blindly
- The arrived/armed latch is persisted (US-003) so a restart while parked at home does not re-trigger
- If no fresh location is available on restart, the system stays safe with no pulse (STALE unless a persisted ARRIVED latch applies)
- A latch write failure does not crash the service: the in-memory latch holds, the failure is alerted, and on restart the system falls back to fix-derived state (geometry decides ARRIVED), never to re-pulsing a parked car (FR-22)
- A persisted cap-halt marker is replayed on restart and cleared only by re-arm or POST /v1/session/reset
- Tests simulate restart in each state (IDLE, ARMED, TRIGGERING, ARRIVED, STALE, cap-halted) including restart-after-latch-write-failure and restart-while-cap-halted
- Lint passes
- Typecheck passes
- Tests pass in CI
⏳ US-015 — Degraded-operation behavior
As the system, I must fail safe when dependencies are unavailable.
- MQTT broker unreachable -> retry with backoff, stay IDLE, flip readiness not-ready, alert, never crash (broker-down must not fail liveness)
- Location stream silent -> TRIGGERING contribution expires after location_freshness_seconds, entity enters STALE, no pulses from stale data; surfaced as a metric and (after grace) an alert
- Actuator publish failure -> logged and alerted, retried within the pulse_interval budget, loop not wedged
- Persistence unavailable / migration failure on boot -> readiness fails and alerts rather than running against an unmigrated schema
- Data volume not writable by the service UID/GID (non-root + read-only rootfs) is detected at startup with a clear, actionable error rather than a crash loop
- Duplicate instance -> the non-owner refuses to publish until it holds the lock; a test asserts a single deduped stream
- Internet outage does not affect the local arrival flow (no cloud calls on the critical path)
- Tests cover each degraded scenario
- Lint passes
- Typecheck passes
- Tests pass in CI
⏳ US-016 — End-to-end simulation/replay harness (CI completion gate)
As a developer, I need to replay GPS traces end-to-end so behavior is validated without hardware and serves as the integration gate.
- Harness feeds timestamped fix sequences through ingest -> geometry -> state machine -> coordinator and records emitted pulses
- Synthetic scenario fixtures with expected outcomes ship in the repo and are the CI gate (no external data): (a) normal arrival opens before arrival and stops after home; (b) drive-past opens then stops with no runaway; (c) two phones arriving together produce a single deduped stream; (d) phone battery dies mid-approach -> stale fixes ignored, safe; (e) dog walk through zone -> no trigger; (f) enter-then-leave quickly -> bounded pulses
- An offline calibration mode replays recorded real GPS traces and grid-searches zone radii and min_trigger_speed_mps; it runs only when trace fixtures are present and is NOT a CI pass/fail gate
- All synthetic scenario assertions pass in CI
- Lint passes
- Typecheck passes
- Tests pass in CI
⏳ US-017 — Home Assistant configuration artifacts and drift check
As the operator, I need the HA zones, high-accuracy automation, and broker ACL defined so the service receives timely location and the broker is scoped.
- Documented HA zone definitions for arm/trigger/home, generated from or mechanically checked against the same gate_center/radius values used by the service config; drift is a release-blocking validation failure
- Runtime API changes to gate_center/arm_radius either regenerate/validate the HA artifact or are staged/rejected with requires_ha_sync (consistent with US-012)
- Documented HA automation that enables high-accuracy mode on arm-zone entry and disables it on exit
- Documented MQTT publishing of per-entity location to gate/location/
and the broker ACL scoping this client to its topics (FR-25) - A validation test confirms the documented topic/payload shape matches what the ingest module (US-005) expects
- Lint passes
- Typecheck passes
- Tests pass in CI
⏳ US-018 — Containerization (Dockerfile, digest-pinned, non-root, read-only rootfs)
As a developer, I need a hardened container image so the service can deploy to the home-lab VM with verifiable provenance.
- Dockerfile builds a runnable image from a base image pinned by digest (@sha256:...), not a floating tag (FR-31)
- The image runs as a non-root UID with a read-only root filesystem (only the data volume writable)
- The dependency lock file pins every dependency with a cryptographic hash (FR-31)
- CI builds the image successfully
- README documents the container run and configuration
- Lint passes
- Typecheck passes
- Tests pass in CI
⏳ US-019 — Hardened docker-compose deployment
As the operator, I need the compose deployment to enforce least privilege so a compromised container cannot reach the host, other containers, or the public internet.
- compose service runs as non-root, read_only root filesystem (data volume + optional tmpfs writable), cap_drop [ALL], security_opt no-new-privileges:true, no privileged mode, and no mount of the Docker socket or host paths (FR-29)
- API and /metrics ports are not published to 0.0.0.0/public internet; they bind to loopback or the internal compose network; egress on the critical path is limited to the broker and DNS; the host-firewall posture is documented (FR-30)
- Credentials come from an env_file or Docker secret referenced by the compose file, never as literals or baked into the image; rotation is documented (FR-27)
- The data volume is mounted writable by the service's non-root UID/GID (documented) so the read-only-rootfs container can write SQLite and the lockfile
- restart: unless-stopped, a healthcheck targeting /readyz (status only; no autoheal restart-on-unhealthy in v1), and a log-retention/rotation policy (e.g., json-file max-size/max-file) are set
- A CI check (docker compose config parse and/or hadolint) asserts non-root, read_only, cap_drop ALL, no-new-privileges, no docker.sock/host bind mount, the log-retention policy is present, and that API/metrics ports are NOT published on a wildcard bind (rejects bare port mappings and 0.0.0.0 / :: ; requires absent ports or an explicit non-wildcard host IP); it fails the workflow if any is missing
- Lint passes
- Typecheck passes
- Tests pass in CI
⏳ US-020 — Failure-modes catalog, runbooks, and threat model docs
As the on-call homeowner, I need every failure mode and operator action documented so a 2 a.m. page is actionable.
- docs/runbooks.md ships and covers every row of the section 12 failure-modes catalog; each runbook has precondition (the literal docker compose / curl /readyz / curl /v1/health / metric / log signal), actions, expected end state, and rollback
- Includes a runbook for a data-volume permission/UID mismatch (container cannot write SQLite/lockfile): how to detect it and chown/fix the volume
- Every alert emitted by US-013 carries a runbook field whose value resolves to an existing runbook section; a test asserts no alert references a missing runbook anchor
- The cap-halt / POST /v1/session/reset runbook documents how to verify the car is actually home before clearing a halt
- docs/security.md ships the STRIDE summary, the container-compromise blast-radius statement, and a security-incident runbook (spoofed location / unexplained quiet-hours pulse -> rotate credentials, review append-only audit)
- Lint passes
- Typecheck passes
- Tests pass in CI (the alert->runbook link check)
⏳ US-021 — Progress/docs site buildable for Cloudflare Pages
As the operator, I want a static docs/progress site that renders the PRD and live story progress so I can track the Ralph loop, deployable to Cloudflare Pages.
- A static site builds from the repo (e.g., the PRD plus a progress view derived from prd.json / progress.txt / metrics.json) with a single documented build command (mirrors the nodewright build-tracker generator pattern: a generate-pages script + mkdocs)
- The build journal renders each Ralph loop/iteration as an EXPLAINED entry, not just a pass/fail table row: story id + title + one-line goal, a plain-language 'what this loop did' narrative (sourced from progress.txt + the commit message body + the story description), the CI outcome with the specific failing lane and why, the attempt number, and what changed (commit link). The explained build journal is the primary view; the loop-stats table is secondary
- An overview page summarizes the build at a glance: stories complete vs remaining, current loop, first-pass CI rate, and a link to the live PRD
- The build runs in CI and fails the job if the site does not build
- A Cloudflare Pages deploy configuration is present (project/build output dir documented); deploy credentials are referenced from secrets, never committed
- README documents how the Ralph loop publishes progress after each iteration (the publish hook) and how the site is deployed
- Lint passes
- Typecheck passes
- Tests pass in CI (site build check)