Skip to content

PRD Status

Project: Smart Gate Trigger Service

Branch: ralph/smart-gate-trigger

Scope: Self-hosted decision/orchestration service that decides when and how often to pulse a gate-open MQTT command (gate/cmd) based on phone location, so the gate is already open before an FSD vehicle reaches it. Runs as a single Docker/docker-compose service on a home-lab VM. Source of truth: tasks/prd-smart-gate-trigger-service.md (blessed by the 4-pass critic loop). Completion gate for every story: lint + typecheck + pytest green on GitHub Actions runners against a mock MQTT broker; no physical hardware.

Story progress

  • ✅ Complete (passes == true): 0
  • ⏳ Remaining: 21
  • 📋 Total stories: 21

A story flips to complete only when the Ralph wrapper sees lint + typecheck + pytest green on GitHub Actions against a mock MQTT broker — no physical hardware.

All stories

ID Pri Title Status Notes
US-001 1 Project scaffold and GitHub Actions CI pipeline ⏳ pending Combines PRD US-001 scaffold + US-013 CI portion. Container build is deferred to US-018.
US-002 2 Typed configuration model with validation and secret-by-reference ⏳ pending PRD US-001.
US-003 3 Persistence layer with migrations and append-only audit ⏳ pending PRD US-015, moved early because US-012 (config API) and US-014 (recovery) depend on it.
US-004 4 MQTT client wrapper with TLS, auth, reconnect, and retain=false ⏳ pending PRD US-002.
US-005 5 Location ingest with freshness and accuracy guards ⏳ pending PRD US-003.
US-006 6 Geometry module: zone membership, distance, and ground speed ⏳ pending PRD US-004.
US-007 7 Per-entity state machine ⏳ pending PRD US-005.
US-008 8 Singleton publish guard via flock ⏳ pending Extracted from PRD US-016 (singleton portion); the coordinator (US-009) depends on it.
US-009 9 Pulse coordinator and actuator publisher ⏳ pending PRD US-006.
US-010 10 Prometheus metrics and liveness/readiness probes ⏳ pending PRD US-016 (metrics + probes portion); placed before caps/alerting/API which reference its
US-011 11 Safety caps and structured audit logging ⏳ pending PRD US-007.
US-012 12 Runtime configuration and health API ⏳ pending PRD US-009. Depends on persistence (US-003), config model (US-002), probes (US-010), and c
US-013 13 Alerting via Home Assistant notification topic ⏳ pending PRD US-008. Placed after the API (US-012) because it references the API auth-failure metri
US-014 14 State recovery on restart ⏳ pending PRD US-010.
US-015 15 Degraded-operation behavior ⏳ pending PRD US-011.
US-016 16 End-to-end simulation/replay harness (CI completion gate) ⏳ pending PRD US-012. The synthetic fixtures are the primary behavioral completion gate.
US-017 17 Home Assistant configuration artifacts and drift check ⏳ pending PRD US-014.
US-018 18 Containerization (Dockerfile, digest-pinned, non-root, read-only rootfs) ⏳ pending PRD US-013 container portion. SBOM/cosign signing are out of scope per owner decision.
US-019 19 Hardened docker-compose deployment ⏳ pending PRD US-018, retargeted from Kubernetes manifests to docker-compose.
US-020 20 Failure-modes catalog, runbooks, and threat model docs ⏳ pending PRD US-017.
US-021 21 Progress/docs site buildable for Cloudflare Pages ⏳ pending Mirrors the nodewright Cloudflare progress-tracking pattern. The live publish hook + deplo

Acceptance criteria

Expand a story to see its full acceptance criteria from prd.json.

⏳ US-001 — Project scaffold and GitHub Actions CI pipeline

As a developer, I need a Python project skeleton and a CI pipeline so every later story is gated on green CI.

  • Python package layout with a dependency/lock file and lint + type-check tooling configured (e.g., ruff + mypy)
  • pytest configured with at least one trivial passing test to prove the suite runs
  • GitHub Actions workflow runs lint, typecheck, and pytest on push and pull_request and fails the job if any step fails
  • A mock/in-process MQTT broker fixture is available to the test suite (no external broker)
  • README documents local run and how to execute the test suite
  • Lint passes
  • Typecheck passes
  • Tests pass in CI (GitHub Actions)
⏳ US-002 — Typed configuration model with validation and secret-by-reference

As a developer, I need a validated, typed config model so all tunables live in one place and unsafe configs are rejected.

  • Pydantic config model with canonical SI units: zone radii (arm/trigger/home) in meters, gate_center in decimal degrees, pulse_interval, auto_close_window (observed minimum), max_pulses_per_session, max_session_duration, min_trigger_speed_mps (default ~3.6 = 8 mph), location_freshness_seconds, max_gps_accuracy_meters, gate_open_time_seconds, max_approach_speed_mps, system_latency_seconds, tracked entities, MQTT connection + topics, quiet-hours window with an explicit IANA timezone
  • Config loads from file with environment-variable overrides; invalid config fails fast with a clear error
  • Secret-typed fields (MQTT credentials, API tokens, TLS material) are supplied by reference from env (.env / Docker secret), never inlined or persisted, and a redaction helper masks them
  • Rejects config where pulse_interval is not strictly less than auto_close_window
  • Emits a warning when configured trigger radius < (gate_open_time_seconds + system_latency_seconds) * max_approach_speed_mps (FR-17)
  • Emits a warning when home radius <= max_gps_accuracy_meters (FR-19)
  • Tests cover valid load, env override, fail-fast on invalid, both warnings, the pulse_interval rejection, and secret redaction
  • Lint passes
  • Typecheck passes
  • Tests pass in CI
⏳ US-003 — Persistence layer with migrations and append-only audit

As a developer, I need a SQLite-backed repository with migrations so config, latches, and audit rows persist behind a swappable interface.

  • SQLite-backed persistence with a versioned migration framework; forward migrations applied automatically on startup
  • A repository interface abstracts storage so callers depend on the interface, not SQLite specifics
  • Persists runtime-tunable config, the per-entity arrived/armed latch, and the Tier-C cap-halt marker
  • Audit rows are append-only: the repository interface exposes no update or delete path for audit rows (FR-28)
  • Tests cover migration apply on a fresh DB, repository read/write round-trips, and that audit rows cannot be updated/deleted via the interface
  • Lint passes
  • Typecheck passes
  • Tests pass in CI
⏳ US-004 — MQTT client wrapper with TLS, auth, reconnect, and retain=false

As a developer, I need a resilient, authenticated MQTT client so the service survives broker restarts and cannot leak or replay commands.

  • A single asyncio-capable MQTT client library is chosen and pinned to an exact hashed version in the lock file
  • Connect/subscribe/publish helpers with automatic reconnect and backoff
  • Publishes to gate/cmd force retain=false; a test asserts no retained actuator command can be emitted (FR-1)
  • TLS (validating the broker CA) and per-client authentication are required by default; an explicit opt-out flag is allowed only for the mock-broker CI path and emits a startup warning (FR-25)
  • Credentials are read from injected secret env, never from the config blob
  • Tests cover: connect against mock broker, reconnect after broker drop, refusal to connect on TLS-verify failure (unless CI opt-out), and retain=false enforcement
  • Lint passes
  • Typecheck passes
  • Tests pass in CI
⏳ US-005 — Location ingest with freshness and accuracy guards

As the system, I must reject unreliable location fixes so stale or drifting GPS cannot cause bad decisions.

  • Parses incoming MQTT location messages (gate/location/) into a normalized fix (entity, lat, lon, gps_accuracy, timestamp, optional speed)
  • Rejects fixes older than location_freshness_seconds (dead/sleeping phone)
  • Rejects fixes with accuracy worse than max_gps_accuracy_meters (GPS drift)
  • Drops malformed/partial payloads without crashing
  • Tests cover fresh+accurate (accept), stale (reject), inaccurate (reject), and malformed (drop)
  • Lint passes
  • Typecheck passes
  • Tests pass in CI
⏳ US-006 — Geometry module: zone membership, distance, and ground speed

As the system, I need geometry primitives so I can decide which zone an entity is in and how fast it is moving.

  • Haversine distance and circular zone membership (arm/trigger/home) computed from raw coordinates
  • Ground speed computed from consecutive fixes; an explicit speed attribute is used directly when present; speed requires >= 2 fresh consecutive fixes, otherwise reported as unknown (not zero/guessed)
  • The is-point-in-zone check is behind an interface so polygon/corridor implementations can be added later without changing callers
  • Range-rate is NOT computed in v1 (deferred to future garage intent)
  • Tests use known coordinate fixtures with expected zone membership, distance, and speed/unknown results
  • Lint passes
  • Typecheck passes
  • Tests pass in CI
⏳ US-007 — Per-entity state machine

As the system, I need a per-entity state machine so each phone independently progresses IDLE->ARMED->TRIGGERING->ARRIVED/STALE and re-arms correctly.

  • Implements IDLE, ARMED, TRIGGERING, ARRIVED, STALE with the transitions in PRD section 4
  • min_trigger_speed_mps gates only the initial entry to TRIGGERING; unknown speed must not enter TRIGGERING (fail-safe)
  • Trigger-zone latch: once TRIGGERING, stays until ARRIVED, trigger-zone exit, or fix expiry, ignoring momentary speed dips
  • STALE withdrawal: no fresh fix for an entity (age > location_freshness_seconds) stops its pulse contribution; arrived/armed latches behave per PRD
  • Re-arm only after the entity exits the arm zone
  • Tests assert: normal arrival, drive-past (no home, bounded), arrive-and-park (stop, no re-fire), pedestrian/low-speed (no trigger), slow/stop-in-zone after triggering (stays TRIGGERING), first in-zone fix with unknown speed (no trigger), and silence-while-TRIGGERING (enters STALE, stops)
  • Lint passes
  • Typecheck passes
  • Tests pass in CI
⏳ US-008 — Singleton publish guard via flock

As the operator, I need at most one process able to publish gate/cmd so a duplicate instance cannot double-pulse the gate.

  • Before publishing to gate/cmd, a process acquires an exclusive flock advisory lock on /singleton.lock (FR-20)
  • The lock is held continuously for the lifetime of the publish-capable coordinator on a single open file descriptor (not acquired/released per pulse), and auto-releases on fd close, process exit, or SIGKILL so a successor acquires it immediately (no heartbeat/lease)
  • The lockfile lives on a LOCAL data volume; non-local or synced filesystems (NFS/SMB/9p/FUSE) are unsupported for the lock and are rejected or warned at startup
  • A non-owner must refuse to publish to gate/cmd until it acquires the lock
  • Exposes the singleton ownership state for metrics/health (sgt_singleton_owner source)
  • Tests assert two concurrently-started coordinators yield exactly one publisher (single stream, not doubled), and that killing the lock holder (SIGKILL / fd close) lets the waiter take over
  • Lint passes
  • Typecheck passes
  • Tests pass in CI
⏳ US-009 — Pulse coordinator and actuator publisher

As the system, I must keep the gate open across the approach by re-pulsing under the auto-close timer, deduped across both phones.

  • Keeps the gate open while any entity is TRIGGERING with a non-expired fix; re-pulses every pulse_interval
  • pulse_interval is enforced strictly less than auto_close_window (rejected at config validation)
  • Two entities triggering simultaneously produce a single deduped command stream, not a doubled rate (FR-8)
  • Publishes pulse to the configured gate/cmd topic with retain=false, only when this process holds the singleton lock (US-008)
  • Pulsing stops when no entity is triggering
  • Tests assert pulse cadence under the auto-close window, dedupe across entities, and stop-when-none-triggering
  • Lint passes
  • Typecheck passes
  • Tests pass in CI
⏳ US-010 — Prometheus metrics and liveness/readiness probes

As the operator, I need a metrics registry, a /metrics endpoint, and livez/readyz probes so the service is observable and a Docker healthcheck can target it.

  • A metrics registry and a Prometheus /metrics endpoint expose the metrics whose sources exist so far (e.g., sgt_build_info, sgt_mqtt_connected, sgt_location_fixes_total, sgt_entity_state, sgt_pulses_emitted_total, sgt_decision_latency_seconds, sgt_singleton_owner) with the labels in PRD section 11.1; later stories register their own metrics on this registry
  • livez reflects only that the process/event loop is up (never depends on broker/persistence/lock)
  • readyz is ready only when broker connected AND persistence reachable AND migrations applied AND singleton lock held (FR-21)
  • A non-owner instance still serves livez, readyz, /metrics, and /v1/health while it retries lock acquisition — lock contention must not block startup or look like a dead process
  • Tests assert /metrics exposes the expected metric names, and readyz goes not-ready when the broker mock is down, persistence is unreachable, or the lock is not held, then recovers
  • Lint passes
  • Typecheck passes
  • Tests pass in CI
⏳ US-011 — Safety caps and structured audit logging

As the operator, I need hard limits and a full decision trail so the system can never runaway-pulse and I can answer why the gate opened.

  • Enforces max_pulses_per_session and max_session_duration; loop halts when either is hit
  • A cap-halt persists a Tier-C halted_session marker (FR-24), increments sgt_cap_halts_total, and is not silently resumed on restart (clears only on re-arm or operator reset)
  • Every decision and pulse emits a structured log entry: ts, entity, state_from, state_to, lat, lon, gps_accuracy, speed (or unknown), zone, reason, session_id, pulse_seq, quiet_hours
  • Pulses within the quiet-hours window are flagged, evaluated in the configured IANA timezone (DST-aware)
  • Each audit event is appended via the append-only repository (US-003) and also written to stdout (FR-28); the stdout copy requires a documented Docker log-retention/rotation config (e.g., json-file max-size/max-file) and is scoped as a pragmatic single-home repudiation control, not tamper-proof against host-root/Docker-daemon compromise
  • Tests assert caps halt the loop, the persisted marker is set, audit entries contain the required fields and reason codes, and quiet-hours flagging uses the configured timezone
  • Lint passes
  • Typecheck passes
  • Tests pass in CI
⏳ US-012 — Runtime configuration and health API

As the operator (and a future C2 site), I want an authenticated REST API to read/update config at runtime and check health.

  • Versioned endpoints GET/PUT/PATCH /v1/config, GET /v1/health, POST /v1/session/reset; only /livez, /readyz, /metrics are unversioned
  • Every endpoint requires authentication (bearer token or mTLS, credential from injected secret env); unauthenticated requests get 401; mutating endpoints are authorized separately from reads (read-only token on a mutating endpoint gets 403)
  • GET /v1/config returns effective config with all secret-typed fields redacted (FR-27)
  • Config fields are classified hot-mutable / external-artifact-coupled (returns requires_ha_sync) / connection-level (controlled reconnect); the response indicates which class applied and whether a reconnect/HA sync happened
  • Every config mutation writes an audit record (source, field-level before->after diff, reconnect flag) (FR-23); invalid updates are rejected and do not partially apply
  • GET /v1/health reports MQTT connectivity, per-entity last-fix age and state (incl STALE), last-pulse ts, persistence reachability, singleton-lock ownership, and session/cap state; POST /v1/session/reset clears a Tier-C cap-halt
  • The API binds only to a controlled interface (loopback/host LAN), never published to the public internet (FR-26)
  • Tests cover read, valid update, invalid update, health output, 401 unauthenticated, 403 read-only-on-mutate, and secret redaction
  • Lint passes
  • Typecheck passes
  • Tests pass in CI
⏳ US-013 — Alerting via Home Assistant notification topic

As the operator, I want deduped, runbook-linked alerts so I notice failures and anomalies without alert storms.

  • Publishes alerts to a configurable MQTT notify topic that HA can route to phones
  • Alerts fire on: cap reached, actuator publish failure, MQTT broker connection lost, persistence-write failure, quiet-hours pulse, and an API auth-failure spike (rate of sgt_api_auth_failures_total over a configurable threshold)
  • Each alert payload carries severity (page|notify), a stable alert_name, the reason code, and a runbook field pointing at the section 13 runbook
  • Dedupe: emit on transition into the failing state then at most once per configurable alert_repeat_interval (default 300s); a resolved event fires on recovery
  • Tests assert an alert is emitted for each trigger condition with required fields, a repeat within the interval is suppressed, and a resolve fires on recovery
  • Lint passes
  • Typecheck passes
  • Tests pass in CI
⏳ US-014 — State recovery on restart

As the system, I must behave safely if Home Assistant or the service restarts mid-session.

  • On startup, state is re-derived from the latest valid fixes; no in-flight pulse loop resumes blindly
  • The arrived/armed latch is persisted (US-003) so a restart while parked at home does not re-trigger
  • If no fresh location is available on restart, the system stays safe with no pulse (STALE unless a persisted ARRIVED latch applies)
  • A latch write failure does not crash the service: the in-memory latch holds, the failure is alerted, and on restart the system falls back to fix-derived state (geometry decides ARRIVED), never to re-pulsing a parked car (FR-22)
  • A persisted cap-halt marker is replayed on restart and cleared only by re-arm or POST /v1/session/reset
  • Tests simulate restart in each state (IDLE, ARMED, TRIGGERING, ARRIVED, STALE, cap-halted) including restart-after-latch-write-failure and restart-while-cap-halted
  • Lint passes
  • Typecheck passes
  • Tests pass in CI
⏳ US-015 — Degraded-operation behavior

As the system, I must fail safe when dependencies are unavailable.

  • MQTT broker unreachable -> retry with backoff, stay IDLE, flip readiness not-ready, alert, never crash (broker-down must not fail liveness)
  • Location stream silent -> TRIGGERING contribution expires after location_freshness_seconds, entity enters STALE, no pulses from stale data; surfaced as a metric and (after grace) an alert
  • Actuator publish failure -> logged and alerted, retried within the pulse_interval budget, loop not wedged
  • Persistence unavailable / migration failure on boot -> readiness fails and alerts rather than running against an unmigrated schema
  • Data volume not writable by the service UID/GID (non-root + read-only rootfs) is detected at startup with a clear, actionable error rather than a crash loop
  • Duplicate instance -> the non-owner refuses to publish until it holds the lock; a test asserts a single deduped stream
  • Internet outage does not affect the local arrival flow (no cloud calls on the critical path)
  • Tests cover each degraded scenario
  • Lint passes
  • Typecheck passes
  • Tests pass in CI
⏳ US-016 — End-to-end simulation/replay harness (CI completion gate)

As a developer, I need to replay GPS traces end-to-end so behavior is validated without hardware and serves as the integration gate.

  • Harness feeds timestamped fix sequences through ingest -> geometry -> state machine -> coordinator and records emitted pulses
  • Synthetic scenario fixtures with expected outcomes ship in the repo and are the CI gate (no external data): (a) normal arrival opens before arrival and stops after home; (b) drive-past opens then stops with no runaway; (c) two phones arriving together produce a single deduped stream; (d) phone battery dies mid-approach -> stale fixes ignored, safe; (e) dog walk through zone -> no trigger; (f) enter-then-leave quickly -> bounded pulses
  • An offline calibration mode replays recorded real GPS traces and grid-searches zone radii and min_trigger_speed_mps; it runs only when trace fixtures are present and is NOT a CI pass/fail gate
  • All synthetic scenario assertions pass in CI
  • Lint passes
  • Typecheck passes
  • Tests pass in CI
⏳ US-017 — Home Assistant configuration artifacts and drift check

As the operator, I need the HA zones, high-accuracy automation, and broker ACL defined so the service receives timely location and the broker is scoped.

  • Documented HA zone definitions for arm/trigger/home, generated from or mechanically checked against the same gate_center/radius values used by the service config; drift is a release-blocking validation failure
  • Runtime API changes to gate_center/arm_radius either regenerate/validate the HA artifact or are staged/rejected with requires_ha_sync (consistent with US-012)
  • Documented HA automation that enables high-accuracy mode on arm-zone entry and disables it on exit
  • Documented MQTT publishing of per-entity location to gate/location/ and the broker ACL scoping this client to its topics (FR-25)
  • A validation test confirms the documented topic/payload shape matches what the ingest module (US-005) expects
  • Lint passes
  • Typecheck passes
  • Tests pass in CI
⏳ US-018 — Containerization (Dockerfile, digest-pinned, non-root, read-only rootfs)

As a developer, I need a hardened container image so the service can deploy to the home-lab VM with verifiable provenance.

  • Dockerfile builds a runnable image from a base image pinned by digest (@sha256:...), not a floating tag (FR-31)
  • The image runs as a non-root UID with a read-only root filesystem (only the data volume writable)
  • The dependency lock file pins every dependency with a cryptographic hash (FR-31)
  • CI builds the image successfully
  • README documents the container run and configuration
  • Lint passes
  • Typecheck passes
  • Tests pass in CI
⏳ US-019 — Hardened docker-compose deployment

As the operator, I need the compose deployment to enforce least privilege so a compromised container cannot reach the host, other containers, or the public internet.

  • compose service runs as non-root, read_only root filesystem (data volume + optional tmpfs writable), cap_drop [ALL], security_opt no-new-privileges:true, no privileged mode, and no mount of the Docker socket or host paths (FR-29)
  • API and /metrics ports are not published to 0.0.0.0/public internet; they bind to loopback or the internal compose network; egress on the critical path is limited to the broker and DNS; the host-firewall posture is documented (FR-30)
  • Credentials come from an env_file or Docker secret referenced by the compose file, never as literals or baked into the image; rotation is documented (FR-27)
  • The data volume is mounted writable by the service's non-root UID/GID (documented) so the read-only-rootfs container can write SQLite and the lockfile
  • restart: unless-stopped, a healthcheck targeting /readyz (status only; no autoheal restart-on-unhealthy in v1), and a log-retention/rotation policy (e.g., json-file max-size/max-file) are set
  • A CI check (docker compose config parse and/or hadolint) asserts non-root, read_only, cap_drop ALL, no-new-privileges, no docker.sock/host bind mount, the log-retention policy is present, and that API/metrics ports are NOT published on a wildcard bind (rejects bare port mappings and 0.0.0.0 / :: ; requires absent ports or an explicit non-wildcard host IP); it fails the workflow if any is missing
  • Lint passes
  • Typecheck passes
  • Tests pass in CI
⏳ US-020 — Failure-modes catalog, runbooks, and threat model docs

As the on-call homeowner, I need every failure mode and operator action documented so a 2 a.m. page is actionable.

  • docs/runbooks.md ships and covers every row of the section 12 failure-modes catalog; each runbook has precondition (the literal docker compose / curl /readyz / curl /v1/health / metric / log signal), actions, expected end state, and rollback
  • Includes a runbook for a data-volume permission/UID mismatch (container cannot write SQLite/lockfile): how to detect it and chown/fix the volume
  • Every alert emitted by US-013 carries a runbook field whose value resolves to an existing runbook section; a test asserts no alert references a missing runbook anchor
  • The cap-halt / POST /v1/session/reset runbook documents how to verify the car is actually home before clearing a halt
  • docs/security.md ships the STRIDE summary, the container-compromise blast-radius statement, and a security-incident runbook (spoofed location / unexplained quiet-hours pulse -> rotate credentials, review append-only audit)
  • Lint passes
  • Typecheck passes
  • Tests pass in CI (the alert->runbook link check)
⏳ US-021 — Progress/docs site buildable for Cloudflare Pages

As the operator, I want a static docs/progress site that renders the PRD and live story progress so I can track the Ralph loop, deployable to Cloudflare Pages.

  • A static site builds from the repo (e.g., the PRD plus a progress view derived from prd.json / progress.txt / metrics.json) with a single documented build command (mirrors the nodewright build-tracker generator pattern: a generate-pages script + mkdocs)
  • The build journal renders each Ralph loop/iteration as an EXPLAINED entry, not just a pass/fail table row: story id + title + one-line goal, a plain-language 'what this loop did' narrative (sourced from progress.txt + the commit message body + the story description), the CI outcome with the specific failing lane and why, the attempt number, and what changed (commit link). The explained build journal is the primary view; the loop-stats table is secondary
  • An overview page summarizes the build at a glance: stories complete vs remaining, current loop, first-pass CI rate, and a link to the live PRD
  • The build runs in CI and fails the job if the site does not build
  • A Cloudflare Pages deploy configuration is present (project/build output dir documented); deploy credentials are referenced from secrets, never committed
  • README documents how the Ralph loop publishes progress after each iteration (the publish hook) and how the site is deployed
  • Lint passes
  • Typecheck passes
  • Tests pass in CI (site build check)