Critic Log¶

Before the Ralph loop touched a line of code, the PRD ran the standardized 4-pass critic loop: a skeptical Staff Engineer, an SRE/on-call lead, a Security & Compliance auditor, and an independent cross-model (Codex) review. Each persona surfaced findings, applied fixes in place, and committed before the next pass. This page summarizes that audit; the full text lives at tasks/prd-smart-gate-trigger-service.critic-log.md in the repo.

Status: BLESSED 2026-05-25 19:31 UTC

Passes¶

Pass	Persona	When	Major	Minor	Nit
1	Staff Engineer	2026-05-25 19:15 UTC	5	6	3
2	SRE / On-Call Lead	2026-05-25 19:19 UTC	7	3	3
3	Security & Compliance Auditor	2026-05-25 19:25 UTC	5	4	4
4	Codex (gpt-5.5, xhigh)	2026-05-25 19:31 UTC	4	3	0
—	Post-revision codex review (gpt-5.5, xhigh)	2026-05-25 20:09 UTC	3	2	0
	Total		24	18	10

Cumulative findings across all passes: 24 Major + 18 Minor + 10 Nit.

Severity classification:

Major — PRD is wrong, dangerous, or unshippable as written. Auto-applied.
Minor — PRD is correct but unclear/redundant/misordered. Applied when unambiguous.
Nit — author-judgment item. Logged, not auto-applied.

Findings by pass¶

Pass 1 — Staff Engineer · 5M / 6m / 3 nit

Findings¶

[Major] Speed-source contradiction. §4 diagram payload and US-003 normalized fix list no speed, but US-004 said "GPS speed used if present," and the state machine triggers on speed. With only positions, speed is derived from consecutive fixes, so the first in-zone fix (cold start / gap) has no speed — behavior was unspecified. Proposed change: make speed an explicit optional ingest field, require ≥2 fresh fixes to compute speed, and define unknown-speed ⇒ no TRIGGERING (fail-safe). Section: §4, US-003, US-004, US-005, FR-5.
[Major] Speed-drop hysteresis unspecified. TRIGGERING requires speed ≥ min_trigger_speed; a car slowing/stopping in the trigger zone (traffic, mailbox, gate not yet open) would drop out, letting the gate auto-close in front of an FSD vehicle that "will not wait" — the project's whole reason for existing. Proposed change: latch TRIGGERING until ARRIVED or trigger-zone exit, ignoring momentary dips. Section: §4, US-005, FR-5.
[Major] Unmeasurable success metric. "Gate is fully open before the vehicle reaches it" cannot be measured — the actuator is a black box with no status topic (§1); the audit log records pulse times, not gate-open times. Proposed change: reframe to the measurable proxy "first qualifying pulse emitted ≥ gate_open_time_seconds before home-zone entry," with a note that physical confirmation needs a field observer. Section: §9.
[Major] Accuracy gate vs. home radius collision — the named "hard problem." max_gps_accuracy_meters and the ~40 m home radius interact, but the PRD never says how. If the accuracy gate is ≥ the home radius, no fix can ever be accepted and placed inside home, so stop-when-home/ARRIVED is unreachable. Proposed change: add config-validation warning when home radius ≤ max_gps_accuracy_meters. Section: US-001, FR-19 (new), §3 hard-problem.
[Major] US-012 completion gate is unsatisfiable in CI. It required replaying "recorded real GPS traces" AND "all scenario assertions pass in CI," but real traces are captured during field testing and are absent at build time, so the story can never be marked complete by Ralph. Proposed change: synthetic fixtures = CI gate; recorded-trace replay + grid-search = offline calibration tool, not a CI pass/fail gate. Section: US-012.
[Minor] Range-rate built but unused in v1. US-004 required computing range-rate, yet §3 and FR-5 state v1 is speed-only and range-rate is "reserved for future garage intent" — building an unused path contradicts the established decision. Proposed change: remove range-rate from US-004 v1 scope; mark deferred. Section: US-004.
[Minor] auto_close_window is a 20–30 s range, but the safety-critical pulse_interval < auto_close_window invariant needs the conservative minimum or the gate closes between pulses. Proposed change: require auto_close_window be set to the observed minimum (≈20 s); reject config where pulse_interval is not strictly less. Section: §1, US-001, US-006/FR-6.
[Minor] Quiet-hours timezone/DST unspecified. Bonney Lake is America/Los_Angeles with DST; a naive or UTC window mis-flags the "2 a.m." case half the year. Proposed change: add explicit IANA timezone to the quiet-hours config and evaluate DST-aware. Section: US-001, US-007.
[Minor] US-009 "applies changes without restart" too broad. Connection-level fields (MQTT broker/topics, tracked-entity set) cannot be silently hot-swapped like zone radii. Proposed change: classify each field hot-mutable vs connection-level (reconnect/reset), and report which path an update took. Section: US-009.
[Minor] Hand-wavy MQTT dependency. §8/US-002 say "an asyncio-capable client" without naming/pinning one. Proposed change: require the chosen client be pinned to an exact version in the lock file. Section: US-002.
[Minor] Success metrics lack sample size / source. "≥99% of real arrivals" over an undefined "field-test period" and "100% of the time" are unfalsifiable as written. Proposed change: define a field-test window (≥30 logged arrivals across both drivers) and name the audit log as the source for each metric. Section: §9.

Changes applied¶

§1: added the conservative-minimum rule for auto_close_window and the strict pulse_interval < relationship.
§4 diagram: added optional [speed?] to the ingest payload.
§4 state machine: rewrote the TRIGGERING bullet for the unknown-speed fail-safe and the trigger-zone latch (hysteresis).
US-001: added quiet-hours IANA timezone to the config model; added derived-warning for home radius ≤ max_gps_accuracy_meters; added explicit reject for pulse_interval ≥ auto_close_window.
US-002: added AC requiring a single MQTT client pinned to an exact version.
US-003: added optional speed to the normalized fix with a note v1 does not depend on it.
US-004: retitled (dropped range-rate), rewrote the speed AC (≥2 fresh fixes, optional explicit speed, unknown-when-single-fix), and marked range-rate deferred.
US-005: added unknown-speed fail-safe + trigger-zone latch to the transitions AC and added the slow/stop-in-zone and unknown-speed test cases.
US-007: tied quiet-hours flagging to the configured IANA timezone (DST-aware).
US-009: classified config fields hot-mutable vs connection-level and required the response to indicate the path.
US-012: split synthetic (CI gate) from recorded-real-trace replay + grid-search (offline calibration, non-gating).
§9: added a no-gate-feedback caveat and reframed all six metrics to audit-log-derivable proxies with a defined field-test window.
FR-5: added known-speed requirement, unknown-speed no-trigger, and the trigger-zone latch.
FR-19 (new): home-radius vs accuracy warning.

Outstanding¶

[Nit] US-015 specifies a "backend-agnostic repository layer" with a "swappable" backend and a "versioned migration framework" for a single-home store that holds a config blob and two latch booleans. This is premature abstraction — a plugin point unlikely to ever be plugged. Left unapplied because it is an explicit owner decision in §3; flagging that a plain SQLite module with a small init/upgrade step would suffice for v1.
[Nit] gate_open_time_seconds is unmeasured at build time (§10) yet feeds the safety-critical FR-17 min-radius derivation. The PRD should state the conservative default/behavior when it is unknown (e.g., assume a large value so the warning fires until measured). Author judgment; left unapplied.
[Nit] "Designed to back a future C2 site" framing (Goals, §3, US-009) drives API versioning and persistence shape now for a UI that is explicitly out of scope. Versioning is cheap so it stays, but this is speculative scope influence worth a second look.

Pass 2 — SRE / On-Call Lead · 7M / 3m / 3 nit

Findings¶

[Major] No singleton guarantee — replicas: 2 or a RollingUpdate overlap runs two independent pulse coordinators with in-process-only dedupe, doubling the pulse rate and forking the arrived/armed latch. The whole "dedupe across entities" guarantee (FR-8) is per-process. Proposed change: declare single-active-instance (replicas:1 + Recreate) plus a persistence-backed singleton marker; a non-owner refuses to publish to gate/cmd. Section: §4, FR-20 (new), US-016 (new), §14.
[Major] Zero Prometheus metrics. The service runs in K8s but emits nothing scrapable — at 3 a.m. you cannot tell if it is pulsing, when the last fix landed, or whether the broker is connected without curling an ad-hoc /health. A controller that fails silently is undermonitored. Proposed change: add /metrics with an enumerated metric set (counters/gauges + decision-latency + lead-time histograms). Section: §11.1 (new), FR-21 (new), US-016 (new).
[Major] No failure-modes catalog. US-011 lists degraded scenarios as prose with no detection signal / auto-response / alert / runbook per failure. Proposed change: add a §12.5-style Failure Modes Catalog table (12 rows) keyed to metrics and runbooks. Section: §12 (new).
[Major] Cap-halt is an undocumented manual-resume (NeedsOperator) state. US-007 says "loop halts" but never says what un-halts it; worse, a max_session_duration halt during a slow approach drops the gate in front of the FSD car the project exists to serve, with no resume contract and no runbook. Proposed change: define a Tier-C halted_session marker cleared only by re-arm or POST /v1/session/reset; restart replays the halt; add runbook §13.4. Section: US-007, US-010, FR-24 (new), §13.4, §14.
[Major] No runbooks for any operator-action state. Every alert in US-008 paged a human with no documented action. Proposed change: ship docs/runbooks.md (§13) with one entry per failure (precondition signal / actions / end state / rollback) and a test asserting every alert's runbook field resolves to a real section. Section: §13 (new), US-017 (new), US-008.
[Major] SLIs were prose, not expressions. §9 (already reframed to audit-log proxies by Pass 1) gave no measurable computation, and the safety-critical fix→pulse latency that feeds FR-17's radius derivation was unmeasured. Proposed change: define each SLI as a PromQL/audit-query expression and add sgt_decision_latency_seconds + sgt_lead_time_seconds histograms. Section: §11.5 (new).
[Major] Persistence-down behavior unspecified. US-010 depends on the persisted latch but US-011 never said what happens when SQLite is locked / disk full / migration fails — risk of crash-loop or re-pulsing a parked car. Proposed change: write failure holds in-memory latch + alerts + does not crash; restart falls back to fix-derived state; migration failure fails readiness not liveness. Section: US-010, US-011, FR-22 (new), §12 rows 7–8, §13.6.
[Minor] Alert noise / no suppression. US-008 fired one alert per condition with no dedupe — a flapping broker or wedged actuator pages per retry, and no alert carried severity or a runbook link or an expected rate. Proposed change: transition-based emit + alert_repeat_interval suppression + resolved events; severity (page/notify) and runbook fields; per-alert weekly rate in §12. Section: US-008.
[Minor] No config-change audit. PUT/PATCH /config mutated safety-critical settings (trigger radius, caps) with no record of who/when/diff — "why is the radius 50 m?" was unanswerable. Proposed change: audit every mutation with source + before→after diff + reconnect flag. Section: US-009, FR-23 (new), §11.3.
[Minor] No liveness/readiness distinction. GET /health existed but wasn't wired as probes, and "healthy" was undefined — a broker-down readiness signal mapped to liveness would crash-loop the pod pointlessly. Proposed change: livez (process only) vs readyz (broker+persistence+migrations); degraded deps flip readiness, never liveness. Section: US-009, US-016, §11.4, FR-21.
[Nit] Metric naming follows Prometheus base-unit convention (_total counters, _seconds histograms, sgt_ prefix); left as a documented convention in §11.1 rather than a separate finding.

Changes applied¶

§4: added a metrics/probes line to the service box and a "Single active instance" note (replicas:1 + Recreate).
US-007: cap-halt now persists a Tier-C halted_session marker, increments sgt_cap_halts_total, and is not silently resumed.
US-008: added MQTT-disconnect/persistence-failure alerts; severity/alert_name/reason/runbook payload fields; transition-based emit + alert_repeat_interval dedupe + resolved; page-vs-notify classification.
US-009: added per-mutation config audit (source + diff + reconnect) and expanded /health with last-pulse, persistence reachability, session/cap state, and live-vs-ready distinction.
US-010: added latch-write-failure fallback (fix-derived state, no crash, no re-pulse) and the cap-halt Tier-C resume-marker contract; added restart-after-write-failure and restart-while-cap-halted tests.
US-011: added persistence-unavailable and duplicate-instance degraded scenarios; clarified broker-down flips readiness not liveness; actuator retry within pulse budget.
US-016 (new): Prometheus /metrics, livez/readyz, and the single-instance guard with marker lease semantics.
US-017 (new): failure-modes catalog + runbooks doc deliverable with an alert→runbook link test.
FR-20..FR-24 (new): singleton; metrics+probes; persistence-failure safety; config-change audit; cap-halt resume contract.
§11 (new): Prometheus metrics table, structured-log fields, audit events, health/probes, SLI expressions.
§12 (new): Failure Modes Catalog (12 rows, §12.5 format).
§13 (new): ten runbooks (precondition signal / actions / end state / rollback).
§14 (new): recoverability tiering table (Tier A pulse / Tier B restart re-derive / Tier C cap-halt) and singleton/lease semantics.

Outstanding¶

[Nit] The singleton marker's lease takeover gap (default 30 s) can exceed auto_close_window (~20 s), so a crash-takeover could let the gate close once mid-approach. Documented in §14.2 with the mitigation (Recreate = no overlap; treat the lease as a misconfig guard, not the steady path). Left as a calibration note rather than forcing a specific lease value, since the safe value depends on the empirically-measured auto_close_window.
[Nit] Metric naming convention logged only (no edit beyond adopting it in §11.1).

Pass 3 — Security & Compliance Auditor · 5M / 4m / 4 nit

Threat model summary¶

This service is the decision plane of a physical access control: it publishes the command that opens an occupied home's driveway gate, and it subscribes to the real-time location of two named residents. Both the command stream and the location stream are security-sensitive, and both ride a single MQTT broker co-located with Home Assistant on a shared home LAN. The dominant threats are (S) spoofing a location fix or (T) publishing gate/cmd pulse directly to force the gate open, (I) passively reading the location topics to track residents / infer an empty house, and (E) driving the unauthenticated config API to shrink zones, zero min_trigger_speed, disable caps, or clear a cap-halt. (R) Repudiation is real because the audit log — the answer to "why did it open at 2 a.m.?" — lives in the same SQLite file the operator pod controls. As written before this pass, the PRD specified no broker TLS/auth/ACLs, no API authentication, no secret-handling rules, no pod-security/RBAC/NetworkPolicy posture, no supply-chain provenance, and no threat model / blast-radius statement — i.e., every privileged effect was reachable by any device on the LAN. This pass closes those gaps; the gate command itself is already replay-safe (Tier A, §14), so post-mitigation the marginal harm of pod compromise is bounded to "open the gate / read location," with no lateral movement.

Findings¶

[Major] MQTT broker has no transport security or authentication. The service is MQTT-only for all I/O (§3, §4); an anonymous/cleartext broker (Mosquitto's default) lets any LAN/IoT device spoof a gate/location fix to open the gate, publish gate/cmd pulse directly (bypassing the whole singleton story), or sniff residents' real-time location. STRIDE category: S/T/I. Proposed change: require broker TLS + per-client credentials + topic-scoped ACLs; credentials from a Secret. Section: US-002, FR-25, §15.1.
[Major] REST config/reset API is unauthenticated. PUT/PATCH /config and POST /v1/session/reset mutate a physical-security control (zones, min_trigger_speed, caps, cap-halt) with no authN/authZ; the audit source field even hints at a token subject that was never specified. STRIDE category: E/T. Proposed change: require auth on every endpoint, separate mutate-vs-read authorization, 401 on unauthenticated, bind to a controlled non-public interface. Section: US-009, FR-26, §15.1.
[Major] GET /config and the persisted config store leak/retain plaintext secrets. The config model holds "MQTT connection" (credentials) and GET /config "returns current effective config" — that returns the broker password and any API token, and the SQLite config store persists them in cleartext. STRIDE category: I. Proposed change: secrets by reference only (Secret/External Secrets), never inlined/persisted, redacted from GET /config, audit, and logs; rotatable without image rebuild. Section: US-001, US-009, FR-27, §15.1.
[Major] No tamper-evidence on the audit log. The audit trail (the §9/§11.3 answer to "why did the gate open / who changed the radius?") lives in the same SQLite file the operator pod can write, so a pod compromise can erase it. STRIDE category: R. Proposed change: append-only repository (no update/delete for audit rows) + mirror every audit event to an out-of-pod structured-log sink (separate failure domain). Section: US-007, FR-28, §15.1.
[Major] No documented blast radius / threat model. The PRD introduced ingest, coordinator, actuator publisher, API, metrics, and persistence components with no STRIDE pass and no operator-pod-compromise statement, so the blast radius was undeclared. STRIDE category: E (governance). Proposed change: add §15 with a per-component STRIDE table and the required "if compromised, attacker can X / cannot Y / mitigations Z" statement. Section: §15 (new), US-017.
[Minor] No pod-security / RBAC posture. A K8s-deployed controller with no declared PodSecurity standard, ServiceAccount scope, or token mount could run over-permissioned. The singleton is SQLite-backed (US-016), so the service needs zero K8s RBAC — that should be stated and enforced. STRIDE category: E. Proposed change: restricted PSA, non-root/read-only-rootfs/drop-ALL/seccomp, dedicated SA with no bindings + automountServiceAccountToken: false. Section: US-018 (new), FR-29, §15.3.
[Minor] No NetworkPolicy. New namespace with no default-deny; egress to the public internet was implicitly allowed, which also weakens the §3 "local-first, no cloud on the critical path" guarantee. STRIDE category: I/E. Proposed change: default-deny ingress+egress with enumerated allows (egress: broker + DNS; ingress: scraper to /metrics, operator/C2 to API); deny public-internet egress. Section: US-018, FR-30, §15.2.
[Minor] No supply-chain provenance. US-013 ships a Dockerfile/image that controls a physical gate, with no base-image digest pin, no hashed dependency lock, no SBOM, no signing/verification. STRIDE category: T (supply chain). Proposed change: digest-pinned base, hash-pinned lock, SBOM (SPDX/CycloneDX) + cosign-signed image with a documented verification step. Section: US-013, FR-31.
[Minor] FIPS boundary undocumented. Project does not declare FIPS support, but the crypto-bearing paths (MQTT TLS, API token/TLS verification, image-signature verification) were never identified. STRIDE category: n/a (compliance hygiene). Proposed change: explicit "FIPS out of scope for v1; these three are the crypto paths if it is ever required." Section: §15.5.
[Nit] Unexplained quiet-hours pulse is a spoofing tell. A 2 a.m. pulse with no known device transitioning the zones is the cheapest available indicator of a forged-fix attack; folded into the §13.11 runbook and §12 row 13 rather than a standalone finding (true defense is broker ACLs, not anomaly detection).
[Nit] Broker ACL ownership crosses the HA boundary. The broker is operated alongside Home Assistant (§3), so the FR-25 ACL spec is partly an HA/infra-side deliverable; noted in FR-25/US-018 but the broker config itself is outside this service's code.

Changes applied¶

US-001: added a no-plaintext-secrets AC (secrets by reference, write-only, redacted).
US-002: added required broker TLS + per-client auth (credentials from Secret) with a CI mock-broker opt-out, plus a TLS-verification-failure test.
US-007: added tamper-evident audit AC (append-only + out-of-pod log mirror, FR-28).
US-008: added the API auth-failure-spike alert (suspected unauthorized access).
US-009: added API authN + mutate-vs-read authZ (FR-26), GET /config secret redaction (FR-27), and auth/redaction tests.
US-013: added digest-pinned base image, hash-pinned lock, SBOM + signed-image provenance (FR-31).
US-017: added an AC that §15 (STRIDE + blast radius) and the §13.11 security-incident runbook ship.
US-018 (new): hardened K8s manifests — restricted PSA, zero-RBAC SA + automountServiceAccountToken: false, default-deny NetworkPolicy, Secret-referenced credentials, with a CI manifest-policy check.
§6: added FR-25 (broker TLS/auth/ACL), FR-26 (API auth), FR-27 (secret handling), FR-28 (audit tamper-evidence), FR-29 (PodSecurity + zero RBAC), FR-30 (NetworkPolicy), FR-31 (supply-chain provenance).
§11.1: added sgt_api_auth_failures_total.
§12: added row 13 (unauthorized access / spoofed location → §13.11).
§13: added runbook §13.11 (suspected unauthorized access / spoofed location).
§15 (new): Security & Threat Model — per-component STRIDE table, operator-pod-compromise blast-radius statement, privileged-operations review (none unjustified), security monitoring, FIPS-out-of-scope statement.

Outstanding¶

[Nit] The unexplained-quiet-hours-pulse spoofing tell and the cross-boundary broker-ACL ownership note are documented in §13.11 / FR-25 / US-018 rather than carried as standalone findings; the broker's own ACL/TLS config is an HA/infra deliverable outside this service's codebase, so it is referenced but not implemented by these stories.
[Nit] v1 deliberately keeps the API and broker inside the cluster boundary (FR-26) and defers the future C2 web UI's externally-reachable auth model (OIDC/mTLS at an ingress) to the C2 project; the token/mTLS scheme chosen here must be forward-compatible with it, but that is not built in v1.

Pass 4 — Codex (gpt-5.5, xhigh) · 4M / 3m / 0 nit

Cross-pass observations¶

Passes 1-3 caught the obvious product, operability, and security holes: speed hysteresis, measurable success metrics, singleton need, probes/metrics/runbooks, broker/API auth, secret handling, pod hardening, and supply-chain provenance. They still converged on several contract bugs: stale-location prose contradicted the state machine, the singleton design invented a Kubernetes-adjacent heartbeat instead of using the Kubernetes Lease API, the public config/API schema had hidden units and route drift, and MQTT/HA integration contracts were assumed rather than specified.

Findings¶

[Major] [Cross-pass blind spot] Pass 2 missed: "location stream silent -> no pulses" contradicted the TRIGGERING latch; if a phone died while TRIGGERING, the entity could keep contributing until a cap instead of failing safe. Proposed change: add explicit STALE state/contribution expiry when the latest accepted fix exceeds location_freshness_seconds, and test phone silence while TRIGGERING. Section: §4, US-005, US-011, FR-5/FR-6/FR-9, §11-§12.
[Major] [API design] Pass 2 and Pass 3 missed: the SQLite singleton heartbeat is a non-Kubernetes-native lock inside a Kubernetes deployment, and the "zero RBAC" defense was driving a worse design. Proposed change: replace the persistence-backed marker with a pre-created coordination.k8s.io/v1 Lease, scoped RBAC (get/update/patch on that single Lease), Lease-aware readiness, and NetworkPolicy egress to the local Kubernetes API. Section: US-016, US-018, FR-20/FR-29/FR-30, §14, §15.
[Major] [API design] The long-lived config/API schema used unitless speed fields (min_trigger_speed, max_approach_speed) while prose used mph and formulas needed m/s; fixing that after v1 would require avoidable conversion and compatibility work. Proposed change: make public config units canonical SI and rename the speed knobs to min_trigger_speed_mps and max_approach_speed_mps. Section: §3, US-001, US-005, US-012, FR-5/FR-17, §13.
[Major] [Integration assumption] MQTT command retention was unspecified even though a retained pulse on gate/cmd would replay an old open command when the controller reconnects. Proposed change: require actuator publishes to use retain=false and add wrapper/publisher tests that a retained command cannot be emitted. Section: §1, US-002, US-006, FR-1, §15.
[Minor] [Internal inconsistency] The PRD claimed a versioned API but acceptance criteria and runbooks mixed unversioned GET /config/GET /health with POST /v1/session/reset; health also reads state but was omitted from the auth list. Proposed change: standardize operator API paths on /v1/config, /v1/health, /v1/session/reset, with only /livez, /readyz, and /metrics intentionally unversioned. Section: US-009, US-017, FR-26/FR-27, §11, §13, §15.
[Minor] [Integration assumption] HA arm-zone automation and service geometry can drift, especially after runtime config updates; the prior passes treated "documented HA zones" as enough. Proposed change: require generated/mechanically checked HA zones and classify gate center/arm radius updates as external-artifact-coupled (requires_ha_sync unless the HA artifact is regenerated/validated). Section: US-009, US-014.
[Minor] [Prose/spec mismatch] The Non-Goals still referenced "speed + range-rate heuristics" after Pass 1 removed range-rate from v1. Proposed change: align the sentence with the speed-only v1 trigger. Section: §7.

Changes applied¶

Added STALE state semantics, stale TRIGGERING withdrawal, state/metric/test expectations, and failure-mode/runbook updates.
Replaced the SQLite singleton marker with a pre-created Kubernetes Lease, minimal resource-scoped RBAC, Lease readiness behavior, NetworkPolicy allowance, and revised threat-model blast-radius language.
Renamed public speed config fields to SI-unit names and updated formulas, tests, runbooks, and threat-model examples.
Required MQTT actuator commands to use retain=false and added acceptance-test coverage for retained-command prevention.
Standardized API paths on /v1/config and /v1/health while leaving probes/metrics unversioned.
Added HA zone source-of-truth validation and runtime requires_ha_sync handling for HA-coupled geometry changes.
Removed the leftover range-rate reference from the v1 Non-Goals prose.

Outstanding¶

None.

Post-revision codex review (gpt-5.5, xhigh) · 3M / 2m / 0 nit

Findings¶

[Major] The flock singleton was under-specified for Docker. The PRD said a process must hold a lock before publishing, but did not explicitly require a continuous lifetime lock on an open file descriptor, leaving room for a per-pulse acquire/release implementation where two coordinators could alternate pulses. It also did not state the local-filesystem dependency; flock on NFS/SMB/9p/FUSE/remote-synced volumes is not a safe single-writer primitive. Applied to: §4, US-016, FR-20, FR-29, §8, §11.4, §14.2, §15.3.
[Major] The Docker healthcheck/readiness contract needed compose-specific guardrails. readyz depending on the singleton lock is correct, but a non-owner must still start HTTP/probe endpoints and retry lock acquisition without blocking startup, otherwise lock contention looks like a dead process. Also, Docker Compose healthchecks are status signals, not automatic restarts; autoheal-style restart automation against readyz would reintroduce restart churn on dependency failures. Applied to: US-016, §8, §11.4, §14.2.
[Major] The audit-trim is acceptable only as a scoped container-compromise defense, not as durable tamper-proof evidence. Append-only SQLite plus Docker stdout is a pragmatic single-home repudiation control if the container cannot access Docker logs, but it needs explicit log-retention sizing and a clear caveat that host-root/Docker-daemon compromise or log-flood rotation can still erase the second copy. Applied to: US-007, FR-28, US-018, §15.1, §15.2.
[Minor] FR-30 network exposure needed a verifiable Docker boundary check. "Not public" is otherwise partly prose; the compose check now rejects bare port mappings, wildcard IPv4/IPv6 binds (0.0.0.0, ::), and requires absent ports or an explicit non-wildcard host IP plus documented firewall posture. Applied to: US-009, US-018, FR-26, FR-30.
[Minor] The Docker model introduced an operational failure not present in the Kubernetes PRD: a non-root, read-only-rootfs container can fail at startup if the SQLite/lock volume is not writable by the service UID/GID. Applied to: US-011, US-018, §12 row 7, §13.6, FR-29.
[Info] Searched the revised PRD for Kubernetes/Lease/RBAC/NetworkPolicy/control-plane stragglers. Remaining references are deliberate non-goals/footer history or Docker-secrets terminology, not active requirements.

Changes applied¶

Clarified flock semantics: lock must be held continuously for the publish-capable coordinator lifetime, auto-releases on fd close/process exit/SIGKILL, and non-local/synced filesystems are unsupported.
Required the singleton wait path to keep livez, readyz, /metrics, and /v1/health responsive while a non-owner retries lock acquisition.
Clarified compose healthcheck semantics: /readyz is for health status; no v1 autoheal/restart-on-unhealthy controller, and any future restart automation must use /livez.
Added Docker log-retention and repudiation-scope language for the stdout audit copy.
Tightened FR-30/US-018 to make port exposure checkable in CI.
Added data-volume writability/UID mismatch to degraded behavior, failure catalog, runbook, and hardening requirements.

Outstanding¶

prd.json was intentionally not modified. It still matches the Docker rewrite at a coarse story/order level, but it now lacks the new PRD requirements for continuous lifetime flock semantics, local-only data volume / non-local FS prohibition, SIGKILL/fd-close takeover tests, responsive probes while waiting for the lock, Docker log-retention settings, explicit host-port wildcard rejection, and data-volume UID/GID repair runbook coverage. Derived-story candidates affected: US-008, US-010, US-011, US-015, US-019, and US-020.

Full audit log: tasks/prd-smart-gate-trigger-service.critic-log.md.