Skip to content

Rectify: Trace Identity Contract#776

Merged
Trecek merged 14 commits intointegrationfrom
recover-crashed-sessions-test-code-dev-shm-pollution-produce/771
Apr 13, 2026
Merged

Rectify: Trace Identity Contract#776
Trecek merged 14 commits intointegrationfrom
recover-crashed-sessions-test-code-dev-shm-pollution-produce/771

Conversation

@Trecek
Copy link
Copy Markdown
Collaborator

@Trecek Trecek commented Apr 12, 2026

Summary

The trace file lifecycle had no identity contract. The filename encodes a PID, but PIDs
recycle, /dev/shm is writable by any process, and there is no boot-epoch anchor. The
30-second age heuristic was the sole gate before recover_crashed_sessions emitted a
subtype=crashed row — making it trivially fooled by test artifacts, stale pre-reboot
files, and alien processes whose PIDs happen to match.

Part A establishes the enrollment sidecar contract: a per-session JSON file written
atomically at start_linux_tracing time containing the identity triple (boot_id, pid, starttime_ticks). Recovery now validates this sidecar before classifying any trace file as
a crash. LinuxTracingHandle.stop() deletes both files on clean exit so recovery only ever
sees genuine crashes.

Part B covers test isolation enforcement (__post_init__ guard, fixing four non-isolated
test call sites, and the shared isolated_tracing_config conftest fixture).

Architecture Impact

State Lifecycle Diagram

%%{init: {'flowchart': {'nodeSpacing': 50, 'rankSpacing': 65, 'curve': 'basis'}}}%%
flowchart TB
    %% CLASS DEFINITIONS %%
    classDef cli fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff;
    classDef stateNode fill:#004d40,stroke:#4db6ac,stroke-width:2px,color:#fff;
    classDef handler fill:#e65100,stroke:#ffb74d,stroke-width:2px,color:#fff;
    classDef phase fill:#6a1b9a,stroke:#ba68c8,stroke-width:2px,color:#fff;
    classDef output fill:#00695c,stroke:#4db6ac,stroke-width:2px,color:#fff;
    classDef detector fill:#b71c1c,stroke:#ef5350,stroke-width:2px,color:#fff;
    classDef gap fill:#ff6f00,stroke:#ffa726,stroke-width:2px,color:#000;
    classDef terminal fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff;
    classDef newComponent fill:#2e7d32,stroke:#81c784,stroke-width:2px,color:#fff;

    %% ── CONSTRUCTION PATH ── %%
    subgraph Construction ["● LinuxTracingConfig Construction (settings.py)"]
        direction TB

        CFG_FIELDS["● INIT_ONLY Fields<br/>━━━━━━━━━━<br/>enabled: bool = True<br/>proc_interval: float = 5.0<br/>log_dir: str = ''<br/>tmpfs_path: str = '/dev/shm'"]

        GUARD["● __post_init__ TEST GUARD<br/>━━━━━━━━━━<br/>PYTEST_CURRENT_TEST set?<br/>AND tmpfs_path == '/dev/shm'?<br/>AND caller frame in /tests/?<br/>→ raise RuntimeError"]

        PROD_OK["Production Path OK<br/>━━━━━━━━━━<br/>No PYTEST_CURRENT_TEST<br/>→ /dev/shm allowed"]

        LIBRARY_OK["Library Factory Path OK<br/>━━━━━━━━━━<br/>default_factory / from_dynaconf<br/>caller not in /tests/<br/>→ guard bypassed"]
    end

    %% ── TEST ISOLATION LAYER ── %%
    subgraph TestIsolation ["● Test Isolation Contracts (conftest.py)"]
        direction LR

        ISOLATED_FIX["● isolated_tracing_config<br/>━━━━━━━━━━<br/>tmpfs_path = tmp_path/shm<br/>proc_interval = 0.05s<br/>mkdir on setup"]

        TOOL_CTX_FIX["tool_ctx fixture<br/>━━━━━━━━━━<br/>log_dir → tmp_path/session_logs<br/>tmpfs_path → tmp_path/shm<br/>both redirected"]

        ISO_HOME["_isolated_home (autouse)<br/>━━━━━━━━━━<br/>Path.home() → tmp_path<br/>blocks real config.yaml load"]
    end

    %% ── TRACING RUNTIME ── %%
    subgraph TracingRuntime ["● linux_tracing.py — Runtime State"]
        direction TB

        ENABLED_GATE["enabled gate<br/>━━━━━━━━━━<br/>if not enabled → return None<br/>FAIL-FAST"]

        TMPFS_CHECK["tmpfs_path.is_dir()<br/>━━━━━━━━━━<br/>dir present → open JSONL<br/>absent → silent skip"]

        HANDLE_MUTABLE["LinuxTracingHandle MUTABLE<br/>━━━━━━━━━━<br/>_trace_file (open → None)<br/>_trace_path (path → None)<br/>_enrollment_path (path → None)<br/>_monitor_cancel_scope (live → None)"]

        SNAPSHOTS_APPEND["_snapshots APPEND_ONLY<br/>━━━━━━━━━━<br/>grows during proc_monitor loop<br/>never shrunk mid-session<br/>read by stop() for flush"]
    end

    %% ── SESSION LOG WRITE PATH ── %%
    subgraph SessionLog ["● session_log.py — Write Path"]
        direction TB

        FLUSH["flush_session_log()<br/>━━━━━━━━━━<br/>sole write entry point<br/>called: normal end + crash recovery"]

        SESSION_ARTIFACTS["WRITE_ONCE Artifacts<br/>━━━━━━━━━━<br/>summary.json<br/>proc_trace.jsonl<br/>anomalies.jsonl"]

        RECOVERY_ARTIFACTS["WRITE_ONCE Recovery Files<br/>━━━━━━━━━━<br/>token_usage.json (next boot)<br/>step_timing.json (next boot)<br/>audit_log.json (next boot)"]

        SESSIONS_IDX["sessions.jsonl APPEND_ONLY<br/>━━━━━━━━━━<br/>global audit index<br/>append mode — never read<br/>by live logic (only retention)"]

        TELEMETRY_FENCE["FENCE: .telemetry_cleared_at<br/>━━━━━━━━━━<br/>write_telemetry_clear_marker()<br/>read only on next server boot<br/>excludes pre-clear sessions"]
    end

    %% ── CRASH RECOVERY GATE CHAIN ── %%
    subgraph Recovery ["● session_log.py — Crash Recovery Gate Chain"]
        direction TB

        TIMING_GATE["Gate 0: 30s Freshness<br/>━━━━━━━━━━<br/>mtime < now - 30s?<br/>skip if too fresh (active session)"]

        ENROLL_GATE["Gate 1: Enrollment Sidecar<br/>━━━━━━━━━━<br/>autoskillit_enrollment_{pid}.json<br/>absent → skip (alien file)"]

        BOOT_GATE["Gate 2: boot_id Match<br/>━━━━━━━━━━<br/>enrollment.boot_id == current_boot_id?<br/>mismatch → DELETE both files (stale)"]

        PID_GATE["Gate 3: PID Liveness + Reuse<br/>━━━━━━━━━━<br/>psutil.pid_exists(pid)?<br/>starttime_ticks == enrolled_ticks?<br/>same ticks → alive, skip<br/>different ticks → recycled, RECOVER"]

        RECOVER_FLUSH["recover: flush_session_log()<br/>━━━━━━━━━━<br/>subtype='crashed', exit_code=-1<br/>termination_reason='CRASHED'<br/>then unlink trace + enrollment"]
    end

    %% ── CONNECTIONS ── %%

    %% Construction flow
    CFG_FIELDS --> GUARD
    GUARD -->|"PYTEST_CURRENT_TEST set<br/>+ caller in /tests/<br/>+ tmpfs == /dev/shm"| ERR_RAISE(["RuntimeError raised"])
    GUARD -->|"no PYTEST env"| PROD_OK
    GUARD -->|"caller not in /tests/"| LIBRARY_OK

    %% Test isolation feeds construction safely
    ISOLATED_FIX -->|"tmpfs_path = tmp_path/shm"| CFG_FIELDS
    TOOL_CTX_FIX -->|"both paths redirected"| CFG_FIELDS
    ISO_HOME -->|"blocks real config"| LIBRARY_OK

    %% Config fields flow into runtime
    PROD_OK --> ENABLED_GATE
    LIBRARY_OK --> ENABLED_GATE
    ENABLED_GATE -->|"enabled=True"| TMPFS_CHECK
    ENABLED_GATE -->|"enabled=False"| SKIP(["return None"])
    TMPFS_CHECK -->|"dir exists"| HANDLE_MUTABLE
    TMPFS_CHECK -->|"dir absent"| HANDLE_MUTABLE
    HANDLE_MUTABLE --> SNAPSHOTS_APPEND

    %% Runtime feeds session log
    SNAPSHOTS_APPEND -->|"stop() called"| FLUSH
    FLUSH --> SESSION_ARTIFACTS
    FLUSH --> RECOVERY_ARTIFACTS
    FLUSH --> SESSIONS_IDX
    FLUSH --> TELEMETRY_FENCE

    %% Recovery gate chain
    TIMING_GATE -->|"stale enough"| ENROLL_GATE
    ENROLL_GATE -->|"sidecar present"| BOOT_GATE
    BOOT_GATE -->|"boot_id matches"| PID_GATE
    PID_GATE -->|"PID recycled"| RECOVER_FLUSH
    RECOVER_FLUSH -->|"calls"| FLUSH

    %% CLASS ASSIGNMENTS %%
    class CFG_FIELDS stateNode;
    class GUARD,ENROLL_GATE,BOOT_GATE,PID_GATE,TIMING_GATE detector;
    class PROD_OK,LIBRARY_OK phase;
    class ISOLATED_FIX,TOOL_CTX_FIX,ISO_HOME newComponent;
    class ENABLED_GATE,TMPFS_CHECK handler;
    class HANDLE_MUTABLE,SNAPSHOTS_APPEND stateNode;
    class FLUSH handler;
    class SESSION_ARTIFACTS,RECOVERY_ARTIFACTS,SESSIONS_IDX,TELEMETRY_FENCE output;
    class RECOVER_FLUSH phase;
    class ERR_RAISE,SKIP terminal;
Loading

Process Flow Diagram

%%{init: {'flowchart': {'nodeSpacing': 40, 'rankSpacing': 50, 'curve': 'basis'}}}%%
flowchart TB
    %% CLASS DEFINITIONS %%
    classDef terminal fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff;
    classDef stateNode fill:#004d40,stroke:#4db6ac,stroke-width:2px,color:#fff;
    classDef handler fill:#e65100,stroke:#ffb74d,stroke-width:2px,color:#fff;
    classDef phase fill:#6a1b9a,stroke:#ba68c8,stroke-width:2px,color:#fff;
    classDef newComponent fill:#2e7d32,stroke:#81c784,stroke-width:2px,color:#fff;
    classDef output fill:#00695c,stroke:#4db6ac,stroke-width:2px,color:#fff;
    classDef detector fill:#b71c1c,stroke:#ef5350,stroke-width:2px,color:#fff;

    %% TERMINALS %%
    START([START])
    BLOCKED([BLOCKED — RuntimeError])
    NONE([returns None])
    HANDLE([returns LinuxTracingHandle])
    COMPLETE([COMPLETE])

    %% ─── PHASE 1: CONFIG GUARD ─── %%
    subgraph GuardPhase ["● LinuxTracingConfig — Test Isolation Guard  (settings.py)"]
        direction TB
        N_config["● LinuxTracingConfig()<br/>━━━━━━━━━━<br/>dataclass __init__ called"]
        N_guard{"tmpfs_path == '/dev/shm'<br/>━━━━━━━━━━<br/>default value?"}
        N_pytest{"PYTEST_CURRENT_TEST<br/>━━━━━━━━━━<br/>env var set?"}
        N_frame{"inspect caller frame<br/>━━━━━━━━━━<br/>f_back.f_back<br/>co_filename in /tests/?"}
        N_pass["guard passes<br/>━━━━━━━━━━<br/>normal construction"]
    end

    %% ─── PHASE 2: TRACING START ─── %%
    subgraph StartPhase ["● start_linux_tracing()  (linux_tracing.py)"]
        direction TB
        N_start["● start_linux_tracing(pid, config, tg)<br/>━━━━━━━━━━<br/>entry: open session tracing"]
        N_gate1{"LINUX_TRACING_AVAILABLE<br/>━━━━━━━━━━<br/>& config.enabled?"}
        N_gate2{"tg (task group)<br/>━━━━━━━━━━<br/>provided?"}
        N_open["open tmpfs trace file<br/>━━━━━━━━━━<br/>config.tmpfs_path /autoskillit_trace_{pid}.jsonl<br/>buffering=1 (line-buffered)"]
        N_enroll["write enrollment sidecar (atomic)<br/>━━━━━━━━━━<br/>config.tmpfs_path /autoskillit_enrollment_{pid}.json<br/>TraceEnrollmentRecord: pid+boot_id+starttime_ticks"]
        N_spawn["tg.start_soon(_run_monitor)<br/>━━━━━━━━━━<br/>async monitor task in cancel scope"]
    end

    %% ─── PHASE 3: MONITORING LOOP ─── %%
    subgraph MonitorPhase ["● proc_monitor() + _run_monitor()  (linux_tracing.py)"]
        direction TB
        N_snap["read_proc_snapshot(pid)<br/>━━━━━━━━━━<br/>psutil.Process.oneshot()<br/>+ /proc/{pid}/status|oom_score|wchan"]
        N_gone{"snapshot is None?<br/>━━━━━━━━━━<br/>process exited"}
        N_clock{"captured_at ≤<br/>last_captured_at?<br/>━━━━━━━━━━<br/>NTP / WSL2 clock jump"}
        N_advance["advance by 1 µs<br/>━━━━━━━━━━<br/>maintain monotonic<br/>captured_at invariant"]
        N_append["append to handle._snapshots<br/>━━━━━━━━━━<br/>in-memory list"]
        N_write["write JSON line to tmpfs file<br/>━━━━━━━━━━<br/>crash-resilient streaming"]
        N_file_err{"OSError on write?"}
        N_degrade["close trace file<br/>━━━━━━━━━━<br/>degrade to in-memory only"]
        N_sleep["anyio.sleep(config.proc_interval)<br/>━━━━━━━━━━<br/>default 5.0s → loop"]
    end

    %% ─── PHASE 4: SESSION END + LOG FLUSH ─── %%
    subgraph FlushPhase ["● handle.stop() + flush_session_log()  (linux_tracing.py / session_log.py)"]
        direction TB
        N_stop["handle.stop()<br/>━━━━━━━━━━<br/>cancel CancelScope<br/>flush+close trace file<br/>unlink tmpfs files"]
        N_flush["● flush_session_log()<br/>━━━━━━━━━━<br/>called by headless.py<br/>after session completes"]
        N_proc_trace["write proc_trace.jsonl<br/>━━━━━━━━━━<br/>session_dir/proc_trace.jsonl"]
        N_anomaly["detect_anomalies()<br/>━━━━━━━━━━<br/>post-hoc over snapshots"]
        N_summary["write summary.json<br/>━━━━━━━━━━<br/>+ anomalies.jsonl (if any)"]
        N_index["append to sessions.jsonl<br/>━━━━━━━━━━<br/>log_root/sessions.jsonl<br/>global append-only index"]
        N_retention["_enforce_retention()<br/>━━━━━━━━━━<br/>trim to max 500 session dirs<br/>rewrite sessions.jsonl"]
    end

    %% ─── PHASE 5: CRASH RECOVERY ─── %%
    subgraph RecoveryPhase ["● recover_crashed_sessions()  (session_log.py)"]
        direction TB
        R_entry["● recover_crashed_sessions(tmpfs_path)<br/>━━━━━━━━━━<br/>called at server startup<br/>glob autoskillit_trace_*.jsonl"]
        R_age{"file age < 30s?<br/>━━━━━━━━━━<br/>may be active session"}
        R_sidecar{"enrollment sidecar<br/>━━━━━━━━━━<br/>autoskillit_enrollment_{pid}.json<br/>exists & parses?"}
        R_bootid{"boot_id matches<br/>━━━━━━━━━━<br/>current boot?"}
        R_pid{"psutil.pid_exists(pid)?"}
        R_ticks{"starttime_ticks<br/>━━━━━━━━━━<br/>unchanged?<br/>(same process)"}
        R_read["read snapshot lines<br/>━━━━━━━━━━<br/>from trace file"]
        R_flush["flush_session_log()<br/>━━━━━━━━━━<br/>subtype='crashed', exit_code=-1<br/>termination_reason='CRASHED'"]
        R_cleanup["unlink trace file<br/>━━━━━━━━━━<br/>+ enrollment sidecar"]
    end

    %% ─── PHASE 6: TEST ISOLATION FIXTURE ─── %%
    subgraph FixturePhase ["● isolated_tracing_config fixture  (tests/execution/conftest.py)"]
        direction TB
        F_fixture["● isolated_tracing_config(tmp_path)<br/>━━━━━━━━━━<br/>pytest fixture"]
        F_return["LinuxTracingConfig(<br/>━━━━━━━━━━<br/>tmpfs_path=str(tmp_path/shm)<br/>proc_interval=0.05)"]
    end

    %% ─── FLOW: CONFIG GUARD ─── %%
    START --> N_config
    N_config --> N_guard
    N_guard -->|"yes (default path)"| N_pytest
    N_guard -->|"no (custom path)"| N_pass
    N_pytest -->|"yes — test context"| N_frame
    N_pytest -->|"no — production"| N_pass
    N_frame -->|"yes — direct test caller"| BLOCKED
    N_frame -->|"no — library machinery"| N_pass

    %% ─── FLOW: TRACING START ─── %%
    N_pass --> N_start
    N_start --> N_gate1
    N_gate1 -->|"no"| NONE
    N_gate1 -->|"yes"| N_gate2
    N_gate2 -->|"no"| NONE
    N_gate2 -->|"yes"| N_open
    N_open --> N_enroll
    N_enroll --> N_spawn
    N_spawn --> HANDLE

    %% ─── FLOW: MONITORING LOOP ─── %%
    HANDLE --> N_snap
    N_snap --> N_gone
    N_gone -->|"yes"| N_stop
    N_gone -->|"no"| N_clock
    N_clock -->|"yes (stepped back)"| N_advance
    N_clock -->|"no"| N_append
    N_advance --> N_append
    N_append --> N_write
    N_write --> N_file_err
    N_file_err -->|"yes"| N_degrade
    N_file_err -->|"no"| N_sleep
    N_degrade --> N_sleep
    N_sleep -->|"loop"| N_snap

    %% ─── FLOW: SESSION END + FLUSH ─── %%
    N_stop --> N_flush
    N_flush --> N_proc_trace
    N_proc_trace --> N_anomaly
    N_anomaly --> N_summary
    N_summary --> N_index
    N_index --> N_retention
    N_retention --> COMPLETE

    %% ─── FLOW: CRASH RECOVERY ─── %%
    R_entry --> R_age
    R_age -->|"yes — skip"| R_entry
    R_age -->|"no — old enough"| R_sidecar
    R_sidecar -->|"missing — alien file, skip"| R_entry
    R_sidecar -->|"found"| R_bootid
    R_bootid -->|"mismatch — stale pre-reboot<br/>unlink both files"| R_entry
    R_bootid -->|"match"| R_pid
    R_pid -->|"no — process gone"| R_read
    R_pid -->|"yes — alive"| R_ticks
    R_ticks -->|"unchanged — still running, skip"| R_entry
    R_ticks -->|"changed — PID recycled, treat as crash"| R_read
    R_read --> R_flush
    R_flush --> R_cleanup
    R_cleanup --> R_entry

    %% ─── FLOW: TEST ISOLATION ─── %%
    F_fixture --> F_return
    F_return -->|"passed to tests as LinuxTracingConfig"| N_start

    %% CLASS ASSIGNMENTS %%
    class START,BLOCKED,NONE,HANDLE,COMPLETE terminal;
    class N_config,N_pass,N_start,N_open,N_enroll,N_spawn handler;
    class N_guard,N_pytest,N_frame,N_gate1,N_gate2,N_gone,N_clock,N_file_err stateNode;
    class N_snap,N_advance,N_append,N_write,N_degrade,N_sleep phase;
    class N_stop,N_flush,N_proc_trace,N_anomaly,N_summary,N_index,N_retention handler;
    class R_entry,R_read,R_flush,R_cleanup handler;
    class R_age,R_sidecar,R_bootid,R_pid,R_ticks detector;
    class F_fixture,F_return newComponent;
Loading

Error/Resilience Diagram

%%{init: {'flowchart': {'nodeSpacing': 40, 'rankSpacing': 55, 'curve': 'basis'}}}%%
flowchart TB
    %% CLASS DEFINITIONS %%
    classDef terminal fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff;
    classDef stateNode fill:#004d40,stroke:#4db6ac,stroke-width:2px,color:#fff;
    classDef handler fill:#e65100,stroke:#ffb74d,stroke-width:2px,color:#fff;
    classDef phase fill:#6a1b9a,stroke:#ba68c8,stroke-width:2px,color:#fff;
    classDef newComponent fill:#2e7d32,stroke:#81c784,stroke-width:2px,color:#fff;
    classDef output fill:#00695c,stroke:#4db6ac,stroke-width:2px,color:#fff;
    classDef detector fill:#b71c1c,stroke:#ef5350,stroke-width:2px,color:#fff;
    classDef gap fill:#ff6f00,stroke:#ffa726,stroke-width:2px,color:#000;

    CONSTRUCT_START(["LinuxTracingConfig()"])

    subgraph PostInitGate ["● CONSTRUCTION GUARD — LinuxTracingConfig.__post_init__"]
        CHECK_SHM{"tmpfs_path<br/>== /dev/shm?"}
        CHECK_PYTEST{"PYTEST_CURRENT_TEST<br/>env set?"}
        CHECK_FRAME{"caller frame<br/>inside /tests/?"}
        GUARD_PASS["Config accepted<br/>━━━━━━━━━━<br/>proceeds to use"]
    end

    subgraph TestIsolation ["● TEST ISOLATION ESCAPE HATCH"]
        FIXTURE["● isolated_tracing_config<br/>━━━━━━━━━━<br/>pytest fixture<br/>tests/execution/conftest.py"]
        SAFE_CONFIG["LinuxTracingConfig<br/>━━━━━━━━━━<br/>tmpfs_path=str(tmp_path/shm)<br/>proc_interval=0.05"]
    end

    subgraph CrashRecovery ["● CRASH RECOVERY GATES — recover_crashed_sessions"]
        SCAN["Scan tmpfs<br/>━━━━━━━━━━<br/>autoskillit_trace_*.jsonl"]
        AGE_GATE{"mtime age<br/>< 30s?"}
        ENROLLMENT_GATE{"enrollment<br/>sidecar exists?"}
        BOOT_GATE{"boot_id<br/>matches current?"}
        PID_GATE{"PID alive +<br/>same starttime_ticks?"}
        READ_SNAPS["Read snapshots<br/>━━━━━━━━━━<br/>parse JSONL lines<br/>OSError → skip file"]
    end

    subgraph Finalization ["● CRASH FINALIZATION — flush_session_log"]
        FLUSH["flush_session_log<br/>━━━━━━━━━━<br/>subtype=crashed<br/>termination=CRASHED"]
        FLUSH_ERR{"Exception<br/>during flush?"}
        UNLINK["Unlink trace +<br/>enrollment files<br/>━━━━━━━━━━<br/>missing_ok=True"]
    end

    T_RUNTIME_ERR(["RuntimeError<br/>dev/shm rejected<br/>in test context"])
    T_PASS(["SAFE<br/>config constructed"])
    T_SKIP(["SKIPPED<br/>file left in place"])
    T_STALE_DEL(["STALE DELETED<br/>pre-reboot file removed"])
    T_RECOVERED(["RECOVERED<br/>session finalized<br/>count += 1"])
    T_FIXTURE_SAFE(["TEST SAFE<br/>isolated tmp_path/shm"])

    %% Construction gate flow
    CONSTRUCT_START --> CHECK_SHM
    CHECK_SHM -->|"no — safe path"| GUARD_PASS
    CHECK_SHM -->|"yes"| CHECK_PYTEST
    CHECK_PYTEST -->|"not set — production"| GUARD_PASS
    CHECK_PYTEST -->|"set"| CHECK_FRAME
    CHECK_FRAME -->|"library machinery<br/>(AutomationConfig etc.)"| GUARD_PASS
    CHECK_FRAME -->|"direct test code"| T_RUNTIME_ERR
    GUARD_PASS --> T_PASS

    %% Test isolation path (recommended escape hatch)
    FIXTURE --> SAFE_CONFIG --> T_FIXTURE_SAFE

    %% Crash recovery gates
    SCAN --> AGE_GATE
    AGE_GATE -->|"too recent<br/>(may be active)"| T_SKIP
    AGE_GATE -->|"old enough"| ENROLLMENT_GATE
    ENROLLMENT_GATE -->|"missing — alien/test file"| T_SKIP
    ENROLLMENT_GATE -->|"present"| BOOT_GATE
    BOOT_GATE -->|"mismatch<br/>→ unlink both"| T_STALE_DEL
    BOOT_GATE -->|"matches"| PID_GATE
    PID_GATE -->|"alive, same ticks<br/>(process running)"| T_SKIP
    PID_GATE -->|"gone or PID recycled"| READ_SNAPS
    READ_SNAPS --> FLUSH

    %% Finalization
    FLUSH --> FLUSH_ERR
    FLUSH_ERR -->|"yes — log debug,<br/>continue loop"| T_SKIP
    FLUSH_ERR -->|"no"| UNLINK
    UNLINK --> T_RECOVERED

    %% CLASS ASSIGNMENTS
    class CONSTRUCT_START terminal;
    class CHECK_SHM,CHECK_PYTEST,CHECK_FRAME,AGE_GATE,ENROLLMENT_GATE,BOOT_GATE,PID_GATE,FLUSH_ERR detector;
    class GUARD_PASS,READ_SNAPS handler;
    class SCAN phase;
    class FIXTURE,SAFE_CONFIG newComponent;
    class FLUSH,UNLINK output;
    class T_RUNTIME_ERR,T_PASS,T_SKIP,T_STALE_DEL,T_RECOVERED,T_FIXTURE_SAFE terminal;
Loading

Closes #771

Implementation Plan

Plan file: /home/talon/projects/autoskillit-runs/remediation-20260412-142552-496727/.autoskillit/temp/rectify/rectify_trace_identity_contract_2026-04-12_180100_part_a.md

🤖 Generated with Claude Code via AutoSkillit

Token Usage Summary

Step uncached output cache_read cache_write count time
group 2.7k 68.8k 3.1M 200.8k 1 22m 49s
plan 3.2k 139.0k 4.9M 383.9k 6 48m 3s
verify 3.2k 81.1k 4.5M 293.1k 6 24m 41s
implement 4.9k 156.8k 20.6M 442.5k 7 56m 14s
fix 818 43.5k 4.2M 208.9k 4 27m 56s
Total 14.9k 489.2k 37.3M 1.5M 2h 59m

Copy link
Copy Markdown
Collaborator Author

@Trecek Trecek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AutoSkillit PR Review — Verdict: changes_requested

import inspect
import os

if self.tmpfs_path != "/dev/shm" or not os.environ.get("PYTEST_CURRENT_TEST"):
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[warning] defense: The guard walks exactly two frames back (frame.f_back.f_back) to find the test caller. If LinuxTracingConfig is constructed via a helper factory, dataclasses.replace, a classmethod, or a one-liner wrapper, the test-file frame will be further than two levels up and the guard will silently not fire. The depth assumption is fragile and not documented in the error message.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Investigated — this is intentional. The two-level frame walk is deliberately scoped to direct test callers only (commit 9e07dc5, comment on lines 200-202 documents this). Factory paths, from_dynaconf, and AutomationConfig default_factory are intentionally excluded: they construct LinuxTracingConfig with production defaults and should not be blocked. The 'fragility' is the feature: indirect paths bypass the guard by design.

# (e.g. AutomationConfig default_factory, from_dynaconf). We inspect the call
# frame two levels up: __post_init__ → __init__ (generated) → actual caller.
frame = inspect.currentframe()
init_frame = frame.f_back if frame is not None else None
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[warning] arch: The call-stack depth heuristic is fragile. The /tests/ check fires only if the caller is exactly two frames up. Indirect construction paths (fixtures that call helpers, default_factory in AutomationConfig, from_dynaconf) would bypass this guard silently, giving a false sense of safety while still writing to real /dev/shm.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Investigated — this is intentional. The commit message 'fix: scope post_init guard to direct test callers only' (9e07dc5) explicitly defines the scope. Fixtures calling helpers and from_dynaconf bypassing the guard is the desired behavior — only direct test-code instantiation of LinuxTracingConfig(tmpfs_path='/dev/shm') should be blocked. The category: false_positive_intentional_pattern.

def _write_old_trace(tmpfs: Path, filename: str, content: str) -> Path:
"""Write a trace file and backdate its mtime to 60 seconds ago."""
import time
"""Write a trace file (backdated 60s) and its enrollment sidecar.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[critical] tests: The _write_old_trace docstring is malformed. The opening """ closes prematurely, leaving The enrollment sidecar uses... and The PID embedded in the filename... as bare statements outside any string literal, causing a SyntaxError at import time. Wrap the full docstring text inside a single triple-quoted string.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Investigated — this is a stale comment. python3 -m py_compile tests/execution/test_session_log.py succeeds (Syntax OK). The docstring at lines 393-398 is properly closed with triple-quotes. The malformed state observed in the diff was corrected in a later commit on this branch before the review was submitted.

assert enrollment.exists(), "Enrollment sidecar for alive PID must not be deleted"


def test_recover_crashed_sessions_skips_file_without_enrollment(tmp_path):
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[warning] tests: Hardcoded PID 99997 — if that PID is alive on the test host, psutil.pid_exists(99997) returns True and Gate 3 may short-circuit before Gate 1 (missing enrollment) is tested. Use a guaranteed-dead PID (e.g. beyond Linux PID_MAX) or mock psutil.pid_exists.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Investigated — this is intentional. Gate 1 (enrollment sidecar existence, session_log.py lines 341-346) runs BEFORE Gate 3 (PID liveness, lines 355-360). This test creates a trace for PID 99997 with NO enrollment sidecar. Gate 1 fires: enrollment is None → continue. Gate 3 is never reached. An alive PID 99997 cannot cause the test to fail.

lambda: "current-boot-id",
)
tmpfs = tmp_path / "shm"
tmpfs.mkdir()
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[warning] tests: Same concern for hardcoded PID 99996 in test_recover_crashed_sessions_skips_wrong_boot_id. If alive, Gate 3 short-circuits before Gate 2 (boot_id check) is exercised. Use a guaranteed-dead PID or mock the liveness check.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Investigated — this is intentional. Gate 2 (boot_id mismatch, session_log.py lines 348-353) runs BEFORE Gate 3 (PID liveness). This test creates an enrollment sidecar with boot_id='stale-boot-id' while the mock returns 'current-boot-id'. Gate 2 fires: mismatch → unlink + continue. Gate 3 is never reached. An alive PID 99996 cannot cause the test to fail.

Copy link
Copy Markdown
Collaborator Author

@Trecek Trecek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AutoSkillit review found 14 blocking issues. See inline comments.

@Trecek Trecek enabled auto-merge April 12, 2026 23:58
Trecek and others added 14 commits April 12, 2026 17:00
Tests 1.1-1.3 (session_log): recover_crashed_sessions must skip trace
files for alive PIDs, files without enrollment sidecars, and files whose
enrollment boot_id doesn't match the current boot.

Tests 1.4-1.5 (linux_tracing): start_linux_tracing must write an
enrollment sidecar atomically; stop() must unlink both trace and sidecar.

All 5 tests fail until the implementation is added.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Introduces a per-session enrollment sidecar
(autoskillit_enrollment_{pid}.json) written atomically at
start_linux_tracing time, containing the identity triple
(boot_id, pid, starttime_ticks) that anchors the trace file to the
specific process that created it.

Key changes:
- linux_tracing.py: Add TraceEnrollmentRecord frozen dataclass,
  _write_enrollment_atomic and _read_enrollment helpers; extend
  start_linux_tracing with session_id/kitchen_id/order_id keyword
  params; write sidecar immediately after opening trace file; update
  LinuxTracingHandle.stop() to unlink both trace and sidecar on clean
  exit so recovery only ever sees genuine crashes.
- session_log.py: Add three-gate identity chain to
  recover_crashed_sessions: (1) enrollment sidecar must exist,
  (2) boot_id must match current boot, (3) PID must be dead or
  starttime_ticks must differ. Delete both files after recovery.
- test_session_log.py: Update _write_old_trace to write companion
  enrollment sidecars so existing recovery tests pass the new gates.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
test_streaming_writes_each_snapshot_as_jsonl: save trace path and flush
before calling stop(), which now deletes both trace and enrollment files
on clean exit. Read file content into variable before stop() runs.

test_current_json_write_sites_match_allowlist: update session_log.py
allowlist line numbers (206→213, 219→226, 222→229) shifted by the
enrollment sidecar code added above the existing atomic_write calls.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ayloads

Lines 219→226 and 222→229 in session_log.py shifted by the enrollment sidecar
code; the hardcoded list_sites check in the same convention test needed the same
update as _LEGACY_JSON_WRITES.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…lated_tracing_config fixture

Tests assert:
- __post_init__ raises RuntimeError when tmpfs_path == /dev/shm and PYTEST_CURRENT_TEST is set
- Custom tmpfs path does not raise in test env
- /dev/shm is allowed outside pytest (production path)
- isolated_tracing_config fixture returns non-/dev/shm temp dir

All four tests fail before implementation (no guard, no fixture yet).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…/shm in test env

When PYTEST_CURRENT_TEST is set and tmpfs_path is the production default /dev/shm,
construction raises RuntimeError with a diagnostic message pointing to the correct
fix. Zero overhead in production — env var is never set outside pytest.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nfig call sites

Five test call sites that previously wrote to the real /dev/shm are now
explicitly isolated: four in test_linux_tracing.py and one in
test_session_log_integration.py. Each gains a tmp_path fixture param
where it lacked one, and passes tmpfs_path=str(tmp_path) to the constructor.

These tests would have raised RuntimeError at construction after the
__post_init__ guard was added; this commit restores them to passing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Provides a canonical, pre-isolated LinuxTracingConfig for all tracing tests.
The fixture creates a shm/ subdir under tmp_path and returns a config pointing
to it — never to the real /dev/shm. New tests should use this fixture instead
of constructing LinuxTracingConfig manually.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The guard fired too broadly — any AutomationConfig() or from_dynaconf()
call in tests triggered it even when no actual /dev/shm writes would
occur. Use frame inspection to fire only when the immediate caller
(two frames up: past __post_init__ and the generated __init__) is test
code (/tests/ in filename). Library machinery (AutomationConfig
default_factory, from_dynaconf) resolves to <string> or src/, so it
passes through.

Also set tool_ctx.config.linux_tracing.tmpfs_path to an isolated
tmp_path for proper test isolation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…le filenames

Gate 3 previously skipped recovery when read_starttime_ticks() returned None
(unreadable /proc/<pid>/stat — a transient state after process exit), causing
genuinely crashed sessions to be missed. Fix: only skip when ticks are positively
known and match the enrollment record.

Also change the pid parse-failure fallback from 0 to -1: PID 0 exists on Linux
(swapper), making psutil.pid_exists(0) return True and silently short-circuiting
Gate 3. PID -1 is guaranteed non-existent.

Addresses reviewer comments 3070341630 (critical) and 3070341628.
The function was imported across module boundaries from linux_tracing.py
into session_log.py with an underscore prefix signalling "private". Since
it is legitimately shared between sister modules in the execution package,
rename it to read_enrollment (no underscore) — consistent with the other
public helpers (read_boot_id, read_starttime_ticks) in the same import block.

Addresses reviewer comment 3070341627.
… locals in settings.py

The deferred `import inspect` and `import os` inside `__post_init__` executed on
every LinuxTracingConfig instantiation (the early-return check comes after them).
Moving them to module level surfaces the dependency explicitly and avoids repeated
import overhead.

Also delete the frame locals (`del frame, init_frame, caller`) after the guard
block to prevent reference cycles in non-CPython runtimes.

Addresses reviewer comments 3070341619 and 3070341622.
…linux_tracing.py

1. Add clarifying comment to stop() explaining why _trace_path is unconditionally
   unlinked: crash-recovery only reads files left by processes that never called
   stop(), so clean sessions correctly clean up their own file.

2. Add logger.warning() when the enrollment sidecar write fails with OSError.
   Previously silently swallowed; the missing sidecar causes Gate 1 to skip
   recovery for that session with no operator-visible diagnostic.

Addresses reviewer comments 3070341625 and 3070341626.
…tests

In test_start_linux_tracing_writes_enrollment_sidecar and
test_stop_unlinks_trace_and_enrollment, handle.stop() was called before
proc.kill(). If stop() raises, proc.kill() never executes, leaking the
sleep 2 subprocess. Wrap stop() in try/finally to guarantee proc.kill()
always runs.

Addresses reviewer comments 3070341631 and 3070341632.
@Trecek Trecek force-pushed the recover-crashed-sessions-test-code-dev-shm-pollution-produce/771 branch from b0888f9 to 5247553 Compare April 13, 2026 00:01
@Trecek Trecek added this pull request to the merge queue Apr 13, 2026
Merged via the queue into integration with commit f89ed37 Apr 13, 2026
2 checks passed
@Trecek Trecek deleted the recover-crashed-sessions-test-code-dev-shm-pollution-produce/771 branch April 13, 2026 00:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant