Conversation
Trecek
left a comment
There was a problem hiding this comment.
AutoSkillit PR Review — Verdict: changes_requested
| import inspect | ||
| import os | ||
|
|
||
| if self.tmpfs_path != "/dev/shm" or not os.environ.get("PYTEST_CURRENT_TEST"): |
There was a problem hiding this comment.
[warning] defense: The guard walks exactly two frames back (frame.f_back.f_back) to find the test caller. If LinuxTracingConfig is constructed via a helper factory, dataclasses.replace, a classmethod, or a one-liner wrapper, the test-file frame will be further than two levels up and the guard will silently not fire. The depth assumption is fragile and not documented in the error message.
There was a problem hiding this comment.
Investigated — this is intentional. The two-level frame walk is deliberately scoped to direct test callers only (commit 9e07dc5, comment on lines 200-202 documents this). Factory paths, from_dynaconf, and AutomationConfig default_factory are intentionally excluded: they construct LinuxTracingConfig with production defaults and should not be blocked. The 'fragility' is the feature: indirect paths bypass the guard by design.
| # (e.g. AutomationConfig default_factory, from_dynaconf). We inspect the call | ||
| # frame two levels up: __post_init__ → __init__ (generated) → actual caller. | ||
| frame = inspect.currentframe() | ||
| init_frame = frame.f_back if frame is not None else None |
There was a problem hiding this comment.
[warning] arch: The call-stack depth heuristic is fragile. The /tests/ check fires only if the caller is exactly two frames up. Indirect construction paths (fixtures that call helpers, default_factory in AutomationConfig, from_dynaconf) would bypass this guard silently, giving a false sense of safety while still writing to real /dev/shm.
There was a problem hiding this comment.
Investigated — this is intentional. The commit message 'fix: scope post_init guard to direct test callers only' (9e07dc5) explicitly defines the scope. Fixtures calling helpers and from_dynaconf bypassing the guard is the desired behavior — only direct test-code instantiation of LinuxTracingConfig(tmpfs_path='/dev/shm') should be blocked. The category: false_positive_intentional_pattern.
| def _write_old_trace(tmpfs: Path, filename: str, content: str) -> Path: | ||
| """Write a trace file and backdate its mtime to 60 seconds ago.""" | ||
| import time | ||
| """Write a trace file (backdated 60s) and its enrollment sidecar. |
There was a problem hiding this comment.
[critical] tests: The _write_old_trace docstring is malformed. The opening """ closes prematurely, leaving The enrollment sidecar uses... and The PID embedded in the filename... as bare statements outside any string literal, causing a SyntaxError at import time. Wrap the full docstring text inside a single triple-quoted string.
There was a problem hiding this comment.
Investigated — this is a stale comment. python3 -m py_compile tests/execution/test_session_log.py succeeds (Syntax OK). The docstring at lines 393-398 is properly closed with triple-quotes. The malformed state observed in the diff was corrected in a later commit on this branch before the review was submitted.
| assert enrollment.exists(), "Enrollment sidecar for alive PID must not be deleted" | ||
|
|
||
|
|
||
| def test_recover_crashed_sessions_skips_file_without_enrollment(tmp_path): |
There was a problem hiding this comment.
[warning] tests: Hardcoded PID 99997 — if that PID is alive on the test host, psutil.pid_exists(99997) returns True and Gate 3 may short-circuit before Gate 1 (missing enrollment) is tested. Use a guaranteed-dead PID (e.g. beyond Linux PID_MAX) or mock psutil.pid_exists.
There was a problem hiding this comment.
Investigated — this is intentional. Gate 1 (enrollment sidecar existence, session_log.py lines 341-346) runs BEFORE Gate 3 (PID liveness, lines 355-360). This test creates a trace for PID 99997 with NO enrollment sidecar. Gate 1 fires: enrollment is None → continue. Gate 3 is never reached. An alive PID 99997 cannot cause the test to fail.
| lambda: "current-boot-id", | ||
| ) | ||
| tmpfs = tmp_path / "shm" | ||
| tmpfs.mkdir() |
There was a problem hiding this comment.
[warning] tests: Same concern for hardcoded PID 99996 in test_recover_crashed_sessions_skips_wrong_boot_id. If alive, Gate 3 short-circuits before Gate 2 (boot_id check) is exercised. Use a guaranteed-dead PID or mock the liveness check.
There was a problem hiding this comment.
Investigated — this is intentional. Gate 2 (boot_id mismatch, session_log.py lines 348-353) runs BEFORE Gate 3 (PID liveness). This test creates an enrollment sidecar with boot_id='stale-boot-id' while the mock returns 'current-boot-id'. Gate 2 fires: mismatch → unlink + continue. Gate 3 is never reached. An alive PID 99996 cannot cause the test to fail.
Trecek
left a comment
There was a problem hiding this comment.
AutoSkillit review found 14 blocking issues. See inline comments.
Tests 1.1-1.3 (session_log): recover_crashed_sessions must skip trace files for alive PIDs, files without enrollment sidecars, and files whose enrollment boot_id doesn't match the current boot. Tests 1.4-1.5 (linux_tracing): start_linux_tracing must write an enrollment sidecar atomically; stop() must unlink both trace and sidecar. All 5 tests fail until the implementation is added. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Introduces a per-session enrollment sidecar
(autoskillit_enrollment_{pid}.json) written atomically at
start_linux_tracing time, containing the identity triple
(boot_id, pid, starttime_ticks) that anchors the trace file to the
specific process that created it.
Key changes:
- linux_tracing.py: Add TraceEnrollmentRecord frozen dataclass,
_write_enrollment_atomic and _read_enrollment helpers; extend
start_linux_tracing with session_id/kitchen_id/order_id keyword
params; write sidecar immediately after opening trace file; update
LinuxTracingHandle.stop() to unlink both trace and sidecar on clean
exit so recovery only ever sees genuine crashes.
- session_log.py: Add three-gate identity chain to
recover_crashed_sessions: (1) enrollment sidecar must exist,
(2) boot_id must match current boot, (3) PID must be dead or
starttime_ticks must differ. Delete both files after recovery.
- test_session_log.py: Update _write_old_trace to write companion
enrollment sidecars so existing recovery tests pass the new gates.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
test_streaming_writes_each_snapshot_as_jsonl: save trace path and flush before calling stop(), which now deletes both trace and enrollment files on clean exit. Read file content into variable before stop() runs. test_current_json_write_sites_match_allowlist: update session_log.py allowlist line numbers (206→213, 219→226, 222→229) shifted by the enrollment sidecar code added above the existing atomic_write calls. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ayloads Lines 219→226 and 222→229 in session_log.py shifted by the enrollment sidecar code; the hardcoded list_sites check in the same convention test needed the same update as _LEGACY_JSON_WRITES. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…lated_tracing_config fixture Tests assert: - __post_init__ raises RuntimeError when tmpfs_path == /dev/shm and PYTEST_CURRENT_TEST is set - Custom tmpfs path does not raise in test env - /dev/shm is allowed outside pytest (production path) - isolated_tracing_config fixture returns non-/dev/shm temp dir All four tests fail before implementation (no guard, no fixture yet). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…/shm in test env When PYTEST_CURRENT_TEST is set and tmpfs_path is the production default /dev/shm, construction raises RuntimeError with a diagnostic message pointing to the correct fix. Zero overhead in production — env var is never set outside pytest. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nfig call sites Five test call sites that previously wrote to the real /dev/shm are now explicitly isolated: four in test_linux_tracing.py and one in test_session_log_integration.py. Each gains a tmp_path fixture param where it lacked one, and passes tmpfs_path=str(tmp_path) to the constructor. These tests would have raised RuntimeError at construction after the __post_init__ guard was added; this commit restores them to passing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Provides a canonical, pre-isolated LinuxTracingConfig for all tracing tests. The fixture creates a shm/ subdir under tmp_path and returns a config pointing to it — never to the real /dev/shm. New tests should use this fixture instead of constructing LinuxTracingConfig manually. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The guard fired too broadly — any AutomationConfig() or from_dynaconf() call in tests triggered it even when no actual /dev/shm writes would occur. Use frame inspection to fire only when the immediate caller (two frames up: past __post_init__ and the generated __init__) is test code (/tests/ in filename). Library machinery (AutomationConfig default_factory, from_dynaconf) resolves to <string> or src/, so it passes through. Also set tool_ctx.config.linux_tracing.tmpfs_path to an isolated tmp_path for proper test isolation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…le filenames Gate 3 previously skipped recovery when read_starttime_ticks() returned None (unreadable /proc/<pid>/stat — a transient state after process exit), causing genuinely crashed sessions to be missed. Fix: only skip when ticks are positively known and match the enrollment record. Also change the pid parse-failure fallback from 0 to -1: PID 0 exists on Linux (swapper), making psutil.pid_exists(0) return True and silently short-circuiting Gate 3. PID -1 is guaranteed non-existent. Addresses reviewer comments 3070341630 (critical) and 3070341628.
The function was imported across module boundaries from linux_tracing.py into session_log.py with an underscore prefix signalling "private". Since it is legitimately shared between sister modules in the execution package, rename it to read_enrollment (no underscore) — consistent with the other public helpers (read_boot_id, read_starttime_ticks) in the same import block. Addresses reviewer comment 3070341627.
… locals in settings.py The deferred `import inspect` and `import os` inside `__post_init__` executed on every LinuxTracingConfig instantiation (the early-return check comes after them). Moving them to module level surfaces the dependency explicitly and avoids repeated import overhead. Also delete the frame locals (`del frame, init_frame, caller`) after the guard block to prevent reference cycles in non-CPython runtimes. Addresses reviewer comments 3070341619 and 3070341622.
…linux_tracing.py 1. Add clarifying comment to stop() explaining why _trace_path is unconditionally unlinked: crash-recovery only reads files left by processes that never called stop(), so clean sessions correctly clean up their own file. 2. Add logger.warning() when the enrollment sidecar write fails with OSError. Previously silently swallowed; the missing sidecar causes Gate 1 to skip recovery for that session with no operator-visible diagnostic. Addresses reviewer comments 3070341625 and 3070341626.
…tests In test_start_linux_tracing_writes_enrollment_sidecar and test_stop_unlinks_trace_and_enrollment, handle.stop() was called before proc.kill(). If stop() raises, proc.kill() never executes, leaking the sleep 2 subprocess. Wrap stop() in try/finally to guarantee proc.kill() always runs. Addresses reviewer comments 3070341631 and 3070341632.
b0888f9 to
5247553
Compare
Summary
The trace file lifecycle had no identity contract. The filename encodes a PID, but PIDs
recycle,
/dev/shmis writable by any process, and there is no boot-epoch anchor. The30-second age heuristic was the sole gate before
recover_crashed_sessionsemitted asubtype=crashedrow — making it trivially fooled by test artifacts, stale pre-rebootfiles, and alien processes whose PIDs happen to match.
Part A establishes the enrollment sidecar contract: a per-session JSON file written
atomically at
start_linux_tracingtime containing the identity triple(boot_id, pid, starttime_ticks). Recovery now validates this sidecar before classifying any trace file asa crash.
LinuxTracingHandle.stop()deletes both files on clean exit so recovery only eversees genuine crashes.
Part B covers test isolation enforcement (
__post_init__guard, fixing four non-isolatedtest call sites, and the shared
isolated_tracing_configconftest fixture).Architecture Impact
State Lifecycle Diagram
%%{init: {'flowchart': {'nodeSpacing': 50, 'rankSpacing': 65, 'curve': 'basis'}}}%% flowchart TB %% CLASS DEFINITIONS %% classDef cli fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff; classDef stateNode fill:#004d40,stroke:#4db6ac,stroke-width:2px,color:#fff; classDef handler fill:#e65100,stroke:#ffb74d,stroke-width:2px,color:#fff; classDef phase fill:#6a1b9a,stroke:#ba68c8,stroke-width:2px,color:#fff; classDef output fill:#00695c,stroke:#4db6ac,stroke-width:2px,color:#fff; classDef detector fill:#b71c1c,stroke:#ef5350,stroke-width:2px,color:#fff; classDef gap fill:#ff6f00,stroke:#ffa726,stroke-width:2px,color:#000; classDef terminal fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff; classDef newComponent fill:#2e7d32,stroke:#81c784,stroke-width:2px,color:#fff; %% ── CONSTRUCTION PATH ── %% subgraph Construction ["● LinuxTracingConfig Construction (settings.py)"] direction TB CFG_FIELDS["● INIT_ONLY Fields<br/>━━━━━━━━━━<br/>enabled: bool = True<br/>proc_interval: float = 5.0<br/>log_dir: str = ''<br/>tmpfs_path: str = '/dev/shm'"] GUARD["● __post_init__ TEST GUARD<br/>━━━━━━━━━━<br/>PYTEST_CURRENT_TEST set?<br/>AND tmpfs_path == '/dev/shm'?<br/>AND caller frame in /tests/?<br/>→ raise RuntimeError"] PROD_OK["Production Path OK<br/>━━━━━━━━━━<br/>No PYTEST_CURRENT_TEST<br/>→ /dev/shm allowed"] LIBRARY_OK["Library Factory Path OK<br/>━━━━━━━━━━<br/>default_factory / from_dynaconf<br/>caller not in /tests/<br/>→ guard bypassed"] end %% ── TEST ISOLATION LAYER ── %% subgraph TestIsolation ["● Test Isolation Contracts (conftest.py)"] direction LR ISOLATED_FIX["● isolated_tracing_config<br/>━━━━━━━━━━<br/>tmpfs_path = tmp_path/shm<br/>proc_interval = 0.05s<br/>mkdir on setup"] TOOL_CTX_FIX["tool_ctx fixture<br/>━━━━━━━━━━<br/>log_dir → tmp_path/session_logs<br/>tmpfs_path → tmp_path/shm<br/>both redirected"] ISO_HOME["_isolated_home (autouse)<br/>━━━━━━━━━━<br/>Path.home() → tmp_path<br/>blocks real config.yaml load"] end %% ── TRACING RUNTIME ── %% subgraph TracingRuntime ["● linux_tracing.py — Runtime State"] direction TB ENABLED_GATE["enabled gate<br/>━━━━━━━━━━<br/>if not enabled → return None<br/>FAIL-FAST"] TMPFS_CHECK["tmpfs_path.is_dir()<br/>━━━━━━━━━━<br/>dir present → open JSONL<br/>absent → silent skip"] HANDLE_MUTABLE["LinuxTracingHandle MUTABLE<br/>━━━━━━━━━━<br/>_trace_file (open → None)<br/>_trace_path (path → None)<br/>_enrollment_path (path → None)<br/>_monitor_cancel_scope (live → None)"] SNAPSHOTS_APPEND["_snapshots APPEND_ONLY<br/>━━━━━━━━━━<br/>grows during proc_monitor loop<br/>never shrunk mid-session<br/>read by stop() for flush"] end %% ── SESSION LOG WRITE PATH ── %% subgraph SessionLog ["● session_log.py — Write Path"] direction TB FLUSH["flush_session_log()<br/>━━━━━━━━━━<br/>sole write entry point<br/>called: normal end + crash recovery"] SESSION_ARTIFACTS["WRITE_ONCE Artifacts<br/>━━━━━━━━━━<br/>summary.json<br/>proc_trace.jsonl<br/>anomalies.jsonl"] RECOVERY_ARTIFACTS["WRITE_ONCE Recovery Files<br/>━━━━━━━━━━<br/>token_usage.json (next boot)<br/>step_timing.json (next boot)<br/>audit_log.json (next boot)"] SESSIONS_IDX["sessions.jsonl APPEND_ONLY<br/>━━━━━━━━━━<br/>global audit index<br/>append mode — never read<br/>by live logic (only retention)"] TELEMETRY_FENCE["FENCE: .telemetry_cleared_at<br/>━━━━━━━━━━<br/>write_telemetry_clear_marker()<br/>read only on next server boot<br/>excludes pre-clear sessions"] end %% ── CRASH RECOVERY GATE CHAIN ── %% subgraph Recovery ["● session_log.py — Crash Recovery Gate Chain"] direction TB TIMING_GATE["Gate 0: 30s Freshness<br/>━━━━━━━━━━<br/>mtime < now - 30s?<br/>skip if too fresh (active session)"] ENROLL_GATE["Gate 1: Enrollment Sidecar<br/>━━━━━━━━━━<br/>autoskillit_enrollment_{pid}.json<br/>absent → skip (alien file)"] BOOT_GATE["Gate 2: boot_id Match<br/>━━━━━━━━━━<br/>enrollment.boot_id == current_boot_id?<br/>mismatch → DELETE both files (stale)"] PID_GATE["Gate 3: PID Liveness + Reuse<br/>━━━━━━━━━━<br/>psutil.pid_exists(pid)?<br/>starttime_ticks == enrolled_ticks?<br/>same ticks → alive, skip<br/>different ticks → recycled, RECOVER"] RECOVER_FLUSH["recover: flush_session_log()<br/>━━━━━━━━━━<br/>subtype='crashed', exit_code=-1<br/>termination_reason='CRASHED'<br/>then unlink trace + enrollment"] end %% ── CONNECTIONS ── %% %% Construction flow CFG_FIELDS --> GUARD GUARD -->|"PYTEST_CURRENT_TEST set<br/>+ caller in /tests/<br/>+ tmpfs == /dev/shm"| ERR_RAISE(["RuntimeError raised"]) GUARD -->|"no PYTEST env"| PROD_OK GUARD -->|"caller not in /tests/"| LIBRARY_OK %% Test isolation feeds construction safely ISOLATED_FIX -->|"tmpfs_path = tmp_path/shm"| CFG_FIELDS TOOL_CTX_FIX -->|"both paths redirected"| CFG_FIELDS ISO_HOME -->|"blocks real config"| LIBRARY_OK %% Config fields flow into runtime PROD_OK --> ENABLED_GATE LIBRARY_OK --> ENABLED_GATE ENABLED_GATE -->|"enabled=True"| TMPFS_CHECK ENABLED_GATE -->|"enabled=False"| SKIP(["return None"]) TMPFS_CHECK -->|"dir exists"| HANDLE_MUTABLE TMPFS_CHECK -->|"dir absent"| HANDLE_MUTABLE HANDLE_MUTABLE --> SNAPSHOTS_APPEND %% Runtime feeds session log SNAPSHOTS_APPEND -->|"stop() called"| FLUSH FLUSH --> SESSION_ARTIFACTS FLUSH --> RECOVERY_ARTIFACTS FLUSH --> SESSIONS_IDX FLUSH --> TELEMETRY_FENCE %% Recovery gate chain TIMING_GATE -->|"stale enough"| ENROLL_GATE ENROLL_GATE -->|"sidecar present"| BOOT_GATE BOOT_GATE -->|"boot_id matches"| PID_GATE PID_GATE -->|"PID recycled"| RECOVER_FLUSH RECOVER_FLUSH -->|"calls"| FLUSH %% CLASS ASSIGNMENTS %% class CFG_FIELDS stateNode; class GUARD,ENROLL_GATE,BOOT_GATE,PID_GATE,TIMING_GATE detector; class PROD_OK,LIBRARY_OK phase; class ISOLATED_FIX,TOOL_CTX_FIX,ISO_HOME newComponent; class ENABLED_GATE,TMPFS_CHECK handler; class HANDLE_MUTABLE,SNAPSHOTS_APPEND stateNode; class FLUSH handler; class SESSION_ARTIFACTS,RECOVERY_ARTIFACTS,SESSIONS_IDX,TELEMETRY_FENCE output; class RECOVER_FLUSH phase; class ERR_RAISE,SKIP terminal;Process Flow Diagram
%%{init: {'flowchart': {'nodeSpacing': 40, 'rankSpacing': 50, 'curve': 'basis'}}}%% flowchart TB %% CLASS DEFINITIONS %% classDef terminal fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff; classDef stateNode fill:#004d40,stroke:#4db6ac,stroke-width:2px,color:#fff; classDef handler fill:#e65100,stroke:#ffb74d,stroke-width:2px,color:#fff; classDef phase fill:#6a1b9a,stroke:#ba68c8,stroke-width:2px,color:#fff; classDef newComponent fill:#2e7d32,stroke:#81c784,stroke-width:2px,color:#fff; classDef output fill:#00695c,stroke:#4db6ac,stroke-width:2px,color:#fff; classDef detector fill:#b71c1c,stroke:#ef5350,stroke-width:2px,color:#fff; %% TERMINALS %% START([START]) BLOCKED([BLOCKED — RuntimeError]) NONE([returns None]) HANDLE([returns LinuxTracingHandle]) COMPLETE([COMPLETE]) %% ─── PHASE 1: CONFIG GUARD ─── %% subgraph GuardPhase ["● LinuxTracingConfig — Test Isolation Guard (settings.py)"] direction TB N_config["● LinuxTracingConfig()<br/>━━━━━━━━━━<br/>dataclass __init__ called"] N_guard{"tmpfs_path == '/dev/shm'<br/>━━━━━━━━━━<br/>default value?"} N_pytest{"PYTEST_CURRENT_TEST<br/>━━━━━━━━━━<br/>env var set?"} N_frame{"inspect caller frame<br/>━━━━━━━━━━<br/>f_back.f_back<br/>co_filename in /tests/?"} N_pass["guard passes<br/>━━━━━━━━━━<br/>normal construction"] end %% ─── PHASE 2: TRACING START ─── %% subgraph StartPhase ["● start_linux_tracing() (linux_tracing.py)"] direction TB N_start["● start_linux_tracing(pid, config, tg)<br/>━━━━━━━━━━<br/>entry: open session tracing"] N_gate1{"LINUX_TRACING_AVAILABLE<br/>━━━━━━━━━━<br/>& config.enabled?"} N_gate2{"tg (task group)<br/>━━━━━━━━━━<br/>provided?"} N_open["open tmpfs trace file<br/>━━━━━━━━━━<br/>config.tmpfs_path /autoskillit_trace_{pid}.jsonl<br/>buffering=1 (line-buffered)"] N_enroll["write enrollment sidecar (atomic)<br/>━━━━━━━━━━<br/>config.tmpfs_path /autoskillit_enrollment_{pid}.json<br/>TraceEnrollmentRecord: pid+boot_id+starttime_ticks"] N_spawn["tg.start_soon(_run_monitor)<br/>━━━━━━━━━━<br/>async monitor task in cancel scope"] end %% ─── PHASE 3: MONITORING LOOP ─── %% subgraph MonitorPhase ["● proc_monitor() + _run_monitor() (linux_tracing.py)"] direction TB N_snap["read_proc_snapshot(pid)<br/>━━━━━━━━━━<br/>psutil.Process.oneshot()<br/>+ /proc/{pid}/status|oom_score|wchan"] N_gone{"snapshot is None?<br/>━━━━━━━━━━<br/>process exited"} N_clock{"captured_at ≤<br/>last_captured_at?<br/>━━━━━━━━━━<br/>NTP / WSL2 clock jump"} N_advance["advance by 1 µs<br/>━━━━━━━━━━<br/>maintain monotonic<br/>captured_at invariant"] N_append["append to handle._snapshots<br/>━━━━━━━━━━<br/>in-memory list"] N_write["write JSON line to tmpfs file<br/>━━━━━━━━━━<br/>crash-resilient streaming"] N_file_err{"OSError on write?"} N_degrade["close trace file<br/>━━━━━━━━━━<br/>degrade to in-memory only"] N_sleep["anyio.sleep(config.proc_interval)<br/>━━━━━━━━━━<br/>default 5.0s → loop"] end %% ─── PHASE 4: SESSION END + LOG FLUSH ─── %% subgraph FlushPhase ["● handle.stop() + flush_session_log() (linux_tracing.py / session_log.py)"] direction TB N_stop["handle.stop()<br/>━━━━━━━━━━<br/>cancel CancelScope<br/>flush+close trace file<br/>unlink tmpfs files"] N_flush["● flush_session_log()<br/>━━━━━━━━━━<br/>called by headless.py<br/>after session completes"] N_proc_trace["write proc_trace.jsonl<br/>━━━━━━━━━━<br/>session_dir/proc_trace.jsonl"] N_anomaly["detect_anomalies()<br/>━━━━━━━━━━<br/>post-hoc over snapshots"] N_summary["write summary.json<br/>━━━━━━━━━━<br/>+ anomalies.jsonl (if any)"] N_index["append to sessions.jsonl<br/>━━━━━━━━━━<br/>log_root/sessions.jsonl<br/>global append-only index"] N_retention["_enforce_retention()<br/>━━━━━━━━━━<br/>trim to max 500 session dirs<br/>rewrite sessions.jsonl"] end %% ─── PHASE 5: CRASH RECOVERY ─── %% subgraph RecoveryPhase ["● recover_crashed_sessions() (session_log.py)"] direction TB R_entry["● recover_crashed_sessions(tmpfs_path)<br/>━━━━━━━━━━<br/>called at server startup<br/>glob autoskillit_trace_*.jsonl"] R_age{"file age < 30s?<br/>━━━━━━━━━━<br/>may be active session"} R_sidecar{"enrollment sidecar<br/>━━━━━━━━━━<br/>autoskillit_enrollment_{pid}.json<br/>exists & parses?"} R_bootid{"boot_id matches<br/>━━━━━━━━━━<br/>current boot?"} R_pid{"psutil.pid_exists(pid)?"} R_ticks{"starttime_ticks<br/>━━━━━━━━━━<br/>unchanged?<br/>(same process)"} R_read["read snapshot lines<br/>━━━━━━━━━━<br/>from trace file"] R_flush["flush_session_log()<br/>━━━━━━━━━━<br/>subtype='crashed', exit_code=-1<br/>termination_reason='CRASHED'"] R_cleanup["unlink trace file<br/>━━━━━━━━━━<br/>+ enrollment sidecar"] end %% ─── PHASE 6: TEST ISOLATION FIXTURE ─── %% subgraph FixturePhase ["● isolated_tracing_config fixture (tests/execution/conftest.py)"] direction TB F_fixture["● isolated_tracing_config(tmp_path)<br/>━━━━━━━━━━<br/>pytest fixture"] F_return["LinuxTracingConfig(<br/>━━━━━━━━━━<br/>tmpfs_path=str(tmp_path/shm)<br/>proc_interval=0.05)"] end %% ─── FLOW: CONFIG GUARD ─── %% START --> N_config N_config --> N_guard N_guard -->|"yes (default path)"| N_pytest N_guard -->|"no (custom path)"| N_pass N_pytest -->|"yes — test context"| N_frame N_pytest -->|"no — production"| N_pass N_frame -->|"yes — direct test caller"| BLOCKED N_frame -->|"no — library machinery"| N_pass %% ─── FLOW: TRACING START ─── %% N_pass --> N_start N_start --> N_gate1 N_gate1 -->|"no"| NONE N_gate1 -->|"yes"| N_gate2 N_gate2 -->|"no"| NONE N_gate2 -->|"yes"| N_open N_open --> N_enroll N_enroll --> N_spawn N_spawn --> HANDLE %% ─── FLOW: MONITORING LOOP ─── %% HANDLE --> N_snap N_snap --> N_gone N_gone -->|"yes"| N_stop N_gone -->|"no"| N_clock N_clock -->|"yes (stepped back)"| N_advance N_clock -->|"no"| N_append N_advance --> N_append N_append --> N_write N_write --> N_file_err N_file_err -->|"yes"| N_degrade N_file_err -->|"no"| N_sleep N_degrade --> N_sleep N_sleep -->|"loop"| N_snap %% ─── FLOW: SESSION END + FLUSH ─── %% N_stop --> N_flush N_flush --> N_proc_trace N_proc_trace --> N_anomaly N_anomaly --> N_summary N_summary --> N_index N_index --> N_retention N_retention --> COMPLETE %% ─── FLOW: CRASH RECOVERY ─── %% R_entry --> R_age R_age -->|"yes — skip"| R_entry R_age -->|"no — old enough"| R_sidecar R_sidecar -->|"missing — alien file, skip"| R_entry R_sidecar -->|"found"| R_bootid R_bootid -->|"mismatch — stale pre-reboot<br/>unlink both files"| R_entry R_bootid -->|"match"| R_pid R_pid -->|"no — process gone"| R_read R_pid -->|"yes — alive"| R_ticks R_ticks -->|"unchanged — still running, skip"| R_entry R_ticks -->|"changed — PID recycled, treat as crash"| R_read R_read --> R_flush R_flush --> R_cleanup R_cleanup --> R_entry %% ─── FLOW: TEST ISOLATION ─── %% F_fixture --> F_return F_return -->|"passed to tests as LinuxTracingConfig"| N_start %% CLASS ASSIGNMENTS %% class START,BLOCKED,NONE,HANDLE,COMPLETE terminal; class N_config,N_pass,N_start,N_open,N_enroll,N_spawn handler; class N_guard,N_pytest,N_frame,N_gate1,N_gate2,N_gone,N_clock,N_file_err stateNode; class N_snap,N_advance,N_append,N_write,N_degrade,N_sleep phase; class N_stop,N_flush,N_proc_trace,N_anomaly,N_summary,N_index,N_retention handler; class R_entry,R_read,R_flush,R_cleanup handler; class R_age,R_sidecar,R_bootid,R_pid,R_ticks detector; class F_fixture,F_return newComponent;Error/Resilience Diagram
%%{init: {'flowchart': {'nodeSpacing': 40, 'rankSpacing': 55, 'curve': 'basis'}}}%% flowchart TB %% CLASS DEFINITIONS %% classDef terminal fill:#1a237e,stroke:#7986cb,stroke-width:2px,color:#fff; classDef stateNode fill:#004d40,stroke:#4db6ac,stroke-width:2px,color:#fff; classDef handler fill:#e65100,stroke:#ffb74d,stroke-width:2px,color:#fff; classDef phase fill:#6a1b9a,stroke:#ba68c8,stroke-width:2px,color:#fff; classDef newComponent fill:#2e7d32,stroke:#81c784,stroke-width:2px,color:#fff; classDef output fill:#00695c,stroke:#4db6ac,stroke-width:2px,color:#fff; classDef detector fill:#b71c1c,stroke:#ef5350,stroke-width:2px,color:#fff; classDef gap fill:#ff6f00,stroke:#ffa726,stroke-width:2px,color:#000; CONSTRUCT_START(["LinuxTracingConfig()"]) subgraph PostInitGate ["● CONSTRUCTION GUARD — LinuxTracingConfig.__post_init__"] CHECK_SHM{"tmpfs_path<br/>== /dev/shm?"} CHECK_PYTEST{"PYTEST_CURRENT_TEST<br/>env set?"} CHECK_FRAME{"caller frame<br/>inside /tests/?"} GUARD_PASS["Config accepted<br/>━━━━━━━━━━<br/>proceeds to use"] end subgraph TestIsolation ["● TEST ISOLATION ESCAPE HATCH"] FIXTURE["● isolated_tracing_config<br/>━━━━━━━━━━<br/>pytest fixture<br/>tests/execution/conftest.py"] SAFE_CONFIG["LinuxTracingConfig<br/>━━━━━━━━━━<br/>tmpfs_path=str(tmp_path/shm)<br/>proc_interval=0.05"] end subgraph CrashRecovery ["● CRASH RECOVERY GATES — recover_crashed_sessions"] SCAN["Scan tmpfs<br/>━━━━━━━━━━<br/>autoskillit_trace_*.jsonl"] AGE_GATE{"mtime age<br/>< 30s?"} ENROLLMENT_GATE{"enrollment<br/>sidecar exists?"} BOOT_GATE{"boot_id<br/>matches current?"} PID_GATE{"PID alive +<br/>same starttime_ticks?"} READ_SNAPS["Read snapshots<br/>━━━━━━━━━━<br/>parse JSONL lines<br/>OSError → skip file"] end subgraph Finalization ["● CRASH FINALIZATION — flush_session_log"] FLUSH["flush_session_log<br/>━━━━━━━━━━<br/>subtype=crashed<br/>termination=CRASHED"] FLUSH_ERR{"Exception<br/>during flush?"} UNLINK["Unlink trace +<br/>enrollment files<br/>━━━━━━━━━━<br/>missing_ok=True"] end T_RUNTIME_ERR(["RuntimeError<br/>dev/shm rejected<br/>in test context"]) T_PASS(["SAFE<br/>config constructed"]) T_SKIP(["SKIPPED<br/>file left in place"]) T_STALE_DEL(["STALE DELETED<br/>pre-reboot file removed"]) T_RECOVERED(["RECOVERED<br/>session finalized<br/>count += 1"]) T_FIXTURE_SAFE(["TEST SAFE<br/>isolated tmp_path/shm"]) %% Construction gate flow CONSTRUCT_START --> CHECK_SHM CHECK_SHM -->|"no — safe path"| GUARD_PASS CHECK_SHM -->|"yes"| CHECK_PYTEST CHECK_PYTEST -->|"not set — production"| GUARD_PASS CHECK_PYTEST -->|"set"| CHECK_FRAME CHECK_FRAME -->|"library machinery<br/>(AutomationConfig etc.)"| GUARD_PASS CHECK_FRAME -->|"direct test code"| T_RUNTIME_ERR GUARD_PASS --> T_PASS %% Test isolation path (recommended escape hatch) FIXTURE --> SAFE_CONFIG --> T_FIXTURE_SAFE %% Crash recovery gates SCAN --> AGE_GATE AGE_GATE -->|"too recent<br/>(may be active)"| T_SKIP AGE_GATE -->|"old enough"| ENROLLMENT_GATE ENROLLMENT_GATE -->|"missing — alien/test file"| T_SKIP ENROLLMENT_GATE -->|"present"| BOOT_GATE BOOT_GATE -->|"mismatch<br/>→ unlink both"| T_STALE_DEL BOOT_GATE -->|"matches"| PID_GATE PID_GATE -->|"alive, same ticks<br/>(process running)"| T_SKIP PID_GATE -->|"gone or PID recycled"| READ_SNAPS READ_SNAPS --> FLUSH %% Finalization FLUSH --> FLUSH_ERR FLUSH_ERR -->|"yes — log debug,<br/>continue loop"| T_SKIP FLUSH_ERR -->|"no"| UNLINK UNLINK --> T_RECOVERED %% CLASS ASSIGNMENTS class CONSTRUCT_START terminal; class CHECK_SHM,CHECK_PYTEST,CHECK_FRAME,AGE_GATE,ENROLLMENT_GATE,BOOT_GATE,PID_GATE,FLUSH_ERR detector; class GUARD_PASS,READ_SNAPS handler; class SCAN phase; class FIXTURE,SAFE_CONFIG newComponent; class FLUSH,UNLINK output; class T_RUNTIME_ERR,T_PASS,T_SKIP,T_STALE_DEL,T_RECOVERED,T_FIXTURE_SAFE terminal;Closes #771
Implementation Plan
Plan file:
/home/talon/projects/autoskillit-runs/remediation-20260412-142552-496727/.autoskillit/temp/rectify/rectify_trace_identity_contract_2026-04-12_180100_part_a.md🤖 Generated with Claude Code via AutoSkillit
Token Usage Summary