DCE: Update refactor plan for per-file state and immutable results

cristianoc · cristianoc · commit 093ef5939891 · 2025-12-07T04:42:15.000+01:00
Key design principles added:
- Separate per-file input (keyed by filename) from project-wide analysis
- Analysis results are immutable - returned by solver, not mutated
- Enable incremental updates by replacing one file's data

Updated tasks to emphasize:
- Per-file state with merge functions for project-wide view
- Solver returns AnalysisResult.t instead of mutating input
- Task 3 marked partial (input/output mixing fixed in Task 8)
diff --git a/analysis/reanalyze/DEADCODE_REFACTOR_PLAN.md b/analysis/reanalyze/DEADCODE_REFACTOR_PLAN.md
@@ -5,6 +5,7 @@
 - Global mutable state is eliminated
 - Side effects (logging, file I/O) live at the edges
 - Processing files in different orders gives the same results
+- **Incremental analysis is possible** - can reprocess one file without redoing everything
 
 **Why?** The current architecture makes:
 - Incremental/reactive analysis impossible (can't reprocess one file)
@@ -14,6 +15,50 @@
 
 ---
 
+## Key Design Principles
+
+### 1. Separate per-file input from project-wide analysis
+
+**Per-file source data** (can be incrementally updated):
+- Source annotations (`@dead`, `@live`, `@genType` from AST)
+- Declarations defined in that file
+- References made from that file
+- Keyed by filename so we can replace one file's data
+
+**Project-wide analysis** (computed from merged per-file data):
+- Deadness solver operates on merged view of all files
+- Results are **immutable** - returned as data, not mutated
+
+### 2. Analysis results are immutable
+
+The solver should:
+- Take source data as **read-only input**
+- Return results as **new immutable data**
+- Never mutate input state during analysis
+
+```ocaml
+(* WRONG - current design mutates state during analysis *)
+let resolveRecursiveRefs ~state ... =
+  ...
+  AnnotationState.annotate_dead state decl.pos  (* mutation! *)
+
+(* RIGHT - return results as data *)
+let solve_deadness ~source_annotations ~decls ~refs =
+  ... compute ...
+  { dead_positions; issues; annotations_to_write }  (* return, don't mutate *)
+```
+
+### 3. Enable incremental updates
+
+When file F changes:
+1. Replace `per_file_data[F]` with new data from re-processing F
+2. Re-merge into project-wide view
+3. Re-run solver (returns new results)
+
+This requires per-file data to be **keyed by filename**.
+
+---
+
 ## Current Problems (What We're Fixing)
 
 ### P1: Global "current file" context
@@ -39,7 +84,9 @@
 - `DeadType.TypeDependencies.delayedItems` - deferred type deps
 - `ProcessDeadAnnotations.positionsAnnotated` - annotation tracking
 
-**Impact**: Order-dependent. Processing files in different orders can give different results because queue processing happens at arbitrary times.
+**Additional problem**: `positionsAnnotated` mixes **input** (source annotations from AST) with **output** (positions the solver determines are dead). The solver mutates this during analysis, violating purity.
+
+**Impact**: Order-dependent. Processing files in different orders can give different results because queue processing happens at arbitrary times. Mixing input/output prevents incremental analysis.
 
 ### P4: Global configuration reads
 **Problem**: Analysis code directly reads `!Common.Cli.debug`, `RunConfig.runConfig.transitive`, etc. scattered throughout. Can't run analysis with different configs without mutating globals.
@@ -65,63 +112,79 @@
 ## End State
 
 ```ocaml
-(* Configuration: all inputs as immutable data *)
+(* Configuration: immutable *)
 type config = {
-  run : RunConfig.t;          (* transitive, suppress lists, etc. *)
+  run : RunConfig.t;
   debug : bool;
   write_annotations : bool;
   live_names : string list;
   live_paths : string list;
   exclude_paths : string list;
 }
 
-(* Per-file analysis state - everything needed to analyze one file *)
-type file_state = {
+(* Per-file source data - extracted from one file's AST *)
+type file_data = {
   source_path : string;
   module_name : Name.t;
   is_interface : bool;
-  annotations : annotation_state;
-  (* ... other per-file state *)
-}
-
-(* Project-level analysis state - accumulated across all files *)
-type project_state = {
-  decls : decl PosHash.t;
-  value_refs : PosSet.t PosHash.t;
-  type_refs : PosSet.t PosHash.t;
-  file_refs : FileSet.t FileHash.t;
-  optional_args : optional_args_state;
-  exceptions : exception_state;
-  (* ... *)
+  source_annotations : annotated_as PosHash.t;  (* @dead/@live/@genType in source *)
+  decls : decl list;                            (* declarations defined here *)
+  value_refs : (pos * pos) list;                (* references made from here *)
+  type_refs : (pos * pos) list;
+  file_refs : string list;                      (* files this file depends on *)
 }
 
-(* Pure analysis function *)
-val analyze_file : config -> file_state -> project_state -> Cmt_format.cmt_infos -> project_state
+(* Per-file data keyed by filename - enables incremental updates *)
+type per_file_state = file_data StringMap.t
 
-(* Pure deadness solver *)
-val solve_deadness : config -> project_state -> analysis_result
+(* Project-wide merged view - computed from per_file_state *)
+type merged_state = {
+  all_annotations : annotated_as PosHash.t;     (* merged from all files *)
+  all_decls : decl PosHash.t;                   (* merged from all files *)
+  all_value_refs : PosSet.t PosHash.t;          (* merged from all files *)
+  all_type_refs : PosSet.t PosHash.t;
+  all_file_refs : FileSet.t StringMap.t;
+}
 
+(* Analysis results - IMMUTABLE, returned by solver *)
 type analysis_result = {
-  dead_decls : decl list;
-  issues : Common.issue list;
+  dead_positions : PosSet.t;
+  issues : issue list;
   annotations_to_write : (string * line_annotation list) list;
 }
 
-(* Side effects at the edge *)
+(* Pure: extract data from one file *)
+val process_file : config -> Cmt_format.cmt_infos -> file_data
+
+(* Pure: merge per-file data into project-wide view *)
+val merge_file_data : per_file_state -> merged_state
+
+(* Pure: solve deadness - takes READ-ONLY input, returns IMMUTABLE result *)
+val solve_deadness : config -> merged_state -> analysis_result
+
+(* Orchestration with side effects at edges *)
 let run_analysis ~config ~cmt_files =
-  (* Pure: analyze all files *)
-  let project_state = 
+  (* Pure: process each file independently *)
+  let per_file = 
     cmt_files 
-    |> List.fold_left (fun state file -> 
-         analyze_file config (file_state_for file) state (load_cmt file)
-       ) empty_project_state
+    |> List.map (fun path -> (path, process_file config (load_cmt path)))
+    |> StringMap.of_list
   in
-  (* Pure: solve deadness *)
-  let result = solve_deadness config project_state in
+  (* Pure: merge into project-wide view *)
+  let merged = merge_file_data per_file in
+  (* Pure: solve deadness - NO MUTATION *)
+  let result = solve_deadness config merged in
   (* Impure: report results *)
   result.issues |> List.iter report_issue;
   if config.write_annotations then 
-    result.annotations_to_write |> List.iter write_annotations_to_file
+    result.annotations_to_write |> List.iter write_to_file
+
+(* Incremental update when file F changes *)
+let update_file ~config ~per_file ~changed_file =
+  let new_file_data = process_file config (load_cmt changed_file) in
+  let per_file = StringMap.add changed_file new_file_data per_file in
+  let merged = merge_file_data per_file in
+  solve_deadness config merged
 ```
 
 ---
@@ -173,36 +236,60 @@ Each task should:
 **Value**: Removes hidden global state. Makes annotation tracking testable.
 
 **Changes**:
-- [ ] Change `ProcessDeadAnnotations` functions to take/return explicit `state` instead of mutating `positionsAnnotated` ref
-- [ ] Thread `annotation_state` through `DeadCode.processCmt`
-- [ ] Delete the global `positionsAnnotated`
+- [x] Create `AnnotationState.t` module with explicit state type and accessor functions
+- [x] Change `ProcessDeadAnnotations` functions to take explicit `~state:AnnotationState.t`
+- [x] Thread `annotation_state` through `DeadCode.processCmt` and `Reanalyze.loadCmtFile`
+- [x] Update `declIsDead`, `doReportDead`, `resolveRecursiveRefs`, `reportDead` to use explicit state
+- [x] Update `DeadOptionalArgs.check` to take explicit state
+- [x] Delete the global `positionsAnnotated`
+
+**Status**: Partially complete ⚠️
+
+**Known limitation**: Current implementation still mixes concerns:
+- Source annotations (from `@dead`/`@live`/`@genType` in files) - INPUT
+- Analysis results (positions solver determined are dead) - OUTPUT
+
+The solver currently **mutates** `AnnotationState` via `annotate_dead` during `resolveRecursiveRefs`.
+This violates the principle that analysis results should be immutable and returned.
+
+**TODO** (in later task):
+- [ ] Separate `SourceAnnotations.t` (per-file, read-only input) from analysis results
+- [ ] Make `SourceAnnotations` keyed by filename for incremental updates
+- [ ] Solver should return dead positions as part of `analysis_result`, not mutate state
 
 **Test**: Process two files "simultaneously" (two separate state values) - should not interfere.
 
 **Estimated effort**: Small (well-scoped module)
 
 ### Task 4: Localize analysis tables (P2) - Part 1: Declarations
 
-**Value**: First step toward incremental analysis. Can analyze a subset of files with isolated state.
+**Value**: First step toward incremental analysis. Per-file declaration data enables replacing one file's contributions.
 
 **Changes**:
-- [ ] Change `DeadCommon.addDeclaration_` and friends to take `decl_state : decl PosHash.t` parameter
-- [ ] Thread through `DeadCode.processCmt` - allocate fresh state, pass through, return updated state
-- [ ] Accumulate per-file states in `Reanalyze.processCmtFiles`
+- [ ] Create `FileDecls.t` type for per-file declarations (keyed by filename)
+- [ ] `process_file` returns declarations for that file only
+- [ ] Store as `file_decls : decl list StringMap.t` (per-file, keyed by filename)
+- [ ] Create `merge_decls : file_decls -> decl PosHash.t` for project-wide view
 - [ ] Delete global `DeadCommon.decls`
 
+**Incremental benefit**: When file F changes, just replace `file_decls[F]` and re-merge.
+
 **Test**: Analyze files with separate decl tables - should not interfere.
 
 **Estimated effort**: Medium (core data structure, many call sites)
 
 ### Task 5: Localize analysis tables (P2) - Part 2: References
 
-**Value**: Completes the localization of analysis state.
+**Value**: Completes per-file reference tracking for incremental analysis.
 
 **Changes**:
-- [ ] Same pattern as Task 4 but for `ValueReferences.table` and `TypeReferences.table`
-- [ ] Thread explicit `value_refs` and `type_refs` parameters
-- [ ] Delete global reference tables
+- [ ] Create `FileRefs.t` for per-file references (keyed by filename)
+- [ ] `process_file` returns references made from that file
+- [ ] Store as `file_value_refs : (pos * pos) list StringMap.t`
+- [ ] Create `merge_refs` for project-wide view
+- [ ] Delete global `ValueReferences.table` and `TypeReferences.table`
+
+**Incremental benefit**: When file F changes, replace `file_refs[F]` and re-merge.
 
 **Test**: Same as Task 4.
 
@@ -213,40 +300,63 @@ Each task should:
 **Value**: Removes order dependence. Makes analysis deterministic.
 
 **Changes**:
-- [ ] `DeadOptionalArgs`: Thread explicit `state` with `delayed_items` and `function_refs`, delete global refs
-- [ ] `DeadException`: Thread explicit `state` with `delayed_items` and `declarations`, delete global refs
-- [ ] `DeadType.TypeDependencies`: Thread explicit `type_deps_state`, delete global ref
-- [ ] Update `forceDelayedItems` calls to operate on explicit state
+- [ ] `DeadOptionalArgs`: Return delayed items from file processing, merge later
+- [ ] `DeadException`: Return delayed items from file processing, merge later
+- [ ] `DeadType.TypeDependencies`: Return delayed items from file processing, merge later
+- [ ] `forceDelayedItems` operates on merged delayed items (pure function)
+- [ ] Delete global refs
+
+**Key insight**: Delayed items should be **returned** from file processing, not accumulated in globals.
+This makes them per-file and enables incremental updates.
 
 **Test**: Process files in different orders - delayed items should be processed consistently.
 
 **Estimated effort**: Medium (3 modules, each similar to Task 3)
 
 ### Task 7: Localize file/module tracking (P2 + P3)
 
-**Value**: Removes last major global state. Makes cross-file analysis explicit.
+**Value**: Per-file dependency tracking enables incremental dependency graph updates.
 
 **Changes**:
-- [ ] `FileReferences`: Replace global `table` with explicit `file_refs_state` parameter
-- [ ] `DeadModules`: Replace global `table` with explicit `module_state` parameter  
-- [ ] Thread both through analysis pipeline
-- [ ] `iterFilesFromRootsToLeaves`: take explicit state, return ordered file list (pure)
+- [ ] `FileReferences`: Store per-file as `file_deps : string list StringMap.t`
+- [ ] Create `merge_file_refs` for project-wide dependency graph
+- [ ] `DeadModules`: Track per-file module usage, merge for project-wide view
+- [ ] `iterFilesFromRootsToLeaves`: pure function on merged file refs, returns ordered list
+
+**Incremental benefit**: When file F changes, update `file_deps[F]` and re-merge graph.
 
 **Test**: Build file reference graph in isolation, verify topological ordering is correct.
 
 **Estimated effort**: Medium (cross-file logic, but well-contained)
 
-### Task 8: Separate analysis from reporting (P5)
+### Task 8: Separate analysis from reporting (P5) - Immutable Results
 
-**Value**: Core analysis is now pure. Can get results as data. Can test without I/O.
+**Value**: Solver returns immutable results. No mutation during analysis. Pure function.
 
 **Changes**:
-- [ ] `DeadCommon.reportDead`: Return `issue list` instead of calling `Log_.warning`
+- [ ] Create `AnalysisResult.t` type with `dead_positions`, `issues`, `annotations_to_write`
+- [ ] `solve_deadness`: Return `AnalysisResult.t` instead of mutating state
+- [ ] Remove `AnnotationState.annotate_dead` call from `resolveRecursiveRefs`
+- [ ] Dead positions are part of returned result, not mutated into input state
 - [ ] `Decl.report`: Return `issue` instead of logging
 - [ ] Remove all `Log_.warning`, `Log_.item`, `EmitJson` calls from `Dead*.ml` modules
-- [ ] `Reanalyze.runAnalysis`: Call pure analysis, then separately report issues
+- [ ] `Reanalyze.runAnalysis`: Call pure solver, then separately report from result
+
+**Key principle**: The solver takes **read-only** merged state and returns **new immutable** results.
+No mutation of input state during analysis.
 
-**Test**: Run analysis, capture result list, verify no I/O side effects occurred.
+```ocaml
+(* Before - WRONG *)
+let solve ~state = 
+  ... AnnotationState.annotate_dead state pos ...  (* mutates input! *)
+
+(* After - RIGHT *)
+let solve ~merged_state =
+  let dead_positions = ... compute ... in
+  { dead_positions; issues; annotations_to_write }  (* return new data *)
+```
+
+**Test**: Run analysis, capture result, verify input state unchanged.
 
 **Estimated effort**: Medium (many logging call sites, but mechanical)
 
@@ -296,17 +406,23 @@ Each task should:
 ## Execution Strategy
 
 **Completed**: Task 1 ✅, Task 2 ✅, Task 10 ✅
+**Partially complete**: Task 3 ⚠️ (state explicit but still mixes input/output)
 
-**Remaining order**: 3 → 4 → 5 → 6 → 7 → 8 → 9 → 11 (test)
+**Remaining order**: 4 → 5 → 6 → 7 → 8 → 9 → 11 (test)
 
 **Why this order?**
 - Tasks 1-2 remove implicit dependencies (file context, config) - ✅ DONE
-- Tasks 3-7 localize global state - can be done incrementally now that inputs are explicit
-- Tasks 8-9 separate pure/impure - can only do this once state is local
+- Task 3 makes annotation tracking explicit - ⚠️ PARTIAL (needs input/output separation in Task 8)
+- Tasks 4-7 make state **per-file** for incremental updates
+- Task 8 makes solver **pure** with immutable results (also fixes Task 3's input/output mixing)
+- Task 9 separates annotation computation from file writing
 - Task 10 verifies no global config reads remain - ✅ DONE
-- Task 11 validates everything
+- Task 11 validates everything including incremental updates
 
-**Alternative**: Could do 3-7 in any order (they're mostly independent).
+**Key architectural milestones**:
+1. **After Task 7**: All state is per-file, keyed by filename
+2. **After Task 8**: Solver is pure, returns immutable results
+3. **After Task 11**: Incremental updates verified working
 
 **Time estimate**: 
 - Best case (everything goes smoothly): 2-3 days
@@ -331,12 +447,20 @@ After all tasks:
 ✅ **Pure analysis function**
 - Can call analysis and get results as data
 - No side effects (logging, file I/O) during analysis
+- **Solver returns immutable results** - no mutation of input state
+
+✅ **Per-file state enables incremental updates**
+- All per-file data (annotations, decls, refs) keyed by filename
+- Can replace one file's data: `per_file_state[F] = new_data`
+- Re-merge and re-solve without reprocessing other files
 
-✅ **Incremental analysis possible**
-- Can create empty state and analyze just one file
-- Can update state with new file without reanalyzing everything
+✅ **Clear separation of input vs output**
+- Source annotations (from AST) are **read-only input**
+- Analysis results (dead positions, issues) are **immutable output**
+- Solver takes input, returns output - no mixing
 
 ✅ **Testable**
 - Can test analysis without mocking I/O
 - Can test with different configs without mutating globals
 - Can test with isolated state
+- Can verify solver doesn't mutate its input