Conversation
There was a problem hiding this comment.
Pull request overview
This PR consolidates and expands “quality” tooling across RAG + conversation evaluation by adding a unified quality-gate command/workflow, extending persona-drift telemetry export and calibration utilities, and updating docs/fixtures/message-examples accordingly.
Changes:
- Add unified
quality-gateCLI (and GitHub Actions workflow) covering message-example lint, retrieval fixtures (optional), and conversation fixtures (mock + baseline regression). - Extend conversation session exports with persona-drift config/anchor + persist per-turn drift trace; add drift calibration command.
- Add retrieval fixture hard-gates (
--min-recall/--min-mrr), trend rendering, and update RAG docs + message examples to the[USER]/[ASSISTANT]format.
Reviewed changes
Copilot reviewed 47 out of 47 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| web_app.py | Adds quality block to saved session payload for downstream calibration/analysis. |
| tests/test_conversation_quality_eval.py | Adds tests for loading session drift turns and calibration report generation + fixture schema checks. |
| tests/test_context_manager.py | Simplifies history test string formatting. |
| tests/fixtures/conversation_fixtures_negative.json | Adds negative conversation fixture pack for leakage/OOC checks. |
| tests/fixtures/conversation_fixtures_hard.json | Adds hard conversation fixture pack for drift/style-break scenarios. |
| scripts/rag/push_rag_data.py | Formatting-only adjustments in option building. |
| scripts/rag/migrate_collection_embedding.py | Formatting-only adjustments for fingerprint keys/client construction. |
| scripts/rag/manage_collections_core_types.py | Extends fixture eval options with optional min_recall/min_mrr. |
| scripts/rag/manage_collections_core_evaluation.py | Adds CSV history loader helper and exports it. |
| scripts/rag/manage_collections_core.py | Formatting-only adjustment to exported name aggregation. |
| scripts/rag/manage_collections_commands_eval.py | Adds --min-recall/--min-mrr gating and a show-retrieval-trends command. |
| scripts/rag/manage_collections_commands_collections.py | Formatting-only adjustment to metadata filtering comprehension. |
| scripts/rag/lint_message_examples.py | Minor message string formatting + indentation fix in a violation append block. |
| scripts/rag/analyze_rag_text_enrichment.py | Formatting-only adjustment to category threshold boolean. |
| scripts/rag/analyze_rag_coverage.py | Formatting-only adjustments; adds blank line. |
| scripts/quality_gate.py | New unified quality-gate CLI orchestrating lint + retrieval + conversation checks. |
| scripts/conversation/evaluate_quality.py | Adds session-based persona drift calibration tooling and associated CLI command. |
| scripts/conversation/capture_baselines.py | New CLI to capture mock-mode baseline artifacts for conversation fixtures. |
| rag_data/shodan_message_examples.txt | Updates message-example formatting to [USER]/[ASSISTANT] style with blank-line separation. |
| rag_data/leonardo_da_vinci_message_examples.txt | Updates message-example formatting to [USER]/[ASSISTANT] style with blank-line separation. |
| rag_data/legacy/shodan_message_examples.txt | Adds legacy copy of the prior “User:/Persona:” formatted examples. |
| rag_data/legacy/leonardo_da_vinci_message_examples.txt | Adds legacy copy of the prior “User:/Persona:” formatted examples. |
| pyproject.toml | Registers quality-gate and capture-conversation-baselines as project scripts. |
| docs/rag_management/LINTING.md | Updates quality-gate integration instructions. |
| docs/rag_management/CATEGORY_CONFIG.md | Clarifies that category assignment is determined at analysis time, not push time. |
| docs/rag_management/05_WORKFLOWS_TROUBLESHOOTING.md | Updates recommended workflow and module-style invocation examples. |
| docs/rag_management/03_MANAGE_COLLECTIONS.md | Documents new gating options and retrieval trend display command. |
| docs/rag_management/02_PUSH_RAG_DATA.md | Updates module invocation and adds pre-push checklist/coverage note. |
| docs/rag_management/01_ANALYZE_RAG_TEXT.md | Updates examples to module-style invocation. |
| docs/rag_management/00_README.md | Adds module-invocation guidance and links to the quality gate doc. |
| docs/quality_gate.md | New end-to-end quality gate documentation. |
| docs/future_work/UI_REFINEMENTS.md | New UI backlog doc separated from general refinements. |
| docs/future_work/REFINEMENTS.md | Updates completed items and points UI work to the new UI backlog doc. |
| docs/future_work/COPILOT_COMPACT_REFERENCE.md | Updates implemented-state reference with new quality/conversation tooling. |
| docs/context_management/07_IMPLEMENTATION.md | Updates implementation notes to include multi-query, compression, and drift scoring. |
| docs/context_management/03_CONFIGURATION.md | Updates config docs for new keys and notes on defaults. |
| docs/context_management/02_HOW_IT_WORKS.md | Updates retrieval flow description for multi-query/compression/drift telemetry. |
| docs/context_management/01_QUICKSTART.md | Updates quickstart config snippets and naming. |
| docs/context_management/00_README.md | Updates feature bullets for new retrieval/drift capabilities. |
| docs/configs/01_CONFIG_V2.md | Documents conversation_quality and other new v2 config keys. |
| docs/configs/00_README.md | Notes config tracking behavior and lack of example config file. |
| docs/RAG_SCRIPTS_GUIDE.md | Updates preferred invocation and adds a canonical workflow section. |
| core/conversation_retrieval_orchestration_mixin.py | Formatting-only change to mes_trace construction when MES retrieval is disabled. |
| core/conversation_response_mixin.py | Persists persona drift trace in exported state; clears/imports trace. |
| core/conversation_manager.py | Initializes persona_drift_trace deque alongside drift history. |
| README.md | Updates RAG workflow docs to module invocation + adds optional quality gates before push. |
| .github/workflows/quality_gate.yml | Adds CI workflow running ruff, pytest, baseline capture, and quality gate (skip retrieval). |
Comments suppressed due to low confidence (1)
docs/rag_management/03_MANAGE_COLLECTIONS.md:404
show-retrieval-trendsis documented earlier in this file already, and this later section repeats the same command/option text (and duplicates the section number). Please remove one of the sections (or replace the second with a short cross-link) to avoid documentation drift.
### 10) `show-retrieval-trends`
Display a compact trend table from a retrieval eval history CSV, showing Recall@k and MRR per run with deltas vs the previous row.
```bash
uv run python -m scripts.rag.manage_collections show-retrieval-trends \
--history-csv logs/retrieval_eval/history.csv
Limit to the most recent N rows:
uv run python -m scripts.rag.manage_collections show-retrieval-trends \
--history-csv logs/retrieval_eval/history.csv \
--last-n 10Options:
--history-csv(default:logs/retrieval_eval/history.csv)--last-n(default: show all rows)
The table prints columns: #, Date, Fixture, Mode, k, Recall@k, MRR, dRecall, dMRR.
</details>
| Before using baseline comparison, capture canonical baselines in mock mode: | ||
|
|
||
| ```bash | ||
| uv run python -m scripts.conversation.capture_baselines capture-conversation-baselines |
There was a problem hiding this comment.
This baseline-capture invocation is incorrect for the current scripts/conversation/capture_baselines.py module: python -m scripts.conversation.capture_baselines already runs the click command, so passing capture-conversation-baselines as an extra argument will cause a click usage error. Update the docs to either use uv run python -m scripts.conversation.capture_baselines or the [project.scripts] entrypoint uv run capture-conversation-baselines.
| uv run python -m scripts.conversation.capture_baselines capture-conversation-baselines | |
| uv run capture-conversation-baselines |
| uv run ruff format . | ||
| uv run python -m scripts.conversation.evaluate_quality evaluate-conversation-fixtures | ||
| uv run python -m scripts.conversation.evaluate_quality calibrate-persona-drift | ||
| uv run python -m scripts.conversation.capture_baselines capture-conversation-baselines |
There was a problem hiding this comment.
The operational command shown here is likely incorrect: uv run python -m scripts.conversation.capture_baselines already runs the click command, so adding capture-conversation-baselines as an extra argument will fail. Prefer either uv run python -m scripts.conversation.capture_baselines or uv run capture-conversation-baselines (from [project.scripts]).
| uv run python -m scripts.conversation.capture_baselines capture-conversation-baselines | |
| uv run python -m scripts.conversation.capture_baselines |
| - `--min-recall` (exit non-zero if Recall@k falls below this threshold) | ||
| - `--min-mrr` (exit non-zero if MRR falls below this threshold) |
There was a problem hiding this comment.
The --min-recall / --min-mrr options are listed twice, which makes it unclear whether they differ. Please dedupe these bullet points (keep a single description per flag).
| - `--min-recall` (exit non-zero if Recall@k falls below this threshold) | |
| - `--min-mrr` (exit non-zero if MRR falls below this threshold) |
| try: | ||
| options = FixtureEvalOptions( | ||
| fixture_file=fixture_file, | ||
| k=None, | ||
| retrieval_mode="similarity", | ||
| persist_directory=None, | ||
| embedding_model=None, | ||
| embedding_device=None, | ||
| show_failures=False, | ||
| min_recall=min_recall, | ||
| min_mrr=min_mrr, | ||
| ) | ||
| run = _execute_fixture_evaluation(options) | ||
| except click.ClickException as exc: | ||
| return "fail", str(exc.format_message()) | ||
| except Exception as exc: | ||
| return "skip", f"Retrieval evaluation unavailable: {exc}" | ||
|
|
There was a problem hiding this comment.
The retrieval step treats any non-ClickException as "skip" (except Exception: return 'skip' ...). This can mask real bugs (e.g., parsing errors, regressions inside evaluation) and produce a false-green gate. Consider only skipping for known “environment missing” failure modes (e.g., Chroma/embedding initialization errors) and failing (or re-raising) for unexpected exceptions, possibly with a stack trace in the output.
| 4. `04_OLD_PREPARE_RAG.md` | ||
| 5. `05_WORKFLOWS_TROUBLESHOOTING.md` | ||
|
|
||
| See also `docs/QUALITY_GATE.md` for the unified quality gate reference. |
There was a problem hiding this comment.
docs/QUALITY_GATE.md is referenced here, but the file added in this PR is docs/quality_gate.md (different casing). On case-sensitive filesystems (Linux CI), this link will be broken; update the reference (or rename the file) so the path matches exactly.
| The unified quality gate (`docs/QUALITY_GATE.md`) runs RAG lint, retrieval fixtures, and | ||
| conversation fixture evaluation in one command. See that document for the full guide. |
There was a problem hiding this comment.
This section links to docs/QUALITY_GATE.md, but the actual document in this PR is docs/quality_gate.md. Please align the casing so the link works on Linux/macOS (case-sensitive) environments.
| ``` | ||
|
|
||
| This runs linting, coverage scoring, and retrieval benchmarks together. | ||
| The gate runs message-example linting, conversation fixture evaluation, and (optionally) retrieval fixture evaluation in sequence, printing a PASS/WARN/FAIL table and exiting non-zero if any step fails. See `docs/QUALITY_GATE.md` for full usage. |
There was a problem hiding this comment.
The doc points to docs/QUALITY_GATE.md, but the quality gate doc added in this PR is docs/quality_gate.md. This mismatch will break links on case-sensitive filesystems; update the reference or rename the target file to match.
| The gate runs message-example linting, conversation fixture evaluation, and (optionally) retrieval fixture evaluation in sequence, printing a PASS/WARN/FAIL table and exiting non-zero if any step fails. See `docs/QUALITY_GATE.md` for full usage. | |
| The gate runs message-example linting, conversation fixture evaluation, and (optionally) retrieval fixture evaluation in sequence, printing a PASS/WARN/FAIL table and exiting non-zero if any step fails. See `docs/quality_gate.md` for full usage. |
ruff was only installed globally as a uv tool locally, but was not declared in pyproject.toml. GitHub Actions runs uv sync --dev which only installs declared dependencies, so ruff was absent on CI runners causing 'Failed to spawn: ruff / No such file or directory'. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ruff EXE001 requires files with shebang lines to have the executable bit set. On Windows this bit is never set automatically, but the CI runner is Linux where ruff enforces it. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
@copilot apply changes based on the comments in this thread |
…on handler Agent-Logs-Url: https://github.com/ossirytk/light-chat/sessions/1563bc6c-3872-436c-b454-40f2b5e71c59 Co-authored-by: ossirytk <136164501+ossirytk@users.noreply.github.com>
Applied all changes from the review thread in commit
|
docs/quality_gate.md:105- incorrect baseline-capture command →uv run capture-conversation-baselinesdocs/future_work/COPILOT_COMPACT_REFERENCE.md- incorrect commands with extra subcommand arg →uv run capture-conversation-baselinesdocs/rag_management/03_MANAGE_COLLECTIONS.md:177-178- dedupe--min-recall/--min-mrrbullet pointsscripts/quality_gate.py:89-106- narrow broadexcept Exceptionto specific env-missing exceptions (OSError,ImportError,ModuleNotFoundError,RuntimeError)docs/rag_management/00_README.md:32,41- case mismatchQUALITY_GATE.md→quality_gate.mddocs/rag_management/LINTING.md:275- case mismatchQUALITY_GATE.md→quality_gate.md⌨️ Start Copilot coding agent tasks without leaving your editor — available in VS Code, Visual Studio, JetBrains IDEs and Eclipse.