Skip to content

Update docs and some rag work#22

Merged
ossirytk merged 4 commits intomainfrom
vibing-rag-quality-work
Mar 28, 2026
Merged

Update docs and some rag work#22
ossirytk merged 4 commits intomainfrom
vibing-rag-quality-work

Conversation

@ossirytk
Copy link
Copy Markdown
Owner

@ossirytk ossirytk commented Mar 28, 2026

  • Fix docs/quality_gate.md:105 - incorrect baseline-capture command → uv run capture-conversation-baselines
  • Fix docs/future_work/COPILOT_COMPACT_REFERENCE.md - incorrect commands with extra subcommand arg → uv run capture-conversation-baselines
  • Fix docs/rag_management/03_MANAGE_COLLECTIONS.md:177-178 - dedupe --min-recall/--min-mrr bullet points
  • Fix scripts/quality_gate.py:89-106 - narrow broad except Exception to specific env-missing exceptions (OSError, ImportError, ModuleNotFoundError, RuntimeError)
  • Fix docs/rag_management/00_README.md:32,41 - case mismatch QUALITY_GATE.mdquality_gate.md
  • Fix docs/rag_management/LINTING.md:275 - case mismatch QUALITY_GATE.mdquality_gate.md

⌨️ Start Copilot coding agent tasks without leaving your editor — available in VS Code, Visual Studio, JetBrains IDEs and Eclipse.

Copilot AI review requested due to automatic review settings March 28, 2026 05:09
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR consolidates and expands “quality” tooling across RAG + conversation evaluation by adding a unified quality-gate command/workflow, extending persona-drift telemetry export and calibration utilities, and updating docs/fixtures/message-examples accordingly.

Changes:

  • Add unified quality-gate CLI (and GitHub Actions workflow) covering message-example lint, retrieval fixtures (optional), and conversation fixtures (mock + baseline regression).
  • Extend conversation session exports with persona-drift config/anchor + persist per-turn drift trace; add drift calibration command.
  • Add retrieval fixture hard-gates (--min-recall/--min-mrr), trend rendering, and update RAG docs + message examples to the [USER]/[ASSISTANT] format.

Reviewed changes

Copilot reviewed 47 out of 47 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
web_app.py Adds quality block to saved session payload for downstream calibration/analysis.
tests/test_conversation_quality_eval.py Adds tests for loading session drift turns and calibration report generation + fixture schema checks.
tests/test_context_manager.py Simplifies history test string formatting.
tests/fixtures/conversation_fixtures_negative.json Adds negative conversation fixture pack for leakage/OOC checks.
tests/fixtures/conversation_fixtures_hard.json Adds hard conversation fixture pack for drift/style-break scenarios.
scripts/rag/push_rag_data.py Formatting-only adjustments in option building.
scripts/rag/migrate_collection_embedding.py Formatting-only adjustments for fingerprint keys/client construction.
scripts/rag/manage_collections_core_types.py Extends fixture eval options with optional min_recall/min_mrr.
scripts/rag/manage_collections_core_evaluation.py Adds CSV history loader helper and exports it.
scripts/rag/manage_collections_core.py Formatting-only adjustment to exported name aggregation.
scripts/rag/manage_collections_commands_eval.py Adds --min-recall/--min-mrr gating and a show-retrieval-trends command.
scripts/rag/manage_collections_commands_collections.py Formatting-only adjustment to metadata filtering comprehension.
scripts/rag/lint_message_examples.py Minor message string formatting + indentation fix in a violation append block.
scripts/rag/analyze_rag_text_enrichment.py Formatting-only adjustment to category threshold boolean.
scripts/rag/analyze_rag_coverage.py Formatting-only adjustments; adds blank line.
scripts/quality_gate.py New unified quality-gate CLI orchestrating lint + retrieval + conversation checks.
scripts/conversation/evaluate_quality.py Adds session-based persona drift calibration tooling and associated CLI command.
scripts/conversation/capture_baselines.py New CLI to capture mock-mode baseline artifacts for conversation fixtures.
rag_data/shodan_message_examples.txt Updates message-example formatting to [USER]/[ASSISTANT] style with blank-line separation.
rag_data/leonardo_da_vinci_message_examples.txt Updates message-example formatting to [USER]/[ASSISTANT] style with blank-line separation.
rag_data/legacy/shodan_message_examples.txt Adds legacy copy of the prior “User:/Persona:” formatted examples.
rag_data/legacy/leonardo_da_vinci_message_examples.txt Adds legacy copy of the prior “User:/Persona:” formatted examples.
pyproject.toml Registers quality-gate and capture-conversation-baselines as project scripts.
docs/rag_management/LINTING.md Updates quality-gate integration instructions.
docs/rag_management/CATEGORY_CONFIG.md Clarifies that category assignment is determined at analysis time, not push time.
docs/rag_management/05_WORKFLOWS_TROUBLESHOOTING.md Updates recommended workflow and module-style invocation examples.
docs/rag_management/03_MANAGE_COLLECTIONS.md Documents new gating options and retrieval trend display command.
docs/rag_management/02_PUSH_RAG_DATA.md Updates module invocation and adds pre-push checklist/coverage note.
docs/rag_management/01_ANALYZE_RAG_TEXT.md Updates examples to module-style invocation.
docs/rag_management/00_README.md Adds module-invocation guidance and links to the quality gate doc.
docs/quality_gate.md New end-to-end quality gate documentation.
docs/future_work/UI_REFINEMENTS.md New UI backlog doc separated from general refinements.
docs/future_work/REFINEMENTS.md Updates completed items and points UI work to the new UI backlog doc.
docs/future_work/COPILOT_COMPACT_REFERENCE.md Updates implemented-state reference with new quality/conversation tooling.
docs/context_management/07_IMPLEMENTATION.md Updates implementation notes to include multi-query, compression, and drift scoring.
docs/context_management/03_CONFIGURATION.md Updates config docs for new keys and notes on defaults.
docs/context_management/02_HOW_IT_WORKS.md Updates retrieval flow description for multi-query/compression/drift telemetry.
docs/context_management/01_QUICKSTART.md Updates quickstart config snippets and naming.
docs/context_management/00_README.md Updates feature bullets for new retrieval/drift capabilities.
docs/configs/01_CONFIG_V2.md Documents conversation_quality and other new v2 config keys.
docs/configs/00_README.md Notes config tracking behavior and lack of example config file.
docs/RAG_SCRIPTS_GUIDE.md Updates preferred invocation and adds a canonical workflow section.
core/conversation_retrieval_orchestration_mixin.py Formatting-only change to mes_trace construction when MES retrieval is disabled.
core/conversation_response_mixin.py Persists persona drift trace in exported state; clears/imports trace.
core/conversation_manager.py Initializes persona_drift_trace deque alongside drift history.
README.md Updates RAG workflow docs to module invocation + adds optional quality gates before push.
.github/workflows/quality_gate.yml Adds CI workflow running ruff, pytest, baseline capture, and quality gate (skip retrieval).
Comments suppressed due to low confidence (1)

docs/rag_management/03_MANAGE_COLLECTIONS.md:404

  • show-retrieval-trends is documented earlier in this file already, and this later section repeats the same command/option text (and duplicates the section number). Please remove one of the sections (or replace the second with a short cross-link) to avoid documentation drift.
### 10) `show-retrieval-trends`

Display a compact trend table from a retrieval eval history CSV, showing Recall@k and MRR per run with deltas vs the previous row.

```bash
uv run python -m scripts.rag.manage_collections show-retrieval-trends \
  --history-csv logs/retrieval_eval/history.csv

Limit to the most recent N rows:

uv run python -m scripts.rag.manage_collections show-retrieval-trends \
  --history-csv logs/retrieval_eval/history.csv \
  --last-n 10

Options:

  • --history-csv (default: logs/retrieval_eval/history.csv)
  • --last-n (default: show all rows)

The table prints columns: #, Date, Fixture, Mode, k, Recall@k, MRR, dRecall, dMRR.

</details>

Comment thread docs/quality_gate.md Outdated
Before using baseline comparison, capture canonical baselines in mock mode:

```bash
uv run python -m scripts.conversation.capture_baselines capture-conversation-baselines
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This baseline-capture invocation is incorrect for the current scripts/conversation/capture_baselines.py module: python -m scripts.conversation.capture_baselines already runs the click command, so passing capture-conversation-baselines as an extra argument will cause a click usage error. Update the docs to either use uv run python -m scripts.conversation.capture_baselines or the [project.scripts] entrypoint uv run capture-conversation-baselines.

Suggested change
uv run python -m scripts.conversation.capture_baselines capture-conversation-baselines
uv run capture-conversation-baselines

Copilot uses AI. Check for mistakes.
uv run ruff format .
uv run python -m scripts.conversation.evaluate_quality evaluate-conversation-fixtures
uv run python -m scripts.conversation.evaluate_quality calibrate-persona-drift
uv run python -m scripts.conversation.capture_baselines capture-conversation-baselines
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The operational command shown here is likely incorrect: uv run python -m scripts.conversation.capture_baselines already runs the click command, so adding capture-conversation-baselines as an extra argument will fail. Prefer either uv run python -m scripts.conversation.capture_baselines or uv run capture-conversation-baselines (from [project.scripts]).

Suggested change
uv run python -m scripts.conversation.capture_baselines capture-conversation-baselines
uv run python -m scripts.conversation.capture_baselines

Copilot uses AI. Check for mistakes.
Comment on lines +177 to +178
- `--min-recall` (exit non-zero if Recall@k falls below this threshold)
- `--min-mrr` (exit non-zero if MRR falls below this threshold)
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The --min-recall / --min-mrr options are listed twice, which makes it unclear whether they differ. Please dedupe these bullet points (keep a single description per flag).

Suggested change
- `--min-recall` (exit non-zero if Recall@k falls below this threshold)
- `--min-mrr` (exit non-zero if MRR falls below this threshold)

Copilot uses AI. Check for mistakes.
Comment thread scripts/quality_gate.py
Comment on lines +89 to +106
try:
options = FixtureEvalOptions(
fixture_file=fixture_file,
k=None,
retrieval_mode="similarity",
persist_directory=None,
embedding_model=None,
embedding_device=None,
show_failures=False,
min_recall=min_recall,
min_mrr=min_mrr,
)
run = _execute_fixture_evaluation(options)
except click.ClickException as exc:
return "fail", str(exc.format_message())
except Exception as exc:
return "skip", f"Retrieval evaluation unavailable: {exc}"

Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retrieval step treats any non-ClickException as "skip" (except Exception: return 'skip' ...). This can mask real bugs (e.g., parsing errors, regressions inside evaluation) and produce a false-green gate. Consider only skipping for known “environment missing” failure modes (e.g., Chroma/embedding initialization errors) and failing (or re-raising) for unexpected exceptions, possibly with a stack trace in the output.

Copilot uses AI. Check for mistakes.
Comment thread docs/rag_management/00_README.md Outdated
4. `04_OLD_PREPARE_RAG.md`
5. `05_WORKFLOWS_TROUBLESHOOTING.md`

See also `docs/QUALITY_GATE.md` for the unified quality gate reference.
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs/QUALITY_GATE.md is referenced here, but the file added in this PR is docs/quality_gate.md (different casing). On case-sensitive filesystems (Linux CI), this link will be broken; update the reference (or rename the file) so the path matches exactly.

Copilot uses AI. Check for mistakes.
Comment thread docs/rag_management/00_README.md Outdated
Comment on lines +41 to +42
The unified quality gate (`docs/QUALITY_GATE.md`) runs RAG lint, retrieval fixtures, and
conversation fixture evaluation in one command. See that document for the full guide.
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section links to docs/QUALITY_GATE.md, but the actual document in this PR is docs/quality_gate.md. Please align the casing so the link works on Linux/macOS (case-sensitive) environments.

Copilot uses AI. Check for mistakes.
Comment thread docs/rag_management/LINTING.md Outdated
```

This runs linting, coverage scoring, and retrieval benchmarks together.
The gate runs message-example linting, conversation fixture evaluation, and (optionally) retrieval fixture evaluation in sequence, printing a PASS/WARN/FAIL table and exiting non-zero if any step fails. See `docs/QUALITY_GATE.md` for full usage.
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc points to docs/QUALITY_GATE.md, but the quality gate doc added in this PR is docs/quality_gate.md. This mismatch will break links on case-sensitive filesystems; update the reference or rename the target file to match.

Suggested change
The gate runs message-example linting, conversation fixture evaluation, and (optionally) retrieval fixture evaluation in sequence, printing a PASS/WARN/FAIL table and exiting non-zero if any step fails. See `docs/QUALITY_GATE.md` for full usage.
The gate runs message-example linting, conversation fixture evaluation, and (optionally) retrieval fixture evaluation in sequence, printing a PASS/WARN/FAIL table and exiting non-zero if any step fails. See `docs/quality_gate.md` for full usage.

Copilot uses AI. Check for mistakes.
ossirytk and others added 2 commits March 28, 2026 07:15
ruff was only installed globally as a uv tool locally, but was not
declared in pyproject.toml. GitHub Actions runs uv sync --dev which
only installs declared dependencies, so ruff was absent on CI runners
causing 'Failed to spawn: ruff / No such file or directory'.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ruff EXE001 requires files with shebang lines to have the executable
bit set. On Windows this bit is never set automatically, but the CI
runner is Linux where ruff enforces it.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@ossirytk
Copy link
Copy Markdown
Owner Author

@copilot apply changes based on the comments in this thread

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 28, 2026

@copilot apply changes based on the comments in this thread

Applied all changes from the review thread in commit 1d958f2:

  • docs/quality_gate.md — fixed baseline-capture command to uv run capture-conversation-baselines
  • docs/future_work/COPILOT_COMPACT_REFERENCE.md — fixed the same incorrect command in both the prose (line 36) and the code block (line 227)
  • docs/rag_management/03_MANAGE_COLLECTIONS.md — removed duplicate --min-recall/--min-mrr bullet points
  • scripts/quality_gate.py — narrowed the broad except Exception to (OSError, ImportError, ModuleNotFoundError, RuntimeError) so only known environment-missing failures produce a skip; unexpected exceptions propagate normally
  • docs/rag_management/00_README.md and docs/rag_management/LINTING.md — fixed case-mismatch references QUALITY_GATE.mdquality_gate.md

@ossirytk ossirytk merged commit 1c876f0 into main Mar 28, 2026
2 checks passed
@ossirytk ossirytk deleted the vibing-rag-quality-work branch March 28, 2026 05:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants