Skip to content

Harden deployment readiness#9

Open
majdabd wants to merge 34 commits into
mainfrom
deployment-readiness-audit
Open

Harden deployment readiness#9
majdabd wants to merge 34 commits into
mainfrom
deployment-readiness-audit

Conversation

@majdabd

@majdabd majdabd commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

No description provided.

Copilot AI review requested due to automatic review settings June 19, 2026 13:06

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens TrialMatchAI’s deployment/readiness story by centralizing Elasticsearch configuration, adding preflight validation, tightening JSON extraction from model outputs, and introducing a containerized (Docker Compose) runtime path while removing committed secrets/generated artifacts.

Changes:

  • Added preflight checks + CLI entrypoints (healthcheck/run/bootstrap/index) and wired them into setup/CI.
  • Centralized Elasticsearch indexer config (utils/Indexer/es_config.py) with env overrides and safer cert path handling.
  • Removed committed secrets/artifacts (Elasticsearch certs/config, Parser input/output samples, tracked IDs) and added secret scanning + ignore rules.

Reviewed changes

Copilot reviewed 136 out of 159 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
utils/Preprocessor/utils.py Fixes primary completion date extraction and removes unused NER overlap variables.
utils/Indexer/index_trials.py Switches to shared ES config helpers for consistent client/config creation.
utils/Indexer/index_criteria.py Switches to shared ES config helpers and updates default index name.
utils/Indexer/es_config.py New shared ES config loader/client builder with env overrides and CA path resolution.
utils/Indexer/config.json Removes real password and updates CA cert path to new layout.
utils/finetuning/finetune_instruct/evaluate_gemma2.py Makes incorrect-prediction output persist to disk and tidies imports.
utils/DataLoader/nct_ids.txt Removes tracked generated ID file.
tests/test_settings.py Removes manual sys.path manipulation for imports.
tests/test_search_queries.py Removes manual sys.path manipulation for imports.
tests/test_schemas.py Removes manual sys.path manipulation for imports.
tests/test_preflight_and_indexer.py Adds unit tests covering preflight checks and indexer config env overrides.
tests/test_logging.py Removes manual sys.path manipulation for imports.
tests/test_file_utils_pytest.py Removes unused import.
tests/test_deployment_readiness.py Adds tests for config resolution/env overrides + prompt/JSON extraction behavior.
source/Parser/output/b5d32e3a13ff4d519235aa93bf6faeaa8e80439b77a08925ae9c617c.PubTator.gner.json Removes committed generated Parser output artifact.
source/Parser/output/b5d32e3a13ff4d519235aa93bf6faeaa8e80439b77a08925ae9c617c.PubTator.biomedner.json Removes committed generated Parser output artifact.
source/Parser/output/89e92a6963f2179038668173907fc2a79ae48e7712fe3dddfe49dace.PubTator.gner.json Removes committed generated Parser output artifact.
source/Parser/output/89e92a6963f2179038668173907fc2a79ae48e7712fe3dddfe49dace.PubTator.biomedner.json Removes committed generated Parser output artifact.
source/Parser/output/7aed0d619cfb7fb7edf932fc5aeff01e489c3be8482e4c08c26f4de3.PubTator.gner.json Removes committed generated Parser output artifact.
source/Parser/output/7aed0d619cfb7fb7edf932fc5aeff01e489c3be8482e4c08c26f4de3.PubTator.biomedner.json Removes committed generated Parser output artifact.
source/Parser/output/6db4a18d11e3899c21b4cc11489cf3c8b457a40273c48ecd39ab4377.PubTator.biomedner.json Removes committed generated Parser output artifact.
source/Parser/output/6d60d5573378fb2ca71a90099fe304ec4d5532cafebec4676f42fe28.PubTator.gner.json Removes committed generated Parser output artifact.
source/Parser/output/6d60d5573378fb2ca71a90099fe304ec4d5532cafebec4676f42fe28.PubTator.biomedner.json Removes committed generated Parser output artifact.
source/Parser/output/61d5392634e0658327dab1a020c904645ec31ef6fbb49b058ee86cdc.PubTator.gner.json Removes committed generated Parser output artifact.
source/Parser/output/61d5392634e0658327dab1a020c904645ec31ef6fbb49b058ee86cdc.PubTator.biomedner.json Removes committed generated Parser output artifact.
source/Parser/output/611019227d9ea65a714df4e8c4498bfc13e21c67879025c3b03d8637.PubTator.biomedner.json Removes committed generated Parser output artifact.
source/Parser/output/525bbc5e481bdaf825bf80725e1ac63c15786fa13120fe734d703c0e.PubTator.gner.json Removes committed generated Parser output artifact.
source/Parser/output/525bbc5e481bdaf825bf80725e1ac63c15786fa13120fe734d703c0e.PubTator.biomedner.json Removes committed generated Parser output artifact.
source/Parser/output/427736011957cbb0ce549f492b0330b9d77ba984a2beb7b2abca8453.PubTator.gner.json Removes committed generated Parser output artifact.
source/Parser/output/427736011957cbb0ce549f492b0330b9d77ba984a2beb7b2abca8453.PubTator.biomedner.json Removes committed generated Parser output artifact.
source/Parser/output/1cd9760fe682423e9e3c37d70f3e932d45259c7c9d4b6e911fbf9b42.PubTator.gner.json Removes committed generated Parser output artifact.
source/Parser/output/1cd9760fe682423e9e3c37d70f3e932d45259c7c9d4b6e911fbf9b42.PubTator.biomedner.json Removes committed generated Parser output artifact.
source/Parser/input/b5d32e3a13ff4d519235aa93bf6faeaa8e80439b77a08925ae9c617c.PubTator.gner.PubTator Removes committed generated Parser input artifact.
source/Parser/input/b5d32e3a13ff4d519235aa93bf6faeaa8e80439b77a08925ae9c617c.PubTator.biomedner.PubTator Removes committed generated Parser input artifact.
source/Parser/input/89e92a6963f2179038668173907fc2a79ae48e7712fe3dddfe49dace.PubTator.gner.PubTator Removes committed generated Parser input artifact.
source/Parser/input/89e92a6963f2179038668173907fc2a79ae48e7712fe3dddfe49dace.PubTator.biomedner.PubTator Removes committed generated Parser input artifact.
source/Parser/input/7aed0d619cfb7fb7edf932fc5aeff01e489c3be8482e4c08c26f4de3.PubTator.gner.PubTator Removes committed generated Parser input artifact.
source/Parser/input/7aed0d619cfb7fb7edf932fc5aeff01e489c3be8482e4c08c26f4de3.PubTator.biomedner.PubTator Removes committed generated Parser input artifact.
source/Parser/input/6db4a18d11e3899c21b4cc11489cf3c8b457a40273c48ecd39ab4377.PubTator.gner.PubTator Removes committed generated Parser input artifact.
source/Parser/input/6db4a18d11e3899c21b4cc11489cf3c8b457a40273c48ecd39ab4377.PubTator.biomedner.PubTator Removes committed generated Parser input artifact.
source/Parser/input/6d60d5573378fb2ca71a90099fe304ec4d5532cafebec4676f42fe28.PubTator.gner.PubTator Removes committed generated Parser input artifact.
source/Parser/input/6d60d5573378fb2ca71a90099fe304ec4d5532cafebec4676f42fe28.PubTator.biomedner.PubTator Removes committed generated Parser input artifact.
source/Parser/input/61d5392634e0658327dab1a020c904645ec31ef6fbb49b058ee86cdc.PubTator.gner.PubTator Removes committed generated Parser input artifact.
source/Parser/input/61d5392634e0658327dab1a020c904645ec31ef6fbb49b058ee86cdc.PubTator.biomedner.PubTator Removes committed generated Parser input artifact.
source/Parser/input/611019227d9ea65a714df4e8c4498bfc13e21c67879025c3b03d8637.PubTator.gner.PubTator Removes committed generated Parser input artifact.
source/Parser/input/611019227d9ea65a714df4e8c4498bfc13e21c67879025c3b03d8637.PubTator.biomedner.PubTator Removes committed generated Parser input artifact.
source/Parser/input/525bbc5e481bdaf825bf80725e1ac63c15786fa13120fe734d703c0e.PubTator.gner.PubTator Removes committed generated Parser input artifact.
source/Parser/input/525bbc5e481bdaf825bf80725e1ac63c15786fa13120fe734d703c0e.PubTator.biomedner.PubTator Removes committed generated Parser input artifact.
source/Parser/input/427736011957cbb0ce549f492b0330b9d77ba984a2beb7b2abca8453.PubTator.gner.PubTator Removes committed generated Parser input artifact.
source/Parser/input/427736011957cbb0ce549f492b0330b9d77ba984a2beb7b2abca8453.PubTator.biomedner.PubTator Removes committed generated Parser input artifact.
source/Parser/input/1cd9760fe682423e9e3c37d70f3e932d45259c7c9d4b6e911fbf9b42.PubTator.gner.PubTator Removes committed generated Parser input artifact.
source/Parser/input/1cd9760fe682423e9e3c37d70f3e932d45259c7c9d4b6e911fbf9b42.PubTator.biomedner.PubTator Removes committed generated Parser input artifact.
source/Parser/biomedner_init.py Renames locals and removes unused variables to reduce lint noise.
source/Parser/biomedner_engine.py Removes stray print(os.getcwd()) debug output.
source/Matcher/utils/json_utils.py Adds balanced-brace JSON extraction helper for model output parsing.
source/Matcher/services/preflight.py Adds preflight validation for paths, ES reachability, models, and optional vLLM requirements.
source/Matcher/services/elasticsearch_service.py Adds helper to build an ES client from config + certs.
source/Matcher/services/biomedner_service.py Gates BioMedNER auto-start behind services.auto_start.
source/Matcher/pipeline/phenopacket_processor.py Switches LLM JSON parsing to balanced-brace extraction helper.
source/Matcher/pipeline/cot_reasoning.py Removes consent injection text and uses robust JSON extraction + error output persistence.
source/Matcher/pipeline/cot_reasoning_vllm.py Removes consent injection text and uses robust JSON extraction + error output persistence.
source/Matcher/models/llm/vllm_loader.py Disables remote code by default and supports explicit trust/revision configuration.
source/Matcher/models/llm/llm_reranker.py Adds revision + trust_remote_code plumbing to reranker loading.
source/Matcher/models/llm/llm_loader.py Adds revision + trust_remote_code plumbing to base model loading.
source/Matcher/models/embedding/text_embedder.py Adds revision + trust_remote_code plumbing to embedder loading.
source/Matcher/main.py Adds preflight execution and better exit codes; defers BioMedNER import until after readiness checks.
source/Matcher/config/settings.py Expands env overrides, adds new settings fields, and refactors nested override handling.
source/Matcher/config/config.json Updates default paths, disables auto-start defaults, standardizes index names, adds revision/trust flags.
source/Matcher/config/config_loader.py Adds robust config path resolution + repo-root .env loading + path normalization.
source/Matcher/cli/run.py New CLI entrypoint for running the batch pipeline.
source/Matcher/cli/index_data.py New CLI entrypoint that runs the indexing script from repo root.
source/Matcher/cli/healthcheck.py Switches to preflight checks + unified ES client builder; adds optional index requirement.
source/Matcher/cli/bootstrap_data.py New CLI entrypoint that runs the bootstrap script from repo root.
setup.sh Uses packaged CLI commands (via uv run when available) for bootstrap and indexing.
scripts/start_es.sh Supports both docker compose and legacy docker-compose.
scripts/scan_secrets.py Adds tracked-file secret scanning for CI/local checks.
scripts/bootstrap_data.sh Adds checksum verification and safer archive extraction helpers.
requirements.txt Restructures dependencies and documents extras (gpu/llm/training).
README.md Updates deployment guidance, security posture, CLI usage, and checks.
pyproject.toml Updates Python requirement, dependencies/extras, script entrypoints, and package data.
Makefile Adds targets for GPU sync, auditing, and running new CLIs.
elasticsearch/tmp-config/roles.yml Removes tracked Elasticsearch generated/config scaffolding.
elasticsearch/tmp-config/role_mapping.yml Removes tracked Elasticsearch generated/config scaffolding.
elasticsearch/tmp-config/log4j2.properties Removes tracked Elasticsearch generated/config scaffolding.
elasticsearch/tmp-config/log4j2.file.properties Removes tracked Elasticsearch generated/config scaffolding.
elasticsearch/tmp-config/jvm.options Removes tracked Elasticsearch generated/config scaffolding.
elasticsearch/tmp-config/elasticsearch.yml Removes tracked Elasticsearch generated/config scaffolding.
elasticsearch/tmp-config/elasticsearch-plugins.example.yml Removes tracked Elasticsearch generated/config scaffolding.
elasticsearch/docker-compose.yml Adjusts cert volume mount to allow setup container to write certs.
elasticsearch/config/es03/roles.yml Removes tracked Elasticsearch generated/config scaffolding.
elasticsearch/config/es03/role_mapping.yml Removes tracked Elasticsearch generated/config scaffolding.
elasticsearch/config/es03/log4j2.properties Removes tracked Elasticsearch generated/config scaffolding.
elasticsearch/config/es03/jvm.options Removes tracked Elasticsearch generated/config scaffolding.
elasticsearch/config/es03/es03.key Removes tracked TLS private key material.
elasticsearch/config/es03/es03.crt Removes tracked TLS cert material.
elasticsearch/config/es03/elasticsearch.yml Removes tracked Elasticsearch generated/config scaffolding.
elasticsearch/config/es03/elasticsearch-plugins.example.yml Removes tracked Elasticsearch generated/config scaffolding.
elasticsearch/config/es03/ca.crt Removes tracked CA cert material.
elasticsearch/config/es02/roles.yml Removes tracked Elasticsearch generated/config scaffolding.
elasticsearch/config/es02/role_mapping.yml Removes tracked Elasticsearch generated/config scaffolding.
elasticsearch/config/es02/log4j2.properties Removes tracked Elasticsearch generated/config scaffolding.
elasticsearch/config/es02/jvm.options Removes tracked Elasticsearch generated/config scaffolding.
elasticsearch/config/es02/es02.key Removes tracked TLS private key material.
elasticsearch/config/es02/es02.crt Removes tracked TLS cert material.
elasticsearch/config/es02/elasticsearch.yml Removes tracked Elasticsearch generated/config scaffolding.
elasticsearch/config/es02/elasticsearch-plugins.example.yml Removes tracked Elasticsearch generated/config scaffolding.
elasticsearch/config/es02/ca.crt Removes tracked CA cert material.
elasticsearch/config/es01/roles.yml Removes tracked Elasticsearch generated/config scaffolding.
elasticsearch/config/es01/role_mapping.yml Removes tracked Elasticsearch generated/config scaffolding.
elasticsearch/config/es01/log4j2.properties Removes tracked Elasticsearch generated/config scaffolding.
elasticsearch/config/es01/jvm.options Removes tracked Elasticsearch generated/config scaffolding.
elasticsearch/config/es01/es01.key Removes tracked TLS private key material.
elasticsearch/config/es01/es01.crt Removes tracked TLS cert material.
elasticsearch/config/es01/elasticsearch.yml Removes tracked Elasticsearch generated/config scaffolding.
elasticsearch/config/es01/elasticsearch-plugins.example.yml Removes tracked Elasticsearch generated/config scaffolding.
elasticsearch/config/es01/ca.crt Removes tracked CA cert material.
elasticsearch/certs/instances.yml Removes tracked certutil instance configuration.
elasticsearch/certs/es03/es03.key Removes tracked TLS private key material.
elasticsearch/certs/es03/es03.crt Removes tracked TLS cert material.
elasticsearch/certs/es02/es02.key Removes tracked TLS private key material.
elasticsearch/certs/es02/es02.crt Removes tracked TLS cert material.
elasticsearch/certs/es01/es01.key Removes tracked TLS private key material.
elasticsearch/certs/es01/es01.crt Removes tracked TLS cert material.
elasticsearch/certs/ca/ca.key Removes tracked CA private key material.
elasticsearch/certs/ca/ca.crt Removes tracked CA cert material.
elasticsearch/certs/ca.crt Removes tracked CA cert material.
elasticsearch/apptainer-run-es.sh Adds checks to fail fast if TLS certs are missing for Apptainer ES startup.
elasticsearch/.env.example Adds example env template for local ES stack.
elasticsearch/.env Removes committed real credentials.
Dockerfile Adds a container build for TrialMatchAI worker using uv sync + GPU extra.
docker-compose.yml Adds root-level Compose stack for local ES + worker (+ optional Kibana).
.gitignore Expands ignore rules to prevent committing env/certs/parser outputs/indexing state.
.github/workflows/ci.yml Updates CI to include linting, secret scan, dependency audit, compose validation, and docker build.
.env.example Adds root env template for runtime configuration.
.dockerignore Prevents shipping secrets/artifacts into Docker build context.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +28 to +33
if char == "{":
depth += 1
elif char == "}":
depth -= 1
if depth == 0:
return json.loads(text[start : index + 1])
Comment thread source/Matcher/services/preflight.py Outdated
Comment on lines +75 to +77
issues.append(
f"Elasticsearch is not reachable at {config['elasticsearch']['host']}."
)
Comment thread docker-compose.yml Outdated
Comment on lines +126 to +129
ELASTICSEARCH_HOSTS: https://elasticsearch:9200
ELASTICSEARCH_USERNAME: kibana_system
ELASTICSEARCH_PASSWORD: ${KIBANA_PASSWORD:-}
ELASTICSEARCH_SSL_CERTIFICATEAUTHORITIES: config/certs/ca/ca.crt
majdabd and others added 26 commits June 21, 2026 16:12
Safety net before the audit-driven refactor (see REFACTOR_PLAN.md).

- CI: add ml-extras-smoke job that `uv sync --extra entity`, imports the
  heavy libs (torch/transformers/gliner/gliner2) and the six local-model
  modules the default job never exercises, so a broken import or API drift
  is caught instead of shipping silently.
- CI: keep the installed-smoke wheel path absolute ("$PWD"/dist) so $WHEEL
  survives the cd into /tmp.
- tests: characterization tests pinning the current (buggy) score_trial
  behavior, plus a strict xfail encoding the PR1 contract (a Violated
  exclusion must hard-disqualify).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Previously score_trial averaged the inclusion ratio with the exclusion
ratio, so a trial whose exclusion criteria were Violated could still
outrank an eligible trial (audit finding C1). The classification lists
also counted impossible labels (inclusion can never be "Not Violated",
exclusion can never be "Met").

score_trial now:
- returns DISQUALIFIED_SCORE (-1.0) if any exclusion is "Violated", so it
  ranks strictly below every eligible trial;
- otherwise scores eligible trials in [0, 1] by the fraction of decided
  inclusion criteria (Met or Not Met) that are Met.

The prompts already constrain the label sets correctly, so only the scorer
changed; the duplicated prompt text is addressed in PR7. Contract tests
(disqualification, met-fraction, ranking order) replace the prior xfail.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The reranker re-derived device/dtype/attention setup independently of
llm_loader and got several wrong. Extract the shared logic into
models/llm/_common.py and rebuild both loaders on it.

Reranker fixes (audit C2 + highs/mediums):
- device_map pinned to the resolved device instead of "auto", so the
  model's first layer lives where inputs are moved (fixes multi-GPU
  "tensors on different devices" crash);
- left padding + pad token via configure_decoder_tokenizer, so
  logits[:, -1, :] reads the last real token, not a PAD position;
- FlashAttention-2 -> SDPA fallback instead of hardcoding flash_attn;
- compute dtype defaults to bf16/fp16 by capability instead of fp16;
- device accepts int or "auto" (no silent coercion of non-ints to GPU 0);
- dropped the ThreadPoolExecutor + model_lock that serialized anyway.

llm_loader now reuses the same helpers. Stub-based unit tests cover the
tokenizer/device/dtype/attention logic without the llm extra.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Re-grepped each target to confirm zero live references (only auto-generated
egg-info/SOURCES.txt and the now-removed test/whitelist referenced them):

- utils/evaluation.py: orphaned TREC eval, no caller, no entry point.
- models/embedding/query_embedder.py, sentence_embedder.py: vestigial
  TextEmbedder subclasses nothing constructs.
- preprocessing/regex/ tree: regex resource files with no Python consumer
  (leftovers from the deleted src/Matcher); also drop the package-data
  globs in pyproject and the "preprocessing" entry in the config_loader
  resource whitelist.
- matching/phenopacket_processor.py (+ its test): superseded by the
  canonical interop/importers/phenopacket.py + interop/narrative.py path.
  This deletion also resolves the dead ClinicalSummarizer, the ontology
  label bug, the always-true temporal guards, the truncate typo, and the
  duplicate phenopacket pipeline.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Verified-safe in-file dead code (re-grepped each before removal):
- recognizers.with_schema_threshold (+ now-unused dataclasses.replace import)
- types.EntityAnnotation.to_index_entity
- utils/retry.py (with_retries) + its test — only the test used it; production
  retries go through tenacity
- interop EvidenceSpan model, Provenance.raw_text_span field, and the
  PatientProfile.all_facts helper — none populated, read, or referenced
  (Provenance keeps extra="allow", so dropping the field is safe)
- narrative.render_patient_narrative: dead style="audit" branch and the unused
  style parameter (only caller passed "rag")
- annotator.annotate_texts_in_parallel: dead retries/delay params (accepted
  then immediately del'd; no caller supplies them)
- criteria_retrieval.rerank_criteria: unused `queries` parameter (body keys off
  criterion["query"]); updated the call site

Legacy shim removed:
- CompatibilityEntityAnnotator was an empty SchemaEntityAnnotator subclass;
  build_entity_annotator now returns SchemaEntityAnnotator directly, and the
  export/test were repointed.

Deferred: max_text_score -> PR6 (folded into the create_query rework); dead
config settings (cot/LLM_reranker/TokenizerSettings/entity_extraction.threshold)
-> PR9, since they are entangled with config.json + the env-override map and
carry config-validation risk that does not belong in a deletion PR.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The join compared each child table's raw person_id against the patient's
*sanitized* profile id via astype(str). Two real failures:
  1. a null person_id in a child table promotes the column to float64, so
     person_id 1 serialized as "1.0" and "1.0" != "1" dropped every
     condition/measurement/drug/procedure/observation/note for that patient;
  2. a person_id needing sanitization ("pat 01" -> "pat-01") never matched
     the raw child value.

Now join on the raw person_id via _normalize_join_id (handles float promotion
and string/int mismatch) and pre-group each child table once with
_group_by_person, so each patient is an O(1) lookup instead of a full astype
scan per table (removes the O(P*R) N+1). _concept_lookup also switches off
iterrows. Profile ids are still sanitized for display/provenance.

Two regression tests (null-person_id float promotion; sanitized id) fail on
the old code and pass now.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
1. prepare_trial_document now writes detailed_description and official_title.
   The backend weights both in TRIAL_TEXT_WEIGHTS (1.5 / 1.0) but they were
   never indexed, so the weights were dead and trials were under-indexed.
   (Requires rebuilding the LanceDB trial index to take effect.)

2. trial_ranker.load_trial_data only loads NCT-named files. It previously
   globbed every *.json in the output folder, scoring run sidecars
   (keywords/patient_profile/first_level_scores/rag_output) as bogus 0.0 trials.

3. _scan_rows fallback now applies the nct_id WHERE filter
   (table.search().where(...)). When FTS and vector both returned nothing it
   scanned an unfiltered head slice that could exclude the requested trials.
   Verified table.search().where() against lancedb 0.25.3.

4. create_query drops the 7 keys the backend never reads (age/sex/
   overall_status/pre_selected_nct_ids/vector_score_threshold/max_text_score/
   search_mode) and the misleading age=0-vs-None contract; search_trials already
   passes those filters to the backend directly. Removes the deferred
   max_text_score param.

Regression tests: scan-fallback nct filter, NCT-only loading (+ sidecars
skipped), minimal create_query contract, and prepared-doc field preservation.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Behavior-preserving consolidation of copy-pasted logic:
- models/embedding.build_embedder(config): single embedder factory replacing
  the identical TextEmbedder(TextEmbedderConfig(...)) block in main.py and the
  index/build-concepts/update-registry CLIs.
- matching/retrieval/synonyms.disease_synonyms(): one disease-synonym extractor
  that ClinicalTrialSearch and SecondStageRetriever now delegate to.
- utils/text.flatten_text(): single whitespace-normalizing flatten used by both
  registry.preparation and search.lancedb_backend (the two _flatten_text copies).
  Behavior-preserving for preparation since _preprocess_text re-normalizes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The HF (eligibility_reasoning) and vLLM (eligibility_reasoning_vllm) CoT
processors duplicated the ~90-line prompt template, _load_trial_data,
_save_outputs, and the worklist/length-bucketing orchestration verbatim — the
largest duplication in the repo and a drift risk for the scoring contract.

New matching/eligibility_base.BaseTrialProcessor holds the shared prompt, trial
I/O, output persistence, and process_trials skeleton (parameterized by the
_token_length and _progress_desc hooks). Each backend now subclasses it and
implements only __init__ and _process_batch (plus vLLM's LoRA/token-length
helpers). Prompt output is byte-identical; behavior preserved.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…records

- main_pipeline only loads (and half()s) the HuggingFace CoT model when
  cot_backend != "vllm". Under the vLLM backend run_rag_processing loads its own
  engine and ignored the HF model, so it was wasting GPU memory and load time
  (and could OOM alongside the vLLM engine).
- _rank_trial_rows / _rank_criteria_rows skip rebuilding the trial/criteria
  record when the candidate row read from the index already has the derived
  fields (search_text); raw in-memory docs still get built. Avoids recomputing
  search_text/search_vector for every candidate on every query.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- logging_config: configure the root handler exactly once instead of building a
  new StreamHandler on every setup_logging() call (~all modules call it at
  import). Per-logger request-id ContextFilter behavior is unchanged.
- import_patient: build and pass an embedder to the entity annotator so concept
  linking can use semantic search instead of silently degrading to lexical-only
  (still degrades gracefully to no-entities when ML extras are absent).
- pyproject: drop 5 dependencies never imported in src (regex, python-dotenv,
  rich, bioregistry, rapidfuzz); regenerated uv.lock. numpy/pyarrow kept as real
  transitive runtime needs of pandas/lancedb.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Lets users train their own models and plug them straight into the pipeline,
instead of only relying on the vanilla checkpoints. The integration path already
existed via config (entity_extraction.model_name, model.reranker_adapter_path,
model.cot_adapter_path); this adds the missing training side, modernized for the
current architecture (GLiNER for NER, LoRA for reranker/CoT).

New trialmatchai.finetuning package (heavy deps imported lazily):
- config.FinetuneConfig: shared LoRA SFT hyper-parameters.
- data: JSONL loaders + prompt builders that REUSE the runtime prompts
  (LLMReranker.create_messages, chat templates) so train == inference; plus a
  char-span -> GLiNER token-span converter.
- _sft.run_sft: LoRA SFT loop with prompt-masked labels (loss on completion).
- cot/reranker/ner: thin task fine-tuners producing a LoRA adapter (cot,
  reranker) or a GLiNER checkpoint (ner).
- cli: `trialmatchai-finetune {cot,reranker,ner}` console command.

pyproject: new `finetune` optional extra (torch/transformers/peft/accelerate/
datasets/gliner[2]/bitsandbytes) + the console entry point. LLMReranker
.create_messages is now a @staticmethod so finetuning reuses it without loading
a model. CI imports the finetuning modules and smoke-tests the new CLI.
Tests cover data conversion, prompt reuse, and CLI parsing (CPU-only).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- README: add an architecture/"how it works" overview, a "bring your own
  models" section (custom checkpoints/adapters via config), the new
  trialmatchai-finetune command, an extras table, and a navigable layout.
- docs/finetuning.md: data formats, commands, and plug-back-in steps for the
  NER (GLiNER), reranker (LoRA Yes/No), and CoT (LoRA SFT) fine-tuners.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The audit-driven refactor (PR0-PR9), fine-tuning integration, and README/docs
updates are complete and merged; the tracking plan is no longer needed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…eted)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
It exercises heavy third-party deps (torch/gliner) on a Linux runner that we
don't gate releases on; keep it as a signal without failing the workflow.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Research-driven corrections to the fine-tuning integration:

- NER now uses the real GLiNER2 stack (GLiNER2Trainer / TrainingConfig /
  InputExample with entities={label: [surface forms]}), not the gliner-v1
  token-classification API. entity_descriptions are back-filled from the entity
  schema so a fine-tuned model keeps the runtime label semantics. Saves a LoRA
  adapter (output_dir/final) or full checkpoint (output_dir/best); both load via
  entity_extraction.model_name. CLI exposes encoder-lr/task-lr/lora/schema-path.
- data: NER converter emits GLiNER2 surface-form schema (accepts {entities},
  char spans, or native {input,output}) instead of token spans.
- _sft: add prepare_model_for_kbit_training + gradient checkpointing + a paged
  8-bit optimizer + cosine schedule — required for stable, memory-efficient
  4-bit (QLoRA) training of the reranker and CoT adapters.
- docs/finetuning.md updated to the GLiNER2 data format and flags.

Sources: GLiNER2 training + LoRA tutorials (fastino-ai/GLiNER2); QLoRA/PEFT
fine-tuning best practices.
Per direction: every LLM runs on vLLM (no HuggingFace backend), and fine-tuned
LoRA adapters are served everywhere via vLLM's LoRARequest.

Removed the HuggingFace LLM path entirely:
- delete matching/eligibility_reasoning.py (HF CoT processor),
  models/llm/llm_loader.py, models/llm/_common.py, and tests/test_llm_common.py.
- run_rag_processing always builds a vLLM engine; main_pipeline no longer loads
  or half()s an HF model. Drop the cot_backend config switch (settings, config.json,
  env map) and make preflight's vLLM/GPU check unconditional under require_models.

Reranker now runs on vLLM:
- rewrite LLMReranker to score (patient, criterion) pairs by constraining
  generation to the Yes/No tokens (allowed_token_ids) and reading their logprobs;
  a configured reranker_adapter_path is served via LoRARequest. create_messages
  is a @staticmethod (reused by the fine-tuner without loading a model).

Adapters end to end:
- CoT and reranker both load their LoRA adapters through vLLM (cot_adapter_path /
  reranker_adapter_path); NER LoRA loads via GLiNER2.load_adapter.
- new `trialmatchai-finetune merge` to merge a LoRA adapter into the base model
  for users who prefer a standalone checkpoint over base+adapter.

The embedder stays on transformers (it is an embedding model, not a generative
LLM). Docs/README updated; CI import smoke adjusted to the vLLM modules.
majdabd and others added 7 commits June 22, 2026 11:26
Reintegrates domain knowledge dropped in the migration, folded into the current
architecture rather than restored verbatim.

Criteria chunking (registry/criteria_chunking.py):
- replaces the generic line-splitter in normalization.split_eligibility_criteria
  with a single pass that understands multi-level enumeration hierarchies
  (1, 1.2, 1.2.3, (a), roman), varied inclusion/exclusion headers (incl. inline
  "Exclusion Criteria: 1. ... 2. ..."), parenthetical protection, decimal/
  abbreviation split-exceptions, and continuation-line joining. 8 new tests.

Genetic-variant recognizer (entities/recognizers.py):
- restore the curated variant pattern table as entities/resources/
  variant_patterns.tsv (HGVS mutations, fusions, chromosome arms, ...).
- RegexVariantRecognizer matches these deterministically (e.g. p.V600E,
  c.1799T>A) and CompositeRecognizer runs it alongside the GLiNER model,
  merging by confidence/length so precise variant spans the model would miss
  are still captured. On by default (entity_extraction.variant_regex). 5 tests.

Packaging: variant_patterns.tsv added to package-data; both load correctly from
the built wheel.
Replace the arXiv reference with an explicit "please cite" message pointing to
the Nature Communications paper (with a BibTeX entry); keep the Zenodo DOI as the
software archive.
…ctions

Adds the multi-channel first-level query planner and the eligibility-constraint
extraction/evaluation subsystem, and corrects the bugs found in review:

Constraints
- exclusion criteria the patient cleanly passes now reward instead of scoring
  neutral (relation "absent": exclusion -> not-violated, inclusion -> neutral)
- biomarker evaluation is wild-type-safe (a negative/wild-type patient no longer
  satisfies a "present"/"mutated" requirement)
- age regexes require an explicit year unit or age cue (no more "100 to 200 mg"
  -> age); comparator-less labs are skipped (no silent ge default)
- unknown_is_neutral config is now honored (penalizes unconfirmable inclusions)

First-level planner
- hard_filters config is actually applied (and [] truly disables filters);
  age parsing degrades gracefully instead of aborting the search; a guard logs
  when all channels return nothing; disease_synonyms output is deterministic

Other
- restore the entity_extraction.variant_regex knob broken by extra="forbid"
- remove the vestigial GLiNER-v1 backend

Regression tests added for each fix.
Adds geographic location as an opt-in first-level hard filter, done site-aware
and recall-safe rather than as a blunt cut.

- interop: PatientProfile gains a Location (country/state/city); the FHIR
  importer populates it from Patient.address.
- matching/retrieval/location.py: country-level, match-ANY-site filtering. A
  trial is kept when the patient's country is unknown, when the trial has no
  indexed site countries, or when any of its sites is in the patient's country —
  so trials with unknown locations are never dropped.
- run_first_level_search applies it only when "location" is in
  search.first_level.hard_filters (default keeps age/sex/overall_status, so the
  behavior is unchanged unless opted in).
- settings: hard_filters Literal accepts "location".

Tests cover recall-safety, the end-to-end hard-filter on/off behavior, and FHIR
address extraction. README updated.
The OMOP importer now resolves a patient's location via the LOCATION table
(PERSON.location_id -> LOCATION), filling country (country_source_value, or
country_concept_id via concept lookup), state, and city. NaN-safe field
extraction; degrades to no location when the table or link is absent.

This extends the optional country-level location hard filter to OMOP patients
(previously FHIR-only). Test added; README updated.
Addresses the correctness/coverage gaps found in adversarial review so the
importer is trustworthy on real Epic/Cerner R4 output, not just Synthea.

Correctness (incl. one patient-safety bug):
- honor clinicalStatus/verificationStatus/status: resolved/inactive/refuted
  conditions are now NEGATED (no longer matched as active); entered-in-error,
  cancelled, and not-done resources are dropped (recorded in `unsupported`).
  Completed/stopped medications stay un-negated (real prior exposure).
- CodeableConcept: extract ALL codings, ordering known vocabularies
  (SNOMED/LOINC/RxNorm/ICD) first, so concept linking uses the standard code
  instead of a proprietary one; secondary codings retained.
- parse_date handles partial FHIR dates (YYYY, YYYY-MM) -> age no longer lost.

Coverage:
- medications via medicationReference / contained Medication / R5 wrapper.
- Observation.component panels, interpretation flags, and value[x] variants
  (comparator, Range, Ratio, CodeableConcept, string/bool/int/datetime).
- onset/effective/performed Period + recordedDate temporality; cleaner note and
  dosageInstruction text; broader genomic-observation detection.

Robustness:
- NDJSON loader tolerates malformed lines (skips + logs) instead of aborting.
- reference resolution handles urn:uuid, absolute URLs, and contained refs;
  orphan resources attributed once, with a logged mapping-failure reason.

9 regression tests added.
…g-model eligibility fixes

- Add CPU-capable Transformers reranker (TransformersReranker) and eligibility
  processor (BatchTrialProcessorTransformers) so the pipeline runs without vLLM
- Add no_think config option (rag.no_think) to disable <think> blocks on reasoning
  models like Qwen3; passes enable_thinking=False to apply_chat_template and falls
  back to /no_think prefix for plain-prompt paths
- Strip <think>...</think> blocks from model output before JSON extraction so
  eligibility responses from thinking models are parsed correctly
- Separate reranker_model_path from base_model so a lightweight model can handle
  second-level reranking while a larger model handles RAG eligibility reasoning
- Add hf/hashing backends to EmbedderSettings, RagSettings.enabled/backend,
  LLMRerankerSettings.enabled, and ConceptLinkerSettings to settings schema
- Finetuning: multi-GPU support, LoRA config improvements, data pipeline hardening,
  NER and reranker trainer updates, extended CLI flags
- Registry preparation: hardened FHIR/OMOP importer, improved field normalisation
- LanceDB backend: expose candidate_limit, add FTS fallback path
- Preflight: extended checks for concept index, embedding model, and registry
- Update default entity model to fastino/gliner2-base-v1 across config and settings
- Add tests for embedding, deployment readiness, finetuning, LanceDB backend,
  patient runtime loading, preflight, and registry updater

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants