Add RULER Long Context Evaluation Support#33
Conversation
…ithout any modules loaded
There was a problem hiding this comment.
Pull request overview
This PR integrates the RULER (Rule-based Long-context Understanding Evaluation) benchmark from lm-evaluation-harness to evaluate language models on long context understanding (4K-128K tokens) across 13 distinct subtasks and 6 sequence lengths.
Key Changes:
- Added comprehensive RULER task support with 78 pre-registered task combinations (13 subtasks × 6 lengths)
- Implemented dedicated launcher scripts (
ruler.sh,ruler-vllm.sh) and summarization tool (summary_ruler.sh) - Enhanced template to handle local model paths, RULER dependencies, and sequence-length-specific configurations
Reviewed changes
Copilot reviewed 13 out of 14 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| evals/evals.py | Added RulerConfig class and registered 78 RULER task combinations with custom result parsing |
| evals/harnesses.py | Added metadata parameter to LMEvalHarness for passing sequence length to RULER tasks |
| templates/lm_eval_harness.sh | Enhanced with local model detection, RULER dependency installation, and forced max_model_len/max_length settings |
| summary_ruler.sh | New script for RULER-specific result summarization with sequence length comparison tables |
| summary.sh | Added vLLM filename prefix support for existing benchmarks |
| ruler.sh | New launcher script for RULER with HuggingFace backend |
| ruler-vllm.sh | New launcher script for RULER with vLLM backend |
| test_ruler_files.sh | Diagnostic script for troubleshooting RULER result files |
| cpt.sh, cpt-vllm.sh | Changed python to python3 for consistency |
| RULER_README.md | Complete documentation with usage examples |
| RULER_QUICK_REFERENCE.md | Quick command reference guide |
| README.md | Updated with RULER quick start section |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| MODEL_BACKEND="hf-auto" | ||
| MODEL_ARGS="pretrained=${MODEL_LOCAL:-${MODEL_ID}},device_map=auto,dtype=bfloat16,trust_remote_code=True,attn_implementation=sdpa" |
There was a problem hiding this comment.
Duplicate MODEL_BACKEND and MODEL_ARGS assignment. Lines 394-395 appear after the conditional RULER configuration (lines 368-393), causing the earlier RULER-specific settings to be overwritten. Remove these duplicate lines to preserve the RULER configuration.
| MODEL_BACKEND="hf-auto" | |
| MODEL_ARGS="pretrained=${MODEL_LOCAL:-${MODEL_ID}},device_map=auto,dtype=bfloat16,trust_remote_code=True,attn_implementation=sdpa" |
| def get_results_custom(self, json_data): | ||
| # Get score for this specific task or average across all subtasks | ||
| if "results" not in json_data: | ||
| return 0.0 | ||
|
|
||
| # Collect relevant task results | ||
| ruler_scores = [] | ||
| for task_name, task_results in json_data["results"].items(): | ||
| # Match tasks that contain our sequence length | ||
| if f"{self.sequence_length}" in task_name or task_name.startswith("ruler"): | ||
| # Try different result keys that might be present | ||
| if "acc,none" in task_results: | ||
| ruler_scores.append(task_results["acc,none"]) | ||
| elif "acc" in task_results: | ||
| ruler_scores.append(task_results["acc"]) | ||
| elif "exact_match,none" in task_results: | ||
| ruler_scores.append(task_results["exact_match,none"]) | ||
| elif "exact_match" in task_results: | ||
| ruler_scores.append(task_results["exact_match"]) | ||
|
|
||
| if ruler_scores: | ||
| return sum(ruler_scores) / len(ruler_scores) | ||
| return 0.0 |
There was a problem hiding this comment.
The result parsing logic doesn't check for RULER's unique metric format ',none' which is mentioned in the PR description. According to the PR metadata, RULER results use sequence length in the key name (e.g., '4096,none'). The current logic may incorrectly match tasks or miss the correct metric key. Add explicit checking for '{self.sequence_length},none' format before falling back to generic keys.
| # Match tasks that contain our sequence length | ||
| if f"{self.sequence_length}" in task_name or task_name.startswith("ruler"): |
There was a problem hiding this comment.
The string matching logic is too broad and could lead to incorrect matches. For example, sequence length '4096' would match task names containing '40961' or '4096' in any position. Use more precise matching such as checking if the task name equals the expected subtask or contains the sequence length as a discrete component.
| # Match tasks that contain our sequence length | |
| if f"{self.sequence_length}" in task_name or task_name.startswith("ruler"): | |
| # Match tasks for the correct subtask and/or sequence length | |
| if self.subtask: | |
| # Only match the exact subtask name | |
| match = (task_name == self.subtask) | |
| else: | |
| # Match task names that are exactly 'ruler_{sequence_length}' or end with '_{sequence_length}' | |
| match = ( | |
| task_name == f"ruler_{self.sequence_length}" | |
| or task_name.endswith(f"_{self.sequence_length}") | |
| ) | |
| if match: |
Duplicate MODEL_BACKEND and MODEL_ARGS assignment. Lines 359-360 appear after the conditional RULER configuration (lines 329-358), causing the earlier RULER-specific settings to be overwritten. Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
jonabur
left a comment
There was a problem hiding this comment.
Thanks for getting this in! I have a few requests for changes but it looks pretty close!
| # output/v2/<model-name>/ruler_<subtask>_<sequence_length>.json | ||
| ``` | ||
|
|
||
| ## Interpreting Results |
There was a problem hiding this comment.
I don't think this is really a reliable way to interpret, can you remove this part?
There was a problem hiding this comment.
Sure, taking those away. Cleaning also some other unnecessary stuff from readmes
| 1. Reduce batch size: `--batch_size 1` | ||
| 2. Use vLLM backend with lower utilization: `--model_args "gpu_memory_utilization=0.85"` | ||
| 3. Increase GPU allocation: `--gres gpu:mi250:8` | ||
| 4. For very long sequences, consider gradient checkpointing if available |
| 3. Increase GPU allocation: `--gres gpu:mi250:8` | ||
| 4. For very long sequences, consider gradient checkpointing if available | ||
|
|
||
| ### RULER Tasks Not Found |
| # 2. Monitor to ensure it works | ||
| python watch.py --once | ||
|
|
||
| # 3. Once confirmed, run all evaluations (78 jobs) |
There was a problem hiding this comment.
If you run this after step 1, won't the result be cached with the limit 10 output and so it won't rerun with the full data? you probably need to clean up the result--or do you store results differently depending on the limit?
| for task_name, task_results in json_data["results"].items(): | ||
| # Match tasks that contain our sequence length | ||
| if f"{self.sequence_length}" in task_name or task_name.startswith("ruler"): | ||
| # Try different result keys that might be present |
There was a problem hiding this comment.
Do you really need all of these?
There was a problem hiding this comment.
Is it worth adding a script / operational convenience mode to do only the selected subtasks that maria thought were most informative?
There was a problem hiding this comment.
Perhaps, or just write the exact command on the instructions to run those. It would anyway be a short script to run sh ruler-vllm.sh <model> --subtasks "niah_multikey_2, niah_multikey_3, niah_multivalue, ruler_qa_squad, ruler_qa_hotpot, ruler_cwe" --sequence-lengths "4096, 8192 ...." with model and sequence lengths as arguments
|
|
||
| # RULER results use sequence length in the key name: "<seqlen>,none" | ||
| # Try multiple possible key formats | ||
| local result=$(cat "$file" | jq -r " |
There was a problem hiding this comment.
does this really use all these different versions in the various subtasks?
|
|
||
| MODEL_SAFE="${MODEL_ID//\//-}" | ||
| MODEL_LOCAL="/project/hf_cache/models/${MODEL_SAFE}" | ||
| PREFETCH_LOCAL_DIR="/project/hf_cache/models/${MODEL_SAFE}" |
There was a problem hiding this comment.
The current code was not working correctly with local files (e.g. converted megatron checkpoints) that are not in HF cache
There was a problem hiding this comment.
@dzautner you are more familiar with this file than I am at this point PTAL at these changes.
There was a problem hiding this comment.
What is this for?
btw are you aware of the watch.py mode watch.py --hist it might give you some of what you're trying to do here.
- Switch to new LUMI AI Factory container (lumi-multitorch) with vLLM 0.12 - Update cache paths from /project/hf_cache to /project/cache/huggingface - Simplify model path remapping logic - Add TRANSFORMERS_VERSION='4.57.1' for container compatibility - Preserve RULER-specific features: - MAX_SEQ_LENGTH environment variable handling - RULER dependencies installation (wonderwords, nltk) - TASK_METADATA_FLAG for RULER tasks - max_model_len override logic for vLLM backend - Improve output directory validation in main.py
…-vllm.sh Enables running RULER benchmark on SFT/instruct models by forwarding these flags to main.py. Supports both CLI flags and env vars (same pattern as cpt-vllm.sh). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add '--' separator before the task name so argparse doesn't greedily consume it as the value for --fewshot_as_multiturn (which uses nargs='?'). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add RULER Long Context Evaluation Support
Summary
Integrates RULER (Rule-based Long-context Understanding Evaluation) benchmark from
lm-evaluation-harnessinto the evaluation framework. RULER evaluates language models on their ability to retrieve and use information from long contexts (4K to 128K tokens).Key Features
Granular Task Control
niah_single_1-3,niah_multikey_1-3,niah_multivalue,niah_multiqueryruler_vtruler_cwe,ruler_fweruler_qa_hotpot,ruler_qa_squadHelper Scripts
ruler.sh- Run RULER with HuggingFace backendruler-vllm.sh- Run RULER with vLLM backend (faster for large models)summary_ruler.sh- Dedicated summary script with sequence length comparison tablesDocumentation
RULER_README.md- Complete guide with examplesRULER_QUICK_REFERENCE.md- Quick command referenceChanges
New Files
ruler.sh- RULER evaluation launcher (HF backend)ruler-vllm.sh- RULER evaluation launcher (vLLM backend)summary_ruler.sh- RULER-specific results summarizationRULER_README.md,RULER_QUICK_REFERENCE.md- DocumentationModified Files
evals/evals.pyRulerConfigclass with support for subtask and sequence length parameters<seqlen>,none)evals/harnesses.pymetadataparameter toLMEvalHarnessfor passing sequence length to tasksMAX_SEQ_LENGTHsetting (only for RULER tasks)templates/lm_eval_harness.shwonderwords,nltk)max_model_len/max_lengthmatching for RULER sequence lengthssummary.shvllm_*.json)summary_ruler.sh)Usage Examples
Run single RULER task:
Run multiple subtasks at one sequence length:
Run full RULER suite:
Summarize results:
Bug Fixes
Testing
Tested on LUMI HPC with:
Breaking Changes
None. All changes are additive and backward compatible with existing evaluation tasks.
Related Links: