Skip to content

Add RULER Long Context Evaluation Support#33

Open
luomajouni wants to merge 35 commits intomasterfrom
feature/lm-eval-ruler
Open

Add RULER Long Context Evaluation Support#33
luomajouni wants to merge 35 commits intomasterfrom
feature/lm-eval-ruler

Conversation

@luomajouni
Copy link
Copy Markdown

Add RULER Long Context Evaluation Support

Summary

Integrates RULER (Rule-based Long-context Understanding Evaluation) benchmark from lm-evaluation-harness into the evaluation framework. RULER evaluates language models on their ability to retrieve and use information from long contexts (4K to 128K tokens).

Key Features

Granular Task Control

  • Support for all 13 RULER subtasks:
    • NIAH (Needle in a Haystack): niah_single_1-3, niah_multikey_1-3, niah_multivalue, niah_multiquery
    • Variable Tracking: ruler_vt
    • Word Extraction: ruler_cwe, ruler_fwe
    • Question Answering: ruler_qa_hotpot, ruler_qa_squad
  • Support for 6 sequence lengths: 4096, 8192, 16384, 32768, 65536, 131072
  • Run individual subtask-length combinations or batches via helper scripts

Helper Scripts

  • ruler.sh - Run RULER with HuggingFace backend
  • ruler-vllm.sh - Run RULER with vLLM backend (faster for large models)
  • summary_ruler.sh - Dedicated summary script with sequence length comparison tables

Documentation

  • RULER_README.md - Complete guide with examples
  • RULER_QUICK_REFERENCE.md - Quick command reference

Changes

New Files

  • ruler.sh - RULER evaluation launcher (HF backend)
  • ruler-vllm.sh - RULER evaluation launcher (vLLM backend)
  • summary_ruler.sh - RULER-specific results summarization
  • RULER_README.md, RULER_QUICK_REFERENCE.md - Documentation

Modified Files

evals/evals.py

  • Added RulerConfig class with support for subtask and sequence length parameters
  • Registered 78 RULER task combinations (13 subtasks × 6 lengths)
  • Custom result parsing for RULER's unique metric format (<seqlen>,none)

evals/harnesses.py

  • Added metadata parameter to LMEvalHarness for passing sequence length to tasks
  • Conditional MAX_SEQ_LENGTH setting (only for RULER tasks)

templates/lm_eval_harness.sh

  • Local model path detection and container path conversion
  • Automatic RULER dependency installation (wonderwords, nltk)
  • Forced max_model_len/max_length matching for RULER sequence lengths
  • Improved output path handling for container environments
  • RULER metadata flag generation with proper JSON escaping

summary.sh

  • Added vLLM filename prefix support (vllm_*.json)
  • Removed RULER-specific logic (delegated to summary_ruler.sh)

Usage Examples

Run single RULER task:

sh ruler-vllm.sh meta-llama/Llama-3.1-8B \
  --subtasks "niah_single_1" \
  --sequence-lengths "4096"

Run multiple subtasks at one sequence length:

sh ruler-vllm.sh meta-llama/Llama-3.1-8B \
  --subtasks "niah_single_1,niah_single_2,niah_single_3" \
  --sequence-lengths "32768"

Run full RULER suite:

sh ruler-vllm.sh meta-llama/Llama-3.1-8B \
  --subtasks "all" \
  --sequence-lengths "4096,8192,16384,32768,65536,131072"

Summarize results:

sh summary_ruler.sh output/v2/Llama-3.1-8B

Bug Fixes

  • Fixed path conversion for local models in container environments
  • Fixed unbound variable errors with proper initialization
  • Fixed vLLM result file detection in summary script
  • Fixed RULER metric parsing for sequence-length-specific keys

Testing

Tested on LUMI HPC with:

  • Local models (converted checkpoints)
  • Both HF and vLLM backends
  • Multiple sequence lengths (4K and 8K)
  • All 13 RULER subtasks

Breaking Changes

None. All changes are additive and backward compatible with existing evaluation tasks.


Related Links:

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR integrates the RULER (Rule-based Long-context Understanding Evaluation) benchmark from lm-evaluation-harness to evaluate language models on long context understanding (4K-128K tokens) across 13 distinct subtasks and 6 sequence lengths.

Key Changes:

  • Added comprehensive RULER task support with 78 pre-registered task combinations (13 subtasks × 6 lengths)
  • Implemented dedicated launcher scripts (ruler.sh, ruler-vllm.sh) and summarization tool (summary_ruler.sh)
  • Enhanced template to handle local model paths, RULER dependencies, and sequence-length-specific configurations

Reviewed changes

Copilot reviewed 13 out of 14 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
evals/evals.py Added RulerConfig class and registered 78 RULER task combinations with custom result parsing
evals/harnesses.py Added metadata parameter to LMEvalHarness for passing sequence length to RULER tasks
templates/lm_eval_harness.sh Enhanced with local model detection, RULER dependency installation, and forced max_model_len/max_length settings
summary_ruler.sh New script for RULER-specific result summarization with sequence length comparison tables
summary.sh Added vLLM filename prefix support for existing benchmarks
ruler.sh New launcher script for RULER with HuggingFace backend
ruler-vllm.sh New launcher script for RULER with vLLM backend
test_ruler_files.sh Diagnostic script for troubleshooting RULER result files
cpt.sh, cpt-vllm.sh Changed python to python3 for consistency
RULER_README.md Complete documentation with usage examples
RULER_QUICK_REFERENCE.md Quick command reference guide
README.md Updated with RULER quick start section

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread templates/lm_eval_harness.sh Outdated
Comment thread templates/lm_eval_harness.sh Outdated
Comment on lines 394 to 395
MODEL_BACKEND="hf-auto"
MODEL_ARGS="pretrained=${MODEL_LOCAL:-${MODEL_ID}},device_map=auto,dtype=bfloat16,trust_remote_code=True,attn_implementation=sdpa"
Copy link

Copilot AI Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate MODEL_BACKEND and MODEL_ARGS assignment. Lines 394-395 appear after the conditional RULER configuration (lines 368-393), causing the earlier RULER-specific settings to be overwritten. Remove these duplicate lines to preserve the RULER configuration.

Suggested change
MODEL_BACKEND="hf-auto"
MODEL_ARGS="pretrained=${MODEL_LOCAL:-${MODEL_ID}},device_map=auto,dtype=bfloat16,trust_remote_code=True,attn_implementation=sdpa"

Copilot uses AI. Check for mistakes.
Comment thread evals/evals.py
Comment on lines +188 to +210
def get_results_custom(self, json_data):
# Get score for this specific task or average across all subtasks
if "results" not in json_data:
return 0.0

# Collect relevant task results
ruler_scores = []
for task_name, task_results in json_data["results"].items():
# Match tasks that contain our sequence length
if f"{self.sequence_length}" in task_name or task_name.startswith("ruler"):
# Try different result keys that might be present
if "acc,none" in task_results:
ruler_scores.append(task_results["acc,none"])
elif "acc" in task_results:
ruler_scores.append(task_results["acc"])
elif "exact_match,none" in task_results:
ruler_scores.append(task_results["exact_match,none"])
elif "exact_match" in task_results:
ruler_scores.append(task_results["exact_match"])

if ruler_scores:
return sum(ruler_scores) / len(ruler_scores)
return 0.0
Copy link

Copilot AI Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result parsing logic doesn't check for RULER's unique metric format ',none' which is mentioned in the PR description. According to the PR metadata, RULER results use sequence length in the key name (e.g., '4096,none'). The current logic may incorrectly match tasks or miss the correct metric key. Add explicit checking for '{self.sequence_length},none' format before falling back to generic keys.

Copilot uses AI. Check for mistakes.
Comment thread evals/evals.py
Comment on lines +196 to +197
# Match tasks that contain our sequence length
if f"{self.sequence_length}" in task_name or task_name.startswith("ruler"):
Copy link

Copilot AI Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The string matching logic is too broad and could lead to incorrect matches. For example, sequence length '4096' would match task names containing '40961' or '4096' in any position. Use more precise matching such as checking if the task name equals the expected subtask or contains the sequence length as a discrete component.

Suggested change
# Match tasks that contain our sequence length
if f"{self.sequence_length}" in task_name or task_name.startswith("ruler"):
# Match tasks for the correct subtask and/or sequence length
if self.subtask:
# Only match the exact subtask name
match = (task_name == self.subtask)
else:
# Match task names that are exactly 'ruler_{sequence_length}' or end with '_{sequence_length}'
match = (
task_name == f"ruler_{self.sequence_length}"
or task_name.endswith(f"_{self.sequence_length}")
)
if match:

Copilot uses AI. Check for mistakes.
Duplicate MODEL_BACKEND and MODEL_ARGS assignment. Lines 359-360 appear after the conditional RULER configuration (lines 329-358), causing the earlier RULER-specific settings to be overwritten.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@jonabur jonabur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for getting this in! I have a few requests for changes but it looks pretty close!

Comment thread RULER_README.md Outdated
# output/v2/<model-name>/ruler_<subtask>_<sequence_length>.json
```

## Interpreting Results
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is really a reliable way to interpret, can you remove this part?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, taking those away. Cleaning also some other unnecessary stuff from readmes

Comment thread RULER_README.md Outdated
1. Reduce batch size: `--batch_size 1`
2. Use vLLM backend with lower utilization: `--model_args "gpu_memory_utilization=0.85"`
3. Increase GPU allocation: `--gres gpu:mi250:8`
4. For very long sequences, consider gradient checkpointing if available
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supported?

Comment thread RULER_README.md Outdated
3. Increase GPU allocation: `--gres gpu:mi250:8`
4. For very long sequences, consider gradient checkpointing if available

### RULER Tasks Not Found
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably unnecessary

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree

Comment thread RULER_README.md
# 2. Monitor to ensure it works
python watch.py --once

# 3. Once confirmed, run all evaluations (78 jobs)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you run this after step 1, won't the result be cached with the limit 10 output and so it won't rerun with the full data? you probably need to clean up the result--or do you store results differently depending on the limit?

Comment thread evals/evals.py
for task_name, task_results in json_data["results"].items():
# Match tasks that contain our sequence length
if f"{self.sequence_length}" in task_name or task_name.startswith("ruler"):
# Try different result keys that might be present
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really need all of these?

Comment thread ruler.sh
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth adding a script / operational convenience mode to do only the selected subtasks that maria thought were most informative?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, or just write the exact command on the instructions to run those. It would anyway be a short script to run sh ruler-vllm.sh <model> --subtasks "niah_multikey_2, niah_multikey_3, niah_multivalue, ruler_qa_squad, ruler_qa_hotpot, ruler_cwe" --sequence-lengths "4096, 8192 ...." with model and sequence lengths as arguments

Comment thread summary_ruler.sh Outdated

# RULER results use sequence length in the key name: "<seqlen>,none"
# Try multiple possible key formats
local result=$(cat "$file" | jq -r "
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this really use all these different versions in the various subtasks?

Comment thread templates/lm_eval_harness.sh Outdated

MODEL_SAFE="${MODEL_ID//\//-}"
MODEL_LOCAL="/project/hf_cache/models/${MODEL_SAFE}"
PREFETCH_LOCAL_DIR="/project/hf_cache/models/${MODEL_SAFE}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current code was not working correctly with local files (e.g. converted megatron checkpoints) that are not in HF cache

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dzautner you are more familiar with this file than I am at this point PTAL at these changes.

Comment thread test_ruler_files.sh
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this for?

btw are you aware of the watch.py mode watch.py --hist it might give you some of what you're trying to do here.

Luoma and others added 8 commits November 24, 2025 16:09
- Switch to new LUMI AI Factory container (lumi-multitorch) with vLLM 0.12
- Update cache paths from /project/hf_cache to /project/cache/huggingface
- Simplify model path remapping logic
- Add TRANSFORMERS_VERSION='4.57.1' for container compatibility
- Preserve RULER-specific features:
  - MAX_SEQ_LENGTH environment variable handling
  - RULER dependencies installation (wonderwords, nltk)
  - TASK_METADATA_FLAG for RULER tasks
  - max_model_len override logic for vLLM backend
- Improve output directory validation in main.py
…-vllm.sh

Enables running RULER benchmark on SFT/instruct models by forwarding
these flags to main.py. Supports both CLI flags and env vars (same
pattern as cpt-vllm.sh).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add '--' separator before the task name so argparse doesn't greedily
consume it as the value for --fewshot_as_multiturn (which uses nargs='?').

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants