Add RULER Long Context Evaluation Support by luomajouni · Pull Request #33 · LumiOpen/evals

luomajouni · 2025-11-24T10:27:00Z

Add RULER Long Context Evaluation Support

Summary

Integrates RULER (Rule-based Long-context Understanding Evaluation) benchmark from lm-evaluation-harness into the evaluation framework. RULER evaluates language models on their ability to retrieve and use information from long contexts (4K to 128K tokens).

Key Features

Granular Task Control

Support for all 13 RULER subtasks:
- NIAH (Needle in a Haystack): niah_single_1-3, niah_multikey_1-3, niah_multivalue, niah_multiquery
- Variable Tracking: ruler_vt
- Word Extraction: ruler_cwe, ruler_fwe
- Question Answering: ruler_qa_hotpot, ruler_qa_squad
Support for 6 sequence lengths: 4096, 8192, 16384, 32768, 65536, 131072
Run individual subtask-length combinations or batches via helper scripts

Helper Scripts

ruler.sh - Run RULER with HuggingFace backend
ruler-vllm.sh - Run RULER with vLLM backend (faster for large models)
summary_ruler.sh - Dedicated summary script with sequence length comparison tables

Documentation

RULER_README.md - Complete guide with examples
RULER_QUICK_REFERENCE.md - Quick command reference

Changes

New Files

ruler.sh - RULER evaluation launcher (HF backend)
ruler-vllm.sh - RULER evaluation launcher (vLLM backend)
summary_ruler.sh - RULER-specific results summarization
RULER_README.md, RULER_QUICK_REFERENCE.md - Documentation

Modified Files

evals/evals.py

Added RulerConfig class with support for subtask and sequence length parameters
Registered 78 RULER task combinations (13 subtasks × 6 lengths)
Custom result parsing for RULER's unique metric format (<seqlen>,none)

evals/harnesses.py

Added metadata parameter to LMEvalHarness for passing sequence length to tasks
Conditional MAX_SEQ_LENGTH setting (only for RULER tasks)

templates/lm_eval_harness.sh

Local model path detection and container path conversion
Automatic RULER dependency installation (wonderwords, nltk)
Forced max_model_len/max_length matching for RULER sequence lengths
Improved output path handling for container environments
RULER metadata flag generation with proper JSON escaping

summary.sh

Added vLLM filename prefix support (vllm_*.json)
Removed RULER-specific logic (delegated to summary_ruler.sh)

Usage Examples

Run single RULER task:

sh ruler-vllm.sh meta-llama/Llama-3.1-8B \
  --subtasks "niah_single_1" \
  --sequence-lengths "4096"

Run multiple subtasks at one sequence length:

sh ruler-vllm.sh meta-llama/Llama-3.1-8B \
  --subtasks "niah_single_1,niah_single_2,niah_single_3" \
  --sequence-lengths "32768"

Run full RULER suite:

sh ruler-vllm.sh meta-llama/Llama-3.1-8B \
  --subtasks "all" \
  --sequence-lengths "4096,8192,16384,32768,65536,131072"

Summarize results:

sh summary_ruler.sh output/v2/Llama-3.1-8B

Bug Fixes

Fixed path conversion for local models in container environments
Fixed unbound variable errors with proper initialization
Fixed vLLM result file detection in summary script
Fixed RULER metric parsing for sequence-length-specific keys

Testing

Tested on LUMI HPC with:

Local models (converted checkpoints)
Both HF and vLLM backends
Multiple sequence lengths (4K and 8K)
All 13 RULER subtasks

Breaking Changes

None. All changes are additive and backward compatible with existing evaluation tasks.

Related Links:

RULER Paper: arXiv:2404.06654
LM Eval Harness RULER: https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/ruler

…ithout any modules loaded

Copilot

Pull request overview

This PR integrates the RULER (Rule-based Long-context Understanding Evaluation) benchmark from lm-evaluation-harness to evaluate language models on long context understanding (4K-128K tokens) across 13 distinct subtasks and 6 sequence lengths.

Key Changes:

Added comprehensive RULER task support with 78 pre-registered task combinations (13 subtasks × 6 lengths)
Implemented dedicated launcher scripts (ruler.sh, ruler-vllm.sh) and summarization tool (summary_ruler.sh)
Enhanced template to handle local model paths, RULER dependencies, and sequence-length-specific configurations

Reviewed changes

Copilot reviewed 13 out of 14 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
evals/evals.py	Added `RulerConfig` class and registered 78 RULER task combinations with custom result parsing
evals/harnesses.py	Added metadata parameter to `LMEvalHarness` for passing sequence length to RULER tasks
templates/lm_eval_harness.sh	Enhanced with local model detection, RULER dependency installation, and forced max_model_len/max_length settings
summary_ruler.sh	New script for RULER-specific result summarization with sequence length comparison tables
summary.sh	Added vLLM filename prefix support for existing benchmarks
ruler.sh	New launcher script for RULER with HuggingFace backend
ruler-vllm.sh	New launcher script for RULER with vLLM backend
test_ruler_files.sh	Diagnostic script for troubleshooting RULER result files
cpt.sh, cpt-vllm.sh	Changed python to python3 for consistency
RULER_README.md	Complete documentation with usage examples
RULER_QUICK_REFERENCE.md	Quick command reference guide
README.md	Updated with RULER quick start section

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-24T10:28:17Z

 MODEL_BACKEND="hf-auto"
 MODEL_ARGS="pretrained=${MODEL_LOCAL:-${MODEL_ID}},device_map=auto,dtype=bfloat16,trust_remote_code=True,attn_implementation=sdpa"


Duplicate MODEL_BACKEND and MODEL_ARGS assignment. Lines 394-395 appear after the conditional RULER configuration (lines 368-393), causing the earlier RULER-specific settings to be overwritten. Remove these duplicate lines to preserve the RULER configuration.

Suggested change

MODEL_BACKEND="hf-auto"

MODEL_ARGS="pretrained=${MODEL_LOCAL:-${MODEL_ID}},device_map=auto,dtype=bfloat16,trust_remote_code=True,attn_implementation=sdpa"

Copilot · 2025-11-24T10:28:17Z

+    def get_results_custom(self, json_data):
+        # Get score for this specific task or average across all subtasks
+        if "results" not in json_data:
+            return 0.0
+
+        # Collect relevant task results
+        ruler_scores = []
+        for task_name, task_results in json_data["results"].items():
+            # Match tasks that contain our sequence length
+            if f"{self.sequence_length}" in task_name or task_name.startswith("ruler"):
+                # Try different result keys that might be present
+                if "acc,none" in task_results:
+                    ruler_scores.append(task_results["acc,none"])
+                elif "acc" in task_results:
+                    ruler_scores.append(task_results["acc"])
+                elif "exact_match,none" in task_results:
+                    ruler_scores.append(task_results["exact_match,none"])
+                elif "exact_match" in task_results:
+                    ruler_scores.append(task_results["exact_match"])
+
+        if ruler_scores:
+            return sum(ruler_scores) / len(ruler_scores)
+        return 0.0


The result parsing logic doesn't check for RULER's unique metric format ',none' which is mentioned in the PR description. According to the PR metadata, RULER results use sequence length in the key name (e.g., '4096,none'). The current logic may incorrectly match tasks or miss the correct metric key. Add explicit checking for '{self.sequence_length},none' format before falling back to generic keys.

Copilot · 2025-11-24T10:28:17Z

+            # Match tasks that contain our sequence length
+            if f"{self.sequence_length}" in task_name or task_name.startswith("ruler"):


The string matching logic is too broad and could lead to incorrect matches. For example, sequence length '4096' would match task names containing '40961' or '4096' in any position. Use more precise matching such as checking if the task name equals the expected subtask or contains the sequence length as a discrete component.

Suggested change

# Match tasks that contain our sequence length

if f"{self.sequence_length}" in task_name or task_name.startswith("ruler"):

# Match tasks for the correct subtask and/or sequence length

if self.subtask:

# Only match the exact subtask name

match = (task_name == self.subtask)

else:

# Match task names that are exactly 'ruler_{sequence_length}' or end with '_{sequence_length}'

match = (

task_name == f"ruler_{self.sequence_length}"

or task_name.endswith(f"_{self.sequence_length}")

)

if match:

Duplicate MODEL_BACKEND and MODEL_ARGS assignment. Lines 359-360 appear after the conditional RULER configuration (lines 329-358), causing the earlier RULER-specific settings to be overwritten. Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

jonabur

Thanks for getting this in! I have a few requests for changes but it looks pretty close!

jonabur · 2025-11-24T11:03:49Z

+# output/v2/<model-name>/ruler_<subtask>_<sequence_length>.json
+```
+
+## Interpreting Results


I don't think this is really a reliable way to interpret, can you remove this part?

Sure, taking those away. Cleaning also some other unnecessary stuff from readmes

jonabur · 2025-11-24T11:04:13Z

+1. Reduce batch size: `--batch_size 1`
+2. Use vLLM backend with lower utilization: `--model_args "gpu_memory_utilization=0.85"`
+3. Increase GPU allocation: `--gres gpu:mi250:8`
+4. For very long sequences, consider gradient checkpointing if available


Is this supported?

jonabur · 2025-11-24T11:04:34Z

+3. Increase GPU allocation: `--gres gpu:mi250:8`
+4. For very long sequences, consider gradient checkpointing if available
+
+### RULER Tasks Not Found


probably unnecessary

jonabur · 2025-11-24T11:06:07Z

+# 2. Monitor to ensure it works
+python watch.py --once
+
+# 3. Once confirmed, run all evaluations (78 jobs)


If you run this after step 1, won't the result be cached with the limit 10 output and so it won't rerun with the full data? you probably need to clean up the result--or do you store results differently depending on the limit?

jonabur · 2025-11-24T11:08:24Z

+        for task_name, task_results in json_data["results"].items():
+            # Match tasks that contain our sequence length
+            if f"{self.sequence_length}" in task_name or task_name.startswith("ruler"):
+                # Try different result keys that might be present


Do you really need all of these?

jonabur · 2025-11-24T11:11:32Z

Is it worth adding a script / operational convenience mode to do only the selected subtasks that maria thought were most informative?

Perhaps, or just write the exact command on the instructions to run those. It would anyway be a short script to run sh ruler-vllm.sh <model> --subtasks "niah_multikey_2, niah_multikey_3, niah_multivalue, ruler_qa_squad, ruler_qa_hotpot, ruler_cwe" --sequence-lengths "4096, 8192 ...." with model and sequence lengths as arguments

jonabur · 2025-11-24T11:13:24Z

+
+    # RULER results use sequence length in the key name: "<seqlen>,none"
+    # Try multiple possible key formats
+    local result=$(cat "$file" | jq -r "


does this really use all these different versions in the various subtasks?

jonabur · 2025-11-24T11:14:36Z


 MODEL_SAFE="${MODEL_ID//\//-}"
-MODEL_LOCAL="/project/hf_cache/models/${MODEL_SAFE}"
+PREFETCH_LOCAL_DIR="/project/hf_cache/models/${MODEL_SAFE}"


why is this needed?

The current code was not working correctly with local files (e.g. converted megatron checkpoints) that are not in HF cache

jonabur · 2025-11-24T11:15:14Z

@dzautner you are more familiar with this file than I am at this point PTAL at these changes.

jonabur · 2025-11-24T11:16:54Z

What is this for?

btw are you aware of the watch.py mode watch.py --hist it might give you some of what you're trying to do here.

- Switch to new LUMI AI Factory container (lumi-multitorch) with vLLM 0.12 - Update cache paths from /project/hf_cache to /project/cache/huggingface - Simplify model path remapping logic - Add TRANSFORMERS_VERSION='4.57.1' for container compatibility - Preserve RULER-specific features: - MAX_SEQ_LENGTH environment variable handling - RULER dependencies installation (wonderwords, nltk) - TASK_METADATA_FLAG for RULER tasks - max_model_len override logic for vLLM backend - Improve output directory validation in main.py

…files

…-vllm.sh Enables running RULER benchmark on SFT/instruct models by forwarding these flags to main.py. Supports both CLI flags and env vars (same pattern as cpt-vllm.sh). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add '--' separator before the task name so argparse doesn't greedily consume it as the value for --fewshot_as_multiturn (which uses nargs='?'). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Luoma added 26 commits November 11, 2025 18:12

initial

bda2b34

Changes for running subtask / seq-len seprately

233202d

Fix crash if the command log does not return anything

fa46af1

using python3 in target system

4a336d9

Local model support added

17a5552

Fix syntax error

1c986f4

Fix syntax error

813fa95

Fix syntax error

bd7979a

Fix syntax error

3598d17

Make IS_LOCAL_MODEL default to false

0cbd2a7

Fix some syntax issues

d8191cc

Fix typo

93f2449

More path handling

c035afe

Parameter passing changes for Ruler

cdc5fd1

Fix some template escaping issues

8f48612

fixed escaping issue

7ce3eee

More escaping errors

23a0eae

Corrected task names

d1361b9

Add some dependency installs for lm-eval-harness to run Ruler

5dfaa32

Merging changes from main to

6fc1a84

Fixing summary.sh use

8bb9ea9

Processing of vllm backend result files

6ccfd72

Ruler task summary

91085e6

help scripts

f23703a

Fixing result summarization

a0e9fcd

python3 instead of python for starting the run from Lumi login node w…

e449e39

…ithout any modules loaded

luomajouni requested review from Copilot, dzautner and jonabur November 24, 2025 10:27

Copilot AI reviewed Nov 24, 2025

View reviewed changes

jonabur reviewed Nov 24, 2025

View reviewed changes

Luoma and others added 8 commits November 24, 2025 16:09

Changes from PR comments

c39ce6f

bug fixes and minor modifications

c7e9b59

switch to lumi ai factory container for vllm support

f0c60e2

template change

482c45e

Fixing summary script parsing due to the added datestrings on result …

53e7fd1

…files

Fix task name consumed by nargs='?' optional arg in main.py

8781635

Add '--' separator before the task name so argparse doesn't greedily consume it as the value for --fewshot_as_multiturn (which uses nargs='?'). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

		MODEL_BACKEND="hf-auto"
		MODEL_ARGS="pretrained=${MODEL_LOCAL:-${MODEL_ID}},device_map=auto,dtype=bfloat16,trust_remote_code=True,attn_implementation=sdpa"

		# Match tasks that contain our sequence length
		if f"{self.sequence_length}" in task_name or task_name.startswith("ruler"):

-            # Match tasks that contain our sequence length
-            if f"{self.sequence_length}" in task_name or task_name.startswith("ruler"):
+            # Match tasks for the correct subtask and/or sequence length
+            if self.subtask:
+                # Only match the exact subtask name
+                match = (task_name == self.subtask)
+            else:
+                # Match task names that are exactly 'ruler_{sequence_length}' or end with '_{sequence_length}'
+                match = (
+                    task_name == f"ruler_{self.sequence_length}"
+                    or task_name.endswith(f"_{self.sequence_length}")
+                )
+            if match:

Conversation

luomajouni commented Nov 24, 2025