Macrodata Refiner

Refiner is an open-source engine for turning raw, unstructured, and multimodal data into high-quality datasets for large model training.

It replaces the brittle scripts and stitched-together data tooling that teams still use for training data work, while offering much better support for multimodal data, robotics workflows, and model-based processing.

It also plugs into the Macrodata platform, which gives you visibility into what is happening to your data while pipelines run: job and shard lifecycle, logs, metrics, manifests, and pipeline behavior. The same code can run locally for development and then scale out through Macrodata's elastic serverless cloud.

Quickstart

Install:

pip install macrodata-refiner

Create a Macrodata API key:

https://macrodata.co/settings/api-keys

Log in:

macrodata login

Cloud example

Launch a robotics pipeline on Macrodata Cloud.

This requires a valid API key.

import refiner as mdr

(
    mdr.read_lerobot("hf://datasets/macrodata/aloha_static_battery_ep005_009")
    .map(
        mdr.robotics.motion_trim(
            threshold=0.001,
            pad_frames=5,
        )
    )
    .write_lerobot("hf://buckets/macrodata/test_bucket/aloha_motion")
    .launch_cloud(
        name="motion_trim",
        num_workers=4,
    )
)

Need cloud GPUs? See Launchers for the GPU-specific cloud options.

Local example

Launch a local pipeline:

import refiner as mdr

def add_preview(row):
    return row.update(
        preview=" ".join(row["text"].split()[:20]),
    )

(
    mdr.read_jsonl("input/*.jsonl")
    .filter(mdr.col("lang") == "en")
    .with_columns(
        text=mdr.col("text").str.strip(),
        text_len=mdr.col("text").str.len(),
    )
    .map(add_preview)
    .write_parquet("s3://my-bucket/english-cleanup/")
    .launch_local(
        name="english-cleanup",
        num_workers=2,
    )
)

pip install gives you:

the Python package as refiner
the CLI as macrodata

Batteries included

training-data-first pipeline primitives instead of generic ETL abstractions
multimodal processing, with robotics support today
a lot of built-in readers, transforms, sinks, and lifecycle/runtime machinery so you do not have to rebuild the same scaffolding in scripts
access to any storage backend supported by fsspec (S3, GCP, Hugging Face, etc.)
local execution for development and elastic cloud execution for large runs
built-in observability through the Macrodata platform, so you can inspect how your data is changing instead of debugging blindly after the fact

Docs

Getting started:

Core concepts:

Modalities and platform:

Community

join the Macrodata Discord: https://discord.gg/S8kZtmBR2x

Name		Name	Last commit message	Last commit date
Latest commit History 450 Commits
.github/workflows		.github/workflows
.vscode		.vscode
benchmark/lerobot		benchmark/lerobot
docs		docs
examples		examples
src/refiner		src/refiner
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
AGENTS.md		AGENTS.md
LICENSE		LICENSE
OVERVIEW.md		OVERVIEW.md
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Macrodata Refiner

Quickstart

Cloud example

Local example

Batteries included

Docs

Community

About

Uh oh!

Releases 4

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Macrodata Refiner

Quickstart

Cloud example

Local example

Batteries included

Docs

Community

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages