Skip to content

macrodata-labs/refiner

Repository files navigation

Macrodata

Macrodata Refiner

Refiner is an open-source engine for turning raw, unstructured, and multimodal data into high-quality datasets for large model training.

It replaces the brittle scripts and stitched-together data tooling that teams still use for training data work, while offering much better support for multimodal data, robotics workflows, and model-based processing.

It also plugs into the Macrodata platform, which gives you visibility into what is happening to your data while pipelines run: job and shard lifecycle, logs, metrics, manifests, and pipeline behavior. The same code can run locally for development and then scale out through Macrodata's elastic serverless cloud.

Quickstart

Install:

pip install macrodata-refiner

Create a Macrodata API key:

Log in:

macrodata login

Cloud example

Launch a robotics pipeline on Macrodata Cloud.

This requires a valid API key.

import refiner as mdr

(
    mdr.read_lerobot("hf://datasets/macrodata/aloha_static_battery_ep005_009")
    .map(
        mdr.robotics.motion_trim(
            threshold=0.001,
            pad_frames=5,
        )
    )
    .write_lerobot("hf://buckets/macrodata/test_bucket/aloha_motion")
    .launch_cloud(
        name="motion_trim",
        num_workers=4,
    )
)

Need cloud GPUs? See Launchers for the GPU-specific cloud options.

Local example

Launch a local pipeline:

import refiner as mdr

def add_preview(row):
    return row.update(
        preview=" ".join(row["text"].split()[:20]),
    )

(
    mdr.read_jsonl("input/*.jsonl")
    .filter(mdr.col("lang") == "en")
    .with_columns(
        text=mdr.col("text").str.strip(),
        text_len=mdr.col("text").str.len(),
    )
    .map(add_preview)
    .write_parquet("s3://my-bucket/english-cleanup/")
    .launch_local(
        name="english-cleanup",
        num_workers=2,
    )
)

pip install gives you:

  • the Python package as refiner
  • the CLI as macrodata

Batteries included

  • training-data-first pipeline primitives instead of generic ETL abstractions
  • multimodal processing, with robotics support today
  • a lot of built-in readers, transforms, sinks, and lifecycle/runtime machinery so you do not have to rebuild the same scaffolding in scripts
  • access to any storage backend supported by fsspec (S3, GCP, Hugging Face, etc.)
  • local execution for development and elastic cloud execution for large runs
  • built-in observability through the Macrodata platform, so you can inspect how your data is changing instead of debugging blindly after the fact

Docs

Getting started:

Core concepts:

Modalities and platform:

Community

About

Refiner by Macrodata Labs, a data processing framework for Machine Learning large scale datasets

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages