Add RLDS writer by guipenedo · Pull Request #149 · macrodata-labs/refiner

Guilherme Penedo (guipenedo) · 2026-05-21T23:37:25Z

Adds a standalone RLDS-style TFRecord writer for robotics episode rows.\n\nWhat changed:\n- add RefinerPipeline.write_rlds(...)\n- add RldsSink that writes one TFRecord Example per RoboticsRow episode\n- add optional tensorflow extra\n- add dependency-free feature mapping test plus TensorFlow integration test when TensorFlow is installed\n\nLocal verification:\n- uv run ruff format --check src/refiner/pipeline/sinks/rlds.py src/refiner/pipeline/sinks/init.py src/refiner/pipeline/pipeline.py tests/pipeline/test_rlds_sink.py\n- uv run ruff check src/refiner/pipeline/sinks/rlds.py src/refiner/pipeline/sinks/init.py src/refiner/pipeline/pipeline.py tests/pipeline/test_rlds_sink.py\n- uv run ty check src/refiner/pipeline/sinks/rlds.py src/refiner/pipeline/sinks/init.py src/refiner/pipeline/pipeline.py tests/pipeline/test_rlds_sink.py\n- uv run pytest tests/pipeline/test_rlds_sink.py tests/robotics/test_robotics_row.py -q -> 24 passed, 1 skipped; skipped test requires TensorFlow

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: efc08fad18

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "Codex (@codex) review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "Codex (@codex) address that feedback".

chatgpt-codex-connector · 2026-05-21T23:40:30Z

+        writer = tf.io.TFRecordWriter(
+            self.output.abs_path(self._relpath(shard_id)),
+            options=options,


Use DataFolder I/O path for TFRecord writes

RldsSink opens TFRecord files with tf.io.TFRecordWriter(self.output.abs_path(...)) instead of going through DataFolder.open(...). This bypasses the filesystem abstraction used by other sinks (e.g., JSONL/Parquet), so write_rlds will fail or ignore configured filesystem backends/credentials for non-local outputs (such as s3://... or custom fs), and it also skips DataFolder's auto-mkdir behavior for missing parent directories. In practice, pipelines that successfully write other formats to remote folders can break when switched to write_rlds.

Useful? React with 👍 / 👎.

gemini-code-assist

Code Review

This pull request introduces the RldsSink and a corresponding write_rlds pipeline method to enable exporting robotics data into RLDS-style TFRecord formats. The implementation includes logic for mapping RoboticsRow attributes to TensorFlow features and adds tensorflow as an optional dependency. Review feedback highlights several areas for improvement: expanding the feature set to include all observations rather than a hardcoded subset, optimizing performance by avoiding expensive tolist() calls on large arrays, and improving the efficiency of PyArrow to NumPy conversions.

gemini-code-assist · 2026-05-21T23:42:23Z

+        _add_array_features(tf, features, "steps/action", row.actions)
+        _add_array_features(tf, features, "steps/observation/state", row.states)
+        _add_array_features(tf, features, "steps/timestamp", row.timestamps)


The current implementation only writes a hardcoded subset of robotics features (action, state, timestamp). It misses all other observations and videos that might be present in the RoboticsRow. To support full RLDS datasets, you should iterate over row.observations() and include all available features in the steps/observation/ namespace.

gemini-code-assist · 2026-05-21T23:42:24Z

+    if values is None:
+        return
+    array = _array(values)
+    features[name] = _float_feature(tf, array.reshape(-1).astype(float).tolist())


This line has several performance and flexibility issues:

tolist() on a large numpy array is slow and memory-intensive. tf.train.FloatList accepts any iterable, so you can pass the numpy array directly.

Forcing all array features to float prevents the use of integer-based observations or actions (e.g., discrete actions or pixel-based states).

astype(float) defaults to float64, while FloatList stores float32. It's better to cast to np.float32 explicitly if floating point is intended.

Suggested change

features[name] = _float_feature(tf, array.reshape(-1).astype(float).tolist())

features[name] = _float_feature(tf, array.reshape(-1).astype(np.float32)) if array.dtype.kind == 'f' else _int_feature(tf, array.reshape(-1).astype(np.int64).tolist())

gemini-code-assist · 2026-05-21T23:42:24Z

+def _array(values: Any) -> np.ndarray:
+    if isinstance(values, pa.ChunkedArray | pa.Array):
+        return np.asarray(values.to_pylist())
+    return np.asarray(values)


Converting PyArrow arrays to numpy via to_pylist() is inefficient. You should use to_numpy() where possible for better performance.

Suggested change

def _array(values: Any) -> np.ndarray:

if isinstance(values, pa.ChunkedArray | pa.Array):

return np.asarray(values.to_pylist())

return np.asarray(values)

def _array(values: Any) -> np.ndarray:

if isinstance(values, pa.ChunkedArray | pa.Array):

try:

return values.to_numpy()

except (pa.ArrowInvalid, pa.ArrowNotImplementedError):

return np.asarray(values.to_pylist())

return np.asarray(values)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 068cdfaba7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "Codex (@codex) review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "Codex (@codex) address that feedback".

chatgpt-codex-connector · 2026-05-22T10:19:39Z

+        return self._tf
+
+    def write_shard_block(self, shard_id: str, block: Block) -> int:
+        rows = block if not isinstance(block, Tabular) else list(block)


Stream tabular rows when writing TFRecord shards

The new write_shard_block path eagerly converts Tabular blocks via list(block) before writing, which materializes every row view in memory at once. On large shards this can significantly increase peak memory and cause OOMs or slowdowns even though TFRecord emission is sequential. Iterating the Tabular directly (or in batches) avoids the extra allocation and keeps sink memory bounded.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-22T10:19:39Z

            )
        )

+    def write_rlds(


Update docs for new write_rlds pipeline API

This commit introduces a new user-facing feature (RefinerPipeline.write_rlds) but does not update docs/, which violates the repository rule in AGENTS.md (“Any new feature, execution block, architectural change, or user-visible behavior change must include corresponding doc updates in the same change set”). Without docs, users miss required usage/dependency details for the new sink and the published API surface becomes inconsistent with project documentation.

Useful? React with 👍 / 👎.

chatgpt-codex-connector Bot reviewed May 21, 2026

View reviewed changes

gemini-code-assist Bot reviewed May 21, 2026

View reviewed changes

Add RLDS writer

068cdfa

Guilherme Penedo (guipenedo) force-pushed the codex/rlds-writer-clean branch from efc08fa to 068cdfa Compare May 22, 2026 10:14

chatgpt-codex-connector Bot reviewed May 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add RLDS writer#149

Add RLDS writer#149
Guilherme Penedo (guipenedo) wants to merge 1 commit into
mainfrom
codex/rlds-writer-clean

Guilherme Penedo (guipenedo) commented May 21, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 21, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 21, 2026

Uh oh!

gemini-code-assist Bot May 21, 2026

Uh oh!

gemini-code-assist Bot May 21, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 22, 2026

Uh oh!

chatgpt-codex-connector Bot May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	features[name] = _float_feature(tf, array.reshape(-1).astype(float).tolist())
	features[name] = _float_feature(tf, array.reshape(-1).astype(np.float32)) if array.dtype.kind == 'f' else _int_feature(tf, array.reshape(-1).astype(np.int64).tolist())

Conversation

Guilherme Penedo (guipenedo) commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Guilherme Penedo (guipenedo) commented May 21, 2026 •

edited

Loading