Skip to content

Update Datapipes API#1468

Open
coreyjadams wants to merge 1 commit intoNVIDIA:2.0.0-rcfrom
coreyjadams:datapipes-api-rc
Open

Update Datapipes API#1468
coreyjadams wants to merge 1 commit intoNVIDIA:2.0.0-rcfrom
coreyjadams:datapipes-api-rc

Conversation

@coreyjadams
Copy link
Collaborator

@coreyjadams coreyjadams commented Mar 2, 2026

PhysicsNeMo Pull Request

This PR updates the Datapipes API docs. The idea here was to break down the datapipes API into several pages that are smaller and more consumable. I separated the "legacy" datapipes (v1 format) and the composable datapipes (v2 format). The main intro page gives the high level summary and there are details on the subsequent pages.

Hard to review until it's rendered, but hard to render without the docs build ...

Description

Checklist

Dependencies

Review Process

All PRs are reviewed by the PhysicsNeMo team before merging.

Depending on which files are changed, GitHub may automatically assign a maintainer for review.

We are also testing AI-based code review tools (e.g., Greptile), which may add automated comments with a confidence score.
This score reflects the AI’s assessment of merge readiness and is not a qualitative judgment of your work, nor is
it an indication that the PR will be accepted / rejected.

AI-generated feedback should be reviewed critically for usefulness.
You are not required to respond to every AI comment, but they are intended to help both authors and reviewers.
Please react to Greptile comments with 👍 or 👎 to provide feedback on their accuracy.

@coreyjadams coreyjadams marked this pull request as ready for review March 2, 2026 15:55
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 2, 2026

Greptile Summary

Added comprehensive documentation for the PhysicsNeMo datapipes module, including new RST files for benchmarks, CAE, climate, GNN, readers, transforms, and the main datapipes page. The documentation includes usage examples, architecture diagrams, and API references.

Key changes:

  • Reorganized datapipes documentation into dedicated subdirectory
  • Added detailed docs for v2.0 redesigned datapipes architecture with GPU-first philosophy
  • Documented legacy v1 datapipes for backward compatibility
  • Included quick start guides and custom extension examples

Issues found:

  • Multiple spelling errors in main documentation files that need correction
  • Missing newline at end of one file

Important Files Changed

Filename Overview
docs/api/datapipes/physicsnemo.datapipes.cae.rst added CAE datapipes docs; has typo on line 18 and missing newline at EOF
docs/api/datapipes/physicsnemo.datapipes.rst added main datapipes documentation; contains multiple typos (lines 14, 35, 49, 156, 159)
docs/api/datapipes/physicsnemo.datapipes.transforms.rst added transforms documentation; has typo on line 25

Last reviewed commit: 0897f6d

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 files reviewed, 8 comments

Edit Code Review Agent Settings | Greptile


The DoMINO DataPipe reads the DrivearML dataset, and other datasets, for
the DoMINO model for external aerodynamics. The expected format of inputs can
be acheived using PhysicsNeMo-Curator.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'acheived' is misspelled

Suggested change
be acheived using PhysicsNeMo-Curator.
be achieved using PhysicsNeMo-Curator.


.. automodule:: physicsnemo.datapipes.cae.transolver_datapipe
:members:
:show-inheritance: No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing newline at end of file

Suggested change
:show-inheritance:
:show-inheritance:

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

are described in the climate, cae, gnn, and
benchmark subsections.

In PhysicsNeMo v2.0, the datapipes API has been redesign from scratch to focus
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'redesign' should be 'redesigned' (past tense)

Suggested change
In PhysicsNeMo v2.0, the datapipes API has been redesign from scratch to focus
In PhysicsNeMo v2.0, the datapipes API has been redesigned from scratch to focus

- Composability and Extensibility: We aim to provide a tool kit and examples that
lets you build what you need yourself, easily, if it's not here.
- Datapipes as configuration: Changing a pipeline shouldn't require source code
modification; the registry system in PhysicsNeMo datapipes enables hydra instantition
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'instantition' is misspelled

Suggested change
modification; the registry system in PhysicsNeMo datapipes enables hydra instantition
modification; the registry system in PhysicsNeMo datapipes enables hydra instantiation


At the highest level, `physicsnemo.datapipes.DataLoader` has a similar API and
model as `pytorch.utils.data.DataLoader`, enabling a drop-in replacement in many
cases. Under the hood, physicsnemo follows a very differnt computation orchestration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'differnt' is misspelled

Suggested change
cases. Under the hood, physicsnemo follows a very differnt computation orchestration.
cases. Under the hood, physicsnemo follows a very different computation orchestration.

^^^^^^^^^

Combining a set of tensordict objects into a batch of data can, at times,
be complex. Some dataset, like graph datasets, require special care. For
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'dataset' should be plural to match context

Suggested change
be complex. Some dataset, like graph datasets, require special care. For
Some datasets, like graph datasets, require special care. For

be complex. Some dataset, like graph datasets, require special care. For
this reason, PhysicsNeMo datapipes offers custom collation functions
as well as an interface to write your own collator. If the dataset you are
trying to collate can not be accomodated here, please open an issue on github.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'accomodated' is misspelled

Suggested change
trying to collate can not be accomodated here, please open an issue on github.
trying to collate can not be accommodated here, please open an issue on github.

By default, transforms accept and return `tensordict` objects: this is not,
strictly, a requirement that must be enforced.
If you implement custom transforms that return different data types, downstream
transforms should expect that data type. On example of this, found in the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'On example' should be 'One example'

Suggested change
transforms should expect that data type. On example of this, found in the
transforms should expect that data type. One example of this, found in the

@coreyjadams coreyjadams changed the title Trying again with datapipes check in Update Datapipes API Mar 2, 2026
Copy link
Collaborator

@ktangsali ktangsali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, have a few minor comments.

===================

The Benchmark Datapipes are targeted v1 datapipes for specific datasets. These
are largely maintained but not actively developed.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to highlight here that the datapipes here are generating the data "on the fly". i.e., they are not loading anything from file system and hence are useful for some quick model development and testing. I think the individual docstrings call this out, but might be a good idea to highlight it here.

The VortexSheddingDataset processes flow field data around bluff bodies,
capturing vortex shedding patterns and flow structures for graph-based learning.
The VortexSheddingDataset is used in the VortexShedding CFD examples.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we call out why these are "datasets" and not "datapipes"?


Readers handle IO exclusively - it is highly encouraged, if you are building a
a custom datapipe, to implement transforms as separate operations. This will
enable GPU computations and composable, extensible pipelines.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we saying that the dataset operations are better if they are done after the data is already on GPU? I think having these ops happen async would be beneficial from perf standpoint right?


The input to a `transform` is **mutable** by default, and so the order of transformations matters.

In general, transforms are transactional: take input in, manipulate it, return input out, and almost never update state. Transforms should be device-agnostic,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: output out?

Comment on lines +39 to +45
.. BEGIN transforms-intro
.. Suggested outline:
.. - All transforms implement the :class:`~physicsnemo.datapipes.transforms.base.Transform` ABC
.. - Designed for GPU tensors (device transfer happens before transforms in Dataset)
.. - Composable: pass a list to Dataset or use Compose explicitly
.. - Transforms with internal tensor state are auto-moved via .to(device)
.. END transforms-intro
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this left over?

)

# 1. Choose a Reader for your data format
reader = HDF5Reader(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use some real data here? might make it easy for users to follow. Or something that we prepare (e.g. here: https://github.com/NVIDIA/physicsnemo/blob/main/examples/cfd/navier_stokes_rnn/navier_stokes_rnn.py). that will help clarify how the h5 should be structured to be able to use this reader.

What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants