Policy for Robotic Action via Neural Architecture
by Siddarth Dayasagar and Arpit Gandhi
PRANA is a vision-action policy for autonomous robot manipulation, evolving from deterministic action chunking (v1) to flow matching with correlated noise (v2/v3).
VIDEO PRESENTATION: https://canva.link/5y0ik0mh8dsv7i6
DATASET: https://huggingface.co/datasets/Siddarth09/PRANA
TASK: Pick a screwdriver and place it in the box
lerobot.teleop.mp4
| Version | Action Head | Vision Backbone | Key Innovation | Status |
|---|---|---|---|---|
| v1 | L1 regression | ViT-Tiny (timm) | Self-attention action chunking | Baseline |
| v2 | Flow matching | ViT-Tiny (timm) | Cross-attention + correlated noise | Deployed |
| v3 | Flow matching | DINOv2 ViT-S/14 | Self-supervised spatial features | Experimental |
- Python 3.12+
- CUDA GPU (tested on RTX 5060 Laptop, 8GB VRAM)
- SO-101 robot arm with LeRobot firmware
python -m venv lerobot_env
source lerobot_env/bin/activate
pip install lerobot==0.5.1
pip install timm wandb avFor Blackwell GPUs (RTX 50xx):
pip install --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/cu128 --force-reinstallUse the leader arm to teleoperate the SO-101 follower:
lerobot-teleoperate \
--robot.type=so101_follower \
--robot.port=/dev/ttyACM0 \
--robot.cameras='{
"table": {"type": "intelrealsense", "serial_number_or_name": "103422071945", "width": 640, "height": 480, "fps": 30},
"wrist": {"type": "opencv", "index_or_path": "/dev/video4", "width": 640, "height": 480, "fps": 30}
}' \
--teleop.type=so101_leader \
--teleop.port=/dev/ttyACM1 \
--display_data=truelerobot-record \
--robot.type=so101_follower \
--robot.port=/dev/ttyACM0 \
--robot.cameras='{
"table": {"type": "intelrealsense", "serial_number_or_name": "103422071945", "width": 640, "height": 480, "fps": 30},
"wrist": {"type": "opencv", "index_or_path": "/dev/video4", "width": 640, "height": 480, "fps": 30}
}' \
--teleop.type=so101_leader \
--teleop.port=/dev/ttyACM1 \
--dataset.repo_id=Siddarth09/PRANA \
--dataset.single_task="Pick the screwdriver and place it in the box" \
--dataset.num_episodes=20 \
--dataset.episode_time_s=25 \
--dataset.reset_time_s=10 \
--dataset.fps=30 \
--display_data=true \
--dataset.push_to_hub=false
- Place all cameras at their designated locations
- Get comfortable with the task via teleoperation first
- Start the recorder and wait for "Episode # recording" audio
- Complete the entire task within each episode
- Watch the wrist camera in the Rerun viewer for better control
We collected 123 episodes total (83 initial + 40 focused on grasp/release).
Deterministic action chunking transformer (baseline).
All tokens (vision + state + language + action queries) are concatenated and processed through a single self-attention transformer. Predicts action chunks via L1 regression.
| Component | Details |
|---|---|
| Vision | ViT-Tiny (unfrozen) |
| Transformer | 4-layer encoder, self-attention only |
| Action head | L1 regression |
| Language | 256K vocab embedding (unused — receives zeros) |
| Chunk size | 50 |
python3 prana/train_v1.py \
--dataset.repo_id=Siddarth09/PRANA \
--dataset.video_backend=pyav \
--dataset.image_transforms.enable=false \
--policy.type=prana_v1 \
--policy.device=cuda \
--policy.camera_order='["observation.images.table","observation.images.wrist"]' \
--batch_size=1 \
--num_workers=0 \
--steps=85000 \
--policy.push_to_hub=false \
--output_dir=outputs/train/prana \
--wandb.enable=truepython3 prana/deploy_v1.py \
--robot.type=so101_follower \
--robot.port=/dev/ttyACM0 \
--robot.cameras='{"table": {"type": "intelrealsense", "serial_number_or_name": "103422071945", "width": 640, "height": 480, "fps": 30}, "wrist": {"type": "opencv", "index_or_path": "/dev/video4", "width": 640, "height": 480, "fps": 30}}' \
--display_data=true \
--dataset.repo_id=Siddarth09/eval_prana_pick_place \
--dataset.num_episodes=5 \
--dataset.single_task="Pick the screwdriver and place it in the box" \
--dataset.push_to_hub=false \
--policy.path=outputs/train/prana/checkpoints/last/pretrained_model \
--policy.device=cuda- Self-attention bottleneck — action queries pollute each other before seeing visual context
- No positional embeddings on action queries
- Deterministic output — learns average of demos, fails with multimodal actions
- Unfrozen ViT overfits with limited data
- Language encoder wastes 262MB of parameters
Flow matching with correlated noise and cross-attention decoder. Recommended model.
Context path: Frozen ViT-Tiny encodes table + wrist images with camera-ID embeddings. Concatenated with state token and processed through a 4-layer self-attention context encoder.
Flow matching path: Correlated noise ε ~ N(0, βΣ+(1-β)I) is sampled from the empirical action covariance, combined with target actions at timestep t, then projected through the action encoder with positional embeddings.
Decoder: Noisy action tokens cross-attend to visual context (7-8 layers), producing velocity predictions. Loss = MSE(v_pred, ε - a_target).
Inference: 10 Euler denoising steps from correlated noise → actions. Rolling inpainting (execute 40, save 10) for smooth transitions.
| Component | Details |
|---|---|
| Vision | ViT-Tiny (frozen) + camera ID embed |
| Context encoder | 4 self-attention layers |
| Action decoder | 7-8 cross-attention layers |
| Action head | Flow matching (10 denoise steps) |
| Noise | Correlated, β=0.5 |
| Chunk size | 50 |
| Hidden dim | 256 |
| Trainable params | ~11M |
python3 prana_v2/train_v2.py \
--dataset.repo_id=Siddarth09/PRANA \
--dataset.video_backend=pyav \
--dataset.image_transforms.enable=false \
--policy.type=prana_v2 \
--policy.device=cuda \
--policy.camera_order='["observation.images.table","observation.images.wrist"]' \
--batch_size=8 \
--num_workers=0 \
--steps=100000 \
--policy.push_to_hub=false \
--output_dir=outputs/train/prana_v2 \
--wandb.enable=truepython3 prana_v2/fit_noise.py \
--dataset Siddarth09/PRANA \
--checkpoint outputs/train/prana_v2/checkpoints/last/pretrained_model
⚠️ Critical step. Without fitting the noise sampler, the robot will be jittery. This computes the action covariance matrix and patches the Cholesky factor into the checkpoint.
python3 prana_v2/deploy_v2.py \
--robot.type=so101_follower \
--robot.port=/dev/ttyACM0 \
--robot.cameras='{"table": {"type": "intelrealsense", "serial_number_or_name": "103422071945", "width": 640, "height": 480, "fps": 30}, "wrist": {"type": "opencv", "index_or_path": "/dev/video4", "width": 640, "height": 480, "fps": 30}}' \
--display_data=false \
--dataset.fps=10 \
--dataset.repo_id=Siddarth09/eval_prana_v2 \
--dataset.num_episodes=5 \
--dataset.single_task="Pick the screwdriver and place it in the box" \
--dataset.push_to_hub=false \
--policy.path=outputs/train/prana_v2/checkpoints/last/pretrained_model \
--policy.device=cuda| Tip | Why |
|---|---|
--dataset.fps=10 |
Matches camera throughput, prevents frame drops |
--display_data=false |
Saves CPU for inference |
| Denoise steps = 5 | Faster inference without retraining (edit config) |
| Fit noise sampler | Reduces jitter significantly |
| Operation | Time |
|---|---|
| Chunk prediction (10 denoise steps) | ~22ms |
| Queued action (from buffer) | ~0.5ms |
| Camera capture (2 cameras) | ~100ms |
Flow matching with DINOv2 self-supervised vision backbone.
Identical to v2 except the vision backbone — uses DINOv2 ViT-S/14 which provides self-supervised spatial features (384d, 256 tokens per camera) instead of ImageNet-supervised ViT-Tiny (192d, 197 tokens).
| What changed | v2 | v3 |
|---|---|---|
| Vision backbone | ViT-Tiny (timm) | DINOv2 ViT-S/14 (torch.hub) |
| Embed dimension | 192 | 384 (2x) |
| Tokens per camera | 197 | 256 |
| Total visual tokens | 394 | 512 |
| Backbone params | 5.7M | 22M |
| Trainable params | ~11M | ~30M |
Note: DINOv2 requires ~1.5GB more VRAM. Batch size 8 uses ~5.5GB on RTX 5060.
python3 prana_v3/train_v3.py \
--dataset.repo_id=Siddarth09/PRANA \
--dataset.video_backend=pyav \
--dataset.image_transforms.enable=false \
--policy.type=prana_v3 \
--policy.device=cuda \
--policy.camera_order='["observation.images.table","observation.images.wrist"]' \
--batch_size=8 \
--num_workers=0 \
--steps=100000 \
--policy.push_to_hub=false \
--output_dir=outputs/train/prana_v3 \
--wandb.enable=truepython3 prana_v3/deploy_v3.py \
--robot.type=so101_follower \
--robot.port=/dev/ttyACM0 \
--robot.cameras='{"table": {"type": "intelrealsense", "serial_number_or_name": "103422071945", "width": 640, "height": 480, "fps": 30}, "wrist": {"type": "opencv", "index_or_path": "/dev/video4", "width": 640, "height": 480, "fps": 30}}' \
--display_data=false \
--dataset.fps=10 \
--dataset.repo_id=Siddarth09/eval_prana_v3 \
--dataset.num_episodes=5 \
--dataset.single_task="Pick the screwdriver and place it in the box" \
--dataset.push_to_hub=false \
--policy.path=outputs/train/prana_v3/checkpoints/080000/pretrained_model \
--policy.device=cudaThe v2 encoder auto-detects backbone type, so you can experiment by changing one line in configuration_prana.py:
# ViT-Tiny (default, 5.7M, 197 tokens/cam)
vision_backbone: str = "vit_tiny_patch16_224"
# DINOv2 via timm (22M, 256 tokens/cam)
vision_backbone: str = "vit_small_patch14_dinov2"
# ConvNeXt CNN (28M, 49 tokens/cam)
vision_backbone: str = "convnext_tiny"
# ViT-Small (22M, 197 tokens/cam)
vision_backbone: str = "vit_small_patch16_224"Standard imitation learning predicts actions via regression (L1/MSE), which learns the average of demonstrations. Flow matching instead learns the velocity field that transforms noise into valid trajectories:
- Training: Interpolate noise ε and target actions a at time t:
x_t = t·ε + (1-t)·a. Predict velocity v_t. Loss =MSE(v_t, ε - a). - Inference: Start from correlated noise, take 10 Euler steps:
x_{t-dt} = x_t - dt·v_t. Result ≈ valid action trajectory.
Correlated noise (from empirical action covariance) makes denoising easier — the starting noise already has temporal smoothness and joint coordination of real trajectories.
lerobot-record --robot.type=so101_follower --robot.port=/dev/ttyACM0 --robot.cameras='{
table: {
"type": "intelrealsense",
"serial_number_or_name": "103422071945",
"width": 640,
"height": 480,
"fps": 30
},
wrist: {
"type": "opencv",
"index_or_path": "/dev/video4",
"width": 640,
"height": 480,
"fps": 30
}
}' --teleop.type=so101_leader --teleop.port=/dev/ttyACM1 --display_data=true --dataset.repo_id=Siddarth09/eval_prana_pick_place --dataset.num_episodes=5 --dataset.single_task="Pick the screwdriver and place it in the box" --dataset.push_to_hub=false --policy.path=outputs/train/prana/checkpoints/last/pretrained_model --policy.device=cuda --display_data=true

