README.md · robotics-diffusion-transformer/RDT2-FM at main

File size: 7,924 Bytes

---
license: apache-2.0
language:
- en
base_model:
- robotics-diffusion-transformer/rdt-1b
pipeline_tag: robotics
library_name: transformers
tags:
- RDT
- rdt
- RDT 2
- Vision-Language-Action
- Bimanual
- Manipulation
- Zero-shot
- UMI
- Flowmatching
- Diffusion
- Action Expert
---


# RDT2-FM: Flow-Matching Action Expert for RDT 2

<!-- RDT2-FM conditions on a vision-language backbone ([RDT2-VQ](https://huggingface.co/robotics-diffusion-transformer/RDT2-VQ)) and predicts short-horizon **relative action chunks** with an action expert with improved RDT architecture and flow-matching objective.
Using a **flow-matching** objective, RDT2-FM delivering **lower inference latency** while preserving strong instruction following and cross-embodiment generalization on UMI-style bimanual setups.
Concretely, This repository contains the **action expert** for RDT2-FM.  -->
RDT2-FM builds on a vision-language backbone (RDT2-VQ) and predicts short-horizon relative action chunks through an action expert that integrates an improved RDT architecture with a flow-matching objective.
By leveraging flow matching, RDT2-FM achieves lower inference latency while maintaining strong instruction following and cross-embodiment generalization on UMI-style bimanual setups.
This repository specifically provides the action expert component of RDT2-FM.

[**Home**](https://rdt-robotics.github.io/rdt2/) - [**Github**](https://github.com/thu-ml/RDT2/tree/main?tab=readme-ov-file) - [**Discord**](https://discord.gg/vsZS3zmf9A)

---

## Table of contents

* [Highlights](#highlights)
* [Model details](#model-details)
* [Hardware & software requirements](#hardware--software-requirements)
* [Quickstart (inference)](#quickstart-inference)
* [Precision settings](#precision-settings)
* [Intended uses & limitations](#intended-uses--limitations)
* [Troubleshooting](#troubleshooting)
* [Changelog](#changelog)
* [Citation](#citation)
* [Contact](#contact)

---

## Highlights

* **Low-latency control**: Flow-matching policy head (no iterative denoising) for fast closed-loop actions.
* **Zero-shot cross-embodiment**: Designed to work with any bimanual platforms (e.g., **UR5e**, **Franka FR3**) after proper calibration.
* **Scales with RDT2-VQ**: Pairs with the VLM backbone (**[RDT2-VQ](https://huggingface.co/robotics-diffusion-transformer/RDT2-VQ)**) trained on **10k+ hours** and **100+ scenes** of UMI manipulation.

---

## Model details

### Architecture

* **Backbone**: Vision-language backbone such as **RDT2-VQ** (Qwen2.5-VL-7B based).
* **Action head**: **Flow-Matching (FM)** expert mapping observations + instruction → continuous actions.
* **Observation**: Two wrist-camera RGB images (right/left), 384×384, JPEG-like statistics.
* **Instruction**: Short imperative text, recommended format **“Verb + Object.”** (e.g., “Pick up the apple.”).

### Action representation (UMI bimanual, per 24-step chunk)

* 20-D per step = right (10) + left (10):

  * pos (x,y,z): 3
  * rot (6D rotation): 6
  * gripper width: 1
* Output tensor shape: **(T=24, D=20)**, relative deltas, `float32`.

---

## Hardware & software requirements

Approximate **single-GPU** requirements:

| Mode                      |     RAM |    VRAM | Example GPU             |
| ------------------------- | ------: | ------: | ----------------------- |
| Inference (FM head + VLM) | ≥ 32 GB | ~ 16 GB | RTX 4090                |
| Fine-tuning FM head       |       – | ~ 16 GB | RTX 4090                |

> For **deployment on real robots**, follow your platform’s **end-effector + camera** choices and perform **[hardware setup & calibration](https://github.com/thu-ml/RDT2/tree/main?tab=readme-ov-file#1-important-hard-ware-set-up-and-calibration)** (camera stand/pose, flange, etc.) before running closed-loop policies.

**Tested OS**: Ubuntu 24.04.

---

## Quickstart (inference)

```python
# Run under root directory of RDT2 GitHub Repo: https://github.com/thu-ml/RDT2/tree/main?tab=readme-ov-file#1-important-hard-ware-set-up-and-calibration
import yaml

from models.rdt_inferencer import RDTInferencer


with open("configs/rdt/post_train.yaml", "r") as f:
  model_config = yaml.safe_load(f)

model = RDTInferencer(
  config=model_config,
  pretrained_path="robotics-diffusion-transformer/RDT2-FM",
  # TODO: modify `normalizer_path` to your own downloaded normalizer path
  # download from http://ml.cs.tsinghua.edu.cn/~lingxuan/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt
  normalizer_path="umi_normalizer_wo_downsample_indentity_rot.pt",  
  pretrained_vision_language_model_name_or_path="robotics-diffusion-transformer/RDT2-VQ", # use RDT2-VQ as the VLM backbone
  device="cuda:0",
  dtype=torch.bfloat16,
)

result = model.step(
    observations={
        'images': {
            # 'exterior_rs': np.random.randint(0, 255, (480, 640, 3), dtype=np.uint8),
            'left_stereo': ..., # left arm RGB image in np.ndarray of shape (384, 384, 3) with dtype=np.uint8
            'right_stereo': ..., # right arm RGB image in np.ndarray of shape (384, 384, 3) with dtype=np.uint8
        },
        # use zero input current state for currently
        # preserve input interface for future fine-tuning
        'state': np.zeros(model_config["common"]["state_dim"]).astype(np.float32)
    },
    instruction=instruction # Language instruction
    # We suggest using Instruction in format "verb + object" with Capitalized First Letter and trailing period 
)


# relative action chunk in np.ndarray of shape (24, 20) with dtype=np.float32
# with the same format as RDT2-VQ
action_chunk = result.detach().cpu().numpy()

# rescale gripper width from [0, 0.088] to [0, 0.1]
for robot_idx in range(2):
    action_chunk[:, robot_idx * 10 + 9] = action_chunk[:, robot_idx * 10 + 9] / 0.088 * 0.1
```

> For guides on **installation and fine-tuning**, please refer to the official [GitHub repository](https://github.com/thu-ml/RDT2).

---

## Precision settings

* **RDT2-FM (action expert)**: `bfloat16` for training and inference.
* **RDT2-VQ (VLM backbone)**: `bfloat16` by default (Qwen2.5-VL practices).

---

## Intended uses & limitations

**Intended uses**

* Research in **robot manipulation** and **VLA modeling**.
* Low-latency, short-horizon control on bimanual systems following **hardware calibration** steps.

**Limitations**

* Performance depends on **calibration quality**, camera placement, and correct normalization.
* Dataset/action-stat shift can degrade behavior—verify bounds and reconstruction when adapting.

**Safety & responsible use**

* Always test with **hardware limits** engaged (reduced speed, gravity compensation, E-stop within reach).

---

## Troubleshooting

| Symptom                            | Likely cause                    | Suggested fix                                                          |
| ---------------------------------- | ------------------------------- | ---------------------------------------------------------------------- |
| Drifting / unstable gripper widths | Scale mismatch                  | Apply **LinearNormalizer**; rescale widths ([0,0.088] → [0,0.1]).      |
| Poor instruction following         | Prompt format / backbone config | Use **“Verb + Object.”**; ensure backbone is loaded on same device.    |

---

## Changelog

* **2025-09**: Initial release of **RDT2-FM** on Hugging Face.

---

## Citation

```bibtex
@software{rdt2,
    title={RDT2: Enabling Zero-Shot Cross-Embodiment Generalization by Scaling Up UMI Data},
    author={RDT Team},
    url={https://github.com/thu-ml/RDT2},
    month={September},
    year={2025}
}
```

---

## Contact

* Project page: [https://rdt-robotics.github.io/rdt2/](https://rdt-robotics.github.io/rdt2/)
* Organization: [https://huggingface.co/robotics-diffusion-transformer](https://huggingface.co/robotics-diffusion-transformer)
* Discord: [https://discord.gg/vsZS3zmf9A](https://discord.gg/vsZS3zmf9A)