Update README.md

Browse files

Files changed (1) hide show

README.md +182 -2

README.md CHANGED Viewed

@@ -15,7 +15,187 @@ tags:
 - Manipulation
 - Zero-shot
 - UMI
-- Flow matching
 - Diffusion
 - Action Expert
----

 - Manipulation
 - Zero-shot
 - UMI
+- Flowmatching
 - Diffusion
 - Action Expert
+---
+# RDT2-FM: Flow-Matching Action Expert for RDT 2
+RDT2-FM conditions on a vision-language backbone ([RDT2-VQ](https://huggingface.co/robotics-diffusion-transformer/RDT2-VQ)) and predicts short-horizon **relative action chunks** with an action expert with improved RDT architecture and flow-matching objective.
+Using a **flow-matching** objective, RDT2-FM delivering **lower inference latency** while preserving strong instruction following and cross-embodiment generalization on UMI-style bimanual setups.
+Concretely, This repository contains the **action expert** for RDT2-FM.
+[**Home**](https://rdt-robotics.github.io/rdt2/) - [**Github**](https://github.com/thu-ml/RDT2/tree/main?tab=readme-ov-file) - [**Discord**](https://discord.gg/vsZS3zmf9A)
+---
+## Table of contents
+* [Highlights](#highlights)
+* [Model details](#model-details)
+* [Hardware & software requirements](#hardware--software-requirements)
+* [Quickstart (inference)](#quickstart-inference)
+* [Precision settings](#precision-settings)
+* [Intended uses & limitations](#intended-uses--limitations)
+* [Troubleshooting](#troubleshooting)
+* [Changelog](#changelog)
+* [Citation](#citation)
+* [Contact](#contact)
+---
+## Highlights
+* **Low-latency control**: Flow-matching policy head (no iterative denoising) for fast closed-loop actions.
+* **Zero-shot cross-embodiment**: Designed to work with any bimanual platforms (e.g., **UR5e**, **Franka FR3**) after proper calibration.
+* **Scales with RDT2-VQ**: Pairs with the VLM backbone (**[RDT2-VQ](https://huggingface.co/robotics-diffusion-transformer/RDT2-VQ)**) trained on **10k+ hours** and **100+ scenes** of UMI manipulation.
+---
+## Model details
+### Architecture
+* **Backbone**: Vision-language backbone such as **RDT2-VQ** (Qwen2.5-VL-7B based).
+* **Action head**: **Flow-Matching (FM)** expert mapping observations + instruction → continuous actions.
+* **Observation**: Two wrist-camera RGB images (right/left), 384×384, JPEG-like statistics.
+* **Instruction**: Short imperative text, recommended format **“Verb + Object.”** (e.g., “Pick up the apple.”).
+### Action representation (UMI bimanual, per 24-step chunk)
+* 20-D per step = right (10) + left (10):
+  * pos (x,y,z): 3
+  * rot (6D rotation): 6
+  * gripper width: 1
+* Output tensor shape: **(T=24, D=20)**, relative deltas, `float32`.
+---
+## Hardware & software requirements
+Approximate **single-GPU** requirements:
+| Mode                      |     RAM |    VRAM | Example GPU             |
+| ------------------------- | ------: | ------: | ----------------------- |
+| Inference (FM head + VLM) | ≥ 32 GB | ~ 16 GB | RTX 4090                |
+| Fine-tuning FM head       |       – | ~ 16 GB | RTX 4090                |
+> For **deployment on real robots**, follow your platform’s **end-effector + camera** choices and perform **[hardware setup & calibration](https://github.com/thu-ml/RDT2/tree/main?tab=readme-ov-file#1-important-hard-ware-set-up-and-calibration)** (camera stand/pose, flange, etc.) before running closed-loop policies.
+**Tested OS**: Ubuntu 24.04.
+---
+## Quickstart (inference)
+```python
+# Run under root directory of RDT2 GitHub Repo: https://github.com/thu-ml/RDT2/tree/main?tab=readme-ov-file#1-important-hard-ware-set-up-and-calibration
+import yaml
+from models.rdt_inferencer import RDTInferencer
+with open("configs/rdt/post_train.yaml", "r") as f:
+  model_config = yaml.safe_load(f)
+model = RDTInferencer(
+  config=model_config,
+  pretrained_path="robotics-diffusion-transformer/RDT2-FM",
+  # TODO: modify `normalizer_path` to your own downloaded normalizer path
+  # download from http://ml.cs.tsinghua.edu.cn/~lingxuan/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt
+  normalizer_path="umi_normalizer_wo_downsample_indentity_rot.pt",
+  pretrained_vision_language_model_name_or_path="robotics-diffusion-transformer/RDT2-VQ", # use RDT2-VQ as the VLM backbone
+  device="cuda:0",
+  dtype=torch.bfloat16,
+)
+result = model.step(
+    observations={
+        'images': {
+            # 'exterior_rs': np.random.randint(0, 255, (480, 640, 3), dtype=np.uint8),
+            'left_stereo': ..., # left arm RGB image in np.ndarray of shape (384, 384, 3) with dtype=np.uint8
+            'right_stereo': ..., # right arm RGB image in np.ndarray of shape (384, 384, 3) with dtype=np.uint8
+        },
+        # use zero input current state for currently
+        # preserve input interface for future fine-tuning
+        'state': np.zeros(model_config["common"]["state_dim"]).astype(np.float32)
+    },
+    instruction=instruction # Language instruction
+    # We suggest using Instruction in format "verb + object" with Capitalized First Letter and trailing period
+)
+# relative action chunk in np.ndarray of shape (24, 20) with dtype=np.float32
+# with the same format as RDT2-VQ
+action_chunk = result.detach().cpu().numpy()
+# rescale gripper width from [0, 0.088] to [0, 0.1]
+for robot_idx in range(2):
+    action_chunk[:, robot_idx * 10 + 9] = action_chunk[:, robot_idx * 10 + 9] / 0.088 * 0.1
+```
+> For **installation and fine-tuning instructions**, please refer to the official [GitHub repository](https://github.com/thu-ml/RDT2).
+---
+## Precision settings
+* **RDT2-FM (action expert)**: `bfloat16` for training and inference.
+* **RDT2-VQ (VLM backbone)**: `bfloat16` by default (Qwen2.5-VL practices).
+---
+## Intended uses & limitations
+**Intended uses**
+* Research in **robot manipulation** and **VLA modeling**.
+* Low-latency, short-horizon control on bimanual systems following **hardware calibration** steps.
+**Limitations**
+* Performance depends on **calibration quality**, camera placement, and correct normalization.
+* Dataset/action-stat shift can degrade behavior—verify bounds and reconstruction when adapting.
+**Safety & responsible use**
+* Always test with **hardware limits** engaged (reduced speed, gravity compensation, E-stop within reach).
+---
+## Troubleshooting
+| Symptom                            | Likely cause                    | Suggested fix                                                          |
+| ---------------------------------- | ------------------------------- | ---------------------------------------------------------------------- |
+| Drifting / unstable gripper widths | Scale mismatch                  | Apply **LinearNormalizer**; rescale widths ([0,0.088] → [0,0.1]).      |
+| Poor instruction following         | Prompt format / backbone config | Use “**Verb + Object.**”; ensure backbone is loaded on same device.    |
+---
+## Changelog
+* **2025-09**: Initial release of **RDT2-FM** on Hugging Face.
+---
+## Citation
+```bibtex
+@misc{rdt2_2025,
+  title  = {RDT 2: Enabling Zero-Shot Cross-Embodiment Generalization by Scaling Up UMI Data},
+  author = {RDT Robotics Team},
+  year   = {2025},
+  url    = {https://rdt-robotics.github.io/rdt2/}
+}
+```
+---
+## Contact
+* Project page: [https://rdt-robotics.github.io/rdt2/](https://rdt-robotics.github.io/rdt2/)
+* Organization: [https://huggingface.co/robotics-diffusion-transformer](https://huggingface.co/robotics-diffusion-transformer)
+* Discord: [https://discord.gg/vsZS3zmf9A](https://discord.gg/vsZS3zmf9A)