Commit
d37823f
·
verified ·
1 Parent(s): a6fe3f3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +182 -2
README.md CHANGED
@@ -15,7 +15,187 @@ tags:
15
  - Manipulation
16
  - Zero-shot
17
  - UMI
18
- - Flow matching
19
  - Diffusion
20
  - Action Expert
21
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  - Manipulation
16
  - Zero-shot
17
  - UMI
18
+ - Flowmatching
19
  - Diffusion
20
  - Action Expert
21
+ ---
22
+
23
+
24
+ # RDT2-FM: Flow-Matching Action Expert for RDT 2
25
+
26
+ RDT2-FM conditions on a vision-language backbone ([RDT2-VQ](https://huggingface.co/robotics-diffusion-transformer/RDT2-VQ)) and predicts short-horizon **relative action chunks** with an action expert with improved RDT architecture and flow-matching objective.
27
+ Using a **flow-matching** objective, RDT2-FM delivering **lower inference latency** while preserving strong instruction following and cross-embodiment generalization on UMI-style bimanual setups.
28
+ Concretely, This repository contains the **action expert** for RDT2-FM.
29
+
30
+ [**Home**](https://rdt-robotics.github.io/rdt2/) - [**Github**](https://github.com/thu-ml/RDT2/tree/main?tab=readme-ov-file) - [**Discord**](https://discord.gg/vsZS3zmf9A)
31
+
32
+ ---
33
+
34
+ ## Table of contents
35
+
36
+ * [Highlights](#highlights)
37
+ * [Model details](#model-details)
38
+ * [Hardware & software requirements](#hardware--software-requirements)
39
+ * [Quickstart (inference)](#quickstart-inference)
40
+ * [Precision settings](#precision-settings)
41
+ * [Intended uses & limitations](#intended-uses--limitations)
42
+ * [Troubleshooting](#troubleshooting)
43
+ * [Changelog](#changelog)
44
+ * [Citation](#citation)
45
+ * [Contact](#contact)
46
+
47
+ ---
48
+
49
+ ## Highlights
50
+
51
+ * **Low-latency control**: Flow-matching policy head (no iterative denoising) for fast closed-loop actions.
52
+ * **Zero-shot cross-embodiment**: Designed to work with any bimanual platforms (e.g., **UR5e**, **Franka FR3**) after proper calibration.
53
+ * **Scales with RDT2-VQ**: Pairs with the VLM backbone (**[RDT2-VQ](https://huggingface.co/robotics-diffusion-transformer/RDT2-VQ)**) trained on **10k+ hours** and **100+ scenes** of UMI manipulation.
54
+
55
+ ---
56
+
57
+ ## Model details
58
+
59
+ ### Architecture
60
+
61
+ * **Backbone**: Vision-language backbone such as **RDT2-VQ** (Qwen2.5-VL-7B based).
62
+ * **Action head**: **Flow-Matching (FM)** expert mapping observations + instruction → continuous actions.
63
+ * **Observation**: Two wrist-camera RGB images (right/left), 384×384, JPEG-like statistics.
64
+ * **Instruction**: Short imperative text, recommended format **“Verb + Object.”** (e.g., “Pick up the apple.”).
65
+
66
+ ### Action representation (UMI bimanual, per 24-step chunk)
67
+
68
+ * 20-D per step = right (10) + left (10):
69
+
70
+ * pos (x,y,z): 3
71
+ * rot (6D rotation): 6
72
+ * gripper width: 1
73
+ * Output tensor shape: **(T=24, D=20)**, relative deltas, `float32`.
74
+
75
+ ---
76
+
77
+ ## Hardware & software requirements
78
+
79
+ Approximate **single-GPU** requirements:
80
+
81
+ | Mode | RAM | VRAM | Example GPU |
82
+ | ------------------------- | ------: | ------: | ----------------------- |
83
+ | Inference (FM head + VLM) | ≥ 32 GB | ~ 16 GB | RTX 4090 |
84
+ | Fine-tuning FM head | – | ~ 16 GB | RTX 4090 |
85
+
86
+ > For **deployment on real robots**, follow your platform’s **end-effector + camera** choices and perform **[hardware setup & calibration](https://github.com/thu-ml/RDT2/tree/main?tab=readme-ov-file#1-important-hard-ware-set-up-and-calibration)** (camera stand/pose, flange, etc.) before running closed-loop policies.
87
+
88
+ **Tested OS**: Ubuntu 24.04.
89
+
90
+ ---
91
+
92
+ ## Quickstart (inference)
93
+
94
+ ```python
95
+ # Run under root directory of RDT2 GitHub Repo: https://github.com/thu-ml/RDT2/tree/main?tab=readme-ov-file#1-important-hard-ware-set-up-and-calibration
96
+ import yaml
97
+
98
+ from models.rdt_inferencer import RDTInferencer
99
+
100
+
101
+ with open("configs/rdt/post_train.yaml", "r") as f:
102
+ model_config = yaml.safe_load(f)
103
+
104
+ model = RDTInferencer(
105
+ config=model_config,
106
+ pretrained_path="robotics-diffusion-transformer/RDT2-FM",
107
+ # TODO: modify `normalizer_path` to your own downloaded normalizer path
108
+ # download from http://ml.cs.tsinghua.edu.cn/~lingxuan/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt
109
+ normalizer_path="umi_normalizer_wo_downsample_indentity_rot.pt",
110
+ pretrained_vision_language_model_name_or_path="robotics-diffusion-transformer/RDT2-VQ", # use RDT2-VQ as the VLM backbone
111
+ device="cuda:0",
112
+ dtype=torch.bfloat16,
113
+ )
114
+
115
+ result = model.step(
116
+ observations={
117
+ 'images': {
118
+ # 'exterior_rs': np.random.randint(0, 255, (480, 640, 3), dtype=np.uint8),
119
+ 'left_stereo': ..., # left arm RGB image in np.ndarray of shape (384, 384, 3) with dtype=np.uint8
120
+ 'right_stereo': ..., # right arm RGB image in np.ndarray of shape (384, 384, 3) with dtype=np.uint8
121
+ },
122
+ # use zero input current state for currently
123
+ # preserve input interface for future fine-tuning
124
+ 'state': np.zeros(model_config["common"]["state_dim"]).astype(np.float32)
125
+ },
126
+ instruction=instruction # Language instruction
127
+ # We suggest using Instruction in format "verb + object" with Capitalized First Letter and trailing period
128
+ )
129
+
130
+
131
+ # relative action chunk in np.ndarray of shape (24, 20) with dtype=np.float32
132
+ # with the same format as RDT2-VQ
133
+ action_chunk = result.detach().cpu().numpy()
134
+
135
+ # rescale gripper width from [0, 0.088] to [0, 0.1]
136
+ for robot_idx in range(2):
137
+ action_chunk[:, robot_idx * 10 + 9] = action_chunk[:, robot_idx * 10 + 9] / 0.088 * 0.1
138
+ ```
139
+
140
+ > For **installation and fine-tuning instructions**, please refer to the official [GitHub repository](https://github.com/thu-ml/RDT2).
141
+
142
+ ---
143
+
144
+ ## Precision settings
145
+
146
+ * **RDT2-FM (action expert)**: `bfloat16` for training and inference.
147
+ * **RDT2-VQ (VLM backbone)**: `bfloat16` by default (Qwen2.5-VL practices).
148
+
149
+ ---
150
+
151
+ ## Intended uses & limitations
152
+
153
+ **Intended uses**
154
+
155
+ * Research in **robot manipulation** and **VLA modeling**.
156
+ * Low-latency, short-horizon control on bimanual systems following **hardware calibration** steps.
157
+
158
+ **Limitations**
159
+
160
+ * Performance depends on **calibration quality**, camera placement, and correct normalization.
161
+ * Dataset/action-stat shift can degrade behavior—verify bounds and reconstruction when adapting.
162
+
163
+ **Safety & responsible use**
164
+
165
+ * Always test with **hardware limits** engaged (reduced speed, gravity compensation, E-stop within reach).
166
+
167
+ ---
168
+
169
+ ## Troubleshooting
170
+
171
+ | Symptom | Likely cause | Suggested fix |
172
+ | ---------------------------------- | ------------------------------- | ---------------------------------------------------------------------- |
173
+ | Drifting / unstable gripper widths | Scale mismatch | Apply **LinearNormalizer**; rescale widths ([0,0.088] → [0,0.1]). |
174
+ | Poor instruction following | Prompt format / backbone config | Use “**Verb + Object.**”; ensure backbone is loaded on same device. |
175
+
176
+ ---
177
+
178
+ ## Changelog
179
+
180
+ * **2025-09**: Initial release of **RDT2-FM** on Hugging Face.
181
+
182
+ ---
183
+
184
+ ## Citation
185
+
186
+ ```bibtex
187
+ @misc{rdt2_2025,
188
+ title = {RDT 2: Enabling Zero-Shot Cross-Embodiment Generalization by Scaling Up UMI Data},
189
+ author = {RDT Robotics Team},
190
+ year = {2025},
191
+ url = {https://rdt-robotics.github.io/rdt2/}
192
+ }
193
+ ```
194
+
195
+ ---
196
+
197
+ ## Contact
198
+
199
+ * Project page: [https://rdt-robotics.github.io/rdt2/](https://rdt-robotics.github.io/rdt2/)
200
+ * Organization: [https://huggingface.co/robotics-diffusion-transformer](https://huggingface.co/robotics-diffusion-transformer)
201
+ * Discord: [https://discord.gg/vsZS3zmf9A](https://discord.gg/vsZS3zmf9A)