File size: 4,834 Bytes
d786491
7d2090f
 
d786491
7d2090f
 
 
 
d786491
7d2090f
 
 
 
 
 
 
 
 
3fe31e6
7ff9ff7
3fe31e6
7ff9ff7
 
7d2090f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3fe31e6
7d2090f
 
3fe31e6
7d2090f
 
3fe31e6
7d2090f
 
 
 
 
3fe31e6
7d2090f
 
3fe31e6
7d2090f
 
3fe31e6
7d2090f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7ff9ff7
7d2090f
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
---

pipeline_tag: robotics
library_name: transformers
license: cc-by-nc-sa-4.0
tags:
  - vision-language-model
  - manipulation
  - robotics
---


<div align="center">
  <video src="https://cdn-uploads.huggingface.co/production/uploads/678123194248fde89e4fc9bf/_cbIWKHPzffRxIpfmqdFG.mp4"

         controls autoplay muted playsinline loop width="720"></video>
  
  <p><em>🏁 Best viewed with sound on</em></p>
</div>


# F1: A Vision Language Action Model Bridging<br>Understanding and Generation to Actions
[![Paper](https://img.shields.io/badge/Paper-arXiv-red.svg)](https://arxiv.org/abs/2509.06951)
[![Code](https://img.shields.io/badge/GitHub-Code-800820?logo=github)](https://github.com/InternRobotics/F1-VLA)
[![Website](https://img.shields.io/badge/Website-Pages-blue.svg)](https://aopolin-lv.github.io/F1-VLA)



## πŸš€ Key Innovations

- **🧠 Predictive Inverse Dynamics**: Visual foresight generation for planning-based control
- **πŸ—οΈ Mixture-of-Transformer**: Three specialized experts (Understanding, Generation, Action)
- **πŸ“ˆ Three-Stage Training**: Progressive alignment, pretraining, and adaptation

## πŸ€– Real-World Robot Experiments

<!-- <div align="center">
    <video src="https://cdn-uploads.huggingface.co/production/uploads/678123194248fde89e4fc9bf/FPZ45NJd9_B_T1gOP8QVf.qt"

         controls autoplay muted playsinline loop width="720"></video>

  <p><em>9 diverse manipulation tasks including pick-and-place, handover, and complex object manipulation</em></p>

</div> -->


<div style="display: flex; flex-direction: column; align-items: center; gap: 10px;">
    <!-- First Row -->

    <div style="display: flex; justify-content: center; align-items: center; gap: 10px;">

        <video controls autoplay loop muted width="200" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">

            <source src="https://huggingface.co/spaces/Jia-Zeng/Robot_demos/resolve/main/arx_v2_long.mp4" type="video/mp4">

        </video>

        <video controls autoplay loop muted width="200" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">

            <source src="https://huggingface.co/spaces/Jia-Zeng/Robot_demos/resolve/main/arx_v1_dyna.mp4" type="video/mp4">

        </video>

        <video controls autoplay loop muted width="210" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">

            <source src="https://huggingface.co/spaces/Jia-Zeng/Robot_demos/resolve/main/franka_v1_sweep.mp4" type="video/mp4">

        </video>

    </div>

    <!-- Second Row -->

    <div style="display: flex; justify-content: center; align-items: center; gap: 10px;">

        <video controls autoplay loop muted width="200" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">

            <source src="https://huggingface.co/spaces/Jia-Zeng/Robot_demos/resolve/main/genie_v2_handover.mp4" type="video/mp4">

        </video>

        <video controls autoplay loop muted width="200" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">

            <source src="https://huggingface.co/spaces/Jia-Zeng/Robot_demos/resolve/main/genie_v3_tea.mp4" type="video/mp4">

        </video>

        <video controls autoplay loop muted width="210" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">

            <source src="https://huggingface.co/spaces/Jia-Zeng/Robot_demos/resolve/main/genie_v1_flower.mp4" type="video/mp4">

        </video>

    </div>

    <p><em>Diverse manipulation tasks across multiple robot platforms.</em></p>

</div>



## πŸ“Š Performance Summary

| Task | Platform | F1 | Ο€0 | Improvement |
|:--------:|:------------:|:------------------:|:------------:|:---------------:|
| Multi-task | Genie-1 | 82.2% | 65.2% | +17.0% |
| Adaptation | Franka | 66.7% | 53.3% | +13.4% |
| Long-horizon | ARX LIFT II | 40.0% | 0.0% | +40.0% |
| Dynamic Env | ARX LIFT II | 66.7% | 33.3% | +33.4% |

## Usage
Please refer to our official repo [F1-VLA](https://github.com/InternRobotics/F1-VLA).

## πŸ“š Citation

If you find our work helpful, please cite:

```bibtex

@article{f1_vla_2025,

  title={F1: A Vision Language Action Model Bridging Understanding and Generation to Actions},

  author={Qi Lv and Weijie Kong and Hao Li and Jia Zeng and Zherui Qiu and Delin Qu and Haoming Song and Qizhi Chen and Xiang Deng and Jiangmiao Pang},

  journal={Conference/Journal Name},

  year={2025},

  url={https://arxiv.org/abs/2509.06951}

}

```

## License
This work is under the [cc-by-nc-sa-4.0](LICENSE).

## Acknowledgements
This repository is based on [Lerobot](https://github.com/huggingface/lerobot), [Any4lerobot](https://github.com/Tavish9/any4lerobot/), and [VAR](https://github.com/FoundationVision/VAR).