rrui commited on
Commit
739f9c4
·
verified ·
1 Parent(s): c425043

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -34,8 +34,17 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  assets/comfyui.png filter=lfs diff=lfs merge=lfs -text
 
37
  assets/flux-text.png filter=lfs diff=lfs merge=lfs -text
38
  assets/gradio_1.png filter=lfs diff=lfs merge=lfs -text
39
  assets/gradio_2.png filter=lfs diff=lfs merge=lfs -text
40
  assets/method.png filter=lfs diff=lfs merge=lfs -text
41
  assets/method_result.png filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  assets/comfyui.png filter=lfs diff=lfs merge=lfs -text
37
+ assets/comfyui2.png filter=lfs diff=lfs merge=lfs -text
38
  assets/flux-text.png filter=lfs diff=lfs merge=lfs -text
39
  assets/gradio_1.png filter=lfs diff=lfs merge=lfs -text
40
  assets/gradio_2.png filter=lfs diff=lfs merge=lfs -text
41
  assets/method.png filter=lfs diff=lfs merge=lfs -text
42
  assets/method_result.png filter=lfs diff=lfs merge=lfs -text
43
+ assets/new_img1.png filter=lfs diff=lfs merge=lfs -text
44
+ assets/new_img2.png filter=lfs diff=lfs merge=lfs -text
45
+ assets/ori_img1.png filter=lfs diff=lfs merge=lfs -text
46
+ assets/ori_img2.png filter=lfs diff=lfs merge=lfs -text
47
+ assets/video1.gif filter=lfs diff=lfs merge=lfs -text
48
+ assets/video2.gif filter=lfs diff=lfs merge=lfs -text
49
+ assets/video_end1.png filter=lfs diff=lfs merge=lfs -text
50
+ assets/video_end2.png filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,316 +1,3 @@
1
- ---
2
- license: mit
3
- ---
4
-
5
- # Implementation of FLUX-Text
6
-
7
- FLUX-Text: A Simple and Advanced Diffusion Transformer Baseline for Scene Text Editing
8
-
9
- <a href='https://amap-ml.github.io/FLUX-text/'><img src='https://img.shields.io/badge/Project-Page-green'></a>
10
- <a href='https://arxiv.org/abs/2505.03329'><img src='https://img.shields.io/badge/Technique-Report-red'></a>
11
- <a href="https://huggingface.co/GD-ML/FLUX-Text/"><img src="https://img.shields.io/badge/🤗_HuggingFace-Model-ffbd45.svg" alt="HuggingFace"></a>
12
- <!-- <a ><img src="https://img.shields.io/badge/🤗_HuggingFace-Model-ffbd45.svg" alt="HuggingFace"></a> -->
13
-
14
- > *[Rui Lan](https://scholar.google.com/citations?user=zwVlWXwAAAAJ&hl=zh-CN), [Yancheng Bai](https://scholar.google.com/citations?hl=zh-CN&user=Ilx8WNkAAAAJ&view_op=list_works&sortby=pubdate), [Xu Duan](https://scholar.google.com/citations?hl=zh-CN&user=EEUiFbwAAAAJ), [Mingxing Li](https://scholar.google.com/citations?hl=zh-CN&user=-pfkprkAAAAJ), [Lei Sun](https://allylei.github.io), [Xiangxiang Chu](https://scholar.google.com/citations?hl=zh-CN&user=jn21pUsAAAAJ&view_op=list_works&sortby=pubdate)*
15
- > <br>
16
- > ALibaba Group
17
-
18
- <img src='assets/flux-text.png'>
19
-
20
- ## 📖 Overview
21
- * **Motivation:** Scene text editing is a challenging task that aims to modify or add text in images while maintaining the fidelity of newly generated text and visual coherence with the background. The main challenge of this task is that we need to edit multiple line texts with diverse language attributes (e.g., fonts, sizes, and styles), language types (e.g., English, Chinese), and visual scenarios (e.g., poster, advertising, gaming).
22
- * **Contribution:** We propose FLUX-Text, a novel text editing framework for editing multi-line texts in complex visual scenes. By incorporating a lightweight Condition Injection LoRA module, Regional text perceptual loss, and two-stage training strategy, we significantly significant improvements on both Chinese and English benchmarks.
23
- <img src='assets/method.png'>
24
-
25
- ## News
26
-
27
- - **2025-07-16**: 🔥 Update comfyui node. We have decoupled the FLUX-Text node to support the use of more basic nodes. Due to differences in node computation in ComfyUI, if you need more consistent results, you should set min_length to 512 in the [code](https://github.com/comfyanonymous/ComfyUI/blob/master/comfy/text_encoders/flux.py#L12).
28
-
29
- <div align="center">
30
- <table>
31
- <tr>
32
- <td><img src="assets/comfyui2.png" alt="workflow/FLUX-Text-Basic-Workflow.json" width="400"/></td>
33
- </tr>
34
- <tr>
35
- <td align="center">workflow/FLUX-Text-Basic-Workflow.json</td>
36
- </tr>
37
- </table>
38
- </div>
39
-
40
- - **2025-07-13**: 🔥 The training code has been updated. The code now supports multi-scale training.
41
-
42
- - **2025-07-13**: 🔥 Update the low-VRAM version of the Gradio demo, which It currently requires 25GB of VRAM to run. Looking forward to more efficient, lower-memory solutions from the community.
43
-
44
- - **2025-07-08**: 🔥 ComfyUI Node is supported! You can now build an workflow based on FLUX-Text for editing posters. It is definitely worth trying to set up a workflow to automatically enhance product image service information and service scope. Meanwhile, utilizing the first and last frames enables the creation of video data with text effects. Thanks to the [community work](https://github.com/AMAP-ML/FluxText/issues/4), FLUX-Text was run on 8GB VRAM.
45
-
46
- <div align="center">
47
- <table>
48
- <tr>
49
- <td><img src="assets/comfyui.png" alt="workflow/FLUX-Text-Workflow.json" width="400"/></td>
50
- </tr>
51
- <tr>
52
- <td align="center">workflow/FLUX-Text-Workflow.json</td>
53
- </tr>
54
- </table>
55
- </div>
56
-
57
- <div align="center">
58
- <table>
59
- <tr>
60
- <td><img src="assets/ori_img1.png" alt="assets/ori_img1.png" width="200"/></td>
61
- <td><img src="assets/new_img1.png" alt="assets/new_img1.png" width="200"/></td>
62
- <td><img src="assets/ori_img2.png" alt="assets/ori_img2.png" width="200"/></td>
63
- <td><img src="assets/new_img2.png" alt="assets/new_img2.png" width="200"/></td>
64
- </tr>
65
- <tr>
66
- <td align="center">original image</td>
67
- <td align="center">edited image</td>
68
- <td align="center">original image</td>
69
- <td align="center">edited image</td>
70
- </tr>
71
- </table>
72
- </div>
73
-
74
- <div align="center">
75
- <table>
76
- <tr>
77
- <td><img src="assets/video_end1.png" alt="assets/video_end1.png" width="400"/></td>
78
- <td><img src="assets/video1.gif" alt="assets/video1.gif" width="400"/></td>
79
- </tr>
80
- <tr>
81
- <td><img src="assets/video_end2.png" alt="assets/video_end2.png" width="400"/></td>
82
- <td><img src="assets/video2.gif" alt="assets/video2.gif" width="400"/></td>
83
- </tr>
84
- <tr>
85
- <td align="center">last frame</td>
86
- <td align="center">video</td>
87
- </tr>
88
- </table>
89
- </div>
90
-
91
- - **2025-07-04**: 🔥 We have released gradio demo! You can now try out FLUX-Text.
92
-
93
- <div align="center">
94
- <table>
95
- <tr>
96
- <td><img src="assets/gradio_1.png" alt="Example 1" width="400"/></td>
97
- <td><img src="assets/gradio_2.png" alt="Example 2" width="400"/></td>
98
- </tr>
99
- <tr>
100
- <td align="center">Example 1</td>
101
- <td align="center">Example 2</td>
102
- </tr>
103
- </table>
104
- </div>
105
-
106
- - **2025-07-03**: 🔥 We have released our [pre-trained checkpoints](https://huggingface.co/GD-ML/FLUX-Text/) on Hugging Face! You can now try out FLUX-Text with the official weights.
107
-
108
- - **2025-06-26**: ⭐️ Inference and evaluate code are released. Once we have ensured that everything is functioning correctly, the new model will be merged into this repository.
109
-
110
- ## Todo List
111
- 1. - [x] Inference code
112
- 2. - [x] Pre-trained weights
113
- 3. - [x] Gradio demo
114
- 4. - [x] ComfyUI
115
- 5. - [x] Training code
116
-
117
- ## 🛠️ Installation
118
-
119
- We recommend using Python 3.10 and PyTorch with CUDA support. To set up the environment:
120
-
121
- ```bash
122
- # Create a new conda environment
123
- conda create -n flux_text python=3.10
124
- conda activate flux_text
125
-
126
- # Install other dependencies
127
- pip install -r requirements.txt
128
- pip install flash_attn --no-build-isolation
129
- pip install Pillow==9.5.0
130
- ```
131
-
132
- ## 🤗 Model Introduction
133
-
134
- FLUX-Text is an open-source version of the scene text editing model. FLUX-Text can be used for editing posters, emotions, and more. The table below displays the list of text editing models we currently offer, along with their foundational information.
135
-
136
- <table style="border-collapse: collapse; width: 100%;">
137
- <tr>
138
- <th style="text-align: center;">Model Name</th>
139
- <th style="text-align: center;">Image Resolution</th>
140
- <th style="text-align: center;">Memory Usage</th>
141
- <th style="text-align: center;">English Sen.Acc</th>
142
- <th style="text-align: center;">Chinese Sen.Acc</th>
143
- <th style="text-align: center;">Download Link</th>
144
- </tr>
145
- <tr>
146
- <th style="text-align: center;">FLUX-Text-512</th>
147
- <th style="text-align: center;">512*512</th>
148
- <th style="text-align: center;">34G</th>
149
- <th style="text-align: center;">0.8419</th>
150
- <th style="text-align: center;">0.7132</th>
151
- <th style="text-align: center;"><a href="https://huggingface.co/GD-ML/FLUX-Text/tree/main/model_512">🤗 HuggingFace</a></th>
152
- </tr>
153
- <tr>
154
- <th style="text-align: center;">FLUX-Text</th>
155
- <th style="text-align: center;">Multi Resolution</th>
156
- <th style="text-align: center;">34G for (512*512)</th>
157
- <th style="text-align: center;">0.8228</th>
158
- <th style="text-align: center;">0.7161</th>
159
- <th style="text-align: center;"><a href="https://huggingface.co/GD-ML/FLUX-Text/tree/main/model_multisize">🤗 HuggingFace</a></th>
160
- </tr>
161
- </table>
162
-
163
- ## 🔥 ComfyUI
164
-
165
- <details close>
166
- <summary> Installing via GitHub </summary>
167
-
168
- First, install and set up [ComfyUI](https://github.com/comfyanonymous/ComfyUI), and then follow these steps:
169
-
170
- 1. **Clone FLUXText Repository**:
171
- ```shell
172
- git clone https://github.com/AMAP-ML/FluxText.git
173
- ```
174
-
175
- 2. **Install FluxText**:
176
- ```shell
177
- cd FluxText && pip install -r requirements.txt
178
- ```
179
-
180
- 3. **Integrate FluxText Comfy Nodes with ComfyUI**:
181
- - **Symbolic Link (Recommended)**:
182
- ```shell
183
- ln -s $(pwd)/ComfyUI-fluxtext path/to/ComfyUI/custom_nodes/
184
- ```
185
- - **Copy Directory**:
186
- ```shell
187
- cp -r ComfyUI-fluxtext path/to/ComfyUI/custom_nodes/
188
- ```
189
-
190
- </details>
191
-
192
- ## 🔥 Quick Start
193
-
194
- Here's a basic example of using FLUX-Text:
195
-
196
- ```python
197
- import numpy as np
198
- from PIL import Image
199
- import torch
200
- import yaml
201
-
202
- from src.flux.condition import Condition
203
- from src.flux.generate_fill import generate_fill
204
- from src.train.model import OminiModelFIll
205
- from safetensors.torch import load_file
206
-
207
- config_path = ""
208
- lora_path = ""
209
- with open(config_path, "r") as f:
210
- config = yaml.safe_load(f)
211
- model = OminiModelFIll(
212
- flux_pipe_id=config["flux_path"],
213
- lora_config=config["train"]["lora_config"],
214
- device=f"cuda",
215
- dtype=getattr(torch, config["dtype"]),
216
- optimizer_config=config["train"]["optimizer"],
217
- model_config=config.get("model", {}),
218
- gradient_checkpointing=True,
219
- byt5_encoder_config=None,
220
- )
221
-
222
- state_dict = load_file(lora_path)
223
- state_dict_new = {x.replace('lora_A', 'lora_A.default').replace('lora_B', 'lora_B.default').replace('transformer.', ''): v for x, v in state_dict.items()}
224
- model.transformer.load_state_dict(state_dict_new, strict=False)
225
- pipe = model.flux_pipe
226
-
227
- prompt = "lepto college of education, the written materials on the picture: LESOTHO , COLLEGE OF , RE BONA LESELI LESEL , EDUCATION ."
228
- hint = Image.open("assets/hint.png").resize((512, 512)).convert('RGB')
229
- img = Image.open("assets/hint_imgs.jpg").resize((512, 512))
230
- condition_img = Image.open("assets/hint_imgs_word.png").resize((512, 512)).convert('RGB')
231
- hint = np.array(hint) / 255
232
- condition_img = np.array(condition_img)
233
- condition_img = (255 - condition_img) / 255
234
- condition_img = [condition_img, hint, img]
235
- position_delta = [0, 0]
236
- condition = Condition(
237
- condition_type='word_fill',
238
- condition=condition_img,
239
- position_delta=position_delta,
240
- )
241
- generator = torch.Generator(device="cuda")
242
- res = generate_fill(
243
- pipe,
244
- prompt=prompt,
245
- conditions=[condition],
246
- height=512,
247
- width=512,
248
- generator=generator,
249
- model_config=config.get("model", {}),
250
- default_lora=True,
251
- )
252
- res.images[0].save('flux_fill.png')
253
- ```
254
-
255
- ## 🤗 gradio
256
-
257
- You can upload the glyph image and mask image to edit text region. Or you can use `manual edit` to obtain glyph image and mask image.
258
-
259
- first, download the model weight and config in [HuggingFace](https://huggingface.co/GD-ML/FLUX-Text)
260
-
261
- ```bash
262
- python app.py --model_path xx.safetensors --config_path config.yaml
263
- ```
264
-
265
- ## 💪🏻 Training
266
-
267
- 1. Download training dataset [**AnyWord-3M**](https://modelscope.cn/datasets/iic/AnyWord-3M/summary) from ModelScope, unzip all \*.zip files in each subfolder, then open *\*.json* and modify the `data_root` with your own path of *imgs* folder for each sub dataset.
268
-
269
- 2. Download the ODM weights in [HuggingFace](https://huggingface.co/GD-ML/FLUX-Text/blob/main/epoch_100.pt).
270
-
271
- 3. (Optional) Download the pretrained weight in [HuggingFace](https://huggingface.co/GD-ML/FLUX-Text).
272
-
273
- 4. Run the training scripts. With 48GB of VRAM, you can train at 512×512 resolution with a batch size of 2.
274
-
275
- ```bash
276
- bash train/script/train_word.sh
277
- ```
278
-
279
-
280
- ## 📊 Evaluation
281
-
282
- For [Anytext-benchmark](https://modelscope.cn/datasets/iic/AnyText-benchmark/summary), please set the **config_path**, **model_path**, **json_path**, **output_dir** in the `eval/gen_imgs_anytext.sh` and generate the text editing results.
283
-
284
- ```bash
285
- bash eval/gen_imgs_anytext.sh
286
- ```
287
-
288
- For `Sen.ACC, NED, FID and LPIPS` evaluation, use the scripts in the `eval` folder.
289
-
290
- ```bash
291
- bash eval/eval_ocr.sh
292
- bash eval/eval_fid.sh
293
- bash eval/eval_lpips.sh
294
- ```
295
-
296
- ## 📈 Results
297
-
298
- <img src='assets/method_result.png'>
299
-
300
- ## 🌹 Acknowledgement
301
-
302
- Our work is primarily based on [OminiControl](https://github.com/Yuanshi9815/OminiControl), [AnyText](https://github.com/tyxsspa/AnyText), [Open-Sora](https://github.com/hpcaitech/Open-Sora), [Phantom](https://github.com/Phantom-video/Phantom). We are sincerely grateful for their excellent works.
303
-
304
- ## 📚 Citation
305
-
306
- If you find our paper and code helpful for your research, please consider starring our repository ⭐ and citing our work ✏️.
307
- ```bibtex
308
- @misc{lan2025fluxtext,
309
- title={FLUX-Text: A Simple and Advanced Diffusion Transformer Baseline for Scene Text Editing},
310
- author={Rui Lan and Yancheng Bai and Xu Duan and Mingxing Li and Lei Sun and Xiangxiang Chu},
311
- year={2025},
312
- eprint={2505.03329},
313
- archivePrefix={arXiv},
314
- primaryClass={cs.CV}
315
- }
316
- ```
 
1
+ ---
2
+ license: mit
3
+ ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
assets/comfyui2.png ADDED

Git LFS Details

  • SHA256: bf568344d296aec9e5f2672a3fbf98adaddfe9c872bf21e7695e4b48a85655fd
  • Pointer size: 132 Bytes
  • Size of remote file: 1.96 MB
assets/new_img1.png ADDED

Git LFS Details

  • SHA256: 56255c96c300c2f7da95234e1ad5ef10194fef2608cbb79e475747edd73b833b
  • Pointer size: 131 Bytes
  • Size of remote file: 612 kB
assets/new_img2.png ADDED

Git LFS Details

  • SHA256: 94fd932e5fe902898f26773d60bf4ce582e2eeb2efe2893c305a6b3d6fc11b5c
  • Pointer size: 131 Bytes
  • Size of remote file: 427 kB
assets/ori_img1.png ADDED

Git LFS Details

  • SHA256: 49978502989564fe8ad02321a173bd96144ce3c11db6ec54e3f315a12848b9d1
  • Pointer size: 131 Bytes
  • Size of remote file: 710 kB
assets/ori_img2.png ADDED

Git LFS Details

  • SHA256: 05d55516b09063dfea12e20a1b3f4e6c888bd26c8d66688c3f4eb69b62c5c21c
  • Pointer size: 132 Bytes
  • Size of remote file: 1.65 MB
assets/video1.gif ADDED

Git LFS Details

  • SHA256: 4205b11706d91ed4824ebd11806232e5531be9f8b3ecb9208ce3e6ad019c8a81
  • Pointer size: 132 Bytes
  • Size of remote file: 2.68 MB
assets/video2.gif ADDED

Git LFS Details

  • SHA256: 173a0842f1f3e0710651014744f573183a656895d31a33b23c8813a1dac6c8fc
  • Pointer size: 132 Bytes
  • Size of remote file: 1.18 MB
assets/video_end1.png ADDED

Git LFS Details

  • SHA256: 0007264768ba24078ca823f3e735467d23350f774f8f6e6b33ff5ff35ee4bbb5
  • Pointer size: 131 Bytes
  • Size of remote file: 954 kB
assets/video_end2.png ADDED

Git LFS Details

  • SHA256: 6de39544a6fd00d27c772774d925dde2559b44dba8723feff220a166a96c0692
  • Pointer size: 131 Bytes
  • Size of remote file: 695 kB