zR
		
	commited on
		
		
					Commit 
							
							·
						
						a190ef4
	
1
								Parent(s):
							
							33b90ca
								
test
Browse files- README.md +41 -23
- README_zh.md +32 -16
    	
        README.md
    CHANGED
    
    | @@ -37,33 +37,34 @@ The table below provides a list of the video generation models we currently offe | |
| 37 | 
             
                <th style="text-align: center;">CogVideoX-5B (Current Repository)</th>
         | 
| 38 | 
             
              </tr>
         | 
| 39 | 
             
              <tr>
         | 
| 40 | 
            -
                <td style="text-align: center;">Model  | 
| 41 | 
            -
                <td style="text-align: center;"> | 
| 42 | 
            -
                <td style="text-align: center;">A larger model  | 
| 43 | 
             
              </tr>
         | 
| 44 | 
             
              <tr>
         | 
| 45 | 
             
                <td style="text-align: center;">Inference Precision</td>
         | 
| 46 | 
            -
                <td style="text-align: center;">FP16, FP32 | 
| 47 | 
            -
                <td style="text-align: center;">BF16, FP32 | 
| 48 | 
             
              </tr>
         | 
| 49 | 
             
              <tr>
         | 
| 50 | 
            -
                <td style="text-align: center;">Inference Speed<br>( | 
| 51 | 
            -
                <td style="text-align: center;">FP16: ~90 s</td>
         | 
| 52 | 
            -
                <td style="text-align: center;">BF16: ~200 s</td>
         | 
| 53 | 
             
              </tr>
         | 
| 54 | 
             
              <tr>
         | 
| 55 | 
            -
                <td style="text-align: center;">Single GPU  | 
| 56 | 
            -
                <td style="text-align: center;">18GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a><br>12GB  | 
| 57 | 
            -
                <td style="text-align: center;">26GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a><br>21GB  | 
| 58 | 
             
              </tr>
         | 
| 59 | 
             
              <tr>
         | 
| 60 | 
            -
                <td style="text-align: center;">Multi-GPU Inference Memory  | 
| 61 | 
            -
                <td  | 
|  | |
| 62 | 
             
              </tr>
         | 
| 63 | 
             
              <tr>
         | 
| 64 | 
            -
                <td style="text-align: center;">Fine- | 
| 65 | 
            -
                <td style="text-align: center;">47 GB (bs=1, LORA)<br> | 
| 66 | 
            -
                <td style="text-align: center;">63 GB (bs=1, LORA)<br> | 
| 67 | 
             
              </tr>
         | 
| 68 | 
             
              <tr>
         | 
| 69 | 
             
                <td style="text-align: center;">Prompt Language</td>
         | 
| @@ -79,15 +80,33 @@ The table below provides a list of the video generation models we currently offe | |
| 79 | 
             
              </tr>
         | 
| 80 | 
             
              <tr>
         | 
| 81 | 
             
                <td style="text-align: center;">Frame Rate</td>
         | 
| 82 | 
            -
                <td colspan="2" style="text-align: center;">8 frames | 
| 83 | 
             
              </tr>
         | 
| 84 | 
             
              <tr>
         | 
| 85 | 
             
                <td style="text-align: center;">Video Resolution</td>
         | 
| 86 | 
             
                <td colspan="2" style="text-align: center;">720 x 480, does not support other resolutions (including fine-tuning)</td>
         | 
| 87 | 
             
              </tr>
         | 
|  | |
|  | |
|  | |
|  | |
|  | |
| 88 | 
             
            </table>
         | 
| 89 |  | 
| 90 | 
            -
            ** | 
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
| 91 |  | 
| 92 | 
             
            ## Quick Start 🤗
         | 
| 93 |  | 
| @@ -137,8 +156,6 @@ video = pipe( | |
| 137 | 
             
            export_to_video(video, "output.mp4", fps=8)
         | 
| 138 | 
             
            ```
         | 
| 139 |  | 
| 140 | 
            -
            **Using a single A100 GPU, generating a video with the above configuration takes approximately 200 seconds**
         | 
| 141 | 
            -
             | 
| 142 | 
             
            If the generated model appears “all green” and not viewable in the default MAC player, it is a normal phenomenon (due to
         | 
| 143 | 
             
            OpenCV saving video issues). Simply use a different player to view the video.
         | 
| 144 |  | 
| @@ -160,8 +177,9 @@ This model is released under the [CogVideoX LICENSE](LICENSE). | |
| 160 |  | 
| 161 | 
             
            ```
         | 
| 162 | 
             
            @article{yang2024cogvideox,
         | 
| 163 | 
            -
             | 
| 164 | 
            -
             | 
| 165 | 
            -
             | 
|  | |
| 166 | 
             
            }
         | 
| 167 | 
             
            ```
         | 
|  | |
| 37 | 
             
                <th style="text-align: center;">CogVideoX-5B (Current Repository)</th>
         | 
| 38 | 
             
              </tr>
         | 
| 39 | 
             
              <tr>
         | 
| 40 | 
            +
                <td style="text-align: center;">Model Introduction</td>
         | 
| 41 | 
            +
                <td style="text-align: center;">An entry-level model with good compatibility. Low cost for running and secondary development.</td>
         | 
| 42 | 
            +
                <td style="text-align: center;">A larger model with higher video generation quality and better visual effects.</td>
         | 
| 43 | 
             
              </tr>
         | 
| 44 | 
             
              <tr>
         | 
| 45 | 
             
                <td style="text-align: center;">Inference Precision</td>
         | 
| 46 | 
            +
                <td style="text-align: center;">FP16, FP32<br><b>NOT support BF16</b> </td>
         | 
| 47 | 
            +
                <td style="text-align: center;">BF16, FP32<br><b>NOT support FP16</b> </td>
         | 
| 48 | 
             
              </tr>
         | 
| 49 | 
             
              <tr>
         | 
| 50 | 
            +
                <td style="text-align: center;">Inference Speed<br>(Step = 50)</td>
         | 
| 51 | 
            +
                <td style="text-align: center;">FP16: ~90* s</td>
         | 
| 52 | 
            +
                <td style="text-align: center;">BF16: ~200* s</td>
         | 
| 53 | 
             
              </tr>
         | 
| 54 | 
             
              <tr>
         | 
| 55 | 
            +
                <td style="text-align: center;">Single GPU Memory Consumption</td>
         | 
| 56 | 
            +
                <td style="text-align: center;">18GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a><br><b>12GB* using diffusers</b><br></td>
         | 
| 57 | 
            +
                <td style="text-align: center;">26GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a><br><b>21GB* using diffusers</b><br></td>
         | 
| 58 | 
             
              </tr>
         | 
| 59 | 
             
              <tr>
         | 
| 60 | 
            +
                <td style="text-align: center;">Multi-GPU Inference Memory Consumption</td>
         | 
| 61 | 
            +
                <td style="text-align: center;"><b>10GB* using diffusers</b><br></td>
         | 
| 62 | 
            +
                <td style="text-align: center;"><b>15GB* using diffusers</b><br></td>
         | 
| 63 | 
             
              </tr>
         | 
| 64 | 
             
              <tr>
         | 
| 65 | 
            +
                <td style="text-align: center;">Fine-Tuning Memory Consumption (Per GPU)</td>
         | 
| 66 | 
            +
                <td style="text-align: center;">47 GB (bs=1, LORA)<br>61 GB (bs=2, LORA)<br>62GB (bs=1, SFT)</td>
         | 
| 67 | 
            +
                <td style="text-align: center;">63 GB (bs=1, LORA)<br>80 GB (bs=2, LORA)<br>75GB (bs=1, SFT)<br></td>
         | 
| 68 | 
             
              </tr>
         | 
| 69 | 
             
              <tr>
         | 
| 70 | 
             
                <td style="text-align: center;">Prompt Language</td>
         | 
|  | |
| 80 | 
             
              </tr>
         | 
| 81 | 
             
              <tr>
         | 
| 82 | 
             
                <td style="text-align: center;">Frame Rate</td>
         | 
| 83 | 
            +
                <td colspan="2" style="text-align: center;">8 frames per second</td>
         | 
| 84 | 
             
              </tr>
         | 
| 85 | 
             
              <tr>
         | 
| 86 | 
             
                <td style="text-align: center;">Video Resolution</td>
         | 
| 87 | 
             
                <td colspan="2" style="text-align: center;">720 x 480, does not support other resolutions (including fine-tuning)</td>
         | 
| 88 | 
             
              </tr>
         | 
| 89 | 
            +
              <tr>
         | 
| 90 | 
            +
                <td style="text-align: center;">Positional Encoding</td>
         | 
| 91 | 
            +
                <td style="text-align: center;">3d_sincos_pos_embed</td>
         | 
| 92 | 
            +
                <td style="text-align: center;">3d_rope_pos_embed<br></td>
         | 
| 93 | 
            +
              </tr>
         | 
| 94 | 
             
            </table>
         | 
| 95 |  | 
| 96 | 
            +
            **Data Explanation**
         | 
| 97 | 
            +
             | 
| 98 | 
            +
            + When testing with the diffusers library, the `enable_model_cpu_offload()` and `pipe.vae.enable_tiling()` options were
         | 
| 99 | 
            +
              enabled. This configuration was not tested on non-**NVIDIA A100 / H100** devices, but it should generally work on all
         | 
| 100 | 
            +
              **NVIDIA Ampere architecture** and above. Disabling these optimizations will significantly increase memory usage, with
         | 
| 101 | 
            +
              peak usage approximately 3 times the values shown in the table.
         | 
| 102 | 
            +
            + For multi-GPU inference, `enable_model_cpu_offload()` must be disabled.
         | 
| 103 | 
            +
            + Inference speed tests used the above memory optimization options. Without these optimizations, inference speed
         | 
| 104 | 
            +
              increases by around 10%.
         | 
| 105 | 
            +
            + The model supports only English input. For other languages, translation to English is recommended during large model
         | 
| 106 | 
            +
              processing.
         | 
| 107 | 
            +
             | 
| 108 | 
            +
            + **Note** Using [SAT](https://github.com/THUDM/SwissArmyTransformer)  for inference and fine-tuning of SAT version
         | 
| 109 | 
            +
              models. Feel free to visit our GitHub for more information.
         | 
| 110 |  | 
| 111 | 
             
            ## Quick Start 🤗
         | 
| 112 |  | 
|  | |
| 156 | 
             
            export_to_video(video, "output.mp4", fps=8)
         | 
| 157 | 
             
            ```
         | 
| 158 |  | 
|  | |
|  | |
| 159 | 
             
            If the generated model appears “all green” and not viewable in the default MAC player, it is a normal phenomenon (due to
         | 
| 160 | 
             
            OpenCV saving video issues). Simply use a different player to view the video.
         | 
| 161 |  | 
|  | |
| 177 |  | 
| 178 | 
             
            ```
         | 
| 179 | 
             
            @article{yang2024cogvideox,
         | 
| 180 | 
            +
              title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer},
         | 
| 181 | 
            +
              author={Yang, Zhuoyi and Teng, Jiayan and Zheng, Wendi and Ding, Ming and Huang, Shiyu and Xu, Jiazheng and Yang, Yuanming and Hong, Wenyi and Zhang, Xiaohan and Feng, Guanyu and others},
         | 
| 182 | 
            +
              journal={arXiv preprint arXiv:2408.06072},
         | 
| 183 | 
            +
              year={2024}
         | 
| 184 | 
             
            }
         | 
| 185 | 
             
            ```
         | 
    	
        README_zh.md
    CHANGED
    
    | @@ -29,22 +29,23 @@ CogVideoX是 [清影](https://chatglm.cn/video) 同源的开源版本视频生 | |
| 29 | 
             
              </tr>
         | 
| 30 | 
             
              <tr>
         | 
| 31 | 
             
                <td style="text-align: center;">推理精度</td>
         | 
| 32 | 
            -
                <td style="text-align: center;">FP16, FP32 | 
| 33 | 
            -
                <td style="text-align: center;">BF16, FP32 | 
| 34 | 
             
              </tr>
         | 
| 35 | 
             
              <tr>
         | 
| 36 | 
            -
                <td style="text-align: center;">推理速度<br>( | 
| 37 | 
            -
                <td style="text-align: center;">FP16: ~90 s</td>
         | 
| 38 | 
            -
                <td style="text-align: center;">BF16: ~200 s</td>
         | 
| 39 | 
             
              </tr>
         | 
| 40 | 
             
              <tr>
         | 
| 41 | 
            -
                <td style="text-align: center;">单GPU | 
| 42 | 
            -
                <td style="text-align: center;">18GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a><br>12GB  | 
| 43 | 
            -
                <td style="text-align: center;">26GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a><br>21GB  | 
| 44 | 
             
              </tr>
         | 
| 45 | 
             
              <tr>
         | 
| 46 | 
             
                <td style="text-align: center;">多GPU推理显存消耗</td>
         | 
| 47 | 
            -
                <td  | 
|  | |
| 48 | 
             
              </tr>
         | 
| 49 | 
             
              <tr>
         | 
| 50 | 
             
                <td style="text-align: center;">微调显存消耗(每卡)</td>
         | 
| @@ -61,7 +62,7 @@ CogVideoX是 [清影](https://chatglm.cn/video) 同源的开源版本视频生 | |
| 61 | 
             
              </tr>
         | 
| 62 | 
             
              <tr>
         | 
| 63 | 
             
                <td style="text-align: center;">视频长度</td>
         | 
| 64 | 
            -
                <td colspan="2" style="text-align: center;">6  | 
| 65 | 
             
              </tr>
         | 
| 66 | 
             
              <tr>
         | 
| 67 | 
             
                <td style="text-align: center;">帧率</td>
         | 
| @@ -71,9 +72,25 @@ CogVideoX是 [清影](https://chatglm.cn/video) 同源的开源版本视频生 | |
| 71 | 
             
                <td style="text-align: center;">视频分辨率</td>
         | 
| 72 | 
             
                <td colspan="2" style="text-align: center;">720 * 480,不支持其他分辨率(含微调)</td>
         | 
| 73 | 
             
              </tr>
         | 
|  | |
|  | |
|  | |
|  | |
|  | |
| 74 | 
             
            </table>
         | 
| 75 |  | 
| 76 | 
            -
             | 
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
| 77 |  | 
| 78 | 
             
            ## 快速上手 🤗
         | 
| 79 |  | 
| @@ -122,8 +139,6 @@ video = pipe( | |
| 122 | 
             
            export_to_video(video, "output.mp4", fps=8)
         | 
| 123 | 
             
            ```
         | 
| 124 |  | 
| 125 | 
            -
            **使用单卡A100按照上述配置生成一次视频大约需要200秒**。
         | 
| 126 | 
            -
             | 
| 127 | 
             
            如果您生成的模型在 MAC 默认播放器上表现为 "全绿" 无法正常观看,属于正常现象 (OpenCV保存视频问题),仅需更换一个播放器观看。
         | 
| 128 |  | 
| 129 | 
             
            ## 深入研究
         | 
| @@ -144,8 +159,9 @@ export_to_video(video, "output.mp4", fps=8) | |
| 144 |  | 
| 145 | 
             
            ```
         | 
| 146 | 
             
            @article{yang2024cogvideox,
         | 
| 147 | 
            -
             | 
| 148 | 
            -
             | 
| 149 | 
            -
             | 
|  | |
| 150 | 
             
            }
         | 
| 151 | 
             
            ```
         | 
|  | |
| 29 | 
             
              </tr>
         | 
| 30 | 
             
              <tr>
         | 
| 31 | 
             
                <td style="text-align: center;">推理精度</td>
         | 
| 32 | 
            +
                <td style="text-align: center;">FP16, FP32<br><b>不支持 BF16</b> </td>
         | 
| 33 | 
            +
                <td style="text-align: center;">BF16, FP32<br><b>不支持 FP16</b> </td>
         | 
| 34 | 
             
              </tr>
         | 
| 35 | 
             
              <tr>
         | 
| 36 | 
            +
                <td style="text-align: center;">推理速度<br>(Step = 50)</td>
         | 
| 37 | 
            +
                <td style="text-align: center;">FP16: ~90* s</td>
         | 
| 38 | 
            +
                <td style="text-align: center;">BF16: ~200* s</td>
         | 
| 39 | 
             
              </tr>
         | 
| 40 | 
             
              <tr>
         | 
| 41 | 
            +
                <td style="text-align: center;">单GPU显存消耗<br></td>
         | 
| 42 | 
            +
                <td style="text-align: center;">18GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a><br><b>12GB* using diffusers</b><br></td>
         | 
| 43 | 
            +
                <td style="text-align: center;">26GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a><br><b>21GB* using diffusers</b><br></td>
         | 
| 44 | 
             
              </tr>
         | 
| 45 | 
             
              <tr>
         | 
| 46 | 
             
                <td style="text-align: center;">多GPU推理显存消耗</td>
         | 
| 47 | 
            +
                <td style="text-align: center;"><b>10GB* using diffusers</b><br></td>
         | 
| 48 | 
            +
                <td style="text-align: center;"><b>15GB* using diffusers</b><br></td>
         | 
| 49 | 
             
              </tr>
         | 
| 50 | 
             
              <tr>
         | 
| 51 | 
             
                <td style="text-align: center;">微调显存消耗(每卡)</td>
         | 
|  | |
| 62 | 
             
              </tr>
         | 
| 63 | 
             
              <tr>
         | 
| 64 | 
             
                <td style="text-align: center;">视频长度</td>
         | 
| 65 | 
            +
                <td colspan="2" style="text-align: center;">6 秒</td>
         | 
| 66 | 
             
              </tr>
         | 
| 67 | 
             
              <tr>
         | 
| 68 | 
             
                <td style="text-align: center;">帧率</td>
         | 
|  | |
| 72 | 
             
                <td style="text-align: center;">视频分辨率</td>
         | 
| 73 | 
             
                <td colspan="2" style="text-align: center;">720 * 480,不支持其他分辨率(含微调)</td>
         | 
| 74 | 
             
              </tr>
         | 
| 75 | 
            +
                <tr>
         | 
| 76 | 
            +
                <td style="text-align: center;">位置编码</td>
         | 
| 77 | 
            +
                <td style="text-align: center;">3d_sincos_pos_embed</td>
         | 
| 78 | 
            +
                <td style="text-align: center;">3d_rope_pos_embed<br></td>
         | 
| 79 | 
            +
              </tr>
         | 
| 80 | 
             
            </table>
         | 
| 81 |  | 
| 82 | 
            +
            **数据解释**
         | 
| 83 | 
            +
             | 
| 84 | 
            +
            + 使用 diffusers 库进行测试时,启用了 `enable_model_cpu_offload()` 选项 和 `pipe.vae.enable_tiling()` 优化,该方案未测试在非
         | 
| 85 | 
            +
               **NVIDIA A100 / H100** 外的实际显存占用,通常,该方案可以适配于所有 **NVIDIA 安培架构**
         | 
| 86 | 
            +
               以上的设备。若关闭优化,显存占用会成倍增加,峰值显存约为表格的3倍。
         | 
| 87 | 
            +
            + 多GPU推理时,需要关闭 `enable_model_cpu_offload()` 优化。
         | 
| 88 | 
            +
            + 推理速度测试同样采用了上述显存优化方案,不采用显存优化的情况下,推理速度提升约10%。 
         | 
| 89 | 
            +
            + 模型仅支持英语输入,其他语言可以通过大模型润色时翻译为英语。
         | 
| 90 | 
            +
             | 
| 91 | 
            +
            **提醒**
         | 
| 92 | 
            +
             | 
| 93 | 
            +
            + 使用 [SAT](https://github.com/THUDM/SwissArmyTransformer) 推理和微调SAT版本模型。欢迎前往我们的github查看。
         | 
| 94 |  | 
| 95 | 
             
            ## 快速上手 🤗
         | 
| 96 |  | 
|  | |
| 139 | 
             
            export_to_video(video, "output.mp4", fps=8)
         | 
| 140 | 
             
            ```
         | 
| 141 |  | 
|  | |
|  | |
| 142 | 
             
            如果您生成的模型在 MAC 默认播放器上表现为 "全绿" 无法正常观看,属于正常现象 (OpenCV保存视频问题),仅需更换一个播放器观看。
         | 
| 143 |  | 
| 144 | 
             
            ## 深入研究
         | 
|  | |
| 159 |  | 
| 160 | 
             
            ```
         | 
| 161 | 
             
            @article{yang2024cogvideox,
         | 
| 162 | 
            +
              title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer},
         | 
| 163 | 
            +
              author={Yang, Zhuoyi and Teng, Jiayan and Zheng, Wendi and Ding, Ming and Huang, Shiyu and Xu, Jiazheng and Yang, Yuanming and Hong, Wenyi and Zhang, Xiaohan and Feng, Guanyu and others},
         | 
| 164 | 
            +
              journal={arXiv preprint arXiv:2408.06072},
         | 
| 165 | 
            +
              year={2024}
         | 
| 166 | 
             
            }
         | 
| 167 | 
             
            ```
         | 
