Spaces:

jbilcke-hf
/

VideoModelStudio

Paused

App Files Files Community

Julian Bilcke commited on Mar 6

Commit

14d083b

1 Parent(s): 2a0dc25

making our code more robust

Browse files

Files changed (11) hide show

docs/finetrainers/documentation_models_cogvideox.md +50 -0
docs/finetrainers/documentation_models_hunyuan_video.md +7 -141
docs/finetrainers/documentation_models_ltx_video.md +7 -161
docs/finetrainers/documentation_models_wan.md +10 -3
vms/config.py +56 -10
vms/services/trainer.py +296 -82
vms/tabs/train_tab.py +88 -11
vms/ui/video_trainer_ui.py +37 -5
vms/utils/__init__.py +7 -1
vms/utils/finetrainers_utils.py +83 -13
vms/utils/gpu_detector.py +59 -0

docs/finetrainers/documentation_models_cogvideox.md ADDED Viewed

	@@ -0,0 +1,50 @@

+# CogVideoX
+## Training
+For LoRA training, specify `--training_type lora`. For full finetuning, specify `--training_type full-finetune`.
+Examples available:
+- [PIKA crush effect](../../examples/training/sft/cogvideox/crush_smol_lora/)
+To run an example, run the following from the root directory of the repository (assuming you have installed the requirements and are using Linux/WSL):
+```bash
+chmod +x ./examples/training/sft/cogvideox/crush_smol_lora/train.sh
+./examples/training/sft/cogvideox/crush_smol_lora/train.sh
+```
+On Windows, you will have to modify the script to a compatible format to run it. [TODO(aryan): improve instructions for Windows]
+## Supported checkpoints
+CogVideoX has multiple checkpoints as one can note [here](https://huggingface.co/collections/THUDM/cogvideo-66c08e62f1685a3ade464cce). The following checkpoints were tested with `finetrainers` and are known to be working:
+* [THUDM/CogVideoX-2b](https://huggingface.co/THUDM/CogVideoX-2b)
+* [THUDM/CogVideoX-5B](https://huggingface.co/THUDM/CogVideoX-5B)
+* [THUDM/CogVideoX1.5-5B](https://huggingface.co/THUDM/CogVideoX1.5-5B)
+## Inference
+Assuming your LoRA is saved and pushed to the HF Hub, and named `my-awesome-name/my-awesome-lora`, we can now use the finetuned model for inference:
+```diff
+import torch
+from diffusers import CogVideoXPipeline
+from diffusers.utils import export_to_video
+pipe = CogVideoXPipeline.from_pretrained(
+    "THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16
+).to("cuda")
++ pipe.load_lora_weights("my-awesome-name/my-awesome-lora", adapter_name="cogvideox-lora")
++ pipe.set_adapters(["cogvideox-lora"], [0.75])
+video = pipe("<my-awesome-prompt>").frames[0]
+export_to_video(video, "output.mp4")
+```
+You can refer to the following guides to know more about the model pipeline and performing LoRA inference in `diffusers`:
+* [CogVideoX in Diffusers](https://huggingface.co/docs/diffusers/main/en/api/pipelines/cogvideox)
+* [Load LoRAs for inference](https://huggingface.co/docs/diffusers/main/en/tutorials/using_peft_for_inference)
+* [Merge LoRAs](https://huggingface.co/docs/diffusers/main/en/using-diffusers/merge_loras)

docs/finetrainers/documentation_models_hunyuan_video.md CHANGED Viewed

@@ -4,151 +4,17 @@
 For LoRA training, specify `--training_type lora`. For full finetuning, specify `--training_type full-finetune`.
-```bash
-#!/bin/bash
-export WANDB_MODE="offline"
-export NCCL_P2P_DISABLE=1
-export TORCH_NCCL_ENABLE_MONITORING=0
-export FINETRAINERS_LOG_LEVEL=DEBUG
-GPU_IDS="0,1"
-DATA_ROOT="/path/to/dataset"
-CAPTION_COLUMN="prompts.txt"
-VIDEO_COLUMN="videos.txt"
-OUTPUT_DIR="/path/to/models/hunyuan-video/"
-ID_TOKEN="afkx"
-# Model arguments
-model_cmd="--model_name hunyuan_video \
-  --pretrained_model_name_or_path hunyuanvideo-community/HunyuanVideo"
-# Dataset arguments
-dataset_cmd="--data_root $DATA_ROOT \
-  --video_column $VIDEO_COLUMN \
-  --caption_column $CAPTION_COLUMN \
-  --id_token $ID_TOKEN \
-  --video_resolution_buckets 17x512x768 49x512x768 61x512x768 \
-  --caption_dropout_p 0.05"
-# Dataloader arguments
-dataloader_cmd="--dataloader_num_workers 0"
-# Diffusion arguments
-diffusion_cmd=""
-# Training arguments
-training_cmd="--training_type lora \
-  --seed 42 \
-  --batch_size 1 \
-  --train_steps 500 \
-  --rank 128 \
-  --lora_alpha 128 \
-  --target_modules to_q to_k to_v to_out.0 \
-  --gradient_accumulation_steps 1 \
-  --gradient_checkpointing \
-  --checkpointing_steps 500 \
-  --checkpointing_limit 2 \
-  --enable_slicing \
-  --enable_tiling"
-# Optimizer arguments
-optimizer_cmd="--optimizer adamw \
-  --lr 2e-5 \
-  --lr_scheduler constant_with_warmup \
-  --lr_warmup_steps 100 \
-  --lr_num_cycles 1 \
-  --beta1 0.9 \
-  --beta2 0.95 \
-  --weight_decay 1e-4 \
-  --epsilon 1e-8 \
-  --max_grad_norm 1.0"
-# Miscellaneous arguments
-miscellaneous_cmd="--tracker_name finetrainers-hunyuan-video \
-  --output_dir $OUTPUT_DIR \
-  --nccl_timeout 1800 \
-  --report_to wandb"
-cmd="accelerate launch --config_file accelerate_configs/uncompiled_8.yaml --gpu_ids $GPU_IDS train.py \
-  $model_cmd \
-  $dataset_cmd \
-  $dataloader_cmd \
-  $diffusion_cmd \
-  $training_cmd \
-  $optimizer_cmd \
-  $miscellaneous_cmd"
-echo "Running command: $cmd"
-eval $cmd
-echo -ne "-------------------- Finished executing script --------------------\n\n"
-```
-## Memory Usage
-### LoRA
-> [!NOTE]
->
-> The below measurements are done in `torch.bfloat16` precision. Memory usage can further be reduce by passing `--layerwise_upcasting_modules transformer` to the training script. This will cast the model weights to `torch.float8_e4m3fn` or `torch.float8_e5m2`, which halves the memory requirement for model weights. Computation is performed in the dtype set by `--transformer_dtype` (which defaults to `bf16`).
-LoRA with rank 128, batch size 1, gradient checkpointing, optimizer adamw, `49x512x768` resolutions, **without precomputation**:
-```
-Training configuration: {
-    "trainable parameters": 163577856,
-    "total samples": 69,
-    "train epochs": 1,
-    "train steps": 10,
-    "batches per device": 1,
-    "total batches observed per epoch": 69,
-    "train batch size": 1,
-    "gradient accumulation steps": 1
-}
-```
-| stage                   | memory_allocated | max_memory_reserved |
-|:-----------------------:|:----------------:|:-------------------:|
-| before training start   | 38.889           | 39.020              |
-| before validation start | 39.747           | 56.266              |
-| after validation end    | 39.748           | 58.385              |
-| after epoch 1           | 39.748           | 40.910              |
-| after training end      | 25.288           | 40.910              |
-Note: requires about `59` GB of VRAM when validation is performed.
-LoRA with rank 128, batch size 1, gradient checkpointing, optimizer adamw, `49x512x768` resolutions, **with precomputation**:
-```
-Training configuration: {
-    "trainable parameters": 163577856,
-    "total samples": 1,
-    "train epochs": 10,
-    "train steps": 10,
-    "batches per device": 1,
-    "total batches observed per epoch": 1,
-    "train batch size": 1,
-    "gradient accumulation steps": 1
-}
 ```
-| stage                         | memory_allocated | max_memory_reserved |
-|:-----------------------------:|:----------------:|:-------------------:|
-| after precomputing conditions | 14.232           | 14.461              |
-| after precomputing latents    | 14.717           | 17.244              |
-| before training start         | 24.195           | 26.039              |
-| after epoch 1                 | 24.83            | 42.387              |
-| before validation start       | 24.842           | 42.387              |
-| after validation end          | 39.558           | 46.947              |
-| after training end            | 24.842           | 41.039              |
-Note: requires about `47` GB of VRAM with validation. If validation is not performed, the memory usage is reduced to about `42` GB.
-### Full finetuning
-Current, full finetuning is not supported for HunyuanVideo. It goes out of memory (OOM) for `49x512x768` resolutions.
 ## Inference

 For LoRA training, specify `--training_type lora`. For full finetuning, specify `--training_type full-finetune`.
+Examples available:
+- [PIKA Dissolve effect](../../examples/training/sft/hunyuan_video/modal_labs_dissolve/)
+To run an example, run the following from the root directory of the repository (assuming you have installed the requirements and are using Linux/WSL):
+```bash
+chmod +x ./examples/training/sft/hunyuan_video/modal_labs_dissolve/train.sh
+./examples/training/sft/hunyuan_video/modal_labs_dissolve/train.sh
 ```
+On Windows, you will have to modify the script to a compatible format to run it. [TODO(aryan): improve instructions for Windows]
 ## Inference

docs/finetrainers/documentation_models_ltx_video.md CHANGED Viewed

@@ -4,171 +4,17 @@
 For LoRA training, specify `--training_type lora`. For full finetuning, specify `--training_type full-finetune`.
-```bash
-#!/bin/bash
-export WANDB_MODE="offline"
-export NCCL_P2P_DISABLE=1
-export TORCH_NCCL_ENABLE_MONITORING=0
-export FINETRAINERS_LOG_LEVEL=DEBUG
-GPU_IDS="0,1"
-DATA_ROOT="/path/to/dataset"
-CAPTION_COLUMN="prompts.txt"
-VIDEO_COLUMN="videos.txt"
-OUTPUT_DIR="/path/to/models/ltx-video/"
-ID_TOKEN="BW_STYLE"
-# Model arguments
-model_cmd="--model_name ltx_video \
-  --pretrained_model_name_or_path Lightricks/LTX-Video"
-# Dataset arguments
-dataset_cmd="--data_root $DATA_ROOT \
-  --video_column $VIDEO_COLUMN \
-  --caption_column $CAPTION_COLUMN \
-  --id_token $ID_TOKEN \
-  --video_resolution_buckets 49x512x768 \
-  --caption_dropout_p 0.05"
-# Dataloader arguments
-dataloader_cmd="--dataloader_num_workers 0"
-# Diffusion arguments
-diffusion_cmd="--flow_weighting_scheme logit_normal"
-# Training arguments
-training_cmd="--training_type lora \
-  --seed 42 \
-  --batch_size 1 \
-  --train_steps 3000 \
-  --rank 128 \
-  --lora_alpha 128 \
-  --target_modules to_q to_k to_v to_out.0 \
-  --gradient_accumulation_steps 4 \
-  --gradient_checkpointing \
-  --checkpointing_steps 500 \
-  --checkpointing_limit 2 \
-  --enable_slicing \
-  --enable_tiling"
-# Optimizer arguments
-optimizer_cmd="--optimizer adamw \
-  --lr 3e-5 \
-  --lr_scheduler constant_with_warmup \
-  --lr_warmup_steps 100 \
-  --lr_num_cycles 1 \
-  --beta1 0.9 \
-  --beta2 0.95 \
-  --weight_decay 1e-4 \
-  --epsilon 1e-8 \
-  --max_grad_norm 1.0"
-# Miscellaneous arguments
-miscellaneous_cmd="--tracker_name finetrainers-ltxv \
-  --output_dir $OUTPUT_DIR \
-  --nccl_timeout 1800 \
-  --report_to wandb"
-cmd="accelerate launch --config_file accelerate_configs/uncompiled_2.yaml --gpu_ids $GPU_IDS train.py \
-  $model_cmd \
-  $dataset_cmd \
-  $dataloader_cmd \
-  $diffusion_cmd \
-  $training_cmd \
-  $optimizer_cmd \
-  $miscellaneous_cmd"
-echo "Running command: $cmd"
-eval $cmd
-echo -ne "-------------------- Finished executing script --------------------\n\n"
-```
-## Memory Usage
-### LoRA
-> [!NOTE]
->
-> The below measurements are done in `torch.bfloat16` precision. Memory usage can further be reduce by passing `--layerwise_upcasting_modules transformer` to the training script. This will cast the model weights to `torch.float8_e4m3fn` or `torch.float8_e5m2`, which halves the memory requirement for model weights. Computation is performed in the dtype set by `--transformer_dtype` (which defaults to `bf16`).
-LoRA with rank 128, batch size 1, gradient checkpointing, optimizer adamw, `49x512x768` resolution, **without precomputation**:
-```
-Training configuration: {
-    "trainable parameters": 117440512,
-    "total samples": 69,
-    "train epochs": 1,
-    "train steps": 10,
-    "batches per device": 1,
-    "total batches observed per epoch": 69,
-    "train batch size": 1,
-    "gradient accumulation steps": 1
-}
-```
-| stage                   | memory_allocated | max_memory_reserved |
-|:-----------------------:|:----------------:|:-------------------:|
-| before training start   | 13.486           | 13.879              |
-| before validation start | 14.146           | 17.623              |
-| after validation end    | 14.146           | 17.623              |
-| after epoch 1           | 14.146           | 17.623              |
-| after training end      | 4.461            | 17.623              |
-Note: requires about `18` GB of VRAM without precomputation.
-LoRA with rank 128, batch size 1, gradient checkpointing, optimizer adamw, `49x512x768` resolution, **with precomputation**:
-```
-Training configuration: {
-    "trainable parameters": 117440512,
-    "total samples": 1,
-    "train epochs": 10,
-    "train steps": 10,
-    "batches per device": 1,
-    "total batches observed per epoch": 1,
-    "train batch size": 1,
-    "gradient accumulation steps": 1
-}
-```
-| stage                         | memory_allocated | max_memory_reserved |
-|:-----------------------------:|:----------------:|:-------------------:|
-| after precomputing conditions | 8.88             | 8.920               |
-| after precomputing latents    | 9.684            | 11.613              |
-| before training start         | 3.809            | 10.010              |
-| after epoch 1                 | 4.26             | 10.916              |
-| before validation start       | 4.26             | 10.916              |
-| after validation end          | 13.924           | 17.262              |
-| after training end            | 4.26             | 14.314              |
-Note: requires about `17.5` GB of VRAM with precomputation. If validation is not performed, the memory usage is reduced to `11` GB.
-### Full Finetuning
-```
-Training configuration: {
-    "trainable parameters": 1923385472,
-    "total samples": 1,
-    "train epochs": 10,
-    "train steps": 10,
-    "batches per device": 1,
-    "total batches observed per epoch": 1,
-    "train batch size": 1,
-    "gradient accumulation steps": 1
-}
 ```
-| stage                         | memory_allocated | max_memory_reserved |
-|:-----------------------------:|:----------------:|:-------------------:|
-| after precomputing conditions | 8.89             | 8.937               |
-| after precomputing latents    | 9.701            | 11.615              |
-| before training start         | 3.583            | 4.025               |
-| after epoch 1                 | 10.769           | 20.357              |
-| before validation start       | 10.769           | 20.357              |
-| after validation end          | 10.769           | 28.332              |
-| after training end            | 10.769           | 12.904              |
 ## Inference

 For LoRA training, specify `--training_type lora`. For full finetuning, specify `--training_type full-finetune`.
+Examples available:
+- [PIKA crush effect](../../examples/training/sft/ltx_video/crush_smol_lora/)
+To run an example, run the following from the root directory of the repository (assuming you have installed the requirements and are using Linux/WSL):
+```bash
+chmod +x ./examples/training/sft/ltx_video/crush_smol_lora/train.sh
+./examples/training/sft/ltx_video/crush_smol_lora/train.sh
 ```
+On Windows, you will have to modify the script to a compatible format to run it. [TODO(aryan): improve instructions for Windows]
 ## Inference

docs/finetrainers/documentation_models_wan.md CHANGED Viewed

@@ -4,11 +4,18 @@
 For LoRA training, specify `--training_type lora`. For full finetuning, specify `--training_type full-finetune`.
-See [this](../../examples/training/sft/wan/crush_smol_lora/) example training script for training Wan with Pika Effects Crush.
-## Memory Usage
-TODO
 ## Inference

 For LoRA training, specify `--training_type lora`. For full finetuning, specify `--training_type full-finetune`.
+Examples available:
+- [PIKA crush effect](../../examples/training/sft/wan/crush_smol_lora/)
+- [3DGS dissolve](../../examples/training/sft/wan/3dgs_dissolve/)
+To run an example, run the following from the root directory of the repository (assuming you have installed the requirements and are using Linux/WSL):
+```bash
+chmod +x ./examples/training/sft/wan/crush_smol_lora/train.sh
+./examples/training/sft/wan/crush_smol_lora/train.sh
+```
+On Windows, you will have to modify the script to a compatible format to run it. [TODO(aryan): improve instructions for Windows]
 ## Inference

vms/config.py CHANGED Viewed

@@ -2,6 +2,8 @@ import os
 from dataclasses import dataclass, field
 from typing import Dict, Any, Optional, List, Tuple
 from pathlib import Path
 def parse_bool_env(env_value: Optional[str]) -> bool:
     """Parse environment variable string to boolean
@@ -71,7 +73,16 @@ TRAINING_TYPES = {
 DEFAULT_SEED = 42
-DEFAULT_NB_TRAINING_STEPS = 1000
 DEFAULT_SAVE_CHECKPOINT_EVERY_N_STEPS = 200
@@ -87,6 +98,23 @@ DEFAULT_BATCH_SIZE = 1
 DEFAULT_LEARNING_RATE = 3e-5
 # it is best to use resolutions that are powers of 8
 # The resolution should be divisible by 32
 # so we cannot use 1080, 540 etc as they are not divisible by 32
@@ -183,7 +211,10 @@ TRAINING_PRESETS = {
         "learning_rate": 2e-5,
         "save_iterations": DEFAULT_SAVE_CHECKPOINT_EVERY_N_STEPS,
         "training_buckets": SMALL_TRAINING_BUCKETS,
-        "flow_weighting_scheme": "none"
     },
     "LTX-Video (normal)": {
         "model_type": "ltx_video",
@@ -195,7 +226,10 @@ TRAINING_PRESETS = {
         "learning_rate": DEFAULT_LEARNING_RATE,
         "save_iterations": DEFAULT_SAVE_CHECKPOINT_EVERY_N_STEPS,
         "training_buckets": SMALL_TRAINING_BUCKETS,
-        "flow_weighting_scheme": "logit_normal"
     },
     "LTX-Video (16:9, HQ)": {
         "model_type": "ltx_video",
@@ -207,7 +241,10 @@ TRAINING_PRESETS = {
         "learning_rate": DEFAULT_LEARNING_RATE,
         "save_iterations": DEFAULT_SAVE_CHECKPOINT_EVERY_N_STEPS,
         "training_buckets": MEDIUM_19_9_RATIO_BUCKETS,
-        "flow_weighting_scheme": "logit_normal"
     },
     "LTX-Video (Full Finetune)": {
         "model_type": "ltx_video",
@@ -217,7 +254,10 @@ TRAINING_PRESETS = {
         "learning_rate": DEFAULT_LEARNING_RATE,
         "save_iterations": DEFAULT_SAVE_CHECKPOINT_EVERY_N_STEPS,
         "training_buckets": SMALL_TRAINING_BUCKETS,
-        "flow_weighting_scheme": "logit_normal"
     },
     "Wan-2.1-T2V (normal)": {
         "model_type": "wan",
@@ -229,7 +269,10 @@ TRAINING_PRESETS = {
         "learning_rate": 5e-5,
         "save_iterations": DEFAULT_SAVE_CHECKPOINT_EVERY_N_STEPS,
         "training_buckets": SMALL_TRAINING_BUCKETS,
-        "flow_weighting_scheme": "logit_normal"
     },
     "Wan-2.1-T2V (HQ)": {
         "model_type": "wan",
@@ -241,7 +284,10 @@ TRAINING_PRESETS = {
         "learning_rate": DEFAULT_LEARNING_RATE,
         "save_iterations": DEFAULT_SAVE_CHECKPOINT_EVERY_N_STEPS,
         "training_buckets": MEDIUM_19_9_RATIO_BUCKETS,
-        "flow_weighting_scheme": "logit_normal"
     }
 }
@@ -287,7 +333,7 @@ class TrainingConfig:
     seed: int = DEFAULT_SEED
     mixed_precision: str = "bf16"
     batch_size: int = 1
-    train_step: int = DEFAULT_NB_TRAINING_STEPS
     lora_rank: int = DEFAULT_LORA_RANK
     lora_alpha: int = DEFAULT_LORA_ALPHA
     target_modules: List[str] = field(default_factory=lambda: ["to_q", "to_k", "to_v", "to_out.0"])
@@ -301,10 +347,10 @@ class TrainingConfig:
     # Optimizer arguments
     optimizer: str = "adamw"
-    lr: float = 3e-5
     scale_lr: bool = False
     lr_scheduler: str = "constant_with_warmup"
-    lr_warmup_steps: int = 100
     lr_num_cycles: int = 1
     lr_power: float = 1.0
     beta1: float = 0.9

 from dataclasses import dataclass, field
 from typing import Dict, Any, Optional, List, Tuple
 from pathlib import Path
+import torch
+import math
 def parse_bool_env(env_value: Optional[str]) -> bool:
     """Parse environment variable string to boolean
 DEFAULT_SEED = 42
+DEFAULT_REMOVE_COMMON_LLM_CAPTION_PREFIXES = True
+DEFAULT_DATASET_TYPE = "video"
+DEFAULT_TRAINING_TYPE = "lora"
+DEFAULT_RESHAPE_MODE = "bicubic"
+DEFAULT_MIXED_PRECISION = "bf16"
 DEFAULT_SAVE_CHECKPOINT_EVERY_N_STEPS = 200
 DEFAULT_LEARNING_RATE = 3e-5
+# GPU SETTINGS
+DEFAULT_NUM_GPUS = 1
+DEFAULT_MAX_GPUS = min(8, torch.cuda.device_count() if torch.cuda.is_available() else 1)
+DEFAULT_PRECOMPUTATION_ITEMS = 512
+DEFAULT_NB_TRAINING_STEPS = 1000
+# For this value, it is recommended to use about 20 to 40% of the number of training steps
+DEFAULT_NB_LR_WARMUP_STEPS = math.ceil(0.20 * DEFAULT_NB_TRAINING_STEPS)  # 20% of training steps
+# For validation
+DEFAULT_VALIDATION_NB_STEPS = 50
+DEFAULT_VALIDATION_HEIGHT = 512
+DEFAULT_VALIDATION_WIDTH = 768
+DEFAULT_VALIDATION_NB_FRAMES = 49
+DEFAULT_VALIDATION_FRAMERATE = 8
 # it is best to use resolutions that are powers of 8
 # The resolution should be divisible by 32
 # so we cannot use 1080, 540 etc as they are not divisible by 32
         "learning_rate": 2e-5,
         "save_iterations": DEFAULT_SAVE_CHECKPOINT_EVERY_N_STEPS,
         "training_buckets": SMALL_TRAINING_BUCKETS,
+        "flow_weighting_scheme": "none",
+        "num_gpus": DEFAULT_NUM_GPUS,
+        "precomputation_items": DEFAULT_PRECOMPUTATION_ITEMS,
+        "lr_warmup_steps": DEFAULT_NB_LR_WARMUP_STEPS,
     },
     "LTX-Video (normal)": {
         "model_type": "ltx_video",
         "learning_rate": DEFAULT_LEARNING_RATE,
         "save_iterations": DEFAULT_SAVE_CHECKPOINT_EVERY_N_STEPS,
         "training_buckets": SMALL_TRAINING_BUCKETS,
+        "flow_weighting_scheme": "none",
+        "num_gpus": DEFAULT_NUM_GPUS,
+        "precomputation_items": DEFAULT_PRECOMPUTATION_ITEMS,
+        "lr_warmup_steps": DEFAULT_NB_LR_WARMUP_STEPS,
     },
     "LTX-Video (16:9, HQ)": {
         "model_type": "ltx_video",
         "learning_rate": DEFAULT_LEARNING_RATE,
         "save_iterations": DEFAULT_SAVE_CHECKPOINT_EVERY_N_STEPS,
         "training_buckets": MEDIUM_19_9_RATIO_BUCKETS,
+        "flow_weighting_scheme": "logit_normal",
+        "num_gpus": DEFAULT_NUM_GPUS,
+        "precomputation_items": DEFAULT_PRECOMPUTATION_ITEMS,
+        "lr_warmup_steps": DEFAULT_NB_LR_WARMUP_STEPS,
     },
     "LTX-Video (Full Finetune)": {
         "model_type": "ltx_video",
         "learning_rate": DEFAULT_LEARNING_RATE,
         "save_iterations": DEFAULT_SAVE_CHECKPOINT_EVERY_N_STEPS,
         "training_buckets": SMALL_TRAINING_BUCKETS,
+        "flow_weighting_scheme": "logit_normal",
+        "num_gpus": DEFAULT_NUM_GPUS,
+        "precomputation_items": DEFAULT_PRECOMPUTATION_ITEMS,
+        "lr_warmup_steps": DEFAULT_NB_LR_WARMUP_STEPS,
     },
     "Wan-2.1-T2V (normal)": {
         "model_type": "wan",
         "learning_rate": 5e-5,
         "save_iterations": DEFAULT_SAVE_CHECKPOINT_EVERY_N_STEPS,
         "training_buckets": SMALL_TRAINING_BUCKETS,
+        "flow_weighting_scheme": "logit_normal",
+        "num_gpus": DEFAULT_NUM_GPUS,
+        "precomputation_items": DEFAULT_PRECOMPUTATION_ITEMS,
+        "lr_warmup_steps": DEFAULT_NB_LR_WARMUP_STEPS,
     },
     "Wan-2.1-T2V (HQ)": {
         "model_type": "wan",
         "learning_rate": DEFAULT_LEARNING_RATE,
         "save_iterations": DEFAULT_SAVE_CHECKPOINT_EVERY_N_STEPS,
         "training_buckets": MEDIUM_19_9_RATIO_BUCKETS,
+        "flow_weighting_scheme": "logit_normal",
+        "num_gpus": DEFAULT_NUM_GPUS,
+        "precomputation_items": DEFAULT_PRECOMPUTATION_ITEMS,
+        "lr_warmup_steps": DEFAULT_NB_LR_WARMUP_STEPS,
     }
 }
     seed: int = DEFAULT_SEED
     mixed_precision: str = "bf16"
     batch_size: int = 1
+    train_steps: int = DEFAULT_NB_TRAINING_STEPS
     lora_rank: int = DEFAULT_LORA_RANK
     lora_alpha: int = DEFAULT_LORA_ALPHA
     target_modules: List[str] = field(default_factory=lambda: ["to_q", "to_k", "to_v", "to_out.0"])
     # Optimizer arguments
     optimizer: str = "adamw"
+    lr: float = DEFAULT_LEARNING_RATE
     scale_lr: bool = False
     lr_scheduler: str = "constant_with_warmup"
+    lr_warmup_steps: int = DEFAULT_NB_LR_WARMUP_STEPS
     lr_num_cycles: int = 1
     lr_power: float = 1.0
     beta1: float = 0.9

vms/services/trainer.py CHANGED Viewed

@@ -28,9 +28,26 @@ from ..config import (
     DEFAULT_BATCH_SIZE, DEFAULT_CAPTION_DROPOUT_P,
     DEFAULT_LEARNING_RATE,
     DEFAULT_LORA_RANK, DEFAULT_LORA_ALPHA,
-    DEFAULT_LORA_RANK_STR, DEFAULT_LORA_ALPHA_STR
 )
-from ..utils import make_archive, parse_training_log, is_image_file, is_video_file, prepare_finetrainers_dataset, copy_files_to_training_dir
 logger = logging.getLogger(__name__)
@@ -107,18 +124,89 @@ class TrainingService:
     def save_ui_state(self, values: Dict[str, Any]) -> None:
-        """Save current UI state to file"""
         ui_state_file = OUTPUT_PATH / "ui_state.json"
         try:
             with open(ui_state_file, 'w') as f:
-                json.dump(values, f, indent=2)
-            logger.debug(f"UI state saved: {values}")
         except Exception as e:
             logger.error(f"Error saving UI state: {str(e)}")
-    # Additional fix for the load_ui_state method in trainer.py to clean up old values
     def load_ui_state(self) -> Dict[str, Any]:
-        """Load saved UI state"""
         ui_state_file = OUTPUT_PATH / "ui_state.json"
         default_state = {
             "model_type": list(MODEL_TYPES.keys())[0],
@@ -129,7 +217,10 @@ class TrainingService:
             "batch_size": DEFAULT_BATCH_SIZE,
             "learning_rate": DEFAULT_LEARNING_RATE,
             "save_iterations": DEFAULT_SAVE_CHECKPOINT_EVERY_N_STEPS,
-            "training_preset": list(TRAINING_PRESETS.keys())[0]
         }
         if not ui_state_file.exists():
@@ -149,7 +240,13 @@ class TrainingService:
                     logger.warning("UI state file is empty or contains only whitespace, using default values")
                     return default_state
-                saved_state = json.loads(file_content)
                 # Clean up model type if it contains " (LoRA)" suffix
                 if "model_type" in saved_state and " (LoRA)" in saved_state["model_type"]:
@@ -158,17 +255,36 @@ class TrainingService:
                 # Convert numeric values to appropriate types
                 if "train_steps" in saved_state:
-                    saved_state["train_steps"] = int(saved_state["train_steps"])
                 if "batch_size" in saved_state:
-                    saved_state["batch_size"] = int(saved_state["batch_size"])
                 if "learning_rate" in saved_state:
-                    saved_state["learning_rate"] = float(saved_state["learning_rate"])
                 if "save_iterations" in saved_state:
-                    saved_state["save_iterations"] = int(saved_state["save_iterations"])
                 # Make sure we have all keys (in case structure changed)
                 merged_state = default_state.copy()
-                merged_state.update(saved_state)
                 # Validate model_type is in available choices
                 if merged_state["model_type"] not in MODEL_TYPES:
@@ -203,67 +319,80 @@ class TrainingService:
                     merged_state["training_preset"] = default_state["training_preset"]
                     logger.warning(f"Invalid training preset in saved state, using default")
                 return merged_state
-        except json.JSONDecodeError as e:
-            logger.error(f"Error parsing UI state JSON: {str(e)}")
-            return default_state
         except Exception as e:
             logger.error(f"Error loading UI state: {str(e)}")
             return default_state
     def ensure_valid_ui_state_file(self):
         """Ensure UI state file exists and is valid JSON"""
         ui_state_file = OUTPUT_PATH / "ui_state.json"
         if not ui_state_file.exists():
-            # Create a new file with default values
             logger.info("Creating new UI state file with default values")
-            default_state = {
-                "model_type": list(MODEL_TYPES.keys())[0],
-                "training_type": list(TRAINING_TYPES.keys())[0],
-                "lora_rank": DEFAULT_LORA_RANK_STR,
-                "lora_alpha": DEFAULT_LORA_ALPHA_STR,
-                "train_steps": DEFAULT_NB_TRAINING_STEPS,
-                "batch_size": DEFAULT_BATCH_SIZE,
-                "learning_rate": DEFAULT_LEARNING_RATE,
-                "save_iterations": DEFAULT_SAVE_CHECKPOINT_EVERY_N_STEPS,
-                "training_preset": list(TRAINING_PRESETS.keys())[0]
-            }
             self.save_ui_state(default_state)
             return
         # Check if file is valid JSON
         try:
             with open(ui_state_file, 'r') as f:
                 file_content = f.read().strip()
                 if not file_content:
-                    raise ValueError("Empty file")
-                json.loads(file_content)
-            logger.debug("UI state file validation successful")
         except Exception as e:
-            logger.warning(f"Invalid UI state file detected: {str(e)}. Creating new one with defaults.")
-            # Backup the invalid file
-            backup_file = ui_state_file.with_suffix('.json.bak')
-            try:
-                shutil.copy2(ui_state_file, backup_file)
-                logger.info(f"Backed up invalid UI state file to {backup_file}")
-            except Exception as backup_error:
-                logger.error(f"Failed to backup invalid UI state file: {str(backup_error)}")
-            # Create a new file with default values
-            default_state = {
-                "model_type": list(MODEL_TYPES.keys())[0],
-                "training_type": list(TRAINING_TYPES.keys())[0],
-                "lora_rank": DEFAULT_LORA_RANK_STR,
-                "lora_alpha": DEFAULT_LORA_ALPHA_STR,
-                "train_steps": DEFAULT_NB_TRAINING_STEPS,
-                "batch_size": DEFAULT_BATCH_SIZE,
-                "learning_rate": DEFAULT_LEARNING_RATE,
-                "save_iterations": DEFAULT_NB_TRAINING_STEPS,
-                "training_preset": list(TRAINING_PRESETS.keys())[0]
-            }
-            self.save_ui_state(default_state)
     # Modify save_session to also store the UI state at training start
     def save_session(self, params: Dict) -> None:
         """Save training session parameters"""
@@ -412,8 +541,12 @@ class TrainingService:
         save_iterations: int,
         repo_id: str,
         preset_name: str,
-        training_type: str = "lora",
         resume_from_checkpoint: Optional[str] = None,
     ) -> Tuple[str, str]:
         """Start training with finetrainers"""
@@ -431,6 +564,10 @@ class TrainingService:
         log_prefix = "Resuming" if is_resuming else "Initializing"
         logger.info(f"{log_prefix} training with model_type={model_type}, training_type={training_type}")
         try:
             # Get absolute paths - FIXED to look in project root instead of within vms directory
             current_dir = Path(__file__).parent.parent.parent.absolute()  # Go up to project root
@@ -459,6 +596,10 @@ class TrainingService:
             logger.info("Current working directory: %s", current_dir)
             logger.info("Training script path: %s", train_script)
             logger.info("Training data path: %s", TRAINING_PATH)
             videos_file, prompts_file = prepare_finetrainers_dataset()
             if videos_file is None or prompts_file is None:
@@ -474,32 +615,45 @@ class TrainingService:
                 logger.error(error_msg)
                 return error_msg, "No training data available"
             # Get preset configuration
             preset = TRAINING_PRESETS[preset_name]
             training_buckets = preset["training_buckets"]
             flow_weighting_scheme = preset.get("flow_weighting_scheme", "none")
             preset_training_type = preset.get("training_type", "lora")
             # Create a proper dataset configuration JSON file
             dataset_config_file = OUTPUT_PATH / "dataset_config.json"
-            # Determine appropriate ID token based on model type
-            id_token = None
-            if model_type == "hunyuan_video":
-                id_token = "afkx"
-            elif model_type == "ltx_video":
-                id_token = "BW_STYLE"
-            # Wan doesn't use an ID token by default, so leave it as None
             dataset_config = {
                 "datasets": [
                     {
                         "data_root": str(TRAINING_PATH),
-                        "dataset_type": "video",
                         "id_token": id_token,
                         "video_resolution_buckets": [[f, h, w] for f, h, w in training_buckets],
-                        "reshape_mode": "bicubic",
-                        "remove_common_llm_caption_prefixes": True
                     }
                 ]
             }
@@ -552,6 +706,16 @@ class TrainingService:
                 logger.error(error_msg)
                 return error_msg, "Unsupported model"
             # Update with UI parameters
             config.train_steps = int(train_steps)
             config.batch_size = int(batch_size)
@@ -560,7 +724,19 @@ class TrainingService:
             config.training_type = training_type
             config.flow_weighting_scheme = flow_weighting_scheme
-            # CRITICAL FIX: Update the dataset_config to point to the JSON file, not the directory
             config.data_root = str(dataset_config_file)
             # Update LoRA parameters if using LoRA training type
@@ -574,7 +750,7 @@ class TrainingService:
                 self.append_log(f"Resuming from checkpoint: {resume_from_checkpoint}")
             # Common settings for both models
-            config.mixed_precision = "bf16"
             config.seed = DEFAULT_SEED
             config.gradient_checkpointing = True
             config.enable_slicing = True
@@ -598,7 +774,7 @@ class TrainingService:
                 torchrun_args = [
                     "torchrun",
                     "--standalone",
-                    "--nproc_per_node=1",
                     "--nnodes=1",
                     "--rdzv_backend=c10d",
                     "--rdzv_endpoint=localhost:0",
@@ -623,11 +799,29 @@ class TrainingService:
                 launch_args = torchrun_args
             else:
                 # For other models, use accelerate launch as before
                 # Configure accelerate parameters
                 accelerate_args = [
                     "accelerate", "launch",
                     "--mixed_precision=bf16",
-                    "--num_processes=1",
                     "--num_machines=1",
                     "--dynamo_backend=no",
                     str(train_script)
@@ -647,7 +841,11 @@ class TrainingService:
             env["WANDB_MODE"] = "offline"
             env["HF_API_TOKEN"] = HF_API_TOKEN
             env["FINETRAINERS_LOG_LEVEL"] = "DEBUG"  # Added for better debugging
             # Start the training process
             process = subprocess.Popen(
                 launch_args + config_args,
@@ -675,6 +873,9 @@ class TrainingService:
                 "batch_size": batch_size,
                 "learning_rate": learning_rate,
                 "save_iterations": save_iterations,
                 "repo_id": repo_id,
                 "start_time": datetime.now().isoformat()
             })
@@ -699,6 +900,10 @@ class TrainingService:
             self.append_log(success_msg)
             logger.info(success_msg)
             return success_msg, self.get_logs()
         except Exception as e:
@@ -1064,19 +1269,28 @@ class TrainingService:
                     if output:
                         # Remove decode() since output is already a string due to universal_newlines=True
                         line = output.strip()
                         if is_error:
-                            #self.append_log(f"ERROR: {line}")
                             #logger.error(line)
-                            #logger.info(line)
-                            self.append_log(line)
-                        else:
-                            self.append_log(line)
-                            # Parse metrics only from stdout
-                            metrics = parse_training_log(line)
-                            if metrics:
-                                status = self.get_status()
-                                status.update(metrics)
-                                self.save_status(**status)
                         return True
                 return False

     DEFAULT_BATCH_SIZE, DEFAULT_CAPTION_DROPOUT_P,
     DEFAULT_LEARNING_RATE,
     DEFAULT_LORA_RANK, DEFAULT_LORA_ALPHA,
+    DEFAULT_LORA_RANK_STR, DEFAULT_LORA_ALPHA_STR,
+    DEFAULT_SEED, DEFAULT_RESHAPE_MODE,
+    DEFAULT_REMOVE_COMMON_LLM_CAPTION_PREFIXES,
+    DEFAULT_DATASET_TYPE, DEFAULT_PROMPT_PREFIX,
+    DEFAULT_MIXED_PRECISION, DEFAULT_TRAINING_TYPE,
+    DEFAULT_NUM_GPUS,
+    DEFAULT_MAX_GPUS,
+    DEFAULT_PRECOMPUTATION_ITEMS,
+    DEFAULT_NB_TRAINING_STEPS,
+    DEFAULT_NB_LR_WARMUP_STEPS
+)
+from ..utils import (
+    get_available_gpu_count,
+    make_archive,
+    parse_training_log,
+    is_image_file,
+    is_video_file,
+    prepare_finetrainers_dataset,
+    copy_files_to_training_dir
 )
 logger = logging.getLogger(__name__)
     def save_ui_state(self, values: Dict[str, Any]) -> None:
+        """Save current UI state to file with validation"""
         ui_state_file = OUTPUT_PATH / "ui_state.json"
+        # Validate values before saving
+        validated_values = {}
+        default_state = {
+            "model_type": list(MODEL_TYPES.keys())[0],
+            "training_type": list(TRAINING_TYPES.keys())[0],
+            "lora_rank": DEFAULT_LORA_RANK_STR,
+            "lora_alpha": DEFAULT_LORA_ALPHA_STR,
+            "train_steps": DEFAULT_NB_TRAINING_STEPS,
+            "batch_size": DEFAULT_BATCH_SIZE,
+            "learning_rate": DEFAULT_LEARNING_RATE,
+            "save_iterations": DEFAULT_SAVE_CHECKPOINT_EVERY_N_STEPS,
+            "training_preset": list(TRAINING_PRESETS.keys())[0],
+            "num_gpus": DEFAULT_NUM_GPUS,
+            "precomputation_items": DEFAULT_PRECOMPUTATION_ITEMS,
+            "lr_warmup_steps": DEFAULT_NB_LR_WARMUP_STEPS
+        }
+        # Copy default values first
+        validated_values = default_state.copy()
+        # Update with provided values, converting types as needed
+        for key, value in values.items():
+            if key in default_state:
+                if key == "train_steps":
+                    try:
+                        validated_values[key] = int(value)
+                    except (ValueError, TypeError):
+                        validated_values[key] = default_state[key]
+                elif key == "batch_size":
+                    try:
+                        validated_values[key] = int(value)
+                    except (ValueError, TypeError):
+                        validated_values[key] = default_state[key]
+                elif key == "learning_rate":
+                    try:
+                        validated_values[key] = float(value)
+                    except (ValueError, TypeError):
+                        validated_values[key] = default_state[key]
+                elif key == "save_iterations":
+                    try:
+                        validated_values[key] = int(value)
+                    except (ValueError, TypeError):
+                        validated_values[key] = default_state[key]
+                elif key == "lora_rank" and value not in ["16", "32", "64", "128", "256", "512", "1024"]:
+                    validated_values[key] = default_state[key]
+                elif key == "lora_alpha" and value not in ["16", "32", "64", "128", "256", "512", "1024"]:
+                    validated_values[key] = default_state[key]
+                else:
+                    validated_values[key] = value
         try:
+            # First verify we can serialize to JSON
+            json_data = json.dumps(validated_values, indent=2)
+            # Write to the file
             with open(ui_state_file, 'w') as f:
+                f.write(json_data)
+            logger.debug(f"UI state saved successfully")
         except Exception as e:
             logger.error(f"Error saving UI state: {str(e)}")
+    def _backup_and_recreate_ui_state(self, ui_state_file, default_state):
+        """Backup the corrupted UI state file and create a new one with defaults"""
+        try:
+            # Create a backup with timestamp
+            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+            backup_file = ui_state_file.with_suffix(f'.json.bak_{timestamp}')
+            # Copy the corrupted file
+            shutil.copy2(ui_state_file, backup_file)
+            logger.info(f"Backed up corrupted UI state file to {backup_file}")
+        except Exception as backup_error:
+            logger.error(f"Failed to backup corrupted UI state file: {str(backup_error)}")
+        # Create a new file with default values
+        self.save_ui_state(default_state)
+        logger.info("Created new UI state file with default values after error")
     def load_ui_state(self) -> Dict[str, Any]:
+        """Load saved UI state with robust error handling"""
         ui_state_file = OUTPUT_PATH / "ui_state.json"
         default_state = {
             "model_type": list(MODEL_TYPES.keys())[0],
             "batch_size": DEFAULT_BATCH_SIZE,
             "learning_rate": DEFAULT_LEARNING_RATE,
             "save_iterations": DEFAULT_SAVE_CHECKPOINT_EVERY_N_STEPS,
+            "training_preset": list(TRAINING_PRESETS.keys())[0],
+            "num_gpus": DEFAULT_NUM_GPUS,
+            "precomputation_items": DEFAULT_PRECOMPUTATION_ITEMS,
+            "lr_warmup_steps": DEFAULT_NB_LR_WARMUP_STEPS
         }
         if not ui_state_file.exists():
                     logger.warning("UI state file is empty or contains only whitespace, using default values")
                     return default_state
+                try:
+                    saved_state = json.loads(file_content)
+                except json.JSONDecodeError as e:
+                    logger.error(f"Error parsing UI state JSON: {str(e)}")
+                    # Instead of showing the error, recreate the file with defaults
+                    self._backup_and_recreate_ui_state(ui_state_file, default_state)
+                    return default_state
                 # Clean up model type if it contains " (LoRA)" suffix
                 if "model_type" in saved_state and " (LoRA)" in saved_state["model_type"]:
                 # Convert numeric values to appropriate types
                 if "train_steps" in saved_state:
+                    try:
+                        saved_state["train_steps"] = int(saved_state["train_steps"])
+                    except (ValueError, TypeError):
+                        saved_state["train_steps"] = default_state["train_steps"]
+                        logger.warning("Invalid train_steps value, using default")
                 if "batch_size" in saved_state:
+                    try:
+                        saved_state["batch_size"] = int(saved_state["batch_size"])
+                    except (ValueError, TypeError):
+                        saved_state["batch_size"] = default_state["batch_size"]
+                        logger.warning("Invalid batch_size value, using default")
                 if "learning_rate" in saved_state:
+                    try:
+                        saved_state["learning_rate"] = float(saved_state["learning_rate"])
+                    except (ValueError, TypeError):
+                        saved_state["learning_rate"] = default_state["learning_rate"]
+                        logger.warning("Invalid learning_rate value, using default")
                 if "save_iterations" in saved_state:
+                    try:
+                        saved_state["save_iterations"] = int(saved_state["save_iterations"])
+                    except (ValueError, TypeError):
+                        saved_state["save_iterations"] = default_state["save_iterations"]
+                        logger.warning("Invalid save_iterations value, using default")
                 # Make sure we have all keys (in case structure changed)
                 merged_state = default_state.copy()
+                merged_state.update({k: v for k, v in saved_state.items() if v is not None})
                 # Validate model_type is in available choices
                 if merged_state["model_type"] not in MODEL_TYPES:
                     merged_state["training_preset"] = default_state["training_preset"]
                     logger.warning(f"Invalid training preset in saved state, using default")
+                # Validate lora_rank is in allowed values
+                if merged_state.get("lora_rank") not in ["16", "32", "64", "128", "256", "512", "1024"]:
+                    merged_state["lora_rank"] = default_state["lora_rank"]
+                    logger.warning(f"Invalid lora_rank in saved state, using default")
+                # Validate lora_alpha is in allowed values
+                if merged_state.get("lora_alpha") not in ["16", "32", "64", "128", "256", "512", "1024"]:
+                    merged_state["lora_alpha"] = default_state["lora_alpha"]
+                    logger.warning(f"Invalid lora_alpha in saved state, using default")
                 return merged_state
         except Exception as e:
             logger.error(f"Error loading UI state: {str(e)}")
+            # If anything goes wrong, backup and recreate
+            self._backup_and_recreate_ui_state(ui_state_file, default_state)
             return default_state
     def ensure_valid_ui_state_file(self):
         """Ensure UI state file exists and is valid JSON"""
         ui_state_file = OUTPUT_PATH / "ui_state.json"
+        # Default state with all required values
+        default_state = {
+            "model_type": list(MODEL_TYPES.keys())[0],
+            "training_type": list(TRAINING_TYPES.keys())[0],
+            "lora_rank": DEFAULT_LORA_RANK_STR,
+            "lora_alpha": DEFAULT_LORA_ALPHA_STR,
+            "train_steps": DEFAULT_NB_TRAINING_STEPS,
+            "batch_size": DEFAULT_BATCH_SIZE,
+            "learning_rate": DEFAULT_LEARNING_RATE,
+            "save_iterations": DEFAULT_SAVE_CHECKPOINT_EVERY_N_STEPS,
+            "training_preset": list(TRAINING_PRESETS.keys())[0],
+            "num_gpus": DEFAULT_NUM_GPUS,
+            "precomputation_items": DEFAULT_PRECOMPUTATION_ITEMS,
+            "lr_warmup_steps": DEFAULT_NB_LR_WARMUP_STEPS
+        }
+        # If file doesn't exist, create it with default values
         if not ui_state_file.exists():
             logger.info("Creating new UI state file with default values")
             self.save_ui_state(default_state)
             return
         # Check if file is valid JSON
         try:
+            # First check if the file is empty
+            file_size = ui_state_file.stat().st_size
+            if file_size == 0:
+                logger.warning("UI state file exists but is empty, recreating with default values")
+                self.save_ui_state(default_state)
+                return
             with open(ui_state_file, 'r') as f:
                 file_content = f.read().strip()
                 if not file_content:
+                    logger.warning("UI state file is empty or contains only whitespace, recreating with default values")
+                    self.save_ui_state(default_state)
+                    return
+                # Try to parse the JSON content
+                try:
+                    saved_state = json.loads(file_content)
+                    logger.debug("UI state file validation successful")
+                except json.JSONDecodeError as e:
+                    # JSON parsing failed, backup and recreate
+                    logger.error(f"Error parsing UI state JSON: {str(e)}")
+                    self._backup_and_recreate_ui_state(ui_state_file, default_state)
+                    return
         except Exception as e:
+            # Any other error (file access, etc)
+            logger.error(f"Error checking UI state file: {str(e)}")
+            self._backup_and_recreate_ui_state(ui_state_file, default_state)
+            return
     # Modify save_session to also store the UI state at training start
     def save_session(self, params: Dict) -> None:
         """Save training session parameters"""
         save_iterations: int,
         repo_id: str,
         preset_name: str,
+        training_type: str = DEFAULT_TRAINING_TYPE,
         resume_from_checkpoint: Optional[str] = None,
+        num_gpus: int = DEFAULT_NUM_GPUS,
+        precomputation_items: int = DEFAULT_PRECOMPUTATION_ITEMS,
+        lr_warmup_steps: int = DEFAULT_NB_LR_WARMUP_STEPS,
+        progress: Optional[gr.Progress] = None,
     ) -> Tuple[str, str]:
         """Start training with finetrainers"""
         log_prefix = "Resuming" if is_resuming else "Initializing"
         logger.info(f"{log_prefix} training with model_type={model_type}, training_type={training_type}")
+        # Update progress if available
+        if progress:
+            progress(0.15, desc="Setting up training configuration")
         try:
             # Get absolute paths - FIXED to look in project root instead of within vms directory
             current_dir = Path(__file__).parent.parent.parent.absolute()  # Go up to project root
             logger.info("Current working directory: %s", current_dir)
             logger.info("Training script path: %s", train_script)
             logger.info("Training data path: %s", TRAINING_PATH)
+            # Update progress
+            if progress:
+                progress(0.2, desc="Preparing training dataset")
             videos_file, prompts_file = prepare_finetrainers_dataset()
             if videos_file is None or prompts_file is None:
                 logger.error(error_msg)
                 return error_msg, "No training data available"
+            # Update progress
+            if progress:
+                progress(0.25, desc="Creating dataset configuration")
             # Get preset configuration
             preset = TRAINING_PRESETS[preset_name]
             training_buckets = preset["training_buckets"]
             flow_weighting_scheme = preset.get("flow_weighting_scheme", "none")
             preset_training_type = preset.get("training_type", "lora")
+            # Get the custom prompt prefix from the tabs
+            custom_prompt_prefix = None
+            if hasattr(self.app, 'tabs') and 'caption_tab' in self.app.tabs:
+                if hasattr(self.app.tabs['caption_tab'], 'components') and 'custom_prompt_prefix' in self.app.tabs['caption_tab'].components:
+                    # Get the value and clean it
+                    prefix = self.app.tabs['caption_tab'].components['custom_prompt_prefix'].value
+                    if prefix:
+                        # Clean the prefix - remove trailing comma, space or comma+space
+                        custom_prompt_prefix = prefix.rstrip(', ')
             # Create a proper dataset configuration JSON file
             dataset_config_file = OUTPUT_PATH / "dataset_config.json"
+            # Determine appropriate ID token based on model type and custom prefix
+            id_token = custom_prompt_prefix  # Use custom prefix as the primary id_token
+            # Only use default ID tokens if no custom prefix is provided
+            if not id_token:
+                id_token = DEFAULT_PROMPT_PREFIX
             dataset_config = {
                 "datasets": [
                     {
                         "data_root": str(TRAINING_PATH),
+                        "dataset_type": DEFAULT_DATASET_TYPE,
                         "id_token": id_token,
                         "video_resolution_buckets": [[f, h, w] for f, h, w in training_buckets],
+                        "reshape_mode": DEFAULT_RESHAPE_MODE,
+                        "remove_common_llm_caption_prefixes": DEFAULT_REMOVE_COMMON_LLM_CAPTION_PREFIXES,
                     }
                 ]
             }
                 logger.error(error_msg)
                 return error_msg, "Unsupported model"
+            # Create validation dataset if needed
+            validation_file = None
+            #if enable_validation:  # Add a parameter to control this
+            #    validation_file = create_validation_config()
+            #    if validation_file:
+            #        config_args.extend([
+            #            "--validation_dataset_file", str(validation_file),
+            #            "--validation_steps", "500"  # Set this to a suitable value
+            #        ])
             # Update with UI parameters
             config.train_steps = int(train_steps)
             config.batch_size = int(batch_size)
             config.training_type = training_type
             config.flow_weighting_scheme = flow_weighting_scheme
+            config.lr_warmup_steps = int(lr_warmup_steps)
+            config_args.extend([
+                "--precomputation_items", str(precomputation_items)
+            ])
+            # Update the NUM_GPUS variable and CUDA_VISIBLE_DEVICES
+            num_gpus = min(num_gpus, get_available_gpu_count())
+            if num_gpus <= 0:
+                num_gpus = 1
+            # Generate CUDA_VISIBLE_DEVICES string
+            visible_devices = ",".join([str(i) for i in range(num_gpus)])
             config.data_root = str(dataset_config_file)
             # Update LoRA parameters if using LoRA training type
                 self.append_log(f"Resuming from checkpoint: {resume_from_checkpoint}")
             # Common settings for both models
+            config.mixed_precision = DEFAULT_MIXED_PRECISION
             config.seed = DEFAULT_SEED
             config.gradient_checkpointing = True
             config.enable_slicing = True
                 torchrun_args = [
                     "torchrun",
                     "--standalone",
+                    "--nproc_per_node=" + str(num_gpus),
                     "--nnodes=1",
                     "--rdzv_backend=c10d",
                     "--rdzv_endpoint=localhost:0",
                 launch_args = torchrun_args
             else:
                 # For other models, use accelerate launch as before
+                # Determine the appropriate accelerate config file based on num_gpus
+                accelerate_config = None
+                if num_gpus == 1:
+                    accelerate_config = "accelerate_configs/uncompiled_1.yaml"
+                elif num_gpus == 2:
+                    accelerate_config = "accelerate_configs/uncompiled_2.yaml"
+                elif num_gpus == 4:
+                    accelerate_config = "accelerate_configs/uncompiled_4.yaml"
+                elif num_gpus == 8:
+                    accelerate_config = "accelerate_configs/uncompiled_8.yaml"
+                else:
+                    # Default to 1 GPU config if no matching config is found
+                    accelerate_config = "accelerate_configs/uncompiled_1.yaml"
+                    num_gpus = 1
+                    visible_devices = "0"
                 # Configure accelerate parameters
                 accelerate_args = [
                     "accelerate", "launch",
+                    "--config_file", accelerate_config,
+                    "--gpu_ids", visible_devices,
                     "--mixed_precision=bf16",
+                    "--num_processes=" + str(num_gpus),
                     "--num_machines=1",
                     "--dynamo_backend=no",
                     str(train_script)
             env["WANDB_MODE"] = "offline"
             env["HF_API_TOKEN"] = HF_API_TOKEN
             env["FINETRAINERS_LOG_LEVEL"] = "DEBUG"  # Added for better debugging
+            env["CUDA_VISIBLE_DEVICES"] = visible_devices
+            if progress:
+                progress(0.9, desc="Launching training process")
             # Start the training process
             process = subprocess.Popen(
                 launch_args + config_args,
                 "batch_size": batch_size,
                 "learning_rate": learning_rate,
                 "save_iterations": save_iterations,
+                "num_gpus": num_gpus,
+                "precomputation_items": precomputation_items,
+                "lr_warmup_steps": lr_warmup_steps,
                 "repo_id": repo_id,
                 "start_time": datetime.now().isoformat()
             })
             self.append_log(success_msg)
             logger.info(success_msg)
+            # Final progress update - now we'll track it through the log monitor
+            if progress:
+                progress(1.0, desc="Training started successfully")
             return success_msg, self.get_logs()
         except Exception as e:
                     if output:
                         # Remove decode() since output is already a string due to universal_newlines=True
                         line = output.strip()
+                        self.append_log(line)
                         if is_error:
                             #logger.error(line)
+                            pass
+                        # Parse metrics only from stdout
+                        metrics = parse_training_log(line)
+                        if metrics:
+                            status = self.get_status()
+                            status.update(metrics)
+                            self.save_status(**status)
+                            # Extract total_steps and current_step for progress tracking
+                            if 'step' in metrics:
+                                current_step = metrics['step']
+                            if 'total_steps' in status:
+                                total_steps = status['total_steps']
+                            # Update progress bar if available and total_steps is known
+                            if progress_obj and total_steps > 0:
+                                progress_value = min(0.99, current_step / total_steps)
+                                progress_obj(progress_value, desc=f"Training: step {current_step}/{total_steps}")
                         return True
                 return False

vms/tabs/train_tab.py CHANGED Viewed

@@ -15,7 +15,13 @@ from ..config import (
     DEFAULT_BATCH_SIZE, DEFAULT_CAPTION_DROPOUT_P,
     DEFAULT_LEARNING_RATE,
     DEFAULT_LORA_RANK, DEFAULT_LORA_ALPHA,
-    DEFAULT_LORA_RANK_STR, DEFAULT_LORA_ALPHA_STR
 )
 logger = logging.getLogger(__name__)
@@ -106,7 +112,30 @@ class TrainTab(BaseTab):
                             precision=0,
                             info="Model will be saved periodically after these many steps"
                         )
                 with gr.Column():
                     with gr.Row():
                         # Check for existing checkpoints to determine button text
@@ -218,7 +247,27 @@ class TrainTab(BaseTab):
                 self.components["lora_params_row"]
             ]
         )
         # Training parameters change events
         self.components["lora_rank"].change(
             fn=lambda v: self.app.update_ui_state(lora_rank=v),
@@ -274,7 +323,10 @@ class TrainTab(BaseTab):
                 self.components["learning_rate"],
                 self.components["save_iterations"],
                 self.components["preset_info"],
-                self.components["lora_params_row"]
             ]
         )
@@ -332,7 +384,7 @@ class TrainTab(BaseTab):
             outputs=[self.components["status_box"]]
         )
-    def handle_training_start(self, preset, model_type, training_type, *args):
         """Handle training start with proper log parser reset and checkpoint detection"""
         # Safely reset log parser if it exists
         if hasattr(self.app, 'log_parser') and self.app.log_parser is not None:
@@ -341,6 +393,9 @@ class TrainTab(BaseTab):
             logger.warning("Log parser not initialized, creating a new one")
             from ..utils import TrainingLogParser
             self.app.log_parser = TrainingLogParser()
         # Check for latest checkpoint
         checkpoints = list(OUTPUT_PATH.glob("checkpoint-*"))
@@ -351,6 +406,9 @@ class TrainTab(BaseTab):
             latest_checkpoint = max(checkpoints, key=os.path.getmtime)
             resume_from = str(latest_checkpoint)
             logger.info(f"Found checkpoint at {resume_from}, will resume training")
         # Convert model_type display name to internal name
         model_internal_type = MODEL_TYPES.get(model_type)
@@ -366,19 +424,32 @@ class TrainTab(BaseTab):
             logger.error(f"Invalid training type: {training_type}")
             return f"Error: Invalid training type '{training_type}'", "Training type not recognized"
         # Start training (it will automatically use the checkpoint if provided)
         try:
             return self.app.trainer.start_training(
-                model_internal_type,  # Use internal model type
-                *args,
                 preset_name=preset,
-                training_type=training_internal_type,  # Pass the internal training type
-                resume_from_checkpoint=resume_from
             )
         except Exception as e:
             logger.exception("Error starting training")
             return f"Error starting training: {str(e)}", f"Exception: {str(e)}\n\nCheck the logs for more details."
     def get_model_info(self, model_type: str, training_type: str) -> str:
         """Get information about the selected model type and training method"""
         if model_type == "HunyuanVideo":
@@ -518,6 +589,9 @@ class TrainTab(BaseTab):
         batch_size_val = current_state.get("batch_size") if current_state.get("batch_size") != preset.get("batch_size", DEFAULT_BATCH_SIZE) else preset.get("batch_size", DEFAULT_BATCH_SIZE)
         learning_rate_val = current_state.get("learning_rate") if current_state.get("learning_rate") != preset.get("learning_rate", DEFAULT_LEARNING_RATE) else preset.get("learning_rate", DEFAULT_LEARNING_RATE)
         save_iterations_val = current_state.get("save_iterations") if current_state.get("save_iterations") != preset.get("save_iterations", DEFAULT_SAVE_CHECKPOINT_EVERY_N_STEPS) else preset.get("save_iterations", DEFAULT_SAVE_CHECKPOINT_EVERY_N_STEPS)
         # Return values in the same order as the output components
         return (
@@ -530,7 +604,10 @@ class TrainTab(BaseTab):
             learning_rate_val,
             save_iterations_val,
             info_text,
-            gr.Row(visible=show_lora_params)
         )
     def get_latest_status_message_and_logs(self) -> Tuple[str, str, str]:

     DEFAULT_BATCH_SIZE, DEFAULT_CAPTION_DROPOUT_P,
     DEFAULT_LEARNING_RATE,
     DEFAULT_LORA_RANK, DEFAULT_LORA_ALPHA,
+    DEFAULT_LORA_RANK_STR, DEFAULT_LORA_ALPHA_STR,
+    DEFAULT_SEED,
+    DEFAULT_NUM_GPUS,
+    DEFAULT_MAX_GPUS,
+    DEFAULT_PRECOMPUTATION_ITEMS,
+    DEFAULT_NB_TRAINING_STEPS,
+    DEFAULT_NB_LR_WARMUP_STEPS,
 )
 logger = logging.getLogger(__name__)
                             precision=0,
                             info="Model will be saved periodically after these many steps"
                         )
+                    with gr.Row():
+                        self.components["num_gpus"] = gr.Slider(
+                            label="Number of GPUs to use",
+                            value=DEFAULT_NUM_GPUS,
+                            minimum=1,
+                            maximum=DEFAULT_MAX_GPUS,
+                            step=1,
+                            info="Number of GPUs to use for training"
+                        )
+                        self.components["precomputation_items"] = gr.Number(
+                            label="Precomputation Items",
+                            value=DEFAULT_PRECOMPUTATION_ITEMS,
+                            minimum=1,
+                            precision=0,
+                            info="Should be more or less the number of total items (ex: 200 videos), divided by the number of GPUs"
+                        )
+                    with gr.Row():
+                        self.components["lr_warmup_steps"] = gr.Number(
+                            label="Learning Rate Warmup Steps",
+                            value=DEFAULT_NB_LR_WARMUP_STEPS,
+                            minimum=0,
+                            precision=0,
+                            info="Number of warmup steps (typically 20-40% of total training steps)"
+                        )
                 with gr.Column():
                     with gr.Row():
                         # Check for existing checkpoints to determine button text
                 self.components["lora_params_row"]
             ]
         )
+        # Add in the connect_events() method:
+        self.components["num_gpus"].change(
+            fn=lambda v: self.app.update_ui_state(num_gpus=v),
+            inputs=[self.components["num_gpus"]],
+            outputs=[]
+        )
+        self.components["precomputation_items"].change(
+            fn=lambda v: self.app.update_ui_state(precomputation_items=v),
+            inputs=[self.components["precomputation_items"]],
+            outputs=[]
+        )
+        self.components["lr_warmup_steps"].change(
+            fn=lambda v: self.app.update_ui_state(lr_warmup_steps=v),
+            inputs=[self.components["lr_warmup_steps"]],
+            outputs=[]
+        )
         # Training parameters change events
         self.components["lora_rank"].change(
             fn=lambda v: self.app.update_ui_state(lora_rank=v),
                 self.components["learning_rate"],
                 self.components["save_iterations"],
                 self.components["preset_info"],
+                self.components["lora_params_row"],
+                self.components["num_gpus"],
+                self.components["precomputation_items"],
+                self.components["lr_warmup_steps"]
             ]
         )
             outputs=[self.components["status_box"]]
         )
+    def handle_training_start(self, preset, model_type, training_type, *args, progress=gr.Progress()):
         """Handle training start with proper log parser reset and checkpoint detection"""
         # Safely reset log parser if it exists
         if hasattr(self.app, 'log_parser') and self.app.log_parser is not None:
             logger.warning("Log parser not initialized, creating a new one")
             from ..utils import TrainingLogParser
             self.app.log_parser = TrainingLogParser()
+        # Initialize progress
+        progress(0, desc="Initializing training")
         # Check for latest checkpoint
         checkpoints = list(OUTPUT_PATH.glob("checkpoint-*"))
             latest_checkpoint = max(checkpoints, key=os.path.getmtime)
             resume_from = str(latest_checkpoint)
             logger.info(f"Found checkpoint at {resume_from}, will resume training")
+            progress(0.05, desc=f"Resuming from checkpoint {Path(resume_from).name}")
+        else:
+            progress(0.05, desc="Starting new training run")
         # Convert model_type display name to internal name
         model_internal_type = MODEL_TYPES.get(model_type)
             logger.error(f"Invalid training type: {training_type}")
             return f"Error: Invalid training type '{training_type}'", "Training type not recognized"
+        # Progress update
+        progress(0.1, desc="Preparing dataset")
         # Start training (it will automatically use the checkpoint if provided)
         try:
             return self.app.trainer.start_training(
+                model_internal_type,
+                lora_rank,
+                lora_alpha,
+                train_steps,
+                batch_size,
+                learning_rate,
+                save_iterations,
+                repo_id,
                 preset_name=preset,
+                training_type=training_internal_type,
+                resume_from_checkpoint=resume_from,
+                num_gpus=num_gpus,
+                precomputation_items=precomputation_items,
+                lr_warmup_steps=lr_warmup_steps,
+                progress=progress
             )
         except Exception as e:
             logger.exception("Error starting training")
             return f"Error starting training: {str(e)}", f"Exception: {str(e)}\n\nCheck the logs for more details."
     def get_model_info(self, model_type: str, training_type: str) -> str:
         """Get information about the selected model type and training method"""
         if model_type == "HunyuanVideo":
         batch_size_val = current_state.get("batch_size") if current_state.get("batch_size") != preset.get("batch_size", DEFAULT_BATCH_SIZE) else preset.get("batch_size", DEFAULT_BATCH_SIZE)
         learning_rate_val = current_state.get("learning_rate") if current_state.get("learning_rate") != preset.get("learning_rate", DEFAULT_LEARNING_RATE) else preset.get("learning_rate", DEFAULT_LEARNING_RATE)
         save_iterations_val = current_state.get("save_iterations") if current_state.get("save_iterations") != preset.get("save_iterations", DEFAULT_SAVE_CHECKPOINT_EVERY_N_STEPS) else preset.get("save_iterations", DEFAULT_SAVE_CHECKPOINT_EVERY_N_STEPS)
+        num_gpus_val = current_state.get("num_gpus") if current_state.get("num_gpus") != preset.get("num_gpus", DEFAULT_NUM_GPUS) else preset.get("num_gpus", DEFAULT_NUM_GPUS)
+        precomputation_items_val = current_state.get("precomputation_items") if current_state.get("precomputation_items") != preset.get("precomputation_items", DEFAULT_PRECOMPUTATION_ITEMS) else preset.get("precomputation_items", DEFAULT_PRECOMPUTATION_ITEMS)
+        lr_warmup_steps_val = current_state.get("lr_warmup_steps") if current_state.get("lr_warmup_steps") != preset.get("lr_warmup_steps", DEFAULT_NB_LR_WARMUP_STEPS) else preset.get("lr_warmup_steps", DEFAULT_NB_LR_WARMUP_STEPS)
         # Return values in the same order as the output components
         return (
             learning_rate_val,
             save_iterations_val,
             info_text,
+            gr.Row(visible=show_lora_params),
+            num_gpus_val,
+            precomputation_items_val,
+            lr_warmup_steps_val
         )
     def get_latest_status_message_and_logs(self) -> Tuple[str, str, str]:

vms/ui/video_trainer_ui.py CHANGED Viewed

@@ -14,9 +14,20 @@ from ..config import (
     DEFAULT_BATCH_SIZE, DEFAULT_CAPTION_DROPOUT_P,
     DEFAULT_LEARNING_RATE,
     DEFAULT_LORA_RANK, DEFAULT_LORA_ALPHA,
-    DEFAULT_LORA_RANK_STR, DEFAULT_LORA_ALPHA_STR
 )
-from ..utils import count_media_files, format_media_title, TrainingLogParser
 from ..tabs import ImportTab, SplitTab, CaptionTab, TrainTab, ManageTab
 logger = logging.getLogger(__name__)
@@ -101,7 +112,10 @@ class VideoTrainerUI:
                     self.tabs["train_tab"].components["batch_size"],
                     self.tabs["train_tab"].components["learning_rate"],
                     self.tabs["train_tab"].components["save_iterations"],
-                    self.tabs["train_tab"].components["current_task_box"]  # Add new component
                 ]
             )
@@ -273,11 +287,26 @@ class VideoTrainerUI:
         # Rest of the function remains unchanged
         lora_rank_val = ui_state.get("lora_rank", DEFAULT_LORA_RANK_STR)
         lora_alpha_val = ui_state.get("lora_alpha", DEFAULT_LORA_ALPHA_STR)
-        train_steps_val = int(ui_state.get("train_steps", DEFAULT_NB_TRAINING_STEPS))
         batch_size_val = int(ui_state.get("batch_size", DEFAULT_BATCH_SIZE))
         learning_rate_val = float(ui_state.get("learning_rate", DEFAULT_LEARNING_RATE))
         save_iterations_val = int(ui_state.get("save_iterations", DEFAULT_SAVE_CHECKPOINT_EVERY_N_STEPS))
         # Initial current task value
         current_task_val = ""
         if hasattr(self, 'log_parser') and self.log_parser:
@@ -299,7 +328,10 @@ class VideoTrainerUI:
             batch_size_val,
             learning_rate_val,
             save_iterations_val,
-            current_task_val  # Add current task value
         )
     def initialize_ui_from_state(self):

     DEFAULT_BATCH_SIZE, DEFAULT_CAPTION_DROPOUT_P,
     DEFAULT_LEARNING_RATE,
     DEFAULT_LORA_RANK, DEFAULT_LORA_ALPHA,
+    DEFAULT_LORA_RANK_STR, DEFAULT_LORA_ALPHA_STR,
+    DEFAULT_SEED,
+    DEFAULT_NUM_GPUS,
+    DEFAULT_MAX_GPUS,
+    DEFAULT_PRECOMPUTATION_ITEMS,
+    DEFAULT_NB_TRAINING_STEPS,
+    DEFAULT_NB_LR_WARMUP_STEPS
+)
+from ..utils import (
+    get_recommended_precomputation_items,
+    count_media_files,
+    format_media_title,
+    TrainingLogParser
 )
 from ..tabs import ImportTab, SplitTab, CaptionTab, TrainTab, ManageTab
 logger = logging.getLogger(__name__)
                     self.tabs["train_tab"].components["batch_size"],
                     self.tabs["train_tab"].components["learning_rate"],
                     self.tabs["train_tab"].components["save_iterations"],
+                    self.tabs["train_tab"].components["current_task_box"],
+                    self.tabs["train_tab"].components["num_gpus"],
+                    self.tabs["train_tab"].components["precomputation_items"],
+                    self.tabs["train_tab"].components["lr_warmup_steps"]
                 ]
             )
         # Rest of the function remains unchanged
         lora_rank_val = ui_state.get("lora_rank", DEFAULT_LORA_RANK_STR)
         lora_alpha_val = ui_state.get("lora_alpha", DEFAULT_LORA_ALPHA_STR)
         batch_size_val = int(ui_state.get("batch_size", DEFAULT_BATCH_SIZE))
         learning_rate_val = float(ui_state.get("learning_rate", DEFAULT_LEARNING_RATE))
         save_iterations_val = int(ui_state.get("save_iterations", DEFAULT_SAVE_CHECKPOINT_EVERY_N_STEPS))
+        # Update for new UI components
+        num_gpus_val = int(ui_state.get("num_gpus", DEFAULT_NUM_GPUS))
+        # Calculate recommended precomputation items based on video count
+        video_count = len(list(TRAINING_VIDEOS_PATH.glob('*.mp4')))
+        recommended_precomputation = get_recommended_precomputation_items(video_count, num_gpus_val)
+        precomputation_items_val = int(ui_state.get("precomputation_items", recommended_precomputation))
+        # Ensure warmup steps are not more than training steps
+        train_steps_val = int(ui_state.get("train_steps", DEFAULT_NB_TRAINING_STEPS))
+        default_warmup = min(DEFAULT_NB_LR_WARMUP_STEPS, int(train_steps_val * 0.2))
+        lr_warmup_steps_val = int(ui_state.get("lr_warmup_steps", default_warmup))
+        # Ensure warmup steps <= training steps
+        lr_warmup_steps_val = min(lr_warmup_steps_val, train_steps_val)
         # Initial current task value
         current_task_val = ""
         if hasattr(self, 'log_parser') and self.log_parser:
             batch_size_val,
             learning_rate_val,
             save_iterations_val,
+            current_task_val,
+            num_gpus_val,
+            precomputation_items_val,
+            lr_warmup_steps_val
         )
     def initialize_ui_from_state(self):

vms/utils/__init__.py CHANGED Viewed

@@ -8,6 +8,8 @@ from .finetrainers_utils import prepare_finetrainers_dataset, copy_files_to_trai
 from . import webdataset_handler
 __all__ = [
     'validate_model_repo',
     'make_archive',
@@ -33,5 +35,9 @@ __all__ = [
     'prepare_finetrainers_dataset',
     'copy_files_to_training_dir',
-    'webdataset_handler'
 ]

 from . import webdataset_handler
+from .gpu_detector import get_available_gpu_count, get_gpu_info, get_recommended_precomputation_items
 __all__ = [
     'validate_model_repo',
     'make_archive',
     'prepare_finetrainers_dataset',
     'copy_files_to_training_dir',
+    'webdataset_handler',
+    'get_available_gpu_count',
+    'get_gpu_info',
+    'get_recommended_precomputation_items'
 ]

vms/utils/finetrainers_utils.py CHANGED Viewed

@@ -4,15 +4,22 @@ import logging
 import shutil
 from typing import Any, Optional, Dict, List, Union, Tuple
-from ..config import STORAGE_PATH, TRAINING_PATH, STAGING_PATH, TRAINING_VIDEOS_PATH, MODEL_PATH, OUTPUT_PATH, HF_API_TOKEN, MODEL_TYPES
 from .utils import get_video_fps, extract_scene_info, make_archive, is_image_file, is_video_file
 logger = logging.getLogger(__name__)
 def prepare_finetrainers_dataset() -> Tuple[Path, Path]:
-    """make sure we have a Finetrainers-compatible dataset structure
-    Checks that we have:
         training/
         ├── prompt.txt       # All captions, one per line
         ├── videos.txt       # All video paths, one per line
@@ -30,14 +37,15 @@ def prepare_finetrainers_dataset() -> Tuple[Path, Path]:
     # Clear existing training lists
     for f in TRAINING_PATH.glob("*"):
         if f.is_file():
-            if f.name in ["videos.txt", "prompts.txt"]:
                 f.unlink()
     videos_file = TRAINING_PATH / "videos.txt"
-    prompts_file = TRAINING_PATH / "prompts.txt"  # Note: Changed from prompt.txt to prompts.txt to match our config
     media_files = []
     captions = []
     # Process all video files from the videos subdirectory
     for idx, file in enumerate(sorted(TRAINING_VIDEOS_PATH.glob("*.mp4"))):
         caption_file = file.with_suffix('.txt')
@@ -50,19 +58,16 @@ def prepare_finetrainers_dataset() -> Tuple[Path, Path]:
             relative_path = f"videos/{file.name}"
             media_files.append(relative_path)
             captions.append(caption)
-            # Clean up the caption file since it's now in prompts.txt
-            # EDIT well you know what, let's keep it, otherwise running the function
-            # twice might cause some errors
-            # caption_file.unlink()
     # Write files if we have content
     if media_files and captions:
         videos_file.write_text('\n'.join(media_files))
         prompts_file.write_text('\n'.join(captions))
     else:
-        raise ValueError("No valid video/caption pairs found in training directory")
     # Verify file contents
     with open(videos_file) as vf:
         video_lines = [l.strip() for l in vf.readlines() if l.strip()]
@@ -70,7 +75,8 @@ def prepare_finetrainers_dataset() -> Tuple[Path, Path]:
         prompt_lines = [l.strip() for l in pf.readlines() if l.strip()]
     if len(video_lines) != len(prompt_lines):
-        raise ValueError(f"Mismatch in generated files: {len(video_lines)} videos vs {len(prompt_lines)} prompts")
     return videos_file, prompts_file
@@ -137,3 +143,67 @@ def copy_files_to_training_dir(prompt_prefix: str) -> int:
     gr.Info(f"Successfully generated the training dataset ({nb_copied_pairs} pairs)")
     return nb_copied_pairs

 import shutil
 from typing import Any, Optional, Dict, List, Union, Tuple
+from ..config import (
+    STORAGE_PATH, TRAINING_PATH, STAGING_PATH, TRAINING_VIDEOS_PATH, MODEL_PATH, OUTPUT_PATH, HF_API_TOKEN, MODEL_TYPES,
+    DEFAULT_VALIDATION_NB_STEPS,
+    DEFAULT_VALIDATION_HEIGHT,
+    DEFAULT_VALIDATION_WIDTH,
+    DEFAULT_VALIDATION_NB_FRAMES,
+    DEFAULT_VALIDATION_FRAMERATE
+)
 from .utils import get_video_fps, extract_scene_info, make_archive, is_image_file, is_video_file
 logger = logging.getLogger(__name__)
 def prepare_finetrainers_dataset() -> Tuple[Path, Path]:
+    """Prepare a Finetrainers-compatible dataset structure
+    Creates:
         training/
         ├── prompt.txt       # All captions, one per line
         ├── videos.txt       # All video paths, one per line
     # Clear existing training lists
     for f in TRAINING_PATH.glob("*"):
         if f.is_file():
+            if f.name in ["videos.txt", "prompts.txt", "prompt.txt"]:
                 f.unlink()
     videos_file = TRAINING_PATH / "videos.txt"
+    prompts_file = TRAINING_PATH / "prompts.txt"  # Finetrainers can use either prompts.txt or prompt.txt
     media_files = []
     captions = []
     # Process all video files from the videos subdirectory
     for idx, file in enumerate(sorted(TRAINING_VIDEOS_PATH.glob("*.mp4"))):
         caption_file = file.with_suffix('.txt')
             relative_path = f"videos/{file.name}"
             media_files.append(relative_path)
             captions.append(caption)
     # Write files if we have content
     if media_files and captions:
         videos_file.write_text('\n'.join(media_files))
         prompts_file.write_text('\n'.join(captions))
+        logger.info(f"Created dataset with {len(media_files)} video/caption pairs")
     else:
+        logger.warning("No valid video/caption pairs found in training directory")
+        return None, None
     # Verify file contents
     with open(videos_file) as vf:
         video_lines = [l.strip() for l in vf.readlines() if l.strip()]
         prompt_lines = [l.strip() for l in pf.readlines() if l.strip()]
     if len(video_lines) != len(prompt_lines):
+        logger.error(f"Mismatch in generated files: {len(video_lines)} videos vs {len(prompt_lines)} prompts")
+        return None, None
     return videos_file, prompts_file
     gr.Info(f"Successfully generated the training dataset ({nb_copied_pairs} pairs)")
     return nb_copied_pairs
+# Add this function to finetrainers_utils.py or a suitable place
+def create_validation_config() -> Optional[Path]:
+    """Create a validation configuration JSON file for Finetrainers
+    Creates a validation dataset file with a subset of the training data
+    Returns:
+        Path to the validation JSON file, or None if no training files exist
+    """
+    # Ensure training dataset exists
+    if not TRAINING_VIDEOS_PATH.exists() or not any(TRAINING_VIDEOS_PATH.glob("*.mp4")):
+        logger.warning("No training videos found for validation")
+        return None
+    # Get a subset of the training videos (up to 4) for validation
+    training_videos = list(TRAINING_VIDEOS_PATH.glob("*.mp4"))
+    validation_videos = training_videos[:min(4, len(training_videos))]
+    if not validation_videos:
+        logger.warning("No validation videos selected")
+        return None
+    # Create validation data entries
+    validation_data = {"data": []}
+    for video_path in validation_videos:
+        # Get caption from matching text file
+        caption_path = video_path.with_suffix('.txt')
+        if not caption_path.exists():
+            logger.warning(f"Missing caption for {video_path}, skipping for validation")
+            continue
+        caption = caption_path.read_text().strip()
+        # Get video dimensions and properties
+        try:
+            # Use the most common default resolution and settings
+            data_entry = {
+                "caption": caption,
+                "image_path": "",  # No input image for text-to-video
+                "video_path": str(video_path),
+                "num_inference_steps": DEFAULT_VALIDATION_NB_STEPS,
+                "height": DEFAULT_VALIDATION_HEIGHT,
+                "width": DEFAULT_VALIDATION_WIDTH,
+                "num_frames": DEFAULT_VALIDATION_NB_FRAMES,
+                "frame_rate": DEFAULT_VALIDATION_FRAMERATE
+            }
+            validation_data["data"].append(data_entry)
+        except Exception as e:
+            logger.warning(f"Error adding validation entry for {video_path}: {e}")
+    if not validation_data["data"]:
+        logger.warning("No valid validation entries created")
+        return None
+    # Write validation config to file
+    validation_file = OUTPUT_PATH / "validation_config.json"
+    with open(validation_file, 'w') as f:
+        json.dump(validation_data, f, indent=2)
+    logger.info(f"Created validation config with {len(validation_data['data'])} entries")
+    return validation_file

vms/utils/gpu_detector.py ADDED Viewed

	@@ -0,0 +1,59 @@

+import torch
+import logging
+logger = logging.getLogger(__name__)
+def get_available_gpu_count():
+    """Get the number of available GPUs on the system.
+    Returns:
+        int: Number of available GPUs, or 0 if no GPUs are available
+    """
+    try:
+        if torch.cuda.is_available():
+            return torch.cuda.device_count()
+        else:
+            return 0
+    except Exception as e:
+        logger.warning(f"Error detecting GPUs: {e}")
+        return 0
+def get_gpu_info():
+    """Get information about available GPUs.
+    Returns:
+        list: List of dictionaries with GPU information
+    """
+    gpu_info = []
+    try:
+        if torch.cuda.is_available():
+            for i in range(torch.cuda.device_count()):
+                gpu = {
+                    'index': i,
+                    'name': torch.cuda.get_device_name(i),
+                    'memory_total': torch.cuda.get_device_properties(i).total_memory
+                }
+                gpu_info.append(gpu)
+    except Exception as e:
+        logger.warning(f"Error getting GPU details: {e}")
+    return gpu_info
+def get_recommended_precomputation_items(num_videos, num_gpus):
+    """Calculate recommended precomputation items.
+    Args:
+        num_videos (int): Number of videos in dataset
+        num_gpus (int): Number of GPUs to use
+    Returns:
+        int: Recommended precomputation items value
+    """
+    if num_gpus <= 0:
+        num_gpus = 1
+    # Calculate items per GPU, but ensure it's at least 1
+    items_per_gpu = max(1, num_videos // num_gpus)
+    # Limit to a maximum of 512
+    return min(512, items_per_gpu)