Dataset missing data?

#13
by deraynger - opened

Hi and thanks a lot for your great work and open sourcing it! I'm new to all and have been extensively testing and trying JoyCaption with my custom code. My next step is that I really want to learn and try the training as you did with custom tweaks, just so I can try to understand how all works. I noticed all URLs are empty, so I guess I can't do any of the training unless I recreate all of your dataset examples.
Is there any way to get the data you used for training?

Keep up the great work, looking forward to your future version/projects!!!

Oh, one more thing, you mentioned you trained it on a single GPU? What GPU is that? If I understood correctly, beta one was trained in 48 hours?

I noticed all URLs are empty

Yes, sorry, I haven't finished annotating all the data I used and fully "released" the dataset. Once I get around to that all of the publicly available images should be filled out with their URLs.

Oh, one more thing, you mentioned you trained it on a single GPU? What GPU is that? If I understood correctly, beta one was trained in 48 hours?

Yes, JoyCaption's training is lightweight, all things considered. For the full run on Beta One I rented a cloud GPU just to speed things along. IIRC it was a GH200 because they were discounted at the time, which is roughly equivalent to a 600W H100. Total run time was 65 hours, processing 24M training samples total. Mini batch size is 4. Peak GPU memory usage was about 100GB.

But, for reference, the early versions of JoyCaption were trained on a 4090 (24GB), and I do development on a 96GB GPU. The later versions have longer examples in the dataset, which is why the GPU memory use has gone up. But if you keep it capped at 256 tokens and drop the mini batch to 1, it fits easily in 24GB. Mainly because it's only a LORA on the Llama model, which helps save a ton of memory.

Sign up or log in to comment