Why SD VAE in 2025?
I'm curious, why did you spend compute training an SD vae based model in 2025?
The EQ-VAE trains faster, the DC-AE gives higher resolution for a smaller latent. Flux AE gives higher quality.
This is for research purpose. We starts with SDVAE for fair comparison on ImageNet as well.
Thanks for the reply, I didn't expect it, I'd love to have a more detailed chat about it, if you're interested!
I understand comparisons and baselines are very important, but in the nicest possible framing, it's training a model with one hand tied behind your back! The problems with SD VAE are now much better understood than three years ago. Lots of research has showed downstream models learn faster when the a) natural image priors survive the dimensionality reduction step, and b) the latent manifold is well behaved.
There's far more work to be done there, preserving scale, rotation, mirroring, translation in the reduced manifold like EQ-VAE but also using the strengths of residual encoding like DC-AE.
Another "downstream thing to optimize for" is tokenisation for transformers, like this:
https://github.com/facebookresearch/SSDD
However, IMHO, I think there's an oversight in the community that the "discovery" of the pixel -> latent space mapping has thus far mostly been in a relatively small model. Sticking with the SD-VAE example, OpenAI addressed this with:
https://github.com/openai/consistencydecoder
But, the consistency decoder had the training objective priority backwards - keeping the same deeply flawed "latent API" of SD-VAE and attempting to patch it in the decoder, which ends up producing a very clever general reconstruction model that accepts a sort of "noisy input" in the form of the SD-VAE latent.
I have a hunch we need to the exact opposite, train an enormous model, a genius in semantic understanding, and high perceptual reconstruction quality, but strongly regularized to a very well behaved equivariant latent space. Then, distill it to a smaller model for all downstream model training and inference. Find the patterns separately to utilizing the knowledge! Like this:
https://huggingface.co/lightx2v/Autoencoders
Another low hanging fruit that I haven't seen any project take up yet is switching to a perceptually uniform colourspace before AE/VAE. Basically, using RGB, HSV, LAB etc uses the "wrong" colour difference formula for human perception. It's like every paper uses a PSNR measure that is the wrong "measuring stick", because everyone else uses the wrong measuring stick!
https://bottosson.github.io/posts/oklab/
Finally, I also think there is huge potential in baking more semantics into the latent space in the form of layering and depth, that goes far beyond just an alpha channel. There have been enormous strides in foundation vision models for monocular depth extraction, as well as image matting. With your resources, every training image could be decomposed into meaningful depth layers, or even just foreground and background, with excellent image matting around hair, fur, translucency, etc, the latent representation would be semantically far richer. One well designed representation could work across vector graphic alpha as well as natural images.
It does need careful attention though, naively adding an alpha channel to any colourspace is a illogical - 0% alpha black and 0% alpha white are the same 'singularity' on the manifold. I haven't found a satisfying solution to that though without invoking complexity like inverse rendering or some really nasty maths.
Anyway, I'm working on a paper on this exact thing, so it's a bit of a brain dump, a sneak preview!
Looks like Qwen Image Layers has implemented the layer idea!
https://huggingface.co/Qwen/Qwen-Image-Layered
And Adobe has released an intrinsic decomposition model:
https://aleafy.github.io/vrgbx/
Still, my Christmas wish list to π is an open source VAE/AE with priors suitable for the real world:
- Start with a sota VAE like the SSDD variant with 16x spatial reduction so that 4K pipelines becomes tractable, and you can aim to surpass Nano Bana Pro with your diffuser from day 1 on a reasonable inference budget.
- Build on their non-adversarial perceptual distance measure work using the work of Eero Simoncelli and his team. The papers like "Differentiating image representations in terms of their local geometry" and follow up with the repo https://github.com/ohayonguy/information-estimation-metric show a much more sophisticated mechanism than was used in Efficient Halftoning via Deep Reinforcement Learning which achieved a boost simply by contrast weighting SSIM, to better align with human perception. It's the way forward.
- Choose a very modest patch size for training the VAE, so that no global information leaks into your latent, e.g. 128, 256. Be incredibly careful with edge loss to avoid tiling artefacts. Be equally careful with convolutions to avoid checkboard and the weird texture noise that Nano Bana Pro has. It's 2025.
- Be bold. Require a depth map in the image channels, with excellent foreground/background separation using image matting SOTA to augment the depth map on every image in the training dataset so both alpha and and geometry are baked in to the latent space. Yes, this means all downstream tasks have to do monocular depth estimation and image de-matting (or vector art de-layering, anime segmentation, etc) but, this is worthwhile compute to spend for real world understanding - you get geometry and in every downstream task, and get some limited semantic segmentation for free.
- Use the OKLAB colour space and a good mix of SDR and HDR content. Aggressively filter out low quality training data, and aggressively AI upscale training data to 4K, before downscaling to boost sharpness before extracting patches for VAE training. Don't make the stupid linear colourspace 255+0 / 2 = 128 nonsense (it should be closer to 187) anywhere in your pipeline.
- Train the VAE with equivariant regularisation, with rotation, scaling, mirroring, flipping. Bake real world prior and geometry into the latent space that downstream transformers will eat up with the positional encoding! It might even help to encode depth as a position, 3D rope, so the network sees in voxels instead of a grid.
- Avoid strange statistical mean shifting/scaling like the WAN VAE, otherwise the downstream models will hook onto them and use them to cheat time step encoding. It probably needs even smarter regularisation so the vae latent mean is 0 across every channel + std or var is 1.
- Don't waste any compute training any more diffusion models until your VAE alone shows Apple can leapfrog Adobe, Qwen, Wan, Google, Meta, Alibaba, and everyone else from a principled approach, not by playing catch up!
Merry Christmas! π
