How are VAE GAN different from each other?
Starting from my own understanding, and scoped to the purpose of image generation, I'm well aware of the major architectural differences:
A GAN's generator samples from a relatively low dimensional random variable and produces an image. Then the discriminator takes that image and predicts whether the image belongs to a target distribution or not. Once trained, I can generate a variety of images just by sampling the initial random variable and forwarding through the generator.
A VAE's encoder takes an image from a target distribution and compresses it into a low dimensional latent space. Then the decoder's job is to take that latent space representation and reproduce the original image. Once the network is trained, I can generate latent space representations of various images, and interpolate between these before forwarding through the decoder which produces new images.
What I'm more interested in is the consequences of said architectural differences. Why would I choose one approach over the other? And why? (for example, if GANs typically produce better quality images, any ideas why that is so? is it true in all cases or just some?)
GANs generally produce better photo-realistic images but can be difficult to work with. Conversely, VAEs are easier to train but don’t usually give the best results. I recommend picking VAEs if you don’t have a lot of time to experiment with GANs and photorealism isn’t paramount. There are exceptions such as Google’s VQ-VAE 2 which can compete with GANs for image quality and realism. There is also VAE GAN and VQ-VAE-GAN. As a note, GANs and VAEs are not specifically for images and can be used for other data types/structures.