Part 2: Understanding GANs — The Engine Behind Virtual Try-On

Introduction

In Part 1, I surveyed over fifteen virtual try-on models and selected PASTA-GAN++ as the foundation for META FIT. But every one of those models — from the earliest CAGAN to the latest PASTA-GAN++ — shares a common foundation: Generative Adversarial Networks.

Before diving into the specific architecture of our try-on engine, it is worth understanding GANs thoroughly. They were the dominant paradigm for image generation from 2014 through roughly 2021, and grasping how they work is essential for understanding why virtual try-on systems are designed the way they are.

This article is a deep dive into GAN technology — the theory, the milestones, the challenges, and ultimately, how GANs connect to the virtual try-on problem.

What Is a GAN?

Generative Adversarial Networks were introduced by Ian Goodfellow and colleagues in a 2014 paper that fundamentally changed the trajectory of generative modeling. The core idea is elegant: pit two neural networks against each other in a game, and let their competition drive both toward excellence.

The Two Players

A GAN consists of two networks trained simultaneously:

Generator (G): Takes a random noise vector z (sampled from a simple distribution, typically Gaussian) and produces a synthetic image G(z). The generator’s objective is to produce outputs that are indistinguishable from real images. It never sees real data directly — it only receives feedback through the discriminator’s judgments.

Discriminator (D): Takes an image — either a real image from the training set or a fake image from the generator — and outputs the probability that the input is real. The discriminator’s objective is to correctly classify real images as real and generated images as fake.

The Minimax Game

The training process is formalized as a minimax optimization:

min_G max_D  V(D, G) = E[log D(x)] + E[log(1 - D(G(z)))]

In plain terms: the discriminator tries to maximize its ability to distinguish real from fake (maximizing V), while the generator tries to minimize the discriminator’s success (minimizing V). They pull in opposite directions, and this tension is what drives learning.

The Counterfeiter and the Detective

A useful analogy: imagine a counterfeiter (the generator) trying to produce fake currency, and a detective (the discriminator) trying to identify counterfeits. At first, the counterfeiter’s fakes are crude and easily detected. But each time the detective catches a fake, the counterfeiter learns what gave it away and improves. As the counterfeiter gets better, the detective must become more discerning. This arms race continues, and over time, the counterfeiter’s output becomes increasingly convincing.

At theoretical convergence (Nash equilibrium), the generator produces images so realistic that the discriminator can do no better than random guessing — outputting 0.5 for every input. In practice, reaching true equilibrium is difficult, but the dynamic reliably produces high-quality generated images.

Why GANs Were Revolutionary

To appreciate the impact of GANs, it helps to understand what image generation looked like before 2014.

Variational Autoencoders (VAEs), introduced around the same time, could generate images but tended to produce blurry outputs. They optimize a reconstruction loss that averages over possibilities, which smooths out fine details.

Explicit density models required defining the full probability distribution of images — computationally intractable for high-resolution data.

GANs bypassed these limitations entirely. By framing generation as a game rather than an explicit density estimation problem, they could produce sharp, detailed images. The discriminator provided a learned, adaptive loss function that pushed the generator toward photorealism in ways that fixed loss functions (like mean squared error) could not.

Key Milestones in GAN Research

The original GAN paper opened a floodgate of research. Here are the milestones most relevant to understanding virtual try-on:

DCGAN (2015) — Deep Convolutional GAN stabilized training by establishing architectural guidelines: use strided convolutions instead of pooling, apply batch normalization in both networks, remove fully connected layers in deeper architectures. These guidelines became standard practice and made GANs reliably trainable for the first time.

Conditional GAN / cGAN (2014) — By feeding class labels or other conditioning information to both the generator and discriminator, cGANs could generate images of specific categories. This idea of conditioned generation is fundamental to virtual try-on, where the output must be conditioned on both a person image and a garment image.

pix2pix (2016) — Perhaps the most directly relevant milestone for VTON. pix2pix demonstrated paired image-to-image translation: given an input image (edges, sketches, segmentation maps), produce a corresponding output image (photos, rendered scenes). The architecture — a U-Net generator with skip connections and a PatchGAN discriminator — became the template for virtually all subsequent image translation work, including virtual try-on.

CycleGAN (2017) — Extended image translation to unpaired data using cycle consistency loss. The key insight: if you translate an image from domain A to domain B and then back to domain A, you should recover the original. This enabled style transfer, domain adaptation, and other tasks without requiring matched pairs — critical for VTON methods that cannot easily obtain paired training data.

ProGAN (2017) — Introduced progressive growing: start training at low resolution (4x4) and gradually add layers for higher resolutions. This stabilized high-resolution image generation and paved the way for photorealistic outputs.

StyleGAN (2018-2020) — Nvidia’s style-based generator architecture produced photorealistic face images at 1024x1024 resolution. StyleGAN demonstrated that GANs could generate images virtually indistinguishable from photographs, at least within specific domains. The style-based architecture also offered fine-grained control over generated image attributes.

Each of these advances contributed techniques and insights that virtual try-on researchers adapted for their specific problem.

GAN Training Challenges

Despite their impressive results, GANs are notoriously difficult to train. Several fundamental challenges arise from the adversarial training dynamic.

Mode Collapse

The generator discovers a small set of outputs that consistently fool the discriminator and stops exploring the rest of the data distribution. The result: limited variety in generated images. For example, a face-generating GAN might produce only a few distinct faces, ignoring the diversity of the training data. In a VTON context, mode collapse might manifest as the generator producing the same generic garment appearance regardless of the input.

Training Instability

The generator and discriminator must remain in approximate balance throughout training. If the discriminator becomes too strong, it rejects everything the generator produces, and the generator’s gradients vanish — it receives no useful learning signal. If the discriminator is too weak, it accepts everything, and the generator has no incentive to improve. Maintaining this balance requires careful hyperparameter tuning and often manual monitoring.

Evaluation Difficulty

Unlike supervised learning, where accuracy or loss on a validation set provides a clear quality signal, GAN outputs lack a straightforward evaluation metric. The two most common metrics are:

FID (Frechet Inception Distance): Measures the distance between the distribution of generated images and real images in a feature space defined by an Inception network. Lower is better. Widely used but sensitive to sample size and not always correlated with perceptual quality.
IS (Inception Score): Measures both quality (each image should be classifiable) and diversity (the set of images should cover many classes). Useful but limited — it does not directly compare against the real data distribution.

Neither metric perfectly captures what humans perceive as “good” image quality, making GAN evaluation partly subjective.

Mitigation Techniques

The research community developed several techniques to address these challenges:

Wasserstein loss (WGAN): Replaces the original GAN loss with the Wasserstein distance, providing more stable gradients and a loss that correlates with image quality.
Spectral normalization: Constrains the discriminator’s Lipschitz constant to prevent it from becoming too powerful too quickly.
Gradient penalty (WGAN-GP): Enforces the Lipschitz constraint through a penalty term on the gradient norm, further stabilizing training.
Progressive growing: Starts at low resolution and gradually increases, allowing both networks to learn coarse structure before fine details.

These techniques collectively made GAN training more reliable, though it remained more art than science compared to standard supervised learning.

From General GANs to Virtual Try-On

With the fundamentals established, let us connect GANs to the specific problem of virtual try-on.

At its core, VTON is an image-to-image translation problem: given a person image and a garment image, produce an output image showing that person wearing that garment. This is why pix2pix-style conditional GANs form the foundation of nearly every VTON model.

However, virtual try-on introduces challenges that generic image translation does not face. Several key adaptations are required:

Geometric Warping

A garment image from a product catalog is typically photographed flat or on a mannequin. To make it look natural on a person, the garment must be spatially deformed — stretched, compressed, rotated — to match the person’s body shape and pose. Early methods used Thin Plate Spline (TPS) transformations with learned control points. Later methods, including PF-AFN (which we will examine in Part 3), use appearance flow — learning a dense pixel-level flow field that maps each pixel in the garment image to its target position on the person.

Human Parsing

The system needs to understand which regions of the person image correspond to which body parts and clothing items. Human parsing (semantic segmentation of the human body) provides this understanding, producing a map that labels regions as face, hair, upper body, lower body, arms, shoes, and so forth. This map guides the generator in knowing where to place the new garment and what to preserve from the original image.

Pose Conditioning

Body pose varies dramatically between images. A person might be standing straight, turning sideways, raising an arm, or crossing their legs. The try-on system must account for these variations. Pose estimation (typically using OpenPose or similar frameworks) provides skeletal keypoints that encode the person’s pose, and these keypoints are fed to the generator as additional conditioning information.

The Discriminator’s Role in VTON

In a virtual try-on GAN, the discriminator must judge something more specific than “does this look like a real photo.” It must evaluate: does the output look like a real person naturally wearing real clothes? This means the discriminator implicitly learns about garment draping, body proportions, fabric behavior, and the subtle visual cues that distinguish a well-fitting garment from an awkward composite.

What the Generator Must Accomplish

The generator in a VTON system has a remarkably complex task. It must simultaneously:

Warp the garment to match the person’s body shape and pose
Blend the garment naturally with the person’s body, handling occlusions (arms in front of the torso, for example)
Preserve garment details — patterns, textures, logos, stitching
Maintain body proportions and identity — the person should still look like themselves

Achieving all four simultaneously is what makes virtual try-on one of the most demanding applications of conditional image generation.

The GAN Era and What Came After

From 2014 through approximately 2021, GANs were the definitive approach for high-quality image generation. Beyond virtual try-on, they powered face generation (and DeepFakes), image super-resolution (SRGAN, ESRGAN), artistic style transfer, image inpainting, and countless other applications.

Then the landscape shifted. Diffusion models — DDPM (2020), DALL-E 2 (2022), Stable Diffusion (2022), Midjourney — demonstrated that an entirely different approach could match and exceed GAN quality:

More stable training: No adversarial dynamics means no mode collapse and no delicate balancing act between two networks.
Better mode coverage: Diffusion models tend to capture the full diversity of the training distribution more faithfully.
Higher quality at scale: With sufficient compute, diffusion models produce stunning results across a wider range of domains.
But slower inference: The iterative denoising process requires multiple forward passes (often 20-50 steps), making generation slower than a single GAN forward pass.

META FIT was built during the GAN era. The technical choices throughout this series reflect what was state-of-the-art at the time of development. This is not a limitation — it is historical context. Understanding the GAN-based approach provides a foundation for appreciating both its achievements and the directions that modern diffusion-based methods have since opened.

Whether diffusion models can outperform GANs specifically for virtual try-on — where geometric warping and garment-specific constraints are critical — remains an active area of research that we will revisit in the final installment of this series.

What Comes Next

Now that we understand how GANs work and why they are suited for virtual try-on, the next part takes us inside the actual implementation — the PF-AFN model, its custom CUDA correlation kernels, and the two-stage pipeline that transforms a person photo and a garment image into a virtual try-on result.

META FIT Series:

Part 1: From Photo Booths to Virtual Try-On
Part 2: Understanding GANs — The Engine Behind Virtual Try-On (You are here)
Part 3: Inside PF-AFN — The Try-On Engine in Code
Part 4: Pose Estimation, Body Measurement, and 3D Reconstruction
Part 5: Results, Failure Modes, and the Path to Modern Image Generation