Part 1: From Photo Booths to Virtual Try-On — The 20-Year Quest

Introduction

Some ideas refuse to go away. They sit in the back of your mind for years, quietly waiting for technology to catch up. This project is one of those ideas.

Roughly twenty years ago — before smartphones were ubiquitous, before deep learning existed as a practical tool, before anyone outside academia had heard the term “generative AI” — I had a simple thought: what if you could see how clothes look on you without physically putting them on?

That question has followed me through two decades of technological change. The answer, it turns out, required a convergence of computer vision, generative adversarial networks, human pose estimation, and a great deal of persistence.

This is the story of META FIT — and this first installment covers the vision that started it all and the comprehensive research survey that shaped the technical direction.

The Origin: A Photo Booth for Fashion

If you have ever been to a Japanese arcade, you are likely familiar with purikura — photo booth machines where you step inside, pose for a camera, and receive decorated prints of yourself. They have been a cultural fixture in Japan for decades.

My original concept was essentially purikura for fashion. Imagine a kiosk installed in a clothing store: you stand in front of a camera, select garments from a catalog, and the screen shows you wearing those clothes in real time. No changing room required. No undressing. Just instant visual feedback on how a garment looks on your body.

At the time, the technology to make this work simply did not exist. Image processing was limited. Real-time body segmentation was a research problem, not a product feature. The idea went into a mental filing cabinet.

But as the years passed, the landscape shifted dramatically. Smartphone cameras reached high resolution. Deep learning frameworks matured. And in 2014, Ian Goodfellow introduced Generative Adversarial Networks — a breakthrough that would eventually make virtual try-on possible.

The hardware kiosk concept evolved into something far more accessible: a smartphone app. The core question, however, never changed: can we let people see how clothes look and fit on their body without physically trying them on?

The Business Problem: Why This Matters

Virtual try-on is not just a technical curiosity — it addresses a significant economic problem in the apparel industry.

Return rates in online fashion retail are staggering. Studies consistently report that 30-40% of clothing purchased online is returned, a rate far higher than most other e-commerce categories. The primary reason is straightforward: customers cannot assess fit, silhouette, or draping on their specific body type before purchasing.

This creates a lose-lose situation:

For consumers: Returns are inconvenient. Repackaging, shipping back, waiting for refunds — it erodes the convenience that online shopping is supposed to provide.
For retailers: Each return incurs shipping costs, warehouse processing, potential markdowns on returned items, and environmental waste. Some estimates put the cost of apparel returns at over $100 billion annually in the US alone.

A reliable virtual try-on system that accurately shows how a garment looks on a specific person’s body could dramatically reduce these return rates. Even a modest improvement — say, reducing returns from 35% to 25% — would represent billions of dollars in savings across the industry.

This is the business case that motivated META FIT. The technical challenge was formidable, but the potential impact was clear.

VTON Research Survey: Mapping the Landscape

Before writing a single line of code, I conducted a comprehensive survey of Virtual Try-On (VTON) research. The field had been active since approximately 2017, and by the time I began this project, there were over fifteen distinct approaches worth evaluating.

I assessed each model across several dimensions: image quality, body coverage (upper body vs. full body), pose flexibility, data requirements (paired vs. unpaired), and practical feasibility for implementation.

2D Upper-Body Methods

The earliest VTON research focused on upper-body garment transfer — replacing a person’s top with a target garment. These methods established the foundational techniques that later work built upon.

Model	Year	Key Innovation
CAGAN	2017	Conditional Analogy GAN — swaps garments using a triplet of person image, current garment, and target garment. Conceptually simple, but struggles with patterned fabrics and fine details.
VITON	2017	Introduced mask-based garment warping. A significant step forward: only requires the person image and the target garment (no image of the current garment needed).
CP-VTON	2018	Improved upon VITON with a Geometric Matching Module (GMM) for parameter-based warping. Better preservation of textures, logos, and garment structure.
WUTON	2019	End-to-end adversarial training using both “try-on” and “try-back” images as supervision. Achieved more accurate reproduction of textures and embroidery.
MG-VTON	2019	Extended virtual try-on to diverse poses. Previous methods generally required fixed, front-facing poses — this work relaxed that constraint.

These models demonstrated that GAN-based virtual try-on was feasible, but each came with significant limitations. Most could only handle frontal poses, upper-body garments, and relatively simple patterns.

2D Full-Body Methods

The next generation of research expanded the scope from upper-body garments to full outfits, addressing a much harder problem: simultaneously handling tops, bottoms, and their interaction with the full body.

Model	Year	Key Innovation
SwapNet	2019	Weakly-supervised approach with separate Warping and Texturing modules. Notably, does not require paired training data, reducing data collection burden.
M2E-TON	2020	Model-to-Everyone — transfers garments directly between person images without requiring separate product photos. This significantly reduces the data pipeline complexity.
O-VITON	2020	Simultaneous multi-item try-on (top and bottom together) from unpaired data. Addressed a key limitation of single-garment methods.
PASTA-GAN++	2022	Patch-based style transfer for full-body try-on. Handles tops, bottoms, and full outfits in separate modes with strong detail preservation. This was selected as the primary model for META FIT.

3D Approaches

I also explored 3D reconstruction methods, most notably PiFu (Pixel-aligned Implicit Function), which reconstructs a 3D mesh of a person from a single 2D photo. The appeal was clear: with a 3D body model, garment simulation could theoretically produce highly accurate try-on results with proper physics-based draping.

However, 3D garment simulation introduced substantial additional complexity — cloth physics, collision detection, texture mapping onto deformable surfaces — that was beyond the scope of the initial prototype. I noted 3D reconstruction as a promising future direction and focused on 2D approaches for the first version.

Datasets

Three datasets appeared repeatedly across the literature and formed the basis for training and evaluation:

VITON Dataset: The standard benchmark for paired upper-body try-on. Contains person images paired with corresponding garment images.
DeepFashion / DeepFashion2: Large-scale datasets with diverse poses, garment categories, and annotations. Essential for training models that generalize beyond frontal poses.
Fashionpedia: Fine-grained attribute annotations (sleeve length, neckline type, fabric pattern) useful for conditioning models on specific garment properties.

Why PASTA-GAN++ Was Selected

After evaluating the full landscape, PASTA-GAN++ emerged as the strongest candidate for META FIT’s core engine. The reasoning was as follows:

Full-body coverage. Unlike the earlier upper-body methods (VITON, CP-VTON), PASTA-GAN++ handles the entire body. For a practical virtual try-on application, showing only a new top while ignoring bottoms and shoes would be insufficient.

Multi-garment modes. The model supports separate modes for tops, bottoms, and full outfits. This flexibility allows users to try on individual pieces or complete looks.

Patch-based detail preservation. The patch-based style transfer approach proved effective at preserving fine garment details — patterns, textures, seams — that global approaches tended to blur or distort.

Unpaired data compatibility. PASTA-GAN++ can train with unpaired data, meaning it does not require every person image to be paired with every garment image. This dramatically reduces the data collection requirements for fine-tuning on custom datasets.

No model was perfect. PASTA-GAN++ still struggled with certain edge cases — unusual poses, complex layering, accessories — but it represented the best balance of capability and feasibility available at the time.

What Comes Next

This survey established the technical foundation for the entire project. Understanding the strengths and limitations of each approach was essential for making informed architectural decisions.

In the next part, we will dive deep into the engine behind all these models — Generative Adversarial Networks — and understand why GANs were the dominant paradigm for image generation before diffusion models emerged.

META FIT Series: