Part 3: Inside PF-AFN — The Try-On Engine in Code

Introduction

In Part 2, we covered the theory behind GANs and how they connect to the virtual try-on problem. Now it is time to open the hood and look at the actual code.

PF-AFN (Parser-Free Appearance Flow Network) is the core try-on engine that powers META FIT’s garment transfer pipeline. This article is a code-level walkthrough of its architecture — the Feature Pyramid encoder, the CUDA-accelerated correlation kernels, the coarse-to-fine optical flow estimation, and the ResUnet generator that composites everything into a final image.

This is the most technically dense installment in the series. If you have read through GANs in Part 2 and want to understand precisely how a virtual try-on model works at the implementation level, this is where that understanding takes shape.

The Two-Stage Pipeline

PF-AFN operates in two distinct stages. The full inference pipeline, extracted from test.py, looks like this:

# Stage 1: Warp the garment to match the person's body
warped_cloth, last_flow = warp_model(real_image, clothes)

# Stage 2: Generate the composite image
gen_output = gen_model(torch.cat([real_image, warped_cloth, warped_edge], 1))
p_rendered, m_composite = torch.split(gen_output, [3, 1], 1)
p_rendered = torch.tanh(p_rendered)
m_composite = torch.sigmoid(m_composite)

# Final composition: blend warped garment with rendered output
p_tryon = warped_cloth * m_composite + p_rendered * (1 - m_composite)

Stage 1 (AFWM) answers the question: where should each pixel of the garment go on the person’s body? It produces a spatially warped version of the garment that conforms to the person’s shape and pose.

Stage 2 (ResUnet Generator) answers the question: how do we blend this warped garment with the person to produce a natural-looking result? It generates both a rendered person image and a composite mask that controls the blending.

The final line is the composition formula: where the mask is close to 1, the warped garment appears directly; where it is close to 0, the network’s rendered output takes over. This handles skin, hair, background, and the subtle transitions between garment and body.

Stage 1: The AFWM Architecture

The Appearance Flow Warping Module (defined in afwm.py) is itself composed of three submodules: a FeatureEncoder, a RefinePyramid, and an AFlowNet. Together, they estimate a dense optical flow field that maps every pixel in the garment image to its target position on the person.

FeatureEncoder

The encoder is a straightforward convolutional feature extractor that produces multi-scale representations:

class FeatureEncoder(nn.Module):
    def __init__(self, in_channels, chns=[64, 128, 256, 256, 256]):
        super().__init__()
        self.encoders = nn.ModuleList()
        for i, out_channels in enumerate(chns):
            self.encoders.append(
                nn.Sequential(
                    nn.Conv2d(in_channels, out_channels, 3, stride=2, padding=1),
                    nn.BatchNorm2d(out_channels),
                    nn.ReLU(inplace=True)
                )
            )
            in_channels = out_channels

Five convolutional layers with stride=2 progressively downsample the input, producing feature maps at five different resolutions. The channel dimensions increase as [64, 128, 256, 256, 256], capturing increasingly abstract spatial information at each level.

Both the person image and the garment image pass through separate FeatureEncoders, producing two sets of five-level feature pyramids.

RefinePyramid (Feature Pyramid Network)

The raw encoder features are then refined through lateral connections, similar to a Feature Pyramid Network (FPN):

class RefinePyramid(nn.Module):
    def __init__(self, chns=[64, 128, 256, 256, 256]):
        super().__init__()
        self.lateral = nn.ModuleList()
        self.smooth = nn.ModuleList()
        for i in range(len(chns) - 1):
            self.lateral.append(nn.Conv2d(chns[i], chns[i], 1))
            self.smooth.append(nn.Conv2d(chns[i], chns[i], 3, padding=1))

Four refinement layers combine adjacent scale features: coarser (higher-level) features are upsampled and added to finer (lower-level) features through lateral connections. This ensures that each pyramid level carries both local detail and global context — essential for accurate flow estimation at every scale.

AFlowNet: The Core Innovation

The Appearance Flow Network is where PF-AFN’s key contribution lives. It estimates the optical flow in a coarse-to-fine manner across all five pyramid levels:

class AFlowNet(nn.Module):
    def forward(self, x, x_warps, x_conds, flow=None):
        for i in range(num_levels):
            # 1. Compute correlation between person and garment features
            corr = self.correlation(x_conds[i], x_warps[i])

            # 2. Concatenate correlation, person features, and current flow
            combined = torch.cat([corr, x_conds[i], flow_upsampled], dim=1)

            # 3. Predict flow residual through convolutional layers
            flow_residual = self.flow_conv[i](combined)

            # 4. Add residual to upsampled flow from coarser level
            flow = flow_upsampled + flow_residual

            # 5. Warp garment features using updated flow for next level
            x_warps[i+1] = F.grid_sample(x_warps[i+1], flow_grid)

        return warped_cloth, flow

The logic proceeds from the coarsest level (smallest spatial resolution) to the finest:

At each level, a correlation map is computed between the person features and the (possibly already warped) garment features. This measures spatial similarity: for each position in the person feature map, how similar is the corresponding garment feature at each nearby position?
The correlation map is concatenated with the person features and the current flow estimate (upsampled from the previous coarser level).
A stack of convolutional layers predicts a flow residual — a correction to the current flow estimate.
The residual is added to the upsampled flow, refining the alignment.
The updated flow is used to warp the garment features at the next (finer) level, preparing them for the next iteration.

This coarse-to-fine strategy is critical. The coarsest level handles large-scale alignment (getting the garment roughly in position), while finer levels progressively correct local details (sleeve placement, neckline alignment, hemline curvature). Without this hierarchy, the model would need to solve global and local alignment simultaneously — a much harder optimization problem.

CUDA Correlation Kernels

The correlation computation at the heart of AFlowNet is performance-critical. Computing the similarity between every spatial position in two feature maps across a local neighborhood is an O(H x W x K x K x C) operation per pyramid level. PF-AFN accelerates this with custom CUDA kernels implemented via the CuPy library:

kernel_Correlation_rearrange = '''
    extern "C" __global__ void kernel(...) {
        // Rearranges feature maps into a padded layout
        // optimized for subsequent correlation computation.
        // Maps from (batch, channel, height, width) to
        // a contiguous memory layout with spatial padding.
    }
'''

kernel_Correlation_updateOutput = '''
    extern "C" __global__ void kernel(...) {
        // For each spatial position in the first feature map,
        // computes dot products over channels with positions
        // in a local neighborhood of the second feature map.
        // The neighborhood is defined by the kernel size parameter.
        // Output: a 3D correlation volume of shape
        //   (batch, kernel_size * kernel_size, height, width)
    }
'''

kernel_Correlation_updateGradFirst = '''...'''
kernel_Correlation_updateGradSecond = '''...'''

Four kernels handle the forward and backward passes: rearranging memory for cache-friendly access, computing the correlation volume, and propagating gradients back to both input feature maps.

The correlation output can be interpreted intuitively: for each position in the person feature map, it produces a similarity score for every position within a local window of the garment feature map. High values indicate strong feature matches, which guide the flow estimation toward correct alignment.

This CUDA implementation was one of the significant infrastructure requirements for the project. It requires an NVIDIA GPU and the CuPy library (which provides Python bindings for custom CUDA code). A pure Python implementation would be orders of magnitude slower and impractical for the iterative flow refinement across five pyramid levels.

Stage 2: The ResUnet Generator

Once the garment is warped, the second stage composites it onto the person. The generator architecture is a U-Net enhanced with residual blocks:

ResUnetGenerator(input_nc=7, output_nc=4, num_downs=5, ngf=64)

Input (7 channels): The person image (3 channels), the warped garment (3 channels), and the warped edge map (1 channel) are concatenated along the channel dimension.

Output (4 channels): A rendered person image (3 channels) and a composite mask (1 channel).

The U-Net structure uses five downsampling levels with channel progression 64 -> 128 -> 256 -> 512 -> 512, followed by symmetric upsampling with skip connections from the encoder. Each level includes ResidualBlocks that stabilize training and improve gradient flow:

class ResidualBlock(nn.Module):
    def __init__(self, in_features):
        super().__init__()
        self.block = nn.Sequential(
            nn.ReflectionPad2d(1),
            nn.Conv2d(in_features, in_features, 3),
            nn.InstanceNorm2d(in_features),
            nn.ReLU(inplace=True),
            nn.ReflectionPad2d(1),
            nn.Conv2d(in_features, in_features, 3),
            nn.InstanceNorm2d(in_features)
        )

    def forward(self, x):
        return x + self.block(x)

Two design choices are worth noting. First, ReflectionPad2d is used instead of zero padding to avoid border artifacts — the padded values mirror the edge pixels rather than introducing artificial zeros. Second, InstanceNorm (rather than BatchNorm) normalizes each sample independently, which is standard practice in image generation tasks where batch statistics can introduce unwanted correlations between generated images.

The Composite Mask

The fourth output channel, passed through sigmoid, produces the composite mask m_composite with values between 0 and 1. This mask is the key to natural-looking results:

Where m_composite is near 1: The warped garment appears directly. These are garment regions where the texture, pattern, and color should be preserved exactly as in the original product image.
Where m_composite is near 0: The generator’s rendered output takes over. These are skin areas, background, hair, and other non-garment regions.
At boundaries: The mask produces smooth gradients, creating natural transitions between garment and body. This soft blending is what prevents the “pasted on” look that simpler overlay methods produce.

Data Pipeline and Infrastructure

Input Specifications

From aligned_dataset_test.py, the data pipeline expects:

Input	Specification
Image size	256 x 192 pixels
Person image	Full-body photo (front-facing)
Clothes image	Flat-lay or mannequin garment photo
Edge map	Garment outline (preprocessed)
Normalization	Scale to [-1, 1] range
Transform	Scale to width 192, resize to [256, 192]

Docker Environment

The Dockerfile reveals the infrastructure requirements:

FROM nvcr.io/nvidia/pytorch:18.04-py3
# NVIDIA NGC container with CUDA support

RUN pip install torch==1.2.0 cupy==6.0.0
# PyTorch 1.2.0 with CUDA 9.2
# CuPy 6.0.0 for custom CUDA correlation kernels

This is a heavyweight environment. The NVIDIA NGC base container alone is several gigabytes, and the GPU memory requirements are substantial. Running this on consumer hardware is possible but constrained — inference on a single image pair requires significant GPU memory, and throughput is limited compared to what a consumer-facing application would demand.

From PF-AFN to PASTA-GAN++

PF-AFN demonstrated a critical insight: by using learned appearance flow, the system could warp garments without requiring explicit human parsing as input (hence “Parser-Free”). However, for META FIT’s production pipeline, we moved to PASTA-GAN++, which reintroduces parsing information for improved accuracy.

PASTA-GAN++ integrates three models into a single pipeline:

Component	Model File	Purpose
GAN weights	`network-snapshot-004408.pkl`	Core try-on generation
Pose estimation	`body_pose_model.pth` (OpenPose)	Skeletal keypoint detection
Human parsing	`inference.pth` (Graphonomy)	Body region segmentation

The trade-off is clear: PF-AFN’s parser-free approach is simpler but less accurate; PASTA-GAN++ adds complexity (three models instead of one) but achieves better results, particularly for full-body try-on. It supports three modes — full (top and bottom together), upper, and lower — and was trained on the DeepFashion dataset rather than VITON, giving it broader coverage of garment types and body poses.

What Comes Next

With the core try-on engine examined in detail, the next part explores the supporting systems that make virtual try-on practical: OpenPose for pose estimation, Graphonomy for human parsing, the custom body measurement system, and an exploration of PiFu for 3D reconstruction.

META FIT Series: