Course 6 • Lesson 2

Diffusion Mathematics & Latent Space

Unpack the rigorous mathematical formulations undergirding Denoising Diffusion Probabilistic Models (DDPM) and Latent Diffusion (Stable Diffusion). Master VAE reduction, U-Net attention blocks, and Classifier-Free Guidance equations.

DDPM Markov ChainsReparameterization TrickVAE LatentsCross-Attention (CLIP)CFG Math Playground

1. The Forward Process: Systematically Adding Noise

Diffusion models work by defining a forward process that slowly destroys the structure of an image by injecting Gaussian noise in $T$ discrete steps. This forms a parameterized Markov Chain:

q(x_t | x_{t-1}) = N(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I)

Where \beta_t is the variance schedule at step t, scaled usually from $10^-4$ to $0.02$.

Computing this iteratively step-by-step is computationally inefficient. By defining \alpha_t = 1 - \beta_t and \bar{\alpha}_t = \prod_{i=1}^t \alpha_i, we use the Reparameterization Trick to sample the noise at any arbitrary step t directly in closed form:

The Closed-Form Forward Diffusion Formula

x_t = \sqrt{\bar{\alpha}_t} • x_0 + \sqrt{1 - \bar{\alpha}_t} • \epsilon

Where \epsilon \sim N(0, I) represents standard independent identically distributed Gaussian noise, and x_0 is the clean input image.

2. The Reverse Process & Denoising Loss

The goal of the neural network (specifically a U-Net architecture) is to predict the noise vector \epsilon injected at step t, allowing us to subtract it and reconstruct the clean image.

We optimize the network parameters \theta using a simplified loss function derived from the Variational Lower Bound (ELBO):

Simplified MSE Diffusion Loss

L_simple(\theta) = E_{t, x_0, \epsilon} [ || \epsilon - \epsilon_\theta(x_t, t) ||^2 ]

Where \epsilon_\theta(x_t, t) is the neural network prediction of the noise vector added to latent x_t at step t.

3. Latent Diffusion: Shrinking the Domain

Running diffusion processes directly on pixel-space images (e.g. $512 \times 512$ resolution) is incredibly expensive. Latent Diffusion Models (LDMs) solve this by separating the structure learning from the perceptual generation.

Variational Autoencoder (VAE)

A high-resolution image $x \in R^{H \times W \times 3}$ is encoded into a low-dimensional latent space $z = E(x) \in R^{h \times w \times c}$ by a VAE encoder. Typically downsampling by a factor of 8 (e.g., $512 \times 512$ pixels reduces to $64 imes 64$ latents). The diffusion process runs exclusively in this compressed space.

Decoder Reconstruction

Once the reverse loop finishes denoising the latent tensor $z_0$, the VAE decoder maps it back into standard high-resolution pixel space: $\tilde{x} = D(z_0)$. This mathematical trick cuts computation costs by over $64\times$.

4. Text Conditioning & Cross-Attention Math

To direct the generation using prompts, Stable Diffusion introduces cross-attention blocks within the U-Net. A pre-trained CLIP text encoder outputs word token embeddings $y$. Inside the U-Net intermediate layers:

Attention(Q, K, V) = softmax( (Q • K^T) / \sqrt{d} ) • V

Query (Q)Q = W_Q • z_t

Derived from projection of active spatial latent frames.

Key (K)K = W_K • y

Derived from CLIP textual context prompt tokens.

Value (V)V = W_V • y

Directs what visual signals map to which textual tags.

Classifier-Free Guidance (CFG)

To enforce strict adherence to the prompt, we compute two noise predictions: one conditioned on the prompt $c$ and one unconditioned (with empty prompt $\emptyset$). We linearly interpolate between them using a scaling factor $s$:

\tilde{\epsilon}_\theta(z_t, c) = \epsilon_\theta(z_t, \emptyset) + s • ( \epsilon_\theta(z_t, c) - \epsilon_\theta(z_t, \emptyset) )

If $s = 1$, we get standard conditioning. If $s > 1$, the difference between conditional and unconditional noise is amplified, forcing the model to heavily emphasize the prompt at the expense of variety.

PyTorch Walkthrough: Custom DDIM/DDPM Reverse Sampling Loop

Here is the fully specified python code outlining a customized stable diffusion sampling loop including Classifier-Free Guidance (CFG) and VAE decompression:

import torch
from tqdm import auto as tqdm
import numpy as np

@torch.no_grad()
def sample_stable_diffusion(
    unet, 
    vae, 
    text_embeddings,  # Precomputed CLIP embeddings [Uncond, Cond]
    height=512, 
    width=512, 
    num_inference_steps=20, 
    guidance_scale=7.5,
    generator=None
):
    # 1. Initialize random latents: 64x64 represents 512x512 downsampled by VAE factor of 8
    latents_shape = (1, 4, height // 8, width // 8)
    latents = torch.randn(latents_shape, generator=generator, device="cuda")
    
    # 2. Configure time step schedule (DDIM or standard linear spacing)
    # 1000 standard steps downsampled to num_inference_steps
    timesteps = torch.linspace(999, 0, num_inference_steps, dtype=torch.long, device="cuda")
    
    # Calculate scheduling variables for analytical step steps
    # (Usually retrieved from a NoiseScheduler instance like DPMSolver)
    scheduler_alphas = torch.sigmoid(torch.linspace(-5, 5, 1000)) 
    
    for i, t in enumerate(timesteps):
        # Double the inputs to run both conditional and unconditional predictions concurrently
        latent_model_input = torch.cat([latents] * 2)
        
        # 3. UNet Noise Prediction
        # Takes spatial noise, current timestep t, and CLIP text conditioning
        noise_pred = unet(latent_model_input, t, encoder_hidden_states=text_embeddings)
        
        # Split predicted noise vectors
        noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
        
        # 4. Perform Classifier-Free Guidance (CFG) correction
        # Boosts prompt features based on scale 's'
        noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)
        
        # 5. Reverse step projection (DDIM step computation)
        # Compute x_{t-1} using current noise vector
        alpha_t = scheduler_alphas[t]
        alpha_t_prev = scheduler_alphas[timesteps[i+1]] if i < len(timesteps)-1 else torch.tensor(1.0)
        
        pred_original_latent = (latents - (1 - alpha_t).sqrt() * noise_pred) / alpha_t.sqrt()
        dir_pointing_to_latent = (1 - alpha_t_prev).sqrt() * noise_pred
        latents = alpha_t_prev.sqrt() * pred_original_latent + dir_pointing_to_latent
        
        print(f"Step {i+1}/{num_inference_steps} completed. Timestep t={t.item()}")

    # 6. Decompress latents back into high resolution pixel-space using VAE decoder
    # Scale latents back to match VAE target training variance
    latents = 1 / 0.18215 * latents
    decoded_image = vae.decode(latents) # Dimensions: [1, 3, 512, 512]
    
    # Restrict pixels back to [0, 1] color spectrum
    image = (decoded_image / 2 + 0.5).clamp(0, 1)
    return image.cpu().numpy()

Latent Diffusion Step-by-Step Denoising Playground

Configure parameters and witness how pure random normal distribution latents resolve into final conceptual visual forms.

Generator Controls

CLIP Text Prompt

CFG Scale (s)7.5

1.0 (Unconditioned)15.0 (Hyper-strict)

Denoising Steps20

Sampler Algorithm

z_t[0]

z_t[1]

z_t[2]

z_t[3]

z_t[4]

z_t[5]

z_t[6]

z_t[7]

z_t[8]

z_t[9]

z_t[10]

z_t[11]

z_t[12]

z_t[13]

z_t[14]

z_t[15]

z_t[16]

z_t[17]

z_t[18]

z_t[19]

z_t[20]

z_t[21]

z_t[22]

z_t[23]

z_t[24]

z_t[25]

z_t[26]

z_t[27]

z_t[28]

z_t[29]

z_t[30]

z_t[31]

z_t[32]

z_t[33]

z_t[34]

z_t[35]

z_t[36]

z_t[37]

z_t[38]

z_t[39]

z_t[40]

z_t[41]

z_t[42]

z_t[43]

z_t[44]

z_t[45]

z_t[46]

z_t[47]

z_t[48]

z_t[49]

z_t[50]

z_t[51]

z_t[52]

z_t[53]

z_t[54]

z_t[55]

z_t[56]

z_t[57]

z_t[58]

z_t[59]

z_t[60]

z_t[61]

z_t[62]

z_t[63]

Configure variables and click Iterate Latents to run denoising loop.

Active Resolution: 512x512 (z = 64x64x4)CFG: s = 7.5

Step t=20 (Pure Noise)Step t=0 (Decoded Image)

Stable Diffusion Mathematics Challenges

Execute these tasks in your workspace repository code. Toggle checkboxes once they successfully execute within your development environment.

Task 1: Math Implementation of Forward Reparameterization
Write the PyTorch code representing x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * epsilon. Calculate scheduler constants for any step $t$.
Task 2: Classifier-Free Guidance (CFG) Chunking
Write a function that accepts chunked double model inputs (conditioned and unconditioned), passes them through a UNet, and implements: uncond + scale * (cond - uncond).
Task 3: DDIM Accelerated Step Scheduler
Implement a deterministic DDIM sampler loop spanning exactly 20 steps, asserting that each step maps back to correct relative analytical times in the standard 1000-step schedules.
Task 4: Cross-Attention UNet Conditioning
Integrate HuggingFace's CLIP text encoder to process prompts into token dimensions of size [Batch, 77, 768]. Assert this hidden state tensor is successfully mapped as key/value matrices in cross-attention blocks.

Lesson 2: Diffusion & Stable Diffusion

1. The Forward Process: Systematically Adding Noise

2. The Reverse Process & Denoising Loss

3. Latent Diffusion: Shrinking the Domain

Variational Autoencoder (VAE)

Decoder Reconstruction

4. Text Conditioning & Cross-Attention Math

Classifier-Free Guidance (CFG)

PyTorch Walkthrough: Custom DDIM/DDPM Reverse Sampling Loop

Latent Diffusion Step-by-Step Denoising Playground

Generator Controls

Stable Diffusion Mathematics Challenges

1. The Forward Process: Systematically Adding Noise

2. The Reverse Process & Denoising Loss

3. Latent Diffusion: Shrinking the Domain

Variational Autoencoder (VAE)

Decoder Reconstruction

4. Text Conditioning & Cross-Attention Math

Classifier-Free Guidance (CFG)

PyTorch Walkthrough: Custom DDIM/DDPM Reverse Sampling Loop

Latent Diffusion Step-by-Step Denoising Playground

Generator Controls

Stable Diffusion Mathematics Challenges