Unpack the rigorous mathematical formulations undergirding Denoising Diffusion Probabilistic Models (DDPM) and Latent Diffusion (Stable Diffusion). Master VAE reduction, U-Net attention blocks, and Classifier-Free Guidance equations.
Diffusion models work by defining a forward process that slowly destroys the structure of an image by injecting Gaussian noise in $T$ discrete steps. This forms a parameterized Markov Chain:
Where \beta_t is the variance schedule at step t, scaled usually from $10^-4$ to $0.02$.
Computing this iteratively step-by-step is computationally inefficient. By defining \alpha_t = 1 - \beta_t and \bar{\alpha}_t = \prod_{i=1}^t \alpha_i, we use the Reparameterization Trick to sample the noise at any arbitrary step t directly in closed form:
The Closed-Form Forward Diffusion Formula
Where \epsilon \sim N(0, I) represents standard independent identically distributed Gaussian noise, and x_0 is the clean input image.
The goal of the neural network (specifically a U-Net architecture) is to predict the noise vector \epsilon injected at step t, allowing us to subtract it and reconstruct the clean image.
We optimize the network parameters \theta using a simplified loss function derived from the Variational Lower Bound (ELBO):
Simplified MSE Diffusion Loss
Where \epsilon_\theta(x_t, t) is the neural network prediction of the noise vector added to latent x_t at step t.
Running diffusion processes directly on pixel-space images (e.g. $512 \times 512$ resolution) is incredibly expensive. Latent Diffusion Models (LDMs) solve this by separating the structure learning from the perceptual generation.
A high-resolution image $x \in R^{H \times W \times 3}$ is encoded into a low-dimensional latent space $z = E(x) \in R^{h \times w \times c}$ by a VAE encoder. Typically downsampling by a factor of 8 (e.g., $512 \times 512$ pixels reduces to $64 imes 64$ latents). The diffusion process runs exclusively in this compressed space.
Once the reverse loop finishes denoising the latent tensor $z_0$, the VAE decoder maps it back into standard high-resolution pixel space: $\tilde{x} = D(z_0)$. This mathematical trick cuts computation costs by over $64\times$.
To direct the generation using prompts, Stable Diffusion introduces cross-attention blocks within the U-Net. A pre-trained CLIP text encoder outputs word token embeddings $y$. Inside the U-Net intermediate layers:
Attention(Q, K, V) = softmax( (Q • K^T) / \sqrt{d} ) • V
Q = W_Q • z_tDerived from projection of active spatial latent frames.
K = W_K • yDerived from CLIP textual context prompt tokens.
V = W_V • yDirects what visual signals map to which textual tags.
To enforce strict adherence to the prompt, we compute two noise predictions: one conditioned on the prompt $c$ and one unconditioned (with empty prompt $\emptyset$). We linearly interpolate between them using a scaling factor $s$:
If $s = 1$, we get standard conditioning. If $s > 1$, the difference between conditional and unconditional noise is amplified, forcing the model to heavily emphasize the prompt at the expense of variety.
Here is the fully specified python code outlining a customized stable diffusion sampling loop including Classifier-Free Guidance (CFG) and VAE decompression:
import torch
from tqdm import auto as tqdm
import numpy as np
@torch.no_grad()
def sample_stable_diffusion(
unet,
vae,
text_embeddings, # Precomputed CLIP embeddings [Uncond, Cond]
height=512,
width=512,
num_inference_steps=20,
guidance_scale=7.5,
generator=None
):
# 1. Initialize random latents: 64x64 represents 512x512 downsampled by VAE factor of 8
latents_shape = (1, 4, height // 8, width // 8)
latents = torch.randn(latents_shape, generator=generator, device="cuda")
# 2. Configure time step schedule (DDIM or standard linear spacing)
# 1000 standard steps downsampled to num_inference_steps
timesteps = torch.linspace(999, 0, num_inference_steps, dtype=torch.long, device="cuda")
# Calculate scheduling variables for analytical step steps
# (Usually retrieved from a NoiseScheduler instance like DPMSolver)
scheduler_alphas = torch.sigmoid(torch.linspace(-5, 5, 1000))
for i, t in enumerate(timesteps):
# Double the inputs to run both conditional and unconditional predictions concurrently
latent_model_input = torch.cat([latents] * 2)
# 3. UNet Noise Prediction
# Takes spatial noise, current timestep t, and CLIP text conditioning
noise_pred = unet(latent_model_input, t, encoder_hidden_states=text_embeddings)
# Split predicted noise vectors
noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
# 4. Perform Classifier-Free Guidance (CFG) correction
# Boosts prompt features based on scale 's'
noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)
# 5. Reverse step projection (DDIM step computation)
# Compute x_{t-1} using current noise vector
alpha_t = scheduler_alphas[t]
alpha_t_prev = scheduler_alphas[timesteps[i+1]] if i < len(timesteps)-1 else torch.tensor(1.0)
pred_original_latent = (latents - (1 - alpha_t).sqrt() * noise_pred) / alpha_t.sqrt()
dir_pointing_to_latent = (1 - alpha_t_prev).sqrt() * noise_pred
latents = alpha_t_prev.sqrt() * pred_original_latent + dir_pointing_to_latent
print(f"Step {i+1}/{num_inference_steps} completed. Timestep t={t.item()}")
# 6. Decompress latents back into high resolution pixel-space using VAE decoder
# Scale latents back to match VAE target training variance
latents = 1 / 0.18215 * latents
decoded_image = vae.decode(latents) # Dimensions: [1, 3, 512, 512]
# Restrict pixels back to [0, 1] color spectrum
image = (decoded_image / 2 + 0.5).clamp(0, 1)
return image.cpu().numpy()
Configure parameters and witness how pure random normal distribution latents resolve into final conceptual visual forms.
Configure variables and click Iterate Latents to run denoising loop.
Execute these tasks in your workspace repository code. Toggle checkboxes once they successfully execute within your development environment.
Write the PyTorch code representing x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * epsilon. Calculate scheduler constants for any step $t$.
Write a function that accepts chunked double model inputs (conditioned and unconditioned), passes them through a UNet, and implements: uncond + scale * (cond - uncond).
Implement a deterministic DDIM sampler loop spanning exactly 20 steps, asserting that each step maps back to correct relative analytical times in the standard 1000-step schedules.
Integrate HuggingFace's CLIP text encoder to process prompts into token dimensions of size [Batch, 77, 768]. Assert this hidden state tensor is successfully mapped as key/value matrices in cross-attention blocks.