" MicromOne: Diffusion Models Explained Line by Line

Pagine

Diffusion Models Explained Line by Line


1. Setup and Noise Schedule

n_steps = 512  # Total number of diffusion steps (T)
beta = linspace(start, end, n_steps)
  • n_steps (T): total number of steps in the diffusion process

  • beta: controls how much noise is added at each step

This is a linear noise schedule

Derived Variables

alpha = 1. - beta
alpha_bar = cumprod(alpha, axis=0)
  • alpha: amount of signal preserved at each step

  • alpha_bar: cumulative product → how much of the original image remains after t steps

sqrt_alpha_bar = sqrt(alpha_bar)
sqrt_one_minus_alpha_bar = sqrt(1. - alpha_bar)

These are used in the reparameterization trick:

[
x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon
]

Model and Optimizer

model = UNet()
optimizer = Adam(model.parameters(), lr=0.001)
  • UNet: predicts the noise given a noisy image and timestep

  • Adam: optimizer used to train the model

TRAINING (Forward Process)

for batch, _ in dataloader:

Loop over real images.

Batch size

bs = batch.shape[0]

Sampling random timesteps

t = torch.randint(0, T, (bs,)).long()

Very important:

  • Each image uses a different timestep

  • The model learns to handle all noise levels

Generate noise

noise = torch.randn_like(batch, device=device)
  • Gaussian noise (ε)

Add noise to images

x_noisy = (
    sqrt_alpha_bar[t].view(bs, 1, 1, 1) * batch +
    sqrt_one_minus_alpha_bar[t].view(bs, 1, 1, 1) * noise
)

This is the forward diffusion step.

  • Early timesteps → image is mostly clean

  • Late timesteps → image becomes almost pure noise

Predict the noise

noise_pred = model(x_noisy, t)

The model learns:

“Given a noisy image and timestep, what noise was added?”

Loss function

loss = F.mse_loss(noise, noise_pred)

We minimize:

[
||\epsilon - \epsilon_\theta||^2
]

Optimization

loss.backward()
optimizer.step()

Update model weights.

3. IMAGE GENERATION (Reverse Process) Inference 

Now we generate new images from pure noise.

Precomputations

sqrt_one_minus_alpha_bar = sqrt(1. - alpha_bar)
alpha_bar_t_minus_1 = F.pad(alpha_bar[:-1], (1, 0), value=1.0)
  • Shifted version of alpha_bar

  • Needed for reverse formulas

posterior_variance = (
    beta * (1.0 - alpha_bar_t_minus_1) / (1.0 - alpha_bar)
)

This corresponds to:

[
\sigma_t^2
]

Controls how much randomness is added during generation

Initialization

bs = 8
x = randn((bs, 3, IMG_SIZE, IMG_SIZE))

Start from pure noise.

Reverse loop

for ts in range(0, T)[::-1]:

Go backward in time (T → 0)

Add noise (except final step)

noise = randn_like(x) if ts > 0 else 0

Important detail:

  • No noise at the final step

  • Otherwise the image would degrade

Time tensor

t = full((bs,), ts).long()

Denoising step

x = (
    sqrt_one_over_alpha[t].view(bs, 1, 1, 1) *
    (
        x - beta[t].view(bs, 1, 1, 1) /
        sqrt_one_minus_alpha_bar[t].view(bs, 1, 1, 1) *
        model(x, t)
    )
    + sqrt(posterior_variance[t].view(bs, 1, 1, 1)) * noise
)

What happens here?

Each step:

  1. The model predicts the noise

  2. That noise is removed

  3. A small amount of controlled noise is added back

Missing definition (important)

sqrt_one_over_alpha = sqrt(1.0 / alpha)

Final Output

generated_image = torch.clamp(x, -1, 1)

Clamp values to a valid image range.

Intuition Recap

Training

  • Add noise to images

  • Train model to predict that noise

Generation

  • Start from noise

  • Remove noise step by step

  • Obtain a clean