" MicromOne: To make Generative Adversarial Networks (GANs) work well in practice

Pagine

To make Generative Adversarial Networks (GANs) work well in practice

To make Generative Adversarial Networks (GANs) work well in practice, choosing the right architecture is crucial.

If you're working on a simple task—like generating 28×28 pixel handwritten digits from the MNIST dataset—you can often use a fully connected architecture. In this setup, each layer interacts through matrix multiplications only. There’s no convolution, no recurrence—just dense layers stacked together.

The key design rule? Both the generator and the discriminator must include at least one hidden layer. This ensures they have the universal approximator property, meaning that, given enough hidden units, they can represent any probability distribution.

Choosing Activation Functions

For hidden layers, many activation functions can work. However, Leaky ReLU is especially popular. The reason is simple: it helps gradients flow through the network more reliably.

Gradient flow is important in any neural network—but it’s absolutely critical in GANs. Why? Because the generator can only learn through the gradients it receives from the discriminator. If gradients vanish, learning stalls.

For the generator’s output layer, a common choice is the hyperbolic tangent (tanh) activation function. This means your training data should be scaled to the range [-1, 1].

For the discriminator, the output must represent a probability. To enforce this, we typically use a sigmoid activation function at the output layer.

Training GANs: Two Optimizers, Two Losses

GANs differ from most machine learning models because they require training two networks simultaneously:

  • One optimizer minimizes the discriminator’s loss.

  • Another optimizer minimizes the generator’s loss.

A widely used optimizer is Adam, a design choice popularized in the DCGAN architecture.

Designing the Discriminator Loss

The discriminator’s goal is straightforward:

  • Output values close to 1 for real data.

  • Output values close to 0 for fake (generated) data.

This is essentially a binary classification problem. Therefore, the correct loss function is sigmoid cross-entropy, just like in standard classifiers.

One common mistake is implementing cross-entropy incorrectly. You should always use the numerically stable version computed directly from the logits (the values before the sigmoid). Using probabilities after the sigmoid can cause numerical instability, especially when outputs are very close to 0 or 1.

Label Smoothing Trick

A GAN-specific trick is to slightly soften the real labels. Instead of labeling real examples as 1, use something like 0.9. Keep fake labels at 0.

This technique—called label smoothing—prevents the discriminator from becoming overly confident and improves generalization.

Designing the Generator Loss

For the generator, we also use cross-entropy—but with flipped labels. In other words, the generator tries to make the discriminator classify fake data as real.

Some implementations use the negative of the discriminator’s loss as the generator’s loss. While intuitive, this approach doesn’t work well in practice. When the discriminator is performing well, its gradients can vanish—leaving the generator with no meaningful signal to learn from.

Instead, the better approach is to let the generator minimize cross-entropy with flipped labels. This works because:

  • The derivative of cross-entropy remains non-zero unless the loss is fully minimized.

  • The losing player always receives gradient feedback.

  • Training remains stable and effective.

Final Thoughts

To summarize best practices for simple GAN architectures:

  • Use fully connected layers for simple datasets like MNIST.

  • Ensure both networks have at least one hidden layer.

  • Prefer Leaky ReLU for hidden activations.

  • Use tanh for the generator output (and scale data to [-1, 1]).

  • Use sigmoid for the discriminator output.

  • Apply numerically stable cross-entropy computed from logits.

  • Consider label smoothing for real samples.

  • Train generator and discriminator with separate optimizers (Adam is a strong choice).

  • Let both networks minimize cross-entropy—don’t rely on maximizing the discriminator’s loss.

Getting these architectural and optimization details right can make the difference between a GAN that fails to train and one that produces convincing results.