Mode collapse что это

Обновлено: 09.01.2025

Intuition: In any game, you look ahead for the next few moves of your opponent and prepare your next move accordingly. In Unrolled GAN, we give an opportunity for the generator to unroll k steps on how the discriminator may optimize itself. Then we update the generator using backpropagation with the cost calculated in the final k step. The lookahead discourages the generator to exploit local optimal that easily counteract by the discriminator. Otherwise, the model will oscillate and even become unstable. Unrolled GAN lowers the chance that the generator is overfitted for a specific discriminator. This lessens mode collapse and improves stability.

This article is part of the series on GAN. Since mode collapse is common, we spend some time to explore Unrolled GAN to see how mode collapse may be addressed.

Discriminator training

In GAN, we compute the cost function and use backpropagation to fit the model parameters of the discriminator D and the generator G.

We redraw the diagram below to emphasize the model parameters θ. The red arrows show how we backpropagate the cost function f to fit the model parameters.

Here are the cost function and the gradient descent. (we use a simple gradient descent for the purpose of the illustration)

In the diagram below, we add the SGD (gradient descent formula) to explicitly define how the discriminator parameters are calculated.

In Unrolled GAN, we train the discriminator exactly the same way as GAN.

Checking alternative optimization algorithms

The concept behind GANs optimization is a min-max game, which often results in during the training process we are unable to find a local nash-equilibria, meaning, they fail to converge. In some articles found in the literature the use of Simultaneous Gradient Descent has been proposed, leading to a more stable training and improved convergence even on GAN architectures that are known to be hard to train.

How to Fight Mode Collapse in GANs

What is all this fuss about Generative Adversarial Networks (GANs). what did this new “invention” really accomplish? which challenges did it solve and what are the current limitations is it facing? We will answer these questions simply and concisely through this article.

GAN is relatively a new machine learning technique invent e d by Ian Goodfellow in 2014. In GANs, instead of one neural network, two new neural networks compete against each other in a two-sided game. During the game, based on the training data, one neural networks (known as the Generator), utilizes the knowledge on the training data distribution to try to generate fake samples of the data, while the other neural network (known as the Discriminator) is fed both the fake sample as well as real samples and the network tries to classify the incoming data correctly whether it is real or fake. As the game progresses, both networks become better at their tasks, the generator becomes better at generating fake data that looks real, while the discriminator becomes better at telling the difference between a fake or a real sample of data. In the end, the new data produced by the generator looks authentic even to human observers.

There is a famous example mentioned to explain the function of the Generator and Discriminator: the counterfeiters and the police, the counterfeiters try to fake money in a way that is not differentiated from the real money. At the same time, when policemen succeed to detect counterfeits money, that encourages the counterfeiters to even become better at creating counterfeit money that cannot be detected, and the chase goes on! In this game, however, we hope that the policemen win :D.

What is going on with my GAN?

Generative Adversarial Networks are a novel class of deep generative models, that have recently gain e d a lot of attention. I’ve covered them in the past ( Tabular synthetic data — Part 1 and Tabular synthetic data — Part 2), in very general terms, and with particular attention to their application to synthesize tabular data. But today, the focus will be a bit different — in a series of articles, I’ll be covering the challenges that you can find while training GANs, the most common solutions, and future directions in literature. This review was inspired by this amazing article about GANs challenges, solutions, and future — I strongly advise you to have a deeper look.

Generative Adversarial Networks

Generative models have been widely used in the latest years in a broad and varied number of real applications. Generative Models’ can be defined as models that compute the density estimation where model distribution is learned to approximate the real data distribution.

This brings some challenges, as researches have shown that maximum likelihood is not a good option — leads to overgeneralized and unplausible samples.

Generative Adversarial Nets (GANs) can solve this by introducing a Discriminator network that brings the capacity to discriminate original data samples, and samples generated by a model.

They have a wide scope of application, as they are able to learn implicitly over images, audio, and data which are challenging to model with an explicit likelihood.

The challenges

GANs can be very helpful and pretty disruptive in some areas of application, but, as in everything, it’s a trade-off between their benefits and the challenges that we easily find while working with them. We can break down GANs challenges in 3 main problems:

Mode collapse
Non-convergence and instability
Highly sensibility to hyperparameters and evaluation metrics

Exploring new network architectures

Better design of GAN models architectures is definitely one valid option. In fact, there are several GANs in the literature that result from exploring new architectures to solve particular data challenges — for example, CGAN is a conditional version of the first proposed GAN architecture, that undoubtedly leads to better results when synthesizing data, on the other hand, VAE-GAN follows an encoder-encoder architecture, that leverages learned representations to better measure similarities in the data space which results to improved visual fidelity and, finally, for example, Memory GAN follows a memory architecture that can alleviate two of the main issues related to unsupervised learning, the ability of the Generators to learn correctly the representation of the training samples, and Discriminators to better memorize already seen generated samples.

Memory GAN follows a memory architecture that can alleviate two of the main issues related to unsupervised learning

In summary, in what concerns architecture re-engineering, the research positions the solutions as following:

Conditional generation
Generative-discriminative network pair
Join architecture leveraging encoders
Improved Discriminator architectures
Exploration of Memory networks

The solutions

Challenges covered, it’s time to check the solutions that have been proposed and most widely applied to GANs.

As mentioned before, although there are many challenges related to GANs training, there’s a lot of research for mode collapse and non-convergence issues’ solutions. The image below depicts an interesting taxonomy for GANs challenges solutions, which leaves us with a pretty good idea of the options that we’ve available in literature.

Further, will be covered the three main techniques to improve GANs training and overall results.

Why mode collapse? 🔥

GANs can sometimes suffer from the limitation of generating samples with little representative of the population, which means that, for example, after training a GAN on the MNIST dataset, it may happen that our Generator is unable to generate digits different from digit 0. This condition is called mode collapse.

The main drawback is related to the fact that GANs are to able to focus on the whole data distribution due to its objective function. Some experiments have shown that even for bi-modal distribution, GANs tend to produce a good fit to the principal mode, struggling to generalize. In summary, mode collapse is a consequence of poor generalization and can be classified into two different types:

Most of the modes from the input data are absent from the generated data
Only a subset of particular modes is learned by the Generator.

The causes for mode collapse can vary, from an ill-suited objective function to the impact of the chosen GAN architecture having in consideration the data under analysis. But fear no more, there are options to solve this many have been the efforts dedicated to this particular challenge.

The future of GANs

Congratulations! In a nutshell, you’ve learned about the most commons challenges found when working with GANs along with some of the most commonly proposed solutions in the literature! From this review, it’s possible to understand that, although there are many challenges to be solved when working with GANs, they are without a doubt one of the most important findings in the area of Machine Learning in the latest years.

Hopefully, this review inspires you to start digging these amazing algorithms and explore new applications!

GAN — Ways to improve GAN performance

GAN models can suffer badly in the following areas comparing to other deep networks.

Slow training: the gradient to train the generator vanished.

As part of the GAN series, this article looks into ways on how to improve GAN. In particular,

Change the cost function for a better optimization goal.
Add additional penalties to the cost function to enforce constraints.
Avoid overconfidence and overfitting.
Better ways of optimizing the model.
Add labels.

But be aware that this is a dynamic topic as research remains highly active.

Feature Matching

The generator tries to find the best image to fool the discriminator. The “best” image keeps changing when both networks counteract their opponent. However, the optimization can turn too greedy and fall it into a never-ending cat-and-mouse game. This is one of the scenarios that the model does not converge and mode collapses.

F e ature matching changes the cost function for the generator to minimizing the statistical difference between the features of the real images and the generated images. Often, we measure the L2-distance between the means of their feature vectors. Therefore, feature matching expands the goal from beating the opponent to matching features in real images. Here is the new objective function:

where f(x) is the feature vector extracted in an immediate layer by the discriminator.

The means of the real image features are computed per minibatch which fluctuate on every batch. It is good news in mitigating the mode collapse. It introduces randomness that makes the discriminator harder to overfit itself.

Feature matching is effective when the GAN model is unstable during training.

Minibatch discrimination

When mode collapses, all images created looks similar. To mitigate the problem, we feed real images and generated images into the discriminator separately in different batches and compute the similarity of the image x with images in the same batch. We append the similarity o(x) in one of the dense layers in the discriminator to classify whether this image is real or generated.

If the mode starts to collapse, the similarity of generated images increases. The discriminator can use this score to detect generated images and penalize the generator if mode is collapsing.

The similarity o(xi) between the image xi and other images in the same batch is computed by a transformation matrix T. The equations are a little bit hard to trace but the concept is pretty simple. But feel free to skip to next section if you want.

In the figure above, xi is the input image and xj is the rest of the images in the same batch. We use a transformation matrix T to transform the features xi to Mi which is a B×C matrix.

We derive the similarity c(xi, xj) between image i and j using the L1-norm and the following equation.

The similarity o(xi) between image xi and the rest of images in the batch is

Here is the recap:

As a quote from the paper “Improved Techniques for Training GANs”

Minibatch discrimination allows us to generate visually appealing samples very quickly, and in this regard it is superior to feature matching.

One-sided label smoothing

Deep networks may suffer from overconfidence. For example, it uses very few features to classify an object. To mitigate the problem, deep learning uses regulation and dropout to avoid overconfidence.

In GAN, if the discriminator depends on a small set of features to detect real images, the generator may just produce these features only to exploit the discriminator. The optimization may turn too greedy and produces no long term benefit. In GAN, overconfidence hurts badly. To avoid the problem, we penalize the discriminator when the prediction for any real images go beyond 0.9 (D(real image)>0.9). This is done by setting our target label value to be 0.9 instead of 1.0. Here is the pseudo code:

Non-convergence and instability

The fact that GANs are composed by two networks, and each one of them has its loss function, results in the fact that GANs are inherently unstable- diving a bit deeper into the problem, the Generator (G) loss can lead to the GAN instability, which can be the cause of the gradient vanishing problem when the Discriminator (D) can easily distinguish between real and fake samples.

In GANs architecture, the D tries to minimize a cross-entropy while the G tries to maximize it. When D confidence is high and starts to reject the samples that are produced by G leads to G’s gradient vanishes.

This might refer to the hypothesis of the existence of local equilibria in the non-convex game that we are targeting when training GANs, as proposed in an article about GANs convergence and stability. There are some options already proposed in the literature to mitigate this problem, such as reversing the target employed for construction the cross-entropy cost or the application of gradient penalty to avoid local equilibria.

Historical averaging

In historical averaging, we keep track of the model parameters for the last t models. Alternatively, we update a running average of the model parameters if we need to keep a long sequence of models.

We add an L2 cost below to the cost function to penalize model different from the historical average.

For GANs with non-convex object function, historical averaging may stop models circle around the equilibrium point and act as a damping force to converge the model.

Experience replay

The model optimization can be too greedy in defeating what the generator is currently generating. To address this problem, experience replay maintains the most recent generated images from the past optimization iterations. Instead of fitting the models with current generated images only, we feed the discriminator with all recent generated images also. Hence, the discriminator will not be overfitted for a particular time instance of the generator.

Using labels (CGAN)

Many datasets come with labels for the object type of their samples. Training GAN is already hard. So any extra help in guiding the GAN training can improve the performance a lot. Adding the label as part of the latent space z helps the GAN training. Below is the data flow used in CGAN to take advantage of the labels in the samples.

Cost functions

Do cost functions matter? It must be otherwise all those research efforts will be a waste. But if you hear about a 2017 Google Brain paper, you will definitely have doubts. But pushing image quality is still a top priority. Likely, we will see researchers trying different cost functions before we have a definite answer for the merit.

The following figure lists the cost functions for some common GAN models.

We decide not to detail these cost functions in this article. Here are the articles that covers some common cost functions in details: WGAN/WGAN-GP, EBGAN/BEGAN, LSGAN, RGAN and RaGAN. At the end of this article, we list an article that studies all these cost functions in more details. Since cost function is one major research area in GAN, we do encourage you to read that article later.

Here is some of the FID score (the lower the better) on some of the datasets. This is one reference point but be warned that it is still too early to make any conclusion on what cost functions perform the best. Indeed, there is no single cost function that performs the best among all different datasets yet.

(MM GAN is the GAN cost function in the original paper. NS GAN is the alternative cost functions addressing the vanishing gradients in the same paper.)

But no model performs well without good hyperparameters and tuning GANs takes time. Be patient in the hyperparameters optimization before randomly testing different cost functions. Some researchers had suggested that tunning the hyperparameters may ripe a better return than changing the cost functions. A carefully tunned learning rate may mitigate some serious GAN’s problems like mode collapse. In specific, lower the learning rate and redo the training when mode collapse happens.

We can also experiment with different learning rates for the generator and the discriminator. For example, the following graph use the learning rate of 0.0003 for the discriminator and 0.0001 for the generator in the WGAN-GP training.

Implementation tips

Scale the image pixel value between -1 and 1. Use tanh as the output layer for the generator.
Experiment sampling z with Gaussian distributions.
Batch normalization often stabilizes training.
Use PixelShuffle and transpose convolution for upsampling.
Avoid max pooling for downsampling. Use convolution stride.
Adam optimizer usually works better than other methods.
Add noise to the real and generated images before feeding them into the discriminator.

The dynamics of the GAN models are not well understood yet. So some of the tips are just suggestions and the mileage may vary. For example, the LSGAN paper reports RMSProp has more stable training in their experiments. This is kind of rare but demonstrates the challenges of making generic recommendations.

The discriminator and the generator are constantly competing with others. Be prepared that the cost function value may go up and down. Don’t stop the training pre-maturely even the cost may seem to trend up. Monitor the results visually to verify the progress of the training.

Virtual batch normalization (VBN)

Batch normalization BM becomes a de facto standard in many deep network design. The mean and the variance of BM is derived from the current minibatch. However, it creates a dependency between samples. The generated images are not independent of each other.

This is reflected in experiments which generated images show color tint in the same batch.

Orange tone on the top batch and greenish tone on the second. Source

Originally, we sample z from a random distribution that gives us independent samples. However, the bias created by the batch normalization overwhelm the randomness of z.

Virtual batch normalization (VBN) samples a reference batch before the training. In the forward pass, we can preselect a reference batch to compute the normalization parameters (μ and σ) for the BN. However, we will overfit the model with this reference batch since we use the same batch over the whole training. To mitigate that, we can combine a reference batch with the current batch to compute the normalization parameters.

Random seeds

The random seeds used to initialize the model parameters impact the performance of GAN. As shown below, the FID scores in measuring the GAN performance vary in 50 individual runs (training). But the range is relatively small and likely to be done in later fine tuning only.

A Google Brain paper indicates LSGAN occasionally fails or collapses in some dataset, and training needs to be restarted with another random seed.

Introducing new loss functions

As the model parameters oscillate a lot, and can vary in a way that never converge, some have decided to explore new loss functions, to help GANs to reach a better optimum. In fact, several researchers have pointed out that the selection of the right loss function can effectively tackle the training instability. Improvements in the loss functions can be categorized as the proposal of a new probability distance and divergence, that can solve the mode collapse problem as it stabilized the GAN training as observed in WGAN or, with the introduction of Regularization or Gradient Penalty, as observed in WGAN-GP, which improved previously proposed WGAN, training stability.

Batch normalization

DGCAN strongly recommends adding BM into the network design. The use of BM also become a general practice in many deep network model. However, there will be exceptions. The following figure demonstrates the impact of BN on different dataset. The y-axis is the FID score which the lower the better. As suggested by the WGAN-GP paper, BN should be off when it is used. We suggest readers to check the cost function used and the corresponding FID performance on BN, and verify the setting with experiments.

Limitations

GANs have been responsible for remarkable achievements in generating real samples of data that have not been seen before in several applications varying from generating images of real people that do not exist (check paper here), through generating text (check paper here), to generating fake audio that is indistinguishable from real audio (check paper here), even generating authentic music! (check paper here). However, there are still some limitations that hinder GANs development. Such limitations include: i) model training instability, ii) hard to evaluate their performance, and iii) they suffer from mode collapse problem. The purpose of this article is to some recent research that proposed a multi-generator setup to mitigate some of GANs limitations such as the mode collapse.

Mode collapse

The mode collapse in GANs refers to the problem of missing some of the modes of the multi-modal data it was trained on. Simply speaking, in case of a GAN trained on a dataset consisting of digits from 0 to 9, the generated images from a GAN suffering from mode collapse would not generate some of the digits (ex. Generate only from 0 to 7 or all except digit 5…etc.). The below image shows an example of training two GANs, the first row showing a normal GANs learning to successfully generate 10 modes (10 digits), while the second row is showing a GAN suffering from mode collapse, only generating one mode.

Mode collapse examples are not only limited to that, in an example of animal classes (cat or dog), the generator can learn to generate both images of cats and dogs, cats with different colors and features but only limited colors and features of a dog (ex. White poodle dog).

Why does mode collapse occur? there are different hypotheses presented in the literature, yet our understanding of it is still lacking. However, an obvious understanding of what is happening during mode collapse, that the generator fails to model the distribution of the training data well enough.

Solutions from the literature

So, after we explained mode collapse, it is time to present some techniques to mitigate this problem. One of the most promising directions in the last couple of years has been using multiple generators. Hence, we are going to talk about this direction in detail.

AdaGAN: Inspired by boosting techniques, Tolstikhin et al. (2017) elevated that technique to train a collection of generators instead of one generator. Also, the training is done sequentially by adding a new generator to the mixture model. A classifier is trained to separate the original images from fake images generated from the mixture model, the classifier weights then are used to reweight the training set. The reweighting is mainly done to ignore images of modes that it is confident about (already generated images of that mode/class), and a new GAN is hence defined after this reweighting, making it less probable for this GAN to miss the modes that were missed before from the mixture model.

This work, however, has two main limitations:

- Computationally expensive (multiple GANs).

- Built on an assumption that a single GAN generator can generate good enough images of some classes, which can be untrue for competitive and diverse datasets such as ImageNet where some GANs tend to generate unidentifiable objects.

MAD-GAN: Ghosh et al. (2018) proposed a GAN setup of multiple generators and one discriminator, in addition to that, they changed the normal setup of the discriminator such that the discriminator is not only required to detect if the provided sample is real or fake but additionally to detect which generator was responsible for creating the fake sample (had it decided that the sample is fake). Indirectly, the only way for the discriminator to be able to tell which generator has actually generated a particular sample is when there is a recognizable difference between different samples. Hence, such setup encourages different generators to synthesize identifiably different samples, which pushes for diversified modes over different generators. Check the below illustrative figure, where each row corresponds to images created from different generators, we can see clearly that MAD-GANs were able to help different generators capture different modes.

Naturally having more generators add computational complexity to the structure. To counteract that, this paper allows different generators to share their initial layer parameters. This is very helpful in case of homogeneous data; where all classes have similar low-frequency features (for images such as faces or animals), hence avoiding redundant computation done in earlier layers. In another case in which the dataset has diverse contents (e.x. persons, scenery), it is advisable to be careful when sharing parameters among different generators, as early layers of each generator can still capture quite diverse low-frequency features corresponding to the dataset diversity.

Finally, by visualizing some of the results of the paper below, increasing the numbers of generators was successful to capture all modes similar to training modes (orange color for density estimate of training data, while blue for generated data points).

MGAN: Hoang et al. (2018) proposed a GAN setup of multiple coexisting generators and one discriminator, which is similar MAD-GANs. What is different from previous works that have a direct objective function between different generators to push for more diversity between them. The discriminator is still doing its same job; to detect whether the sample its fed is fake or real, while a classifier is used to detect which generator created the fake sample. It is interesting to mention that the parameters are shared amongst different generators, as well as between the discriminator and the classifier. These components constitute a Mixture Adversarial Generative Network (MGAN). An overview of the MGAN architecture can be visualized below.

The objective function that the generators are trained to maximize is Jensen-Shannon Divergence (JSD). Maximization of JSD between generators helps each different generator to concentrate on different data modes to mitigate mode collapse. The authors hypothesize that a combination of distributions learned from different generators are better than one distribution learned from one generator and can model the real data distribution better. Also, similar to MAD-GANs, MGANs share the parameters in the discriminator and the classifier, however, the generators share all layers’ parameters except the input layer (opposite of MAD-GAN), and the discriminator and classifier share all layers except the classification layer (output layers). Such strategy allows for the generators to calculate similar high-level features, while different low-level calculation is leveraged by the input layer (Hint: A generator is a deconvolutional neural network and research prove that early layers in a convolutional neural network calculate high-level features while later layers calculate more low-level features), and the intuition for sharing the parameters between the discriminator and classifier.

The authors also showed an interesting proof, showing that assuming if the generators, the discriminator and the classifier have enough capacity, then the output of the GAN minimax game would produce a minimal JSD between the distribution produced by the combination of generators and the real distribution. Additionally, the game would produce a maximum JSD between distributions produced from different generators (check the paper for the full proof).

As for some results, looking at the below graph, where rows show different methods trying to model the real distribution in red (normal GAN in the first row and MGAN in the last row), it can be clearly seen that MGAN is faster to realize the true distribution of the data than other methods and doesn’t suffer from mode collapse problem unlike normal GAN.

Since MAD-GAN and MGAN have similar concepts, I am adding my opinion about both together. Both have concrete concepts and I found both papers very interesting to read. However, I found some general assumption in MGAN that can be inefficient: In sharing parameters, it is true that forcing generators to share all layers’ parameters except the input layer is beneficial, but this can be inefficient in tasks where the high-level features are more desired than low-level features (ex. background information such as scenery, weather, etc. — which was highlighted in MAD-GAN paper).

Follow us for more articles about recent hot ML papers. And tell us what else interests you, will definitely put it in mind for the next article!

This is my first blog post, so I would really appreciate feedback and advice 😊. Also, follow me for more articles about recent hot ML papers. And tell me what other topics interest you, will definitely put it in mind for the next article!.

Spectral Normalization

Spectral Normalization is a weight normalization that stabilizes the training of the discriminator. It controls the Lipschitz constant of the discriminator to mitigate the exploding gradient problem and the mode collapse problem. The concept is based heavily on maths but conceptually, it restricts the weight changes in each iteration and not over depending on a small set of features in distinguishing images by the discriminator. This approach will be computationally light compared with WGAN-GP and achieve good mode coverage that haunts many GAN methods.

Multiple GANs

Mode collapse may not be all bad. The image quality often improves when mode collapses. In fact, we may collect the best model for each mode and use them to recreate different modes of images.

Balance between discriminator & generator

The discriminator and generator are always in a tug of war to undercut each other. Mode collapse and gradient diminishing are often explained as an imbalance between the discriminator and the generator. We can improve GAN by turning our attention in balancing the loss between the generator and the discriminator. Unfortunately, the solution seems elusive. We can maintain a static ratio between the number of gradient descent iterations on the discriminator and the generator. Even this seems appealing but many doubt its benefit. Often, we maintain a one-to-one ratio. But some researchers also test out a ratio of 5 discriminator iterations per generator update. Balancing both networks with dynamic mechanics is also proposed. But not until recent years, we get some traction on it.

On the other hand, some researchers challenge the feasibility and desirability of balancing these networks. A well-trained discriminator gives quality feedback to the generator anyway. Also, it is not easy to train the generator to always catch up with the discriminator. Instead, we may turn the attention into finding a cost function that does not have a close-to-zero gradient when the generator is not performing well.

Nevertheless, issues remain. Many cost function proposals are made and the debates on what is the best remain.

Discriminator & generator network capacity

The model for the discriminator is usually more complex than the generator (more filters and more layers) and a good discriminator gives quality information. In many GAN applications, we may run into bottlenecks where increasing generator capacity shows no quality improvement. Until we identify the bottlenecks and resolve them, increasing generator capacity does not seem to be a priority for many partitioners.

What about hyperparameters and evaluation?

No cost function will work without the selection of good hyperparameters, and GANs are not an exception they are even more sensitive to the selection of the network hyperparameters. The right selection of hyperparameters can be tedious and time-consuming, and so far the majority of the efforts have been in topics such as mode-collapse or GAN’s struggles to converge.

No cost function will work without the selection of good hyperparameters!

Moreover, GANs lack meaningful measures to evaluate the quality of their output. Since its creation, GANs have been widely used with a variety of application areas, from supervised representation learning, semi-supervised learning, inpainting, denoising, and synthetic data creation. The extensive applications brings along a lot of heterogeneity, which makes it harder to define how we can evaluate the equality of these networks. Because there are no robust or consistent metrics defined, in particular for image generation, it is difficult to evaluate which GANs algorithms outperform others. A series of evaluation methods have been proposed in the literature, to overcome this challenge — you can find interesting details about GANs evaluation metrics in this article.

BigGAN

BigGAN was published in 2018 with the goal of pulling together some practices for GAN in generating the best images at that time. In this section, we will study some of the practices that not yet covered.

Larger batch size

Increase the batch size can have a significant drop in FID as shown above. With a bigger batch size, more modes are covered and provide better gradients for both networks to learn. But yet, BigGAN reports the model reaches better performance in fewer iterations, but become unstable and even collapse afterward. So, save the model constantly.

Truncation Trick

Low probability density region in the latent space z may not have enough training data to learn it accurately. So when generating images, we can avoid those regions to improve the image quality at the cost of the variation. i.e. the quality of images will increase but those generated images will have lower variance in style. There are different techniques to truncate the input latent space z. The general principle is when values fall outside a range, it will be resampled or squeeze to the higher-probability region.

Increase model capacity

During tuning, consider increasing the capacity of the model, in particular for layers with high-spatial resolutions. Many models show improvement when double the traditional capacity used at the time. But don’t do it too early without proofing the model design and implementation first.

Moving averages of Generator weights

The weights used by the generator are computed from an exponential moving average of the weights of the generator.

Orthogonal regularization

The condition of the weight matrix is a heavy studied topic. This is a study on how sensitive a function output is to changes in its input. It has a large impact on training stability. A matrix Q is orthogonal if

If we multiply x with an orthogonal matrix, the changes in x will not be magnified. This behavior is very desirable for maintaining numerical stability.

With other properties, maintain the orthogonal properties of the weight matrix can be appealing in deep learning. We can add an orthogonal regularization to encourage such properties during training. It penalizes the system if Q deviates from being an orthogonal matrix.

Nevertheless, this is known to be too limiting and therefore BigGAN uses a modified term:

The orthogonal regularization also allows the truncation trick to be more successful across different models.

Orthogonal weight initialization

The model weight is initialized to be a random orthogonal matrix.

Skip-z connection

In the vanilla GAN, the latent factor z is input to the first layer only. With skip-z connection, direct skip connections (skip-z) from the latent factor z is connected to multiple layers of the generator rather than just the first layer.