# Introduction In this coursework we present two variants of the GAN architecture - DCGAN and CGAN, applied to the MNIST dataset and evaluate performance metrics across various optimisations techniques. The MNIST dataset contains 60,000 training images and 10,000 testing images of size 28x28, spread across ten classes representing the ten handwritten digits. Generative Adversarial Networks present a system of models which learn to output data, similar to training data. A trained GAN takes noise as an input and is able to provide an output with the same dimensions and relevant features as the samples it has been trained with. GAN's employ two neural networks - a *discriminator* and a *generator* which contest in a zero-sum game. The task of the *discriminator* is to distinguish generated images from real images, while the task of the generator is to produce realistic images which are able to fool the discriminator. Training a shallow GAN with no convolutional layers poses problems such as mode collapse and unbalanced G-D losses which lead to low quality image output. \begin{figure} \begin{center} \includegraphics[width=24em]{fig/generic_gan_mode_collapse.pdf} \caption{Vanilla GAN mode collapse} \label{fig:mode_collapse} \end{center} \end{figure} Mode collapse is achieved with our naive *vanilla GAN* (Appendix-\ref{fig:vanilla_gan}) implementation after 200,000 batches. The generated images observed during a mode collapse can be seen on figure \ref{fig:mode_collapse}. The output of the generator only represents few of the labels originally fed. When mode collapse is reached loss function of the generator stops improving as shown in figure \ref{fig:vanilla_loss}. We observe, the discriminator loss tends to zero as the discriminator learns to assume and classify the fake 1's, while the generator is stuck producing 1 and hence not able to improve. A significant improvement to this vanilla architecture is Deep Convolutional Generative Adversarial Networks (DCGAN). It is possible to artificially balance the number of steps between G and D backpropagation, however we think with a solid GAN structure this step is not really needed. Updating D more frequently than G resulted in additional cases of mode collapse due to the vanishing gradient issue. Updating G more frequently has not proved to be beneficial either, as the discriminator did not learn how to distinguish real samples from fake samples quickly enough. For this reasons the following sections will not present any artificial balancing of G-D training steps, opting for a standard single step update for both discriminator and generator. # DCGAN ## DCGAN Architecture description DCGAN exploits convolutional stride to perform downsampling and transposed convolution to perform upsampling. We use batch normalization at the output of each convolutional layer (exception made for the output layer of the generator and the input layer of the discriminator). The activation functions of the intermediate layers are `ReLU` (for generator) and `LeakyReLU` with slope 0.2 (for discriminator). The activation functions used for the output are `tanh` for the generator and `sigmoid` for the discriminator. The convolutional layers' output in the discriminator uses dropout before feeding the next layers. We noticed a significant improvement in performance, and estimated an optimal dropout rate of 0.25. The optimizer used for training is `Adam(learning_rate=0.002, beta=0.5)`. The main architecture used can be observed in figure \ref{fig:dcganarc}. \begin{figure} \begin{center} \includegraphics[width=24em]{fig/DCGAN_arch.pdf} \caption{DCGAN Architecture} \label{fig:dcganarc} \end{center} \end{figure} ## Tests on MNIST We evaluate three different GAN architectures, varying the size of convolutional layers in the generator, while retaining the structure presented in figure \ref{fig:dcganarc}: * Shallow: Conv128-Conv64 * Medium: Conv256-Conv128 * Deep: Conv512-Conv256 \begin{figure} \begin{center} \includegraphics[width=24em]{fig/med_dcgan_ex.png} \includegraphics[width=24em]{fig/med_dcgan.png} \caption{Medium DCGAN} \label{fig:dcmed} \end{center} \end{figure} We observed that the deep architectures result in a more easily achievable equilibria of G-D losses. Our medium depth DCGAN achieves very good performance, balancing both binary cross entropy losses at approximately 0.9 after 5,000 batches, reaching equilibrium quicker and with less oscillation that the Deepest DCGAN tested. As DCGAN is trained with no labels, the generator primary objective is to output images that fool the discriminator, but does not intrinsically separate the classes form one another. Therefore we sometimes observe oddly shape fused digits which may temporarily full be labeled real by the discriminator. This issue is solved by training the network for more batches or introducing a deeper architecture, as it can be deducted from a qualitative comparison between figures \ref{fig:dcmed}, \ref{fig:dcshort} and \ref{fig:dclong}. Applying Virtual Batch Normalization our Medium DCGAN does not provide observable changes in G-D balancing, but reduces within-batch correlation. Although it is difficult to qualitatively assess the improvements, figure \ref{fig:vbn_dc} shows results of the introduction of this technique. \begin{figure} \begin{center} \includegraphics[width=24em]{fig/vbn_dc.pdf} \caption{DCGAN Virtual Batch Normalization} \label{fig:vbn_dc} \end{center} \end{figure} We evaluated the effect of different dropout rates (results in appendix figures \ref{fig:dcdrop1_1}, \ref{fig:dcdrop1_2}, \ref{fig:dcdrop2_1}, \ref{fig:dcdrop2_2}) and concluded that the optimisation of the dropout hyper-parameter is essential for maximising performance. A high dropout rate results in DCGAN producing only artifacts that do not match any specific class due to the generator performing better than the discriminator. Conversely a low dropout rate leads to an initial stabilisation of G-D losses, but ultimately results in instability under the form of oscillation when training for a large number of batches. While training the different proposed DCGAN architectures, we did not observe mode collapse, indicating the DCGAN is less prone to a collapse compared to our *vanilla GAN*. # CGAN ## CGAN Architecture description CGAN is a conditional version of a GAN which utilises labeled data. Unlike DCGAN, CGAN is trained with explicitly provided labels which allow CGAN to associate features with specific labels. This has the intrinsic advantage of allowing us to specify the label of generated data. The baseline CGAN which we evaluate is visible in figure \ref{fig:cganarc}. The baseline CGAN architecture presents a series blocks each contained a dense layer, LeakyReLu layer (slope=0.2) and a Batch Normalisation layer. The baseline discriminator uses Dense layers, followed by LeakyReLu (slope=0.2) and a Droupout layer. The optimizer used for training is `Adam`(`learning_rate=0.002`, `beta=0.5`). The Convolutional CGAN analysed follows a structure similar to DCGAN and is presented in figure \ref{fig:cdcganarc}. We evaluate permutations of the architecture involving: * Shallow CGAN - 1 Dense-LeakyReLu-BN block * Deep CGAN - 5 Dense-LeakyReLu-BN * Deep Convolutional GAN - DCGAN + conditional label input * One-Sided Label Smoothing (LS) * Various Dropout (DO)- Use 0.1, 0.3 and 0.5 * Virtual Batch Normalisation (VBN)- Normalisation based on one batch(BN) [@improved] \begin{figure} \begin{center} \includegraphics[width=24em]{fig/CGAN_arch.pdf} \caption{CGAN Architecture} \label{fig:cganarc} \end{center} \end{figure} ## Tests on MNIST When comparing the three levels of depth for the architectures it is possible to notice significant differences for the G-D losses balancing. In a shallow architecture we notice a high oscillation of the generator loss (figure \ref{fig:cshort}), which is being overpowered by the discriminator. Despite this we don't experience any issues with vanishing gradient, hence no mode collapse is reached. Similarly, with a deep architecture the discriminator still overpowers the generator, and an equilibrium between the two losses is not achieved. The image quality in both cases is not really high: we can see that even after 20,000 batches the some pictures appear to be slightly blurry (figure \ref{fig:clong}). The best compromise is reached for 3 Dense-LeakyReLu-BN blocks as shown in figure \ref{fig:cmed}. It is possible to observe that G-D losses are perfectly balanced, and their value goes below 1, meaning the GAN is approaching the theoretical Nash Equilibrium of 0.5. The image quality is better than the two examples reported earlier, proving that this Medium-depth architecture is the best compromise. \begin{figure} \begin{center} \includegraphics[width=24em]{fig/med_cgan_ex.png} \includegraphics[width=24em]{fig/med_cgan.png} \caption{Medium CGAN} \label{fig:cmed} \end{center} \end{figure} The three levels of dropout rates attempted do not affect the performance significantly, and as we can see in figures \ref{fig:cg_drop1_1} (0.1), \ref{fig:cmed}(0.3) and \ref{fig:cg_drop2_1}(0.5), both image quality and G-D losses are comparable. The biggest improvement in performance is obtained through one-sided label smoothing, shifting the true labels form 1 to 0.9 to reinforce discriminator behaviour. Using 0.1 instead of zero for the fake labels does not improve performance, as the discriminator loses incentive to do better (generator behaviour is reinforced). Performance results for one-sided labels smoothing with true labels = 0.9 are shown in figure \ref{fig:smooth}. \begin{figure} \begin{center} \includegraphics[width=24em]{fig/smoothing_ex.png} \includegraphics[width=24em]{fig/smoothing.png} \caption{One sided label smoothing} \label{fig:smooth} \end{center} \end{figure} Virtual Batch normalization does not affect performance significantly. Applying this technique to both the CGAN architectures used keeps G-D losses mostly unchanged. The biggest change we expect to see is a lower correlation between images in the same batch. This aspect will mostly affect performance when training a classifier with the generated images from CGAN, as we will obtain more diverse images. Training with a larger batch size would show more significant results, but since we set this parameter to 128 the issue of within-batch correlation is limited. Convolutional CGAN did not achieve better results than our baseline approach for the architecture analyzed, although we believe that it is possible to achieve a better performance by finer tuning of the Convolutional CGAN parameters. Figure \ref{fig:cdcloss} shows a very high oscillation of the generator loss, hence the image quality varies a lot at each training step. Attempting LS on this architecture achieved a similar outcome when compared to the non-convolutional counterpart. # Inception Score Inception score is calculated as introduced by Tim Salimans et. al [@improved]. However as we are evaluating MNIST, we use LeNet-5 [@lenet] as the basis of the inceptioen score. We use the logits extracted from LeNet: $$ \textrm{IS}(x) = \exp(\mathbb{E}_x \left( \textrm{KL} ( p(y\mid x) \| p(y) ) \right) ) $$ We further report the classification accuracy as found with LeNet. For coherence purposes the inception scores were calculated training the LeNet classifier under the same conditions across all experiments (100 epochs with SGD optimizer, learning rate = 0.001). \begin{table}[H] \begin{tabular}{llll} & Accuracy & IS & GAN Tr. Time \\ \hline Shallow CGAN & 0.645 & 3.57 & 8:14 \\ Medium CGAN & 0.715 & 3.79 & 10:23 \\ Deep CGAN & 0.739 & 3.85 & 16:27 \\ Convolutional CGAN & 0.737 & 4 & 25:27 \\ Medium CGAN+LS & 0.749 & 3.643 & 10:42 \\ Convolutional CGAN+LS & 0.601 & 2.494 & 27:36 \\ Medium CGAN DO=0.1 & 0.761 & 3.836 & 10:36 \\ Medium CGAN DO=0.5 & 0.725 & 3.677 & 10:36 \\ Medium CGAN+VBN & 0.735 & 3.82 & 19:38 \\ Medium CGAN+VBN+LS & 0.763 & 3.91 & 19:43 \\ *MNIST original & 0.9846 & 9.685 & N/A \\ \hline \end{tabular} \end{table} ## Discussion ### Architecture We observe increased accruacy as we increase the depth of the GAN arhitecture at the cost of the training time. There appears to be diminishing returns with the deeper networks, and larger improvements are achievable with specific optimisation techniques. Despite the initial considerations about G-D losses for the Convolutional CGAN, there seems to be an improvement in inception score and test accuracy with respect to the other analysed cases. One sided label smoothing however did not improve this performanc any further, suggesting that reinforcing discriminator behaviour does not benefit the system in this case. ### One Side Label Smoothing One sided label smoothing involves relaxing our confidence on the labels in our data. Tim Salimans et. al. [@improved] show smoothing of the positive labels reduces the vulnerability of the neural network to adversarial examples. We observe significant improvements to the Inception score and classification accuracy in the case of our baseline (Medium CGAN). ### Virtual Batch Normalisation Virtual Batch Noramlisation is a further optimisation technique proposed by Tim Salimans et. al. [@improved]. Virtual batch normalisation is a modification to the batch normalisation layer, which performs normalisation based on statistics from a reference batch. We observe that VBN improved the classification accuracy and the Inception score. TODO EXPLAIN WHY ### Dropout The effect of dropout for the non-convolutional CGAN architecture does not affect performance as much as in DCGAN, nor does it seem to affect the quality of images produced, together with the G-D loss remain almost unchanged. Ultimately, judging from the inception scores, it is preferable to use a low dropout rate (in our case 0.1 seems to be the dropout rate that achieves the best results). # Re-training the handwritten digit classifier ## Results In this section we analyze the effect of retraining the classification network using a mix of real and generated data, highlighting the benefits of injecting generated samples in the original training set to boost testing accuracy. As observed in figure \ref{fig:mix1} we performed two experiments for performance evaluation: * Keeping the same number of training samples while just changing the amount of real to generated data (55,000 samples in total). * Keeping the whole training set from MNIST and adding generated samples from CGAN. \begin{figure} \begin{center} \includegraphics[width=12em]{fig/mix_zoom.png} \includegraphics[width=12em]{fig/added_generated_data.png} \caption{Mix data, left unchanged samples number, right added samples} \label{fig:mix1} \end{center} \end{figure} Both experiments show that an optimal amount of data to boost testing accuracy on the original MNIST dataset is around 30% generated data as in both cases we observe an increase in accuracy by around 0.3%. In absence of original data the testing accuracy drops significantly to around 20% for both cases. ## Adapted Training Strategy For this section we will use 550 samples from MNIST (55 samples per class). Training the classifier yields major challenges, since the amount of samples available for training is relatively small. Training for 100 epochs, similarly to the previous section, is clearly not enough. The MNIST test set accuracy reached in this case is only 62%, while training for 300 epochs we can reach up to 88%. The learning curve in figure \ref{fig:few_real} suggests we cannot achieve much better with this very small amount of data, since the validation accuracy plateaus, while the training accuracy almost reaches 100%. \begin{figure} \begin{center} \includegraphics[width=24em]{fig/train_few_real.png} \caption{Training with few real samples} \label{fig:few_real} \end{center} \end{figure} We conduct one experiment, feeding the test set to a LeNet trained exclusively on data generated from our CGAN. It is noticeable that training for the first 5 epochs gives good results (figure \ref{fig:fake_only}) when compared to the learning curve obtained when training the network with only the few real samples. This indicates that we can use the generated data to train the first steps of the network (initial weights) and apply the real sample for 300 epochs to obtain a finer tuning. As observed in figure \ref{fig:few_init} the first steps of retraining will show oscillation, since the fine tuning will try and adapt to the newly fed data. The maximum accuracy reached before the validation curve plateaus is 88.6%, indicating that this strategy proved to be somewhat successful at improving testing accuracy. \begin{figure} \begin{center} \includegraphics[width=24em]{fig/initialization.png} \caption{Retraining with initialization from generated samples} \label{fig:few_init} \end{center} \end{figure} We try to improve the results obtained earlier by retraining LeNet with mixed data: few real samples and plenty of generated samples (160,000) (learning curve show in figure \ref{fig:training_mixed}. The peak accuracy reached is 91%. We then try to remove the generated samples to apply fine tuning, using only the real samples. After 300 more epochs (figure \ref{fig:training_mixed}) the test accuracy is boosted to 92%, making this technique the most successful attempt of improvement while using a limited amount of data from MNIST dataset. \begin{figure} \begin{center} \includegraphics[width=12em]{fig/training_mixed.png} \includegraphics[width=12em]{fig/fine_tuning.png} \caption{Retraining; Mixed initialization left, fine tuning right} \label{fig:training_mixed} \end{center} \end{figure} Failures classification examples are displayed in figure \ref{fig:retrain_fail}. The results showed indicate that the network we trained is actually performing quite well, as most of the testing images that got misclassified (mainly nines and fours) show ambiguities. \newpage # Bonus Questions ## Relation to PCA Similarly to GAN's, PCA can be used to formulate **generative** models of a system. While GAN's are trained neural networks, PCA is a definite statistical procedure which perform orthogonal transformations of the data. Both attempt to identify the most important or *variant* features of the data (which we may then use to generate new data), but PCA by itself is only able to extract linearly related features. In a purely linear system, a GAN would be converging to PCA. In a more complicated system, we would indeed to identify relevant kernels in order to extract relevant features with PCA, while a GAN is able to leverage dense and convolutional neural network layers which may be trained to perform relevant transformations. ## Data representation TODO EXPLAIN WHAT WE HAVE DONE HERE \begin{figure} \centering \subfloat[][]{\includegraphics[width=.2\textwidth]{fig/pca-mnist.png}}\quad \subfloat[][]{\includegraphics[width=.2\textwidth]{fig/tsne-mnist.png}}\\ \subfloat[][]{\includegraphics[width=.2\textwidth]{fig/pca-cgan.png}}\quad \subfloat[][]{\includegraphics[width=.2\textwidth]{fig/tsne-cgan.png}} \caption{Visualisations: a)MNIST|PCA b)MNIST|TSNE c)CGAN-gen|PCA d)CGAN-gen|TSNE} \label{fig:features} \end{figure} \begin{figure} \centering \subfloat[][]{\includegraphics[width=.22\textwidth]{fig/pr-mnist.png}}\quad \subfloat[][]{\includegraphics[width=.22\textwidth]{fig/pr-cgan.png}} \caption{Precisional Recall Curves a) MNIST : b) CGAN output} \label{fig:rocpr} \end{figure} ## Factoring in classification loss into GAN Classification accuracy and Inception score can be factored into the GAN to attempt to produce more realistic images. Shane Barrat and Rishi Sharma are able to indirectly optimise the inception score to over 900, and note that directly optimising for maximised Inception score produces adversarial examples [@inception-note]. Nevertheless, a pretrained static classifier may be added to the GAN model, and it's loss incorporated into the loss added too the loss of the GAN. $$ L_{\textrm{total}} = \alpha L_{\textrm{LeNet}} + \beta L_{\textrm{generator}} $$ # References
\newpage # Appendix ## DCGAN-Appendix \begin{figure}[H] \begin{center} \includegraphics[width=24em]{fig/vanilla_gan_arc.pdf} \caption{Vanilla GAN Architecture} \label{fig:vanilla_gan} \end{center} \end{figure} \begin{figure}[H] \begin{center} \includegraphics[width=24em]{fig/generic_gan_loss.png} \caption{Shallow GAN D-G Loss} \label{fig:vanilla_loss} \end{center} \end{figure} \begin{figure}[H] \begin{center} \includegraphics[width=24em]{fig/short_dcgan_ex.png} \includegraphics[width=24em]{fig/short_dcgan.png} \caption{Shallow DCGAN} \label{fig:dcshort} \end{center} \end{figure} \begin{figure}[H] \begin{center} \includegraphics[width=24em]{fig/long_dcgan_ex.png} \includegraphics[width=24em]{fig/long_dcgan.png} \caption{Deep DCGAN} \label{fig:dclong} \end{center} \end{figure} \begin{figure}[H] \begin{center} \includegraphics[width=24em]{fig/dcgan_dropout01_gd.png} \caption{DCGAN Dropout 0.1 G-D Losses} \label{fig:dcdrop1_1} \end{center} \end{figure} \begin{figure}[H] \begin{center} \includegraphics[width=14em]{fig/dcgan_dropout01.png} \caption{DCGAN Dropout 0.1 Generated Images} \label{fig:dcdrop1_2} \end{center} \end{figure} \begin{figure}[H] \begin{center} \includegraphics[width=24em]{fig/dcgan_dropout05_gd.png} \caption{DCGAN Dropout 0.5 G-D Losses} \label{fig:dcdrop2_1} \end{center} \end{figure} \begin{figure}[H] \begin{center} \includegraphics[width=14em]{fig/dcgan_dropout05.png} \caption{DCGAN Dropout 0.5 Generated Images} \label{fig:dcdrop2_2} \end{center} \end{figure} ## CGAN-Appendix \begin{figure}[H] \begin{center} \includegraphics[width=24em]{fig/CDCGAN_arch.pdf} \caption{Deep Convolutional CGAN Architecture} \label{fig:cdcganarc} \end{center} \end{figure} \begin{figure}[H] \begin{center} \includegraphics[width=24em]{fig/short_cgan_ex.png} \includegraphics[width=24em]{fig/short_cgan.png} \caption{Shallow CGAN} \label{fig:cshort} \end{center} \end{figure} \begin{figure}[H] \begin{center} \includegraphics[width=24em]{fig/long_cgan_ex.png} \includegraphics[width=24em]{fig/long_cgan.png} \caption{Deep CGAN} \label{fig:clong} \end{center} \end{figure} \begin{figure}[H] \begin{center} \includegraphics[width=24em]{fig/cgan_dropout01.png} \caption{CGAN Dropout 0.1 G-D Losses} \label{fig:cg_drop1_1} \end{center} \end{figure} \begin{figure}[H] \begin{center} \includegraphics[width=14em]{fig/cgan_dropout01_ex.png} \caption{CGAN Dropout 0.1 Generated Images} \label{fig:cg_drop1_2} \end{center} \end{figure} \begin{figure}[H] \begin{center} \includegraphics[width=24em]{fig/cgan_dropout05.png} \caption{CGAN Dropout 0.5 G-D Losses} \label{fig:cg_drop2_1} \end{center} \end{figure} \begin{figure}[H] \begin{center} \includegraphics[width=14em]{fig/cgan_dropout05_ex.png} \caption{CGAN Dropout 0.5 Generated Images} \label{fig:cg_drop2_2} \end{center} \end{figure} \begin{figure}[H] \begin{center} \includegraphics[width=12em]{fig/good_ex.png} \includegraphics[width=12em]{fig/bad_ex.png} \includegraphics[width=24em]{fig/cdcgan.png} \caption{Convolutional CGAN+LS} \label{fig:cdcloss} \end{center} \end{figure} ## Retrain-Appendix \begin{figure}[H] \begin{center} \includegraphics[width=24em]{fig/fake_only.png} \caption{Retraining with generated samples only} \label{fig:fake_only} \end{center} \end{figure} \begin{figure}[H] \begin{center} \includegraphics[width=12em]{fig/retrain_fail.png} \caption{Retraining failures} \label{fig:retrain_fail} \end{center} \end{figure}