From ff79fdd9acd1849e69d2beda574f1e9b0e2cce22 Mon Sep 17 00:00:00 2001 From: nunzip Date: Thu, 14 Mar 2019 22:34:36 +0000 Subject: Revision overall content --- report/paper.md | 66 +++++++++++++++++++++++++++++---------------------------- 1 file changed, 34 insertions(+), 32 deletions(-) diff --git a/report/paper.md b/report/paper.md index 9aaa278..098523a 100644 --- a/report/paper.md +++ b/report/paper.md @@ -2,7 +2,7 @@ In this coursework we present two variants of the GAN architecture - DCGAN and CGAN, applied to the MNIST dataset and evaluate performance metrics across various optimisations techniques. The MNIST dataset contains 60,000 training images and 10,000 testing images of size 28x28, spread across ten classes representing the ten handwritten digits. -Generative Adversarial Networks present a system of models which learn to output data, similar to training data. A trained GAN takes noise as an input and is able to provide an output with the same dimensions and relevant features as the samples it has been trained with. +Generative Adversarial Networks present a system of models which learn to output data similar to training data. A trained GAN takes noise as an input and is able to provide an output with the same dimensions and relevant features as the samples it has been trained with. GANs employ two neural networks - a *discriminator* and a *generator* which contest in a min-max game. The task of the *discriminator* is to distinguish generated images from real images, while the task of the generator is to produce realistic images which are able to fool the discriminator. @@ -16,7 +16,7 @@ Training a shallow GAN with no convolutional layers poses problems such as mode \end{center} \end{figure} -Some of the main challanges faced when training a GAN are: **mode collapse**, **low quality** of images and **mismatch** between generator and discriminator loss. Mode collapse is achieved with our naive *vanilla GAN* (Appendix-\ref{fig:vanilla_gan}) implementation after 200,000 batches. The generated images observed during a mode collapse can be seen on figure \ref{fig:mode_collapse}. The output of the generator only represents few of the labels originally fed. When mode collapse is reached loss function of the generator stops improving as shown in figure \ref{fig:vanilla_loss}. We observe, the discriminator loss tends to zero as the discriminator learns to assume and classify the fake 1s, while the generator is stuck producing 1 and hence not able to improve. +Some of the main challanges faced when training a GAN are: **mode collapse**, **low quality** of images and **mismatch** between generator and discriminator loss. Mode collapse is achieved with our naive *vanilla GAN* (Appendix-\ref{fig:vanilla_gan}) implementation after 200,000 batches. The generated images observed during a mode collapse can be seen in figure \ref{fig:mode_collapse}. The output of the generator only represents few of the labels originally fed. When mode collapse is reached the loss function of the generator stops improving as shown in figure \ref{fig:vanilla_loss}. We observe the discriminator loss tends to zero as the discriminator learns to assume and classify the fake 1s, while the generator is stuck producing 1 and hence not able to improve. A significant improvement to this vanilla architecture is Deep Convolutional Generative Adversarial Networks (DCGAN). @@ -59,9 +59,9 @@ We evaluate three different GAN architectures, varying the size of convolutional \end{figure} We observed that the deep architectures result in a more easily achievable equilibria of G-D losses. -Our medium depth DCGAN achieves very good performance, balancing both binary cross entropy losses at approximately 0.9 after 5,000 batches, reaching equilibrium quicker and with less oscillation that the Deepest DCGAN tested. +Our medium depth DCGAN achieves very good performance (figure \ref{fig:dcmed}), balancing both binary cross entropy losses at approximately 0.9 after 5,000 batches, reaching equilibrium quicker and with less oscillation than the Deepest DCGAN tested (figure \ref{fig:dclong}). -As DCGAN is trained with no labels, the generator primary objective is to output images that fool the discriminator, but does not intrinsically separate the classes form one another. Therefore we sometimes observe oddly shape fused digits which may temporarily full be labeled real by the discriminator. This issue is solved by training the network for more batches or introducing a deeper architecture, as it can be deducted from a qualitative comparison +As DCGAN is trained with no labels, the generator's primary objective is to output images that fool the discriminator, but does not intrinsically separate the classes from each another. Therefore we sometimes observe oddly shaped digits which may temporarily be labeled as real by the discriminator. This issue is solved by training the network for more batches or introducing a deeper architecture, as it can be deducted from a qualitative comparison between figures \ref{fig:dcmed}, \ref{fig:dcshort} and \ref{fig:dclong}. Applying Virtual Batch Normalization our Medium DCGAN does not provide observable changes in G-D losses, but reduces within-batch correlation. Although it is difficult to qualitatively assess the improvements, figure \ref{fig:vbn_dc} shows results of the introduction of this technique. @@ -85,19 +85,20 @@ but no mode collapse was observed even with the shallow model. ## CGAN Architecture description -CGAN is a conditional version of a GAN which utilises labeled data. Unlike DCGAN, CGAN is trained with explicitly provided labels which allow CGAN to associate features with specific labels. This has the intrinsic advantage of allowing us to specify the label of generated data. The baseline CGAN which we evaluate is visible in figure \ref{fig:cganarc}. The baseline CGAN architecture presents a series blocks each contained a dense layer, `LeakyReLu` layer (slope=0.2) and a Batch Normalisation layer. The baseline discriminator uses Dense layers, followed by `LeakyReLu` (slope=0.2) and a Droupout layer. +CGAN is a conditional version of a GAN which utilises labeled data. Unlike DCGAN, CGAN is trained with explicitly provided labels which allow CGAN to associate features with specific classes. The baseline CGAN which we evaluate is visible in figure \ref{fig:cganarc}. The baseline CGAN architecture presents a series of blocks, each containing a dense layer, `LeakyReLu` layer (`slope=0.2`) and a Batch Normalisation layer. The baseline discriminator uses Dense layers, followed by `LeakyReLu` (`slope=0.2`) and a Droupout layer. The optimizer used for training is `Adam`(`learning_rate=0.002`, `beta=0.5`). The Convolutional CGAN (CDCGAN) analysed follows the structure presented in the relevant Appendix section. It uses TODO ADD BRIEF DESCRIPTION We evaluate permutations of the architecture involving: -* Shallow CGAN - 1 `Dense-LeakyReLu-BN` block -* Deep CGAN - 5 `Dense-LeakyReLu-BN` -* Deep Convolutional GAN - DCGAN + conditional label input +* Shallow CGAN - 1 `Dense-LeakyReLu` blocks +* Medium CGAN - 3 `Dense-LeakyReLu` blocks +* Deep CGAN - 5 `Dense-LeakyReLu` blocks +* Deep Convolutional CGAN (CDCGAN) * One-Sided Label Smoothing (LS) -* Various Dropout (DO)- Use 0.1, 0.3 and 0.5 -* Virtual Batch Normalisation (VBN)- Normalisation based on one batch(BN) [@improved] +* Various Dropout (DO): 0.1, 0.3 and 0.5 +* Virtual Batch Normalisation (VBN) - Normalisation based on one batch(BN) [@improved] \begin{figure} \begin{center} @@ -109,11 +110,11 @@ We evaluate permutations of the architecture involving: ## Tests on MNIST -When comparing the three levels of depth for the architectures it is possible to notice significant differences for the G-D losses balancing. In +When comparing the three levels of depth for the baseline architecture it is possible to notice significant differences in G-D losses balancing. In a shallow architecture we notice a high oscillation of the generator loss (figure \ref{fig:cshort}), which is being overpowered by the discriminator. Despite this we don't experience any issues with vanishing gradient, hence no mode collapse is reached. -Similarly, with a deep architecture the discriminator still overpowers the generator, and an equilibrium between the two losses is not achieved. The image quality in both cases is not really high: we can see that even after 20,000 batches the some pictures appear to be slightly blurry (figure \ref{fig:clong}). -The best compromise is reached for 3 Dense-LeakyReLu-BN blocks as shown in figure \ref{fig:cmed}. It is possible to observe that G-D losses are perfectly balanced, and their value goes below 1. +Similarly, with a deep architecture the discriminator still overpowers the generator, and an equilibrium between the two losses is not achieved. The image quality in both cases is not really high: we can see that even after 20,000 batches some pictures appear to be slightly blurry (figure \ref{fig:clong}). +The best compromise is reached for `3 Dense-LeakyReLu` blocks as shown in figure \ref{fig:cmed}. It is possible to observe that G-D losses are perfectly balanced, and their value goes below 1. The image quality is better than the two examples reported earlier, proving that this Medium-depth architecture is the best compromise. \begin{figure} @@ -125,12 +126,12 @@ The image quality is better than the two examples reported earlier, proving that \end{center} \end{figure} -The three levels of dropout rates attempted do not affect the performance significantly, and as we can see in figures \ref{fig:cg_drop1_1} (0.1), \ref{fig:cmed}(0.3) and \ref{fig:cg_drop2_1}(0.5), both +Unlike DCGAN, the three levels of dropout rate attempted do not affect the performance significantly, and as we can see in figures \ref{fig:cg_drop1_1} (0.1), \ref{fig:cmed}(0.3) and \ref{fig:cg_drop2_1}(0.5), both image quality and G-D losses are comparable. The biggest improvement in performance is obtained through one-sided label smoothing, shifting the true labels form 1 to 0.9 to reinforce discriminator behaviour. Using 0.1 instead of zero for the fake labels does not improve performance, as the discriminator loses incentive to do better (generator behaviour is reinforced). -Performance results for one-sided labels smoothing with true labels = 0.9 are shown in figure \ref{fig:smooth}. +Performance results for one-sided labels smoothing with `true_labels = 0.9` are shown in figure \ref{fig:smooth}. \begin{figure} \begin{center} @@ -161,8 +162,8 @@ the same classes, indicating that mode collapse still did not occur. \end{figure} The best performing architecture was CDCGAN. It is difficult to assess any potential improvement at this stage, since the samples produced -between 8,000 and 13,000 batches are indistinguishable from the ones of the MNIST dataset (as it can be seen in figure \ref{fig:cdc}, middle). Training CDCGAN for more than -15,000 batches is however not beneficial, as the discriminator will keep improving, leading the generator to oscillate and produce bad samples as shown in the reported example. +between 8,000 and 13,000 batches are almost indistinguishable from the ones of the MNIST dataset (as it can be seen in figure \ref{fig:cdc}, middle). Training CDCGAN for more than +15,000 batches is however not beneficial, as the discriminator will keep improving, leading the generator loss to increase and produce bad samples as shown in the reported example. We find a good balance for 12,000 batches. \begin{figure} @@ -177,9 +178,9 @@ We find a good balance for 12,000 batches. Oscillation on the generator loss is noticeable in figure \ref{fig:cdcloss} due to the discriminator loss approaching zero. One possible adjustment to tackle this issue was balancing G-D training steps, opting for G/D=3, allowing the generator to gain some advantage over the discriminator. This -technique allowed to smooth oscillation while producing images of similar quality. A quantitative performance assessment will be performed in the following section. +technique allowed to smooth oscillation while producing images of similar quality. Using G/D=6 dampens oscillation almost completely leading to the vanishing discriminator's gradient issue. Mode collapse occurs in this specific case as shown on -figure \ref{fig:cdccollapse}. Checking the embeddings extracted from a pretrained LeNet classifier (figure \ref{fig:clustcollapse})we observe low diversity between features of each class, that +figure \ref{fig:cdccollapse}. Checking the embeddings extracted from a pretrained LeNet classifier (figure \ref{fig:clustcollapse}) we observe low diversity between features of each class, that tend to collapse to very small regions. \begin{figure} @@ -206,7 +207,7 @@ tend to collapse to very small regions. Virtual Batch Normalization on this architecture was not attempted as it significantly increased the training time (about twice more). Introducing one-sided label smoothing produced very similar results (figure \ref{fig:cdcsmooth}), hence a quantitative performance assessment will need to -be performed in the next section through the introduction of Inception Scores. +be performed in the next section to state which ones are better(through Inception Scores). # Inception Score @@ -241,11 +242,11 @@ Medium CGAN+VBN+LS & 0.763 & 3.91 & 19:43 \\ ### Architecture -We observe increased accruacy as we increase the depth of the GAN arhitecture at the cost of the training time. There appears to be diminishing returns with the deeper networks, and larger improvements are achievable with specific optimisation techniques. CDCGAN achieves improved performance in comparison to the other cases analysed as we expected from the results obtained in the previous section, since the samples produced are almost identical to the ones of the original MNIST dataset. +We observe increased accruacy as we increase the depth of the GAN arhitecture at the cost of training time. There appears to be diminishing returns with the deeper networks, and larger improvements are achievable with specific optimisation techniques. CDCGAN achieves improved performance in comparison to the other cases analysed as we expected from the results obtained in the previous section, since the samples produced are almost identical to the ones of the original MNIST dataset. ### One Side Label Smoothing -One sided label smoothing involves relaxing our confidence on the labels in our data. Tim Salimans et. al. [@improved] show smoothing of the positive labels reduces the vulnerability of the neural network to adversarial examples. We observe significant improvements to the Inception score and classification accuracy in the case of our baseline (Medium CGAN). This technique however did not improve the performance of the CDCGAN any further, suggesting that reinforcing discriminator behaviour does not benefit the system in this case. +One sided label smoothing involves relaxing our confidence on the labels in our data. Tim Salimans et. al. [@improved] show smoothing of the positive labels reduces the vulnerability of the neural network to adversarial examples. We observe significant improvements to the Inception score and classification accuracy in the case of our baseline (Medium CGAN). This technique however did not improve the performance of CDCGAN any further, suggesting that reinforcing discriminator behaviour does not benefit the system in this case. ### Virtual Batch Normalisation @@ -253,12 +254,12 @@ Virtual Batch Normalisation is a further optimisation technique proposed by Tim ### Dropout -The effect of dropout for the non-convolutional CGAN architecture does not affect performance as much as in DCGAN, nor does it seem to affect the quality of images produced, together with the G-D loss remain almost unchanged. Ultimately, judging from the inception scores, it is preferable to use a low dropout rate (in our case 0.1 seems to be the dropout rate that achieves the best results). +Despite the difficulties in judging differences between G-D losses and image quality, dropout rate seems to have a noticeable effect on accuracy and inception score, with a variation of 3.6% between our best and worst dropout cases. Ultimately, judging from the measurements, it is preferable to use a low dropout rate (0.1 seems to be the one that achieves the best results). ### G-D Balancing on CDCGAN -Despite achieving lower oscillation on the losses, using G/D=3 to incentivize generator training did not improve the performance of CDCGAN as it is observed from -the inception score and testing accuracy. We obtain in fact 5% less testing precision, meaning that using this technique in our architecture produces on +Despite achieving lower losses oscillation, using G/D=3 to incentivize generator training did not improve the performance of CDCGAN as it is observed from +the inception score and testing accuracy. We obtain in fact 5% less test accuracy, meaning that using this technique in our architecture produces on average lower quality images when compared to our standard CDCGAN. # Re-training the handwritten digit classifier @@ -272,8 +273,8 @@ injecting generated samples in the original training set to boost testing accura As observed in figure \ref{fig:mix1} we performed two experiments for performance evaluation: -* Keeping the same number of training samples while just changing the amount of real to generated data (55,000 samples in total). -* Keeping the whole training set from MNIST and adding generated samples from CGAN. +* Keeping the same number of training samples while just changing the ratio of real to generated data (55,000 samples in total). +* Keeping the whole training set from MNIST and adding generated samples from CDCGAN. \begin{figure} \begin{center} @@ -288,7 +289,7 @@ Both experiments show that training the classification network with the injectio ## Adapted Training Strategy -For this section we will use 550 samples from MNIST (55 samples per class). Training the classifier yields major challenges, since the amount of samples available for training is relatively small. +For this section we will use 550 samples from MNIST (55 samples per class). Training the classifier yields major challenges, since the amount of samples available is relatively small. Training for 100 epochs, similarly to the previous section, is clearly not enough. The MNIST test set accuracy reached in this case is only 62%, while training for 300 epochs we can reach up to 88%. The learning curve in figure \ref{fig:few_real} suggests @@ -296,12 +297,13 @@ we cannot achieve much better with this very small amount of data, since the val We conduct one experiment, feeding the test set to a LeNet trained exclusively on data generated from our CGAN. It is noticeable that training for the first 20 epochs gives good results before reaching a plateau (figure \ref{fig:fake_only}) when compared to the learning curve obtained when training the network with only the few real samples. This -indicates that we can use the generated data to train the first steps of the network (initial weights) and apply the real sample for 300 epochs to obtain -a finer tuning. As observed in figure \ref{fig:few_init} the first steps of retraining will show oscillation, since the fine tuning will try and adapt to the newly fed data. The maximum accuracy reached before the validation curve plateaus is 88.6%, indicating that this strategy proved to be somewhat successful at +indicates that we can use the generated data to train the first steps of the network (initialize weights) and train only +with the real samples for 300 epochs to perform fine tuning. +As observed in figure \ref{fig:few_init} the first steps of retraining will show oscillation, since the fine tuning will try and adapt to the newly fed data. The maximum accuracy reached before the validation curve plateaus is 88.6%, indicating that this strategy proved to be somewhat successful at improving testing accuracy. We try to improve the results obtained earlier by retraining LeNet with mixed data: few real samples and plenty of generated samples (160,000) -(learning curve show in figure \ref{fig:training_mixed}. The peak accuracy reached is 91%. We then try to remove the generated +(learning curve show in figure \ref{fig:training_mixed}). The peak accuracy reached is 91%. We then try to remove the generated samples to apply fine tuning, using only the real samples. After 300 more epochs (figure \ref{fig:training_mixed}) the test accuracy is boosted to 92%, making this technique the most successful attempt of improvement while using a limited amount of data from MNIST dataset. @@ -315,7 +317,7 @@ boosted to 92%, making this technique the most successful attempt of improvement \end{figure} Examples of misclassification are displayed in figure \ref{fig:retrain_fail}. It is visible from a cross comparison between these results and the precision-recall -curve displayed in figure \ref{fig:pr-retrain} that the network we trained performs really well for most of the digits, but the low confidence on the digit $8$ lowers +curve displayed in figure \ref{fig:pr-retrain} that the network we trained performs really well for most of the digits, but the low confidence on digit $8$ lowers the overall performance. \begin{figure} -- cgit v1.2.3-54-g00ecf