From 303a485df8b0fe58078521a144e1f4270742cc79 Mon Sep 17 00:00:00 2001 From: nunzip Date: Fri, 15 Mar 2019 23:03:09 +0000 Subject: Fix grammar --- report/paper.md | 47 +++++++++++++++++++++++------------------------ 1 file changed, 23 insertions(+), 24 deletions(-) (limited to 'report/paper.md') diff --git a/report/paper.md b/report/paper.md index 0ed78df..945ffb4 100644 --- a/report/paper.md +++ b/report/paper.md @@ -1,12 +1,12 @@ # Introduction -In this coursework we present two variants of the GAN architecture - DCGAN and CGAN, applied to the MNIST dataset and evaluate performance metrics across various optimizations techniques. The MNIST dataset contains 60,000 training images and 10,000 testing images of size 28x28, spread across ten classes representing the ten handwritten digits. +In this coursework we present two variants of the GAN architecture - DCGAN and CGAN, applied to the MNIST dataset and evaluate performance metrics across various optimisations techniques. The MNIST dataset contains 60,000 training images and 10,000 testing images of size 28x28, spread across ten classes representing the ten handwritten digits. Generative Adversarial Networks represent a system of models characterised by their ability to output data similar to training data. A trained GAN takes noise as an input and is able to provide an output with the same dimensions and relevant features as the samples it has been trained with. GANs employ two neural networks - a *discriminator* and a *generator* which contest in a min-max game. The task of the *discriminator* is to distinguish generated images from real images, while the task of the *generator* is to produce realistic images which are able to fool the *discriminator*. -Training a shallow GAN with no convolutional layers exposes problems such as **mode collapse**, and unbalanced *generator-discriminator* losses which lead to **diminishing gradients** and **low quality image output*. +Training a shallow GAN with no convolutional layers exposes problems such as **mode collapse**, and unbalanced *generator-discriminator* losses which lead to **diminishing gradients** and **low quality image output**. \begin{figure} \begin{center} @@ -16,7 +16,7 @@ Training a shallow GAN with no convolutional layers exposes problems such as **m \end{center} \end{figure} -Mode collapse is achieved with our naive *vanilla GAN* (Appendix-\ref{fig:vanilla_gan}) implementation after 200,000 batches. The generated images observed during a mode collapse can be seen in figure \ref{fig:mode_collapse}. We observe that the output of the generator only represents few of the labels originally fed. When mode collapse is reached the loss function of the generator stops improving as shown in figure \ref{fig:vanilla_loss}. We observe the discriminator loss tends to zero as the discriminator learns to assume and classify the fake one's, while the generator is stuck and hence not able to improve. +Mode collapse is achieved with our naive *vanilla GAN* (Appendix-\ref{fig:vanilla_gan}) implementation after 200,000 batches. The generated images observed during a mode collapse can be seen in figure \ref{fig:mode_collapse}. We observe that the output of the generator only represents few of the labels originally fed. When mode collapse is reached the loss function of the generator stops improving as shown in figure \ref{fig:vanilla_loss}. We observe the discriminator loss tends to zero as the discriminator learns to assume and classify the fake ones, while the generator is stuck and hence not able to improve. A marked improvement to the vanilla architecture is Deep Convolutional Generative Adversarial Networks (DCGAN). @@ -27,8 +27,8 @@ A marked improvement to the vanilla architecture is Deep Convolutional Generativ DCGAN exploits convolutional stride to perform downsampling and transposed convolutions to perform upsampling, in contrast to the fully connected layers in a vanilla GAN. The tested implementation uses batch normalization at the output of each convolutional layer (exceptions being the output layer of the generator and the input layer of the discriminator). The activation functions of the intermediate layers are `ReLU` (for generator) and `LeakyReLU` with slope 0.2 (for discriminator). -The activation functions used for the output are `tanh` for the generator and `sigmoid` for the discriminator. The convolutional layers' output in the discriminator uses dropout before feeding into the next layers. We noticed a significant improvement in performance, and meassured a well performing dropout rate of 0.25. -The optimizer used for training is `Adam(learning_rate=0.002, beta=0.5)`. +The activation functions used for the output are `tanh` for the generator and `sigmoid` for the discriminator. The convolutional layers' output in the discriminator uses dropout before feeding into the next layers. We noticed a significant improvement in performance, and measured a well performing dropout rate of 0.25. +The optimiser used for training is Adam optimiser [@adam] with a learning rate of $0.002$ and $\beta=0.5$. The base architecture used can be observed in figure \ref{fig:dcganarc}. @@ -42,7 +42,7 @@ The base architecture used can be observed in figure \ref{fig:dcganarc}. ## Tests on MNIST -We evaluate three variants the DCGAN architecture, varying the size of convolutional layers in the generator, while retaining the structure presented in figure \ref{fig:dcganarc}: +We evaluate three variants of the DCGAN architecture, varying the size of convolutional layers in the generator, while retaining the structure presented in figure \ref{fig:dcganarc}: * Shallow: `Conv128-Conv64` * Medium: `Conv256-Conv128` @@ -76,12 +76,12 @@ but no mode collapse was observed even with the shallow model. ## CGAN Architecture description -CGAN is a conditional version of a GAN which utilises labeled data. Unlike DCGAN, CGAN is trained with explicitly provided labels which allow CGAN to associate features with specific classes. The baseline CGAN architecture we evaluate is visible in figure \ref{fig:cganarc}. The generator's architecture presents a series of blocks, each containing a dense layer, `LeakyReLU` layer (`slope=0.2`) and a Batch Normalization layer. The baseline discriminator uses Dense layers, followed by `LeakyReLU` (`slope=0.2`) and a Droupout layer. For training we used the Adam optimizer [@adam] with a learning rate of $0.002$ and $\beta=0.5$. +CGAN is a conditional version of a GAN which utilises labeled data. Unlike DCGAN, CGAN is trained with explicitly provided labels which allow CGAN to associate features with specific classes. The baseline CGAN architecture we evaluate is visible in figure \ref{fig:cganarc}. The generator's architecture presents a series of blocks, each containing a dense layer, `LeakyReLU` layer (`slope=0.2`) and a Batch Normalization layer. The baseline discriminator uses Dense layers, followed by `LeakyReLU` (`slope=0.2`) and a Dropout layer. For training we used the Adam optimiser [@adam] with a learning rate of $0.002$ and $\beta=0.5$. We also evaluate a Deep Convolutional version of CGAN (cDCGAN), the architecture of which can be found in the Appendix. It uses transpose convolutions with a stride of two to perform upscaling followed by convolutional blocks with singular stride. We find that kernel size of three by three worked well for all four convolutional blocks which include a Batch Normalization and an Activation layer (`ReLU` for generator and `LeakyReLU` for discriminator). The architecture assessed in this paper uses multiplying layers between the label embedding and the output `ReLU` blocks, as we found that it was more robust compared to the addition of the label embedding via concatenation. Label embedding is performed with a `Dense`,`tanh` and `Upsampling` block, both in the discriminator and the generator, creating a $64\times 28\times 28$ input for the multiplication layers. The output activation layers for generator and discriminator are respectively `tanh` and `sigmoid`. -The list of the architecture we evaluate in this report: +The list of the architecture's variations we evaluate in this report: * Shallow CGAN - $1\times$ `Dense-LeakyReLU` blocks * Medium CGAN - $3\times$ `Dense-LeakyReLU` blocks @@ -116,7 +116,7 @@ The image quality is better than the two examples reported earlier, proving that \end{center} \end{figure} -Unlike DCGAN, the three levels of dropout rate attempted do not affect the performance significantly, and as we can see in figures \ref{fig:cg_drop1_1} (0.1), \ref{fig:cmed}(0.3) and \ref{fig:cg_drop2_1}(0.5), both +Unlike DCGAN, the three levels of dropout rate attempted do not affect the performance significantly, and as we can see in figures \ref{fig:cg_drop1_1}(0.1), \ref{fig:cmed}(0.3) and \ref{fig:cg_drop2_1}(0.5), both image quality and G-D losses are comparable. The biggest improvement in performance is obtained through one-sided label smoothing, shifting the true labels form 1 to 0.9. @@ -133,10 +133,9 @@ Using 0.1 instead of zero for the fake labels does not improve performance, as t Virtual Batch Normalization provides results that are difficult to qualitatively assess when compared to the ones obtained through the baseline. VBN application does not significantly affect the G-D curves. -. We expect it to affect -performance most when training a classifier with the generated images from CGAN, as we +We expect it to show its impact the most when training a classifier with the generated images from CGAN, as we will generate more robust output samples. Training with a larger batch size -may result in even more difficult changes to observe, but since we ran for a batch_size of 128 we see definite effects when looking results when performing quantitative measurements. +may result in even more difficult changes to observe, but since we ran for a batch_size of 128 we see clear effects when performing quantitative measurements. Similarly to DCGAN, changing the G-D steps did not lead to good quality results as it can be seen in figure \ref{fig:cbalance}, in which we tried to train with D/G=15 for 10,000 batches, trying to initialize good discriminator weights, to then revert to a D/G=1, aiming to balance the losses of the two networks. @@ -168,9 +167,9 @@ We find a good balance for 12,000 batches. \end{figure} Oscillation on the generator loss is noticeable in figure \ref{fig:cdcloss} due to the discriminator loss approaching zero. One possible -adjustment to tackle this issue was balancing G-D training steps, is using unbalanced proportion training steps, such as $G/D=3$, allowing the generator to gain some advantage over the discriminator. This +adjustment to tackle this issue is using unbalanced proportions in the training steps, such as $G/D=3$, allowing the generator to gain some advantage over the discriminator. This technique allowed to smooth oscillation while producing images of similar quality. -Using $G/D=6$ dampens oscillation almost completely leading to the vanishing discriminator's gradient issue. Mode collapse occurs in this specific case as shown on +Using $G/D=6$ dampens oscillation almost completely leading to the vanishing discriminator's gradient issue. Mode collapse occurs in this specific case as shown in figure \ref{fig:cdccollapse}. Checking the PCA embeddings extracted from a pretrained LeNet classifier (figure \ref{fig:clustcollapse}) we observe low diversity between features of each class, that tend to collapse to very small regions. @@ -197,18 +196,18 @@ tend to collapse to very small regions. Virtual Batch Normalization on this architecture was not attempted as it significantly increased the training time (about twice more). -Introducing one-sided label smoothing produced very similar results (figure \ref{fig:cdcsmooth}), hence a quantitative performance assessment using Inception score is due in the and presented next section. +Introducing one-sided label smoothing produced very similar results (figure \ref{fig:cdcsmooth}), hence a quantitative performance assessment using Inception Score is due and presented in the next section. # Inception Score -Inception score is calculated as introduced by Tim Salimans et. al [@improved], used to evaluate the CIFAR-10 dataset. However as we are evaluating MNIST, we use LeNet-5 [@lenet] as the basis of the Inception score, instead of original Inception network. +Inception Score is calculated as introduced by Tim Salimans et. al [@improved], used to evaluate the CIFAR-10 dataset. However as we are evaluating MNIST, we use LeNet-5 [@lenet] as the basis of the Inception Score, instead of original Inception network. To calculate the score we use the logits extracted from LeNet: $$ \textrm{IS}(x) = \exp(\mathbb{E}_x \left( \textrm{KL} ( p(y\mid x) \| p(y) ) \right) ) $$ We further report the classification accuracy as found with LeNet. For coherence purposes the Inception Scores were -calculated training the LeNet classifier under the same conditions across all experiments (100 epochs and gradient descent with a learning rate of 0.001). +calculated training the LeNet classifier under the same conditions across all experiments (100 epochs and stochastic gradient descent optimiser with a learning rate of 0.001). \begin{table}[H] \begin{tabular}{llll} @@ -241,7 +240,7 @@ One sided label smoothing involves relaxing our confidence on data labels. Tim S ### Virtual Batch Normalization -Virtual Batch Normalization is a further optimisation technique proposed by Tim Salimans et. al. [@improved]. Virtual batch normalization is a modification to the batch normalization layer, which performs normalization based on statistics from a reference batch. VBN ipmrvoes the dependency of the output on the other inputs from the same minibatch [@improved]. We observe that VBN improved the classification accuracy and the Inception Score. +Virtual Batch Normalization is a further optimisation technique proposed by Tim Salimans et. al. [@improved]. Virtual batch normalization is a modification to the batch normalization layer, which performs normalization based on statistics from a reference batch. VBN improves the dependency of the output on the other inputs from the same minibatch [@improved]. We observe that VBN improved the classification accuracy and the Inception Score. ### Dropout @@ -250,8 +249,8 @@ Dropout appears to have a noticeable effect on accuracy and Inception Score, wit ### G-D Balancing on cDCGAN Despite achieving lower loss oscillation, using *G/D=3* to incentivize generator training did not improve the performance of cDCGAN as meassured by -the Inception Score and testing accuracy. We obtain 5% less test accuracy, meaning that using this technique in our architecture produces on -lower quality images on average when compared to our standard cDCGAN. +the Inception Score and testing accuracy. We obtain 5% less test accuracy, meaning that using this technique in our architecture produces on average +lower quality images when compared to our standard cDCGAN. # Re-training the handwritten digit classifier @@ -286,15 +285,15 @@ Training for 100 epochs, similarly to the previous section, is clearly not enoug is only 62%, while training for 300 epochs we can reach up to 88%. The learning curve in figure \ref{fig:few_real} suggests we cannot achieve much better with this very small amount of data, since the validation accuracy plateaus, while the training accuracy almost reaches 100%. -We conduct one experiment, feeding the test set to a LeNet trained exclusively on data generated from our CGAN. It is noticeable that training -for the first 20 epochs gives good results before reaching a plateau (figure \ref{fig:fake_only}) when compared to the learning curve obtained when training the network with only the few real samples. This +We conduct one experiment, feeding the test set to a LeNet classifier trained exclusively on data generated from our cDCGAN. It is noticeable that training +for the first 20 epochs gives good results when compared to the learning curve obtained when training with only the few real samples before reaching a plateau (figure \ref{fig:fake_only}). This indicates that we can use the generated data to train the first steps of the network (initialize weights) and train only with the real samples for 300 epochs to perform fine tuning. As observed in figure \ref{fig:few_init} the first steps of retraining will show oscillation, since the fine tuning will try and adapt to the newly fed data. The maximum accuracy reached before the validation curve plateaus is 88.6%, indicating that this strategy proved to be somewhat successful at improving testing accuracy. We try to improve the results obtained earlier by retraining LeNet with mixed data: few real samples and plenty of generated samples (160,000) -(learning curve show in figure \ref{fig:training_mixed}). The peak accuracy reached is 91%. We then try to remove the generated +(learning curve showed in figure \ref{fig:training_mixed}). The peak accuracy reached is 91%. We then try to remove the generated samples to apply fine tuning, using only the real samples. After 300 more epochs (figure \ref{fig:training_mixed}) the test accuracy is boosted to 92%, making this technique the most successful attempt of improvement while using a limited amount of data from MNIST dataset. @@ -308,7 +307,7 @@ boosted to 92%, making this technique the most successful attempt of improvement \end{figure} Examples of misclassification are displayed in figure \ref{fig:retrain_fail}. It is visible from a cross comparison between these results and the precision-recall -curve displayed in figure \ref{fig:pr-retrain} that the network performs well for most of the digits, but has is brought down by the relatively low precision for the digit 8, lowering the micro-average'd precision. +curve displayed in figure \ref{fig:pr-retrain} that the network performs well for most of the digits, but is brought down by the relatively low precision for the digit 8, lowering the micro-average precision. \begin{figure} \begin{center} -- cgit v1.2.3-54-g00ecf