report/paper.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445

# Introduction 

In this coursework we present two variants of the GAN architecture - DCGAN and CGAN, applied to the MNIST dataset and evaluate performance metrics across various optimisations techniques. The MNIST dataset contains 60,000 training images and 10,000 testing images of size 28x28, spread across ten classes representing the ten handwritten digits.

## GAN

Generative Adversarial Networks present a system of models which learn to output data, similar to training data. A trained GAN takes noise as an input and is able to provide an output with the same dimensions and relevant features as the samples it has been trained with.

GAN's employ two neural networks - a *discriminator* and a *generator* which contest in a zero-sum game. The task of the *discriminator* is to distinguish generated images from real images, while the task of the generator is to produce realistic images which are able to fool the discriminator.

Training a shallow GAN with no convolutional layers poses problems such as mode collapse and unbalanced G-D losses which lead to low quality image output.

\begin{figure}
\begin{center}
\includegraphics[width=24em]{fig/generic_gan_mode_collapse.pdf}
\caption{Vanilla GAN mode collapse}
\label{fig:mode_collapse}
\end{center}
\end{figure}


Mode collapse is achieved with our naive *vanilla GAN* (Appendix-\ref{fig:vanilla_gan}) implementation after 200,000 epochs. The generated images observed during a mode collapse can be seen on figure \ref{fig:mode_collapse}. The output of the generator only represents few of the labels originally fed. When mode collapse is reached loss function of the generator stops improving as shown in figure \ref{fig:vanilla_loss}. We observe, the discriminator loss tends to zero as the discriminator learns to assume and classify the fake 1's, while the generator is stuck producing 1 and hence not able to improve.

A significant improvement to this vanilla architecture is Deep Convolutional Generative Adversarial Networks (DCGAN).

# DCGAN

## DCGAN Architecture description

DCGAN exploits convolutional stride to perform downsampling and transposed convolution to perform upsampling. 

We use batch normalization at the output of each convolutional layer (exception made for the output layer of the generator 
and the input layer of the discriminator). The activation functions of the intermediate layers are `ReLU` (for generator) and `LeakyReLU` with slope 0.2 (for discriminator).
The activation functions used for the output are `tanh` for the generator and `sigmoid` for the discriminator. The convolutional layers' output in the discriminator uses dropout before feeding the next layers. We noticed a significant improvement in performance, and estimated an optimal droput rate of 0.25.
The optimizer used for training is `Adam(learning_rate=0.002, beta=0.5)`.

The main architecture used can be observed in figure \ref{fig:dcganarc}.

\begin{figure}
\begin{center}
\includegraphics[width=24em]{fig/DCGAN_arch.pdf}
\caption{DCGAN Architecture}
\label{fig:dcganarc}
\end{center}
\end{figure}

## Tests on MNIST

We evaluate three different GAN architectures, varying the size of convolutional layers in the generator, while retaining the structure presented in figure \ref{fig:dcganarc}: 

* Shallow: Conv128-Conv64
* Medium: Conv256-Conv128
* Deep: Conv512-Conv256

\begin{figure}
\begin{center}
\includegraphics[width=24em]{fig/med_dcgan_ex.pdf}
\includegraphics[width=24em]{fig/med_dcgan.png}
\caption{Medium DCGAN}
\label{fig:dcmed}
\end{center}
\end{figure}

We observed that the deep architectures result in a more easily achievable equilibria of G-D losses.
Our medium depth DCGAN achieves very good performance, balancing both binary cross entropy losses at approximately 0.9 after 5.000 epochs, reaching equilibrium quicker and with less oscillation that the Deepest DCGAN tested.

As DCGAN is trained with no labels, the generator primary objective is to output images that fool the discriminator, but does not intrinsically separate the classes form one another. Therefore we sometimes observe oddly shape fused digits which may temporarily full be labeled real by the discriminator. This issue is solved by training the network for more epochs or introducing a deeper architecture, as it can be deducted from a qualitative comparison
between figures \ref{fig:dcmed}, \ref{fig:dcshort} and \ref{fig:dclong}.

Applying Virtual Batch Normalization our Medium DCGAN does not provide observable changes in G-D balancing, but reduces within-batch correlation. Although it is difficult to qualitatively assess the improvements, figure \ref{fig:vbn_dc} shows results of the introduction of this technique. 

\begin{figure}
\begin{center}
\includegraphics[width=24em]{fig/vbn_dc.pdf}
\caption{DCGAN Virtual Batch Normalization}
\label{fig:vbn_dc}
\end{center}
\end{figure}

We evaluated the effect of different dropout rates (results in appendix figures \ref{fig:dcdrop1_1}, \ref{fig:dcdrop1_2}, \ref{fig:dcdrop2_1}, \ref{fig:dcdrop2_2}) and concluded that the optimisation
of the droupout hyper-parameter is essential for maximising performance. A high dropout rate results in DCGAN producing only artifacts that do not match any specific class due to the generator performing better than the discriminator. Conversely a low dropout rate leads to an initial stabilisation of G-D losses, but ultimately results in instability under the form of oscillation when training for a large number of epochs.

While training the different proposed DCGAN architectures, we did not observe mode collapse, indicating the DCGAN is less prone to a collapse compared to our *vanilla GAN*.

# CGAN

## CGAN Architecture description

CGAN is a conditional version of a GAN which utilises labeled data. Unlike DCGAN, CGAN is trained with explicitly provided labels which allow CGAN to associate features with specific labels. This has the intrinsic advantage of allowing us to specify the label of generated data.  The baseline CGAN which we evaluate is visible in figure \ref{fig:cganrc}. The baseline GAN arhitecture presents a series blocks each contained a dense layer, ReLu layer and a Batch Normalisation layer. The baseline discriminator uses Dense layers, followed by ReLu and a Droupout layer.

We evaluate permutations of the architecture involving:

* Shallow CGAN - 1 Dense-ReLu-BN  block
* Deep CGAN - 5 Dense-ReLu-BN
* Deep Convolutional GAN - DCGAN + conditional label input
* Label Smoothing (One Sided) - Truth labels to 0 and $1-\alpha$ (0.9)
* Various Dropout - Use 0.1 and 0.5 Dropout parameters
* Virtual Batch Normalisation - Normalisation based on one batch [@improved]

\begin{figure}
\begin{center}
\includegraphics[width=24em]{fig/CGAN_arch.pdf}
\caption{CGAN Architecture}
\label{fig:cganarc}
\end{center}
\end{figure}

\begin{figure}
\begin{center}
\includegraphics[width=24em]{fig/CDCGAN_arch.pdf}
\caption{Deep Convolutional CGAN Architecture}
\label{fig:cdcganarc}
\end{center}
\end{figure}

## Tests on MNIST 

\begin{figure}
\begin{center}
\includegraphics[width=24em]{fig/med_cgan_ex.pdf}
\includegraphics[width=24em]{fig/med_cgan.png}
\caption{Medium CGAN}
\label{fig:cmed}
\end{center}
\end{figure}

### Inception Score

Inception score is calculated as introduced by Tim Salimans et. al [@improved]. However as we are evaluating MNIST, we use LeNet  as the basis of the inceptioen score.
We use the logits extracted from LeNet:

$$ \textrm{IS}(x) = \exp(\mathbb{E}_x \left( \textrm{KL} ( p(y\mid x) \| p(y) ) \right) ) $$

### Classifier Architecture Used

\begin{table}[]
\begin{tabular}{llll}
                      & Accuracy & Inception Sc. & GAN Tr. Time \\ \hline
Shallow CGAN          & 0.645    & 3.57          & 8:14         \\
Medium CGAN           & 0.715    & 3.79          & 10:23        \\
Deep CGAN             & 0.739    & 3.85          & 16:27        \\
Convolutional CGAN    & 0.737    & 4             & 25:27        \\
Medium CGAN+LS        & 0.749    & 3.643         & 10:42        \\
Convolutional CGAN+LS & 0.601    & 2.494         & 27:36        \\
Medium CGAN DO=0.1    & 0.761    & 3.836         & 10:36        \\
Medium CGAN DO=0.5    & 0.725    & 3.677         & 10:36        \\
Medium CGAN+VBN       & 0.745    & 4.02          & 10:38        \\
Medium CGAN+VBN+LS    & 0.783    & 4.31          & 10:38        \\
*MNIST original       & 0.9846   & 9.685         & N/A          \\ \hline
\end{tabular}
\end{table}

## Discussion

### Architecture

### One Side Label Smoothing

\begin{figure}
\begin{center}
\includegraphics[width=24em]{fig/smoothing_ex.pdf}
\includegraphics[width=24em]{fig/smoothing.png}
\caption{One sided label smoothing}
\label{fig:smooth}
\end{center}
\end{figure}


### Virtual Batch Normalisation


### Dropout
The effect of dropout for the non-convolutional CGAN architecture does not affect performance as much as in DCGAN, nor does it seem to affect the quality of images produced, together with the G-D loss remain almost unchanged. Results are presented in figures \ref{fig:cg_drop1_1}, \ref{fig:cg_drop1_2}, \ref{fig:cg_drop2_1}, \ref{fig:cg_drop2_2}.


**Please measure and discuss the inception scores for the different hyper-parameters/tricks and/or


# Re-training the handwritten digit classifier

## Results

In this section we analyze the effect of retraining the classification network using a mix of real and generated data, highlighting the benefits of 
injecting generated samples in the original training set to boost testing accuracy.

As observed in figure \ref{fig:mix1} we performed two experiments for performance evaluation: 

* Keeping the same number of training samples while just changing the amount of real to generated data (55.000 samples in total).
* Keeping the whole training set from MNIST and adding generated samples from CGAN.

\begin{figure}
\begin{center}
\includegraphics[width=12em]{fig/mix_zoom.png}
\includegraphics[width=12em]{fig/added_generated_data.png}
\caption{Mix data, left unchanged samples number, right added samples}
\label{fig:mix1}
\end{center}
\end{figure}

Both experiments show that an optimal amount of data to boost testing accuracy on the original MNIST dataset is around 30% generated data as in both cases we observe
an increase in accuracy by around 0.3%. In absence of original data the testing accuracy drops significantly to around 20% for both cases.

## Adapted Training Strategy

For this section we will use 550 samples from MNIST (55 samples per class). Training the classifier 
yelds major challanges, since the amount of samples aailable for training is relatively small.

Training for 100 epochs, similarly to the previous section, is clearly not enough. The MNIST test set accuracy reached in this case
is only 62%, while training for 300 epochs we can reach up to 88%. The learning curve in figure \ref{fig:few_real} suggests
we cannot achieve much better whith this very small amount of data, since the validation accuracy flattens, while the training accuracy 
almost reaches 100%.

\begin{figure}
\begin{center}
\includegraphics[width=24em]{fig/train_few_real.png}
\caption{Training with few real samples}
\label{fig:few_real}
\end{center}
\end{figure}

We conduct one experiment, feeding the test set to a L2-Net trained exclusively on data generated from our CGAN. It is noticeable that training 
for the first 5 epochs gives good results (figure \ref{fig:fake_only}) when compared to the learning curve obtained while training the network ith only the few real samples. This
indicates that we can use the generated data to train the first steps of the network (initial weights) and apply the real sample for 300 epochs to obtain 
a finer tuning. As observed in figure \ref{fig:few_init} the first steps of retraining will show oscillation, since the fine tuning will try and adapt to the newly fed data. The maximum accuracy reached before the validation curve plateaus is 88.6%, indicating that this strategy proved to be somewhat successfull at 
improving testing accuracy. 

\begin{figure}
\begin{center}
\includegraphics[width=24em]{fig/initialization.png}
\caption{Retraining with initialization from generated samples}
\label{fig:few_init}
\end{center}
\end{figure}


We try to improve the results obtained earlier by retraining L2-Net with mixed data: few real samples and plenty of generated samples (160.000)
(learning curve show in figure \ref{fig:training_mixed}. The peak accuracy reached is 91%. We then try to remove the generated 
samples to apply fine tuning, using only the real samples. After 300 more epochs (figure \ref{fig:training_mixed}) the test accuracy is 
boosted to 92%, making this technique the most successfull attempt of improvement while using a limited amount of data from MNIST dataset.

\begin{figure}
\begin{center}
\includegraphics[width=12em]{fig/training_mixed.png}
\includegraphics[width=12em]{fig/fine_tuning.png}
\caption{Retraining; Mixed initialization left, fine tuning right}
\label{fig:training_mixed}
\end{center}
\end{figure}

Failures classification examples are displayed in figure \ref{fig:retrain_fail}. The results showed indicate that the network we trained is actually performing quite well,
as most of the testing images that got misclassified (mainly nines and fours) show ambiguities.

# Bonus

## Relation to PCA

Similarly to GAN's, PCA can be used to formulate **generative** models of a system. While GAN's are trained neural networks, PCA is a definite statistical procedure which perform orthogonal transformations of the data. While both attempt to identify the most important or *variant* features of the data (which we may then use to generate new data), PCA by itself is only able to extract linearly related features. In a purely linear system, a GAN would be converging to PCA. In a more complicated system, we would indeed to identify relevant kernels in order to extract relevant features with PCA, while a GAN is able to leverage dense and convolutional neural network layers which may be trained to perform relevant transformations.

* This is an open question. Do you have any other ideas to improve GANs or
have more insightful and comparative evaluations of GANs? Ideas are not limited. For instance,

\begin{itemize}

\item How do you compare GAN with PCA? We leant PCA as another generative model in the
Pattern Recognition module (EE468/EE9SO29/EE9CS729). Strengths/weaknesses?

\item Take the pre-trained classification network using 100% real training examples and use it
to extract the penultimate layer’s activations (embeddings) of 100 randomly sampled real
test examples and 100 randomly sampled synthetic examples from all the digits i.e. 0-9.
Use an embedding method e.g. t-sne [1] or PCA, to project them to a 2D subspace and
plot them. Explain what kind of patterns do you observe between the digits on real and
synthetic data. Also plot the distribution of confidence scores on these real and synthetic
sub-sampled examples by the classification network trained on 100% real data on two
separate graphs. Explain the trends in the graphs.

\end{itemize}

\begin{figure}
   \centering
   \subfloat[][]{\includegraphics[width=.2\textwidth]{fig/pca-mnist.png}}\quad
   \subfloat[][]{\includegraphics[width=.2\textwidth]{fig/tsne-mnist.png}}\\
   \subfloat[][]{\includegraphics[width=.2\textwidth]{fig/pca-cgan.png}}\quad
   \subfloat[][]{\includegraphics[width=.2\textwidth]{fig/tsne-cgan.png}}
   \caption{Visualisations PCA: a) MNIST c) CGAN | TSNE b) MNIST d) CGAN}
   \label{fig:features}
\end{figure}


\begin{figure}
   \centering
   \subfloat[][]{\includegraphics[width=.22\textwidth]{fig/pr-mnist.png}}\quad
   \subfloat[][]{\includegraphics[width=.22\textwidth]{fig/pr-cgan.png}}
   \caption{Precisional Recall Curves a) MNIST : b) CGAN output}
   \label{fig:rocpr}
\end{figure}

## Factoring in classification loss into GAN

Classification accuracy and Inception score can be factored into the GAN to attemp to produce more realistic images. Shane Barrat and Rishi Sharma are able to indirectly optimise the inception score to over 900, and note that directly optimising for maximised Inception score produces adversarial examples [@inception-note]. 
Nevertheless, a pretrained static classifier may be added to the GAN model, and it's loss incorporated into the loss added too the loss of the gan.

$$ L_{\textrm{total}} = \alpha L_{2-\textrm{LeNet}} + \beta L_{\textrm{generator}} $$


# References

<div id="refs"></div>

\newpage

# Appendix 

\begin{figure}
\begin{center}
\includegraphics[width=24em]{fig/vanilla_gan_arc.pdf}
\caption{Vanilla GAN Architecture}
\label{fig:vanilla_gan}
\end{center}
\end{figure}

\begin{figure}
\begin{center}
\includegraphics[width=24em]{fig/generic_gan_loss.png}
\caption{Shallow GAN D-G Loss}
\label{fig:vanilla_loss}
\end{center}
\end{figure}

\begin{figure}
\begin{center}
\includegraphics[width=24em]{fig/short_dcgan_ex.pdf}
\includegraphics[width=24em]{fig/short_dcgan.png}
\caption{Shallow DCGAN}
\label{fig:dcshort}
\end{center}
\end{figure}

\begin{figure}
\begin{center}
\includegraphics[width=24em]{fig/long_dcgan_ex.pdf}
\includegraphics[width=24em]{fig/long_dcgan.png}
\caption{Deep DCGAN}
\label{fig:dclong}
\end{center}
\end{figure}

\begin{figure}
\begin{center}
\includegraphics[width=24em]{fig/dcgan_dropout01_gd.png}
\caption{DCGAN Dropout 0.1 G-D Losses}
\label{fig:dcdrop1_1}
\end{center}
\end{figure}

\begin{figure}
\begin{center}
\includegraphics[width=14em]{fig/dcgan_dropout01.png}
\caption{DCGAN Dropout 0.1 Generated Images}
\label{fig:dcdrop1_2}
\end{center}
\end{figure}

\begin{figure}
\begin{center}
\includegraphics[width=24em]{fig/dcgan_dropout05_gd.png}
\caption{DCGAN Dropout 0.5 G-D Losses}
\label{fig:dcdrop2_1}
\end{center}
\end{figure}

\begin{figure}
\begin{center}
\includegraphics[width=14em]{fig/dcgan_dropout05.png}
\caption{DCGAN Dropout 0.5 Generated Images}
\label{fig:dcdrop2_2}
\end{center}
\end{figure}

\begin{figure}
\begin{center}
\includegraphics[width=24em]{fig/short_cgan_ex.pdf}
\includegraphics[width=24em]{fig/short_cgan.png}
\caption{Shallow CGAN}
\label{fig:cshort}
\end{center}
\end{figure}

\begin{figure}
\begin{center}
\includegraphics[width=24em]{fig/long_cgan_ex.pdf}
\includegraphics[width=24em]{fig/long_cgan.png}
\caption{Deep CGAN}
\label{fig:clong}
\end{center}
\end{figure}

\begin{figure}
\begin{center}
\includegraphics[width=24em]{fig/cgan_dropout01.png}
\caption{CGAN Dropout 0.1 G-D Losses}
\label{fig:cg_drop1_1}
\end{center}
\end{figure}

\begin{figure}
\begin{center}
\includegraphics[width=14em]{fig/cgan_dropout01_ex.png}
\caption{CGAN Dropout 0.1 Generated Images}
\label{fig:cg_drop1_2}
\end{center}
\end{figure}

\begin{figure}
\begin{center}
\includegraphics[width=24em]{fig/cgan_dropout05.png}
\caption{CGAN Dropout 0.5 G-D Losses}
\label{fig:cg_drop2_1}
\end{center}
\end{figure}

\begin{figure}
\begin{center}
\includegraphics[width=14em]{fig/cgan_dropout05_ex.png}
\caption{CGAN Dropout 0.5 Generated Images}
\label{fig:cg_drop2_2}
\end{center}
\end{figure}

\begin{figure}
\begin{center}
\includegraphics[width=24em]{fig/fake_only.png}
\caption{Retraining with generated samples only}
\label{fig:fake_only}
\end{center}
\end{figure}

\begin{figure}
\begin{center}
\includegraphics[width=12em]{fig/retrain_fail.png}
\caption{Retraining failures}
\label{fig:retrain_fail}
\end{center}
\end{figure}