report/paper.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581

# Introduction 

In this coursework we present two variants of the GAN architecture - DCGAN and CGAN, applied to the MNIST dataset and evaluate performance metrics across various optimisations techniques. The MNIST dataset contains 60,000 training images and 10,000 testing images of size 28x28, spread across ten classes representing the ten handwritten digits.

Generative Adversarial Networks represent a system of models characterised by their ability to output data similar to training data. A trained GAN takes noise as an input and is able to provide an output with the same dimensions and relevant features as the samples it has been trained with.

GANs employ two neural networks - a *discriminator* and a *generator* which contest in a min-max game. The task of the *discriminator* is to distinguish generated images from real images, while the task of the *generator* is to produce realistic images which are able to fool the *discriminator*.

Training a shallow GAN with no convolutional layers exposes problems such as **mode collapse**,  and unbalanced *generator-discriminator* losses which lead to **diminishing gradients** and **low quality image output**.

\begin{figure}
\begin{center}
\includegraphics[width=16em]{fig/generic_gan_mode_collapse.pdf}
\caption{Vanilla GAN mode collapse}
\label{fig:mode_collapse}
\end{center}
\end{figure}

Mode collapse is achieved with our naive *vanilla GAN* (Appendix-\ref{fig:vanilla_gan}) implementation after 200,000 batches. The generated images observed during a mode collapse can be seen in figure \ref{fig:mode_collapse}. We observe that the output of the generator only represents few of the labels originally fed. When mode collapse is reached the loss function of the generator stops improving as shown in figure \ref{fig:vanilla_loss}. We observe the discriminator loss tends to zero as the discriminator learns to assume and classify the fake ones, while the generator is stuck and hence not able to improve.

A marked improvement to the vanilla architecture is Deep Convolutional Generative Adversarial Networks (DCGAN).

# DCGAN

## DCGAN Architecture description

DCGAN exploits convolutional stride to perform downsampling and transposed convolutions to perform upsampling, in contrast to the fully connected layers in a vanilla GAN. 

The tested implementation uses batch normalization at the output of each convolutional layer (exceptions being the output layer of the generator and the input layer of the discriminator). The activation functions of the intermediate layers are `ReLU` (for generator) and `LeakyReLU` with slope 0.2 (for discriminator).
The activation functions used for the output are `tanh` for the generator and `sigmoid` for the discriminator. The convolutional layers' output in the discriminator uses dropout before feeding into the next layers. We noticed a significant improvement in performance, and measured a well performing dropout rate of 0.25.
The optimiser used for training is Adam optimiser [@adam] with a learning rate of $0.002$ and $\beta=0.5$.

The base architecture used can be observed in figure \ref{fig:dcganarc}.

\begin{figure}
\begin{center}
\includegraphics[width=24em]{fig/DCGAN_arch.pdf}
\caption{DCGAN Architecture}
\label{fig:dcganarc}
\end{center}
\end{figure}

## Tests on MNIST

We evaluate three variants of the DCGAN architecture, varying the size of convolutional layers in the generator, while retaining the structure presented in figure \ref{fig:dcganarc}: 

* Shallow: `Conv128-Conv64`
* Medium: `Conv256-Conv128`
* Deep: `Conv512-Conv256`

\begin{figure}
\begin{center}
\includegraphics[width=24em]{fig/med_dcgan_ex.png}
\includegraphics[width=24em]{fig/med_dcgan.png}
\caption{Medium DCGAN}
\label{fig:dcmed}
\end{center}
\end{figure}

We observed that the deep architectures result in a more easily achievable equilibria of G-D losses.
Our medium depth DCGAN achieves very good performance (figure \ref{fig:dcmed}), balancing both binary cross entropy losses at approximately 0.9 after 5,000 batches, reaching equilibrium quicker and with less oscillation than the Deepest DCGAN tested (figure \ref{fig:dclong}).

As DCGAN is trained with no labels, the generator's primary objective is to output images that fool the discriminator, but does not intrinsically separate the classes from each another. Therefore we sometimes observe oddly shaped digits which may temporarily be labeled as real by the discriminator. This issue is alleviated by training the network for more batches or introducing a deeper architecture, as it can be deducted from a qualitative comparison
between figures \ref{fig:dcmed}, \ref{fig:dcshort} and \ref{fig:dclong}.

Applying Virtual Batch Normalization on Medium DCGAN does not provide observable changes in G-D losses. Although it is difficult to qualitatively assess the improvements, figure \ref{fig:vbn_dc} shows results of the introduction of this technique. 

We evaluated the effect of different dropout rates (results in Appendix figures \ref{fig:dcdrop1_1}, \ref{fig:dcdrop1_2}, \ref{fig:dcdrop2_1}, \ref{fig:dcdrop2_2}) and concluded that the optimisation
of the dropout hyper-parameter is essential for maximising performance. A high dropout rate results in DCGAN producing only artifacts that do not match any specific class due to the generator performing better than the discriminator. Conversely a low dropout rate leads to an initial stabilization of G-D losses, but ultimately results in instability under the form of oscillation when training for a large number of batches.

Trying different parameters for artificial G-D balancing in the training stage did not achieve any significant benefits,
exclusively leading to the generation of more artifacts (figure \ref{fig:baldc}). We also attempted to increase the D training steps with respect to G,
but no mode collapse was observed even with the shallow model. 

# CGAN

## CGAN Architecture description

CGAN is a conditional version of a GAN which utilises labeled data. Unlike DCGAN, CGAN is trained with explicitly provided labels which allow CGAN to associate features with specific classes. The baseline CGAN architecture we evaluate is visible in figure \ref{fig:cganarc}. The generator's architecture presents a series of blocks, each containing a dense layer, `LeakyReLU` layer (`slope=0.2`) and a Batch Normalization layer. The baseline discriminator uses Dense layers, followed by `LeakyReLU` (`slope=0.2`) and a Dropout layer. For training we used the Adam optimiser [@adam] with a learning rate of $0.002$ and $\beta=0.5$.

We also evaluate a Deep Convolutional version of CGAN (cDCGAN), the architecture of which can be found in the Appendix. It uses transpose convolutions with a stride of two to perform upscaling followed by convolutional blocks with singular stride. We find that kernel size of three by three worked well for all four convolutional blocks which include a Batch Normalization and an Activation layer (`ReLU` for generator and `LeakyReLU` for discriminator). The architecture assessed in this paper uses multiplying layers between the label embedding and the output `ReLU` blocks, as we found that it was more robust compared to the addition of the label embedding via concatenation. Label embedding 
is performed with a `Dense`,`tanh` and `Upsampling` block, both in the discriminator and the generator, creating a $64\times 28\times 28$ input for the multiplication layers. The output activation layers for generator and discriminator are respectively `tanh` and `sigmoid`.

The list of the architecture's variations we evaluate in this report:

* Shallow CGAN - $1\times$ `Dense-LeakyReLU` blocks
* Medium CGAN - $3\times$ `Dense-LeakyReLU` blocks
* Deep CGAN - $5\times$ `Dense-LeakyReLU` blocks
* Deep Convolutional CGAN (cDCGAN)
* One-Sided Label Smoothing (LS)
* Various Dropout (DO): 0.1, 0.3 and 0.5
* Virtual Batch Normalization (VBN) - Normalization based on one batch(BN) [@improved]

\begin{figure}
\begin{center}
\includegraphics[width=24em]{fig/CGAN_arch.pdf}
\caption{CGAN Architecture}
\label{fig:cganarc}
\end{center}
\end{figure}

## Tests on MNIST 

When comparing the three levels of depth for the baseline architecture it is possible to notice significant differences in G-D loss balancing. In
a shallow architecture we notice a high oscillation of the generator loss (figure \ref{fig:cshort}), which is being overpowered by the discriminator. Despite this, for the Dense CGAN we did not experience issues with vanishing gradient, and did not achieve mode collapse. 
Similarly, with a deep architecture the discriminator still overpowers the generator, and an equilibrium between the two losses is not achieved. The image quality in both cases is not particularly high: we can see that even after 20,000 batches some pictures appear slightly blurry (figure \ref{fig:clong}).
The best compromise is reached for `3 Dense-LeakyReLU` blocks as shown in figure \ref{fig:cmed}. It is possible to observe that G-D losses are perfectly balanced, and their value goes below 1.
The image quality is better than the two examples reported earlier, proving that this Medium-depth architecture is the best compromise.

\begin{figure}
\begin{center}
\includegraphics[width=24em]{fig/med_cgan_ex.png}
\includegraphics[width=24em]{fig/med_cgan.png}
\caption{Medium CGAN}
\label{fig:cmed}
\end{center}
\end{figure}

Unlike DCGAN, the three levels of dropout rate attempted do not affect the performance significantly, and as we can see in figures \ref{fig:cg_drop1_1}(0.1), \ref{fig:cmed}(0.3) and \ref{fig:cg_drop2_1}(0.5), both
image quality and G-D losses are comparable.

The biggest improvement in performance is obtained through one-sided label smoothing, shifting the true labels form 1 to 0.9.
Using 0.1 instead of zero for the fake labels does not improve performance, as the discriminator loses incentive to do better (generator behaviour is reinforced) [@improved]. Performance results for one-sided labels smoothing with `true_labels = 0.9` are shown in figure \ref{fig:smooth}.

\begin{figure}
\begin{center}
\includegraphics[width=24em]{fig/smoothing_ex.png}
\caption{One sided label smoothing}
\label{fig:smooth}
\end{center}
\end{figure}

Virtual Batch Normalization provides results that are 
difficult to qualitatively assess when compared to the ones obtained through the baseline. 
VBN application does not significantly affect the G-D curves.
We expect it to show its impact the most when training a classifier with the generated images from CGAN, as we 
will generate more robust output samples. Training with a larger batch size 
may result in even more difficult changes to observe, but since we ran for a batch_size of 128 we see clear effects when performing quantitative measurements.

Similarly to DCGAN, changing the G-D steps did not lead to good quality results as it can be seen in figure \ref{fig:cbalance}, in which we tried to train
with D/G=15 for 10,000 batches, trying to initialize good discriminator weights, to then revert to a D/G=1, aiming to balance the losses of the two networks. 
Even for shallow network, where we initially expected mode collapse, we found a diversity between the samples produced for the same classes, showing the contrary.

\begin{figure}
\begin{center}
\includegraphics[width=8em]{fig/bal1.png}
\includegraphics[width=8em]{fig/bal2.png}
\includegraphics[width=8em]{fig/bal3.png}
\caption{CGAN G-D balancing results}
\label{fig:cbalance}
\end{center}
\end{figure}

The best performing architecture was cDCGAN. It is difficult to assess any potential improvement at this stage, since the samples produced 
between 8,000 and 13,000 batches are almost indistinguishable from the ones of the MNIST dataset (as it can be seen in figure \ref{fig:cdc}, middle). Training cDCGAN for more than
15,000 batches is however not beneficial, as the discriminator will keep improving, leading the generator loss to increase and produce bad samples as shown in the reported example.
We find a good balance for 12,000 batches. 

\begin{figure}
\begin{center}
\includegraphics[width=8em]{fig/cdc1.png}
\includegraphics[width=8em]{fig/cdc2.png}
\includegraphics[width=8em]{fig/cdc3.png}
\caption{cDCGAN outputs; 1000 batches - 12000 batches - 20000 batches}
\label{fig:cdc}
\end{center}
\end{figure}

Oscillation on the generator loss is noticeable in figure \ref{fig:cdcloss} due to the discriminator loss approaching zero. One possible
adjustment to tackle this issue is using unbalanced proportions in the training steps, such as $G/D=3$, allowing the generator to gain some advantage over the discriminator. This
technique allowed to smooth oscillation while producing images of similar quality. 
Using $G/D=6$ dampens oscillation almost completely leading to the vanishing discriminator's gradient issue. Mode collapse occurs in this specific case as shown in
figure \ref{fig:cdccollapse}. Checking the PCA embeddings extracted from a pretrained LeNet classifier (figure \ref{fig:clustcollapse}) we observe low diversity between features of each class, that
tend to collapse to very small regions.

\begin{figure}
\begin{center}
\includegraphics[width=8em]{fig/cdcloss1.png}
\includegraphics[width=8em]{fig/cdcloss2.png}
\includegraphics[width=8em]{fig/cdcloss3.png}
\caption{cDCGAN G-D loss; Left *G/D=1*; Middle $G/D=3$; Right $G/D=6$}
\label{fig:cdcloss}
\end{center}
\end{figure}

\begin{figure}
\begin{center}
\includegraphics[width=8em]{fig/cdc_collapse.png}
\includegraphics[width=8em]{fig/cdc_collapse.png}
\includegraphics[width=8em]{fig/cdc_collapse.png}
\caption{cDCGAN G/D=6 mode collapse}
\label{fig:cdccollapse}
\end{center}
\end{figure}


Virtual Batch Normalization on this architecture was not attempted as it significantly
increased the training time (about twice more). 
Introducing one-sided label smoothing produced very similar results (figure \ref{fig:cdcsmooth}), hence a quantitative performance assessment using Inception Score is due and presented in the next section.

# Inception Score

Inception Score is calculated as introduced by Tim Salimans et. al [@improved], used to evaluate the CIFAR-10 dataset. However as we are evaluating MNIST, we use LeNet-5 [@lenet] as the basis of the Inception Score, instead of original Inception network.

To calculate the score we use the logits extracted from LeNet:

$$ \textrm{IS}(x) = \exp(\mathbb{E}_x \left( \textrm{KL} ( p(y\mid x) \| p(y) ) \right) ) $$

We further report the classification accuracy as found with LeNet. For coherence purposes the Inception Scores were
calculated training the LeNet classifier under the same conditions across all experiments (100 epochs and stochastic gradient descent optimiser with a learning rate of 0.001).

\begin{table}[H]
\begin{tabular}{llll}
                      & Accuracy & IS		 & GAN Tr. Time \\ \hline
Shallow CGAN          & 0.645    & 3.57          & 8:14         \\
Medium CGAN           & 0.715    & 3.79          & 10:23        \\
Deep CGAN             & 0.739    & 3.85          & 16:27        \\
\textbf{cDCGAN}       & \textbf{0.899}	 & \textbf{7.41}          & 1:05:27      \\
Medium CGAN+LS        & 0.749    & 3.643         & 10:42        \\
cDCGAN+LS             & 0.846    & 6.63          & 1:12:39      \\
cDCGAN-G/D=3          & 0.849    & 6.59          & 48:11        \\
cDCGAN-G/D=6          & 0.801    & 6.06          & 36:05        \\
Medium CGAN DO=0.1    & 0.761    & 3.836         & 10:36        \\
Medium CGAN DO=0.5    & 0.725    & 3.677         & 10:36        \\
Medium CGAN+VBN       & 0.735    & 3.82          & 19:38        \\
Medium CGAN+VBN+LS    & 0.763    & 3.91          & 19:43        \\
*MNIST original       & 0.9846   & 9.685         & N/A          \\ \hline
\end{tabular}
\end{table}

## Discussion

### Architecture

We observe increased accruacy as we increase the depth of the GAN arhitecture at the cost of training time. There appears to be diminishing returns with the deeper networks, and larger improvements are achievable with specific optimisation techniques. cDCGAN achieves improved performance in comparison to the other networks, as expected during the qualitative observations, where we found the samples produced are almost indistinguishable to the ones of the original MNIST dataset. 

### One Side Label Smoothing

One sided label smoothing involves relaxing our confidence on data labels. Tim Salimans et. al. [@improved] show smoothing of the positive labels reduces the vulnerability of the neural network to adversarial examples. We observe significant improvements to the Inception Score and classification accuracy in the case of our baseline (Medium CGAN). This technique however did not improve the performance of cDCGAN any further, suggesting that the shifted discriminator target does not benefit the system in this case.

### Virtual Batch Normalization

Virtual Batch Normalization is a further optimisation technique proposed by Tim Salimans et. al. [@improved]. Virtual batch normalization is a modification to the batch normalization layer, which performs normalization based on statistics from a reference batch. VBN improves the dependency of the output on the other inputs from the same minibatch [@improved]. We observe that VBN improved the classification accuracy and the Inception Score.

### Dropout

Dropout appears to have a noticeable effect on accuracy and Inception Score, with a variation of 3.6% between our best and worst dropout cases. The measurements indicate that it is preferable to use a low dropout rate (0.1 seems to be the one that achieves the best results).

### G-D Balancing on cDCGAN

Despite achieving lower loss oscillation, using *G/D=3* to incentivize generator training did not improve the performance of cDCGAN as meassured by
the Inception Score and testing accuracy. We obtain 5% less test accuracy, meaning that using this technique in our architecture produces on average
lower quality images when compared to our standard cDCGAN.

# Re-training the handwritten digit classifier

*In the following section the generated data we use will be exclusively produced by our cDCGAN architecture.*

## Results

In this section we analyze the effect of retraining the classification network using a mix of real and generated data, highlighting the benefits of 
injecting generated samples in the original training set to boost testing accuracy.

As shown in figure \ref{fig:mix1} we performed two experiments for performance evaluation: 

* Using the same number of training samples while only changing the ratio of real to generated data (55,000 samples in total).
* Using the whole training set from MNIST and adding generated samples from cDCGAN.

\begin{figure}
\begin{center}
\includegraphics[width=12em]{fig/mix_zoom.png}
\includegraphics[width=12em]{fig/added_generated_data.png}
\caption{Mix data, left unchanged samples number, right added samples}
\label{fig:mix1}
\end{center}
\end{figure}

Both experiments show that training the classification network with the injection of generated data (between 40% and 90%) causes on average a small increase in accuracy of up to 0.2%. In absence of original data the testing accuracy drops significantly to around 40% for both cases. 

## Adapted Training Strategy

For this section we will use 550 samples from MNIST (55 samples per class). Training the classifier yields major challenges, since the amount of samples available is relatively small.

Training for 100 epochs, similarly to the previous section, is clearly not enough. The MNIST test set accuracy reached in this case
is only 62%, while training for 300 epochs we can reach up to 88%. The learning curve in figure \ref{fig:few_real} suggests
we cannot achieve much better with this very small amount of data, since the validation accuracy plateaus, while the training accuracy almost reaches 100%.

We conduct one experiment, feeding the test set to a LeNet classifier trained exclusively on data generated from our cDCGAN. It is noticeable that training 
for the first 20 epochs gives good results when compared to the learning curve obtained when training with only the few real samples before reaching a plateau (figure \ref{fig:fake_only}). This
indicates that we can use the generated data to train the first steps of the network (initialize weights) and train only 
with the real samples for 300 epochs to perform fine tuning. 
As observed in figure \ref{fig:few_init} the first steps of retraining will show oscillation, since the fine tuning will try and adapt to the newly fed data. The maximum accuracy reached before the validation curve plateaus is 88.6%, indicating that this strategy proved to be somewhat successful at 
improving testing accuracy. 

We try to improve the results obtained earlier by retraining LeNet with mixed data: few real samples and plenty of generated samples (160,000)
(learning curve showed in figure \ref{fig:training_mixed}). The peak accuracy reached is 91%. We then try to remove the generated 
samples to apply fine tuning, using only the real samples. After 300 more epochs (figure \ref{fig:training_mixed}) the test accuracy is 
boosted to 92%, making this technique the most successful attempt of improvement while using a limited amount of data from MNIST dataset.

\begin{figure}
\begin{center}
\includegraphics[width=12em]{fig/training_mixed.png}
\includegraphics[width=12em]{fig/fine_tuning.png}
\caption{Retraining; Mixed initialization left, fine tuning right}
\label{fig:training_mixed}
\end{center}
\end{figure}

Examples of misclassification are displayed in figure \ref{fig:retrain_fail}. It is visible from a cross comparison between these results and the precision-recall
curve displayed in figure \ref{fig:pr-retrain} that the network performs well for most of the digits, but is brought down by the relatively low precision for the digit 8, lowering the micro-average precision.

\begin{figure}
\begin{center}
\includegraphics[width=24em]{fig/pr-retrain.png}
\caption{Retraining; Precision-Recall Curve}
\label{fig:pr-retrain}
\end{center}
\end{figure}


# Bonus Questions

## Relation to PCA

Similarly to GANs, PCA can be used to formulate **generative** models of a system. While GANs are trained neural networks, PCA is a definite statistical procedure which performs orthogonal transformations of the data. Both attempt to identify the most important or *variant* features of the data (which we may then use to generate new data), but PCA by itself is only able to extract linearly related features. In a purely linear system, a GAN would be converging to PCA. In a more complicated system, we would need to identify relevant kernels in order to extract relevant features with PCA, while a GAN is able to leverage dense and convolutional neural network layers which may be trained to perform relevant transformations.

## Data representation

Using the pre-trained classification on real training examples we extract embeddings of 10,000 randomly sampled real
test examples and 10,000 randomly sampled synthetic examples using both CGAN and cDCGAN from the different classes.
We obtain both a PCA and TSNE representation of our data on two dimensions in figure \ref{fig:features}.

It is observable that the network that achieved a good Inception Score (cDCGAN) produces embeddings that are very similar
to the ones obtained from the original MNIST dataset, further strengthening our hypothesis about the performance of this 
specific model. On the other hand, with non cDCGAN we notice higher correlation between the two represented features
for the different classes, meaning that a good data separation was not achieved. This is probably due to the additional blur
produced around the images with our simple CGAN model.

We have presented the Precision Recall Curve for the MNIST, against that of a Dense CGAN and Convolutional CGAN. While the superior performance of the convolutional GAN is evident, it is interesting to note that the precision curves are similar, specifically the numbers 8 and 9. For both architectures 9 is the worst digit on average, but for higher Recall we find that there is a smaller proportion of extremely poor 8's, which result in lower the digit to the poorest precision.

## Factoring in classification loss into GAN

Classification accuracy and Inception Score can be factored into the GAN to attempt to produce more realistic images. Shane Barrat and Rishi Sharma are able to indirectly optimise the Inception Score to over 900, and note that directly optimising for maximised Inception Score produces adversarial examples [@inception-note]. 
Nevertheless, a pre-trained static classifier may be added to the GAN model, and its loss incorporated into the loss added too the loss of the GAN.

$$ L_{\textrm{total}} = \alpha L_{\textrm{LeNet}} + \beta L_{\textrm{generator}} $$

\begin{figure}
   \centering
   \subfloat[][]{\includegraphics[width=.2\textwidth]{fig/pca-mnist.png}}\quad
   \subfloat[][]{\includegraphics[width=.2\textwidth]{fig/tsne-mnist.png}}\\
   \subfloat[][]{\includegraphics[width=.2\textwidth]{fig/pca-cgan.png}}\quad
   \subfloat[][]{\includegraphics[width=.2\textwidth]{fig/tsne-cgan.png}}\\
   \subfloat[][]{\includegraphics[width=.2\textwidth]{fig/pca-cdc.png}}\quad
   \subfloat[][]{\includegraphics[width=.2\textwidth]{fig/tsne-cdc.png}}
   \caption{Visualisations: a)MNIST|PCA b)MNIST|TSNE c)CGAN-gen|PCA d)CGAN-gen|TSNE e)cDCGAN-gen|PCA f)cDCGAN-gen|TSNE}
   \label{fig:features}
\end{figure}

\begin{figure}
   \centering
   \subfloat[][]{\includegraphics[width=.22\textwidth]{fig/pr-mnist.png}}\quad
   \subfloat[][]{\includegraphics[width=.22\textwidth]{fig/pr-cgan.png}}\\
   \subfloat[][]{\includegraphics[width=.22\textwidth]{fig/pr-cdc.png}}
   \caption{Precisional Recall Curves a) MNIST : b) CGAN output c)cDCGAN output}
   \label{fig:rocpr}
\end{figure}

# References

<div id="refs"></div>

# Appendix 

## DCGAN-Appendix

\begin{figure}[H]
\begin{center}
\includegraphics[width=24em]{fig/vanilla_gan_arc.pdf}
\caption{Vanilla GAN Architecture}
\label{fig:vanilla_gan}
\end{center}
\end{figure}

\begin{figure}[H]
\begin{center}
\includegraphics[width=24em]{fig/generic_gan_loss.png}
\caption{Shallow GAN D-G Loss}
\label{fig:vanilla_loss}
\end{center}
\end{figure}

\begin{figure}[H]
\begin{center}
\includegraphics[width=24em]{fig/short_dcgan_ex.png}
\includegraphics[width=24em]{fig/short_dcgan.png}
\caption{Shallow DCGAN}
\label{fig:dcshort}
\end{center}
\end{figure}

\begin{figure}[H]
\begin{center}
\includegraphics[width=24em]{fig/long_dcgan_ex.png}
\includegraphics[width=24em]{fig/long_dcgan.png}
\caption{Deep DCGAN}
\label{fig:dclong}
\end{center}
\end{figure}

\begin{figure}[H]
\begin{center}
\includegraphics[width=24em]{fig/vbn_dc.pdf}
\caption{DCGAN Virtual Batch Normalization}
\label{fig:vbn_dc}
\end{center}
\end{figure}

\begin{figure}[H]
\begin{center}
\includegraphics[width=24em]{fig/dcgan_dropout01_gd.png}
\caption{DCGAN Dropout 0.1 G-D Losses}
\label{fig:dcdrop1_1}
\end{center}
\end{figure}

\begin{figure}[H]
\begin{center}
\includegraphics[width=14em]{fig/dcgan_dropout01.png}
\caption{DCGAN Dropout 0.1 Generated Images}
\label{fig:dcdrop1_2}
\end{center}
\end{figure}

\begin{figure}[H]
\begin{center}
\includegraphics[width=24em]{fig/dcgan_dropout05_gd.png}
\caption{DCGAN Dropout 0.5 G-D Losses}
\label{fig:dcdrop2_1}
\end{center}
\end{figure}

\begin{figure}[H]
\begin{center}
\includegraphics[width=14em]{fig/dcgan_dropout05.png}
\caption{DCGAN Dropout 0.5 Generated Images}
\label{fig:dcdrop2_2}
\end{center}
\end{figure}

\begin{figure}
\begin{center}
\includegraphics[width=12em]{fig/bal4.png}
\caption{DCGAN Balancing G-D; D/G=3}
\label{fig:baldc}
\end{center}
\end{figure}

## CGAN-Appendix

\begin{figure}[H]
\begin{center}
\includegraphics[width=24em]{fig/short_cgan_ex.png}
\includegraphics[width=24em]{fig/short_cgan.png}
\caption{Shallow CGAN}
\label{fig:cshort}
\end{center}
\end{figure}

\begin{figure}[H]
\begin{center}
\includegraphics[width=24em]{fig/long_cgan_ex.png}
\includegraphics[width=24em]{fig/long_cgan.png}
\caption{Deep CGAN}
\label{fig:clong}
\end{center}
\end{figure}

\begin{figure}[H]
\begin{center}
\includegraphics[width=24em]{fig/cgan_dropout01.png}
\caption{CGAN Dropout 0.1 G-D Losses}
\label{fig:cg_drop1_1}
\end{center}
\end{figure}

\begin{figure}[H]
\begin{center}
\includegraphics[width=14em]{fig/cgan_dropout01_ex.png}
\caption{CGAN Dropout 0.1 Generated Images}
\label{fig:cg_drop1_2}
\end{center}
\end{figure}

\begin{figure}[H]
\begin{center}
\includegraphics[width=24em]{fig/cgan_dropout05.png}
\caption{CGAN Dropout 0.5 G-D Losses}
\label{fig:cg_drop2_1}
\end{center}
\end{figure}

\begin{figure}[H]
\begin{center}
\includegraphics[width=14em]{fig/cgan_dropout05_ex.png}
\caption{CGAN Dropout 0.5 Generated Images}
\label{fig:cg_drop2_2}
\end{center}
\end{figure}

\begin{figure}[H]
\begin{center}
\includegraphics[width=18em]{fig/clustcollapse.png}
\caption{cDCGAN *G/D=6* PCA Embeddings through LeNet (10000 samples per class)}
\label{fig:clustcollapse}
\end{center}
\end{figure}

\begin{figure}[H]
\begin{center}
\includegraphics[width=8em]{fig/cdcsmooth.png}
\caption{cDCGAN+LS outputs 12000 batches}
\label{fig:cdcsmooth}
\end{center}
\end{figure}

\begin{figure}[H]
\begin{center}
\includegraphics[width=24em]{fig/smoothing.png}
\caption{CGAN+LS G-D Losses}
\label{fig:smoothgd}
\end{center}
\end{figure}

## cDCGAN Alternative Architecture

\begin{figure}[H]
\begin{center}
\includegraphics[width=24em]{fig/cdcgen.pdf}
\end{center}
\end{figure}

\begin{figure}[H]
\begin{center}
\includegraphics[width=24em]{fig/cdcdesc.pdf}
\end{center}
\end{figure}

## Retrain-Appendix

\begin{figure}[H]
\begin{center}
\includegraphics[width=24em]{fig/train_few_real.png}
\caption{Training with 550 samples from MNIST only}
\label{fig:few_real}
\end{center}
\end{figure}

\begin{figure}[H]
\begin{center}
\includegraphics[width=24em]{fig/fake_only.png}
\caption{Retraining with generated samples only}
\label{fig:fake_only}
\end{center}
\end{figure}

\begin{figure}[H]
\begin{center}
\includegraphics[width=24em]{fig/initialization.png}
\caption{Retraining with initialization from generated samples}
\label{fig:few_init}
\end{center}
\end{figure}

\begin{figure}[H]
\begin{center}
\includegraphics[width=12em]{fig/retrain_fail.png}
\caption{Retraining failures}
\label{fig:retrain_fail}
\end{center}
\end{figure}