Final reportHEAD master

author: Vasil Zlatanov <v@skozl.com> 2019-03-22 23:10:59 +0000
committer: Vasil Zlatanov <v@skozl.com> 2019-03-22 23:10:59 +0000
commit: 0e9a762665554b0ff0d13f088366fd09a25a69a2 (patch)
tree: 177220a4aaa35e2bcac1b6adc7f13d8ff14c2d73
parent: 5245b5ec835175d390b78f5b201e7ca5fdaaab9b (diff)
download: e3-deep-master.tar.gz
e3-deep-master.tar.bz2
e3-deep-master.zip
3 files changed, 52 insertions, 38 deletions
diff --git a/report/makefile b/report/makefile
index ec6b6e8..fcf357b 100644
--- a/report/makefile
+++ b/report/makefile
@@ -12,7 +12,7 @@ FLAGS_PDF = --template=template.latex
 
 pdf:
 	mkdir -p $(OUTPUT)
-	pandoc -o $(OUTPUT)/cw_interim_vz215.pdf $(FLAGS) $(FLAGS_PDF) $(FILES)
+	pandoc -o $(OUTPUT)/cw_deep_learning_vz215.pdf $(FLAGS) $(FLAGS_PDF) $(FILES)
 
 clean:
 	rm build/*
diff --git a/report/paper.md b/report/paper.md
index 3e55d23..f6f6e5a 100644
--- a/report/paper.md
+++ b/report/paper.md
@@ -10,7 +10,7 @@ The goal is to train a network, which given a patch is able to produce a descrip
 
 # Baseline Model
 
-The baseline model provided in the given IPython notebook, approaches the problem by using two networks for the task.
+The baseline model approaches the problem by using two networks for the task.
 
 ## TPU Training
 
@@ -21,30 +21,30 @@ Direct comparison of performance with GPU is made difficult by the fact that Goo
 \begin{table}[h!]
 \begin{center}
 \begin{tabular}{lrrrr} \hline
-Batch Size & CPU & GPU & TPU &  \\ \hline
-64         & 85s & 41s & 43s &  \\
-512        & 84s & 40s & 19s &  \\
-1024       & 83s & 40s & 10s &  \\ \hline
+Batch Size & CPU  & GPU & TPU &  \\ \hline
+64         & 125s & 41s & 43s &  \\
+512        & 124s & 40s & 19s &  \\
+1024       & 123s & 40s & 10s &  \\ 
+4096       & 123s & -   & 10s &  \\ \hline
 \end{tabular}
 \label{tpu-time}
-\caption{Epoch Training Time on Accelerators}
+\caption{Denoise Epoch Training Time on Accelerators}
 \end{center}
 \end{table}
 
 ### U-Net Denoising
 
-A shallow version of the U-Net  network [@unet] is used to denoise the noisy patches. The shallow U-Net network has the same output size as the input size, is fed a noisy image and has loss computed as the mean average Euclidean distance (L1 loss) with a clean reference patch. This effectively teaches the U-Net autoencoder to perform a denoising operation on the input images.
-Efficient training can performed with TPU acceleration, a batch size of 4096 and the Adam optimizer with learning rate of 0.001 and is shown on figure \ref{fig:denoise}. Training and validation was performed with all available data.
-The network is able to achieve a mean average error of 5.3 after 19 epochs. With gradient descent we observed a loss of 5.5 after the same number of epochs. There is no observed evidence of overfitting with the shallow net, something which may be expected with a such a shallow network. An example of denoising as performed by the network is visible in figure \ref{fig:den3}.
-Quick experimentation with a deeper version of U-Net shows it is possible to achieve validation loss of below 5.0 after training for 10 epochs, and a equivalent to the shallow loss of 5.3 is achievable after only 3 epochs.
+A shallow version of the U-Net  network [@unet] is used to denoise the noisy patches. The shallow U-Net network has the same output size as the input size, is fed a noisy image and has loss computed as the mean average Euclidean distance ($L1$ loss) with a clean reference patch. This effectively teaches the U-Net autoencoder to perform a denoising operation on the input images.
+Efficient training can performed with TPU acceleration, a batch size of 4096 and the Adam optimizer with learning rate of 0.001 as shown on Appendix figure \ref{fig:denoise}. Training and validation was performed with all available data.
+The network is able to achieve a mean average error of 5.3 after 19 epochs. With gradient descent we observed a loss of 5.5 after the same number of epochs. There is no observed evidence of overfitting with the shallow net, something which may be expected with a such a shallow network. An example of denoising as performed by the network is visible in figure Appendix \ref{fig:den3}.
 
-![Denoise example - 20th epoch\label{fig:den3}](fig/denoised.png){width=15em height=10em}
 
-Visualisation of denoising as seen in figure \ref{fig:den3} demonstrates the impressive performance of the denoising model. We are able use a deeper U-Net containing 3 downsampling and 3 upsampling layers, as per the architecture of Olaf Ronneberger et.al [@unet] to achieve 4.5 mean average error loss. We find that at this loss the network converges and we therefore focus our attention to the descriptor model.
+Visualisation of denoising as seen in figure \ref{fig:den3} demonstrates the impressive performance of the denoising model. We are able use a deeper U-Net containing 3 downsampling and 3 upsampling layers, shown in Appendix figure \ref{fig:unet}. 
+Experimentation shows that with the deeper version of U-Net shows it is possible to achieve validation loss of below 5.0 after training for 10 epochs, and a equivalent to the shallow loss of 5.3 is achievable after only 3 epochs.  We are able to achieve 4.5 mean average error loss. We find that at this loss the network converges, and minor steps in the denoise loss do not significantly affect the final evaluation metrics and we therefore focus our attention to the descriptor model.
 
 ### L2 Net
 
-The network used to output the 128 dimension descriptors is a L2-network with triplet loss as defined by CVPR 17 [@l2net]. L2-Net was specifically designed for descriptor output of patches and is a very suitable choice for this task. Training the L2-Net with denoised as per the baseline yields training curves shown in \ref{fig:descriptor}. L2-Net is a robust architecture which has been developed with the HPatches dataset and may be hard to improve upon. The architecture is visualised in fig \ref{fig:l2arch}.
+The network used to output the 128 dimension descriptors is a L2-network as defined by CVPR 17 [@l2net]. L2-Net was specifically designed for descriptor output of patches and is a very suitable choice for this task. Training the L2-Net with denoised as per the baseline yields training curves shown in \ref{fig:descriptor}. The architecture is visualised in fig \ref{fig:l2arch}. L2-Net is a robust architecture which has been developed with the HPatches dataset and may be hard to improve upon. We therefore focus the remainder of paper on improving the triplet loss training technique.
 
 ### Triplet Loss
 
@@ -58,7 +58,7 @@ Where N is the number of triplets in the batch. $D_{a,x}$ is the Euclidean dista
 
 # Hard Triplet Mining
 
-Training the baseline for a sufficient number of epochs results in loss convergence, that is the validation loss stops improving. The baseline generator creates random valid triplets which are fed to the Siamese network. This results in the network being given too many *easy* triplets, which for a well trained baseline results in a low loss but do not provide a good opportunity for the model to improve on the hard triplets. 
+Training the baseline for a sufficient number of epochs results in loss convergence under the static learning learning rate. The baseline generator creates random valid triplets which are fed to the Siamese network. This results in the network being given too many *easy* triplets, which for a well trained baseline results in a low loss but do not provide a good opportunity for the model to improve on the hard triplets. 
 
 One way to boost performance is to run the dataset of patches through a singular descriptor model, record the descriptors and generate a subset of hard triplets with which the baseline can be trained. This will result in higher immediate loss and will allow the network to optimise the loss for those hard triplets. This can be done once or multiple times. It can also be done on every epoch, or ideally on every batch under the form of *online mining*.
 
@@ -70,7 +70,7 @@ Instead of using a three parallel Siamese models, online batch mining uses a sin
 
 We then have then have a few options for determining our loss. We define batch losses as introduced by Alexander Hermans et. al. [@defense] who applied the batch all and batch hard losses to person re-identification. The **batch all** triplet is computed as the average of all valid triplets in the distance matrix. This is in a way equivalent to loss as found by the Siamese network in the baseline, but is able to compute the loss as the average of a vastly larger number of triplets. 
 
-Perhaps the more interesting approach is the **batch hard** strategy of computing the loss using the hardest triplets for each anchor. We define the anchor's as each and every input patch which can form a valid triplet, i.e. has a positive with the samle label. For every anchor we find largest positive distance and the smallest negative distance in the batch. The loss for the batch is the mean of the hardest triplets for each anchor.
+Perhaps the more interesting approach is the **batch hard** strategy of computing the loss using the hardest triplets for each anchor. We define the anchors as each and every input patch which can form a valid triplet, i.e. has a positive with the samle label. For every anchor we find largest positive distance and the smallest negative distance in the batch. The loss for the batch is the mean of the hardest triplets for each anchor.
 
 \begin{align}\label{eq:loss_bh}
         \loss{BH}(\theta; X) = \overbrace{\sum\limits_{i=1}^{S} \sum\limits_{a=1}^{K}}^{\textnormal{all anchors}}
@@ -82,13 +82,13 @@ Where $S$ is the number of sequences in the batch, $K$ is the number of images i
 
 ## Symmetric Batch Formation
 
-Batch loss presents a series of problems when applied on the HPatches dataset. Implementations of batch triplet loss often use randomly sampled batches. For a dataset like MNIST which has only 10 classes, this is not a problem as is it is very unlikely to have no valid triplets in a batch. In the HPatches dataset, the  majority of the image sequences have over 1000 patches, meaning that the probability of having no valid triplets is very high. In situations where there are no valid triplets loss is meaningless and therefore a random batch sequence is unfeasible. Our first experimentation with batch loss failed due to this and required additional work.
+Batch loss presents a series of problems when applied on the HPatches dataset. Implementations of batch triplet loss often use randomly sampled batches. For a dataset like MNIST which has only 10 classes, this is not a problem as is it is very unlikely to have no valid triplets in a batch. In the HPatches dataset, the  majority of the image sequences have over 1000 patches, meaning that the probability of having no valid triplets is very high. In situations where there are no valid triplets loss is meaningless and therefore a random batch sequence is unfeasible. Our first experimentation with batch loss failed due to this.
 
-We therefore implemented batches of size $SK$ formed with $S$ number patch sequences containing $K \geq 2$ patches. The anchor positive permutation's are therefor $(K-1)$ possible positives for each anchor, and $(S-1)K)$ negatives for each pair. With a guaranteed total number of $K^2(K-1)(S-1)$ triplets. This has the added benefit of allowing the positive and negative distances masks to be precomputed based on the $S$ and $K$ as the patches are ordered. It should be noted that the difficulty of the batch is highly reliant both $SK$ and $K$. The larger $K$ the more likely it is to have a harder the furthest anchor-positive pair, and the bigger the batch size $SK$ the more likely it is to find a close negative.
+We therefore implemented batches of size $SK$ formed with $S$ number patch sequences containing $K \geq 2$ patches. The anchor positive permutations are therefore $(K-1)$ possible positives for each anchor, and $(S-1)K)$ negatives for each pair. With a guaranteed total number of $K^2(K-1)(S-1)$ triplets. This has the added benefit of allowing the positive and negative distances masks to be precomputed based on the $S$ and $K$ as the patches are ordered. It should be noted that the difficulty of the batch is highly reliant both $SK$ and $K$. The larger $K$ the more likely it is to have a harder the furthest anchor-positive pair, and the bigger the batch size $SK$ the more likely it is to find a close negative.
 
 ## Collapse
 
-Even after the implementation of symmetric batch formation, we were unable to train the descriptor model without having the loss stagnating at the margin, that is $\loss{BH} \approx \alpha$. Upon observation of the descriptors it was evident that the descriptor model producing descriptors with all dimensions approaching zero, regardless of the input. A naive solution we initially attempted was to apply $L2$ normalisation to the descriptors, but that does not avoid collapse as the network learns to output the same descriptor regardless and this further unnecessarily limits the descriptors to the unit hypersphere.
+Even after the implementation of symmetric batch formation, we were unable to train the descriptor model without having loss stagnation at the margin, that is $\loss{BH} \approx \alpha$. Upon observation of the descriptors it was evident that the descriptor model producing descriptors with all dimensions approaching zero, regardless of the input. A naive solution we initially attempted was to apply $L2$ normalisation to the descriptors, but that does not avoid collapse as the network learns to output the same descriptor regardless and this further unnecessarily limits the descriptors to the unit hypersphere.
 
 Avoiding collapse of all the descriptors to a single point proved to be a hard task. Upon experimentation, we find that descriptors tend to collapse for large batch sizes with a large $K$. Initial experiments, which eagerly used large batch sizes in order to utilise TPU acceleration would ubiquitously collapse to the erroneous local minima of identical descriptors. Avoiding collapse is ultimately solved with low learning rate and small/easy batches. This significantly limits the advantages of batch hard loss, and makes it extremely slow and hard to train.
 
@@ -98,11 +98,11 @@ We further find that square Euclidean distance: $D\left(\nnfn(x_i), \nnfn(x_j)\r
 
 We find that for large batch sizes with $K > 3$ avoiding collapse when training the network with randomly initialised weights is extremely difficult for the HPatches dataset even for low learning rates ($10^{-5}$). We further find that the Adam optimizer collapses even quicker than SGD, reaching collapse in a few epoch. We were eventually able to successfuly train small batches with $SK < 32$ by weights from the baseline model.
 
-While doing so, we observed that the risk of collapse disappears if we are able to reach a $\loss{BH} < \alpha$. After the loss has dipped under the margin we find that we may increase the learning rate as the collapsed state has a higher loss and as such the optimizer has no incentive to push the network into collapse. 
+While doing so, we observed that the risk of collapse disappears if we are able to reach a $\loss{BH} < \alpha$. After the loss has dropped under the margin we find that we may increase the learning rate as the collapsed state has a higher than current loss and as such the optimizer has no incentive to push the network into collapse. 
 
 Eventual training with batch size of of 1028 with $K=8$ (the maximum K for HPatches is 16 as not all sequences have more than 16 patches) was achieved by progressively increasing the batch size, starting with weights from the baseline model. We take care to increase the batch size only so much, such that the loss is consitently below $\alpha$. This training technique allows the use of the Adam optimizer and higher learning rates and can be done automatically. We call this method *stepped batch hard*, as to our knowledge this techinque has not been described in literature previously.
 
-Stepping was performed with batch sizes of 32, 64, 128, 256, 512 and 1024. With $K$ starting at 2 and eventually reaching 16. We performed training with the Adam optimizer with a learning rate of $2 \times 10^{-5}$. The learning curve can be seen in figure \ref{stepped} with the jumps in batch size.
+Stepping was performed with batch sizes of 32, 64, 128, 256, 512 and 1024. With $K$ starting at 2 and eventually reaching 8. We performed training with the Adam optimizer with a learning rate of $2 \times 10^{-4}$. The learning curve can be seen in figure \ref{stepped} with the jumps in batch size.
 
 ![Stepped Batch Training \label{stepped}](fig/stepped_batch.pdf){width=20em height=10em}
 
@@ -110,28 +110,28 @@ Stepping was performed with batch sizes of 32, 64, 128, 256, 512 and 1024. With
 
 One of properties of triplet loss as implemented with the margin based hinge loss function $\left[m + \bullet\right]_+$ is that it avoids optimisation based on already submarginal "easy" triplet's which have their loss set to zero. However it is possible to replace the loss function such that it still produces a small amount of loss for these correct triplets, such that they can be pulled closer together. We use the `keras.backend.softplus` function which implements an exponential loss below zero instead of a hard cut off. A. Hermans et. al. [@defense] refer to this as the soft margin formulation and implement it as $\ln(1 + \exp(\bullet))$ which is identical to the `keras`'s softplus function. 
 
-It is important to note that the point of collapse when using soft margin formulation with $\alpha = 0$ is approximately $0.69$, as it per our training strategy is desirable to progressively increase batch size, such that it the loss stays beyond point of collapse.
+It is important to note that the point of collapse when using soft margin formulation with $\alpha = 0$ is approximately $0.69$, as it per our training strategy is desirable to progressively increase batch size, such that it the loss stays bellow the point of collapse.
 
 ## Feature Augmentation
 
-We implement feature augmentation through random flips and rotations using numpy's `np.flip` and `np.rot90` functions. Nevertheless, we make a case that random feature augmentation is detrimental for the HPatches dataset as patch sequences. An example sequence on figure \ref{augment} shows patches carry similar positional appearance across the same sequence, which is nullified by random flips and rotations. Experimentally we observe a near doubling of the loss, which reduces maximum batch hard size which fits below the collapse threshold and ultimately results in lower performance. 
+We implement feature augmentation through random flips and rotations using numpy's `np.flip` and `np.rot90` functions. Nevertheless, we make a case that random feature augmentation is detrimental for the HPatches dataset as patch sequences. An example sequence on figure \ref{augment} shows patches carry valuable positional information across the same sequence, which is removed by random flips and rotations. Experimentally we observe a near doubling of the loss, which reduces maximum batch hard size which fits below the collapse threshold and ultimately results in lower performance. 
 We presume that this may be solved by flipping and rotating anchors-positive pairs in the same way within the batch, such that positional similarity between anchor and positive is presserved, but we did not have the opportunity to test this idea.
 
 ![Non-Augmented Sequence\label{augment}](fig/augmentation.png){width=17em height=10em}
 
 # Experimental Results
 
-The baseline is trained until convergence with default SGD optimizer with a learning rate of 0.1. There is potential performance gains to made by implementing learning rate decay. Soft margin loss is tested with $\alpha = 0$.  Batch hard training is performed with the Adam optimizer with a learning rate of $2 \times 10^{-4}$ in steps of 64, 128, 386 and 1024 with $K$ stepped through 2, 4, 6 and 8. Increasing batch size to 2048 or using $K=16$ results in training loss higher than collapse loss.
+The baseline is trained until convergence with default SGD optimizer with a learning rate of 0.1. There may be potential performance gains to made by implementing learning rate decay and training for longer, but we focus our attention to improvements through extended hard triplet mining. Soft margin loss is tested with $\alpha = 0$.  Batch hard training is performed with the Adam optimizer with a learning rate of $2 \times 10^{-4}$ in steps of 64, 128, 386 and 1024 with $K$ stepped through 2, 4, 6 and 8. Increasing batch size to 2048 or using $K=16$ results in training loss higher than collapse loss.
 
 \begin{table}[h!]
 \begin{center}
 \begin{tabular}{lrrr} \hline
-Training Method      & Verification & Matching & Retrieval \\ \hline
-Baseline             & 0.821        & 0.249    & 0.544     \\
-Soft Baseline        & 0.853        & 0.261    & 0.558     \\
-Batch Hard 128       & 0.857        & 0.312    & 0.618     \\
-Batch Hard  1024     & 0.863        & 0.342    & 0.625     \\
-Soft Batch Hard 1024 & 0.873        & 0.356    & 0.645     \\ \hline
+Training Method       & Verif. & Match. & Retriev. \\ \hline
+Baseline              & 0.821        & 0.249    & 0.544     \\
+Soft Baseline + Deep U-Net         & 0.853        & 0.261    & 0.558     \\
+Batch Hard 128  + Deep U       & 0.857        & 0.312    & 0.618     \\
+Batch Hard  1024  + Deep U     & 0.863        & 0.342    & 0.625     \\
+Soft B. Hard 1024 + Deep U & 0.873        & 0.356    & 0.645     \\ \hline
 \end{tabular}
 \label{results}
 \caption{mAP for various training techniques}
@@ -140,21 +140,31 @@ Soft Batch Hard 1024 & 0.873        & 0.356    & 0.645     \\ \hline
 
 # Visualisation
 
-We may leverage visualisation of the descriptors feature to identify if descriptors sequences are seperated as we expect them to be. Figure \ref{tsne} visualises the descriptor embeddings with t-SNE in 2 dimensions for as single batch with size 2048 containing 128 sequences with 16 patches each as trained with batch hard. For a collapsed network all descriptors appear clustered at the same point, while for a well training network we see them seperate appart.
+We may leverage visualisation of the descriptors feature to identify if descriptors sequences are separated as we expect them to be. Figure \ref{tsne} visualises the descriptor embeddings with t-SNE in 2 dimensions for as single batch with size 2048 containing 128 sequences with 16 patches each as trained with batch hard. For a collapsed network all descriptors appear clustered at the same point, while for a well training network we see them separate apart. Visualisation during the training of the network is useful for detecting a collapsing network as the collapse is not immediately evident when directly observing the loss.
 
 ![2D Descriptor Visualisation with t-SNE (S=128;K=16)\label{tsne}](fig/tsne.pdf){width=18em height=12em}
 
+
+# Conclusion
+
+We find that we are able to use batch hard loss to significantly improve performance over the baseline training method, specifically with regards to matching and retrieval, the more interesting metrics which which distinguish triplets loss from contemporary classification and verification losses. We find that batch hard training strategies may be difficult to use and introduce *stepped batch training* -  a method of training both soft and hard margin triplet losses which reduces risk of single point collapse.
+
+# References
+
+<div id="refs"></div>
+
 # Appendix
 
-![Sequence from the HPatches dataset\label{sequence}](fig/sequence.png){width=15em height=10em}
 
-![U-Net Training with TPU\label{fig:denoise}](fig/denoise.pdf){width=20em height=15em}
+![Denoise example - 20th epoch\label{fig:den3}](fig/denoised.png){width=20em height=15em}
 
-![L2-Net\label{fig:descriptor}](fig/descriptor.pdf){width=20em height=15em}
+![Sequence from the HPatches dataset\label{sequence}](fig/sequence.png){width=20em height=15em}
 
+![Deeper U-Net Architecture\label{fig:unet}](fig/unet.pdf){width=20em height=15em}
 
-![L2-Net Architecture\label{fig:l2arch} \[@l2net\]](fig/l2net.png){width=15em height=15em}
 
-\newpage
+![U-Net Training with TPU\label{fig:denoise}](fig/denoise.pdf){width=20em height=12em}
 
-# References
+![L2-Net\label{fig:descriptor}](fig/descriptor.pdf){width=20em height=15em}
+
+![L2-Net Architecture\label{fig:l2arch} \[@l2net\]](fig/l2net.png){width=15em height=25em}
diff --git a/report/template.latex b/report/template.latex
index 232f8b3..8d27aef 100644
--- a/report/template.latex
+++ b/report/template.latex
@@ -47,7 +47,7 @@
 
 %%%%%%%%% ABSTRACT
 \begin{abstract}
-  Abstract - In this paper we investigate triplet loss based methods for verification, matching and retrieval of the HPatches dataset. We explore a system of two models, one for denoising of the patches and one for generation of descriptors. We are able to show that a model trained with online hard patch mining, while more difficult, is able to achieve improved performance over a classical Siamese triplet network.
+  Abstract - In this paper we investigate triplet loss based methods for verification, matching and retrieval of the HPatches dataset. We explore a system of two models - one for denoising of the patches and one for generation of descriptors. We show that a model utilising online hard mining, while more difficult, is able to achieve improved performance over a classical Siamese triplet network.
 \end{abstract}
 
 \providecommand{\tightlist}{%
@@ -77,6 +77,10 @@ $endif$
 \newcommand{\nnfn}{f_\theta}
 \newcommand{\norm}[1]{\left\lVert#1\right\rVert}  % Thanks http://tex.stackexchange.com/a/107190
 
+\makeatletter
+\def\fps@figure{h}
+\makeatother
+
 $body$
 
 $if(natbib)$
author	Vasil Zlatanov <v@skozl.com>	2019-03-22 23:10:59 +0000
committer	Vasil Zlatanov <v@skozl.com>	2019-03-22 23:10:59 +0000
commit	0e9a762665554b0ff0d13f088366fd09a25a69a2 (patch)
tree	177220a4aaa35e2bcac1b6adc7f13d8ff14c2d73
parent	5245b5ec835175d390b78f5b201e7ca5fdaaab9b (diff)
download	e3-deep-master.tar.gz e3-deep-master.tar.bz2 e3-deep-master.zip