summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorVasil Zlatanov <v@skozl.com>2019-03-22 17:43:54 +0000
committerVasil Zlatanov <v@skozl.com>2019-03-22 17:43:54 +0000
commit5245b5ec835175d390b78f5b201e7ca5fdaaab9b (patch)
treea6cbe85e05a1f10dc56e431382e05e54e1fc4dcf
parentf38dcc474ce7b97f37b5a226ad66043199402b0f (diff)
downloade3-deep-5245b5ec835175d390b78f5b201e7ca5fdaaab9b.tar.gz
e3-deep-5245b5ec835175d390b78f5b201e7ca5fdaaab9b.tar.bz2
e3-deep-5245b5ec835175d390b78f5b201e7ca5fdaaab9b.zip
Pre final
-rw-r--r--report/paper.md33
-rw-r--r--report/template.latex2
2 files changed, 18 insertions, 17 deletions
diff --git a/report/paper.md b/report/paper.md
index fffecc2..3e55d23 100644
--- a/report/paper.md
+++ b/report/paper.md
@@ -1,8 +1,6 @@
# Problem Definition
-This coursework's goal is to develop an image representation of patches from the `HPatches` dataset for the purpose of matching and retrieval, where images with the same features (and classes) are nearby in the reduced Euclidean space of the descriptors, while dissimilar ones are far apart. The dataset contains patches sampled from image sequences taken from the same scenes. For each image sequence there is a reference image, and two more files `eX.png` and `hX.png` containing corresponding patches from the images in the sequence with altered illumination (`i_X`) or viewpoint (`v_X`). Furthemore corresponding patches are extracted with geometric noise, easy `e_X` have a small amount of jitter while `h_X` patches have a larger amount [@patches]. The patches as processed by our networks are monochrome 32 by 32 images. An series eights images from the same sequence are shown on figure \ref{sequence}.
-
-![Sequence from the HPatches dataset\label{sequence}](fig/sequence.png){width=20em height=12em}
+This coursework's goal is to develop an image representation of patches from the `HPatches` dataset for the purpose of matching and retrieval, where images with the same features (and classes) are nearby in the reduced Euclidean space of the descriptors, while dissimilar ones are far apart. The dataset contains patches sampled from image sequences taken from the same scenes. For each image sequence there is a reference image, and two more files `eX.png` and `hX.png` containing corresponding patches from the images in the sequence with altered illumination (`i_X`) or viewpoint (`v_X`). Furthermore corresponding patches are extracted with geometric noise, easy `e_X` have a small amount of jitter while `h_X` patches have a larger amount [@patches]. The patches as processed by our networks are monochrome 32 by 32 images. An series eights images from the same sequence are shown on figure \ref{sequence}.
The goal is to train a network, which given a patch is able to produce a descriptor vector with a dimension of 128. The descriptors are evaluated based on there performance across three tasks:
@@ -16,7 +14,7 @@ The baseline model provided in the given IPython notebook, approaches the proble
## TPU Training
-Training with Google's proprietary TPU's is used for this paper as it can be done with some a few modifications of the baseline. Specifically it is necessary to rewrite the `Upsampling2D` layer using a `Lambda` layer, replace python's `keras` with `tensorflow.keras` and replace the sequence generators with those in the `tensorflow` library. Furthemore the TPU's do not support the placeholder op necessary for the implementation of the baseline siamese triplet network, but is compatible with the online batch (hard) triplet loss implemented and presented later in this paper.
+Training with Google's proprietary TPU's is used for this paper as it can be done with some a few modifications of the baseline. Specifically it is necessary to rewrite the `Upsampling2D` layer using a `Lambda` layer, replace python's `keras` with `tensorflow.keras` and replace the sequence generators with those in the `tensorflow` library. Furthemore the TPU's do not support the placeholder op necessary for the implementation of the baseline Siamese triplet network, but is compatible with the online batch (hard) triplet loss implemented and presented later in this paper.
Direct comparison of performance with GPU is made difficult by the fact that Google's TPU's show the biggest performance increase for large batch sizes, which do not fit in GPU memory. Furthemore TPU's incur an extra model compilation time and copying of the weights. Nevertheless approximate performance times for the denoise network are presented in table \ref{tpu-time}.
@@ -38,9 +36,9 @@ Batch Size & CPU & GPU & TPU & \\ \hline
A shallow version of the U-Net network [@unet] is used to denoise the noisy patches. The shallow U-Net network has the same output size as the input size, is fed a noisy image and has loss computed as the mean average Euclidean distance (L1 loss) with a clean reference patch. This effectively teaches the U-Net autoencoder to perform a denoising operation on the input images.
Efficient training can performed with TPU acceleration, a batch size of 4096 and the Adam optimizer with learning rate of 0.001 and is shown on figure \ref{fig:denoise}. Training and validation was performed with all available data.
The network is able to achieve a mean average error of 5.3 after 19 epochs. With gradient descent we observed a loss of 5.5 after the same number of epochs. There is no observed evidence of overfitting with the shallow net, something which may be expected with a such a shallow network. An example of denoising as performed by the network is visible in figure \ref{fig:den3}.
-Quick experimentation with a deeper version of U-Net shows it is possible to achieve validation loss of below 5.0 after training for 10 epochs, and a equivalent to the shallow loss of 5.3 is achievable aftery only 3 epochs.
+Quick experimentation with a deeper version of U-Net shows it is possible to achieve validation loss of below 5.0 after training for 10 epochs, and a equivalent to the shallow loss of 5.3 is achievable after only 3 epochs.
-![Denoise example - 20th epoch\label{fig:den3}](fig/denoised.png){width=20em height=15em}
+![Denoise example - 20th epoch\label{fig:den3}](fig/denoised.png){width=15em height=10em}
Visualisation of denoising as seen in figure \ref{fig:den3} demonstrates the impressive performance of the denoising model. We are able use a deeper U-Net containing 3 downsampling and 3 upsampling layers, as per the architecture of Olaf Ronneberger et.al [@unet] to achieve 4.5 mean average error loss. We find that at this loss the network converges and we therefore focus our attention to the descriptor model.
@@ -60,7 +58,7 @@ Where N is the number of triplets in the batch. $D_{a,x}$ is the Euclidean dista
# Hard Triplet Mining
-Training the baseline for a sufficient number of epochs results in loss convergence, that is the validation loss stops improving. The baseline generator creates random valid triplets which are fed to the siamese network. This results in the network being given too many *easy* triplets, which for a well trained baseline results in a low loss but do not provide a good opportunity for the model to improve on the hard triplets.
+Training the baseline for a sufficient number of epochs results in loss convergence, that is the validation loss stops improving. The baseline generator creates random valid triplets which are fed to the Siamese network. This results in the network being given too many *easy* triplets, which for a well trained baseline results in a low loss but do not provide a good opportunity for the model to improve on the hard triplets.
One way to boost performance is to run the dataset of patches through a singular descriptor model, record the descriptors and generate a subset of hard triplets with which the baseline can be trained. This will result in higher immediate loss and will allow the network to optimise the loss for those hard triplets. This can be done once or multiple times. It can also be done on every epoch, or ideally on every batch under the form of *online mining*.
@@ -68,9 +66,9 @@ One way to boost performance is to run the dataset of patches through a singular
Online mining is first introduced by Google in their Facenet paper [@facenet] and further evaluated by the Szkocka Research group in their Hardnet paper [@hardnet] which utilises the padded L2-Net network as implemented in baseline.
-Instead of using a three parallel siamese models, online batch mining uses a single model which is given a batch of $n$ patches and produces $n$ descriptors. The single model has the added advantage of being directly TPU compatible in our Keras implementation. The loss function calculates the pairwise Euclidean distance between the descriptors creating a $n^2$ matrix. We mask this matrix with the labels of the patches (based on the sequence they were extracted from), to find all valid posive and negative distances.
+Instead of using a three parallel Siamese models, online batch mining uses a single model which is given a batch of $n$ patches and produces $n$ descriptors. The single model has the added advantage of being directly TPU compatible in our Keras implementation. The loss function calculates the pairwise Euclidean distance between the descriptors creating a $n^2$ matrix. We mask this matrix with the labels of the patches (based on the sequence they were extracted from), to find all valid positive and negative distances.
-We then have then have a few options for determining our loss. We define batch losses as introduced by Alexander Hermans et. al. [@defense] who applied the batch all and batch hard losses to person re-identification. The **batch all** triplet is computed as the average of all valid triplets in the distance matrix. This is in a way equivalent to loss as found by the siamese network in the baseline, but is able to compute the loss as the average of a vastly larger number of triplets.
+We then have then have a few options for determining our loss. We define batch losses as introduced by Alexander Hermans et. al. [@defense] who applied the batch all and batch hard losses to person re-identification. The **batch all** triplet is computed as the average of all valid triplets in the distance matrix. This is in a way equivalent to loss as found by the Siamese network in the baseline, but is able to compute the loss as the average of a vastly larger number of triplets.
Perhaps the more interesting approach is the **batch hard** strategy of computing the loss using the hardest triplets for each anchor. We define the anchor's as each and every input patch which can form a valid triplet, i.e. has a positive with the samle label. For every anchor we find largest positive distance and the smallest negative distance in the batch. The loss for the batch is the mean of the hardest triplets for each anchor.
@@ -84,9 +82,9 @@ Where $S$ is the number of sequences in the batch, $K$ is the number of images i
## Symmetric Batch Formation
-Batch loss presents a series of problems when applied on the HPatches dataset. Implementations of batch triplet loss often use randomly sampled batches. For a dataset like MNIST which has only 10 classes, this is not a problem as is it is very unlikely to have no valid triplets in a batch. In the HPatches dataset, the majority of the image sequences have over 1000 patches, meaning that the probability of having no valid triplets is very high. In situations where there are no valid triplets loss is meaningless andtherefore a random batch sequence is unfeasible. Our first experimentation with batch loss failed due to this and required additional work.
+Batch loss presents a series of problems when applied on the HPatches dataset. Implementations of batch triplet loss often use randomly sampled batches. For a dataset like MNIST which has only 10 classes, this is not a problem as is it is very unlikely to have no valid triplets in a batch. In the HPatches dataset, the majority of the image sequences have over 1000 patches, meaning that the probability of having no valid triplets is very high. In situations where there are no valid triplets loss is meaningless and therefore a random batch sequence is unfeasible. Our first experimentation with batch loss failed due to this and required additional work.
-We therefore implemented batches of size $SK$ formed with $S$ number patch sequences containg $K \geq 2$ patches. The anchor positive permutation's are therefor $(K-1)$ possible positives for each anchor, and $(S-1)K)$ negatives for each pair. With a guaranteed total number of $K^2(K-1)(S-1)$ triplets. This has the added benefit of allowing the positive and negative distances masks to be precomputed based on the $S$ and $K$ as the patches are ordered. It should be noted that the difficulty of the batch is highly reliant both $SK$ and $K$. The larger $K$ the more likely it is to have a harder the furthest anchor-positive pair, and the bigger the batch size $SK$ the more likely it is to find a close negative.
+We therefore implemented batches of size $SK$ formed with $S$ number patch sequences containing $K \geq 2$ patches. The anchor positive permutation's are therefor $(K-1)$ possible positives for each anchor, and $(S-1)K)$ negatives for each pair. With a guaranteed total number of $K^2(K-1)(S-1)$ triplets. This has the added benefit of allowing the positive and negative distances masks to be precomputed based on the $S$ and $K$ as the patches are ordered. It should be noted that the difficulty of the batch is highly reliant both $SK$ and $K$. The larger $K$ the more likely it is to have a harder the furthest anchor-positive pair, and the bigger the batch size $SK$ the more likely it is to find a close negative.
## Collapse
@@ -106,7 +104,7 @@ Eventual training with batch size of of 1028 with $K=8$ (the maximum K for HPatc
Stepping was performed with batch sizes of 32, 64, 128, 256, 512 and 1024. With $K$ starting at 2 and eventually reaching 16. We performed training with the Adam optimizer with a learning rate of $2 \times 10^{-5}$. The learning curve can be seen in figure \ref{stepped} with the jumps in batch size.
-![Stepped Batch Training \label{stepped}](fig/augmentation.png){width=20em height=12em}
+![Stepped Batch Training \label{stepped}](fig/stepped_batch.pdf){width=20em height=10em}
## Soft Margin Loss
@@ -116,13 +114,14 @@ It is important to note that the point of collapse when using soft margin formul
## Feature Augmentation
-We implement feature augmentation through random flips and rotations using numpy's `np.flip` and `np.rot90` functions. Nevertheless, we make a case that feature augmentation is detrimental for the HPatches dataset as patch sequences. An example sequence on figure \ref{augment} shows patches carry similar positional appearance across the same sequence, which is nullified by random flips and rotations. Experimentally we observe a near doubling of the loss, which reduces maximum batch hard size which fits below the collapse threshold and ultimately results in lower performance.
+We implement feature augmentation through random flips and rotations using numpy's `np.flip` and `np.rot90` functions. Nevertheless, we make a case that random feature augmentation is detrimental for the HPatches dataset as patch sequences. An example sequence on figure \ref{augment} shows patches carry similar positional appearance across the same sequence, which is nullified by random flips and rotations. Experimentally we observe a near doubling of the loss, which reduces maximum batch hard size which fits below the collapse threshold and ultimately results in lower performance.
+We presume that this may be solved by flipping and rotating anchors-positive pairs in the same way within the batch, such that positional similarity between anchor and positive is presserved, but we did not have the opportunity to test this idea.
-![Non-Augmented Sequence\label{augment}](fig/augmentation.png){width=20em height=12em}
+![Non-Augmented Sequence\label{augment}](fig/augmentation.png){width=17em height=10em}
# Experimental Results
-** Give more specific learning rates, and results ! ***
+The baseline is trained until convergence with default SGD optimizer with a learning rate of 0.1. There is potential performance gains to made by implementing learning rate decay. Soft margin loss is tested with $\alpha = 0$. Batch hard training is performed with the Adam optimizer with a learning rate of $2 \times 10^{-4}$ in steps of 64, 128, 386 and 1024 with $K$ stepped through 2, 4, 6 and 8. Increasing batch size to 2048 or using $K=16$ results in training loss higher than collapse loss.
\begin{table}[h!]
\begin{center}
@@ -143,10 +142,12 @@ Soft Batch Hard 1024 & 0.873 & 0.356 & 0.645 \\ \hline
We may leverage visualisation of the descriptors feature to identify if descriptors sequences are seperated as we expect them to be. Figure \ref{tsne} visualises the descriptor embeddings with t-SNE in 2 dimensions for as single batch with size 2048 containing 128 sequences with 16 patches each as trained with batch hard. For a collapsed network all descriptors appear clustered at the same point, while for a well training network we see them seperate appart.
-![2D Descriptor Visualisation with t-SNE (S=128;K=16)\label{tsne}](fig/tsne.pdf){width=20em height=15em}
+![2D Descriptor Visualisation with t-SNE (S=128;K=16)\label{tsne}](fig/tsne.pdf){width=18em height=12em}
# Appendix
+![Sequence from the HPatches dataset\label{sequence}](fig/sequence.png){width=15em height=10em}
+
![U-Net Training with TPU\label{fig:denoise}](fig/denoise.pdf){width=20em height=15em}
![L2-Net\label{fig:descriptor}](fig/descriptor.pdf){width=20em height=15em}
diff --git a/report/template.latex b/report/template.latex
index da5bdd2..232f8b3 100644
--- a/report/template.latex
+++ b/report/template.latex
@@ -47,7 +47,7 @@
%%%%%%%%% ABSTRACT
\begin{abstract}
- Abstract - ...
+ Abstract - In this paper we investigate triplet loss based methods for verification, matching and retrieval of the HPatches dataset. We explore a system of two models, one for denoising of the patches and one for generation of descriptors. We are able to show that a model trained with online hard patch mining, while more difficult, is able to achieve improved performance over a classical Siamese triplet network.
\end{abstract}
\providecommand{\tightlist}{%