Not bad

author: Vasil Zlatanov <v@skozl.com> 2019-03-21 20:00:10 +0000
committer: Vasil Zlatanov <v@skozl.com> 2019-03-21 20:00:10 +0000
commit: 817b2024e87c586b13208df23636c4b7955a7c87 (patch)
tree: c6c0d487366ae3131ec79c9fdca8a8fa4278a0d4
parent: 8353c32a764ae1be8683530ecb600446bf171c42 (diff)
download: e3-deep-817b2024e87c586b13208df23636c4b7955a7c87.tar.gz
e3-deep-817b2024e87c586b13208df23636c4b7955a7c87.tar.bz2
e3-deep-817b2024e87c586b13208df23636c4b7955a7c87.zip
1 files changed, 15 insertions, 8 deletions
diff --git a/report/paper.md b/report/paper.md
index dd86649..688208b 100644
--- a/report/paper.md
+++ b/report/paper.md
@@ -1,6 +1,8 @@
 # Problem Definition
 
-This coursework's goal is to develop an image representation of patches from the `HPatches` dataset for the purpose of matching and retrieval, where images with the same features (and classes) are nearby in the reduced Euclidean space, while dissimilar ones are far away. The dataset contains patches sampled from image sequences taken from the same scenes. For each image sequence there is a reference image, and two more files `eX.png` and `hX.png` containing corresponding patches from the images in the sequence with altered illumination (`i_X`) or viewpoint (`v_X`). Furthemore corresponding patches are extracted with geometric noise, easy `e_X` have a small amount of jitter while `h_X` patches have a larger amount [@patches]. The patches as processed by our networks are monochrome 32 by 32 images.
+This coursework's goal is to develop an image representation of patches from the `HPatches` dataset for the purpose of matching and retrieval, where images with the same features (and classes) are nearby in the reduced Euclidean space, while dissimilar ones are far away. The dataset contains patches sampled from image sequences taken from the same scenes. For each image sequence there is a reference image, and two more files `eX.png` and `hX.png` containing corresponding patches from the images in the sequence with altered illumination (`i_X`) or viewpoint (`v_X`). Furthemore corresponding patches are extracted with geometric noise, easy `e_X` have a small amount of jitter while `h_X` patches have a larger amount [@patches]. The patches as processed by our networks are monochrome 32 by 32 images. An series eights images from the same sequence are shown on figure \ref{sequence}.
+
+![Sequence from the HPatches dataset\label{sequence}](fig/sequence.png){width=20em height=15em}
 
 The goal is to train a network, which given a patch is able to produce a descriptor vector with a dimension of 128. The descriptors are evaluated based on there performance across three tasks:
 
@@ -31,13 +33,17 @@ Batch Size & CPU & GPU & TPU &  \\ \hline
 \end{center}
 \end{table}
 
-### Shallow U-Net
+### U-Net Denoising
 
 A shallow version of the U-Net network is used to denoise the noisy patches. The shallow U-Net network has the same output size as the input size, is fed a noisy image and has loss computed as the euclidean distance with a clean reference patch. This effectively teaches the U-Net autoencoder to perform a denoising operation on the input images.
 Efficient training can performed with TPU acceleration, a batch size of 4096 and the Adam optimizer with learning rate of 0.001 and is shown on figure \ref{fig:denoise}. Training and validation was performed with **all** available data.
 The network is able to achieve a mean average error of 5.3 after 19 epochs. With gradient descent we observed a loss of 5.5 after the same number of epochs. There is no observed evidence of overfitting with the shallow net, something which may be expected with a such a shallow network. An example of denoising as performed by the network is visible in figure \ref{fig:den3}.
 Quick experimentation with a deeper version of U-Net shows it is possible to achieve validation loss of below 5.0 after training for 10 epochs, and a equivalent to the shallow loss of 5.3 is achievable aftery only 3 epochs.
 
+![Denoise example - 20th epoch\label{fig:den3}](fig/denoised.png){width=20em height=15em}
+
+**Talk about max performance**
+
 ### L2 Net
 
 The network used to output the 128 dimension descriptors is a L2-network with triplet loss as defined in CVPR 17 [@l2net]. L2-Net was specifically for descriptor output of patches and is a very suitable choice for this task.  Training of the L2-Net can be done on the noisy images, but experimentation showed it is beneficial to use denoised images. Training the L2-Net with denoised yields training curves shown in \ref{fig:descriptor}. L2-Net is a robust architecture which has been developed with the HPatches dataset and may be hard to improve upon. The architecture is visualised in fig \ref{fig:l2arch}.
@@ -70,7 +76,7 @@ One way to boost performance is to run the dataset of patches through a singular
 
 Online mining is introduced by Google in their Facenet paper [@facenet] and further evaluated by the Szkocka Research group in their Hardnet paper [@hardnet] which utilises the padded L2Net network as implemented in baseline.
 
-Instead of using a three parallel siamese models online batch mining as evaluated in this paper uses a single model which is given a batch of $n$ patches and produces $n$ descriptors. The loss function calculates the pairwise Euclidean distance between each descriptors creating a $n^2$ matrix. We mask this matrix with the labels of the patches (based on the sequence they were extracted from), to find all valid posive and negative distances.
+Instead of using a three parallel siamese models online batch mining as evaluated in this paper uses a single model which is given a batch of $n$ patches and produces $n$ descriptors. The single model has the added advantage of being directly TPU compatible. The loss function calculates the pairwise Euclidean distance between each descriptors creating a $n^2$ matrix. We mask this matrix with the labels of the patches (based on the sequence they were extracted from), to find all valid posive and negative distances.
 
 We then have then have a few options for determining our loss. We define batch losses as introduced by Alexander Hermans et. al. [@defense] who applied the batch all and batch hard losses to person re-identification. The **batch all** triplet is computed as the average of all valid triplets in the distance matrix. This is in a way equivalent to loss as found by the siamese network in the baseline, but is able to compute the loss as the average of a vastly larger number of triplets. 
 
@@ -116,7 +122,9 @@ It is important to note that the point of collapse when using soft margin formul
 
 ## Feature Augmentation
 
-show bad examples here
+We implement feature augmentation through random flips and rotations using numpy's `np.flip` and `np.rot90` functions. Nevertheless, we make a case that feature augmentation is detrimental for the HPatches dataset as patch sequences. An example sequence on figure \ref{augmenent} shows patches carry similar positional appearance across the same sequence, which is nullified by random flips and rotations. Experimentally we observe a near doubling of the loss, which reduces maximum batch hard size which fits below the collapse threshold and ultimately results in lower performance.
+
+![Non-Augmented Sequence\label{augment}](fig/augmentation.png){width=20em height=15em}
 
 # Experimental Results
 
@@ -124,8 +132,8 @@ show bad examples here
 \begin{center}
 \begin{tabular}{lrrr} \hline
 Training Method      & Verification & Matching & Retrieval \\ \hline
-Baseline             & 0.873        & 0.356    & 0.645     \\
-Soft baseline        & 0.873        & 0.356    & 0.645     \\
+Baseline             & 0.813        & 0.317    & 0.544     \\
+Soft Baseline        & 0.853        & 0.324    & 0.558     \\
 Batch Hard 128       & 0.873        & 0.356    & 0.645     \\
 Batch Hard  1024     & 0.873        & 0.356    & 0.645     \\
 Soft Batch Hard 1024 & 0.873        & 0.356    & 0.645     \\ \hline
@@ -137,7 +145,7 @@ Soft Batch Hard 1024 & 0.873        & 0.356    & 0.645     \\ \hline
 
 # Visualisation
 
-![2D Descriptor Visualisation with TSNE (S=128;K=16)](fig/tsne.pdf){width=20em height=15em}
+![2D Descriptor Visualisation with t-SNE (S=128;K=16)](fig/tsne.pdf){width=20em height=15em}
 
 # Appendix
 
@@ -145,7 +153,6 @@ Soft Batch Hard 1024 & 0.873        & 0.356    & 0.645     \\ \hline
 
 ![L2-Net\label{fig:descriptor}](fig/descriptor.pdf){width=20em height=15em}
 
-![Denoise example - 20th epoch\label{fig:den3}](fig/denoised.png){width=20em height=15em}
 
 ![L2-Net Architecture\label{fig:l2arch} \[@l2net\]](fig/l2net.png){width=15em height=15em}
author	Vasil Zlatanov <v@skozl.com>	2019-03-21 20:00:10 +0000
committer	Vasil Zlatanov <v@skozl.com>	2019-03-21 20:00:10 +0000
commit	817b2024e87c586b13208df23636c4b7955a7c87 (patch)
tree	c6c0d487366ae3131ec79c9fdca8a8fa4278a0d4
parent	8353c32a764ae1be8683530ecb600446bf171c42 (diff)
download	e3-deep-817b2024e87c586b13208df23636c4b7955a7c87.tar.gz e3-deep-817b2024e87c586b13208df23636c4b7955a7c87.tar.bz2 e3-deep-817b2024e87c586b13208df23636c4b7955a7c87.zip