summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorVasil Zlatanov <v@skozl.com>2019-03-21 18:27:47 +0000
committerVasil Zlatanov <v@skozl.com>2019-03-21 18:27:47 +0000
commit09dedd5ffc5577917e0c5135750d4aed49ffc557 (patch)
treeae5ec9b7c1850a6de732deec59a911babb619fbc
parenta283b7fb373851d42113f2a44203ffeb8777f039 (diff)
downloade3-deep-09dedd5ffc5577917e0c5135750d4aed49ffc557.tar.gz
e3-deep-09dedd5ffc5577917e0c5135750d4aed49ffc557.tar.bz2
e3-deep-09dedd5ffc5577917e0c5135750d4aed49ffc557.zip
Add tsne
-rw-r--r--report/paper.md37
1 files changed, 27 insertions, 10 deletions
diff --git a/report/paper.md b/report/paper.md
index 2829952..a720128 100644
--- a/report/paper.md
+++ b/report/paper.md
@@ -1,11 +1,11 @@
-# Problem Definion
+# Problem Definition
This coursework's goal is to develop an image representation of patches from the `HPatches` dataset for the purpose of matching and retrieval, where images with the same features (and classes) are nearby in the reduced Euclidean space, while dissimilar ones are far away. The dataset contains patches sampled from image sequences taken from the same scenes. For each image sequence there is a reference image, and two more files `eX.png` and `hX.png` containing corresponding patches from the images in the sequence with altered illumination (`i_X`) or viewpoint (`v_X`). Furthemore corresponding patches are extracted with geometric noise, easy `e_X` have a small amount of jitter while `h_X` patches have a larger amount [@patches]. The patches as processed by our networks are monochrome 32 by 32 images.
The goal is to train a network, which given a patch is able to produce a descriptor vector with a dimension of 128. The descriptors are evaluated based on there performance across three tasks:
* Retrieval: Use a given image's descriptor to find similar images in a large gallery
-* Matching: Use a given image's descriptor to find similar in a small gallery with difficult distractors
+* Matching: Use a given image's descriptor to find similar in a small gallery with difficult dis tractors
* Verification: Given two images, use the descriptors to determine their similarity
# Baseline Model
@@ -34,13 +34,13 @@ Batch Size & CPU & GPU & TPU & \\ \hline
### Shallow U-Net
A shallow version of the U-Net network is used to denoise the noisy patches. The shallow U-Net network has the same output size as the input size, is fed a noisy image and has loss computed as the euclidean distance with a clean reference patch. This effectively teaches the U-Net autoencoder to perform a denoising operation on the input images.
-Efficient training can performed with TPU acceleartion, a batch size of 4096 and the Adam optimizer with learning rate of 0.001 and is shown on figure \ref{fig:denoise}. Training and validation was performed with **all** available data.
+Efficient training can performed with TPU acceleration, a batch size of 4096 and the Adam optimizer with learning rate of 0.001 and is shown on figure \ref{fig:denoise}. Training and validation was performed with **all** available data.
The network is able to achieve a mean average error of 5.3 after 19 epochs. With gradient descent we observed a loss of 5.5 after the same number of epochs. There is no observed evidence of overfitting with the shallow net, something which may be expected with a such a shallow network. An example of denoising as performed by the network is visible in figure \ref{fig:den3}.
Quick experimentation with a deeper version of U-Net shows it is possible to achieve validation loss of below 5.0 after training for 10 epochs, and a equivalent to the shallow loss of 5.3 is achievable aftery only 3 epochs.
### L2 Net
-The network used to output the 128 dimension descritors is a L2-network with triplet loss as defined in CVPR 17 [@l2net]. L2-Net was specifically for descriptor output of patches and is a very suitable choice for this task. Training of the L2-Net can be done on the noisy images, but expirimentation showed it is beneficial to use denoised images. Training the L2-Net with denoised yields training curves shown in \ref{fig:descriptor}. L2-Net is a robust architecture which has been developed with the HPatches dataset and may be hard to improve upon. The architecture is visualised in fig \ref{fig:l2arch}.
+The network used to output the 128 dimension descriptors is a L2-network with triplet loss as defined in CVPR 17 [@l2net]. L2-Net was specifically for descriptor output of patches and is a very suitable choice for this task. Training of the L2-Net can be done on the noisy images, but experimentation showed it is beneficial to use denoised images. Training the L2-Net with denoised yields training curves shown in \ref{fig:descriptor}. L2-Net is a robust architecture which has been developed with the HPatches dataset and may be hard to improve upon. The architecture is visualised in fig \ref{fig:l2arch}.
### Triplet Loss
@@ -54,7 +54,7 @@ Where N is the number of triplets in the batch.
There is an intrinsic problem that occurs when loss approaches 0, training becomes more difficult as we are throwing away loss data which prevents the network from progressing significantly past that point. Solutions may involve increase the margin $\alpha$ or adopting a non linear loss which is able to avoid the loss truncation similar to the L2-Net paper.
-# Peformance & Evaluation
+# Performance & Evaluation
Training speed was found to be greatly improvable by utilising Google's dedicated TPU and increasing batch size. With the increase in batch size, it becomes beneficial to increase learning rate. Particularly we found an increase of batch size to 4096 to allow an increase in learning rate of a factor of 10 over the baseline which offered around 10 training time speedup, together with faster convergence of the loss for the denoise U-Net.
@@ -86,9 +86,9 @@ Where $S$ is the number of sequences in the batch, $K$ is the number of images i
## Symmetric Batch Formation
-Batch loss presents a series of problems when applied on the HPatches dataset. Implementations of batch triplet loss often use randomly sampled batches. For a dataset like MNIST which has only 10 classes, this is not a problem as there is it is very unlikely to have no valid triplets. In the HPatches dataset, image sequences tend to have over 1000 patches, meaning that the probability of having no valid triplets is very high. In these situations loss is meaningliss and hence a random batch sequence is unfeasable. The first test of batch loss failed due to this and required an alternatet solution.
+Batch loss presents a series of problems when applied on the HPatches dataset. Implementations of batch triplet loss often use randomly sampled batches. For a dataset like MNIST which has only 10 classes, this is not a problem as there is it is very unlikely to have no valid triplets. In the HPatches dataset, image sequences tend to have over 1000 patches, meaning that the probability of having no valid triplets is very high. In these situations loss is meaningless and hence a random batch sequence is unfeasible. The first test of batch loss failed due to this and required an alternate solution.
-We therefore implemented batches of size $SK$ formed with $S$ number patch sequences containg $K \geq 2$ patches. The anchor positive permutation's are therefor $(K-1)$ possible positives for each anchor, and $(S-1)K)$ negatives for each pair. With a guarranteed total number of $K^2(K-1)(S-1)$ triplets. This has the added benefit of allowing the positive and negative distances masks to be precomputed based on the $S$ and $K$ as the patches are ordered. It should be noted that the difficulty of the batch is highly reliant both $SK$ and $K$. The larger $K$ the more likely it is to have a harder the furthest anchor-postive pair, and the bigger the batch size $SK$ the more likely it is to find a close negative.
+We therefore implemented batches of size $SK$ formed with $S$ number patch sequences containg $K \geq 2$ patches. The anchor positive permutation's are therefor $(K-1)$ possible positives for each anchor, and $(S-1)K)$ negatives for each pair. With a guaranteed total number of $K^2(K-1)(S-1)$ triplets. This has the added benefit of allowing the positive and negative distances masks to be precomputed based on the $S$ and $K$ as the patches are ordered. It should be noted that the difficulty of the batch is highly reliant both $SK$ and $K$. The larger $K$ the more likely it is to have a harder the furthest anchor-positive pair, and the bigger the batch size $SK$ the more likely it is to find a close negative.
## Collapse
@@ -105,13 +105,30 @@ We find that while collapse is hard to avoid for large batch with $K > 3$, the r
## Soft Margin Loss
-One of properties of triplet loss as implemented with the margin based hinge loss function $\left[m + \bullet\right]_+$ is that it avoids optimisation based on already submarginal "easy" triplet's which have their loss set to zero. However it is possible to replace the loss function such that it still produces a small amount of loss for these correct triplets, such that they can be pulled closer together. We use the `keras.backend.softplus` function which implements an exponential loss below zero instead of a hard cut off. A. Hermans et. al. [@defense] refer to this as the soft margin formulation and implement it as $\ln(1 + \exp(\bullet))$ which is identical to the `keras`'s softplus fuction.
+One of properties of triplet loss as implemented with the margin based hinge loss function $\left[m + \bullet\right]_+$ is that it avoids optimisation based on already submarginal "easy" triplet's which have their loss set to zero. However it is possible to replace the loss function such that it still produces a small amount of loss for these correct triplets, such that they can be pulled closer together. We use the `keras.backend.softplus` function which implements an exponential loss below zero instead of a hard cut off. A. Hermans et. al. [@defense] refer to this as the soft margin formulation and implement it as $\ln(1 + \exp(\bullet))$ which is identical to the `keras`'s softplus function.
-It is important to note that the point of collapse when using soft margin formulation with $\alpha = 0$ is approximately $0.69$, as it per our training strategy is desirable to progressively increase batch size, such that it the loss stays beyong point of collapse.
+It is important to note that the point of collapse when using soft margin formulation with $\alpha = 0$ is approximately $0.69$, as it per our training strategy is desirable to progressively increase batch size, such that it the loss stays beyond point of collapse.
+
+# Experimental Results
+
+\begin{table}[]
+\begin{center}
+\begin{tabular}{lrrr} \hline
+Training Method & Verification & Matching & Retrieval \\ \hline
+Baseline & 0.873 & 0.356 & 0.645 \\
+Soft baseline & 0.873 & 0.356 & 0.645 \\
+Batch Hard 128 & 0.873 & 0.356 & 0.645 \\
+Batch Hard 1024 & 0.873 & 0.356 & 0.645 \\
+Soft Batch Hard 1024 & 0.873 & 0.356 & 0.645 \\ \hline
+\end{tabular}
+\label{results}
+\caption{mAP for various training techniques}
+\end{center}
+\end{table}
# Visualisation
-Descripotrs can be visualised with TSNE
+![2D Descriptor Visualisation with TSNE](fig/tsne.pdf){width=20em height=15em}
# Appendix