very good progress

author: Vasil Zlatanov <v@skozl.com> 2019-03-21 21:05:06 +0000
committer: Vasil Zlatanov <v@skozl.com> 2019-03-21 21:05:06 +0000
commit: a6e407b86854cc336684723d251746d0fd1f28e8 (patch)
tree: e4e35faa36cb6b859d0fddd1e3b8b71ce26c1413
parent: 817b2024e87c586b13208df23636c4b7955a7c87 (diff)
download: e3-deep-a6e407b86854cc336684723d251746d0fd1f28e8.tar.gz
e3-deep-a6e407b86854cc336684723d251746d0fd1f28e8.tar.bz2
e3-deep-a6e407b86854cc336684723d251746d0fd1f28e8.zip
1 files changed, 22 insertions, 26 deletions
diff --git a/report/paper.md b/report/paper.md
index 688208b..3208e88 100644
--- a/report/paper.md
+++ b/report/paper.md
@@ -1,6 +1,6 @@
 # Problem Definition
 
-This coursework's goal is to develop an image representation of patches from the `HPatches` dataset for the purpose of matching and retrieval, where images with the same features (and classes) are nearby in the reduced Euclidean space, while dissimilar ones are far away. The dataset contains patches sampled from image sequences taken from the same scenes. For each image sequence there is a reference image, and two more files `eX.png` and `hX.png` containing corresponding patches from the images in the sequence with altered illumination (`i_X`) or viewpoint (`v_X`). Furthemore corresponding patches are extracted with geometric noise, easy `e_X` have a small amount of jitter while `h_X` patches have a larger amount [@patches]. The patches as processed by our networks are monochrome 32 by 32 images. An series eights images from the same sequence are shown on figure \ref{sequence}.
+This coursework's goal is to develop an image representation of patches from the `HPatches` dataset for the purpose of matching and retrieval, where images with the same features (and classes) are nearby in the reduced Euclidean space of the descriptors, while dissimilar ones are far apart. The dataset contains patches sampled from image sequences taken from the same scenes. For each image sequence there is a reference image, and two more files `eX.png` and `hX.png` containing corresponding patches from the images in the sequence with altered illumination (`i_X`) or viewpoint (`v_X`). Furthemore corresponding patches are extracted with geometric noise, easy `e_X` have a small amount of jitter while `h_X` patches have a larger amount [@patches]. The patches as processed by our networks are monochrome 32 by 32 images. An series eights images from the same sequence are shown on figure \ref{sequence}.
 
 ![Sequence from the HPatches dataset\label{sequence}](fig/sequence.png){width=20em height=15em}
 
@@ -16,11 +16,11 @@ The baseline model provided in the given IPython notebook, approaches the proble
 
 ## TPU Training
 
-Training with Google's proprietary TPU's is used for this project as it can be done with some a few modifications of the baseline. Specifically it is necessary to rewrite the `Upsampling2D` layer using a `Lambda` layer, replace python's `keras` with `tensorflow.keras` and replace the sequence generators with those in the `tensorflow` library. Furthemore the TPU's do not support the placeholder op necessary for the implementation of the baseline siamese triplet network, but is compatible with the online batch (hard) triplet loss implemented and presented in this paper.
+Training with Google's proprietary TPU's is used for this paper as it can be done with some a few modifications of the baseline. Specifically it is necessary to rewrite the `Upsampling2D` layer using a `Lambda` layer, replace python's `keras` with `tensorflow.keras` and replace the sequence generators with those in the `tensorflow` library. Furthemore the TPU's do not support the placeholder op necessary for the implementation of the baseline siamese triplet network, but is compatible with the online batch (hard) triplet loss implemented and presented later in this paper.
 
 Direct comparison of performance with GPU is made difficult by the fact that Google's TPU's show the biggest performance increase for large batch sizes, which do not fit in GPU memory. Furthemore TPU's incur an extra model compilation time and copying of the weights. Nevertheless approximate performance times for the denoise network are presented in table \ref{tpu-time}.
 
-\begin{table}[]
+\begin{table}[h!]
 \begin{center}
 \begin{tabular}{lrrrr} \hline
 Batch Size & CPU & GPU & TPU &  \\ \hline
@@ -35,8 +35,8 @@ Batch Size & CPU & GPU & TPU &  \\ \hline
 
 ### U-Net Denoising
 
-A shallow version of the U-Net network is used to denoise the noisy patches. The shallow U-Net network has the same output size as the input size, is fed a noisy image and has loss computed as the euclidean distance with a clean reference patch. This effectively teaches the U-Net autoencoder to perform a denoising operation on the input images.
-Efficient training can performed with TPU acceleration, a batch size of 4096 and the Adam optimizer with learning rate of 0.001 and is shown on figure \ref{fig:denoise}. Training and validation was performed with **all** available data.
+A shallow version of the U-Net network is used to denoise the noisy patches. The shallow U-Net network has the same output size as the input size, is fed a noisy image and has loss computed as the mean average Euclidean distance (L1 loss) with a clean reference patch. This effectively teaches the U-Net autoencoder to perform a denoising operation on the input images.
+Efficient training can performed with TPU acceleration, a batch size of 4096 and the Adam optimizer with learning rate of 0.001 and is shown on figure \ref{fig:denoise}. Training and validation was performed with all available data.
 The network is able to achieve a mean average error of 5.3 after 19 epochs. With gradient descent we observed a loss of 5.5 after the same number of epochs. There is no observed evidence of overfitting with the shallow net, something which may be expected with a such a shallow network. An example of denoising as performed by the network is visible in figure \ref{fig:den3}.
 Quick experimentation with a deeper version of U-Net shows it is possible to achieve validation loss of below 5.0 after training for 10 epochs, and a equivalent to the shallow loss of 5.3 is achievable aftery only 3 epochs.
 
@@ -46,37 +46,29 @@ Quick experimentation with a deeper version of U-Net shows it is possible to ach
 
 ### L2 Net
 
-The network used to output the 128 dimension descriptors is a L2-network with triplet loss as defined in CVPR 17 [@l2net]. L2-Net was specifically for descriptor output of patches and is a very suitable choice for this task.  Training of the L2-Net can be done on the noisy images, but experimentation showed it is beneficial to use denoised images. Training the L2-Net with denoised yields training curves shown in \ref{fig:descriptor}. L2-Net is a robust architecture which has been developed with the HPatches dataset and may be hard to improve upon. The architecture is visualised in fig \ref{fig:l2arch}.
+The network used to output the 128 dimension descriptors is a L2-network with triplet loss as defined by CVPR 17 [@l2net]. L2-Net was specifically designed for descriptor output of patches and is a very suitable choice for this task. Training the L2-Net with denoised as per the baseline yields training curves shown in \ref{fig:descriptor}. L2-Net is a robust architecture which has been developed with the HPatches dataset and may be hard to improve upon. The architecture is visualised in fig \ref{fig:l2arch}.
 
 ### Triplet Loss
 
-The loss used for Siamese L2-Net:
+The loss used for Siamese L2-Net as implemented in the baseline can be formulate as:
 
 \begin{equation}\label{eq:loss_trip}
   \loss{tri}(\theta) = \frac{1}{N} \sum\limits_{\substack{a,p,n \\ y_a = y_p \neq y_n}} \left[ D_{a,p} - D_{a,n} + \alpha \right]_+.
 \end{equation}
 
-Where N is the number of triplets in the batch.
-
-There is an intrinsic problem that occurs when loss approaches 0, training becomes more difficult as we are throwing away loss data which prevents the network from progressing significantly past that point. Solutions may involve increase the margin $\alpha$ or adopting a non linear loss which is able to avoid the loss truncation similar to the L2-Net paper.
-
-# Performance & Evaluation
-
-Training speed was found to be greatly improvable by utilising Google's dedicated TPU and increasing batch size. With the increase in batch size, it becomes beneficial to increase learning rate. Particularly we found an increase of batch size to 4096 to allow an increase in learning rate of a factor of 10 over the baseline which offered around 10 training time speedup, together with faster convergence of the loss for the denoise U-Net.
-
-The baseline is evaluated against the three metrics is: Retrieval - 81.3%, Matching - 31.7%, Verification - 54.4%
+Where N is the number of triplets in the batch. $D_{a,x}$ is the Euclidean distance with the anchor and $\alpha$ is the margin.
 
 # Hard Triplet Mining
 
-Training the baseline for a sufficient number of epochs results in loss convergance, that is the validation (and training) loss stop improving. The baseline generator creates random valid triplets which are fed to the siamese network. This results in the network being given too many *easy* triplets, which for a well trained baseline result in a low loss but do not incentivise the model to improve. 
+Training the baseline for a sufficient number of epochs results in loss convergence, that is the validation loss stops improving. The baseline generator creates random valid triplets which are fed to the siamese network. This results in the network being given too many *easy* triplets, which for a well trained baseline results in a low loss but do not provide a good opportunity for the model to improve on the hard triplets. 
 
-One way to boost performance is to run the dataset of patches through a singular decriptor model, record the descriptors and generate a subset of hard triplets with which the baseline can be trained. This will result in higher immediate loss and will allow the network to optimise the loss for thos hard triplets. This can be done once or multiple times. It can also be done on every epoch, or ideally on every batch under the form of *online mining*.
+One way to boost performance is to run the dataset of patches through a singular descriptor model, record the descriptors and generate a subset of hard triplets with which the baseline can be trained. This will result in higher immediate loss and will allow the network to optimise the loss for those hard triplets. This can be done once or multiple times. It can also be done on every epoch, or ideally on every batch under the form of *online mining*.
 
 ## Online Mining
 
-Online mining is introduced by Google in their Facenet paper [@facenet] and further evaluated by the Szkocka Research group in their Hardnet paper [@hardnet] which utilises the padded L2Net network as implemented in baseline.
+Online mining is first introduced by Google in their Facenet paper [@facenet] and further evaluated by the Szkocka Research group in their Hardnet paper [@hardnet] which utilises the padded L2-Net network as implemented in baseline.
 
-Instead of using a three parallel siamese models online batch mining as evaluated in this paper uses a single model which is given a batch of $n$ patches and produces $n$ descriptors. The single model has the added advantage of being directly TPU compatible. The loss function calculates the pairwise Euclidean distance between each descriptors creating a $n^2$ matrix. We mask this matrix with the labels of the patches (based on the sequence they were extracted from), to find all valid posive and negative distances.
+Instead of using a three parallel siamese models, online batch mining uses a single model which is given a batch of $n$ patches and produces $n$ descriptors. The single model has the added advantage of being directly TPU compatible in our Keras implementation. The loss function calculates the pairwise Euclidean distance between the descriptors creating a $n^2$ matrix. We mask this matrix with the labels of the patches (based on the sequence they were extracted from), to find all valid posive and negative distances.
 
 We then have then have a few options for determining our loss. We define batch losses as introduced by Alexander Hermans et. al. [@defense] who applied the batch all and batch hard losses to person re-identification. The **batch all** triplet is computed as the average of all valid triplets in the distance matrix. This is in a way equivalent to loss as found by the siamese network in the baseline, but is able to compute the loss as the average of a vastly larger number of triplets. 
 
@@ -92,25 +84,27 @@ Where $S$ is the number of sequences in the batch, $K$ is the number of images i
 
 ## Symmetric Batch Formation
 
-Batch loss presents a series of problems when applied on the HPatches dataset. Implementations of batch triplet loss often use randomly sampled batches. For a dataset like MNIST which has only 10 classes, this is not a problem as there is it is very unlikely to have no valid triplets. In the HPatches dataset, image sequences tend to have over 1000 patches, meaning that the probability of having no valid triplets is very high. In these situations loss is meaningless and hence a random batch sequence is unfeasible. The first test of batch loss failed due to this and required an alternate solution.
+Batch loss presents a series of problems when applied on the HPatches dataset. Implementations of batch triplet loss often use randomly sampled batches. For a dataset like MNIST which has only 10 classes, this is not a problem as is it is very unlikely to have no valid triplets in a batch. In the HPatches dataset, the  majority of the image sequences have over 1000 patches, meaning that the probability of having no valid triplets is very high. In situations where there are no valid triplets loss is meaningless andtherefore a random batch sequence is unfeasible. Our first experimentation with batch loss failed due to this and required additional work.
 
-We therefore implemented  batches of size $SK$ formed with $S$ number patch sequences containg $K \geq 2$ patches. The anchor positive permutation's are therefor $(K-1)$ possible positives for each anchor, and $(S-1)K)$ negatives for each pair. With a guaranteed total number of $K^2(K-1)(S-1)$ triplets. This has the added benefit of allowing the positive and negative distances masks to be precomputed based on the $S$ and $K$ as the patches are ordered. It should be noted that the difficulty of the batch is highly reliant both $SK$ and $K$. The larger $K$ the more likely it is to have a harder the furthest anchor-positive pair, and the bigger the batch size $SK$ the more likely it is to find a close negative.
+We therefore implemented batches of size $SK$ formed with $S$ number patch sequences containg $K \geq 2$ patches. The anchor positive permutation's are therefor $(K-1)$ possible positives for each anchor, and $(S-1)K)$ negatives for each pair. With a guaranteed total number of $K^2(K-1)(S-1)$ triplets. This has the added benefit of allowing the positive and negative distances masks to be precomputed based on the $S$ and $K$ as the patches are ordered. It should be noted that the difficulty of the batch is highly reliant both $SK$ and $K$. The larger $K$ the more likely it is to have a harder the furthest anchor-positive pair, and the bigger the batch size $SK$ the more likely it is to find a close negative.
 
 ## Collapse
 
-Even after the implementation of symmetric batch formation, we were unable to train the descriptor model without having the loss becoming stuck at the margin, that is $\loss{BH} = \alpha$. Upon observation of the descriptors it was evident that the descriptor model producing descriptors with all dimensions approaching zero, regardless of the input. A naive solution is to apply $L2$ normalisation to the descriptors, but that does not avoid collapse as the network learns to output the same descriptor every time and this further unnecessarily limits the descriptors to the unit hypersphere.
+Even after the implementation of symmetric batch formation, we were unable to train the descriptor model without having the loss stagnating at the margin, that is $\loss{BH} \approx \alpha$. Upon observation of the descriptors it was evident that the descriptor model producing descriptors with all dimensions approaching zero, regardless of the input. A naive solution we initially attempted was to apply $L2$ normalisation to the descriptors, but that does not avoid collapse as the network learns to output the same descriptor regardless and this further unnecessarily limits the descriptors to the unit hypersphere.
 
-Avoiding collapse of all the descriptors to a single point proved to be a hard task. Upon experimentation, we find that descriptors tend to collapse for large batch sizes with a large $K$. Initial experiments, which eagerly used large batch sizes in order to utilise TPU acceleration would ubiquitously collapse to the erroneous local minima of identical descriptors. Avoiding collapse is ultimately solved with low learning rate and small/easy batch. This significantly limits the advantages of batch hard loss, and makes it extremely slow and hard to train.
+Avoiding collapse of all the descriptors to a single point proved to be a hard task. Upon experimentation, we find that descriptors tend to collapse for large batch sizes with a large $K$. Initial experiments, which eagerly used large batch sizes in order to utilise TPU acceleration would ubiquitously collapse to the erroneous local minima of identical descriptors. Avoiding collapse is ultimately solved with low learning rate and small/easy batches. This significantly limits the advantages of batch hard loss, and makes it extremely slow and hard to train.
 
 We further find that square Euclidean distance: $D\left(\nnfn(x_i), \nnfn(x_j)\right) = \norm{\nnfn(x_i) - \nnfn(x_j)}_2^2$, while cheaper to compute, is much more prone to collapse (in fact we did not successfuly train a network with square Euclidean distance). A. Hermans et. al. make a similar observation about squared Euclidean distance on the MARS dataset [@defense].
 
 ## Stepped hard batch training
 
-We find that for large batch sizes with $K > 3$ avoiding collapse when training the network with randomly initialised weights is extremely difficult for the HPatches dataset even for low learning rates ($10^-5$). We further find that the Adam optimizer collapses even quicker than SGD, reaching collapse in a few epoch. We were eventually able to successfuly train small batches with $SK < 32$ by weights from the baseline model.
+We find that for large batch sizes with $K > 3$ avoiding collapse when training the network with randomly initialised weights is extremely difficult for the HPatches dataset even for low learning rates ($10^{-5}$). We further find that the Adam optimizer collapses even quicker than SGD, reaching collapse in a few epoch. We were eventually able to successfuly train small batches with $SK < 32$ by weights from the baseline model.
 
 While doing so, we observed that the risk of collapse disappears if we are able to reach a $\loss{BH} < \alpha$. After the loss has dipped under the margin we find that we may increase the learning rate as the collapsed state has a higher loss and as such the optimizer has no incentive to push the network into collapse. 
 
-Eventual training with batch size of of 1024 with $K=8$ was achieved by progressively increasing the batch size, starting with weights from the baseline model. We take care to increase the batch size only so much, such that the loss is consitently below $\alpha$. This training technique allows the use of the Adam optimizer and higher learning rates and can be done automatically. We call this method *stepped batch hard*, as we do not find previous documentation of this technique. 
+Eventual training with batch size of of 2048 with $K=16$ (the maximum K for HPatches as not all sequences have more than 16 patches) was achieved by progressively increasing the batch size, starting with weights from the baseline model. We take care to increase the batch size only so much, such that the loss is consitently below $\alpha$. This training technique allows the use of the Adam optimizer and higher learning rates and can be done automatically. We call this method *stepped batch hard*, as to our knowledge this techinque has not been described in literature previously.
+
+Stepping was performed with batch sizes of 32, 64, 128, 256, 512, 1024 and finally 2048. With $K$ starting at 2 and eventually reaching 16. We performed training with the Adam optimizer with a learning rate of $2 \times 10^{-5}$.
 
 ->> Graph of stepped batch hard
 
@@ -128,6 +122,8 @@ We implement feature augmentation through random flips and rotations using numpy
 
 # Experimental Results
 
+** Give more specific learning rates, and results ! ***
+
 \begin{table}[]
 \begin{center}
 \begin{tabular}{lrrr} \hline
author	Vasil Zlatanov <v@skozl.com>	2019-03-21 21:05:06 +0000
committer	Vasil Zlatanov <v@skozl.com>	2019-03-21 21:05:06 +0000
commit	a6e407b86854cc336684723d251746d0fd1f28e8 (patch)
tree	e4e35faa36cb6b859d0fddd1e3b8b71ce26c1413
parent	817b2024e87c586b13208df23636c4b7955a7c87 (diff)
download	e3-deep-a6e407b86854cc336684723d251746d0fd1f28e8.tar.gz e3-deep-a6e407b86854cc336684723d251746d0fd1f28e8.tar.bz2 e3-deep-a6e407b86854cc336684723d251746d0fd1f28e8.zip