1 files changed, 13 insertions, 4 deletions
diff --git a/report/paper.md b/report/paper.md
index a720128..dd86649 100644
--- a/report/paper.md
+++ b/report/paper.md
@@ -5,7 +5,7 @@ This coursework's goal is to develop an image representation of patches from the
 The goal is to train a network, which given a patch is able to produce a descriptor vector with a dimension of 128. The descriptors are evaluated based on there performance across three tasks:
 
 * Retrieval: Use a given image's descriptor to find similar images in a large gallery
-* Matching: Use a given image's descriptor to find similar in a small gallery with difficult dis tractors
+* Matching: Use a given image's descriptor to find similar in a small gallery with difficult distractors
 * Verification: Given two images, use the descriptors to determine their similarity
 
 # Baseline Model
@@ -98,10 +98,15 @@ Avoiding collapse of all the descriptors to a single point proved to be a hard t
 
 We further find that square Euclidean distance: $D\left(\nnfn(x_i), \nnfn(x_j)\right) = \norm{\nnfn(x_i) - \nnfn(x_j)}_2^2$, while cheaper to compute, is much more prone to collapse (in fact we did not successfuly train a network with square Euclidean distance). A. Hermans et. al. make a similar observation about squared Euclidean distance on the MARS dataset [@defense].
 
-## Progressive hard batch mining
+## Stepped hard batch training
 
-We find that while collapse is hard to avoid for large batch with $K > 3$, the risk of collapse disappears if we are able to reach a $\loss{BH} < \alpha$, and are hence able to use a more aggressive learning rate. We were able to train the model up to larger batches, by starting with a baseline trained model and progressively increasing the batch size, training the network while maintaing the loss below the margin to avoid collapse.
+We find that for large batch sizes with $K > 3$ avoiding collapse when training the network with randomly initialised weights is extremely difficult for the HPatches dataset even for low learning rates ($10^-5$). We further find that the Adam optimizer collapses even quicker than SGD, reaching collapse in a few epoch. We were eventually able to successfuly train small batches with $SK < 32$ by weights from the baseline model.
 
+While doing so, we observed that the risk of collapse disappears if we are able to reach a $\loss{BH} < \alpha$. After the loss has dipped under the margin we find that we may increase the learning rate as the collapsed state has a higher loss and as such the optimizer has no incentive to push the network into collapse. 
+
+Eventual training with batch size of of 1024 with $K=8$ was achieved by progressively increasing the batch size, starting with weights from the baseline model. We take care to increase the batch size only so much, such that the loss is consitently below $\alpha$. This training technique allows the use of the Adam optimizer and higher learning rates and can be done automatically. We call this method *stepped batch hard*, as we do not find previous documentation of this technique. 
+
+->> Graph of stepped batch hard
 
 ## Soft Margin Loss
 
@@ -109,6 +114,10 @@ One of properties of triplet loss as implemented with the margin based hinge los
 
 It is important to note that the point of collapse when using soft margin formulation with $\alpha = 0$ is approximately $0.69$, as it per our training strategy is desirable to progressively increase batch size, such that it the loss stays beyond point of collapse.
 
+## Feature Augmentation
+
+show bad examples here
+
 # Experimental Results
 
 \begin{table}[]
@@ -128,7 +137,7 @@ Soft Batch Hard 1024 & 0.873        & 0.356    & 0.645     \\ \hline
 
 # Visualisation
 
-![2D Descriptor Visualisation with TSNE](fig/tsne.pdf){width=20em height=15em}
+![2D Descriptor Visualisation with TSNE (S=128;K=16)](fig/tsne.pdf){width=20em height=15em}
 
 # Appendix