1 files changed, 10 insertions, 4 deletions
diff --git a/report/paper.md b/report/paper.md
index 87c22ab..2829952 100644
--- a/report/paper.md
+++ b/report/paper.md
@@ -88,20 +88,26 @@ Where $S$ is the number of sequences in the batch, $K$ is the number of images i
 
 Batch loss presents a series of problems when applied on the HPatches dataset. Implementations of batch triplet loss often use randomly sampled batches. For a dataset like MNIST which has only 10 classes, this is not a problem as there is it is very unlikely to have no valid triplets. In the HPatches dataset, image sequences tend to have over 1000 patches, meaning that the probability of having no valid triplets is very high. In these situations loss is meaningliss and hence a random batch sequence is unfeasable. The first test of batch loss failed due to this and required an alternatet solution.
 
-We therefore implemented  batches of size $SK$ formed with $S$ number patch sequences containg $K \geq 2$ patches. The anchor positive permutation's are therefor $(K-1)$ possible positives for each anchor, and $(S-1)K)$ negatives for each pair. With a guarranteed total number of $K^2(K-1)(S-1)$ triplets. This has the added benefit of allowing the positive and negative distances masks to be precomputed based on the $S$ and $K$ as the patches are ordered. It should be noted that the difficult of the batch is highly reliant both $SK$ and $K$. The larger $K$ the more likely it is to have a harder the furthest anchor-postive pair, and the bigger the batch size $SK$ the more likely it is to find a close negative.
+We therefore implemented  batches of size $SK$ formed with $S$ number patch sequences containg $K \geq 2$ patches. The anchor positive permutation's are therefor $(K-1)$ possible positives for each anchor, and $(S-1)K)$ negatives for each pair. With a guarranteed total number of $K^2(K-1)(S-1)$ triplets. This has the added benefit of allowing the positive and negative distances masks to be precomputed based on the $S$ and $K$ as the patches are ordered. It should be noted that the difficulty of the batch is highly reliant both $SK$ and $K$. The larger $K$ the more likely it is to have a harder the furthest anchor-postive pair, and the bigger the batch size $SK$ the more likely it is to find a close negative.
 
 ## Collapse
 
-Even after the implementation of symmetric batch formation, we were unable to train the descriptor model without having the loss becomming stuck at the margin, that is $\loss{BH} = \alpha$. Upon observation of the descriptors it was evident that the descriptor model producing descriptors with all dimensions approaching zero, regardless of the input. A naive solution is to apply $L2$ normalisation to the descriptors, but that does not avoid collapse as the network learns to output the same descriptor every time.
+Even after the implementation of symmetric batch formation, we were unable to train the descriptor model without having the loss becoming stuck at the margin, that is $\loss{BH} = \alpha$. Upon observation of the descriptors it was evident that the descriptor model producing descriptors with all dimensions approaching zero, regardless of the input. A naive solution is to apply $L2$ normalisation to the descriptors, but that does not avoid collapse as the network learns to output the same descriptor every time and this further unnecessarily limits the descriptors to the unit hypersphere.
 
-Avoiding collapse of all the descriptors to a single point proved to not be an easy task. Upon experimentation, we find that descriptors tend to collapse for large batch sizes. 
+Avoiding collapse of all the descriptors to a single point proved to be a hard task. Upon experimentation, we find that descriptors tend to collapse for large batch sizes with a large $K$. Initial experiments, which eagerly used large batch sizes in order to utilise TPU acceleration would ubiquitously collapse to the erroneous local minima of identical descriptors. Avoiding collapse is ultimately solved with low learning rate and small/easy batch. This significantly limits the advantages of batch hard loss, and makes it extremely slow and hard to train.
+
+We further find that square Euclidean distance: $D\left(\nnfn(x_i), \nnfn(x_j)\right) = \norm{\nnfn(x_i) - \nnfn(x_j)}_2^2$, while cheaper to compute, is much more prone to collapse (in fact we did not successfuly train a network with square Euclidean distance). A. Hermans et. al. make a similar observation about squared Euclidean distance on the MARS dataset [@defense].
 
 ## Progressive hard batch mining
 
+We find that while collapse is hard to avoid for large batch with $K > 3$, the risk of collapse disappears if we are able to reach a $\loss{BH} < \alpha$, and are hence able to use a more aggressive learning rate. We were able to train the model up to larger batches, by starting with a baseline trained model and progressively increasing the batch size, training the network while maintaing the loss below the margin to avoid collapse.
+
 
 ## Soft Margin Loss
 
-* Softplus function
+One of properties of triplet loss as implemented with the margin based hinge loss function $\left[m + \bullet\right]_+$ is that it avoids optimisation based on already submarginal "easy" triplet's which have their loss set to zero. However it is possible to replace the loss function such that it still produces a small amount of loss for these correct triplets, such that they can be pulled closer together. We use the `keras.backend.softplus` function which implements an exponential loss below zero instead of a hard cut off. A. Hermans et. al. [@defense] refer to this as the soft margin formulation and implement it as $\ln(1 + \exp(\bullet))$ which is identical to the `keras`'s softplus fuction. 
+
+It is important to note that the point of collapse when using soft margin formulation with $\alpha = 0$ is approximately $0.69$, as it per our training strategy is desirable to progressively increase batch size, such that it the loss stays beyong point of collapse.
 
 # Visualisation