From 40bff320aacdda8d577a7ebd9a74aeae45523ed8 Mon Sep 17 00:00:00 2001
From: Vasil Zlatanov <v@skozl.com>
Date: Tue, 20 Nov 2018 12:26:20 +0000
Subject: Add references (and minor grammer

---
 report/paper.md | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

(limited to 'report/paper.md')

diff --git a/report/paper.md b/report/paper.md
index e67b3d0..5e20702 100755
--- a/report/paper.md
+++ b/report/paper.md
@@ -94,11 +94,11 @@ PCA &Fast PCA\\
 \label{tab:eigen}
 \end{table}
 
-It can be proven that the eigenvalues obtained are mathematically the same,
+It can be proven that the eigenvalues obtained are mathematically the same [@lecture-notes],
 and the there is a relation between the eigenvectors obtained: 
 
 Computing the eigenvectors **u\textsubscript{i}** for the DxD matrix AA\textsuperscript{T} 
-we obtain a very large matrix. The computation process can get very expensive when D>>N.
+we obtain a very large matrix. The computation process can get very expensive when $D>>N$.
 
 For such reason we compute the eigenvectors **v\textsubscript{i}** of the NxN
 matrix A\textsuperscript{T}A. From the computation it follows that $A\textsuperscript{T}A\boldsymbol{v\textsubscript{i}} = \lambda \textsubscript{i}\boldsymbol{v\textsubscript{i}}$.
@@ -285,10 +285,10 @@ of the projected samples: $W\textsuperscript{T}\textsubscript{pca} = arg\underse
 = arg\underset{W}max\frac{|W\textsuperscript{T}W\textsuperscript{T}
 \textsubscript{pca}S\textsubscript{B}W\textsubscript{pca}W|}{|W\textsuperscript{T}W\textsuperscript{T}\textsubscript{pca}S\textsubscript{W}W\textsubscript{pca}W|}$.
 
-Anyways performing PCA followed by LDA carries a loss of discriminative information. Such problem can
-be avoided by a linear combination of the two. In the following section we will use a 1-dimensional
-subspace *e*. The cost functions associated with PCA and LDA (with $\epsilon$ being a very small number)
-are H\textsubscript{pca}(*e*)=
+However, performing PCA followed by LDA carries a loss of discriminative information. This problem can
+be avoided through a linear combination of the two [@pca-lda]. In the following section we will use a 
+1-dimensional subspace *e*. The cost functions associated with PCA and LDA (with $\epsilon$ being a very 
+small number) are H\textsubscript{pca}(*e*)=
 <*e*, S\textsubscript{e}> and $H\textsubscript{lda}(e)=\frac{<e, S\textsubscript{B}e>}
 {<e,(S\textsubscript{W} + \epsilon I)e>}=
 \frac{<e, S\textsubscript{B}e>}{<e,S\textsubscript{W}e> + \epsilon}$. 
@@ -403,13 +403,13 @@ Since each model in the ensemble outputs its own predicted labels, we need to de
 
 ### Majority Voting
 
-In simple majority voting we the committee label is the most popular label outputted by all the models. This can be achieved by binning all labels produced by the ensemble of models and classifying the test case as the class with the most bins. 
+In simple majority voting the comitee label is the most pouplar label given by them models. This can be achieved by binning all labels produced by the ensemble and classifying the test case as the class with the most bins. 
 
-This technique does is not bias towards statistically better models and values all models in the ensemble equally. It is useful when models have similar accuracies and our not specialised in classifying in their classification.
+This technique is not bias towards statistically better models and values all models in the ensemble equally. It is useful when models have similar accuracies and are not specialised in their classification.
 
 ### Confidence Weighted Averaging
 
-Given that the model can output confidence about the label it is able to predict, we can factor the confidence of the model towards the final output of the committee machine. For instance, if a specialised model says with 95% confidence the label for the test case is "A", and two other models only classify it as "B" with 40% confidence, we would be inclined to trust the first model and classify the result as "A".
+Given that the model can output confidences about the labels it predicts, we can factor the confidence of the model towards the final output of the committee machine. For instance, if a specialised model says with 95% confidence the label for the test case is "A", and two other models only classify it as "B" with 40% confidence, we would be inclined to trust the first model and classify the result as "A".
 
 This technique is reliant on the model producing a confidence score for the label(s) it guesses. For K-Nearest neighbours where $K > 1$ we may produce a confidence based on the proportion of the K nearest neighbours which are the same class. For instance if $K = 5$ and 3 out of the 5 nearest neighbours are of class "C" and the other two are class "B" and "D", then we may say that the predictions are classes C, B and D, with confidence of 60%, 20% and 20% respectively.
 
@@ -438,7 +438,7 @@ Feature space randomisations involves randomising the features which are analyse
 \begin{figure}
 \begin{center}
 \includegraphics[width=19em]{fig/random-ensemble.pdf}
-\caption{Ensemble size effect on accraucy with 160 eigenvalues (mc=90,mr=70)}
+\caption{Ensemble size effect on accraucy with 160 eigenvalues ($m_c=90$,$m_r=70$)}
 \label{fig:random-e}
 \end{center}
 \end{figure}
@@ -472,7 +472,7 @@ Combining bagging and feature space randomization we are able to achieve higher
 \begin{figure}
 \begin{center}
 \includegraphics[width=19em]{fig/ensemble-cm.pdf}
-\caption{Ensemble confusion matrix}
+\caption{Ensemble confusion matrix (pre-comittee)}
 \label{fig:ens-cm}
 \end{center}
 \end{figure}
-- 
cgit v1.2.3-54-g00ecf