1 files changed, 39 insertions, 31 deletions
diff --git a/report/paper.md b/report/paper.md
index b1f72f0..963f087 100755
--- a/report/paper.md
+++ b/report/paper.md
@@ -263,12 +263,6 @@ affect recognition the most are: glasses, hair, sex and brightness of the pictur
 
 # Question 2, Generative and Discriminative Subspace Learning
 
-As mentioned in the introduction, PCA is a generative method that allows to perform dimensionality
-reduction while keeping most of the information from the initial training data. It is a very good method for 
-reconstruction and allows very fast computation. LDA is instead a discriminative method that uses a high 
-dimensional space for computation. It comes with a very high classification accuracy, with the tradeoff of
-being slightly slower than PCA, and not as good for face reconstruction.
-
 To combine both method it is possible to perform LDA in a generative subspace created by PCA. In order to
 maximize class separation and minimize the distance between elements of the same class it is necessary to 
 maximize the function J(W) (generalized Rayleigh quotient): $J(W) = \frac{W\textsuperscript{T}S\textsubscript{B}W}{W\textsuperscript{T}S\textsubscript{W}W}$ 
@@ -277,41 +271,55 @@ With S\textsubscript{B} being the scatter matrix between classes, S\textsubscrip
 being the within-class scatter matrix and W being the set of projection vectors. $\mu$ 
 represents the mean of each class.
 
-$$ S\textsubscript{B} = \sum\limits_{c}(\mu\textsubscript{c} - \overline{x})(\mu\textsubscript{c} - \overline{x})\textsuperscript{T} $$
-$$ S\textsubscript{W} = \sum\limits_{c}\sum\limits_{i\in c}(x\textsubscript{i} - \mu\textsubscript{c})(x\textsubscript{i} - \mu\textsubscript{c})\textsuperscript{T} $$
+NEED TO REFERENC THIS WHOLE SECTION
+
+It can be proven that when we have a singular S\textsubscript{W} we obtain: $W\textsubscript{opt} = arg\underset{W}max\frac{|W\textsuperscript{T}S\textsubscript{B}W|}{|W\textsuperscript{T}S\textsubscript{W}W|} = S\textsubscript{W}\textsuperscript{-1}(\mu\textsubscript{1} - \mu\textsubscript{2})$
 
-To maximize J(W) we differentiate with respect to W and equate to zero:
+However S\textsubscript{W} is often singular since the rank of S\textsubscript{W}
+is at most N-c and usually N is smaller than D. In such case it is possible to use 
+Fisherfaces. The optimal solution to such problem lays in W\textsuperscript{T}\textsubscript{opt} 
+=  W\textsuperscript{T}\textsubscript{lda}W\textsuperscript{T}\textsubscript{pca}, 
 
-$$ \frac{d}{dW}J(W) = \frac{d}{dW}\left(\frac{W\textsuperscript{T}S\textsubscript{B}W}{W\textsuperscript{T}S\textsubscript{W}W}\right) = 0 $$
-$$ (W\textsuperscript{T}S\textsubscript{W}W)\frac{d(W\textsuperscript{T}S\textsubscript{B}W)}{dW} -  (W\textsuperscript{T}S\textsubscript{B}W)\frac{d(W\textsuperscript{T}S\textsubscript{W}W)}{dW} = 0 $$
-$$ (W\textsuperscript{T}S\textsubscript{W}W)2S\textsubscript{B}W - (W\textsuperscript{T}S\textsubscript{B}W)2S\textsubscript{W}W = 0 $$ 
-$$ S\textsubscript{B}W - JS\textsubscript{W}W = 0 $$
+where W\textsubscript{pca} is chosen to maximize the determinant of the total scatter matrix
+of the projected samples: $W\textsuperscript{T}\textsubscript{pca} = arg\underset{W}max|W\textsuperscript{T}S\textsubscript{T}W|$. And $W\textsubscript{lda} 
+= arg\underset{W}max\frac{|W\textsuperscript{T}W\textsuperscript{T}
+\textsubscript{pca}S\textsubscript{B}W\textsubscript{pca}W|}{|W\textsuperscript{T}W\textsuperscript{T}\textsubscript{pca}S\textsubscript{W}W\textsubscript{pca}W|}$.
 
-Multiplying by the inverse of S\textsubscript{W} we obtain:
-$$ S\textsubscript{W}\textsuperscript{-1}S\textsubscript{B}W - JW = 0 $$
+Anyways performing PCA followed by LDA carries a loss of discriminative information. Such problem can
+be avoided by a linear combination of the two. In the following section we will use a 1-dimensional
+subspace *e*. The cost functions associated with PCA and LDA (with $\epsilon$ being a very small number)
+are H\textsubscript{pca}(*e*)=
+<*e*, S\textsubscript{e}> and $H\textsubscript{lda}(e)=\frac{<e, S\textsubscript{B}e>}
+{<e,(S\textsubscript{W} + \epsilon I)e>}=
+\frac{<e, S\textsubscript{B}e>}{<e,S\textsubscript{W}e> + \epsilon}$. 
 
-From here it follows:
+Through linear interpolation, for $0\leq t \leq 1$: $F\textsubscript{t}(e)=\frac{1-t}{2}
+H\textsubscript{pca}(e)+\frac{t}{2}H\textsubscript{lda}(e)=
+\frac{1-t}{2}<e,S\textsubscript{e}>+\frac{t}{2}\frac{<e, S\textsubscript{B}e>}{<e,S\textsubscript{W}e> + \epsilon}$ 
 
-$$ W\textsubscript{opt} = arg\underset{W}max\frac{|W\textsuperscript{T}S\textsubscript{B}W|}{|W\textsuperscript{T}S\textsubscript{W}W|} = S\textsubscript{W}\textsuperscript{-1}(\mu\textsubscript{1} - \mu\textsubscript{2}) $$
+The objective is to find a unit vector *e\textsubscript{t}* in **R**\textsuperscript{n} 
+(with n being the number of samples) such that: $e\textsubscript{t}=arg\underset{et}min F\textsubscript{t}(e)$.
 
-However S\textsubscript{W} is often singular since the rank of S\textsubscript{W}
-is at most N-c and usually N is smaller than D.
+We can model the Lagrange optimization problem under the constraint of ||*e*||
+\textsuperscript{2}=1 as $L(e\lambda)=F\textsubscript{t}(e)+\lambda(||e||\textsuperscript{2}-1)$.
 
-In such case it is possible to use Fisherfaces. The optimal solution to such
-problem lays in W\textsuperscript{T}\textsubscript{opt} =  W\textsuperscript{T}\textsubscript{lda}W\textsuperscript{T}\textsubscript{pca}, 
+To minimize we take the derivative with respect to *e* and equate L to zero: $\frac
+{\partial L(e\lambda)}{\partial e}=\frac{\partial F\textsubscript{t}(e)}{\partial e}
++\frac{\partial\lambda(||e||\textsuperscript{2}-1)}{\partial e}=0$
 
-where W\textsubscript{pca} is chosen to maximize the determinant of the total scatter matrix
-of the projected samples: $$ W\textsuperscript{T}\textsubscript{pca} = arg\underset{W}max|W\textsuperscript{T}S\textsubscript{T}W| $$
-$$ And $$
-$$ W\textsubscript{lda} = arg\underset{W}max\frac{|W\textsuperscript{T}W\textsuperscript{T}\textsubscript{pca}S\textsubscript{B}W\textsubscript{pca}W|}{|W\textsuperscript{T}W\textsuperscript{T}\textsubscript{pca}S\textsubscript{W}W\textsubscript{pca}W|} $$
+Being $\nabla F\textsubscript{t}(e)= (1-t)Se+\frac{t}{<e,S\textsubscript{W}e>
++\epsilon}S\textsubscript{B}e-t\frac{<e,S\textsubscript{B}e>}{(<e,S\textsubscript{W}
+e>+\epsilon)\textsuperscript{2}S\textsubscript{W}e}$, we obtain that our goal is to 
+find $\nabla F\textsubscript{t}(e)=\lambda e$, which means making $\nabla F\textsubscript{t}(e)$
+parallel to *e*. Since S is positive semi-definite, $<\nabla F\textsubscript{t}(e),e> \geq 0$.
+It means that $\lambda$ needs to be greater than zero. In such case, normalizing both sides we 
+obtain $\frac{\nabla F\textsubscript{t}(e)}{||\nabla F\textsubscript{t}(e)||}=e$.
 
-Such result indicates that the optimal discriminant vectors can be derived
-from the reduced feature space M\textsubscript{pca} (<=N-c) obtained through PCA
-and applying FLD to reduce the dimension to M\textsubscript{lda}(<=c-1).
+We can express *T(e)* as $T(e) = \frac{\alpha e+ \nabla F\textsubscript{t}(e)}{||\alpha e+\nabla F\textsubscript{t}(e)||}$ (adding a positive multiple of *e*, $\alpha e$ to prevent $\lambda$ from vanishing
 
-In conclusion such method is theoretically better than LDA and PCA alone.
-The Fisherfaces method requires less computation complexity, less time than
-LDA and it improves recognition performances with respect to PCA and LDA.
+It is then possible to use the gradient descent optimization method to perform an iterative procedure
+that solves our optimization problem, using e\textsubscript{n+1}=T(e\textsubscript{n}) and updating 
+after each step.
 
 # Question 3, LDA Ensemble for Face Recognition, PCA-LDA