From 5632396b4adc3b1dad1dbb4286b8abd1e672db07 Mon Sep 17 00:00:00 2001
From: nunzip <np.scarh@gmail.com>
Date: Mon, 19 Nov 2018 20:29:35 +0000
Subject: Part 2 revised

---
 report/paper.md | 80 +++++++++++++++++++++++++++++++--------------------------
 1 file changed, 44 insertions(+), 36 deletions(-)

(limited to 'report/paper.md')

diff --git a/report/paper.md b/report/paper.md
index b1f72f0..963f087 100755
--- a/report/paper.md
+++ b/report/paper.md
@@ -263,12 +263,6 @@ affect recognition the most are: glasses, hair, sex and brightness of the pictur
 
 # Question 2, Generative and Discriminative Subspace Learning
 
-As mentioned in the introduction, PCA is a generative method that allows to perform dimensionality
-reduction while keeping most of the information from the initial training data. It is a very good method for 
-reconstruction and allows very fast computation. LDA is instead a discriminative method that uses a high 
-dimensional space for computation. It comes with a very high classification accuracy, with the tradeoff of
-being slightly slower than PCA, and not as good for face reconstruction.
-
 To combine both method it is possible to perform LDA in a generative subspace created by PCA. In order to
 maximize class separation and minimize the distance between elements of the same class it is necessary to 
 maximize the function J(W) (generalized Rayleigh quotient): $J(W) = \frac{W\textsuperscript{T}S\textsubscript{B}W}{W\textsuperscript{T}S\textsubscript{W}W}$ 
@@ -277,41 +271,55 @@ With S\textsubscript{B} being the scatter matrix between classes, S\textsubscrip
 being the within-class scatter matrix and W being the set of projection vectors. $\mu$ 
 represents the mean of each class.
 
-$$ S\textsubscript{B} = \sum\limits_{c}(\mu\textsubscript{c} - \overline{x})(\mu\textsubscript{c} - \overline{x})\textsuperscript{T} $$
-$$ S\textsubscript{W} = \sum\limits_{c}\sum\limits_{i\in c}(x\textsubscript{i} - \mu\textsubscript{c})(x\textsubscript{i} - \mu\textsubscript{c})\textsuperscript{T} $$
-
-To maximize J(W) we differentiate with respect to W and equate to zero:
-
-$$ \frac{d}{dW}J(W) = \frac{d}{dW}\left(\frac{W\textsuperscript{T}S\textsubscript{B}W}{W\textsuperscript{T}S\textsubscript{W}W}\right) = 0 $$
-$$ (W\textsuperscript{T}S\textsubscript{W}W)\frac{d(W\textsuperscript{T}S\textsubscript{B}W)}{dW} -  (W\textsuperscript{T}S\textsubscript{B}W)\frac{d(W\textsuperscript{T}S\textsubscript{W}W)}{dW} = 0 $$
-$$ (W\textsuperscript{T}S\textsubscript{W}W)2S\textsubscript{B}W - (W\textsuperscript{T}S\textsubscript{B}W)2S\textsubscript{W}W = 0 $$ 
-$$ S\textsubscript{B}W - JS\textsubscript{W}W = 0 $$
-
-Multiplying by the inverse of S\textsubscript{W} we obtain:
-$$ S\textsubscript{W}\textsuperscript{-1}S\textsubscript{B}W - JW = 0 $$
+NEED TO REFERENC THIS WHOLE SECTION
 
-From here it follows:
-
-$$ W\textsubscript{opt} = arg\underset{W}max\frac{|W\textsuperscript{T}S\textsubscript{B}W|}{|W\textsuperscript{T}S\textsubscript{W}W|} = S\textsubscript{W}\textsuperscript{-1}(\mu\textsubscript{1} - \mu\textsubscript{2}) $$
+It can be proven that when we have a singular S\textsubscript{W} we obtain: $W\textsubscript{opt} = arg\underset{W}max\frac{|W\textsuperscript{T}S\textsubscript{B}W|}{|W\textsuperscript{T}S\textsubscript{W}W|} = S\textsubscript{W}\textsuperscript{-1}(\mu\textsubscript{1} - \mu\textsubscript{2})$
 
 However S\textsubscript{W} is often singular since the rank of S\textsubscript{W}
-is at most N-c and usually N is smaller than D.
-
-In such case it is possible to use Fisherfaces. The optimal solution to such
-problem lays in W\textsuperscript{T}\textsubscript{opt} =  W\textsuperscript{T}\textsubscript{lda}W\textsuperscript{T}\textsubscript{pca}, 
+is at most N-c and usually N is smaller than D. In such case it is possible to use 
+Fisherfaces. The optimal solution to such problem lays in W\textsuperscript{T}\textsubscript{opt} 
+=  W\textsuperscript{T}\textsubscript{lda}W\textsuperscript{T}\textsubscript{pca}, 
 
 where W\textsubscript{pca} is chosen to maximize the determinant of the total scatter matrix
-of the projected samples: $$ W\textsuperscript{T}\textsubscript{pca} = arg\underset{W}max|W\textsuperscript{T}S\textsubscript{T}W| $$
-$$ And $$
-$$ W\textsubscript{lda} = arg\underset{W}max\frac{|W\textsuperscript{T}W\textsuperscript{T}\textsubscript{pca}S\textsubscript{B}W\textsubscript{pca}W|}{|W\textsuperscript{T}W\textsuperscript{T}\textsubscript{pca}S\textsubscript{W}W\textsubscript{pca}W|} $$
-
-Such result indicates that the optimal discriminant vectors can be derived
-from the reduced feature space M\textsubscript{pca} (<=N-c) obtained through PCA
-and applying FLD to reduce the dimension to M\textsubscript{lda}(<=c-1).
-
-In conclusion such method is theoretically better than LDA and PCA alone.
-The Fisherfaces method requires less computation complexity, less time than
-LDA and it improves recognition performances with respect to PCA and LDA.
+of the projected samples: $W\textsuperscript{T}\textsubscript{pca} = arg\underset{W}max|W\textsuperscript{T}S\textsubscript{T}W|$. And $W\textsubscript{lda} 
+= arg\underset{W}max\frac{|W\textsuperscript{T}W\textsuperscript{T}
+\textsubscript{pca}S\textsubscript{B}W\textsubscript{pca}W|}{|W\textsuperscript{T}W\textsuperscript{T}\textsubscript{pca}S\textsubscript{W}W\textsubscript{pca}W|}$.
+
+Anyways performing PCA followed by LDA carries a loss of discriminative information. Such problem can
+be avoided by a linear combination of the two. In the following section we will use a 1-dimensional
+subspace *e*. The cost functions associated with PCA and LDA (with $\epsilon$ being a very small number)
+are H\textsubscript{pca}(*e*)=
+<*e*, S\textsubscript{e}> and $H\textsubscript{lda}(e)=\frac{<e, S\textsubscript{B}e>}
+{<e,(S\textsubscript{W} + \epsilon I)e>}=
+\frac{<e, S\textsubscript{B}e>}{<e,S\textsubscript{W}e> + \epsilon}$. 
+
+Through linear interpolation, for $0\leq t \leq 1$: $F\textsubscript{t}(e)=\frac{1-t}{2}
+H\textsubscript{pca}(e)+\frac{t}{2}H\textsubscript{lda}(e)=
+\frac{1-t}{2}<e,S\textsubscript{e}>+\frac{t}{2}\frac{<e, S\textsubscript{B}e>}{<e,S\textsubscript{W}e> + \epsilon}$ 
+
+The objective is to find a unit vector *e\textsubscript{t}* in **R**\textsuperscript{n} 
+(with n being the number of samples) such that: $e\textsubscript{t}=arg\underset{et}min F\textsubscript{t}(e)$.
+
+We can model the Lagrange optimization problem under the constraint of ||*e*||
+\textsuperscript{2}=1 as $L(e\lambda)=F\textsubscript{t}(e)+\lambda(||e||\textsuperscript{2}-1)$.
+
+To minimize we take the derivative with respect to *e* and equate L to zero: $\frac
+{\partial L(e\lambda)}{\partial e}=\frac{\partial F\textsubscript{t}(e)}{\partial e}
++\frac{\partial\lambda(||e||\textsuperscript{2}-1)}{\partial e}=0$
+
+Being $\nabla F\textsubscript{t}(e)= (1-t)Se+\frac{t}{<e,S\textsubscript{W}e>
++\epsilon}S\textsubscript{B}e-t\frac{<e,S\textsubscript{B}e>}{(<e,S\textsubscript{W}
+e>+\epsilon)\textsuperscript{2}S\textsubscript{W}e}$, we obtain that our goal is to 
+find $\nabla F\textsubscript{t}(e)=\lambda e$, which means making $\nabla F\textsubscript{t}(e)$
+parallel to *e*. Since S is positive semi-definite, $<\nabla F\textsubscript{t}(e),e> \geq 0$.
+It means that $\lambda$ needs to be greater than zero. In such case, normalizing both sides we 
+obtain $\frac{\nabla F\textsubscript{t}(e)}{||\nabla F\textsubscript{t}(e)||}=e$.
+
+We can express *T(e)* as $T(e) = \frac{\alpha e+ \nabla F\textsubscript{t}(e)}{||\alpha e+\nabla F\textsubscript{t}(e)||}$ (adding a positive multiple of *e*, $\alpha e$ to prevent $\lambda$ from vanishing
+
+It is then possible to use the gradient descent optimization method to perform an iterative procedure
+that solves our optimization problem, using e\textsubscript{n+1}=T(e\textsubscript{n}) and updating 
+after each step.
 
 # Question 3, LDA Ensemble for Face Recognition, PCA-LDA 
 
-- 
cgit v1.2.3-54-g00ecf