# Question 1, Eigenfaces 

## Partition and Standard PCA

The data is partitioned to allow random selection of the
same amount of samples for each class.
In such way, each training vector space will be generated with
the same amount of elements. The test data will instead
be taken from the remaining samples. Testing on accuracy
with respect to data partition indicates that the maximum
accuracy is obtained when using a 90% of the data for
training. Despite such results we will be using 70% of the data
for training as a standard. This will allow to give more than one
example of success and failure for each class when classifying the 
test_data. Moreover using 90% training data would make the results
obtained heavilly dependent on the seed chosen.

After partitioning the data into training and testing sets,
PCA is applied. The covariance matrix, S, of dimension
2576x2576 (features x features), will have 2576 eigenvalues
and eigenvectors. The amount of non-zero eigenvalues and
eigenvectors obtained will only be equal to the amount of
training samples minus one. This can be observed in figure \ref{fig:logeig}
as a sudden drop for eigenvalues after the 363rd.

\begin{figure}
\begin{center}
\includegraphics[width=20em]{fig/eigenvalues.pdf}
\caption{Log plot of all eigenvalues}
\label{fig:logeig}
\end{center}
\end{figure}

The mean image is calculated averaging the features of the
training data. Changing the randomization seed will give
very similar values, since the vast majority of the training
faces used for averaging will be the same. Two mean faces
obtained with different seeds for split can be observed in figure \ref{fig:mean_face}.

\begin{figure}
\begin{center}
\includegraphics[width=8em]{fig/mean_face.pdf}
\includegraphics[width=8em]{fig/mean2.pdf}
\caption{Mean Faces}
\label{fig:mean_face}
\end{center}
\end{figure}

To perform face recognition we choose the best M eigenvectors
associated with the largest eigenvalues. We tried
different values of M, and we found an optimal point for
M=99 with accuracy=57%. After such value the accuracy starts 
to flaten. 

\begin{figure}
\begin{center}
\includegraphics[width=20em]{fig/accuracy.pdf}
\caption{NN Recognition Accuracy varying M}
\label{fig:accuracy}
\end{center}
\end{figure}

## Low dimensional computation of eigenspace

Performing the low-dimensional computation of the
eigenspace for PCA we obtain the same accuracy results
of the high-dimensional computation previously used. A
comparison between eigenvalues of the
two computation techniques used shows that the difference
is very small (due to rounding
of the np.eigh function when calculating the eigenvalues
and eigenvectors of the matrices A\textsuperscript{T}A (NxN) and AA\textsuperscript{T}
(DxD)).

The first ten biggest eigenvalues obtained with each method
are shown in table \ref{tab:eigen}.

\begin{table}[ht]
\centering
\begin{tabular}[t]{cc}
PCA &Fast PCA\\
2.9755E+05 &2.9828E+05\\
1.4873E+05 &1.4856E+05\\
1.2286E+05 &1.2259E+05\\
7.5084E+04 &7.4950E+04\\
6.2575E+04 &6.2428E+04\\
4.7024E+04 &4.6921E+04\\
3.7118E+04 &3.7030E+04\\
3.2101E+04 &3.2046E+04\\
2.7871E+04 &2.7814E+04\\
2.4396E+04 &2.4339E+04\\
\end{tabular}
\caption{Comparison of eigenvalues obtain with the two computation methods}
\label{tab:eigen}
\end{table}

It can be proven that the eigenvalues obtained are mathematically the same,
and the there is a relation between the eigenvectors obtained: 

Computing the eigenvectors **u\textsubscript{i}** for the DxD matrix AA\textsuperscript{T} 
we obtain a very large matrix. The computation process can get very expensive when D>>N.

For such reason we compute the eigenvectors **v\textsubscript{i}** of the NxN
matrix A\textsuperscript{T}A. From the computation it follows that $A\textsuperscript{T}A\boldsymbol{v\textsubscript{i}} = \lambda \textsubscript{i}\boldsymbol{v\textsubscript{i}}$.

Multiplying both sides by A we obtain: 

$$ AA\textsuperscript{T}A\boldsymbol{v\textsubscript{i}} = \lambda \textsubscript{i}A\boldsymbol{v\textsubscript{i}} \rightarrow SA\boldsymbol{v\textsubscript{i}} = \lambda \textsubscript{i}A\boldsymbol{v\textsubscript{i}} $$

We know that $S\boldsymbol{u\textsubscript{i}} = \lambda \textsubscript{i}\boldsymbol{u\textsubscript{i}}$.
 
From here it follows that AA\textsuperscript{T} and A\textsuperscript{T}A have the same eigenvalues and their eigenvectors follow the relationship $\boldsymbol{u\textsubscript{i}} = A\boldsymbol{v\textsubscript{i}}$ 

It can be noticed that we effectively don't lose any data calculating the eigenvectors
for PCA with the second method. The main advantages of it are in terms of speed,
(since the two methods require on average respectively 3.4s and 0.11s), and complexity of computation
(since the eigenvectors found with the first method are extracted from a significantly 
bigger matrix).

The only drawback is that with method 1 the eigenfaces are generated directly through 
the covariance matrix, whereas method 2 requires an additional projection step.

# Question 1, Application of eigenfaces

## Image Reconstruction

Using the computational method for fast PCA, face reconstruction is then performed.
The quality of reconstruction will depend on the amount of eigenvectors picked.
The results of varying M can be observed in fig.\ref{fig:face160rec}. Two faces from classes 
number 21 and 2 respectively, are reconstructed as shown in fig.\ref{fig:face10rec} with respective M values 
of M=10, M=100, M=200, M=300. The last picture is the original face.

![Reconstructed Face C21\label{fig:face160rec}](fig/face160rec.pdf)

![Reconstructed Face C2\label{fig:face10rec}](fig/face10rec.pdf)

It is already observable that the improvement in reconstruction is marginal for M=200 
and M=300. For such reason choosing M close to 100 is good enough for such purpose.
Observing in fact the variance ratio of the principal components, the contribution
they'll have will be very low for values above 100, hence we will require a much higher
quantity of components to improve reconstruction quality. With M=100 we will be able to
use effectively 97% of the information from our initial training data for reconstruction.
Refer to figure \ref{fig:eigvariance} for the data variance associated with each of the M
eigenvalues.

\begin{figure}
\begin{center}
\includegraphics[width=20em]{fig/variance.pdf}
\caption{Data variance carried by each of M eigenvalues}
\label{fig:eigvariance}
\end{center}
\end{figure}

## Classification

The analysed classification methods used for face recognition are Nearest Neighbor and
alternative method through reconstruction error. 

Nearest Neighbor projects the test data onto the generated subspace and finds the closest 
element to the projected test image, assigning the same class as the neighbor found.

Recognition accuracy of NN classification can be observed in figure \ref{fig:accuracy}.

A confusion matrix showing success and failure cases for Nearest Neighbor classfication
can be observed in figure \label{fig:cm}:

\begin{figure}
\begin{center}
\includegraphics[width=20em]{fig/pcacm.pdf}
\label{fig:cm}
\caption{Confusion Matrix NN, M=99}
\end{center}
\end{figure}

Two examples of the outcome of Nearest Neighbor Classification are presented in figures \ref{fig:nn_fail} and \ref{fig:nn_succ},
respectively one example of classification failure and an example of successful 
classification.

\begin{figure}
\begin{center}
\includegraphics[width=7em]{fig/face2.pdf}
\includegraphics[width=7em]{fig/face5.pdf}
\caption{Failure case for NN. Test face left. NN right}
\label{fig:nn_fail}
\end{center}
\end{figure}

\begin{figure}
\begin{center}
\includegraphics[width=7em]{fig/success1.pdf}
\includegraphics[width=7em]{fig/success1t.pdf}
\caption{Success case for NN. Test face left. NN right}
\label{fig:nn_succ}
\end{center}
\end{figure}

It is possible to use a NN classification that takes into account majority voting.
With this method recognition is based on the K closest neighbors of the projected
test image. Such method anyways showed the best recognition accuracies for PCA with
K=1, as it can be observed from figure \ref{fig:k-diff}.

\begin{figure}
\begin{center}
\includegraphics[width=20em]{fig/kneighbors_diffk.pdf}
\caption{NN recognition accuracy varying K. Split: 80-20}
\label{fig:k-diff}
\end{center}
\end{figure}

The process for alternative method is somewhat similar to LDA. One different
subspace is generated for each class. These subspaces are then used for reconstruction
of the test image and the class of the subspace that generated the minimum reconstruction
error is assigned.

The alternative method shows overall a better performance (see figure \ref{fig:altacc}), with peak accuracy of 69%
for M=5. The maximum M non zero eigenvectors that can be used will in this case be at most
the amount of training samples per class minus one, since the same amount of eigenvectors
will be used for each generated class-subspace.

\begin{figure}
\begin{center}
\includegraphics[width=20em]{fig/alternative_accuracy.pdf}
\caption{Accuracy of Alternative Method varying M}
\label{fig:altacc}
\end{center}
\end{figure}

A confusion matrix showing success and failure cases for alternative method classfication
can be observed in figure \ref{fig:cm-alt}.

\begin{figure}
\begin{center}
\includegraphics[width=20em]{fig/altcm.pdf}
\caption{Confusion Matrix for alternative method, M=5}
\label{fig:cm-alt}
\end{center}
\end{figure}

Similarly to the NN case, we present two cases, respectively failure (figure \ref{fig:altfail}) and success (figure \ref{fig:altsucc}).

\begin{figure}
\begin{center}
\includegraphics[width=7em]{fig/FO.JPG}
\includegraphics[width=7em]{fig/FR.JPG}
\includegraphics[width=7em]{fig/FL.JPG}
\caption{Alternative method failure. Respectively test image, reconstructed image, class assigned}
\label{fig:altfail}
\end{center}
\end{figure}

\begin{figure}
\begin{center}
\includegraphics[width=7em]{fig/SO.JPG}
\includegraphics[width=7em]{fig/SR.JPG}
\includegraphics[width=7em]{fig/SL.JPG}
\caption{Alternative method success. Respectively test image, reconstructed image, class assigned}
\label{fig:altsucc}
\end{center}
\end{figure}

From the failures and success cases analyzed it is noticeable that the parameters that
affect recognition the most are: glasses, hair, sex and brightness of the picture.

# Question 2, Generative and Discriminative Subspace Learning

As mentioned in the introduction, PCA is a generative method that allows to perform dimensionality
reduction while keeping most of the information from the initial training data. It is a very good method for 
reconstruction and allows very fast computation. LDA is instead a discriminative method that uses a high 
dimensional space for computation. It comes with a very high classification accuracy, with the tradeoff of
being slightly slower than PCA, and not as good for face reconstruction.

To combine both method it is possible to perform LDA in a generative subspace created by PCA. In order to
maximize class separation and minimize the distance between elements of the same class it is necessary to 
maximize the function J(W) (generalized Rayleigh quotient): $J(W) = \frac{W\textsuperscript{T}S\textsubscript{B}W}{W\textsuperscript{T}S\textsubscript{W}W}$ 

With S\textsubscript{B} being the scatter matrix between classes, S\textsubscript{W} 
being the within-class scatter matrix and W being the set of projection vectors. $\mu$ 
represents the mean of each class.

$$ S\textsubscript{B} = \sum\limits_{c}(\mu\textsubscript{c} - \overline{x})(\mu\textsubscript{c} - \overline{x})\textsuperscript{T} $$
$$ S\textsubscript{W} = \sum\limits_{c}\sum\limits_{i\in c}(x\textsubscript{i} - \mu\textsubscript{c})(x\textsubscript{i} - \mu\textsubscript{c})\textsuperscript{T} $$

To maximize J(W) we differentiate with respect to W and equate to zero:

$$ \frac{d}{dW}J(W) = \frac{d}{dW}\left(\frac{W\textsuperscript{T}S\textsubscript{B}W}{W\textsuperscript{T}S\textsubscript{W}W}\right) = 0 $$
$$ (W\textsuperscript{T}S\textsubscript{W}W)\frac{d(W\textsuperscript{T}S\textsubscript{B}W)}{dW} -  (W\textsuperscript{T}S\textsubscript{B}W)\frac{d(W\textsuperscript{T}S\textsubscript{W}W)}{dW} = 0 $$
$$ (W\textsuperscript{T}S\textsubscript{W}W)2S\textsubscript{B}W - (W\textsuperscript{T}S\textsubscript{B}W)2S\textsubscript{W}W = 0 $$ 
$$ S\textsubscript{B}W - JS\textsubscript{W}W = 0 $$

Multiplying by the inverse of S\textsubscript{W} we obtain:
$$ S\textsubscript{W}\textsuperscript{-1}S\textsubscript{B}W - JW = 0 $$

From here it follows:

$$ W\textsubscript{opt} = arg\underset{W}max\frac{|W\textsuperscript{T}S\textsubscript{B}W|}{|W\textsuperscript{T}S\textsubscript{W}W|} = S\textsubscript{W}\textsuperscript{-1}(\mu\textsubscript{1} - \mu\textsubscript{2}) $$

However S\textsubscript{W} is often singular since the rank of S\textsubscript{W}
is at most N-c and usually N is smaller than D.

In such case it is possible to use Fisherfaces. The optimal solution to such
problem lays in W\textsuperscript{T}\textsubscript{opt} =  W\textsuperscript{T}\textsubscript{lda}W\textsuperscript{T}\textsubscript{pca}, 

where W\textsubscript{pca} is chosen to maximize the determinant of the total scatter matrix
of the projected samples: $$ W\textsuperscript{T}\textsubscript{pca} = arg\underset{W}max|W\textsuperscript{T}S\textsubscript{T}W| $$
$$ And $$
$$ W\textsubscript{lda} = arg\underset{W}max\frac{|W\textsuperscript{T}W\textsuperscript{T}\textsubscript{pca}S\textsubscript{B}W\textsubscript{pca}W|}{|W\textsuperscript{T}W\textsuperscript{T}\textsubscript{pca}S\textsubscript{W}W\textsubscript{pca}W|} $$

Such result indicates that the optimal discriminant vectors can be derived
from the reduced feature space M\textsubscript{pca} (<=N-c) obtained through PCA
and applying FLD to reduce the dimension to M\textsubscript{lda}(<=c-1).

In conclusion such method is theoretically better than LDA and PCA alone.
The Fisherfaces method requires less computation complexity, less time than
LDA and it improves recognition performances with respect to PCA and LDA.

# Question 3, LDA Ensemble for Face Recognition, PCA-LDA 

In this section we will perform PCA-LDA recognition with NN classification.

Varying the values of M_pca and M_lda we obtain the average recognition accuracies 
reported in figure \ref{fig:ldapca_acc}. Peak accuracy of 93% can be observed for M_pca=115, M_lda=41;
howeverer accuracies above 90% can be observed for M_pca values between 90 and 130 and 
M_lda values between 30 and 50.

Recognition accuracy is significantly higher than PCA, and the run time is roughly the same,
vaying between 0.11s(low M_pca) and 0.19s(high M_pca).

\begin{figure}
\begin{center}
\includegraphics[width=20em]{fig/ldapca3dacc.pdf}
\caption{PCA-LDA NN Recognition Accuracy varying hyper-parameters}
\label{fig:ldapca_acc}
\end{center}
\end{figure}

The scatter matrices obtained, S\textsubscript{B}(scatter matrix between classes) and 
S\textsubscript{W}(within-class scatter matrix), respectively show ranks of at most c-1(51) and
N-c(312 maximum for our standard 70-30 split). 
The rank of S\textsubscript{W} will have the same value of M_pca for M_pca$\leq$N-c.

NEED MORE SCATTER MATRIX CONTENT

Testing with M_lda=50 and M_pca=115 gives 92.9% accuracy. The results of such test can be
observed in the confusion matrix shown in figure \ref{fig:ldapca_cm}.

\begin{figure}
\begin{center}
\includegraphics[width=20em]{fig/cmldapca.pdf}
\caption{PCA-LDA NN Recognition Confusion Matrix Mlda=50, Mpca=115}
\label{fig:ldapca_cm}
\end{center}
\end{figure}

Two recognition examples are reported: success in figure \ref{fig:succ_ldapca} and failure in figure \ref{fig:fail_ldapca}.

\begin{figure}
\begin{center}
\includegraphics[width=7em]{fig/ldapcaf2.pdf}
\includegraphics[width=7em]{fig/ldapcaf1.pdf}
\caption{Failure case for PCA-LDA. Test face left. NN right}
\label{fig:fail_ldapca}
\end{center}
\end{figure}

\begin{figure}
\begin{center}
\includegraphics[width=7em]{fig/ldapcas1.pdf}
\includegraphics[width=7em]{fig/ldapcas2.pdf}
\caption{Success case for PCA-LDA. Test face left. NN right}
\label{fig:succ_ldapca}
\end{center}
\end{figure}

The PCA-LDA method allows to obtain a much higher recognition accuracy compared to PCA.
The achieved separation between classes and reduction between inner class-distance 
that makes such results possible can be observed in figure \ref{fig:subspaces}, in which 
the 3 features of the subspaces obtained are graphed.

\begin{figure}
\begin{center}
\includegraphics[width=12em]{fig/SubspaceQ1.pdf}
\includegraphics[width=12em]{fig/SubspaceQL1.pdf}
\caption{Generated Subspaces (3 features). PCA on the left. PCA-LDA on the right}
\label{fig:subspaces}
\end{center}
\end{figure}

# Question 3, LDA Ensemble for Face Recognition, PCA-LDA Ensemble 

So far we have established a combined PCA-LDA model which has good recognition while maintaining relatively low execution times and looked at varying hyperparameters.

## Committee Machine Design

Since each model in the ensemble outputs its own predicted labels, we need to defined a strategy for combining the predictions such that we obtain a combined response which is better than that of an individual model. For this project, we consider two committee machine designs.

### Majority Voting

In simple majority voting we the committee label is the most popular label outputted by all the models. This can be achieved by binning all labels produced by the ensemble of models and classifying the test case as the class with the most bins. 

This technique does is not bias towards statistically better models and values all models in the ensemble equally. It is useful when models have similar accuracies and our not specialised in classifying in their classification.

### Confidence Weighted Averaging

Given that the model can output confidence about the label it is able to predict, we can factor the confidence of the model towards the final output of the committee machine. For instance, if a specialised model says with 95% confidence the label for the test case is "A", and two other models only classify it as "B" with 40% confidence, we would be inclined to trust the first model and classify the result as "A".

This technique is reliant on the model producing a confidence score for the label(s) it guesses. For K-Nearest neighbours where $K > 1$ we may produce a confidence based on the proportion of the K nearest neighbours which are the same class. For instance if $K = 5$ and 3 out of the 5 nearest neighbours are of class "C" and the other two are class "B" and "D", then we may say that the predictions are classes C, B and D, with confidence of 60%, 20% and 20% respectively.

In our testing we have elected to use a committee machine employing majority voting, as we identified that looking a nearest neighbour strategy with only **one** neighbour ($K=1$) performed best.


## Data Randomisation (Bagging)

The first strategy which we may use when using ensemble learning is randomisation of the data, while maintaining the model static.

Bagging is performed by generating each dataset for the ensembles by randomly picking with replacement. We chose to perform bagging independently for each face such that we can maintain the split training and testing split ratio used with and without bagging.

![Ensemble size effect on accuracy with bagging\label{fig:bagging-e}](fig/bagging.pdf)


## Feature Space Randomisation

Feature space randomisations involves randomising the features which are analysed by the model. In the case of PCA-LDA this can be achieved by randomising the eigenvectors used when performing the PCA step. For instance, instead of choosing the most variant 120 eigenfaces, we may chose to use the 90 eigenvectors with biggest variance and picking 70 of the rest non-zero eigenvectors randomly.

![Ensemble size effect on accraucy with 160 eeigen values (m_c=90,m_r=70\label{fig:random-e}](fig/random-ensemble.pdf)

In figure \ref{fig:random-e} we can see the effect of ensemble size when using the bigget 90 eigenvalues and 70 random eigenvalues.

We noticed that an ensemble size of around 27 is the point wher accuracy or error plateues. We will use this number when performing an exhaustive search on the optimal randomness parameter.

### Optimal randomness hyper-parameter

The randomness hyper-parameter regarding feature space randomsiation can be defined as the number of features we chose to randomise. For instance the figure \ref{fig:random-e} we chose 70 out of 160 eigenvalues to be random. We could chose to use more than 70 random eigenvalues, thereby increasing the randomness. Conversly we could decrease the randomness parameter, randomising less of the eigenvectors.

The optimal number of constant and random eigenvectors to use is therefore an interesting question.

![Optimal M and Randomness Hyperparameter\label{fig:opti-rand}](fig/vaskplot1.pdf)
![Optimal M and Randomness Hyperparameter\label{fig:opti-rand2}](fig/vaskplot3.pdf)

The optimal randomness after doing an exhaustive search as seen on figure \label{fig:opti-rand}peaks at 95 randomised eigenvalues out of 155 total eigenvalues, or 60 static and 95 random eigenvalues. The values of $M_{\textrm{lda}}$ in the figures is the maximum of 51.

## Comparison

Combining bagging and feature space we are able to achieve higher test accuracy then individual model.

### Ensemmble Confusion Matrix

![Ensemble confusion matrix\label{fig:ens-cm}](fig/ensemble-cm.pdf)

# References