# Question 1, Eigenfaces ## Partition and Standard PCA The data is partitioned to allow random selection of the same amount of samples for each class. In such way, each training vector space will be generated with the same amount of elements. The test data will instead be taken from the remaining samples. Testing on accuracy with respect to data partition indicates that the maximum accuracy is obtained when using a 90% of the data for training. Despite such results we will be using 70% of the data for training as a standard. This will allow to give more than one example of success and failure for each class when classifying the test_data. \begin{figure} \begin{center} \includegraphics[width=20em]{fig/partition.pdf} \caption{NN Recognition Accuracies for different data partitions} \end{center} \end{figure} After partitioning the data into training and testing sets, PCA is applied. The covariance matrix, S, of dimension 2576x2576 (features x features), will have 2576 eigenvalues and eigenvectors. The amount of non-zero eigenvalues and eigenvectors obtained will only be equal to the amount of training samples minus one. This can be observed in the graph below as a sudden drop for eigenvalues after the 363rd. \begin{figure} \begin{center} \includegraphics[width=20em]{fig/eigenvalues.pdf} \caption{Log plot of all eigenvalues} \end{center} \end{figure} The mean image is calculated averaging the features of the training data. Changing the randomization seed will give very similar values, since the vast majority of the training faces used for averaging will be the same. The mean face for our standard seed can be observed below. \begin{figure} \begin{center} \includegraphics[width=8em]{fig/mean_face.pdf} \caption{Mean Face} \end{center} \end{figure} To perform face recognition we choose the best M eigenvectors associated with the largest eigenvalues. We tried different values of M, and we found an optimal point for M=99 with accuracy=57%. After such value the accuracy starts to flaten. \begin{figure} \begin{center} \includegraphics[width=20em]{fig/accuracy.pdf} \caption{NN Recognition Accuracy varying M} \end{center} \end{figure} ## Low dimensional computation of eigenspace Performing the low-dimensional computation of the eigenspace for PCA we obtain the same accuracy results of the high-dimensional computation previously used. A comparison between eigenvalues of the two computation techniques used shows that the difference is very small (due to rounding of the np.eigh function when calculating the eigenvalues and eigenvectors of the matrices A\textsuperscript{T}A (NxN) and AA\textsuperscript{T} (DxD)). The first ten biggest eigenvalues obtained with each method are shown in the table below. \begin{table}[ht] \centering \begin{tabular}[t]{cc} PCA &Fast PCA\\ 2.9755E+05 &2.9828E+05\\ 1.4873E+05 &1.4856E+05\\ 1.2286E+05 &1.2259E+05\\ 7.5084E+04 &7.4950E+04\\ 6.2575E+04 &6.2428E+04\\ 4.7024E+04 &4.6921E+04\\ 3.7118E+04 &3.7030E+04\\ 3.2101E+04 &3.2046E+04\\ 2.7871E+04 &2.7814E+04\\ 2.4396E+04 &2.4339E+04\\ \end{tabular} \caption{Comparison of eigenvalues obtain with the two computation methods} \end{table} It can be proven that the eigenvalues obtained are mathematically the same, and the there is a relation between the eigenvectors obtained: Computing the eigenvectors **u\textsubscript{i}** for the DxD matrix AA\textsuperscript{T} we obtain a very large matrix. The computation process can get very expensive when D>>N. For such reason we compute the eigenvectors **v\textsubscript{i}** of the NxN matrix A\textsuperscript{T}A. From the computation it follows that $A\textsuperscript{T}A\boldsymbol{v\textsubscript{i}} = \lambda \textsubscript{i}\boldsymbol{v\textsubscript{i}}$. Multiplying both sides by A we obtain: $$ AA\textsuperscript{T}A\boldsymbol{v\textsubscript{i}} = \lambda \textsubscript{i}A\boldsymbol{v\textsubscript{i}} \rightarrow SA\boldsymbol{v\textsubscript{i}} = \lambda \textsubscript{i}A\boldsymbol{v\textsubscript{i}} $$ We know that $S\boldsymbol{u\textsubscript{i}} = \lambda \textsubscript{i}\boldsymbol{u\textsubscript{i}}$. From here it follows that AA\textsuperscript{T} and A\textsuperscript{T}A have the same eigenvalues and their eigenvectors follow the relationship $\boldsymbol{u\textsubscript{i}} = A\boldsymbol{v\textsubscript{i}}$ It can be noticed that we effectively don't lose any data calculating the eigenvectors for PCA with the second method. The main advantages of it are in terms of speed, (since the two methods require on average respectively 3.4s and 0.14s), and complexity of computation (since the eigenvectors found with the first method are extracted from a significantly bigger matrix). The only drawback is that with method 1 the eigenfaces are generated directly through the covariance matrix, whereas method 2 requires an additional projection step. # Question 1, Application of eigenfaces ## Image Reconstruction Using the computational method for fast PCA, face reconstruction is then performed. The quality of reconstruction will depend on the amount of eigenvectors picked. The results of varying M can be observed in the picture below. Two faces from classes number 21 and 2 respectively, are reconstructed as shown below with respective M values of M=10, M=100, M=200, M=300. The last picture is the original face. ![Reconstructed Face C21](fig/face160rec.pdf) ![Reconstructed Face C2](fig/face10rec.pdf) It is already observable that the improvement in reconstruction is marginal for M=200 and M=300. For such reason choosing M close to 100 is good enough for such purpose. Observing in fact the variance ratio of the principal components, the contribution they'll have will be very low for values above 100, hence we will require a much higher quantity of components to improve reconstruction quality. With M=100 we will be able to use effectively 97% of the information from our initial training data for reconstruction. \begin{figure} \begin{center} \includegraphics[width=20em]{fig/variance.pdf} \caption{Data variance carried by each of M eigenvalues} \end{center} \end{figure} ## Classification The analysed classification methods used for face recognition are Nearest Neighbor and alternative method through reconstruction error. Nearest Neighbor projects the test data onto the generated subspace and finds the closest element to the projected test image, assigning the same class as the neighbor found. Recognition accuracy of NN classification can be observed in Figure 4 (CHANGE TO ALWAYS POINT AT THE GRAPH, DUNNO HOW). A confusion matrix showing success and failure cases for Nearest Neighbor classfication can be observed below: \begin{figure} \begin{center} \includegraphics[width=20em]{fig/pcacm.pdf} \caption{Confusion Matrix NN, M=99} \end{center} \end{figure} Two examples of the outcome of Nearest Neighbor Classification are presented below, respectively one example of classification failure and an example of successful classification. \begin{figure} \begin{center} \includegraphics[width=7em]{fig/face2.pdf} \includegraphics[width=7em]{fig/face5.pdf} \caption{Failure case for NN. Test face left. NN right} \end{center} \end{figure} \begin{figure} \begin{center} \includegraphics[width=7em]{fig/success1.pdf} \includegraphics[width=7em]{fig/success1t.pdf} \caption{Success case for NN. Test face left. NN right} \end{center} \end{figure} It is possible to use a NN classification that takes into account majority voting. With this method recognition is based on the K closest neighbors of the projected test image. Such method anyways showed the best recognition accuracies for PCA with K=1, as it can be observed from the graph below. \begin{figure} \begin{center} \includegraphics[width=20em]{fig/kneighbors_diffk.pdf} \caption{NN recognition accuracy varying K. Split: 80-20} \end{center} \end{figure} The process for alternative method is somewhat similar to LDA. One different subspace is generated for each class. These subspaces are then used for reconstruction of the test image and the class of the subspace that generated the minimum reconstruction error is assigned. The alternative method shows overall a better performance, with peak accuracy of 69% for M=5. The maximum M non zero eigenvectors that can be used will in this case be at most the amount of training samples per class minus one, since the same amount of eigenvectors will be used for each generated class-subspace. \begin{figure} \begin{center} \includegraphics[width=20em]{fig/alternative_accuracy.pdf} \caption{Accuracy of Alternative Method varying M} \end{center} \end{figure} A confusion matrix showing success and failure cases for alternative method classfication can be observed below: \begin{figure} \begin{center} \includegraphics[width=20em]{fig/altcm.pdf} \caption{Confusion Matrix for alternative method, M=5} \end{center} \end{figure} Similarly to the NN case, we present two cases, respectively failure and success. The pictures on the right show the reconstructed images. \begin{figure} \begin{center} \includegraphics[width=7em]{fig/FO.JPG} \includegraphics[width=7em]{fig/FR.JPG} \includegraphics[width=7em]{fig/FL.JPG} \caption{Alternative method failure. Respectively test image, reconstructed image, class assigned} \end{center} \end{figure} \begin{figure} \begin{center} \includegraphics[width=7em]{fig/SO.JPG} \includegraphics[width=7em]{fig/SR.JPG} \includegraphics[width=7em]{fig/SL.JPG} \caption{Alternative method success. Respectively test image, reconstructed image, class assigned} \end{center} \end{figure} From the failures and success cases analyzed it is noticeable that the parameters that affect recognition the most are: glasses, hair, sex and brightness of the picture. # Question 2, Generative and Discriminative Subspace Learning As mentioned in the introduction, PCA is a generative method that allows to perform dimensionality reduction while keeping most of the information from the initial training data. It is a very good method for reconstruction and allows very fast computation. LDA is instead a discriminative method that uses a high dimensional space for computation. It comes with a very high classification accuracy, with the tradeoff of being slightly slower than PCA, and not as good for face reconstruction. To combine both method it is possible to perform LDA in a generative subspace created by PCA. In order to maximize class separation and minimize the distance between elements of the same class it is necessary to maximize the function J(W) (generalized Rayleigh quotient): $J(W) = \frac{W\textsuperscript{T}S\textsubscript{B}W}{W\textsuperscript{T}S\textsubscript{W}W}$ With S\textsubscript{B} being the scatter matrix between classes, S\textsubscript{W} being the within-class scatter matrix and W being the set of projection vectors. $\mu$ represents the mean of each class. $$ S\textsubscript{B} = \sum\limits_{c}(\mu\textsubscript{c} - \overline{x})(\mu\textsubscript{c} - \overline{x})\textsuperscript{T} $$ $$ S\textsubscript{W} = \sum\limits_{c}\sum\limits_{i\in c}(x\textsubscript{i} - \mu\textsubscript{c})(x\textsubscript{i} - \mu\textsubscript{c})\textsuperscript{T} $$ To maximize J(W) we differentiate with respect to W and equate to zero: $$ \frac{d}{dW}J(W) = \frac{d}{dW}(\frac{W\textsuperscript{T}S\textsubscript{B}W}{W\textsuperscript{T}S\textsubscript{W}W}) = 0 $$ $$ (W\textsuperscript{T}S\textsubscript{W}W)\frac{d(W\textsuperscript{T}S\textsubscript{B}W)}{dW} - (W\textsuperscript{T}S\textsubscript{B}W)\frac{d(W\textsuperscript{T}S\textsubscript{W}W)}{dW} = 0 $$ $$ (W\textsuperscript{T}S\textsubscript{W}W)2S\textsubscript{B}W - (W\textsuperscript{T}S\textsubscript{B}W)2S\textsubscript{W}W = 0 $$ $$ S\textsubscript{B}W - JS\textsubscript{W}W = 0 $$ Multiplying by the inverse of S\textsubscript{W} we obtain: $$ S\textsubscript{W}\textsuperscript{-1}S\textsubscript{B}W - JW = 0 $$ From here it follows: $$ W\textsubscript{opt} = arg\underset{W}max|\frac{W\textsuperscript{T}S\textsubscript{B}W}{W\textsuperscript{T}S\textsubscript{W}W}| = S\textsubscript{W}\textsuperscript{-1}(\mu\textsubscript{1} - \mu\textsubscript{2}) $$ However S\textsubscript{W} is often singular since the rank of S\textsubscript{W} is at most N-c and usually N is smaller than D. In such case it is possible to use Fisherfaces. The optimal solution to such problem lays in W\textsuperscript{T}\textsubscript{opt} = W\textsuperscript{T}\textsubscript{lda}W\textsuperscript{T}\textsubscript{pca} Where W\textsubscript{pca} is chosen to maximize the determinant of the total scatter matrix of the projected samples: $$ W\textsuperscript{T}\textsubscript{pca} = arg\underset{W}max|W\textsuperscript{T}S\textsubscript{T}W| $$ $$ And $$ $$ W\textsubscript{lda} = arg\underset{W}max\frac{|W\textsuperscript{T}W\textsuperscript{T}\textsubscript{pca}S\textsubscript{B}W\textsubscript{pca}W|}{|W\textsuperscript{T}W\textsuperscript{T}\textsubscript{pca}S\textsubscript{W}W\textsubscript{pca}W|} $$ Such result indicates that the optimal discriminant vectors can be derived from the reduced feature space M\textsubscript{pca} (<=N-c) obtained through PCA and applying FLD to reduce the dimension to M\textsubscript{lda}(<=c-1). In conclusion such method is theoretically better than LDA and PCA alone. The Fisherfaces method requires less computation complexity, less time than LDA and it improves recognition performances with respect to PCA and LDA. # Question 3, LDA Ensemble for Face Recognition, PCA-LDA # Question 3, LDA Ensemble for Face Recognition, PCA-LDA Ensemble # References