# Question 1, Eigenfaces The data is partitioned to allow random selection of the same amount of samples for each class. In such way, each training vector space will be generated with the same amount of elements. The test data will instead be taken from the remaining samples. Testing on accuracy with respect to data partition indicates that the maximum accuracy is obtained when using a 90% of the data for training. Despite such results we will be using 80% of the data for training as a standard. This will allow to give more than one example of success and failure for each class when classifying the test_data. \begin{center} \includegraphics[width=20em]{fig/partition.pdf} \end{center} After partitioning the data into training and testing sets, PCA is applied. The covariance matrix, S, of dimension 2576x2576 (features x features), will have 2576 eigenvalues and eigenvectors. The amount of non-zero eigenvalues and eigenvectors obtained will only be equal to the amount of training samples minus one. This can be observed in the graph below as a sudden drop for eigenvalues after the 415th. \begin{center} \includegraphics[width=20em]{fig/eigenvalues.pdf} \end{center} The mean image is calculated averaging the features of the training data. Changing the randomization seed will give very similar values, since the vast majority of the training faces used for averaging will be the same. The mean face for our standard seed can be observed below. \begin{center} \includegraphics[width=8em]{fig/mean_face.pdf} \end{center} To perform face recognition we choose the best M eigenvectors associated with the largest eigenvalues. We tried different values of M, and we found an optimal point for M=42 with accuracy=66.3%. After such value the accuracy starts to flaten, with some exceptions for points at which accuracy decreases. WE NEED TO ADD PHYSICAL MEANINGS \begin{center} \includegraphics[width=20em]{fig/accuracy.pdf} \end{center} # Question 1, Application of eigenfaces Performing the low-dimensional computation of the eigenspace for PCA we obtain the same accuracy results of the high-dimensional computation previously used. A comparison between eigenvalues of the two computation techniques used shows that the difference is very small (due to rounding of the np.eigh function when calculating the eigenvalues and eigenvectors of the matrices A\textsuperscript{T}A (NxN) and AA\textsuperscript{T} (DxD)) I MIGHT HAVE SWAPPED THE DIMENSIONS, NOT SURE The first ten biggest eigenvalues obtained with each method are shown in the table below. \begin{table}[ht] \centering \begin{tabular}[t]{cc} PCA &Fast PCA\\ 2.9755E+05 &2.9828E+05\\ 1.4873E+05 &1.4856E+05\\ 1.2286E+05 &1.2259E+05\\ 7.5084E+04 &7.4950E+04\\ 6.2575E+04 &6.2428E+04\\ 4.7024E+04 &4.6921E+04\\ 3.7118E+04 &3.7030E+04\\ 3.2101E+04 &3.2046E+04\\ 2.7871E+04 &2.7814E+04\\ 2.4396E+04 &2.4339E+04\\ \end{tabular} \caption{Comparison of eigenvalues obtain with the two computation methods} \end{table} It can be proven that the eigenvalues obtained are mathematically the same, and the there is a relation between the eigenvectors obtained: Computing the eigenvectors **u\textsubscript{i}** for the DxD matrix AA\textsuperscript{T} we obtain a very large matrix. The computation process can get very expensive when D>>N. For such reason we compute the eigenvectors **v\textsubscript{i}** of the NxN matrix A\textsuperscript{T}A. From the computation it follows that $$ A\textsuperscript{T}A\boldsymbol{v\textsubscript{i}} = \lambda \textsubscript{i}\boldsymbol{v\textsubscript{i}} $$ Multiplying both side by A we obtain: $$ AA\textsuperscript{T}A\boldsymbol{v\textsubscript{i}} = \lambda \textsubscript{i}A\boldsymbol{v\textsubscript{i}} \rightarrow SA\boldsymbol{v\textsubscript{i}} = \lambda \textsubscript{i}A\boldsymbol{v\textsubscript{i}} $$ We know that $$ S\boldsymbol{u\textsubscript{i}} = \lambda \textsubscript{i}\boldsymbol{u\textsubscript{i}} $$ From here it follows that AA\textsuperscript{T} and A\textsuperscript{T}A have the same eigenvalues and their eigenvectors follow the relationship $$ \boldsymbol{u\textsubscript{i}} = A\boldsymbol{v\textsubscript{i}} $$ Using the computational method for fast PCA, face reconstruction is then performed. The quality of reconstruction will depend on the amount of eigenvectors picked. The results of varying M can be observed in the picture below. Two faces from classes number 21 and 2 respectively, are reconstructed as shown below with respective M values of M=10, M=100, M=200, M=300. The last picture is the original face. ![Reconstructed Face C21](fig/face160rec.pdf) ![Reconstructed Face C2](fig/face10rec.pdf) It is already observable that the improvement in reconstruction is marginal for M=200 and M=300. For such reason choosing M close to 100 is good enough for such purpose. Observing in fact the variance ratio of the principal components, the contribution they'll have will be very low for values above 100, hence we will require a much higher quantity of components to improve reconstruction quality. \begin{center} \includegraphics[width=20em]{fig/variance.pdf} \end{center} The analysed classification methods used for face recognition are **Nearest Neighbor** and **alternative method** through reconstruction error. EXPLAIN THE METHODS REFER TO ACCURACY GRAPH 1 FOR NN. MAYBE WE CAN ALSO ADD SAME GRAPH WITH DIFFERENT K A confusion matrix showing success and failure cases for Nearest Neighbor classfication can be observed below: \begin{center} \includegraphics[width=20em]{fig/pcacm.pdf} %![Confusion Matrix NN, K=1](fig/pcacm.pdf) \end{center} An example of failed classification is a test face from class 2, wrongly labeled as class 5: \begin{center} \includegraphics[width=20em]{fig/failure_2_5.pdf} %![Class 2 (left) labeled as class 5 (right)](fig/failure_2_5.pdf) \end{center} The alternative method shows overall a better performance, with peak accuracy of 73% for M=3. The maximum M non zero eigenvectors that can be used will in this case be at most the amount of training samples per class minus one, since the same amount of eigenvectors will be used for each generated class-subspace. \begin{center} \includegraphics[width=20em]{fig/alternative_accuracy.pdf} %![Accuracy of Alternative Method varying M](fig/alternative_accuracy.pdf) \end{center} A confusion matrix showing success and failure cases for alternative method classfication can be observed below: \begin{center} \includegraphics[width=20em]{fig/altcm.pdf} %![Confusion Matrix alternative method, M=3](fig/altcm.pdf) \end{center} It can be observed that even with this more accurate classification, there is one instance of mislabel of the same face of class 2 as class 5. An additional classification failure of class 6 labeled as class 7 can be observed below: \begin{center} \includegraphics[width=20em]{fig/failure_6_7.pdf} %![Class 6 (left) labeled as class 7 (right)](fig/failure_6_7.pdf) \end{center} # Question 2, Generative and Discriminative Subspace Learning Maximize function J(W) (Fisher's Criterion): $$ J(W) = \frac{W\textsuperscript{T}S\textsubscript{B}W}{W\textsuperscript{T}S\textsubscript{W}W}\textrm{ or } J(W) = \frac{W\textsuperscript{T}S\textsubscript{B}W}{W\textsuperscript{T}S\textsubscript{t}W}$$ With S\textsubscript{B} being the scatter matrix between classes, S\textsubscript{W} being the within-class scatter matrix and W being the set of projection vectors. $\mu$ represents the mean vector(???) of each class. $$ S\textsubscript{B} = \sum\limits_{c}(\mu\textsubscript{c} - \overline{x})(\mu\textsubscript{c} - \overline{x})\textsuperscript{T} $$ $$ S\textsubscript{W} = \sum\limits_{c}\sum\limits_{i\in c}(x\textsubscript{i} - \mu\textsubscript{c})(x\textsubscript{i} - \mu\textsubscript{c})\textsuperscript{T} $$ To maximize J(W) we differentiate with respect to W and equate to zero: $$ \frac{d}{dW}J(W) = \frac{d}{dW}(\frac{W\textsuperscript{T}S\textsubscript{B}W}{W\textsuperscript{T}S\textsubscript{W}W}) = 0 $$ $$ (W\textsuperscript{T}S\textsubscript{W}W)\frac{d(W\textsuperscript{T}S\textsubscript{B}W)}{dW} - (W\textsuperscript{T}S\textsubscript{B}W)\frac{d(W\textsuperscript{T}S\textsubscript{W}W)}{dW} = 0 $$ $$ (W\textsuperscript{T}S\textsubscript{W}W)2S\textsubscript{B}W - (W\textsuperscript{T}S\textsubscript{B}W)2S\textsubscript{W}W = 0 $$ $$ S\textsubscript{B}W - JS\textsubscript{W}W = 0 $$ Multiplying by the inverse of S\textsubscript{W} we obtain: $$ S\textsubscript{W}\textsuperscript{-1}S\textsubscript{B}W - JW = 0 $$ From here it follows: $$ W\textsuperscript{*} = arg\underset{W}max(\frac{W\textsuperscript{T}S\textsubscript{B}W}{W\textsuperscript{T}S\textsubscript{W}W}) = S\textsubscript{W}\textsuperscript{-1}(\mu\textsubscript{1} - \mu\textsubscript{2}) $$ By isomorphic mapping where P are the eigenvectors generated through PCA: $$ W = PX $$ We can substitute for W in the J(W) expression, obtaining: $$ J(W) = \frac{X\textsuperscript{T}P\textsuperscript{T}S\textsubscript{B}PX}{X\textsuperscript{T}P\textsuperscript{T}S\textsubscript{t}PX} $$ We can rewrite such expression substituting for: $$ P\textsuperscript{T}S\textsubscript{B}P = \widetilde{S}\textsubscript{B} \textrm{ and } P\textsuperscript{T}S\textsubscript{t}P = \widetilde{S}\textsubscript{t} $$ $$ J(W) = \widetilde{J}(W) = \frac{X\textsuperscript{T}\widetilde{S}\textsubscript{B}X}{X\textsuperscript{T}\widetilde{S}\textsubscript{t}X} $$ $$ \widetilde{S}\textsubscript{B} \textrm{ and } \widetilde{S}\textsubscript{t} $$ are respectively semi-positive definite and positive definite. So $$ \widetilde{J}(X) $$ acts like Fisher's criterion but in PCA transformed space. This method does not result in any loss of data. *Proof:* The set of optimal discriminant vectors can be found in R\textsuperscript{n} for LDA. But, this is a difficult computation because the dimension is very high. Besides, S\textsubscript{t} is always singular. Fortunately, it is possible to derive it to a lower dimensional space. Suppose **b\textsubscript{n}** are the eigenvectors of S\textsubscript{t}. The M biggest eigenvectors are positive: M = *rank*(S\textsubscript{t}). The other M+1 to compose the null space of S\textsubscript{t}. YO WTF(???) **Theorem:** *$$ \textrm{For any arbitrary } \varphi \in R\textsuperscript{n}, \varphi \textrm{ can be denoted by } \varphi = X + \xi, $$ $$ \textrm{ where, }X \in \phi\textsubscript{t}\textrm{ and } \xi \in \phi\textsubscript{t}\textsuperscript{perp}\textrm{, and satisfies }J(\varphi)=J(X). $$* According to the theorem, it is possible to find optimal discriminant vectors, reducing the problem dimension without any loss of information with respect to Fisher’s criterion. Fisherfaces method is effective because it requires less computation time than regular LDA. It also has lower error rates compared to Eigenfaces method. Thus, it combines the performances of discriminative and reconstructive tools. # Question 3, LDA Ensemble for Face Recognition, PCA-LDA # Question 3, LDA Ensemble for Face Recognition, PCA-LDA Ensemble # References