report/paper.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304

# Question 1, Eigenfaces 

The data is partitioned to allow random selection of the
same amount of samples for each class.
In such way, each training vector space will be generated with
the same amount of elements. The test data will instead
be taken from the remaining samples. Testing on accuracy
with respect to data partition indicates that the maximum
accuracy is obtained when using a 90% of the data for
training. Despite such results we will be using 70% of the data
for training as a standard. This will allow to give more than one
example of success and failure for each class when classifying the 
test_data.

\begin{figure}
\begin{center}
\includegraphics[width=20em]{fig/partition.pdf}
\caption{NN Recognition Accuracies for different data partitions}
\end{center}
\end{figure}

After partitioning the data into training and testing sets,
PCA is applied. The covariance matrix, S, of dimension
2576x2576 (features x features), will have 2576 eigenvalues
and eigenvectors. The amount of non-zero eigenvalues and
eigenvectors obtained will only be equal to the amount of
training samples minus one. This can be observed in the
graph below as a sudden drop for eigenvalues after the
363rd.

\begin{figure}
\begin{center}
\includegraphics[width=20em]{fig/eigenvalues.pdf}
\caption{Log plot of all eigenvalues}
\end{center}
\end{figure}

The mean image is calculated averaging the features of the
training data. Changing the randomization seed will give
very similar values, since the vast majority of the training
faces used for averaging will be the same. The mean face
for our standard seed can be observed below.

\begin{figure}
\begin{center}
\includegraphics[width=8em]{fig/mean_face.pdf}
\caption{Mean Face}
\end{center}
\end{figure}

To perform face recognition we choose the best M eigenvectors
associated with the largest eigenvalues. We tried
different values of M, and we found an optimal point for
M=99 with accuracy=57%. After such value the accuracy starts 
to flaten. 

\begin{figure}
\begin{center}
\includegraphics[width=20em]{fig/accuracy.pdf}
\caption{NN Recognition Accuracy varying M}
\end{center}
\end{figure}

# Question 1, Application of eigenfaces

Performing the low-dimensional computation of the
eigenspace for PCA we obtain the same accuracy results
of the high-dimensional computation previously used. A
comparison between eigenvalues of the
two computation techniques used shows that the difference
is very small (due to rounding
of the np.eigh function when calculating the eigenvalues
and eigenvectors of the matrices A\textsuperscript{T}A (NxN) and AA\textsuperscript{T}
(DxD)).

The first ten biggest eigenvalues obtained with each method
are shown in the table below.

\begin{table}[ht]
\centering
\begin{tabular}[t]{cc}
PCA &Fast PCA\\
2.9755E+05 &2.9828E+05\\
1.4873E+05 &1.4856E+05\\
1.2286E+05 &1.2259E+05\\
7.5084E+04 &7.4950E+04\\
6.2575E+04 &6.2428E+04\\
4.7024E+04 &4.6921E+04\\
3.7118E+04 &3.7030E+04\\
3.2101E+04 &3.2046E+04\\
2.7871E+04 &2.7814E+04\\
2.4396E+04 &2.4339E+04\\
\end{tabular}
\caption{Comparison of eigenvalues obtain with the two computation methods}
\end{table}

It can be proven that the eigenvalues obtained are mathematically the same,
and the there is a relation between the eigenvectors obtained: 

Computing the eigenvectors **u\textsubscript{i}** for the DxD matrix AA\textsuperscript{T} 
we obtain a very large matrix. The computation process can get very expensive when D>>N.

For such reason we compute the eigenvectors **v\textsubscript{i}** of the NxN
matrix A\textsuperscript{T}A. From the computation it follows that $A\textsuperscript{T}A\boldsymbol{v\textsubscript{i}} = \lambda \textsubscript{i}\boldsymbol{v\textsubscript{i}}$.

Multiplying both sides by A we obtain: 

$$ AA\textsuperscript{T}A\boldsymbol{v\textsubscript{i}} = \lambda \textsubscript{i}A\boldsymbol{v\textsubscript{i}} \rightarrow SA\boldsymbol{v\textsubscript{i}} = \lambda \textsubscript{i}A\boldsymbol{v\textsubscript{i}} $$

We know that $S\boldsymbol{u\textsubscript{i}} = \lambda \textsubscript{i}\boldsymbol{u\textsubscript{i}}$.
 
From here it follows that AA\textsuperscript{T} and A\textsuperscript{T}A have the same eigenvalues and their eigenvectors follow the relationship $\boldsymbol{u\textsubscript{i}} = A\boldsymbol{v\textsubscript{i}}$ 

It can be noticed that we effectively don't lose any data calculating the eigenvectors
for PCA with the second method. The main advantages of it are in terms of speed,
(since the two methods require respectively 3.4s and 0.14s), and complexity of computation
(since the eigenvectors found with the first method are extracted from a significantly 
bigger matrix).

The only drawback is that with method 1 the eigenfaces are generated directly through 
the covariance matrix, whereas method 2 requires an additional projection step.

Using the computational method for fast PCA, face reconstruction is then performed.
The quality of reconstruction will depend on the amount of eigenvectors picked.
The results of varying M can be observed in the picture below. Two faces from classes 
number 21 and 2 respectively, are reconstructed as shown below with respective M values 
of M=10, M=100, M=200, M=300. The last picture is the original face.

![Reconstructed Face C21](fig/face160rec.pdf)
![Reconstructed Face C2](fig/face10rec.pdf)

It is already observable that the improvement in reconstruction is marginal for M=200 
and M=300. For such reason choosing M close to 100 is good enough for such purpose.
Observing in fact the variance ratio of the principal components, the contribution
they'll have will be very low for values above 100, hence we will require a much higher
quantity of components to improve reconstruction quality. With M=100 we will be able to
use effectively 97% of the information from our initial training data for reconstruction.

\begin{figure}
\begin{center}
\includegraphics[width=20em]{fig/variance.pdf}
\caption{Data variance carried by each of M eigenvalues}
\end{center}
\end{figure}

The analysed classification methods used for face recognition are **Nearest Neighbor** and
**alternative method** through reconstruction error. 
EXPLAIN THE METHODS

REFER TO ACCURACY GRAPH 1 FOR NN. MAYBE WE CAN ALSO ADD SAME GRAPH WITH DIFFERENT K

A confusion matrix showing success and failure cases for Nearest Neighbor classfication
can be observed below:

\begin{figure}
\begin{center}
\includegraphics[width=20em]{fig/pcacm.pdf}
\caption{Confusion Matrix NN, M=99}
\end{center}
\end{figure}

Two examples of the outcome of Nearest Neighbor Classification are presented below,
respectively one example of classification failure and an example of successful 
classification.

\begin{figure}
\begin{center}
\includegraphics[width=7em]{fig/face2.pdf}
\includegraphics[width=7em]{fig/face5.pdf}
\caption{Failure case for NN. Test face left. NN right}
\end{center}
\end{figure}

\begin{figure}
\begin{center}
\includegraphics[width=7em]{fig/success1.pdf}
\includegraphics[width=7em]{fig/success1t.pdf}
\caption{Success case for NN. Test face left. NN right}
\end{center}
\end{figure}

The alternative method shows overall a better performance, with peak accuracy of 69%
for M=5. The maximum M non zero eigenvectors that can be used will in this case be at most
the amount of training samples per class minus one, since the same amount of eigenvectors
will be used for each generated class-subspace.

\begin{figure}
\begin{center}
\includegraphics[width=20em]{fig/alternative_accuracy.pdf}
\caption{Accuracy of Alternative Method varying M}
\end{center}
\end{figure}

A confusion matrix showing success and failure cases for alternative method classfication
can be observed below:

\begin{figure}
\begin{center}
\includegraphics[width=20em]{fig/altcm.pdf}
\caption{Confusion Matrix for alternative method, M=5}
\end{center}
\end{figure}

Similarly to the NN case, we present two cases, respectively failure and success.
The pictures on the right show the reconstructed images.

\begin{figure}
\begin{center}
\includegraphics[width=7em]{fig/FO.JPG}
\includegraphics[width=7em]{fig/FL.JPG}
\includegraphics[width=7em]{fig/FR.JPG}
\caption{Alternative method failure. Respectively test image, labeled class, reconstructed image}
\end{center}
\end{figure}

\begin{figure}
\begin{center}
\includegraphics[width=7em]{fig/SO.JPG}
\includegraphics[width=7em]{fig/SL.JPG}
\includegraphics[width=7em]{fig/SR.JPG}
\caption{Alternative method success. Respectively test image, labeled class, reconstructed image}
\end{center}
\end{figure}


# Question 2, Generative and Discriminative Subspace Learning

Maximize function J(W) (Fisher's Criterion): 

$$ J(W) = \frac{W\textsuperscript{T}S\textsubscript{B}W}{W\textsuperscript{T}S\textsubscript{W}W}\textrm{  or  } J(W) = \frac{W\textsuperscript{T}S\textsubscript{B}W}{W\textsuperscript{T}S\textsubscript{t}W} $$ 

With S\textsubscript{B} being the scatter matrix between classes, S\textsubscript{W} 
being the within-class scatter matrix and W being the set of projection vectors. $\mu$ 
represents the mean vector(???) of each class.

$$ S\textsubscript{B} = \sum\limits_{c}(\mu\textsubscript{c} - \overline{x})(\mu\textsubscript{c} - \overline{x})\textsuperscript{T} $$
$$ S\textsubscript{W} = \sum\limits_{c}\sum\limits_{i\in c}(x\textsubscript{i} - \mu\textsubscript{c})(x\textsubscript{i} - \mu\textsubscript{c})\textsuperscript{T} $$

To maximize J(W) we differentiate with respect to W and equate to zero:

$$ \frac{d}{dW}J(W) = \frac{d}{dW}(\frac{W\textsuperscript{T}S\textsubscript{B}W}{W\textsuperscript{T}S\textsubscript{W}W}) = 0 $$
$$ (W\textsuperscript{T}S\textsubscript{W}W)\frac{d(W\textsuperscript{T}S\textsubscript{B}W)}{dW} -  (W\textsuperscript{T}S\textsubscript{B}W)\frac{d(W\textsuperscript{T}S\textsubscript{W}W)}{dW} = 0 $$
$$ (W\textsuperscript{T}S\textsubscript{W}W)2S\textsubscript{B}W - (W\textsuperscript{T}S\textsubscript{B}W)2S\textsubscript{W}W = 0 $$ 
$$ S\textsubscript{B}W - JS\textsubscript{W}W = 0 $$

Multiplying by the inverse of S\textsubscript{W} we obtain:
$$ S\textsubscript{W}\textsuperscript{-1}S\textsubscript{B}W - JW = 0 $$

From here it follows:

$$ W\textsuperscript{*} = arg\underset{W}max(\frac{W\textsuperscript{T}S\textsubscript{B}W}{W\textsuperscript{T}S\textsubscript{W}W}) = S\textsubscript{W}\textsuperscript{-1}(\mu\textsubscript{1} - \mu\textsubscript{2}) $$

By isomorphic mapping where P are the eigenvectors generated through PCA:
$W = PX$

We can substitute for W in the J(W) expression, obtaining:

$$ J(W) = \frac{X\textsuperscript{T}P\textsuperscript{T}S\textsubscript{B}PX}{X\textsuperscript{T}P\textsuperscript{T}S\textsubscript{t}PX} $$

We can rewrite such expression substituting for:

$$ P\textsuperscript{T}S\textsubscript{B}P = \widetilde{S}\textsubscript{B} \textrm{  and  } P\textsuperscript{T}S\textsubscript{t}P = \widetilde{S}\textsubscript{t} $$
$$ J(W) = \widetilde{J}(W) = \frac{X\textsuperscript{T}\widetilde{S}\textsubscript{B}X}{X\textsuperscript{T}\widetilde{S}\textsubscript{t}X} $$

$\widetilde{S}\textsubscript{B} \textrm{  and  } \widetilde{S}\textsubscript{t}$ 
are respectively semi-positive definite and positive definite. So $\widetilde{J}(X)$ 
acts like Fisher's criterion but in PCA transformed space. This method 
does not result in any loss of data.

*Proof:*

The set of optimal discriminant vectors can be found in R\textsuperscript{n}
for LDA. But, this is a difficult computation because the dimension is very 
high. Besides, S\textsubscript{t} is always singular. Fortunately, it is possible 
to derive it to a lower dimensional space.

Suppose **b\textsubscript{n}** are the eigenvectors of S\textsubscript{t}. 
The M biggest eigenvectors are positive: M = *rank*(S\textsubscript{t}). 
The other M+1 to compose the null space of S\textsubscript{t}. 

YO WTF(???)

**Theorem:**
*$$ \textrm{For any arbitrary } \varphi \in R\textsuperscript{n}, \varphi 
\textrm{ can be denoted by } \varphi = X + \xi, $$ 
$$ \textrm{ where, }X \in \phi\textsubscript{t}\textrm{ and } \xi \in 
\phi\textsubscript{t}\textsuperscript{perp}\textrm{, and
satisfies }J(\varphi)=J(X). $$*

According to the theorem, it is possible to find optimal discriminant
vectors, reducing the problem dimension without any loss of information
with respect to Fisher’s criterion.

Fisherfaces method is effective because it requires less computation 
time than regular LDA. It also has lower error rates compared to 
Eigenfaces method. Thus, it combines the performances of discriminative 
and reconstructive tools. 

# Question 3, LDA Ensemble for Face Recognition, PCA-LDA 

# Question 3, LDA Ensemble for Face Recognition, PCA-LDA Ensemble 

# References