report/paper.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178

# Codebooks

## K-means codebook 

A common technique for codebook generation involves utilising K-means clustering on a sample of the
image descriptors. In this way descriptors may be mapped to *visual* words which lend themselves to
binning and therefore the creation of bag-of-words histograms for the use of classification.

In this courseworok 100-thousand random SIFT descriptors (size=128) of the Caltech_101 dataset are used to build the K-means visual vocabulary.

Both training and testing use 15 randomly selected images from the 10 available classess.

## Vocabulary size 

The number of clusters or the number of centroids determines the vocabulary size when creating the codebook with the K-means the method. Each descriptor is mapped to the nearest centroid, and each descriptor belonging to that cluster is mapped to the same *visual word*. This allows similar descriptors to be mapped to the same word, allowing for comparison through bag-of-words techniques.

## Bag-of-words histogram quantisation of descriptor vectors

An example histograms for training and testing images is shown on figure \ref{fig:histo_tr}, computed with a vocubulary size of 100. The histograms of the same class appear to have comparable magnitudes for their respective keywords, demonstrating they had a similar number of descriptors which mapped to each of the clusters. The effect of the vocubalary size (as determined by the number of K-means centroids) on the classificaiton accuracy is shown in figure \ref{fig:km_vocsize}. A small vocabulary size tends to misrepresent the information contained in the different patches, resulting in poor classification accuracy. Conversly a large vocabulary size (many K-mean centroids), may display overfitting. In our tests, we observe a plateau after a cluster count of 60 on figure \ref{fig:km_vocsize}.

\begin{figure}[H]
\begin{center}
\includegraphics[width=12em]{fig/kmeans_vocsize.pdf}
\includegraphics[width=12em]{fig/time_kmeans.pdf}
\caption{Effect of vocabulary size; classification error (left) and time (right)}
\label{fig:km_vocsize}
\end{center}
\end{figure}


The time complexity of quantisation with a K-means codebooks is $O(DNK)$, where N is the number of entities to be clustered (descriptors), D is the dimension (of the descriptors) and K is the cluster count [@km-complexity]. As the computation time is high, the tests we use a subsample of descriptors to compute the centroids (a random selection of 100 thousand descriptors). An alternative method we tried is applying PCA to the descriptors vectors to improve time performance. However, the descriptor dimension of 128 is relatiely small and as such we found PCA to be unnecessary.

K-means is a process that converges to local optima and heavily depends on the initialization values of the centroids.
Initializing K-means is an expensive process, based on sequential attempts of centroids placement. Running for multiple instances significantly affects the computation process, leading to a linear increase in execution time. We did not observe increase in accuracy with K-means estimator size larger than one, and therefore present results accuracy and execution time results with a single K-Mean estimator.

\begin{figure}[H]
\begin{center}
\includegraphics[width=12em]{fig/trainhist.pdf}
\includegraphics[width=12em]{fig/testhist.pdf}
\caption{Bag-of-words histograms; Training (left), Testing (right)}
\label{fig:histo_tr}
\end{center}
\end{figure}

# RF classifier 

We use a random forest classifier to label images based on the bag-of-words histograms. Random forests are an ensemble of randomly generated decision trees. Random forest classifier performance depends on the ensemble size, tree depth, randomness and weak learner used.

## Hyperparameters tuning

Figure \ref{fig:km-tree-param} shows the effect of tree depth and number of trees, when classifying a bag-of-words created by K-means with 100 cluster centers.
Optimal values for tree depth and number of trees were found to be respectively 5 and 100 as shown in figure \ref{fig:km-tree-param}. Running for multiple seeds instances shows an average accuracy of 80% for these two parameters, peaking at 84% in very specific cases.
We expect a large tree depth to lead into overfitting. However for the data analysed it is only possible to observe a plateau in classification performance.

\begin{figure}[H]
\begin{center}
\includegraphics[width=12em]{fig/error_depth_kmean100.pdf}
\includegraphics[width=12em]{fig/trees_kmean.pdf}
\caption{K-means Classification error varying tree depth (left) and forest size (right)}
\label{fig:km-tree-param}
\end{center}
\end{figure}

Random forests will select a random number of features on which to apply a weak learner (such as axis aligned split) and then chose the best feature of the sampled ones to perform the split on, based on a given criteria (our results use the *Gini index*). The fewer features that are compared for each split the quicker the trees are built and the more random they are. Therefore the randomness parameter can be considered the number of features used when making splits. We evaluate accuracy given different randomness when using a K-means vocabulary in figure \ref{fig:kmeanrandom}. The results in the figure \ref{fig:kmeanrandom} use a forest size of 100 as we infered that this is the estimatator count for which performance gains tend to plateau (when selecting $\sqrt{n}$ random features).
This parameter also affects correlation between trees. We expect in fact trees to be more correlated when using a large number of features for splits.

\begin{figure}[H]
\begin{center}
\includegraphics[width=12em]{fig/new_kmean_random.pdf}
\includegraphics[width=12em]{fig/p3_rand.pdf}
\caption{Classification error for different number of random features; K-means left, RF-codebooks right}
\label{fig:kmeanrandom}
\end{center}
\end{figure}

Changing the randomness parameter had no significant effect on execution time. This may be attributed to increased required tree depth to purify the training set.

## Weak Learner comparison

In figure \ref{fig:2pt} it is possible to notice an improvement in recognition accuracy by 2%,
with the two pixels test, achieving better results than the axis-aligned counterpart. The two-pixels
test theoretically brings a slight deacrease in time performance due to complexity, since it adds one dimension to the computation. It is difficult to measure in our case since it should be less than a second.

\begin{figure}[H]
\begin{center}
\includegraphics[width=14em]{fig/2pixels_kmean.pdf}
\caption{K-means classification accuracy changing the type of weak learners}
\label{fig:2pt}
\end{center}
\end{figure}

Figure \ref{fig:km_cm} shows a confusion matrix for K-means+RF CLassifier with 256 centroids, a forest size of 100 and trees depth of 5. The reported accuracy for this case is 82%. Figure \ref{fig:km_succ} reports examples of failure and success cases obtained from this test, with the top performing classes being *trilobite* and *windsor_chair*. *Water_lilly* was the one that on average performed worst.

\begin{figure}[H]
\begin{center}
\includegraphics[width=14em]{fig/e100k256d5_cm.pdf}
\caption{Confusion Matrix: K=256, ClassifierForestSize=100, Depth=5}
\label{fig:km_cm}
\end{center}
\end{figure}

\begin{figure}[H]
\begin{center}
\includegraphics[width=8em]{fig/success_km.pdf}
\includegraphics[width=8em]{fig/fail_km.pdf}
\caption{K-means + RF Classifier: Success (left); Failure (right)}
\label{fig:km_succ}
\end{center}
\end{figure}

# RF codebook

An alternative to codebook creation via K-means involves using an ensemble of totally random trees. We code each decriptor according to which leaf of each tree in the ensemble it is sorted. This effectively performs and unsupervised transformation of our dataset to a high-dimensional spare representation. The dimension of the vocabulary size is determined by the number of leaves in each random tree and the ensemble size. From comparing execution times of K-means in figure \ref{fig:km_vocsize} and the RF codebook in \ref{fig:p3_voc} we observe considerable speed gains from utilising the RF codebook. This may be attributed to the reduce complexity of RF Codebook creation, 
which is $O(\sqrt{D} N \log K)$ compared to $O(DNK)$ for K-means. Codebook mapping given a created vocabulary is also quicker than K-means, $O(\log K)$ (assuming a balanced tree) vs $O(KD)$.

\begin{figure}[H]
\begin{center}
\includegraphics[width=14em]{fig/256t1_e200D5_cm.pdf}
\caption{Confusion Matrix: CodeBookForestSize=256; ClassifierForestSize=200; Depth=5}
\label{fig:p3_cm}
\end{center}
\end{figure}

\begin{figure}[H]
\begin{center}
\includegraphics[width=8em]{fig/success_3.pdf}
\includegraphics[width=8em]{fig/fail_3.pdf}
\caption{RF Codebooks + RF Classifier: Success (left); Failure (right)}
\label{fig:p3_succ}
\end{center}
\end{figure}

\begin{figure}[H]
\begin{center}
\includegraphics[width=12em]{fig/error_depth_p3.pdf}
\includegraphics[width=12em]{fig/trees_p3.pdf}
\caption{RF-codebooks Classification error varying trees depth (left) and numbers of trees (right)}
\label{fig:p3_trees}
\end{center}
\end{figure}

\begin{figure}[H]
\begin{center}
\includegraphics[width=12em]{fig/p3_vocsize.pdf}
\includegraphics[width=12em]{fig/p3_time.pdf}
\caption{RF-codebooks Effect of vocabulary size; classification error (left) and time (right)}
\label{fig:p3_voc}
\end{center}
\end{figure}

# Comparison of methods and conclusions

Overall we observe slightly higher accuracy when using K-means codebooks compared to RF codebook at the expense of a higher execution time for training and testing.

As discussed in section I, due to the initialization process for optimal centroids placements, K-means can result unpreferable for large 
descriptors' sizes (in absence of methods for dimensionality reduction),
and in many cases the increase in training time would not justify the minimum increase in classification performance. 

For Caltech_101 RF-codebook seems to be the most suitable method to perform RF-classification.

It is observable that for the particular dataset we are analysing the class *water_lilly*
is the one that gets misclassified the most, both in k-means and RF-codebook (refer to figures \ref{fig:km_cm} and \ref{fig:p3_cm}). This means that the features obtained
from this class do not guarantee very discriminative splits, hence the first splits in the trees
will prioritize features taken from other classes.

# Appendix 

\begin{figure}[H]
\begin{center}
\includegraphics[width=14em]{fig/p3_colormap.pdf}
\caption{Varying leaves and estimators: effect on accuracy}
\label{fig:p3_colormap}
\end{center}
\end{figure}

# References