diff options
-rw-r--r-- | report/paper.md | 18 |
1 files changed, 9 insertions, 9 deletions
diff --git a/report/paper.md b/report/paper.md index 19d039a..65755ce 100644 --- a/report/paper.md +++ b/report/paper.md @@ -4,9 +4,9 @@ A common technique for codebook generation involves utilising K-means clustering image descriptors. In this way descriptors may be mapped to *visual* words which lend themselves to binning and therefore the creation of bag-of-words histograms for the use of classification. -In this courseworok 100-thousand random SIFT descriptors (with 128 dimenions) of the `Caltech_101` dataset are used to build the K-means visual vocabulary. +In this coursework 100-thousand random SIFT descriptors (with 128 dimenions) of the `Caltech_101` dataset are used to build the K-means visual vocabulary. -Both training and testing use 15 randomly selected images from the 10 available classess. +Both training and testing use 15 randomly selected images from the 10 available classes. ## Vocabulary size @@ -14,7 +14,7 @@ The number of clusters or the number of centroids determines the vocabulary size ## Bag-of-words histogram of descriptor vectors -An example of histograms for training and testing images is shown on figure \ref{fig:histo_tr}, computed with a vocubulary size of 100. The histograms of the same class appear to have comparable magnitudes for their respective keywords, demonstrating they have a similar number of descriptors which map to each of the clusters. The effect of the vocubalary size (as determined by the number of K-means centroids) on the classificaiton accuracy is shown in figure \ref{fig:km_vocsize}. We find that a small vocabulary size tends to misrepresent the information contained in the different patches, resulting in poor classification accuracy. Conversly a large vocabulary size (many K-mean centroids), may display overfitting. In our tests, we begin to observe a plateau after a cluster count of 60 on figure \ref{fig:km_vocsize}. The proccess of partitioning the input space into K distinct clusters is a form of **vector quantization**. +An example of histograms for training and testing images is shown on figure \ref{fig:histo_tr}, computed with a vocabulary size of 100. The histograms of the same class appear to have comparable magnitudes for their respective keywords, demonstrating they have a similar number of descriptors which map to each of the clusters. The effect of the vocabulary size (as determined by the number of K-means centroids) on the classification accuracy is shown in figure \ref{fig:km_vocsize}. We find that a small vocabulary size tends to misrepresent the information contained in the different patches, resulting in poor classification accuracy. Conversely a large vocabulary size (many K-mean centroids), may display overfitting. In our tests, we begin to observe a plateau after a cluster count of 60 on figure \ref{fig:km_vocsize}. The process of partitioning the input space into K distinct clusters is a form of **vector quantization**. \begin{figure} \begin{center} @@ -26,7 +26,7 @@ An example of histograms for training and testing images is shown on figure \ref \end{figure} -The time complexity of quantization with a K-means codebooks is $O(DNK)$, where N is the number of entities to be clustered (descriptors), D is the dimension (of the descriptors) and K is the cluster count [@km-complexity]. As the computation time is high, the tests use a subsample of descriptors to compute the centroids (a random selection of 100 thousand descriptors). An alternative method we tried is applying PCA to the descriptors vectors to improve time performance. However, the descriptor dimension of 128 is relatiely small and as such we found PCA to be unnecessary. +The time complexity of quantization with a K-means codebooks is $O(DNK)$, where N is the number of entities to be clustered (descriptors), D is the dimension (of the descriptors) and K is the cluster count [@km-complexity]. As the computation time is high, the tests use a subsample of descriptors to compute the centroids (a random selection of 100 thousand descriptors). An alternative method we tried is applying PCA to the descriptors vectors to improve time performance. However, the descriptor dimension of 128 is relatively small and as such we found PCA to be unnecessary. K-means is a process that converges to local optima and heavily depends on the initialization values of the centroids[@kmean]. Initializing K-means is an expensive process, based on sequential attempts of centroids placement. Running for multiple instances significantly affects the computation process, leading to a linear increase in execution time. We did not observe increase in accuracy with more than one K-means clusters initializations, and therefore present results for accuracy and execution time with a single K-Means initialization. @@ -59,7 +59,7 @@ We expect a large tree depth to lead into overfitting. However for the data anal \end{center} \end{figure} -Random forests will select a random number of features on which to apply a weak learner (such as axis aligned split) and then choose the best feature of the sampled ones to perform the split on, based on a given criteria (our results use the *Gini index*). The fewer features that are compared for each split the quicker the trees are built and the more random they are. Therefore the randomness parameter can be considered as the number of features used when making splits. We evaluate accuracy given different randomness when using a K-means vocabulary of size 100 in figure \ref{fig:kmeanrandom}. The results in the figure \ref{fig:kmeanrandom} also use a forest size of 100 as we infered that this is the estimatator count for which performance gains tend to plateau (when selecting $\sqrt{n}$ random features). +Random forests will select a random number of features on which to apply a weak learner (such as axis aligned split) and then choose the best feature of the sampled ones to perform the split on, based on a given criteria (our results use the *Gini index*). The fewer features that are compared for each split the quicker the trees are built and the more random they are. Therefore the randomness parameter can be considered as the number of features used when making splits. We evaluate accuracy given different randomness when using a K-means vocabulary of size 100 in figure \ref{fig:kmeanrandom}. The results in the figure \ref{fig:kmeanrandom} also use a forest size of 100 as we inferred that this is the estimator count for which performance gains tend to plateau (when selecting $\sqrt{n}$ random features). This parameter also affects correlation between trees. We expect trees to be more correlated when using a large number of features for splits. \begin{figure} @@ -73,7 +73,7 @@ This parameter also affects correlation between trees. We expect trees to be mor Changing the randomness parameter had no significant effect on execution time. This may be attributed to the increased required tree depth to purify the training set. -Effects of vocabulary size on accuracy and time performance are shown in section I, figure \ref{fig:km_vocsize}. Time increases linearly with vocabulary size. Optimal number of cluster centers was found to be around 100, giving a good tradeoff between time and accuracy performance. As shown in figure \ref{fig:km_vocsize} the classification error in fact does no plateau completely, despite experiencing a significant decrease in gradient. +Effects of vocabulary size on accuracy and time performance are shown in section I, figure \ref{fig:km_vocsize}. Time increases linearly with vocabulary size. Optimal number of cluster centers was found to be around 100, giving a good trade-off between time and accuracy performance. As shown in figure \ref{fig:km_vocsize} the classification error in fact does no plateau completely, despite experiencing a significant decrease in gradient. ## Weak Learner comparison @@ -101,10 +101,10 @@ Figure \ref{fig:km_cm} shows a confusion matrix for RF Classification on K-means # RF codebook -An alternative to codebook creation via K-means involves using an ensemble of totally random trees. We code each decriptor according to which leaf of each tree in the ensemble it is sorted. This effectively performs an unsupervised quantization of our descriptors. The vocabulary size is determined by the number of leaves in each random tree multiplied by the ensemble size. The returned leafe node ID's are then binned into a histogram to create a bag-of-words, much like the one created using K-Means. From comparing execution times of K-means in figure \ref{fig:km_vocsize} and the RF codebook in figure \ref{fig:p3_voc} we observe considerable speed gains from utilising the RF codebook. This may be attributed to the reduced complexity of RF Codebook creation, +An alternative to codebook creation via K-means involves using an ensemble of totally random trees. We code each descriptor according to which leaf of each tree in the ensemble it is sorted. This effectively performs an unsupervised quantization of our descriptors. The vocabulary size is determined by the number of leaves in each random tree multiplied by the ensemble size. The returned leafe node ID's are then binned into a histogram to create a bag-of-words, much like the one created using K-Means. From comparing execution times of K-means in figure \ref{fig:km_vocsize} and the RF codebook in figure \ref{fig:p3_voc} we observe considerable speed gains from utilising the RF codebook. This may be attributed to the reduced complexity of RF Codebook creation, which is $O(\sqrt{D} N \log K)$ compared to $O(DNK)$ for K-means. Codebook mapping given a created vocabulary is also quicker than K-means, $O(\log K)$ (assuming a balanced tree) vs $O(KD)$. -The effect of vocabulary size on classification accuracy can be observed both in figure \ref{fig:p3_voc}, in which we independently vary number of leaves and ensemble size, and figure \ref{fig:p3_colormap}, in which both parameters are varied simultaneously. It is possible to notice that these two parameters make classification accuracy plateau for *leaves*$>80$ and *estimators*$>100$. The peaks of 82% accuracy visible on the heatmap in figure \ref{fig:p3_colormap} are highly dependent on the seed and indicate the range of *good* hyperparametres. +The effect of vocabulary size on classification accuracy can be observed both in figure \ref{fig:p3_voc}, in which we independently vary number of leaves and ensemble size, and figure \ref{fig:p3_colormap}, in which both parameters are varied simultaneously. It is possible to notice that these two parameters make classification accuracy plateau for *leaves*$>80$ and *estimators*$>100$. The peaks of 82% accuracy visible on the heatmap in figure \ref{fig:p3_colormap} are highly dependent on the seed and indicate the range of *good* hyper-parameters. \begin{figure} \begin{center} @@ -148,7 +148,7 @@ In many applications the increase in training time would not justify the small i For the `Caltech_101` dataset, a RF codebook seems to be the most suitable method to perform RF classification. -The `water_lilly` is the most misclassified class, both for K-means and RF codebook (refer to figures \ref{fig:km_cm} and \ref{fig:p3_cm}). This indicates that the quantized descriptors obtained from the class do not provide for very discriminative splits, resulting in the prioritsation of other features in the first nodes of the decision trees. +The `water_lilly` is the most misclassified class, both for K-means and RF codebook (refer to figures \ref{fig:km_cm} and \ref{fig:p3_cm}). This indicates that the quantized descriptors obtained from the class do not provide for very discriminative splits, resulting in the prioritisation of other features in the first nodes of the decision trees. # References |