With the sun setting on yet another trimester at university, thought I’d sneak in another blog on two fantastic machine learning algorithms , KMeans Clustering and Support Vector Machine(SVM) that were my marquee players in implementing a project on the “comments made in the New York Times during Feb 2017”. Due to strict time constraints, the project could only be carried out on a smaller dataset but I plan on implementing the project on a larger dataset during the month long vacation from mid June to July.
Coming back to the project, the main focus was to sufficiently train the machine on the dataset and at a certain level of satisfaction ,ask the machine to accurately determine the group of people most likely to have made a particular comment. As a first, note that KMeans and Support Vector Machine(SVM) tend to work better with labelled data.
The earliest steps in both the Algorithms was carrying out data pre -processing by removing the most commonly occuring stop words and then using porter stemming to weed out the suffixes from every word. For both the algorithms, we used the Bag of Words(BOW) representation and the Term Frequency-Inverse Document Frequency(TF-IDF) representation to determine the accuracy of both the models in making predictions. We used the “Normalised Mutual Index”(NMI) representation to evaluate the performance of the KMeans Algorithm on our dataset and the “Accuracy Score” representation to evaluate the performance of the SVM Algorithm on our dataset.
For our Kmeans Clustering, the formula(No of clusters=sqrt(n/2)) where n was the size of the dataset, was applied to set the number of clusters. For our Support Vector Machine(SVM) , we randomly chose 80% of our data as our training dataset, so as to avoid any bias, while the remaining 20% of our data were chosen as test data. For both the algorithms, we ran 5 iterations to get a more balanced result. The tables below show the NMI and the accuracy obtained by running KMeans and SVM through 5 iterations.
Figure 1 – KMeans for BOW Representation.
|0||Comments for Feb 2017||0.046909|
|1||Comments for Feb 2017||0.046909|
|2||Comments for Feb 2017||0.046909|
|3||Comments for Feb 2017||0.046909|
|4||Comments for Feb 2017||0.046909|
Figure 1a – KMeans for TF-IDF Representation.
|0||Comments for Feb 2017||0.14627|
|1||Comments for Feb 2017||0.14627|
|2||Comments for Feb 2017||0.14627|
|3||Comments for Feb 2017||0.14627|
|4||Comments for Feb 2017||0.14627|
Figure 2- SVM for BOW Representation
|0||Comments for Feb 2017||0.775|
|1||Comments for Feb 2017||0.850|
|2||Comments for Feb 2017||0.750|
|3||Comments for Feb 2017||0.750|
|4||Comments for Feb 2017||0.775|
Figure 2a- SVM for TF-IDF Representation
|0||Comments for Feb 2017||0.875|
|1||Comments for Feb 2017||0.750|
|2||Comments for Feb 2017||0.700|
|3||Comments for Feb 2017||0.850|
|4||Comments for Feb 2017||0.875|
For both the algorithms, we observe that the TF-IDF representation works better than the Bag of Words(BOW) representation. Moreover, the SVM model is a better fit for our dataset with an 87.5% accuracy compared to a 14.6% accuracy with the KMeans clustering model.
Well, that’s about it for now. Can’t wait to test the algorithms on a bigger dataset. But for now, gotta hit the books. Exams coming up…