Category Archives: Uncategorized

KMeans Clustering and Support Vector Machine.

With the sun setting on yet another trimester at university, thought I’d sneak in another blog on two fantastic machine learning algorithms , KMeans Clustering and Support Vector Machine(SVM) that were my marquee players in implementing a project on the “comments made in the New York Times during Feb 2017”. Due to strict time constraints, the project could only be carried out on a smaller dataset but I plan on implementing the project on a larger dataset during the month long vacation from mid June to July.

Coming back to the project, the main focus was to sufficiently train the machine on the dataset and at a certain level of satisfaction ,ask the machine to accurately determine the group of people most likely to have made a particular comment. As a first, note that KMeans and Support Vector Machine(SVM) tend to work better with labelled data.

The earliest steps in both the Algorithms was carrying out data pre -processing by removing the most commonly occuring stop words and then using porter stemming to weed out the suffixes from every word. For both the algorithms, we used the Bag of Words(BOW) representation and the Term Frequency-Inverse Document Frequency(TF-IDF) representation to determine the accuracy of both the models in making predictions. We used the “Normalised Mutual Index”(NMI) representation to evaluate the performance of the KMeans Algorithm on our dataset and the “Accuracy Score” representation to evaluate the performance of the SVM Algorithm on our dataset.

For our Kmeans Clustering, the formula(No of clusters=sqrt(n/2)) where n was the size of the dataset, was applied to set the number of clusters. For our Support Vector Machine(SVM) , we randomly chose 80% of our data as our training dataset, so as to avoid any bias, while the remaining 20% of our data were chosen as test data.  For both the algorithms, we ran 5 iterations to get a more balanced result. The tables below show the NMI and the accuracy obtained by running KMeans and SVM through 5 iterations.

Figure 1 – KMeans for BOW Representation.

dataset NMI Score
0 Comments for Feb 2017 0.046909
1 Comments for Feb 2017 0.046909
2 Comments for Feb 2017 0.046909
3 Comments for Feb 2017 0.046909
4 Comments for Feb 2017 0.046909

Figure 1a – KMeans for TF-IDF Representation.

dataset NMI Score
0 Comments for Feb 2017 0.14627
1 Comments for Feb 2017 0.14627
2 Comments for Feb 2017 0.14627
3 Comments for Feb 2017 0.14627
4 Comments for Feb 2017 0.14627

Figure 2- SVM for BOW Representation

dataset Accuracy Score
0 Comments for Feb 2017 0.775
1 Comments for Feb 2017 0.850
2 Comments for Feb 2017 0.750
3 Comments for Feb 2017 0.750
4 Comments for Feb 2017 0.775

Figure 2a- SVM for TF-IDF Representation

dataset Accuracy Score
0 Comments for Feb 2017 0.875
1 Comments for Feb 2017 0.750
2 Comments for Feb 2017 0.700
3 Comments for Feb 2017 0.850
4 Comments for Feb 2017 0.875

For both the algorithms, we observe that the TF-IDF representation works better than the Bag of Words(BOW) representation. Moreover, the SVM model is a better fit for our dataset with an 87.5% accuracy compared to a 14.6% accuracy with the KMeans clustering model.

Well, that’s about it for now. Can’t wait to test the algorithms on a bigger dataset. But for now, gotta hit the books. Exams coming up…

 

 

 

 

 

 

Advertisements

CRICKET VS MATCH FIXING

Is it just me or does anyone  feel the game has gone to dogs in the recent past? Has glitz,glamour,arrogance,money etc taken over what once upon a time used to be a “GENTLEMAN’S GAME”??.The recent spot fixing scandal that unfolded seems to be highlighting all of the points mentioned above.

Why do such talented players indulge in such cheap activity?Are they not paid enough?I don’t think so.Right from flight tickets to checking  them in  at 5 star hotels,everything is taken care of.Some of these guys are paid more than what they actually deserve.

Sadly enough  this has been going on for a long long time and just when we were about to watch the games with genuine interest, this  is what we are rewarded with, the headlines on TV reading “3 CRICKETERS CAUGHT FOR SPOT FIXING” .Are these guys trying to rip off people?some people spend a fortune to catch a glimpse of one cricket match and is this what they deserve?The board as usual seems to be turning a blind eye towards all this thinking that one day all will be forgotten.

Well, if this this the direction that they are heading in,then god save Indian cricket.Personally i feel if superstars in the form of a  Rahul Dravid or a Sachin Tendulkar hadn’t emerged , the game would have died a long time ago in this country.Thanks to them , the sport here is still alive and kicking .Well, i hope that the same excitement and fervor is got back in the game that we all still love and practically call our own.

 

Tagged , , ,