KMeans Clustering and Support Vector Machine.

With the sun setting on yet another trimester at university, thought I’d sneak in another blog on two fantastic machine learning algorithms , KMeans Clustering and Support Vector Machine(SVM) that were my marquee players in implementing a project on the “comments made in the New York Times during Feb 2017”. Due to strict time constraints, the project could only be carried out on a smaller dataset but I plan on implementing the project on a larger dataset during the month long vacation from mid June to July.

Coming back to the project, the main focus was to sufficiently train the machine on the dataset and at a certain level of satisfaction ,ask the machine to accurately determine the group of people most likely to have made a particular comment. As a first, note that KMeans and Support Vector Machine(SVM) tend to work better with labelled data.

The earliest steps in both the Algorithms was carrying out data pre -processing by removing the most commonly occuring stop words and then using porter stemming to weed out the suffixes from every word. For both the algorithms, we used the Bag of Words(BOW) representation and the Term Frequency-Inverse Document Frequency(TF-IDF) representation to determine the accuracy of both the models in making predictions. We used the “Normalised Mutual Index”(NMI) representation to evaluate the performance of the KMeans Algorithm on our dataset and the “Accuracy Score” representation to evaluate the performance of the SVM Algorithm on our dataset.

For our Kmeans Clustering, the formula(No of clusters=sqrt(n/2)) where n was the size of the dataset, was applied to set the number of clusters. For our Support Vector Machine(SVM) , we randomly chose 80% of our data as our training dataset, so as to avoid any bias, while the remaining 20% of our data were chosen as test data.  For both the algorithms, we ran 5 iterations to get a more balanced result. The tables below show the NMI and the accuracy obtained by running KMeans and SVM through 5 iterations.

Figure 1 – KMeans for BOW Representation.

dataset NMI Score
0 Comments for Feb 2017 0.046909
1 Comments for Feb 2017 0.046909
2 Comments for Feb 2017 0.046909
3 Comments for Feb 2017 0.046909
4 Comments for Feb 2017 0.046909

Figure 1a – KMeans for TF-IDF Representation.

dataset NMI Score
0 Comments for Feb 2017 0.14627
1 Comments for Feb 2017 0.14627
2 Comments for Feb 2017 0.14627
3 Comments for Feb 2017 0.14627
4 Comments for Feb 2017 0.14627

Figure 2- SVM for BOW Representation

dataset Accuracy Score
0 Comments for Feb 2017 0.775
1 Comments for Feb 2017 0.850
2 Comments for Feb 2017 0.750
3 Comments for Feb 2017 0.750
4 Comments for Feb 2017 0.775

Figure 2a- SVM for TF-IDF Representation

dataset Accuracy Score
0 Comments for Feb 2017 0.875
1 Comments for Feb 2017 0.750
2 Comments for Feb 2017 0.700
3 Comments for Feb 2017 0.850
4 Comments for Feb 2017 0.875

For both the algorithms, we observe that the TF-IDF representation works better than the Bag of Words(BOW) representation. Moreover, the SVM model is a better fit for our dataset with an 87.5% accuracy compared to a 14.6% accuracy with the KMeans clustering model.

Well, that’s about it for now. Can’t wait to test the algorithms on a bigger dataset. But for now, gotta hit the books. Exams coming up…








Microsoft Excel vs R

Hi there. Hope you’re all having a great start to the week.  Its getting real cold here in Melbourne and the rain does not help too, so i decided to stay indoors for the day. As I sip my coffee, I think back on a question posed by a friend last week. Its a question I ask myself a lot these days, why use Microsoft Excel when you have R(or even python) for that matter? Having carried out some amount of Data Analysis in both these tools, both amazing technologies by the way, here’s my opinion based on a few key factors:

  1. Computing Speed: I had the opportunity to undertake very similar assignments in two different units, one in Excel, the other in R. While R was able to seamlessly process massive datasets, Excel on the other hand wasn’t easily accustomed to processing large datasets unlike its counterpart resulting in it slowing down while handling calculations of large datasets. THE WINNER – R.
  2. Data Manipulation : Sorting, Merging and Binding large datasets are quicker and more straightforward in R compared to excel. Also, techniques such as normalising data to a [0,1] interval is possible in R but not in Excel. THE WINNER – R.
  3. Accuracy : While Excel works like an absolute delight on smaller datasets by conducting tasks such as Regression, Descriptive Statistics, Logistic Regression, etc with relative ease, the accuracy of these tasks take a hit when the size of the datasets are increased. On a personal level, I found an excel regression on an airport dataset of America take forever to run, only for the model to output inaccurate estimates. R as mentioned, because of its superior computing speed has no problem carrying out tasks such as regression. THE WINNER – R.
  4. Libraries : Excel has great tools to analyse and visualise data, no doubt. The chart options present in Excel are absolutely spectacular and I have no right to criticise them. But then again R goes one step further with libraries like ggplot putting the library ecosystem of R in the stratosphere on par with python. THE WINNER – R. And last but not the least :
  5. Open Source : Not real fun when you’re all over the place trying to get Microsoft Office installed at a discount isn’t it? R on the other hand well, easy as it gets…Download!…Install!. THE WINNER – R .

Don’t get me wrong. Excel is still a fantastic tool. Just not scalable enough. For Data Analysis, R wins….Everytime!!

Here’s my take on the topic …would love to hear yours 🙂

Correlation vs Causation

My first Data Analytics class i heard those important words…..Correlation does not imply causation. In fact they are poles apart.

Several assignments later its all becoming clear to me. Correlation vs causation analysis can sometimes prove that statistics can be deceiving.

Correlation : The correlation coefficient(R) indicates the strength of a relationship. R=1 implies a strong positive linear relationship while R=-1 implies a strong negative linear relationship. Here’s a pretty interesting piece of correlation analysis : ” There is a strong positive linear relationship between  the number of films Nicolas Cage has appeared in and the number of people who drowned by falling into a swimming pool ” .  Yikes! Proof that statistics can sometimes be misleading.

Causation : The collapse of Lehman brothers set the wheels in motion for the economic recession in 2008 implies causation.

Also, are conclusions based on a population or sample of data? Every small detail counts.

Data Analytics is here to stay….

Sources :


Over the last few months, one particular topic has been of  interest among  everyone, Bitcoin.

While i was aware of Bitcoin, its meteoric rise was something i did not expect to occur so early on. Last i heard one bitcoin was trading north of 13,000 Australian dollars which is pretty staggering.

This lead me to my first introduction to Blockchain technology which is the underlying infrastructure supporting Bitcoin. In December last year, a guest professor at university running his own blockchain startup stated that the technology  was soon going to replace core banking systems of several banks and had several use cases in industries like healthcare too.

Blockchain comes with several advanatges over the traditional legacy systems of financial  institutions some of them including features like :

  1. DISTRIBUTED LEDGERS where information is duplicated across several nodes so there is absolutely no chance of tampering with them.
  2. SMART CONTRACTS enable transactions to be carried between one person to another  without the intervention of a third party thus cutting the unwanted middle man fee.

But here’s the catch. Blockchain is not regulated yet. How long before the technology becomes full fledged? Blockchain coupled with Artificial Intelligence can be a match made in heaven.

Exciting times ahead….





I realize its 2018 but something inspired me to sit down at a computer and start writing. So here i am.

I landed in Melbourne on a cold winters night on the 29th of June 2017. It’s been 8 months since and the journey has been humbling. From cooking your own food, to doing your own laundry, to learning to manage your financials, not to mention managing your university course load,  it’s life in the fast lane. Living abroad really does give you some stick. It’s true.

So Australia what have you got in store for me this year? Cant wait.





Is it just me or does anyone  feel the game has gone to dogs in the recent past? Has glitz,glamour,arrogance,money etc taken over what once upon a time used to be a “GENTLEMAN’S GAME”??.The recent spot fixing scandal that unfolded seems to be highlighting all of the points mentioned above.

Why do such talented players indulge in such cheap activity?Are they not paid enough?I don’t think so.Right from flight tickets to checking  them in  at 5 star hotels,everything is taken care of.Some of these guys are paid more than what they actually deserve.

Sadly enough  this has been going on for a long long time and just when we were about to watch the games with genuine interest, this  is what we are rewarded with, the headlines on TV reading “3 CRICKETERS CAUGHT FOR SPOT FIXING” .Are these guys trying to rip off people?some people spend a fortune to catch a glimpse of one cricket match and is this what they deserve?The board as usual seems to be turning a blind eye towards all this thinking that one day all will be forgotten.

Well, if this this the direction that they are heading in,then god save Indian cricket.Personally i feel if superstars in the form of a  Rahul Dravid or a Sachin Tendulkar hadn’t emerged , the game would have died a long time ago in this country.Thanks to them , the sport here is still alive and kicking .Well, i hope that the same excitement and fervor is got back in the game that we all still love and practically call our own.


Tagged , , ,