Process and product of various data science tasks— from data collection, data preparation, data visualization, to basic statistical analysis and modelling. Datasets for practice available.

Selected as Top 100 Data Science Resources for 2018


October 30, 2019

While I mainly host my datasets on my Github repository, I have also cross-shared some datasets on as the platform is integrated with quite a couple of other tools. And also, is more user-friendly for users who might not want to dabble into Github...

September 15, 2019

With R, and Ananconda installed, we can also use R in Jupyter notebook. So my previous laptop died and now I have to re-install everything again. But this time I ran into some issue that I didn't have with my previous laptop (not too sure why). 

So what is necessary to...

May 7, 2019

A typical question faced is how much data is considered enough. The answer is it depends. First and foremost, we need to know what comprises the total population. If the population is small, and there are enough resources to obtain whatever information you want on the...

December 15, 2018

This is a tutorial to get the frequency distribution of words used in a chunk of text and is a simpler alternative to a more elaborate text mining post that involves auto-removal of stopwords e.g. "the", "a", "and", etc.

The script basically breaks the chunk of tex...

June 16, 2018

Data cleaning is one of the most important tasks in data science but it is unglamorous, underappreciated and under-discussed. These are some common tasks involved in data cleaning but not limited to: 

  • Merging/ appending

  • Checking completeness of data​​

  • ...

June 9, 2018

Sometimes you want to get started on analyzing data with the main objective of practising the basics of a certain language. So the focus is not so much on the analysis itself but getting familiar with the commands and steps involved in a data analysis. In such cases, w...

March 23, 2018

It's been a while since the closing of the Datathlon but I hadn't gotten the time to write about it. This is my first attempt at participating in a Kaggle competition and I participated because of the nature of the competition - the WiDS Datathon competition seeks to e...

December 24, 2017

This is Part II of a four-part post. Part I talks about collecting text data from Twitter while Part II discusses analysis on text data i.e. text mining. Part III outlines the process of presenting the data using Tableau and Part IV delves into insights from the analys...

November 22, 2017

Over the years, the number of tools (or software) I have to install/ use increased steadily given the different types of tasks I have to perform. Some tools serve a similar purpose but I ended up with another tool because of the school/ work environment setup. Not reco...

Please reload

Recent Posts

Please reload


Please reload