top of page
Untitled

DATA DOUBLE CONFIRM

Getting started

Over the years, the number of tools (or software) I have to install/ use increased steadily given the different types of tasks I have to perform. Some tools serve a similar purpose but I ended up with another tool because of the school/ work environment setup. Not recommending any particular order for the tools in which one should pick up as it all depends on what one needs, but here's mine and I feel pretty comfortable this way (giving my take on the level of difficulty as well):

1. R

Started using since: 2008

Level of difficulty: Low

Used for: Statistical analysis/ data mining

Many functions and algorithms are built into, what we call, packages. Install the relevant packages necessary and the analysis can get going. R manuals describing the packages with examples are available online.

Link to installation: https://www.r-project.org/

1.1. R Studio

This is the GUI I used for R.

Link to installation: https://www.rstudio.com/

2. Tableau

Started using since: 2015

Level of difficulty: Low

Used for: Data visualization/ data exploration

Drag-and-drop feature makes the barrier to entry (or knowledge inertia) a lot lower. Main challenge for me stems from data preparation in the appropriate format to achieve the desired visualization.

3. MySQL

Started using since: 2015

Level of difficulty: Medium

Used for: Database creation and management

Main challenge for me is developing nested queries or subqueries (basically queries within queries). As a quick win, I often end up creating more views (i.e. data subsets) which, of course, mean a less efficient code. Also, installation is painful. According to the tutorial below, "Installation could be the hardest part in this exercise."

Link to installation (instructions):

3.1. MySQL Workbench

This is the GUI I used for MySQL.

4. Python

Started using since: 2016

Level of difficulty: High

Used for: Statistical analysis/ machine learning/ web-scraping

Main challenge for me is the language structure itself, where it can have so many ".", "[ ]", and "( )" in a single line of code. Performing a simple data transformation in Python is not as simple as it seems. Here is an example of what I mean. Also installing packages (or modules) is not as simple as R. You have to use the command line to do it (instructions).

(Spyder which will be included in the download is the Python development environment that we will use. Hmm Anaconda, Spyder and Python. Same same but different.)

5. PostgreSQL

Started using since: 2017

Level of difficulty: Low

Used for: Database creation and management

This is similar to MySQL as they are both database creation and query languages with a slight twist, so the prior experience with MySQL helps.

Link to installation: https://www.postgresql.org/

5.1. DBeaver

This is the GUI I used for PostgreSQL.

Link to installation: https://dbeaver.jkiss.org/

6. Git

Started using since: 2017

Level of difficulty: High

Used for: Version control/ Collaborative work

It's not easy to me because I do not have experience with the command line. I'm still new to this and mainly survived on commands like git pull/ git add/ git commit/ git push.

6.1 Github

Create a Github account at https://github.com/.

You can create a repository to upload your work or develop on the work of others in other repositories. For example, this is my Github repository containing some datasets I've curated. If you are interested about the technicalities regarding the difference between Git and Github, read this article.

7. Jupyter Notebook

Started using since: 2017

Level of difficulty: Low

Used for: Collaborative work/ code and output documentation

This is a web-based application that allows you to run and save the output of your code. There are many languages it support but I've only used it for Python so far. Also, as you can see Python is a pre-requisite for installing, so things get easier when you already have it installed.

Link to installation: http://jupyter.org/

8. PuTTy

Started using since: 2017

Level of difficulty: Medium

Used for: Remote access to server computers

This is not a data science tool but is necessary if we have to access work on remote servers. The command line is involved.

8.1 MobaXterm

Decided to get this interface because uploading and downloading files were much easier with it.

Link to installation: https://mobaxterm.mobatek.net/

PS: Only listing open-source tools as everybody can have access to them. Tableau is an exception being it's free for students. Also, Excel is a pretty resourceful tool for data analysis as well, but in today's environment, knowing how to use Excel is expected of everyone. GUI (Graphic User Interface) is something good to have. It makes starting out less daunting and provides a more comfortable user experience.

PPS: There is a constant debate over R and Python. This is one good read I found and relate to.

PPPS: Yes, as you can see from 8 and 8.1, I'm a Windows user (in case it makes any difference).

Feel free to share your order of learning as well!

bottom of page