top of page
Untitled

DATA DOUBLE CONFIRM

Creating hypothetical datasets - Process - Python

This post is a replicate of the previous post on R but using Python this time round.

Sometimes you want to get started on analyzing data with the main objective of practising the basics of a certain language. So the focus is not so much on the analysis itself but getting familiar with the commands and steps involved in a data analysis. In such cases, we can create our own (simple/ small) hypothetical datasets without spend time sourcing datasets/ scraping data. It is especially helpful to get comfortable with the language first before interacting with big and complex real-world data. This is a common practice in most data science programs. But of course, often the excitement lies in understanding real-world situations/ applications hence some might find that hypothetical datasets are not interesting.

There are several practice datasets within R itself such as mtcars and iris, but there isn't any in Python. However, it is possible to import these datasets from R to Python. This can be done by installing pydataset via the command prompt: pip install pydataset

When I wanted to import the module in Jupyter notebook, there was some error due to the notebook running in different environment from Python in the command prompt (ImportError: No module named 'pydataset'). So I checked where my package was installed.

The paths for my Python environment within Jupyter notebook (below) did not contain the path where the package was installed. Hence I appended the path where the package was installed and it worked: sys.path.append('C:\\Users\\HuiXiang\\Anaconda3\\lib\\site-packages')

Now I can import the practice datasets within R to Python.

There could be occasions where we want to create examples that fit a particular context. In this notebook, I outline the few lines of code to create a hypothetical dataset consisting of five columns, namely id, gender of students, and scores of test 1, 2 and 3. There are a total of 50 male and 50 female students. Each test score follows its own normal distribution. Of which some of the scores for test 1 and test 2 were made missing to mimic dirty/ incomplete data in the real-world. The first few rows of the dataset is shown below. This is a simple exercise with few lines of code and makes use of loop. You can now go ahead to create your own hypothetical dataset!

References:

bottom of page