This post is a replicate of the previous post on R but using Python this time round.
Sometimes you want to get started on analyzing data with the main objective of practising the basics of a certain language. So the focus is not so much on the analysis itself but getting familiar with the commands and steps involved in a data analysis. In such cases, we can create our own (simple/ small) hypothetical datasets without spend time sourcing datasets/ scraping data. It is especially helpful to get comfortable with the language first before interacting with big and complex real-world data. This is a common practice in most data science programs. But of course, often the excitement lies in understanding real-world situations/ applications hence some might find that hypothetical datasets are not interesting.
There are several practice datasets within R itself such as mtcars and iris, but there isn't any in Python. However, it is possible to import these datasets from R to Python. This can be done by installing pydataset via the command prompt: pip install pydataset
![](https://static.wixstatic.com/media/1ea3da_4a555c2bf00c4773bcc25057e86716db~mv2.png/v1/fill/w_980,h_520,al_c,q_90,usm_0.66_1.00_0.01,enc_avif,quality_auto/1ea3da_4a555c2bf00c4773bcc25057e86716db~mv2.png)
![](https://static.wixstatic.com/media/1ea3da_2880284509114368b62c1ba32999adfa~mv2.png/v1/fill/w_980,h_522,al_c,q_90,usm_0.66_1.00_0.01,enc_avif,quality_auto/1ea3da_2880284509114368b62c1ba32999adfa~mv2.png)
When I wanted to import the module in Jupyter notebook, there was some error due to the notebook running in different environment from Python in the command prompt (ImportError: No module named 'pydataset'). So I checked where my package was installed.
![](https://static.wixstatic.com/media/1ea3da_c3599782177f4e19837991d1100f5dc6~mv2.png/v1/fill/w_980,h_134,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/1ea3da_c3599782177f4e19837991d1100f5dc6~mv2.png)
The paths for my Python environment within Jupyter notebook (below) did not contain the path where the package was installed. Hence I appended the path where the package was installed and it worked: sys.path.append('C:\\Users\\HuiXiang\\Anaconda3\\lib\\site-packages')
![](https://static.wixstatic.com/media/1ea3da_218b92efa1e54aef8c91a3bd993a65ab~mv2.png/v1/fill/w_980,h_518,al_c,q_90,usm_0.66_1.00_0.01,enc_avif,quality_auto/1ea3da_218b92efa1e54aef8c91a3bd993a65ab~mv2.png)
Now I can import the practice datasets within R to Python.
![](https://static.wixstatic.com/media/1ea3da_14a5b203b95848e8b22bf488f2a0f387~mv2.png/v1/fill/w_980,h_522,al_c,q_90,usm_0.66_1.00_0.01,enc_avif,quality_auto/1ea3da_14a5b203b95848e8b22bf488f2a0f387~mv2.png)
There could be occasions where we want to create examples that fit a particular context. In this notebook, I outline the few lines of code to create a hypothetical dataset consisting of five columns, namely id, gender of students, and scores of test 1, 2 and 3. There are a total of 50 male and 50 female students. Each test score follows its own normal distribution. Of which some of the scores for test 1 and test 2 were made missing to mimic dirty/ incomplete data in the real-world. The first few rows of the dataset is shown below. This is a simple exercise with few lines of code and makes use of loop. You can now go ahead to create your own hypothetical dataset!
![](https://static.wixstatic.com/media/1ea3da_a9e4fa79f5104eba915790048a658b3e~mv2.png/v1/fill/w_980,h_317,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/1ea3da_a9e4fa79f5104eba915790048a658b3e~mv2.png)
References: