Sometimes you want to get started on analyzing data with the main objective of practising the basics of a certain language. So the focus is not so much on the analysis itself but getting familiar with the commands and steps involved in a data analysis. In such cases, we can create our own (simple/ small) hypothetical datasets without spend time sourcing datasets/ scraping data. It is especially helpful to get comfortable with the language first before interacting with big and complex real-world data. This is a common practice in most data science programs. But of course, often the excitement lies in understanding real-world situations/ applications hence some might find that hypothetical datasets are not interesting.
There are several datasets within R itself such as mtcars and iris, but there could be occasions where we want to create examples that fit a particular context.
![](https://static.wixstatic.com/media/1ea3da_a1ed35fdf89646fba96cb4350f88bf01~mv2.png/v1/fill/w_980,h_696,al_c,q_90,usm_0.66_1.00_0.01,enc_avif,quality_auto/1ea3da_a1ed35fdf89646fba96cb4350f88bf01~mv2.png)
In this notebook, I outline the few lines of code to create a hypothetical dataset consisting of five columns, namely id, gender of students, and scores of test 1, 2 and 3. There are a total of 50 male and 50 female students. Each test score follows its own normal distribution. Of which some of the scores for test 1 and test 2 were made missing to mimic dirty/ incomplete data in the real-world. The first few rows of the dataset is shown below. This is a simple exercise with few lines of code and makes use of loop. You can now go ahead to create your own hypothetical dataset!
![](https://static.wixstatic.com/media/1ea3da_ae2b3312a8d84d74ac1f29dd875a2cf5~mv2.png/v1/fill/w_980,h_311,al_c,q_85,usm_0.66_1.00_0.01,enc_avif,quality_auto/1ea3da_ae2b3312a8d84d74ac1f29dd875a2cf5~mv2.png)