Creating hypothetical datasets - Process - R

datadoubleconfirm
Jun 9, 2018
1 min read

Sometimes you want to get started on analyzing data with the main objective of practising the basics of a certain language. So the focus is not so much on the analysis itself but getting familiar with the commands and steps involved in a data analysis. In such cases, we can create our own (simple/ small) hypothetical datasets without spend time sourcing datasets/ scraping data. It is especially helpful to get comfortable with the language first before interacting with big and complex real-world data. This is a common practice in most data science programs. But of course, often the excitement lies in understanding real-world situations/ applications hence some might find that hypothetical datasets are not interesting.

There are several datasets within R itself such as mtcars and iris, but there could be occasions where we want to create examples that fit a particular context.

In this notebook, I outline the few lines of code to create a hypothetical dataset consisting of five columns, namely id, gender of students, and scores of test 1, 2 and 3. There are a total of 50 male and 50 female students. Each test score follows its own normal distribution. Of which some of the scores for test 1 and test 2 were made missing to mimic dirty/ incomplete data in the real-world. The first few rows of the dataset is shown below. This is a simple exercise with few lines of code and makes use of loop. You can now go ahead to create your own hypothetical dataset!

DATA DOUBLE CONFIRM

DATA DOUBLE CONFIRM

Creating hypothetical datasets - Process - R

Comments