top of page
Untitled

DATA DOUBLE CONFIRM

Creating hypothetical datasets - Process - R

Sometimes you want to get started on analyzing data with the main objective of practising the basics of a certain language. So the focus is not so much on the analysis itself but getting familiar with the commands and steps involved in a data analysis. In such cases, we can create our own (simple/ small) hypothetical datasets without spend time sourcing datasets/ scraping data. It is especially helpful to get comfortable with the language first before interacting with big and complex real-world data. This is a common practice in most data science programs. But of course, often the excitement lies in understanding real-world situations/ applications hence some might find that hypothetical datasets are not interesting.

There are several datasets within R itself such as mtcars and iris, but there could be occasions where we want to create examples that fit a particular context.

In this notebook, I outline the few lines of code to create a hypothetical dataset consisting of five columns, namely id, gender of students, and scores of test 1, 2 and 3. There are a total of 50 male and 50 female students. Each test score follows its own normal distribution. Of which some of the scores for test 1 and test 2 were made missing to mimic dirty/ incomplete data in the real-world. The first few rows of the dataset is shown below. This is a simple exercise with few lines of code and makes use of loop. You can now go ahead to create your own hypothetical dataset!

bottom of page