This post is a replicate of the previous post on R but using Python this time round. However, note that there is a difference in data randomly generated by R and Python. For most of this exercise, we use prepared hypothetical datasets.
Data cleaning is one of the most important tasks in data science but it is unglamorous, underappreciated and under-discussed. These are some common tasks involved in data cleaning but not limited to:
Most, if not all, of the time, the datasets that we have to analyse are unclean. i.e. they are not necessarily complete/ accurate/ valid. This will impact the accuracy of our analysis if we do not clean them properly.
I will make use of mock-up data to demonstrate the various tasks listed above. The notebook for the data cleaning exercise can be found here.
Merging and appending is very common because data can be stored in various databases or they could come in several batches, hence we need to combine these datasets together before we can start analysing them.
We check for completeness because sometimes mistakes can (accidentally) happen during the stage of data extraction and datasets handed over for analysis are incomplete. Usually we will check for the number of records and number of variables. This step is especially important as incomplete data would result in inaccurate/ bias reporting/ analysis.
Apart from checking for completeness, we also check the validity of the data. For example, certain fields can only take positive values and we have to make sure that's the case. If there are negative values, we will have to inquire the reasons behind and this process also helps us in understanding the business better, where there could be exceptions or other interpretations behind a "supposedly/ intuitively invalid" value.
Quite often, we will need to de-duplicate records in the data. Duplication could happen due to a system glitch where a transaction was recorded twice, or it could be that an error occurred on the first entry and a subsequent entry was meant to overwrite the first entry, or for whatever other reasons there could be. Again, we have to discuss with the business users how we should handle such records and whether the duplicate records are valid. If they are meant to be invalidated, we will have to drop these duplicate records from our data for analysis i.e. de-duplicate.
How we handle missing values is important - whether we should drop it from the analysis or whether to do an imputation as well as what kind of imputation makes a difference on the results of the analysis. For example, imputing missing values as zero will skew the mean while imputing them as the mean would not. You will see it from the exercise.
Recoding can also be part of data transformation (that I will cover it in a future post). In data cleaning, recoding is sometimes done to "correct" the data. e.g. data entry errors, or another way of putting it will be updating the data.
The notebook for the data cleaning exercise can be found here.