Data integrity is vital to enable robust analysis. In the process of open-sourcing Singapore's Covid-19 data, I ran into several data integrity issues, where figures can be back-dated, and where data definitions kept changing such as the national vaccination programme seeing an inclusion of a vaccine or the nation seeing a drop in population hence a sudden surge/ adjustment in vaccination rates.
The latest one being case numbers backdated to Jan 6 when GPs started to order Protocol 2. Protocol 2 cases are individuals who are well and tested positive, or have been assessed by a doctor to have a mild condition. That explains why as of 20 Jan 2022, we have 297,549 cases, and as of 21 Jan 2022, we have 307,813 cases, but the number of new cases is 3,156. So the number of new confirmed cases for 21 Jan 2022 and the cumulative confirmed cases from the previous day do not add up to the cumulative case counts for the day. This hasn't been clear in the dashboards published officially and I had to look up the explanation behind the discrepancies in figures.
All these data inconsistencies have been recorded in the Overview of the dataset: https://data.world/hxchua/covid-19-singapore. Hence do be mindful when understanding/ interpreting the figures.
When data is not consistently represented, we need to be cautious whether the trend was a result of a change in data definition or an actual observed change in reality (or both). This will affect how we assess the magnitude of change in trends.
If you're looking for a dataset with multiple data quality challenges, this might be the one for you!
Comments