The Singapore government publishes locations visited by COVID-19 patients in the last 14 days, and removes any data earlier. I decided to capture the data in its entirety in a Google sheet (for anyone who would be interested to do further analysis). This data is also shared on data.world. What's necessary is to consistently update the data. I used to do this manually but it is very inefficient, and of course, being in this field of data science/ engineering, I should be automating this process. I finally got to doing it.
This is what the entire automation process is like:
1. Retrieve existing data in the spreadsheet using Google API and modules gspread and gspread_dataframe
2. Scrape the data from the government website using BeautifulSoup
3. Identify the earliest (i.e. minimum) date listed on the government website
4. Remove records within the existing data that is later than/ equals to the minimum date
5. Append the scraped data to the existing data
6. Push the updated data to the spreadsheet
Note for pt. 4: Because new records are updated within the 14-day window period and this 14-day window period overlaps with the existing data. So we remove these dates that overlap to avoid duplicate records.
To use Google API, follow the instructions here for installation and here for authentication. To use gspread and gspread_dataframe, please run pip install gspread and pip install gspread_dataframe.
In the process of scraping the data, some data cleaning is also necessary, such as
i. removing * from date, and converting date from string to date format,
ii. splitting the 'Location (Address)' column into two to record sub-locations as this is how I designed the dataset,
iii. I also replaced the word 'to' to '-' in the 'Time' column as previously when I started collecting the data from other sources, it was '-'.
iv. Add in another column to indicate the source of the data i.e. the government website.
There were a couple of errors I ran into and one of which is "'Invalid JSON payload received. Unexpected token". I managed to find the solution here, which is to fill up 'NaN' values with empty string, df.fillna('', inplace=True) and the problem was solved.
Towards the end, I kept running into "The caller does not have permission"/ "PERMISSION_DENIED" error messages and finally found the solution here. Basically, we have to share the edit access of the Google sheet with the service account of the Google project.
The code for the entire process can be found here.