This is Part I of a two-part post. Part I talks about scraping data from SGDI while Part II outlines the process of presenting the data using Tableau.
The code builds on the one covered in a previous post on how to use Beautifulsoup in Python to scrape web data.
Contact information on government agencies is available on the Singapore Government Directory (SGDI) but the information is segregated by (type of) agencies. Users have to click on the agencies they are interested one-by-one to find out the information.
I thought it would be great if we can have a single page showing general contact information of all agencies (something like this).
In order to do the above, we need to first collect the data. As a proof-of-concept, I manually collected the data on Ministries (which url is slightly different from that of statutory boards). For the data on Statutory Boards, I decided to write a script to automate the collection. The data items that I was interested in were: 'name', 'address', 'tel', 'fax', 'website', and 'email'. Also, I need the images of the agencies' logo. While it is not exactly taxing to perform the data collection manually (there're about 60 statutory boards), it would still be nice to do some coding and do things in a "smarter" way :)
One challenge I had was extracting the email addresses of the organisations. It doesn't show up in the parsed html code (as seen in Out [35]:) using Beautifulsoup. I had to make use of regular expression (and hence to import the re package).
Next, what I want to do is to save the images of the logos to my computer. On each organisation page, the logo image is tied to a url. I extract the acronym of the organisation as the name of the file and save the image to its respective type i.e. .jpg/ .gif/ .png.
While most of the agencies had all data available on SGDI (i.e. 'name', 'address', 'tel', 'fax', 'website', and 'email' as well as logo image), some agencies had one or two missing data fields. Hence some manual effort was necessary, where I had to search for the information on the organisation website itself. Some agencies did not have an email address for the public correspondence but an online site containing an Enquiry/ Feedback form, (or some had both), hence a separate field was created to reflect it.
The eventual dataset put together has the following data items:
Organisation
Type (Ministry or Statutory Board)
Acronym
Address
Zipcode
Latitude
Longitude
Website
Tel
Fax
Email
Enquiry/ Feedback Form
Parent Ministry
To find which statutory boards were under a particular Ministry, I referred to a Careers@Gov page. Statutory Boards listed on SGDI but were not under any Ministry were dropped. Zipcode was extracted from the address and latitude/ longitude information was obtained through passing this data through Tableau. As some organisations were in the same building (i.e. identical zipcode), in order to make them appear as different dots on the map thereafter, some coordinates were manually edited at the 2nd or 3rd decimal point.
The code for scraping the data (and automating the downloading of images) can be found here.
The dataset GovSG.csv can be found here.
The visualization on the one-stop guide to Government agencies in Singapore can be found here.