top of page
Untitled

DATA DOUBLE CONFIRM

Web scraping using Beautifulsoup - Process - Python

This is Part I of a four-part post. Part I talks about scraping data from a website (bookdepository.com, in this case) while Part II discusses data cleaning/ preparation. Part III outlines the process of presenting the data using Tableau and Part IV delves into insights from the analysis.

While there is a specific package to scrape Twitter data, the more commonly used package to scrape web data is BeautifulSoup. Web scraping is a useful skill because it allows you to "collect" data that you would like to analyze and is much more cost-effective and much less time-consuming as compared to a survey, for example. Hence if ready/ open data is not available, we can try to look for data available on the Internet as an alternative before considering fieldwork.

Over here, I've put together an exercise to crawl the Bestsellers on Book Depository. The main steps are as follows:

- From the main landing page, we want to get the url of all individual book pages. We need to know how this information is embedded within the html code and this can be done by right-clicking on the particular part of the page containing the information we want and selecting "Inspect". Click the dropdown in the html code to view more.

- There are 1,000 results and the first main landing page consist of book 1-30. We will start with one main landing page then run a loop to do the same for the rest of the main landing pages. That means we would need to know how many main landing pages are there in total, which is easy, because each main landing page has 30 books, so 1,000/30 = 34 pages.

- After getting the url of all individual book pages on this one main page, we identify the data items we want to get relating to a particular book on the individual book page, such as book material, author, rank, main category,sub category, rating, rating count, sale price, list price,number of pages, date published, and isbn13. Similarly, we right-click on the particular part of the page containing the information we want and select "Inspect".

- With BeautifulSoup, parsing/ extracting information from html is easy, where we can focus on certain tags/ class. We identify the html headers we need to scrape for all the data items we want.

- There are some differences in the information presented for each individual book page in the "Product details" section and hence, for convenience, I only scraped data that is consistent across pages.

- After scraping one book page, we go onto the rest of the urls that we had previously gotten from scraping the main landing page i.e. to loop through book 1-30, before proceeding to the other main landing pages.

The raw dataset scraped, bookdepo.csv, can be found here. File description is written in the README on the main page of the Github repository. Python notebook for scraping the data can be found here.

bottom of page