top of page
Untitled

DATA DOUBLE CONFIRM

Data preparation for scraped data - Process - Excel

This is Part II of a four-part post. Part I talks about scraping data from a website (bookdepository.com, in this case) while Part II discusses data cleaning/ preparation. Part III outlines the process of presenting the data using Tableau and Part IV delves into insights from the analysis.

This serves to provide a "cheat"/ "quick-and-dirty" method of cleaning datasets using Excel, should the size of the dataset be manageable within Excel. The following is how the raw data looks like upon scraping. The dataset can be found here.

As we can see, there are some variables that we need to clean before we can use them, namely author, ratingcount, saleprice, listprice, and numofpages.

Some useful functions include

- Replace. Ctrl + H. We can remove characters with that function. eg. I can remove "S$" from the variables saleprice and listprice or "By (author)" from author with this function. Additional lines can be removed using the following keyboard steps: Ctrl+H, Ctrl+J, Tab, Enter.

- Extracting text before/ between certain characters/ words.

eg. to extract the number of ratings given i.e. figures from ratingcount column, input the following formula under a new column

=MID(I2,SEARCH("(",I2)+1,SEARCH(" rating",I2)-SEARCH("(",I2)-1)

eg. to extract the figures from numofpages column, input the following formula under a new column

=LEFT(L2,(FIND(" ",L2,1)-1))

- Checking if a certain text variable contains a certain word/ substring of a word.

=IF(ISNUMBER(SEARCH("substring",text)), "Yes", "No") Not relevant in this case but useful in many occasions. If the substring is referenced to a cell within the workbook, "" is not necessary.

The cleaned dataset can be found here.

bottom of page