Extracting data from tables in PDF

July 7, 2018

Having data in tables in PDF is probably one of the most agonizing thing for users. It feels as if the data is there but not there. In this post, I cover two resources that allow us to extract data into excel/ csv format. I made use of the U.S. Complement to the End of Childhood Report 2018 as an example here. 



The first is https://pdftables.com/. 


It allows one to upload a PDF and converts the first 25 pages into excel format for free without signing up. 


This is how the table looks like upon conversion. You can see some rows are jumbled up. e.g. New Hampshire and Massachusetts somehow got concatenated for the columns Infant deaths and Infant mortality.



The second one is Tabula. The link points towards their Github page and provides more details relating to how you can use it. 

Somehow I couldn't get it to work at port 8080 but it worked after I changed it to port 9999.  

java -Dfile.encoding=utf-8 -Xms256M -Xmx1024M -Dwarbler.port=9999 -jar tabula.jar


You will see this page upon loading successfully. After which you can import PDFs from which you would like to extract data. 



It allows us to select the area which we would like to extract. Hence we can choose to extract table by table and the output would be different from that of pdftables.com which puts two tables into one sheet.


This is how the output looks like.  


Both interfaces require some form of human effort in both selection of area for extraction and/ or cleaning. i.e. the process is not fully automated. However, it reduces some manual work from data re-entry.


Please reload

Recent Posts

Please reload

©2017-2020 by DATA DOUBLE CONFIRM. Proudly created with Wix.com

This site was designed with the
website builder. Create your website today.
Start Now