Untitled

DATA DOUBLE CONFIRM

Extracting data from tables in PDF

Having data in tables in PDF is probably one of the most agonizing thing for users. It feels as if the data is there but not there.


[\Edited on 18 Mar 2021] Found this useful resource online: https://theautomatic.net/2019/05/24/3-ways-to-scrape-tables-from-pdfs-with-python/. Ran into issues while trying to use tabula-py with error relating to java, JavaNotFoundError: `java` command is not found from this Python process. Did not manage to overcome after spending some time and decided to try camelot instead which worked for me.


In this post, I cover two resources that allow us to extract data into excel/ csv format. I made use of the U.S. Complement to the End of Childhood Report 2018 as an example here.

The first is https://pdftables.com/.

It allows one to upload a PDF and converts the first 25 pages into excel format for free without signing up.

This is how the table looks like upon conversion. You can see some rows are jumbled up. e.g. New Hampshire and Massachusetts somehow got concatenated for the columns Infant deaths and Infant mortality.

The second one is Tabula. The link points towards their Github page and provides more details relating to how you can use it.

Somehow I couldn't get it to work at port 8080 but it worked after I changed it to port 9999.

java -Dfile.encoding=utf-8 -Xms256M -Xmx1024M -Dwarbler.port=9999 -jar tabula.jar

You will see this page upon loading successfully. After which you can import PDFs from which you would like to extract data.

It allows us to select the area which we would like to extract. Hence we can choose to extract table by table and the output would be different from that of pdftables.com which puts two tables into one sheet.

This is how the output looks like.

Both interfaces require some form of human effort in both selection of area for extraction and/ or cleaning. i.e. the process is not fully automated. However, it reduces some manual work from data re-entry.