Extracting data from tables in PDF [Updated] - Python

datadoubleconfirm
Jul 6, 2024
1 min read

The previous post on Extracting data from tables in PDF was many years old and requires an update as it didn't cover a Python package that has already been widely used and is also suggested by ChatGPT: pdfplumber.

I wanted to extract the data published by National Environmental Agency (NEA) of Singapore on Waste and recycling statistics in PDF format (waste-and-recycling-statistics-2018-to-2022) and tried with tabula-py but it didn't give me satisfactory results as it groups all the columns together in a single column. Using pdfplumber works nicely. it's pretty straightforward. You can find the Python code here: https://github.com/hxchua/datadoubleconfirm/blob/master/notebooks/pdftables.ipynb.

If you're looking to use this dataset for practice on data cleaning, analysis, and visualization, it's a good one as the data is quite dirty. You can find the dataset on Waste statistics and overall recycling rate in Singapore from 2003 to 2023 here: https://data.world/hxchua/waste-in-singapore.

The original attachment is published on this NEA website: https://www.nea.gov.sg/our-services/waste-management/waste-statistics-and-overall-recycling

DATA DOUBLE CONFIRM

DATA DOUBLE CONFIRM

Extracting data from tables in PDF [Updated] - Python

Comments