top of page
Untitled

DATA DOUBLE CONFIRM

Extracting data from tables in PDF [Updated] - Python

The previous post on Extracting data from tables in PDF was many years old and requires an update as it didn't cover a Python package that has already been widely used and is also suggested by ChatGPT: pdfplumber.


I wanted to extract the data published by National Environmental Agency (NEA) of Singapore on Waste and recycling statistics in PDF format (waste-and-recycling-statistics-2018-to-2022) and tried with tabula-py but it didn't give me satisfactory results as it groups all the columns together in a single column. Using pdfplumber works nicely. it's pretty straightforward. You can find the Python code here: https://github.com/hxchua/datadoubleconfirm/blob/master/notebooks/pdftables.ipynb.




If you're looking to use this dataset for practice on data cleaning, analysis, and visualization, it's a good one as the data is quite dirty. You can find the dataset on Waste statistics and overall recycling rate in Singapore from 2003 to 2023 here: https://data.world/hxchua/waste-in-singapore.


Comments


bottom of page