SoFunction
Updated on 2024-11-21

python extract data from PDF example

01

preamble

Data is the key to any analysis in data science, and the most common type of dataset used in most analyses is clean data stored in comma-separated value (csv) tables. However, since portable document format (pdf) files are one of the most commonly used file formats, every data scientist should understand how to extract data from pdf files and convert the data to a format such as "csv" for use in analysis or model building.

In this article, we will focus on how to extract data tables from pdf files. Similar analysis can be used to extract other types of data such as text or images from pdf files. We will illustrate how to extract data tables from a pdf file and then convert them into a format suitable for further analysis and model building. We will give an example.

02

Example: Using Python to extract a form from a PDF file

a) Copy the table to Excel and save it as table_1_raw.csv

Data is stored in a one-dimensional format and must be reshaped, cleaned and transformed.

b) Import the necessary libraries

import pandas as pd
import numpy as np

c) Importing raw data and redefining data

df=pd.read_csv("table_1_raw.csv", header=None)

df2=((25,10))
column_names=df2[0:1].values[0]
df3=df2[1:]
 = df2[0:1].values[0]
()

d) Data wrangling using string processing tools

We note from the table above that columns x5, x6, and x7 are expressed as percentages, so we need to remove the percent(%) symbol:.

df4['x5']=list(map(lambda x: x[:-1], df4['x5'].values))
df4['x6']=list(map(lambda x: x[:-1], df4['x6'].values))
df4['x7']=list(map(lambda x: x[:-1], df4['x7'].values))

e) Conversion of data to digital form

We notice that columns x5, x6 and x7 have column value data type string, so we need to convert them to numeric data as follows.

df4['x5']=[float(x) for x in df4['x5'].values]
df4['x6']=[float(x) for x in df4['x6'].values]
df4['x7']=[float(x) for x in df4['x7'].values]

f) View the final form of the converted data

(n=5)

g) Export the final data to a csv file

df4.to_csv('table_1_final.csv',index=False)

Above is python extract data from PDF example of the details, more information about python extract PDF data please pay attention to my other related articles!