01
preamble
Data is the key to any analysis in data science, and the most common type of dataset used in most analyses is clean data stored in comma-separated value (csv) tables. However, since portable document format (pdf) files are one of the most commonly used file formats, every data scientist should understand how to extract data from pdf files and convert the data to a format such as "csv" for use in analysis or model building.
In this article, we will focus on how to extract data tables from pdf files. Similar analysis can be used to extract other types of data such as text or images from pdf files. We will illustrate how to extract data tables from a pdf file and then convert them into a format suitable for further analysis and model building. We will give an example.
02
Example: Using Python to extract a form from a PDF file
a) Copy the table to Excel and save it as table_1_raw.csv
Data is stored in a one-dimensional format and must be reshaped, cleaned and transformed.
b) Import the necessary libraries
import pandas as pd import numpy as np
c) Importing raw data and redefining data
df=pd.read_csv("table_1_raw.csv", header=None) df2=((25,10)) column_names=df2[0:1].values[0] df3=df2[1:] = df2[0:1].values[0] ()
d) Data wrangling using string processing tools
We note from the table above that columns x5, x6, and x7 are expressed as percentages, so we need to remove the percent(%) symbol:.
df4['x5']=list(map(lambda x: x[:-1], df4['x5'].values)) df4['x6']=list(map(lambda x: x[:-1], df4['x6'].values)) df4['x7']=list(map(lambda x: x[:-1], df4['x7'].values))
e) Conversion of data to digital form
We notice that columns x5, x6 and x7 have column value data type string, so we need to convert them to numeric data as follows.
df4['x5']=[float(x) for x in df4['x5'].values] df4['x6']=[float(x) for x in df4['x6'].values] df4['x7']=[float(x) for x in df4['x7'].values]
f) View the final form of the converted data
(n=5)
g) Export the final data to a csv file
df4.to_csv('table_1_final.csv',index=False)
Above is python extract data from PDF example of the details, more information about python extract PDF data please pay attention to my other related articles!