SoFunction
Updated on 2024-11-12

python from the pdf file to achieve text extraction, and automatic translation method

Tested against Python 3.5.2

First install the two packages:

$ pip install googletrans

$ pip install pdfminer3k

googletrans will provide a command translate, this command will call the google translate api to perform automatic translation:

python pdf文件中提取文本,并自动翻译

python pdf文件中提取文本,并自动翻译

python pdf文件中提取文本,并自动翻译

pdfminer3k will provide a tool script:

$  

Search from * for commands that remove headers and footers (highly recommended):

Use the pdftotext tool provided by Ubuntu:

$ pdftotext -y 50 -H 650 -W 1000 -nopgbrk 

$ pdftotext -f 147 -l 166 -y 50 -H 650 -W 1000 -nopgbrk 

Google Translate does not recognize paragraphs or whole sentences, if there is a line break in a whole sentence, you will find that the translation is incomplete, to test the web version of Google Translate:

python pdf文件中提取文本,并自动翻译

Therefore, the need to convert a good pdf text file for splicing, borrowed linux args command, to achieve this function, the entire file line breaks all removed.

But the problem arises again, the whole file becomes one line, and all our paragraph structure disappears, so we need to manually add the delimiter, set to a special character @.

python pdf文件中提取文本,并自动翻译

Execute the following command:

cat trans_src.txt |xargs |xargs -0 -d '@' -i{} translate -d zh-cn {} |tee trans_dst.txt

cat sva_src_1to2.txt |xargs |xargs -0 -d '&' -i{} translate -d zh-cn {} |xargs -d'\n' -n4 | awk -F'zh-cn' '{print $2}' | awk -F'[][]' '{print $2}' | tee sva_dst_1to2.txt

Redirect the translated text to a file, then do some simple post-processing on the file, and you're done.

python pdf文件中提取文本,并自动翻译

Above this python to realize the extraction of text from pdf files, and automatic translation is the method I share with you all the content, I hope to be able to give you a reference, but also hope that you support me more.