Tested against Python 3.5.2
First install the two packages:
$ pip install googletrans
$ pip install pdfminer3k
googletrans will provide a command translate, this command will call the google translate api to perform automatic translation:
pdfminer3k will provide a tool script:
$
Search from * for commands that remove headers and footers (highly recommended):
Use the pdftotext tool provided by Ubuntu:
$ pdftotext -y 50 -H 650 -W 1000 -nopgbrk $ pdftotext -f 147 -l 166 -y 50 -H 650 -W 1000 -nopgbrk
Google Translate does not recognize paragraphs or whole sentences, if there is a line break in a whole sentence, you will find that the translation is incomplete, to test the web version of Google Translate:
Therefore, the need to convert a good pdf text file for splicing, borrowed linux args command, to achieve this function, the entire file line breaks all removed.
But the problem arises again, the whole file becomes one line, and all our paragraph structure disappears, so we need to manually add the delimiter, set to a special character @.
Execute the following command:
cat trans_src.txt |xargs |xargs -0 -d '@' -i{} translate -d zh-cn {} |tee trans_dst.txt cat sva_src_1to2.txt |xargs |xargs -0 -d '&' -i{} translate -d zh-cn {} |xargs -d'\n' -n4 | awk -F'zh-cn' '{print $2}' | awk -F'[][]' '{print $2}' | tee sva_dst_1to2.txt
Redirect the translated text to a file, then do some simple post-processing on the file, and you're done.
Above this python to realize the extraction of text from pdf files, and automatic translation is the method I share with you all the content, I hope to be able to give you a reference, but also hope that you support me more.