SoFunction
Updated on 2024-11-17

python implementation of the storage of data to txt and pdf documents and the solution to the problem of garbled code

First, several commonly used methods

Read TXT documents: urlopen()

Read PDF documents: pdfminer3k

Second, the problem of garbled codes

(1)、

from  import urlopen
# Access to wiki content
html = urlopen("/")
print(())

The reason for the garbled code in the output:

The computer can only handle 0 and 1 two numbers, so want to deal with the text, you must turn the text into 0 and 1 such numbers, the earliest computers use eight 0 and 1 to represent a byte, so the maximum can represent the integer is 255 = 1111111111. If you want to represent a larger number, you must use more bytes.

Since computers were invented by Americans, only 127 characters were first written into computers, namely the common Arabic numerals, the upper and lower case letters, and the symbols on the keyboard. This code is known as the ASCII code, for example the ASCII code for the capital letter A is 65,65 which is then converted to binary 01000001 which is what the computer processes.

Obviously, ASCII can't represent Chinese, so China has developed its own GB2312 encoding and is compatible with ASCII encoding. The problem is: using GB2312 encoding of the three characters of Muzi.com, suppose the encoding is 61,62,63. but in ASCII code list may be other characters. As shown in the following figure, 616263 in Japanese is encoded as other characters, and the meaning is wrong after opening.

Solution:

The international unicode encoding integrates all the encodings of the world. Therefore, unicode-encoded content can still be opened normally on any computer using unicode.

And for A, ASCII code is 01000001, Unicode code: 0000000001000001 at this time waste of space!

Therefore, the UTF-8 encoding appears: 01000001 at this time with two octets to store Chinese.

(2), notepad using unicode encoding, will notepad saved to the computer, will be converted to utf-8 storage.

Converts text to unicode when opened in a computer

Storage reasons: use utf-8 for storage to save space, use unicode to open to ensure maximum compatibility

(3), the server reads the uncode encoded document, converted to utf-8 format passed to the browser. Because the network bandwidth is expensive, the conversion in order to reduce the burden.

(4), python3 strings use Unicode encoding by default, so python3 supports multiple languages

Unicode str can be encoded to the specified bytes by using the encode() method.

If bytes uses ASCII encoding, characters not found in the ASCII code list will be expressed as \x##, then just use '\x##'.decode('utf-8') to do it

(5), Solutions

from  import urlopen
# Access to wiki content
html = urlopen("/")
print(().decode("utf-8"))

Third, pdfminer3k installation

Act I:

(1), go to the URL to download and unzip it directly:/pypi/pdfminer3k/

(2), as an administrator to run the command line window, into the software extracted location, run python install

Law II:

(3), directly in pycharm installation

(4), read pdf process: first create a parser pdfparser and document object pdfdocument, and through the two methods associated with each other, and then call the initialization method of the document object (you can pass the parameter), at this time, the content of the resource is loaded into the document object.

Create resource manager and parameter parser, then create aggregator (integrate resource manager and parameter parser), create interpreter through aggregator (encode the pdf document and interpret it into a format that python can recognize)

(5), read pdf document: through the document object get_pages () method to get the contents of each page of pdf, through the interpreter process_page () method to read a page by page.

(6) Example Demonstration

from  import PDFPageAggregator
from  import LAParams
from  import PDFParser, PDFDocument
from  import PDFResourceManager, PDFPageInterpreter
from  import PDFDevice
# Get the document object and open it as a binary read.
fp = open("", "rb")
#Create an analyzer associated with a document
parser = PDFParser(fp)
# Create a pdf document object
doc = PDFDocument()
# Connect the interpreter to the document object
parser.set_document(doc)
doc.set_parser(parser)
# Initialize the document, if the document has a password, write with this.
("")
#Create pdf explorer
resource = PDFResourceManager()
#Parameter Analyzer
laparam = LAParams()
#Creating an aggregator
device = PDFPageAggregator(resource, laparams=laparam)
# Create pdf page interpreter
interpreter = PDFPageInterpreter(resource, device)
# Use the document object to get a collection of pages
for page in doc.get_pages():
  # Use the page interpreter to read
  interpreter.process_page(page)
  #Use aggregators to get content
  layout = device.get_result()
  for out in layout:
    if hasattr(out, "get_text"):
      print(out.get_text())

One for reading pdf content on the site

fp = urlopen(/zh-cn/articles/)

Additional content:

The above is a personal experience, I hope it can give you a reference, and I hope you can support me more. If there is any mistake or something that has not been fully considered, please do not hesitate to give me advice.