Whether it is automated login or crawler, always around the CAPTCHA, this time to talk about python in the optical recognition of the CAPTCHA moduletesserocr
cap (a poem)pytesseract
。tesserocr
cap (a poem)pytesseract
is an OCR-recognition library for Python, but is actually a library for thetesseract
A layer of Python API wrapping done by thepytesseract
It's Google.Tesseract-OCR
engine wrappers; so they are at their coretesseract
Therefore, when installing thetesserocr
Before that, we need to install thetesseract
。
download and install
Download Address:/tesseract/tesseract-ocr-w64-setup-v4.0.0.
Once the download is complete, double-click to install, and you can check the box forAdditional language data(download)
option to install the language packs supported by OCR Recognition, but downloading the language packs is really slow, we can directly download the language packs from the/tesseract-ocr/tessdata/Download the zip file of the language pack, unzip it and place thetessdata-master
Copy the files in theTesseract
The installation directory of theC:\Program Files (x86)\Tesseract-OCR\tessdata
directory, and finally we configure the environment variables, we set theC:\Program Files (x86)\Tesseract-OCR
Add it to the environment variables. Go to a command prompt and typetesseract
If the following result is displayed, the configuration is complete
View the installed language packs:tesseract --list-langs
It shows that I have a total of 167 language packs installed that contain English or other characters.
test (machinery etc)
QR codes for experiments
Basic Usage Syntaxtesseract result
(tesseract image name generates file name)
in the end
From the results, it recognizes P, 2, and X, but recognizes C as G. The recognition is still high, so let's look at using it in python.
python introduces tesseract
The download and installation can be done using the pip command in python.pip install pytesseract
Captcha Recognition Script
import pytesseract from PIL import Image im=('') print(pytesseract.image_to_string(im))
in the end
The result is the same as above, individual characters are not recognized very accurately.
image processing
Now the website of the QR code design is usually very difficult to complex, if the direct recognition of the words is difficult to identify, the following code is gray-scale processing and binarization
import pytesseract from PIL import Image im=('') # Ash disposal im=('L') # This is the binarization threshold # threshold=150 table=[] for i in range(256): if i<threshold: (0) else: (1) #Convert to binary image via table, 1's work as white, 0's are black im=(table,"1") () print(pytesseract.image_to_string(im))
original figure
After Graying and Binarization
This is the whole content of this article.