SoFunction
Updated on 2024-11-19

Example of python method to recognize CAPTCHA using Tesseract

Whether it is automated login or crawler, always around the CAPTCHA, this time to talk about python in the optical recognition of the CAPTCHA moduletesserocrcap (a poem)pytesseracttesserocrcap (a poem)pytesseractis an OCR-recognition library for Python, but is actually a library for thetesseractA layer of Python API wrapping done by thepytesseractIt's Google.Tesseract-OCRengine wrappers; so they are at their coretesseractTherefore, when installing thetesserocrBefore that, we need to install thetesseract

download and install

Download Address:/tesseract/tesseract-ocr-w64-setup-v4.0.0.

Once the download is complete, double-click to install, and you can check the box forAdditional language data(download)option to install the language packs supported by OCR Recognition, but downloading the language packs is really slow, we can directly download the language packs from the/tesseract-ocr/tessdata/Download the zip file of the language pack, unzip it and place thetessdata-masterCopy the files in theTesseractThe installation directory of theC:\Program Files (x86)\Tesseract-OCR\tessdatadirectory, and finally we configure the environment variables, we set theC:\Program Files (x86)\Tesseract-OCRAdd it to the environment variables. Go to a command prompt and typetesseractIf the following result is displayed, the configuration is complete

View the installed language packs:tesseract --list-langs

It shows that I have a total of 167 language packs installed that contain English or other characters.

test (machinery etc)

QR codes for experiments

Basic Usage Syntax
tesseract result (tesseract image name generates file name)

in the end

From the results, it recognizes P, 2, and X, but recognizes C as G. The recognition is still high, so let's look at using it in python.

python introduces tesseract

The download and installation can be done using the pip command in python.pip install pytesseract

Captcha Recognition Script

import pytesseract
from PIL import Image
im=('')
print(pytesseract.image_to_string(im))

in the end

The result is the same as above, individual characters are not recognized very accurately.

image processing

Now the website of the QR code design is usually very difficult to complex, if the direct recognition of the words is difficult to identify, the following code is gray-scale processing and binarization

import pytesseract
from PIL import Image
im=('')
# Ash disposal
im=('L')
# This is the binarization threshold #
threshold=150
table=[]
for i in range(256):
 if i<threshold:
  (0)
 else:
  (1)
#Convert to binary image via table, 1's work as white, 0's are black
im=(table,"1")
()
print(pytesseract.image_to_string(im))

original figure

After Graying and Binarization

This is the whole content of this article.