Python graphic CAPTCHA recognition tutorial details

preamble

(There's an egg at the end.)

Currently, many websites take various measures to counter crawlers, one of which is the use of CAPTCHA. With the development of technology, CAPTCHA is becoming more and more fancy. CAPTCHA was initially a simple graphical CAPTCHA combining a few numbers, and later added English letters and confusing curves. Some sites may also see CAPTCHAs in Chinese characters, which makes identification more and more difficult.

Then 12306 CAPTCHA appeared to make behavioral CAPTCHA began to develop, users who have used 12306 must have more or less for its CAPTCHA headache. We need to recognize the text, click on the picture that matches the description of the text, the CAPTCHA is completely correct, the verification can pass. Now this interactive CAPTCHA is more and more, such as the extreme test slide CAPTCHA need to slide the collocation slider to complete the verification, touch CAPTCHA need to completely click the correct result to complete the verification, in addition to the sliding grid CAPTCHA, the calculation of CAPTCHA and so on.

CAPTCHAs are becoming more and more complex and the job of the crawler is becoming more and more difficult. Sometimes we must pass the verification of the CAPTCHA to access the page. This chapter is dedicated to the recognition of CAPTCHA to do a unified explanation.

The next CAPTCHAs that will be covered are common graphical CAPTCHAs, extreme check slide CAPTCHAs, tap CAPTCHAs, and micro-blog CAPTCHAs, which are recognized in different ways and with different ideas. After understanding how these CAPTCHAs are recognized, we can use similar methods to recognize other types of CAPTCHAs.

Environmental use

python 3.9
pycharm

Graphical CAPTCHA Recognition

Let's start by identifying the simplest type of CAPTCHA, the graphical CAPTCHA. This type of CAPTCHA first appeared and is still very common, and usually consists of 4 digits of letters or numbers. For example, the registration page of such-and-such a website has a similar CAPTCHA.

Generally, the last item on the form is the graphical captcha, and we must enter the characters in the graphic exactly correctly to complete the registration and login.

1. Objectives of this section

Take the CAPTCHA of a website as an example to explain the method of recognizing graphical CAPTCHA using OCR technology.

2. Preparatory work

The library tesserocr is required to recognize graphical CAPTCHAs. An installation tutorial is available at the end of this article.

3. Obtaining authentication codes

For the sake of our experimental test, let's first save the CAPTCHA image locally.

Open the developer tools and find the CAPTCHA element. The CAPTCHA element is an image whose src attribute is . You can see a CAPTCHA, right click to save it, name it .

This gives us a CAPTCHA image to use for test recognition.

4. Recognition tests

Next, create a new project, put the CAPTCHA image into the root directory of the project, and use the tesserocr library to recognize the CAPTCHA, the code is shown below:

import tesserocr
from PIL import Image
image = ('')
result = tesserocr.image_to_text(image)
print(result)

Here we create a new Image object and call tesserocr's image_to_text() method. We can pass this Image object to complete the recognition, the realization process is very simple, the result is as follows: JR42. Isn't it amazing.

5. Captcha processing

Next, let's change the CAPTCHA and name it

Rerun the above code to output FFKT.

This time the recognition deviated from the actual result, which was due to the extra lines within the CAPTCHA interfering with the recognition of the image.

For this case, we also need to do a little extra processing, such as converting to grayscale, binarization and other operations. We can use the Image object's convert() method parameter pass L, you can convert the image to grayscale image, the code is shown below:

image = ('L')
image = ('1')
()

We can also specify the threshold for binarization. The above method uses the default threshold value of 127. However, we can't convert the original image directly, we have to convert the original image to grayscale first, and then specify the binarization threshold value, the code is shown below:

image = ('L')
threshold = 80
table = []
for i in range(256):
    if i < threshold:
        (0)
    else:
        (1)
image = (table, '1')
()

After running it we get the processing results we want. And we find that the lines in the original CAPTCHA have been removed, and the whole CAPTCHA becomes black and white. At this time to re-recognize the code, run the above code again to get the code we want.

Then, for some images with interference, we do some grayscale and binarization processing, which will improve the correctness of image recognition.

tesserocr library installation

Here I'll give you a brief tutorial on how to install this library.

1. Installation of tesseract software

To install tesseract on Win10, you can go to this URL to download /tesseract/.

Among them, the file name with dev is the development version, without dev is the stable version, you can choose to download the version without dev. For example, you can choose to download tesseract-ocr-win64-setup-v5.3.0.

After the download is complete, open the download file, in which you can check the Additional language data (download) option to install OCR recognition support for language packages, in order to OCR recognition of multiple languages. (You can also check only chinese in the option).

2. Environmental configuration

In System Variables, modify path to add the path where you installed tesserocr. In System Variables, create a new variable named:TESSDATA_PREFIX with the value:D:\Program Files(X86)\Tesseract-OCR\tessdata (according to the path of your tesserocr installation).

3. Installation of the tesseracr package

- Try pip install:

pip install tesserocr

- If that doesn't work, try installing via a .whl file.

Download Address:/simonflueckiger/tesserocr-windows_build/releases I won't go into how whl is installed here, private message comments if you don't know how.

TIP

tesserocr is also just one of the recognition means, if you need high precision recognition, you can try TensorFlow to implement a deep learning model to recognize graphical CAPTCHA by training the model.

The above is Python graphic CAPTCHA recognition tutorial details, more information about Python graphic CAPTCHA please pay attention to my other related articles!