Python implementation of character-based image verification code recognition complete process details

1 Summary

CAPTCHA is currently a very common and very important thing on the Internet, acting as a firewall function of many systems, but at any time the development of OCR technology, CAPTCHA exposed security problems are also more and more serious. This paper introduces a set of character authentication code recognition of the complete process, for the authentication code security and OCR recognition technology have some reference significance.

This paper's based on source code sharing for traditional machine learning SVMs:/zhengwh/captcha-svm

2 Keywords

Keywords: security, character image, captcha recognition, OCR, Python, SVM, PIL

3 Disclaimer

The material used in this paper comes from a public image resource that is completely open to the public on the website of an old Web framework.

In this paper, we only did the site's public image resources open to the public for crawling, and did not overstep the authority to do any superfluous operations.

In this article, the identity of the vulnerable website has been omitted while writing the related report.

The author of this article has notified the site staff of this system vulnerability and is actively moving to a new system.

The main purpose of this report is also only for OCR communication and learning and to raise awareness about authentication security.

4 Introduction

The content of this chapter as its technical complement to give the corresponding recognition of the solution, so that readers have a deeper understanding of the function of the authentication code and security issues.

5 Basic tools

For the purpose of this article, only a simple programming knowledge is required, as there are many well-packaged open-source solutions for machine learning in the booming field of machine learning. Ordinary programmers no longer need to understand complex mathematical principles to realize the application of these tools.

Primary development environment:

python3.5

python SDK version

Image Processing Library

libsvm

Open source svm machine learning library

The installation of the environment is not the focus of this article, so it is omitted.

6 Basic Processes

In general, the recognition process for character-based CAPTCHA is as follows:

1. Prepare the original image material
2.Image Preprocessing
3. Picture character cutting
4. Image size normalization
5. Picture character marking
6. Character picture feature extraction
7. Generate training dataset corresponding to features and markers
8. Training feature labeled data to generate recognition models
9. Use recognition models to predict new sets of unknown images
10. Achieve the goal of returning the correct character set based on the "picture".

7 Material Preparation

7.1 Material Selection

As this paper is based on the primary study research purposes, the requirements of "representative, but not too difficult", so directly on the Internet to find a more representative of the simple character-based CAPTCHA (feel like looking for loopholes).

Finally found this CAPTCHA image on an older site (presumably a decades old site frame).

Original image:

Zoom in for a clear picture:

This image fulfills the requirements and has the following characteristics on closer inspection.

Characteristics that facilitate identification:

Composed of pure Arabic numerals, the number of characters is 4 characters arranged in a regular font is used in a uniform font

These are the important reasons for the simplicity of this CAPTCHA as stated in this article, which will be used in subsequent code implementations

Unfavorable identification characteristics:

Disturbing noise in the background of the image

While this is an unfavorable feature, the threshold for this interference is so low that it requires only simple methods to remove it

7.2 Material Acquisition

Since a large amount of material is needed when doing training, it is not possible to manually save one sheet at a time in the browser, so it is recommended to write an automated download program.

The main steps are as follows:

Get random image CAPTCHA generation interface through browser's packet grabbing function Bulk request interface to get the image to save the image to the local disk directory

These are some of the basic IT skills and will not be expanded upon in detail in this article.

The code regarding network requests and file saving is as follows:

def downloads_pic(**kwargs):
 pic_name = ('pic_name', None)

 url = 'http://xxxx/rand_code_captcha/'
 res = (url, stream=True)
 with open(pic_path + pic_name+'.bmp', 'wb') as f:
  for chunk in res.iter_content(chunk_size=1024):
   if chunk: # filter out keep-alive new chunks
    (chunk)
    ()
  ()

The loop will be executed N times, and you will be able to save N validation materials.

Below is a collection of dozens of stock footage saved to local files for effect:

8 Image Preprocessing

Although current machine learning algorithms are quite advanced, it is very necessary to preprocess the images to make them more friendly for machine recognition in order to reduce the complexity during training later while increasing the recognition rate.

The processing steps for the above raw material are as follows:

1. Read the original image material
2. Color images binary to black and white images
3. Remove background noise

8.1 Binarized pictures

The main steps are as follows:

Converting RGB color maps to grayscale maps
Convert a grayscale map to a binary map according to a set threshold.

image = (img_path)
imgry = ('L') # Converted to grayscale

table = get_bin_table()
out = (table, '1')

The binary function quoted above is defined as follows:

def get_bin_table(threshold=140):

 """

 Get the grayscale-to-binary mapping table

 :param threshold.

 :return.

 """

 table = []

 for i in range(256):

  if i < threshold:

   (0)

  else:

   (1)

 

 return table

It is transformed into a binary image by PIL: 0 means black, 1 means white. The output of 6937 pixels with noise after binarization is shown below:

1110111011110111011111011110111100110111
1101111111110110101111110101111111101111
1100111011111000001111111001011111011111
1101111011111111101111011110111111011111
1110000111111000011101100001110111011111

If you're nearsighted and then move away from the screen, you can vaguely see the skeleton of the 6937.

8.2 Noise removal

After conversion to binary images, noise removal is required. The material chosen for this paper is relatively simple, and most of the noise is of the simplest kind of isolates, so a large amount of noise can be removed just by detecting these isolates.

On how to remove more complex noise or even interference lines and color blocks, there are more mature algorithms:Flood Fill Flood Fill, the back of the time of interest can continue to study.

In this article, in order to simplify the problem, we will simply use a simple simple solution that we have thought of ourselves to solve away the problem:

Count the black dots in the nine squares around a black dot.
If there are less than 2 black points then this point is proved to be an isolated point and then we get all the isolated points
Batch removal of all isolated points at once.

Details about the specific algorithmic principles are described below.

Divide all pixel points into three main categories as shown below

Vertex A non-vertex boundary point B interior point C

A schematic of the species points is shown below:

Among them:

-Class A point counts 3 neighboring points in the perimeter (as shown in the red box above)
-Category B points count the 5 neighboring points in the perimeter (as shown in the red box above)
-Category C points count the 8 neighboring points on the perimeter (as shown in the red box above)

Of course, there is also a breakdown between Class A and Class B points due to the different orientations of the datums in the computational region:

-Category A points continue to be subdivided into: upper left, lower left, upper right, lower right
-Category B points continue to be subdivided into: top, bottom, left, right
-C points do not need to be subdivided

These subdivision points will then become the guidelines for subsequent coordinate acquisitions.

The python implementation of the main algorithm is as follows:

def sum_9_region(img, x, y):
 """
 9 Neighborhood box, field box centered at current point, number of black dots
 :param x.
 :param y.
 :return.
 """
 # todo Determine the lower limit of the image's length and width
 cur_pixel = ((x, y)) # The value of the current pixel point
 width = 
 height = 

 if cur_pixel == 1: # Neighborhood values are not counted if the current point is a white area.
  return 0

 if y == 0: # First line
  if x == 0: # Upper left vertex, 4-neighborhood #
   # 3 points next to the center point
   sum = cur_pixel \
     + ((x, y + 1)) \
     + ((x + 1, y)) \
     + ((x + 1, y + 1))
   return 4 - sum
  elif x == width - 1: # Upper right vertex
   sum = cur_pixel \
     + ((x, y + 1)) \
     + ((x - 1, y)) \
     + ((x - 1, y + 1))

   return 4 - sum
  else: # Uppermost non-vertex, 6-neighborhood
   sum = ((x - 1, y)) \
     + ((x - 1, y + 1)) \
     + cur_pixel \
     + ((x, y + 1)) \
     + ((x + 1, y)) \
     + ((x + 1, y + 1))
   return 6 - sum
 elif y == height - 1: # Bottom line
  if x == 0: # Lower left vertex
   # 3 points next to the center point
   sum = cur_pixel \
     + ((x + 1, y)) \
     + ((x + 1, y - 1)) \
     + ((x, y - 1))
   return 4 - sum
  elif x == width - 1: # Lower right vertex
   sum = cur_pixel \
     + ((x, y - 1)) \
     + ((x - 1, y)) \
     + ((x - 1, y - 1))

   return 4 - sum
  else: # Lower-most non-vertex, 6-neighborhoods #
   sum = cur_pixel \
     + ((x - 1, y)) \
     + ((x + 1, y)) \
     + ((x, y - 1)) \
     + ((x - 1, y - 1)) \
     + ((x + 1, y - 1))
   return 6 - sum
 else: # y is not on the border
  if x == 0: # Left non-vertex
   sum = ((x, y - 1)) \
     + cur_pixel \
     + ((x, y + 1)) \
     + ((x + 1, y - 1)) \
     + ((x + 1, y)) \
     + ((x + 1, y + 1))

   return 6 - sum
  elif x == width - 1: # Right non-vertex
   # print('%s,%s' % (x, y))
   sum = ((x, y - 1)) \
     + cur_pixel \
     + ((x, y + 1)) \
     + ((x - 1, y - 1)) \
     + ((x - 1, y)) \
     + ((x - 1, y + 1))

   return 6 - sum
  else: # of conditions in 9 areas
   sum = ((x - 1, y - 1)) \
     + ((x - 1, y)) \
     + ((x - 1, y + 1)) \
     + ((x, y - 1)) \
     + cur_pixel \
     + ((x, y + 1)) \
     + ((x + 1, y - 1)) \
     + ((x + 1, y)) \
     + ((x + 1, y + 1))
   return 9 - sum

Tips:This place is quite a test of the level of care and patience, this place is still quite a lot of work, it took half a night to complete.

After calculating the number of surrounding pixel black dots (note: the value of black dots in the image transformed by PIL is 0) of each pixel point, we only need to filter out the coordinates of the points with the number of 1 or 2, which are the isolated points. This judgment method may not be accurate, but basically can meet the needs of this paper.

The pre-processed image is shown below.

Comparing the original image at the beginning of the article, those isolated dots have been removed and a relatively clean CAPTCHA image has been generated.

9 Picture Character Cutting

Since a character-based CAPTCHA image can essentially be viewed as a series of single character images stitched together, in order to simplify the object of study, we can also decompose these images down to the atomic level, i.e.: images that contain only a single character.

As a result, the object of our research is changed from "N kinds of string combination object" to "10 kinds of Arabic numerals" processing, which greatly simplifies and reduces the processing object.

9.1 Segmentation algorithms

Real-life character CAPTCHAs are generated in a myriad of ways, with all sorts of twists and distortions. There is also no very general way about the algorithm for character segmentation. This algorithm also needs to be formulated by the developer by carefully studying the characteristics of the character picture to be recognized.

Of course, the subjects chosen for study in this paper try to simplify the difficulty of this step, which will be described slowly below.

Use image editing software (PhoneShop or other) to open the CAPTCHA image, zoom in to the pixel level, and observe some other parameter characteristics:

The following parameters can be obtained:

-The whole image size is 40*10
-Individual character size is 6*10
-Left and right characters and left and right edges are 2 pixels apart
-Characters are close to the top and bottom edges (i.e., 0 pixels apart)

This makes it easy to locate the pixel area that each character occupies in the whole picture, and then it can be segmented, the specific code is as follows:

def get_crop_imgs(img):
 """
 Cut the image according to its characteristics, this will work according to the specific CAPTCHA. # See schematic
 :param img: :return:
 :return.
 """
 child_img_list = []
 for i in range(4):
  x = 2 + i * (6 + 4) # See schematic
  y = 0
  child_img = ((x, y, x + 6, y + 10))
  child_img_list.append(child_img)

 return child_img_list

Then you get the cut atomic level picture elements:

9.2 Content Summary

Based on the discussion in this part, I believe you have learned that if the CAPTCHA's interference (distortion, noise, interference color block, interference line ......) is not done strongly enough, you can get the following two conclusions:

There's not much difference between a 4-character and a 40,000-character CAPTCHA

There is little difference between pure numbers and CAPTCHAs that are a combination of numbers and letters

Pure numbers. The number of categories is 10
purely alphabetical
- Case insensitive. The number of categories is 26
- Case sensitive. The number of categories is 52
Combination of numbers and case-sensitive letters. The number of categories is 62

It makes less sense when there is no exponential or geometric increase in difficulty, but only a linear finite level increase in computation.

10 Dimensions to One

The size of the research object chosen in this paper itself is a uniform state: 6*10, so this part does not need additional processing. However, some of the CAPTCHA has been distorted and scaled, then this part will also be a difficult point of image processing.

11 Model Training Steps

In the previous session, the processing and segmentation of individual images has been done. The training of the recognition model is started later.

The entire training process is as follows:

1. A large number of completed pre-processing and cutting to the atomic level of the image material preparation
2. Artificial categorization of material images, i.e.: tagging
3. Define the recognition features of a single image
4. Use SVM training model to train the labeled feature files to get the model file

12 Material Preparation

In this paper, we re-downloaded a total of 4-digit validation images of the same pattern during the training phase: 3000 images. These 3000 images are then processed and cut to get 12000 atomic level images.

In these 12,000 images to remove some of the interference material that will affect the training and recognition of the strong interference, after cutting the effect of the picture is as follows:

13 Material Markers

In the recognition method used in this paper, the machine does not have any idea of numbers at the beginning. Therefore, it is necessary to identify the material artificially, and tell the machine what kind of image is 1.......

This process is called "tagging".

The exact method of labeling is:

Create a directory for each number from 0 to 9, with the name of the directory being the corresponding number (equivalent to a label)

Artificially determine the content of the image and drag the image to the specified digital directory

Store about 100 clips in each catalog

In general, the more material labeled, the more discriminative and predictive ability of the trained model. For example, in this paper, when the labeled material is more than ten, the recognition rate of new test images is basically zero, but when it reaches 100, it can reach nearly 100% recognition rate.

14 Feature Selection

For the cut single character image, the pixel level zoomed image is shown below:

On a macro level, the essence of the different digital pictures is that the black color is filled on the corresponding pixel points according to certain rules, so all these features end up around the pixel points.

Character images are 6 pixels wide and 10 pixels high, which can theoretically be the simplest and crudest way to define 60 features: the pixel values above the 60 pixel points. However, it is obvious that such a high dimensionality will inevitably result in too much computation, and can be appropriately downgraded.

By reviewing the corresponding literature [2], another simple and rough definition of the feature is given:

The number of black pixels on each line gives 10 features
The number of black pixels on each column gives 6 features

Finally a 16-dimensional set of features is obtained and the implementation code is as follows:

def get_feature(img):
 """
 Get the feature values of the specified image, the
 1. According to the pixel points in each row, a height of 10 gives 10 dimensions, then 6 columns, for a total of 16 dimensions
 :param img_path.
 :return: a list of dimensions 10 (height)
 """

 width, height = 

 pixel_cnt_list = []
 height = 10
 for y in range(height):
  pix_cnt_x = 0
  for x in range(width):
   if ((x, y)) == 0: # Black dots
    pix_cnt_x += 1

  pixel_cnt_list.append(pix_cnt_x)

 for x in range(width):
  pix_cnt_y = 0
  for y in range(height):
   if ((x, y)) == 0: # Black dots
    pix_cnt_y += 1

  pixel_cnt_list.append(pix_cnt_y)

 return pixel_cnt_list

The image material is then characterized to generate a set of vector files with eigenvalues and tagged values in the format specified by libSVM. An example of the content is shown below:

The description is as follows:

1. The first column is the label column, that is, this picture is labeled for the value, followed by other values of 1 to 9 markers
2. followed by 16 sets of eigenvalues, with the index number before the colon, followed by the value
3. If there are 1,000 training images, then 1,000 rows of records will be generated

If you are interested in this file format, you can search for more information on the libSVM website.

15 Model Training

After this stage, since this paper directly uses the open-source libSVM program, belonging to the application, so the content here is relatively simple. Just need to input the feature file, and then output the model file.

A lot of related Chinese materials can be searched [1].

The main code is as follows:

def train_svm_model():
 """
 Train and generate the model file
 :return.
 """
 y, x = svm_read_problem(svm_root + '/train_pix_feature_xy.txt')
 model = svm_train(y, x)
 svm_save_model(model_path, model)

Note: The name of the generated model file is svm_model_file

16 Model Testing

After training the generated model, the model needs to be tested using a brand new labeled image outside the training set as a test set.

The test experiments in this paper are as follows:

Model testing was performed using a set of 21 images all labeled 8
The test image generates a tagged feature file name calledlast_test_pix_xy_new.txt

In the early training set samples only a dozen images per character, although the training set samples have a good degree of differentiation, but for new samples of the test set is basically no differentiation ability, recognition is basically wrong. Gradually increase the training set of samples labeled 8 after the situation has improved:

By the time you get to about 60 sheets, you're about 80% correct
By the time you get to 185 sheets, you're basically 100% correct

With this model strengthening method for the number 8, we continue to strengthen the model training for the other numbers from 0 to 9, and finally we can achieve nearly 100% recognition rate for all the pictures of all the numbers. In this paper's example, basically each number of the training set in about 100 pictures, you can achieve a recognition rate of 100%.

The model test code is as follows:

def svm_model_test():
 """
 Testing models with test sets
 :return.
 """
 yt, xt = svm_read_problem(svm_root + '/last_test_pix_xy_new.txt')
 model = svm_load_model(model_path)
 p_label, p_acc, p_val = svm_predict(yt, xt, model)The #p_label is the result of the recognition.

 cnt = 0
 for item in p_label:
  print('%d' % item, end=',')
  cnt += 1
  if cnt % 8 == 0:
   print('')

At this point, the identification of the verification is considered complete.

17 Complete identification process

In the previous session, the relevant toolset for CAPTCHA recognition was prepared. Then a continuous recognition of dynamic CAPTCHAs on a specified network is formed, and a bit of additional code needs to be written to organize this process in order to form a stable black-box interface for CAPTCHA recognition.

The main steps are as follows:

1. Pass in a set of CAPTCHA images
2. Pre-processing of images: denoising, binarization, etc.
3. Cut into 4 ordered single-character pictures
4. Use the model file to recognize each of the four images
5. Splice the recognition results
6. Return recognition results

Then in this article, request a network CAPTCHA http interface, get the CAPTCHA image, recognize the result, and save this result as the name of this verification image. The effect is as follows:

Apparently, an almost 100% recognition rate has been achieved.

Without any optimization in this algorithm, running this program on a PC with the current mainstream configuration can achieve 200ms to recognize one (a large amount of time consuming comes from the blocking of network requests).

18 Efficiency Optimization

Better efficiency can be achieved later on by way of optimization.

Software Hierarchy Optimization

1. Make the network request part of the image resource asynchronous and non-blocking mode.
2. Utilizing multi-core CPUs, multiple processes running in parallel
3. Careful selection and experimentation on picture features to reduce dimensionality

It is expected to be able to achieve 10 to 100 CAPTCHAs recognized in 1s.

Hardware Hierarchy Optimization

1. Crudely increase CPU performance
2. Roughly increase the running machine

Basically, with 10 4-core machines requesting at the same time, a conservative estimate of efficiency can be improved to 1s recognizing 10,000 CAPTCHAs.

19 Internet Security Alert

What are the security risks if the CAPTCHA is recognized?

After you have gained an understanding of recognition efficiency through the previous subsection, you will have a new perspective when such a scenario is mentioned again, right?

12306 train ticketing network, during the Spring Festival 8:00 a.m. a car released 500 tickets, 1s all snatched up, and finally found that the normal demand for people can not snatch the ticket, but the scalper is a great deal of tickets a cell phone website, 10:00 a.m. to open the purchase of activities, waiting for a long time of countless of you return in frustration, but the same scalper is a large number of goods!

For the time being, regardless of whether there are no formalities behind the blackmail, in the case of all legal procedures, as long as the technical means to identify the CAPTCHA, and then through the computer's powerful computing power and automation capabilities, a large number of resources will be grabbed to a small number of scalpers in the hands of the technology is completely feasible.

So in the future, when you can't grab a ticket and are upset, you can continue to scold 12306, but instead of scolding it for having shady practices, let's scold them for their poor IT skills.

Regarding a CAPTCHA failure, i.e., a system that is equivalent to no CAPTCHA, and no other risk control strategy, then the system is a complete no-man's land for the code program.

It is true that some web applications do not even have a CAPTCHA and can only be slaughtered even if the web application has a CAPTCHA but the difficulty is not enough, it can only be slaughtered!

So, although this piece is small, safety cannot be ignored.

20 Positive Application Scenarios

This article describes a simple implementation of OCR technology. There are some good and at the same time positive and progressive application scenarios:

-Bank card number identification
-ID number identification
-License plate number recognition

These scenes have characteristics that are very similar to the material studied in this paper:

1. Single font
2. Characters are simple combinations of numbers or letters
3. The arrangement of the text is standardized and unified

So it shouldn't be too difficult to recognize if the raw data collection is more standardized when the photo is taken.

21 Summary

In this paper, we have just selected a typical and relatively simple CAPTCHA recognition as an example, but basically can express a complete process of recognizing this kind of CAPTCHA, which can be used for everyone to communicate and learn.

Due to the varying strength of IT technology around the world, many old IT systems now have some old page frames in them, and the CAPTCHAs used in them are also quite old, and are completely unbeatable for some of today's recognition technologies. For example, I have seen some college students directly take their own school's faculty system CAPTCHA to start practicing.

Finally, this paper deliberately proposes the following initiatives:

For those who have mastered OCR technology

-Don't do anything illegal, because there's a lot of news about "white hats" getting caught.
-Without breaking the law, it's still possible to offer a well-intentioned reminder to system administrators with vulnerabilities
-Use your expertise to do more to promote social progress and enhance social productivity, such as the electronicization of paper books, etc.

For companies or organizations still using old and outdated IT systems.

You should realize the seriousness of the matter as soon as possible, and either upgrade your system or deliver this piece of business to a specialized security company.