Using Python to crack the CAPTCHA example details

I. Preface

This experiment will explain the principle of CAPTCHA cracking through a simple example, and the following points will be learned and practiced:

Python Basics

Use of the PIL module

II. Examples in detail

Install the pillow (PIL) library:

$ sudo apt-get update

$ sudo apt-get install python-dev

$ sudo apt-get install libtiff5-dev libjpeg8-dev zlib1g-dev \
libfreetype6-dev liblcms2-dev libwebp-dev tcl8.6-dev tk8.6-dev python-tk

$ sudo pip install pillow

Download the file for the experiment:

$ wget /courses/364/python_captcha.zip
$ unzip python_captcha.zip
$ cd python_captcha

This is the CAPTCHA we used for our experiment

Extract text images

Create a new file in the working directory and edit it.

#-*- coding:utf8 -*-
from PIL import Image

im = ("")
# (Converts images to 8-bit pixel mode)
im = ("P")

# Print color histograms
print ()

Output:

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 1, 0, 0, 0, 2, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0 , 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 1, 2, 0, 1, 0, 0, 1, 0, 2, 0, 0, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 3, 1, 3, 3, 0, 0, 0, 0, 0, 0, 1, 0, 3, 2, 132, 1, 1, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 15, 0 , 1, 0, 1, 0, 0, 8, 1, 0, 0, 0, 0, 1, 6, 0, 2, 0, 0, 0, 0, 18, 1, 1, 1, 1, 1, 2, 365, 115, 0, 1, 0, 0, 0, 135, 186, 0, 0, 1, 0, 0, 0, 116, 3, 0, 0, 0, 0, 0, 21, 1, 1, 0, 0, 0, 2, 10, 2, 0, 0, 0, 0, 2, 10, 0, 0, 0, 0, 1, 0, 625]

Each digit of the color histogram represents the number of pixels that contain the color of the corresponding bit in the picture.

Each pixel point can represent 256 colors, and you'll notice that the white points are the most numerous (the position of the white serial number 255, which is the last position, as you can see, has 625 white pixels). The red pixels are around serial number 200, and we can sort them to get useful colors.

his = ()
values = {}

for i in range(256):
 values[i] = his[i]

for j,k in sorted((),key=lambda x:x[1],reverse = True)[:10]:
 print j,k

Output:

We get a maximum of 10 colors in the image, of which 220 and 227 are the reds and grays we need to construct a black-and-white binary image from this message.

#-*- coding:utf8 -*-
from PIL import Image

im = ("")
im = ("P")
im2 = ("P",,255)


for x in range([1]):
 for y in range([0]):
  pix = ((y,x))
  if pix == 220 or pix == 227: # these are the numbers to get
   ((y,x),0)

()

Getting results:

Extract single character images

The next task is to get the set of pixels for a single character, which we cut vertically due to the simplicity of the example:

inletter = False
foundletter=False
start = 0
end = 0

letters = []

for y in range([0]): 
 for x in range([1]):
  pix = ((y,x))
  if pix != 255:
   inletter = True
 if foundletter == False and inletter == True:
  foundletter = True
  start = y

 if foundletter == True and inletter == False:
  foundletter = False
  end = y
  ((start,end))

 inletter=False
print letters

Output:

[(6, 14), (15, 25), (27, 35), (37, 46), (48, 56), (57, 67)]

Get the serial number of the column where each character starts and ends.

import hashlib
import time

count = 0
for letter in letters:
 m = hashlib.md5()
 im3 = (( letter[0] , 0, letter[1],[1] ))
 ("%s%s"%((),count))
 ("./%"%(()))
 count += 1

(pick up the code above)

Cut the image to get the portion of the image where each character is located.

AI and vector space image recognition

Here we use vector space search engine for character recognition which has many advantages:

Doesn't require a lot of training iterations
No overtraining.
You can add/remove erroneous data at any time to see the effect.
Easy to understand and code
Provides hierarchical results where you can view the closest multiple matches
For things that can't be recognized just add them to the search engine and they are immediately recognized.

Of course it has drawbacks, such as classification is much slower than neural networks, it can't find its own way to solve problems and so on.

Vector space search engine name sounds very lofty in fact the principle is very simple. Take the example in the article:

You have 3 documents, how do we calculate the similarity between them?The more words the 2 documents use that are the same, the more similar the two articles are! But what if there are too many words? Let's choose a few key words, which are also called features. Each feature is like a dimension in space (x, y, z, etc.), and a set of features is a vector, which is what we get for each document, and then we calculate the angle between the vectors to get the similarity of the articles.

Implement vector spaces with Python classes:

import math

class VectorCompare:
 # Calculate the vector size
 def magnitude(self,concordance):
  total = 0
  for word,count in ():
   total += count ** 2
  return (total)

 # Calculate the cos between the vectors
 def relation(self,concordance1, concordance2):
  relevance = 0
  topvalue = 0
  for word, count in ():
   if concordance2.has_key(word):
    topvalue += count * concordance2[word]
  return topvalue / ((concordance1) * (concordance2))

It compares two python dictionary types and outputs their similarity (expressed as a number from 0 to 1)

Putting the previous together

There is also the work of taking a large number of CAPTCHAs to extract single character images as training sets, but as long as there are students who have read the above article properly, they must know how these jobs are to be done, and will be omitted here. You can directly use the provided training set to do the following.

The iconset directory holds our training set.

Final additions:

#Convert images to vectors
def buildvector(im):
 d1 = {}
 count = 0
 for i in ():
  d1[count] = i
  count += 1
 return d1

v = VectorCompare()

iconset = ['0','1','2','3','4','5','6','7','8','9','0','a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z']

#Load the training set
imageset = []
for letter in iconset:
 for img in ('./iconset/%s/'%(letter)):
  temp = []
  if img != "" and img != ".DS_Store":
   (buildvector(("./iconset/%s/%s"%(letter,img))))
  ({letter:temp})


count = 0
#Cutting of CAPTCHA images
for letter in letters:
 m = hashlib.md5()
 im3 = (( letter[0] , 0, letter[1],[1] ))

 guess = []

 # Compare the cut resulting CAPTCHA snippets with each training snippet
 for image in imageset:
  for x,y in ():
   if len(y) != 0:
    ( ( (y[0],buildvector(im3)),x) )

 (reverse=True)
 print "",guess[0]
 count += 1

get the result

Everything is ready, run our code and try it out:

python

exports

(0.96376811594202894, '7')
(0.96234028545977002, 's')
(0.9286884286888929, '9')
(0.98350370609844473, 't')
(0.96751165072506273, '9')
(0.96989711688772628, 'j')

It's the right solution, well done.

summarize

The above is the entire content of this article, I hope the content of this article on your learning or work can bring some help, if there are questions you can leave a message to exchange.