I. Preface
This experiment will explain the principle of CAPTCHA cracking through a simple example, and the following points will be learned and practiced:
Python Basics
Use of the PIL module
II. Examples in detail
Install the pillow (PIL) library:
$ sudo apt-get update $ sudo apt-get install python-dev $ sudo apt-get install libtiff5-dev libjpeg8-dev zlib1g-dev \ libfreetype6-dev liblcms2-dev libwebp-dev tcl8.6-dev tk8.6-dev python-tk $ sudo pip install pillow
Download the file for the experiment:
$ wget /courses/364/python_captcha.zip $ unzip python_captcha.zip $ cd python_captcha
This is the CAPTCHA we used for our experiment
Extract text images
Create a new file in the working directory and edit it.
#-*- coding:utf8 -*- from PIL import Image im = ("") # (Converts images to 8-bit pixel mode) im = ("P") # Print color histograms print ()
Output:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 1, 0, 0, 0, 2, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0 , 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 1, 2, 0, 1, 0, 0, 1, 0, 2, 0, 0, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 3, 1, 3, 3, 0, 0, 0, 0, 0, 0, 1, 0, 3, 2, 132, 1, 1, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 15, 0 , 1, 0, 1, 0, 0, 8, 1, 0, 0, 0, 0, 1, 6, 0, 2, 0, 0, 0, 0, 18, 1, 1, 1, 1, 1, 2, 365, 115, 0, 1, 0, 0, 0, 135, 186, 0, 0, 1, 0, 0, 0, 116, 3, 0, 0, 0, 0, 0, 21, 1, 1, 0, 0, 0, 2, 10, 2, 0, 0, 0, 0, 2, 10, 0, 0, 0, 0, 1, 0, 625]
Each digit of the color histogram represents the number of pixels that contain the color of the corresponding bit in the picture.
Each pixel point can represent 256 colors, and you'll notice that the white points are the most numerous (the position of the white serial number 255, which is the last position, as you can see, has 625 white pixels). The red pixels are around serial number 200, and we can sort them to get useful colors.
his = () values = {} for i in range(256): values[i] = his[i] for j,k in sorted((),key=lambda x:x[1],reverse = True)[:10]: print j,k
Output:
255 625 212 365 220 186 219 135 169 132 227 116 213 115 234 21 205 18 184 15
We get a maximum of 10 colors in the image, of which 220 and 227 are the reds and grays we need to construct a black-and-white binary image from this message.
#-*- coding:utf8 -*- from PIL import Image im = ("") im = ("P") im2 = ("P",,255) for x in range([1]): for y in range([0]): pix = ((y,x)) if pix == 220 or pix == 227: # these are the numbers to get ((y,x),0) ()
Getting results:
Extract single character images
The next task is to get the set of pixels for a single character, which we cut vertically due to the simplicity of the example:
inletter = False foundletter=False start = 0 end = 0 letters = [] for y in range([0]): for x in range([1]): pix = ((y,x)) if pix != 255: inletter = True if foundletter == False and inletter == True: foundletter = True start = y if foundletter == True and inletter == False: foundletter = False end = y ((start,end)) inletter=False print letters
Output:
[(6, 14), (15, 25), (27, 35), (37, 46), (48, 56), (57, 67)]
Get the serial number of the column where each character starts and ends.
import hashlib import time count = 0 for letter in letters: m = hashlib.md5() im3 = (( letter[0] , 0, letter[1],[1] )) ("%s%s"%((),count)) ("./%"%(())) count += 1
(pick up the code above)
Cut the image to get the portion of the image where each character is located.
AI and vector space image recognition
Here we use vector space search engine for character recognition which has many advantages:
- Doesn't require a lot of training iterations
- No overtraining.
- You can add/remove erroneous data at any time to see the effect.
- Easy to understand and code
- Provides hierarchical results where you can view the closest multiple matches
- For things that can't be recognized just add them to the search engine and they are immediately recognized.
Of course it has drawbacks, such as classification is much slower than neural networks, it can't find its own way to solve problems and so on.
Vector space search engine name sounds very lofty in fact the principle is very simple. Take the example in the article:
You have 3 documents, how do we calculate the similarity between them?The more words the 2 documents use that are the same, the more similar the two articles are! But what if there are too many words? Let's choose a few key words, which are also called features. Each feature is like a dimension in space (x, y, z, etc.), and a set of features is a vector, which is what we get for each document, and then we calculate the angle between the vectors to get the similarity of the articles.
Implement vector spaces with Python classes:
import math class VectorCompare: # Calculate the vector size def magnitude(self,concordance): total = 0 for word,count in (): total += count ** 2 return (total) # Calculate the cos between the vectors def relation(self,concordance1, concordance2): relevance = 0 topvalue = 0 for word, count in (): if concordance2.has_key(word): topvalue += count * concordance2[word] return topvalue / ((concordance1) * (concordance2))
It compares two python dictionary types and outputs their similarity (expressed as a number from 0 to 1)
Putting the previous together
There is also the work of taking a large number of CAPTCHAs to extract single character images as training sets, but as long as there are students who have read the above article properly, they must know how these jobs are to be done, and will be omitted here. You can directly use the provided training set to do the following.
The iconset directory holds our training set.
Final additions:
#Convert images to vectors def buildvector(im): d1 = {} count = 0 for i in (): d1[count] = i count += 1 return d1 v = VectorCompare() iconset = ['0','1','2','3','4','5','6','7','8','9','0','a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'] #Load the training set imageset = [] for letter in iconset: for img in ('./iconset/%s/'%(letter)): temp = [] if img != "" and img != ".DS_Store": (buildvector(("./iconset/%s/%s"%(letter,img)))) ({letter:temp}) count = 0 #Cutting of CAPTCHA images for letter in letters: m = hashlib.md5() im3 = (( letter[0] , 0, letter[1],[1] )) guess = [] # Compare the cut resulting CAPTCHA snippets with each training snippet for image in imageset: for x,y in (): if len(y) != 0: ( ( (y[0],buildvector(im3)),x) ) (reverse=True) print "",guess[0] count += 1
get the result
Everything is ready, run our code and try it out:
python
exports
(0.96376811594202894, '7') (0.96234028545977002, 's') (0.9286884286888929, '9') (0.98350370609844473, 't') (0.96751165072506273, '9') (0.96989711688772628, 'j')
It's the right solution, well done.
summarize
The above is the entire content of this article, I hope the content of this article on your learning or work can bring some help, if there are questions you can leave a message to exchange.