Example of python CAPTCHA recognition in detail

dependencies

sudo apt-get install python-imaging
sudo apt-get install tesseract-ocr
pip install pytesseract

Using google ocr to recognize captchas

from PIL import Image
import pytesseract
image = ('')
vcode = pytesseract.image_to_string(image)
print vcode

stillpytesseractThe recognition rate itself is not high, and the CAPTCHA of a typical website comes with a lot of distracting elements. (￣▽￣)"

So we first need to denoise the CAPTCHA.

For single pixel interfering lines, interfering points we can scan the entire image and examine the color of the eight pixel points adjacent to each pixel point, if the number of differences is greater than a certain value, then the point is discrete and needs to be removed.

Alternatively try setting thresholds to directly binarize CAPTCHAs.

Here are two CAPTCHAs from the school's website

We can see that the CAPTCHA has single pixel interference points, so we need to try to remove them. But after refreshing the CAPTCHA repeatedly, we found that this CAPTCHA

1. Addition only

2. Addition of up to two digits

3. The text part must be red (255,0,0)

With the above information, it can be determined that this CAPTCHA generation algorithm is flawed

import Image 
from numpy import * 
import pytesseract 
im = ('') 
im = ('RGB') 
# Elongated images for easy identification.
im = ((200,80)) 
a = array(im) 
for i in xrange(len(a)): 
for j in xrange(len(a[i])): 
  if a[i][j][0] == 255: 
    a[i][j]=[0,0,0] 
  else: 
    a[i][j]=[255,255,255] 
im = (a) 
() 
vcode = pytesseract.image_to_string(im) 
print vcode

Using the above script we can binarize the image and recognize it using google ocr. Then we can recognize the image byeval()to evaluate the expression.

summarize

python CAPTCHA recognition content to this basic introduction, I hope this article on your learning or work can help, if there are questions you can leave a message to exchange.