dependencies
sudo apt-get install python-imaging sudo apt-get install tesseract-ocr pip install pytesseract
Using google ocr to recognize captchas
from PIL import Image import pytesseract image = ('') vcode = pytesseract.image_to_string(image) print vcode
stillpytesseract
The recognition rate itself is not high, and the CAPTCHA of a typical website comes with a lot of distracting elements. ( ̄▽ ̄)"
So we first need to denoise the CAPTCHA.
For single pixel interfering lines, interfering points we can scan the entire image and examine the color of the eight pixel points adjacent to each pixel point, if the number of differences is greater than a certain value, then the point is discrete and needs to be removed.
Alternatively try setting thresholds to directly binarize CAPTCHAs.
Here are two CAPTCHAs from the school's website
We can see that the CAPTCHA has single pixel interference points, so we need to try to remove them. But after refreshing the CAPTCHA repeatedly, we found that this CAPTCHA
1. Addition only
2. Addition of up to two digits
3. The text part must be red (255,0,0)
With the above information, it can be determined that this CAPTCHA generation algorithm is flawed
import Image from numpy import * import pytesseract im = ('') im = ('RGB') # Elongated images for easy identification. im = ((200,80)) a = array(im) for i in xrange(len(a)): for j in xrange(len(a[i])): if a[i][j][0] == 255: a[i][j]=[0,0,0] else: a[i][j]=[255,255,255] im = (a) () vcode = pytesseract.image_to_string(im) print vcode
Using the above script we can binarize the image and recognize it using google ocr. Then we can recognize the image byeval()
to evaluate the expression.
summarize
python CAPTCHA recognition content to this basic introduction, I hope this article on your learning or work can help, if there are questions you can leave a message to exchange.