Python+Opencv text detection implementation

In this tutorial, you will learn how to use OpenCV to detect text in an image using the EAST Text Detector.

The EAST text detector requires us to run OpenCV 3.4.2 or OpenCV 4 on our system.

Original paper

code address
original text

In the first part of today's tutorial, I'll discuss why detecting text in natural scene images can be so challenging. From there I'll briefly discuss the EAST text detector, why we use it, and what makes the algorithm so novel - I'll also provide a link to the original paper so you can read the details if you'd like.

Finally, I'll provide my Python + OpenCV text detection implementation so you can start applying text detection in your own applications.

Why is natural scene text detection so challenging?

Detecting text in a constrained and controlled environment can often be accomplished by using heuristic-based methods, such as utilizing gradient information or the fact that text is often grouped into paragraphs and characters appear in a straight line.

However, natural scene text detection is different - and much more challenging. Due to the popularity of cheap digital cameras, not to mention the fact that almost every smartphone is now equipped with a camera, we need to pay a lot of attention to the conditions under which the image was taken - and, moreover, what assumptions we can make and what is not feasible. I describe a summarized version of the natural scene text detection challenge in Celine Mancas-Thillou and Bernard Gosselin's excellent 2017 paper, "Natural Scene Text Understanding," below:

Image/Sensor Noise: Sensor noise in handheld cameras is typically higher than that of traditional scanners. In addition, low-cost cameras often insert pixels from the original sensor to produce true colors.
Perspective: natural scene text will naturally have a perspective that is not parallel to the text, making the text more difficult to recognize.
Ambiguity: Uncontrolled environments tend to be ambiguous, especially if the end user is using a smartphone without some form of stability.
LIGHTING CONDITIONS: We can't make any assumptions about the lighting conditions in an image of a natural scene. It may be close to dark, the flash on the camera may be on, or the sun may be shining brightly, saturating the entire image.
Resolution: not all cameras are created equal - we may deal with cameras with lower than standard resolution.
Non-paper objects: most, but not all, paper is not reflective (at least in the paper environment you are trying to scan). Text in natural scenes may be reflective, including logos, emblems, etc.
Non-flat objects: Consider what happens when you put text around a bottle - the text on the surface will be distorted. While humans may still be able to "detect" and read the text easily, our algorithms will face difficulties. We need to be able to handle such use cases.
Unknown layout: We cannot use any a priori information to provide our algorithm with "clues" about the location of the text.

EAST Deep Learning Text Detector

With the release of OpenCV 3.4.2 and OpenCV 4, we now have access to a deep learning-based text detector called EAST, which is based on the 2017 paper EAST: An Efficient and Accurate Scene Text Detector by Zhou et al.

We call this algorithm "EAST" because it is an efficient and accurate pipeline for scene text detection.

The group of authors say that the EAST pipeline is capable of predicting words and lines of text in any orientation on a 720p image and can run at 13 FPS. Perhaps most importantly, because the deep learning model is end-to-end, it can bypass the computationally expensive sub-algorithms typically applied by other text detectors, including candidate aggregation and word partitioning.

To build and train such a deep learning model, the EAST method utilizes a novel, well-designed loss function. For more detailed information about EAST, including the architectural design and training methodology, be sure to refer to the author's publication.

Project structure

$ tree --dirsfirst
.
├── images
│   ├── car_wash.png
│   ├── lebron_james.jpg
│   └── 
├── frozen_east_text_detection.pb
├── text_detection.py
└── text_detection_video.py

Please note that I have provided three sample images in the images/ directory. You may wish to add images from your own smartphone collection or images you find online. We will review two .py files today:

text_detection.py : detect text in still images.
text_detection_video.py : detect text via webcam or input video file.

Implementation note

The text detection implementation I'm including today is based on OpenCV's official C++ example; however, I must admit to having some trouble converting it to Python.

First of all, Point2f and RotatedRect functions are not available in Python, so I can't 100% mimic the C++ implementation. The C++ implementation can generate a rotated bounding box, but unfortunately, the one I'm sharing with you today cannot.

Second, the NMSBoxes function doesn't return any of the values of the Python bindings (at least for my pre-release installation of OpenCV 4), which ultimately causes OpenCV to throw an error. The NMSBoxes function works in OpenCV 3.4.2, but I was unable to test it exhaustively.

I solved this in imutils using my own implementation of non-maximum suppression, but again, I'm not convinced that the two are 100% interchangeable because it looks like NMSBoxes accepts extra arguments.

Given all this, I have done my best to provide you with the best OpenCV text detection implementation using the working features and resources I have. If you have any improvements to the method, please feel free to share them in the comments below.

Implementing our text detector with OpenCV

Before we get started, I'd like to point out that you need to have at least OpenCV 3.4.2 (or OpenCV 4) installed on your system in order to use OpenCV's EAST text detector, and next, make sure that you have imutils installed/upgraded on your system as well:

 pip install --upgrade imutils

At this point your system has been configured, so open text_detection.py and insert the following code:

# import the necessary packages
from imutils.object_detection import non_max_suppression
import numpy as np
import argparse
import time
import cv2
# construct the argument parser and parse the arguments
ap = ()
ap.add_argument("-i", "--image", type=str,
	help="path to input image")
ap.add_argument("-east", "--east", type=str,
	help="path to input EAST text detector")
ap.add_argument("-c", "--min-confidence", type=float, default=0.5,
	help="minimum probability required to inspect a region")
ap.add_argument("-w", "--width", type=int, default=320,
	help="resized image width (should be multiple of 32)")
ap.add_argument("-e", "--height", type=int, default=320,
	help="resized image height (should be multiple of 32)")
args = vars(ap.parse_args())

First, import the required packages and modules. Notably, we import NumPy, OpenCV, and my implementation of non_max_suppression from imutils.object_detection. We then proceeded to parse the five command line arguments:

-image : The path of our input image.

-east : EAST Scene Text Detector model file path.

-min-confidence : Probability threshold for determining the text. Optional, default=0.5.

-width : Adjusted image width - must be a multiple of 32. Optional when the default value is 320.

-height : The height of the adjusted image - must be a multiple of 32. Optional when the default value is 320.

IMPORTANT NOTE: The EAST text requires that your input image dimensions be a multiple of 32, so if you choose to resize the --width and --height values, make sure they are a multiple of 32! From there, let's load our image and resize it:

# load the input image and grab the image dimensions
image = (args["image"])
orig = ()
(H, W) = [:2]
# set the new width and height and then determine the ratio in change
# for both the width and height
(newW, newH) = (args["width"], args["height"])
rW = W / float(newW)
rH = H / float(newH)
# resize the image and grab the new image dimensions
image = (image, (newW, newH))
(H, W) = [:2]

We load and copy our input image. Determine the ratio of the original image size to the new image size (based on the command line arguments provided for --width and --height). We then resize the image, ignoring the aspect ratio. In order to use OpenCV and EAST deep learning models for text detection, we need to extract two layers of output feature maps:

# define the two output layer names for the EAST detector model that
# we are interested -- the first is the output probabilities and the
# second can be used to derive the bounding box coordinates of text
layerNames = [
	"feature_fusion/Conv_7/Sigmoid",
	"feature_fusion/concat_3"]

We build a list of layerNames:

The first layer is our output sigmoid activation, which gives us the probability of whether a region contains text or not.

The second layer is the output feature map, which represents the "geometry" of the image - we will be able to use this geometry to derive the bounding box coordinates of the text in the input image.

Let's load OpenCV's EAST text detector:

# load the pre-trained EAST text detector
print("[INFO] loading EAST text detector...")
net = (args["east"])
# construct a blob from the image and then perform a forward pass of
# the model to obtain the two output layer sets
blob = (image, 1.0, (W, H),
	(123.68, 116.78, 103.94), swapRB=True, crop=False)
start = ()
(blob)
(scores, geometry) = (layerNames)
end = ()
# show timing information on text prediction
print("[INFO] text detection took {:.6f} seconds".format(end - start))

We use Load the neural network into memory by passing the path to the EAST detector.

We then prepare our image by converting it to a blob. To read more about this step, see Deep Learning: How OpenCV's blobFromImage works. To predict the text, we can simply set the blob as input and call . These lines are surrounded by the crawl timestamp so that we can print the elapsed time. By providing layerNames as an argument to , we instruct OpenCV to return the two feature maps we are interested in:

Output geometry for deriving bounding box coordinates of text in the input image
Similarly, the fraction map, which contains the probability that a given region contains the text

We need to loop through each of these values one by one:

# grab the number of rows and columns from the scores volume, then
# initialize our set of bounding box rectangles and corresponding
# confidence scores
(numRows, numCols) = [2:4]
rects = []
confidences = []
# loop over the number of rows
for y in range(0, numRows):
	# extract the scores (probabilities), followed by the geometrical
	# data used to derive potential bounding box coordinates that
	# surround text
	scoresData = scores[0, 0, y]
	xData0 = geometry[0, 0, y]
	xData1 = geometry[0, 1, y]
	xData2 = geometry[0, 2, y]
	xData3 = geometry[0, 3, y]
	anglesData = geometry[0, 4, y]

We first get the dimensions of the score roll ( and then initialize two lists:

rects : stores the bounding box (x, y) coordinates of the text area.
Confidence: store the probabilities associated with each bounding box in rects

We will apply non-extremely large value suppression to these regions later. Loop through the rows. Extract the fractional and geometric data for the current row y. Next, we traverse each column index of the currently selected row:

    # loop over the number of columns
	for x in range(0, numCols):
		# if our score does not have sufficient probability, ignore it
		if scoresData[x] < args["min_confidence"]:
			continue
		# compute the offset factor as our resulting feature maps will
		# be 4x smaller than the input image
		(offsetX, offsetY) = (x * 4.0, y * 4.0)
		# extract the rotation angle for the prediction and then
		# compute the sin and cosine
		angle = anglesData[x]
		cos = (angle)
		sin = (angle)
		# use the geometry volume to derive the width and height of
		# the bounding box
		h = xData0[x] + xData2[x]
		w = xData1[x] + xData3[x]
		# compute both the starting and ending (x, y)-coordinates for
		# the text prediction bounding box
		endX = int(offsetX + (cos * xData1[x]) + (sin * xData2[x]))
		endY = int(offsetY - (sin * xData1[x]) + (cos * xData2[x]))
		startX = int(endX - w)
		startY = int(endY - h)
		# add the bounding box coordinates and probability score to
		# our respective lists
		((startX, startY, endX, endY))
		(scoresData[x])

For each row, we start traversing the columns. We need to filter out weak text detection by ignoring regions that do not have a high enough probability.

As the image passes through the network, the EAST text detector naturally reduces the volume size - our volume size is actually 4 times smaller than our input image, so we multiply by 4 to bring the coordinates back to the original image.

Extract the angle data. Then we update our rectangle and confidence lists separately. We are almost done! The final step is to apply non-maximal suppression to our bounding boxes to suppress weakly overlapping bounding boxes, and then display the resulting text predictions:

# apply non-maxima suppression to suppress weak, overlapping bounding
# boxes
boxes = non_max_suppression((rects), probs=confidences)
# loop over the bounding boxes
for (startX, startY, endX, endY) in boxes:
	# scale the bounding box coordinates based on the respective
	# ratios
	startX = int(startX * rW)
	startY = int(startY * rH)
	endX = int(endX * rW)
	endY = int(endY * rH)
	# draw the bounding box on the image
	(orig, (startX, startY), (endX, endY), (0, 255, 0), 2)
# show the output image
("Text Detection", orig)
(0)

As I mentioned in the previous section, I was unable to use non-maximum suppression in my OpenCV 4 installation () because the Python bindings had no return value, which ultimately led to an OpenCV error. I was unable to fully test this in OpenCV 3.4.2, so it works in v3.4.2.

Instead, I used the non-maximum suppression implementation provided in the imutils package (line 114). The results still look good; however, I can't compare my output to the NMSBoxes function to see if they are the same. Looping our bounding box scales the coordinates back to the original image size and plots the output to our original image. The original image will remain displayed until a key is pressed.

As a final implementation note, I'd like to mention that our two nested for loops for looping fractions and geometry will be a great example of how you can significantly speed up your pipeline with Cython. I've already demonstrated the power of Cython in a fast, optimized "for" pixel loop using OpenCV and Python.

OpenCV Text Detection Results

Are you ready to apply text detection to images?

Download frozen_east_text_detection at:

oyyd/frozen_east_text_detection.pb ()

From there, you can execute the following command in the terminal (note the two command line arguments):

$ python text_detection.py --image images/lebron_james.jpg \
	--east frozen_east_text_detection.pb

Your results should resemble the diagram below:

Three text areas were marked on LeBron James. Now let's try to detect the text of the commercial logo:

$ python text_detection.py --image images/car_wash.png \
	--east frozen_east_text_detection.pb

Detecting Text in Video with OpenCV

Now that we've learned how to detect text in images, let's move on to detecting text in videos using OpenCV. This explanation will be very brief; please refer to the previous section for more details as needed. Open text_detection_video.py and insert the following code:

# import the necessary packages
from  import VideoStream
from  import FPS
from imutils.object_detection import non_max_suppression
import numpy as np
import argparse
import imutils
import time
import cv2

We'll start by importing our package. We will benchmark the frames per second of this script using VideoStream access to the webcam and FPS. Everything else is the same as in the previous section.

For convenience, let's define a new function to decode our prediction function - it will be reused in every frame and make our loop clearer:

def decode_predictions(scores, geometry):
	# grab the number of rows and columns from the scores volume, then
	# initialize our set of bounding box rectangles and corresponding
	# confidence scores
	(numRows, numCols) = [2:4]
	rects = []
	confidences = []
	# loop over the number of rows
	for y in range(0, numRows):
		# extract the scores (probabilities), followed by the
		# geometrical data used to derive potential bounding box
		# coordinates that surround text
		scoresData = scores[0, 0, y]
		xData0 = geometry[0, 0, y]
		xData1 = geometry[0, 1, y]
		xData2 = geometry[0, 2, y]
		xData3 = geometry[0, 3, y]
		anglesData = geometry[0, 4, y]
		# loop over the number of columns
		for x in range(0, numCols):
			# if our score does not have sufficient probability,
			# ignore it
			if scoresData[x] < args["min_confidence"]:
				continue
			# compute the offset factor as our resulting feature
			# maps will be 4x smaller than the input image
			(offsetX, offsetY) = (x * 4.0, y * 4.0)
			# extract the rotation angle for the prediction and
			# then compute the sin and cosine
			angle = anglesData[x]
			cos = (angle)
			sin = (angle)
			# use the geometry volume to derive the width and height
			# of the bounding box
			h = xData0[x] + xData2[x]
			w = xData1[x] + xData3[x]
			# compute both the starting and ending (x, y)-coordinates
			# for the text prediction bounding box
			endX = int(offsetX + (cos * xData1[x]) + (sin * xData2[x]))
			endY = int(offsetY - (sin * xData1[x]) + (cos * xData2[x]))
			startX = int(endX - w)
			startY = int(endY - h)
			# add the bounding box coordinates and probability score
			# to our respective lists
			((startX, startY, endX, endY))
			(scoresData[x])
	# return a tuple of the bounding boxes and associated confidences
	return (rects, confidences)

The decode_predictions function is defined.

This function is used to extract: the bounding box coordinates of a text area and the probability of a text area being detected This specialized function will make the code easier to read and manage later in this script. Let's parse our command line arguments:

def decode_predictions(scores, geometry):
	# grab the number of rows and columns from the scores volume, then
	# initialize our set of bounding box rectangles and corresponding
	# confidence scores
	(numRows, numCols) = [2:4]
	rects = []
	confidences = []
	# loop over the number of rows
	for y in range(0, numRows):
		# extract the scores (probabilities), followed by the
		# geometrical data used to derive potential bounding box
		# coordinates that surround text
		scoresData = scores[0, 0, y]
		xData0 = geometry[0, 0, y]
		xData1 = geometry[0, 1, y]
		xData2 = geometry[0, 2, y]
		xData3 = geometry[0, 3, y]
		anglesData = geometry[0, 4, y]
		# loop over the number of columns
		for x in range(0, numCols):
			# if our score does not have sufficient probability,
			# ignore it
			if scoresData[x] < args["min_confidence"]:
				continue
			# compute the offset factor as our resulting feature
			# maps will be 4x smaller than the input image
			(offsetX, offsetY) = (x * 4.0, y * 4.0)
			# extract the rotation angle for the prediction and
			# then compute the sin and cosine
			angle = anglesData[x]
			cos = (angle)
			sin = (angle)
			# use the geometry volume to derive the width and height
			# of the bounding box
			h = xData0[x] + xData2[x]
			w = xData1[x] + xData3[x]
			# compute both the starting and ending (x, y)-coordinates
			# for the text prediction bounding box
			endX = int(offsetX + (cos * xData1[x]) + (sin * xData2[x]))
			endY = int(offsetY - (sin * xData1[x]) + (cos * xData2[x]))
			startX = int(endX - w)
			startY = int(endY - h)
			# add the bounding box coordinates and probability score
			# to our respective lists
			((startX, startY, endX, endY))
			(scoresData[x])
	# return a tuple of the bounding boxes and associated confidences
	return (rects, confidences)

Command line parameter parsing:

-east : EAST Scene Text Detector model file path.

-video : The path where we enter the video. Optional - if the video path is provided, the webcam will not be used.

-min-confidence : Probability threshold for determining the text. Optional, default=0.5.

-width : Adjusted image width (must be a multiple of 32). Optional default=320.

-height : Height of the adjusted image (must be a multiple of 32). Optional default=320.

The main change from the image-only script in the previous section (in terms of command-line arguments) is that I've replaced the --image argument with --video. Important: The EAST text requires that your input image size be a multiple of 32, so if you choose to adjust the --width and --height values, make sure they are multiples of 32! Next, we'll perform an important initialization that mimics the previous script:

# initialize the original frame dimensions, new frame dimensions,
# and ratio between the dimensions
(W, H) = (None, None)
(newW, newH) = (args["width"], args["height"])
(rW, rH) = (None, None)
# define the two output layer names for the EAST detector model that
# we are interested -- the first is the output probabilities and the
# second can be used to derive the bounding box coordinates of text
layerNames = [
	"feature_fusion/Conv_7/Sigmoid",
	"feature_fusion/concat_3"]
# load the pre-trained EAST text detector
print("[INFO] loading EAST text detector...")
net = (args["east"])

The height/width and ratio initialization will allow us to scale the bounding box correctly later. Our output layer name is defined, loading our pre-trained EAST text detector. The following blocks set up our video stream and frames per second counter:

# if a video path was not supplied, grab the reference to the web cam
if not ("video", False):
	print("[INFO] starting video stream...")
	vs = VideoStream(src=0).start()
	(1.0)
# otherwise, grab a reference to the video file
else:
	vs = (args["video"])
# start the FPS throughput estimator
fps = FPS().start()

Our video streams are set to: webcam or video file.

Initializes the frames per second counter and starts looping through incoming frames:

# loop over frames from the video stream
while True:
	# grab the current frame, then handle if we are using a
	# VideoStream or VideoCapture object
	frame = ()
	frame = frame[1] if ("video", False) else frame
	# check to see if we have reached the end of the stream
	if frame is None:
		break
	# resize the frame, maintaining the aspect ratio
	frame = (frame, width=1000)
	orig = ()
	# if our frame dimensions are None, we still need to compute the
	# ratio of old frame dimensions to new frame dimensions
	if W is None or H is None:
		(H, W) = [:2]
		rW = W / float(newW)
		rH = H / float(newH)
	# resize the frame, this time ignoring aspect ratio
	frame = (frame, (newW, newH))

Iterate through the video/webcam frames. Our frame is resized to maintain the aspect ratio. From there, we get the dimensions and calculate the scaling ratio. We then resize the frame again (it must be a multiple of 32), this time ignoring the aspect ratio because we've stored the safely preserved ratio. Reasoning and drawing the text area bounding box happens in the following lines:

# construct a blob from the frame and then perform a forward pass
	# of the model to obtain the two output layer sets
	blob = (frame, 1.0, (newW, newH),
		(123.68, 116.78, 103.94), swapRB=True, crop=False)
	(blob)
	(scores, geometry) = (layerNames)
	# decode the predictions, then  apply non-maxima suppression to
	# suppress weak, overlapping bounding boxes
	(rects, confidences) = decode_predictions(scores, geometry)
	boxes = non_max_suppression((rects), probs=confidences)
	# loop over the bounding boxes
	for (startX, startY, endX, endY) in boxes:
		# scale the bounding box coordinates based on the respective
		# ratios
		startX = int(startX * rW)
		startY = int(startY * rH)
		endX = int(endX * rW)
		endY = int(endY * rH)
		# draw the bounding box on the frame
		(orig, (startX, startY), (endX, endY), (0, 255, 0), 2)

In this block, we:

Use EAST to detect text regions by creating a blob and passing it to the network.

Decode the predictions and apply the NMS. We use the decode_predictions function previously defined in this script and my imutils non_max_suppression convenience function.

Loop the bounding boxes and draw them on the frame. This involves scaling the frame by the previously collected ratio.

From there we'll close the frame handling loop as well as the script itself:

# update the FPS counter
	()
	# show the output frame
	("Text Detection", orig)
	key = (1) & 0xFF
	# if the `q` key was pressed, break from the loop
	if key == ord("q"):
		break
# stop the timer and display FPS information
()
print("[INFO] elasped time: {:.2f}".format(()))
print("[INFO] approx. FPS: {:.2f}".format(()))
# if we are using a webcam, release the pointer
if not ("video", False):
	()
# otherwise, release the file pointer
else:
	()
# close all windows
()

We update our fps counter on each iteration of the loop so that we can calculate and display the timings when we jump out of the loop. We display the output of the EAST text detection on line 165 and process the keystrokes. If "q" is pressed to "exit", we jump out of the loop and continue to clean up and release the pointer.

Video text detection results

Open a terminal and execute the following command (this will start your webcam, since we didn't provide --video via a command line argument):

python text_detection_video.py --east frozen_east_text_detection.pb

summarize

In today's blog post, we learned how to use OpenCV's new EAST Text Detector to automatically detect the presence of text in natural scene images.

The text detector is not only accurate, but can run in near real-time at approximately 13 FPS on 720p images.

In order to provide an implementation of OpenCV's EAST text detector, I needed to convert OpenCV's C++ examples; however, I encountered a number of challenges, such as:

It is not possible to use OpenCV's NMSBoxes for non-maximization suppression, but you must use the implementation in imutils.
Due to the lack of a Python binding for RotatedRect, it is not possible to compute a true rotated bounding box.

Above is the implementation of Python + Opencv text detection details, more information about Python text detection please pay attention to my other related articles!