Related
I am trying to detect the underlines where students write their answers in the homework, but I cannot get Hough Line Transform to work. It is detecting way to many lines and if I increase thresholds, it will only detect vertical lines. Is there any other method to do this?
This is the code I have based on another post:
gray = cv2.imread(image_path + '000003.png')
edges = cv2.Canny(gray,50,150,apertureSize = 3)
cv2.imwrite('edges-50-150.jpg',edges)
minLineLength=100
lines = cv2.HoughLinesP(image=edges,rho=10,theta=np.pi/180, threshold=100,lines=np.array([]), minLineLength=minLineLength,maxLineGap=80)
a,b,c = lines.shape
for i in range(a):
cv2.line(gray, (lines[i][0][0], lines[i][0][1]), (lines[i][0][2], lines[i][0][3]), (0, 0, 255), 3, cv2.LINE_AA)
cv2.imwrite('houghlines5.jpg',gray)
When I run the code above I get these lines: Hough Transform Result
Edit: original image - Original Image
I am working on a project with the goal of extracting structured data from a series of tables captured in images.
I have achieved some success adapting the process outlined in this extremely helpful medium post.
As best I understand, this program works by creating a contour mask, of sorts, to outline the borders of a table. Here is the relevant code performing that function:
#Load image as numpy array
img = np.array(img)
#Threshold image to binary image
thresh,img_bin = cv2.threshold(img,128,255,cv2.THRESH_BINARY |cv2.THRESH_OTSU)
#inverting the image
img_bin = 255-img_bin
# Length(width) of kernel as 100th of total width
kernel_len = np.array(img).shape[1]//100
# Defining a vertical kernel to detect all vertical lines of image
ver_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, kernel_len))
# Defining a horizontal kernel to detect all horizontal lines of image
hor_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (kernel_len, 1))
# A kernel of 2x2
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (2, 2))
#Use vertical kernel to detect and save the vertical lines in a jpg
image_1 = cv2.erode(img_bin, ver_kernel, iterations=3)
vertical_lines = cv2.dilate(image_1, ver_kernel, iterations=3)
#Use horizontal kernel to detect and save the horizontal lines in a jpg
image_2 = cv2.erode(img_bin, hor_kernel, iterations=3)
horizontal_lines = cv2.dilate(image_2, hor_kernel, iterations=3)
# Combine horizontal and vertical lines in a new third image, with both having same weight.
img_vh = cv2.addWeighted(vertical_lines, 0.5, horizontal_lines, 0.5, 0.0)
#Eroding and thesholding the image
img_vh = cv2.erode(~img_vh, kernel, iterations=2)
thresh, img_vh = cv2.threshold(img_vh,128,255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)
This process produces numpy array that can be interpreted as an image like this:
From there, the program can identify the table cells outlined on four sides by the contour mask.
Unfortunately, many of the tables that I seek to process, including the one above lack perfect border formatting. The left-most column above lacks a left border (there is still data inside it). Other tables I have lack internal borders at all, relying on white space to format the data for the human eye.
As best I can tell, my path forward here is to add the missing contour lines myself using some kind of logic based on visual elements on the page. In the first example, I could attempt to add a left-side vertical line to the contour mask based on the position of the other contours. In the second example, I could try to add table borders based on consistencies in the position of the text.
That being said, this strategy would require a significant amount of logic, and may not be flexible enough to deal with the various table formats I may come into contact with.
Am I approaching this challenge with the right strategy? Is there a deployable software solution that I am not seeing? Ideally, I'd like this to be as automated as possible.
Any help would be greatly appreciated!
I've been researching and trying a couple functions to get what I want and I feel like I might be overthinking it.
One version of my code is below. The sample image is here.
My end goal is to find the angle (yellow) of the approximated line with respect to the frame (green line) Final
I haven't even got to the angle portion of the program yet.
The results I was obtaining from the below code were as follows. Canny Closed Small Removed
Anybody have a better way of creating the difference and establishing the estimated line?
Any help is appreciated.
import cv2
import numpy as np
pX = int(512)
pY = int(768)
img = cv2.imread('IMAGE LOCATION', cv2.IMREAD_COLOR)
imgS = cv2.resize(img, (pX, pY))
aimg = cv2.imread('IMAGE LOCATION', cv2.IMREAD_GRAYSCALE)
# Blur image to reduce noise and resize for viewing
blur = cv2.medianBlur(aimg, 5)
rblur = cv2.resize(blur, (384, 512))
canny = cv2.Canny(rblur, 120, 255, 1)
cv2.imshow('canny', canny)
kernel = np.ones((2, 2), np.uint8)
#fringeMesh = cv2.dilate(canny, kernel, iterations=2)
#fringeMesh2 = cv2.dilate(fringeMesh, None, iterations=1)
#cv2.imshow('fringeMesh', fringeMesh2)
closing = cv2.morphologyEx(canny, cv2.MORPH_CLOSE, kernel)
cv2.imshow('Closed', closing)
nb_components, output, stats, centroids = cv2.connectedComponentsWithStats(closing, connectivity=8)
#connectedComponentswithStats yields every separated component with information on each of them, such as size
sizes = stats[1:, -1]; nb_components = nb_components - 1
min_size = 200 #num_pixels
fringeMesh3 = np.zeros((output.shape))
for i in range(0, nb_components):
if sizes[i] >= min_size:
fringeMesh3[output == i + 1] = 255
#contours, _ = cv2.findContours(fringeMesh3, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
#cv2.drawContours(fringeMesh3, contours, -1, (0, 255, 0), 1)
cv2.imshow('final', fringeMesh3)
#cv2.imshow("Natural", imgS)
#cv2.imshow("img", img)
cv2.imshow("aimg", aimg)
cv2.imshow("Blur", rblur)
cv2.waitKey()
cv2.destroyAllWindows()
You can fit a straight line to the first white pixel you encounter in each column, starting from the bottom.
I had to trim your image because you shared a screen grab of it with a window decoration, title and frame rather than your actual image:
import cv2
import math
import numpy as np
# Load image as greyscale
im = cv2.imread('trimmed.jpg', cv2.IMREAD_GRAYSCALE)
# Get index of first white pixel in each column, starting at the bottom
yvals = (im[::-1,:]>200).argmax(axis=0)
# Make the x values 0, 1, 2, 3...
xvals = np.arange(0,im.shape[1])
# Fit a line of the form y = mx + c
z = np.polyfit(xvals, yvals, 1)
# Convert the slope to an angle
angle = np.arctan(z[0]) * 180/math.pi
Note 1: The value of z (the result of fitting) is:
array([ -0.74002694, 428.01463745])
which means the equation of the line you are looking for is:
y = -0.74002694 * x + 428.01463745
i.e. the y-intercept is at row 428 from the bottom of the image.
Note 2: Try to avoid JPEG format as an intermediate format in image processing - it is lossy and changes your pixel values - so where you have thresholded and done your morphology you are expecting values of 255 and 0, JPEG will lossily alter those values and you end up testing for a range or thresholding again.
Your 'Closed' image seems to quite clearly segment the two regions, so I'd suggest you focus on turning that boundary into a line that you can do something with. Connected components analysis and contour detection don't really provide any useful information here, so aren't necessary.
One quite simple approach to finding the line angle is to find the first white pixel in each row. To get only the rows that are part of your diagonal, don't include rows where that pixel is too close to either side (e.g. within 5%). That gives you a set of points (pixel locations) on the boundary of your two types of grass.
From there you can either do a linear regression to get an equation for the straight line, or you can get two points by averaging the x values for the top and bottom half of the rows, and then calculate the gradient angle from that.
An alternative approach would be doing another morphological close with a very large kernel, to end up with just a solid white region and a solid black region, which you could turn into a line with canny or findContours. From there you could either get some points by averaging, use the endpoints, or given a smooth enough result from a large enough kernel you could detect the line with hough lines.
I'm trying to remove horizontal and vertical lines in this image in order to have more distinct text areas.
I'm using the below code, which follows this guide
image = cv2.imread('image.jpg')
gray = cv2.cvtColor(image,cv2.COLOR_BGR2GRAY)
thresh = cv2.adaptiveThreshold(
blurred, 255,
cv2.ADAPTIVE_THRESH_MEAN_C, cv2.THRESH_BINARY_INV,
25,
15
)
# Create the images that will use to extract the horizontal and vertical lines
horizontal = np.copy(thresh)
vertical = np.copy(thresh)
# Specify size on horizontal axis
cols = horizontal.shape[1]
horizontal_size = math.ceil(cols / 20)
# Create structure element for extracting horizontal lines through morphology operations
horizontalStructure = cv2.getStructuringElement(cv2.MORPH_RECT, (horizontal_size, 1))
# Apply morphology operations
horizontal = cv2.erode(horizontal, horizontalStructure)
horizontal = cv2.dilate(horizontal, horizontalStructure)
# Show extracted horizontal lines
cv2.imwrite("horizontal.jpg", horizontal)
# Specify size on vertical axis
rows = vertical.shape[0]
verticalsize = math.ceil(rows / 20)
# Create structure element for extracting vertical lines through morphology operations
verticalStructure = cv2.getStructuringElement(cv2.MORPH_RECT, (1, verticalsize))
# Apply morphology operations
vertical = cv2.erode(vertical, verticalStructure)
vertical = cv2.dilate(vertical, verticalStructure)
After this, I know I would need to isolate the lines and mask the original image with the white lines, however I'm not really sure on how to proceed.
Does anyone have any suggestion?
Jeru's answer already gives you what you want. But I wanted to add an alternative that is maybe a bit more general than what you have so far.
You are converting the color image to gray-value, then apply adaptive threshold in an attempt to find lines. You filter this to get only the long horizontal and vertical lines, then use that mask to paint the original image white at those locations.
Here we look for all lines, and remove them from the image making painting them with whatever the surrounding color is. This process does not involve thresholding at all, all morphological operations are applied to the channels of the color image.
Ideally we'd use color morphology, but implementations of that are rare. Mathematical morphology is based on maximum and minimum operations, and the maximum or minimum of a color triplet (i.e. a vector) is not well defined.
So instead we apply the following procedure to each of the three color channels independently. This should produce results that are good enough for this application:
Extract the red channel: take the input RGB image, and extract the first channel. This is a gray-value image. We'll call this image channel.
Apply a top-hat filter to detect the thin structures: the difference between a closing with a small structuring element (SE) applied to channel, and channel (a closing is a dilation followed by an erosion with the same SE, you're using this to find lines as well). We'll call this output thin. thin = closing(channel)-channel. This step is similar to your local thresholding, but no actual threshold is applied. The resulting intensities indicate how dark the lines are w.r.t. to background. If you add thin to channel, you'll fill in these thin structures. The size of the SE here determines what is considered "thin".
Filter out the short lines, to keep only the long ones: apply an opening with a long horizontal SE to thin, and an opening with a long vertical SE to thin, and take the maximum of the two result. We'll call this lines. Note that this is the same process you used to generate horizontal and vertical. Instead of adding them together as Jeru suggested, we take the maximum. This makes it so that output intensities still match the contrast in channel. (In Mathematical Morphology parlance, the supremum of openings is an opening). The length of the SEs here determines what is long enough to be a line.
Fill in the lines in the original image channel: now simply add lines to channel. Write the result to the first channel of the output image.
Repeat the same process with the other two channels.
Using DIPlib this is quite a simple script:
import diplib as dip
input = dip.ImageReadTIFF('/home/cris/tmp/T4tbM.tif')
output = input.Copy()
for ii in range(0,3):
channel = output.TensorElement(ii)
thin = dip.Closing(channel, dip.SE(5, 'rectangular')) - channel
vertical = dip.Opening(thin, dip.SE([100,1], 'rectangular'))
horizontal = dip.Opening(thin, dip.SE([1,100], 'rectangular'))
lines = dip.Supremum(vertical,horizontal)
channel += lines # overwrites output image
Edit:
When increasing the size of the first SE, above set to 5, to be large enough to remove also the thicker gray bar in the middle of the example image, causes part of the block containing the inverted text "POWERLIFTING" to be left in thin.
To filter out those parts as well, we can change the definition of thin as follows:
notthin = dip.Closing(channel, dip.SE(11, 'rectangular'), ["add max"]))
notthin = dip.MorphologicalReconstruction(notthin, channel, 1, "erosion")
thin = notthin - channel
That is, instead of thin=closing(channel)-channel, we do thin=reconstruct(closing(channel))-channel. The reconstruction simply expands selected (not thin) structures so that where part of a structure was selected, now the full structure is selected. The only thing that is now in thin are lines that are not connected to thicker structures.
I've also added "add max" as a boundary condition -- this causes the closing to expand the area outside the image with white, and therefore see lines at the edges of the image as lines.
To elaborate more here is what to do:
First, add the resulting images of vertical and horizontal. This will give you an image containing both the horizontal and vertical lines. Since both the images are of type uint8 (unsigned 8-bit integer) adding them won't be a problem:
res = vertical + horizontal
Finally, mask the resulting image obtained above with the original 3-channel image. This can be accomplished using cv2.bitwise_and:
fin = cv2.bitwise_and(image, image, mask = cv2.bitwise_not(res))
A sample for removing horizontal lines.
Sample image:
import cv2
import numpy as np
img = cv2.imread("Image path", 0)
if len(img.shape) != 2:
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
else:
gray = img
gray = cv2.bitwise_not(gray)
bw = cv2.adaptiveThreshold(gray, 255, cv2.ADAPTIVE_THRESH_MEAN_C,
cv2.THRESH_BINARY, 15, -2)
horizontal = np.copy(bw)
cols = horizontal.shape[1]
horizontal_size = cols // 30
horizontalStructure = cv2.getStructuringElement(cv2.MORPH_RECT, (horizontal_size, 1))
horizontal = cv2.erode(horizontal, horizontalStructure)
horizontal = cv2.dilate(horizontal, horizontalStructure)
cv2.imwrite("horizontal_lines_extracted.png", horizontal)
horizontal_inv = cv2.bitwise_not(horizontal)
cv2.imwrite("inverse_extracted.png", horizontal_inv)
masked_img = cv2.bitwise_and(gray, gray, mask=horizontal_inv)
masked_img_inv = cv2.bitwise_not(masked_img)
cv2.imwrite("masked_img.jpg", masked_img_inv)
=> horizontal_lines_extracted.png:
=> inverse_extracted.png
=> masked_img.png(resultant image after masking)
Do you want something like this?
image = cv2.imread('image.jpg', cv2.IMREAD_UNCHANGED);
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
ret,binary = cv2.threshold(gray, 170, 255, cv2.THRESH_BINARY)#|cv2.THRESH_OTSU)
V = cv2.Sobel(binary, cv2.CV_8U, dx=1, dy=0)
H = cv2.Sobel(binary, cv2.CV_8U, dx=0, dy=1)
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3,3))
V = cv2.morphologyEx(V, cv2.MORPH_DILATE, kernel, iterations = 2)
H = cv2.morphologyEx(H, cv2.MORPH_DILATE, kernel, iterations = 2)
rows,cols = image.shape[:2]
mask = np.zeros(image.shape[:2], dtype=np.uint8)
contours = cv2.findContours(V, cv2.RETR_LIST, cv2.CHAIN_APPROX_SIMPLE)[1]
for cnt in contours:
(x,y,w,h) = cv2.boundingRect(cnt)
# manipulate these values to change accuracy
if h > rows/2 and w < 10:
cv2.drawContours(mask, [cnt], -1, 255,-1)
contours = cv2.findContours(H, cv2.RETR_LIST, cv2.CHAIN_APPROX_SIMPLE)[1]
for cnt in contours:
(x,y,w,h) = cv2.boundingRect(cnt)
# manipulate these values to change accuracy
if w > cols/2 and h < 10:
cv2.drawContours(mask, [cnt], -1, 255,-1)
mask = cv2.morphologyEx(mask, cv2.MORPH_DILATE, kernel, iterations = 2)
image[mask == 255] = (255,255,255)
So I have found a solution by using part of Juke's suggestion. Eventually I would need to continue to process the image using a binary mode so figured I might keep it that way.
First, add the resulting images of vertical and horizontal. This will give you an image containing both the horizontal and vertical lines. Since both the images are of type uint8 (unsigned 8-bit integer) adding them won't be a problem:
res = vertical + horizontal
Then, subtract res from the original input image tresh, which was used to find the lines. This will remove the white lines and can than be used to apply some other morphology transformations.
fin = thresh - res
I am attempting to pull text from a few hundred JPGs that contain information on capital punishment records; the JPGs are hosted by the Texas Department of Criminal Justice (TDCJ). Below is an example snippet with personally identifiable information removed.
I've identified the underlines as being the impediment to proper OCR--if I go in, screenshot a sub-snippet and manually white-out lines, the resulting OCR through pytesseract is very good. But with underlines present, it's extremely poor.
How can I best remove these horizontal lines? What I have tried:
Started on OpenCV doc's walkthrough: Extract horizontal and vertical lines by using morphological operations. Got stuck pretty quickly, because I know zero C++.
Followed along with Removing Horizontal Lines in image - ended up with an illegible string.
Followed along with Removing long horizontal/vertical lines from edge image using OpenCV - wasn't able to get the intuition behind sizing the array of zeros here.
Tagging this question with c++ in the hope that someone could help to translate Step 5 of the docs walkthrough to Python. I've tried a batch of transformations such as Hugh Line Transform, but I am feeling around in the dark within a library and area I have zero prior experience with.
import cv2
# Inverted grayscale
img = cv2.imread('rsnippet.jpg', cv2.IMREAD_GRAYSCALE)
img = cv2.bitwise_not(img)
# Transform inverted grayscale to binary
th = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_MEAN_C,
cv2.THRESH_BINARY, 15, -2)
# An alternative; Not sure if `th` or `th2` is optimal here
th2 = cv2.threshold(img, 170, 255, cv2.THRESH_BINARY)[1]
# Create corresponding structure element for horizontal lines.
# Start by cloning th/th2.
horiz = th.copy()
r, c = horiz.shape
# Lost after here - not understanding intuition behind sizing/partitioning
All the answers so far seem to be using morphological operations. Here's something a bit different. This should give fairly good results if the lines are horizontal.
For this I use a part of your sample image shown below.
Load the image, convert it to gray scale and invert it.
import cv2
import numpy as np
import matplotlib.pyplot as plt
im = cv2.imread('sample.jpg')
gray = 255 - cv2.cvtColor(im, cv2.COLOR_BGR2GRAY)
Inverted gray-scale image:
If you scan a row in this inverted image, you'll see that its profile looks different depending on the presence or the absence of a line.
plt.figure(1)
plt.plot(gray[18, :] > 16, 'g-')
plt.axis([0, gray.shape[1], 0, 1.1])
plt.figure(2)
plt.plot(gray[36, :] > 16, 'r-')
plt.axis([0, gray.shape[1], 0, 1.1])
Profile in green is a row where there's no underline, red is for a row with underline. If you take the average of each profile, you'll see that red one has a higher average.
So, using this approach you can detect the underlines and remove them.
for row in range(gray.shape[0]):
avg = np.average(gray[row, :] > 16)
if avg > 0.9:
cv2.line(im, (0, row), (gray.shape[1]-1, row), (0, 0, 255))
cv2.line(gray, (0, row), (gray.shape[1]-1, row), (0, 0, 0), 1)
cv2.imshow("gray", 255 - gray)
cv2.imshow("im", im)
Here are the detected underlines in red, and the cleaned image.
tesseract output of the cleaned image:
Convthed as th(
shot once in the
she stepped fr<
brother-in-lawii
collect on life in
applied for man
to the scheme i|
Reason for using part of the image should be clear by now. Since personally identifiable information have been removed in the original image, the threshold wouldn't have worked. But this should not be a problem when you apply it for processing. Sometimes you may have to adjust the thresholds (16, 0.9).
The result does not look very good with parts of the letters removed and some of the faint lines still remaining. Will update if I can improve it a bit more.
UPDATE:
Dis some improvements; cleanup and link the missing parts of the letters. I've commented the code, so I believe the process is clear. You can also check the resulting intermediate images to see how it works. Results are a bit better.
tesseract output of the cleaned image:
Convicted as th(
shot once in the
she stepped fr<
brother-in-law. ‘
collect on life ix
applied for man
to the scheme i|
tesseract output of the cleaned image:
)r-hire of 29-year-old .
revolver in the garage ‘
red that the victim‘s h
{2000 to kill her. mum
250.000. Before the kil
If$| 50.000 each on bin
to police.
python code:
import cv2
import numpy as np
import matplotlib.pyplot as plt
im = cv2.imread('sample2.jpg')
gray = 255 - cv2.cvtColor(im, cv2.COLOR_BGR2GRAY)
# prepare a mask using Otsu threshold, then copy from original. this removes some noise
__, bw = cv2.threshold(cv2.dilate(gray, None), 128, 255, cv2.THRESH_BINARY or cv2.THRESH_OTSU)
gray = cv2.bitwise_and(gray, bw)
# make copy of the low-noise underlined image
grayu = gray.copy()
imcpy = im.copy()
# scan each row and remove lines
for row in range(gray.shape[0]):
avg = np.average(gray[row, :] > 16)
if avg > 0.9:
cv2.line(im, (0, row), (gray.shape[1]-1, row), (0, 0, 255))
cv2.line(gray, (0, row), (gray.shape[1]-1, row), (0, 0, 0), 1)
cont = gray.copy()
graycpy = gray.copy()
# after contour processing, the residual will contain small contours
residual = gray.copy()
# find contours
contours, hierarchy = cv2.findContours(cont, cv2.RETR_CCOMP, cv2.CHAIN_APPROX_SIMPLE)
for i in range(len(contours)):
# find the boundingbox of the contour
x, y, w, h = cv2.boundingRect(contours[i])
if 10 < h:
cv2.drawContours(im, contours, i, (0, 255, 0), -1)
# if boundingbox height is higher than threshold, remove the contour from residual image
cv2.drawContours(residual, contours, i, (0, 0, 0), -1)
else:
cv2.drawContours(im, contours, i, (255, 0, 0), -1)
# if boundingbox height is less than or equal to threshold, remove the contour gray image
cv2.drawContours(gray, contours, i, (0, 0, 0), -1)
# now the residual only contains small contours. open it to remove thin lines
st = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (3, 3))
residual = cv2.morphologyEx(residual, cv2.MORPH_OPEN, st, iterations=1)
# prepare a mask for residual components
__, residual = cv2.threshold(residual, 0, 255, cv2.THRESH_BINARY)
cv2.imshow("gray", gray)
cv2.imshow("residual", residual)
# combine the residuals. we still need to link the residuals
combined = cv2.bitwise_or(cv2.bitwise_and(graycpy, residual), gray)
# link the residuals
st = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (1, 7))
linked = cv2.morphologyEx(combined, cv2.MORPH_CLOSE, st, iterations=1)
cv2.imshow("linked", linked)
# prepare a msak from linked image
__, mask = cv2.threshold(linked, 0, 255, cv2.THRESH_BINARY)
# copy region from low-noise underlined image
clean = 255 - cv2.bitwise_and(grayu, mask)
cv2.imshow("clean", clean)
cv2.imshow("im", im)
One can try this.
img = cv2.imread('img_provided_by_op.jpg', 0)
img = cv2.bitwise_not(img)
# (1) clean up noises
kernel_clean = np.ones((2,2),np.uint8)
cleaned = cv2.erode(img, kernel_clean, iterations=1)
# (2) Extract lines
kernel_line = np.ones((1, 5), np.uint8)
clean_lines = cv2.erode(cleaned, kernel_line, iterations=6)
clean_lines = cv2.dilate(clean_lines, kernel_line, iterations=6)
# (3) Subtract lines
cleaned_img_without_lines = cleaned - clean_lines
cleaned_img_without_lines = cv2.bitwise_not(cleaned_img_without_lines)
plt.imshow(cleaned_img_without_lines)
plt.show()
cv2.imwrite('img_wanted.jpg', cleaned_img_without_lines)
Demo
The method is based on the answer by Zaw Lin. He/she identified lines in the image and just did subtraction to get rid of them. However, we cannot just subtract lines here because we have letters e, t, E, T, - containing lines as well! If we just subtract horizontal lines from the image, e will be nearly identical to c. - will be gone...
Q: How do we find lines?
To find lines, we can make use of erode function. To make use of erode, we need to define a kernel. (You can think of a kernel as a window/shape that functions operate on.)
The kernel slides through
the image (as in 2D convolution). A pixel in the original image
(either 1 or 0) will be considered 1 only if all the pixels under the
kernel is 1, otherwise it is eroded (made to zero). -- (Source).
To extract lines, we define a kernel, kernel_line as np.ones((1, 5)), [1, 1, 1, 1, 1]. This kernel will slide through the image and erode pixels that have 0 under the kernel.
More specifically, while the kernel is applied to one pixel, it will capture the two pixels to its left and two to its right.
[X X Y X X]
^
|
Applied to Y, `kernel_line` captures Y's neighbors. If any of them is not
0, Y will be set to 0.
Horizontal lines will be preserved under this kernel while pixel that don't have horizontal neighbors will disappear. This is how we capture lines with the following line.
clean_lines = cv2.erode(cleaned, kernel_line, iterations=6)
Q: How do we avoid extracting lines within e, E, t, T, and -?
We will combine erosion and dilation with iteration parameter.
clean_lines = cv2.erode(cleaned, kernel_line, iterations=6)
You might have noticed the iterations=6 part. The effect of this parameter will make the flat part in e, E, t, T, - disappear. This is because while we apply the same operation multiple times, the boundary part of these lines would be shrinking. (Applying the same kernel, only the boundary part will meet 0s and become 0 as the result.) We use this trick to make the lines in these characters disappear.
This, however, comes with a side effect that the long underline part that we want to get rid of also shrinks. We can grow it with dilate!
clean_lines = cv2.dilate(clean_lines, kernel_line, iterations=6)
Contrary to erosion that shrinks a image, dilation makes image larger. While we still have the same kernel, kernel_line, if any part under the kernel is 1, the target pixel will be 1. Applying this, the boundary will grow back. (The part in e, E, t, T, - won't grow back if we pick the parameter carefully such that it disappears at the erosion part.)
With this additional trick, we can successfully get rid of the lines without hurting e, E, t, T, and -.
As most of the lines to be detected in your source are horizontal-long-lines, similar with my another answer, that is Find single color, horizontal spaces in image
This is the source image:
Here are my two main steps to remove the long horizontal line:
Do morph-close with long line kernel on the gray image
kernel = np.ones((1,40), np.uint8)
morphed = cv2.morphologyEx(gray, cv2.MORPH_CLOSE, kernel)
then, get the morphed image contains the long lines:
Invert the morphed image, and add to the source image:
dst = cv2.add(gray, (255-morphed))
then get image with long lines removed:
Simple enough, right? And also there exist small line segments, I think it has little effects on OCR. Notice, almost all chars keep original, except g,j,p,q,y,Q, maybe a little diffent. But mordern OCR tools such as Tesseract( with LSTM technology) has ability to deal with such simple confusion.
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
Total code to save removed image as line_removed.png:
#!/usr/bin/python3
# 2018.01.21 16:33:42 CST
import cv2
import numpy as np
## Read
img = cv2.imread("img04.jpg")
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
## (1) Create long line kernel, and do morph-close-op
kernel = np.ones((1,40), np.uint8)
morphed = cv2.morphologyEx(gray, cv2.MORPH_CLOSE, kernel)
cv2.imwrite("line_detected.png", morphed)
## (2) Invert the morphed image, and add to the source image:
dst = cv2.add(gray, (255-morphed))
cv2.imwrite("line_removed.png", dst)
Update # 2018.01.23 13:15:15 CST:
Tesseract is a powerful tool to do OCR. Today I install the tesseract-4.0 and pytesseract. Then I do ocr using pytesseract on the my result line_removed.png.
import cv2
import pytesseract
img = cv2.imread("line_removed.png")
print(pytesseract.image_to_string(img, lang="eng"))
This is the reuslt, fine to me.
Convicted as the triggerman in the murder—for—hire of 29—year—old .
shot once in the head with a 357 Magnum revolver in the garage of her home at ..
she stepped from her car. Police discovered that the victim‘s husband,
brother—in—law, _ ______ paid _ $2,000 to kill her, apparently so .. _
collect on life insurance policies totaling $250,000. Before the killing, .
applied for additional life insurance policies of $150,000 each on himself and his wife
to the scheme in three different statements to police.
was
and
could
had also
. confessed
A few suggestions:
Given that you're starting with a JPEG, don't compound the loss. Save your intermediate files as PNGs. Tesseract copes with those just fine.
Scale the image 2x (using cv2.resize) handing to Tesseract.
Try detecting and removing the black underline. (This question might help). Doing that while preserving descenders might be tricky.
Explore Tesseract command-line options, of which there are many (and they're horribly documented, some requiring dives into C++ source to try to understand them). It's looking like ligatures are causing some grief. IIRC (it's been a while), there's a setting or two that might help.