Python - Improving Tesseract OCR to recognize list of names - python

I'm working on a project that will recognize teams in a game (Overwatch) and record which players were on which team. It has a predefined list of who is playing, it only needs to recognize which image they are located on. So far I have had success in capturing the images for each team and getting a rough output as to the name for each player, however, it is getting several letters confused.
My input images:
And the output I get from OCR:
W THEMIGHTVMRT
ERSVZENVRTTR
ERSVLUCID
ERSVZRRVR
ERSVMEI
EFISVSDMBRR
ERSV RNR
ERSVZENVRTTR
EFISVZHRVR
ERSVMCCREE
ERSVMEI
EHSVRDRDHDG
From this, you can see that the OCR confuses "A" with "R" and "Y" with "V". I was able to get the font file that Overwatch uses and generate a .traineddata file using Train Your Tesseract - I'm aware that there is probably a better way of generating this file, though I'm not sure how.
My code:
from pytesseract import *
import pyscreenshot
pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract'
tessdata_dir_config = '--tessdata-dir "C:\\Program Files (x86)\\Tesseract-OCR\\tessdata"'
team1 = pyscreenshot.grab(bbox=(50,450,530,810)) # X1, Y1, X2, Y2
team1.save("team1screenshot.png")
team1text = pytesseract.image_to_string(team1, config=tessdata_dir_config, lang='owf')
team2 = pyscreenshot.grab(bbox=(800,450,1280,810)) # X1, Y1, X2, Y2
team2.save("team2screenshot.png")
team2text = pytesseract.image_to_string(team2, config=tessdata_dir_config, lang='owf')
print(team1text)
print("------------------")
print(team2text)
How should I improve the recognition of these characters? Do I need a better .traineddata file, or is it regarding better image processing?
Thanks for any help!

As #FlorianBrucker mentioned, doing a similarity test on the strings allows (with some fine tuning) the ability to find the correct string after the OCR level.

You could try custom OCR configs to do a sparse text search, "Find as much text as possible in no particular order."
SET psm to 11 in tesseract configs
See if you can do this:
tessdata_dir_config = "--oem 3 --psm 11"
To see a complete list of supported page segmentation modes (psm), use tesseract -h. Here's the list as of 3.21:
Orientation and script detection (OSD) only.
Automatic page segmentation with OSD.
Automatic page segmentation, but no OSD, or OCR.
Fully automatic page segmentation, but no OSD. (Default)
Assume a single column of text of variable sizes.
Assume a single uniform block of vertically aligned text.
Assume a single uniform block of text.
Treat the image as a single text line.
Treat the image as a single word.
Treat the image as a single word in a circle.
Treat the image as a single character.
Sparse text. Find as much text as possible in no particular order.
Sparse text with OSD.
Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific.
I'm using python wrapper for Tesseract https://github.com/madmaze/pytesseract
Here you can configure tesseract as:
custom_oem_psm_config = r'--oem 3 --psm 6'
pytesseract.image_to_string(image, config=custom_oem_psm_config)

Related

How can I change the image layout options on python.docx?

I need to creat a archive with the same pattern that other one, but i need to do this with python. In this another archive i have a image with this following configuration the image and text are alligned, but when i try to put my image with the following code
p = doc.add_paragraph()
r = p.add_run()
r.add_picture(myPath)
r.add_text(myText)
the image stays alligned just to the first line of the text, like in this image allinged just with the first line.
I see that if i go into the word and change the layout options to this With Text Wrapping, the second option everything work exactaly as i want to. But how can I chage this layout options using python?
There is no API support for floating images in python-docx, which seems to be what you are asking about. Run.add_picture() adds a so-called "inline" picture (shape) which is treated like a single glyph (character in the run). So the line height grows to the height of the image and only that single line can abut the image.
One alternative would be to use a table with two cells, place the image in one cell and the text in the other.

Tesseract OCR having trouble detecting numbers

I am trying to detect some numbers with tesseract in python. Below you will find my starting image and what I can get it down to. Here is the code I used to get it there.
import pytesseract
import cv2
import numpy as np
pytesseract.pytesseract.tesseract_cmd = "C:\\Users\\choll\\AppData\\Local\\Programs\\Tesseract-OCR\\tesseract.exe"
image = cv2.imread(r'64normalwart.png')
lower = np.array([254, 254, 254])
upper = np.array([255, 255, 255])
image = cv2.inRange(image, lower, upper)
image = cv2.bitwise_not(image)
#Uses a language that should work with minecraft text, I have tried with and without, no luck
text = pytesseract.image_to_string(image, lang='mc')
print(text)
cv2.imwrite("Wartthreshnew.jpg", image)
cv2.imshow("Image", image)
cv2.waitKey(0)
I end up with black numbers on a white background which seems pretty good but tesseract can still not detect the numbers. I also noticed the numbers were pretty jagged but I don't know how to fix that. Does anyone have recommendations for how I could make tesseract be able to recognize these numbers?
Starting Image
What I end up with
Your problem is with the page segmentation mode. Tesseract segments every image in a different way. When you don't choose an appropriate PSM, it goes for mode 3, which is automatic and might not be suitable for your case. I've just tried your image and it works perfectly with PSM 6.
df = pytesseract.image_to_string(np.array(image),lang='eng', config='--psm 6')
These are all PSMs availabe at this moment:
0 Orientation and script detection (OSD) only.
1 Automatic page segmentation with OSD.
2 Automatic page segmentation, but no OSD, or OCR.
3 Fully automatic page segmentation, but no OSD. (Default)
4 Assume a single column of text of variable sizes.
5 Assume a single uniform block of vertically aligned text.
6 Assume a single uniform block of text.
7 Treat the image as a single text line.
8 Treat the image as a single word.
9 Treat the image as a single word in a circle.
10 Treat the image as a single character.
11 Sparse text. Find as much text as possible in no particular order.
12 Sparse text with OSD.
13 Raw line. Treat the image as a single text line,
bypassing hacks that are Tesseract-specific.
Use the pytesseract.image_to_string(img, config='--psm 8') or try diffrent configs to see if the image will get recognized. Useful link here Pytesseract OCR multiple config options

EasyOCR not recognizing simple numbers

I am trying to analyze a page footer in a video and retrieve the current page number. I got the frame collection working but I am struggling on reading the page number itself, using EasyOCR.
I already tried using pytesseract, but that doesnt work well. I have misinterpreted numbers: 10 gets recognized as 113, 6 as 41 and so on. Overall its very inconsistent, even though I format my input image correctly with grayscale, threshholding and cropping (only analyzing the pagenumber area of the footer).
Here is the code:
def getPageNumberTest(path, psm):
image = cv2.imread(path)
height = len(image)
width = len(image[0])
# the height of the footer
footerHeight = 90 # int(height / 15.5)
# retrieve only the footer from the image
cropped = image[height-footerHeight:height,0:width]
results = reader.readtext(cropped)
Which gives me the following output:
Is there a setting I am missing? Is there a way to instruct EasyOCR to look for numbers only?
Any help or hint is appreciated!
EDIT:
After some fiddling around with some optimizations of the number-images, I am now back to the beginning, not optimizing the images at all. All thats left is the conversion to gray-scale and a resize.
This is what a normal input looks like:
But the results are:
Which is weird, because for most numbers (especially for single digits) this works flawlessly, yielding certainties of over 95%...
I tried deblurring, threshholding, denoising with cv2.filter2D(), blurring,...
When I use threshholding, for example, my output looks like this (ignoring the "1", same applies for the single digit "1"):
I had a look into pattern matching, which isnt an option because I don't know the pagenumber shape beforehand...
txt = pytesseract.image_to_string(final_image, config='--psm 13 --oem 3 -c tessedit_char_whitelist=0123456789')
According to my tests, PaddleOCR works better than easyOCR in most scenes.

Why does reading text from image using pytesseract doesnt work?

Here is my code:
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'F:\Installations\tesseract'
print(pytesseract.image_to_string('images/meme1.png', lang='eng'))
And here is the image:
And the output is as follows:
GP.
ed <a
= va
ay Roce Thee .
‘ , Pe ship
RCAC Tm alesy-3
Pein Reg a
years —
? >
ee bs
I see the word years in the output so it does recognize the text but why doesn't it recognize it fully?
OCR is still a very hard problem in cluttered scenes. You probably won't get better results without doing some preprocessing on the image. In this specific case it makes sense to threshold the image first, to only extract the white regions (i.e. the text). You can look into opencv for this: https://docs.opencv.org/3.4/d7/d4d/tutorial_py_thresholding.html
Additionally, in your image, there are only two lines of text in arbitrary positions, so it might make sense to play around with page segmentation modes: https://github.com/tesseract-ocr/tesseract/issues/434

how to extract images that are in 2 or more copies in .docx file

I have a .docx with a mix of text and images (some duplicates and some not). I want the script to ultimately return only the images that appear at least twice in the word document (i.e. images that appear once can be discarded).
I've tried using manual extraction using Microsoft word itself and docx2txt (shown below) and they extract all images within the word document but it auto-deletes duplicate images (i.e. only one copy of each unique image ends up in the new folder). So in a sense, it's going counter to what I'm eventually aiming to do. Is there any way to solve this or is there a different approach that's better?
import docx2txt
text = docx2txt.process(r"C:\Users\name\Documents\document_with_image.docx", r'C:\Users\name\Documents\folder_of_choice')
Thanks so much!
you can convert the word document to latex and then check for duplicate images from there. You will need to install the library pypandoc using the command:-
pip install pypandoc
output = pypandoc.convert_file(path, 'latex')
When converted to latex each image file will be preceded by '\includegraphics'
tempvar = docx2txt.process(path,impath)
Now you can scan the directory "impath" where the images have been put to generate a list of the names of the images.
image_li={}
for entry in os.scandir(impath):
if (entry.path.endswith(".png")):
templ=entry.path.split("\\")
templ2=templ[-1]
templ3=templ2.split(".")
im_name=str(templ3[0])
image_li[im_name]=0
Now you have successfully initialized a dictionary with keys as image names and values as 0.
Now iterate through the latex and whenever you encounter the image name just increment the value by 1.
At the end of parsing the whole latex file, check the image names which have values associated less than 2 and delete them.
output_list=output.split('\includegraphics')
for out in output_list:
for imn in image_li.keys():
if imn in out:
image_li[imn]+=1
for entry in os.scandir(impath):
if (entry.path.endswith(".png")):
templ=entry.path.split("\\")
templ2=templ[-1]
templ3=templ2.split(".")
im_name=str(templ3[0])
if image_li[im_name]<=1:
os.remove(entry)
Please mark the answer as approved answer if it helped you :)

Categories