Using Python and Tesseract OCR to solve Captcha - python

I am not planning to spam, and besides Google has made captcha obsolete with reCaptcha. I am doing this as a project to learn more about OCR and eventually maybe neural networks.
SO I have an image from a Captcha, I have been able to make modest progress, but the documentation on tesseract isn't exactly well documented. Here is the code I have so far and the results are bellow it.
from selenium import webdriver
from selenium.webdriver.common import keys
import time
import random
import pytesseract
from pytesseract import image_to_string
from PIL import Image, ImageEnhance, ImageFilter
def ParsePic():
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files (x86)\Tesseract-OCR\tesseract.exe'
im = Image.open("path\\screenshot.png")
im = im.filter(ImageFilter.CONTOUR)
im = im.filter(ImageFilter.DETAIL)
enhancer = ImageEnhance.Contrast(im)
im = enhancer.enhance(4)
im = im.convert('L')
im.save('temp10.png')
text = image_to_string(Image.open('temp10.png'))
print(text)
Original Image
Output
I understand that Captcha was made specifically to defeat OCR, but I read that it is no longer the case, and Im interested in learning how it was done.
My question is, how do I make the background the same color, so the text becomes easily readable?

Late answer but anyway...
You are doing edge detection but there are, obviously, to many in this image so this will not work.
You will have to do some thing with the colors.
I don't know if this is true for every of your captchas but you can just use contrast.
You can test this by open up your original with paint (or any other image edit program) and save the image as "monochrom" (black and white only, NOT grayscale)
result:
without any other editing! (Even the Questionmark is gone)
This would be ready to OCR right away.
Maybe your other images are not this easy, but color/contrast is the way to go for you. If you need ideas on how you can use color, contrast and other things to solve captachs, you can take a look on harder examples and how I solved them here: https://github.com/cracker0dks/CaptchaSolver
cheers

Related

Detecting Bangla characters using pytesseract

I am trying to detect Bangla characters from images of Bangla number plates using Python, so I decided to use pytesseract. For this purpose I have used below code:
import pytesseract
from PIL import Image
pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract'
text = pytesseract.image_to_string(Image.open('input.png'),lang="ben")
print(text)
The problem is when I am printing, it is showing as empty output.
When I tried to freeze it in a text, it is showing like:
Example Picture: (Link)
Expected Output (should be something like or should be somewhat relatable like):
ঢাকা মেট্রো হ
৪৫ ২৩০৭
P.S: I have downloaded Bengali language data while installing Tesseract-OCR-64 and I am trying to run it in VS Code.
Can anyone help me to solve this problem or give me an idea of how to solve this problem?
The solution to this problem is:
You need to segment all the characters (you can take any approach if you want, can be deep learning or image processing) and feed the PyTesseract only the character. (only clear photos)
Reason: It can detect the Bangla language from pictures of clear and considerably acceptable resolution. It might have considerably fewer models trained for this language for pictures of small size. (which is quite understandable)
Code:
### any deep learning approach or any image processing approach here
# load the segmented character
import pytesseract
from PIL import Image
pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract'
character = pytesseract.image_to_string(Image.open('char.png'),lang="ben")
print(character)

How would I filter the results from Tesseract against a dictionary supplied?

I started learning Python a bit ago and just now have started learning Tesseract to create a tool for my own use. I have a script written to take four screenshots of specific parts of the screen and then it uses Tesseract to pull data from those images. It is mostly accurate, and gets the words almost 100% of the time, but there is still some garbage letters and symbols that I don't want in the results.
Rather than trying to process the image (if this really is the easiest way, I could do it, but I still feel like that would result in more data coming through that I don't want) I would like to only keep the words in the result that are in the dictionary I can provide.
import cv2
import pytesseract
import pyscreenshot as ImageGrab
im=ImageGrab.grab(bbox=(580,430,780,500))
im.save(r'C:\Users\Charlie\Desktop\tesseract_images\imagegrab.png')
im2=ImageGrab.grab(bbox=(770,430,960,500))
im2.save(r'C:\Users\Charlie\Desktop\tesseract_images\imagegrab2.png')
im3=ImageGrab.grab(bbox=(950,430,1150,500))
im3.save(r'C:\Users\Charlie\Desktop\tesseract_images\imagegrab3.png')
im4=ImageGrab.grab(bbox=(1140,430,1320,500))
im4.save(r'C:\Users\Charlie\Desktop\tesseract_images\imagegrab4.png')
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
from PIL import Image
image = Image.open(r'C:\Users\Charlie\Desktop\tesseract_images\imagegrab.png')
image2 = Image.open(r'C:\Users\Charlie\Desktop\tesseract_images\imagegrab2.png')
image3 = Image.open(r'C:\Users\Charlie\Desktop\tesseract_images\imagegrab3.png')
image4 = Image.open(r'C:\Users\Charlie\Desktop\tesseract_images\imagegrab4.png')
print(pytesseract.image_to_string(r'C:\Users\Charlie\Desktop\tesseract_images\imagegrab.png'))
print(pytesseract.image_to_string(r'C:\Users\Charlie\Desktop\tesseract_images\imagegrab2.png'))
print(pytesseract.image_to_string(r'C:\Users\Charlie\Desktop\tesseract_images\imagegrab3.png'))
print(pytesseract.image_to_string(r'C:\Users\Charlie\Desktop\tesseract_images\imagegrab4.png'))
This is what I've written above. It might not be as clean as possible but it does what I want it to for now. When I run the program against the test screenshot I've taken, I get the following:
Ballistica Prime Receiver
a
“Ze ij
Titania Prime Blueprint
—
‘|! stradavar Prime
Blueprint
My.
Bronco Prime Barrel
uby-
Here is my screenshot:
It can pick up the words perfectly fine, but the data like "uby-" and "‘|! " aren't needed and is why I want to hopefully clean them out by only keeping words that are in the dictionary. If there's an easier way to do this I'd love to know as I haven't been using Tesseract but for a day or so and don't know of another way I'd do it other than the aforementioned image processing.

Python Image Scaling

I'm trying to scale a screenshot using this code :
im = Image.open(img_path)
im = im.resize((newWidth,newHeight),Image.ANTIALIAS)
but this results in a very low quality image especially texts are impossible to read
Original
Click for original Image
Scaled
Click for scaled Image
I have tried other algorithms in PIL but none of them gives the result I wanted.
I actually tried to resize my images inside Office PowerPoint and texts are clear and readable.
PowerPoint scaled
Click for Office scaled Image
Are there any other ways which I can scale the Images ?
it had worked for me.
import imutils
im = imutils.resize(im, width=Image.ANTIALIAS)
if you want for details, you can examine https://www.programcreek.com/python/example/93640/imutils.resize

How to parse pixel data of an online image python

For a project I am needing to parse pixel data from a large number of online images. I realised it could well be faster to load the images into programme memory with a get request, carry out the required operations, then move onto the next image - removing the necessity for reading and writing these into storage. However in doing this I have ran into several problems, is there a not (overly) complicated way to do this?
Edit: I didn't include code as as far as I can tell everything I've seen (scikit-image, pillow, imagemagick) is a complete dead end. Not looking for somebody to write code for me, just a pointer in the right direction.
Its easy to load image directly from url.
import PIL
from PIL import Image
import urllib2
url = "https://cdn.pixabay.com/photo/2013/07/12/12/58/tv-test-pattern-146649_1280.png"
img = PIL.Image.open(urllib2.urlopen(url))
Image is now loaded.
Getting pixels is also easy: Get pixel's RGB using PIL

Python PIL to extract number from image

I have an image like this one:
and I would like to have a black number written on white so that I can use an OCR to recognise it. How could I achieve that in Python?
Many thanks,
John.
You don't need to manipulate the image for OCR. For example, you could just use pytesser:
from PIL import Image
from pytesser import *
im = Image.open('wjNL6.jpg')
text = image_to_string(im)
print text
Output:
0
If you just want to turn a white-on-black image to black-on-white, that's trivial; it's just invert:
from PIL import Image, ImageOps
img = Image.open('zero.jpg')
inverted = ImageOps.invert(img)
inverted.save('invzero.png')
If you also want to do some basic processing like increasing the contrast, see the other functions in the ImageOps module, like autocontrast. They're all pretty easy to use, but if you get stuck, you can always ask a new question. For more complex enhancements, look around the rest of PIL. ImageEnhance can be used to sharpen an image, ImageFilter can do edge detection and unsharp masking; etc. You may also want to change the format to greyscale (L8), or even black and white (L1); that's all in the Image.convert method.
Of course you have to know what processing you want to do. One thing you might want to try is playing around with the image in Photoshop or GIMP and keeping track of what operations you do, then looking for how to implement those operations in PIL. (It might be simpler to just use gimp-fu scripting in the first place instead of trying to use PIL…)

Categories