how to read this file using tesseract(python)? - python

how to read this png file using tesseract?


Decoding Captcha isn't easy job, but maybe this example would be helpful.
In order to recognize text from Captcha you need more magic with cv2.
bfile='blablabla' #your base64 image
ffile=bfile.decode('base64')
pl=open('capcha.png','wb')
pl.write(ffile)
pl.close()
import pytesseract
from PIL import Image
img=Image.open('capcha.png')
text = pytesseract.image_to_string(img)
print (text)

Related

Python 3 - How can I open an image in pillow that I already opened using open('image','r')

How can I open an image in pillow that I already opened using open('image','r')
I have an image that I opened using the open() function, but i want to use the image in pillow.
Actually, I encoded it using base64, then the program decodes it,then gives you a variable that is in the same format as the open() function has. Then I just want to show the image, if there is another way to show the image without saving it, please let me know.
Here is the code that I use to decode it, just so you know:
import base64
image_64_encode = 'this-string-is-big-so'
image_64_decode = base64.decodebytes(image_64_encode)
I just want to show the image.
Like this:
from base64 import b64decode
from PIL import Image
import io
# Load useful-looking base64 string of a PNG
b64 = 'iVBORw0KGgoAAAANSUhEUgAAAEAAAABAEAIAAAB1mzrKAAAABGdBTUEAALGPC/xhBQAAACBjSFJNAAB6JgAAgIQAAPoAAACA6AAAdTAAAOpgAAA6mAAAF3CculE8AAAABmJLR0T///////8JWPfcAAAAB3RJTUUH5QsDDgY2fEahewAAANhJREFUeNrt3LENg0AQRcFFOue4/yIhN9K5jAl4UwHS02flAB977z1h1sxzr1M/xnsVAFszz7W++jHeqwVgBcAKgHUDsBaAFQArANYNwFoAVgCsAFg3AFszv/tz6sd4r15BWAGwbgDWArACYAXACoB1hLEWgBUAKwDWDcBaAFYArADYmvldn24A0wKwAmAFwPodgLUArABYAbBuANYCsAJgBcAKgHWEsRaAFQDr+wCsG4D1CsIKgBUA6wZgLQArAFYArBuAtQCsAFgBsG4A1gKwY++Z/sDe+QP/bITzNFwkpgAAACV0RVh0ZGF0ZTpjcmVhdGUAMjAyMS0xMS0wM1QxNDowNjo1NCswMDowMGz+3A4AAAAldEVYdGRhdGU6bW9kaWZ5ADIwMjEtMTEtMDNUMTQ6MDY6NTQrMDA6MDAdo2SyAAAAAElFTkSuQmCC'
# Open it with PIL - no disk access required
im = Image.open(io.BytesIO(b64decode(b64)))
print(im)
# prints: <PIL.PngImagePlugin.PngImageFile image mode=RGB size=64x64 at 0x7FF3B90251C0>
The clue is here in the Pillow documentation where it says:
fp – A filename (string), pathlib.Path object or a file object.

How to convert image from a url to text using pytesseract without saving in memory

I have a URL that contains an image, which has only text in it. I want to extract the text present in the image. I was able to find a solution using pytesseract. However, I need to save the image in local memory & then use it in the function to get the text. Is there any way to do this in-memory?
img_resp = requests.get(img_url)
with open('test.png','wb') as img:
img.write(img_resp.content)
print(image_to_string(Image.open('test.png')))
You can pass response content to Image.open like this:
import io
import requests
from PIL import Image
img_resp = requests.get(img_url)
img = Image.open(io.BytesIO(img_resp.content))

cv2 to tesseract directly without saving

import pytesseract
from pdf2image import convert_from_path, convert_from_bytes
import cv2,numpy
def pil_to_cv2(image):
open_cv_image = numpy.array(image)
return open_cv_image[:, :, ::-1].copy()
path='OriginalsFile.pdf'
images = convert_from_path(path)
cv_h=[pil_to_cv2(i) for i in images]
img_header = cv_h[0][:160,:]
#print(pytesseract.image_to_string(Image.open('test.png'))) I only found this in tesseract docs
Hello, is there a way to read the img_header directly using pytesseract without saving it,
pytesseract docs
pytesseract.image_to_string() input format
As documentation explains pytesseract.image_to_string() needs a PIL image as input.
So you can convert your CV image into PIL one easily, like this:
from PIL import Image
... (your code)
print(pytesseract.image_to_string(Image.fromarray(img_header)))
if you really don't want to use PIL!
see:
https://github.com/madmaze/pytesseract/blob/master/src/pytesseract.py
pytesseract is an easy wrapper to run the tesseract command def run_and_get_output() line, you'll see that it saves your image into an temporary file, and then gives the address to the tesseract to run.
hence, you can do the same with opencv, just rewrite the pytesseract only .py file to do it with opencv, although; i don't see any performance improvements whatsoever.
The fromarray function allows you to load the PIL document into tesseract without saving the document to disk, but you should also ensure that you don`t send a list of pil images into tesseract. The convert_from_path function can generate a list of pil images if a pdf document contains multiple pages, therefore you need to send each page into tesseract individually.
import pytesseract
from pdf2image import convert_from_path
import cv2, numpy
def pil_to_cv2(image):
open_cv_image = numpy.array(image)
return open_cv_image[:, :, ::-1].copy()
doc = convert_from_path(path)
for page_number, page_data in enumerate(doc):
cv_h= pil_to_cv2(page_data)
img_header = cv_h[:160,:]
print(f"{page_number} - {pytesseract.image_to_string(Image.fromarray(img_header))}")

Unable To Convert PDF to Text format

I am getting this error while parsing the PDF file using pypdf2
i am attaching PDF along with the error.
I have attached the PDF to be parsed please click to view
Can anyone help?
import PyPDF2
def convert(data):
pdfName = data
read_pdf = PyPDF2.PdfFileReader(pdfName)
page = read_pdf.getPage(0)
page_content = page.extractText()
print(page_content)
return (page_content)
error:
PyPDF2.utils.PdfReadError: Expected object ID (8 0) does not match actual (7 0); xref table not zero-indexed.
There are some open source OCR tools like tesseract or openCV.
If you want to use e.g. tesseract there is a python wrapper library called pytesseract.
Most of OCR tools work on images, so you have to first convert your PDF into an image file format like PNG or JPG. After this you can load your image and process it with pytesseract.
Here is some sample code how you can use pytesseract, let's suppose you have already converted your PDF to an image with filename pdfName.png:
from PIL import Image
import pytesseract
def ocr_core(filename):
"""
This function will handle the core OCR processing of images.
"""
text = pytesseract.image_to_string(Image.open(filename)) # We'll use Pillow's Image class to open the image and pytesseract to detect the string in the image
return text
print(ocr_core('pdfName.png'))

Extracting text/characters from Xray images using python

I am trying to extract the characters in the x-ray, I have tried using pytesseract to extract but couldn't succeed, I used a canny edge to remove the noise and extract, but still, I am not able to extract the text/chars. Can you please help/guide me to extract the text/chars
Try this tuotrial to locate the text:
https://www.pyimagesearch.com/2018/08/20/opencv-text-detection-east-text-detector/
Then once you locate you can isolate and use tesseract to recognize it.
If it's a DICOM file, you could use gdcm to get the attribute. It's available on python too.
pytesseract should be sufficient, if the file is in 'png' or 'jpg' form.
now suppose image is the name of your image. Please write the below code.
from PIL import Image
from pytesseract import image_to_string
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'C:/Program Files (x86)/Tesseract-OCR/tesseract.exe'
im = Image.open('F:/kush/invert.jpg')
pytesseract.image_to_string(im, lang = 'eng')

Categories