OCR in python on Google Colab, identify information in jpg - python

I'm trying to use an OCR function to identify the information in a .jpg (referring this link) and system shows:
SyntaxError: invalid character in identifier
!sudo apt install tesseract-ocr
!pip install pytesseract
import pytesseract
import shutil
import os
import random
try:
from PIL import Image
except ImportError:
import Image
from google.colab import files
uploaded = files.upload()
image_path_in_colab=‘Test.jpeg’
extractedInformation = pytesseract.image_to_string(Image.open('image_path_in_colab'), lang='eng')
print(extractedInformation)
I expect to export information in the .jpg

Related

why google colab cannot identify image file using while using tesseract?

!apt install tesseract-ocr
!pip install pytesserac
import pytesseract
from PIL import Image
# Open image file
image = Image.open("/content/book.PNG")
# Recognize text
text = pytesseract.image_to_string(image)
print(text)
I wrote this code in google colab and getting an error "cannot identify image file '/content/book.PNG'"
I am expecting the code to run.

Problem with python script that takes image to text

When running this script and trying to paste it, it doesn't work.
import pytesseract
import pyperclip
import PIL.Image
# Open the image file
image = PIL.Image.open('generate_word_picture.png')
# Extract the text from the image using pytesseract
text = pytesseract.image_to_string(image)
# Copy the text to the clipboard
pyperclip.copy(text)
This is the image:
Image
I have these installed:
-pip install pytesseract
-pip install pyperclip
-https://github.com/UB-Mannheim/tesseract/wiki

UnidentifiedImageError: cannot identify image file '59.jpg

!sudo apt install tesseract-ocr
!pip install pytesseract
from google.colab import files
file_uploaded = files.upload()
import pytesseract
import cv2
import os
from PIL import Image
from google.colab.patches import cv2_imshow
from google.colab import drive
from google.colab import files
image=cv2.imread('prac.jpg')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
filename = "{}.jpg".format(os.getpid())
cv2.imwrite(filename, gray)
text = pytesseract.image_to_string(Image.open(filename), lang = None)
os.remove(filename)
print(text)
cv2_imshow(image)
In the line "text = ~~", this code cause error "UnidentifiedImageError: cannot identify image file '59.jpg'". But I don't even have image named 59. what's wrong with this?
Your code works. But you are supposed to restart Colab runtime once pytesseract and its dependencies are installed, as you were guided by pip install (in red):
You must restart the runtime in order to use newly installed versions. ,
followed by RESTART RUNTIME button.
After restarting runtime and running that cell again you should get expected result. Though you'd probably want to re-organize your Notebook a bit and split that code (at least) into 3 different cells:
installing dependencies
importing modules
your image upload & OCR code
"image named 59" is created by image=cv2.imread('prac.jpg'); gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)in your code.

FileNotFoundError: [WinError 2] The system cannot find the file specified while trying to use pytesseract

I am trying to write a python script to extract the text from image and i keep getting this error. The script is given below. Error
from PIL import Image
from pytesseract import image_to_string
print (image_to_string(Image.open('samp.png')))
print (image_to_string(Image.open('test-english.jpg'), lang='eng'))
Try the following steps, this worked for me.
1) Download tesseract-OCR from here and install it in the location C:/Program Files
2)write the following code
from PIL import Image
from pytesseract import image_to_string
#pytesseract.pytesseract.tesseract_cmd = '<full_path_to_your_tesseract_executable>'
i.e
pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files/Tesseract-OCR/tesseract.exe'
3)Now run this
print(pytesseract.image_to_string(Image.open('D:/image_file.jpg')))
Hope that help!

Python error when importing image_to_string from tesseract

I recently used tesseract OCR with python and I kept getting an error when I was trying to import image_to_string from tesseract.
Code causing the problem:
# Perform OCR using tesseract-ocr library
from tesseract import image_to_string
image = Image.open('input-NEAREST.tif')
print image_to_string(image)
Error caused by above code:
Traceback (most recent call last):
file "./captcha.py", line 52, in <module>
from tesseract import image_to_string
ImportError: cannot import name image_to_string
I've verified that the tesseract module is installed:
digital_alchemy#roaming-gnome /home $ pydoc modules | grep 'tesseract'
Hdf5StubImagePlugin _tesseract gzip sipconfig
ORBit cairo mako tesseract
I believe that I've grabbed all the required packages but unfortunately I'm just stuck at this point. It appears that the function is not in the module.
Any help greatly appreciated.
Another possibility that seems to have worked for me is to modify pytesseract so that instead of import Image it has from PIL import Image
Code that works in PyCharm after modifying pytesseract:
from pytesseract import image_to_string
from PIL import Image
im = Image.open(r'C:\Users\<user>\Downloads\dashboard-test.jpeg')
print(im)
print(image_to_string(im))
Pytesseract I installed via the package management built into PyCharm
Is your syntax correct for the module you have installed? That image_to_string functions looks like it is from PyTesser per the usage example on this page:
https://code.google.com/p/pytesser/
Your import looks like it is for python-tesseract which has a more complicated usage example listed:
https://code.google.com/p/python-tesseract/
For windows followed below steps
pip3 install pytesseract
pip3 install pillow
Installation of tessaract-ocr is also required
https://github.com/tesseract-ocr/tesseract/wiki
otherwise you will get an error Tessract is not on path
Python code
from PIL import Image
from pytesseract import image_to_string
print ( image_to_string(Image.open('test.tif'),lang='eng') )
what works for me:
after I install the pytesseract form tesseract-ocr-setup-3.05.02-20180621.exe
I add the line
pytesseract.pytesseract.tesseract_cmd="C:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe"
and use the code form the above this is all the code:
import pytesseract
from PIL import Image
pytesseract.pytesseract.tesseract_cmd="C:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe"
im=Image.open("C:\\Users\\<user>\\Desktop\\ro\\capt.png")
print(pytesseract.image_to_string(im,lang='eng'))
I am using windows 10 with PyCharm Community Edition 2018.2.3 x64

Categories