Error while performing OCR using pytesseract - python

I wanna to use pytesseract. This is my code.
import pytesseract
from pdf2image import convert_from_path
PDF_file = 'file.pdf'
text = ''
pages = convert_from_path(PDF_file, 500)
pageText = str(((pytesseract.image_to_string(pages[0]))))
and at result I get this error
Traceback (most recent call last):
File "C:\Users\user\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pdf2image\pdf2image.py", line 409, in pdfinfo_from_path
proc = Popen(command, env=env, stdout=PIPE, stderr=PIPE)
File "C:\Users\user\AppData\Local\Programs\Python\Python38-32\lib\subprocess.py", line 854, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "C:\Users\user\AppData\Local\Programs\Python\Python38-32\lib\subprocess.py", line 1307, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
FileNotFoundError: [WinError 2] The system cannot find the file specified
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\user\Desktop\projects\pdfparser\pdftest.py", line 13, in
pages = convert_from_path(PDF_file, 500)
File "C:\Users\user\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pdf2image\pdf2image.py", line 89, in convert_from_path
page_count = pdfinfo_from_path(pdf_path, userpw, poppler_path=poppler_path)["Pages"]
File "C:\Users\user\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pdf2image\pdf2image.py", line 430, in pdfinfo_from_path
raise PDFInfoNotInstalledError(
pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?

As a lot of comments already pointed out, the error message
PDFInfoNotInstalledError( pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?
Tells you precisely what went wrong: Poppler is not installed. Please refer to the README for help on that side.
You see, pdf2image is only a wrapper around the pdftoppm command-line utility. On Linux it is installed by default so you would not need to bother with it, but on Windows it is not.

Related

Textract: failed with exit code 127 // windows 10 // pdftotext

When I'm trying to run my (after deploying with pyinstaller) program for reading and converting a PDF file and entering it into a google sheet. I get the error shown in the image below. However I can not seem to figure out what the problem is:
Exception in Tkinter callback
Traceback (most recent call last):
File "C:\Users\trpfinance\AppData\Local\Programs\Python\Python38-32\lib\site-packages\textract\parsers\utils.py", line 82, in run
pipe = subprocess.Popen(
File "C:\Users\trpfinance\AppData\Local\Programs\Python\Python38-32\lib\subprocess.py", line 854, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "C:\Users\trpfinance\AppData\Local\Programs\Python\Python38-32\lib\subprocess.py", line 1307, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
FileNotFoundError: [WinError 2] The system cannot find the file specified
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\trpfinance\AppData\Local\Programs\Python\Python38-32\lib\tkinter\__init__.py", line 1883, in __call__
return self.func(*args)
File "EinkaufRGWindows.py", line 40, in InkoopRekeningen
text = textract.process(str(importfolder) + str(i))
File "C:\Users\trpfinance\AppData\Local\Programs\Python\Python38-32\lib\site-packages\textract\parsers\__init__.py", line 77, in process
return parser.process(filename, encoding, **kwargs)
File "C:\Users\trpfinance\AppData\Local\Programs\Python\Python38-32\lib\site-packages\textract\parsers\utils.py", line 46, in process
byte_string = self.extract(filename, **kwargs)
File "C:\Users\trpfinance\AppData\Local\Programs\Python\Python38-32\lib\site-packages\textract\parsers\pdf_parser.py", line 28, in extract
raise ex
File "C:\Users\trpfinance\AppData\Local\Programs\Python\Python38-32\lib\site-packages\textract\parsers\pdf_parser.py", line 20, in extract
return self.extract_pdftotext(filename, **kwargs)
File "C:\Users\trpfinance\AppData\Local\Programs\Python\Python38-32\lib\site-packages\textract\parsers\pdf_parser.py", line 43, in extract_pdftotext
stdout, _ = self.run(args)
File "C:\Users\trpfinance\AppData\Local\Programs\Python\Python38-32\lib\site-packages\textract\parsers\utils.py", line 90, in run
raise exceptions.ShellError(
textract.exceptions.ShellError: The command `pdftotext //Mac/Home/Desktop/Wickey Einkauf Test/Rekeningen/Lekkerkerker_ - 20803471.pdf -` failed with exit code 127
------------- stdout -------------
------------- stderr -------------
You're getting a FileNotFoundError it seems. If you look at the error, the command being run is:
pdftotext //Mac/Home/Desktop/Wickey Einkauf Test/Rekeningen/Lekkerkerker_ -
0803471.pdf -
There are a couple of things here I would look at. Firstly, there is an extra slash at the start of your file path, which seems wrong. Secondly, you have spaces in the file path, but there are no quotations enclosing the path. This second part means pdftotext will read this as a few separate command arguments, rather than one. You can fix this by formatting you subprocess call to have the file wrapped in quotation marks, like so:
pdftotext "example file path.pdf" -
You need to install pdftotext using pip.
To install it you need to have Microsoft Visual C++ 14 or greater.
I had the same issue. It seems to be an OS issue. For me, switching to GIT bash worked. https://github.com/deanmalmgren/textract/issues/229
If you are using Pycharm, change default terminal to bash.

textract doesn´t work on pdf

im new to python. Im using Pycharm 2018.2 and the latest version on Anaconda. Im working on windows 10.
After solving all the problems with installing textract on win 10. I got a positive installation result using anaconda prompt. Additional i have import the Project Interpreter from the \continuum\anaconda3\python.exe
My Target is that i want to extract pdf text from large files so save this text as a .txt
I have tried the test_pdf.py files from textract but they dont work.
Here is the conclusion code:
"textract" is wrong written or cant be found (self translate from
german :-/)
So I tried my own as on the textract page. But it doesnt work...:
Code:
import textract
text = textract.process('pfad/large.pdf')
Results:
C:\Users\raz\AppData\Local\Continuum\anaconda3\python.exe "C:/Users/raz/Google Drive/FOM/Master/Master/NurText/Testo.py"
Traceback (most recent call last):
File "C:\Users\raz\AppData\Local\Continuum\anaconda3\lib\site-packages\textract-1.6.1-py3.6.egg\textract\parsers\utils.py", line 85, in run
stdout=subprocess.PIPE, stderr=subprocess.PIPE,
File "C:\Users\raz\AppData\Local\Continuum\anaconda3\lib\subprocess.py", line 709, in init
restore_signals, start_new_session)
File "C:\Users\raz\AppData\Local\Continuum\anaconda3\lib\subprocess.py", line 997, in _execute_child
startupinfo)
FileNotFoundError: [WinError 2] Das System kann die angegebene Datei nicht finden
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:/Users/raz/Google Drive/FOM/Master/Master/NurText/Testo.py", line 2, in
text = textract.process('pfad/large.pdf')
File "C:\Users\raz\AppData\Local\Continuum\anaconda3\lib\site-packages\textract-1.6.1-py3.6.egg\textract\parsers_init_.py", line 77, in process
return parser.process(filename, encoding, **kwargs)
File "C:\Users\raz\AppData\Local\Continuum\anaconda3\lib\site-packages\textract-1.6.1-py3.6.egg\textract\parsers\utils.py", line 46, in process
byte_string = self.extract(filename, **kwargs)
File "C:\Users\raz\AppData\Local\Continuum\anaconda3\lib\site-packages\textract-1.6.1-py3.6.egg\textract\parsers\pdf_parser.py", line 28, in extract
raise ex
File "C:\Users\raz\AppData\Local\Continuum\anaconda3\lib\site-packages\textract-1.6.1-py3.6.egg\textract\parsers\pdf_parser.py", line 20, in extract
return self.extract_pdftotext(filename, **kwargs)
File "C:\Users\raz\AppData\Local\Continuum\anaconda3\lib\site-packages\textract-1.6.1-py3.6.egg\textract\parsers\pdf_parser.py", line 43, in extract_pdftotext
stdout, _ = self.run(args)
File "C:\Users\raz\AppData\Local\Continuum\anaconda3\lib\site-packages\textract-1.6.1-py3.6.egg\textract\parsers\utils.py", line 92, in run
' '.join(args), 127, '', '',
textract.exceptions.ShellError: The command pdftotext pfad/large.pdf - failed with exit code 127
------------- stdout -------------
------------- stderr -------------
Thanks for your help

Simple HTML to PDF python library error

I'm using this pydf to convert HTML to a PDF on our server. This is an example that comes right from their docs that illustrates the problem:
import pydf
pdf = pydf.generate_pdf('<h1>this is html</h1>')
with open('test_doc.pdf', 'wb') as f:
f.write(pdf)
When I go to run this file, I get the same error everytime:
(pdf) <computer>:<folder> <user>$ python pdf.py
Traceback (most recent call last):
File "pdf.py", line 3, in <module>
pdf = pydf.generate_pdf('<h1>this is html</h1>')
File "/Users/nilesbrandon/Projects/pdf/pdf/lib/python2.7/site-packages/pydf/wkhtmltopdf.py", line 121, in generate_pdf
return gen_pdf(html_file.name, cmd_args)
File "/Users/nilesbrandon/Projects/pdf/pdf/lib/python2.7/site-packages/pydf/wkhtmltopdf.py", line 105, in gen_pdf
_, stderr, returncode = execute_wk(*cmd_args)
File "/Users/nilesbrandon/Projects/pdf/pdf/lib/python2.7/site-packages/pydf/wkhtmltopdf.py", line 22, in execute_wk
p = subprocess.Popen(wk_args, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 710, in __init__
errread, errwrite)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1335, in _execute_child
raise child_exception
I'm running this in a virtualenv and my pip freeze is only the following:
python-pdf==0.30
Any idea what could be going wrong here?
As you are using macOS, you need to download a wkhtmltopdf binary by your own:
pydf comes bundled with a wkhtmltopdf binary which will only work on Linux amd64 architectures. If you're on another OS or architecture your milage may vary, it is likely that you'll need to supply your own wkhtmltopdf binary and point pydf towards it by setting the WKHTMLTOPDF_PATH variable.

FileNotFoundError on python

img = printscreen_pil
img = img.filter(ImageFilter.MedianFilter())
enhancer = ImageEnhance.Contrast(img)
img = enhancer.enhance(2)
img = img.convert('1')
img.save('temp.jpg')
text = pytesseract.image_to_string(Image.open('temp.jpg'))
I want to read the image in order to convert it to text but i get the error system cannot find the file specified. I think it has to do with the working directory of the python. I'm sorry if this is a stupid question but I hope you can help me.
this is the complete error mssg.
Traceback (most recent call last):
File "C:\Users\pncor\Documents\pyprograms\bot.py", line 23, in <module>
text = pytesseract.image_to_string(Image.open('temp.jpg'))
File "C:\Users\pncor\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pytesseract\pytesseract.py", line 122, in image_to_string
config=config)
File "C:\Users\pncor\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pytesseract\pytesseract.py", line 46, in run_tesseract
proc = subprocess.Popen(command, stderr=subprocess.PIPE)
File "C:\Users\pncor\AppData\Local\Programs\Python\Python36-32\lib\subprocess.py", line 707, in __init__
restore_signals, start_new_session)
File "C:\Users\pncor\AppData\Local\Programs\Python\Python36-32\lib\subprocess.py", line 990, in _execute_child
startupinfo)
FileNotFoundError: [WinError 2] The system cannot find the file specified
The tesseract package does not seem to be installed on your system, or it is not found on your path. pytesseract runs the tesseract binary as a sub process in order to perform the OCR.
Use the package manager on your OS to install it, or refer the the installation documentation. You are using Windows so check this out.
Also I don't think that it is necessary to write the enhanced image to file first, just pass it directly to pytesseract.image_to_string:
text = pytesseract.image_to_string(img)

Compiling and Executing Java file in python

how can I open an java file in python?, i've search over the net and found this:
import os.path, subprocess
from subprocess import STDOUT, PIPE
def compile_java (java_file):
subprocess.check_call(['javac', java_file])
def execute_java (java_file):
cmd=['java', java_file]
proc=subprocess.Popen(cmd, stdout = PIPE, stderr = STDOUT)
input = subprocess.Popen(cmd, stdin = PIPE)
print(proc.stdout.read())
compile_java("CsMain.java")
execute_java("CsMain")
but then I got this error:
Traceback (most recent call last):
File "C:\Python33\lib\subprocess.py", line 1106, in _execute_child
startupinfo)
FileNotFoundError: [WinError 2] The system cannot find the file specified
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\casestudy\opener.py", line 13, in <module>
compile_java("CsMain.java")
File "C:\casestudy\opener.py", line 5, in compile_java
subprocess.check_call(['javac', java_file])
File "C:\Python33\lib\subprocess.py", line 539, in check_call
retcode = call(*popenargs, **kwargs)
File "C:\Python33\lib\subprocess.py", line 520, in call
with Popen(*popenargs, **kwargs) as p:
File "C:\Python33\lib\subprocess.py", line 820, in __init__
restore_signals, start_new_session)
File "C:\Python33\lib\subprocess.py", line 1112, in _execute_child
raise WindowsError(*e.args)
FileNotFoundError: [WinError 2] The system cannot find the file specified
>>>
the python file and java file is in the same folder, and I am using Python 3.3.2, how can I resolve this? or do you guys have another way on doing this?, any answer is appreciated thanks!
I think it isn't recognizing the javac command. Try manually running the command and if javac isn't a recognized command, register it in your PATH variable and try again.
Or you could just try typing the full pathname to the Java directory for javac and java.
you need to add path to your java file name. like this:
compile_java("C:\\path\to\this\CsMain.java")

Categories