How to convert pdf document to ocr pdf document

How to convert pdf document to ocr pdf document - python

I have a problem where I need to convert a pdf document to OCR pdf document just like how Adobe Acrobat works. I have tried that using ocrmypdf module, but somehow it is not working. I am using python 2.7. Any other modules is also appreciated.
import logging
import os
import subprocess
import sys
import time
import shutil
path="D:\Nikhil Scraping\Pdf all processing"
for filenames in os.listdir(path):
print (filenames)
filename=filenames.split('.')[0]
print (filename)
input_path=os.path.join(path,filenames)
outputfile=filename+"_OCR.pdf"
cmd=["ocrmypdf","--output-type", "pdf", input_path, outputfile]
logging.info(cmd)
proc=subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
result = proc.stdout.read()
Error Shown :
1-9-US 118137380VP1.pdf
1-9-US 118137380VP1
Traceback (most recent call last):
File "D:\Nikhil Scraping\Pdf all processing\pdf_ocr_working.py", line 19, in <module>
proc=subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
File "C:\Python27\Lib\subprocess.py", line 710, in __init__
errread, errwrite)
File "C:\Python27\Lib\subprocess.py", line 958, in _execute_child
startupinfo)
WindowsError: [Error 2] The system cannot find the file specified
while i am using the same code in python 3.7, it is working fine but no output file is generated.
Also it is successfully working in mac OS, i don't know why windows is showing this errors.
My Code Approach Error shown

You are joining all filenames here instead of one filename
input_path=os.path.join(path,filenames)
Use this code instead
input_path=os.path.join(path,filename)

Related

subprocess.call() in python is not producing files despite not producing an error

I am using this script below, taken from here: https://github.com/theeko74/pdfc/blob/master/pdf_compressor.py
import os
import sys
import subprocess
def compress(input_file_path, output_file_path, power=0):
"""Function to compress PDF via Ghostscript command line interface"""
quality = {
0: '/default',
1: '/prepress',
2: '/printer',
3: '/ebook',
4: '/screen'
}
# Basic controls
# Check if valid path
if not os.path.isfile(input_file_path):
print("Error: invalid path for input PDF file")
sys.exit(1)
# Check if file is a PDF by extension
if input_file_path.split('.')[-1].lower() != 'pdf':
print("Error: input file is not a PDF")
sys.exit(1)
print("Compress PDF...")
initial_size = os.path.getsize(input_file_path)
subprocess.Popen(['gs', '-sDEVICE=pdfwrite', '-dCompatibilityLevel=1.4',
'-dPDFSETTINGS={}'.format(quality[power]),
'-dNOPAUSE', '-dQUIET', '-dBATCH',
'-sOutputFile={}'.format(output_file_path),
input_file_path])
compress("D:/Documents/Pdf Handler/test.pdf","D:/Documents/Pdf Handler/testCompress.pdf")
I want to use it to compress a PDF, however when ever it was run the below error was produced:
Traceback (most recent call last):
File "D:/Documents/Pdf Handler/compress.py", line 39, in <module>
compress("D:/Documents/Pdf Handler/test.pdf","D:/Documents/Pdf Handler/testCompress.pdf")
File "D:/Documents/Pdf Handler/compress.py", line 31, in compress
input_file_path]
File "C:\Python37\lib\subprocess.py", line 323, in call
with Popen(*popenargs, **kwargs) as p:
File "C:\Python37\lib\subprocess.py", line 775, in __init__
restore_signals, start_new_session)
File "C:\Python37\lib\subprocess.py", line 1178, in _execute_child
startupinfo)
FileNotFoundError: [WinError 2] The system cannot find the file specified
After some research, i found out that i had to add shell=True in subprocess.call:
...
print("Compress PDF...")
initial_size = os.path.getsize(input_file_path)
subprocess.Popen(['gs', '-sDEVICE=pdfwrite', '-dCompatibilityLevel=1.4',
'-dPDFSETTINGS={}'.format(quality[power]),
'-dNOPAUSE', '-dQUIET', '-dBATCH',
'-sOutputFile={}'.format(output_file_path),
input_file_path])
compress("D:/Documents/Pdf Handler/test.pdf","D:/Documents/Pdf Handler/testCompress.pdf")
whilst this did fix the issue of an error being raised, the code now doesn't appear to actually do anything. It runs, however no files are added to the specified directory.
Your help would be massively appreciated, as i mentioned i did take this code from elsewhere and the documentation is very sparse.

This isn't really an answer, but it's meant to help you debugging this.
You're currently flying blind, so you need to inspect the output of stdout and stderr like this:
import subprocess
proc = subprocess.Popen(['gs', '-sDEVICE=pdfwrite', '-dCompatibilityLevel=1.4',
'-dPDFSETTINGS={}'.format(quality[power]),
'-dNOPAUSE', '-dQUIET', '-dBATCH',
'-sOutputFile={}'.format(output_file_path),
input_file_path],shell=True,stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
out, err = proc.communicate()
print(out)
print(err)

Python: Can't find file even though file referenced exists

I'm getting this error when trying to run a Python script. Is it saying that it can't find subprocess.py? Because I found it in the location it's listing there, so I doubt that's the issue. What file can't it find?
Traceback (most recent call last):
File "D:\Projects\PythonMathPlots\MandelbrotVideoGenerator.py", line 201, in <module>
run( ['open', 'MandelbrotZoom.mp4'] )
File "C:\Users\Aaron\AppData\Local\Programs\Python\Python37\lib\subprocess.py", line 472, in run
with Popen(*popenargs, **kwargs) as process:
File "C:\Users\Aaron\AppData\Local\Programs\Python\Python37\lib\subprocess.py", line 775, in __init__
restore_signals, start_new_session)
File "C:\Users\Aaron\AppData\Local\Programs\Python\Python37\lib\subprocess.py", line 1178, in _execute_child
startupinfo)
FileNotFoundError: [WinError 2] The system cannot find the file specified

You may need to put the full path in the run(...) command, to the open file, and the path to the .mp4 file as well.
Most likely, open does not exist on your system and you have to use the name of the video player software instead.

Make sure the user you're running the script as has read permission for the file.

You may also try with subprocess.Popen(args, shell=True). The use of shell=True may be useful.
Also, use a path defined as path = os.path.join(filepath, filename) and then before passing the path to Popen, assert if os.path.exists(path)==True.
But note that there are some downsides to using shell=True:
Actual meaning of 'shell=True' in subprocess
https://medium.com/python-pandemonium/a-trap-of-shell-true-in-the-subprocess-module-6db7fc66cdfd

Simple HTML to PDF python library error

I'm using this pydf to convert HTML to a PDF on our server. This is an example that comes right from their docs that illustrates the problem:
import pydf
pdf = pydf.generate_pdf('<h1>this is html</h1>')
with open('test_doc.pdf', 'wb') as f:
f.write(pdf)
When I go to run this file, I get the same error everytime:
(pdf) <computer>:<folder> <user>$ python pdf.py
Traceback (most recent call last):
File "pdf.py", line 3, in <module>
pdf = pydf.generate_pdf('<h1>this is html</h1>')
File "/Users/nilesbrandon/Projects/pdf/pdf/lib/python2.7/site-packages/pydf/wkhtmltopdf.py", line 121, in generate_pdf
return gen_pdf(html_file.name, cmd_args)
File "/Users/nilesbrandon/Projects/pdf/pdf/lib/python2.7/site-packages/pydf/wkhtmltopdf.py", line 105, in gen_pdf
_, stderr, returncode = execute_wk(*cmd_args)
File "/Users/nilesbrandon/Projects/pdf/pdf/lib/python2.7/site-packages/pydf/wkhtmltopdf.py", line 22, in execute_wk
p = subprocess.Popen(wk_args, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 710, in __init__
errread, errwrite)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1335, in _execute_child
raise child_exception
I'm running this in a virtualenv and my pip freeze is only the following:
python-pdf==0.30
Any idea what could be going wrong here?

As you are using macOS, you need to download a wkhtmltopdf binary by your own:
pydf comes bundled with a wkhtmltopdf binary which will only work on Linux amd64 architectures. If you're on another OS or architecture your milage may vary, it is likely that you'll need to supply your own wkhtmltopdf binary and point pydf towards it by setting the WKHTMLTOPDF_PATH variable.

How can I get Python to find ffprobe?

I have ffmpeg and ffprobe installed on my mac (macOS Sierra), and I have added their path to PATH. I can run them from terminal.
I am trying to use ffprobe to get the width and height of a video file using the following code:
import subprocess
import shlex
import json
# function to find the resolution of the input video file
def findVideoResolution(pathToInputVideo):
cmd = "ffprobe -v quiet -print_format json -show_streams"
args = shlex.split(cmd)
args.append(pathToInputVideo)
# run the ffprobe process, decode stdout into utf-8 & convert to JSON
ffprobeOutput = subprocess.check_output(args).decode('utf-8')
ffprobeOutput = json.loads(ffprobeOutput)
# find height and width
height = ffprobeOutput['streams'][0]['height']
width = ffprobeOutput['streams'][0]['width']
return height, width
h, w = findVideoResolution("/Users/tomburrows/Documents/qfpics/user1/order1/movie.mov")
print(h, w)
I am sorry I cannot provide a MCVE, as I didn't write this code, and I don't really know how it works.
It gives the following error:
Traceback (most recent call last):
File "/Users/tomburrows/Dropbox/Moviepy Tests/get_dimensions.py", line 21, in <module>
h, w = findVideoResolution("/Users/tomburrows/Documents/qfpics/user1/order1/movie.mov")
File "/Users/tomburrows/Dropbox/Moviepy Tests/get_dimensions.py", line 12, in findVideoResolution
ffprobeOutput = subprocess.check_output(args).decode('utf-8')
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/subprocess.py", line 626, in check_output
**kwargs).stdout
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/subprocess.py", line 693, in run
with Popen(*popenargs, **kwargs) as process:
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/subprocess.py", line 947, in __init__
restore_signals, start_new_session)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/subprocess.py", line 1551, in _execute_child
raise child_exception_type(errno_num, err_msg)
FileNotFoundError: [Errno 2] No such file or directory: 'ffprobe'
If python is not reading from the PATH file, how can I specify where ffprobe is?
Edit:
It appears the python path is not aligned with my shell path.
Using os.environ["PATH"]+=":/the_path/of/ffprobe/dir" at the beginning of each program allows me to use ffprobe, but why might my python path not be the same as my shell path?

You may use
import os
print os.environ['PATH']
to verify/validate that ffprobe is in your python environment. According to the error you have, it is likely not.

Error in opening image file in PIL

I am trying to execute the following code
from pytesser import *
import Image
i="C:/Documents and Settings/Administrator/Desktop/attachments/R1PNDTCB.jpg"
print i
im = Image.open(i.strip())
text = image_to_string(im)
print text
I get the following error
C:/Documents and Settings/Administrator/Desktop/attachments/R1PNDTCB.jpg
Traceback (most recent call last):
File "C:\Python27\Lib\site-packages\Pythonwin\pywin\framework\scriptutils.py", line 322, in RunScript
debugger.run(codeObject, __main__.__dict__, start_stepping=0)
File "C:\Python27\Lib\site-packages\Pythonwin\pywin\debugger\__init__.py", line 60, in run
_GetCurrentDebugger().run(cmd, globals,locals, start_stepping)
File "C:\Python27\Lib\site-packages\Pythonwin\pywin\debugger\debugger.py", line 655, in run
exec cmd in globals, locals
File "C:\Documents and Settings\Administrator\Desktop\attachments\ocr.py", line 1, in <module>
from pytesser import *
File "C:\Python27\lib\site-packages\PIL\Image.py", line 1952, in open
fp = __builtin__.open(fp, "rb")
IOError: [Errno 2] No such file or directory: 'C:/Documents and Settings/Administrator/Desktop/attachments/R1PNDTCB.jpg'
Can someone please explain what I am doing wrong here.
Renamed the image file.Shifted the python file and the images to a new folder. Shifted the folder to E drive
Now the code is as follows:
from pytesser import *
import Image
import os
i=os.path.join("E:\\","ocr","a.jpg")
print i
im = Image.open(i.strip())
text = image_to_string(im)
print text
Now the error is as follows:
E:\ocr\a.jpg
Traceback (most recent call last):
File "or.py", line 8, in <module>
text = image_to_string(im)
File "C:\Python27\lib\pytesser.py", line 31, in image_to_string
call_tesseract(scratch_image_name, scratch_text_name_root)
File "C:\Python27\lib\pytesser.py", line 21, in call_tesseract
proc = subprocess.Popen(args)
File "C:\Python27\lib\subprocess.py", line 679, in __init__
errread, errwrite)
File "C:\Python27\lib\subprocess.py", line 893, in _execute_child
startupinfo)
WindowsError: [Error 2] The system cannot find the file specified

You need to install Tesseract first. Just installing pytesseract is not enough. Then edit the tesseract_cmd variable in pytesseract.py to point the the tessseract binary. For example, in my installation I set it to
tesseract_cmd = 'C:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe'

The exception is pretty clear: the file either doesn't exist, or you lack sufficient permissions to access it. If neither is the case, please provide evidence (e.g. relevant dir commands with output, run as the same user).

your image path maybe?
i="C:\\Documents and Settings\\Administrator\\Desktop\\attachments\\R1PNDTCB.jpg"
try this:
import os
os.path.join("C:\\", "Documents and Settings", "Administrator")
you should get a string similar to the one in the previous line

Try this first:
os.path.expanduser('~/Desktop/attachments/R1PNDTCB.jpg')
It could be that the space in the 'Documents and Settings' is causing this problem.
EDIT:
Use os.path.join so it uses the correct directory separator.

Just add these two lines in your code
import OS
os.chdir('C:\Python27\Lib\site-packages\pytesser')
before
from pytesser import *

If you are using pytesseract, you have to make sure that you have installed Tesseract-OCR in your system. After that you have to insert the path of the tesseract in your code, as below
from PIL import Image
import pytesseract
pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract
OCR/tesseract'
You can download the Tesseract-OCR form https://github.com/UB-Mannheim/tesseract/wiki

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to convert pdf document to ocr pdf document - python

You are joining all filenames here instead of one filename input_path=os.path.join(path,filenames) Use this code instead input_path=os.path.join(path,filename)

Related

subprocess.call() in python is not producing files despite not producing an error

Python: Can't find file even though file referenced exists

Simple HTML to PDF python library error

How can I get Python to find ffprobe?

Error in opening image file in PIL

Categories

Resources