PDFminer gives strange letters

PDFminer gives strange letters - python

I am using python2.7 and PDFminer for extracting text from pdf. I noticed that sometimes PDFminer gives me words with strange letters, but pdf viewers don't. Also for some pdf docs result returned by PDFminer and other pdf viewers are same (strange), but there are docs where pdf viewers can recognize text (copy-paste). Here is example of returned values:
from pdf viewer: ‫فتــح بـــاب ا�ستيــراد البيــ�ض والدجــــاج المجمـــد‬
from PDFMiner: óªéªdG êÉ````LódGh ¢†``«ÑdG OGô``«à°SG ÜÉH í``àa
So my question is can I get same result as pdf viewer, and what is wrong with PDFminer. Does it missing encodings I don't know.

Yes.
This will happen when custom font encodings have been used e.g. identity-H,identity-V, etc. but fonts have not been embedded properly.
pdfminer gives garbage output in such cases because encoding is required to interpret the text

Maybe the PDF file you are trying to read has an encoding not yet supported by pdfMiner.
I had a similar problem last month and finally solved it by using a java library named "pdfBox" and calling it from python. The pdfBox library supported the encoding that I needed and worked like a charm!.
First I downloaded pdfbox from the official site
and then referenced the path to the .jar file from my code.
Here is a simplified version of the code I used (untested, but based on my original tested code).
You will need subprocess32, which you can install by calling pip install subprocess32
import subprocess32 as subprocess
import os
import tempfile
def extractPdf(file_path, pdfboxPath, timeout=30, encoding='UTF-8'):
#tempfile = temp_file(data, suffix='.pdf')
try:
command_args = ['java', '-jar', os.path.expanduser(pdfboxPath), 'ExtractText', '-console', '-encoding', encoding, file_path]
status, stdout, stderr = external_process(command_args, timeout=timeout)
except subprocess.TimeoutExpired:
raise RunnableError('PDFBox timed out while processing document')
finally:
pass#os.remove(tempfile)
if status != 0:
raise RunnableError('PDFBox returned error status code {0}.\nPossible error:\n{1}'.format(status, stderr))
# We can use result from PDFBox directly, no manipulation needed
pdf_plain_text = stdout
return pdf_plain_text
def external_process(process_args, input_data='', timeout=None):
process = subprocess.Popen(process_args,
stdout=subprocess.PIPE,
stdin=subprocess.PIPE,
stderr=subprocess.PIPE)
try:
(stdout, stderr) = process.communicate(input_data, timeout)
except subprocess.TimeoutExpired as e:
# cleanup process
# see https://docs.python.org/3.3/library/subprocess.html?highlight=subprocess#subprocess.Popen.communicate
process.kill()
process.communicate()
raise e
exit_status = process.returncode
return (exit_status, stdout, stderr)
def temp_file(data, suffix=''):
handle, file_path = tempfile.mkstemp(suffix=suffix)
f = os.fdopen(handle, 'w')
f.write(data)
f.close()
return file_path
if __name__ == '__main__':
text = extractPdf(filename, 'pdfbox-app-2.0.3.jar')
`
This code was not entirely written by me. I followed the suggestions of other stack overflow answers, but it was a month ago, so I lost the original sources. If anyone finds the original posts where I got the pieces of this code, please let me know, so I can give them their deserved credit for the code.

Related

Python - shell command causes InterfaceError on file download

Recently, we replaced curl with aria2c in order to download files faster from our backend servers for later conversion to different formats.
Now for some reason we ran into the following issue with aria2c:
Pool callback raised exception: InterfaceError(0, '')
It's not clear to us where this InterfaceError occurs or what it actually could mean. Besides, we can trigger the executed command manually without any problems.
Please also have a look at our download function:
def download_file(descriptor):
"""
creates the WORKING_DIR structure and Download the descriptor.
The descriptor should be a URI (processed via aria2c)
returns the created resource path
"""
makedirs(WORKING_DIR + 'output/', exist_ok=True)
file_path = WORKING_DIR + decompose_uri(descriptor)['fileNameExt']
print(file_path)
try:
print(descriptor)
exec_command(f'aria2c -x16 "{descriptor}" -o "{file_path}"')
except CalledProcessError as err:
log('DEBUG', f'Aria2C error: {err.stderr}')
raise VodProcessingException("Download failed. Aria2C error")
return file_path
def exec_command(string):
"""
Shell command interface
Returns returnCode, stdout, stderr
"""
log('DEBUG', f'[Command] {string}')
output = run(string, shell=True, check=True, capture_output=True)
return output.returncode, output.stdout, output.stderr
Is stdout here maybe misunderstood by python which then drop into this InterfaceError?
Thanks in advance

As I just wanted to use aria2c to download files faster, as it support multiple connection, I now switched over to a tool called "axel". It also supports multiple connections without the excessive overhead aria2c has, at least for me in this situation.

os.popen: How to integrate output and data from file?

I have a basic script that reads a file which has package names to build a command string and I store in a variable.
I then call a os.popen to run the command and store the output to a variable for further processing.
I loop over the variable looking for an 'Error:' string and if there is a match it prints it. All works good but it just prints the error which is what I want but I also want to know which package caused the error even if I include the package variable I only get the error.
Here are the contents of the file:
kernel-3.10.0-1160.el7
openshift-clients-4.3.7-202003130552.git.0.6027a27.el7
NetworkManager-config-server-1.18.8-1.el7
python2-psutil-5.6.6-1.el7ar
systemd-219-67.el7_7.1.x86_64
Here is the script:
import os
import sys
f=open("data1", "r")
for pkg in f:
#print(pkg)
command='yum --showduplicates list + ' +pkg
with os.popen(command) as results_in:
for item in results_in:
if 'Error:' in item:
print(item + "package name:" + pkg)
Here is the results of the script:
Error: No matching Packages to list
I was hoping to get the error + package name.
Can someone please tell me what I need to do to make the proper adjustments?

yum is writing the error message to stderr, not stdout. What you're seeing is the error message being printed by yum, not from your script.
You need to redirect stderr to stdout so you can capture it and check it.
It's also a good idea to remove the trailing newline from the line being read from the file, so do pkg = pkg.strip()
command=f'yum --showduplicates list ' + pkg + ' 2>&1'

I wrote the script another way to get the data I'm looking for. Thank you for your help! You sparked the idea of stderr so I chased that method to capture it and respond based on that.
import subprocess
import shlex
f=open("data1", "r")
for pkg in f:
command='yum list available ' + pkg
proc = subprocess.Popen(shlex.split(command), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
output = proc.stdout.readline()
stdout, stderr= proc.communicate()
if 'Error' in str(stderr, 'utf-8').strip():
print("Error not available: "+pkg)
else:
print("Package available: "+pkg)

It's possible to catch ffmpeg errors with python?

Hi I'm trying to make a video converter for django with python, I forked django-ffmpeg module which does almost everything I want, except that doesn't catch error if conversion failed.
Basically the module passes to the command line interface the ffmpeg command to make the conversion like this:
/usr/bin/ffmpeg -hide_banner -nostats -i %(input_file)s -target
film-dvd %(output_file)
Module uses this method to pass the ffmpeg command to cli and get the output:
def _cli(self, cmd, without_output=False):
print 'cli'
if os.name == 'posix':
import commands
return commands.getoutput(cmd)
else:
import subprocess
if without_output:
DEVNULL = open(os.devnull, 'wb')
subprocess.Popen(cmd, stdout=DEVNULL, stderr=DEVNULL)
else:
p = subprocess.Popen(cmd, stdout=subprocess.PIPE)
return p.stdout.read()
But for example, I you upload an corrupted video file it only returns the ffmpeg message printed on the cli, but nothing is triggered to know that something failed
This is an ffmpeg sample output when conversion failed:
[mov,mp4,m4a,3gp,3g2,mj2 # 0x237d500] Format mov,mp4,m4a,3gp,3g2,mj2
detected only with low score of 1, misdetection possible!
[mov,mp4,m4a,3gp,3g2,mj2 # 0x237d500] moov atom not found
/home/user/PycharmProjects/videotest/media/videos/orig/270f412927f3405aba041265725cdf6b.mp4:
Invalid data found when processing input
I was wondering if there's any way to make that an exception and how, so I can handle it easy.
The only option that came to my mind is to search: "Invalid data found when processing input" in the cli output message string but I'm not shure that if this is the best approach. Anyone can help me and guide me with this please.

You need to check the returncode of the Popen object that you're creating.
Check the docs: https://docs.python.org/3/library/subprocess.html#subprocess.Popen
Your code should wait for the subprocess to finish (with wait) & then check the returncode. If the returncode is != 0 then you can raise any exception you want.

This is how I implemented it in case it's useful to someone else:
def _cli(self, cmd):
errors = False
import subprocess
try:
p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
stdoutdata, stderrdata = p.communicate()
if p.wait() != 0:
# Handle error / raise exception
errors = True
print "There were some errors"
return stderrdata, errors
print 'conversion success '
return stderrdata, errors
except OSError as e:
errors = True
return e.strerror, errors

Using FFprobe in Python

How can I use FFprobe in python scripts, which also exports the output as csv file?
The command I want to perform is:
ffprobe -i file_name -show_frames -select_streams v:1 -print_format csv > filename.csv
I looked at other posts about similar problem, and changed a little:
def probe_file(filename):
cmnd = ['ffprobe', '-i',filename, '-show_frames', '-select_streams', 'a', '-print_format', 'csv']
p = subprocess.Popen(cmnd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
print filename
out, err = p.communicate()
print "==========output=========="
print out
if err:
print "========= error =r======="
print err
however, I can't seem to have "> filename.csv" working.
After analysing the video, I want all the output as csv file, named the same as the file name
Does anyone know how I can approach this?
Thanks in advance

If you want the data available to python (as a list of lists):
data = [l.split(',') for l in out.decode('ascii').splitlines()]
If you just want to dump the data to a file:
with open("{}.csv".format(filename), "w") as f:
f.write(out.decode('ascii'))
Note that I'm using Python 3, hence the decode()s (out is bytes, and must be decoded). You're clearly using Python 2, so the decode()s probably aren't necessary.

How to redirect stderr in Python?

I would like to log all the output of a Python script. I tried:
import sys
log = []
class writer(object):
def write(self, data):
log.append(data)
sys.stdout = writer()
sys.stderr = writer()
Now, if I "print 'something' " it gets logged. But if I make for instance some syntax error, say "print 'something# ", it wont get logged - it will go into the console instead.
How do I capture also the errors from Python interpreter?
I saw a possible solution here:
http://www.velocityreviews.com/forums/showpost.php?p=1868822&postcount=3
but the second example logs into /dev/null - this is not what I want. I would like to log it into a list like my example above or StringIO or such...
Also, preferably I don't want to create a subprocess (and read its stdout and stderr in separate thread).

I have a piece of software I wrote for work that captures stderr to a file like so:
import sys
sys.stderr = open('C:\\err.txt', 'w')
so it's definitely possible.
I believe your problem is that you are creating two instances of writer.
Maybe something more like:
import sys
class writer(object):
log = []
def write(self, data):
self.log.append(data)
logger = writer()
sys.stdout = logger
sys.stderr = logger

You can't do anything in Python code that can capture errors during the compilation of that same code. How could it? If the compiler can't finish compiling the code, it won't run the code, so your redirection hasn't even taken effect yet.
That's where your (undesired) subprocess comes in. You can write Python code that redirects the stdout, then invokes the Python interpreter to compile some other piece of code.

I can't think of an easy way. The python process's standard error is living on a lower level than a python file object (C vs. python).
You could wrap the python script in a second python script and use subprocess.Popen. It's also possible you could pull some magic like this in a single script:
import os
import subprocess
import sys
cat = subprocess.Popen("/bin/cat", stdin=subprocess.PIPE, stdout=subprocess.PIPE)
os.close(sys.stderr.fileno())
os.dup2(cat.stdin.fileno(), sys.stderr.fileno())
And then use select.poll() to check cat.stdout regularly to find output.
Yes, that seems to work.
The problem I foresee is that most of the time, something printed to stderr by python indicates it's about to exit. The more usual way to handle this would be via exceptions.
---------Edit
Somehow I missed the os.pipe() function.
import os, sys
r, w = os.pipe()
os.close(sys.stderr.fileno())
os.dup2(w, sys.stderr.fileno())
Then read from r

To route the output and errors from Windows, you can use the following code outside of your Python file:
python a.py 1> a.out 2>&1
Source: https://support.microsoft.com/en-us/help/110930/redirecting-error-messages-from-command-prompt-stderr-stdout

Since python 3.5 you can use contextlib.redirect_stderr
with open('help.txt', 'w') as f:
with redirect_stdout(f):
help(pow)

For such a request, usually it would be much easier to do it in the OS instead of in Python.
For example, if you're going to run "a.py" and record all the messages it will generate into file "a.out", it would just be
python a.py 2>&1 > a.out
The first part 2>&1 redirects stderr to stdout (0: stdin, 1:stdout, 2:stderr), and the second redirects that to a file called a.out.
And as far as I know, this command works in Windows, Linux or MacOS! For other file redirection techniques, just search the os plus "file redirection"

I found this approach to redirecting stderr particularly helpful. Essentially, it is necessary to understand if your output is stdout or stderr. The difference? Stdout is any output posted by a shell command (think an 'ls' list) while sterr is any error output.
It may be that you want to take a shell commands output and redirect it to a log file only if it is normal output. Using ls as an example here, with an all files flag:
# Imports
import sys
import subprocess
# Open file
log = open("output.txt", "w+")
# Declare command
cmd = 'ls -a'
# Run shell command piping to stdout
result = subprocess.run(cmd, stdout=subprocess.PIPE, shell=True)
# Assuming utf-8 encoding
txt = result.stdout.decode('utf-8')
# Write and close file
log.write(txt)
log.close()
If you wanted to make this an error log, you could do the same with stderr. It's exactly the same code as stdout with stderr in its place. This pipes an error messages that get sent to the console to the log. Doing so actually keeps it from flooding your terminal window as well!
Saw this was a post from a while ago, but figured this could save someone some time :)

import sys
import tkinter
# ********************************************
def mklistenconsswitch(*printf: callable) -> callable:
def wrapper(*fcs: callable) -> callable:
def newf(data):
[prf(data) for prf in fcs]
return newf
stdoutw, stderrw = sys.stdout.write, sys.stderr.write
funcs = [(wrapper(sys.stdout.write, *printf), wrapper(sys.stderr.write, *printf)), (stdoutw, stderrw)]
def switch():
sys.stdout.write, sys.stderr.write = dummy = funcs[0]
funcs[0] = funcs[1]
funcs[1] = dummy
return switch
# ********************************************
def datasupplier():
i = 5.5
while i > 0:
yield i
i -= .5
def testloop():
print(supplier.__next__())
svvitch()
root.after(500, testloop)
root = tkinter.Tk()
cons = tkinter.Text(root)
cons.pack(fill='both', expand=True)
supplier = datasupplier()
svvitch = mklistenconsswitch(lambda text: cons.insert('end', text))
testloop()
root.mainloop()

Python will not execute your code if there is an error. But you can import your script in another script an catch exceptions. Example:
Script.py
print 'something#
FinalScript.py
from importlib.machinery import SourceFileLoader
try:
SourceFileLoader("main", "<SCRIPT PATH>").load_module()
except Exception as e:
# Handle the exception here

To add to Ned's answer, it is difficult to capture the errors on the fly during the compilation.
You can write several print statements in your script and you can stdout to a file, it will stop writing to the file when the error occurs. To debug the code you could check the last logged output and check your script after that point.
Something like this:
# Add to the beginning of the script execution(eg: if __name__ == "__main__":).
from datetime import datetime
dt = datetime.now()
script_dir = os.path.dirname(os.path.abspath(__file__)) # gets the path of the script
stdout_file = script_dir+r'\logs\log'+('').join(str(dt.date()).split("-"))+r'.log'
sys.stdout = open(stdout_file, 'w')
This will create a log file and stream the print statements to the file.
Note: Watch out for escape characters in your filepath while concatenating with script_dir in the second line from the last in the code. You might want something similar to raw string. You can check this thread for this.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

PDFminer gives strange letters - python

Yes. This will happen when custom font encodings have been used e.g. identity-H,identity-V, etc. but fonts have not been embedded properly. pdfminer gives garbage output in such cases because encoding is required to interpret the text

Related

Python - shell command causes InterfaceError on file download

os.popen: How to integrate output and data from file?

It's possible to catch ffmpeg errors with python?

Using FFprobe in Python

How to redirect stderr in Python?

Categories

Resources