While processing a PDF file (2.pdf) with pdfminer (pdf2txt.py) I received the following error:
pdf2txt.py 2.pdf
Traceback (most recent call last):
File "/usr/local/bin/pdf2txt.py", line 115, in <module>
if __name__ == '__main__': sys.exit(main(sys.argv))
File "/usr/local/bin/pdf2txt.py", line 109, in main
interpreter.process_page(page)
File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 832, in process_page
self.render_contents(page.resources, page.contents, ctm=ctm)
File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 843, in render_contents
self.init_resources(resources)
File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 347, in init_resources
self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)
File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 195, in get_font
font = self.get_font(None, subspec)
File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 186, in get_font
font = PDFCIDFont(self, spec)
File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdffont.py", line 654, in __init__
StringIO(self.fontfile.get_data()))
File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdffont.py", line 375, in __init__
(name, tsum, offset, length) = struct.unpack('>4sLLL', fp.read(16))
struct.error: unpack requires a string argument of length 16
While the similar file (1.pdf) doesn't cause a problem.
I can't find any information about the error. I added an issue on the pdfminer GitHub repository, but it remained unanswered. Can someone explain to me why this is happening? What can I do to parse 2.pdf?
Update: I get a similar error with BytesIO instead of StringIO after installing pdfminer directly from the GitHub repository.
$ pdf2txt.py 2.pdf
Traceback (most recent call last):
File "/home/danil/projects/python/pdfminer-source/env/bin/pdf2txt.py", line 116, in <module>
if __name__ == '__main__': sys.exit(main(sys.argv))
File "/home/danil/projects/python/pdfminer-source/env/bin/pdf2txt.py", line 110, in main
interpreter.process_page(page)
File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 839, in process_page
self.render_contents(page.resources, page.contents, ctm=ctm)
File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 850, in render_contents
self.init_resources(resources)
File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 356, in init_resources
self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)
File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 204, in get_font
font = self.get_font(None, subspec)
File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 195, in get_font
font = PDFCIDFont(self, spec)
File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdffont.py", line 665, in __init__
BytesIO(self.fontfile.get_data()))
File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdffont.py", line 386, in __init__
(name, tsum, offset, length) = struct.unpack('>4sLLL', fp.read(16))
struct.error: unpack requires a string argument of length 16
TL; DR
Thanks to #mkl and #hynecker for the extra info... With that I can confirm this is a bug in pdfminer and your PDF. Whenever pdfminer tries to get embedded file streams (e.g. font definitions), it is picking up the last one in the file before an endobj. Sadly, not all PDFs rigorously add the end tag and so pdfminer should be resilient to this.
Quick fix for this issue
I've created a patch - which has been submitted as a pull request on github. See https://github.com/euske/pdfminer/pull/159.
Detailed diagnosis
As mentioned in the other answers, the reason you're seeing this is that you're not getting the expected number of bytes from the stream as pdfminer is unpacking the data. But why?
As you can see in your stack trace, pdfminer (rightly) spots that it has a CID font to process. It then goes on to process the embedded font file as a TrueType font (in pdffont.py). It tries to parse the associated stream (stream ID 18) by reading out a set of binary tables.
This doesn't work for 2.pdf because it has a text stream. You can see this by running dumppdf -b -i 18 2.pdf. I've put the start here:
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo << /Registry (Adobe) /Ordering (UCS) /Supplement 0
>> def /CMapName /Adobe-Identity-UCS def
...
So, garbage in, garbage out... Is this a bug in your file or pdfminer? Well, the fact that other readers can handle it made me suspicious.
Digging around a little more, I see that this stream is identical to stream ID 17, which is the cmap for the ToUnicode field. A quick look at the PDF spec shows that these cannot be the same.
Digging in to the code further, I see that all streams are getting the same data. Oops! This is the bug. The cause appears to be related to the fact that this PDF is missing some end tags - as noted by #hynecker.
The fix is to return the right data for each stream. Any other fix to just swallow the error will result in bad data being used for all streams and so, for example, incorrect font definitions.
I believe the attached patch will fix your problem and should be safe to use in general.
I fixed your problem in the source code, and I try on your file 2.pdf to make sure it worked.
In the file pdffont.py I replaced:
class TrueTypeFont(object):
class CMapNotFound(Exception):
pass
def __init__(self, name, fp):
self.name = name
self.fp = fp
self.tables = {}
self.fonttype = fp.read(4)
(ntables, _1, _2, _3) = struct.unpack('>HHHH', fp.read(8))
for _ in xrange(ntables):
(name, tsum, offset, length) = struct.unpack('>4sLLL', fp.read(16))
self.tables[name] = (offset, length)
return
by this:
class TrueTypeFont(object):
class CMapNotFound(Exception):
pass
def __init__(self, name, fp):
self.name = name
self.fp = fp
self.tables = {}
self.fonttype = fp.read(4)
(ntables, _1, _2, _3) = struct.unpack('>HHHH', fp.read(8))
for _ in xrange(ntables):
fp_bytes = fp.read(16)
if len(fp_bytes) < 16:
break
(name, tsum, offset, length) = struct.unpack('>4sLLL', fp_bytes)
self.tables[name] = (offset, length)
return
Explanations
#Nabeel Ahmed was right
The foramt string >4sLLL requires 16 bytes size of buffer, which is specified correctly to fp.read to read 16 bytes at a time.
So, the problem can only be with the buffer stream it's reading i.e. the content of your specific PDF file.
In the code we see that fp.read(16) are made in a loop without any check.Thus, we don't know for sure if it successfully read it all. It could for instance reached an EOF.
To avoid this problem, I just break out of the for loop when this kind of problem appears.
for _ in xrange(ntables):
fp_bytes = fp.read(16)
if len(fp_bytes) < 16:
break
In any regular cases, it shouldn't change anything anyway.
I will try to do a pull request on github, but I'm not even sure it will be accepted so I suggest you do a monkey patch for now and modify your /home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdffont.py file right now.
This is really an invalid PDF because there are some missing keywords endobj after three indirect objects. (object 5, 18 and 22)
The definition of an indirect object in a PDF file shall consist of its object number and generation number (separated by white space), followed by the value of the object bracketed between the keywords obj and endobj.
(chapter 7.3.10 in PDF reference)
The example 2.pdf is a simple PDF 1.3 version that uses a simple uncompressed cross reference and uncompressed object separators. The failure can be easily found by grep command and by a general file viewer that the PDF has 22 indirect objects. The pattern " obj" is found correctly exactly 22 times (never accidentally in a string object or in a stream, fortunately for simplicity), but the keyword endobj is three times missing.
$ grep --binary-files=text -B1 -A2 -E " obj|endobj" 2.pdf
...
18 0 obj
<< /Length 451967/Length1 451967/Filter [/FlateDecode] >>
stream
...
endstream % # see the missing "endobj" here
17 0 obj
<< /Length 12743 /Filter [/FlateDecode] >>
stream
...
endstream
endobj
...
Similarly the object 5 has no endobj before object 1 and the object 22 has no endobj before object 21.
It is known that broken cross references in PDF can be and should be usually reconstructed by obj/endobj keywords (see the PDF reference, chapter C.2) Some applications do probably vice-versa fix missing endobj if cross references are correct, but it is no written advice.
The last error message tells you a lot:
File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdffont.py", line 375, in
init
(name, tsum, offset, length) = struct.unpack('>4sLLL', fp.read(16))
struct.error: unpack requires a string argument of length 16
You can easily debug what is going on, for example, by putting necessary debug statements exactly in pdffont.py file. My guess is that there is something special about your pdf contents. Judging by the method name - TrueTypeFont - which throws the error message, there is some incompatibility with the font type.
Let start with explaining the statement where you're getting exception:
struct.unpack('>4sLLL', fp.read(16))
where the synopsis is:
struct.unpack(fmt, buffer)
The method unpack, unpacks from the buffer buffer (which
presumably earlier packed by pack(fmt, ...)) according to the
format string fmt. The result is a tuple even if it
contains exactly one item. The buffer’s size in bytes must match the
size required by the format, as reflected by calcsize().
The most common case is, wrong number of bytes (16) for the format used (>4sLLL) - for example, for a format expecting 4 bytes, you have specified 3 bytes:
(name, tsum, offset, length) = struct.unpack('BH', fp.read(3))
for this you'll get
struct.error: unpack requires a string argument of length 4
The reason - the format struct ('BH') expects 4 bytes i.e. when we pack something using 'BH' format it'll occupy 4 bytes of memory.
A good explanation here.
To clarify it further - let's look into the >4sLLL format string. To verify the size unpack 'd be expecting for the buffer (the bytes you're reading from the PDF file). Quoting from docs:
The buffer’s size in bytes must match the size required by the format,
as reflected by calcsize().
>>> import struct
>>> struct.calcsize('>4sLLL')
16
>>>
To this point we can say there's nothing wrong with the statement:
(name, tsum, offset, length) = struct.unpack('>4sLLL', fp.read(16))
The foramt string >4sLLL requires 16 bytes size of buffer, which is specified correctly to fp.read to read 16 bytes at a time.
So, the problem can only be with the buffer stream it's reading i.e. the content of your specific PDF file.
Can be a bug - as per this comment:
This is a bug in the upstream PDFminer by #euske There seems to be
patches for this so it should be an easy fix. Beyond this I also need
to strengthen the pdf parsing such that we never error out from a
failed parse
I'll edit the question it I find something helpful to add here - a solution, or a patch.
In case you still get some struct errors after applying Peter's patch, especially when parsing many files in one script's run (using os.listdir), try changing resource manager caching to false.
rsrcmgr = PDFResourceManager(caching=False)
It helped me to get rid of the rest of errors after applying above solutions.
Related
In case of the file size is under 5000 bytes (InMemoryUploadedFile).
This code doesn't work
mime_type = magic.from_buffer(file.read(), mime=True)
It returns wrong mime_type.
For example, I have a file cv.docx with 4074 bytes size.
It returns a mime_type:
'application/x-empty'
instead of
'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
Could you please suggest me any advices to solve this case?
I had this problem as well. It's very likely not to do with the file size, because I have tested magic.from_buffer on 90 byte text/plain files as well and it returned the right value.
The problem is that the file has somehow become empty. In my case, this is because the file was a stream and I had already read from the stream (remember if you read from a stream and read again, the second read will start where the first read finished -- unlike reading from the start of a file each time).
This example is from flask
mime_type1 = magic.from_buffer(request.stream.read(2048), mime=True) // returns text/plain
mime_type = magic.from_buffer(request.files["file"].stream.read(2048), mime=True) // returns application/x-empty because the stream has already been read from
It's hard to exactly diagnose without seeing your earlier code but check where else you are working with the file and comment those out.
You might need to do something like
file.seek(0)
mime_type = magic.from_buffer(file.read(), mime=True)
I am getting the following error, when reading certain PDF files using PyPDF2. Due to the confidential nature of these documents, I can't share them, but I can try and provide information which can help solve this problem.
Stacktrace -
inputpdf = PdfFileReader(open(pdfpath, "rb"), strict=False)
File "/home/tata/.virtualenvs/obu/local/lib/python2.7/site-packages/PyPDF2/pdf.py", line 1084, in __init__
self.read(stream)
File "/home/tata/.virtualenvs/obu/local/lib/python2.7/site-packages/PyPDF2/pdf.py", line 1732, in read
num = readObject(stream, self)
File "/home/tata/.virtualenvs/obu/local/lib/python2.7/site-packages/PyPDF2/generic.py", line 74, in readObject
return BooleanObject.readFromStream(stream)
File "/home/tata/.virtualenvs/obu/local/lib/python2.7/site-packages/PyPDF2/generic.py", line 137, in readFromStream
raise utils.PdfReadError('Could not read Boolean object')
PdfReadError: Could not read Boolean object
The exception seems to be raised from the following function, in generic.py:
def readFromStream(stream):
word = stream.read(4)
if word == b_("true"):
return BooleanObject(True)
elif word == b_("fals"):
stream.read(1)
return BooleanObject(False)
else:
raise utils.PdfReadError('Could not read Boolean object')
Printing the variable word prints the string trai, but I am not sure what this string represents.
Since the PyPDF2 project seems unmaintained, can someone help me figure out a solution for this?
Note : Please note that these PDFs are not password protected.
It seems as if all pdfs are encrypted in some way. Using the solution cited in this issue #53 in PyPDF2's github repository, I used the following command to generate another pdf ( The Decrypted version of the original pdf ) -
qpdf --password= --decrypt input.pdf output.pdf
and then reading output.pdf worked for me. I am not sure as to how I can determine beforehand, whether a pdf is encrypted ( or in this particular state ) or not. But this solution temporarily solves the problem.
For a larger project, I'm currently in the process of writing a Stanford polygon file (PLY) parser. The example at Github Gists is currently capable of parsing ASCII-format PLY files into a data abstraction Mesh. It also contains a description of the actual grammar, for those inclined.
However the format definition (PLY - Polygon File Format) also includes two binary formats (little and big endian). Since those two formats are much more common (and storage-space efficient), I would like to be able to parse those files with pyparsing as well.
I'm grateful for some advice on how to do that, if at all possible.
The idea of the binary PLY files is that, the header portion consists of an ASCII description of the actual data of the file, and the body contains the actual data. An example (data in brackets are hex bytes):
ply
format binary_little_endian 1.0
element vertex 1
property float x
property float y
property float z
property uchar red
property uchar green
property uchar blue
property uchar alpha
end_header
[84 72 F1 C1 D8 FD 9F C1 00 00 00 00 3B 45 CB FF]
My first approach was to just load the input file in binary format (using bytes instead of str), and adapt the parser accordingly, but this somehow throws pyparsing off track. Also, I don't really know how to tell pyparsing how to grok byte groups.
File "components.py", line 338, in create
mesh = PlyParser.create().load(mesh_path)
File "model_parser.py", line 120, in create
property_position = aggregate_property("position", b"x", b"y", b"z")
File "model_parser.py", line 113, in aggregate_property
aggregates.append(pp.Group(property_simple_prefix + keyword_or(*keywords)("name")))
File "model_parser.py", line 87, in keyword_or
return pp.Or(pp.CaselessKeyword(literal) for literal in keywords)
File "pyparsing.py", line 3418, in __init__
super(Or,self).__init__(exprs, savelist)
File "pyparsing.py", line 3222, in __init__
exprs = list(exprs)
File "model_parser.py", line 87, in <genexpr>
return pp.Or(pp.CaselessKeyword(literal) for literal in keywords)
File "pyparsing.py", line 2496, in __init__
super(CaselessKeyword,self).__init__( matchString, identChars, caseless=True )
File "pyparsing.py", line 2422, in __init__
self.matchLen = len(matchString)
TypeError: object of type 'int' has no len()
What you might want to try is to open the file as text, use pyparsing to parse the header and capture the end position of the "end header" token. Use the structure information extracted from the header to build a Python struct reader that will process the binary content. Then reopen the file as binary, seek to the position, and use the struct reader to load the binary content. Probably simpler than twisting pyparsing to be both text and binary.
There is already a module for parsing binary PLY files: python-plyfile.
You could either use this or at least look at the source code to get an idea how it works.
It uses numpy.fromfile - which is described as a "highly efficient way of reading binary data with a known data-type" - to to the binary data reading.
Recently, without changes to codes/libs, I started getting python error_proto: line too long error when reading email (poplib.retr) from hotmail inbox. I am using Python version 2.7.8. I understand that a long line may be caused this error. But is there a way to go around this or a certain version I need to put in place. Thank you for any advice/direction anyone can give.
Here is a traceback error:
"/opt/rh/python27/root/usr/lib64/python2.7/poplib.py", line 232, in retr\n return self._longcmd(\'RETR %s\' % which)\n',
' File "/opt/rh/python27/root/usr/lib64/python2.7/poplib.py", line 167, in _longcmd\n return self._getlongresp()\n',
' File "/opt/rh/python27/root/usr/lib64/python2.7/poplib.py", line 152, in _getlongresp\n line, o = self._getline()\n',
' File "/opt/rh/python27/root/usr/lib64/python2.7/poplib.py", line 377, in _getline\n raise error_proto(\'line too long\')\n',
'error_proto: line too long\n'
A python bug report exists for this issue here: https://bugs.python.org/issue16041
The work around I put inplace was as follows:
import poplib
poplib._MAXLINE=20480
I thought this was a better idea, rather than editing the poplib.py library file directly.
Woody
Are you sure you've not updated poplib? Have a look at the most recent diff, committed last night:
# Added:
...
# maximal line length when calling readline(). This is to prevent
# reading arbitrary length lines. RFC 1939 limits POP3 line length to
# 512 characters, including CRLF. We have selected 2048 just to be on
# the safe side.
_MAXLINE = 2048
...
# in_getline()...
if len(self.buffer) > _MAXLINE:
raise error_proto('line too long')
...it looks suspiciously similar to your problem.
So if you roll back to the previous version, it will probably be OK.
I've been previously writing code for a quiz program with a text file that stores all of the participants' results. The code that converts the text file to a dictionary and the text file itself are shown below:
Code:
import collections
from collections import defaultdict
scores_guessed = collections.defaultdict(lambda: collections.deque(maxlen=4))
with open('GuessScores.txt') as f:
for line in f:
name,val = line.split(":")
scores_guessed[name].appendleft(int(val))
for k in sorted(scores_guessed):
print("\n"+k," ".join(map(str,scores_guessed[k])))
writer = open('GuessScores.txt', 'wb')
for key, value in scores_guessed.items():
output = "%s:%s\n" % (key,value)
writer.write(output)
The text file appears like this:
Jack:10
Dave:20
Adam:30
Jack:40
Adam:50
Dave:60
Jack:70
Dave:80
Jack:90
Jack:100
Dave:110
Dave:120
Adam:130
Adam:140
Adam:150
Now, when I run the program code, the dictionary appears like this:
Adam 150 140 130 50
Dave 120 110 80 60
Jack 100 90 70 40
Now, this arranges the dictionary into order of highest scores, and the top 4 scores!
I want the python IDLE to overwrite the GuessScores.txt to this:
Adam:150
Adam:140
Adam:130
Adam:50
Dave:120
Dave:110
Dave:80
Dave:60
Jack:100
Jack:90
Jack:70
Jack:40
BUT when I run the code, this error appears:
Traceback (most recent call last):
File "/Users/Ahmad/Desktop/Test Files SO copy/readFile_prompt.py", line 16, in <module>
writer.write(output)
TypeError: 'str' does not support the buffer interface
The GuessScores.txt file is empty because it cannot write to the file, since there is the error above.
Why is this happening? And what is the fix? I have asked this previously but there were numerous issues. I am running Python 3.3.2 on a Mac 10.8 Mavericks iMac, if that makes any help.
Thanks,
Delbert.
The first issue is that you are trying to write text to a file that you opened in binary mode. In 3.x, this will no longer work. "text" vs. "binary" used to mean very little (only affecting line-ending translation, so no difference at all on some systems). Now it means like what it sounds like: a file opened in text mode is one whose contents are to be treated like text with some specific encoding, and a file opened in binary mode is one whose contents are to be treated as a sequence of bytes.
Thus, you need open('GuessScores.txt', 'w'), not open('GuessScores.txt', 'wb').
That said, you really should be using with blocks to manage the files, and you're going to have to write code that actually formats the dictionary content in the way you want. I assume you intend to output in sorted name order, and you need to iterate over each deque and write a line for each item. Something like:
with open('GuessScores.txt', 'w') as f:
for name, scores in sorted(scores_guessed.items()):
for score in scores:
f.write("{}:{}\n".format(name, score))
(Note also the new-style formatting.)
If necessary, you can explicitly specify the encoding of the file in the open call, with the encoding keyword parameter. (If you don't know what I mean by "encoding", you must learn. I'm serious. Drop everything and look it up.)
The writing problem has to do with the b in your open function. You've opened it in binary mode, so only bytes can be written. You can either remove the b or call bytes on output to give it the right type. You have a logic error anyway though. When I run it on Python 2.7, the output to GuessedScores.txt is this:
Dave:deque([120,110,80,60],maxlen=4)
Jack:deque([100, 90, 70, 40], maxlen=4)
Adam:deque([150, 140, 130, 50], maxlen=4)
So your values are the whole deques, not the individual scores. You'll have to format them, similar to how you did in your print statement.