Parsing binary Stanford polygon files (PLY) with Pyparsing - python

For a larger project, I'm currently in the process of writing a Stanford polygon file (PLY) parser. The example at Github Gists is currently capable of parsing ASCII-format PLY files into a data abstraction Mesh. It also contains a description of the actual grammar, for those inclined.
However the format definition (PLY - Polygon File Format) also includes two binary formats (little and big endian). Since those two formats are much more common (and storage-space efficient), I would like to be able to parse those files with pyparsing as well.
I'm grateful for some advice on how to do that, if at all possible.
The idea of the binary PLY files is that, the header portion consists of an ASCII description of the actual data of the file, and the body contains the actual data. An example (data in brackets are hex bytes):
ply
format binary_little_endian 1.0
element vertex 1
property float x
property float y
property float z
property uchar red
property uchar green
property uchar blue
property uchar alpha
end_header
[84 72 F1 C1 D8 FD 9F C1 00 00 00 00 3B 45 CB FF]
My first approach was to just load the input file in binary format (using bytes instead of str), and adapt the parser accordingly, but this somehow throws pyparsing off track. Also, I don't really know how to tell pyparsing how to grok byte groups.
File "components.py", line 338, in create
mesh = PlyParser.create().load(mesh_path)
File "model_parser.py", line 120, in create
property_position = aggregate_property("position", b"x", b"y", b"z")
File "model_parser.py", line 113, in aggregate_property
aggregates.append(pp.Group(property_simple_prefix + keyword_or(*keywords)("name")))
File "model_parser.py", line 87, in keyword_or
return pp.Or(pp.CaselessKeyword(literal) for literal in keywords)
File "pyparsing.py", line 3418, in __init__
super(Or,self).__init__(exprs, savelist)
File "pyparsing.py", line 3222, in __init__
exprs = list(exprs)
File "model_parser.py", line 87, in <genexpr>
return pp.Or(pp.CaselessKeyword(literal) for literal in keywords)
File "pyparsing.py", line 2496, in __init__
super(CaselessKeyword,self).__init__( matchString, identChars, caseless=True )
File "pyparsing.py", line 2422, in __init__
self.matchLen = len(matchString)
TypeError: object of type 'int' has no len()

What you might want to try is to open the file as text, use pyparsing to parse the header and capture the end position of the "end header" token. Use the structure information extracted from the header to build a Python struct reader that will process the binary content. Then reopen the file as binary, seek to the position, and use the struct reader to load the binary content. Probably simpler than twisting pyparsing to be both text and binary.

There is already a module for parsing binary PLY files: python-plyfile.
You could either use this or at least look at the source code to get an idea how it works.
It uses numpy.fromfile - which is described as a "highly efficient way of reading binary data with a known data-type" - to to the binary data reading.

Related

Find a string in a binary file

I am trying to extract data from a binary file where the data chunks are "tagged" with ASCII text. I need to find the word "tracers" in the binary file so I can read the next 4 bytes (int).
I am trying to simply loop over the lines, decoding them and checking for the text, which works. But I am having trouble seeking to the correct place in the file directly after the text (the seek_to_key function):
from io import BytesIO
import struct
binary = b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\n\x00\x00\xd6\x00\x8c<TE\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00tracers\x00\xf2N\x03\x00P\xd9U=6\x1e\x92=\xbe\xa8\x0b<\xb1\x9f\x9f=\xaf%\x82=3\x81|=\xbeM\xb4=\x94\xa7\xa6<\xb9\xbd\xcb=\xba\x18\xc7=\x18?\xca<j\xe37=\xbc\x1cm=\x8a\xa6\xb5=q\xc1\x8f;\xe7\xee\xa0=\xe7\xec\xf7<\xc3\xb8\x8c=\xedw\xae=C$\x84<\x94\x18\x9c=&Tj=\xb3#\xb3=\r\xdd3=\x0eL==4\x00~<\xc6q\x1e=pHw=\xc1\x9a\x92="\x08\x9a=\xe6a\xeb<\xa4#.=\xc4\x0f-=\xa9O\xcb=i\'\x15=\x94\x03\x80=\x8f\xcd\xaf=\xd6\x00\x8c<TE\x9f<m\x9ad<[;Q=\x157X=\x17\xf1u=\xb8(\xa4=\x13\xd3\xfa<\x811_=\xd1iX=Q\x17^;\xd1n\xbe=\xfcb\xcc=\xe8\x9b\x99=W\xa9\x16=\xc5\x83\xa4=\xc0%\x98<\xbb|\x99<>#\x8b:\x1cY\x82;\xb8T\xa4<Cv\x87="n\x1c<J\x152=\x1f\xb2\x9d=&\x18\xb6=\x8a\xf9{=\x0fT\xba=HrX=\xa0\\S=#\xee\xbd=\x1e,\xc5=y\rU<gK\x84=\xe3*\r=\x04\xc4M=\x98a\xb3<\x95 T=\xf2Z\x94=lL\x15=\x07\x1b^=\xf3W\x83<\xf6\xff\xa1<\xb8\xfb\xcb<p\xb4\xd8<\xc9#\xfd<s\xa6\x1f;\xbf7W<\x8a\x9c\x82<\x1c\xb7l=\xa7\xd0\xb7=\xe4\x8d\x97=\xe2\x7f\x82=\x82\xa1\xcc<\xdfs\xca=C\x10p=\xb4\xfa\xb0=\xf35\x87=\x9d\x8bR<d\xb9\x0c<\xb26\xcd=\r\xd5\x1d<\xf4p\xb1=f)\xaf=\xe2M\\=F|\xf9<\x9baW=\x85|\xa3=\x0f\xdd\xa1=\xb6f\xa9=\xcbW\xcf<\xfa\x1a\xbe=\xeb\xda\xb2=\x88\xfb\x8e=\x9f+$=\xbbS\xac;\xa2o\xb5=\x08\xca\xe5<\xc9IC=\xa8\x05\xa6=\xbc \xbd=\x8e\x8d}=U\xcd\xba=\xcbG\x89=}\xadg=Z\xad\x9f=_=\xb6:y\x1c==\xa5\x0b3<<\xe5\x1e=*\xa0\xb6=\n\xcd\xb8\xd9<u\xb5W=rZ\x88=\xe0w}=\xa5\xf0\xa0=\xf4\x91\x82=\xe4r\xc5<\x0e\x91A=Z\x9d-<[N:=\xf1\t\x1e=\xc5_\xc2=\xf8\xea\x98=t\xd7\xbf<~N\xce==#\x93=\x98A\xa7=c\x81x=\xe3\xc6\x94=\xe2&\xcc=\x05\xa9^=\xf7\x05\xa8=[m\x81=\x1b\x0b\x84=\xf5\x98\xb9=+\x90\xd8<\xa2\xcc\xa5=5^\x92=\x0e\x9d\x1d=\x96\xc7\x8b;\xc5E\x9e;r\x1e\xc7=\xea6\xbf=\x19mN;\xd9$D=\x85\xa9\x8b=!\xe9\x90=\xe4/~<\xc1\x9c\xaf=\xde\xe4\x18=e\xb0H=hLO;\x9f\xf8\x8b=p.\xcf=L\x1f\x01<\xea\x19\xaf=Z\xd5\xc2<\xb4\xd8\xcf=s\x84\x0c=\x987\xa5;\x19Z\x93=\x0c\x8fO=y/\x97=\xeaOG=\xb0Fl=\x03\x7f\xbe=\x96\n'
binary_data = BytesIO()
binary_data.write(binary)
binary_data.seek(0)
def seek_to_key(f, line_str, key):
key_start = line_str.find(key)
offset = len(line_str[key_start+len(key)].encode('utf-8'))
f.seek(-offset, 1)
for line in binary_data:
line_str = line.decode('utf-8', errors='replace')
print(line_str)
if 'tracers' in line_str:
seek_to_key(binary_data, line_str, 'tracers')
nfloats = struct.unpack('<i', binary_data.read(4))
print(nfloats)
break
Any recommendations on a better way to do this would be awesome!
It's not completely clear to me what you are trying to achieve. Please explain that in more detail if you want a better answer. What I understand from your current question and code is that you are trying to read the 32-bit number directly after the ASCII text 'tracers'. I'm guessing this is only the first step of your code, since the name `nfloats' suggests that you will be reading a number of floats in the next step ;-) But I'll try to answer this question only.
There are a number of problems with your code:
First of all, a simple typo: Instead of line_str[key_start+len(key)] you probably meant line_str[key_start+len(key):]. (You missed the colon.)
You are mixing binary and text data. Why do you decode the binary data as UTF-8? It clearly isn't. You can't just "decode" binary data as UTF-8, slicing a piece of it, and then re-encode that using UTF-8. In this case, the part after your marker is 518 bytes, but when encoded as UTF-8 it becomes 920 bytes. This messes up your offset calculation. Tip: you can search binary data in binary data in Python :-) For example: b'Hello, world!'.find(b'world') returns 7. So you don't have to encode/decode the data at all.
You are reading line by line. Why is that? Lines are a concept of text files and don't have a real meaning in binary files. It could work, but that depends on the file format (which I don't know). In any case, your current code can only find one tracer per line. Is that intentionally, or could there be more markers in each line? Anyway, if the file is small enough to fit in memory, it would be much easier to process the data in one chunk.
A minor note: you could write binary_data = BytesIO(binary) and avoid the additional write(). Also the seek(0) is not necessary.
Example code
I think the following code gives the correct result. I hope it will be a useful start to finish your application. Note that this code conforms to the Style Guide for Python Code and that all pylint issues were resolved (except for a too long line and missing docstrings).
import io
import struct
DATA = b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\n\x00\x00\xd6\x00\x8c<TE\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00tracers\x00\xf2N\x03\x00P\xd9U=6\x1e\x92=\xbe\xa8\x0b<\xb1\x9f\x9f=\xaf%\x82=3\x81|=\xbeM\xb4=\x94\xa7\xa6<\xb9\xbd\xcb=\xba\x18\xc7=\x18?\xca<j\xe37=\xbc\x1cm=\x8a\xa6\xb5=q\xc1\x8f;\xe7\xee\xa0=\xe7\xec\xf7<\xc3\xb8\x8c=\xedw\xae=C$\x84<\x94\x18\x9c=&Tj=\xb3#\xb3=\r\xdd3=\x0eL==4\x00~<\xc6q\x1e=pHw=\xc1\x9a\x92="\x08\x9a=\xe6a\xeb<\xa4#.=\xc4\x0f-=\xa9O\xcb=i\'\x15=\x94\x03\x80=\x8f\xcd\xaf=\xd6\x00\x8c<TE\x9f<m\x9ad<[;Q=\x157X=\x17\xf1u=\xb8(\xa4=\x13\xd3\xfa<\x811_=\xd1iX=Q\x17^;\xd1n\xbe=\xfcb\xcc=\xe8\x9b\x99=W\xa9\x16=\xc5\x83\xa4=\xc0%\x98<\xbb|\x99<>#\x8b:\x1cY\x82;\xb8T\xa4<Cv\x87="n\x1c<J\x152=\x1f\xb2\x9d=&\x18\xb6=\x8a\xf9{=\x0fT\xba=HrX=\xa0\\S=#\xee\xbd=\x1e,\xc5=y\rU<gK\x84=\xe3*\r=\x04\xc4M=\x98a\xb3<\x95 T=\xf2Z\x94=lL\x15=\x07\x1b^=\xf3W\x83<\xf6\xff\xa1<\xb8\xfb\xcb<p\xb4\xd8<\xc9#\xfd<s\xa6\x1f;\xbf7W<\x8a\x9c\x82<\x1c\xb7l=\xa7\xd0\xb7=\xe4\x8d\x97=\xe2\x7f\x82=\x82\xa1\xcc<\xdfs\xca=C\x10p=\xb4\xfa\xb0=\xf35\x87=\x9d\x8bR<d\xb9\x0c<\xb26\xcd=\r\xd5\x1d<\xf4p\xb1=f)\xaf=\xe2M\\=F|\xf9<\x9baW=\x85|\xa3=\x0f\xdd\xa1=\xb6f\xa9=\xcbW\xcf<\xfa\x1a\xbe=\xeb\xda\xb2=\x88\xfb\x8e=\x9f+$=\xbbS\xac;\xa2o\xb5=\x08\xca\xe5<\xc9IC=\xa8\x05\xa6=\xbc \xbd=\x8e\x8d}=U\xcd\xba=\xcbG\x89=}\xadg=Z\xad\x9f=_=\xb6:y\x1c==\xa5\x0b3<<\xe5\x1e=*\xa0\xb6=\n\xcd\xb8\xd9<u\xb5W=rZ\x88=\xe0w}=\xa5\xf0\xa0=\xf4\x91\x82=\xe4r\xc5<\x0e\x91A=Z\x9d-<[N:=\xf1\t\x1e=\xc5_\xc2=\xf8\xea\x98=t\xd7\xbf<~N\xce==#\x93=\x98A\xa7=c\x81x=\xe3\xc6\x94=\xe2&\xcc=\x05\xa9^=\xf7\x05\xa8=[m\x81=\x1b\x0b\x84=\xf5\x98\xb9=+\x90\xd8<\xa2\xcc\xa5=5^\x92=\x0e\x9d\x1d=\x96\xc7\x8b;\xc5E\x9e;r\x1e\xc7=\xea6\xbf=\x19mN;\xd9$D=\x85\xa9\x8b=!\xe9\x90=\xe4/~<\xc1\x9c\xaf=\xde\xe4\x18=e\xb0H=hLO;\x9f\xf8\x8b=p.\xcf=L\x1f\x01<\xea\x19\xaf=Z\xd5\xc2<\xb4\xd8\xcf=s\x84\x0c=\x987\xa5;\x19Z\x93=\x0c\x8fO=y/\x97=\xeaOG=\xb0Fl=\x03\x7f\xbe=\x96\n' # noqa
def find_tracers(data):
start = 0
while True:
pos = data.find(b'tracers', start)
if pos == -1:
break
num_floats = struct.unpack('<i', data[pos+7: pos+11])
print(num_floats)
start = pos + 11
def main():
with io.BytesIO(DATA) as file:
data = file.read()
find_tracers(data)
if __name__ == '__main__':
main()

Python getting unrecognizable characters after reading data from file

I'm using Python to recreate a program that have been written in Fortran 95, the program opens a binary file, containing only float numbers, and read a specific value, it works just fine in Fortran, when I execute the code, I get 284.69 for example.
Although, when I try to do the same in Python, reading the entire first line of the file, I get characters like these:
Y{�C�x�Cz~�C�x�C�j�C�r�C�v�Ch�Ck�CVx�C
Here is how I open the file and read the values:
f = open(args.model_files[0], "r").readlines()
print str(f[0])
I can't provide a file as example, because it is too big, but I affirm that there is only float numbers.
I would like to at least understand what type of characters I'm getting, or what I'm doing wrong when opening the file, any suggestion is welcome.

struct.error: unpack requires a string argument of length 16

While processing a PDF file (2.pdf) with pdfminer (pdf2txt.py) I received the following error:
pdf2txt.py 2.pdf
Traceback (most recent call last):
File "/usr/local/bin/pdf2txt.py", line 115, in <module>
if __name__ == '__main__': sys.exit(main(sys.argv))
File "/usr/local/bin/pdf2txt.py", line 109, in main
interpreter.process_page(page)
File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 832, in process_page
self.render_contents(page.resources, page.contents, ctm=ctm)
File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 843, in render_contents
self.init_resources(resources)
File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 347, in init_resources
self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)
File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 195, in get_font
font = self.get_font(None, subspec)
File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 186, in get_font
font = PDFCIDFont(self, spec)
File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdffont.py", line 654, in __init__
StringIO(self.fontfile.get_data()))
File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdffont.py", line 375, in __init__
(name, tsum, offset, length) = struct.unpack('>4sLLL', fp.read(16))
struct.error: unpack requires a string argument of length 16
While the similar file (1.pdf) doesn't cause a problem.
I can't find any information about the error. I added an issue on the pdfminer GitHub repository, but it remained unanswered. Can someone explain to me why this is happening? What can I do to parse 2.pdf?
Update: I get a similar error with BytesIO instead of StringIO after installing pdfminer directly from the GitHub repository.
$ pdf2txt.py 2.pdf
Traceback (most recent call last):
File "/home/danil/projects/python/pdfminer-source/env/bin/pdf2txt.py", line 116, in <module>
if __name__ == '__main__': sys.exit(main(sys.argv))
File "/home/danil/projects/python/pdfminer-source/env/bin/pdf2txt.py", line 110, in main
interpreter.process_page(page)
File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 839, in process_page
self.render_contents(page.resources, page.contents, ctm=ctm)
File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 850, in render_contents
self.init_resources(resources)
File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 356, in init_resources
self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)
File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 204, in get_font
font = self.get_font(None, subspec)
File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 195, in get_font
font = PDFCIDFont(self, spec)
File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdffont.py", line 665, in __init__
BytesIO(self.fontfile.get_data()))
File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdffont.py", line 386, in __init__
(name, tsum, offset, length) = struct.unpack('>4sLLL', fp.read(16))
struct.error: unpack requires a string argument of length 16
TL; DR
Thanks to #mkl and #hynecker for the extra info... With that I can confirm this is a bug in pdfminer and your PDF. Whenever pdfminer tries to get embedded file streams (e.g. font definitions), it is picking up the last one in the file before an endobj. Sadly, not all PDFs rigorously add the end tag and so pdfminer should be resilient to this.
Quick fix for this issue
I've created a patch - which has been submitted as a pull request on github. See https://github.com/euske/pdfminer/pull/159.
Detailed diagnosis
As mentioned in the other answers, the reason you're seeing this is that you're not getting the expected number of bytes from the stream as pdfminer is unpacking the data. But why?
As you can see in your stack trace, pdfminer (rightly) spots that it has a CID font to process. It then goes on to process the embedded font file as a TrueType font (in pdffont.py). It tries to parse the associated stream (stream ID 18) by reading out a set of binary tables.
This doesn't work for 2.pdf because it has a text stream. You can see this by running dumppdf -b -i 18 2.pdf. I've put the start here:
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo << /Registry (Adobe) /Ordering (UCS) /Supplement 0
>> def /CMapName /Adobe-Identity-UCS def
...
So, garbage in, garbage out... Is this a bug in your file or pdfminer? Well, the fact that other readers can handle it made me suspicious.
Digging around a little more, I see that this stream is identical to stream ID 17, which is the cmap for the ToUnicode field. A quick look at the PDF spec shows that these cannot be the same.
Digging in to the code further, I see that all streams are getting the same data. Oops! This is the bug. The cause appears to be related to the fact that this PDF is missing some end tags - as noted by #hynecker.
The fix is to return the right data for each stream. Any other fix to just swallow the error will result in bad data being used for all streams and so, for example, incorrect font definitions.
I believe the attached patch will fix your problem and should be safe to use in general.
I fixed your problem in the source code, and I try on your file 2.pdf to make sure it worked.
In the file pdffont.py I replaced:
class TrueTypeFont(object):
class CMapNotFound(Exception):
pass
def __init__(self, name, fp):
self.name = name
self.fp = fp
self.tables = {}
self.fonttype = fp.read(4)
(ntables, _1, _2, _3) = struct.unpack('>HHHH', fp.read(8))
for _ in xrange(ntables):
(name, tsum, offset, length) = struct.unpack('>4sLLL', fp.read(16))
self.tables[name] = (offset, length)
return
by this:
class TrueTypeFont(object):
class CMapNotFound(Exception):
pass
def __init__(self, name, fp):
self.name = name
self.fp = fp
self.tables = {}
self.fonttype = fp.read(4)
(ntables, _1, _2, _3) = struct.unpack('>HHHH', fp.read(8))
for _ in xrange(ntables):
fp_bytes = fp.read(16)
if len(fp_bytes) < 16:
break
(name, tsum, offset, length) = struct.unpack('>4sLLL', fp_bytes)
self.tables[name] = (offset, length)
return
Explanations
#Nabeel Ahmed was right
The foramt string >4sLLL requires 16 bytes size of buffer, which is specified correctly to fp.read to read 16 bytes at a time.
So, the problem can only be with the buffer stream it's reading i.e. the content of your specific PDF file.
In the code we see that fp.read(16) are made in a loop without any check.Thus, we don't know for sure if it successfully read it all. It could for instance reached an EOF.
To avoid this problem, I just break out of the for loop when this kind of problem appears.
for _ in xrange(ntables):
fp_bytes = fp.read(16)
if len(fp_bytes) < 16:
break
In any regular cases, it shouldn't change anything anyway.
I will try to do a pull request on github, but I'm not even sure it will be accepted so I suggest you do a monkey patch for now and modify your /home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdffont.py file right now.
This is really an invalid PDF because there are some missing keywords endobj after three indirect objects. (object 5, 18 and 22)
The definition of an indirect object in a PDF file shall consist of its object number and generation number (separated by white space), followed by the value of the object bracketed between the keywords obj and endobj.
(chapter 7.3.10 in PDF reference)
The example 2.pdf is a simple PDF 1.3 version that uses a simple uncompressed cross reference and uncompressed object separators. The failure can be easily found by grep command and by a general file viewer that the PDF has 22 indirect objects. The pattern " obj" is found correctly exactly 22 times (never accidentally in a string object or in a stream, fortunately for simplicity), but the keyword endobj is three times missing.
$ grep --binary-files=text -B1 -A2 -E " obj|endobj" 2.pdf
...
18 0 obj
<< /Length 451967/Length1 451967/Filter [/FlateDecode] >>
stream
...
endstream % # see the missing "endobj" here
17 0 obj
<< /Length 12743 /Filter [/FlateDecode] >>
stream
...
endstream
endobj
...
Similarly the object 5 has no endobj before object 1 and the object 22 has no endobj before object 21.
It is known that broken cross references in PDF can be and should be usually reconstructed by obj/endobj keywords (see the PDF reference, chapter C.2) Some applications do probably vice-versa fix missing endobj if cross references are correct, but it is no written advice.
The last error message tells you a lot:
File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdffont.py", line 375, in
init
(name, tsum, offset, length) = struct.unpack('>4sLLL', fp.read(16))
struct.error: unpack requires a string argument of length 16
You can easily debug what is going on, for example, by putting necessary debug statements exactly in pdffont.py file. My guess is that there is something special about your pdf contents. Judging by the method name - TrueTypeFont - which throws the error message, there is some incompatibility with the font type.
Let start with explaining the statement where you're getting exception:
struct.unpack('>4sLLL', fp.read(16))
where the synopsis is:
struct.unpack(fmt, buffer)
The method unpack, unpacks from the buffer buffer (which
presumably earlier packed by pack(fmt, ...)) according to the
format string fmt. The result is a tuple even if it
contains exactly one item. The buffer’s size in bytes must match the
size required by the format, as reflected by calcsize().
The most common case is, wrong number of bytes (16) for the format used (>4sLLL) - for example, for a format expecting 4 bytes, you have specified 3 bytes:
(name, tsum, offset, length) = struct.unpack('BH', fp.read(3))
for this you'll get
struct.error: unpack requires a string argument of length 4
The reason - the format struct ('BH') expects 4 bytes i.e. when we pack something using 'BH' format it'll occupy 4 bytes of memory.
A good explanation here.
To clarify it further - let's look into the >4sLLL format string. To verify the size unpack 'd be expecting for the buffer (the bytes you're reading from the PDF file). Quoting from docs:
The buffer’s size in bytes must match the size required by the format,
as reflected by calcsize().
>>> import struct
>>> struct.calcsize('>4sLLL')
16
>>>
To this point we can say there's nothing wrong with the statement:
(name, tsum, offset, length) = struct.unpack('>4sLLL', fp.read(16))
The foramt string >4sLLL requires 16 bytes size of buffer, which is specified correctly to fp.read to read 16 bytes at a time.
So, the problem can only be with the buffer stream it's reading i.e. the content of your specific PDF file.
Can be a bug - as per this comment:
This is a bug in the upstream PDFminer by #euske There seems to be
patches for this so it should be an easy fix. Beyond this I also need
to strengthen the pdf parsing such that we never error out from a
failed parse
I'll edit the question it I find something helpful to add here - a solution, or a patch.
In case you still get some struct errors after applying Peter's patch, especially when parsing many files in one script's run (using os.listdir), try changing resource manager caching to false.
rsrcmgr = PDFResourceManager(caching=False)
It helped me to get rid of the rest of errors after applying above solutions.

Python Overwrite Dictionary to Text File doesn't work... why?

I've been previously writing code for a quiz program with a text file that stores all of the participants' results. The code that converts the text file to a dictionary and the text file itself are shown below:
Code:
import collections
from collections import defaultdict
scores_guessed = collections.defaultdict(lambda: collections.deque(maxlen=4))
with open('GuessScores.txt') as f:
for line in f:
name,val = line.split(":")
scores_guessed[name].appendleft(int(val))
for k in sorted(scores_guessed):
print("\n"+k," ".join(map(str,scores_guessed[k])))
writer = open('GuessScores.txt', 'wb')
for key, value in scores_guessed.items():
output = "%s:%s\n" % (key,value)
writer.write(output)
The text file appears like this:
Jack:10
Dave:20
Adam:30
Jack:40
Adam:50
Dave:60
Jack:70
Dave:80
Jack:90
Jack:100
Dave:110
Dave:120
Adam:130
Adam:140
Adam:150
Now, when I run the program code, the dictionary appears like this:
Adam 150 140 130 50
Dave 120 110 80 60
Jack 100 90 70 40
Now, this arranges the dictionary into order of highest scores, and the top 4 scores!
I want the python IDLE to overwrite the GuessScores.txt to this:
Adam:150
Adam:140
Adam:130
Adam:50
Dave:120
Dave:110
Dave:80
Dave:60
Jack:100
Jack:90
Jack:70
Jack:40
BUT when I run the code, this error appears:
Traceback (most recent call last):
File "/Users/Ahmad/Desktop/Test Files SO copy/readFile_prompt.py", line 16, in <module>
writer.write(output)
TypeError: 'str' does not support the buffer interface
The GuessScores.txt file is empty because it cannot write to the file, since there is the error above.
Why is this happening? And what is the fix? I have asked this previously but there were numerous issues. I am running Python 3.3.2 on a Mac 10.8 Mavericks iMac, if that makes any help.
Thanks,
Delbert.
The first issue is that you are trying to write text to a file that you opened in binary mode. In 3.x, this will no longer work. "text" vs. "binary" used to mean very little (only affecting line-ending translation, so no difference at all on some systems). Now it means like what it sounds like: a file opened in text mode is one whose contents are to be treated like text with some specific encoding, and a file opened in binary mode is one whose contents are to be treated as a sequence of bytes.
Thus, you need open('GuessScores.txt', 'w'), not open('GuessScores.txt', 'wb').
That said, you really should be using with blocks to manage the files, and you're going to have to write code that actually formats the dictionary content in the way you want. I assume you intend to output in sorted name order, and you need to iterate over each deque and write a line for each item. Something like:
with open('GuessScores.txt', 'w') as f:
for name, scores in sorted(scores_guessed.items()):
for score in scores:
f.write("{}:{}\n".format(name, score))
(Note also the new-style formatting.)
If necessary, you can explicitly specify the encoding of the file in the open call, with the encoding keyword parameter. (If you don't know what I mean by "encoding", you must learn. I'm serious. Drop everything and look it up.)
The writing problem has to do with the b in your open function. You've opened it in binary mode, so only bytes can be written. You can either remove the b or call bytes on output to give it the right type. You have a logic error anyway though. When I run it on Python 2.7, the output to GuessedScores.txt is this:
Dave:deque([120,110,80,60],maxlen=4)
Jack:deque([100, 90, 70, 40], maxlen=4)
Adam:deque([150, 140, 130, 50], maxlen=4)
So your values are the whole deques, not the individual scores. You'll have to format them, similar to how you did in your print statement.

How can I say a file is SVG without using a magic number?

An SVG file is basically an XML file so I could use the string <?xml (or the hex representation: '3c 3f 78 6d 6c') as a magic number but there are a few opposing reason not to do that if for example there are extra white-spaces it could break this check.
The other images I need/expect to check are all binaries and have magic numbers. How can I fast check if the file is an SVG format without using the extension eventually using Python?
XML is not required to start with the <?xml preamble, so testing for that prefix is not a good detection technique — not to mention that it would identify every XML as SVG. A decent detection, and really easy to implement, is to use a real XML parser to test that the file is well-formed XML that contains the svg top-level element:
import xml.etree.cElementTree as et
def is_svg(filename):
tag = None
with open(filename, "r") as f:
try:
for event, el in et.iterparse(f, ('start',)):
tag = el.tag
break
except et.ParseError:
pass
return tag == '{http://www.w3.org/2000/svg}svg'
Using cElementTree ensures that the detection is efficient through the use of expat; timeit shows that an SVG file was detected as such in ~200μs, and a non-SVG in 35μs. The iterparse API enables the parser to forego creating the whole element tree (module name notwithstanding) and only read the initial portion of the document, regardless of total file size.
You could try reading the beginning of the file as binary - if you can't find any magic numbers, you read it as a text file and match to any textual patterns you wish. Or vice-versa.
This is from man file (here), for the unix file command:
The magic tests are used to check for files with data in particular fixed formats. The canonical example of this is a binary executable ... These files have a “magic number” stored in a particular place near the beginning of the file that tells the UNIX operating system that the file is a binary executable, and which of several types thereof. The concept of a “magic” has been applied by extension to data files. Any file with some invariant identifier at a small fixed offset into the file can usually be described in this way. ...
(my emphasis)
And here's one example of the "magic" that the file command uses to identify an svg file (see source for more):
...
0 string \<?xml\ version=
>14 regex ['"\ \t]*[0-9.]+['"\ \t]*
>>19 search/4096 \<svg SVG Scalable Vector Graphics image
...
0 string \<svg SVG Scalable Vector Graphics image
...
As described by man magic, each line follows the format <offset> <type> <test> <message>.
If I understand correctly, the code above looks for the literal "<?xml version=". If that is found, it looks for a version number, as described by the regular expression. If that is found, it searches the next 4096 bytes until it finds the literal "<svg". If any of this fails, it looks for the literal "<svg" at the start of the file, and so on.
Something similar could be implemented in Python.
Note there's also python-magic, which provides an interface to libmagic, as used by the unix file command.

Categories