I couldn't find this anywhere, so sorry if I missed it. It seems like it should be simple but somehow isn't. I have a simple program that opens a log (log1.lg let's say) and strips any lines that don't contain keywords. It then tosses them into a 2nd file that is renamed to Log1.lg.clean.
The way I've implemented this is by using os.rename so the code looks like this:
#define source and key words
source_log = 'Log1.lg'
bad_words = ['word', 'bad']
#clean up the log
with open(source_log) as orig_log, open('cleanlog.lg', 'w') as cleanlog:
for line in orig_log:
if not any9bad_word in line for bad_word in bad_words):
cleanlog.write(line)
#rename file and open in Notepad
rename = orig_log + '.clean'
new_log = os.rename("cleanlog.lg", rename)
prog = "notepad.exe"
subprocess.Popen(prog, new_log)
Error I'm getting is this:
File "C:\Users\me\Downloads\PythonStuff\stripMmax.py", line 23, in cleanLog
subprocess.Popen(prog, new_log)
File "C:\Python27\lib\subprocess.py", line 339, in __init__
raise TypeError("bufsize must be an integer")
TypeError: bufsize must be an integer
I'm using Python 2.7 if that's relevant. I don't get why this isn't working or why it's requiring a bufsize. I've seen other examples where this works this way so I'm thinking maybe this command doesn't work in 2.7 the way I'm typing it?
The documentation shows how to use this properly using the actual file name in quotes, but as you can see, mine here is contained in a variable which seems to cause issues. Thanks in advance!
See the Popen constructor here: subprocess.Popen. The second argument to Popen is bufsize. That explains your error. Also note that os.rename does not return anything so new_log will be None. Use your rename variable instead. Your call should look like this:
subprocess.Popen([prog, rename])
You likely also want to wait on the created Popen object:
proc = subprocess.Popen([prog, rename])
proc.wait()
Or something like that.
While processing a PDF file (2.pdf) with pdfminer (pdf2txt.py) I received the following error:
pdf2txt.py 2.pdf
Traceback (most recent call last):
File "/usr/local/bin/pdf2txt.py", line 115, in <module>
if __name__ == '__main__': sys.exit(main(sys.argv))
File "/usr/local/bin/pdf2txt.py", line 109, in main
interpreter.process_page(page)
File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 832, in process_page
self.render_contents(page.resources, page.contents, ctm=ctm)
File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 843, in render_contents
self.init_resources(resources)
File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 347, in init_resources
self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)
File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 195, in get_font
font = self.get_font(None, subspec)
File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfinterp.py", line 186, in get_font
font = PDFCIDFont(self, spec)
File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdffont.py", line 654, in __init__
StringIO(self.fontfile.get_data()))
File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdffont.py", line 375, in __init__
(name, tsum, offset, length) = struct.unpack('>4sLLL', fp.read(16))
struct.error: unpack requires a string argument of length 16
While the similar file (1.pdf) doesn't cause a problem.
I can't find any information about the error. I added an issue on the pdfminer GitHub repository, but it remained unanswered. Can someone explain to me why this is happening? What can I do to parse 2.pdf?
Update: I get a similar error with BytesIO instead of StringIO after installing pdfminer directly from the GitHub repository.
$ pdf2txt.py 2.pdf
Traceback (most recent call last):
File "/home/danil/projects/python/pdfminer-source/env/bin/pdf2txt.py", line 116, in <module>
if __name__ == '__main__': sys.exit(main(sys.argv))
File "/home/danil/projects/python/pdfminer-source/env/bin/pdf2txt.py", line 110, in main
interpreter.process_page(page)
File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 839, in process_page
self.render_contents(page.resources, page.contents, ctm=ctm)
File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 850, in render_contents
self.init_resources(resources)
File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 356, in init_resources
self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)
File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 204, in get_font
font = self.get_font(None, subspec)
File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 195, in get_font
font = PDFCIDFont(self, spec)
File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdffont.py", line 665, in __init__
BytesIO(self.fontfile.get_data()))
File "/home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdffont.py", line 386, in __init__
(name, tsum, offset, length) = struct.unpack('>4sLLL', fp.read(16))
struct.error: unpack requires a string argument of length 16
TL; DR
Thanks to #mkl and #hynecker for the extra info... With that I can confirm this is a bug in pdfminer and your PDF. Whenever pdfminer tries to get embedded file streams (e.g. font definitions), it is picking up the last one in the file before an endobj. Sadly, not all PDFs rigorously add the end tag and so pdfminer should be resilient to this.
Quick fix for this issue
I've created a patch - which has been submitted as a pull request on github. See https://github.com/euske/pdfminer/pull/159.
Detailed diagnosis
As mentioned in the other answers, the reason you're seeing this is that you're not getting the expected number of bytes from the stream as pdfminer is unpacking the data. But why?
As you can see in your stack trace, pdfminer (rightly) spots that it has a CID font to process. It then goes on to process the embedded font file as a TrueType font (in pdffont.py). It tries to parse the associated stream (stream ID 18) by reading out a set of binary tables.
This doesn't work for 2.pdf because it has a text stream. You can see this by running dumppdf -b -i 18 2.pdf. I've put the start here:
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo << /Registry (Adobe) /Ordering (UCS) /Supplement 0
>> def /CMapName /Adobe-Identity-UCS def
...
So, garbage in, garbage out... Is this a bug in your file or pdfminer? Well, the fact that other readers can handle it made me suspicious.
Digging around a little more, I see that this stream is identical to stream ID 17, which is the cmap for the ToUnicode field. A quick look at the PDF spec shows that these cannot be the same.
Digging in to the code further, I see that all streams are getting the same data. Oops! This is the bug. The cause appears to be related to the fact that this PDF is missing some end tags - as noted by #hynecker.
The fix is to return the right data for each stream. Any other fix to just swallow the error will result in bad data being used for all streams and so, for example, incorrect font definitions.
I believe the attached patch will fix your problem and should be safe to use in general.
I fixed your problem in the source code, and I try on your file 2.pdf to make sure it worked.
In the file pdffont.py I replaced:
class TrueTypeFont(object):
class CMapNotFound(Exception):
pass
def __init__(self, name, fp):
self.name = name
self.fp = fp
self.tables = {}
self.fonttype = fp.read(4)
(ntables, _1, _2, _3) = struct.unpack('>HHHH', fp.read(8))
for _ in xrange(ntables):
(name, tsum, offset, length) = struct.unpack('>4sLLL', fp.read(16))
self.tables[name] = (offset, length)
return
by this:
class TrueTypeFont(object):
class CMapNotFound(Exception):
pass
def __init__(self, name, fp):
self.name = name
self.fp = fp
self.tables = {}
self.fonttype = fp.read(4)
(ntables, _1, _2, _3) = struct.unpack('>HHHH', fp.read(8))
for _ in xrange(ntables):
fp_bytes = fp.read(16)
if len(fp_bytes) < 16:
break
(name, tsum, offset, length) = struct.unpack('>4sLLL', fp_bytes)
self.tables[name] = (offset, length)
return
Explanations
#Nabeel Ahmed was right
The foramt string >4sLLL requires 16 bytes size of buffer, which is specified correctly to fp.read to read 16 bytes at a time.
So, the problem can only be with the buffer stream it's reading i.e. the content of your specific PDF file.
In the code we see that fp.read(16) are made in a loop without any check.Thus, we don't know for sure if it successfully read it all. It could for instance reached an EOF.
To avoid this problem, I just break out of the for loop when this kind of problem appears.
for _ in xrange(ntables):
fp_bytes = fp.read(16)
if len(fp_bytes) < 16:
break
In any regular cases, it shouldn't change anything anyway.
I will try to do a pull request on github, but I'm not even sure it will be accepted so I suggest you do a monkey patch for now and modify your /home/danil/projects/python/pdfminer-source/env/local/lib/python2.7/site-packages/pdfminer/pdffont.py file right now.
This is really an invalid PDF because there are some missing keywords endobj after three indirect objects. (object 5, 18 and 22)
The definition of an indirect object in a PDF file shall consist of its object number and generation number (separated by white space), followed by the value of the object bracketed between the keywords obj and endobj.
(chapter 7.3.10 in PDF reference)
The example 2.pdf is a simple PDF 1.3 version that uses a simple uncompressed cross reference and uncompressed object separators. The failure can be easily found by grep command and by a general file viewer that the PDF has 22 indirect objects. The pattern " obj" is found correctly exactly 22 times (never accidentally in a string object or in a stream, fortunately for simplicity), but the keyword endobj is three times missing.
$ grep --binary-files=text -B1 -A2 -E " obj|endobj" 2.pdf
...
18 0 obj
<< /Length 451967/Length1 451967/Filter [/FlateDecode] >>
stream
...
endstream % # see the missing "endobj" here
17 0 obj
<< /Length 12743 /Filter [/FlateDecode] >>
stream
...
endstream
endobj
...
Similarly the object 5 has no endobj before object 1 and the object 22 has no endobj before object 21.
It is known that broken cross references in PDF can be and should be usually reconstructed by obj/endobj keywords (see the PDF reference, chapter C.2) Some applications do probably vice-versa fix missing endobj if cross references are correct, but it is no written advice.
The last error message tells you a lot:
File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdffont.py", line 375, in
init
(name, tsum, offset, length) = struct.unpack('>4sLLL', fp.read(16))
struct.error: unpack requires a string argument of length 16
You can easily debug what is going on, for example, by putting necessary debug statements exactly in pdffont.py file. My guess is that there is something special about your pdf contents. Judging by the method name - TrueTypeFont - which throws the error message, there is some incompatibility with the font type.
Let start with explaining the statement where you're getting exception:
struct.unpack('>4sLLL', fp.read(16))
where the synopsis is:
struct.unpack(fmt, buffer)
The method unpack, unpacks from the buffer buffer (which
presumably earlier packed by pack(fmt, ...)) according to the
format string fmt. The result is a tuple even if it
contains exactly one item. The buffer’s size in bytes must match the
size required by the format, as reflected by calcsize().
The most common case is, wrong number of bytes (16) for the format used (>4sLLL) - for example, for a format expecting 4 bytes, you have specified 3 bytes:
(name, tsum, offset, length) = struct.unpack('BH', fp.read(3))
for this you'll get
struct.error: unpack requires a string argument of length 4
The reason - the format struct ('BH') expects 4 bytes i.e. when we pack something using 'BH' format it'll occupy 4 bytes of memory.
A good explanation here.
To clarify it further - let's look into the >4sLLL format string. To verify the size unpack 'd be expecting for the buffer (the bytes you're reading from the PDF file). Quoting from docs:
The buffer’s size in bytes must match the size required by the format,
as reflected by calcsize().
>>> import struct
>>> struct.calcsize('>4sLLL')
16
>>>
To this point we can say there's nothing wrong with the statement:
(name, tsum, offset, length) = struct.unpack('>4sLLL', fp.read(16))
The foramt string >4sLLL requires 16 bytes size of buffer, which is specified correctly to fp.read to read 16 bytes at a time.
So, the problem can only be with the buffer stream it's reading i.e. the content of your specific PDF file.
Can be a bug - as per this comment:
This is a bug in the upstream PDFminer by #euske There seems to be
patches for this so it should be an easy fix. Beyond this I also need
to strengthen the pdf parsing such that we never error out from a
failed parse
I'll edit the question it I find something helpful to add here - a solution, or a patch.
In case you still get some struct errors after applying Peter's patch, especially when parsing many files in one script's run (using os.listdir), try changing resource manager caching to false.
rsrcmgr = PDFResourceManager(caching=False)
It helped me to get rid of the rest of errors after applying above solutions.
I am trying to implement a little script in order to automatize a local blast alignment.
I had ran commands in the terminal en it works perfectly. However when I try to automatize this, I have a message like : Empty XML file.
Do we have to implement a "system" waiting time to let the file be written, or I did something wrong?
The code :
#sequence identifier as key, sequence as value.
for element in dictionnaryOfSequence:
#I make a little temporary fasta file because the blast command need a fasta file as input.
out_fasta = open("tmp.fasta", 'w')
query = ">" + element + "\n" + str(dictionnary[element])
out_fasta.write(query) # And I have this file with my sequence correctly filled
OUT_FASTA.CLOSE() # EDIT : It was out of my loop....
#Now the blast command, which works well in the terminal, I have my tmp.xml file well filled.
os.system("blastn -db reads.fasta -query tmp.fasta -out tmp.xml -outfmt 5 -max_target_seqs 5000")
#Parsing of the xml file.
handle = open("tmp.xml", 'r')
blast_records = NCBIXML.read(handle)
print blast_records
I have an Error : Your XML file was empty, and the blast_records object doesn't exist.
Did I make something wrong with handles?
I take all advice. Thank you a lot for your ideas and help.
EDIT : Problem solved, sorry for the useless question. I did wrong with handle and I did not open the file in the right location. Same thing with the closing.
Sorry.
try to open the file "tmp.xml" in Internet explorer. All tags are closed?
I get a strange error in python. When I try to extract a password protected file using the zip module, I get an exception when trying to set "oy" as password. Everything else seems to work. A bug in ZipFile module?
import zipfile
zip = zipfile.ZipFile("file.zip", "r")
zip.setpassword("oy".encode('utf-8'))
zip.extractall() #Above password "oy" generates the error here
zip.close()
This is the exception I get:
Traceback (most recent call last):
File "unzip.py", line 4, in <module>
zip.extractall()
File "C:\Program Files\Python32\lib\zipfile.py", line 1002, in extrac
l
self.extract(zipinfo, path, pwd)
File "C:\Program Files\Python32\lib\zipfile.py", line 990, in extract
return self._extract_member(member, path, pwd)
File "C:\Program Files\Python32\lib\zipfile.py", line 1035, in _extra
member
shutil.copyfileobj(source, target)
File "C:\Program Files\Python32\lib\shutil.py", line 65, in copyfileo
buf = fsrc.read(length)
File "C:\Program Files\Python32\lib\zipfile.py", line 581, in read
data = self.read1(n - len(buf))
File "C:\Program Files\Python32\lib\zipfile.py", line 633, in read1
max(n - len_readbuffer, self.MIN_READ_SIZE)
zlib.error: Error -3 while decompressing: invalid block type
If I use UTF-16 as encoding I get this error:
zlib.error: Error -3 while decompressing: invalid distance too far back
EDIT
I have now tested on a virtual Linux machine with following stuff:
Python version: 2.6.5
I created a password protected zip file with zip -e file.zip
hello.txt
Now it seems the problem is something else. Now I can extract the zip file even if the password is wrong!
try:
zip.setpassword("ks") # "ks" is wrong password but it still extracts the zip
zip.extractall()
except RuntimeException:
print "wrong!"
Sometimes I can extract the zip file with an incorrect password. The file (inside the zip file) is then extracted but when I try to open it the information seems to be corrupted/decrypted.
If there's a problem with the password, usually you get the following exception:
RuntimeError: ('Bad password for file', <zipfile.ZipInfo object at 0xb76dec2c>)
Since your exception complains about block type, most probably your .zip archive is corrupted, have you tried to unpack it with standalone unzip utility?
Or maybe you have used something funny, like 7zip to create it, which makes incompatible .zip archives.
You don't provide enough information (OS version? Python version? ZIP archive creator and contents? are there many files in those archives or single file in single archive? do all those files give same errors, or you can unpack some of them?), so here's quick Q&A section, which should help you to find and remedy the problem.
Q1. Is this a bug in Python?
A1. Unlikely.
Q2. What might cause this behaviour?
A2. Broken zip files, incompatible zip compressors -- since you don't tell anything, it's hard to point the the exact cause.
Q3. How to find the cause?
A3. Try to isolate the problem, find the file which gives you an error, try to use zip.testzip() and/or decompress that particular file with different unzip utility, share the results. Only you have access to the problematic files, so nobody can help you unless you try to do something yourself.
Q4. How to fix this?
A4. You cannot. Use different zip extractor, ZipFile won't work.
Try using the testzip() method to check the file's integrity before extracting files.
It could be possibly a bug in zipfile, or a bug in your zip implementation. I noted that your line numbers do not match mine so I guess this is python 3.2 earlier than the current 3.2.3 release I have.
Now, as to your code, it does work for me on Python 3.2.3 on Linux. I suggest you update to the latest 3.2.x as there seem to be a number of bug fixes related to zipfile and zlib, including fixes for crashes.
I'm trying to write a small Python script that will get query results from a database, write them to a file, and then sftp the file to a different server. The pieces work just fine but I'm getting a weird error when trying to sftp the file immediately after it's written.
The error I'm getting is
File "/usr/lib/python2.4/site-packages/paramiko/sftp_client.py", line 558, in put
file_size = os.stat(localpath).st_size
TypeError: coercing to Unicode: need string or buffer, file found
The offending line of code is just
sftp.put(outputfile, sftpoutputfile)
I tried using a copy of the output file instead of the one that's being written in the script and that worked exactly as it's supposed to. I'm calling file.close() after the file is written (and before setting up the sftp) so it seems like the file should be, well, closed and usable after that. Can someone tell me what I'm doing wrong? I can post more of the code if that would be helpful. Thank you very much.
The error message is telling you that it (in this case, os.stat) wants a stringlike object, and you're giving it the file instead.
Looking at the source of sftp_client.py in my copy of paramiko, we see
def put(self, localpath, remotepath, callback=None, confirm=True):
[...]
file_size = os.stat(localpath).st_size
fl = file(localpath, 'rb')
try:
fr = self.file(remotepath, 'wb')
fr.set_pipelined(True)
so I'm pretty sure that it wants the filename, not the file itself.