Decoding Mail Subject Thunderbird in Python 3.x

Decoding Mail Subject Thunderbird in Python 3.x - python

For a workaround, see below
/Original Question:
Sorry, I am simply too dumb to solve this on my own. I am trying to read the "subjects" from several emails stored in a .mbox folder from Thunderbird. Now, I am trying to decode the header with decode_header(), but I am still getting UnicodeErrors.
I am using the following function (I am sure there is a smarter way to do this, but this is not the point of this post)
import mailbox
from email.header import decode_header
mflder = mailbox.mbox("mailfolder")
for message in mflder:
print(header_to_string(message["subject"]))
def header_to_string(header):
try:
header, encoding = decode_header(header)[0]
except:
return "something went wrong {}".format(header)
if encoding == None:
return header
else:
return header.decode(encoding)
The first 100 outputs or so are perfectly fine, but then this error message appears:
---------------------------------------------------------------------------
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-97-e252df04c215> in <module>
----> 1 for message in mflder:
2 try:
3 print(header_to_string(message["subject"]))
4 except:
5 print("0")
~\anaconda3\lib\mailbox.py in itervalues(self)
107 for key in self.iterkeys():
108 try:
--> 109 value = self[key]
110 except KeyError:
111 continue
~\anaconda3\lib\mailbox.py in __getitem__(self, key)
71 """Return the keyed message; raise KeyError if it doesn't exist."""
72 if not self._factory:
---> 73 return self.get_message(key)
74 else:
75 with contextlib.closing(self.get_file(key)) as file:
~\anaconda3\lib\mailbox.py in get_message(self, key)
779 string = self._file.read(stop - self._file.tell())
780 msg = self._message_factory(string.replace(linesep, b'\n'))
--> 781 msg.set_from(from_line[5:].decode('ascii'))
782 return msg
783
UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 4: ordinal not in range(128)
How can I force mailbox.py to decode a different encoding? Or is the header simply broken? And if I understood this correctly, headers are supposed to be "ASCII", right? I mean, this is the point of this entire MIME thing, no?
Thanks for your help!
/Workaround
I found a workaround by just avoiding to directly iterate over the .mbox mailfolder representation. Instead of using ...
for message in mflder:
# do something
... simply use:
for x in range(len(mflder)):
try:
message = mflder[x]
print(header_to_string(message["subject"]))
except:
print("Failed loading message!")
This skips the broken messages in the .mbox folder. Yet, I stumbled upon several other issues while working with the .mbox folder subjects. For instance, the headers are sometimes split into several tuples when using the decode_header() function. So, in order to receive the full subjects, one needs to add more stuff to the header_to_string() function as well. But this is not related to this question anymore. I am a noob and a hobby prgrammer, but I remember working with the Outlook API and Python, which was MUCH easier...

Solution
It looks like either you have corrupted "mailfolder" mbox file or there is a bug in Python's mailbox module triggered by something in your file. I can't tell what is going on without having the mbox input file or a minimal example input file that reproduces the issue.
You could do some debugging yourself. Each message in the file starts with a "From" line that should look like:
From - Mon Mar 30 18:18:04 2020
From the stack trace you posted, it looks like that line is malformed in one of the messages. Personally, I would use an IDE debugger (PyCharm) track down what the malformed line was, but it can be done with Python's built-in pdb. Wrap your loop like this:
import pdb
try:
for message in mflder:
print(header_to_string(message["subject"]))
except:
pdb.post_mortem()
When you run the code now, it will drop into the debugger when the exception occurs. At that prompt, you can enter l to list the code where the debugger stopped; this should match the last frame printed in your stack trace you originally posted. Once you are there, there are two commands that will tell you what is going on:
p from_line
will show you the malformed "From" line.
p start
will show you at what offset in the file the mailbox code thinks the message was supposed to be.
Previous answer that didn't solve the original problem but still applies
In the real world, there will be messages that don't comply with the standards. You can try to make the code more tolerant if you don't want to reject the bad messages. Decoding with "latin-1" is one way to handle these headers with bytes outside ASCII. This cannot fail because the all possible byte values map to valid Unicode characters (one-to-one mapping of the first 256 codes of Unicode vs. ISO/IEC 8859-1, a.k.a. "latin-1"). This may or may not give you the text the sender intended.
import mailbox
from email.header import decode_header
mflder = mailbox.mbox("mailfolder")
def get_subject(message):
header = message["subject"]
if not header:
return ''
header, encoding = decode_header(header)[0]
if encoding is not None:
try:
header = header.decode(encoding)
except:
header = header.decode('latin-1')
return header
for message in mflder:
print(get_subject(message))

Related

On the wrong foot with regular python reading of a text file, error line at the exception is wrong

When reading an utf-8 text file in Python you may encounter an illegal utf character. Next you probably will try to find the line (number) containing the illegal character, but probably this will fail. This is illustrated by the code below.
Step 1: Create a file containing an illegal utf-8 character (a1 hex = 161 decimal)
filename=r"D:\wrong_utf8.txt"
longstring = "test just_a_text"*10
with open(filename, "wb") as f:
for lineno in range(1,100):
if lineno==85:
f.write(f"{longstring}\terrrocharacter->".encode('utf-8')+bytes.fromhex('a1')+"\r\n".encode('utf-8'))
else:
f.write(f"{longstring}\t{lineno}\r\n".encode('utf-8'))
Step 2: Read the file and catch the error:
print("First pass, regular Python textline read.")
with open(filename, "r",encoding='utf8') as f:
lineno=0
while True:
try:
lineno+=1
line=f.readline()
if not line:
break
print(lineno)
except UnicodeDecodeError:
print (f"UnicodeDecodeError at line {lineno}\n")
break
It prints: UnicodeDecodeError at line 50
I would expect the errorline to be line 85. However, lineno 50 is printed! So, the customer who send the file to us was unable to find the illegal character. I tried to find additional parameters to modify the open statement (including buffering) but was unable to get the right error line number.
Note: if you sufficiently shorten the longstring, the problem goes away. So the problem probably has to do with python's internal buffering.
I succeeded by using the following code to find the error line:
print("Second pass, Python byteline read.")
with open(filename,'rb') as f:
lineno=0
while True:
try:
lineno+=1
line = f.readline()
if not line:
break
lineutf8=line.decode('utf8')
print(lineno)
except UnicodeDecodeError: #Exception as e:
mybytelist=line.split(b'\t')
for index,field in enumerate(mybytelist):
try:
fieldutf8=field.decode('utf8')
except UnicodeDecodeError:
print(f'UnicodeDecodeError in line {lineno}, field {index+1}, offending field: {field}!')
break
break
Now it prints the right lineno: UnicodeDecodeError in line 85, field 2, offending field: b'errrocharacter->\xa1\r\n'!
Is this the pythonic way of finding the error line? It works all right but I somehow have the feeling that a better method should be available where it is not required to read the file twice and/or use a binary read.

The actual cause is indeed the way Python internally processes text files.They are read in chunks, each chunk is decoded according the the specified encoding, and they if you use readline or iterate the file object, the decoded buffer is split in lines which are returned one at a time.
You can have an evidence of that by examining the UnicodeDecodeError object at the time of the error:
....
except UnicodeDecodeError as e:
print (f"UnicodeDecodeError at line {lineno}\n")
print(repr(e)) # or err = e to save the object and examine it later
break
With your example data, you can find that Python was trying to decode a buffer of 8149 bytes, and that the offending character occurs at position 5836 in that buffer.
This processing is deep inside the Python io library because Text files have to be buffered and the binary buffer is decode before being splitted in lines. So IMHO little can be done here, and the best way is probably your second try: read the file as a binary file and decode the lines one at a time.
Alternatively, you could use errors='replace' to replace any offending byte with a REPLACEMENT CHARACTER (U+FFFD). But then, you would no longer test for an error, but search for that character in the line:
with open(filename, "r",encoding='utf8', errors='replace') as f:
lineno=0
while True:
lineno+=1
line=f.readline()
if not line:
break
if chr(0xfffd) in line:
print (f"UnicodeDecodeError at line {lineno}\n")
break
print(lineno)
This one also gives as expected:
...
80
81
82
83
84
UnicodeDecodeError at line 85

The UnicodeDecodeError has information about the error that can be used to improve the reporting of the error.
My proposal would be to decode the whole file in one go. If the content is good then there is no need to iterate around a loop. Especially as reading a binary file doesn't have the concept of lines.
If there is an error raised with the decode, then the UnicodeDecodeError has the start and end values of the bad content.
Only docoding up to the that bad character allows the lines to be counted efficiently with len and splitlines.
If you want to display the bad line then doing the decode with replace errors set might be useful along with the line number from the previous step.
I would also consider raising a custom exception with the new information.
Here is an example:
from pathlib import Path
def create_bad(filename):
longstring = "test just_a_text" * 10
with open(filename, "wb") as f:
for lineno in range(1, 100):
if lineno == 85:
f.write(f"{longstring}\terrrocharacter->".encode('utf-8') + bytes.fromhex('a1') + "\r\n".encode('utf-8'))
else:
f.write(f"{longstring}\t{lineno}\r\n".encode('utf-8'))
class BadUnicodeInFile(Exception):
"""Add information about line numbers"""
pass
def new_read_bad(filename):
file = Path(filename)
data = file.read_bytes()
try:
file_content = data.decode('utf8')
except UnicodeDecodeError as err:
bad_line_no = len(err.object[:err.start].decode('utf8').splitlines())
bad_line_content = err.object.decode('utf8', 'replace').splitlines()[bad_line_no - 1]
bad_content = err.object[err.start:err.end]
raise BadUnicodeInFile(
f"{filename} has bad content ({bad_content}) on: line number {bad_line_no}\n"
f"\t{bad_line_content}")
return file_content
if __name__ == '__main__':
create_bad("/tmp/wrong_utf8.txt")
new_read_bad("/tmp/wrong_utf8.txt")
This gave the following output:
Traceback (most recent call last):
File "/home/user1/stack_overflow/wrong_utf8.py", line 39, in new_read_bad
file_content = data.decode('utf8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa1 in position 14028: invalid start byte
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/user1/stack_overflow/wrong_utf8.py", line 52, in <module>
new_read_bad("/tmp/wrong_utf8.txt")
File "/home/user1/stack_overflow/wrong_utf8.py", line 44, in new_read_bad
raise BadUnicodeInFile(
__main__.BadUnicodeInFile: /tmp/wrong_utf8.txt has bad content (b'\xa1') on: line number 85
test just_a_texttest just_a_texttest just_a_texttest just_a_texttest just_a_texttest just_a_texttest just_a_texttest just_a_texttest just_a_texttest just_a_text errrocharacter->�

Reading a csv file with some lines having wrong encoding and return incorrect lines to a user

I have some users uploading csv files to be ingested. In Python 2, I was able to open the file in binary, pass it to a unicodecsv.DictReader, and if some rows had an encoding issue, like an invalid Unicode character because the customer used CP1251 or something, I could log those rows and return exactly which rows had an issue.
With py3.7, it doesn't seem that I can do this -- the csv module requires the file to be decoded, and if I instead pass it a generator like (line.decode('utf8') for line in my_binary_file), I can't make it throw an exception only for the bad lines and keep going after. I tried using unicodecsv, even though it hasn't seen a commit in over four years and doesn't technically support py > 3.5, and it doesn't seem to work either -- the iterator just stops after the bad row.
I can see two ways around this, neither of which is appealing:
1) decode the file line by line beforehand and find bad lines, which is wasteful, or
2) write my own CSV parser which allows skipping of bad lines, which seems like asking for trouble.
Can I do this another way?
For reference, here's example code that worked in py2:
def unicode_safe_iterator(reader):
while True:
try:
yield True, next(reader)
except UnicodeDecodeError as exc:
yield False, 'UnicodeDecodeError: %s' % str(exc)
# uncomment for py3:
# except StopIteration:
# return
def get_data_iter_from_csv(csv_file, ...):
reader = unicodecsv.DictReader(csv_file)
error_messages = []
line_num = 1
for valid, row in unicode_safe_iterator(reader):
line_num += 1
if not valid:
error_messages.append(dict(line_number=line_num, error=row))
else:
row_data = validate_row_data(row) # check for errors other than encoding, etc.
if not error_messages:
# stop yielding in case of errors, but keep iterating to find all errors.
yield row_data
if error_messages:
raise ValidationError(Errors.CSV_FILE_ERRORS, error_items=error_messages)
data_iter = get_data_iter_from_csv(open(path_to_csv, 'rb'), ...)

Here is a workaround. We read the file as byte stream, split it at new lines and try to convert lines into utf8 strings. If it is failed, try to convert the improper parts into cp1251 string.
Therafter you can use io.StringIO to imitate a file open.
import csv, io
def convert(bl):
rslt=[]
done=False
pos=0
while not done:
try:
s=bl[pos:].decode("utf8")
rslt.append(s)
done=True
except UnicodeDecodeError as ev:
abs_start, abs_end= pos+ev.start, pos+ev.end
rslt.append(bl[pos:abs_start].decode("utf8"))
rslt.append(bl[abs_start:abs_end].decode("cp1251",errors="replace"))
pos= abs_end
if pos>= len(bl):
done=True
return "".join(rslt)
with open(path_to_csv,"rb") as ff:
data= ff.read().split(b'\x0a')
text= [ convert(line) for line in data ]
text="\n".join(text)
print(text)
rdr= csv.DictReader(io.StringIO(text))
It can be done at once, not line by line, too:
with open(path_to_csv,"rb") as ff:
text= convert( ff.read() )
rdr= csv.DictReader(io.StringIO(text))

Python: Using SSML with SAPI (comtypes)

TL;DR: I'm trying to pass an XML object (using ET) to a Comtypes (SAPI) object in python 3.7.2 on Windows 10. It's failing due to invalid chars (see error below). Unicode characters are read correctly from the file, can be printed (but do not display correctly on the console). It seems like the XML is being passed as ASCII or that I'm missing a flag? (https://learn.microsoft.com/en-us/previous-versions/windows/desktop/ee431843(v%3Dvs.85)). If it is a missing flag, how do I pass it? (I haven't figured that part out yet..)
Long form description
I'm using Python 3.7.2 on Windows 10 and trying to send create an XML (SSML: https://www.w3.org/TR/speech-synthesis/) file to use with Microsoft's speech API. The voice struggles with certain words and when I looked at the SSML format and it supports a phoneme tag, which allows you to specify how to pronounce a given word. Microsoft implements parts of the standard (https://learn.microsoft.com/en-us/cortana/skills/speech-synthesis-markup-language#phoneme-element) so I found a UTF-8 encoded library containing IPA pronunciations. When I try to call the SAPI, with parts of the code replaced I get the following error:
Traceback (most recent call last):
File "pdf_to_speech.py", line 132, in <module>
audioConverter(text = "Hello world extended test",outputFile = output_file)
File "pdf_to_speech.py", line 88, in __call__
self.engine.speak(text)
_ctypes.COMError: (-2147200902, None, ("'ph' attribute in 'phoneme' element is not valid.", None, None, 0, None))
I've been trying to debug, but when I print the pronunciations of the words the characters are boxes. However if I copy and paste them from my console, they look fine (see below).
həˈloʊ,
ˈwɝːld
ɪkˈstɛndəd,
ˈtɛst
Best Guess
I'm unsure whether the problem is caused by
1) I've changed versions of pythons to be able to print unicode
2) I fixed problems with reading the file
3) I had incorrect manipulations of the string
I'm pretty sure the problem is that I'm not passing it as a unicode to the comtype object. The ideas I'm looking into are
1) Is there a flag missing?
2) Is it being converted to ascii when its being passed to comtypes (C types error)?
3) Is the XML being passed incorrectly/ am I missing a step?
Sneak peek at the code
This is the class that reads the IPA dictionary and then generates the XML file. Look at _load_phonemes and _pronounce.
class SSML_Generator:
def __init__(self,pause,phonemeFile):
self.pause = pause
if isinstance(phonemeFile,str):
print("Loading dictionary")
self.phonemeDict = self._load_phonemes(phonemeFile)
print(len(self.phonemeDict))
else:
self.phonemeDict = {}
def _load_phonemes(self, phonemeFile):
phonemeDict = {}
with io.open(phonemeFile, 'r',encoding='utf-8') as f:
for line in f:
tok = line.split()
#print(len(tok))
phonemeDict[tok[0].lower()] = tok[1].lower()
return phonemeDict
def __call__(self,text):
SSML_document = self._header()
for utterance in text:
parent_tag = self._pronounce(utterance,SSML_document)
#parent_tag.tail = self._pause(parent_tag)
SSML_document.append(parent_tag)
ET.dump(SSML_document)
return SSML_document
def _pause(self,parent_tag):
return ET.fromstring("<break time=\"150ms\" />") # ET.SubElement(parent_tag,"break",{"time":str(self.pause)+"ms"})
def _header(self):
return ET.Element("speak",{"version":"1.0", "xmlns":"http://www.w3.org/2001/10/synthesis", "xml:lang":"en-US"})
# TODO: Add rate https://learn.microsoft.com/en-us/cortana/skills/speech-synthesis-markup-language#prosody-element
def _rate(self):
pass
# TODO: Add pitch
def _pitch(self):
pass
def _pronounce(self,word,parent_tag):
if word in self.phonemeDict:
sys.stdout.buffer.write(self.phonemeDict[word].encode("utf-8"))
return ET.fromstring("<phoneme alphabet=\"ipa\" ph=\"" + self.phonemeDict[word] + "\"> </phoneme>")#ET.SubElement(parent_tag,"phoneme",{"alphabet":"ipa","ph":self.phonemeDict[word]})#<phoneme alphabet="string" ph="string"></phoneme>
else:
return parent_tag
# Nice to have: Transform acronyms into their pronunciation (See say as tag)
I've also added how the code writes to the comtype object (SAPI) in case the error is there.
def __call__(self,text,outputFile):
# https://learn.microsoft.com/en-us/previous-versions/windows/desktop/ms723606(v%3Dvs.85)
self.stream.Open(outputFile + ".wav", self.SpeechLib.SSFMCreateForWrite)
self.engine.AudioOutputStream = self.stream
text = self._text_processing(text)
text = self.SSML_generator(text)
text = ET.tostring(text,encoding='utf8', method='xml').decode('utf-8')
self.engine.speak(text)
self.stream.Close()
Thanks in advance for your help!

Try to use single quotes inside ph attrubute.
Like this
my_text = '<speak><phoneme alphabet="x-sampa" ph=\'v"e.de.ni.e\'>ведение</phoneme></speak>'
also remember to use \ to escape single quote
UPD
Also this error could mean that your ph cannot be parsed. You can check docs there: https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/speech-synthesis-markup
this example will work
<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
<voice name="en-US-Jessa24kRUS">
<s>His name is Mike <phoneme alphabet="ups" ph="JH AU"> Zhou </phoneme></s>
</voice>
</speak>
but this doesn't
<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
<voice name="en-US-Jessa24kRUS">
<s>His name is Mike <phoneme alphabet="ups" ph="JHU AUA"> Zhou </phoneme></s>
</voice>
</speak>

Receiving a Windows message in Python - UnicodeEncodeError ...?

I have a little Python 3.3 script that successfully sends (SendMessage) a WM_COPYDATA message (inspired from here , works with XYplorer):
import win32api
import win32gui
import struct
import array
def sendScript(window, message):
CopyDataStruct = "IIP"
dwData = 0x00400001 #value required by XYplorer
buffer = array.array("u", message)
cds = struct.pack(CopyDataStruct, dwData, buffer.buffer_info()[1] * 2 + 1, buffer.buffer_info()[0])
win32api.SendMessage(window, 0x004A, 0, cds) #0x004A is the WM_COPYDATA id
message = "helloworld"
sendScript(window, message) #I write manually the hwnd during debug
Now I need to write a receiver script, still in Python. The script in this answer seems to work (after correcting all the print statements in a print() form). Seems because it prints out all properties of the received message (hwnd, wparam, lparam, etc) except the content of the message.
I get an error instead, UnicodeEncodeError. More specifically,
Python WNDPROC handler failed
Traceback (most recent call last):
File "C:\Python\xxx.py", line 45, in OnCopyData
print(ctypes.wstring_at(pCDS.contents.lpData))
File "C:\Python\python-3.3.2\lib\encodings\cp850.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 10-13: character maps to <undefined>
I don't know how to fix it, also because I'm not using "fancy" characters in the message so I really can't see why I get this error. I also tried setting a different length of message in print(ctypes.wstring_at(pCDS.contents.lpData)) as well as using simply string_at, but without success (in the latter case I obtain a binary string).

ctypes.wstring (in the line print (ctypes.wstring_at(pCDS.contents.lpData))) may not be the string type that the sender sent. Try changing it to:
print (ctypes.string_at(pCDS.contents.lpData))

Parsing a PDF with no /Root object using PDFMiner

I'm trying to extract text from a large number of PDFs using PDFMiner python bindings. The module I wrote works for many PDFs, but I get this somewhat cryptic error for a subset of PDFs:
ipython stack trace:
/usr/lib/python2.7/dist-packages/pdfminer/pdfparser.pyc in set_parser(self, parser)
331 break
332 else:
--> 333 raise PDFSyntaxError('No /Root object! - Is this really a PDF?')
334 if self.catalog.get('Type') is not LITERAL_CATALOG:
335 if STRICT:
PDFSyntaxError: No /Root object! - Is this really a PDF?
Of course, I immediately checked to see whether or not these PDFs were corrupted, but they can be read just fine.
Is there any way to read these PDFs despite the absence of a root object? I'm not too sure where to go from here.
Many thanks!
Edit:
I tried using PyPDF in an attempt to get some differential diagnostics. The stack trace is below:
In [50]: pdf = pyPdf.PdfFileReader(file(fail, "rb"))
---------------------------------------------------------------------------
PdfReadError Traceback (most recent call last)
/home/louist/Desktop/pdfs/indir/<ipython-input-50-b7171105c81f> in <module>()
----> 1 pdf = pyPdf.PdfFileReader(file(fail, "rb"))
/usr/lib/pymodules/python2.7/pyPdf/pdf.pyc in __init__(self, stream)
372 self.flattenedPages = None
373 self.resolvedObjects = {}
--> 374 self.read(stream)
375 self.stream = stream
376 self._override_encryption = False
/usr/lib/pymodules/python2.7/pyPdf/pdf.pyc in read(self, stream)
708 line = self.readNextEndLine(stream)
709 if line[:5] != "%%EOF":
--> 710 raise utils.PdfReadError, "EOF marker not found"
711
712 # find startxref entry - the location of the xref table
PdfReadError: EOF marker not found
Quonux suggested that perhaps PDFMiner stopped parsing after reaching the first EOF character. This would seem to suggest otherwise, but I'm very much clueless. Any thoughts?

The solution in slate pdf is use 'rb' --> read binary mode.
Because slate pdf is depends on the PDFMiner and I have the same problem, this should solve your problem.
fp = open('C:\Users\USER\workspace\slate_minner\document1.pdf','rb')
doc = slate.PDF(fp)
print doc

interesting problem. i had performed some kind of research:
function which parsed pdf (from miners source code):
def set_parser(self, parser):
"Set the document to use a given PDFParser object."
if self._parser: return
self._parser = parser
# Retrieve the information of each header that was appended
# (maybe multiple times) at the end of the document.
self.xrefs = parser.read_xref()
for xref in self.xrefs:
trailer = xref.get_trailer()
if not trailer: continue
# If there's an encryption info, remember it.
if 'Encrypt' in trailer:
#assert not self.encryption
self.encryption = (list_value(trailer['ID']),
dict_value(trailer['Encrypt']))
if 'Info' in trailer:
self.info.append(dict_value(trailer['Info']))
if 'Root' in trailer:
# Every PDF file must have exactly one /Root dictionary.
self.catalog = dict_value(trailer['Root'])
break
else:
raise PDFSyntaxError('No /Root object! - Is this really a PDF?')
if self.catalog.get('Type') is not LITERAL_CATALOG:
if STRICT:
raise PDFSyntaxError('Catalog not found!')
return
if you will be have problem with EOF another exception will be raised:
'''another function from source'''
def load(self, parser, debug=0):
while 1:
try:
(pos, line) = parser.nextline()
if not line.strip(): continue
except PSEOF:
raise PDFNoValidXRef('Unexpected EOF - file corrupted?')
if not line:
raise PDFNoValidXRef('Premature eof: %r' % parser)
if line.startswith('trailer'):
parser.seek(pos)
break
f = line.strip().split(' ')
if len(f) != 2:
raise PDFNoValidXRef('Trailer not found: %r: line=%r' % (parser, line))
try:
(start, nobjs) = map(long, f)
except ValueError:
raise PDFNoValidXRef('Invalid line: %r: line=%r' % (parser, line))
for objid in xrange(start, start+nobjs):
try:
(_, line) = parser.nextline()
except PSEOF:
raise PDFNoValidXRef('Unexpected EOF - file corrupted?')
f = line.strip().split(' ')
if len(f) != 3:
raise PDFNoValidXRef('Invalid XRef format: %r, line=%r' % (parser, line))
(pos, genno, use) = f
if use != 'n': continue
self.offsets[objid] = (int(genno), long(pos))
if 1 <= debug:
print >>sys.stderr, 'xref objects:', self.offsets
self.load_trailer(parser)
return
from wiki(pdf specs):
A PDF file consists primarily of objects, of which there are eight types:
Boolean values, representing true or false
Numbers
Strings
Names
Arrays, ordered collections of objects
Dictionaries, collections of objects indexed by Names
Streams, usually containing large amounts of data
The null object
Objects may be either direct (embedded in another object) or indirect. Indirect objects are numbered with an object number and a generation number. An index table called the xref table gives the byte offset of each indirect object from the start of the file. This design allows for efficient random access to the objects in the file, and also allows for small changes to be made without rewriting the entire file (incremental update). Beginning with PDF version 1.5, indirect objects may also be located in special streams known as object streams. This technique reduces the size of files that have large numbers of small indirect objects and is especially useful for Tagged PDF.
i thk the problem is your "damaged pdf" have a few 'root elements' on the page.
Possible solution:
you can download sources and write `print function' in each places where xref objects retrieved and where parser tried to parse this objects. it will be possible to determine full stack of error(before this error is appeared).
ps: i think it some kind of bug in product.

I have had this same problem in Ubuntu. I have a very simple solution. Just print the pdf-file as a pdf. If you are in Ubuntu:
Open a pdf file using the (ubuntu) document viewer.
Goto File
Goto print
Choose print as file and check the mark "pdf"
If you want to make the process automatic, follow for instance this, i.e., use this script to print automatically all your pdf files. A linux script like this also works:
for f in *.pdfx
do
lowriter --headless --convert-to pdf "$f"
done
Note I called the original (problematic) pdf files as pdfx.

I got this error as well and kept trying
fp = open('example','rb')
However, I still got the error OP indicated. What I found is that I had bug in my code where the PDF was still open by another function.
So make sure you don't have the PDF open in memory elsewhere as well.

An answer above is right. This error appears only in windows, and workaround is to replace
with open(path, 'rb')
to
fp = open(path,'rb')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Decoding Mail Subject Thunderbird in Python 3.x - python

Related

On the wrong foot with regular python reading of a text file, error line at the exception is wrong

Reading a csv file with some lines having wrong encoding and return incorrect lines to a user

Python: Using SSML with SAPI (comtypes)

Receiving a Windows message in Python - UnicodeEncodeError ...?

Parsing a PDF with no /Root object using PDFMiner

Categories

Resources