Parsing a PDF with no /Root object using PDFMiner - python

I'm trying to extract text from a large number of PDFs using PDFMiner python bindings. The module I wrote works for many PDFs, but I get this somewhat cryptic error for a subset of PDFs:
ipython stack trace:
/usr/lib/python2.7/dist-packages/pdfminer/pdfparser.pyc in set_parser(self, parser)
331 break
332 else:
--> 333 raise PDFSyntaxError('No /Root object! - Is this really a PDF?')
334 if self.catalog.get('Type') is not LITERAL_CATALOG:
335 if STRICT:
PDFSyntaxError: No /Root object! - Is this really a PDF?
Of course, I immediately checked to see whether or not these PDFs were corrupted, but they can be read just fine.
Is there any way to read these PDFs despite the absence of a root object? I'm not too sure where to go from here.
Many thanks!
Edit:
I tried using PyPDF in an attempt to get some differential diagnostics. The stack trace is below:
In [50]: pdf = pyPdf.PdfFileReader(file(fail, "rb"))
---------------------------------------------------------------------------
PdfReadError Traceback (most recent call last)
/home/louist/Desktop/pdfs/indir/<ipython-input-50-b7171105c81f> in <module>()
----> 1 pdf = pyPdf.PdfFileReader(file(fail, "rb"))
/usr/lib/pymodules/python2.7/pyPdf/pdf.pyc in __init__(self, stream)
372 self.flattenedPages = None
373 self.resolvedObjects = {}
--> 374 self.read(stream)
375 self.stream = stream
376 self._override_encryption = False
/usr/lib/pymodules/python2.7/pyPdf/pdf.pyc in read(self, stream)
708 line = self.readNextEndLine(stream)
709 if line[:5] != "%%EOF":
--> 710 raise utils.PdfReadError, "EOF marker not found"
711
712 # find startxref entry - the location of the xref table
PdfReadError: EOF marker not found
Quonux suggested that perhaps PDFMiner stopped parsing after reaching the first EOF character. This would seem to suggest otherwise, but I'm very much clueless. Any thoughts?

The solution in slate pdf is use 'rb' --> read binary mode.
Because slate pdf is depends on the PDFMiner and I have the same problem, this should solve your problem.
fp = open('C:\Users\USER\workspace\slate_minner\document1.pdf','rb')
doc = slate.PDF(fp)
print doc

interesting problem. i had performed some kind of research:
function which parsed pdf (from miners source code):
def set_parser(self, parser):
"Set the document to use a given PDFParser object."
if self._parser: return
self._parser = parser
# Retrieve the information of each header that was appended
# (maybe multiple times) at the end of the document.
self.xrefs = parser.read_xref()
for xref in self.xrefs:
trailer = xref.get_trailer()
if not trailer: continue
# If there's an encryption info, remember it.
if 'Encrypt' in trailer:
#assert not self.encryption
self.encryption = (list_value(trailer['ID']),
dict_value(trailer['Encrypt']))
if 'Info' in trailer:
self.info.append(dict_value(trailer['Info']))
if 'Root' in trailer:
# Every PDF file must have exactly one /Root dictionary.
self.catalog = dict_value(trailer['Root'])
break
else:
raise PDFSyntaxError('No /Root object! - Is this really a PDF?')
if self.catalog.get('Type') is not LITERAL_CATALOG:
if STRICT:
raise PDFSyntaxError('Catalog not found!')
return
if you will be have problem with EOF another exception will be raised:
'''another function from source'''
def load(self, parser, debug=0):
while 1:
try:
(pos, line) = parser.nextline()
if not line.strip(): continue
except PSEOF:
raise PDFNoValidXRef('Unexpected EOF - file corrupted?')
if not line:
raise PDFNoValidXRef('Premature eof: %r' % parser)
if line.startswith('trailer'):
parser.seek(pos)
break
f = line.strip().split(' ')
if len(f) != 2:
raise PDFNoValidXRef('Trailer not found: %r: line=%r' % (parser, line))
try:
(start, nobjs) = map(long, f)
except ValueError:
raise PDFNoValidXRef('Invalid line: %r: line=%r' % (parser, line))
for objid in xrange(start, start+nobjs):
try:
(_, line) = parser.nextline()
except PSEOF:
raise PDFNoValidXRef('Unexpected EOF - file corrupted?')
f = line.strip().split(' ')
if len(f) != 3:
raise PDFNoValidXRef('Invalid XRef format: %r, line=%r' % (parser, line))
(pos, genno, use) = f
if use != 'n': continue
self.offsets[objid] = (int(genno), long(pos))
if 1 <= debug:
print >>sys.stderr, 'xref objects:', self.offsets
self.load_trailer(parser)
return
from wiki(pdf specs):
A PDF file consists primarily of objects, of which there are eight types:
Boolean values, representing true or false
Numbers
Strings
Names
Arrays, ordered collections of objects
Dictionaries, collections of objects indexed by Names
Streams, usually containing large amounts of data
The null object
Objects may be either direct (embedded in another object) or indirect. Indirect objects are numbered with an object number and a generation number. An index table called the xref table gives the byte offset of each indirect object from the start of the file. This design allows for efficient random access to the objects in the file, and also allows for small changes to be made without rewriting the entire file (incremental update). Beginning with PDF version 1.5, indirect objects may also be located in special streams known as object streams. This technique reduces the size of files that have large numbers of small indirect objects and is especially useful for Tagged PDF.
i thk the problem is your "damaged pdf" have a few 'root elements' on the page.
Possible solution:
you can download sources and write `print function' in each places where xref objects retrieved and where parser tried to parse this objects. it will be possible to determine full stack of error(before this error is appeared).
ps: i think it some kind of bug in product.

I have had this same problem in Ubuntu. I have a very simple solution. Just print the pdf-file as a pdf. If you are in Ubuntu:
Open a pdf file using the (ubuntu) document viewer.
Goto File
Goto print
Choose print as file and check the mark "pdf"
If you want to make the process automatic, follow for instance this, i.e., use this script to print automatically all your pdf files. A linux script like this also works:
for f in *.pdfx
do
lowriter --headless --convert-to pdf "$f"
done
Note I called the original (problematic) pdf files as pdfx.

I got this error as well and kept trying
fp = open('example','rb')
However, I still got the error OP indicated. What I found is that I had bug in my code where the PDF was still open by another function.
So make sure you don't have the PDF open in memory elsewhere as well.

An answer above is right. This error appears only in windows, and workaround is to replace
with open(path, 'rb')
to
fp = open(path,'rb')

Related

Decoding Mail Subject Thunderbird in Python 3.x

For a workaround, see below
/Original Question:
Sorry, I am simply too dumb to solve this on my own. I am trying to read the "subjects" from several emails stored in a .mbox folder from Thunderbird. Now, I am trying to decode the header with decode_header(), but I am still getting UnicodeErrors.
I am using the following function (I am sure there is a smarter way to do this, but this is not the point of this post)
import mailbox
from email.header import decode_header
mflder = mailbox.mbox("mailfolder")
for message in mflder:
print(header_to_string(message["subject"]))
def header_to_string(header):
try:
header, encoding = decode_header(header)[0]
except:
return "something went wrong {}".format(header)
if encoding == None:
return header
else:
return header.decode(encoding)
The first 100 outputs or so are perfectly fine, but then this error message appears:
---------------------------------------------------------------------------
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-97-e252df04c215> in <module>
----> 1 for message in mflder:
2 try:
3 print(header_to_string(message["subject"]))
4 except:
5 print("0")
~\anaconda3\lib\mailbox.py in itervalues(self)
107 for key in self.iterkeys():
108 try:
--> 109 value = self[key]
110 except KeyError:
111 continue
~\anaconda3\lib\mailbox.py in __getitem__(self, key)
71 """Return the keyed message; raise KeyError if it doesn't exist."""
72 if not self._factory:
---> 73 return self.get_message(key)
74 else:
75 with contextlib.closing(self.get_file(key)) as file:
~\anaconda3\lib\mailbox.py in get_message(self, key)
779 string = self._file.read(stop - self._file.tell())
780 msg = self._message_factory(string.replace(linesep, b'\n'))
--> 781 msg.set_from(from_line[5:].decode('ascii'))
782 return msg
783
UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 4: ordinal not in range(128)
How can I force mailbox.py to decode a different encoding? Or is the header simply broken? And if I understood this correctly, headers are supposed to be "ASCII", right? I mean, this is the point of this entire MIME thing, no?
Thanks for your help!
/Workaround
I found a workaround by just avoiding to directly iterate over the .mbox mailfolder representation. Instead of using ...
for message in mflder:
# do something
... simply use:
for x in range(len(mflder)):
try:
message = mflder[x]
print(header_to_string(message["subject"]))
except:
print("Failed loading message!")
This skips the broken messages in the .mbox folder. Yet, I stumbled upon several other issues while working with the .mbox folder subjects. For instance, the headers are sometimes split into several tuples when using the decode_header() function. So, in order to receive the full subjects, one needs to add more stuff to the header_to_string() function as well. But this is not related to this question anymore. I am a noob and a hobby prgrammer, but I remember working with the Outlook API and Python, which was MUCH easier...
Solution
It looks like either you have corrupted "mailfolder" mbox file or there is a bug in Python's mailbox module triggered by something in your file. I can't tell what is going on without having the mbox input file or a minimal example input file that reproduces the issue.
You could do some debugging yourself. Each message in the file starts with a "From" line that should look like:
From - Mon Mar 30 18:18:04 2020
From the stack trace you posted, it looks like that line is malformed in one of the messages. Personally, I would use an IDE debugger (PyCharm) track down what the malformed line was, but it can be done with Python's built-in pdb. Wrap your loop like this:
import pdb
try:
for message in mflder:
print(header_to_string(message["subject"]))
except:
pdb.post_mortem()
When you run the code now, it will drop into the debugger when the exception occurs. At that prompt, you can enter l to list the code where the debugger stopped; this should match the last frame printed in your stack trace you originally posted. Once you are there, there are two commands that will tell you what is going on:
p from_line
will show you the malformed "From" line.
p start
will show you at what offset in the file the mailbox code thinks the message was supposed to be.
Previous answer that didn't solve the original problem but still applies
In the real world, there will be messages that don't comply with the standards. You can try to make the code more tolerant if you don't want to reject the bad messages. Decoding with "latin-1" is one way to handle these headers with bytes outside ASCII. This cannot fail because the all possible byte values map to valid Unicode characters (one-to-one mapping of the first 256 codes of Unicode vs. ISO/IEC 8859-1, a.k.a. "latin-1"). This may or may not give you the text the sender intended.
import mailbox
from email.header import decode_header
mflder = mailbox.mbox("mailfolder")
def get_subject(message):
header = message["subject"]
if not header:
return ''
header, encoding = decode_header(header)[0]
if encoding is not None:
try:
header = header.decode(encoding)
except:
header = header.decode('latin-1')
return header
for message in mflder:
print(get_subject(message))

Reading a csv file with some lines having wrong encoding and return incorrect lines to a user

I have some users uploading csv files to be ingested. In Python 2, I was able to open the file in binary, pass it to a unicodecsv.DictReader, and if some rows had an encoding issue, like an invalid Unicode character because the customer used CP1251 or something, I could log those rows and return exactly which rows had an issue.
With py3.7, it doesn't seem that I can do this -- the csv module requires the file to be decoded, and if I instead pass it a generator like (line.decode('utf8') for line in my_binary_file), I can't make it throw an exception only for the bad lines and keep going after. I tried using unicodecsv, even though it hasn't seen a commit in over four years and doesn't technically support py > 3.5, and it doesn't seem to work either -- the iterator just stops after the bad row.
I can see two ways around this, neither of which is appealing:
1) decode the file line by line beforehand and find bad lines, which is wasteful, or
2) write my own CSV parser which allows skipping of bad lines, which seems like asking for trouble.
Can I do this another way?
For reference, here's example code that worked in py2:
def unicode_safe_iterator(reader):
while True:
try:
yield True, next(reader)
except UnicodeDecodeError as exc:
yield False, 'UnicodeDecodeError: %s' % str(exc)
# uncomment for py3:
# except StopIteration:
# return
def get_data_iter_from_csv(csv_file, ...):
reader = unicodecsv.DictReader(csv_file)
error_messages = []
line_num = 1
for valid, row in unicode_safe_iterator(reader):
line_num += 1
if not valid:
error_messages.append(dict(line_number=line_num, error=row))
else:
row_data = validate_row_data(row) # check for errors other than encoding, etc.
if not error_messages:
# stop yielding in case of errors, but keep iterating to find all errors.
yield row_data
if error_messages:
raise ValidationError(Errors.CSV_FILE_ERRORS, error_items=error_messages)
data_iter = get_data_iter_from_csv(open(path_to_csv, 'rb'), ...)
Here is a workaround. We read the file as byte stream, split it at new lines and try to convert lines into utf8 strings. If it is failed, try to convert the improper parts into cp1251 string.
Therafter you can use io.StringIO to imitate a file open.
import csv, io
def convert(bl):
rslt=[]
done=False
pos=0
while not done:
try:
s=bl[pos:].decode("utf8")
rslt.append(s)
done=True
except UnicodeDecodeError as ev:
abs_start, abs_end= pos+ev.start, pos+ev.end
rslt.append(bl[pos:abs_start].decode("utf8"))
rslt.append(bl[abs_start:abs_end].decode("cp1251",errors="replace"))
pos= abs_end
if pos>= len(bl):
done=True
return "".join(rslt)
with open(path_to_csv,"rb") as ff:
data= ff.read().split(b'\x0a')
text= [ convert(line) for line in data ]
text="\n".join(text)
print(text)
rdr= csv.DictReader(io.StringIO(text))
It can be done at once, not line by line, too:
with open(path_to_csv,"rb") as ff:
text= convert( ff.read() )
rdr= csv.DictReader(io.StringIO(text))

Python: Using SSML with SAPI (comtypes)

TL;DR: I'm trying to pass an XML object (using ET) to a Comtypes (SAPI) object in python 3.7.2 on Windows 10. It's failing due to invalid chars (see error below). Unicode characters are read correctly from the file, can be printed (but do not display correctly on the console). It seems like the XML is being passed as ASCII or that I'm missing a flag? (https://learn.microsoft.com/en-us/previous-versions/windows/desktop/ee431843(v%3Dvs.85)). If it is a missing flag, how do I pass it? (I haven't figured that part out yet..)
Long form description
I'm using Python 3.7.2 on Windows 10 and trying to send create an XML (SSML: https://www.w3.org/TR/speech-synthesis/) file to use with Microsoft's speech API. The voice struggles with certain words and when I looked at the SSML format and it supports a phoneme tag, which allows you to specify how to pronounce a given word. Microsoft implements parts of the standard (https://learn.microsoft.com/en-us/cortana/skills/speech-synthesis-markup-language#phoneme-element) so I found a UTF-8 encoded library containing IPA pronunciations. When I try to call the SAPI, with parts of the code replaced I get the following error:
Traceback (most recent call last):
File "pdf_to_speech.py", line 132, in <module>
audioConverter(text = "Hello world extended test",outputFile = output_file)
File "pdf_to_speech.py", line 88, in __call__
self.engine.speak(text)
_ctypes.COMError: (-2147200902, None, ("'ph' attribute in 'phoneme' element is not valid.", None, None, 0, None))
I've been trying to debug, but when I print the pronunciations of the words the characters are boxes. However if I copy and paste them from my console, they look fine (see below).
həˈloʊ,
ˈwɝːld
ɪkˈstɛndəd,
ˈtɛst
Best Guess
I'm unsure whether the problem is caused by
1) I've changed versions of pythons to be able to print unicode
2) I fixed problems with reading the file
3) I had incorrect manipulations of the string
I'm pretty sure the problem is that I'm not passing it as a unicode to the comtype object. The ideas I'm looking into are
1) Is there a flag missing?
2) Is it being converted to ascii when its being passed to comtypes (C types error)?
3) Is the XML being passed incorrectly/ am I missing a step?
Sneak peek at the code
This is the class that reads the IPA dictionary and then generates the XML file. Look at _load_phonemes and _pronounce.
class SSML_Generator:
def __init__(self,pause,phonemeFile):
self.pause = pause
if isinstance(phonemeFile,str):
print("Loading dictionary")
self.phonemeDict = self._load_phonemes(phonemeFile)
print(len(self.phonemeDict))
else:
self.phonemeDict = {}
def _load_phonemes(self, phonemeFile):
phonemeDict = {}
with io.open(phonemeFile, 'r',encoding='utf-8') as f:
for line in f:
tok = line.split()
#print(len(tok))
phonemeDict[tok[0].lower()] = tok[1].lower()
return phonemeDict
def __call__(self,text):
SSML_document = self._header()
for utterance in text:
parent_tag = self._pronounce(utterance,SSML_document)
#parent_tag.tail = self._pause(parent_tag)
SSML_document.append(parent_tag)
ET.dump(SSML_document)
return SSML_document
def _pause(self,parent_tag):
return ET.fromstring("<break time=\"150ms\" />") # ET.SubElement(parent_tag,"break",{"time":str(self.pause)+"ms"})
def _header(self):
return ET.Element("speak",{"version":"1.0", "xmlns":"http://www.w3.org/2001/10/synthesis", "xml:lang":"en-US"})
# TODO: Add rate https://learn.microsoft.com/en-us/cortana/skills/speech-synthesis-markup-language#prosody-element
def _rate(self):
pass
# TODO: Add pitch
def _pitch(self):
pass
def _pronounce(self,word,parent_tag):
if word in self.phonemeDict:
sys.stdout.buffer.write(self.phonemeDict[word].encode("utf-8"))
return ET.fromstring("<phoneme alphabet=\"ipa\" ph=\"" + self.phonemeDict[word] + "\"> </phoneme>")#ET.SubElement(parent_tag,"phoneme",{"alphabet":"ipa","ph":self.phonemeDict[word]})#<phoneme alphabet="string" ph="string"></phoneme>
else:
return parent_tag
# Nice to have: Transform acronyms into their pronunciation (See say as tag)
I've also added how the code writes to the comtype object (SAPI) in case the error is there.
def __call__(self,text,outputFile):
# https://learn.microsoft.com/en-us/previous-versions/windows/desktop/ms723606(v%3Dvs.85)
self.stream.Open(outputFile + ".wav", self.SpeechLib.SSFMCreateForWrite)
self.engine.AudioOutputStream = self.stream
text = self._text_processing(text)
text = self.SSML_generator(text)
text = ET.tostring(text,encoding='utf8', method='xml').decode('utf-8')
self.engine.speak(text)
self.stream.Close()
Thanks in advance for your help!
Try to use single quotes inside ph attrubute.
Like this
my_text = '<speak><phoneme alphabet="x-sampa" ph=\'v"e.de.ni.e\'>ведение</phoneme></speak>'
also remember to use \ to escape single quote
UPD
Also this error could mean that your ph cannot be parsed. You can check docs there: https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/speech-synthesis-markup
this example will work
<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
<voice name="en-US-Jessa24kRUS">
<s>His name is Mike <phoneme alphabet="ups" ph="JH AU"> Zhou </phoneme></s>
</voice>
</speak>
but this doesn't
<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
<voice name="en-US-Jessa24kRUS">
<s>His name is Mike <phoneme alphabet="ups" ph="JHU AUA"> Zhou </phoneme></s>
</voice>
</speak>

I'm getting a syntax error on my IF statement, not sure why?

I'm trying to run the following code in python 3.7. I keep getting a invalid syntax error and not sure why, can someone spot what i'm doing wrong? Indent seems to be fine, my "Prints" are in correct brackets i believe but i'm totally lost on the "if" and "else" statements.
class pdfPositionHandling:
def parse_obj(self, lt_objs):
# loop over the object list
for obj in lt_objs:
if isinstance(obj, pdfminer.layout.LTTextLine):
print ("%6d, %6d, %s" % (obj.bbox[0], obj.bbox[1], obj.get_text().replace('\n', '_'))
# if it's a textbox, also recurse
if isinstance(obj, pdfminer.layout.LTTextBoxHorizontal):
self.parse_obj(obj._objs)
# if it's a container, recurse
elif isinstance(obj, pdfminer.layout.LTFigure):
self.parse_obj(obj._objs)
def parsepdf(self, filename, startpage, endpage):
# Open a PDF file.
fp = open(filename, 'rb')
# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)
# Create a PDF document object that stores the document structure.
# Password for initialization as 2nd parameter
document = PDFDocument(parser)
# Check if the document allows text extraction. If not, abort.
if not document.is_extractable:
raise PDFTextExtractionNotAllowed
# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()
# Create a PDF device object.
device = PDFDevice(rsrcmgr)
# BEGIN LAYOUT ANALYSIS
# Set parameters for analysis.
laparams = LAParams()
# Create a PDF page aggregator object.
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
i = 0
# loop over all pages in the document
for page in PDFPage.create_pages(document):
if i >= startpage and i <= endpage:
# read the page into a layout object
interpreter.process_page(page)
layout = device.get_result()
# extract text from this object
self.parse_obj(layout._objs)
i += 1
I get the following error:
File "C:/Users/951298/Documents/Python Scripts/PDF Scraping/untitled1.py", line 12
if isinstance(obj, pdfminer.layout.LTTextBoxHorizontal):
^
SyntaxError: invalid syntax
Not sure why its point at the colon at the end?
In line 9 you should have typed 3 parenthesses inthe end but you only had 2 of them.Add another parenthes and it will work fine.
You forgot to place the ending bracket on your print statement. This causes an error on the next line because the interpreter ignores newlines when reading the code inside the brackets. In fact, the only reason it threw an error on line 12 is because if isinstance(obj, pdfminer.layout.LTTextBoxHorizontal): is is not a valid argument to pass to print.
Therefore, the following code would throw an error on line 11.
bar = "a"
baz = "a"
def foo(msg, bar="\n"):
print(msg, end=bar)
if bar == baz:
foo("bar is equal to baz",
bar = baz
else: #Throws error here
foo("bar is not equal to baz")
#Not the best example, I know, sorry.
Odd, is it not? Be sure to take a look the line(s) above the line that throws an error. It gives you both context, and a potential erroneous code. You especially need to watch for these kinds of errors in programming languages that require newline terminators.
In line 9 you should have 3 ending parenthesis, but I also happened to notice that you have two if statements and one elif statement but no else, they should all be if statements. Hope I helped!

Why a file is writtable but os.access( file, os.W_OK ) return false?

I am not quite sure what's going here. Based on the explanation on python
>
os.W_OK:
Value to include in the mode parameter of access() to test the writability of path.
I suppose this check should return True, even if a file does not exist, but its path is valid and I have the permission to write this file.
But this is what happens when I try to check whether a file path is writeable.
import os, subprocess
pwd = os.getcwd();
temp_file_to_write = os.path.join( pwd, "temp_file" );
# use os.access to check
say = "";
if ( os.access( temp_file_to_write, os.W_OK ) ) :
say = "writeable";
else :
say = "NOT writeable";
print "L10", temp_file_to_write, "is", say
# use try/except
try :
with open( temp_file_to_write, "w" ) as F :
F.write( "L14 I am a temp file which is said " + say + "\n" );
print "L15", temp_file_to_write, "is written";
print subprocess.check_output( ['cat', temp_file_to_write ] );
except Exception, e:
print "L18", temp_file_to_write, "is NOT writeable";
It produces the following results
L10 /home/rex/python_code/sandbox/temp_file is NOT writeable
L15 /home/rex/python_code/sandbox/temp_file is written
L14 I am a temp file which is said NOT writeable
Does anyone know why? If my understanding of os.W_OK is wrong, could you tell me the right way in python to check the following both things together 1) whether a file path is valid; and 2) whether I have permissions to write.
Whether or not you can create a new file depends what permissions the directory has, not the new non-existent (yet) file.
Once the file is created (exists) then access(W_OK) may return true if you can modify its content.
Maybe you run your script with sudo (or something like this on Windows)?
I have this on linux (I gave chmod 400):
>>> os.access(fn, os.W_OK)
False
>>> f = open(fn, 'w')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IOError: [Errno 13] Permission denied: '/tmp/non-writable'
The original question asked how to check permissions to write a file. However, in Python it is better to use a try-except block to attempt to write to a file, instead of testing for access, when possible. The reason is given in the os.access() documentation on the Python.org website: https://docs.python.org/3/library/os.html
From the website:
Note: Using access() to check if a user is authorized to e.g. open a file before
actually doing so using open() creates a security hole, because the user might exploit the short time interval between checking and opening the file to manipulate it. It’s preferable to use EAFP techniques. For example:
if os.access("myfile", os.R_OK):
with open("myfile") as fp:
return fp.read()
return "some default data"
is better written as:
try:
fp = open("myfile")
except PermissionError:
return "some default data"
else:
with fp:
return fp.read()
Note: I/O operations may fail even when access() indicates that they would succeed, particularly for operations on network filesystems which may have permissions semantics beyond the usual POSIX permission-bit model.

Categories