with as and premature file closure

with as and premature file closure - python

I have a python function that contains the following code:
with open(modelfilepath, "rb") as modelfile, open(vcffilepath, "rb") as vcffile:
for row in gtf_getrow(modelfile):
print row
#add features as appropriate
if row["feature"] == "transcript":
addfeature(some args...)
if row["feature"] == "exon":
addfeature(some other args..., vcffile=vcffile)
Execution of the addfeature() function passes through several functions before returning to the for loop. In the "exon" case, the vcffile object is passed as an argument to successive functions which eventually write to the vcffile.
The problem is that after a few iterations, the vcffile object seems to close spontaneously, which crashes the program. If I hardcode the function that uses vcffile to access the filename directly, the problem does not occur, but this seems like an undesirable solution since it removes control of the file from the with block. Nor do I want to have to open and close the file each time I access it, since this program is parsing hundreds of megabytes worth of tabular data. Thanks in advance for your suggestions.

Related

Python: Json file become empty

Here is my code of accessing&editing the file:
def edit_default_settings(self, setting_type, value):
with open("cam_settings.json", "r") as f:
cam_settings = json.load(f)
cam_settings[setting_type] = value
with open("cam_settings.json", 'w') as f:
json.dump(cam_settings, f, indent=4)
I use It in a program that runs for several hours in a day, and once in a ~week I'm noticing, that cam_settings.json file becoming empty (literally empty, the file explorer shows 0 bytes), but can't imagine how that is possible
Would be glad to hear some comments on what could go wrong

I can't see any issues with the code itself, but there can be an issue with the execution environment. Are you running the code in a multi-threaded environment or running multiple instances of the same program at once?
This situation can arise if this code is executed parallelly and multiple threads/processes try to access the file at the same time. Try logging each time the function was executed and if the function was executed successfully. Try exception handlers and error logging.
If this is a problem, using buffers or singleton pattern can solve the issue.

As #Chels said, the file is truncated when it's opened with 'w'. That doesn't explain why it stays that way; I can only imagine that happening if your code crashed. Maybe you need to check logs for code crashes (or change how your code is run so that crash reasons get logged, if they aren't).
But there's a way to make this process safer in case of crashes. Write to a separate file and then replace the old file with the new file, only after the new file is fully written. You can use os.replace() for this. You could do this simply with a differently-named file:
with open(".cam_settings.json.tmp", 'w') as f:
json.dump(cam_settings, f, indent=4)
os.replace(".cam_settings.json.tmp", "cam_settings.json")
Or you could use a temporary file from the tempfile module.

When openning a file with the "w" parameter, everytime you will write to it, the content of the file will be erased. (You will actually replace what's written already).
Not sure if this is what you are looking for, but could be one of the reasons why "cam_settings.json" becomes empty after the call of open("cam_settings.json", 'w')!
In such a case, to append some text, use the "a" parameter, as:
open("cam_settings.json", 'a')

Behaviour difference using Python pickle module

I'm developping a python app that deals with big objects, and to avoid filling the pc ram while executing, I chosed to store my temporary objects (created at one step, used by the next step) in files with pickle module.
While trying to optimize memory consumption, I saw a behaviour that I don't understand.
In the first case, I'm opening my temp file, then I loop to do the actions I need and during the loop I regularly dump objects in the file. It works well, but as the file pointer remains open, it consumes a lot of memory. Here is the code example :
tmp_file_path = "toto.txt"
with open(tmp_file_path, 'ab') as f:
p = pickle.Pickler(f)
for filepath in self.file_list: // loop over files to be treated
try:
my_obj = process_file(filepath)
storage_obj = StorageObj()
storage_obj.add(os.path.basename(filepath), my_obj)
p.dump(storage_obj)
[...]
In the second case I'm only opening my temp file when I need to write inside it :
tmp_file_path = "toto.txt"
for filepath in self.file_list: // loop over files to be treated
try:
my_obj = process_file(filepath)
storage_obj = StorageObj()
storage_obj.add(os.path.basename(filepath), my_obj)
with open(tmp_file_path, 'ab') as f:
p = pickle.Pickler(f)
p.dump(storage_obj)
[...]
The code between the two versions is the same except from the block :
with open(tmp_file_path, 'ab') as f:
p = pickle.Pickler(f)
which moves inside/outside the loop.
And for the unpickling part :
with open("toto.txt", 'rb') as f:
try:
u = pickle.Unpickler(f)
storage_obj = u.load()
while storage_obj:
process_my_obj(storage_obj)
storage_obj = u.load()
except EOFError:
pass
When I'm running both codes, in the first case I have a high memory consumption (due to the fact that temp file remains open during the treatment I guess) and in the end, with a set of inputs, the application finds 622 elements in the unpickled data.
In the second case, memory cunsumption is far lower, but in the end , with the same inputs, the application finds 440 elements in the unpickled data, and sometimes crashes with random errors during Unpickler.load() method (for exemple Attribute error, but it's not always reproductible and not always the same error).
With even bigger set of inputs, the first code example often crashes with memory error, so I'd like to use the second code example, but it seems that it doesn't succeed to save all my objects correctly.
Does anyone have an idea of the reason why there is differences between the two behaviour ?
Maybe opening / dumping / closing / reopening /dumping / etc a file in my loop doesn't garanty the content that is dumped ?
EDIT 1 :
All the pickling part is done in a multiprocessing context, with 10 processes writing in their own temp file, and the unpickling is done by the main process, by reading each temp file created.
EDIT 2 :
I can't provide a full reproductible example (company code), but the treatment consists of parsing C files (process_file method, based on pycparser module) and generating an object representing the C file content (fields, functions etc) -> my_obj. Then storing my_obj in an object (StorageObj) that has a a dict as attribute, containing the my_obj object with the file is was extracted from as key.
Thanks in advance if anyone finds the reason behind this, or suggest me a way around to avoid this :)

This has nothing to do with the file. It is that you are using a common Pickler which is retaining its memo table.
The example that does not have the issue creates a new Pickler with a fresh memo table and lets the old one be collected effectively clearing the memo table.
But that doesn't explain why when I create multiple Pickler I retrieve less data than with only one in the end.
Now that is because you have written multiple pickles to the same file and the method where you read one. Only reads the first. As closing and reopening the file resets the file offset. In the reading of multiple objects each time you call load advances the file offset to the start of the next object.

Python script that writes result to txt file - why the lag?

I'm using Windows 7 and I have a super-simple script that goes over a directory of images, checking a specified condition for each image (in my case, whether there's a face in the image, using dlib), while writing the paths of images that fulfilled the condition to a text file:
def process_dir(dir_path):
i = 0
with open(txt_output, 'a') as f:
for filename in os.listdir(dir_path):
# loading image to check whether dlib detects a face:
image_path = os.path.join(dir_path, filename)
opencv_img = cv2.imread(image_path)
dets = detector(opencv_img, 1)
if len(dets) > 0 :
f.write(image_path)
f.write('\n')
i = i + 1
print i
Now the following thing happens: there seems to be a significant lag in appending lines to files. For example, I can see the script has "finished" checking several images (i.e, the console prints ~20, meaning 20 files who fulfill the condition have been found) but the .txt file is still empty. At first I thought there was a problem with my script, but after waiting a while I saw that they were in fact added to the file, only it seems to be updated in "batches".
This may not seem like the most crucial issue (and it's definitely not), but still I'm wondering - what explains this behavior? As far as I understand, every time the f.write(image_path) line is executed the file is changed - then why do I see the update with a lag?

Data written to a file object won't necessarily show up on disk immediately.
In the interests of efficiency, most operating systems will buffer the writes, meaning that data is only written out to disk when a certain amount has accumulated (usually 4K).
If you want to write your data right now, use the flush() function, as others have said.

Did you try using with buffersize 0, open(txt_output, 'a', 0).

I'm, not sure about Windows (please, someone correct me here if I'm wrong), but I believe this is because of how the write buffer is handled. Although you are requesting a write, the buffer only writes every so often (when the buffer is full), and when the file is closed. You can open the file with a smaller buffer:
with open(txt_output, 'a', 0) as f:
or manually flush it at the end of the loop:
if len(dets) > 0 :
f.write(image_path)
f.write('\n')
f.flush()
i = i + 1
print i
I would personally recommend flushing manually when you need to.

It sounds like you're running into file stream buffering.
In short, writing to a file is a very slow process (relative to other sorts of things that the processor does). Modifying the hard disk is about the slowest thing you can do, other than maybe printing to the screen.
Because of this, most file I/O libraries will "buffer" your output, meaning that as you write to the file the library will save your data in an in-memory buffer instead of modifying the hard disk right away. Only when the buffer fills up will it "flush" the buffer (write the data to disk), after which point it starts filling the buffer again. This often reduces the number of actual write operations by quite a lot.
To answer your question, the first question to answer is, do really need to append to the file immediately every time you find a face? It will probably slow down your processing by a noticeable amount, especially if you're processing a large number of files.
If you really do need to update immediately, you basically have two options:
Manually flush the write buffer each time you write to the file. In Python, this usually means calling f.flush(), as #JamieCounsell pointed out.
Tell Python to just not use a buffer, or more accurately to use a buffer of size 0. As #VikasMadhusudana pointed out, you can tell Python how big of a buffer to use with a third argument to open(): open(txt_output, 'a', 0) for a 0-byte buffer.
Again, you probably don't need this; the only case I can think that might require this sort of thing is if you have some other external operation that's watching the file and triggers off of new data being added to it.
Hope that helps!

It's flush related, try:
print(image_path, file=f) # Python 3
or
print >>f, image_page # Python 2
instead of:
f.write(image_path)
f.write('\n')
print flushes.
another good thing about print is it gives you the newline for free.

Is there a more concise way to read csv files in Python?

with open(file, 'rb') as readerfile:
reader = csv.reader(readerfile)
In the above syntax, can I perform the first and second line together? It seems unnecessary to use 2 variables ('readerfile' and 'reader' above) if I only need to use the latter.
Is the former variable ('readerfile') ever used?
Can I use the same variable name for both is that bad form?

You can do:
reader = csv.reader(open(file, 'rb'))
but that would mean you are not closing your file explicitly.

with open(file, 'rb') as readerfile:
The first line opens the file and stores the file object in readerfile. The with statement ensures that the file is closed when you exit the block by any means, including exceptions.
reader = csv.reader(readerfile)
The second line creates a CSV reader object using the file object. It needs the file object (otherwise where would it read the data from?). Of course you could conceivably store it in the same variable
readerfile = csv.reader(readerfile)
if you wanted to (and don't plan on using the file object again), but this will likely lead to confusion for readers of your code.
Note that you haven't read anything yet! You still need to iterate over the reader object in order to get the data that you're interested in, and if you close the file before that happens then the reader object won't work. The file object is used behind the scenes by the reader object, even if you "hide" it by overwriting the readerfile variable.
Lastly, if you really want to do everything on one line, you could conceivably define a function that abstracts the with statement:
def with1(context, func):
with context as x:
return func(x)
Now you can write this as one line:
data = with1(open(file, 'rb'), lambda readerfile: list(csv.reader(readerfile)))
It's by no means clearer, however.

This is not recommended at all
Why is it important to use one line?
Most python programmers know well the benefits of using the with statement. Keep in mind that readers might be lazy (that is -read line by line-) on some cases. You want to be able to handle the file with the correct statement, ensuring the correct closing, even if errors arise.
Nevertheless, you can use a one liner for this, as stated in other answers:
reader = csv.reader(open(file, 'rb'))

So basically you want a one-liner?
reader = csv.reader(open(file, 'rb'))
As said before, the problem with that is with open() allows you to do the following steps in one time:
Open the file
Do what you want with the file (inside your open block)
Close the file (that is implicit and you don't have to specify it)
If you don't use with open but directly open, you file stays opened until the object is garbage collected, and that could lead to unpredicted behaviour in some cases.
Plus, your original code (two lines) is much more readable than a one-liner.

If you put them together, then the file won't be closed automatically -- but that often doesn't really matter, since it will be closed automatically when the script terminates.
It's not common to need to reference the raw file once acsv.readerinstance has been created from (except possibly to explicitly close it if you're not using awithstatement).
If you use the same variable name for both, it will probably work because thecsv.readerinstance will still hold a reference to the file object, so it won't be garbage collected until the program ends. It's not a commonly idiom, however.
Since csv files are often processed sequentially, the following can be a fairly concise way to do it since thecsv.readerinstance frequently doesn't really need to be given a variable name and it will close the file properly even if an exception occurs:
with open(file, 'rb') as readerfile:
for row in csv.reader(readerfile):
process the data...

Python, run commands in specific order

I'm writing a script that gets the most recently modified file from a unix directory.
I'm certain it works, but I have to create a unittest to prove it.
The problem is the setUp function. I want to be able to predict the order the files are created in.
self.filenames = ["test1.txt", "test2.txt", "test3.txt", "filename.txt", "test4"]
newest = ''
for fn in self.filenames:
if pattern.match(fn): newest = fn
with open(fn, "w") as f: f.write("some text")
The pattern is "test.*.txt" so it just matches the first three in the list. In multiple tests, newest sometimes returns 'test3.txt' and sometimes 'test1.txt'.

Use os.utime to explicitly set modified time on the files that you have created. That way your test will run faster.

I doubt that the filesystem you are using supports fractional seconds on file create time.
I suggest you insert a call to time.sleep(1) in your loop so that the filesystem actually has a different timestamp on each created file.

It could be due to syncing. Just because you call write() on files in a certain order, it doesn't mean the data will be updated by the OS in that order.
Try calling f.flush() followed by os.fsync() on your file object before going to the next file. Giving some time between calls (using sleep()) might help also

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.