Behaviour difference using Python pickle module

Behaviour difference using Python pickle module - python

I'm developping a python app that deals with big objects, and to avoid filling the pc ram while executing, I chosed to store my temporary objects (created at one step, used by the next step) in files with pickle module.
While trying to optimize memory consumption, I saw a behaviour that I don't understand.
In the first case, I'm opening my temp file, then I loop to do the actions I need and during the loop I regularly dump objects in the file. It works well, but as the file pointer remains open, it consumes a lot of memory. Here is the code example :
tmp_file_path = "toto.txt"
with open(tmp_file_path, 'ab') as f:
p = pickle.Pickler(f)
for filepath in self.file_list: // loop over files to be treated
try:
my_obj = process_file(filepath)
storage_obj = StorageObj()
storage_obj.add(os.path.basename(filepath), my_obj)
p.dump(storage_obj)
[...]
In the second case I'm only opening my temp file when I need to write inside it :
tmp_file_path = "toto.txt"
for filepath in self.file_list: // loop over files to be treated
try:
my_obj = process_file(filepath)
storage_obj = StorageObj()
storage_obj.add(os.path.basename(filepath), my_obj)
with open(tmp_file_path, 'ab') as f:
p = pickle.Pickler(f)
p.dump(storage_obj)
[...]
The code between the two versions is the same except from the block :
with open(tmp_file_path, 'ab') as f:
p = pickle.Pickler(f)
which moves inside/outside the loop.
And for the unpickling part :
with open("toto.txt", 'rb') as f:
try:
u = pickle.Unpickler(f)
storage_obj = u.load()
while storage_obj:
process_my_obj(storage_obj)
storage_obj = u.load()
except EOFError:
pass
When I'm running both codes, in the first case I have a high memory consumption (due to the fact that temp file remains open during the treatment I guess) and in the end, with a set of inputs, the application finds 622 elements in the unpickled data.
In the second case, memory cunsumption is far lower, but in the end , with the same inputs, the application finds 440 elements in the unpickled data, and sometimes crashes with random errors during Unpickler.load() method (for exemple Attribute error, but it's not always reproductible and not always the same error).
With even bigger set of inputs, the first code example often crashes with memory error, so I'd like to use the second code example, but it seems that it doesn't succeed to save all my objects correctly.
Does anyone have an idea of the reason why there is differences between the two behaviour ?
Maybe opening / dumping / closing / reopening /dumping / etc a file in my loop doesn't garanty the content that is dumped ?
EDIT 1 :
All the pickling part is done in a multiprocessing context, with 10 processes writing in their own temp file, and the unpickling is done by the main process, by reading each temp file created.
EDIT 2 :
I can't provide a full reproductible example (company code), but the treatment consists of parsing C files (process_file method, based on pycparser module) and generating an object representing the C file content (fields, functions etc) -> my_obj. Then storing my_obj in an object (StorageObj) that has a a dict as attribute, containing the my_obj object with the file is was extracted from as key.
Thanks in advance if anyone finds the reason behind this, or suggest me a way around to avoid this :)

This has nothing to do with the file. It is that you are using a common Pickler which is retaining its memo table.
The example that does not have the issue creates a new Pickler with a fresh memo table and lets the old one be collected effectively clearing the memo table.
But that doesn't explain why when I create multiple Pickler I retrieve less data than with only one in the end.
Now that is because you have written multiple pickles to the same file and the method where you read one. Only reads the first. As closing and reopening the file resets the file offset. In the reading of multiple objects each time you call load advances the file offset to the start of the next object.

Related

Will using an explicit del matter for containers that are not needed anymore in the program?

Multiple sources lead me to believe that using a del statement is rarely necessary. However, I am working on a program that needs to read in huge files ( 6 GB) with filename getting picked up from from a list, do some transformation, write them to a datastore and pick up the next file.
For instance the reference variables buffer and processed will get overwritten with every iteration of loop - is there a point to deleting them explicitly?
files = list() # contains 1000 filenames
for file in files:
buffer = read_from_s3(file)
processed = process_data(buffer)
del buffer # needed?
write_to_another_s3(processed)
del processed # needed?

How to temporary save data in Python?

I read position data from a GPS Sensor in a dictionary, which I am sending in cyclic interval to a server.
If I have no coverage, the data will be saved in a list.
If connection can be reestablished, all list items will be transmitted.
But if the a power interruption occurs, all temp data elements will be lost.
What would be the best a pythonic solution to save this data?
I am using a SD card as storage, so i am not sure, if writing every element to a file would be the best solution.
Current implementation:
stageddata = []
position = {'lat':'1.2345', 'lon':'2.3455', 'timestamp':'2020-10-18T15:08:04'}
if not transmission(position):
stageddata.append(position)
else:
while stageddata:
position = stageddata.pop()
if not transmission(position):
stageddata.append(position)
return
EDIT: Finding the "best" solution may be very subjective. I agree with zvone, a power outage can be prevented. Perhaps a shutdown routine should save the temporary data.
So question may be how to pythonic save a given list to a file?

A good solution for temporary storage in Python is tempfile.
You can use it, e.g., like the following:
import tempfile
with tempfile.TemporaryFile() as fp:
# Store your varibale
fp.write(your_variable_to_temp_store)
# Do some other stuff
# Read file
fp.seek(0)
fp.read()

I agree with the comment of zvone. In order to know the best solution, we would need more information.
The following would be a robust and configurable solution.
import os
import pickle
backup_interval = 2
backup_file = 'gps_position_backup.bin'
def read_backup_data():
file_backup_data = []
if os.path.exists(backup_file):
with open(backup_file, 'rb') as f:
while True:
try:
coordinates = pickle.load(f)
except EOFError:
break
file_backup_data += coordinates
return file_backup_data
# When the script is started and backup data exists, stageddata uses it
stageddata = read_backup_data()
def write_backup_data():
tmp_backup_file = 'tmp_' + backup_file
with open(tmp_backup_file, 'wb') as f:
pickle.dump(stageddata, f)
os.replace(tmp_backup_file, backup_file)
print('Wrote data backup!')
# Mockup variable and method
transmission_return = False
def transmission(position):
return transmission_return
def try_transmission(position):
if not transmission(position):
stageddata.append(position)
if len(stageddata) % backup_interval == 0:
write_backup_data()
else:
while stageddata:
position = stageddata.pop()
if not transmission(position):
stageddata.append(position)
return
else:
if len(stageddata) % backup_interval == 0:
write_backup_data()
if __name__ == '__main__':
# transmission_return is False, so write to backup_file
for counter in range(10):
position = {'lat':'1.2345', 'lon':'2.3455'}
try_transmission(position)
# transmission_return is True, transmit positions and "update" backup_file
transmission_return = True
position = {'lat':'1.2345', 'lon':'2.3455'}
try_transmission(position)
I moved your code into some some functions. With the variable backup_interval, it is possible to control how often a backup is written to disk.
Additional Notes:
I use the built-in pickle module, since the data does not have to be human readable or transformable for other programming languages. Alternatives are JSON, which is human readable, or msgpack, which might be faster, but needs an extra package to be installed. The tempfile is not a pythonic solution, as it cannot easily be retrieved in case the program crashes.
stageddata is written to disk when it hits the backup_interval (obviously), but also when transmission returns True within the while loop. This is needed to "synchronize" the data on disk.
The data is written to disk completely new every time. A more sophisticated approach would be to just append the newly added positions, but then the synchronizing part, that I described before, would be more complicated too. Additionally, the safer temporary file approach (see Edit below) would not work.
Edit: I just reconsidered your use case. The main problem here is: Restoring data, even if the program gets interrupted at any time (due to power interruption or whatever). My first solution just wrote the data to disk (which solves part of the problem), but it could still happen, that the program crashes the moment when writing to disk. In that case the file would probably be corrupt and the data lost. I adapted the function write_backup_data(), so that it writes to a temporary file first and then replaces the old file. So now, even if a lot of data has to be written to disk and the crash happens there, the previous backup file would still be available.

Maybe saving it as a binary code could help to minimize the storage. 'pickle' and 'shelve' modules will help with storing objects and serializing (To serialize an object means to convert its state to a byte stream so that the byte stream can be reverted back into a copy of the object), but you should be carefull that when you resolve the power interruption it does not overwrite the data you have been storing, with open(file, "a") (a== append), you could avoid that.

Python script that writes result to txt file - why the lag?

I'm using Windows 7 and I have a super-simple script that goes over a directory of images, checking a specified condition for each image (in my case, whether there's a face in the image, using dlib), while writing the paths of images that fulfilled the condition to a text file:
def process_dir(dir_path):
i = 0
with open(txt_output, 'a') as f:
for filename in os.listdir(dir_path):
# loading image to check whether dlib detects a face:
image_path = os.path.join(dir_path, filename)
opencv_img = cv2.imread(image_path)
dets = detector(opencv_img, 1)
if len(dets) > 0 :
f.write(image_path)
f.write('\n')
i = i + 1
print i
Now the following thing happens: there seems to be a significant lag in appending lines to files. For example, I can see the script has "finished" checking several images (i.e, the console prints ~20, meaning 20 files who fulfill the condition have been found) but the .txt file is still empty. At first I thought there was a problem with my script, but after waiting a while I saw that they were in fact added to the file, only it seems to be updated in "batches".
This may not seem like the most crucial issue (and it's definitely not), but still I'm wondering - what explains this behavior? As far as I understand, every time the f.write(image_path) line is executed the file is changed - then why do I see the update with a lag?

Data written to a file object won't necessarily show up on disk immediately.
In the interests of efficiency, most operating systems will buffer the writes, meaning that data is only written out to disk when a certain amount has accumulated (usually 4K).
If you want to write your data right now, use the flush() function, as others have said.

Did you try using with buffersize 0, open(txt_output, 'a', 0).

I'm, not sure about Windows (please, someone correct me here if I'm wrong), but I believe this is because of how the write buffer is handled. Although you are requesting a write, the buffer only writes every so often (when the buffer is full), and when the file is closed. You can open the file with a smaller buffer:
with open(txt_output, 'a', 0) as f:
or manually flush it at the end of the loop:
if len(dets) > 0 :
f.write(image_path)
f.write('\n')
f.flush()
i = i + 1
print i
I would personally recommend flushing manually when you need to.

It sounds like you're running into file stream buffering.
In short, writing to a file is a very slow process (relative to other sorts of things that the processor does). Modifying the hard disk is about the slowest thing you can do, other than maybe printing to the screen.
Because of this, most file I/O libraries will "buffer" your output, meaning that as you write to the file the library will save your data in an in-memory buffer instead of modifying the hard disk right away. Only when the buffer fills up will it "flush" the buffer (write the data to disk), after which point it starts filling the buffer again. This often reduces the number of actual write operations by quite a lot.
To answer your question, the first question to answer is, do really need to append to the file immediately every time you find a face? It will probably slow down your processing by a noticeable amount, especially if you're processing a large number of files.
If you really do need to update immediately, you basically have two options:
Manually flush the write buffer each time you write to the file. In Python, this usually means calling f.flush(), as #JamieCounsell pointed out.
Tell Python to just not use a buffer, or more accurately to use a buffer of size 0. As #VikasMadhusudana pointed out, you can tell Python how big of a buffer to use with a third argument to open(): open(txt_output, 'a', 0) for a 0-byte buffer.
Again, you probably don't need this; the only case I can think that might require this sort of thing is if you have some other external operation that's watching the file and triggers off of new data being added to it.
Hope that helps!

It's flush related, try:
print(image_path, file=f) # Python 3
or
print >>f, image_page # Python 2
instead of:
f.write(image_path)
f.write('\n')
print flushes.
another good thing about print is it gives you the newline for free.

with as and premature file closure

I have a python function that contains the following code:
with open(modelfilepath, "rb") as modelfile, open(vcffilepath, "rb") as vcffile:
for row in gtf_getrow(modelfile):
print row
#add features as appropriate
if row["feature"] == "transcript":
addfeature(some args...)
if row["feature"] == "exon":
addfeature(some other args..., vcffile=vcffile)
Execution of the addfeature() function passes through several functions before returning to the for loop. In the "exon" case, the vcffile object is passed as an argument to successive functions which eventually write to the vcffile.
The problem is that after a few iterations, the vcffile object seems to close spontaneously, which crashes the program. If I hardcode the function that uses vcffile to access the filename directly, the problem does not occur, but this seems like an undesirable solution since it removes control of the file from the with block. Nor do I want to have to open and close the file each time I access it, since this program is parsing hundreds of megabytes worth of tabular data. Thanks in advance for your suggestions.

Does the Python "open" function save its content in memory or in a temp file?

For the following Python code:
fp = open('output.txt', 'wb')
# Very big file, writes a lot of lines, n is a very large number
for i in range(1, n):
fp.write('something' * n)
fp.close()
The writing process above can last more than 30 min. Sometimes I get the error MemoryError. Is the content of the file before closing stored in memory or written in a temp file? If it's in a temporary file, what is its general location on a Linux OS?
Edit:
Added fp.write in a for loop

It's stored in the operating system's disk cache in memory until it is flushed to disk, either implicitly due to timing or space issues, or explicitly via fp.flush().

There will be write buffering in the Linux kernel, but at (ir)regular intervals they will be flushed to disk. Running out of such buffer space should never cause an application-level memory error; the buffers should empty before that happens, pausing the application while doing so.

Building on ataylor's comment to the question:
You might want to nest your loop. Something like
for i in range(1,n):
for each in range n:
fp.write('something')
fp.close()
That way, the only thing that gets put into memory is the string "something", not "something" * n.

If you a writing out a large file for which the writes might fail you a better off flushing the file to disk yourself at regular intervals using fp.flush(). This way the file will be in a location of your choosing that you can easily get to rather than being at the mercy of the OS:
fp = open('output.txt', 'wb')
counter = 0
for line in many_lines:
file.write(line)
counter += 1
if counter > 999:
fp.flush()
fp.close()
This will flush the file to disk every 1000 lines.

If you write line by line, it should not be a problem. You should show the code of what you are doing before the write. For a start you can try to delete objects where not necessary, use fp.flush() etc..

File writing should never give a memory error; with all probability, you have some bug in another place.
If you have a loop, and a memory error, then I would look if you are "leaking" references to objects.
Something like:
def do_something(a, b = []):
b.append(a)
return b
fp = open('output.txt', 'wb')
for i in range(1, n):
something = do_something(i)
fp.write(something)
fp.close()
I am now picking just an example, but in your actual case the reference leak may be much more difficult to find; however this case will just leak memory inside do_something because of the way Python handles default parameters of functions.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.