How to temporary save data in Python?

How to temporary save data in Python? - python

I read position data from a GPS Sensor in a dictionary, which I am sending in cyclic interval to a server.
If I have no coverage, the data will be saved in a list.
If connection can be reestablished, all list items will be transmitted.
But if the a power interruption occurs, all temp data elements will be lost.
What would be the best a pythonic solution to save this data?
I am using a SD card as storage, so i am not sure, if writing every element to a file would be the best solution.
Current implementation:
stageddata = []
position = {'lat':'1.2345', 'lon':'2.3455', 'timestamp':'2020-10-18T15:08:04'}
if not transmission(position):
stageddata.append(position)
else:
while stageddata:
position = stageddata.pop()
if not transmission(position):
stageddata.append(position)
return
EDIT: Finding the "best" solution may be very subjective. I agree with zvone, a power outage can be prevented. Perhaps a shutdown routine should save the temporary data.
So question may be how to pythonic save a given list to a file?

A good solution for temporary storage in Python is tempfile.
You can use it, e.g., like the following:
import tempfile
with tempfile.TemporaryFile() as fp:
# Store your varibale
fp.write(your_variable_to_temp_store)
# Do some other stuff
# Read file
fp.seek(0)
fp.read()

I agree with the comment of zvone. In order to know the best solution, we would need more information.
The following would be a robust and configurable solution.
import os
import pickle
backup_interval = 2
backup_file = 'gps_position_backup.bin'
def read_backup_data():
file_backup_data = []
if os.path.exists(backup_file):
with open(backup_file, 'rb') as f:
while True:
try:
coordinates = pickle.load(f)
except EOFError:
break
file_backup_data += coordinates
return file_backup_data
# When the script is started and backup data exists, stageddata uses it
stageddata = read_backup_data()
def write_backup_data():
tmp_backup_file = 'tmp_' + backup_file
with open(tmp_backup_file, 'wb') as f:
pickle.dump(stageddata, f)
os.replace(tmp_backup_file, backup_file)
print('Wrote data backup!')
# Mockup variable and method
transmission_return = False
def transmission(position):
return transmission_return
def try_transmission(position):
if not transmission(position):
stageddata.append(position)
if len(stageddata) % backup_interval == 0:
write_backup_data()
else:
while stageddata:
position = stageddata.pop()
if not transmission(position):
stageddata.append(position)
return
else:
if len(stageddata) % backup_interval == 0:
write_backup_data()
if __name__ == '__main__':
# transmission_return is False, so write to backup_file
for counter in range(10):
position = {'lat':'1.2345', 'lon':'2.3455'}
try_transmission(position)
# transmission_return is True, transmit positions and "update" backup_file
transmission_return = True
position = {'lat':'1.2345', 'lon':'2.3455'}
try_transmission(position)
I moved your code into some some functions. With the variable backup_interval, it is possible to control how often a backup is written to disk.
Additional Notes:
I use the built-in pickle module, since the data does not have to be human readable or transformable for other programming languages. Alternatives are JSON, which is human readable, or msgpack, which might be faster, but needs an extra package to be installed. The tempfile is not a pythonic solution, as it cannot easily be retrieved in case the program crashes.
stageddata is written to disk when it hits the backup_interval (obviously), but also when transmission returns True within the while loop. This is needed to "synchronize" the data on disk.
The data is written to disk completely new every time. A more sophisticated approach would be to just append the newly added positions, but then the synchronizing part, that I described before, would be more complicated too. Additionally, the safer temporary file approach (see Edit below) would not work.
Edit: I just reconsidered your use case. The main problem here is: Restoring data, even if the program gets interrupted at any time (due to power interruption or whatever). My first solution just wrote the data to disk (which solves part of the problem), but it could still happen, that the program crashes the moment when writing to disk. In that case the file would probably be corrupt and the data lost. I adapted the function write_backup_data(), so that it writes to a temporary file first and then replaces the old file. So now, even if a lot of data has to be written to disk and the crash happens there, the previous backup file would still be available.

Maybe saving it as a binary code could help to minimize the storage. 'pickle' and 'shelve' modules will help with storing objects and serializing (To serialize an object means to convert its state to a byte stream so that the byte stream can be reverted back into a copy of the object), but you should be carefull that when you resolve the power interruption it does not overwrite the data you have been storing, with open(file, "a") (a== append), you could avoid that.

Related

Behaviour difference using Python pickle module

I'm developping a python app that deals with big objects, and to avoid filling the pc ram while executing, I chosed to store my temporary objects (created at one step, used by the next step) in files with pickle module.
While trying to optimize memory consumption, I saw a behaviour that I don't understand.
In the first case, I'm opening my temp file, then I loop to do the actions I need and during the loop I regularly dump objects in the file. It works well, but as the file pointer remains open, it consumes a lot of memory. Here is the code example :
tmp_file_path = "toto.txt"
with open(tmp_file_path, 'ab') as f:
p = pickle.Pickler(f)
for filepath in self.file_list: // loop over files to be treated
try:
my_obj = process_file(filepath)
storage_obj = StorageObj()
storage_obj.add(os.path.basename(filepath), my_obj)
p.dump(storage_obj)
[...]
In the second case I'm only opening my temp file when I need to write inside it :
tmp_file_path = "toto.txt"
for filepath in self.file_list: // loop over files to be treated
try:
my_obj = process_file(filepath)
storage_obj = StorageObj()
storage_obj.add(os.path.basename(filepath), my_obj)
with open(tmp_file_path, 'ab') as f:
p = pickle.Pickler(f)
p.dump(storage_obj)
[...]
The code between the two versions is the same except from the block :
with open(tmp_file_path, 'ab') as f:
p = pickle.Pickler(f)
which moves inside/outside the loop.
And for the unpickling part :
with open("toto.txt", 'rb') as f:
try:
u = pickle.Unpickler(f)
storage_obj = u.load()
while storage_obj:
process_my_obj(storage_obj)
storage_obj = u.load()
except EOFError:
pass
When I'm running both codes, in the first case I have a high memory consumption (due to the fact that temp file remains open during the treatment I guess) and in the end, with a set of inputs, the application finds 622 elements in the unpickled data.
In the second case, memory cunsumption is far lower, but in the end , with the same inputs, the application finds 440 elements in the unpickled data, and sometimes crashes with random errors during Unpickler.load() method (for exemple Attribute error, but it's not always reproductible and not always the same error).
With even bigger set of inputs, the first code example often crashes with memory error, so I'd like to use the second code example, but it seems that it doesn't succeed to save all my objects correctly.
Does anyone have an idea of the reason why there is differences between the two behaviour ?
Maybe opening / dumping / closing / reopening /dumping / etc a file in my loop doesn't garanty the content that is dumped ?
EDIT 1 :
All the pickling part is done in a multiprocessing context, with 10 processes writing in their own temp file, and the unpickling is done by the main process, by reading each temp file created.
EDIT 2 :
I can't provide a full reproductible example (company code), but the treatment consists of parsing C files (process_file method, based on pycparser module) and generating an object representing the C file content (fields, functions etc) -> my_obj. Then storing my_obj in an object (StorageObj) that has a a dict as attribute, containing the my_obj object with the file is was extracted from as key.
Thanks in advance if anyone finds the reason behind this, or suggest me a way around to avoid this :)

This has nothing to do with the file. It is that you are using a common Pickler which is retaining its memo table.
The example that does not have the issue creates a new Pickler with a fresh memo table and lets the old one be collected effectively clearing the memo table.
But that doesn't explain why when I create multiple Pickler I retrieve less data than with only one in the end.
Now that is because you have written multiple pickles to the same file and the method where you read one. Only reads the first. As closing and reopening the file resets the file offset. In the reading of multiple objects each time you call load advances the file offset to the start of the next object.

Python script that writes result to txt file - why the lag?

I'm using Windows 7 and I have a super-simple script that goes over a directory of images, checking a specified condition for each image (in my case, whether there's a face in the image, using dlib), while writing the paths of images that fulfilled the condition to a text file:
def process_dir(dir_path):
i = 0
with open(txt_output, 'a') as f:
for filename in os.listdir(dir_path):
# loading image to check whether dlib detects a face:
image_path = os.path.join(dir_path, filename)
opencv_img = cv2.imread(image_path)
dets = detector(opencv_img, 1)
if len(dets) > 0 :
f.write(image_path)
f.write('\n')
i = i + 1
print i
Now the following thing happens: there seems to be a significant lag in appending lines to files. For example, I can see the script has "finished" checking several images (i.e, the console prints ~20, meaning 20 files who fulfill the condition have been found) but the .txt file is still empty. At first I thought there was a problem with my script, but after waiting a while I saw that they were in fact added to the file, only it seems to be updated in "batches".
This may not seem like the most crucial issue (and it's definitely not), but still I'm wondering - what explains this behavior? As far as I understand, every time the f.write(image_path) line is executed the file is changed - then why do I see the update with a lag?

Data written to a file object won't necessarily show up on disk immediately.
In the interests of efficiency, most operating systems will buffer the writes, meaning that data is only written out to disk when a certain amount has accumulated (usually 4K).
If you want to write your data right now, use the flush() function, as others have said.

Did you try using with buffersize 0, open(txt_output, 'a', 0).

I'm, not sure about Windows (please, someone correct me here if I'm wrong), but I believe this is because of how the write buffer is handled. Although you are requesting a write, the buffer only writes every so often (when the buffer is full), and when the file is closed. You can open the file with a smaller buffer:
with open(txt_output, 'a', 0) as f:
or manually flush it at the end of the loop:
if len(dets) > 0 :
f.write(image_path)
f.write('\n')
f.flush()
i = i + 1
print i
I would personally recommend flushing manually when you need to.

It sounds like you're running into file stream buffering.
In short, writing to a file is a very slow process (relative to other sorts of things that the processor does). Modifying the hard disk is about the slowest thing you can do, other than maybe printing to the screen.
Because of this, most file I/O libraries will "buffer" your output, meaning that as you write to the file the library will save your data in an in-memory buffer instead of modifying the hard disk right away. Only when the buffer fills up will it "flush" the buffer (write the data to disk), after which point it starts filling the buffer again. This often reduces the number of actual write operations by quite a lot.
To answer your question, the first question to answer is, do really need to append to the file immediately every time you find a face? It will probably slow down your processing by a noticeable amount, especially if you're processing a large number of files.
If you really do need to update immediately, you basically have two options:
Manually flush the write buffer each time you write to the file. In Python, this usually means calling f.flush(), as #JamieCounsell pointed out.
Tell Python to just not use a buffer, or more accurately to use a buffer of size 0. As #VikasMadhusudana pointed out, you can tell Python how big of a buffer to use with a third argument to open(): open(txt_output, 'a', 0) for a 0-byte buffer.
Again, you probably don't need this; the only case I can think that might require this sort of thing is if you have some other external operation that's watching the file and triggers off of new data being added to it.
Hope that helps!

It's flush related, try:
print(image_path, file=f) # Python 3
or
print >>f, image_page # Python 2
instead of:
f.write(image_path)
f.write('\n')
print flushes.
another good thing about print is it gives you the newline for free.

Python securely remove file

How can I securely remove a file using python? The function os.remove(path) only removes the directory entry, but I want to securely remove the file, similar to the apple feature called "Secure Empty Trash" that randomly overwrites the file.
What function securely removes a file using this method?

You can use srm to securely remove files. You can use Python's os.system() function to call srm.

You can very easily write a function in Python to overwrite a file with random data, even repeatedly, then delete it. Something like this:
import os
def secure_delete(path, passes=1):
with open(path, "ba+") as delfile:
length = delfile.tell()
with open(path, "br+") as delfile:
for i in range(passes):
delfile.seek(0)
delfile.write(os.urandom(length))
os.remove(path)
Shelling out to srm is likely to be faster, however.

You can use srm, sure, you can always easily implement it in Python. Refer to wikipedia for the data to overwrite the file content with. Observe that depending on actual storage technology, data patterns may be quite different. Furthermore, if you file is located on a log-structured file system or even on a file system with copy-on-write optimisation, like btrfs, your goal may be unachievable from user space.
After you are done mashing up the disk area that was used to store the file, remove the file handle with os.remove().
If you also want to erase any trace of the file name, you can try to allocate and reallocate a whole bunch of randomly named files in the same directory, though depending on directory inode structure (linear, btree, hash, etc.) it may very tough to guarantee you actually overwrote the old file name.

So at least in Python 3 using #kindall's solution I only got it to append. Meaning the entire contents of the file were still intact and every pass just added to the overall size of the file. So it ended up being [Original Contents][Random Data of that Size][Random Data of that Size][Random Data of that Size] which is not the desired effect obviously.
This trickery worked for me though. I open the file in append to find the length, then reopen in r+ so that I can seek to the beginning (in append mode it seems like what caused the undesired effect is that it was not actually possible to seek to 0)
So check this out:
def secure_delete(path, passes=3):
with open(path, "ba+", buffering=0) as delfile:
length = delfile.tell()
delfile.close()
with open(path, "br+", buffering=0) as delfile:
#print("Length of file:%s" % length)
for i in range(passes):
delfile.seek(0,0)
delfile.write(os.urandom(length))
#wait = input("Pass %s Complete" % i)
#wait = input("All %s Passes Complete" % passes)
delfile.seek(0)
for x in range(length):
delfile.write(b'\x00')
#wait = input("Final Zero Pass Complete")
os.remove(path) #So note here that the TRUE shred actually renames to file to all zeros with the length of the filename considered to thwart metadata filename collection, here I didn't really care to implement
Un-comment the prompts to check the file after each pass, this looked good when I tested it with the caveat that the filename is not shredded like the real shred -zu does

The answers implementing a manual solution did not work for me. My solution is as follows, it seems to work okay.
import os
def secure_delete(path, passes=1):
length = os.path.getsize(path)
with open(path, "br+", buffering=-1) as f:
for i in range(passes):
f.seek(0)
f.write(os.urandom(length))
f.close()

Python Disk Imaging

Trying to make a script for disk imaging (such as .dd format) in python. Originally started as a project to get another hex debugger and kinda got more interested in trying to get raw data from the drive. which turned into wanting to be able to image the drive first. Anyways, I've been looking around for about a week or so and found the best way get get information from the drive on smaller drives appears to be something like:
with file("/dev/sda") as f:
i=file("~/imagingtest.dd", "wb")
i.write(f.read(SIZE))
with size being the disk size. Problem is, which seems to be a well known issue, trying to use large disks shows up as (even in my case total size of 250059350016 bytes):
"OverflowError: Python int too large to convert to C long"
Is there a more appropriate way to get around this issue? As it works fine for a small flash drive, but trying to image a drive, fails.
I've seen mention of possibly just iterating by sector size (512) per the number of sectors (in my case 488397168) however would like to verify exactly how to do this in a way that would be functional.
Thanks in advance for any assistance, sorry for any ignorance you easily notice.

Yes, that's how you should do it. Though you could go higher than the sector size if you wanted.
with open("/dev/sda",'rb') as f:
with open("~/imagingtest.dd", "wb") as i:
while True:
if i.write(f.read(512)) == 0:
break

Read the data in blocks. When you reach the end of the device, .read(blocksize) will return the empty string.
You can use iter() with a sentinel to do this easily in a loop:
from functools import partial
blocksize = 12345
with open("/dev/sda", 'rb') as f:
for block in iter(partial(f.read, blocksize), ''):
# do something with the data block
You really want to open the device in binary mode, 'rb' if you want to make sure no line translations take place.
However, if you are trying to create copy into another file, you want to look at shutil.copyfile():
import shutil
shutil.copyfile('/dev/sda', 'destinationfile')
and it'll take care of the opening, reading and writing for you. If you want to have more control of the blocksize used for that, use shutil.copyfileobj(), open the file objects yourself and specify a blocksize:
import shutil
blocksize = 12345
with open("/dev/sda", 'rb') as f, open('destinationfile', 'wb') as dest:
shutil.copyfileobj(f, dest, blocksize)

Does the Python "open" function save its content in memory or in a temp file?

For the following Python code:
fp = open('output.txt', 'wb')
# Very big file, writes a lot of lines, n is a very large number
for i in range(1, n):
fp.write('something' * n)
fp.close()
The writing process above can last more than 30 min. Sometimes I get the error MemoryError. Is the content of the file before closing stored in memory or written in a temp file? If it's in a temporary file, what is its general location on a Linux OS?
Edit:
Added fp.write in a for loop

It's stored in the operating system's disk cache in memory until it is flushed to disk, either implicitly due to timing or space issues, or explicitly via fp.flush().

There will be write buffering in the Linux kernel, but at (ir)regular intervals they will be flushed to disk. Running out of such buffer space should never cause an application-level memory error; the buffers should empty before that happens, pausing the application while doing so.

Building on ataylor's comment to the question:
You might want to nest your loop. Something like
for i in range(1,n):
for each in range n:
fp.write('something')
fp.close()
That way, the only thing that gets put into memory is the string "something", not "something" * n.

If you a writing out a large file for which the writes might fail you a better off flushing the file to disk yourself at regular intervals using fp.flush(). This way the file will be in a location of your choosing that you can easily get to rather than being at the mercy of the OS:
fp = open('output.txt', 'wb')
counter = 0
for line in many_lines:
file.write(line)
counter += 1
if counter > 999:
fp.flush()
fp.close()
This will flush the file to disk every 1000 lines.

If you write line by line, it should not be a problem. You should show the code of what you are doing before the write. For a start you can try to delete objects where not necessary, use fp.flush() etc..

File writing should never give a memory error; with all probability, you have some bug in another place.
If you have a loop, and a memory error, then I would look if you are "leaking" references to objects.
Something like:
def do_something(a, b = []):
b.append(a)
return b
fp = open('output.txt', 'wb')
for i in range(1, n):
something = do_something(i)
fp.write(something)
fp.close()
I am now picking just an example, but in your actual case the reference leak may be much more difficult to find; however this case will just leak memory inside do_something because of the way Python handles default parameters of functions.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.