Does python make a copy of opened files in memory? - python

So I would like to search for filenames with os.walk() and write the resulting list of names to a file. I would like to know what is more efficient : opening the file and then writing each result as I find them or storing everything in a list and then writing the whole list. That list could be big so I wonder if the second solution would work.

See this example:
import os
fil = open('/tmp/stuff', 'w')
fil.write('aaa')
os.system('cat /tmp/stuff')
You may expect to see aaa, but instead you get nothing. This is because Python has an internal buffer. Writing to disk is expensive, as it has to:
Tell the OS to write it.
Actually transfer the data to the disk (on a hard disk it may involve spinning it up, waiting for IO time, etc.).
Wait for the OS to report success on the writing.
If you want to write any small things, it can add up to quite some time. Instead, what Python does is to keep a buffer and only actually write from time to time. You don't have to worry about the memory growth, as it will be kept at a low value. From the docs:
"0 means unbuffered, 1 means line buffered, any other positive value means use a buffer of (approximately) that size (in bytes). A negative buffering means to use the system default, which is usually line buffered for tty devices and fully buffered for other files. If omitted, the system default is used."
When you are done, make sure you do a fil.close(), or fil.flush() at any point during the execution, or use the keyword buffering=0 to disable buffering.
Another thing to consider is what happens if, for some reason, the program exits in the middle of the process. If you store everything in memory, it will be lost. What you have on disk, will remain there (but unless you flush, there is no guarantee of how much was actually saved).

Related

Make python process writes be scheduled for writeback immediately without being marked dirty

We are building a python framework that captures data from a framegrabber card through a cffi interface. After some manipulation, we try to write RAW images (numpy arrays using the tofile method) to disk at a rate of around 120 MB/s. We are well aware that are disks are capable of handling this throughput.
The problem we were experiencing was dropped frames, often entire seconds of data completely missing from the framegrabber output. What we found was that these framedrops were occurring when our Debian system hit the dirty_background_ratio set in sysctl. The system was calling the flush background gang that would choke up the framegrabber and cause it to skip frames.
Not surprisingly, setting the dirty_background_ratio to 0% managed to get rid of the problem entirely (It is worth noting that even small numbers like 1% and 2% still resulted in ~40% frame loss)
So, my question is, is there any way to get this python process to write in such a way that it is immediately scheduled for writeout, bypassing the dirty buffer entirely?
Thanks
So heres one way I've managed to do it.
By using the numpy memmap object you can instantiate an array that directly corresponds with a part of the disk. Calling the method flush() or python's del causes the array to sync to disk, completely bypassing the OS's buffer. I've successfully written ~280GB to disk at max throughput using this method.
Will continue researching.
Another option is to get the os file id and call os.fsync on it. This will schedule it for writeback immediately.

Does python "file write()" method guarantee datas have been correctly written?

I'm new with python and I'm writting script to patch a file with something like:
def getPatchDatas(file):
f = open(file,"rb")
datas = f.read()
f.close()
return datas
f = open("myfile.bin","r+b")
f.seek(0xC020)
f.write(getPatchDatas("mypatch.bin"))
f.close()
I would like to be sure the patch as been applied correctly.
So, if no error / exception is raised, does it mean I'm 100% sure the patch has been correctly written?
Or is it better to double check with something like:
f = open("myfile.bin","rb")
f.seek(0xC020)
if not f.read(0x20) == getPatchDatas("mypatch.bin"):
print "Patch not applied correctly!"
f.close()
??
Thanks.
No it doesn't, but roughly it does. It depends how much it matters.
Anything could go wrong - it could be a consumer hard disk which lies to the operating system about when it has finished writing data to disk. It could be corrupted in memory and that corrupt version gets written to disk, or it could be corrupted inside the disk during writing by electrical or physical problems.
It could be intercepted by kernel modules on Linux, filter drivers on Windows or a FUSE filesystem provider which doesn't actually support writing but pretends it does, meaning nothing was written.
It could be screwed up by a corrupted Python install where exceptions don't work or were deliberately hacked out of it, or file objects monkeypatched, or accidentally run in an uncommon implementation of Python which fakes supporting files but is otherwise identical.
These kinds of reasons are why servers have server class hardware with higher tolerances to temperature and electrical variation, error checking and correcting memory (ECC), RAID controller battery backups, ZFS checksumming filesystem, Uninterruptable Power Supplies, and so on.
But, as far as normal people and low risk things go - if it's written without error, it's as good as written. Double-checking makes sense - especially as it's that easy. It's nice to know if something has failed.
In single process, it is.
In multi processes(e.g. One process is writing and another is reading. Even you ensure it'll only read after call "write", the "write" need some time to finish), you may need a filelock.

What happens if I don't close a txt file

I'm about to write a program for a racecar, that creates a txt and continuously adds new lines to it. Unfortunately I can't close the file, because when the car shuts off the raspberry (which the program is running on) gets also shut down. So I have no chance of closing the txt.
Is this a problem?
Yes and no. Data is buffered at different places in the process of writing: the file object of python, the underlying C-functions, the operating system, the disk controller. Even closing the file, does not guarantee, that all these buffers are written physically. Only the first two levels are forced to write their buffers to the next level. The same can be done by flushing the filehandle without closing it.
As long as the power-off can occur anytime, you have to deal with the fact, that some data is lost or partially written.
Closing a file is important to give free limited resources of the operating system, but this is no concern in your setup.

Parallel I/O - why does it work?

I have a python function which reads a line from a text file and writes it to another text file. It repeats this for every line in the file. Essentially:
Read line 1 -> Write line 1 -> Read line 2 -> Write line 2...
And so on.
I can parallelise this process, using a queue to pass data, so it is more like:
Read line 1 -> Read line 2 -> Read line 3...
Write line 1 -> Write line 2....
My question is - why does this work (as in why do I get a speed up?). Sounds like a daft question, but I was thinking - surely my hard disk can only do one thing at once? So why isn't one process put on hold til the other is completed?
Things like this are hidden from the user when writing in a high level language..I'd like to know whats going on at low level?
In short: IO buffering. Two levels of it, even.
First, Python itself has IO buffers. So, when you write all those lines to the file, Python doesn't necessarily invoke the write syscall immediately - it does that when it flushes its buffers, which could be anytime from when you call write until you close the file. This clearly won't affect you if you write at such a level as you make the syscalls yourself.
But separate to this, the operating system will also implement buffers. These work the same way - you make the 'write to disk' syscall, the OS puts the data in its write buffer and will use that when other processes read that file back. But it doesn't necessarily write it to disk yet - it can wait, theoretically, until you unmount that filesystem (possibly at shutdown). This is (part of) why it can be a bad idea to unplug a USB storage device without unmounting or 'safely removing' it, for example - things you've written to it aren't necessarily physically on the device yet. Anything the OS does is unaffected by what language you're writing in, or how much of a wrapper around the syscalls you have.
As well as this, both Python and the OS can do read buffering - essentially, when you read one line from the file, Python/the OS anticipates that you might be interested in the next several lines as well, and so reads them into main memory to avoid having to defer all the way down to the disk itself later.

Abort a slow flush to disk after write?

Is there a way to abort a python write operation in such a way that the OS doesn't feel it's necessary to flush the unwritten data to the disc?
I'm writing data to a USB device, typically many megabytes. I'm using 4096 bytes as my block size on the write, but it appears that Linux caches up a bunch of data early on, and write it out to the USB device slowly. If at some point during the write, my user decides to cancel, I want the app to just stop writing immediately. I can see that there's a delay between when the data stops flowing from the application, and the USB activity light stops blinking. Several seconds, up to about 10 seconds typically. I find that the app is holding in the close() method, I'm assuming, waiting for the OS to finish writing the buffered data. I call flush() after every write, but that doesn't appear to have any impact on the delay. I've scoured the python docs for an answer but have found nothing.
It's somewhat filesystem dependent, but in some filesystems, if you delete a file before (all of) it is allocated, the IO to write the blocks will never happen. This might also be true if you truncate it so that the part which is still being written is chopped off.
Not sure that you can really abort a write if you want to still access the data. Also the kinds of filesystems that support this (e.g. xfs, ext4) are not normally used on USB sticks.
If you want to flush data to the disc, use fdatasync(). Merely flushing your IO library's buffer into the OS one will not achieve any physical flushing.
Assuming I am understanding this correct, you want to be able to 'abort' and NOT flush the data. This IS possible using a ctype and a little pokery. This is very OS dependent so I'll give you the OSX version and then what you can do to change it to Linux:
f = open('flibble1.txt', 'w')
f.write("hello world")
import ctypes
x = ctypes.cdll.LoadLibrary('/usr/lib/libc.dylib')
x.close(f.fileno())
try:
del f
catch IOError:
pass
If you change /usr/lib/libc.dylib to the libc.so.6 in /usr/lib for Linux then you should be good to go. Basically by calling close() instead of fclose(), no call to fsync() is done and nothing is flushed.
Hope that's useful.
When you abort the write operation, trying doing file.truncate(0); before closing it.

Categories