Parallel I/O - why does it work?

Parallel I/O - why does it work? - python

I have a python function which reads a line from a text file and writes it to another text file. It repeats this for every line in the file. Essentially:
Read line 1 -> Write line 1 -> Read line 2 -> Write line 2...
And so on.
I can parallelise this process, using a queue to pass data, so it is more like:
Read line 1 -> Read line 2 -> Read line 3...
Write line 1 -> Write line 2....
My question is - why does this work (as in why do I get a speed up?). Sounds like a daft question, but I was thinking - surely my hard disk can only do one thing at once? So why isn't one process put on hold til the other is completed?
Things like this are hidden from the user when writing in a high level language..I'd like to know whats going on at low level?

In short: IO buffering. Two levels of it, even.
First, Python itself has IO buffers. So, when you write all those lines to the file, Python doesn't necessarily invoke the write syscall immediately - it does that when it flushes its buffers, which could be anytime from when you call write until you close the file. This clearly won't affect you if you write at such a level as you make the syscalls yourself.
But separate to this, the operating system will also implement buffers. These work the same way - you make the 'write to disk' syscall, the OS puts the data in its write buffer and will use that when other processes read that file back. But it doesn't necessarily write it to disk yet - it can wait, theoretically, until you unmount that filesystem (possibly at shutdown). This is (part of) why it can be a bad idea to unplug a USB storage device without unmounting or 'safely removing' it, for example - things you've written to it aren't necessarily physically on the device yet. Anything the OS does is unaffected by what language you're writing in, or how much of a wrapper around the syscalls you have.
As well as this, both Python and the OS can do read buffering - essentially, when you read one line from the file, Python/the OS anticipates that you might be interested in the next several lines as well, and so reads them into main memory to avoid having to defer all the way down to the disk itself later.

Related

In Python 2.7.x how to verify file writing is done to disk after .close()?

I am writing a tool in Python 2.7.x on Windows 10 / server 2016. As part of my program I write a file of variable size (could be 1KiB, could be 1GiB, or anything!) and I have been having an issue where things that happen after I call myFile.close() run into a care where the file hasn't been fully written to the disk yet even thought .close() was called and returned (no multi-threading or multi-processing here).
What is the best way in Python 2.7.x on a Windows 10/server2016 system to verify that all of the file contents I wanted pushed to the disk was actually done writing on that disk?
I know that a time.sleep(1) helps, but that's arbitrary and I don't know how fast the disk write speed actually is (range is from 1MB/s, to 3GB/s), and the file could be quite large! So I need something less arbitrary that can check that all of the file was fully written to disk before my tool continues to the next step.

You should call fd.flush(), followed by os.fsync(fd), before closing the file. Reference: https://docs.python.org/2/library/os.html#files-and-directories

What happens if I don't close a txt file

I'm about to write a program for a racecar, that creates a txt and continuously adds new lines to it. Unfortunately I can't close the file, because when the car shuts off the raspberry (which the program is running on) gets also shut down. So I have no chance of closing the txt.
Is this a problem?

Yes and no. Data is buffered at different places in the process of writing: the file object of python, the underlying C-functions, the operating system, the disk controller. Even closing the file, does not guarantee, that all these buffers are written physically. Only the first two levels are forced to write their buffers to the next level. The same can be done by flushing the filehandle without closing it.
As long as the power-off can occur anytime, you have to deal with the fact, that some data is lost or partially written.
Closing a file is important to give free limited resources of the operating system, but this is no concern in your setup.

Does python make a copy of opened files in memory?

So I would like to search for filenames with os.walk() and write the resulting list of names to a file. I would like to know what is more efficient : opening the file and then writing each result as I find them or storing everything in a list and then writing the whole list. That list could be big so I wonder if the second solution would work.

See this example:
import os
fil = open('/tmp/stuff', 'w')
fil.write('aaa')
os.system('cat /tmp/stuff')
You may expect to see aaa, but instead you get nothing. This is because Python has an internal buffer. Writing to disk is expensive, as it has to:
Tell the OS to write it.
Actually transfer the data to the disk (on a hard disk it may involve spinning it up, waiting for IO time, etc.).
Wait for the OS to report success on the writing.
If you want to write any small things, it can add up to quite some time. Instead, what Python does is to keep a buffer and only actually write from time to time. You don't have to worry about the memory growth, as it will be kept at a low value. From the docs:
"0 means unbuffered, 1 means line buffered, any other positive value means use a buffer of (approximately) that size (in bytes). A negative buffering means to use the system default, which is usually line buffered for tty devices and fully buffered for other files. If omitted, the system default is used."
When you are done, make sure you do a fil.close(), or fil.flush() at any point during the execution, or use the keyword buffering=0 to disable buffering.
Another thing to consider is what happens if, for some reason, the program exits in the middle of the process. If you store everything in memory, it will be lost. What you have on disk, will remain there (but unless you flush, there is no guarantee of how much was actually saved).

When does Python write a file to disk?

I have a library that interacts with a configuration file. When the library is imported, the initialization code reads the configuration file, possibly updates it, and then writes the updated contents back to the file (even if nothing was changed).
Very occasionally, I encounter a problem where the contents of the configuration file simply disappear. Specifically, this happens when I run many invocations of a short script (using the library), back-to-back, thousands of times. It never occurs during the same directories, which leads me to believe it's a somewhat random problem--specifically a race condition with IO.
This is a pain to debug, since I can never reliably reproduce the problem and it only happens on some systems. I have a suspicion about what might happen, but I wanted to see if my picture of file I/O in Python is correct.
So the question is, when does a Python program actually write file contents to a disk? I thought that the contents would make it to disk by the time that the file closed, but then I can't explain this error. When python closes a file, does it flush the contents to the disk itself, or simply queue it up to the filesystem? Is it possible that file contents can be written to disk after Python terminates? And can I avoid this issue by using fp.flush(); os.fsync(fp.fileno()) (where fp is the file handle)?
If it matters, I'm programming on a Unix system (Mac OS X, specifically). Edit: Also, keep in mind that the processes are not running concurrently.
Appendix: Here is the specific race condition that I suspect:
Process #1 is invoked.
Process #1 opens the configuration file in read mode and closes it when finished.
Process #1 opens the configuration file in write mode, erasing all of its contents. The erasing of the contents is synced to the disk.
Process #1 writes the new contents to the file handle and closes it.
Process #1: Upon closing the file, Python tells the OS to queue writing these contents to disk.
Process #1 closes and exits
Process #2 is invoked
Process #2 opens the configuration file in read mode, but new contents aren't synced yet. Process #2 sees an empty file.
The OS finally finishes writing the contents to disk, after process 2 reads the file
Process #2, thinking the file is empty, sets defaults for the configuration file.
Process #2 writes its version of the configuration file to disk, overwriting the last version.

It is almost certainly not python's fault. If python closes the file, OR exits cleanly (rather than killed by a signal), then the OS will have the new contents for the file. Any subsequent open should return the new contents. There must be something more complicated going on. Here are some thoughts.
What you describe sounds more likely to be a filesystem bug than a Python bug, and a filesystem bug is pretty unlikely.
Filesystem bugs are far more likely if your files actually reside in a remote filesystem. Do they?
Do all the processes use the same file? Do "ls -li" on the file to see its inode number, and see if it ever changes. In your scenario, it should not. Is it possible that something is moving files, or moving directories, or deleting directories and recreating them? Are there symlinks involved?
Are you sure that there is no overlap in the running of your programs? Are any of them run from a shell with "&" at the end (i.e. in the background)? That could easily mean that a second one is started before the first one is finished.
Are there any other programs writing to the same file?
This isn't your question, but if you need atomic changes (so that any program running in parallel only sees either the old version or the new one, never the empty file), the way to achieve it is to write the new content to another file (e.g. "foo.tmp"), then do os.rename("foo.tmp", "foo"). Rename is atomic.

Python: Reading New Information Added To Massive Files

I'm working on a Python script to parse Squid(http://www.squid-cache.org/) log files. While the logs are rotated every day to stop them getting to big, they do reach between 40-90MB by the end of each day.
Essentially what I'm doing is reading the file line by line, parsing out the data I need(IP, Requested URL, Time) and adding it to an sqlite database. However this seems to be taking a very long time(It's been running over 20 minutes now)
So obviously, re-reading the file can't be done. What I would like to do is read the file and then detect all new lines written. Or even better, at the start of the day the script will simply read the data in real time as it is added so there will never be any long processing times.
How would I go about doing this?

One way to achieve this is by emulating tail -f. The script would constantly monitor the file and process each new line as it appears.
For a discussion and some recipes, see tail -f in python with no time.sleep

One way to do this is to use file system monitoring with py-inotify http://pyinotify.sourceforge.net/ - and set a callback function to be executed whenever
the log file size changed.
Another way to do it, without requiring external modules, is to record in the filesystem
(possibily on your sqlite database itself), the offset of the end of the lest line read on the log file, (which you get with with file.tell() ), and just read the newly added lines
from that offset onwards, which is done with a simple call to file.seek(offset) before looping through the lines.
The main difference of keeping track of the offset and the "tail" emulation described ont he other post is that this one allows your script to be run multiple times, i.e. - no need for it to be running continually, or to recover in case of a crash.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.