When is a write to disk triggered? - python

In Python, I can open a file with f= open(<filename>,<permissions>). This returns an object f which I can write to using f.write(<some data>).
If, at this point, I access the original final (eg with cat from a terminal), it appears empty: Python stored the data I wrote to the object f and not the actual on-disk file. If I then call f.close(), the data in f is persisted to the on-disk file (and I can access it from other programs).
I assume data is buffered to improve latency. However, what happens if the buffered data grows a lot? Will Python initiate a write? If so, details on the internals (what influences the buffer size? is the disk I/O handled within Python or by another program/thread? is there a chance Python will just hang during the write?) would be much appreciated.

The general subject of I/O buffering has been treated many times (including in questions linked from the comments). But to answer your specific questions:
By default, when writing to a terminal (“the screen”), a newline causes the text to be flushed up through it. For all files, the buffer is flushed each time it fills. (Large single writes might flush any existing buffer contents and then bypass it.)
The buffer has a fixed size and is allocated before any data is written; Python 3 doesn’t use stdio, so it chooses its own buffer sizes. (A few kB is typical.)
The “disk I/O” (really kernel I/O, which is distinguishable only in certain special circumstances like network/power failure) happens within whatever Python write triggers the flush.
Yes, it can hang, if the file is a pipe to a busy process, a socket over a slow network, a special device, or even a regular file mounted from a remote machine.

Related

In h5py, do I need to call flush() before I close a file?

In the Python HDF5 library h5py, do I need to flush() a file before I close() it?
Or does closing the file already make sure that any data that might still be in the buffers will be written to disk?
What exactly is the point of flushing? When would flushing be necessary?
No, you do not need to flush the file before closing. Flushing is done automatically by the underlying HDF5 C library when you close the file.
As to the point of flushing. File I/O is slow compared to things like memory or cache access. If programs had to wait before data was actually on the disk each time a write was performed, that would slow things down a lot. So the actual writing to disk is buffered by at least the OS, but in many cases by the I/O library being used (e.g., the C standard I/O library). When you ask to write data to a file, it usually just means that the OS has copied your data to its own internal buffer, and will actually put it on the disk when it's convenient to do so.
Flushing overrides this buffering, at whatever level the call is made. So calling h5py.File.flush() will flush the HDF5 library buffers, but not necessarily the OS buffers. The point of this is to give the program some control over when data actually leaves a buffer.
For example, writing to the standard output is usually line-buffered. But if you really want to see the output before a newline, you can call fflush(stdout). This might make sense if you are piping the standard output of one process into another: that downstream process can start consuming the input right away, without waiting for the OS to decide it's a good time.
Another good example is making a call to fork(2). This usually copies the entire address space of a process, which means the I/O buffers as well. That may result in duplicated output, unnecessary copying, etc. Flushing a stream guarantees that the buffer is empty before forking.

Why doesn't flask show console log in the right order? [duplicate]

I learned that by default I/O in programs is buffered, i.e they are served from a temporary storage to the requesting program.
I understand that buffering improves IO performance (maybe by reducing system calls). I have seen examples of disabling buffering, like setvbuf in C. What is the difference between the two modes and when should one be used over the other?
You want unbuffered output whenever you want to ensure that the output has been written before continuing. One example is standard error under a C runtime library - this is usually unbuffered by default. Since errors are (hopefully) infrequent, you want to know about them immediately. On the other hand, standard output is buffered simply because it's assumed there will be far more data going through it.
Another example is a logging library. If your log messages are held within buffers in your process, and your process dumps core, there a very good chance that output will never be written.
In addition, it's not just system calls that are minimized but disk I/O as well. Let's say a program reads a file one byte at a time. With unbuffered input, you will go out to the (relatively very slow) disk for every byte even though it probably has to read in a whole block anyway (the disk hardware itself may have buffers but you're still going out to the disk controller which is going to be slower than in-memory access).
By buffering, the whole block is read in to the buffer at once then the individual bytes are delivered to you from the (in-memory, incredibly fast) buffer area.
Keep in mind that buffering can take many forms, such as in the following example:
+-------------------+-------------------+
| Process A | Process B |
+-------------------+-------------------+
| C runtime library | C runtime library | C RTL buffers
+-------------------+-------------------+
| OS caches | Operating system buffers
+---------------------------------------+
| Disk controller hardware cache | Disk hardware buffers
+---------------------------------------+
| Disk |
+---------------------------------------+
You want unbuffered output when you already have large sequence of bytes ready to write to disk, and want to avoid an extra copy into a second buffer in the middle.
Buffered output streams will accumulate write results into an intermediate buffer, sending it to the OS file system only when enough data has accumulated (or flush() is requested). This reduces the number of file system calls. Since file system calls can be expensive on most platforms (compared to short memcpy), buffered output is a net win when performing a large number of small writes. Unbuffered output is generally better when you already have large buffers to send -- copying to an intermediate buffer will not reduce the number of OS calls further, and introduces additional work.
Unbuffered output has nothing to do with ensuring your data reaches the disk; that functionality is provided by flush(), and works on both buffered and unbuffered streams. Unbuffered IO writes don't guarantee the data has reached the physical disk -- the OS file system is free to hold on to a copy of your data indefinitely, never writing it to disk, if it wants. It is only required to commit it to disk when you invoke flush(). (Note that close() will call flush() on your behalf).

Is there any advantage to reading the entire file

Are there any advantages/disadvantages to reading an entire file in one go rather than reading the bytes as required? So is there any advantage to:
file_handle = open("somefile", rb)
file_contents = file_handle.read()
# do all the things using file_contents
compared to:
file_handle = open("somefile", rb)
part1 = file_handle.read(10)
# do some stuff
part2 = file_handle.read(8)
# do some more stuff etc
Background: I am writing a p-code (bytecode) interpreter in Python and have initially just written a naive implementation that reads bytes from the file as required and performs the necessary actions etc. A friend I was showing the program has suggested that I should instead read the entire file into memory (Python list?) and then process it from memory to avoid lots of slow disk reads. The test files are currently less than 1KB and will probably be at most a few 100KB so I would have expected the Operating System and disk controller system to cache the file obviating any performance issues caused by repeatedly reading small chunks of the file.
Cache aside, you still have system calls. Each read() results in a mode switch to trigger the kernel. You can see this with strace or another tool to look at system calls.
This might be premature for a 100 KB file though. As always, test your code to know for sure.
If you want to do any kind of random access then putting it in a list is going to be much faster than seeking from disk. Even if the OS does cache disk access, you are hitting another layer of cache. In any case, you can't be sure how the OS will behave.
Here are 3 cases I can think of that would motivate doing it in-memory:
You might have a jump instruction which you can execute by adding a number to your program counter. Doing that to the index of an array vs seeking a file is a good use case.
You may want to optimise your VM's behaviour, and that may involve reading the file more than once. Scanning a list twice vs reading a file twice will be much quicker.
Depending on opcodes and the grammar of your language you may want to look ahead in a 'cycle' to speed up execution. If that ends up doing two seeks then this could end up degrading performance.
If your file will always be small enough fit in RAM then it's probably worth reading it all into memory. Profile it with a real program and see if it's noticeably faster.
A single call to read() will be faster than multiple calls to read(). The tradeoff is that with a single call you must be able to fit all data in memory at once, whereas with multiple reads you only have to retain a fraction of the total amount of data. For files that are just a few kilobytes or megabytes, the difference won't be noticeable. For files that are several gigs in size, memory becomes more important.
Also, to do a single read means all of the data must be present, whereas multiple reads can be used to process data as it is streaming in from an external source.
If you are looking for performance, I would recommend going through generators. Since you have small file size, memory would not be any big concern, but its still a good practice. Still reading file from disc multiple times is a definite bottleneck for a scalable solution.

python read() and write() in large blocks / memory management

I'm writing some python code that splices together large files at various points. I've done something similar in C where I allocated a 1MB char array and used that as the read/write buffer. And it was very simple: read 1MB into the char array then write it out.
But with python I'm assuming it is different, each time I call read() with size = 1M, it will allocate a 1M long character string. And hopefully when the buffer goes out of scope it will we freed in the next gc pass.
Would python handle the allocation this way? If so, is the constant allocation/deallocation cycle be computationally expensive?
Can I tell python to use the same block of memory just like in C? Or is the python vm smart enough to do it itself?
I guess what I'm essentially aiming for is kinda like an implementation of dd in python.
Search site docs.python.org for readinto to find docs appropriate for the version of Python you're using. readinto is a low-level feature. They'll look a lot like this:
readinto(b)
Read up to len(b) bytes into bytearray b and return the number of bytes read.
Like read(), multiple reads may be issued to the underlying raw stream, unless the latter is interactive.
A BlockingIOError is raised if the underlying raw stream is in non blocking-mode, and has no data available at the moment.
But don't worry about it prematurely. Python allocates and deallocates dynamic memory at a ferocious rate, and it's likely that the cost of repeatedly getting & free'ing a measly megabyte will be lost in the noise. And note that CPython is primarily reference-counted, so your buffer will get reclaimed "immediately" when it goes out of scope. As to whether Python will reuse the same memory space each time, the odds are decent but it's not assured. Python does nothing to try to force that, but depending on the entire allocation/deallocation pattern and the details of the system C's malloc()/free() implementation, it's not impossible it will get reused ;-)

How does urllib.urlopen() work?

Let's consider a big file (~100MB). Let's consider that the file is line-based (a text file, with relatively short line ~80 chars).
If I use built-in open()/file() the file will be loaded in lazy manner.
I.E. if a I do aFile.readline() only a chunk of a file will reside in memory. Does the urllib.urlopen() do something similar (with usage of a cache on disk)?
How big is the difference in performance between urllib.urlopen().readline() and file().readline()? Let's consider that file is located on localhost. Once I open it with urllib.urlopen() and then with file(). How big will be difference in performance/memory consumption when i loop over the file with readline()?
What is best way to process a file opened via urllib.urlopen()? Is it faster to process it line by line? Or shall I load bunch of lines(~50) into a list and then process the list?
open (or file) and urllib.urlopen look like they're more or less doing the same thing there. urllib.urlopen is (basically) creating a socket._socketobject and then invoking the makefile method (contents of that method included below)
def makefile(self, mode='r', bufsize=-1):
"""makefile([mode[, bufsize]]) -> file object
Return a regular file object corresponding to the socket. The mode
and bufsize arguments are as for the built-in open() function."""
return _fileobject(self._sock, mode, bufsize)
Does the urllib.urlopen() do something similar (with usage of a cache on disk)?
The operating system does. When you use a networking API such as urllib, the operating system and the network card will do the low-level work of splitting data into small packets that are sent over the network, and to receive incoming packets. Those are stored in a cache, so that the application can abstract away the packet concept and pretend it would send and receive continuous streams of data.
How big is the difference in performance between urllib.urlopen().readline() and file().readline()?
It is hard to compare these two. For urllib, this depends on the speed of the network, as well as the speed of the server. Even for local servers, there is some abstraction overhead, so that, usually, it is slower to read from the networking API than from a file directly.
For actual performance comparisons, you will have to write a test script and do the measurement. However, why do you even bother? You cannot replace one with another since they serve different purposes.
What is best way to process a file opened via urllib.urlopen()? Is it faster to process it line by line? Or shall I load bunch of lines(~50) into a list and then process the list?
Since the bottle neck is the networking speed, it might be a good idea to process the data as soon as you get it. This way, the operating system can cache more incoming data "in the background".
It makes no sense to cache lines in a list before processing them. Your program will just sit there waiting for enough data to arrive while it could be doing something useful already.

Categories