Why doesn't write function write to file immediately? [duplicate] - python

I'm running a test, and found that the file doesn't actually get written until I control-C to abort the program. Can anyone explain why that would happen?
I expected it to write at the same time, so I could read the file in the middle of the process.
import os
from time import sleep
f = open("log.txt", "a+")
i = 0
while True:
f.write(str(i))
f.write("\n")
i += 1
sleep(0.1)

Writing to disk is slow, so many programs store up writes into large chunks which they write all-at-once. This is called buffering, and Python does it automatically when you open a file.
When you write to the file, you're actually writing to a "buffer" in memory. When it fills up, Python will automatically write it to disk. You can tell it "write everything in the buffer to disk now" with
f.flush()
This isn't quite the whole story, because the operating system will probably buffer writes as well. You can tell it to write the buffer of the file with
os.fsync(f.fileno())
Finally, you can tell Python not to buffer a particular file with open(f, "w", 0) or only to keep a 1-line buffer with open(f,"w", 1). Naturally, this will slow down all operations on that file, because writes are slow.

You need to f.close() to flush the file write buffer out to the file. Or in your case you might just want to do a f.flush(); os.fsync(); so you can keep looping with the opened file handle.
Don't forget to import os.

You have to force the write, so I i use the following lines to make sure a file is written:
# Two commands together force the OS to store the file buffer to disc
f.flush()
os.fsync(f.fileno())

You will want to check out file.flush() - although take note that this might not write the data to disk, to quote:
Note:
flush() does not necessarily write the file’s data to disk. Use flush() followed by os.fsync() to ensure this behavior.
Closing the file (file.close()) will also ensure that the data is written - using with will do this implicitly, and is generally a better choice for more readability and clarity - not to mention solving other potential problems.

This is a windows-ism. If you add an explicit .close() when you're done with file, it'll appear in explorer at that time. Even just flushing it might be enough (I don't have a windows box handy to test). But basically f.write does not actually write, it just appends to the write buffer - until the buffer gets flushed you won't see it.
On unix the files will typically show up as a 0-byte file in this situation.

File Handler to be flushed.
f.flush()

The file does not get written, as the output buffer is not getting flushed until the garbage collection takes effect, and flushes the I/O buffer (more than likely by calling f.close()).
Alternately, in your loop, you can call f.flush() followed by os.fsync(), as documented here.
f.flush()
os.fsync()
All that being said, if you ever plan on sharing the data in that file with other portions of your code, I would highly recommend using a StringIO object.

Related

How can I make python mmap assignment atomic?

How can I make python mmap assignment atomic? Nothing about atomic is said here: https://docs.python.org/3.0/library/mmap.html
huge_list1 = [888 for _ in range(100000000)]
huge_list2 = [9999 for _ in huge_list1]
b1 = struct.pack("100000000I", *huge_list1)
b2 = struct.pack("100000000I", *huge_list1)
f = open('mmp', 'wb')
f.write(b1)
f.close()
f = open('mmp', 'r+')
m = mmap.mmap(f.fileno(), 0)
m[:]=b2
Immediately, I execute the following code in another process
f = open('mmp', 'r')
m = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
mm = m[:]
l = struct.unpack("100000000I", mm)
set(l)
Then I am seeing {888, 9999}
Which means mmap is not atomic. Anyway to make it atomic?
In general, you can't. File writes aren't atomic to begin with, whether done via mmap or write. Some filesystems, such as Tahoe-LAFS, do have a file put operation, but even there it's a matter of known completion, not atomic operation (chunks are stored individually). Atomicity of file content updates are frequently done with three methods:
Using the rename call, where you can be sure a name points to either the old or new file (Python's Path.replace might be more clear). This is the method used in e.g. maildir.
Using file locks. These are in general cooperative, meaning all programs that access the file must use the same locking method consistently. Sometimes this is not possible, for instance across some network filesystems. Due to this inconsistency, other lock methods such as lock files are also used - thus the "same method" requirement.
Using smaller accesses that are atomic due to underlying architecture, such as disk sectors. This is done e.g. in SQLite's journal headers. Notably the threshold is different with mmap because the memory page itself may be shared, allowing far finer granularity for atomic accesses (perhaps CPU word size or single byte).
The topic is fairly complex. The key to combining any of these synchronization methods with mmap is mmap.flush.
I don't think it's a mmap problem - I'd bet that it happens because f.close() guarantess just that Python has sent the data to the underlying OS's buffer but that doesn't mean it has been actually written. Then when you open it again, and give the handle to mmap, you're still operating on the buffer.
You can try syncing the buffer before you close the file to ensure everything has been written:
import os
f = open('mmp', 'wb')
f.write(b1)
f.flush()
os.fsync(f.fileno())
f.close()
Or better, just to let Python handle closing cleanly in case of an error:
with open('mmp', 'wb') as f:
f.write(b1)
f.flush()
os.fsync(f.fileno())
Although even os.fsync() is not a 100% guarantee, from the underlying fsync() man page:
Calling fsync() does not necessarily ensure that the entry in the directory containing the file has also reached disk. For that an explicit fsync() on a file descriptor for the directory is also needed.
But I'd bet that it wouldn't do what you need in very rare edge cases.

read a large file without memory reallocation

When I want to read a binary file in memory in python I just do:
with open("file.bin","rb") as f:
contents = f.read()
With "reasonable" size files, it's perfect, but when the files are huge (say, 1Gb or more), when monitoring the process, we notice that the memory increases then shrinks, then increases, ... probably the effect of realloc behind the scenes, when the original chunk of memory is too small to hold the file.
Done several times, this realloc + memmove operation takes a lot of CPU time. In C, I wouldn't have the problem because I would pass a properly allocated buffer to fread for instance (but here I can't because bytes objects are immutable, so I cannot pre-allocate).
Of course I could read it chunk by chunk like this:
with open("file.bin","rb") as f:
while True:
contents = f.read(CHUNK_SIZE)
if contents:
chunks.append(contents)
else:
break
but then I would have to join the bytes chunks, but that would also take twice the needed memory at some point, and I may not be able to afford it.
Is there a method to read a big binary file in a buffer with one sole big memory allocation, and efficiently CPU-wise?
You can use the os.open method, which is basically a wrapper around the POSIX syscall open.
import os
fd = os.open("file.bin", os.O_RDONLY | os.O_BINARY)
This opens the file in rb mode.
os.open returns a file descriptor which does not have read methods. You'll have to read n bytes at a time:
data = os.read(fd, 100)
Once done, use os.close to close the file:
os.close(fd)
You're reading a file in Python just like you'd do it in C!
Here's a couple of useful references:
Official docs
Library Reference
Disclaimer: Based on my knowledge of how C's open function works, I believe this should do the trick.

Can read() and readlines() work together when reading a file in Python?

I'm a Python beginner, and I'm doing some tests of file operations.
I just read a file with read() and readlines(). Each of them works perfectly, respectively. However, when I add a readlines() to read the appointed file after read(), I surprisingly find that I can't read anything from the file using readlines().
P.S. I tried to switch the places of them, and the latter function can't read anything from the file yet.
So, how do the functions actually work?
Below is my code:
filea = open('/Users/gssflyaway/Documents/web/echarts-2.2.7/LICENSE.TXT')
print filea.readlines()
print '-' * 50
print filea.read()
filea.close()
the result by Pycharm
Files are read from the disk by moving a pointer (like a bookmark so that the file object knows where it left) around. A read operation advances the pointer and if you read the whole file, the pointer will be at the very end of the file. Same applies to both readlines and read. If you want to re-read the file, you can use seek to reset the pointer to the beginning to start a new round.
filea.seek(0)

Python script that writes result to txt file - why the lag?

I'm using Windows 7 and I have a super-simple script that goes over a directory of images, checking a specified condition for each image (in my case, whether there's a face in the image, using dlib), while writing the paths of images that fulfilled the condition to a text file:
def process_dir(dir_path):
i = 0
with open(txt_output, 'a') as f:
for filename in os.listdir(dir_path):
# loading image to check whether dlib detects a face:
image_path = os.path.join(dir_path, filename)
opencv_img = cv2.imread(image_path)
dets = detector(opencv_img, 1)
if len(dets) > 0 :
f.write(image_path)
f.write('\n')
i = i + 1
print i
Now the following thing happens: there seems to be a significant lag in appending lines to files. For example, I can see the script has "finished" checking several images (i.e, the console prints ~20, meaning 20 files who fulfill the condition have been found) but the .txt file is still empty. At first I thought there was a problem with my script, but after waiting a while I saw that they were in fact added to the file, only it seems to be updated in "batches".
This may not seem like the most crucial issue (and it's definitely not), but still I'm wondering - what explains this behavior? As far as I understand, every time the f.write(image_path) line is executed the file is changed - then why do I see the update with a lag?
Data written to a file object won't necessarily show up on disk immediately.
In the interests of efficiency, most operating systems will buffer the writes, meaning that data is only written out to disk when a certain amount has accumulated (usually 4K).
If you want to write your data right now, use the flush() function, as others have said.
Did you try using with buffersize 0, open(txt_output, 'a', 0).
I'm, not sure about Windows (please, someone correct me here if I'm wrong), but I believe this is because of how the write buffer is handled. Although you are requesting a write, the buffer only writes every so often (when the buffer is full), and when the file is closed. You can open the file with a smaller buffer:
with open(txt_output, 'a', 0) as f:
or manually flush it at the end of the loop:
if len(dets) > 0 :
f.write(image_path)
f.write('\n')
f.flush()
i = i + 1
print i
I would personally recommend flushing manually when you need to.
It sounds like you're running into file stream buffering.
In short, writing to a file is a very slow process (relative to other sorts of things that the processor does). Modifying the hard disk is about the slowest thing you can do, other than maybe printing to the screen.
Because of this, most file I/O libraries will "buffer" your output, meaning that as you write to the file the library will save your data in an in-memory buffer instead of modifying the hard disk right away. Only when the buffer fills up will it "flush" the buffer (write the data to disk), after which point it starts filling the buffer again. This often reduces the number of actual write operations by quite a lot.
To answer your question, the first question to answer is, do really need to append to the file immediately every time you find a face? It will probably slow down your processing by a noticeable amount, especially if you're processing a large number of files.
If you really do need to update immediately, you basically have two options:
Manually flush the write buffer each time you write to the file. In Python, this usually means calling f.flush(), as #JamieCounsell pointed out.
Tell Python to just not use a buffer, or more accurately to use a buffer of size 0. As #VikasMadhusudana pointed out, you can tell Python how big of a buffer to use with a third argument to open(): open(txt_output, 'a', 0) for a 0-byte buffer.
Again, you probably don't need this; the only case I can think that might require this sort of thing is if you have some other external operation that's watching the file and triggers off of new data being added to it.
Hope that helps!
It's flush related, try:
print(image_path, file=f) # Python 3
or
print >>f, image_page # Python 2
instead of:
f.write(image_path)
f.write('\n')
print flushes.
another good thing about print is it gives you the newline for free.

How come a file doesn't get written until I stop the program?

I'm running a test, and found that the file doesn't actually get written until I control-C to abort the program. Can anyone explain why that would happen?
I expected it to write at the same time, so I could read the file in the middle of the process.
import os
from time import sleep
f = open("log.txt", "a+")
i = 0
while True:
f.write(str(i))
f.write("\n")
i += 1
sleep(0.1)
Writing to disk is slow, so many programs store up writes into large chunks which they write all-at-once. This is called buffering, and Python does it automatically when you open a file.
When you write to the file, you're actually writing to a "buffer" in memory. When it fills up, Python will automatically write it to disk. You can tell it "write everything in the buffer to disk now" with
f.flush()
This isn't quite the whole story, because the operating system will probably buffer writes as well. You can tell it to write the buffer of the file with
os.fsync(f.fileno())
Finally, you can tell Python not to buffer a particular file with open(f, "w", 0) or only to keep a 1-line buffer with open(f,"w", 1). Naturally, this will slow down all operations on that file, because writes are slow.
You need to f.close() to flush the file write buffer out to the file. Or in your case you might just want to do a f.flush(); os.fsync(); so you can keep looping with the opened file handle.
Don't forget to import os.
You have to force the write, so I i use the following lines to make sure a file is written:
# Two commands together force the OS to store the file buffer to disc
f.flush()
os.fsync(f.fileno())
You will want to check out file.flush() - although take note that this might not write the data to disk, to quote:
Note:
flush() does not necessarily write the file’s data to disk. Use flush() followed by os.fsync() to ensure this behavior.
Closing the file (file.close()) will also ensure that the data is written - using with will do this implicitly, and is generally a better choice for more readability and clarity - not to mention solving other potential problems.
This is a windows-ism. If you add an explicit .close() when you're done with file, it'll appear in explorer at that time. Even just flushing it might be enough (I don't have a windows box handy to test). But basically f.write does not actually write, it just appends to the write buffer - until the buffer gets flushed you won't see it.
On unix the files will typically show up as a 0-byte file in this situation.
File Handler to be flushed.
f.flush()
The file does not get written, as the output buffer is not getting flushed until the garbage collection takes effect, and flushes the I/O buffer (more than likely by calling f.close()).
Alternately, in your loop, you can call f.flush() followed by os.fsync(), as documented here.
f.flush()
os.fsync()
All that being said, if you ever plan on sharing the data in that file with other portions of your code, I would highly recommend using a StringIO object.

Categories