How can I make python mmap assignment atomic? Nothing about atomic is said here: https://docs.python.org/3.0/library/mmap.html
huge_list1 = [888 for _ in range(100000000)]
huge_list2 = [9999 for _ in huge_list1]
b1 = struct.pack("100000000I", *huge_list1)
b2 = struct.pack("100000000I", *huge_list1)
f = open('mmp', 'wb')
f.write(b1)
f.close()
f = open('mmp', 'r+')
m = mmap.mmap(f.fileno(), 0)
m[:]=b2
Immediately, I execute the following code in another process
f = open('mmp', 'r')
m = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
mm = m[:]
l = struct.unpack("100000000I", mm)
set(l)
Then I am seeing {888, 9999}
Which means mmap is not atomic. Anyway to make it atomic?
In general, you can't. File writes aren't atomic to begin with, whether done via mmap or write. Some filesystems, such as Tahoe-LAFS, do have a file put operation, but even there it's a matter of known completion, not atomic operation (chunks are stored individually). Atomicity of file content updates are frequently done with three methods:
Using the rename call, where you can be sure a name points to either the old or new file (Python's Path.replace might be more clear). This is the method used in e.g. maildir.
Using file locks. These are in general cooperative, meaning all programs that access the file must use the same locking method consistently. Sometimes this is not possible, for instance across some network filesystems. Due to this inconsistency, other lock methods such as lock files are also used - thus the "same method" requirement.
Using smaller accesses that are atomic due to underlying architecture, such as disk sectors. This is done e.g. in SQLite's journal headers. Notably the threshold is different with mmap because the memory page itself may be shared, allowing far finer granularity for atomic accesses (perhaps CPU word size or single byte).
The topic is fairly complex. The key to combining any of these synchronization methods with mmap is mmap.flush.
I don't think it's a mmap problem - I'd bet that it happens because f.close() guarantess just that Python has sent the data to the underlying OS's buffer but that doesn't mean it has been actually written. Then when you open it again, and give the handle to mmap, you're still operating on the buffer.
You can try syncing the buffer before you close the file to ensure everything has been written:
import os
f = open('mmp', 'wb')
f.write(b1)
f.flush()
os.fsync(f.fileno())
f.close()
Or better, just to let Python handle closing cleanly in case of an error:
with open('mmp', 'wb') as f:
f.write(b1)
f.flush()
os.fsync(f.fileno())
Although even os.fsync() is not a 100% guarantee, from the underlying fsync() man page:
Calling fsync() does not necessarily ensure that the entry in the directory containing the file has also reached disk. For that an explicit fsync() on a file descriptor for the directory is also needed.
But I'd bet that it wouldn't do what you need in very rare edge cases.
Related
I'm running a test, and found that the file doesn't actually get written until I control-C to abort the program. Can anyone explain why that would happen?
I expected it to write at the same time, so I could read the file in the middle of the process.
import os
from time import sleep
f = open("log.txt", "a+")
i = 0
while True:
f.write(str(i))
f.write("\n")
i += 1
sleep(0.1)
Writing to disk is slow, so many programs store up writes into large chunks which they write all-at-once. This is called buffering, and Python does it automatically when you open a file.
When you write to the file, you're actually writing to a "buffer" in memory. When it fills up, Python will automatically write it to disk. You can tell it "write everything in the buffer to disk now" with
f.flush()
This isn't quite the whole story, because the operating system will probably buffer writes as well. You can tell it to write the buffer of the file with
os.fsync(f.fileno())
Finally, you can tell Python not to buffer a particular file with open(f, "w", 0) or only to keep a 1-line buffer with open(f,"w", 1). Naturally, this will slow down all operations on that file, because writes are slow.
You need to f.close() to flush the file write buffer out to the file. Or in your case you might just want to do a f.flush(); os.fsync(); so you can keep looping with the opened file handle.
Don't forget to import os.
You have to force the write, so I i use the following lines to make sure a file is written:
# Two commands together force the OS to store the file buffer to disc
f.flush()
os.fsync(f.fileno())
You will want to check out file.flush() - although take note that this might not write the data to disk, to quote:
Note:
flush() does not necessarily write the file’s data to disk. Use flush() followed by os.fsync() to ensure this behavior.
Closing the file (file.close()) will also ensure that the data is written - using with will do this implicitly, and is generally a better choice for more readability and clarity - not to mention solving other potential problems.
This is a windows-ism. If you add an explicit .close() when you're done with file, it'll appear in explorer at that time. Even just flushing it might be enough (I don't have a windows box handy to test). But basically f.write does not actually write, it just appends to the write buffer - until the buffer gets flushed you won't see it.
On unix the files will typically show up as a 0-byte file in this situation.
File Handler to be flushed.
f.flush()
The file does not get written, as the output buffer is not getting flushed until the garbage collection takes effect, and flushes the I/O buffer (more than likely by calling f.close()).
Alternately, in your loop, you can call f.flush() followed by os.fsync(), as documented here.
f.flush()
os.fsync()
All that being said, if you ever plan on sharing the data in that file with other portions of your code, I would highly recommend using a StringIO object.
When I want to read a binary file in memory in python I just do:
with open("file.bin","rb") as f:
contents = f.read()
With "reasonable" size files, it's perfect, but when the files are huge (say, 1Gb or more), when monitoring the process, we notice that the memory increases then shrinks, then increases, ... probably the effect of realloc behind the scenes, when the original chunk of memory is too small to hold the file.
Done several times, this realloc + memmove operation takes a lot of CPU time. In C, I wouldn't have the problem because I would pass a properly allocated buffer to fread for instance (but here I can't because bytes objects are immutable, so I cannot pre-allocate).
Of course I could read it chunk by chunk like this:
with open("file.bin","rb") as f:
while True:
contents = f.read(CHUNK_SIZE)
if contents:
chunks.append(contents)
else:
break
but then I would have to join the bytes chunks, but that would also take twice the needed memory at some point, and I may not be able to afford it.
Is there a method to read a big binary file in a buffer with one sole big memory allocation, and efficiently CPU-wise?
You can use the os.open method, which is basically a wrapper around the POSIX syscall open.
import os
fd = os.open("file.bin", os.O_RDONLY | os.O_BINARY)
This opens the file in rb mode.
os.open returns a file descriptor which does not have read methods. You'll have to read n bytes at a time:
data = os.read(fd, 100)
Once done, use os.close to close the file:
os.close(fd)
You're reading a file in Python just like you'd do it in C!
Here's a couple of useful references:
Official docs
Library Reference
Disclaimer: Based on my knowledge of how C's open function works, I believe this should do the trick.
The use case for this would be creating multiple generators based on some file-object without any of them trampling each other's read state.
Originally I (thought I) had a working implementation using seek() and tell() where each generator was decorated by a meta-generator which maintained the file-handle position. This worked fine on things like StringIO, but failed on real files due the to read-ahead buffer mutilating the offset.
Using readline() or otherwise mocking the real file-object isn't viable as the reason for doing this was the excessively large files prompting a generator expression in the first place. So losing the read-ahead buffer isn't really a good option (as an aside, why was Python implemented this way in the first place? Shouldn't the buffer be like a cache and not actually exposed to the user? Proper encapsulation should have prevented this tell() issue in the first place...)
I then tried to use copy.copy, but that results in something like this: <closed file '<uninitialized file>', mode '<uninitialized file>' at 0x7f722ffda810>. Which appears unusable.
Does there exist an alternative way to copy? Is there a way to initialize a file-object? Or should I give up on this use case entirely because it is not possible in Python?
You are looking for itertools.tee.
from itertools import tee
with open("somefile.txt", "r") as fh:
fh1, fh2, fh3 = tee(fh, 3)
Once you call tee, do not use the parent iterator again. The iterators returned from tee may be used freely and independently, however.
For file objects specifically (to keep file-specific methods like read), you can just open a file multiple times; each file object will maintain its own file pointer as it reads the file.
fh1, fh2, fh3 = [open("somefile.txt") for i in range(3)]
or, if you already have a file object fh:
fh1, fh2, fh3 = [open(fh.name) for i in range(3)]
This doesn't preserve an already advanced file pointer, but it's easy enough to jump ahead:
for x in fh1, fh2, fh3:
x.seek(fh.tell())
I'm running a test, and found that the file doesn't actually get written until I control-C to abort the program. Can anyone explain why that would happen?
I expected it to write at the same time, so I could read the file in the middle of the process.
import os
from time import sleep
f = open("log.txt", "a+")
i = 0
while True:
f.write(str(i))
f.write("\n")
i += 1
sleep(0.1)
Writing to disk is slow, so many programs store up writes into large chunks which they write all-at-once. This is called buffering, and Python does it automatically when you open a file.
When you write to the file, you're actually writing to a "buffer" in memory. When it fills up, Python will automatically write it to disk. You can tell it "write everything in the buffer to disk now" with
f.flush()
This isn't quite the whole story, because the operating system will probably buffer writes as well. You can tell it to write the buffer of the file with
os.fsync(f.fileno())
Finally, you can tell Python not to buffer a particular file with open(f, "w", 0) or only to keep a 1-line buffer with open(f,"w", 1). Naturally, this will slow down all operations on that file, because writes are slow.
You need to f.close() to flush the file write buffer out to the file. Or in your case you might just want to do a f.flush(); os.fsync(); so you can keep looping with the opened file handle.
Don't forget to import os.
You have to force the write, so I i use the following lines to make sure a file is written:
# Two commands together force the OS to store the file buffer to disc
f.flush()
os.fsync(f.fileno())
You will want to check out file.flush() - although take note that this might not write the data to disk, to quote:
Note:
flush() does not necessarily write the file’s data to disk. Use flush() followed by os.fsync() to ensure this behavior.
Closing the file (file.close()) will also ensure that the data is written - using with will do this implicitly, and is generally a better choice for more readability and clarity - not to mention solving other potential problems.
This is a windows-ism. If you add an explicit .close() when you're done with file, it'll appear in explorer at that time. Even just flushing it might be enough (I don't have a windows box handy to test). But basically f.write does not actually write, it just appends to the write buffer - until the buffer gets flushed you won't see it.
On unix the files will typically show up as a 0-byte file in this situation.
File Handler to be flushed.
f.flush()
The file does not get written, as the output buffer is not getting flushed until the garbage collection takes effect, and flushes the I/O buffer (more than likely by calling f.close()).
Alternately, in your loop, you can call f.flush() followed by os.fsync(), as documented here.
f.flush()
os.fsync()
All that being said, if you ever plan on sharing the data in that file with other portions of your code, I would highly recommend using a StringIO object.
For the following Python code:
fp = open('output.txt', 'wb')
# Very big file, writes a lot of lines, n is a very large number
for i in range(1, n):
fp.write('something' * n)
fp.close()
The writing process above can last more than 30 min. Sometimes I get the error MemoryError. Is the content of the file before closing stored in memory or written in a temp file? If it's in a temporary file, what is its general location on a Linux OS?
Edit:
Added fp.write in a for loop
It's stored in the operating system's disk cache in memory until it is flushed to disk, either implicitly due to timing or space issues, or explicitly via fp.flush().
There will be write buffering in the Linux kernel, but at (ir)regular intervals they will be flushed to disk. Running out of such buffer space should never cause an application-level memory error; the buffers should empty before that happens, pausing the application while doing so.
Building on ataylor's comment to the question:
You might want to nest your loop. Something like
for i in range(1,n):
for each in range n:
fp.write('something')
fp.close()
That way, the only thing that gets put into memory is the string "something", not "something" * n.
If you a writing out a large file for which the writes might fail you a better off flushing the file to disk yourself at regular intervals using fp.flush(). This way the file will be in a location of your choosing that you can easily get to rather than being at the mercy of the OS:
fp = open('output.txt', 'wb')
counter = 0
for line in many_lines:
file.write(line)
counter += 1
if counter > 999:
fp.flush()
fp.close()
This will flush the file to disk every 1000 lines.
If you write line by line, it should not be a problem. You should show the code of what you are doing before the write. For a start you can try to delete objects where not necessary, use fp.flush() etc..
File writing should never give a memory error; with all probability, you have some bug in another place.
If you have a loop, and a memory error, then I would look if you are "leaking" references to objects.
Something like:
def do_something(a, b = []):
b.append(a)
return b
fp = open('output.txt', 'wb')
for i in range(1, n):
something = do_something(i)
fp.write(something)
fp.close()
I am now picking just an example, but in your actual case the reference leak may be much more difficult to find; however this case will just leak memory inside do_something because of the way Python handles default parameters of functions.