read a large file without memory reallocation - python

When I want to read a binary file in memory in python I just do:
with open("file.bin","rb") as f:
contents = f.read()
With "reasonable" size files, it's perfect, but when the files are huge (say, 1Gb or more), when monitoring the process, we notice that the memory increases then shrinks, then increases, ... probably the effect of realloc behind the scenes, when the original chunk of memory is too small to hold the file.
Done several times, this realloc + memmove operation takes a lot of CPU time. In C, I wouldn't have the problem because I would pass a properly allocated buffer to fread for instance (but here I can't because bytes objects are immutable, so I cannot pre-allocate).
Of course I could read it chunk by chunk like this:
with open("file.bin","rb") as f:
while True:
contents = f.read(CHUNK_SIZE)
if contents:
chunks.append(contents)
else:
break
but then I would have to join the bytes chunks, but that would also take twice the needed memory at some point, and I may not be able to afford it.
Is there a method to read a big binary file in a buffer with one sole big memory allocation, and efficiently CPU-wise?

You can use the os.open method, which is basically a wrapper around the POSIX syscall open.
import os
fd = os.open("file.bin", os.O_RDONLY | os.O_BINARY)
This opens the file in rb mode.
os.open returns a file descriptor which does not have read methods. You'll have to read n bytes at a time:
data = os.read(fd, 100)
Once done, use os.close to close the file:
os.close(fd)
You're reading a file in Python just like you'd do it in C!
Here's a couple of useful references:
Official docs
Library Reference
Disclaimer: Based on my knowledge of how C's open function works, I believe this should do the trick.

Related

Why doesn't write function write to file immediately? [duplicate]

I'm running a test, and found that the file doesn't actually get written until I control-C to abort the program. Can anyone explain why that would happen?
I expected it to write at the same time, so I could read the file in the middle of the process.
import os
from time import sleep
f = open("log.txt", "a+")
i = 0
while True:
f.write(str(i))
f.write("\n")
i += 1
sleep(0.1)
Writing to disk is slow, so many programs store up writes into large chunks which they write all-at-once. This is called buffering, and Python does it automatically when you open a file.
When you write to the file, you're actually writing to a "buffer" in memory. When it fills up, Python will automatically write it to disk. You can tell it "write everything in the buffer to disk now" with
f.flush()
This isn't quite the whole story, because the operating system will probably buffer writes as well. You can tell it to write the buffer of the file with
os.fsync(f.fileno())
Finally, you can tell Python not to buffer a particular file with open(f, "w", 0) or only to keep a 1-line buffer with open(f,"w", 1). Naturally, this will slow down all operations on that file, because writes are slow.
You need to f.close() to flush the file write buffer out to the file. Or in your case you might just want to do a f.flush(); os.fsync(); so you can keep looping with the opened file handle.
Don't forget to import os.
You have to force the write, so I i use the following lines to make sure a file is written:
# Two commands together force the OS to store the file buffer to disc
f.flush()
os.fsync(f.fileno())
You will want to check out file.flush() - although take note that this might not write the data to disk, to quote:
Note:
flush() does not necessarily write the file’s data to disk. Use flush() followed by os.fsync() to ensure this behavior.
Closing the file (file.close()) will also ensure that the data is written - using with will do this implicitly, and is generally a better choice for more readability and clarity - not to mention solving other potential problems.
This is a windows-ism. If you add an explicit .close() when you're done with file, it'll appear in explorer at that time. Even just flushing it might be enough (I don't have a windows box handy to test). But basically f.write does not actually write, it just appends to the write buffer - until the buffer gets flushed you won't see it.
On unix the files will typically show up as a 0-byte file in this situation.
File Handler to be flushed.
f.flush()
The file does not get written, as the output buffer is not getting flushed until the garbage collection takes effect, and flushes the I/O buffer (more than likely by calling f.close()).
Alternately, in your loop, you can call f.flush() followed by os.fsync(), as documented here.
f.flush()
os.fsync()
All that being said, if you ever plan on sharing the data in that file with other portions of your code, I would highly recommend using a StringIO object.

How can I make python mmap assignment atomic?

How can I make python mmap assignment atomic? Nothing about atomic is said here: https://docs.python.org/3.0/library/mmap.html
huge_list1 = [888 for _ in range(100000000)]
huge_list2 = [9999 for _ in huge_list1]
b1 = struct.pack("100000000I", *huge_list1)
b2 = struct.pack("100000000I", *huge_list1)
f = open('mmp', 'wb')
f.write(b1)
f.close()
f = open('mmp', 'r+')
m = mmap.mmap(f.fileno(), 0)
m[:]=b2
Immediately, I execute the following code in another process
f = open('mmp', 'r')
m = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
mm = m[:]
l = struct.unpack("100000000I", mm)
set(l)
Then I am seeing {888, 9999}
Which means mmap is not atomic. Anyway to make it atomic?
In general, you can't. File writes aren't atomic to begin with, whether done via mmap or write. Some filesystems, such as Tahoe-LAFS, do have a file put operation, but even there it's a matter of known completion, not atomic operation (chunks are stored individually). Atomicity of file content updates are frequently done with three methods:
Using the rename call, where you can be sure a name points to either the old or new file (Python's Path.replace might be more clear). This is the method used in e.g. maildir.
Using file locks. These are in general cooperative, meaning all programs that access the file must use the same locking method consistently. Sometimes this is not possible, for instance across some network filesystems. Due to this inconsistency, other lock methods such as lock files are also used - thus the "same method" requirement.
Using smaller accesses that are atomic due to underlying architecture, such as disk sectors. This is done e.g. in SQLite's journal headers. Notably the threshold is different with mmap because the memory page itself may be shared, allowing far finer granularity for atomic accesses (perhaps CPU word size or single byte).
The topic is fairly complex. The key to combining any of these synchronization methods with mmap is mmap.flush.
I don't think it's a mmap problem - I'd bet that it happens because f.close() guarantess just that Python has sent the data to the underlying OS's buffer but that doesn't mean it has been actually written. Then when you open it again, and give the handle to mmap, you're still operating on the buffer.
You can try syncing the buffer before you close the file to ensure everything has been written:
import os
f = open('mmp', 'wb')
f.write(b1)
f.flush()
os.fsync(f.fileno())
f.close()
Or better, just to let Python handle closing cleanly in case of an error:
with open('mmp', 'wb') as f:
f.write(b1)
f.flush()
os.fsync(f.fileno())
Although even os.fsync() is not a 100% guarantee, from the underlying fsync() man page:
Calling fsync() does not necessarily ensure that the entry in the directory containing the file has also reached disk. For that an explicit fsync() on a file descriptor for the directory is also needed.
But I'd bet that it wouldn't do what you need in very rare edge cases.

Bufferization in GzipFile

Imagine the following simple script:
def reader():
for line in open('logfile.log'):
# do some stuff here like splitting the line or filtering etc.
yield some_new_line
def writer(stream):
with gzip.GzipFile('some_output_file.gz', 'w') as fh:
for _s in stream:
fh.write(_s+'\n')
stream = reader()
writer(stream)
So pretty simple - read lines using generators and write some result into a gzip file.
But how to speed it up? The HDD seems to be a bottleneck. I saw I can use buffer size for reads - using open(file, mode, buffer) syntax. But I'm not quite sure it will work in my case (with generators).
Also I didn't find any bufferization parameter for the gzip.GzipFile call. From the code, it's based on some bufferized class, but I don't see any further docs on that.
I have a (crazy?) idea to create an explicit cache and replace open methods with it - so it will read the file in bigger chunks, say, by 8MB, and then perform splitting it by lines. As for writes, I thought to create a list of lines to write, collect them (say, 5000 lines), and then dump into the file.
Am I trying to re-invent the wheel? I'm not satisfied with the performance the script currently has, so I'm trying to speed it up as much as possible.
UPD. I have around 4-5 different parallel workers running. They all perform reads and writes. So I guess the HDD is jumping from one sector to another, and this is the reason why I want to implement some bufferization to dump the data periodically in big chunks.
Thanks!
I can just propose more compact code:
def reader():
for line in open('logfile.log'):
# do some stuff here like splitting the line or filtering etc.
yield some_new_line
def writer(stream):
with gzip.GzipFile('some_output_file.gz', 'w') as fh:
fh.writelines(stream)
writer(reader())
However, there is no actual speed-up. Python will manage the streams, but if you cannot spare memory for full file write, the speed-up will not be great.
The compression though gzip is the slowest step. The following function will give you only ~3% speed-up (disregarding the generator's part).
def writer():
f = open('logfile.log').read()
gzip.GzipFile('some_output_file.gz', 'w').write(f)
writer()
So, if you need gzip, than you cannot do much.

Process data, much larger than physical memory, in chunks

I need to process some data that is a few hundred times bigger than RAM. I would like to read in a large chunk, process it, save the result, free the memory and repeat. Is there a way to make this efficient in python?
The general key is that you want to process the file iteratively.
If you're just dealing with a text file, this is trivial: for line in f: only reads in one line at a time. (Actually it buffers things up, but the buffers are small enough that you don't have to worry about it.)
If you're dealing with some other specific file type, like a numpy binary file, a CSV file, an XML document, etc., there are generally similar special-purpose solutions, but nobody can describe them to you unless you tell us what kind of data you have.
But what if you have a general binary file?
First, the read method takes an optional max bytes to read. So, instead of this:
data = f.read()
process(data)
You can do this:
while True:
data = f.read(8192)
if not data:
break
process(data)
You may want to instead write a function like this:
def chunks(f):
while True:
data = f.read(8192)
if not data:
break
yield data
Then you can just do this:
for chunk in chunks(f):
process(chunk)
You could also do this with the two-argument iter, but many people find that a bit obscure:
for chunk in iter(partial(f.read, 8192), b''):
process(chunk)
Either way, this option applies to all of the other variants below (except for a single mmap, which is trivial enough that there's no point).
There's nothing magic about the number 8192 there. You generally do want a power of 2, and ideally a multiple of your system's page size. beyond that, your performance won't vary that much whether you're using 4KB or 4MB—and if it does, you'll have to test what works best for your use case.
Anyway, this assumes you can just process each 8K at a time without keeping around any context. If you're, e.g., feeding data into a progressive decoder or hasher or something, that's perfect.
But if you need to process one "chunk" at a time, your chunks could end up straddling an 8K boundary. How do you deal with that?
It depends on how your chunks are delimited in the file, but the basic idea is pretty simple. For example, let's say you use NUL bytes as a separator (not very likely, but easy to show as a toy example).
data = b''
while True:
buf = f.read(8192)
if not buf:
process(data)
break
data += buf
chunks = data.split(b'\0')
for chunk in chunks[:-1]:
process(chunk)
data = chunks[-1]
This kind of code is very common in networking (because sockets can't just "read all", so you always have to read into a buffer and chunk into messages), so you may find some useful examples in networking code that uses a protocol similar to your file format.
Alternatively, you can use mmap.
If your virtual memory size is larger than the file, this is trivial:
with mmap.mmap(f.fileno(), access=mmap.ACCESS_READ) as m:
process(m)
Now m acts like a giant bytes object, just as if you'd called read() to read the whole thing into memory—but the OS will automatically page bits in and out of memory as necessary.
If you're trying to read a file too big to fit into your virtual memory size (e.g., a 4GB file with 32-bit Python, or a 20EB file with 64-bit Python—which is only likely to happen in 2013 if you're reading a sparse or virtual file like, say, the VM file for another process on linux), you have to implement windowing—mmap in a piece of the file at a time. For example:
windowsize = 8*1024*1024
size = os.fstat(f.fileno()).st_size
for start in range(0, size, window size):
with mmap.mmap(f.fileno(), access=mmap.ACCESS_READ,
length=windowsize, offset=start) as m:
process(m)
Of course mapping windows has the same issue as reading chunks if you need to delimit things, and you can solve it the same way.
But, as an optimization, instead of buffering, you can just slide the window forward to the page containing the end of the last complete message, instead of 8MB at a time, and then you can avoid any copying. This is a bit more complicated, so if you want to do it, search for something like "sliding mmap window", and write a new question if you get stuck.

How come a file doesn't get written until I stop the program?

I'm running a test, and found that the file doesn't actually get written until I control-C to abort the program. Can anyone explain why that would happen?
I expected it to write at the same time, so I could read the file in the middle of the process.
import os
from time import sleep
f = open("log.txt", "a+")
i = 0
while True:
f.write(str(i))
f.write("\n")
i += 1
sleep(0.1)
Writing to disk is slow, so many programs store up writes into large chunks which they write all-at-once. This is called buffering, and Python does it automatically when you open a file.
When you write to the file, you're actually writing to a "buffer" in memory. When it fills up, Python will automatically write it to disk. You can tell it "write everything in the buffer to disk now" with
f.flush()
This isn't quite the whole story, because the operating system will probably buffer writes as well. You can tell it to write the buffer of the file with
os.fsync(f.fileno())
Finally, you can tell Python not to buffer a particular file with open(f, "w", 0) or only to keep a 1-line buffer with open(f,"w", 1). Naturally, this will slow down all operations on that file, because writes are slow.
You need to f.close() to flush the file write buffer out to the file. Or in your case you might just want to do a f.flush(); os.fsync(); so you can keep looping with the opened file handle.
Don't forget to import os.
You have to force the write, so I i use the following lines to make sure a file is written:
# Two commands together force the OS to store the file buffer to disc
f.flush()
os.fsync(f.fileno())
You will want to check out file.flush() - although take note that this might not write the data to disk, to quote:
Note:
flush() does not necessarily write the file’s data to disk. Use flush() followed by os.fsync() to ensure this behavior.
Closing the file (file.close()) will also ensure that the data is written - using with will do this implicitly, and is generally a better choice for more readability and clarity - not to mention solving other potential problems.
This is a windows-ism. If you add an explicit .close() when you're done with file, it'll appear in explorer at that time. Even just flushing it might be enough (I don't have a windows box handy to test). But basically f.write does not actually write, it just appends to the write buffer - until the buffer gets flushed you won't see it.
On unix the files will typically show up as a 0-byte file in this situation.
File Handler to be flushed.
f.flush()
The file does not get written, as the output buffer is not getting flushed until the garbage collection takes effect, and flushes the I/O buffer (more than likely by calling f.close()).
Alternately, in your loop, you can call f.flush() followed by os.fsync(), as documented here.
f.flush()
os.fsync()
All that being said, if you ever plan on sharing the data in that file with other portions of your code, I would highly recommend using a StringIO object.

Categories