Python's mmap() performance down with time

Python's mmap() performance down with time - python

I am wondering why Python's mmap() performance going down with time? I mean I have a little app which make changes to N files, if set is big (not too really big, say 1000) first 200 is demon-speed but after that it goes slower and slower. It looks like I should free memory once in a while but don't know how and most importantly why Python do not do this automagically.
Any help?
-- edit --
It's something like that:
def function(filename, N):
fd = open(filename, 'rb+')
size = os.path.getsize(filename)
mapped = mmap(fd.fileno(), size)
for i in range(N):
some_operations_on_mmaped_block()
mapped.close()

Your OS caches the mmap'd pages in RAM. Reads and writes go at RAM speed from the cache. Dirty pages are eventually flushed. On Linux performance will be great until you have to start flushing pages, this is controlled by vm.dirty_ratio sysctl variable. Once your start flushing dirty pages to disk the reads will compete with the writes on your busy IO bus/device. Another thing to consider is simply whether your OS has enough RAM to cache all the files (the buffers counter in top output). So I would watch the output of "vmstat 1" while your program runs and watch the cache / buff counters go up until suddenly you start doing IO.

Related

NVMe Throughput Testing with Python

currently I need to do some throughput testing. My hardware setup is that I have a Samsung 950 Pro connected to an NVMe controller that is hooked to the motherboard via and PCIe port. I have a Linux nvme device corresponding to the device which I have mounted at a location on the filesystem.
My hope was to use Python to do this. I was planning on opening a file on the file system where the SSD is mounted, recording the time, writing some n length stream of bytes to the file, recording the time, then closing the file using os module file operation utilities. Here is the function to gauge write throughput.
def perform_timed_write(num_bytes, blocksize, fd):
"""
This function writes to file and records the time
The function has three steps. The first is to write, the second is to
record time, and the third is to calculate the rate.
Parameters
----------
num_bytes: int
blocksize that needs to be written to the file
fd: string
location on filesystem to write to
Returns
-------
bytes_per_second: float
rate of transfer
"""
# generate random string
random_byte_string = os.urandom(blocksize)
# open the file
write_file = os.open(fd, os.O_CREAT | os.O_WRONLY | os.O_NONBLOCK)
# set time, write, record time
bytes_written = 0
before_write = time.clock()
while bytes_written < num_bytes:
os.write(write_file, random_byte_string)
bytes_written += blocksize
after_write = time.clock()
#close the file
os.close(write_file)
# calculate elapsed time
elapsed_time = after_write - before_write
# calculate bytes per second
bytes_per_second = num_bytes / elapsed_time
return bytes_per_second
My other method of testing is to use Linux fio utility.
https://linux.die.net/man/1/fio
After mounting the SSD at /fsmnt/fs1, I used this jobfile to test the throughput
;Write to 1 file on partition
[global]
ioengine=libaio
buffered=0
rw=write
bs=4k
size=1g
openfiles=1
[file1]
directory=/fsmnt/fs1
I noticed that the write speed returned from the Python function is significantly higher than that of the fio. Because Python is so high-level there is a lot of control you give up. I am wondering if Python is doing something under the hood to cheat its speeds higher. Does anyone know why Python would generate write speeds so much higher than those generated by fio?

The reason your Python program does better than your fio job is because this is not a fair comparison and they are testing different things:
You banned fio from using Linux's buffer cache (by using buffered=0 which is the same as saying direct=1) by telling it to do O_DIRECT operations. With the job you specified, fio will have to send down a single 4k write and then wait for that write to complete at the device (and that acknowledgement has to get all the way back to fio) before it can send the next.
Your Python script is allowed to send down writes that can be buffered at multiple levels (e.g. within userspace by the C library and then again in the buffer cache of the kernel) before touching your SSD. This will generally mean the writes will be accumulated and merged together before being sent down to the lower level resulting in chunkier I/Os that have less overhead. Further, since you don't do any explicit flushing in theory no I/O has to be sent to the disk before your program exits (in practice this will depend on a number of factors like how much I/O you do, the amount of RAM Linux can set aside for buffers, the maximum time the filesystem will hold dirty data for, how long you do the I/O for etc)! Your os.close(write_file) will just be turned into an fclose() which says this in its Linux man page:
Note that fclose() flushes only the user-space buffers provided by the C library. To ensure that the data is physically stored on disk the kernel buffers must be flushed too, for example, with sync(2) or fsync(2).
In fact you take your final time before calling os.close(), so you may even be omitting the time it took for the final "batches" of data to be sent only to the kernel let alone the SSD!
Your Python script is closer to this fio job:
[global]
ioengine=psync
rw=write
bs=4k
size=1g
[file1]
filename=/fsmnt/fio.tmp
Even with this fio is still at a disadvantage because your Python program has userspace buffering (so bs=8k may be closer).
The key takeaway is your Python program is not really testing your SSD's speed at your specified block size and your original fio job is a bit weird, heavily restricted (the libaio ioengine is asynchronous but with a depth of 1 you're not going to be able to benefit from that and that's before we get to the behaviour of Linux AIO when using filesystems) and does different things to your Python program. if you're not doing significantly more buffered I/O compared to the size of the largest buffer (and on Linux the kernel's buffer size scales with RAM) and if the buffered I/Os are small the exercise turns into a demonstration of the effectiveness of buffering.

If you need the exact performance of the NVMe device, fio is the best choice. FIO can write test data to the device directly, without any file system. Here is an example:
[global]
ioengine=libaio
invalidate=1
iodepth=32
time_based
direct=1
filename=/dev/nvme0n1
[write-nvme]
stonewall
bs=128K
rw=write
numjobs=1
runtime=10000
SPDK is another choice. There is an existed example of performance test at https://github.com/spdk/spdk/tree/master/examples/nvme/perf.
Pynvme, which is based on SPDK, is a Python extension. You can write performance test with its ioworker().

How can I force Python code to read input files again without rebooting my computer

I am scanning through a large number of files looking for some markers. I am starting to be really confident that once I have run through the code one time Python is not rereading the actual files from disk. I find this behavior strange because I was told that one reason I needed to structure my file access in the manner I have is so that the handle and file content is flushed. But that can't be.
There are 9,568 file paths in the list I am reading from. If I shut down Python and reboot my computer it takes roughly 6 minutes to read the files and determine if there is anything returned from the regular expression.
However, if I run the code a second time it takes about 36 seconds. Just for grins, the average document has 53,000 words.
Therefore I am concluding that Python still has access to the file it read in the first iteration.
I want to also observe that the first time I do this I can hear the disk spin (E:\ - Python is on C:). E is just a spinning disk with 126 MB cache - I don't think the cache is big enough to hold the contents of these files. When I do it later I do not hear the disk spin.
Here is the code
import re
test_7A_re = re.compile(r'\n\s*ITEM\s*7\(*a\)*[.]*\s*-*\s*QUANT.*\n',re.IGNORECASE)
no7a = []
for path in path_list:
path = path.strip()
with open(path,'r') as fh:
string = fh.read()
items = [item for item in re.finditer(test_7A_re,string)]
if len(items) == 0:
no7a.append(path)
continue
I care about this for a number of reasons, one is that I was thinking about using multi-processing. But if the bottleneck is reading in the files I don't see that I will gain much. I also think this is a problem because I would be worried about the file being modified and not having the most recent version of the file available.
I am tagging this 2.7 because I have no idea if this behavior is persistent across versions.
To confirm this behavior I modified my code to run as a .py file, and added some timing code. I then rebooted my computer - the first time it ran it took 5.6 minutes and the second time (without rebooting) the time was 36 seconds. Output is the same in both cases.
The really interesting thing is that even if shut down IDLE (but do not reboot my computer) it still takes 36 seconds to run the code.
All of this suggests to me that the files are not read from disk after the first time - this is amazing behavior to me but it seems dangerous.
To be clear, the results are the same - I believe given the timing tests I have run and the fact that I do not hear the disk spinning that somehow the files are still accessible to Python.

This is caused by caching in Windows. It is not related to Python.
In order to stop Windows from caching your reads:
Disable paging file in Windows and fill the RAM up to 90%
Use some tool to disable file caching in Windows like this one.
Run your code on a Linux VM on your Windows machine that has limited RAM. In Linux you can control the caching much better
Make the files much bigger, so that they won't fit in cache

I fail to see why this is a problem. I'm not 100% certain of how Windows handles file cache invalidation, but unless the "Last modified time" changes, you and I and Windows would assume that the file still holds the same content. If the file holds the same content, I don't see why reading from cache can be a problem.
I'm pretty sure that if you change the last modified date, say, by opening the file for write access then closing it right away, Windows will hold sufficient doubts over the file content and invalidate the cache.

Does python make a copy of opened files in memory?

So I would like to search for filenames with os.walk() and write the resulting list of names to a file. I would like to know what is more efficient : opening the file and then writing each result as I find them or storing everything in a list and then writing the whole list. That list could be big so I wonder if the second solution would work.

See this example:
import os
fil = open('/tmp/stuff', 'w')
fil.write('aaa')
os.system('cat /tmp/stuff')
You may expect to see aaa, but instead you get nothing. This is because Python has an internal buffer. Writing to disk is expensive, as it has to:
Tell the OS to write it.
Actually transfer the data to the disk (on a hard disk it may involve spinning it up, waiting for IO time, etc.).
Wait for the OS to report success on the writing.
If you want to write any small things, it can add up to quite some time. Instead, what Python does is to keep a buffer and only actually write from time to time. You don't have to worry about the memory growth, as it will be kept at a low value. From the docs:
"0 means unbuffered, 1 means line buffered, any other positive value means use a buffer of (approximately) that size (in bytes). A negative buffering means to use the system default, which is usually line buffered for tty devices and fully buffered for other files. If omitted, the system default is used."
When you are done, make sure you do a fil.close(), or fil.flush() at any point during the execution, or use the keyword buffering=0 to disable buffering.
Another thing to consider is what happens if, for some reason, the program exits in the middle of the process. If you store everything in memory, it will be lost. What you have on disk, will remain there (but unless you flush, there is no guarantee of how much was actually saved).

Use python, should i cache large data in array and write to file in once?

I have a gevent powered crawler download pages all the time. The crawler adopt producer-consumer pattern, which i feed the queue with data like this {method:get, url:xxxx, other_info:yyyy}.
Now i want to assemble some response into files. The problem is, i can't just open and write when every request end, that io costly and the data is not in correct order.
I assume may be i should numbered all requests, cache response in order, open a greenlet to loop and assemble files, pseudo code may be like this:
max_chunk=1000
data=[]
def wait_and_assemble_file(): # a loop
while True:
if len(data)==28:
f= open('test.txt','a')
for d in data:
f.write(d)
f.close()
gevent.sleep(0)
def after_request(response, index): # Execute after every request ends
data[index]=response # every response is about 5-25k
Is there better solution? There are thousands concurrent requests, and i doubt the memory use may be grow too fast, or too many loop at one time, or something unexpectedly.
Update:
Codes above is just demonstrate how data caching and file writing does. In practical situation, there are maybe 1 hundred loop run to wait cacheing complete and write to different files.
Update2
#IT Ninja suggest to use queue system, so i write a alternative using Redis:
def after_request(response, session_id, total_block_count ,index): # Execute after every request ends
redis.lpush(session_id, msgpack.packb({'index':index, 'content':response})) # save data to redid
redis.incr(session_id+':count')
if redis.get(session_id+':count') == total_block_count: # which means all data blocks are prepared
save(session_name)
def save(session_name):
data_array=[]
texts = redis.lrange(session_name,0,-1)
redis.delete(session_name)
redis.delete(session_name+':count')
for t in texts:
_d = msgpack.unpackb(t)
index = _d['index']
content = _d['content']
data_array[index]=content
r= open(session_name+'.txt','w')
[r.write(i) for i in data_array]
r.close()
Looks a bit better, but i doubt if saving large data in Redis is a good idea, hope for more suggestion!

Something like this may be better handled with a queue system, instead of each thread having their own file handler. This is because you may run into race conditions when writing this file due to each thread having its own handler.
As far as resources go, this should not consume too many resources other than your disk writes, assuming that the information being passed to the file is not extremely large (Python is really good about this). If this does pose a problem though, reading into memory the file in chunks (and proportionally writing in chunks) can greatly reduce this problem, as long as this is available as an option for file uploads.

It depends on the size of the data. If it very big it can slow down the program having all the structure in memory.
If the memory is not a problem you should keep the structure in memory instead of reading all the time from a file. Open a file again and again with concurrents request is not a good solution.

Abort a slow flush to disk after write?

Is there a way to abort a python write operation in such a way that the OS doesn't feel it's necessary to flush the unwritten data to the disc?
I'm writing data to a USB device, typically many megabytes. I'm using 4096 bytes as my block size on the write, but it appears that Linux caches up a bunch of data early on, and write it out to the USB device slowly. If at some point during the write, my user decides to cancel, I want the app to just stop writing immediately. I can see that there's a delay between when the data stops flowing from the application, and the USB activity light stops blinking. Several seconds, up to about 10 seconds typically. I find that the app is holding in the close() method, I'm assuming, waiting for the OS to finish writing the buffered data. I call flush() after every write, but that doesn't appear to have any impact on the delay. I've scoured the python docs for an answer but have found nothing.

It's somewhat filesystem dependent, but in some filesystems, if you delete a file before (all of) it is allocated, the IO to write the blocks will never happen. This might also be true if you truncate it so that the part which is still being written is chopped off.
Not sure that you can really abort a write if you want to still access the data. Also the kinds of filesystems that support this (e.g. xfs, ext4) are not normally used on USB sticks.
If you want to flush data to the disc, use fdatasync(). Merely flushing your IO library's buffer into the OS one will not achieve any physical flushing.

Assuming I am understanding this correct, you want to be able to 'abort' and NOT flush the data. This IS possible using a ctype and a little pokery. This is very OS dependent so I'll give you the OSX version and then what you can do to change it to Linux:
f = open('flibble1.txt', 'w')
f.write("hello world")
import ctypes
x = ctypes.cdll.LoadLibrary('/usr/lib/libc.dylib')
x.close(f.fileno())
try:
del f
catch IOError:
pass
If you change /usr/lib/libc.dylib to the libc.so.6 in /usr/lib for Linux then you should be good to go. Basically by calling close() instead of fclose(), no call to fsync() is done and nothing is flushed.
Hope that's useful.

When you abort the write operation, trying doing file.truncate(0); before closing it.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.