NVMe Throughput Testing with Python

NVMe Throughput Testing with Python - python

currently I need to do some throughput testing. My hardware setup is that I have a Samsung 950 Pro connected to an NVMe controller that is hooked to the motherboard via and PCIe port. I have a Linux nvme device corresponding to the device which I have mounted at a location on the filesystem.
My hope was to use Python to do this. I was planning on opening a file on the file system where the SSD is mounted, recording the time, writing some n length stream of bytes to the file, recording the time, then closing the file using os module file operation utilities. Here is the function to gauge write throughput.
def perform_timed_write(num_bytes, blocksize, fd):
"""
This function writes to file and records the time
The function has three steps. The first is to write, the second is to
record time, and the third is to calculate the rate.
Parameters
----------
num_bytes: int
blocksize that needs to be written to the file
fd: string
location on filesystem to write to
Returns
-------
bytes_per_second: float
rate of transfer
"""
# generate random string
random_byte_string = os.urandom(blocksize)
# open the file
write_file = os.open(fd, os.O_CREAT | os.O_WRONLY | os.O_NONBLOCK)
# set time, write, record time
bytes_written = 0
before_write = time.clock()
while bytes_written < num_bytes:
os.write(write_file, random_byte_string)
bytes_written += blocksize
after_write = time.clock()
#close the file
os.close(write_file)
# calculate elapsed time
elapsed_time = after_write - before_write
# calculate bytes per second
bytes_per_second = num_bytes / elapsed_time
return bytes_per_second
My other method of testing is to use Linux fio utility.
https://linux.die.net/man/1/fio
After mounting the SSD at /fsmnt/fs1, I used this jobfile to test the throughput
;Write to 1 file on partition
[global]
ioengine=libaio
buffered=0
rw=write
bs=4k
size=1g
openfiles=1
[file1]
directory=/fsmnt/fs1
I noticed that the write speed returned from the Python function is significantly higher than that of the fio. Because Python is so high-level there is a lot of control you give up. I am wondering if Python is doing something under the hood to cheat its speeds higher. Does anyone know why Python would generate write speeds so much higher than those generated by fio?

The reason your Python program does better than your fio job is because this is not a fair comparison and they are testing different things:
You banned fio from using Linux's buffer cache (by using buffered=0 which is the same as saying direct=1) by telling it to do O_DIRECT operations. With the job you specified, fio will have to send down a single 4k write and then wait for that write to complete at the device (and that acknowledgement has to get all the way back to fio) before it can send the next.
Your Python script is allowed to send down writes that can be buffered at multiple levels (e.g. within userspace by the C library and then again in the buffer cache of the kernel) before touching your SSD. This will generally mean the writes will be accumulated and merged together before being sent down to the lower level resulting in chunkier I/Os that have less overhead. Further, since you don't do any explicit flushing in theory no I/O has to be sent to the disk before your program exits (in practice this will depend on a number of factors like how much I/O you do, the amount of RAM Linux can set aside for buffers, the maximum time the filesystem will hold dirty data for, how long you do the I/O for etc)! Your os.close(write_file) will just be turned into an fclose() which says this in its Linux man page:
Note that fclose() flushes only the user-space buffers provided by the C library. To ensure that the data is physically stored on disk the kernel buffers must be flushed too, for example, with sync(2) or fsync(2).
In fact you take your final time before calling os.close(), so you may even be omitting the time it took for the final "batches" of data to be sent only to the kernel let alone the SSD!
Your Python script is closer to this fio job:
[global]
ioengine=psync
rw=write
bs=4k
size=1g
[file1]
filename=/fsmnt/fio.tmp
Even with this fio is still at a disadvantage because your Python program has userspace buffering (so bs=8k may be closer).
The key takeaway is your Python program is not really testing your SSD's speed at your specified block size and your original fio job is a bit weird, heavily restricted (the libaio ioengine is asynchronous but with a depth of 1 you're not going to be able to benefit from that and that's before we get to the behaviour of Linux AIO when using filesystems) and does different things to your Python program. if you're not doing significantly more buffered I/O compared to the size of the largest buffer (and on Linux the kernel's buffer size scales with RAM) and if the buffered I/Os are small the exercise turns into a demonstration of the effectiveness of buffering.

If you need the exact performance of the NVMe device, fio is the best choice. FIO can write test data to the device directly, without any file system. Here is an example:
[global]
ioengine=libaio
invalidate=1
iodepth=32
time_based
direct=1
filename=/dev/nvme0n1
[write-nvme]
stonewall
bs=128K
rw=write
numjobs=1
runtime=10000
SPDK is another choice. There is an existed example of performance test at https://github.com/spdk/spdk/tree/master/examples/nvme/perf.
Pynvme, which is based on SPDK, is a Python extension. You can write performance test with its ioworker().

Related

Writing fast serial data to a file (csv or txt)

Is there a way to capture and write very fast serial data to a file?
I'm using a 32kSPS external ADC and a baud rate of 2000000 while printing in the following format: adc_value (32bits) \t millis()
This results in ~15 prints every 1 ms. Unfortunately every single soulution I have tried fails to capture and store real time data to a file. This includes: Processing sketches, TeraTerm, Serial Port Monitor, puTTY and some Python scripts. All of them are unable to log the data in real time.
Arduino Serial Monitor on the other hand is able to display real time serial data, but it's unable to log it in a file, as it lacks this function.
Here's a printscreen of the serial monitor in Arduino with the incoming data:

One problematic thing is probably that you try to do a write each time you receive a new record. That will waste a lot of time writing data.
Instead try to collect the data into buffers, and as a buffer is about to overflow write the whole buffer in a single and as low-level as possible write call.
And to not stop the receiving of the data to much, you could use threads and double-buffering: Receive data in one thread, write to a buffer. When the buffer is about to overflow signal a second thread and switch to a second buffer. The other thread takes the full buffer and writes it to disk, and waits for the next buffer to become full.

After trying more than 10 possible solutions for this problem including dedicated serial capture software, python scripts, Matlab scripts, and some C projects alternatives, the only one that kinda worked for me proved to be MegunoLink Pro.
It does not achieve the full 32kSPS potential of the ADC, rather around 12-15kSPS, but it is still much better than anything I've tried.
Not achieving the full 32kSPS might also be limited by the Serial.print() method that I'm using for printing values to the serial console. By the way, the platform I've been using is ESP32.
Later edit: don't forget to edit MegunoLinkPro.exe.config file in the MegunoLink Pro install directory in order to add further baud rates, like 1000000 or 2000000. By default it is limited to 500000.

Python sub processes randomly drops to 0% cpu usage, causing the process to "hang up"

I run several python subprocesses to migrate data to S3. I noticed that my python subprocesses often drops to 0% and this condition lasts more than one minute. This significantly decreases the performance of the migration process.
Here is the pic of the sub process:
The subprocess does these things:
Query all tables from a database.
Spawn sub processes for each table.
for table in tables:
print "Spawn process to process {0} table".format(table)
process = multiprocessing.Process(name="Process " + table,
target=target_def,
args=(args, table))
process.daemon = True
process.start()
processes.append(process)
for process in processes:
process.join()
Query data from a database using Limit and Offset. I used PyMySQL library to query the data.
Transform returned data to another structure. construct_structure_def() is a function that transform row into another format.
buffer_string = []
for i, row_file in enumerate(row_files):
if i == num_of_rows:
buffer_string.append( json.dumps(construct_structure_def(row_file)) )
else:
buffer_string.append( json.dumps(construct_structure_def(row_file)) + "\n" )
content = ''.join(buffer_string)
Write the transformed data into a file and compress it using gzip.
with gzip.open(file_path, 'wb') as outfile:
outfile.write(content)
return file_name
Upload the file to S3.
Repeat step 3 - 6 until no more rows to be fetched.
In order to speed up things faster, I create subprocesses for each table using multiprocesses.Process built-in library.
I ran my script in a virtual machine. Following are the specs:
processor: Intel(R) Xeon(R) CPU E5-2690 # 2.90 Hz 2.90 GHz (2 Processes)
Virtual processors: 4
Installed RAM: 32 GB
OS: Windows Enterprise Edition.
I saw on the post in here that said one of the main possibilities is because of memory I/O limitation. So I tried to run one sub process to test that theory, but no avail.
Any idea why this is happening? Let me know if you guys need more information.
Thank you in advance!

Turns out the culprit was the query I ran. The query took a long time to return the result. This made the python script idle thus zero percent usage.

Your VM is a Windows machine, I'm more of a Linux person so I'd love it if someone will back me up here.
I think the daemon is the problem here.
I've read about daemon preocesses and especially about TSR.
The first line in TSR says:
Regarding computers, a terminate and stay resident program (commonly referred to by the initialism TSR) is a computer program that uses a system call in DOS operating systems to return control of the computer to the operating system, as though the program has quit, but stays resident in computer memory so it can be reactivated by a hardware or software interrupt.
As I understand, making the process a daemon (or TSR in your case) makes it dormant until a syscall will wake it up, which I don't think is the case here (correct me if I'm wrong).

how do I monitor data from serial port and have certain data act as a flag in Python

I am working on a project for work and I am stuck on a part where I need to monitor a serial line and listen for certain words using python
so the setup is that we have a automated RAM testing machine that tests RAM one module at a time and interacts with software that came with the machine via serial. The software that came with the RAM tester is for monitoring/configuring the testing process, it also displays all of the information from the SPD chip from each module. while the RAM tester was running I ran a serial port monitoring program and I was able to see the same information that it displays in the software. The data I'm interested in is the speed of the RAM and the pass/fail result, both of which I was able to find in the data I monitored coming over the serial line. There are only 5 different speeds of RAM that we test, so I was hoping to have python monitor the serial line and wait for the speed of the RAM and the pass/fail results to come across. once python detects the speed of the RAM, and if it passes, I will have python write to an Arduino, and the Arduino will control a conveyor belt that will sort the ram by speed.
my idea is to have a variable for each of the RAM speeds and set the variables to 0. when python detects the RAM speed from the serial line it will set the corresponding variable to 1. then when the test is over the results, either pass or fail, will come over the serial line. this is where I am going to try to use a if statement. I imagine it would look something like this:
if PC-6400 == 1 and ser.read() == pass
ser.write(PC-6400) #serial write to the arduino
I know the use of the ser.read() == pass is incorrect and that's where I'm stuck. I do not know how to use a ser.read() function to look for certain words. I need it to look for the ram speed (in this case its PC-6400) and the word pass but I have not been successful in getting it to find either. I am currently suck in is getting python to detect the RAM speed so it can change the value of the variable. would it be something close to this?
if ser.read() == PC-6400
PC-6400 = 1
This whole thing is a bit difficult for me to explain and I hope it all makes sense, I thank you in advance if anyone can give me some advice on how to get this going. I am pretty new to python and this is the most adventurous project I have worked on using python so far.

I'm still a bit confused, but here's a very basic example to hopefully get you started.
This will read from a serial port. It will attempt to read 512 bytes (which just means 512 characters from a string). If 512 bytes aren't available then it will wait forever, so make sure you set a timeout when you made the serial connection.
return_str = ser.read(size = 512)
You can also see how many bytes are waiting to be read:
print "num bytes available = ", ser.inWaiting()
Once you have a string, you can check words within the string:
if "PASS" in return_str:
print "the module passed!"
if "PC-6400" in return_str:
print "module type is PC-6400"
To do something similar, but with variables:
# the name pass is reserved
pass_flag = "PASS"
PC6400 = 0
if pass_flag in return_str and "PC-6400" in return_str:
PC6400 = 1
Keep in mind that you it is possible to read part of a line if you are too quick. You can add delays by using timeouts or the time.sleep() function. You might also find you need to wait for a couple of seconds after initiating the connection as the arduino resets when you connect. This will give it time to reset before you try and read/write.

Upload using python script takes very long on one laptop as compared to another

I have a python 2.7 code which uses STORBINARY function for uploading files to an ftp server and RETRBINARY for downloading from this server.
However, the issue is the upload is taking a very long time on three laptops from different brands as compared to a Dell laptop. The strange part is when I manually upload any file, it takes the same time on all the systems.
The manual upload rate and upload rate with the python script is the same on the Dell Laptop. However, on every other brand of laptop (I have tried with IBM, Toshiba, Fujitsu-Siemens) the python script has a very low upload rate than the manual attempt. Also, on all these other laptops, the upload rate using the python script is the same (1Mbit/s) while the manual upload rate is approx. 8 Mbit/s.
I have tried to vary the filesize for the upload to no avail. TCP Optimizer improved the download rate on all the systems but had no effect on the upload rate. Download rate using this script on all the systems is fine and same as the manual download rate.
I have checked the server and it has more than 90% free space. The network connection is the same for all the laptops, and I try uploading only with one laptop at a time. All the laptops have almost the same system configurations, same operating system and approximately the same free drive space. If anything the Dell laptop is a little less in terms of processing power and RAM than 2 of the others, but I suppose this has no effect as I have checked many times to see how much was the CPU usage and network usage during these uploads and downloads, and I am sure that no other virus or program has been eating up my bandwidth.
even with 'storbinary' command , when i specify the blocksize to be of 57344(56 kB), the upload rate improves to about 5 Kbit/s from 1 to 1.5 Kbit/s originally...whats the reason for that? and how can i find out the blocksize used by my manual upload client(i used filezilla), or better yet the optimal blocksize for upload?? #guidot
Complete code :
def upnew(counter=0):
f=open("c:/10", "w")
f.write(os.urandom(10*1024*1024))
f.close()
print "Logging in..."
ftpserver='xxxxxxx'
ftpuser='xxxxxxx'
ftppw='xxxxxxxxx'
ftp = FTP(ftpserver)
ftp.login(ftpuser, ftppw)
t = open("c:/10", "rb")
upstart = time.clock()
ftp.storbinary('STOR 10', t)
upende = time.clock()-upstart
print ((10*8)/upende)
print "press Return to disconnect"
raw_input()
ftp.quit()
print "FTP Verbindung abgebaut"
upnew(1)

Please provide a working sample of your code... without being able to see how your implementing the ftp function its impossible to give useful feedback, but in general you might benefit from using threads or sockets.

I may be wrong, but it appears that the issue resides in the way you call and use ftp.storbinary()
I would try using ftp.ntransfercmd() instead and use a buffer to breakup the transfer as it processes. This gives you the added benefit of being able to track to the progress of ftp transfer and in theory allow you to pause and restart the process as needed.
Haven't tested the performance of this script but you could try doing something like this:
def ftpUploader():
BLOCKSIZE = 57344 # size 56 kB
ftp = ftplib.FTP()
ftp.connect(host)
ftp.login(login, passwd)
ftp.voidcmd("TYPE I")
f = open(zipname, 'rb')
datasock, esize = ftp.ntransfercmd(
'STOR %s' % os.path.basename(zipname))
size = os.stat(zipname)[6]
bytes_so_far = 0
print 'started'
while 1:
buf = f.read(BLOCKSIZE)
if not buf:
break
datasock.sendall(buf)
bytes_so_far += len(buf)
print "\rSent %d of %d bytes %.1f%%\r" % (
bytes_so_far, size, 100 * bytes_so_far / size)
sys.stdout.flush()
datasock.close()
f.close()
ftp.voidresp()
ftp.quit()
print 'Complete...'

Python is primarily used for scripting and process automation and is not technically considered a vary fast language (though faster than most other scripting languages). Filezilla is coded using C/C++ and is far superior to Python in performance. That being said, its not a fair comparison and we should consider it when trying to identify issues in logic that may cause general performance issues.
storbinary basically acts as a wrapper for ntransfercmd, calling ntransfercmd without requiring us to define our own buffer (hence the reason for my earlier recommendation).
Additionally, after analyzing your code snip-it again, I noticed that your calling storbinary via the print statement... is this in error?
At this point, we'll need all of the related code used in this example to identify any issues in logic that may affect performance, please build upon your previous snip-it to provide us with additional information.
Another factor to consider here is the general system environment in which your conducting your tests... consider the location of each system you've preformed said testing on, how far away they are from the FTP server, additionally, differences in ISP or DNS servers could play a major factor when troubleshooting performance issues related to TCP/IP based connections such as ftp.

Python's mmap() performance down with time

I am wondering why Python's mmap() performance going down with time? I mean I have a little app which make changes to N files, if set is big (not too really big, say 1000) first 200 is demon-speed but after that it goes slower and slower. It looks like I should free memory once in a while but don't know how and most importantly why Python do not do this automagically.
Any help?
-- edit --
It's something like that:
def function(filename, N):
fd = open(filename, 'rb+')
size = os.path.getsize(filename)
mapped = mmap(fd.fileno(), size)
for i in range(N):
some_operations_on_mmaped_block()
mapped.close()

Your OS caches the mmap'd pages in RAM. Reads and writes go at RAM speed from the cache. Dirty pages are eventually flushed. On Linux performance will be great until you have to start flushing pages, this is controlled by vm.dirty_ratio sysctl variable. Once your start flushing dirty pages to disk the reads will compete with the writes on your busy IO bus/device. Another thing to consider is simply whether your OS has enough RAM to cache all the files (the buffers counter in top output). So I would watch the output of "vmstat 1" while your program runs and watch the cache / buff counters go up until suddenly you start doing IO.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.