Measuring time using pycuda.driver.Event gives wrong results - python

I ran SimpleSpeedTest.py from the PyCuda examples, producing the following output:
Using nbr_values == 8192
Calculating 100000 iterations
SourceModule time and first three results:
0.058294s, [ 0.005477 0.005477 0.005477]
Elementwise time and first three results:
0.102527s, [ 0.005477 0.005477 0.005477]
Elementwise Python looping time and first three results:
2.398071s, [ 0.005477 0.005477 0.005477]
GPUArray time and first three results:
8.207257s, [ 0.005477 0.005477 0.005477]
CPU time measured using :
0.000002s, [ 0.005477 0.005477 0.005477]
The first four time measurements are reasonable, the last one (0.000002s) however is way off. The CPU result should be the slowest one but it is orders of magnitude faster than the fastest GPU method. So obviously the measured time must be wrong. This is strange since the same timing method seems to work fine for the first four results.
So I took some code from SimpleSpeedTest.py and made a small test file [2], which produced:
time measured using option 1:
0.000002s
time measured using option 2:
5.989620s
Option 1 measures the duration using pycuda.driver.Event.record() (as in SimpleSpeedTest.py), option 2 uses time.clock(). Again, option 1 is off while option 2 gives a reasonable result (the time it takes to run the test file is around 6s).
Does anyone have an idea as to why this is happening?
Since using option 1 is endorsed in SimpleSpeedTest.py, could it be my setup that is causing the problem? I am running a GTX 470, Display Driver 301.42, CUDA 4.2, Python 2.7 64, PyCuda 2012.1, X5650 Xeon
[2] Test file:
import numpy
import time
import pycuda.driver as drv
import pycuda.autoinit
n_iter = 100000
nbr_values = 8192 # = 64 * 128 (values as used in SimpleSpeedTest.py)
start = drv.Event() # option 1 uses pycuda.driver.Event
end = drv.Event()
a = numpy.ones(nbr_values).astype(numpy.float32) # test data
start.record() # start option 1 (inserting recording points into GPU stream)
tic = time.clock() # start option 2 (using CPU time)
for i in range(n_iter):
a = numpy.sin(a) # do some work
end.record() # end option 1
toc = time.clock() # end option 2
end.synchronize()
events_secs = start.time_till(end)*1e-3
time_secs = toc - tic
print "time measured using option 1:"
print "%fs " % events_secs
print "time measured using option 2:"
print "%fs " % time_secs

I contacted Andreas Klöckner and he suggested to synchronize on the start event, too.
...
start.record()
start.synchronize()
...
And this seems to solve the issue!
time measured using option 1:
5.944461s
time measured using option 2:
5.944314s
Apparently CUDA's behaviour changed in the last two years. I updated SimpleSpeedTest.py.

Related

Order of magnitude execution time difference of for-loop function between PC and Mac

I have a function that contains a for-loop with numerous actions that I am executing on dataframe with shape (68k,10) in a Jupyter notebook on Windows. My machine has a 11th gen Intel i7-1165G7 2.8Ghz processor (1.5 years old) with 16 GB of RAM.
The execution takes 2928 seconds, as evidenced by the following cProfile results. I am including the top tasks by execution time:
On my colleague's 4-year old Mac, this same function takes 191 seconds:
The built-in method nt.stat that takes 812 seconds on my machine doesn't even appear in my colleague's profiling results. I have also tried this on a 3-rd colleague's older Windows machine, and it takes longer than on mine.
Why is there such a big difference in the execution time?
If helpful, I can try to include the entire cProfile results.
Update:
I created a reproducible example and ran on both machines.
Here is the code:
import numpy as np
import pandas as pd
import cProfile
df_shape = [100000, 8]
df = pd.DataFrame(np.random.uniform(0, 0.3, df_shape), index = pd.date_range('2022-01-01', periods=df_shape[0], freq='H'))
df.columns = ['loss_' + str(i) for i in range(df_shape[1])]
var_incoming = np.random.normal(550, 80, df.shape[0])
def test_func(df_, vars):
a_dct = {}
wf_dct = {}
res_lst = []
wf_res_lst = []
for i in range(df_.shape[0]):
y = vars[i]
for j, z in df_.iloc[i, :].items():
a = z * y
a_dct[j] = a
y = y * (1 - z)
wf_dct[j] = y
temp1 = pd.DataFrame(dict(a_dct), index=[df_.index[i]])
temp2 = pd.DataFrame(dict(wf_dct),
index=[df_.index[i]])
res_lst.append(temp1)
wf_res_lst.append(temp2)
res = pd.concat(res_lst, axis=0)
wf = pd.concat(wf_res_lst, axis=0)
return res, wf
The commands performed seem to be largely the same across the machines. The difference in execution is now smaller, and unfortunately, I no longer see the nt.stat command that was previously taking very long.
Colleague's machine:

Python Why Is For-loop Performance Consistently Faster Compared to Using Multiprocessing?

I am trying to learn the multiprocessing library in Python3.9. One thing I compared was the performance of a repeated computation of on a dataset composing of 220500 samples per dataset. I did this using the multiprocessing library and then using for loops.
Throughout my tests I am consistently getting better performance using for loops. Here is the code for the test I am running. I am computing the FFT of a signal with 220500 samples. My experiment involves running this process for a certain amount of times in each test. I am testing this out with setting the number of processes to 10, 100, and 1000 respectively.
import time
import numpy as np
from scipy.signal import get_window
from scipy.fftpack import fft
import multiprocessing
from itertools import product
def make_signal():
# moved this code into a function to make threading portion of code clearer
DUR = 5
FREQ_HZ = 10
Fs = 44100
# precompute the size
N = DUR * Fs
# get a windowing function
w = get_window('hanning', N)
t = np.linspace(0, DUR, N)
x = np.zeros_like(t)
b = 2*np.pi*FREQ_HZ*t
for i in range(50):
x += np.sin(b*i)
return x*w, Fs
def fft_(x, Fs):
yfft = fft(x)[:x.size//2]
xfft = np.linspace(0,Fs//2,yfft.size)
return 2/yfft.size * np.abs(yfft), xfft
if __name__ == "__main__":
# grab the raw sample data which will be computed by the fft function
x = make_signal()
# len(x) = 220500
# create 5 different tests, each with the amount of processes below
# array([ 10, 100, 1000])
tests_sweep = np.logspace(1,3,3, dtype=int)
# sweep through the processes
for iteration, test_num in enumerate(tests_sweep):
# create a list of the amount of processes to give for each iteration
fft_processes = []
for i in range(test_num):
fft_processes.append(x)
start = time.time()
# repeat the process for test_num amount of times (e.g. 10, 100, 1000)
with multiprocessing.Pool() as pool:
results = pool.starmap(fft_, fft_processes)
end = time.time()
print(f'{iteration}: Multiprocessing method with {test_num} processes took: {end - start:.2f} sec')
start = time.time()
for fft_processes in fft_processes:
# repeat the process the same amount of time as the multiprocessing method using for loops
fft_(*fft_processes)
end = time.time()
print(f'{iteration}: For-loop method with {test_num} processes took: {end - start:.2f} sec')
print('----------')
Here are the results of my test.
0: Multiprocessing method with 10 processes took: 0.84 sec
0: For-loop method with 10 processes took: 0.05 sec
----------
1: Multiprocessing method with 100 processes took: 1.46 sec
1: For-loop method with 100 processes took: 0.45 sec
----------
2: Multiprocessing method with 1000 processes took: 6.70 sec
2: For-loop method with 1000 processes took: 4.21 sec
----------
Why is the for-loop method considerably faster? Am I using the multiprocessing library correctly? Thanks.
There is a nontrivial amount of overhead to starting a new process. In addition the data has to be copied from one process to another (again with some overhead compared to a normal memory copy).
Another aspect is that you should limit the number of processes to the number of cores you have. Going over will make you incurr process switching costs as well.
This, coupled with the fact that you have little computation per process makes the switch not worth while.
I think if you make the signal significantly longer (10x or 100x) you should start seeing some benefits from using multiple cores.
Also check if the operations you are running are already using some parallelism. They might be implemented with threads, which are significantly cheaper the processes (but historically didn't work well in python, dye to GIL).

How to Properly use CuPy Streams

I'm currently trying to figure out how to use CuPy streams effectively. The following code calculates a matrix power via repeated matrix multiplication. I would expect the following code to spend most of its time at the synchronize line, but it seems to spend most of its time in the matmul lines. Is this a bug in CuPy, or am I mis-using the CuPy stream?
#!/usr/bin/env python
"""
stream_example.py
Inefficiently calculates a matrix power through repeated matrix multiplication.
"""
import numpy as np
import cupy
import sys
import time
def main(N, power):
compute_stream = cupy.cuda.stream.Stream(non_blocking=True)
with compute_stream:
d_mat = cupy.random.randn(N*N, dtype=cupy.float64).reshape(N, N)
d_ret = d_mat
cupy.matmul(d_ret, d_mat)
start_time = time.time()
for i in range(power - 1):
d_ret = cupy.matmul(d_ret, d_mat)
end_time = time.time()
print(f"Time spent on cupy.matmul for loop: {end_time - start_time}")
start_time = time.time()
compute_stream.synchronize()
end_time = time.time()
print(f"Time spent compute_stream.synchronize(): {end_time - start_time}")
if __name__ == "__main__":
main(int(sys.argv[1]), int(sys.argv[2]))
The results show that most of the time is spent within the repeated multiplication for loop rather than stream.synchronize(). Can cupy.matmul() not be used asynchronously?
$ python3 stream_example.py 16384 1024
Time spent on cupy.matmul for loop: 2.667935609817505
Time spent compute_stream.synchronize(): 4.2438507080078125e-05
It looks like adding the following works around this issue. I'll reserve the green checkmark for someone who can come up with a less hacky solution:
import cupy_backends.cuda.libs.cublas
from cupy.cuda import device
handle = device.get_cublas_handle()
...
cupy_backends.cuda.libs.cublas.setStream(handle, compute_stream.ptr)
$ python3 stream_example.py 16384 4
Time spent on cupy.matmul for loop: 0.007548093795776367
Time spent compute_stream.synchronize(): 5.099333047866821

Numba Python - how to exploit parallelism effectively?

I have been trying to exploit Numba to speed up large array calculations. I have been measuring the calculation speed in GFLOPS, and it consistently falls far short of my expectations for my CPU.
My processor is i9-9900k, which according to float32 benchmarks should be capable of over 200 GFLOPS. In my tests I have never exceeded about 50 GFLOPS. This is running on all 8 cores.
On a single core I achieve about 17 GFLOPS, which (I believe) is 50% of the theoretical performance. I'm not sure if this is improvable, but the fact that it doesn't extend well to multi-core is a problem.
I am trying to learn this because I am planning to write some image processing code that desperately needs every speed boost possible. I also feel I should understand this first, before I dip my toes into GPU computing.
Here is some example code with a few of my attempts at writing fast functions. The operation I am testing, is multiplying an array by a float32 then summing the whole array, i.e. a MAC operation.
How can I get better results?
import os
# os.environ["NUMBA_ENABLE_AVX"] = "1"
import numpy as np
import timeit
from timeit import default_timer as timer
import numba
# numba.config.NUMBA_ENABLE_AVX = 1
# numba.config.LOOP_VECTORIZE = 1
# numba.config.DUMP_ASSEMBLY = 1
from numba import float32, float64
from numba import jit, njit, prange
from numba import vectorize
from numba import cuda
lengthY = 16 # 2D array Y axis
lengthX = 2**16 # X axis
totalops = lengthY * lengthX * 2 # MAC operation has 2 operations
iters = 100
doParallel = True
#njit(fastmath=True, parallel=doParallel)
def MAC_numpy(testarray):
output = (float)(0.0)
multconst = (float)(.99)
output = np.sum(np.multiply(testarray, multconst))
return output
#njit(fastmath=True, parallel=doParallel)
def MAC_01(testarray):
lengthX = testarray.shape[1]
lengthY = testarray.shape[0]
output = (float)(0.0)
multconst = (float)(.99)
for y in prange(lengthY):
for x in prange(lengthX):
output += multconst*testarray[y,x]
return output
#njit(fastmath=True, parallel=doParallel)
def MAC_04(testarray):
lengthX = testarray.shape[1]
lengthY = testarray.shape[0]
output = (float)(0.0)
multconst = (float)(.99)
for y in prange(lengthY):
for x in prange(int(lengthX/4)):
xn = x*4
output += multconst*testarray[y,xn] + multconst*testarray[y,xn+1] + multconst*testarray[y,xn+2] + multconst*testarray[y,xn+3]
return output
# ======================================= TESTS =======================================
testarray = np.random.rand(lengthY, lengthX)
# ==== MAC_numpy ====
time = 1000
for n in range(iters):
start = timer()
output = MAC_numpy(testarray)
end = timer()
if((end-start) < time): #get shortest time
time = end-start
print("\nMAC_numpy")
print("output = %f" % (output))
print(type(output))
print("fastest time = %16.10f us" % (time*10**6))
print("Compute Rate = %f GFLOPS" % ((totalops/time)/10**9))
# ==== MAC_01 ====
time = 1000
lengthX = testarray.shape[1]
lengthY = testarray.shape[0]
for n in range(iters):
start = timer()
output = MAC_01(testarray)
end = timer()
if((end-start) < time): #get shortest time
time = end-start
print("\nMAC_01")
print("output = %f" % (output))
print(type(output))
print("fastest time = %16.10f us" % (time*10**6))
print("Compute Rate = %f GFLOPS" % ((totalops/time)/10**9))
# ==== MAC_04 ====
time = 1000
for n in range(iters):
start = timer()
output = MAC_04(testarray)
end = timer()
if((end-start) < time): #get shortest time
time = end-start
print("\nMAC_04")
print("output = %f" % (output))
print(type(output))
print("fastest time = %16.10f us" % (time*10**6))
print("Compute Rate = %f GFLOPS" % ((totalops/time)/10**9))
Q : How can I get better results?
1st : Learn how to avoid doing useless work - you can straight eliminate HALF of the FLOP-s not speaking about also the half of all the RAM-I/O-s avoided, each one being at a cost of +100~350 [ns] per writeback
Due to the distributive nature of MUL and ADD ( a.C + b.C ) == ( a + b ).C, better first np.sum( A ) and only after that then MUL the sum by the (float) constant.
#utput = np.sum(np.multiply(testarray, multconst)) # AWFULLY INEFFICIENT
output = np.sum( testarray)*multconst #######################
2nd : Learn how to best align data along the order of processing ( cache-line reuses get you ~100x faster re-use of pre-fetched data. Not aligning vectorised-code along these already pre-fetched data side-effects just let your code pay many times the RAM-access latencies, instead of smart re-using the already paid for data-blocks. Designing work-units aligned according to this principle means a few SLOCs more, but the rewards are worth that - who gets ~100x faster CPUs+RAMs for free and right now or about a ~100x speedup for free, just from not writing a badly or naively designed looping iterators?
3rd : Learn how to efficiently harness vectorised (block-directed) operations inside numpy or numba code-blocks and avoid pressing numba to spend time on auto-analysing the call-signatures ( you pay an extra time for this auto-analyses per call, while you have designed the code and knew exactly what data-types are going to go there, so why to pay an extra time for auto-analysis each time a numba-block gets called???)
4th : Learn where the extended Amdahl's Law, having all the relevant add-on costs and processing atomicity put into the game, supports your wish to get speedups, not to ever pay way more than you will get back (to at least justify the add-on costs... ) - paying extra costs for not getting any reward is possible, yet has no beneficial impact on your code's performance ( rather the opposite )
5th : Learn when and how the manually created inline(s) may save your code, once the steps 1-4 are well learnt and routinely excersised with proper craftmanship ( Using popular COTS frameworks is fine, yet these may deliver results after a few days of work, while a hand-crafted single purpose smart designed assembly code was able to get the same results in about 12 minutes(!), not several days without any GPU/CPU tricks etc - yes, that faster - just by not doing a single step more than what was needed for the numerical processing of the large matrix data )
Did I mention float32 may surprise at being processed slower on small scales than float64, while on larger data-scales ~ n [GB] the RAM I/O-times grow slower for more efficient float32 pre-fetches? This never happens here, as float64 array gets processed here. Sure, unless one explicitly instructs the constructor(s) to downconvert the default data type, like this: np.random.rand( lengthY, lengthX ).astype( dtype = np.float32 )>>> np.random.rand( 10, 2 ).dtypedtype('float64')Avoiding extensive memory allocations is another performance trick, supported in numpy call-signatures. Using this option for large arrays will save you a lot of extra time wasted on mem-allocs for large interim arrays. Reusing already pre-allocated memory-zones and wisely controlled gc-policing are another signs of a professional, focused on low-latency & design-for-performance

Calculating user, nice, sys, idle, iowait, irq and sirq from /proc/stat

/proc/stat shows ticks for user, nice, sys, idle, iowait, irq and sirq like this:
cpu 6214713 286 1216407 121074379 260283 253506 197368 0 0 0
How can I calculate the individual utilizations (in %) for user, nice etc with these values? Like the values that shows in 'top' or 'vmstat'.
This code calculates user utilization spread over all cores.
import os
import time
import multiprocessing
def main():
jiffy = os.sysconf(os.sysconf_names['SC_CLK_TCK'])
num_cpu = multiprocessing.cpu_count()
stat_fd = open('/proc/stat')
stat_buf = stat_fd.readlines()[0].split()
user, nice, sys, idle, iowait, irq, sirq = ( float(stat_buf[1]), float(stat_buf[2]),
float(stat_buf[3]), float(stat_buf[4]),
float(stat_buf[5]), float(stat_buf[6]),
float(stat_buf[7]) )
stat_fd.close()
time.sleep(1)
stat_fd = open('/proc/stat')
stat_buf = stat_fd.readlines()[0].split()
user_n, nice_n, sys_n, idle_n, iowait_n, irq_n, sirq_n = ( float(stat_buf[1]), float(stat_buf[2]),.
float(stat_buf[3]), float(stat_buf[4]),
float(stat_buf[5]), float(stat_buf[6]),
float(stat_buf[7]) )
stat_fd.close()
print ((user_n - user) * 100 / jiffy) / num_cpu
if __name__ == '__main__':
main()
From Documentation/filesystems/proc.txt:
(...) These numbers identify the amount of time the CPU has spent performing
different kinds of work. Time units are in USER_HZ (typically hundredths of a second).
So to figure out utilization in terms of percentages you need to:
Find out what USER_HZ is on the machine
Find out how long it's been since the system booted.
The second one is easy: there is a btime line in that same file which you can use for that. For USER_HZ, check out How to get number of mili seconds per jiffy.

Categories