Python numpy code more efficient than eigen3 or plain C++ - python

I had some code in Python3 (with numpy) that I wanted to convert to C++ (with eigen3) in order to get a more efficient program. So I decided to test a simple example to assess the performance gain I would get. The code consists on two random arrays that are to be multiplied coefficient-wise. My conclusions were that the python code with numpy is about 30% faster than the one in C++. I'd like to know why the interpreted python code is faster than a compiled C++ code. Am I missing something in the C++ code?
I'm using gcc 9.1.0, Eigen 3.3.7, Python 3.7.3 and Numpy 1.16.4.
Possible explanations:
C++ program isn't using vectorization
Numpy is a lot more optimized than I thought
Time is measuring different things in each program
There is a similar question in Stack Overflow (Eigen Matrix vs Numpy Array multiplication performance). I tested this in my computer and got the expected result that eigen is more efficient than numpy, but the operation here is matrix multiplication rather than coefficient-wise multiplication.
Python code (main.py)
Execution command: python3 main.py
import numpy as np
import time
Lx = 4096
Ly = 4000
# Filling arrays
a = np.random.rand(Lx, Ly).astype(np.float64)
a1 = np.random.rand(Lx, Ly).astype(np.float64)
# Coefficient-wise product
start = time.time()
b = a*a1
# Compute the elapsed time
end = time.time()
print(b.sum())
print("duration: ", end-start)
C++ code with eigen3 (main_eigen.cpp)
Compilation command: g++ -O3 -I/usr/include/eigen3/ main_eigen.cpp -o prog_eigen
#include <iostream>
#include <chrono>
#include "Eigen/Dense"
#define Lx 4096
#define Ly 4000
typedef double T;
int main(){
// Allocating arrays
Eigen::Array<T, -1, -1> KPM_ghosts(Lx, Ly), KPM_ghosts1(Lx, Ly), b(Lx,Ly);
// Filling the arrays
KPM_ghosts.setRandom();
KPM_ghosts1.setRandom();
// Coefficient-wise product
auto start = std::chrono::system_clock::now();
b = KPM_ghosts*KPM_ghosts1;
// Compute the elapsed time
auto end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end-start;
std::cout << "elapsed time: " << elapsed_seconds.count() << "s\n";
// Print the sum so the compiler doesn't optimize the code away
std::cout << b.sum() << "\n";
return 0;
}
Plain C++ code (main.cpp)
Compilation command: g++ -O3 main.cpp -o prog
#include <iostream>
#include <chrono>
#define Lx 4096
#define Ly 4000
#define N Lx*Ly
typedef double T;
int main(){
// Allocating arrays
T lin_vector1[N];
T lin_vector2[N];
T lin_vector3[N];
// Filling the arrays
for(unsigned i = 0; i < N; i++){
lin_vector1[i] = std::rand()*1.0/RAND_MAX;
lin_vector2[i] = std::rand()*1.0/RAND_MAX;
}
// Coefficient-wise product
auto start = std::chrono::system_clock::now();
for(unsigned i = 0; i < N; i++)
lin_vector3[i] = lin_vector1[i]*lin_vector2[i];
// Compute the elapsed time
auto end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end-start;
std::cout << "elapsed time: " << elapsed_seconds.count() << "s\n";
// Print the sum so the compiler doesn't optimize the code away
double sum = 0;
for(unsigned i = 0; i < N; i++)
sum += lin_vector3[i];
std::cout << "sum: " << sum << "\n";
return 0;
}
Runtime of each program 10 times
Plain C++
elapsed time: 0.210664s
elapsed time: 0.215406s
elapsed time: 0.222483s
elapsed time: 0.21526s
elapsed time: 0.216346s
elapsed time: 0.218951s
elapsed time: 0.21587s
elapsed time: 0.213639s
elapsed time: 0.219399s
elapsed time: 0.213403s
Plain C++ with eigen3
elapsed time: 0.21052s
elapsed time: 0.220779s
elapsed time: 0.216269s
elapsed time: 0.229234s
elapsed time: 0.212265s
elapsed time: 0.256714s
elapsed time: 0.212396s
elapsed time: 0.248241s
elapsed time: 0.241537s
elapsed time: 0.323519s
Python
duration: 0.23946428298950195
duration: 0.1663036346435547
duration: 0.17225909233093262
duration: 0.15922021865844727
duration: 0.16628384590148926
duration: 0.15654635429382324
duration: 0.15859222412109375
duration: 0.1633443832397461
duration: 0.1685199737548828
duration: 0.16393446922302246

Related

Performance of LLVM-Compiler on native C code vs Python+Numba

I recently did some tests on performance optimization in Python. One part was doing a benchmark on Monte-Carlo Pi calculation using SWIG and compile a library to import in Python. The other solution was using Numba. Now I totally wonder why the native C solution is worse than Numba even if LLVM compiler is used for both. So I'm wondering if I'm doing something wrong.
Runtime on my Laptop
native C module: 7.09 s
Python+Numba: 2.75 s
Native C code
#include "swigtest.h"
#include <time.h>
#include <stdlib.h>
#include <stdio.h>
float monte_carlo_pi(long nsamples)
{
int accGlob=0;
int accLoc=0;
int i,ns;
float x,y;
float res;
float iRMX=1.0/(float) RAND_MAX;
srand(time(NULL));
for(i=0;i<nsamples;i++)
{
x = (float)rand()*iRMX;
y = (float)rand()*iRMX;
if((x*x + y*y) < 1.0) { acc += 1;}
}
res = 4.0 * (float) acc / (float) nsamples;
printf("cres = %.5f\n",res);
return res;
}
swigtest.i
%module swigtest
%{
#define SWIG_FILE_WITH_INIT
#include "swigtest.h"
%}
float monte_carlo_pi(long nsamples);
Compiler call
clang.exe swigtest.c swigtest_wrap.c -Ofast -o _swigtest.pyd -I C:\python37\include -shared -L c:\python37\libs -g0 -mtune=intel -msse4.2 -mmmx
testswig.py
from swigtest import monte_carlo_pi
import time
import os
start = time.time()
pi = monte_carlo_pi(250000000)
print("pi: %.5f" % pi)
print("tm:",time.time()-start)
Python version with Numba
from numba import jit
import random
import time
start = time.time()
#jit(nopython=True,cache=True,fastmath=True)
def monte_carlo_pi(nsamples: int)-> float:
acc:int = 0
for i in range(nsamples):
x:float = random.random()
y:float = random.random()
if (x * x + y * y) < 1.0: acc += 1
return 4.0 * acc / nsamples
pi = monte_carlo_pi(250000000)
print("pi:",pi)
print("tm:",time.time()-start)
Summary up to now:
The rand() function seems to consume most of the time. Using a deterministic approach like this
...
ns = (long) sqrt((double)nsamples)+1;
dx = 1./sqrt((double)nsamples);
dy = dx;
...
for(i=0;i<ns;i++)
for(k=0;k<ns;k++)
{
x = i*dx;
y = k*dy;
if((x*x + y*y) < 1.0) { accLoc += 1;}
}
...
instead of rand() results in an execution tim of only 0.04 s! Obviously Numba uses another much more efficient random function.

Same random numbers in C++ as computed by Python3 numpy.random.rand

I would like to duplicate in C++ the testing for some code that has already been implemented in Python3 which relies on numpy.random.rand and randn values and a specific seed (e.g., seed = 1).
I understand that Python's random implementation is based on a Mersenne twister. The C++ standard library also supplies this in std::mersenne_twister_engine.
The C++ version returns an unsigned int, whereas Python rand is a floating point value.
Is there a way to obtain the same values in C++ as are generated in Python, and be sure that they are the same? And the same for an array generated by randn ?
You can do it this way for integer values:
import numpy as np
np.random.seed(12345)
print(np.random.randint(256**4, dtype='<u4', size=1)[0])
#include <iostream>
#include <random>
int main()
{
std::mt19937 e2(12345);
std::cout << e2() << std::endl;
}
The result of both snippets is 3992670690
By looking at source code of rand you can implement it in your C++ code this way:
import numpy as np
np.random.seed(12345)
print(np.random.rand())
#include <iostream>
#include <iomanip>
#include <random>
int main()
{
std::mt19937 e2(12345);
int a = e2() >> 5;
int b = e2() >> 6;
double value = (a * 67108864.0 + b) / 9007199254740992.0;
std::cout << std::fixed << std::setprecision(16) << value << std::endl;
}
Both random values are 0.9296160928171479
It would be convenient to use std::generate_canonical, but it uses another method to convert the output of Mersenne twister to double. The reason they differ is likely that generate_canonical is more optimized than the random generator used in NumPy, as it avoids costly floating point operations, especially multiplication and division, as seen in source code. However it seems to be implementation dependent, while NumPy produces the same result on all platforms.
double value = std::generate_canonical<double, std::numeric_limits<double>::digits>(e2);
This doesn't work and produces result 0.8901547132827379, which differs from the output of Python code.
For completeness and to avoid re-inventing the wheel, here is an implementation for both numpy.rand and numpy.randn in C++
The header file:
#ifndef RANDOMNUMGEN_NUMPYCOMPATIBLE_H
#define RANDOMNUMGEN_NUMPYCOMPATIBLE_H
#include "RandomNumGenerator.h"
//Uniform distribution - numpy.rand
class RandomNumGen_NumpyCompatible {
public:
RandomNumGen_NumpyCompatible();
RandomNumGen_NumpyCompatible(std::uint_fast32_t newSeed);
std::uint_fast32_t min() const { return m_mersenneEngine.min(); }
std::uint_fast32_t max() const { return m_mersenneEngine.max(); }
void seed(std::uint_fast32_t seed);
void discard(unsigned long long); // NOTE!! Advances and discards twice as many values as passed in to keep tracking with Numpy order
uint_fast32_t operator()(); //Simply returns the next Mersenne value from the engine
double getDouble(); //Calculates the next uniformly random double as numpy.rand does
std::string getGeneratorType() const { return "RandomNumGen_NumpyCompatible"; }
private:
std::mt19937 m_mersenneEngine;
};
///////////////////
//Gaussian distribution - numpy.randn
class GaussianRandomNumGen_NumpyCompatible {
public:
GaussianRandomNumGen_NumpyCompatible();
GaussianRandomNumGen_NumpyCompatible(std::uint_fast32_t newSeed);
std::uint_fast32_t min() const { return m_mersenneEngine.min(); }
std::uint_fast32_t max() const { return m_mersenneEngine.max(); }
void seed(std::uint_fast32_t seed);
void discard(unsigned long long); // NOTE!! Advances and discards twice as many values as passed in to keep tracking with Numpy order
uint_fast32_t operator()(); //Simply returns the next Mersenne value from the engine
double getDouble(); //Calculates the next normally (Gaussian) distrubuted random double as numpy.randn does
std::string getGeneratorType() const { return "GaussianRandomNumGen_NumpyCompatible"; }
private:
bool m_haveNextVal;
double m_nextVal;
std::mt19937 m_mersenneEngine;
};
#endif
And the implementation:
#include "RandomNumGen_NumpyCompatible.h"
RandomNumGen_NumpyCompatible::RandomNumGen_NumpyCompatible()
{
}
RandomNumGen_NumpyCompatible::RandomNumGen_NumpyCompatible(std::uint_fast32_t seed)
: m_mersenneEngine(seed)
{
}
void RandomNumGen_NumpyCompatible::seed(std::uint_fast32_t newSeed)
{
m_mersenneEngine.seed(newSeed);
}
void RandomNumGen_NumpyCompatible::discard(unsigned long long z)
{
//Advances and discards TWICE as many values to keep with Numpy order
m_mersenneEngine.discard(2*z);
}
std::uint_fast32_t RandomNumGen_NumpyCompatible::operator()()
{
return m_mersenneEngine();
}
double RandomNumGen_NumpyCompatible::getDouble()
{
int a = m_mersenneEngine() >> 5;
int b = m_mersenneEngine() >> 6;
return (a * 67108864.0 + b) / 9007199254740992.0;
}
///////////////////
GaussianRandomNumGen_NumpyCompatible::GaussianRandomNumGen_NumpyCompatible()
: m_haveNextVal(false)
{
}
GaussianRandomNumGen_NumpyCompatible::GaussianRandomNumGen_NumpyCompatible(std::uint_fast32_t seed)
: m_haveNextVal(false), m_mersenneEngine(seed)
{
}
void GaussianRandomNumGen_NumpyCompatible::seed(std::uint_fast32_t newSeed)
{
m_mersenneEngine.seed(newSeed);
}
void GaussianRandomNumGen_NumpyCompatible::discard(unsigned long long z)
{
//Burn some CPU cyles here
for (unsigned i = 0; i < z; ++i)
getDouble();
}
std::uint_fast32_t GaussianRandomNumGen_NumpyCompatible::operator()()
{
return m_mersenneEngine();
}
double GaussianRandomNumGen_NumpyCompatible::getDouble()
{
if (m_haveNextVal) {
m_haveNextVal = false;
return m_nextVal;
}
double f, x1, x2, r2;
do {
int a1 = m_mersenneEngine() >> 5;
int b1 = m_mersenneEngine() >> 6;
int a2 = m_mersenneEngine() >> 5;
int b2 = m_mersenneEngine() >> 6;
x1 = 2.0 * ((a1 * 67108864.0 + b1) / 9007199254740992.0) - 1.0;
x2 = 2.0 * ((a2 * 67108864.0 + b2) / 9007199254740992.0) - 1.0;
r2 = x1 * x1 + x2 * x2;
} while (r2 >= 1.0 || r2 == 0.0);
/* Box-Muller transform */
f = sqrt(-2.0 * log(r2) / r2);
m_haveNextVal = true;
m_nextVal = f * x1;
return f * x2;
}
After doing a bit of testing, it does seem that the values are within a tolerance (see #fdermishin 's comment below) when the C++ unsigned int is divided by the maximum value for an unsigned int like this:
#include <limits>
...
std::mt19937 generator1(seed); // mt19937 is a standard mersenne_twister_engine
unsigned val1 = generator1();
std::cout << "Gen 1 random value: " << val1 << std::endl;
std::cout << "Normalized Gen 1: " << static_cast<double>(val1) / std::numeric_limits<std::uint32_t>::max() << std::endl;
However, Python's version seems to skip every other value.
Given the following two programs:
#!/usr/bin/env python3
import numpy as np
def main():
np.random.seed(1)
for i in range(0, 10):
print(np.random.rand())
###########
# Call main and exit success
if __name__ == "__main__":
main()
sys.exit()
and
#include <cstdlib>
#include <iostream>
#include <random>
#include <limits>
int main()
{
unsigned seed = 1;
std::mt19937 generator1(seed); // mt19937 is a standard mersenne_twister_engine
for (unsigned i = 0; i < 10; ++i) {
unsigned val1 = generator1();
std::cout << "Normalized, #" << i << ": " << (static_cast<double>(val1) / std::numeric_limits<std::uint32_t>::max()) << std::endl;
}
return EXIT_SUCCESS;
}
the Python program prints:
0.417022004702574
0.7203244934421581
0.00011437481734488664
0.30233257263183977
0.14675589081711304
0.0923385947687978
0.1862602113776709
0.34556072704304774
0.39676747423066994
0.538816734003357
whereas the C++ program prints:
Normalized, #0: 0.417022
Normalized, #1: 0.997185
Normalized, #2: 0.720324
Normalized, #3: 0.932557
Normalized, #4: 0.000114381
Normalized, #5: 0.128124
Normalized, #6: 0.302333
Normalized, #7: 0.999041
Normalized, #8: 0.146756
Normalized, #9: 0.236089
I could easily skip every other value in the C++ version, which should give me numbers that match the Python version (within a tolerance). But why would Python's implementation seem to skip every other value, or where do these extra values in the C++ version come from?

Matrix multiplication benchmarking on Titan RTX with double and single precisions

I am trying to understand the difference in performances between single and double precisions of our GPU workstation.
Our workstation is equipped with two TITAN RTX GPUs, but I am running the benchmark on a sigle Titan RTX.
I am testing the performance with cublas matrix-matrix multiplications. I multiply 8192x8192 matrices that consist of random floats or doubles. To ensure that there is no mistake on my end, I also repeat this procedure in Python using cupy library, and the results are very similar.
The test results are ~75 ms per 1 multiplication for floats and ~2,000 ms for doubles.
If I had an older GPU, this would make a lot of sense, as 75*32 = 2,400~2000, so that my double-precision performance would be ~32 times poorer as expected from the table https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions.
However, my GPU has Compute Capability 7.5, therefore I expect degradation of the performance with doubles only by a factor of 2.
Other info: Ubuntu 18 LTS, nvcc 10.2, driver 440.82.
Here is the CUDA code:
#include <iostream>
#include <chrono>
#include <string>
#include <cuda_runtime.h>
#include "cublas_v2.h"
#include <math.h>
#include <stdio.h>
#include <cuda.h>
#include <device_functions.h>
#include <sstream>
#include <time.h>
unsigned long mix(unsigned long a, unsigned long b, unsigned long c)
{
a=a-b; a=a-c; a=a^(c >> 13);
b=b-c; b=b-a; b=b^(a << 8);
c=c-a; c=c-b; c=c^(b >> 13);
a=a-b; a=a-c; a=a^(c >> 12);
b=b-c; b=b-a; b=b^(a << 16);
c=c-a; c=c-b; c=c^(b >> 5);
a=a-b; a=a-c; a=a^(c >> 3);
b=b-c; b=b-a; b=b^(a << 10);
c=c-a; c=c-b; c=c^(b >> 15);
return c;
}
using namespace std;
int main()
{
int deviceCount;
cudaGetDeviceCount(&deviceCount);
cudaDeviceProp deviceProp;
cublasStatus_t err;
cudaGetDeviceProperties(&deviceProp, 0);
printf("Detected %d devices \n", deviceCount);
printf("Device %d has compute capability %d.%d:\n\t maxshmem %d. \n\t maxthreads per block %d. \n\t max threads dim %d. %d. %d.\n ", 0,
deviceProp.major, deviceProp.minor, deviceProp.sharedMemPerBlock, deviceProp.maxThreadsPerBlock, deviceProp.maxThreadsDim[0],
deviceProp.maxThreadsDim[1], deviceProp.maxThreadsDim[2]);
cudaEvent_t start_d, stop_d;
cudaEventCreate(&start_d);
cudaEventCreate(&stop_d);
//RND insicialization
unsigned long seed = mix(clock(), time(NULL), 0);
srand(seed);
int N=8192;
int Nloops=2;
int memsize=N*N*sizeof(double);
double *a = (double *)malloc(memsize);
double *b = (double *)malloc(memsize);
double *c = (double *)malloc(memsize);
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++){
a[i*N+j]=((double)rand() / RAND_MAX);
b[i*N+j]=((double)rand() / RAND_MAX);
}
double *a_d, *b_d, *c_d;
cudaMalloc((void **)&a_d, memsize);
cudaMalloc((void **)&b_d, memsize);
cudaMalloc((void **)&c_d, memsize);
cudaMemcpy(a_d, a, memsize, cudaMemcpyHostToDevice);
cudaMemcpy(b_d, b, memsize, cudaMemcpyHostToDevice);
cublasHandle_t handle;
cublasCreate(&handle);
double alpha=1.0;
double beta=0.0;
auto start = chrono::steady_clock::now();
clock_t start1;
start1 = clock();
cudaEventRecord(start_d);
if (cudaGetLastError() != cudaSuccess)
printf("%s \n",cudaGetErrorString(cudaGetLastError()));
for (int i=0; i<Nloops; i++)
cublasDgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, N,N,N,&alpha,a_d,N,b_d,N,&beta,c_d,N);
cudaEventRecord(stop_d);
cudaDeviceSynchronize();
auto end = chrono::steady_clock::now();
start1 = clock() - start1;
cudaEventSynchronize(stop_d);
cublasDestroy(handle);
float milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start_d, stop_d);
std::cout << "Cuda event " << milliseconds /Nloops << " ms" <<endl;
std::cout << " time elapsed " << start1 / (double)CLOCKS_PER_SEC /Nloops << '\n';
cout << "time elapsed for 1 multiplication: " << ((double)chrono::duration_cast<chrono::microseconds>(end-start).count() )/(Nloops*1000.0)<< " milliseconds" <<endl;
free(a); free(b); free(c);
cudaFree(a_d); cudaFree(b_d); cudaFree(c_d);
}
And this is the python code that yields consistent results:
import cupy as cp
import time
iterations = 2
a = cp.random.rand(8192,8192).astype(cp.float64)
b = cp.random.rand(8192,8192).astype(cp.float64)
def ab(a,b,iterations):
for i in range(iterations):
cp.matmul(a,b,out=None)
ab(a,b,1) # warm up
cp.cuda.Device(0).synchronize()
t1 = time.time()
ab(a,b,iterations)
cp.cuda.Device(0).synchronize()
t2 = time.time()
total = (t2-t1)/iterations
print(total)
Ok, I found the answer. In that table that I link in my quesiton, there is a footnote that says that for compute capability 7.5 (which is the case here) the performance is 2, but not 32, and for floats it is 64, which means that multiplication-addition operations for doubles are 32 times slower than for the floats.
If both the float and double problems were fully arithmetic-bound, I would expect the slowdown to be ~32. In reality, the slowdown is slightly smaller (2000/75 ~ 27), which may be a consequence of the problem with floats being bandwidth-bound, or maybe it is related to other things.

Python implementation faster than C

I apologise if comparisons are not supposed to work this way. I'm new to programming and just curious as to why this is the case.
I have a large binary file containing word embeddings (4.5gb). Each line has a word followed by its embedding which is comprised of 300 float values. I'm simply finding the total number of lines.
For C, I use mmap:
int fd;
struct stat sb;
off_t offset = 0, pa_offset;
size_t length, i;
char *addr;
int count = 0;
fd = open("processed_data/crawl-300d-2M.vec", O_RDONLY);
if(fd == -1){
handle_error("open");
exit(1);
}
if(fstat(fd, &sb) < 0){
handle_error("fstat");
close(fd);
exit(1);
}
pa_offset = offset & ~(sysconf(_SC_PAGE_SIZE) - 1);
if(offset >= sb.st_size){
fprintf(stderr, "offset is past end of file\n");
exit(EXIT_FAILURE);
}
length = sb.st_size - offset;
addr = mmap(0, (length + offset - pa_offset), PROT_READ, MAP_SHARED, fd, pa_offset);
if (addr == MAP_FAILED) handle_error("mmap");
//Timing only this loop
clock_t begin = clock();
for(i=0;i<length;i++){
if(*(addr+i) == '\n') count++;
}
printf("%d\n", count);
clock_t end = clock();
double time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
printf("%f\n", time_spent);
This takes 11.283060 seconds.
Python:
file = open('processed_data/crawl-300d-2M.vec', 'r')
count = 0
start_time = timeit.default_timer()
for line in file:
count += 1
print(count)
elapsed = timeit.default_timer() - start_time
print(elapsed)
This takes 3.0633065439997154 seconds.
Doesn't the Python code read each character to find new lines? If so, why is my C code so inefficient?
Hard to say, because I assume that it will be heavily implementation dependant. But at first glance, the main difference between your Python and C programs is that the C program uses mmap. It is a very powerful tool (that you do not really need here...) and as such can come with some overhead. As the reference Python implementation is written in C, it is likely that the loop
for line in file:
count += 1
will end in a loop over a tiny function calling fgets. I would bet a coin that a naive C program using fgets will be slightly faster than the Python equivalent, because it will save all the Python overhead. But IMHO there is no surprise that using mmap in C is less efficient than fgets in Python

Disappointing results in pyCUDA benchmark for distance computing between N points

The following script was set-up for benchmark purposes. It computes the distance between N points using an Euclidean L2 norm. Three different routines are implemented:
High-level solution using the scipy.spatial.distance.pdist function.
Fairly low-level OpenMP powered scipy.weave.inline solution.
pyCUDA powered GPGPU solution.
Here are the benchmark results on a i5-3470 (16GB RAM) using a GTX660 (2GB RAM):
------------
Scipy Pdist
Execution time: 3.01975 s
Frist five elements: [ 0.74968684 0.71457213 0.833188 0.48084545 0.86407363]
Last five elements: [ 0.65717077 0.76850474 0.29652017 0.856179 0.56074625]
------------
Weave Inline
Execution time: 2.48705 s
Frist five elements: [ 0.74968684 0.71457213 0.83318806 0.48084542 0.86407363]
Last five elements: [ 0.65717083 0.76850474 0.29652017 0.856179 0.56074625]
------------
pyCUDA
CUDA clock timing: 0.713028930664
Execution time: 2.04364 s
Frist five elements: [ 0.74968684 0.71457213 0.83318806 0.48084542 0.86407363]
Last five elements: [ 0.65717083 0.76850468 0.29652017 0.856179 0.56074625]
------------
I am a bit disappointed on the pyCUDA perfomance. Since I am new to CUDA, there is probably something I am missing here. So where is the crux of the matter ? Am I reaching the limits of global memory bandwidth ? Poor choice of block- and gridsizes ?
import numpy,time,math
import pycuda.autoinit
import pycuda.driver as drv
from pycuda.compiler import SourceModule
from scipy.spatial.distance import pdist
from scipy import weave
def weave_solution(x):
"""
OpenMP powered weave inline.
"""
N,DIM = numpy.shape(x)
L = ((N-1)**2+(N-1))/2
solution = numpy.zeros(L).astype(numpy.float32)
ncpu = 4
weave_omp = {'headers' : ['<omp.h>'],
'extra_compile_args': ['-fopenmp'],
'extra_link_args' : ['-lgomp']}
code = \
r'''
omp_set_num_threads(ncpu);
#pragma omp parallel
{
int j,d,pos;
float r=0.0;
#pragma omp for
for (int i=0; i<(N-1); i++){
for (j=(i+1); j<N; j++){
r = 0.0;
for (d=0; d<DIM; d++){
r += (x[i*DIM+d]-x[j*DIM+d])*(x[i*DIM+d]-x[j*DIM+d]);
}
pos = (i*N+j)-(i*(i+1)/2)-i-1;
solution[pos] = sqrt(r);
}
}
}
'''
weave.inline(code,['x','N','DIM','solution','ncpu'],**weave_omp)
return numpy.array(solution)
def scipy_solution(x):
"""
SciPy High-level function
"""
return pdist(x).astype(numpy.float32)
def cuda_solution(x):
"""
pyCUDA
"""
N,DIM = numpy.shape(x)
N = numpy.int32(N)
DIM = numpy.int32(DIM)
L = ((N-1)**2+(N-1))/2
solution = numpy.zeros(L).astype(numpy.float32)
start = drv.Event()
end = drv.Event()
mod = SourceModule("""
__global__ void distance(float *x,int N,int DIM,float *solution){
const int i = blockDim.x * blockIdx.x + threadIdx.x;
int j,d,pos;
float r=0.0;
if ( i < (N-1) ){
for (j=(i+1); j<N; j++){
r = 0.0;
for (d=0; d<DIM; d++){
r += (x[i*DIM+d]-x[j*DIM+d])*(x[i*DIM+d]-x[j*DIM+d]);
}
pos = (i*N+j)-(i*(i+1)/2)-i-1;
solution[pos] = sqrt(r);
}
}
}
""")
func = mod.get_function("distance")
start.record()
func(drv.In(x),N,DIM,drv.Out(solution),block=(192,1,1),grid=(192,1))
end.record()
end.synchronize()
secs = start.time_till(end)*1e-3
print "CUDA clock timing: ",secs
return solution
if __name__ == '__main__':
# Set up data points
N = 25000
DIM = 3
x = numpy.random.rand(N,DIM).astype(numpy.float32)
print "-"*12
# Scipy solution
print "Scipy Pdist"
stime = time.time()
spsolution = scipy_solution(x)
stime = time.time()-stime
print "Execution time: {0:.5f} s".format(stime)
print "Frist five elements:", spsolution[:5]
print "Last five elements:", spsolution[-5:]
print "-"*12
# Weave solution
print "Weave Inline"
wtime = time.time()
wsolution = weave_solution(x)
wtime = time.time()-wtime
print "Execution time: {0:.5f} s".format(wtime)
print "Frist five elements:", wsolution[:5]
print "Last five elements:", wsolution[-5:]
print "-"*12
# pyCUDA solution
print "pyCUDA"
ctime = time.time()
csolution = cuda_solution(x)
ctime = time.time()-ctime
print "Execution time: {0:.5f} s".format(ctime)
print "Frist five elements:", csolution[:5]
print "Last five elements:", csolution[-5:]
print "-"*12
Edit:
I have added the hash bang line
#!/usr/bin/env python
at the top of the file and made it executable. After commenting out the computation using weave.inline and scipy.spatial.distance.pdist, the NVIDIA Visual Profiler promts the following results:
Right now you have 192 threads each updating N-1 positions, you could easily launch more blocks/threads.
What you want to do is instead of this loop for (j=(i+1); j<N; j++){, replace it with N-1 threads doing just the inner loop.
If you want to take it further you could have N-1 * DIM threads each doing the statement in the inner loop, store the result to shared memory and finally do an reduction on that. See Optimizing Parallel Reduction in CUDA
Looking at this line:
r += (x[i*DIM+d]-x[j*DIM+d])*(x[i*DIM+d]-x[j*DIM+d]);
The memory transactions and pattern is not uniform and coalesced. Also do not know if nvcc will optimizes your expression to only two memory transactions instead of four shown here, as I do not know if pycuda passes -O3 to nvcc. Put (x[i*DIM+d]-x[j*DIM+d]) into a register variable to make sure and just square it yourself.
Else you can also try to put #pragma unroll before each for loop to unroll them if possible.

Categories