I'm learning to use opencl in python and I wanted to optimize one of the function. I learned that this can be done by storing global memory in local memory. However, it doesn't work as it should, the duration is twice as long. This is well done? Can I optimize this code more?
__kernel void sumOP( __global float *input,
__global float *weights,
int layer_size,
__global float *partialSums,__local float* cache)
{
private const int i = get_global_id(0);
private const int in_layer_s = layer_size;
private const int item_id = get_local_id(0);
private const int group_id = get_group_id(0);
private const int group_count = get_num_groups(0);
const int localsize = get_local_size(0);
for ( int x = 0; x < in_layer_s; x++ )
{
cache[x] = weights[i*in_layer_s + x];
}
float total1 = 0;
for ( int x = 0; x < in_layer_s; x++ )
{
total1 += cache[x] *input[x];
}
partialSums[i] = sigmoid(total1);
}
Python call
l = opencl.LocalMemory(len(inputs))
event = program.sumOP(queue, output.shape, np.random.randn(6,).shape, inputs.data, weights.data,np.int32(len(inputs)),output.data,l)
Thanks for some advice
Besides doing a data write race condition with writing to same shared memory address cache[x] by all workitems of a group (as Dithermaster said) and lack of barrier() function, some optimizations can be added after those are fixed:
First loop in kernel
for ( int x = 0; x < in_layer_s; x++ )
{
cache[x] = weights[i*in_layer_s + x];
}
scans a different memory area for each work item, one element at a time. This is probably wrong in terms of global memory performance because each workitem in their own loop, could be using same memory channel or even same memory bank, hence, all workitems access that channel or bank serially. This is worse if in_layer_s gets a larger value and especially if it is power of 2. To solve this problem, all workitems should access contiguous addresses with their neighbors. GPU works better when global memory is accessed uniformly with workitems. On local memory, it is less of an issue to access randomly or with gaps between workitems. Thats why its advised to use uniform save/load on global while doing random/scatter/gather on local.
Second loop in kernel
for ( int x = 0; x < in_layer_s; x++ )
{
total1 += cache[x] *input[x];
}
is using only single accumulator. This a dependency chain that needs each loop cycle to be completed before moving on to next. Use at least 2 temporary "total" variables and unroll the loop. Here, if in_layer_s is small enough, input array could be moved into local or constant memory to access it faster (repeatedly by all workitems, since all workitems access same input array) (maybe half of input to constant memory and other half to local memory to increase total bandwidth)
Is weights[i*in_layer_s + x]; an array of structs? If yes, you can achieve a speedup by making it a struct of arrays and get rid of first loop's optimization altogether, with an increase of code bloat in host side but if priority is speed then struct of arrays is both faster and readable on gpu side. This also makes it possible to upload only the necessary weights data(an array of SOA) to gpu from host side, decreasing total latency (upload + compute + download) further.
You can also try asynchronous local<-->global transfer functions to make loading and computing overlapped for each workitem group, to hide even more latency as a last resort. https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/async_work_group_copy.html
Related
I have always used Arduino IDE but now, I am using MuEditor.
I used Arduino IDE and it was easier to compare the current value to its previous value, and output max value after a certain period. Code from the Arduino is shown below.
int sample1 = 0;
void loop() {
int sensorValue = 0;
sensorValue = analogRead(A0);
for (int i = 0; i <= 100; i++) {
if (sensorValue > sample1) {
sample1 = sensorValue;
}
}
Serial.println(sample1 * (5.0 / 1023.0));
}
I want to use the same concept on Mu Editor, and I cannot seems to get it done right.
I would like to continuously compare the current value to previous value and output max value after certain period. This is what I came up with on MuEditor. I would appreciate your help on this.
import time
import board
from analogio import AnalogIn
analog_in = AnalogIn(board.A1)
Sample = 0
def get_voltage(pin):
return (pin.value * 3.3) / 65536
while True:
for x in range(1000):
if Sample < analog_in:
Sample = analog_in
print((get_voltage(Sample1),))
time.sleep(0.1)
Result:
With circuitpython's analogio.AnalogIn() you need to refer to the .value property of the AnalogIn instance
import analogio
with analogio.AnalogIn(board.A1) as pin_sensor:
pin_sensor.value # voltage
# pin_sensor is closed when leaving the above scope
Just get the max() at each loop if you're discarding the other values
...
result = 0
with analogio.AnalogIn(board.A1) as pin_sensor:
for x in range(..):
result = max(result, pin_sensor.value)
print(result)
However, note that especially with electronics, you more likely want a more complex mechanism-
collect frequently enough to catch the frequency of what you're sampling (see Nyquist-Shannon theorem), ideally as often as you can (however, if you have a collection of digital logic, this may be infeasible, or drive a faster processor choice, etc.)
carefully discard outliers (which can be transients/static/contact bounce..)
refer to a moving average (take the average of the last N samples)
collect with a precise timer and interrupt (such that your sampling rate isn't dependent on unrelated logic)
Is there a cython-ic way to set a cdef array to zeros. I have a function with the following signature:
cdef cget_values(double[:] cpc_x, double[:] cpc_y):
The function is called as follows:
cdef double cpc_x [16]
cdef double cpc_y [16]
cget_values(cpc_x, cpc_y)
Now the first thing I would like to do is set everything in these arrays to zeros. Currently, I am doing that with a for loop as:
for i in range(16):
cpc_x[i] = 0.0
cpc_y[i] = 0.0
I was wondering if this is a reasonable approach without much overhead. I call this function a lot and was wondering if there is a more elegant/faster way to do this in cython.
I assume, you are already using #cython.boundscheck(False), so there is not much you can do to improve on it performance-wise.
For the readability reasons I would use:
cpc_x[:]=0.0
cpc_y[:]=0.0
the cython would translate this to for-loops. An other additional advantage: even if #cython.boundscheck(False) isn't used, the resulting C-code will be nonetheless without boundchecks (__Pyx_RaiseBufferIndexError). Here is the resulting code for a[:]=0.0:
{
double __pyx_temp_scalar = 0.0;
{
Py_ssize_t __pyx_temp_extent_0 = __pyx_v_a.shape[0];
Py_ssize_t __pyx_temp_stride_0 = __pyx_v_a.strides[0];
char *__pyx_temp_pointer_0;
Py_ssize_t __pyx_temp_idx_0;
__pyx_temp_pointer_0 = __pyx_v_a.data;
for (__pyx_temp_idx_0 = 0; __pyx_temp_idx_0 < __pyx_temp_extent_0; __pyx_temp_idx_0++) {
*((double *) __pyx_temp_pointer_0) = __pyx_temp_scalar;
__pyx_temp_pointer_0 += __pyx_temp_stride_0;
}
}
}
What could improve the performance is to declare the the memory views to be continuous (i.e. double[::1] instead of double[:]. The resulting C code for a[:]=0.0 would be then:
{
double __pyx_temp_scalar = 0.0;
{
Py_ssize_t __pyx_temp_extent = __pyx_v_a.shape[0];
Py_ssize_t __pyx_temp_idx;
double *__pyx_temp_pointer = (double *) __pyx_v_a.data;
for (__pyx_temp_idx = 0; __pyx_temp_idx < __pyx_temp_extent; __pyx_temp_idx++) {
*((double *) __pyx_temp_pointer) = __pyx_temp_scalar;
__pyx_temp_pointer += 1;
}
}
}
As one can see, strides[0] is no longer used in the continuous version - strides[0]=1 is evaluated during the compilation and the resulting C-code can be better optimized (see for example here).
One could be tempted to get smart and to use low-level memset-function:
from libc.string cimport memset
memset(&cpc_x[0], 0, 16*sizeof(double))
However, for bigger arrays there will no difference compared to the usage of continuous memory view (i.e. double[::1], see here for example). There might be less overhead for smaller sizes, but I never cared enough to check.
I'm writing a program to find the roots of nth order Legendre Polynomials using c++; my code is attached below:
double* legRoots(int n)
{
double myRoots[n];
double x, dx, Pi = atan2(1,1)*4;
int iters = 0;
double tolerance = 1e-20;
double error = 10*tolerance;
int maxIterations = 1000;
for(int i = 1; i<=n; i++)
{
x = cos(Pi*(i-.25)/(n+.5));
do
{
dx -= legDir(n,x)/legDif(n,x);
x += dx;
iters += 1;
error = abs(dx);
} while (error>tolerance && iters<maxIterations);
myRoots[i-1] = x;
}
return myRoots;
}
Assuming the existence of functioning Legendre Polynomial and Legendre Polynomial derivative generating functions, which I do have but I thought that would make for unreadable walls of code text. This function is functioning in the sense that it's returning an array calculated values, but they're wildly off, outputting the following:
3.95253e-323
6.94492e-310
6.95268e-310
6.42285e-323
4.94066e-323
2.07355e-317
where an equivalent function I've written in Python gives the following:
[-0.90617985 -0.54064082 0. 0.54064082 0.90617985]
I was hoping another set of eyes could help me see what the issue in my C++ code is that's causing the values to be wildly off. I'm not doing anything different in my Python code that I'm doing in C++, so any help anyone could give on this is greatly appreciated, thanks. For reference, I'm mostly trying to emulate the method found on Rosetta code in regards to Gaussian Quadrature: http://rosettacode.org/wiki/Numerical_integration/Gauss-Legendre_Quadrature.
You are returning an address to a temporary variable in stack
{
double myRoots[n];
...
return myRoots; // Not a safe thing to do
}
I suggest changing your function definition to
void legRoots(int n, double *myRoots)
omitting the return statement, and defining myroots before calling the function
double myRoots[10];
legRoots(10, myRoots);
Option 2 would be to allocate myRoots dynamically with new or malloc.
I am using a large CUDA-matrix library developed within our organization. I need to save the state of a CUDA RNG to take a snapshop of a long-running simulation, and be able to restore it later. This is simple with, e.g., python+numpy:
state = numpy.random.get_state()
# state is a tuple with 5 fields which can be pickled, etc.
...
numpy.random.set_state(state)
I cannot seem to find equivalent functionality in the CUDA host api. You can set the seed and offset, but there is no way to retrieve it to save. The device API seems to offer something like this, but this library uses the host api, and it would be monsterous to change.
The hack-ey solution I am thinking about is to keep track of the number of calls to the RNG (reset when a seed is set), and simply call a RNG function repeatedly. However, I am not sure if the function parameters must be identical, e.g. matrix shapes, etc., to get it to the same state. Similarly, if the number of calls was equivalent to the offset parameter for initializing the RNG, this would work as well, i.e., if I call the RNG 200 times, I could set the offset to 200. However, in python, the offset in the state can increase by more than 1 with each call, so this is also potentially wrong.
Any insights into how to tackle this are appreciated!
For the CURAND Host API, I believe curandSetGeneratorOffset() can probably work for this.
Here's a modified example from the curand host API documentation:
$ cat t721.cu
/*
* This program uses the host CURAND API to generate 10
* pseudorandom floats. And then regenerate those same floats.
*/
#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <curand.h>
#define CUDA_CALL(x) do { if((x)!=cudaSuccess) { \
printf("Error at %s:%d\n",__FILE__,__LINE__);\
return EXIT_FAILURE;}} while(0)
#define CURAND_CALL(x) do { if((x)!=CURAND_STATUS_SUCCESS) { \
printf("Error at %s:%d\n",__FILE__,__LINE__);\
return EXIT_FAILURE;}} while(0)
int main(int argc, char *argv[])
{
size_t n = 10;
size_t i;
curandGenerator_t gen;
float *devData, *hostData;
/* Allocate n floats on host */
hostData = (float *)calloc(n, sizeof(float));
/* Allocate n floats on device */
CUDA_CALL(cudaMalloc((void **)&devData, n*sizeof(float)));
/* Create pseudo-random number generator */
CURAND_CALL(curandCreateGenerator(&gen,
CURAND_RNG_PSEUDO_DEFAULT));
/* Set seed */
CURAND_CALL(curandSetPseudoRandomGeneratorSeed(gen,
1234ULL));
// generator offset = 0
/* Generate n floats on device */
CURAND_CALL(curandGenerateUniform(gen, devData, n));
// generator offset = n
/* Generate n floats on device */
CURAND_CALL(curandGenerateUniform(gen, devData, n));
// generator offset = 2n
/* Copy device memory to host */
CUDA_CALL(cudaMemcpy(hostData, devData, n * sizeof(float),
cudaMemcpyDeviceToHost));
/* Show result */
for(i = 0; i < n; i++) {
printf("%1.4f ", hostData[i]);
}
printf("\n\n");
CURAND_CALL(curandSetGeneratorOffset(gen, n));
// generator offset = n
CURAND_CALL(curandGenerateUniform(gen, devData, n));
// generator offset = 2n
/* Copy device memory to host */
CUDA_CALL(cudaMemcpy(hostData, devData, n * sizeof(float),
cudaMemcpyDeviceToHost));
/* Show result */
for(i = 0; i < n; i++) {
printf("%1.4f ", hostData[i]);
}
printf("\n");
/* Cleanup */
CURAND_CALL(curandDestroyGenerator(gen));
CUDA_CALL(cudaFree(devData));
free(hostData);
return EXIT_SUCCESS;
}
$ nvcc -o t721 t721.cu -lcurand
$ ./t721
0.7816 0.2338 0.6791 0.2824 0.6299 0.1212 0.4333 0.3831 0.5136 0.2987
0.7816 0.2338 0.6791 0.2824 0.6299 0.1212 0.4333 0.3831 0.5136 0.2987
$
So you'll need to keep track of the quantity of random numbers generated (not the number of RNG function calls) up to the point when you do your checkpoint, and save that.
When you restart, initialize the generator in the same way:
/* Create pseudo-random number generator */
CURAND_CALL(curandCreateGenerator(&gen,
CURAND_RNG_PSEUDO_DEFAULT));
/* Set seed */
CURAND_CALL(curandSetPseudoRandomGeneratorSeed(gen,
1234ULL));
but then advance by the number of previously generated values (n):
CURAND_CALL(curandSetGeneratorOffset(gen, n));
So, it is possible to store and restore the state by tracking the number of 32-bit values generated using curandSetGeneratorOffset. The algorithm looks something like:
template<typename T> RNG(T* X, size_T N /*number of values*/){
...
if (sizeof(T) == 1)
offset += (N+4-1)/4;
else if (sizeof(T) == 2)
offset += (N+2-1)/4;
else if (sizeof(T) == 4 || USING_GENERATE_UNIFORM_DOUBLE)
offset += N;
else if (sizeof(T) == 8)
offset += 2*N;
}
For 8-bit values, advance the offset by the N * next highest multiple of 4, for N values generated. For 16, advance by N * the next multiple of 2. For 32 advance by the N, and for 64 advance by 2*N.
HOWEVER, if you use GenerateUniformDouble, you only need to advance by N, not 2*N. I'm not sure why.
Thanks for the help!
I have seen people using buffer in different languages for fast input/output in Online Judges. For example this http://www.spoj.pl/problems/INTEST/ is done with C like this:
#include <stdio.h>
#define size 50000
int main (void){
unsigned int n=0,k,t;
char buff[size];
unsigned int divisible=0;
int block_read=0;
int j;
t=0;
scanf("%lu %lu\n",&t,&k);
while(t){
block_read =fread(buff,1,size,stdin);
for(j=0;j<block_read;j++){
if(buff[j]=='\n'){
t--;
if(n%k==0){
divisible++;
}
n=0;
}
else{
n = n*10 + (buff[j] - '0');
}
}
}
printf("%d",divisible);
return 0;
How can this be done with python?
import sys
file = sys.stdin
size = 50000
t = 0
while(t != 0)
block_read = file.read(size)
...
...
Most probably this will not increase performance though – Python is interpreted language, so you basically want to spend as much time in native code (standard library input/parsing routines in this case) as possible.
TL;DR either use built-in routines to parse integers or get some sort of 3rd party library which is optimized for speed.
I tried solving this one in Python 3 and couldn't get it to work no matter how I tried reading the input. I then switched to running it under Python 2.5 so I could use
import psyco
psyco.full()
After making that change I was able to get it to work by simply reading input from sys.stdin one line at a time in a for loop. I read the first line using raw_input() and parsed the values of n and k, then used the following loop to read the remainder of the input.
for line in sys.stdin:
count += not int(line) % k