I'm writing a program to find the roots of nth order Legendre Polynomials using c++; my code is attached below:
double* legRoots(int n)
{
double myRoots[n];
double x, dx, Pi = atan2(1,1)*4;
int iters = 0;
double tolerance = 1e-20;
double error = 10*tolerance;
int maxIterations = 1000;
for(int i = 1; i<=n; i++)
{
x = cos(Pi*(i-.25)/(n+.5));
do
{
dx -= legDir(n,x)/legDif(n,x);
x += dx;
iters += 1;
error = abs(dx);
} while (error>tolerance && iters<maxIterations);
myRoots[i-1] = x;
}
return myRoots;
}
Assuming the existence of functioning Legendre Polynomial and Legendre Polynomial derivative generating functions, which I do have but I thought that would make for unreadable walls of code text. This function is functioning in the sense that it's returning an array calculated values, but they're wildly off, outputting the following:
3.95253e-323
6.94492e-310
6.95268e-310
6.42285e-323
4.94066e-323
2.07355e-317
where an equivalent function I've written in Python gives the following:
[-0.90617985 -0.54064082 0. 0.54064082 0.90617985]
I was hoping another set of eyes could help me see what the issue in my C++ code is that's causing the values to be wildly off. I'm not doing anything different in my Python code that I'm doing in C++, so any help anyone could give on this is greatly appreciated, thanks. For reference, I'm mostly trying to emulate the method found on Rosetta code in regards to Gaussian Quadrature: http://rosettacode.org/wiki/Numerical_integration/Gauss-Legendre_Quadrature.
You are returning an address to a temporary variable in stack
{
double myRoots[n];
...
return myRoots; // Not a safe thing to do
}
I suggest changing your function definition to
void legRoots(int n, double *myRoots)
omitting the return statement, and defining myroots before calling the function
double myRoots[10];
legRoots(10, myRoots);
Option 2 would be to allocate myRoots dynamically with new or malloc.
Related
I created Python Bindings using pybind11. Everything worked perfectly, but when I did a speed check test the result was disappointing.
Basically, I have a function in C++ that adds two numbers and I want to use that function from a Python script. I also included a for loop to ran 100 times to better view the difference in the processing time.
For the function "imported" from C++, using pybind11, I obtain: 0.002310514450073242 ~ 0.0034799575805664062
For the simple Python script, I obtain: 0.0012788772583007812 ~ 0.0015883445739746094
main.cpp file:
#include <pybind11/pybind11.h>
namespace py = pybind11;
double sum(double a, double b) {
return a + b;
}
PYBIND11_MODULE(SumFunction, var) {
var.doc() = "pybind11 example module";
var.def("sum", &sum, "This function adds two input numbers");
}
main.py file:
from build.SumFunction import *
import time
start = time.time()
for i in range(100):
print(sum(2.3,5.2))
end = time.time()
print(end - start)
CMakeLists.txt file:
cmake_minimum_required(VERSION 3.0.0)
project(Projectpybind11 VERSION 0.1.0)
include(CTest)
enable_testing()
add_subdirectory(pybind11)
pybind11_add_module(SumFunction main.cpp)
set(CPACK_PROJECT_NAME ${PROJECT_NAME})
set(CPACK_PROJECT_VERSION ${PROJECT_VERSION})
include(CPack)
Simple Python script:
import time
def summ(a,b):
return a+b
start = time.time()
for i in range(100):
print(summ(2.3,5.2))
end = time.time()
print(end - start)
Benchmarking is a very complicated thing, even can be called as a Systemic Engineering.
Because there are many processes will interference our benchmarking job. For
example: NIC interrupt responsing / keyboard or mouse input / OS scheduling...
I have encountered my producing process being blocked by OS for up to 15 seconds!
So as the other advisors have pointed out, the print() invokes more
unnecessary interference.
Your testing computation is too simple.
You must think it out clearly what are you comparing for.
The speed of passing arguments between Python and C++ is obviously slower than
that of within Python side. So I assume that you want to compare the computing
speed of both, instead of arguments passing speed.
If so, I think your computing codes are too simple, and these will lead to the
time we counted is mainly the time for passing args, while the time for
computing is merely the minor of the total.
So, I put out my sample below, I will be glad to see anyone polish it.
Your loop count is too less.
The less loops, the more randomness. Similar with my opinion 1, testing time
is merely 0.000x second. It is possible, that the running process be interferenced by OS.
I think we should make the testing time to last at least a few of seconds.
C++ is not always faster than Python. Now time there are so many Python
modules/libs can use GPU to execute heavy computation, and parallelly do matrix operations even only by using CPU.
I guess that perhaps you are evaluating whether or not using Pybind11 in your project. I think that comparing like this worth nothing, because what is the best tool depends on what is the real requirement, but it is a good lesson to learn things.
I recently encountered a case, Python is faster than C++ in a Deep Learning.
Haha, funny?
At the end, I run my sample in my PC, and found that the C++ computing speed is faster up to 100 times than that in Python. I hope it be helpful for you.
If anyone would please revise/correct my opinions, it's my pleasure!
Pls forgive my ugly English, I hope I have expressed things correctly.
ComplexCpp.cpp:
#include <cmath>
#include <pybind11/numpy.h>
#include <pybind11/pybind11.h>
namespace py = pybind11;
double Compute( double x, py::array_t<double> ys ) {
// std::cout << "x:" << std::setprecision( 16 ) << x << std::endl;
auto r = ys.unchecked<1>();
for( py::ssize_t i = 0; i < r.shape( 0 ); ++i ) {
double y = r( i );
// std::cout << "y:" << std::setprecision( 16 ) << y << std::endl;
x += y;
x *= y;
y = std::max( y, 1.001 );
x /= y;
x *= std::log( y );
}
return x;
};
PYBIND11_MODULE( ComplexCpp, m ) {
m.def( "Compute", &Compute, "a more complicated computing" );
};
tryComplexCpp.py
import ComplexCpp
import math
import numpy as np
import random
import time
def PyCompute(x: float, ys: np.ndarray) -> float:
#print(f'x:{x}')
for y in ys:
#print(f'y:{y}')
x += y
x *= y
y = max(y, 1.001)
x /= y
x *= math.log(y)
return x
LOOPS: int = 100000000
if __name__ == "__main__":
# initialize random
x0 = random.random()
""" We store all args in a array, then pass them into both C++ func and
python side, to ensure that args for both sides are same. """
args = np.ndarray(LOOPS, dtype=np.float64)
for i in range(LOOPS):
args[i] = random.random()
print('Args are ready, now start...')
# try it with C++
start_time = time.time()
x = ComplexCpp.Compute(x0, args)
print(f'Computing with C++ in { time.time() - start_time }.\n')
# forcely use the result to prevent the entire procedure be optimized(omit)
print(f'The result is {x}\n')
# try it with python
start_time = time.time()
x = PyCompute(x0, args)
print(f'Computing with Python in { time.time() - start_time }.\n')
# forcely use the result to prevent the entire procedure be optimized(omit)
print(f'The result is {x}\n')
I'm learning to use opencl in python and I wanted to optimize one of the function. I learned that this can be done by storing global memory in local memory. However, it doesn't work as it should, the duration is twice as long. This is well done? Can I optimize this code more?
__kernel void sumOP( __global float *input,
__global float *weights,
int layer_size,
__global float *partialSums,__local float* cache)
{
private const int i = get_global_id(0);
private const int in_layer_s = layer_size;
private const int item_id = get_local_id(0);
private const int group_id = get_group_id(0);
private const int group_count = get_num_groups(0);
const int localsize = get_local_size(0);
for ( int x = 0; x < in_layer_s; x++ )
{
cache[x] = weights[i*in_layer_s + x];
}
float total1 = 0;
for ( int x = 0; x < in_layer_s; x++ )
{
total1 += cache[x] *input[x];
}
partialSums[i] = sigmoid(total1);
}
Python call
l = opencl.LocalMemory(len(inputs))
event = program.sumOP(queue, output.shape, np.random.randn(6,).shape, inputs.data, weights.data,np.int32(len(inputs)),output.data,l)
Thanks for some advice
Besides doing a data write race condition with writing to same shared memory address cache[x] by all workitems of a group (as Dithermaster said) and lack of barrier() function, some optimizations can be added after those are fixed:
First loop in kernel
for ( int x = 0; x < in_layer_s; x++ )
{
cache[x] = weights[i*in_layer_s + x];
}
scans a different memory area for each work item, one element at a time. This is probably wrong in terms of global memory performance because each workitem in their own loop, could be using same memory channel or even same memory bank, hence, all workitems access that channel or bank serially. This is worse if in_layer_s gets a larger value and especially if it is power of 2. To solve this problem, all workitems should access contiguous addresses with their neighbors. GPU works better when global memory is accessed uniformly with workitems. On local memory, it is less of an issue to access randomly or with gaps between workitems. Thats why its advised to use uniform save/load on global while doing random/scatter/gather on local.
Second loop in kernel
for ( int x = 0; x < in_layer_s; x++ )
{
total1 += cache[x] *input[x];
}
is using only single accumulator. This a dependency chain that needs each loop cycle to be completed before moving on to next. Use at least 2 temporary "total" variables and unroll the loop. Here, if in_layer_s is small enough, input array could be moved into local or constant memory to access it faster (repeatedly by all workitems, since all workitems access same input array) (maybe half of input to constant memory and other half to local memory to increase total bandwidth)
Is weights[i*in_layer_s + x]; an array of structs? If yes, you can achieve a speedup by making it a struct of arrays and get rid of first loop's optimization altogether, with an increase of code bloat in host side but if priority is speed then struct of arrays is both faster and readable on gpu side. This also makes it possible to upload only the necessary weights data(an array of SOA) to gpu from host side, decreasing total latency (upload + compute + download) further.
You can also try asynchronous local<-->global transfer functions to make loading and computing overlapped for each workitem group, to hide even more latency as a last resort. https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/async_work_group_copy.html
I'v written a Python extension module with C to speed up computation times. The first step is a 2D integration of a function f(x,y,k), which is very fast and allows me to integrate over y in [y1(x),y2(x)] and x in [a,b] whilst assigning a float to k. But I really need to integrate k over the range [c,d]. Currently, I'm doing something like this in Python
inner = lambda k: calc.kernel(l,k,ki)
I = quad(inner,c,d)[0]
where calc is my C-extension module and calc.kernel calls gauss2 to perform 2D integration. l and ki are just other variables. But with my data, quad still takes many hours to finish. I would like to do all calculations within the C-extension module, but I'm really stumped on how to implement this outer integral. Here is my C-code
#include <Python.h>
#include <math.h>
double A96[96]={ /* abscissas for 96-point Gauss quadrature */
};
double W96[96]={ /* weights for 96-point Gauss quadrature */
};
double Y1(double x){
return 0;
}
double Y2(double x){
return x;
}
double gauss1(double F(double),double a,double b)
{ /* 96-pt Gauss qaudrature integrates F(x) from a to b */
int i;
double cx,dx,q;
cx=(a+b)/2;
dx=(b-a)/2;
q=0;
for(i=0;i<48;i++)
q+=W96[i]*(F(cx-dx*A96[i])+F(cx+dx*A96[i]));
return(q*dx);
}
double gauss2(double F(double,double,int,double,double),double Y1(double),double Y2(double),double a,double b,int l,double k, double ki)
{/* 96x96-pt 2-D Gauss qaudrature integrates
F(x,y) from y=Y1(x) to Y2(x) and x=a to b */
int i,j,h;
double cx,cy,dx,dy,q,w,x,y1,y2;
cx=(a+b)/2;
dx=(b-a)/2;
q=0;
for(i=0;i<48;i++)
{
for(h=-1;h<=1;h+=2)
{
x=cx+h*dx*A96[i];
y1=Y1(x);
y2=Y2(x);
cy=(y1+y2)/2;
dy=(y2-y1)/2;
w=dy*W96[i];
for(j=0;j<48;j++)
q+=w*W96[j]*(F(x,cy-dy*A96[j],l,k,ki)+F(x,cy+dy*A96[j],l,k,ki));
}
}
return(q*dx);
}
double ps_fact(double z){
double M = 0.3;
return 3/2*(M*(1+z)*(1+z)*(1+z) + (1-M))*(M*(1+z)*(1+z)*(1+z) + (1-M))*(M*(1+z)*(1+z)*(1+z) + (1-M))/(1+z)/(1+z);
}
double drdz(double z){
double M = 0.3;
return 3000/sqrt(M*(1+z)*(1+z)*(1+z) + (1-M));
}
double rInt(double z){
double M = 0.3;
return 3000/sqrt(M*(1+z)*(1+z)*(1+z) + (1-M));
}
double kernel_func ( double y , double x, int l,double k, double ki) {
return ps_fact(y)*ki*rInt(x)*sqrt(M_PI/2/rInt(x))*jn(l+0.5,ki*rInt(x))*drdz(x)*(rInt(x)-rInt(y))/rInt(y)*sqrt(M_PI/2/rInt(y))*jn(l+0.5,k*rInt(y))*drdz(y);
}
static PyObject* calc(PyObject* self, PyObject* args)
{
int l;
double k, ki;
if (!PyArg_ParseTuple(args, "idd", &l, &k, &ki))
return NULL;
double res;
res = gauss2(kernel_func,Y1, Y2, 0,10,l, k, ki);
return Py_BuildValue("d", res);
}
static PyMethodDef CalcMethods[] = {
{"kernel", calc, METH_VARARGS, "Calculates kernel values."},
{NULL, NULL, 0, NULL}
};
PyMODINIT_FUNC initcalc(void){
(void) Py_InitModule("calc", CalcMethods);
A96 and W96 both contain the points for the Gaussian quadrature, so don't worry that they are empty here. I should add I don't take any credit for the functions gauss1 and gauss2.
EDIT: python code was wrong - edited now.
Maybe the source code for scipy integrate quad is a good place to start if you haven't looked there : https://github.com/scipy/scipy/blob/v0.17.0/scipy/integrate/quadpack.py#L45-L360
Looks like most of the work is already being done by native Fortran code, which is normally either as fast or faster than C/C++ code. You will be hard pressed to improve on that, unless you create/find a CUDA implementation.
You make the Fortran code multithreaded, if it's not already and the source is open. Lastly, you could make a threading dispatcher in C/Fortran (python doesn't support real threading because of the GIL) and just make your calls to quad parallel from one another atleast. Interfacing calc directly with Fortran quad would probably save you some decent overhead too.
Is it possible to retrieve the colour of a pixel on the screen in C in Rasbmc (Raspberry Pi)?
My plan is to use this along with WiringPi to control RGB LEDs according to the averaged colour at certain points on the screen.
I've looked into various options including using Python. One site provides examples which seems useful but I am yet to make them work. I have tried both the Python and C Examples.
The code I am using is the following:
#include <X11/Xlib.h>
void get_pixel_color (Display *d, int x, int y, XColor *color)
{
XImage *image;
image = XGetImage (d, RootWindow (d, DefaultScreen (d)), x, y, 1, 1, AllPlanes, XYPixmap);
color->pixel = XGetPixel (image, 0, 0);
XFree (image);
XQueryColor (d, DefaultColormap(d, DefaultScreen (d)), color);
}
XColor c;
get_pixel_color(display, 30, 40, &c);
printf ("%d %d %d\n", c.red, c.green, c.blue);
However I can't seem to get it to work. These are the errors I get when I use the C example:
What I don't understand is the first error I get (the others seem to be related to the formating of the printf function so I guess they can be ignored?). What does it mean by numeric constants, does it mean the x and y components of the get_pixel_color function? That seems weird to me but I know I must be misunderstanding something here!
Your call to get_pixelcolor isn't in a function.
You seem to be lacking a main function altogether, unless I'm missing something completely. You're calling functions OUTSIDE of a program block. In C, that formatting is reserved for prototyping. Your compiler is expecting a prototype (Prototyping is describing a function roughly, "takes these arguments, returns this" , before you actually describe the implementation of it. If you're unfamiliar with the terminology.) , when you're feeding it a function.
At the very least you're looking at:
int main(){
XColor c;
get_pixel_color(display, 30, 40, &c);
printf ("%d %d %d\n", c.red, c.green, c.blue);
return 0;
}
Give that a try, and see if it works.
If this is part of a larger program which already has a main, just call the function something different and call it from main.
Is this your actual .c file, or just a snippet? If it your actual file (judging by the line numbers, it is), then you will need to put the last three lines in a main() function:
int main (int argc, char **argv) {
XColor c;
get_pixel_color(display, 30, 40, &c);
printf ("%d %d %d\n", c.red, c.green, c.blue);
return 0;
}
I am a Python n00b and at the risk of asking an elementary question, here I go.
I am porting some code from C to Python for various reasons that I don't want to go into.
In the C code, I have some code that I reproduce below.
float table[3][101][4];
int kx[6] = {0,1,0,2,1,0};
int kz[6] = {0,0,1,0,1,2};
I want an equivalent Python expression for the C code below:
float *px, *pz;
int lx = LX; /* constant defined somewhere else */
int lz = LZ; /* constant defined somewhere else */
px = &(table[kx[i]][0][0])+lx;
pz = &(table[kz[i]][0][0])+lz;
Can someone please help me by giving me the equivalent expression in Python?
Here's the thing... you can't do pointers in python, so what you're showing here is not "portable" in the sense that:
float *px, *pz; <-- this doesn't exist
int lx = LX; /* constant defined somewhere else */
int lz = LZ; /* constant defined somewhere else */
px = &(table[kx[i]][0][0])+lx;
pz = &(table[kz[i]][0][0])+lz;
^ ^ ^
| | |
+----+----------------------+---- Therefore none of this makes any sense...
What you're trying to do is have a pointer to some offset in your multidimensional array table, because you can't do that in python, you don't want to "port" this code verbatim.
Follow the logic beyond this, what are you doing with px and pz? That is the code you need to understand to try and port.
There is no direct equivalent for your C code, since Python has no pointers or pointer arithmetic. Instead, refactor your code to index into the table with bracket notation.
table[kx[i]][0][lx] = 3
would be a rough equivalent of the C
px = &(table[kx[i]][0][0])+lx;
*px = 3;
Note that in Python, your table would not be contiguous. In particular, while this might work in C:
px[10] = 3; // Bounds violation!
This will IndexError in Python:
table[kx[i]][0][lx + 10] = 3