C running slower than PyPy - python

I'm running these two codes. They both perform the same mathematical procedure (calculate series value up to large terms), and also, as expected, produce the same output.
But for some reason, the PyPy code is running significantly faster than the C code.
I cannot figure out why this is happening, as I expected the C code to run faster.
I'd be thankful if anyone could help me by clarifying that (maybe there is a better way to write the C code?)
C code:
#include <stdio.h>
#include <math.h>
int main()
{
double Sum = 0.0;
long n;
for(n = 2; n < 1000000000; n = n + 1) {
double Sign;
Sign = pow(-1.0, n % 2);
double N;
N = (double) n;
double Sqrt;
Sqrt = sqrt(N);
double InvSqrt;
InvSqrt = 1.0 / Sqrt;
double Ln;
Ln = log(N);
double LnSq;
LnSq = pow(Ln, 2.0);
double Term;
Term = Sign * InvSqrt * LnSq;
Sum = Sum + Term;
}
double Coeff;
Coeff = Sum / 2.0;
printf("%0.14f \n", Coeff);
return 0;
}
PyPy code (faster implementation of Python):
from math import log, sqrt
Sum = 0
for n in range(2, 1000000000):
Sum += ((-1)**(n % 2) * (log(n))**2) / sqrt(n)
print(Sum / 2)

This is far from surprising, PyPy does a number of run-time optimizations by default, where as C compilers by default do not perform any optimization. Dave Beazley's 2012 PyCon Keynote covers this pretty explicitly and provides an deep explanation of why this happens.
Per the referenced talk, C should surpass PyPy when compiled with optimization level 2 or 3 (you can watch the full section on the performance of fibonacci generation in cpython, pypy and C starting here).

Additionally to compiler's optimisation level, you can improve your code as well:
int main()
{
double Sum = 0.0;
long n;
for(n = 2; n < 1000000000; ++n)
{
double N = n; // cast is implicit, only for code readability, no effect on runtime!
double Sqrt = sqrt(N);
//double InvSqrt; // spare that:
//InvSqrt = 1.0/Sqrt; // you spare this division with!
double Ln = log(N);
double LnSq;
//LnSq = pow(Ln,2.0);
LnSq = Ln*Ln; // more efficient
double Term;
//Term = Sign * InvSqrt * LnSq;
Term = LnSq / Sqrt;
if(n % 2)
Term = -Term; // just negating, no multiplication
// (IEEE provided: just one bit inverted)
Sum = Sum + Term;
}
// ...
Now we can simplify the code a little more:
int main()
{
double Sum = 0.0;
for(long n = 2; n < 1000000000; ++n)
// ^^^^ possible since C99, better scope, no runtime effect
{
double N = n;
double Ln = log(N);
double Term = Ln * Ln / sqrt(N);
if(n % 2)
Sum -= Term;
else
Sum += Term;
}
// ...

Related

Inconsistent value returned from C++ function vs Python function for skew normal distribution

I have a function to estimate the alpha parameter from the skew normal distribution in C++ and Python. The Python function is written using NumPy and the C++ function uses the STL. My issue is that my C++ implementation is giving me incorrect results. The two functions are essentially identical but the Python version gives me correct results whereas the C++ does not - I have investigated this in some detail and I cannot come to a conclusion as to what's causing the error, any help would be great.
Python Function
import numpy as np
def convert_to_alpha(skew):
a = np.pi/2
skew_ = abs(skew)
numerator = np.power(skew_, (2/3))
b = (4-np.pi)/2
b = np.power(b, (2/3))
denom = numerator + b
delta = np.sqrt(a * (numerator/denom))
a = delta/np.sqrt((1-np.power(delta, 2)))
return a * np.sign(skew)
C++ Function
double convert_to_alpha(double skew)
{
double pi = 3.141592653589793;
double a = pi / 2;
double skew_ = std::abs(skew);
double numerator = std::pow(skew_, (2 / 3));
double b = (4 - pi) / 2;
b = std::pow(b, (2 / 3));
double denom = numerator + b;
double delta = std::sqrt(a * (numerator / denom));
double alpha = delta / std::sqrt((1 - std::pow(delta, 2)));
if (skew == 0) { return 0; }
else if (std::signbit(skew) == 1) { return -1 * alpha; }
else return alpha;
}
The Python function returns the values I would expect whereas the C++ function does not, as examples for input 0.99 I'd expect 27.85xxxx or for input 0.5 I'd expect 2.17xxxx which is exactly what I get from the Python implementation, C++ gives me 1.91306.
Also, strangely - regardless of the input, the C++ implementation seems to return 1.91306.
Driver code for C++
#include <cmath>
#include <math.h>
#include <iostream>
int main()
{
double convert_to_alpha(double skew);
std::cout << "skew: " << convert_to_alpha(0.99);
return 0;
}
double convert_to_alpha(double skew)
{
double pi = 3.141592653589793;
double a = pi / 2;
double skew_ = std::abs(skew);
double numerator = std::pow(skew_, (2 / 3));
double b = (4 - pi) / 2;
b = std::pow(b, (2 / 3));
double denom = numerator + b;
double delta = std::sqrt(a * (numerator / denom));
double alpha = delta / std::sqrt((1 - std::pow(delta, 2)));
if (skew == 0) { return 0; } // if skew is 0 return 0
else if (std::signbit(skew) == 1) { return -1 * alpha; } // if skew is negative return -alpha
else return alpha; // if skew is positive return alpha
}
I'd expect the results to be very similar, definitely not as different as they are currently. I have not encountered an issue like this before so any help figuring out what's causing the inconsistency with the C++ implementation would be very helpful.
You're using a lot of integer numbers in what I can only assume to be intended as floating-point operations.
Lines such as
double numerator = std::pow(skew_, (2 / 3));
will resolve into
double numerator = std::pow(skew_, 0);
Because 2 / 3 in integers simply gets floored down to 0
If you want to make sure that these kinds of divisions maintain their correct value, make sure that at least one of the operands is a float or double type:
double numerator = std::pow(skew_, (2.0 / 3.0));

Why does this algorithm work so much faster in python than in C++?

I was reading "Algorithms in C++" by Robert Sedgewick and I was given this exercise: rewrite this weigted quick-union with path compression by halving algorithm in another programming language.
The algorithm is used to check if two objects are connected, for example for entry like 1 - 2, 2 - 3 and 1 - 3 first two entries create new connections whereas in the third entry 1 and 3 are already connected because 3 can be reached from 1: 1 - 2 - 3, so the third entry would not require creating a new connection.
Sorry if the algorithm description is not understandable, english is not my mother's tongue.
So here is the algorithm itself:
#include <iostream>
#include <ctime>
using namespace std;
static const int N {100000};
int main()
{
srand(time(NULL));
int i;
int j;
int id[N];
int sz[N]; // Stores tree sizes
int Ncount{}; // Counts the numbeer of new connections
int Mcount{}; // Counts the number of all attempted connections
for (i = 0; i < N; i++)
{
id[i] = i;
sz[i] = 1;
}
while (Ncount < N - 1)
{
i = rand() % N;
j = rand() % N;
for (; i != id[i]; i = id[i])
id[i] = id[id[i]];
for (; j != id[j]; j = id[j])
id[j] = id[id[j]];
Mcount++;
if (i == j) // Checks if i and j are connected
continue;
if (sz[i] < sz[j]) // Smaller tree will be
// connected to a bigger one
{
id[i] = j;
sz[j] += sz[i];
}
else
{
id[j] = i;
sz[i] += sz[j];
}
Ncount++;
}
cout << "Mcount: " << Mcount << endl;
cout << "Ncount: " << Ncount << endl;
return 0;
}
I know a tiny bit of python so I chose it for this exercise. This is what got:
import random
N = 100000
idList = list(range(0, N))
sz = [1] * N
Ncount = 0
Mcount = 0
while Ncount < N - 1:
i = random.randrange(0, N)
j = random.randrange(0, N)
while i is not idList[i]:
idList[i] = idList[idList[i]]
i = idList[i]
while j is not idList[j]:
idList[j] = idList[idList[j]]
j = idList[j]
Mcount += 1
if i is j:
continue
if sz[i] < sz[j]:
idList[i] = j
sz[j] += sz[i]
else:
idList[j] = i
sz[i] += sz[j]
Ncount += 1
print("Mcount: ", Mcount)
print("Ncount: ", Ncount)
But I stumbled upon this interesting nuance: when I set N to 100000 or more C++ version version appears to be a lot slower than the python one - it took about 10 seconds to complete the task for the algorithm in python whereas C++ version was doing it so slow I just had to shut it down.
So my question is: what is the cause of that? Does this happen because of the difference in rand() % N and random.randrange(0, N)? Or have I just done something wrong?
I'd be very grateful if someone could explain this to me, thanks in advance!
Those codes do different things.
You have to compare numbers in python with ==.
>>> x=100000
>>> y=100000
>>> x is y
False
There might be other problems, haven't checked. Have you compared the results of the apps?
As pointed out above the codes are not equivalent and especially when it comes to the use of is vs ==.
Look at the following Pyhton code:
while i is not idList[i]:
idList[i] = idList[idList[i]]
i = idList[i]
This is evaluated 0 or 1 times. Why?. Because if the while evaluates to True the 1st time, then i = idList[i] makes the condition True in the 2nd pass, because now i is for sure a number which is in idList
The equivalent c++
for (; i != id[i]; i = id[i])
id[i] = id[id[i]];
Here the code is checking against equality and not against presence and the number of times it runs it is not fixed to be 0 or 1
So yes ... using is vs == makes a huge difference because in Python you are testing instance equality and being contained in, rather than testing simple equality in the sense of equivalence.
The comparison of Python and C++ above is like comparing apples and pears.
Note: The short answer to the question would be: The Python version runs much faster because it is doing a lot less than the C++ version

Eigen Matrix vs Numpy Array multiplication performance

I read in this question that eigen has very good performance. However, I tried to compare eigen MatrixXi multiplication speed vs numpy array multiplication. And numpy performs better (~26 seconds vs. ~29). Is there a more efficient way to do this eigen?
Here is my code:
Numpy:
import numpy as np
import time
n_a_rows = 4000
n_a_cols = 3000
n_b_rows = n_a_cols
n_b_cols = 200
a = np.arange(n_a_rows * n_a_cols).reshape(n_a_rows, n_a_cols)
b = np.arange(n_b_rows * n_b_cols).reshape(n_b_rows, n_b_cols)
start = time.time()
d = np.dot(a, b)
end = time.time()
print "time taken : {}".format(end - start)
Result:
time taken : 25.9291000366
Eigen:
#include <iostream>
#include <Eigen/Dense>
using namespace Eigen;
int main()
{
int n_a_rows = 4000;
int n_a_cols = 3000;
int n_b_rows = n_a_cols;
int n_b_cols = 200;
MatrixXi a(n_a_rows, n_a_cols);
for (int i = 0; i < n_a_rows; ++ i)
for (int j = 0; j < n_a_cols; ++ j)
a (i, j) = n_a_cols * i + j;
MatrixXi b (n_b_rows, n_b_cols);
for (int i = 0; i < n_b_rows; ++ i)
for (int j = 0; j < n_b_cols; ++ j)
b (i, j) = n_b_cols * i + j;
MatrixXi d (n_a_rows, n_b_cols);
clock_t begin = clock();
d = a * b;
clock_t end = clock();
double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;
std::cout << "Time taken : " << elapsed_secs << std::endl;
}
Result:
Time taken : 29.05
I am using numpy 1.8.1 and eigen 3.2.0-4.
My question has been answered by #Jitse Niesen and #ggael in the comments.
I need to add a flag to turn on the optimizations when compiling: -O2 -DNDEBUG (O is capital o, not zero).
After including this flag, eigen code runs in 0.6 seconds as opposed to ~29 seconds without it.
Change:
a = np.arange(n_a_rows * n_a_cols).reshape(n_a_rows, n_a_cols)
b = np.arange(n_b_rows * n_b_cols).reshape(n_b_rows, n_b_cols)
into:
a = np.arange(n_a_rows * n_a_cols).reshape(n_a_rows, n_a_cols)*1.0
b = np.arange(n_b_rows * n_b_cols).reshape(n_b_rows, n_b_cols)*1.0
This gives factor 100 boost at least at my laptop:
time taken : 11.1231250763
vs:
time taken : 0.124922037125
Unless you really want to multiply integers. In Eigen it is also quicker to multiply double precision numbers (amounts to replacing MatrixXi with MatrixXd three times), but there I see just 1.5 factor: Time taken : 0.555005 vs 0.846788.
Is there a more efficient way to do this eigen?
Whenever you have a matrix multiplication where the matrix on the left side of the = does not also appear on the right side, you can safely tell the compiler that there is no aliasing taking place. This will safe you one unnecessary temporary variable and assignment operation, which for big matrices can make an important difference in performance. This is done with the .noalias() function as follows.
d.noalias() = a * b;
This way a*b is directly evaluated and stored in d. Otherwise, to avoid aliasing problems, the compiler will first store the product into a temporary variable and then assign the this variable to your target matrix d.
So, in your code, the line:
d = a * b;
is actually compiled as follows:
temp = a*b;
d = temp;

How to convert this Python code to C++

I have a working algorithm in Python which I want to convert to C++:
def gcd(a, b):
if (a % b == 0):
return b
else:
return gcd(b, a % b)
def solution(N, M):
lcm = N * M / gcd(N, M)
return lcm / M
I'm having a problem with large input values as the multiple of N and M causes integer overflow and using long to store its value doesn't seem to help, unless I'm doing something wrong.
Here's my current code:
int gcd(int a, int b)
{
if (a % b == 0)
return b;
else
return gcd(b, a % b);
}
int solution(int N, int M) {
// Calculate greatest common divisor
int g = gcd(N, M);
// Calculate the least common multiple
long m = N * M;
int lcm = m / g;
return lcm / M;
}
You are computing g=gcd(N,M), then m=N*M, then lcm=m/g, and finally returning lcm/M. That's the same as returning N/gcd(N,M). You don't need those intermediate calculations. Get rid of them. Now there's no problem with overflow (unless M=0, that is, which you aren't protecting against).
int solution(int N, int M) {
if (M == 0) {
handle_error();
}
else {
return N / gcd(N,M);
}
}
To begin with, change:
long m = N * M;
int lcm = m / g;
To:
long long m = (long long)N * M;
int lcm = (int)(m / g);
In general, you might as well change every int in your code to unsigned long long...
But if you have some BigInt class at hand, then you might want to use it instead.
Here is one for free: http://planet-source-code.com/vb/scripts/ShowCode.asp?txtCodeId=9735&lngWId=3
It stores a natural number of any conceivable size, and supports all arithmetic operators provided in C++.
The problem is in long m = N*M.
The multiplication takes place as 32 bit integers only. Since both are of int type, overflow occurs.
Correction is long long m = (long long)N*M

Trying to import a python snippet to C/C++ (PI spigot algorithm)

Some time ago (I can't remember where) I find this python snippet which implents a spigot algorithm for calculating digits of Pi:
def pi_digits():
"""generator for digits of pi"""
q,r,t,k,n,l = 1,0,1,1,3,3
while True:
if 4*q+r-t < n*t:
yield n
q,r,t,k,n,l = (10*q,10*(r-n*t),t,k,(10*(3*q+r))/t-10*n,l)
else:
q,r,t,k,n,l = (q*k,(2*q+r)*l,t*l,k+1,(q*(7*k+2)+r*l)/(t*l),l+2)
digits = pi_digits()
for i in range(30): print digits.next()
Now I wanna implent this in C++. My try was:
#include <cmath>
#include <cstdlib>
#include <iostream>
typedef long long ll;
void help() {
std::cout << "Usage: pi2 <digits>" << std::endl;
exit(1);
}
void pi(const long long digits) {
ll q, r, t, k, n, l;
q=1;
r=0;
t=1;
k=1;
n=3;
l=3;
for(ll i=0; i<digits; ++i) {
if(4*q+r-t < n*t) {
std::cout << n;
q=10*q;
r=10*(r-n*t);
n = ( 10 * ( 3 * q + r) / t ) - 10 * n; //Thanks to maverik
} else {
q=q*k;
r=(2*q+r)*l;
t=t*l;
k=k+1;
n=(q*(7*k+2)+r*l)/(t*l);
l=l+2;
}
}
}
int main(int argc, char** argv) {
if(argc<2) help();
ll digits = 0;
if(digits=atoll(argv[1])<1) help();
pi(digits);
return 0;
}
But it never calls std::cout::operator<<, while the python version works.
Can you help me?
Thanks.
The reason is your code not performing the equivalent calculations in the two languages.
There are(as far as I see) two reasons for this:
In this python code, all the calculations are done at the same time:
q,r,t,k,n,l = (q*k,(2*q+r)*l,t*l,k+1,(q*(7*k+2)+r*l)/(t*l),l+2)
In the C code, the calculations are performed one at a time, so every one uses the result of the previous ones instead of using the old values(like the python code does).
You're using ints in python, and long longs in C.
The division in C code will produce long longs, while those in python(assuming python 2), will produce rounded-down ints.
This could also create miscalculations which can cause your condition to never be true.
P.S.
Implementing this in C from scratch is probably a better idea than porting a python algorithm.
Looks like there should be (according to the python code):
n = ( 10 * ( 3 * q + r) / t ) - 10 * n;
And there:
if ( 4 * q + r - t < n * t) ...
Or I've missed something?

Categories