Why is my python/numpy example faster than pure C implementation?

Why is my python/numpy example faster than pure C implementation? - python

I have pretty much the same code in python and C. Python example:
import numpy
nbr_values = 8192
n_iter = 100000
a = numpy.ones(nbr_values).astype(numpy.float32)
for i in range(n_iter):
a = numpy.sin(a)
C example:
#include <stdio.h>
#include <math.h>
int main(void)
{
int i, j;
int nbr_values = 8192;
int n_iter = 100000;
double x;
for (j = 0; j < nbr_values; j++){
x = 1;
for (i=0; i<n_iter; i++)
x = sin(x);
}
return 0;
}
Something strange happen when I ran both examples:
$ time python numpy_test.py
real 0m5.967s
user 0m5.932s
sys 0m0.012s
$ g++ sin.c
$ time ./a.out
real 0m13.371s
user 0m13.301s
sys 0m0.008s
It looks like python/numpy is twice faster than C. Is there any mistake in the experiment above? How you can explain it?
P.S. I have Ubuntu 12.04, 8G ram, core i5 btw

First, turn on optimization. Secondly, subtleties matter. Your C code is definitely not 'basically the same'.
Here is equivalent C code:
sinary2.c:
#include <math.h>
#include <stdlib.h>
float *sin_array(const float *input, size_t elements)
{
int i = 0;
float *output = malloc(sizeof(float) * elements);
for (i = 0; i < elements; ++i) {
output[i] = sin(input[i]);
}
return output;
}
sinary.c:
#include <math.h>
#include <stdlib.h>
extern float *sin_array(const float *input, size_t elements)
int main(void)
{
int i;
int nbr_values = 8192;
int n_iter = 100000;
float *x = malloc(sizeof(float) * nbr_values);
for (i = 0; i < nbr_values; ++i) {
x[i] = 1;
}
for (i=0; i<n_iter; i++) {
float *newary = sin_array(x, nbr_values);
free(x);
x = newary;
}
return 0;
}
Results:
$ time python foo.py
real 0m5.986s
user 0m5.783s
sys 0m0.050s
$ gcc -O3 -ffast-math sinary.c sinary2.c -lm
$ time ./a.out
real 0m5.204s
user 0m4.995s
sys 0m0.208s
The reason the program has to be split in two is to fool the optimizer a bit. Otherwise it will realize that the whole loop has no effect at all and optimize it out. Putting things in two files doesn't give the compiler visibility into the possible side-effects of sin_array when it's compiling main and so it has to assume that it actually has some and repeatedly call it.
Your original program is not at all equivalent for several reasons. One is that you have nested loops in the C version and you don't in Python. Another is that you are working with arrays of values in the Python version and not in the C version. Another is that you are creating and discarding arrays in the Python version and not in the C version. And lastly you are using float in the Python version and double in the C version.
Simply calling the sin function the appropriate number of times does not make for an equivalent test.
Also, the optimizer is a really big deal for C. Comparing C code on which the optimizer hasn't been used to anything else when you're wondering about a speed comparison is the wrong thing to do. Of course, you also need to be mindful. The C optimizer is very sophisticated and if you're testing something that really doesn't do anything, the C optimizer might well notice this fact and simply not do anything at all, resulting in a program that's ridiculously fast.

Because "numpy" is a dedicated math library implemented for speed. C has standard functions for sin/cos, that are generally derived for accuracy.
You are also not comparing apples with apples, as you are using double in C, and float32 (float) in python. If we change the python code to calculate float64 instead, the time increases by about 2.5 seconds on my machine, making it roughly match with the correctly optimized C version.
If the whole test was made to do something more complicated that requires more control structres (if/else, do/while, etc), then you would probably see even less difference between C and Python - because the C compiler can't really do "sin" any faster - unless you implement a better "sin" function.
Newer mind the fact that your code isn't quite the same on both sides... ;)

You seem to be doing the the same operation in C 8192 x 10000 times but only 10000 in python (I haven't used numpy before so I may misunderstand the code). Why are you using an array in the python case (again I'm not use to numpy so perhaps the dereferencing is implicit). If you wish to use an array be careful doubles have a performance hit in terms of caching and optimised vectorisation - you're using different types between both implementations (float vs double) but given the algorithm I don't think it matters.
The main reason for a lot of anomalous performance benchmark issues surrounding C vs Pythis, Pythat... Is that simply the C implementation is often poor.
https://www.ibm.com/developerworks/community/blogs/jfp/entry/A_Comparison_Of_C_Julia_Python_Numba_Cython_Scipy_and_BLAS_on_LU_Factorization?lang=en
If you notice the guy writes C to process an array of doubles (without using restrict or const keywords where he could've), he builds with optimisation then forces the compiler to use SIMD rather than AVE. In short the compiler is using an inefficient instruction set for doubles and the wrong type of registers too if he wanted performance - you can be sure the numba and numpy will be using as many bells and whistles as possible and will be shipped with very efficient C and C++ libraries to begin with. In short if you want speed with C you have to think about it, you may even have to disassemble the code and perhaps disable optimisation and use compiler instrinsics instead. It gives you the tools to do it so don't expect the compiler to do all the work for you. If you want that degree of freedom use Cython, Numba, Numpy, Scipy etc. They're very fast but you won't be able to eek out every bit of performance out of the machine - to do that use C, C++ or new versions of FORTRAN.
Here is a very good article on these very points (I'd use SciPy):
https://www.scipy.org/scipylib/faq.html

Related

Why does this supposedly infinite loop program terminate?

I was talking to my friend about these two pieces of code. He said the python one terminates, the C++ one doesn't.
Python:
arr = [1, 2, 3]
for i in range(len(arr)):
arr.append(i)
print("done")
C++:
#include <iostream>
#include <vector>
using namespace std;
int main() {
vector<int> arr{1,2,3};
for(int i = 0; i < arr.size(); i++){
arr.push_back(i);
}
cout << "done" << endl;
return 0;
}
I challenged that and ran it on 2 computers. The first one ran out of memory (bad alloc) because it had 4gb of ram. My mac as 12gb of ram and it was able to run and terminate just fine.
I thought it wouldn't run forever because the type of size() in vector is an unsigned int. Since my mac was 64 bit, I thought that it could store 2^(64-2)=2^62 ints (which is true) but the unsigned int for the size is 32 for some reason.
Is this some bug in the C++ compiler that does not change the max_size() to be relative to the system's hardware? The overflow causes the program to terminate. Or is it for some other reason?

There is not a bug in your C++ compiler manifesting itself here.
int is overflowing (due to the i++), the behaviour of which is undefined. (It's feasible that you'll run out of memory on some platforms before this overflow occurs.) Note that there is no defined behaviour that will make i negative, although that is a common occurrence on machines with 2's complement signed integral types once std::numeric_limits<int>::max() is attained, and if i were -1 say then i < arr.size() would be false due to the implicit conversion of i to an unsigned type.
The Python version pre-computes range(len(arr)); that is subsequent appends do not change that initial value.

C++ Implementation of the GDAL Algorithm GDALFillNodata()

Looking for some help with some C++ code to implement/call the GDALFillNoData() algorithm. I already have a working version using python and gdal, which is somewhat slow filling elevation DEMs (1.5GB). I'm curious if this is possible. I've written the code for a command line application and posted it here. File paths are hard coded at the moment. It builds (CodeBlocks 16.1/MinGW) and runs in but then crashes.
I'm not a C++ programmer, though I wish I were, but I'm trying to understand the language better. I'm moderately decent at python. I'm likely missing some thing basic here that's normal to C++.
There likely code that's been commented out through testing. So if something doesn't make sense that's why.
Here's the Code:
#include <iostream>
#include "gdal.h"
#include "gdal_priv.h"
#include "cpl_conv.h"
#include "gdal_alg.h"
int main()
{
GDALAllRegister();
//CPLPushErrorHandler(CPLQuietErrorHandler);
// Read/Write Files
const char *input = "D:/myIn.tif";
GDALDataset *pSrcDataset;
//GDALRasterBandH hMaskBand;
GDALRasterBand *poBand;
//CPLErr maskBand;
int maskFlags;
int noData;
double maxSearch = 10.0;
int maxInt = 1;
int nBlockXSize, nBlockYSize;
double adfGeoTransform[6];
//CPLErr eErr;
pSrcDataset = (GDALDataset*) GDALOpen(input, GA_Update);
CPLAssert( pSrcDataset != NULL );
poBand = pSrcDataset->GetRasterBand( 1 );
poBand->GetBlockSize( &nBlockXSize, &nBlockYSize );
printf( "Block=%dx%d Type=%s, ColorInterp=%s\n",
nBlockXSize, nBlockYSize,
GDALGetDataTypeName(poBand->GetRasterDataType()),
GDALGetColorInterpretationName(
poBand->GetColorInterpretation()) );
noData = pSrcDataset->GetRasterBand(1)->GetNoDataValue();
printf( "No Data Value = %i\n",noData );
printf( "Driver: %s/%s\n",
pSrcDataset->GetDriver()->GetDescription(),
pSrcDataset->GetDriver()->GetMetadataItem( GDAL_DMD_LONGNAME ) );
printf( "Size is %dx%dx%d\n",
pSrcDataset->GetRasterXSize(), pSrcDataset->GetRasterYSize(),
pSrcDataset->GetRasterCount() );
if( pSrcDataset->GetProjectionRef() != NULL )
printf( "Projection is `%s'\n", pSrcDataset->GetProjectionRef() );
if( pSrcDataset->GetGeoTransform( adfGeoTransform ) == CE_None )
printf( "Origin = (%.6f,%.6f)\n", adfGeoTransform[0],
adfGeoTransform[3] );
printf( "Pixel Size = (%.6f,%.6f)\n",adfGeoTransform[1],
adfGeoTransform[5] );
//maskBand = pSrcDataset->GetRasterBand(1)->GetMaskBand();
//hMaskBand = GDALGetMaskBand( maskBand );
//hMaskBand = pSrcDataset->GetRasterBand(1)->GetNoDataValue();
maskFlags = pSrcDataset->GetRasterBand(1)->GetMaskFlags();
printf ( "Mask Flags = %i\n", maskFlags );
printf ( "Processing image\n" );
// Ignore that this is on two lines here
GDALFillNodata(pSrcDataset, pSrcDataset->GetRasterBand(1)-
>GetMaskBand(), maxSearch, 0, maxInt, NULL, NULL, NULL);
//CPLAssert( eErr == CE_None);
GDALClose(pSrcDataset);
return 0;
}
Some errors that I'm getting when running this code (see image links). The process returns a 255 which is I think something that is unique to CodeBlocks?
Program Crashes
Process returns 255
Here is the python implementation. Pretty straight forward. Is None the same as NULL? Because one of the errors I got when using NULL as the hMaskBand (rasterfill.cpp)
#Run the gdal fill
ET = gdal.Open(infile, GA_Update)
ETband = ET.GetRasterBand(1)
result = gdal.FillNodata(targetBand = ETband, maskBand = None,
maxSearchDist = 500, smoothingIterations = 1)
print result # return 0
ET = None
Please let me know if you need more information. Knowing what little I know about C++ it's probably my build environment. :)
Thanks,
Heath

As a follow up for the comments above, the problem is the fact that the first argument given to the GDALFillNoData API is a GDALDataset* instead of a GDALRasterBand*.
But I've been struggling too much with GDAL to keep myself from giving a much explanatory answer.
The implementation of the GDAL lib is very C oriented, while it is indeed written using C++ features (e.g. classes).
So, you will find that in many of the available APIs the arguments are not referring to types defined in the library. Instead, they use typedef like the GDALRasterBandH, that are indeed alias to the void* type.
This has the nasty effect that, whatever type of argument you feed to the compiler, it won't complain, hence you will get a lot of errors at runtime if you make such (very common) mistakes.
Why they do this? I think it was a way to exploit PIMPL in order to avoid the compiler to work on several translation units, every time one of this classes was modified.
The real problem of this approach is that you lose all the benefits from static type checking, i.e. the compiler telling you when there is a type mismatch. As a side note, this is not a critics to the GDAL lib (thank God we have it!! And it is a huge and very useful project), it's just the way it is.
By the way, I'm currently working on a C++ opensource project, Rasterix, that is meant to be a user friendly GUI to GDAL tools for raster datasets. Do not take it as do it like me hint, as this app still needs proper testing, but there is a lot of GDAL APIs use cases there that may help if you need sort of examples on how to use them.

Faster rolling of random Gaussian vectors

For a Monte-Carlo like simulation, I need to pick at random thousands of random Gaussian vectors (that is, vectors having independently normally distributed entries). Each such vector is of fixed length (around 100).
NumPy has a method of achieving this:
import numpy.random
vectors = [numpy.random.normal(size=100) for _ in xrange(10000)]
NumPy's random.normal function is of linear complexity, with overhead for small size values. However, it looks like that overhead is not significant for size=100 (perhaps around 30%, tested empirically; compare with the overhead for size=1, which is about 2300%). Perhaps I can save some of this overhead by rolling once, then splitting the array (haven't tried that yet).
However, it is still much too slow for my needs. Perhaps I'm too greedy here; I know that NumPy's randomization functions are written in c with optimization in mind; still,
timeit numpy.random.normal(size=100)
# 100000 loops, best of 3: 5.8 us per loop
(tested inside IPython, using its magic %timeit)
That makes ~0.06 seconds for 10k vectors. I was wondering whether there's a much faster method which will allow me to roll 10k vectors of size 100 (say) within less than 0.6ms, that is, 100 times faster. A solution may involve c extensions or whatever needed.
Update
A very simple c++ code, based on an example from cppreference, shows a much better performance:
#include <iostream>
#include <random>
int main()
{
float x;
std::random_device rd;
std::mt19937 gen(rd());
std::normal_distribution <> d(0,1);
for(int i=0; i < 100000; i++)
{
x = d(gen);
}
std::cout << x << '\n';
return 0;
}
and time shows:
real 0m0.028s
user 0m0.020s
sys 0m0.004s
which is about X20 times faster than what NumPy gives. However, I am not sure about the overhead of c-extensions for python, and I have no intuition about whether this can become a python function which is faster than numpy.random.normal.

Pypy (python) optimization

I'm looking into replacing some C code with python code and using pypy as the interpreter. The code does a lot of list/dictionary operations. Therefore to get a vague idea of the performance of pypy vs C I am writing sorting algorithms. To test all my read functions I wrote a bubble sort, both in python and C++. CPython of course sucks 6.468s, pypy came in at 0.366s and C++ at 0.229s. Then I remembered that I had forgotten -O3 on the C++ code and the time went to 0.042s. For a 32768 dataset C++ with -O3 is only 2.588s and pypy is 19.65s. Is there anything I can do to speed up my python code (besides using a better sort algorithm of course) or how I use pypy (some flag or something)?
Python code (read_nums module omitted since it's time is trivial: 0.036s on 32768 dataset):
import read_nums
import sys
nums = read_nums.read_nums(sys.argv[1])
done = False
while not done:
done = True
for i in range(len(nums)-1):
if nums[i] > nums[i+1]:
nums[i], nums[i+1] = nums[i+1], nums[i]
done = False
$ time pypy-c2.0 bubble_sort.py test_32768_1.nums
real 0m20.199s
user 0m20.189s
sys 0m0.009s
C code (read_nums function again omitted since it takes little time: 0.017s):
#include <iostream>
#include "read_nums.h"
int main(int argc, char** argv)
{
std::vector<int> nums;
int count, i, tmp;
bool done;
if(argc < 2)
{
std::cout << "Usage: " << argv[0] << " filename" << std::endl;
return 1;
}
count = read_nums(argv[1], nums);
done = false;
while(!done)
{
done = true;
for(i=0; i<count-1; ++i)
{
if(nums[i] > nums[i+1])
{
tmp = nums[i];
nums[i] = nums[i+1];
nums[i+1] = tmp;
done = false;
}
}
}
for(i=0; i<count; ++i)
{
std::cout << nums[i] << ", ";
}
return 0;
}
$ time ./bubble_sort test_32768_1.nums > /dev/null
real 0m2.587s
user 0m2.586s
sys 0m0.001s
P.S. Some of the numbers given in the first paragraph are a little different then the numbers from time because they're the numbers I got the first time.
Further improvements:
Just tried xrange instead of range and the run time went to 16.370s.
Moved the code starting from first done = False to last done = False in a function, speed is now 8.771-8.834s.

The most relevant way to answer this question is to note that the speed of C, CPython and PyPy are not differing by a constant factor: it depends most importantly on what is done and the way it is written. For example, if your C code is doing naive things like walking arrays when the "equivalent" Python code would naturally use dictionaries, then any implementation of Python is faster than C provided the arrays are long enough. Of course, this is not the case on most real-life examples, but the same argument still applies to a smaller extent. There is no one-size-fits-all way to predict the relative speed of a program written in C, or rewritten in Python and running on CPython or PyPy.
Obviously there are guidelines about these relative speeds: on small algorithmical examples you could expect the speed of PyPy to be approaching that of "gcc -O0". In your example it is "only" 1.6x slower. We might help you optimize it, or even find optimizations missing in PyPy, in order to gain 10% or 30% more speed. But this is a tiny example that has nothing to do with your real program. For the reasons above the speed we get here is only vaguely related to the speed you'll get in the end.
I can only say that rewriting code from C to Python for reasons of clarity, notably when the C has become too tangled up for further developments, is clearly a win in the long run --- even in the case where at the end you need to rewrite some parts of it in C again. And PyPy's goal here is to reduce the need for that. While it would be nice to say that no-one ever needs C any more, it's just not true :-)

DLR & Performance

I'm intending to create a web service which performs a large number of manually-specified calculations as fast as possible, and have been exploring the use of DLR.
Sorry if this is long but feel free to skim over and get the general gist.
I've been using the IronPython library as it makes the calculations very easy to specify. My works laptop gives a performance of about 400,000 calculations per second doing the following:
ScriptEngine py = Python.CreateEngine();
ScriptScope pys = py.CreateScope();
ScriptSource src = py.CreateScriptSourceFromString(#"
def result():
res = [None]*1000000
for i in range(0, 1000000):
res[i] = b.GetValue() + 1
return res
result()
");
CompiledCode compiled = src.Compile();
pys.SetVariable("b", new DynamicValue());
long start = DateTime.Now.Ticks;
var res = compiled.Execute(pys);
long end = DateTime.Now.Ticks;
Console.WriteLine("...Finished. Sample data:");
for (int i = 0; i < 10; i++)
{
Console.WriteLine(res[i]);
}
Console.WriteLine("Took " + (end - start) / 10000 + "ms to run 1000000 times.");
Where DynamicValue is a class that returns random numbers from a pre-built array (seeded and built at run time).
When I create a DLR class to do the same thing, I get much higher performance (~10,000,000 calculations per second). The class is as follows:
class DynamicCalc : IDynamicMetaObjectProvider
{
DynamicMetaObject IDynamicMetaObjectProvider.GetMetaObject(Expression parameter)
{
return new DynamicCalcMetaObject(parameter, this);
}
private class DynamicCalcMetaObject : DynamicMetaObject
{
internal DynamicCalcMetaObject(Expression parameter, DynamicCalc value) : base(parameter, BindingRestrictions.Empty, value) { }
public override DynamicMetaObject BindInvokeMember(InvokeMemberBinder binder, DynamicMetaObject[] args)
{
Expression Add = Expression.Convert(Expression.Add(args[0].Expression, args[1].Expression), typeof(System.Object));
DynamicMetaObject methodInfo = new DynamicMetaObject(Expression.Block(Add), BindingRestrictions.GetTypeRestriction(Expression, LimitType));
return methodInfo;
}
}
}
and is called/tested in the same way by doing the following:
dynamic obj = new DynamicCalc();
long t1 = DateTime.Now.Ticks;
for (int i = 0; i < 10000000; i++)
{
results[i] = obj.Add(ar1[i], ar2[i]);
}
long t2 = DateTime.Now.Ticks;
Where ar1 and ar2 are pre-built, runtime seeded arrays of random numbers.
The speed is great this way, but it's not easy to specify the calculation. I'd basically be looking at creating my own lexer & parser, whereas IronPython has everything I need already there.
I'd have thought I could get much better performance from IronPython since it is implemented on top of the DLR, and I could do with better than what I'm getting.
Is my example making best use of the IronPython engine? Is it possible to get significantly better performance out of it?
(Edit) Same as first example but with the loop in C#, setting variables and calling the python function:
ScriptSource src = py.CreateScriptSourceFromString(#"b + 1");
CompiledCode compiled = src.Compile();
double[] res = new double[1000000];
for(int i=0; i<1000000; i++)
{
pys.SetVariable("b", args1[i]);
res[i] = compiled.Execute(pys);
}
where pys is a ScriptScope from py, and args1 is a pre-built array of random doubles. This example executes slower than running the loop in the Python code and passing in the entire arrays.

delnan's comment leads you to some of the problems here. But I'll just get specific about what the differences are here. In the C# version you've cut out a significant amount of the dynamic calls that you have in the Python version. For starters your loop is typed to int and it sounds like ar1 and ar2 are strongly typed arrays. So in the C# version the only dynamic operations you have are the call to obj.Add (which is 1 operation in C#) and potentially the assignment to results if it's not typed to object which seems unlikely. Also note all of this code is lock free.
In the Python version you first have the allocation of the list - this also appears to be during your timer where as in C# it doesn't look like it is. Then you have the dynamic call to range, luckily that only happens once. But that again creates a gigantic list in memory - delnan's suggestion of xrange is an improvement here. Then you have the loop counter i which is getting boxed to an object for every iteration through the loop. Then you have the call to b.GetValue() which is actually 2 dynamic invocatiosn - first a get member to get the "GetValue" method and then an invoke on that bound method object. This is again creating one new object for every iteration of the loop. Then you have the result of b.GetValue() which may be yet another value that's boxed on every iteration. Then you add 1 to that result and you have another boxing operation on every iteration. Finally you store this into your list which is yet another dynamic operation - I think this final operation needs to lock to ensure the list remains consistent (again, delnan's suggestion of using a list comprehension improves this).
So in summary during the loop we have:
C# IronPython
Dynamic Operations 1 4
Allocations 1 4
Locks Acquired 0 1
So basically Python's dynamic behavior does come at a cost vs C#. If you want the best of both worlds you can try and balance what you do in C# vs what you do in Python. For example you could write the loop in C# and have it call a delegate which is a Python function (you can do scope.GetVariable> to get a function out of the scope as a delegate). You could also consider allocating a .NET array for the results if you really need to get every last bit of performance as it may reduce working set and GC copying by not keeping around a bunch of boxed values.
To do the delegate you could have the user write:
def computeValue(value):
return value + 1
Then in the C# code you'd do:
CompiledCode compiled = src.Compile();
compiled.Execute(pys);
var computer = pys.GetVariable<Func<object,object>>("computeValue");
Now you can do:
for (int i = 0; i < 10000000; i++)
{
results[i] = computer(i);
}

If you concerned about computation speed, is it better to look at lowlevel computation specification? Python and C# are high-level languages, and its implementation runtime can spend a lot of time for undercover work.
Look on this LLVM wrapper library: http://www.llvmpy.org
Install it using: pip install llvmpy ply
or on Debian Linux: apt install python-llvmpy python-ply
You still need to write some tiny compiler (you can use PLY library), and bind it with LLVM JIT calls (see LLVM Execution Engine), but this approach can be more effective (generated code much closer to real CPU code), and multiplatform comparing to .NET jail.
LLVM has ready to use optimizing compiler infrastructure, including a lot of optimizer stage modules, and big user and developer community.
Also look here: http://gmarkall.github.io/tutorials/llvm-cauldron-2016
PS: If you interested in it, I can help you with a compiler, contributing to my project's manual in parallel. But it will not be jumpstart, this theme is new to me too.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.