DFT in Python taking significantly longer than C

DFT in Python taking significantly longer than C - python

I'm currently working on translating some C code To Python. This code is being used to help identify errors arising from the CLEAN algorithm used in Radio Astronomy. In order to do this analysis the value of the Fourier Transforms of Intensity Maps, Q Stokes Map and U Stokes Map must be found at specific pixel values (given by ANT_pix). These Maps are just 257*257 arrays.
The below code takes a few seconds to run with C but takes hours to run with Python. I'm pretty sure that it is terribly optimized as my knowledge of Python is quite poor.
Thanks for any help you can give.
Update My question is if there is a better way to implement the loops in Python which will speed things up. I've read quite a few answer here for other questions on Python which recommend avoiding nested for loops in Python if possible and I'm just wondering if anyone knows a good way of implementing something like the Python code below without the loops or with better optimised loops. I realise this may be a tall order though!
I've been using the FFT up till now but my supervisor wants to see what sort of difference the DFT will make. This is because the Antenna position will not, in general, occur at exact pixels values. Using FFT requires round to the closest pixel value.
I'm using Python as CASA, the computer program used to reduce Radio Astronomy datasets is written in python and implementing Python scripts in it is far far easier than C.
Original Code
def DFT_Vis(ANT_Pix="",IMap="",QMap="",UMap="", NMap="", Nvis=""):
UV=numpy.zeros([Nvis,6])
Offset=(NMap+1)/2
ANT=ANT_Pix+Offset;
i=0
l=0
k=0
SumI=0
SumRL=0
SumLR=0
z=0
RL=QMap+1j*UMap
LR=QMap-1j*UMap
Factor=[math.e**(-2j*math.pi*z/NMap) for z in range(NMap)]
for i in range(Nvis):
X=ANT[i,0]
Y=ANT[i,1]
for l in range(NMap):
for k in range(NMap):
Temp=Factor[int((X*l)%NMap)]*Factor[int((Y*k)%NMap)];
SumI+=IMap[l,k]*Temp
SumRL+=RL[l,k]*Temp
SumLR+=IMap[l,k]*Temp
k=1
UV[i,0]=SumI.real
UV[i,1]=SumI.imag
UV[i,2]=SumRL.real
UV[i,3]=SumRL.imag
UV[i,4]=SumLR.real
UV[i,5]=SumLR.imag
l=1
k=1
SumI=0
SumRL=0
SumLR=0
return(UV)

You should probably use numpy's fourier transform code, rather than writing your own: http://docs.scipy.org/doc/numpy/reference/routines.fft.html

If you are interested in boosting the performance of your script cython could be an option.

I am not an expert on the FFT, but my understanding is that the FFT is simply a fast way to compute the DFT. So to me your question sounds like you are trying to write a bubble sort algorithm to see if it gives a better answer than quicksort. They are both sorting algorithms that would give the same result!
So I am questioning your basic premise. I am wondering if you can just change your rounding on your data and get the same result from the SciPy FFT code.
Also, according to my DSP textbook, the FFT can produce a more accurate answer than computing the DFT the long way, simply because floating point operations are inexact, and the FFT invokes fewer floating point operations along the way to finding the correct answer.
If you have some working C code that does the calculation you want, you could always wrap the C code to let you call it from Python. Discussion here: Wrapping a C library in Python: C, Cython or ctypes?
To answer your actual question: as #ZoZo123 noted, it would be a big win to change from range() to xrange(). With range(), Python has to build a list of numbers, and then destroy the list when done; with xrange() Python just makes an iterator that yields up the numbers one at a time. (But note that in Python 3.x, range() makes an iterator and there is no xrange().)
Also, if this code does not have to integrate with the rest of your code, you might try running this code under PyPy. This is exactly the sort of code that PyPy can best optimize. The problem with PyPy is that currently your project must be "pure" Python, and it looks like you are using NumPy. (There are projects to get NumPy and PyPy to work together, but that's not done yet.) http://pypy.org/
If this code does need to integrate with the rest of your code, then I think you need to look at Cython (as noted by #Krzysztof Rosiński).

Related

Program a butterworth filter using numpy (not scipy!) on a BeagleBone Black

I am a new user of Python and an amateur programmer in general - I am hoping to be able to filter a signal using just the numpy library. It will be programmed onto a BeagleBone Black and the OS is Angstrom Linux, so the furthest numpy library it will update to is 1.4 and due to either rumored data limitations (I am not actually sure how to check) or just the version of numpy being too early, scipy will not work on the board.
So the first solution is to get a new operating system but I would not know where to start; I am more comfortable in the realm of putting equations into a program.
I was hoping to use the filtfilt function but maybe it would be best to start with lfilter. This site seemed helpful for implementing it but it is a bit beyond me:
http://docs.scipy.org/doc/scipy-0.13.0/reference/generated/scipy.signal.lfilter.html
I am capable of getting the filter coefficients in MATLAB then transferring them to the BeagleBone. The x is just the array that is my signal which I can upload.
The second section is a bit of a jump - so is there a way to perform a z-transform in just numpy, not scipy? Also, based on all of the secrecy of the filter algorithm in MATLAB, I do not have faith in working that out, but is there some sort of mathematical algorithm description, or better yet code, describing how I may accomplish this?
Thanks for your patience in reading through this and the response. Please do not use complicated language in the response!
-Rob

For the filter design functions, you can copy the code from sicpy.signal.filter_design.py, they are almost pure python code.
But to do lfilter for IIR filters, you need a for loop for every sample in the data array. Since for loop in Python is slow, I think you need to implement it in C, and call it throught ctypes. Do you have a c compile in the target machine?
If you can design your filter as a FIR filter, then you can use numpy.convolve(b, x).

Can I improve python runtime by compiling?

I'm writing a small toy simulation in python. Granted, this simulations are slow. To my understanding, the major reason that python codes are slow is the fact that python is in interpreted language. I don't want to give up python since the clear syntax and the available library cut the writing time significantly. So is there a simple way for me to "compile" my python code?
Edit
I answer some questions:
Yes, I'm using numpy. It greatly simplify the code and I don't think I can improve performance writing the functions on my own. I use numpy for all my lists and and I add all of the beads together. Namely. I invoke
pos += V*dt + forces*0.5*dt**2
where ''pos'', 'V', and 'forces' are all np.array of (2000,3) dimensions.
I'm quite certain that the slow part in the forces calculation. This is logical as I have to iterate over all my particles and check their position. For my real project (Ph.D. stuff) I have code of about roughly the same level of complexity, and I know that this is the expensive stuff.

If none of the solutions in the comment suffice, you can also take a look at cython.
For a quick tutorial & example check:
http://docs.cython.org/src/tutorial/cython_tutorial.html
Used at the correct spots (e.g. around frequently called functions) it can easily speed things up by a factor of 10 - 100.

Python is a slightly odd language in that it is both interpreted and compiled. Well sort of. When you run it is compiled to ".pyc" bytecode - so we can quickly get bogged down in semantic details here. Hell I don't even know if what I just said is strictly accurate. But at the end of the day you want to speed things up so...
First, use the profiler and timeit to work out where all the time is going
Second, rewrite your pure python code to improve the slow bits you've discovered
Third, see how it goes when optimised
Now, depends on your scenario, but seriously think "Can I run it on a bigger CPU/memory"
Ok, try rewriting those slow sections in C++
Screw it, write it all in C++
If you get so far as the last option I dare say you're screwed and the savings aren't going to be significant.

Coordinate container types in Python Aggdraw for fastest possible rendering?

Original Question:
I have a question about the Python Aggdraw module that I cannot find in the Aggdraw documentation. I'm using the ".polygon" command which renders a polygon on an image object and takes input coordinates as its argument.
My question is if anyone knows or has experience with what types of sequence containers the xy coordinates can be in (list, tuple, generator, itertools-generator, array, numpy-array, deque, etc), and most importantly which input type will help Aggdraw render the image in the fastest possible way?
The docs only mention that the polygon method takes: "A Python sequence (x, y, x, y, …)"
I'm thinking that Aggdraw is optimized for some sequence types more than others, and/or that some sequence types have to be converted first, and thus some types will be faster than others. So maybe someone knows these details about Aggdraw's inner workings, either in theory or from experience?
I have done some preliminary testing, and will do more soon, but I still want to know the theory behind why one option might be faster, because it might be that I not doing the tests properly or that there are some additional ways to optimize Aggdraw rendering that I didn't know about.
(Btw, this may seem like trivial optimization but not when the goal is to be able to render tens of thousands of polygons quickly and to be able to zoom in and out of them. So for this question I dont want suggestions for other rendering modules (from my testing Aggdraw appears to be one of the fastest anyway). I also know that there are other optmization bottlenecks like coordinate-to-pixel transformations etc, but for now Im only focusing on the final step of Aggdraw's internal rendering speed.)
Thanks a bunch, curious to see what knowledge and experience others out there have with Aggdraw.
A Winner? Some Preliminary Tests
I have now conducted some preliminary tests and reported the results in an Answer further down the page if you want the details. The main finding is that rounding float coordinates to pixel coordinates as integers and having them in arrays are the fastest way to make Aggdraw render an image or map, and lead to incredibly fast rendering speedups on the scale of 650% at speeds that can be compared with well-known and commonly used GIS software. What remains is to find fast ways to optimize coordinate transformations and shapefile loading, and these are daunting tasks indeed. For all the findings check out my Answer post further down the page.
I'm still interested to hear if you have done any tests of your own, or if you have other useful answers or comments. I'm still curious about the answers to the Bonus question if anyone knows.
Bonus question:
If you don't know the specific answer to this question it might still help if you know which programming language the actual Aggdraw rendering is done in? Ive read that the Aggdraw module is just a Python binding for the original C++ Anti-Grain Geometry library, but not entirely sure what that actually means. Does it mean that the Aggdraw Python commands are simply a way of accessing and activating the c++ library "behind the scenes" so that the actual rendering is done in C++ and at C++ speeds? If so then I would guess that C++ would have to convert the Python sequence to a C++ sequence, and the optimization would be to find out which Python sequence can be converted the fastest to a C++ sequence. Or is the Aggdraw module simply the original library rewritten in pure Python (and thus much slower than the C++ version)? If so which Python types does it support and which is faster for the type of rendering work it has to do. enter code here

A Winner? Some Preliminary Tests
Here are the results from my initial testings of which input types are faster for aggdraw rendering. One clue was to be found in the aggdraw docs where it said that aggdraw.polygon() only takes "sequences": officially defined as "str, unicode, list, tuple, bytearray, buffer, xrange" (http://docs.python.org/2/library/stdtypes.html). Luckily however I found that there are also additional input types that aggdraw rendering accepts. After some testing I came up with a list of the input container types that I could find that aggdraw (and maybe also PIL) rendering supports:
tuples
lists
arrays
Numpy arrays
deques
Unfortunately, aggdraw does not support and results in errors when supplying coordinates contained in:
generators
itertool generators
sets
dictionaries
And then for the performance testing! The test polygons were a subset of 20 000 (multi)polygons from the Global Administrative Units Database of worldwide sub-national province boundaries, loaded into memory using the PyShp shapefile reader module (http://code.google.com/p/pyshp/). To ensure that the tests only measured aggdraw's internal rendering speed I made sure to start the timer only after the polygon coordinates were already transformed to aggdraw image pixel coordinates, AND after I had created a list of input arguments with the correct input type and aggdraw.Pen and .Brush objects. I then timed and ran the rendering using itertools.starmap with the preloaded coordinates and arguments:
t=time.time()
iterat = itertools.starmap(draw.polygon, args) #draw is the aggdraw.Draw() object
for runfunc in iterat: #iterating through the itertools generator consumes and runs it
pass
print time.time()-t
My findings confirm the traditional notion that tuples and arrays are the fastest Python iterators, which both ended up being the fastest. Lists were about 50% slower, and so too were numpy arrays (this was initially surprising given the speed-reputation of Numpy arrays, but then I read that Numpy arrays are only fast when one uses the internal Numpy functions on them, and that for normal Python iteration they are generally slower than other types). Deques, usually considered to be fast, turned out to be the slowest (almost 100%, ie 2x slower).
### Coordinates as FLOATS
### Pure rendering time (seconds) for 20 000 polygons from the GADM dataset
tuples
8.90130587328
arrays
9.03419164657
lists
13.424952522
numpy
13.1880489246
deque
16.8887938784
In other words, if you usually use lists for aggdraw coordinates you should know that you can gain a 50% performance improvement by instead putting them into a tuple or array. Not the most radical improvement but still useful and easy to implement.
But wait! I did find another way to squeeze out more performance power from the aggdraw module--quite a lot actually. I forget why I did it but when I tried rounding the transformed floating point coordinates to the nearest pixel integer as integer type (ie "int(round(eachcoordinate))") before rendering them I got a 6.5x rendering speedup (650%) compared to the most common list container--a well-worth and also easy optimization. Surprisingly, the array container type turns out to be about 25% faster than tuples when the renderer doesnt have to worry about rounding numbers. This prerounding leads to no loss of visual details that I could see, because these floating points can only be assigned to one pixel anyway, and might be the reason why preconverting/prerounding the coordinates before sending them off to the aggdraw renderer speeds up the process bc then aggdraw doesnt have to. A potential caveat is that it could be that taking away the decimal information changes how aggdraw does its anti-aliasing but in my opinion the final map still looks equally anti-aliased and smooth. Finally, this rounding optimization must be weighed against the time it would take to round the numbers in Python, but from what I can see the time it takes to do prerounding does not outweigh the benefits of the rendering speedup. Further optimization should be explored for how to round and convert the coordinates in a fast way.
### Coordinates as INTEGERS (rounded to pixels)
### Pure rendering time (seconds) for 20 000 polygons from the GADM dataset
arrays
1.40970077294
tuples
2.19892537074
lists
6.70839555276
numpy
6.47806400659
deque
7.57472232757
In conclusion then: arrays and tuples are the fastest container types to use when providing aggdraw (and possibly also PIL?) with drawing coordinates.
Given the hefty rendering speeds that can be obtained when using the correct input type with aggdraw, it becomes particularly crucial and rewarding to find even the slightest optimizations for other aspects of the map rendering process, such as coordinate transformation routines (I am already exploring and finding for instance that Numpy is particularly fast for such purposes).
An more general finding from all of this is that Python can potentially be used for very fast map rendering applications and thus further opens the possibilities for Python geospatial scripting; e.g. the entire GADM dataset of 200 000+ provinces can theoretically be rendered in about 1.5*10=15 seconds without thinking about coordinate to image coordinate transformation, which is way faster than QGIS and even ArcGIS which in my experience struggles with displaying the GADM dataset.
All results were obtained on a 8-core processor, 2-year old Windows 7 machine, using Python 2.6.5. Whether these results are also the most efficient when it comes to loading and/or processing the data is a question that has to be tested and answered in another post. It would be interesting to hear if someone else already have any good insights on these aspects.

FSharp runs my algorithm slower than Python

Years ago, I solved a problem via dynamic programming:
https://www.thanassis.space/fillupDVD.html
The solution was coded in Python.
As part of expanding my horizons, I recently started learning OCaml/F#. What better way to test the waters, than by doing a direct port of the imperative code I wrote in Python to F# - and start from there, moving in steps towards a functional programming solution.
The results of this first, direct port... are disconcerting:
Under Python:
bash$ time python fitToSize.py
....
real 0m1.482s
user 0m1.413s
sys 0m0.067s
Under FSharp:
bash$ time mono ./fitToSize.exe
....
real 0m2.235s
user 0m2.427s
sys 0m0.063s
(in case you noticed the "mono" above: I tested under Windows as well, with Visual Studio - same speed).
I am... puzzled, to say the least. Python runs code faster than F# ? A compiled binary, using the .NET runtime, runs SLOWER than Python's interpreted code?!?!
I know about startup costs of VMs (mono in this case) and how JITs improve things for languages like Python, but still... I expected a speedup, not a slowdown!
Have I done something wrong, perhaps?
I have uploaded the code here:
https://www.thanassis.space/fsharp.slower.than.python.tar.gz
Note that the F# code is more or less a direct, line-by-line translation of the Python code.
P.S. There are of course other gains, e.g. the static type safety offered by F# - but if the resulting speed of an imperative algorithm is worse under F# ... I am disappointed, to say the least.
EDIT: Direct access, as requested in the comments:
the Python code: https://gist.github.com/950697
the FSharp code: https://gist.github.com/950699

Dr Jon Harrop, whom I contacted over e-mail, explained what is going on:
The problem is simply that the program has been optimized for Python. This is common when the programmer is more familiar with one language than the other, of course. You just have to learn a different set of rules that dictate how F# programs should be optimized...
Several things jumped out at me such as the use of a "for i in 1..n do" loop rather than a "for i=1 to n do" loop (which is faster in general but not significant here), repeatedly doing List.mapi on a list to mimic an array index (which allocated intermediate lists unnecessarily) and your use of the F# TryGetValue for Dictionary which allocates unnecessarily (the .NET TryGetValue that accepts a ref is faster in general but not so much here)
... but the real killer problem turned out to be your use of a hash table to implement a dense 2D matrix. Using a hash table is ideal in Python because its hash table implementation has been extremely well optimized (as evidenced by the fact that your Python code is running as fast as F# compiled to native code!) but arrays are a much better way to represent dense matrices, particularly when you want a default value of zero.
The funny part is that when I first coded this algorithm, I DID use a table -- I changed the implementation to a dictionary for reasons of clarity (avoiding the array boundary checks made the code simpler - and much easier to reason about).
Jon transformed my code (back :-)) into its array version, and it runs at 100x speed.
Moral of the story:
F# Dictionary needs work... when using tuples as keys, compiled F# is slower than interpreted Python's hash tables!
Obvious, but no harm in repeating: Cleaner code sometimes means... much slower code.
Thank you, Jon -- much appreciated.
EDIT: the fact that replacing Dictionary with Array makes F# finally run at the speeds a compiled language is expected to run, doesn't negate the need for a fix in Dictionary's speed (I hope F# people from MS are reading this). Other algorithms depend on dictionaries/hashes, and can't be easily switched to using arrays; making programs suffer "interpreter-speeds" whenever one uses a Dictionary, is arguably, a bug. If, as some have said in the comments, the problem is not with F# but with .NET Dictionary, then I'd argue that this... is a bug in .NET!
EDIT2: The clearest solution, that doesn't require the algorithm to switch to arrays (some algorithms simply won't be amenable to that) is to change this:
let optimalResults = new Dictionary<_,_>()
into this:
let optimalResults = new Dictionary<_,_>(HashIdentity.Structural)
This change makes the F# code run 2.7x times faster, thus finally beating Python (1.6x faster). The weird thing is that tuples by default use structural comparison, so in principle, the comparisons done by the Dictionary on the keys are the same (with or without Structural). Dr Harrop theorizes that the speed difference may be attributed to virtual dispatch: "AFAIK, .NET does little to optimize virtual dispatch away and the cost of virtual dispatch is extremely high on modern hardware because it is a "computed goto" that jumps the program counter to an unpredictable location and, consequently, undermines branch prediction logic and will almost certainly cause the entire CPU pipeline to be flushed and reloaded".
In plain words, and as suggested by Don Syme (look at the bottom 3 answers), "be explicit about the use of structural hashing when using reference-typed keys in conjunction with the .NET collections". (Dr. Harrop in the comments below also says that we should always use Structural comparisons when using .NET collections).
Dear F# team in MS, if there is a way to automatically fix this, please do.

As Jon Harrop has pointed out, simply constructing the dictionaries using Dictionary(HashIdentity.Structural) gives a major performance improvement (a factor of 3 on my computer). This is almost certainly the minimally invasive change you need to make to get better performance than Python, and keeps your code idiomatic (as opposed to replacing tuples with structs, etc.) and parallel to the Python implementation.

Edit: I was wrong, it's not a question of value type vs reference type. The performance problem was related to the hash function, as explained in other comments. I keep my answer here because there's an interessant discussion. My code partially fixed the performance issue, but this is not the clean and recommended solution.
--
On my computer, I made your sample run twice as fast by replacing the tuple with a struct. This means, the equivalent F# code should run faster than your Python code. I don't agree with the comments saying that .NET hashtables are slow, I believe there's no significant difference with Python or other languages implementations. Also, I don't agree with the "You can't 1-to-1 translate code expect it to be faster": F# code will generally be faster than Python for most tasks (static typing is very helpful to the compiler). In your sample, most of the time is spent doing hashtable lookups, so it's fair to imagine that both languages should be almost as fast.
I think the performance issue is related to gabage collection (but I haven't checked with a profiler). The reason why using tuples can be slower here than structures has been discussed in a SO question ( Why is the new Tuple type in .Net 4.0 a reference type (class) and not a value type (struct)) and a MSDN page (Building tuples):
If they are reference types, this
means there can be lots of garbage
generated if you are changing elements
in a tuple in a tight loop. [...]
F# tuples were reference types, but
there was a feeling from the team that
they could realize a performance
improvement if two, and perhaps three,
element tuples were value types
instead. Some teams that had created
internal tuples had used value instead
of reference types, because their
scenarios were very sensitive to
creating lots of managed objects.
Of course, as Jon said in another comment, the obvious optimization in your example is to replace hashtables with arrays. Arrays are obviously much faster (integer index, no hashing, no collision handling, no reallocation, more compact), but this is very specific to your problem, and it doesn't explain the performance difference with Python (as far as I know, Python code is using hashtables, not arrays).
To reproduce my 50% speedup, here is the full code: http://pastebin.com/nbYrEi5d
In short, I replaced the tuple with this type:
type Tup = {x: int; y: int}
Also, it seems like a detail, but you should move the List.mapi (fun i x -> (i,x)) fileSizes out of the enclosing loop. I believe Python enumerate does not actually allocate a list (so it's fair to allocate the list only once in F#, or use Seq module, or use a mutable counter).

Hmm.. if the hashtable is the major bottleneck, then it is properly the hash function itself. Havn't look at the specific hash function but For one of the most common hash functions namely
((a * x + b) % p) % q
The modulus operation % is painfully slow, if p and q is of the form 2^k - 1, we can do modulus with an and, add and a shift operation.
Dietzfelbingers universal hash function h_a : [2^w] -> [2^l]
lowerbound(((a * x) % 2^w)/2^(w-l))
Where is a random odd seed of w-bit.
It can be computed by (a*x) >> (w-l), which is magnitudes of speed faster than the first hash function. I had to implement a hash table with linked list as collision handling. It took 10 minutes to implement and test, we had to test it with both functions, and analyse the differens of speed. The second hash function had as I remember around 4-10 times of speed gain dependend on the size of the table.
But the thing to learn here is if your programs bottleneck is hashtable lookup the hash function has to be fast too

I need to speed up a function. Should I use cython, ctypes, or something else?

I'm having a lot of fun learning Python by writing a genetic programming type of application.
I've had some great advice from Torsten Marek, Paul Hankin and Alex Martelli on this site.
The program has 4 main functions:
generate (randomly) an expression tree.
evaluate the fitness of the tree
crossbreed
mutate
As all of generate, crossbreed and mutate call 'evaluate the fitness'. it is the busiest function and is the primary bottleneck speedwise.
As is the nature of genetic algorithms, it has to search an immense solution space so the faster the better. I want to speed up each of these functions. I'll start with the fitness evaluator. My question is what is the best way to do this. I've been looking into cython, ctypes and 'linking and embedding'. They are all new to me and quite beyond me at the moment but I look forward to learning one and eventually all of them.
The 'fitness function' needs to compare the value of the expression tree to the value of the target expression. So it will consist of a postfix evaluator which will read the tree in a postfix order. I have all the code in python.
I need advice on which I should learn and use now: cython, ctypes or linking and embedding.
Thank you.

Ignore everyone elses' answer for now. The first thing you should learn to use is the profiler. Python comes with a profile/cProfile; you should learn how to read the results and analyze where the real bottlenecks is. The goal of optimization is three-fold: reduce the time spent on each call, reduce the number of calls to be made, and reduce memory usage to reduce disk thrashing.
The first goal is relatively easy. The profiler will show you the most time-consuming functions and you can go straight to that function to optimize it.
The second and third goal is harder since this means you need to change the algorithm to reduce the need to make so much calls. Find the functions that have high number of calls and try to find ways to reduce the need to call them. Utilize the built-in collections, they're very well optimized.
If you're doing a lot of number and array processing, you should take a look at pandas, Numpy/Scipy, gmpy third party modules; they're well optimised C libraries for processing arrays/tabular data.
Another thing you want to try is PyPy. PyPy can JIT recompile and do much more advanced optimisation than CPython, and it'll work without the need to change your python code. Though well optimised code targeting CPython can look quite different from well optimised code targeting PyPy.
Next to try is Cython. Cython is a slightly different language than Python, in fact Cython is actually best described as C with typed Python-like syntax.
For parts of your code that is in very tight loops that you can no longer optimize using any other ways, you may want to rewrite it as C extension. Python has a very good support for extending with C. In PyPy, the best way to extend PyPy is with cffi.

Cython is the quickest to get the job done, either by writing your algorithm directly in Cython, or by writing it in C and bind it to python with Cython.
My advice: learn Cython.

Another great option is boost::python which lets you easily wrap C or C++.
Of these possibilities though, since you have python code already written, cython is probably a good thing to try first. Perhaps you won't have to rewrite any code to get a speedup.

Try to work your fitness function so that it will support memoization. This will replace all calls that are duplicates of previous calls with a quick dict lookup.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.