Timeit showing that regular python is faster than numpy? - python

I'm working on a piece of code for a game that calculates the distances between all the objects on the screen using their in-game coordinate positions. Originally I was going to use basic Python and lists to do this, but since the number of distances that will need calculated will increase exponentially with the number of objects, I thought it might be faster to do it with numpy.
I'm not very familiar with numpy, and I've been experimenting on basic bits of code with it. I wrote a little bit of code to time how long it takes for the same function to complete a calculation in numpy and in regular Python, and numpy seems to consistently take a good bit more time than the regular python.
The function is very simple. It starts with 1.1 and then increments 200,000 times, adding 0.1 to the last value and then finding the square root of the new value. It's not what I'll actually be doing in the game code, which will involve finding total distance vectors from position coordinates; it's just a quick test I threw together. I already read here that the initialization of arrays takes more time in NumPy, so I moved the initializations of both the numpy and python arrays outside their functions, but Python is still faster than numpy.
Here is the bit of code:
#!/usr/bin/python3
import numpy
from timeit import timeit
#from time import process_time as timer
import math
thing = numpy.array([1.1,0.0], dtype='float')
thing2 = [1.1,0.0]
def NPFunc():
for x in range(1,200000):
thing[0] += 0.1
thing[1] = numpy.sqrt(thing[0])
print(thing)
return None
def PyFunc():
for x in range(1,200000):
thing2[0] += 0.1
thing2[1] = math.sqrt(thing2[0])
print(thing2)
return None
print(timeit(NPFunc, number=1))
print(timeit(PyFunc, number=1))
It gives this result, which indicates normal Python is 3x faster:
[ 20000.99999999 141.42489173]
0.2917748889885843
[20000.99999998944, 141.42489172698504]
0.10341173503547907
Am I doing something wrong, is is this calculation just so simple that it isn't a good test for numpy?

Am I doing something wrong, is is this calculation just so simple that it isn't a good test for NumPy?
It's not really that the calculation is simple, but that you're not taking any advantage of NumPy.
The main benefit of NumPy is vectorization: you can apply an operation to every element of an array in one go, and whatever looping is needed happens inside some tightly-optimized C (or Fortran or C++ or whatever) loop inside NumPy, rather than in a slow generic Python iteration.
But you're only accessing a single value, so there's no looping to be done in C.
On top of that, because the values in an array are stored as "native" values, NumPy functions don't need to unbox them, pulling the raw C double out of a Python float, and then re-box them in a new Python float, the way any Python math functions have to.
But you're not doing that either. In fact, you're doubling that work: You're pulling the value out o the array as a float (boxing it), then passing it to a function (which has to unbox it, and then rebox it to return a result), then storing it back in an array (unboxing it again).
And meanwhile, because np.sqrt is designed to work on arrays, it has to first check the type of what you're passing it and decide whether it needs to loop over an array or unbox and rebox a single value or whatever, while math.sqrt just takes a single value. When you call np.sqrt on an array of 200000 elements, the added cost of that type switch is negligible, but when you're doing it every time through the inner loop, that's a different story.
So, it's not an unfair test.
You've demonstrated that using NumPy to pull out values one at a time, act on them one at a time, and store them back in the array one at a time is slower than just not using NumPy.
But, if you compare it to actually taking advantage of NumPy—e.g., by creating an array of 200000 floats and then calling np.sqrt on that array vs. looping over it and calling math.sqrt on each one—you'll demonstrate that using NumPy the way it was intended is faster than not using it.

you're comparing it wrong
a_list = np.arange(0,20000,0.1)
timeit(lambda:np.sqrt(a_list),number=1)

Related

python one line for loop over a function returning a tuple

I've tried searching quite a lot on this one, but being relatively new to python I feel I am missing the required terminology to find what I'm looking for.
I have a function:
def my_function(x,y):
# code...
return(a,b,c)
Where x and y are numpy arrays of length 2000 and the return values are integers. I'm looking for a shorthand (one-liner) to loop over this function as such:
Output = [my_function(X[i],Y[i]) for i in range(len(Y))]
Where X and Y are of the shape (135,2000). However, after running this I am currently having to do the following to separate out 'Output' into three numpy arrays.
Output = np.asarray(Output)
a = Output.T[0]
b = Output.T[1]
c = Output.T[2]
Which I feel isn't the best practice. I have tried:
(a,b,c) = [my_function(X[i],Y[i]) for i in range(len(Y))]
But this doesn't seem to work. Does anyone know a quick way around my problem?
my_function(X[i], Y[i]) for i in range(len(Y))
On the verge of crossing the "opinion-based" border, ...Y[i]... for i in range(len(Y)) is usually a big no-no in Python. It is even a bigger no-no when working with numpy arrays. One of the advantages of working with numpy is the 'vectorization' that it provides, and thus pushing the for loop down to the C level rather than the (slower) Python level.
So, if you rewrite my_function so it can handle the arrays in a vectorized fashion using the multiple tools and methods that numpy provides, you may not even need that "one-liner" you are looking for.

How to perform an operation on every element in a numpy matrix?

Say I have a function foo() that takes in a single float and returns a single float. What's the fastest/most pythonic way to apply this function to every element in a numpy matrix or array?
What I essentially need is a version of this code that doesn't use a loop:
import numpy as np
big_matrix = np.matrix(np.ones((1000, 1000)))
for i in xrange(np.shape(big_matrix)[0]):
for j in xrange(np.shape(big_matrix)[1]):
big_matrix[i, j] = foo(big_matrix[i, j])
I was trying to find something in the numpy documentation that will allow me to do this but I haven't found anything.
Edit: As I mentioned in the comments, specifically the function I need to work with is the sigmoid function, f(z) = 1 / (1 + exp(-z)).
If foo is really a black box that takes a scalar, and returns a scalar, then you must use some sort of iteration. People often try np.vectorize and realize that, as documented, it does not speed things up much. It is most valuable as a way of broadcasting several inputs. It uses np.frompyfunc, which is slightly faster, but with a less convenient interface.
The proper numpy way is to change your function so it works with arrays. That shouldn't be hard to do with the function in your comments
f(z) = 1 / (1 + exp(-z))
There's a np.exp function. The rest is simple math.

Python: For loop Vs. Map

I am currently in the process of optimising the translation part of my software, which translates co-ordinates x amount of times. My current translation code is in the translate function and the supposedly optimised portion in the translate_map function.
I read here that the map function should be used instead of for loops where possible because the loop is performed in C.
When I run a test case below, the map function actually runs slower than a standard for loop. Why does the map perform slower than the conventional for loop? How could I optimise the translate function to run faster?
import time
def translate(atom_list):
for i in atom_list:
i[1]+=1
i[2]+=1
i[3]+=1
atoms = [[1,1,1,1]]*1000
start = time.time()
for x in xrange(10000):
translate(atoms)
print time.time() - start
atoms = [[1,1,1,1]]*1000
start = time.time()
def translate_map(atom_list):
atom_list[1]+=1
atom_list[2]+=1
atom_list[3]+=1
for x in xrange(10000):
map(translate_map,atoms)
print time.time() - start
output:
2.92705798149
4.14674210548
I suspect most of the overhead you're seeing with your map implementation comes from function call overhead. The translate function does all its work within a single loop, so there's just a single function call for the whole process. The implementation with map makes a separate function call for every item in the list.
A second source of overhead (though I suspect it is small compared to the function calls) is that map creates a list with the return values from the function. Since translate_map doesn't have a return statement, this will be all None values. Note that in Python 3, map is a generator, so your map version won't work at all unless you iterate over the results from the map call. The explicit loop is much clearer though, so I'd stick with that (if you don't go for numpy).
Oh, yes, numpy would make this much easier (and almost certainly faster too):
def translate(arr): # arr should be a numpy array
arr += 1
That's it! No loops needed (at the Python level).

How to avoid enormous additional memory consumption when using numpy vectorize?

This code below best illustrates my problem:
The output to the console (NB it takes ~8 minutes to run even the first test) shows the 512x512x512x16-bit array allocations consuming no more than expected (256MByte for each one), and looking at "top" the process generally remains sub-600MByte as expected.
However, while the vectorized version of the function is being called, the process expands to enormous size (over 7GByte!). Even the most obvious explanation I can think of to account for this - that vectorize is converting the inputs and outputs to float64 internally - could only account for a couple of gigabytes, even though the vectorized function returns an int16, and the returned array is certainly an int16. Is there some way to avoid this happening ? Am I using/understanding vectorize's otypes argument wrong ?
import numpy as np
import subprocess
def logmem():
subprocess.call('cat /proc/meminfo | grep MemFree',shell=True)
def fn(x):
return np.int16(x*x)
def test_plain(v):
print "Explicit looping:"
logmem()
r=np.zeros(v.shape,dtype=np.int16)
for z in xrange(v.shape[0]):
for y in xrange(v.shape[1]):
for x in xrange(v.shape[2]):
r[z,y,x]=fn(x)
print type(r[0,0,0])
logmem()
return r
vecfn=np.vectorize(fn,otypes=[np.int16])
def test_vectorize(v):
print "Vectorize:"
logmem()
r=vecfn(v)
print type(r[0,0,0])
logmem()
return r
logmem()
s=(512,512,512)
v=np.ones(s,dtype=np.int16)
logmem()
test_plain(v)
test_vectorize(v)
v=None
logmem()
I'm using whichever versions of Python/numpy are current on an amd64 Debian Squeeze system (Python 2.6.6, numpy 1.4.1).
It is a basic problem of vectorisation that all intermediate values are also vectors. While this is a convenient way to get a decent speed enhancement, it can be very inefficient with memory usage, and will be constantly thrashing your CPU cache. To overcome this problem, you need to use an approach which has explicit loops running at compiled speed, not at python speed. The best ways to do this are to use cython, fortran code wrapped with f2py or numexpr. You can find a comparison of these approaches here, although this focuses more on speed than memory usage.
you can read the source code of vectorize(). It convert the array's dtype to object, and call np.frompyfunc() to create the ufunc from your python function, the ufunc returns object array, and finally vectorize() convert object array to int16 array.
It will use many memory when the dtype of array is object.
Using python function to do element wise calculation is slow, even is's converted to ufunc by frompyfunc().

Speed up NumPy loop

I'm running a model in Python and I'm trying to speed up the execution time. Through profiling the code I've found that a huge amount of the total processing time is spent in the cell_in_shadow function below. I'm wondering if there is any way to speed it up?
The aim of the function is to provide a boolean response stating whether the specified cell in the NumPy array is shadowed by another cell (in the x direction only). It does this by stepping backwards along the row checking each cell against the height it must be to make the given cell in shadow. The values in shadow_map are calculated by another function not shown here - for this example, take shadow_map to be an array with values similar to:
[0] = 0 (not used)
[1] = 3
[2] = 7
[3] = 18
The add_x function is used to ensure that the array indices loop around (using clock-face arithmetic), as the grid has periodic boundaries (anything going off one side will re-appear on the other side).
def cell_in_shadow(x, y):
"""Returns True if the specified cell is in shadow, False if not."""
# Get the global variables we need
global grid
global shadow_map
global x_len
# Record the original length and move to the left
orig_x = x
x = add_x(x, -1)
while x != orig_x:
# Gets the height that's needed from the shadow_map (the array index is the distance using clock-face arithmetic)
height_needed = shadow_map[( (x - orig_x) % x_len)]
if grid[y, x] - grid[y, orig_x] >= height_needed:
return True
# Go to the cell to the left
x = add_x(x, -1)
def add_x(a, b):
"""Adds the two numbers using clockface arithmetic with the x_len"""
global x_len
return (a + b) % x_len
I do agree with Sancho that Cython will probably be the way to go, but here are a couple of small speed-ups:
A. Store grid[y, orig_x] in some variable before you start the while loop and use that variable instead. This will save a bunch of look-up calls to the grid array.
B. Since you are basically just starting at x_len - 1 in shadow_map and working down to 1, you can avoid using the modulus so much. Basically, change:
while x != orig_x:
height_needed = shadow_map[( (x - orig_x) % x_len)]
to
for i in xrange(x_len-1,0,-1):
height_needed = shadow_map[i]
or just get rid of the height_needed variable all together with:
if grid[y, x] - grid[y, orig_x] >= shadow_map[i]:
These are small changes, but they might help a little bit.
Also, if you plan on going the Cython route, I would consider having your function do this process for the whole grid, or at least a row at a time. That will save a lot of the function call overhead. However, you might not be able to really do this depending on how you are using the results.
Lastly, have you tried using Psyco? It takes less work than Cython though it probably won't give you quite as big of a speed boost. I would certainly try it first.
If you're not limited to strict Python, I'd suggest using Cython for this. It can allow static typing of the indices and efficient, direct access to a numpy array's underlying data buffer at c speed.
Check out a short tutorial/example at http://wiki.cython.org/tutorials/numpy
In that example, which is doing operations very similar to what you're doing (incrementing indices, accessing individual elements of numpy arrays), adding type information to the index variables cut the time in half compared to the original. Adding efficient indexing into the numpy arrays by giving them type information cut the time to about 1% of the original.
Most Python code is already valid Cython, so you can just use what you have and add annotations and type information where needed to give you some speed-ups.
I suspect you'd get the most out of adding type information your indices x, y, orig_x and the numpy arrays.
The following guide compares several different approaches to optimising numerical code in python:
Scipy PerformancePython
It is a bit out of date, but still helpful. Note that it refers to pyrex, which has since been forked to create the Cython project, as mentioned by Sancho.
Personally I prefer f2py, because I think that fortran 90 has many of the nice features of numpy (e.g. adding two arrays together with one operation), but has the full speed of compiled code. On the other hand if you don't know fortran then this may not be the way to go.
I briefly experimented with cython, and the trouble I found was that by default cython generates code which can handle arbitrary python types, but which is still very slow. You then have to spend time adding all the necessary cython declarations to get it to be more specific and fast, whereas if you go with C or fortran then you will tend to get fast code straight out of the box. Again this is biased by me already being familiar with these languages, whereas Cython may be more appropriate if Python is the only language you know.

Categories