Speed up NumPy loop

Speed up NumPy loop - python

I'm running a model in Python and I'm trying to speed up the execution time. Through profiling the code I've found that a huge amount of the total processing time is spent in the cell_in_shadow function below. I'm wondering if there is any way to speed it up?
The aim of the function is to provide a boolean response stating whether the specified cell in the NumPy array is shadowed by another cell (in the x direction only). It does this by stepping backwards along the row checking each cell against the height it must be to make the given cell in shadow. The values in shadow_map are calculated by another function not shown here - for this example, take shadow_map to be an array with values similar to:
[0] = 0 (not used)
[1] = 3
[2] = 7
[3] = 18
The add_x function is used to ensure that the array indices loop around (using clock-face arithmetic), as the grid has periodic boundaries (anything going off one side will re-appear on the other side).
def cell_in_shadow(x, y):
"""Returns True if the specified cell is in shadow, False if not."""
# Get the global variables we need
global grid
global shadow_map
global x_len
# Record the original length and move to the left
orig_x = x
x = add_x(x, -1)
while x != orig_x:
# Gets the height that's needed from the shadow_map (the array index is the distance using clock-face arithmetic)
height_needed = shadow_map[( (x - orig_x) % x_len)]
if grid[y, x] - grid[y, orig_x] >= height_needed:
return True
# Go to the cell to the left
x = add_x(x, -1)
def add_x(a, b):
"""Adds the two numbers using clockface arithmetic with the x_len"""
global x_len
return (a + b) % x_len

I do agree with Sancho that Cython will probably be the way to go, but here are a couple of small speed-ups:
A. Store grid[y, orig_x] in some variable before you start the while loop and use that variable instead. This will save a bunch of look-up calls to the grid array.
B. Since you are basically just starting at x_len - 1 in shadow_map and working down to 1, you can avoid using the modulus so much. Basically, change:
while x != orig_x:
height_needed = shadow_map[( (x - orig_x) % x_len)]
to
for i in xrange(x_len-1,0,-1):
height_needed = shadow_map[i]
or just get rid of the height_needed variable all together with:
if grid[y, x] - grid[y, orig_x] >= shadow_map[i]:
These are small changes, but they might help a little bit.
Also, if you plan on going the Cython route, I would consider having your function do this process for the whole grid, or at least a row at a time. That will save a lot of the function call overhead. However, you might not be able to really do this depending on how you are using the results.
Lastly, have you tried using Psyco? It takes less work than Cython though it probably won't give you quite as big of a speed boost. I would certainly try it first.

If you're not limited to strict Python, I'd suggest using Cython for this. It can allow static typing of the indices and efficient, direct access to a numpy array's underlying data buffer at c speed.
Check out a short tutorial/example at http://wiki.cython.org/tutorials/numpy
In that example, which is doing operations very similar to what you're doing (incrementing indices, accessing individual elements of numpy arrays), adding type information to the index variables cut the time in half compared to the original. Adding efficient indexing into the numpy arrays by giving them type information cut the time to about 1% of the original.
Most Python code is already valid Cython, so you can just use what you have and add annotations and type information where needed to give you some speed-ups.
I suspect you'd get the most out of adding type information your indices x, y, orig_x and the numpy arrays.

The following guide compares several different approaches to optimising numerical code in python:
Scipy PerformancePython
It is a bit out of date, but still helpful. Note that it refers to pyrex, which has since been forked to create the Cython project, as mentioned by Sancho.
Personally I prefer f2py, because I think that fortran 90 has many of the nice features of numpy (e.g. adding two arrays together with one operation), but has the full speed of compiled code. On the other hand if you don't know fortran then this may not be the way to go.
I briefly experimented with cython, and the trouble I found was that by default cython generates code which can handle arbitrary python types, but which is still very slow. You then have to spend time adding all the necessary cython declarations to get it to be more specific and fast, whereas if you go with C or fortran then you will tend to get fast code straight out of the box. Again this is biased by me already being familiar with these languages, whereas Cython may be more appropriate if Python is the only language you know.

Related

speed up function based on list comprehension

I'm trying to get the 15 most relevant item for each users but every functions i tried took an eternity. (more than 6 hours i shutdown it after that ...)
I have 418 unique users, 3718 unique items.
U2tfifd dict has as well 418 entry and there is 32645 words in tfidf_feature_names.
Shape of my interactions_full_df is (40733, 3)
i tried :
def index_tfidf_users(user_id) :
return [users for users in U2tfifd[user_id].flatten().tolist()]
def get_relevant_items(user_id):
return sorted(zip(tfidf_feature_names, index_tfidf_users(user_id)), key=lambda x: -x[1])[:15]
def get_tfidf_token(user_id) :
return [words for words, values in get_relevant_items(user_id)]
then interactions_full_df["tags"] = interactions_full_df["user_id"].apply(lambda x : get_tfidf_token(x))
or
def get_tfidf_token(user_id) :
tags = []
v = sorted(zip(tfidf_feature_names, U2tfifd[user_id].flatten().tolist()), key=lambda x: -x[1])[:15]
for words, values in v :
tags.append(words)
return tags
or
def get_tfidf_token(user_id) :
v = sorted(zip(tfidf_feature_names, U2tfifd[user_id].flatten().tolist()), key=lambda x: -x[1])[:15]
tags = [words for words in v]
return tags
U2tfifd is a dict with keys = user_id, values = an array

There are several things going on which could cause poor performance in your code. The impact of each of these will depend on things like your Python version (2.x or 3.x), your RAM speed, and whatnot. You'll need to experiment and benchmark the various potential improvements yourself.
1. TFIDF Sparsity (~10x speedup depending on sparsity)
One glaring potential problem is that TFIDF naturally returns sparse data (e.g. a paragraph doesn't use anywhere near as many unique words as an entire book), and working with dense structures like numpy arrays is a strange choice when the data is probably zero almost everywhere.
If you'll be doing this same analysis in the future, it might be helpful to make/use a version of TFIDF with sparse array outputs so that when you extract your tokens you can skip over the zero values. This would likely have the secondary benefit of the entire sparse array for each user fitting in the cache and preventing costly RAM access in your sorts and other operations.
It might be worth sparsifying your data anyway. On my potato, a quick benchmark on data which should be similar to yours indicates that the process can be done in ~30s. The process replaces much of the work you're doing with a highly optimized routine coded in C and wrapped for use in Python. The only real cost is the second pass through the non-zero entries, but unless that pass is pretty efficient to begin with you should be better off working with sparse data.
2. Duplicated Efforts and Memoization (~100x speedup)
If U2tfifd has 418 entries and interactions_full_df has 40733 rows then at least 40315 (or 99.0%) of your calls to get_tfidf_token() are wasted since you've already computed the answer. There are tons of memoization decorators out there, but you don't need anything very complicated for your use case.
def memoize(f):
_cache = {}
def _f(arg):
if arg not in _cache:
_cache[arg] = f(arg)
return _cache[arg]
return _f
#memoize
def get_tfidf_token(user_id):
...
Breaking this down, the function memoize() returns another function. The behavior of that function is to check a local cache for the expected return value before computing it and storing it if necessary.
The syntax #memoize... is short for something like the following.
def uncached_get_tfidf_token(user_id):
...
get_tfidf_token = memoize(uncached_get_tfidf_token)
The # symbol is used to signify that we want the modified, or decorated, version of get_tfidf_token() instead of the original. Depending on your application, it might be beneficial to chain decorators together.
3. Vectorized Operations (varying speedup, benchmarking necessary)
Python doesn't really have a notion of primitive types like other languages, and even integers take 24 bytes in memory on my machine. Lists aren't usually be packed, so you can incur costly cache misses as you're plowing through them. No matter how little work the CPU is doing for sorting and whatnot, clobbering a whole new chunk of memory to turn your array into a list and only using that brand new, expensive memory once is going to incur a performance hit.
Many of the things you are trying to do have fast (SIMD vectorized, parallelized, memory-efficient, packed memory, and other fun optimizations) numpy equivalents AND avoid unnecessary array copies and type conversions. It seems you're already using numpy anyway, so you won't have any extra imports or dependencies.
As one example, zip() creates another list in memory in Python 2.x and still does unnecessary work in Python 3.x when you really only care about the indices of tfidf_feature_names. To compute those indices, you can use something like the following, which avoids an unnecessary list creation and uses an optimized routine with slightly better asymptotic complexity as an added bonus.
def get_tfidf_token(user_id):
temp = U2tfifd[user_id].flatten()
ind = np.argpartition(temp, len(temp)-15)[-15:]
return tfidf_feature_names[ind] # works if tfidf_feature_names is a numpy array
return [tfidf_feature_names[i] for i in ind] # always works
Depending on the shape of U2tfifd[user_id], you could avoid the costly .flatten() computation by passing an axis argument to np.argsort() and flattening the 15 obtained indices instead.
4. Bonus
The sorted() function supports a reverse argument so that you can avoid extra computations like throwing a negative on every value. Simply use
sorted(..., reverse=True)
Even better, since you really don't care about the sort itself but just the 15 largest values you can get away with
sorted(...)[-15:]
to index the largest 15 instead of reversing the sort and taking the smallest 15. That doesn't really matter if you're using a better function for the application like np.argpartition(), but it could be helpful in the future.
You can also avoid some function calls by replacing .apply(lambda x : get_tfidf_token(x)) with .apply(get_tfidf_token) since get_tfidf_token is already a function which has the intended behavior. You don't really need the extra lambda.
As far as I can see though, most additional gains are fairly nitpicky and system-dependent. You can make most things faster with Cython or straight C with enough time for example, but you already have reasonably fast routines which do what you want out of the box. The extra engineering effort probably isn't worth any potential gains.

Timeit showing that regular python is faster than numpy?

I'm working on a piece of code for a game that calculates the distances between all the objects on the screen using their in-game coordinate positions. Originally I was going to use basic Python and lists to do this, but since the number of distances that will need calculated will increase exponentially with the number of objects, I thought it might be faster to do it with numpy.
I'm not very familiar with numpy, and I've been experimenting on basic bits of code with it. I wrote a little bit of code to time how long it takes for the same function to complete a calculation in numpy and in regular Python, and numpy seems to consistently take a good bit more time than the regular python.
The function is very simple. It starts with 1.1 and then increments 200,000 times, adding 0.1 to the last value and then finding the square root of the new value. It's not what I'll actually be doing in the game code, which will involve finding total distance vectors from position coordinates; it's just a quick test I threw together. I already read here that the initialization of arrays takes more time in NumPy, so I moved the initializations of both the numpy and python arrays outside their functions, but Python is still faster than numpy.
Here is the bit of code:
#!/usr/bin/python3
import numpy
from timeit import timeit
#from time import process_time as timer
import math
thing = numpy.array([1.1,0.0], dtype='float')
thing2 = [1.1,0.0]
def NPFunc():
for x in range(1,200000):
thing[0] += 0.1
thing[1] = numpy.sqrt(thing[0])
print(thing)
return None
def PyFunc():
for x in range(1,200000):
thing2[0] += 0.1
thing2[1] = math.sqrt(thing2[0])
print(thing2)
return None
print(timeit(NPFunc, number=1))
print(timeit(PyFunc, number=1))
It gives this result, which indicates normal Python is 3x faster:
[ 20000.99999999 141.42489173]
0.2917748889885843
[20000.99999998944, 141.42489172698504]
0.10341173503547907
Am I doing something wrong, is is this calculation just so simple that it isn't a good test for numpy?

Am I doing something wrong, is is this calculation just so simple that it isn't a good test for NumPy?
It's not really that the calculation is simple, but that you're not taking any advantage of NumPy.
The main benefit of NumPy is vectorization: you can apply an operation to every element of an array in one go, and whatever looping is needed happens inside some tightly-optimized C (or Fortran or C++ or whatever) loop inside NumPy, rather than in a slow generic Python iteration.
But you're only accessing a single value, so there's no looping to be done in C.
On top of that, because the values in an array are stored as "native" values, NumPy functions don't need to unbox them, pulling the raw C double out of a Python float, and then re-box them in a new Python float, the way any Python math functions have to.
But you're not doing that either. In fact, you're doubling that work: You're pulling the value out o the array as a float (boxing it), then passing it to a function (which has to unbox it, and then rebox it to return a result), then storing it back in an array (unboxing it again).
And meanwhile, because np.sqrt is designed to work on arrays, it has to first check the type of what you're passing it and decide whether it needs to loop over an array or unbox and rebox a single value or whatever, while math.sqrt just takes a single value. When you call np.sqrt on an array of 200000 elements, the added cost of that type switch is negligible, but when you're doing it every time through the inner loop, that's a different story.
So, it's not an unfair test.
You've demonstrated that using NumPy to pull out values one at a time, act on them one at a time, and store them back in the array one at a time is slower than just not using NumPy.
But, if you compare it to actually taking advantage of NumPy—e.g., by creating an array of 200000 floats and then calling np.sqrt on that array vs. looping over it and calling math.sqrt on each one—you'll demonstrate that using NumPy the way it was intended is faster than not using it.

you're comparing it wrong
a_list = np.arange(0,20000,0.1)
timeit(lambda:np.sqrt(a_list),number=1)

python one line for loop over a function returning a tuple

I've tried searching quite a lot on this one, but being relatively new to python I feel I am missing the required terminology to find what I'm looking for.
I have a function:
def my_function(x,y):
# code...
return(a,b,c)
Where x and y are numpy arrays of length 2000 and the return values are integers. I'm looking for a shorthand (one-liner) to loop over this function as such:
Output = [my_function(X[i],Y[i]) for i in range(len(Y))]
Where X and Y are of the shape (135,2000). However, after running this I am currently having to do the following to separate out 'Output' into three numpy arrays.
Output = np.asarray(Output)
a = Output.T[0]
b = Output.T[1]
c = Output.T[2]
Which I feel isn't the best practice. I have tried:
(a,b,c) = [my_function(X[i],Y[i]) for i in range(len(Y))]
But this doesn't seem to work. Does anyone know a quick way around my problem?

my_function(X[i], Y[i]) for i in range(len(Y))
On the verge of crossing the "opinion-based" border, ...Y[i]... for i in range(len(Y)) is usually a big no-no in Python. It is even a bigger no-no when working with numpy arrays. One of the advantages of working with numpy is the 'vectorization' that it provides, and thus pushing the for loop down to the C level rather than the (slower) Python level.
So, if you rewrite my_function so it can handle the arrays in a vectorized fashion using the multiple tools and methods that numpy provides, you may not even need that "one-liner" you are looking for.

Numpy: vectorizing two-branch test (ternary-operator like)

I am vectorizing a test in Numpy for the following idea: perform elementwise some test and pick expr1 or expr2 according to the test. This is like the ternary-operator in C: test?expr1:expr2
I see two major ways for performing that; I would like to know if there is a good reason to choose one rather than the other one; maybe also other tricks are available and I would be very happy to know about them. Main goal is speed; for that reason I don't want to use np.vectorize with an if-else statement.
For my example, I will re-build the min function; please, don't tell me about some Numpy function for computing that; this is a mere example!
Idea 1: Use the arithmetic value of the booleans in a multiplication:
# a and b have similar shape
test = a < b
ntest = np.logical_not(test)
out = test*a + ntest*b
Idea 2: More or less following the APL/J style of coding (by using the conditional expression as an index for an array made with one dimension more than initial arrays).
# a and b have similar shape
np.choose(a<b, np.array([b,a]))

This is a better way to use choose
np.choose(a<b, [b,a])
In my small timings it is faster. Also the choose doc says Ifchoicesis itself an array (not recommended), ....
(a<b).choose([b,a])
saves one level of function redirection.
Another option:
out = b.copy(); out[test] = a[test]
In quick tests this actually faster. masked.filled uses np.copyto for this sort of 'where' copy, though it doesn't seem to be any faster.
A variation on the choose is where:
np.where(test,a,b)
Or use where (or np.nonzero) to convert boolean index to a numeric one:
I = np.where(test); out = b.copy(); out[I] = a[I]
For some reason this times faster than the one-piece where.
I've used the multiplication approach in the past; if I recall correctly even with APL (though that's decades ago). An old trick to avoid divide by 0 was to add n==0, a/(b+(b==0)). But it's not as generally applicable. a*0, a*1 have to make sense.
choose looks nice, but with the mode parameter may be more powerful (and hence complicated) that needed.
I'm not sure there is a 'best' way. timing tests can evaluate certain situations, but I don't know where they can be generalized across all cases.

Force matrix_world to be recalculated in Blender

I'm designing an add-on for blender, that changes the location of certain vertices of an object. Every object in blender has a matrix_world atribute, that holds a matrix that transposes the coordinates of the vertices from the object to the world frame.
print(object.matrix_world) # unit matrix (as expected)
object.location += mathutils.Vector((5,0,0))
object.rotation_quaternion *= mathutils.Quaternion((0.0, 1.0, 0.0), math.radians(45))
print(object.matrix_world) # Also unit matrix!?!
The above snippet shows that after the translation, you still have the same matrix_world. How can I force blender to recalculate the matrix_world?

You should call Scene.update after changing those values, otherwise Blender won't recalculate matrix_world until it's needed [somewhere else]. The reason, according to the "Gotcha's" section in the API docs, is that this re-calc is an expensive operation, so it's not done right away:
Sometimes you want to modify values from python and immediately access the updated values, eg:
Once changing the objects bpy.types.Object.location you may want to access its transformation right after from bpy.types.Object.matrix_world, but this doesn’t work as you might expect.
Consider the calculations that might go into working out the objects final transformation, this includes:
animation function curves.
drivers and their pythons expressions.
constraints
parent objects and all of their f-curves, constraints etc.
To avoid expensive recalculations every time a property is modified, Blender defers making the actual calculations until they are needed.
However, while the script runs you may want to access the updated values.
This can be done by calling bpy.types.Scene.update after modifying values which recalculates all data that is tagged to be updated.

Calls to bpy.context.scene.update() can become expensive when called within a loop.
If your objects have no complex constraints (e.g. plain or parented), the following can be used to recompute the world matrix after changing object's .location, .rotation_euler\quaternion, or .scale.
def update_matrices(obj):
if obj.parent is None:
obj.matrix_world = obj.matrix_basis
else:
obj.matrix_world = obj.parent.matrix_world * \
obj.matrix_parent_inverse * \
obj.matrix_basis
Some notes:
Immediately after setting object location/rotation/scale the object's matrix_basis is updated
But matrix_local (when parented) and matrix_world are only updated during scene.update()
When matrix_world is manually recomputed (using the code above), matrix_local is recomputed as well
If the object is parented, then its world matrix depends on the parent's world matrix as well as the parent's inverse matrix at the time of creation of the parenting relationship.

I needed to do this too but needed this value to be updated whilst I imported a large scene with tens of thousands of objects.
Calling 'scene.update()' became exponentially slower, so I needed to find a way to do this without calling that function.
This is what I came up with:
def BuildScaleMatrix(s):
return Matrix.Scale(s[0],4,(1,0,0)) * Matrix.Scale(s[1],4,(0,1,0)) * Matrix.Scale(s[2],4,(0,0,1))
def BuildRotationMatrixXYZ(r):
return Matrix.Rotation(r[2],4,'Z') * Matrix.Rotation(r[1],4,'Y') * Matrix.Rotation(r[0],4,'X')
def BuildMatrix(t,r,s):
return Matrix.Translation(t) * BuildRotationMatrixXYZ(r) * BuildScaleMatrix(s)
def UpdateObjectTransform(ob):
ob.matrix_world = BuildMatrix(ob.location, ob.rotation_euler, ob.scale)
This isn't most efficient way to build a matrix (if you know of a better way in blender, please add) and this only works for XYZ order transforms but this avoids exponential slow downs when dealing with large data sets.

Accessing Object.matrix_world is causing it to "freeze" even though you don't do anything to it, eg:
m = C.active_object.matrix_world
causes the matrix to be stuck. Whenever you want to access the matrix use
Object.matrix_world.copy()
only if you want to write the matrix, use
C.active_object.matrix_world = m

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.