So I've got some code
tensors = [] //its filled with 3D float tensors
total = sum(tensors)
if I change that last line to
total = tf.add_n(tensors)
then the code produces the same output but runs much more slowly and soon causes
an out-of-memory exception. Whats going on here? Can someone explain how pythons built in sum function and tf.add_n interact with an array of tensors respectively and why pythons sum would seemingly just be a better version?
When you use sum, you call a standard python algorithm that call __add__ recursively on the elements of the array. Since __add__ (or +) indeed is overloaded on tensorflow's tensors, it works as expected: it creates a graph that can be executed during a session. It is not optimal, however, because you add as many operation as there are elements in your list; also, you are enforcing the order of the operation (add the first two elements, then the third to the result, and so on), which is also not optimal.
By contrast, add_n is a specialized operation to do just that. Looking at the graph is really telling I think:
import tensorflow as tf
with tf.variable_scope('sum'):
xs = [tf.zeros(()) for _ in range(10)]
sum(xs)
with tf.variable_scope('add_n'):
xs = [tf.zeros(()) for _ in range(10)]
tf.add_n(xs)
However – contrary to what I thought earlier – add_n takes up more memory because it waits – and store – for all incoming inputs before storing them. If the number of inputs is large, then the difference can be substantial.
The behavior I was expecting from add_n, that is, summation of inputs as they are available, is actually achieved by tf.accumulate_n. This should be the superior alternative, as it takes less memory than add_n but does not enforce the order of summation like sum.
Why did the authors of tensorflow-wavenet used sum instead of tf.accumulate_n? Certainly because before this function is not differentiable on TF < 1.7. So if you have to support TF < 1.7 and be memory efficient, good old sum is actually quite a good option.
The sum() built-in only takes iterables and therefor would seem to gain the advantage of using generators in regards to memory profile.
the add_n() function for tensor takes a list of tensors and seem to retain that data structure throughout handling based on it's requirement for shape comparison.
In [29]: y = [1,2,3,4,5,6,7,8,9,10]
In [30]: y.__sizeof__()
Out[30]: 120
In [31]: x = iter(y)
In [32]: x.__sizeof__()
Out[32]: 32
Related
Suppose I have an algorithm that does the following in python with pytorch. Please ignore whether the steps are efficient. This is just a silly toy example.
def foo(input_list):
# input_list is a list of N 2-D pytorch tensors of shape (h,w)
tensor = torch.stack(input_list) # convert to tensor.shape(h,w,N)
tensor1 = torch.transpose(tensor,0,2).unsqueeze(1) # convert to tensor.shape(N,1,h,w)
tensor2 = torch.interpolate(tensor1,size=(500,500) # upsample to new shape (N,1,500,500)
def bar(input_list):
tensor = torch.stack(input_list) # convert to tensor.shape(h,w,N)
tensor = torch.transpose(tensor,0,2).unsqueeze(1) # convert to tensor.shape(N,1,h,w)
tensor = torch.interpolate(tensor,size=(500,500) # upsample to new shape (N,1,500,500)
My question is whether it makes more sense to use method foo() or bar() or if it doesn't matter. My thought was that I save memory by rewriting over the same variable name (bar), since I will never actually need those intermediate steps. But if the CUDA interface is creating new memory spaces for each function, then I'm spending the same amount of memory with both methods.
tensor and tensor1 in your example are just different views of the same data in memory, so the memory difference of potentially maintaining two slightly different references to it should be negligible. The relevant part would only be tensor1 vs tensor2.
You might want to see this similar question:
Python: is the "old" memory free'd when a variable is assigned new content?
Since the reassignment to tensor that actually allocates new memory is also the final call in bar, I suspect that in this particular example the total memory wouldn't be impacted (tensor1 would be unreferenced once the function returns anyway).
With a longer chain of operations, I don't think the GC is guaranteed to be called on any of these reassignments, though it might give python some more flexibility. I'd probably prefer the style in foo just because it's easier to later change the order of operations in the chain. Keeping track of different names adds overhead for the programmer, not just the interpreter.
I'm trying to get the 15 most relevant item for each users but every functions i tried took an eternity. (more than 6 hours i shutdown it after that ...)
I have 418 unique users, 3718 unique items.
U2tfifd dict has as well 418 entry and there is 32645 words in tfidf_feature_names.
Shape of my interactions_full_df is (40733, 3)
i tried :
def index_tfidf_users(user_id) :
return [users for users in U2tfifd[user_id].flatten().tolist()]
def get_relevant_items(user_id):
return sorted(zip(tfidf_feature_names, index_tfidf_users(user_id)), key=lambda x: -x[1])[:15]
def get_tfidf_token(user_id) :
return [words for words, values in get_relevant_items(user_id)]
then interactions_full_df["tags"] = interactions_full_df["user_id"].apply(lambda x : get_tfidf_token(x))
or
def get_tfidf_token(user_id) :
tags = []
v = sorted(zip(tfidf_feature_names, U2tfifd[user_id].flatten().tolist()), key=lambda x: -x[1])[:15]
for words, values in v :
tags.append(words)
return tags
or
def get_tfidf_token(user_id) :
v = sorted(zip(tfidf_feature_names, U2tfifd[user_id].flatten().tolist()), key=lambda x: -x[1])[:15]
tags = [words for words in v]
return tags
U2tfifd is a dict with keys = user_id, values = an array
There are several things going on which could cause poor performance in your code. The impact of each of these will depend on things like your Python version (2.x or 3.x), your RAM speed, and whatnot. You'll need to experiment and benchmark the various potential improvements yourself.
1. TFIDF Sparsity (~10x speedup depending on sparsity)
One glaring potential problem is that TFIDF naturally returns sparse data (e.g. a paragraph doesn't use anywhere near as many unique words as an entire book), and working with dense structures like numpy arrays is a strange choice when the data is probably zero almost everywhere.
If you'll be doing this same analysis in the future, it might be helpful to make/use a version of TFIDF with sparse array outputs so that when you extract your tokens you can skip over the zero values. This would likely have the secondary benefit of the entire sparse array for each user fitting in the cache and preventing costly RAM access in your sorts and other operations.
It might be worth sparsifying your data anyway. On my potato, a quick benchmark on data which should be similar to yours indicates that the process can be done in ~30s. The process replaces much of the work you're doing with a highly optimized routine coded in C and wrapped for use in Python. The only real cost is the second pass through the non-zero entries, but unless that pass is pretty efficient to begin with you should be better off working with sparse data.
2. Duplicated Efforts and Memoization (~100x speedup)
If U2tfifd has 418 entries and interactions_full_df has 40733 rows then at least 40315 (or 99.0%) of your calls to get_tfidf_token() are wasted since you've already computed the answer. There are tons of memoization decorators out there, but you don't need anything very complicated for your use case.
def memoize(f):
_cache = {}
def _f(arg):
if arg not in _cache:
_cache[arg] = f(arg)
return _cache[arg]
return _f
#memoize
def get_tfidf_token(user_id):
...
Breaking this down, the function memoize() returns another function. The behavior of that function is to check a local cache for the expected return value before computing it and storing it if necessary.
The syntax #memoize... is short for something like the following.
def uncached_get_tfidf_token(user_id):
...
get_tfidf_token = memoize(uncached_get_tfidf_token)
The # symbol is used to signify that we want the modified, or decorated, version of get_tfidf_token() instead of the original. Depending on your application, it might be beneficial to chain decorators together.
3. Vectorized Operations (varying speedup, benchmarking necessary)
Python doesn't really have a notion of primitive types like other languages, and even integers take 24 bytes in memory on my machine. Lists aren't usually be packed, so you can incur costly cache misses as you're plowing through them. No matter how little work the CPU is doing for sorting and whatnot, clobbering a whole new chunk of memory to turn your array into a list and only using that brand new, expensive memory once is going to incur a performance hit.
Many of the things you are trying to do have fast (SIMD vectorized, parallelized, memory-efficient, packed memory, and other fun optimizations) numpy equivalents AND avoid unnecessary array copies and type conversions. It seems you're already using numpy anyway, so you won't have any extra imports or dependencies.
As one example, zip() creates another list in memory in Python 2.x and still does unnecessary work in Python 3.x when you really only care about the indices of tfidf_feature_names. To compute those indices, you can use something like the following, which avoids an unnecessary list creation and uses an optimized routine with slightly better asymptotic complexity as an added bonus.
def get_tfidf_token(user_id):
temp = U2tfifd[user_id].flatten()
ind = np.argpartition(temp, len(temp)-15)[-15:]
return tfidf_feature_names[ind] # works if tfidf_feature_names is a numpy array
return [tfidf_feature_names[i] for i in ind] # always works
Depending on the shape of U2tfifd[user_id], you could avoid the costly .flatten() computation by passing an axis argument to np.argsort() and flattening the 15 obtained indices instead.
4. Bonus
The sorted() function supports a reverse argument so that you can avoid extra computations like throwing a negative on every value. Simply use
sorted(..., reverse=True)
Even better, since you really don't care about the sort itself but just the 15 largest values you can get away with
sorted(...)[-15:]
to index the largest 15 instead of reversing the sort and taking the smallest 15. That doesn't really matter if you're using a better function for the application like np.argpartition(), but it could be helpful in the future.
You can also avoid some function calls by replacing .apply(lambda x : get_tfidf_token(x)) with .apply(get_tfidf_token) since get_tfidf_token is already a function which has the intended behavior. You don't really need the extra lambda.
As far as I can see though, most additional gains are fairly nitpicky and system-dependent. You can make most things faster with Cython or straight C with enough time for example, but you already have reasonably fast routines which do what you want out of the box. The extra engineering effort probably isn't worth any potential gains.
I'm working on a piece of code for a game that calculates the distances between all the objects on the screen using their in-game coordinate positions. Originally I was going to use basic Python and lists to do this, but since the number of distances that will need calculated will increase exponentially with the number of objects, I thought it might be faster to do it with numpy.
I'm not very familiar with numpy, and I've been experimenting on basic bits of code with it. I wrote a little bit of code to time how long it takes for the same function to complete a calculation in numpy and in regular Python, and numpy seems to consistently take a good bit more time than the regular python.
The function is very simple. It starts with 1.1 and then increments 200,000 times, adding 0.1 to the last value and then finding the square root of the new value. It's not what I'll actually be doing in the game code, which will involve finding total distance vectors from position coordinates; it's just a quick test I threw together. I already read here that the initialization of arrays takes more time in NumPy, so I moved the initializations of both the numpy and python arrays outside their functions, but Python is still faster than numpy.
Here is the bit of code:
#!/usr/bin/python3
import numpy
from timeit import timeit
#from time import process_time as timer
import math
thing = numpy.array([1.1,0.0], dtype='float')
thing2 = [1.1,0.0]
def NPFunc():
for x in range(1,200000):
thing[0] += 0.1
thing[1] = numpy.sqrt(thing[0])
print(thing)
return None
def PyFunc():
for x in range(1,200000):
thing2[0] += 0.1
thing2[1] = math.sqrt(thing2[0])
print(thing2)
return None
print(timeit(NPFunc, number=1))
print(timeit(PyFunc, number=1))
It gives this result, which indicates normal Python is 3x faster:
[ 20000.99999999 141.42489173]
0.2917748889885843
[20000.99999998944, 141.42489172698504]
0.10341173503547907
Am I doing something wrong, is is this calculation just so simple that it isn't a good test for numpy?
Am I doing something wrong, is is this calculation just so simple that it isn't a good test for NumPy?
It's not really that the calculation is simple, but that you're not taking any advantage of NumPy.
The main benefit of NumPy is vectorization: you can apply an operation to every element of an array in one go, and whatever looping is needed happens inside some tightly-optimized C (or Fortran or C++ or whatever) loop inside NumPy, rather than in a slow generic Python iteration.
But you're only accessing a single value, so there's no looping to be done in C.
On top of that, because the values in an array are stored as "native" values, NumPy functions don't need to unbox them, pulling the raw C double out of a Python float, and then re-box them in a new Python float, the way any Python math functions have to.
But you're not doing that either. In fact, you're doubling that work: You're pulling the value out o the array as a float (boxing it), then passing it to a function (which has to unbox it, and then rebox it to return a result), then storing it back in an array (unboxing it again).
And meanwhile, because np.sqrt is designed to work on arrays, it has to first check the type of what you're passing it and decide whether it needs to loop over an array or unbox and rebox a single value or whatever, while math.sqrt just takes a single value. When you call np.sqrt on an array of 200000 elements, the added cost of that type switch is negligible, but when you're doing it every time through the inner loop, that's a different story.
So, it's not an unfair test.
You've demonstrated that using NumPy to pull out values one at a time, act on them one at a time, and store them back in the array one at a time is slower than just not using NumPy.
But, if you compare it to actually taking advantage of NumPy—e.g., by creating an array of 200000 floats and then calling np.sqrt on that array vs. looping over it and calling math.sqrt on each one—you'll demonstrate that using NumPy the way it was intended is faster than not using it.
you're comparing it wrong
a_list = np.arange(0,20000,0.1)
timeit(lambda:np.sqrt(a_list),number=1)
For doing repeated operations in numpy/scipy, there's a lot of overhead because most operation return a new object.
For example
for i in range(100):
x = A*x
I would like to avoid this by passing a reference to the operation, like you would in C
for i in range(100):
np.dot(A,x,x_new) #x_new would now store the result of the multiplication
x,x_new = x_new,x
Is there any way to do this? I would like this not for just mutiplication but all operations that return a matrix or a vector.
See Learning to avoid unnecessary array copies in IPython Books. From there, note e.g. these guidelines:
a *= b
will not produce a copy, whereas:
a = a * b
will produce a copy. Also, flatten() will copy, while ravel() only copies if necessary and returns a view otherwise (and thus should in general be preferred). reshape() also does not produce a copy, but returns a view.
Furthermore, as #hpaulj and #ali_m noted in their comments, many numpy functions support an out parameter, so have a look at the docs. From numpy.dot() docs:
out : ndarray, optional
Output argument.
This must have the exact kind that would be returned if it was not used. In particular, it must have the right type, must be C-contiguous, and its dtype must be the dtype that would be returned for dot(a,b). This is a performance feature. Therefore, if these conditions are not met, an exception is raised, instead of attempting to be flexible.
I'm running a model in Python and I'm trying to speed up the execution time. Through profiling the code I've found that a huge amount of the total processing time is spent in the cell_in_shadow function below. I'm wondering if there is any way to speed it up?
The aim of the function is to provide a boolean response stating whether the specified cell in the NumPy array is shadowed by another cell (in the x direction only). It does this by stepping backwards along the row checking each cell against the height it must be to make the given cell in shadow. The values in shadow_map are calculated by another function not shown here - for this example, take shadow_map to be an array with values similar to:
[0] = 0 (not used)
[1] = 3
[2] = 7
[3] = 18
The add_x function is used to ensure that the array indices loop around (using clock-face arithmetic), as the grid has periodic boundaries (anything going off one side will re-appear on the other side).
def cell_in_shadow(x, y):
"""Returns True if the specified cell is in shadow, False if not."""
# Get the global variables we need
global grid
global shadow_map
global x_len
# Record the original length and move to the left
orig_x = x
x = add_x(x, -1)
while x != orig_x:
# Gets the height that's needed from the shadow_map (the array index is the distance using clock-face arithmetic)
height_needed = shadow_map[( (x - orig_x) % x_len)]
if grid[y, x] - grid[y, orig_x] >= height_needed:
return True
# Go to the cell to the left
x = add_x(x, -1)
def add_x(a, b):
"""Adds the two numbers using clockface arithmetic with the x_len"""
global x_len
return (a + b) % x_len
I do agree with Sancho that Cython will probably be the way to go, but here are a couple of small speed-ups:
A. Store grid[y, orig_x] in some variable before you start the while loop and use that variable instead. This will save a bunch of look-up calls to the grid array.
B. Since you are basically just starting at x_len - 1 in shadow_map and working down to 1, you can avoid using the modulus so much. Basically, change:
while x != orig_x:
height_needed = shadow_map[( (x - orig_x) % x_len)]
to
for i in xrange(x_len-1,0,-1):
height_needed = shadow_map[i]
or just get rid of the height_needed variable all together with:
if grid[y, x] - grid[y, orig_x] >= shadow_map[i]:
These are small changes, but they might help a little bit.
Also, if you plan on going the Cython route, I would consider having your function do this process for the whole grid, or at least a row at a time. That will save a lot of the function call overhead. However, you might not be able to really do this depending on how you are using the results.
Lastly, have you tried using Psyco? It takes less work than Cython though it probably won't give you quite as big of a speed boost. I would certainly try it first.
If you're not limited to strict Python, I'd suggest using Cython for this. It can allow static typing of the indices and efficient, direct access to a numpy array's underlying data buffer at c speed.
Check out a short tutorial/example at http://wiki.cython.org/tutorials/numpy
In that example, which is doing operations very similar to what you're doing (incrementing indices, accessing individual elements of numpy arrays), adding type information to the index variables cut the time in half compared to the original. Adding efficient indexing into the numpy arrays by giving them type information cut the time to about 1% of the original.
Most Python code is already valid Cython, so you can just use what you have and add annotations and type information where needed to give you some speed-ups.
I suspect you'd get the most out of adding type information your indices x, y, orig_x and the numpy arrays.
The following guide compares several different approaches to optimising numerical code in python:
Scipy PerformancePython
It is a bit out of date, but still helpful. Note that it refers to pyrex, which has since been forked to create the Cython project, as mentioned by Sancho.
Personally I prefer f2py, because I think that fortran 90 has many of the nice features of numpy (e.g. adding two arrays together with one operation), but has the full speed of compiled code. On the other hand if you don't know fortran then this may not be the way to go.
I briefly experimented with cython, and the trouble I found was that by default cython generates code which can handle arbitrary python types, but which is still very slow. You then have to spend time adding all the necessary cython declarations to get it to be more specific and fast, whereas if you go with C or fortran then you will tend to get fast code straight out of the box. Again this is biased by me already being familiar with these languages, whereas Cython may be more appropriate if Python is the only language you know.