speed up function based on list comprehension - python

I'm trying to get the 15 most relevant item for each users but every functions i tried took an eternity. (more than 6 hours i shutdown it after that ...)
I have 418 unique users, 3718 unique items.
U2tfifd dict has as well 418 entry and there is 32645 words in tfidf_feature_names.
Shape of my interactions_full_df is (40733, 3)
i tried :
def index_tfidf_users(user_id) :
return [users for users in U2tfifd[user_id].flatten().tolist()]
def get_relevant_items(user_id):
return sorted(zip(tfidf_feature_names, index_tfidf_users(user_id)), key=lambda x: -x[1])[:15]
def get_tfidf_token(user_id) :
return [words for words, values in get_relevant_items(user_id)]
then interactions_full_df["tags"] = interactions_full_df["user_id"].apply(lambda x : get_tfidf_token(x))
or
def get_tfidf_token(user_id) :
tags = []
v = sorted(zip(tfidf_feature_names, U2tfifd[user_id].flatten().tolist()), key=lambda x: -x[1])[:15]
for words, values in v :
tags.append(words)
return tags
or
def get_tfidf_token(user_id) :
v = sorted(zip(tfidf_feature_names, U2tfifd[user_id].flatten().tolist()), key=lambda x: -x[1])[:15]
tags = [words for words in v]
return tags
U2tfifd is a dict with keys = user_id, values = an array

There are several things going on which could cause poor performance in your code. The impact of each of these will depend on things like your Python version (2.x or 3.x), your RAM speed, and whatnot. You'll need to experiment and benchmark the various potential improvements yourself.
1. TFIDF Sparsity (~10x speedup depending on sparsity)
One glaring potential problem is that TFIDF naturally returns sparse data (e.g. a paragraph doesn't use anywhere near as many unique words as an entire book), and working with dense structures like numpy arrays is a strange choice when the data is probably zero almost everywhere.
If you'll be doing this same analysis in the future, it might be helpful to make/use a version of TFIDF with sparse array outputs so that when you extract your tokens you can skip over the zero values. This would likely have the secondary benefit of the entire sparse array for each user fitting in the cache and preventing costly RAM access in your sorts and other operations.
It might be worth sparsifying your data anyway. On my potato, a quick benchmark on data which should be similar to yours indicates that the process can be done in ~30s. The process replaces much of the work you're doing with a highly optimized routine coded in C and wrapped for use in Python. The only real cost is the second pass through the non-zero entries, but unless that pass is pretty efficient to begin with you should be better off working with sparse data.
2. Duplicated Efforts and Memoization (~100x speedup)
If U2tfifd has 418 entries and interactions_full_df has 40733 rows then at least 40315 (or 99.0%) of your calls to get_tfidf_token() are wasted since you've already computed the answer. There are tons of memoization decorators out there, but you don't need anything very complicated for your use case.
def memoize(f):
_cache = {}
def _f(arg):
if arg not in _cache:
_cache[arg] = f(arg)
return _cache[arg]
return _f
#memoize
def get_tfidf_token(user_id):
...
Breaking this down, the function memoize() returns another function. The behavior of that function is to check a local cache for the expected return value before computing it and storing it if necessary.
The syntax #memoize... is short for something like the following.
def uncached_get_tfidf_token(user_id):
...
get_tfidf_token = memoize(uncached_get_tfidf_token)
The # symbol is used to signify that we want the modified, or decorated, version of get_tfidf_token() instead of the original. Depending on your application, it might be beneficial to chain decorators together.
3. Vectorized Operations (varying speedup, benchmarking necessary)
Python doesn't really have a notion of primitive types like other languages, and even integers take 24 bytes in memory on my machine. Lists aren't usually be packed, so you can incur costly cache misses as you're plowing through them. No matter how little work the CPU is doing for sorting and whatnot, clobbering a whole new chunk of memory to turn your array into a list and only using that brand new, expensive memory once is going to incur a performance hit.
Many of the things you are trying to do have fast (SIMD vectorized, parallelized, memory-efficient, packed memory, and other fun optimizations) numpy equivalents AND avoid unnecessary array copies and type conversions. It seems you're already using numpy anyway, so you won't have any extra imports or dependencies.
As one example, zip() creates another list in memory in Python 2.x and still does unnecessary work in Python 3.x when you really only care about the indices of tfidf_feature_names. To compute those indices, you can use something like the following, which avoids an unnecessary list creation and uses an optimized routine with slightly better asymptotic complexity as an added bonus.
def get_tfidf_token(user_id):
temp = U2tfifd[user_id].flatten()
ind = np.argpartition(temp, len(temp)-15)[-15:]
return tfidf_feature_names[ind] # works if tfidf_feature_names is a numpy array
return [tfidf_feature_names[i] for i in ind] # always works
Depending on the shape of U2tfifd[user_id], you could avoid the costly .flatten() computation by passing an axis argument to np.argsort() and flattening the 15 obtained indices instead.
4. Bonus
The sorted() function supports a reverse argument so that you can avoid extra computations like throwing a negative on every value. Simply use
sorted(..., reverse=True)
Even better, since you really don't care about the sort itself but just the 15 largest values you can get away with
sorted(...)[-15:]
to index the largest 15 instead of reversing the sort and taking the smallest 15. That doesn't really matter if you're using a better function for the application like np.argpartition(), but it could be helpful in the future.
You can also avoid some function calls by replacing .apply(lambda x : get_tfidf_token(x)) with .apply(get_tfidf_token) since get_tfidf_token is already a function which has the intended behavior. You don't really need the extra lambda.
As far as I can see though, most additional gains are fairly nitpicky and system-dependent. You can make most things faster with Cython or straight C with enough time for example, but you already have reasonably fast routines which do what you want out of the box. The extra engineering effort probably isn't worth any potential gains.

Related

scala slower than python in constructing a set

I am learning scala by converting some of my python code to scala code. I just encountered an issue where the python code is significantly outperforming the scala code. The code is supposed to construct a set of candidate pairs based on some conditions. Scala has comparable runtime performance with python for all previous parts.
The id_map is an array of map from Long to set of string. The average number of k-v pairs in the map is 1942.
The scala code snippet is below:
// id_map Array[mutable.Map[Long, Set[String]]
val candidate_pairs = id_map
.flatMap(hashmap => hashmap.values)
.filter(_.size >= 2)
.flatMap(strset => strset.toList.combinations(2))
.map(_.sorted)
.toSet
and the corresponding python code is
candidate_pairs = set()
for hashmap in id_map.values():
for strset in hashmap.values():
if len(strset) >= 2:
for pair in combinations(strset, 2):
candidate_pairs.add(tuple(sorted(pair)))
The scala code snippet takes 80 seconds while python version takes 10 seconds.
I am wondering what can I optimize the above code to make it faster. What I have been trying is updating the set using the for loop
var candidate_pairs = Set.empty[List[String]]
for (
hashmap: mutable.Map[Long, Set[String]] <- id_map;
setstr: Set[String] <- hashmap.values if setstr.size >= 2;
pair <- setstr.toList.combinations(2)
)
candidate_pairs += pair.sorted
and although the candidate_pairs is updated a lot of time and each time it creates a new set, it actually is faster than the previous scala version, and takes about 50 seconds, still worse than python though. I tried using mutable set but however the result is about the same as the immutable version.
Any help would be appreciated! Thanks!
Being slower than python sounds ... surprising.
First of all, make sure you have adequate memory settings, and it is not spending half of those 80 seconds in GC.
Also, be sure to "warm up" the JVM (run your function a few times before doing actual measurement), use the same exact data for runs in python and scala (not just same statistics, exactly the same data), and do not include the time spent acquiring/generating data into measurement. Make several runs and compare average time, not how much a single run took.
Having said that, a few ways to make your code faster:
Adding .view (or .iterator) after id_map in your implementation cuts the execution time by about factor of 4 in my experiments.
(.view makes your chained transformation applied "lazily" – essentially, making a single pass through the single instance of array instead of multiple with multiple copies).
- Replacing .map(_.sorted) with
.map {
case List(a,b) if a < b => (a,b)
case List(a,b) => (b, a)
}
Shaves off about another 75% (sorting two element lists is mostly overhead).
This changes the return type to tuples rather than lists (constructing lots of tiny lists also adds up), but this seems even more appropriate in this case actually.
– Removing .filter(_.size >= 2) (it is redundant anyway, and computing size of a collection may get expensive) yields further improvement, but fairly small, that I did not bother to measure exactly.
Additionally, it may be cheaper to get rid of the separate sort step altogether, and just add .sorted before .combinations. I have not tested it, because it would be futile without knowing more details about your data profile.
These are some general improvements that should improve your performance either way, though it is hard to be sure you'll see the same effect as I do, as I don't really know anything about your data beyond that average map size, the improvement you see might be even better than mine, or it could be somewhat smaller ... but you should see some.
I ran this version with some test Scala code I created. On a list of 1944 elements, it completed in about 15 ms on my laptop.
id_map
.flatMap(hashmap => hashmap.values)
.flatMap { strset =>
if (strset.size >= 2) {
strset.toIndexedSeq.combinations(2)
} else IndexedSeq.empty
}.map(_.sorted).toSet
Main changes I have are to use an IndexedSeq instead of a List (which is a LinkedList), and to do the filter on the fly.
I assume you didn't want to hyper optimize, in which case you could still remove a lot of the intermediate collections created in the flatMap, combinations, conversion to IndexedSeq and toSet call.

Is it better for performance to use min() mutliple times or to store it in a variable?

I have a small python code that uses min(list) multiple times on the same unchanged list, this got me wondering if I use the result of functions like min(), len(), etc... multiple times on an unchanged list is it better for me to store those results in variables, does it affect memory usage / performance at all?
If I had to guess I'd say that if a function like min() gets called many times it'd be better for performance to store it in a variable, but I am not sure of this since I don't really know how python gets the value or if python automatically stores this value somewhere as long as the list isn't changed.
Speed
It is almost always cheaper to store the result and re-use it rather than re-call the function multiple times.
Python does not cache (store and later remember) results from functions like min(), len(), etc.
Here is a quick speed test:
timeit.timeit("c = min(x) + min(x)", "x = [1, 2, 3]")
0.24990593400000005
timeit.timeit("a = min(x); b = a + a", "x = [1, 2, 3]")
0.1296667110000005
The second is almost twice as fast, because storing a variable is much cheaper than re-calling the min function.
Memory use
If the result is a single number, as with min() or len(), then memory use is negligible.
If the result is something substantial (e.g. a large table of values), then you can remove it when you're done with it using del
large_object = expensive_function()
do_something(large_object)
do_something_else(large_object)
del large_object
Also, large objects will automatically be deleted from memory when they fall out of scope (e.g. when a function returns) or when garbage collection rounds happen at regular intervals. For this reason, del is only necessary in certain circumstances like when dealing with circular references to an object.
min() is very fast compared to many other operations, such as I/O. So the efficiency improvements could be small for short lists and only a few repeated calls. However, if you cache the results of min(), you can realize some time savings. See the code below for examples of time you can actually save. As you can see, you need multiple iterations of the loops that contain min() calls to get any substantial the time savings.
import timeit
lst = range(2)
def test_min():
x = [min(lst) for i in range(10)]
def test_cached_min():
min_lst = min(lst)
x = [min_lst for i in range(10)]
print(timeit.timeit("test_min()", globals = locals(), number = 1000))
print(timeit.timeit("test_cached_min()", globals = locals(), number = 1000))
# lst = range(2):
# 0.0027015960000000006
# 0.0010772920000000005
# lst = range(2000):
# 0.5262554810000001
# 0.05257684900000004
functions like min or max definitely have to traverse the array each time (giving them a complexity of O(n)). So yeah, specially if your array is larger, it's a better idea to store it in a variable rather than performing the calculation again.
More details about performance in another question
Time Complexity of Python List Operations
Source
The table shows that:
function len (to get length) has complexity O(1) (so very fast, so already stored)
function min (to get minimum) is O(n) (depends upon size of list, so computed each time).
This means that:
len does not need to be stored for reuse
min should be stored for reuse (especially for large lists)
If you are only using it 1-5 times, it doesn't really matter. But if you are going to call it anymore, and really less too, it is best to just save it as a variable. It will take next to no memory, and very little time to do so and to pull it from memory.

python sum on array of tensors vs tf.add_n

So I've got some code
tensors = [] //its filled with 3D float tensors
total = sum(tensors)
if I change that last line to
total = tf.add_n(tensors)
then the code produces the same output but runs much more slowly and soon causes
an out-of-memory exception. Whats going on here? Can someone explain how pythons built in sum function and tf.add_n interact with an array of tensors respectively and why pythons sum would seemingly just be a better version?
When you use sum, you call a standard python algorithm that call __add__ recursively on the elements of the array. Since __add__ (or +) indeed is overloaded on tensorflow's tensors, it works as expected: it creates a graph that can be executed during a session. It is not optimal, however, because you add as many operation as there are elements in your list; also, you are enforcing the order of the operation (add the first two elements, then the third to the result, and so on), which is also not optimal.
By contrast, add_n is a specialized operation to do just that. Looking at the graph is really telling I think:
import tensorflow as tf
with tf.variable_scope('sum'):
xs = [tf.zeros(()) for _ in range(10)]
sum(xs)
with tf.variable_scope('add_n'):
xs = [tf.zeros(()) for _ in range(10)]
tf.add_n(xs)
However – contrary to what I thought earlier – add_n takes up more memory because it waits – and store – for all incoming inputs before storing them. If the number of inputs is large, then the difference can be substantial.
The behavior I was expecting from add_n, that is, summation of inputs as they are available, is actually achieved by tf.accumulate_n. This should be the superior alternative, as it takes less memory than add_n but does not enforce the order of summation like sum.
Why did the authors of tensorflow-wavenet used sum instead of tf.accumulate_n? Certainly because before this function is not differentiable on TF < 1.7. So if you have to support TF < 1.7 and be memory efficient, good old sum is actually quite a good option.
The sum() built-in only takes iterables and therefor would seem to gain the advantage of using generators in regards to memory profile.
the add_n() function for tensor takes a list of tensors and seem to retain that data structure throughout handling based on it's requirement for shape comparison.
In [29]: y = [1,2,3,4,5,6,7,8,9,10]
In [30]: y.__sizeof__()
Out[30]: 120
In [31]: x = iter(y)
In [32]: x.__sizeof__()
Out[32]: 32

Python :: Iteration vs Recursion on string manipulation

In the examples below, both functions have roughly the same number of procedures.
def lenIter(aStr):
count = 0
for c in aStr:
count += 1
return count
or
def lenRecur(aStr):
if aStr == '':
return 0
return 1 + lenRecur(aStr[1:])
Picking between the two techniques is a matter of style or is there a most efficient method here?
Python does not perform tail call optimization, so the recursive solution can hit a stack overflow on long strings. The iterative method does not have this flaw.
That said, len(str) is faster than both methods.
This is not correct: 'functions have roughly the same number of procedures'. You probably mean that: 'these procedures require the same number of operations', or, more formally 'they have the same computational time complexity'.
While both have the same computational time complexity, the one using recursion requires additional CPU instructions to execute code for creating new instances of procedures during recursion, and to switch contexts. And to clean up after returning from every recursion. While these operations do not increase the theoretical computational complexity, in most real life implementations of operating systems they will put significant load.
Also the resursive method will have higher space complexity, as each new instance of recursively-called procedure needs new storage for its data.
Surely the first approach is more optimized, as python doesn't have to do a lot of function call and string slicing, which each of these operations are contain some other operations that cost much for python interpreter, and may be cause a lot of problems in future and in dealing with log strings.
As a more pythonic way you better to use len() function in order to get the length of a string.
You can also use code object to see the required stack sized for each function:
>>> lenRecur.__code__.co_stacksize
4
>>> lenIter.__code__.co_stacksize
3

List comprehension is sorting autmatically [duplicate]

The question arose when answering to another SO question (there).
When I iterate several times over a python set (without changing it between calls), can I assume it will always return elements in the same order? And if not, what is the rationale of changing the order ? Is it deterministic, or random? Or implementation defined?
And when I call the same python program repeatedly (not random, not input dependent), will I get the same ordering for sets?
The underlying question is if python set iteration order only depends on the algorithm used to implement sets, or also on the execution context?
There's no formal guarantee about the stability of sets. However, in the CPython implementation, as long as nothing changes the set, the items will be produced in the same order. Sets are implemented as open-addressing hashtables (with a prime probe), so inserting or removing items can completely change the order (in particular, when that triggers a resize, which reorganizes how the items are laid out in memory.) You can also have two identical sets that nonetheless produce the items in different order, for example:
>>> s1 = {-1, -2}
>>> s2 = {-2, -1}
>>> s1 == s2
True
>>> list(s1), list(s2)
([-1, -2], [-2, -1])
Unless you're very certain you have the same set and nothing touched it inbetween the two iterations, it's best not to rely on it staying the same. Making seemingly irrelevant changes to, say, functions you call inbetween could produce very hard to find bugs.
A set or frozenset is inherently an unordered collection. Internally, sets are based on a hash table, and the order of keys depends both on the insertion order and on the hash algorithm. In CPython (aka standard Python) integers less than the machine word size (32 bit or 64 bit) hash to themself, but text strings, bytes strings, and datetime objects hash to integers that vary randomly; you can control that by setting the PYTHONHASHSEED environment variable.
From the __hash__ docs:
Note
By default, the __hash__() values of str, bytes and datetime
objects are “salted” with an unpredictable random value. Although they
remain constant within an individual Python process, they are not
predictable between repeated invocations of Python.
This is intended to provide protection against a denial-of-service
caused by carefully-chosen inputs that exploit the worst case
performance of a dict insertion, O(n^2) complexity. See
http://www.ocert.org/advisories/ocert-2011-003.html for details.
Changing hash values affects the iteration order of dicts, sets and
other mappings. Python has never made guarantees about this ordering
(and it typically varies between 32-bit and 64-bit builds).
See also PYTHONHASHSEED.
The results of hashing objects of other classes depend on the details of the class's __hash__ method.
The upshot of all this is that you can have two sets containing identical strings but when you convert them to lists they can compare unequal. Or they may not. ;) Here's some code that demonstrates this. On some runs, it will just loop, not printing anything, but on other runs it will quickly find a set that uses a different order to the original.
from random import seed, shuffle
seed(42)
data = list('abcdefgh')
a = frozenset(data)
la = list(a)
print(''.join(la), a)
while True:
shuffle(data)
lb = list(frozenset(data))
if lb != la:
print(''.join(data), ''.join(lb))
break
typical output
dachbgef frozenset({'d', 'a', 'c', 'h', 'b', 'g', 'e', 'f'})
deghcfab dahcbgef
And when I call the same python
program repeatedly (not random, not
input dependent), will I get the same
ordering for sets?
I can answer this part of the question now after a quick experiment. Using the following code:
class Foo(object) :
def __init__(self,val) :
self.val = val
def __repr__(self) :
return str(self.val)
x = set()
for y in range(500) :
x.add(Foo(y))
print list(x)[-10:]
I can trigger the behaviour that I was asking about in the other question. If I run this repeatedly then the output changes, but not on every run. It seems to be "weakly random" in that it changes slowly. This is certainly implementation dependent so I should say that I'm running the macports Python2.6 on snow-leopard. While the program will output the same answer for long runs of time, doing something that affects the system entropy pool (writing to the disk mostly works) will somethimes kick it into a different output.
The class Foo is just a simple int wrapper as experiments show that this doesn't happen with sets of ints. I think that the problem is caused by the lack of __eq__ and __hash__ members for the object, although I would dearly love to know the underlying explanation / ways to avoid it. Also useful would be some way to reproduce / repeat a "bad" run. Does anyone know what seed it uses, or how I could set that seed?
It’s definitely implementation defined. The specification of a set says only that
Being an unordered collection, sets do not record element position or order of insertion.
Why not use OrderedDict to create your own OrderedSet class?
The answer is simply a NO.
Python set operation is NOT stable.
I did a simple experiment to show this.
The code:
import random
random.seed(1)
x=[]
class aaa(object):
def __init__(self,a,b):
self.a=a
self.b=b
for i in range(5):
x.append(aaa(random.choice('asf'),random.randint(1,4000)))
for j in x:
print(j.a,j.b)
print('====')
for j in set(x):
print(j.a,j.b)
Run this for twice, you will get this:
First time result:
a 2332
a 1045
a 2030
s 1935
f 1555
====
a 2030
a 2332
f 1555
a 1045
s 1935
Process finished with exit code 0
Second time result:
a 2332
a 1045
a 2030
s 1935
f 1555
====
s 1935
a 2332
a 1045
f 1555
a 2030
Process finished with exit code 0
The reason is explained in comments in this answer.
However, there are some ways to make it stable:
set PYTHONHASHSEED to 0, see details here, here and here.
Use OrderedDict instead.
As pointed out, this is strictly an implementation detail.
But as long as you don’t change the structure between calls, there should be no reason for a read-only operation (= iteration) to change with time: no sane implementation does that. Even randomized (= non-deterministic) data structures that can be used to implement sets (e.g. skip lists) don’t change the reading order when no changes occur.
So, being rational, you can safely rely on this behaviour.
(I’m aware that certain GCs may reorder memory in a background thread but even this reordering will not be noticeable on the level of data structures, unless a bug occurs.)
The definition of a set is unordered, unique elements ("Unordered collections of unique elements"). You should care only about the interface, not the implementation. If you want an ordered enumeration, you should probably put it into a list and sort it.
There are many different implementations of Python. Don't rely on undocumented behaviour, as your code could break on different Python implementations.

Categories