Which is faster and why? Set or List? - python

Lets say that I have a graph and want to see if b in N[a]. Which is the faster implementation and why?
a, b = range(2)
N = [set([b]), set([a,b])]
N= [[b],[a,b]]
This is obviously oversimplified, but imagine that the graph becomes really dense.

Membership testing in a set is vastly faster, especially for large sets. That is because the set uses a hash function to map to a bucket. Since Python implementations automatically resize that hash table, the speed can be constant (O(1)) no matter the size of the set (assuming the hash function is sufficiently good).
In contrast, to evaluate whether an object is a member of a list, Python has to compare every single member for equality, i.e. the test is O(n).

It all depends on what you're trying to accomplish. Using your example verbatim, it's faster to use lists, as you don't have to go through the overhead of creating the sets:
import timeit
def use_sets(a, b):
return [set([b]), set([a, b])]
def use_lists(a, b):
return [[b], [a, b]]
t=timeit.Timer("use_sets(a, b)", """from __main__ import use_sets
a, b = range(2)""")
print "use_sets()", t.timeit(number=1000000)
t=timeit.Timer("use_lists(a, b)", """from __main__ import use_lists
a, b = range(2)""")
print "use_lists()", t.timeit(number=1000000)
use_sets() 1.57522511482
use_lists() 0.783344984055
However, for reasons already mentioned here, you benefit from using sets when you are searching large sets. It's impossible to tell by your example where that inflection point is for you and whether or not you'll see the benefit.
I suggest you test it both ways and go with whatever is faster for your specific use-case.

Set ( I mean a hash based set like HashSet) is much faster than List to lookup for a value. List has to go sequentially to find out if the value exists. HashSet can directly jump and locate the bucket and look up for a value almost in a constant time.


Most efficient way to get a key from a dictionary

Let d be a large (but still fits into memory) Python dictionary where we do not know what the keys are. What is the most efficient way (efficient should mean something like the memory used to perform the task is small compared to the size of the dictionary and the speed should at least as fast any of the methods below) to get a key of d (where it does not mater which key you get) and d is unchanged either in content or order (for newer versions of Python) once you are done? This question is not about readability but about the python dictionary objects. For example two methods are:
Use the list method
any_key = list(d)[0]
Using the popitem method
any_key,y = d.popitem()
So both methods essentially implement a peekkey() method. My basic timeit analysis shows that method 2) is must faster than method 1) and I assume that method 2) uses a lot less memory (but I do not really know if this true yet). Is method 2) "best" or is there something better?
Extra brownie points if you get a fast and a readable method using only Python. Even more points for a C/Python method that accesses the dictionary object directly if that method is significantly faster than the best python method.
If you do not care about which key you get, and you don't mean "sample" in the random sense, then just grab the first key using next
key = next(iter(d.keys()))
which, for brevity, is the same as
key = next(iter(d))
Just to test performance, if I generate a dict with 1000 elements
d = {k:k for k in range(1000)}
then benchmarking these two methods, the next approach is about 95% faster
>>> timeit.timeit('sample_key = list(d)[0]', setup='d = {k:k for k in range(1000)}')
>>> timeit.timeit('next(iter(d.keys()))', setup='d = {k:k for k in range(1000)}')

speed up function based on list comprehension

I'm trying to get the 15 most relevant item for each users but every functions i tried took an eternity. (more than 6 hours i shutdown it after that ...)
I have 418 unique users, 3718 unique items.
U2tfifd dict has as well 418 entry and there is 32645 words in tfidf_feature_names.
Shape of my interactions_full_df is (40733, 3)
i tried :
def index_tfidf_users(user_id) :
return [users for users in U2tfifd[user_id].flatten().tolist()]
def get_relevant_items(user_id):
return sorted(zip(tfidf_feature_names, index_tfidf_users(user_id)), key=lambda x: -x[1])[:15]
def get_tfidf_token(user_id) :
return [words for words, values in get_relevant_items(user_id)]
then interactions_full_df["tags"] = interactions_full_df["user_id"].apply(lambda x : get_tfidf_token(x))
def get_tfidf_token(user_id) :
tags = []
v = sorted(zip(tfidf_feature_names, U2tfifd[user_id].flatten().tolist()), key=lambda x: -x[1])[:15]
for words, values in v :
return tags
def get_tfidf_token(user_id) :
v = sorted(zip(tfidf_feature_names, U2tfifd[user_id].flatten().tolist()), key=lambda x: -x[1])[:15]
tags = [words for words in v]
return tags
U2tfifd is a dict with keys = user_id, values = an array
There are several things going on which could cause poor performance in your code. The impact of each of these will depend on things like your Python version (2.x or 3.x), your RAM speed, and whatnot. You'll need to experiment and benchmark the various potential improvements yourself.
1. TFIDF Sparsity (~10x speedup depending on sparsity)
One glaring potential problem is that TFIDF naturally returns sparse data (e.g. a paragraph doesn't use anywhere near as many unique words as an entire book), and working with dense structures like numpy arrays is a strange choice when the data is probably zero almost everywhere.
If you'll be doing this same analysis in the future, it might be helpful to make/use a version of TFIDF with sparse array outputs so that when you extract your tokens you can skip over the zero values. This would likely have the secondary benefit of the entire sparse array for each user fitting in the cache and preventing costly RAM access in your sorts and other operations.
It might be worth sparsifying your data anyway. On my potato, a quick benchmark on data which should be similar to yours indicates that the process can be done in ~30s. The process replaces much of the work you're doing with a highly optimized routine coded in C and wrapped for use in Python. The only real cost is the second pass through the non-zero entries, but unless that pass is pretty efficient to begin with you should be better off working with sparse data.
2. Duplicated Efforts and Memoization (~100x speedup)
If U2tfifd has 418 entries and interactions_full_df has 40733 rows then at least 40315 (or 99.0%) of your calls to get_tfidf_token() are wasted since you've already computed the answer. There are tons of memoization decorators out there, but you don't need anything very complicated for your use case.
def memoize(f):
_cache = {}
def _f(arg):
if arg not in _cache:
_cache[arg] = f(arg)
return _cache[arg]
return _f
def get_tfidf_token(user_id):
Breaking this down, the function memoize() returns another function. The behavior of that function is to check a local cache for the expected return value before computing it and storing it if necessary.
The syntax #memoize... is short for something like the following.
def uncached_get_tfidf_token(user_id):
get_tfidf_token = memoize(uncached_get_tfidf_token)
The # symbol is used to signify that we want the modified, or decorated, version of get_tfidf_token() instead of the original. Depending on your application, it might be beneficial to chain decorators together.
3. Vectorized Operations (varying speedup, benchmarking necessary)
Python doesn't really have a notion of primitive types like other languages, and even integers take 24 bytes in memory on my machine. Lists aren't usually be packed, so you can incur costly cache misses as you're plowing through them. No matter how little work the CPU is doing for sorting and whatnot, clobbering a whole new chunk of memory to turn your array into a list and only using that brand new, expensive memory once is going to incur a performance hit.
Many of the things you are trying to do have fast (SIMD vectorized, parallelized, memory-efficient, packed memory, and other fun optimizations) numpy equivalents AND avoid unnecessary array copies and type conversions. It seems you're already using numpy anyway, so you won't have any extra imports or dependencies.
As one example, zip() creates another list in memory in Python 2.x and still does unnecessary work in Python 3.x when you really only care about the indices of tfidf_feature_names. To compute those indices, you can use something like the following, which avoids an unnecessary list creation and uses an optimized routine with slightly better asymptotic complexity as an added bonus.
def get_tfidf_token(user_id):
temp = U2tfifd[user_id].flatten()
ind = np.argpartition(temp, len(temp)-15)[-15:]
return tfidf_feature_names[ind] # works if tfidf_feature_names is a numpy array
return [tfidf_feature_names[i] for i in ind] # always works
Depending on the shape of U2tfifd[user_id], you could avoid the costly .flatten() computation by passing an axis argument to np.argsort() and flattening the 15 obtained indices instead.
4. Bonus
The sorted() function supports a reverse argument so that you can avoid extra computations like throwing a negative on every value. Simply use
sorted(..., reverse=True)
Even better, since you really don't care about the sort itself but just the 15 largest values you can get away with
to index the largest 15 instead of reversing the sort and taking the smallest 15. That doesn't really matter if you're using a better function for the application like np.argpartition(), but it could be helpful in the future.
You can also avoid some function calls by replacing .apply(lambda x : get_tfidf_token(x)) with .apply(get_tfidf_token) since get_tfidf_token is already a function which has the intended behavior. You don't really need the extra lambda.
As far as I can see though, most additional gains are fairly nitpicky and system-dependent. You can make most things faster with Cython or straight C with enough time for example, but you already have reasonably fast routines which do what you want out of the box. The extra engineering effort probably isn't worth any potential gains.

Optimizing Itertools Results in Python

I am calling itertools in python (see below). In this code, snp_dic is a dictionary with integer keys and sets as values. The goal here is to find the minimum list of keys whose union of values is a combination of unions of sets that is equivalent to the set_union. (This is equivalent to solving for a global optimum for the popular NP-hard graph theory problem set-cover for those of you interested)! The algorithm below works but the goal here is optimization.
The most obvious optimization I see has to do with itertools. Let's say for a length r, there exists a combination of r sets in snp_dic whose union = set_union. Basic probability dictates that if this combination exists and is distributed somewhere uniformly at random over the combinations, it is expected to on average only have to iterate over have the combinations to find this set-covering combination. Itertools however will return all the possible combinations, taking twice as long as the expected time of checking set_unions by checking at each iteration.
A logical solution would seem to be simply by to implement itertools.combinations() locally. Based on the "equivalent" python implementation of itertools.combinations() in the python docs however the time is approximately twice as slow because itertools.combinations calls a C level implementation rather than a python-native one.
The question (finally) is then, how can I stream the results of itertools.combinations() one by one so I can check set unions as I go along so it still runs at a near equivalent time as the python implementation of itertools.combinations(). In an answer I would appreciate if you could include the results of timing your new method to prove it runs at a similar time as the python-native implementation. Any other optimizations also appreciated.
def min_informative_helper(snp_dic, min, set_union):
union = lambda set_iterable : reduce(lambda a,b: a|b, set_iterable) #takes the union of sets
for i in range(min, len(snp_dic)):
combinations = itertools.combinations(snp_dic, i)
combinations = [{i:snp_dic[i] for i in combination} for combination in combinations]
for combination in combinations:
comb_union = union(combination.values())
if(comb_union == set_union):
return combination.keys()
itertools provides generators for the things it returns. To stream them simply use
for combo in itertools.combinations(snp_dic, i):
... remainder of your logic
The combinations method returns one new element each time you access it: one per loop iteration.

List comprehension is sorting autmatically [duplicate]

The question arose when answering to another SO question (there).
When I iterate several times over a python set (without changing it between calls), can I assume it will always return elements in the same order? And if not, what is the rationale of changing the order ? Is it deterministic, or random? Or implementation defined?
And when I call the same python program repeatedly (not random, not input dependent), will I get the same ordering for sets?
The underlying question is if python set iteration order only depends on the algorithm used to implement sets, or also on the execution context?
There's no formal guarantee about the stability of sets. However, in the CPython implementation, as long as nothing changes the set, the items will be produced in the same order. Sets are implemented as open-addressing hashtables (with a prime probe), so inserting or removing items can completely change the order (in particular, when that triggers a resize, which reorganizes how the items are laid out in memory.) You can also have two identical sets that nonetheless produce the items in different order, for example:
>>> s1 = {-1, -2}
>>> s2 = {-2, -1}
>>> s1 == s2
>>> list(s1), list(s2)
([-1, -2], [-2, -1])
Unless you're very certain you have the same set and nothing touched it inbetween the two iterations, it's best not to rely on it staying the same. Making seemingly irrelevant changes to, say, functions you call inbetween could produce very hard to find bugs.
A set or frozenset is inherently an unordered collection. Internally, sets are based on a hash table, and the order of keys depends both on the insertion order and on the hash algorithm. In CPython (aka standard Python) integers less than the machine word size (32 bit or 64 bit) hash to themself, but text strings, bytes strings, and datetime objects hash to integers that vary randomly; you can control that by setting the PYTHONHASHSEED environment variable.
From the __hash__ docs:
By default, the __hash__() values of str, bytes and datetime
objects are “salted” with an unpredictable random value. Although they
remain constant within an individual Python process, they are not
predictable between repeated invocations of Python.
This is intended to provide protection against a denial-of-service
caused by carefully-chosen inputs that exploit the worst case
performance of a dict insertion, O(n^2) complexity. See
http://www.ocert.org/advisories/ocert-2011-003.html for details.
Changing hash values affects the iteration order of dicts, sets and
other mappings. Python has never made guarantees about this ordering
(and it typically varies between 32-bit and 64-bit builds).
The results of hashing objects of other classes depend on the details of the class's __hash__ method.
The upshot of all this is that you can have two sets containing identical strings but when you convert them to lists they can compare unequal. Or they may not. ;) Here's some code that demonstrates this. On some runs, it will just loop, not printing anything, but on other runs it will quickly find a set that uses a different order to the original.
from random import seed, shuffle
data = list('abcdefgh')
a = frozenset(data)
la = list(a)
print(''.join(la), a)
while True:
lb = list(frozenset(data))
if lb != la:
print(''.join(data), ''.join(lb))
typical output
dachbgef frozenset({'d', 'a', 'c', 'h', 'b', 'g', 'e', 'f'})
deghcfab dahcbgef
And when I call the same python
program repeatedly (not random, not
input dependent), will I get the same
ordering for sets?
I can answer this part of the question now after a quick experiment. Using the following code:
class Foo(object) :
def __init__(self,val) :
self.val = val
def __repr__(self) :
return str(self.val)
x = set()
for y in range(500) :
print list(x)[-10:]
I can trigger the behaviour that I was asking about in the other question. If I run this repeatedly then the output changes, but not on every run. It seems to be "weakly random" in that it changes slowly. This is certainly implementation dependent so I should say that I'm running the macports Python2.6 on snow-leopard. While the program will output the same answer for long runs of time, doing something that affects the system entropy pool (writing to the disk mostly works) will somethimes kick it into a different output.
The class Foo is just a simple int wrapper as experiments show that this doesn't happen with sets of ints. I think that the problem is caused by the lack of __eq__ and __hash__ members for the object, although I would dearly love to know the underlying explanation / ways to avoid it. Also useful would be some way to reproduce / repeat a "bad" run. Does anyone know what seed it uses, or how I could set that seed?
It’s definitely implementation defined. The specification of a set says only that
Being an unordered collection, sets do not record element position or order of insertion.
Why not use OrderedDict to create your own OrderedSet class?
The answer is simply a NO.
Python set operation is NOT stable.
I did a simple experiment to show this.
The code:
import random
class aaa(object):
def __init__(self,a,b):
for i in range(5):
for j in x:
for j in set(x):
Run this for twice, you will get this:
First time result:
a 2332
a 1045
a 2030
s 1935
f 1555
a 2030
a 2332
f 1555
a 1045
s 1935
Process finished with exit code 0
Second time result:
a 2332
a 1045
a 2030
s 1935
f 1555
s 1935
a 2332
a 1045
f 1555
a 2030
Process finished with exit code 0
The reason is explained in comments in this answer.
However, there are some ways to make it stable:
set PYTHONHASHSEED to 0, see details here, here and here.
Use OrderedDict instead.
As pointed out, this is strictly an implementation detail.
But as long as you don’t change the structure between calls, there should be no reason for a read-only operation (= iteration) to change with time: no sane implementation does that. Even randomized (= non-deterministic) data structures that can be used to implement sets (e.g. skip lists) don’t change the reading order when no changes occur.
So, being rational, you can safely rely on this behaviour.
(I’m aware that certain GCs may reorder memory in a background thread but even this reordering will not be noticeable on the level of data structures, unless a bug occurs.)
The definition of a set is unordered, unique elements ("Unordered collections of unique elements"). You should care only about the interface, not the implementation. If you want an ordered enumeration, you should probably put it into a list and sort it.
There are many different implementations of Python. Don't rely on undocumented behaviour, as your code could break on different Python implementations.

Why are methods so slow?

Ok, I understand in languages like C++ why calling virtual method defined in a class is slower than calling a non-virtual method (you have to go through the dynamic dispatch table to lookup the correct implementation to call).
But in Python, if I have:
list_of_sets = generate_a_list_containg_a_bunch_of_sets()
intersection_of_all = reduce(list_of_sets[0].intersection, list_of_sets)
This is dramatically (in my experiments about 40%) slower than:
list_of_sets = generate_a_list_containg_a_bunch_of_sets()
intersection_of_all = reduce(set.intersection, list_of_sets)
What I don't get is why that should be so much slower, the method lookup (I would think) would happen on the call to reduce, so the inside of reduce where the intersection method is actually called shouldn't have to be looked up again (it just just reuse the same method reference).
Could someone illuminate where my understanding is flawed?
This is completely unrelated to method binding etc. The first version computes the intersection of three sets in each iteration, while the second version only intersects two sets. This is easy to see if we use the explicit loops instead.
Variant 1:
intersection = list_of_sets[0]
for s in list_of_sets[1:]:
intersection = list_of_sets[0].intersection(intersection, s)
Variant 2:
intersection = list_of_sets[0]
for s in list_of_sets[1:]:
intersection = set.intersection(intersection, s)
(Would you agree now Guido has a point?)
Note that this will probably be even faster:
intersection = list_of_sets[0]
for s in list_of_sets[1:]:
