Fast Looping over combinations of elements from two large ditionaries of sets

Fast Looping over combinations of elements from two large ditionaries of sets - python

I have two very big dictionaries of sets and I want to loop over all combinations of pairs of them to calculate a "score" for each pair and store this in another object. The key in each dictionary is a unique identifier for each set. The code I am currently using is something along the lines of:
score_matrix = {}
for id_1, set_1 in set_dict_1:
number_elements_1 = len(set_1)
score_matrix[id_1] = {}
for id_2, set_2 in set_dict_2:
score_matrix[id_1][id_2] = float(len(set_1 & set_2)) / number_elements_1
I am testing this on data where dict_1 and dict_2 have around 25k elements. So this thing has to process around 625,000,000 combinations! Obviously I don't expect this to be done in seconds but in pure python like this it is taking too long. In fact, I don't know how long it takes because I didn't let it finish running but it was going for around 15 mins before I gave up.
I assume there is a more efficient way to try and achieve this. Any ideas? Perhaps using numpy might help but I'm not sure how.

Related

How to split a dataframe and select all possible pairs?

I have a dataframe that I want to separate in order to apply a certain function.
I have the fields df['beam'], df['track'], df['cycle'] and want to separate it by unique values of each of this three. Then, I want to apply this function (it works between two individual dataframes) to each pair that meets that df['track'] is different between the two. Also, the result doesn't change if you switch the order of the pair, so I'd like to not make unnecessary calls to the function if possible.
I currently work it through with four nested for loops into an if conditional, but I'm absolutely sure there's a better, cleaner way.
I'd appreciate all help!
Edit: I ended up solving it like this:
I split the original dataframe into multiple by using df.groupby()
dfsplit=df.groupby(['beam','track','cycle'])
This generates a dictionary where the keys are all the unique ['beam','track','cycle'] combinations as tuples
I combined all possible ['beam','track','cycle'] pairs with the use of itertools.combinations()
keys=list(itertools.combinations(dfsplit.keys(),2))
This generates a list of 2-element tuples where each element is one ['beam','track','cycle'] tuple itself, and it doesn't include the tuple with the order swapped, so I avoid calling the function twice for what would be the same case.
I removed the combinations where 'track' was the same through a for loop
for k in keys.copy():
if k[0][1]==k[1][1]:
keys.remove(k)
Now I can call my function by looping through the list of combinations
for k in keys:
function(dfsplit[k[0]],dfsplit[k[1]])
Step 3 is taking a long time, probably because I have a very large number of unique ['beam','track','cycle'] combinations so the list is very long, but also probably because I'm doing it sub-optimally. I'll keep the question open in case someone realizes a better way to do this last step.
EDIT 2:
Solved the problem with step 3, once again with itertools, just by doing
keys=list(itertools.filterfalse(lambda k : k[0][1]==k[1][1], keys))
itertools.filterfalse returns all elements of the list that return false to the function defined, so it's doing the same as the previous for loop but selecting the false instead of removing the true. It's very fast and I believe this solves my problem for good.

I don't know how to mark the question as solved so I'll just repeat the solution here:
dfsplit=df.groupby(['beam','track','cycle'])
keys=list(itertools.combinations(dfsplit.keys(),2))
keys=list(itertools.filterfalse(lambda k : k[0][1]==k[1][1], keys))
for k in keys:
function(dfsplit[k[0]],dfsplit[k[1]])

How to get all combinations from multiple lists with additional filtering?

a = [[1,2,3,4],[2,3,4,5],[5,6,7,8],[4,3,2,3]]
output: (1,2,5,4),(1,2,5,3),(1,2,5,2)...
I have multiple lists that I'm trying to find all possible combinations of that's similar to the above code. However due to the amount of lists and the amount of items in the list, my computer can't process it. So I was wondering if there was a way I could filter some combinations out by adding all the items(numbers) together. ex: (1,2,5,4) = 12; And then if that list total is under 15 then delete that list and move on to the next combination. I would have to do something like this anyway after I got all the combinations so I figured why not do it earlier.
My code which generates all possible combinations:
list(itertools.product(*a))
Any ideas on how I can implement this filtering concept?

You can use a list comprehension:
from itertools import product
a = [[1,2,3,4],[2,3,4,5],[5,6,7,8],[4,3,2,3]]
l = [e for e in product(*a) if sum(e) > 15]
However be aware that the generation of the initial products might be the bottleneck. If you have too many elements (only 256 in your example) there is a possibility that your computer can never generate them all, and the filtering won't change anything.

Collapse list of lists to eliminate redundancy

I have a couple of long lists of lists of related objects that I'd like to group to reduce redundancy. Pseudocode:
>>>list_of_lists = [[1,2,3],[3,4],[5,6,7],[1,8,9,10]...]
>>>remove_redundancy(list_of_lists)
[[1,2,3,4,8,9,10],[5,6,7]...]
So lists that contain the same elements would be collapsed into single lists. Collapsing them is easy, once I find lists to combine I can make the lists into sets and take their union, but I'm not sure how to compare the lists. Do I need to do a series of for loops?
My first thought was that I should loop through and check whether each item in a sublist is in any of the other lists, if yes, merge the lists and then start over, but that seems terribly inefficient. I did some searching and found this: Python - dividing a list-of-lists to groups but my data isn't structured. Also, my actual data is a series of strings and thus not sortable in any meaningful sense.
I can write some gnarly looping code to make this work, but I was wondering if there are any built-in functions that would make this sort of comparison easier. Maybe something in list comprehensions?

I think this is a reasonably efficient way of doing it, if I understand your question correctly. The result here will be a list of sets.
Maybe the missing bit of knowledge was d & g (also written d.intersection(g)) for finding the set intersection, along with the fact that an empty set is "falsey" in Python
data = [[1,2,3],[3,4],[5,6,7],[1,8,9,10]]
result = []
for d in data:
d = set(d)
matched = [d]
unmatched = []
# first divide into matching and non-matching groups
for g in result:
if d & g:
matched.append(g)
else:
unmatched.append(g)
# then combine all matching groups into one group
# while leaving unmatched groups intact
result = unmatched + [set().union(*matched)]
print(result)
# [set([5, 6, 7]), set([1, 2, 3, 4, 8, 9, 10])]
We start with no groups at all (result = []). Then we take the first list from the data. We then check which of the existing groups intersect this list and which don't. Then we merge all of these matching groups along with the list (achieved by starting with matched = [d]). We don't touch the non-matching groups (though maybe some of these will end up being merged in a later iteration). If you add a line print(result) in each loop you should be able to see how it's built up.
The union of all the sets in matched is computed by set().union(*matched). For reference:
Pythonic Way to Create Union of All Values Contained in Multiple Lists
What does the Star operator mean?

I assume that you want to merge lists that contain any common element.
Here is a function that looks efficiently (to the best of my knowledge) if any two lists contain at least one common element (according to the == operator)
import functools #python 2.5+
def seematch(X,Y):
return functools.reduce(lambda x,y : x|y,functools.reduce(lambda x,y : x+y, [[k==l for k in X ] for l in Y]))
it would be even faster if you would use a reduce that can be interrupted when finding "true" as described here:
Stopping a Reduce() operation mid way. Functional way of doing partial running sum
I was trying to find an elegant way to iterate fast after having that in place, but I think a good way would be simply looping once and creating an other container that will contain the "merged" lists. You loop once on the lists contained on the original list and for every new list created on the proxy list.
Having said that - it seems there might be a much better option - see if you can do away with that redundancy by some sort of book-keeping on the previous steps.
I know this is an incomplete answer - hope that helped anyway!

Program running too slow! : Suggest Algorithmic/Implementation Optimization

I have a huge python list(A) of lists. The length of list A is around 90,000. Each inner list contain around 700 tuples of (datetime.date,string). Now, I am analyzing this data. What I am doing is I am taking a window of size x in inner lists where- x = len(inner list) * (some fraction <= 1) and I am saving each ordered pair (a,b) where a occurs before b in that window (actually the innerlists are sorted wrt time). I am moving this window upto the last element adding one element at a time from one end and removing from other which takes O(window-size)time as I am considering the new tuples only. My code:
for i in xrange(window_size):
j = i+1;
while j<window_size:
check_and_update(cur, my_list[i][1], my_list[j][1],log);
j=j+1
i=1;
while i<=len(my_list)-window_size:
j=i;
k=i+window_size-1;
while j<k:
check_and_update(cur, my_list[j][1], my_list[k][1],log);
j+=1
i += 1
Here cur is actually a sqlite3 database cursor,my_list is a list containing the tuples and I iterate this code for all the lists in A and log is a opened logfile. In method check_and_update() I am looking up my database to find the tuple if exists or else I insert it, along with its total number of occurrence so far. Code:
def check_and_update(cur,start,end,log):
t = str(start)+":"+ str(end)
cur.execute("INSERT OR REPLACE INTO Extra (tuple,count)\
VALUES ( ? , coalesce((SELECT count +1 from Extra WHERE tuple = ?),1))",[t,t])
As expected this number of tuples is HUGE and I have previously experimented with dictionary which eats up the memory quite fast. So, I resorted to SQLite3, but now it is too slow. I have tried indexing but with no help. Probably the my program is spending way to much time querying and updating the database. Do you have any optimization ideas for this problem? Probably changing the algorithm or some different approach/tools. Thank you!
Edit: My goal here is to find the total number of tuples of strings that occur within the window grouped by the number of different innerlists they occur in. I extract this information with this query:
for i in range(1,size+1):
cur.execute('select * from Extra where count = ?',str(i))
#other stuff
For Example ( I am ignoring the date entries and will write them as 'dt'):
My_list = [
[ ( dt,'user1') , (dt, 'user2'), (dt, 'user3') ]
[ ( dt,'user3') , (dt, 'user4')]
[ ( dt,'user2') , (dt, 'user3'), (dt,'user1') ]
]
here if I take fraction = 1 then, results:
only 1 occurrence in window: 5 (user 1-2,1-3,3-4,2-1,3-1)
only 2 occurrence in window: 2 (user 2-3)

Let me get this straight.
You have up to about 22 billion potential tuples (for 90000 lists, any of 700, any of the following entries, on average 350) which might be less depending on the window size. You want to find, but number of inner lists that they appear in, how many tuples there are.
Data of this size has to live on disk. The rule for data that lives on disk due to size is, "Never randomly access, instead generate and then sort."
So I'd suggest that you write out each tuple to a log file, one tuple per line. Sort that file. Now all instances of any given tuple are in one place. Then run through the file, and for each tuple emit the count of how many times it appears in (that is how many inner lists it is in). Sort that second file. Now run through that file, and you can extract how many tuples appeared 1x, 2x, 3x, etc.
If you have multiple machines, it is easy to convert this into a MapReduce. (Which is morally the same approach, but you get to parallelize a lot of stuff.)

Apache Hadoop is one of the MapReduce implementations that is suited for this kind of problem:

Best way to sort 1M records in Python

I have a service that runs that takes a list of about 1,000,000 dictionaries and does the following
myHashTable = {}
myLists = { 'hits':{}, 'misses':{}, 'total':{} }
sorted = { 'hits':[], 'misses':[], 'total':[] }
for item in myList:
id = item.pop('id')
myHashTable[id] = item
for k, v in item.iteritems():
myLists[k][id] = v
So, if I had the following list of dictionaries:
[ {'id':'id1', 'hits':200, 'misses':300, 'total':400},
{'id':'id2', 'hits':300, 'misses':100, 'total':500},
{'id':'id3', 'hits':100, 'misses':400, 'total':600}
]
I end up with
myHashTable =
{
'id1': {'hits':200, 'misses':300, 'total':400},
'id2': {'hits':300, 'misses':100, 'total':500},
'id3': {'hits':100, 'misses':400, 'total':600}
}
and
myLists =
{
'hits': {'id1':200, 'id2':300, 'id3':100},
'misses': {'id1':300, 'id2':100, 'id3':400},
'total': {'id1':400, 'id2':500, 'id3':600}
}
I then need to sort all of the data in each of the myLists dictionaries.
What I doing currently is something like the following:
def doSort(key):
sorted[key] = sorted(myLists[key].items(), key=operator.itemgetter(1), reverse=True)
which would yield, in the case of misses:
[('id3', 400), ('id1', 300), ('id2', 200)]
This works great when I have up to 100,000 records or so, but with 1,000,000 it is taking at least 5 - 10 minutes to sort each with a total of 16 (my original list of dictionaries actually has 17 fields including id which is popped)
* EDIT * This service is a ThreadingTCPServer which has a method
allowing a client to connect and add
new data. The new data may include
new records (meaning dictionaries with
unique 'id's to what is already in
memory) or modified records (meaning
the same 'id' with different data for
the other key value pairs
So, once this is running I would pass
in
[
{'id':'id1', 'hits':205, 'misses':305, 'total':480},
{'id':'id4', 'hits':30, 'misses':40, 'total':60},
{'id':'id5', 'hits':50, 'misses':90, 'total':20
]
I have been using dictionaries to
store the data so that I don't end up
with duplicates. After the
dictionaries are updated with the
new/modified data I resort each of
them.
* END EDIT *
So, what is the best way for me to sort these? Is there a better method?

You may find this related answer from Guido: Sorting a million 32-bit integers in 2MB of RAM using Python

What you really want is an ordered container, instead of an unordered one. That would implicitly sort the results as they're inserted. The standard data structure for this is a tree.
However, there doesn't seem to be one of these in Python. I can't explain that; this is a core, fundamental data type in any language. Python's dict and set are both unordered containers, which map to the basic data structure of a hash table. It should definitely have an optimized tree data structure; there are many things you can do with them that are impossible with a hash table, and they're quite tricky to implement well, so people generally don't want to be doing it themselves.
(There's also nothing mapping to a linked list, which also should be a core data type. No, a deque is not equivalent.)
I don't have an existing ordered container implementation to point you to (and it should probably be implemented natively, not in Python), but hopefully this will point you in the right direction.
A good tree implementation should support iterating across a range by value ("iterate all values from [2,100] in order"), find next/prev value from any other node in O(1), efficient range extraction ("delete all values in [2,100] and return them in a new tree"), etc. If anyone has a well-optimized data structure like this for Python, I'd love to know about it. (Not all operations fit nicely in Python's data model; for example, to get next/prev value from another value, you need a reference to a node, not the value itself.)

If you have a fixed number of fields, use tuples instead of dictionaries. Place the field you want to sort on in first position, and just use mylist.sort()

This seems to be pretty fast.
raw= [ {'id':'id1', 'hits':200, 'misses':300, 'total':400},
{'id':'id2', 'hits':300, 'misses':100, 'total':500},
{'id':'id3', 'hits':100, 'misses':400, 'total':600}
]
hits= [ (r['hits'],r['id']) for r in raw ]
hits.sort()
misses = [ (r['misses'],r['id']) for r in raw ]
misses.sort()
total = [ (r['total'],r['id']) for r in raw ]
total.sort()
Yes, it makes three passes through the raw data. I think it's faster than pulling out the data in one pass.

Instead of trying to keep your list ordered, maybe you can get by with a heap queue. It lets you push any item, keeping the 'smallest' one at h[0], and popping this item (and 'bubbling' the next smallest) is an O(nlogn) operation.
so, just ask yourself:
do i need the whole list ordered all the time? : use an ordered structure (like Zope's BTree package, as mentioned by Ealdwulf)
or the whole list ordered but only after a day's work of random insertions?: use sort like you're doing, or like S.Lott's answer
or just a few 'smallest' items at any moment? : use heapq

Others have provided some excellent advices, try them out.
As a general advice, in situations like that you need to profile your code. Know exactly where most of the time is spent. Bottlenecks hide well, in places you least expect them to be.
If there is a lot of number crunching involved then a JIT compiler like the (now-dead) psyco might also help. When processing takes minutes or hours 2x speed-up really counts.
http://docs.python.org/library/profile.html
http://www.vrplumber.com/programming/runsnakerun/
http://psyco.sourceforge.net/

sorted(myLists[key], key=mylists[key].get, reverse=True)
should save you some time, though not a lot.

I would look into using a different sorting algorithm. Something like a Merge Sort might work. Break the list up into smaller lists and sort them individually. Then loop.
Pseudo code:
list1 = [] // sorted separately
list2 = [] // sorted separately
// Recombine sorted lists
result = []
while (list1.hasMoreElements || list2.hasMoreElements):
if (! list1.hasMoreElements):
result.addAll(list2)
break
elseif (! list2.hasMoreElements):
result.AddAll(list1)
break
if (list1.peek < list2.peek):
result.add(list1.pop)
else:
result.add(list2.pop)

Glenn Maynard is correct that a sorted mapping would be appropriate here. This is one for python: http://wiki.zope.org/ZODB/guide/node6.html#SECTION000630000000000000000

I've done some quick profiling of both the original way and SLott's proposal. In neither case does it take 5-10 minutes per field. The actual sorting is not the problem. It looks like most of the time is spent in slinging data around and transforming it. Also, my memory usage is skyrocketing - my python is over 350 megs of ram! are you sure you're not using up all your ram and paging to disk? Even with my crappy 3 year old power saving processor laptop, I am seeing results way less than 5-10 minutes per key sorted for a million items. What I can't explain is the variability in the actual sort() calls. I know python sort is extra good at sorting partially sorted lists, so maybe his list is getting partially sorted in the transform from the raw data to the list to be sorted.
Here's the results for slott's method:
done creating data
done transform. elapsed: 16.5160000324
sorting one key slott's way takes 1.29699993134
here's the code to get those results:
starttransform = time.time()
hits= [ (r['hits'],r['id']) for r in myList ]
endtransform = time.time()
print "done transform. elapsed: " + str(endtransform - starttransform)
hits.sort()
endslottsort = time.time()
print "sorting one key slott's way takes " + str(endslottsort - endtransform)
Now the results for the original method, or at least a close version with some instrumentation added:
done creating data
done transform. elapsed: 8.125
about to get stuff to be sorted
done getting data. elapsed time: 37.5939998627
about to sort key hits
done sorting on key <hits> elapsed time: 5.54699993134
Here's the code:
for k, v in myLists.iteritems():
time1 = time.time()
print "about to get stuff to be sorted "
tobesorted = myLists[k].items()
time2 = time.time()
print "done getting data. elapsed time: " + str(time2-time1)
print "about to sort key " + str(k)
mysorted[k] = tobesorted.sort( key=itemgetter(1))
time3 = time.time()
print "done sorting on key <" + str(k) + "> elapsed time: " + str(time3-time2)

Honestly, the best way is to not use Python. If performance is a major concern for this, use a faster language.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.