Pairwise comparison in list of lists taking too long

Pairwise comparison in list of lists taking too long - python

I have a list of lists, let's call it thelist, that looks like this:
[[Redditor(name='Kyle'), Redditor(name='complex_r'), Redditor(name='Lor')],
[Redditor(name='krispy'), Redditor(name='flyry'), Redditor(name='Ooer'), Redditor(name='Iewee')],
[Redditor(name='Athaa'), Redditor(name='The_Dark_'), Redditor(name='drpeterfost'), Redditor(name='owise23'), Redditor(name='invirtibrit')],
[Redditor(name='Dacak'), Redditor(name='synbio'), Redditor(name='Thosee')]]
thelist has 1000 elements (or lists). I'm trying to compare each one of these with the other lists pairwise and try to get the number of common elements for each pair of lists. the code doing that:
def calculate(list1,list2):
a=0
for i in list1:
if (i in list2):
a+=1
return a
for i in range(len(thelist)-1):
for j in range(i+1,len(thelist)):
print calculate(thelist[i],thelist[j])
My problem is: the calculation of the function is extremely slow taking 2 or more seconds per a list pair depending on their length. I'm guessing, this has to do with my list structure. What am i missing here?

First I would recommend making your class hashable which is referenced here: What's a correct and good way to implement __hash__()?
You can then make your list of lists a list of sets by doing:
thelist = [set(l) for l in thelist]
Then your function will work much faster!

Related

What is the fastest way of getting the list of lists containing a certain integer in a list of list of integers

for example I have a list of list of integers like
x = [[1,2,3,4], [4,5,6], [2,3,1,9]]
Assume the length of x is in million. In that case iterating trough each element will be very slow.
Is there any faster way?

Without any prior knowledge or extra information about the list (e.g., whether its sorted), you have no real choice but to iterate over the entire list. Note, however, that doing this with a generator could be must more performant than creating a filtered list, as the values are calculated just when you attempt to consume them, and not upfront:
search = 2
listGenerator = (i for i in x if search in i)

How to automatically create a nested for loop using a list of ranges?

I want to create a nested for loop that uses a list of integer numbers for the loops' ranges, just as follows:
a = [5,4,7,2,7,3,8,3,8,9,3,2,1,5]
for i in range(a[0]):
for j in range(a[1]):
for k in range(a[2]):
for l in range(a[3]):
...
...
...
do_some_function()
Is there a way that I can do it automatically?

You will be able to iterate over the permutations of the list's ranges with
for items in itertools.permutations(range(item) for item in a):
items will contain the sequence with one item from each range.
Note: The approach is very time and resource consuming. It might be good to consider if the concept your question is based on can be optimized.

Removing sublists from a list of lists

I'm trying to find the fastest way to solve this problem, say I have a list of lists:
myList = [[1,2,3,4,5],[2,3],[4,5,6,7],[1,2,3],[3,7]]
I'd like to be able to remove all the lists that are sublists of one of the other lists, for example I'd like the following output:
myList = [[1,2,3,4,5],[4,5,6,7],[3,7]]
Where the lists [2,3] and [1,2,3] were removed because they are completely contained in one of the other lists, while [3,7] was not removed because no single list contained all those elements.
I'm not restricted to any one data structure, if a list of lists or a set is easier to work with, that would be fine too.
The best I could come up with was something like this but it doesn't really work because I'm trying to remove from a list while iterating over it. I tried to copy it into a new list but somehow I couldn't get it working right.
for outter in range(0,len(myList)):
outterSet = set(myList[outter])
for inner in range(outter,len(myList)):
innerSet = set(myList[inner])
if innerSet.issubset(outterSet):
myList.remove(innerSet)
Thanks.

The key to solving your problem is a list of sets:
lists = [[1,2,3,4,5],[2,3],[4,5,6,7],[1,2,3],[3,7]]
sets = [set(l) for l in lists]
new_list = [l for l,s in zip(lists, sets) if not any(s < other for other in sets)]
This converts the inner lists to sets, compares each set to every other set to see if it is contained within it (using the < operator) and, if it is not strictly contained within another set, adds the original list to the new list of lists.

How to get random slice of python list of constant size. (smallest code)

Hi I have a List say 100 items, now i want a slice of say 6 items which should be randomly selected. Any way to do it in very simple simple concise statement???
This is what i came up with (but it will fetch in sequence)
mylist #100 items
N=100
L=6
start=random.randint(0,N-L);
mylist[start:start+L]

You could use the shuffle() method on the list before you slice.
If the order of the list matters, just make a copy of it first and slice out of the copy.
mylist #100 items
shuffleList = mylist
L=6
shuffle(shuffleList)
start=random.randint(0,len(shuffleList)-L);
shuffleList[start:start+L]
As above, you could also use len() instead of defining the length of the list.
As THC4K suggested below, you could use the random.sample() method like below IF you want a set of random numbers from the list (which is how I read your question).
mylist #100 items
L=6
random.sample(mylist, L)
That's a lot tidier than my first try at it!

python function slowing down for no apparent reason

I have a python function defined as follows which i use to delete from list1 the items which are already in list2. I am using python 2.6.2 on windows XP
def compareLists(list1, list2):
curIndex = 0
while curIndex < len(list1):
if list1[curIndex] in list2:
list1.pop(curIndex)
else:
curIndex += 1
Here, list1 and list2 are a list of lists
list1 = [ ['a', 11221, '2232'], ['b', 1321, '22342'] .. ]
# list2 has a similar format.
I tried this function with list1 with 38,000 elements and list2 with 150,000 elements. If i put in a print statement to print the current iteration, I find that the function slows down with each iterations. At first, it processes around 1000 or more items in a second and then after a while it reduces to around 20-50 a second. Why can that be happening?
EDIT: In the case with my data, the curIndex remains 0 or very close to 0 so the pop operation on list1 is almost always on the first item.
If possible, can someone also suggest a better way of doing the same thing in a different way?

Try a more pythonic approach to the filtering, something like
[x for x in list1 if x not in set(list2)]
Converting both lists to sets is unnessescary, and will be very slow and memory hungry on large amounts of data.
Since your data is a list of lists, you need to do something in order to hash it.
Try out
list2_set = set([tuple(x) for x in list2])
diff = [x for x in list1 if tuple(x) not in list2_set]
I tested out your original function, and my approach, using the following test data:
list1 = [[x+1, x*2] for x in range(38000)]
list2 = [[x+1, x*2] for x in range(10000, 160000)]
Timings - not scientific, but still:
#Original function
real 2m16.780s
user 2m16.744s
sys 0m0.017s
#My function
real 0m0.433s
user 0m0.423s
sys 0m0.007s

There are 2 issues that cause your algorithm to scale poorly:
x in list is an O(n) operation.
pop(n) where n is in the middle of the array is an O(n) operation.
Both situations cause it to scale poorly O(n^2) for large amounts of data. gnud's implementation would probably be the best solution since it solves both problems without changing the order of elements or removing potential duplicates.

If we rule the data structure itself out, look at your memory usage next. If you end up asking the OS to swap in for you (i.e., the list takes up more memory than you have), Python's going to sit in iowait waiting on the OS to get the pages from disk, which makes sense given your description.
Is Python sitting in a jacuzzi of iowait when this slowdown happens? Anything else going on in the environment?
(If you're not sure, update with your platform and one of us will tell you how to tell.)

The only reason why the code can become slower is that you have big elements in both lists which share a lot of common elements (so the list1[curIndex] in list2 takes more time).
Here are a couple of ways to fix this:
If you don't care about the order, convert both lists into sets and use set1.difference(set2)
If the order in list1 is important, then at least convert list2 into a set because in is much faster with a set.
Lastly, try a filter: filter(list1, lambda x: x not in set2)
[EDIT] Since set() doesn't work on recursive lists (didn't expect that), try:
result = filter(list1, lambda x: x not in list2)
It should still be much faster than your version. If it isn't, then your last option is to make sure that there can't be duplicate elements in either list. That would allow you to remove items from both lists (and therefore making the compare ever cheaper as you find elements from list2).

EDIT: I've updated my answer to account for lists being unhashable, as well as some other feedback. This one is even tested.
It probably relates to the cost of poping an item out of a middle of a list.
Alternatively have you tried using sets to handle this?
def difference(list1, list2):
return [x for x in list1 if tuple(x) in set(tuple(y) for y in list2)]
You can then set list one to the resulting list if that is your intention by doing
list1 = difference(list1, list2)

The often suggested set wont work here, because the two lists contain lists, which are unhashable. You need to change your data structure first.
You can
convert the sublists into tuples or class instances to make them hashable, then use sets.
Keep both lists sorted, then you just have to compare the lists' heads.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pairwise comparison in list of lists taking too long - python

First I would recommend making your class hashable which is referenced here: What's a correct and good way to implement hash()? You can then make your list of lists a list of sets by doing: thelist = [set(l) for l in thelist] Then your function will work much faster!

Related

What is the fastest way of getting the list of lists containing a certain integer in a list of list of integers

How to automatically create a nested for loop using a list of ranges?

Removing sublists from a list of lists

How to get random slice of python list of constant size. (smallest code)

python function slowing down for no apparent reason

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pairwise comparison in list of lists taking too long - python

First I would recommend making your class hashable which is referenced here: What's a correct and good way to implement __hash__()? You can then make your list of lists a list of sets by doing: thelist = [set(l) for l in thelist] Then your function will work much faster!

Related

What is the fastest way of getting the list of lists containing a certain integer in a list of list of integers

How to automatically create a nested for loop using a list of ranges?

Removing sublists from a list of lists

How to get random slice of python list of constant size. (smallest code)

python function slowing down for no apparent reason

Categories

Resources

First I would recommend making your class hashable which is referenced here: What's a correct and good way to implement hash()? You can then make your list of lists a list of sets by doing: thelist = [set(l) for l in thelist] Then your function will work much faster!