python function slowing down for no apparent reason

python function slowing down for no apparent reason - python

I have a python function defined as follows which i use to delete from list1 the items which are already in list2. I am using python 2.6.2 on windows XP
def compareLists(list1, list2):
curIndex = 0
while curIndex < len(list1):
if list1[curIndex] in list2:
list1.pop(curIndex)
else:
curIndex += 1
Here, list1 and list2 are a list of lists
list1 = [ ['a', 11221, '2232'], ['b', 1321, '22342'] .. ]
# list2 has a similar format.
I tried this function with list1 with 38,000 elements and list2 with 150,000 elements. If i put in a print statement to print the current iteration, I find that the function slows down with each iterations. At first, it processes around 1000 or more items in a second and then after a while it reduces to around 20-50 a second. Why can that be happening?
EDIT: In the case with my data, the curIndex remains 0 or very close to 0 so the pop operation on list1 is almost always on the first item.
If possible, can someone also suggest a better way of doing the same thing in a different way?

Try a more pythonic approach to the filtering, something like
[x for x in list1 if x not in set(list2)]
Converting both lists to sets is unnessescary, and will be very slow and memory hungry on large amounts of data.
Since your data is a list of lists, you need to do something in order to hash it.
Try out
list2_set = set([tuple(x) for x in list2])
diff = [x for x in list1 if tuple(x) not in list2_set]
I tested out your original function, and my approach, using the following test data:
list1 = [[x+1, x*2] for x in range(38000)]
list2 = [[x+1, x*2] for x in range(10000, 160000)]
Timings - not scientific, but still:
#Original function
real 2m16.780s
user 2m16.744s
sys 0m0.017s
#My function
real 0m0.433s
user 0m0.423s
sys 0m0.007s

There are 2 issues that cause your algorithm to scale poorly:
x in list is an O(n) operation.
pop(n) where n is in the middle of the array is an O(n) operation.
Both situations cause it to scale poorly O(n^2) for large amounts of data. gnud's implementation would probably be the best solution since it solves both problems without changing the order of elements or removing potential duplicates.

If we rule the data structure itself out, look at your memory usage next. If you end up asking the OS to swap in for you (i.e., the list takes up more memory than you have), Python's going to sit in iowait waiting on the OS to get the pages from disk, which makes sense given your description.
Is Python sitting in a jacuzzi of iowait when this slowdown happens? Anything else going on in the environment?
(If you're not sure, update with your platform and one of us will tell you how to tell.)

The only reason why the code can become slower is that you have big elements in both lists which share a lot of common elements (so the list1[curIndex] in list2 takes more time).
Here are a couple of ways to fix this:
If you don't care about the order, convert both lists into sets and use set1.difference(set2)
If the order in list1 is important, then at least convert list2 into a set because in is much faster with a set.
Lastly, try a filter: filter(list1, lambda x: x not in set2)
[EDIT] Since set() doesn't work on recursive lists (didn't expect that), try:
result = filter(list1, lambda x: x not in list2)
It should still be much faster than your version. If it isn't, then your last option is to make sure that there can't be duplicate elements in either list. That would allow you to remove items from both lists (and therefore making the compare ever cheaper as you find elements from list2).

EDIT: I've updated my answer to account for lists being unhashable, as well as some other feedback. This one is even tested.
It probably relates to the cost of poping an item out of a middle of a list.
Alternatively have you tried using sets to handle this?
def difference(list1, list2):
return [x for x in list1 if tuple(x) in set(tuple(y) for y in list2)]
You can then set list one to the resulting list if that is your intention by doing
list1 = difference(list1, list2)

The often suggested set wont work here, because the two lists contain lists, which are unhashable. You need to change your data structure first.
You can
convert the sublists into tuples or class instances to make them hashable, then use sets.
Keep both lists sorted, then you just have to compare the lists' heads.

Related

Getting the indexes of an element in a subset

I have an list and a subset of it, and want to find the index of each element in the subset. I have currently tried this code:
def convert_toindex(listof_elements, listof_indices):
for i in range(len(listof_elements)):
listof_elements[:] = [listof_indices.index(x) for x in listof_elements]
return listof_elements
list1 = ['lol', 'please', 'help']
list2 = ['help', 'lol', 'please', 'extra']
What I want to happen when I do convert_toindex(list1, list2) is the output to be [2, 0, 1]
However, when I do this I get a ValueError: '0' is not in list.
0, however, appears nowhere in either list so I am not sure why this is happening.
Secondly, if I have a list of lists, and I want to do this process all the nested lists inside the big list, would I do something like this?
for smalllist in biglist:
smalllist[:] = [dict_of_indices[x] for x in smalllist]
Where dict_of_indices is the dictionary of indices created following the top answer.

The problem is that, instead of doing this one times, you're doing it over and over, N times:
for i in range(len(listof_elements)):
listof_elements[:] = [listof_indices.index(x) for x in listof_elements]
The first time through, you replace every value in listof_elements with its index in listof_indices. So far, so good. In fact, you should be done there.
But then you do it a second time. You look up each of those indices, as if they were values, in listof_indices. And some of them aren't there. So you get an error.
You can solve this just by removing the outer loop. You're already done after the first time.
You may be confused because this problem seems to inherently require two loops—but you already do have two loops. The first is the obvious one in the list comprehension, and the second one is the one hidden inside listof_indices.index.
While we're at it: while this problem does require two loops, it doesn't require them to be nested.
Instead of looping over listof_indices to find each x, you can loop over it in advance to build a dictionary:
dict_of_indices = {value: index for index, value in enumerate(listof_indices)}
And then just do a direct lookup in that dictionary:
listof_elements[:] = [dict_of_indices[x] for x in listof_elements]
Besides being a whole lot faster (O(N+M) time rather than O(N*M)), I think this might also easier be to understand, and to debug. The first line may be a bit tricky, but you can easily print out the dict and verify that it's correct. And then the second line is about as trivial as you can get.

How exactly does Python check through a list?

I was doing one of the course exercises on codeacademy for python and I had a few questions I couldn't seem to find an answer to:
For this block of code, how exactly does python check whether something is "in" or "not in" a list? Does it run through each item in the list to check or does it use a quicker process?
Also, how would this code be affected if it were running with a massive list of numbers (thousands or millions)? Would it slow down as the list size increases, and are there better alternatives?
numbers = [1, 1, 2, 3, 5, 8, 13]
def remove_duplicates(list):
new_list = []
for i in list:
if i not in new_list:
new_list.append(i)
return new_list
remove_duplicates(numbers)
Thanks!
P.S. Why does this code not function the same?
numbers = [1, 1, 2, 3, 5, 8, 13]
def remove_duplicates(list):
new_list = []
new_list.append(i for i in list if i not in new_list)
return new_list

In order to execute i not in new_list Python has to do a linear scan of the list. The scanning loop breaks as soon as the result of the test is known, but if i is actually not in the list the whole list must be scanned to determine that. It does that at C speed, so it's faster than doing a Python loop to explicitly check each item. Doing the occasional in some_list test is ok, but if you need to do a lot of such membership tests it's much better to use a set.
On average, with random data, testing membership has to scan through half the list items, and in general the time taken to perform the scan is proportional to the length of the list. In the usual notation the size of the list is denoted by n, and the time complexity of this task is written as O(n).
In contrast, determining membership of a set (or a dict) can be done (on average) in constant time, so its time complexity is O(1). Please see TimeComplexity in the Python Wiki for further details on this topic. Thanks, Serge, for that link.
Of course, if your using a set then you get de-duplication for free, since it's impossible to add duplicate items to a set.
One problem with sets is that they generally don't preserve order. But you can use a set as an auxilliary collection to speed up de-duping. Here is an illustration of one common technique to de-dupe a list, or other ordered collection, which does preserve order. I'll use a string as the data source because I'm too lazy to type out a list. ;)
new_list = []
seen = set()
for c in "this is a test":
if c not in seen:
new_list.append(c)
seen.add(c)
print(new_list)
output
['t', 'h', 'i', 's', ' ', 'a', 'e']
Please see How do you remove duplicates from a list whilst preserving order? for more examples. Thanks, Jean-François Fabre, for the link.
As for your PS, that code appends a single generator object to new_list, it doesn't append what the generate would produce.
I assume you alreay tried to do it with a list comprehension:
new_list = [i for i in list if i not in new_list]
That doesn't work, because the new_list doesn't exist until the list comp finishes running, so doing in new_list would raise a NameError. And even if you did new_list = [] before the list comp, it won't be modified by the list comp, and the result of the list comp would simply replace that empty list object with a new one.
BTW, please don't use list as a variable name (even in example code) since that shadows the built-in list type, which can lead to mysterious error messages.

You are asking multiple questions and one of them asking if you can do this more efficiently. I'll answer that.
Ok let's say you'd have thousands or millions of numbers. From where exactly? Let's say they were stored in some kind of txtfile, then you would probably want to use numpy (if you are sticking with Python that is). Example:
import numpy as np
numbers = np.array([1, 1, 2, 3, 5, 8, 13], dtype=np.int32)
numbers = np.unique(numbers).tolist()
This will be more effective (above all memory-efficient compared) than reading it with python and performing a list(set..)
numbers = [1, 1, 2, 3, 5, 8, 13]
numbers = list(set(numbers))

You are asking for the algorithmic complexity of this function. To find that you need to see what is happening at each step.
You are scanning the list one at a time, which takes 1 unit of work. This is because retrieving something from a list is O(1). If you know the index, it can be retrieved in 1 operation.
The list to which you are going to add it increases at worst case 1 at a time. So at any point in time, the unique items list is going to be of size n.
Now, to add the item you picked to the unique items list is going to take n work in the worst case. Because we have to scan each item to decide that.
So if you sum up the total work in each step, it would be 1 + 2 + 3 + 4 + 5 + ... n which is n (n + 1) / 2. So if you have a million items, you can just find that by applying n = million in the formula.
This is not entirely true because of how list works. But theoretically, it would help to visualize this way.

to answer the question in the title: python has more efficient data types but the list() object is just a plain array, if you want a more efficient way to search values you can use dict() which uses a hash of the object stored to insert it into a tree which i assume is what you were thinking of when you mentioned "a quicker process".
as to the second code snippet:
list().append() inserts whatever value you give it to the end of the list, i for i in list if i not in new_list is a generator object and it inserts that generator as an object into the array, list().extend() does what you want: it takes in an iterable and appends all of its elements to the list

Pairwise comparison in list of lists taking too long

I have a list of lists, let's call it thelist, that looks like this:
[[Redditor(name='Kyle'), Redditor(name='complex_r'), Redditor(name='Lor')],
[Redditor(name='krispy'), Redditor(name='flyry'), Redditor(name='Ooer'), Redditor(name='Iewee')],
[Redditor(name='Athaa'), Redditor(name='The_Dark_'), Redditor(name='drpeterfost'), Redditor(name='owise23'), Redditor(name='invirtibrit')],
[Redditor(name='Dacak'), Redditor(name='synbio'), Redditor(name='Thosee')]]
thelist has 1000 elements (or lists). I'm trying to compare each one of these with the other lists pairwise and try to get the number of common elements for each pair of lists. the code doing that:
def calculate(list1,list2):
a=0
for i in list1:
if (i in list2):
a+=1
return a
for i in range(len(thelist)-1):
for j in range(i+1,len(thelist)):
print calculate(thelist[i],thelist[j])
My problem is: the calculation of the function is extremely slow taking 2 or more seconds per a list pair depending on their length. I'm guessing, this has to do with my list structure. What am i missing here?

First I would recommend making your class hashable which is referenced here: What's a correct and good way to implement __hash__()?
You can then make your list of lists a list of sets by doing:
thelist = [set(l) for l in thelist]
Then your function will work much faster!

Which is more pythonic in a for loop: zip or enumerate?

Which one of these is considered the more pythonic, taking into account scalability and readability?
Using enumerate:
group = ['A','B','C']
tag = ['a','b','c']
for idx, x in enumerate(group):
print(x, tag[idx])
or using zip:
for x, y in zip(group, tag):
print(x, y)
The reason I ask is that I have been using a mix of both. I should keep to one standard approach, but which should it be?

No doubt, zip is more pythonic. It doesn't require that you use a variable to store an index (which you don't otherwise need), and using it allows handling the lists uniformly, while with enumerate, you iterate over one list, and index the other list, i.e. non-uniform handling.
However, you should be aware of the caveat that zip runs only up to the shorter of the two lists. To avoid duplicating someone else's answer I'd just include a reference here: someone else's answer.
#user3100115 aptly points out that in python2, you should prefer using itertools.izip over zip, due its lazy nature (faster and more memory efficient). In python3 zip already behaves like py2's izip.

While others have pointed out that zip is in fact more pythonic than enumerate, I came here to see if it was any more efficient. According to my tests, zip is around 10 to 20% faster than enumerate when simply accessing and using items from multiple lists in parallel.
Here I have three lists of (the same) increasing length being accessed in parallel. When the lists are more than a couple of items in length, the time ratio of zip/enumerate is below zero and zip is faster.
Code I used:
import timeit
setup = \
"""
import random
size = {}
a = [ random.randint(0,i+1) for i in range(size) ]
b = [ random.random()*i for i in range(size) ]
c = [ random.random()+i for i in range(size) ]
"""
code_zip = \
"""
data = []
for x,y,z in zip(a,b,c):
data.append(x+z+y)
"""
code_enum = \
"""
data = []
for i,x in enumerate(a):
data.append(x+c[i]+b[i])
"""
runs = 10000
sizes = [ 2**i for i in range(16) ]
data = []
for size in sizes:
formatted_setup = setup.format(size)
time_zip = timeit.timeit(code_zip, formatted_setup, number=runs)
time_enum = timeit.timeit(code_enum, formatted_setup, number=runs)
ratio = time_zip/time_enum
row = (size,time_zip,time_enum,ratio)
data.append(row)
with open("testzipspeed.csv", 'w') as csv_file:
csv_file.write("size,time_zip,time_enumerate,ratio\n")
for row in data:
csv_file.write(",".join([ str(i) for i in row ])+"\n")

The answer to the question asked in your title, "Which is more pythonic; zip or enumerate...?" is: they both are. enumerate is just a special case of zip.
The answer to your more specific question about that for loop is: use zip, but not for the reasons you've seen so far.
The biggest advantage of zip in that loop has nothing to do with zip itself. It has to do with avoiding the assumptions made in your enumerate loop. To explain, I'll make two different generators based on your two examples:
def process_items_and_tags(items, tags):
"Do something with two iterables: items and tags."
for item, tag in zip(items, tag):
yield process(item, tag)
def process_items_and_list_of_tags(items, tags_list):
"Do something with an iterable of items and an indexable collection of tags."
for idx, item in enumerate(items):
yield process(item, tags_list[idx])
Both generators can take any iterable as their first argument (items), but they differ in how they handle their second argument. The enumerate-based approach can only process tags in a list-like collection with [] indexing. That rules out a huge number of iterables, like file streams and generators, for no good reason.
Why is one parameter more tightly constrained than the other? The restriction isn't inherent in the problem the user is trying to solve, since the generator could just as easily have been written the other way 'round:
def process_list_of_items_and_tags(items_list, tags):
"Do something with an indexable collection of items and an iterable of tags."
for idx, tag in enumerate(tags):
yield process(items[idx], tag)
Same result, different restriction on the inputs. Why should your caller have to know or care about any of that?
As an added penalty, anything of the form some_list[some_index] could raise an IndexError, which you would have to either catch or prevent in some way. That's not normally a problem when your loop both enumerates and accesses the same list-like collection, but here you're enumerating one and then accessing items from another. You'd have to add more code to handle an error that could not have happened in the zip-based version.
Avoiding the unnecessary idx variable is also nice, but hardly the deciding difference between the two approaches.
For more on the subject of iterables, generators, and functions that use them, see Ned Batchelder's PyCon US 2013 talk, "Loop Like a Native" (text, 30-minute video).

zip is more pythonic as said where you don't require another variable while you could also use
from collections import deque
deque(map(lambda x, y:sys.stdout.write(x+" "+y+"\n"),group,tag),maxlen=0)
Since we are printing output here a the list of None values need to be rectified and also provided your lists are of same length.
Update : Well in this case it may not be as good because you are printing group and tag values and it generates a list of None values because of sys.stdout.write but practically if you needed to fetch values it would be better.

zip might be more Pythonic, but it has a gotcha. If you want to change elements in place, you need to use indexing. Iterating over the elements will not work. For example:
x = [1,2,3]
for elem in x:
elem *= 10
print(x)
Output: [1,2,3]
y = [1,2,3]
for index in range(len(y)):
y[i] *= 10
print(y)
Output: [10,20,30]

This is a trivial starting question. I think range(len([list])) isn´t pythonic trying a non pythonist solution.
Thinking about it and reading excelent python documentation, I really like docs as numpy format style in simple pythonic code, that enumerate is a solution for iterables if you need a for loop because make an iterable is a comprehensive form.
list_a = ['a', 'b', 'c'];
list_2 = ['1', '2', '3',]
[print(a) for a in lista]
is for exec the printable line and perhaps better is a generator,
item = genetator_item = (print(i, a) for i, a in enumerate(lista) if a.find('a') == 0)
next(item)
for multiline for and more complex for loops, we can use the enumerate(zip(.
for i, (arg1, arg2) i in enumerate(zip(list_a, list_2)):
print('multiline') # do complex code
but perhaps in extended pythonic code we can use anotrher complex format with itertools, note idx at the end for len(list_a[:]) slice
from itertools import count as idx
for arg1, arg2, i in zip(list_a, list_2, idx(start=1)):
print(f'multiline {i}: {arg1}, {arg2}') # do complex code

Pythonic way of saying "if all of the elements in list 1 also exist in list 2"

I want to return true from the if statement only if all of the elements from list 1 also exist in list 2 (list 2 is a superset of list 1). What is the most pythonic way of writing this?

You can use set operations:
if set(list1) <= set(list2):
#...
Note that the comparison itself is fast, but converting the lists to sets might not (depends on the size of the lists).
Converting to a set also removes any duplicate. So if you have duplicate elements and want to ensure that they are also duplicates in the other list, using sets will not work.

You can use built-in all() function:
if all(x in sLVals for x in fLVals):
# do something
In case of using sets think you can take a look at difference method as far as i know it is quite faster way:
if set(fLVals).difference(sLVals):
# there is a difference
else:
# no difference

Either set.issuperset or all(x in L2 for x in L1).

This one came straight out of good folks at MIT:
from operator import and_
reduce(and_, [x in b for x in a])
I tried to find the "readings.pdf" they had posted for the 6.01 class about a year ago...but I can't find it anymore.
Head to my profile and send me an email, and I'll send you the .pdf where I got this example. It's a very good book, but it doesn't seem to be a part of the class anymore.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.