iterating over a growing set in python

iterating over a growing set in python - python

I have a set, setOfManyElements, which contains n elements. I need to go through all those elements and run a function on each element of S:
for s in setOfManyElements:
elementsFound=EvilFunction(s)
setOfManyElements|=elementsFound
EvilFunction(s) returns the set of elements it has found. Some of them will already be in S, some will be new, and some will be in S and will have already been tested.
The problem is that each time I run EvilFunction, S will expand (until a maximum set, at which point it will stop growing). So I am essentially iterating over a growing set. Also EvilFunction takes a long time to compute, so you do not want to run it twice on the same data.
Is there an efficient way to approach this problem in Python 2.7?
LATE EDIT: changed the name of the variables to make them more understandable. Thanks for the suggestion

I suggest an incremental version of 6502's approach:
seen = set(initial_items)
active = set(initial_items)
while active:
next_active = set()
for item in active:
for result in evil_func(item):
if result not in seen:
seen.add(result)
next_active.add(result)
active = next_active
This visits each item only once, and when finished seen contains all visited items.
For further research: this is a breadth-first graph search.

You can just keep a set of already visited elements and pick a non-yet-visited element each time
visited = set()
todo = S
while todo:
s = todo.pop()
visited.add(s)
todo |= EvilFunction(s) - visited

Iterating a set in your scenario is a bad idea, as you have no guarantee on the ordering and the iterator are not intended to be used in a modifying set. So you do not know what will happen to the iterator, nor will you know the position of a newly inserted element
However, using a list and a set may be a good idea:
list_elements = list(set_elements)
for s in list_elements:
elementsFound=EvilFunction(s)
new_subset = elementsFound - list_elements
list_elements.extend(new_subset)
set_elements |= new_subset
Edit
Depending on the size of everything, you could even drop the set entirely
for s in list_elements:
elementsFound=EvilFunction(s)
list_elements.extend(i for i in elementsFound if i not in list_elements)
However, I am not sure on the performance of this. I think that you should profile. If the list is huge, then the set-based solution seems good --it is cheap to perform set-based operations. However, for moderate size, maybe the EvilFunction is expensive enough and it doesn't matter.

Related

Efficient reverse order comparison of huge growing list in Python

In Python, my goal is to maintain a unique list of points (complex scalars, rounded), while steadily creating new ones with a function, like in this pseudo code
list_of_points = []
while True
# generate new point according to some rule
z = generate()
# check whether this point is already there
if z not in list_of_points:
list_of_points.append(z)
if some_condition:
break
Now list_of_points can become potentially huge (like 10 million entries or even more) during the process and duplicates are quite frequent. In fact about 50% of the time, a newly created point is already somewhere in the list. However, what I know is that oftentimes the already existing point is near the end of the list. Sometimes it is in the "bulk" and only very occasionally it can be found near the beginning.
This brought me to the idea of doing the search in reverse order. But how would I do this most efficiently (in terms of raw speed), given my potentially large list which grows during the process. Is the list container even the best way here?
I managed to gain some performance by doing this
list_of_points = []
while True
# generate new point according to some rule
z = generate()
# check very end of list
if z in list_of_points[-10:]:
continue
# check deeper into the list
if z in list_of_points[-100:-10]:
continue
# check the rest
if z not in list_of_points[:-100]:
list_of_points.append(z)
if some_condition:
break
Apparently, this is not very elegant. Using instead a second, FIFO-type container (collection.deque), gives about the same speed up.

Your best bet might to be to use a set instead of a list, python sets use hashing to insert items, so it is very fast. And, you can skip the step of checking if an item is already in the list by simply trying to add it, if it is already in the set it wont be added since duplicates are not allowed.
Stealing your pseudo code axample
set_of_points = {}
while True
# get size of set
a = len(set_of_points)
# generate new point according to some rule
z = generate()
# try to add z to the set
set_of_points.add(z)
b = len(set_of_points)
# if a == b it was not added, thus already existed in the set
if some_condition:
break

Use a set. This is what sets are for. Ah - you already have answer saying that. So my other comment: this part of your code appears to be incorrect:
# check the rest
if z not in list_of_points[100:]:
list_of_points.append(z)
In context, I believe you meant to write list_of_points[:-100] there instead. You already checked the last 100, but, as is, you're skipping checking the first 100 instead.
But even better, use plain list_of_points. As the list grows longer, the cost to possibly do 100 redundant comparisons becomes trivial compared to the cost of copying len(list_of_points) - 100 elements

Are nested for loops always slow?

It seems there are quite a few questions and answers related to the speed of nested for loops - I think I looked at every single one of them! But unfortunately I am still not exactly sure why my code is slow. I'm hoping I can get some guidance from you fine people.
I download a csv file daily that has ~116,000 entries. Items are added and taken away from it at inconsistent points in the file, so every day I want to see what was added and what was taken away.
Getting the entries from csv to a list takes no time at all, for both the old and new list, but I encounter a big speed decrease in the next part of the code, although at the end, it does what I want and spits out the difference - items added and items removed.
Each of the 116,000 items in the list is a dictionary like so:
old or new = [{'Date Stamped': '', 'Name': '', 'Registration Number': '', 'Type': '', "Form Name': '', 'URL': "}]
when I get to this point:
added = [i for i in new if not i in old]
removed = [i for i in old if not i in new]
It takes 25 minutes to finish! I feel like this is a long time, but also I may not be understanding exactly what I'm doing.
Each list (old and new) has ~116000 items in it. Is that because i has to iterate through ~116,000 items 4 times?
It does what I want, in the end, but it seems awfully slow for what it's doing; that said, this is really the first time I've worked with a data set with this many items, so maybe it's par for course.
Is this slow because it is a nested for loop? Is it slow because of the size? I am definitely an amateur and really appreciate everyone's help. Thanks so much.

Effectively, yes, it's slow because it's a nested for loop, because of the size.
Python's element in list operation works by just searching the entire list, element by element, for the one it wants. If you have to do that for every single element in new, that means you're possibly searching the entire old for each element in new.
Lists are not a good datastructure for searching through. What you should be doing instead, if you have a use case like this, is to transform them into a set first - an unordered collection (but order probably doesn't matter) which uses a hashtable to determine whether elements are present in it. Now, instead of searching the entire datastructure element-by-element, it can just hash the element being searched for, check if there's an element there, and say so if so.
In other words, element in set is an order of magnitude more efficient than element in list. For a relatively small overhead cost (in creating the sets in the first place), this shaves a huge amount of time off of the for loops that will follow:
old_set = set(old)
new_set = set(new)
added = [i for i in new if not i in old_set]
removed = [i for i in old if not i in new]
Furthermore, you can even dispense with the list comprehension, because set supports operations from set theory - taking the difference between two sets (elemenents in one set that are not in the other) is as easy as subtracting them:
added = list(new_set - old_set) # (new_set - old_set) is identical to new_set.difference(old_set)
removed = list(old_set - new_set)
which is probably even more efficient than a list comprehension, because it's optimized for exactly this use case.

Efficient data structure for insertion, deletion and search

I need to keep track of open and closed states as certain resources are used. The first structure I have used are two open_states_list and closed_states_list to which I add and remove as states are updated.
Inserting is O(1) if we ignore the internal memory allocation and reallocation mechanism, I guess: open_states_list.append(x).
Deleting is O(n): open_states_list.remove(x).
Searching, just like deleting, is O(n): x in open_states_list.
Getting the list is obviously O(1).
The second structure I have used is a single dictionary with boolean values open_states = {} where if open_states[x] is True then x is open and if it is False then it is closed.
"Inserting" is O(1) as we are just setting a key and a value: open_states[x] = True.
"Deleting" is O(1) for the same reason: open_states[x] = False.
"Searching" is also O(1), just accessing the value of the key: open_states[x].
Getting the list is O(n): [x for x, s in open_states.iteritems() if s].
Our most common operation is checking whether x is open or not (search), so the second option is better. But our second most common operation, getting the lists of open or used states, follows the search operation very closely, is more efficiently in the first option. We cannot choose the first option though, because checking the state of a resource is O(n).
Inserting is very common too. Removing is not as essential.
What would be the most efficient data structure for these needs?

Python for loop index, without enumerate

Many of us know that, enumerate is being using in a situation you use the for loop and need to know the index. However, it has its downsides. According to my tests with the timeit module, just using enumerate makes the code 2x slower. Adding this a tuple assignment makes it slower up to 3x. These numbers may come as fast enough for any programmer, but people dealing with algorithms know that every bit of code you can optimize, is a huge advantage. Now to my question,
An example of this usage would be, the need of finding indexes of multiple elements in a list. Say that there is two elements we need to find. The first two solutions that occur to me is like so:
x, y = 0, 0
for ind, val in enumerate(lst):
if x and y:
break
if val == "a":
x = ind
elif val == "b":
y = ind
The solution above iterates the list, assign the values, than break if the two is found.
x = lst.index("a")
y = lst.index("b")
This is an other solution, which I didn't want to use because it appeared really naive. It iterates over the same list twice, to find two elements. The first solution, does this in a single iteration. So by complexity terms, even though we make extra assignments in the first solution, it should be faster than the second one in larger lists. But my assumption failed.
Here is the code I tested the performance: https://codeshare.io/XfvGA
The second solution was 2x to 10x faster than the first one, changing with the position of these two elements. There are several possibilities which this would occur.
There is an optimization in index() method that I am unaware of.
Lower level assignments being made in index() method. Possible use of C++ code.
The conditions and extra assignments in the first solution, makes it slower than expected.
Even these reasons fall short of explaining the speed of iterating the list twice over iterating it once. Though languages have much difference in time while running code, iteration process itself is independant from the programming language, if you need to check a million elements, you still have to check a million elements (Could be exampled by map() being not much faster than using a loop to change values).
So though I need you to examine the cases I presented, in order to clarify what is being asked here, question can be put together like this. We know that Python's for loop is actually a while running in background (possibly in C ?). So this means, the index is being stored as it is incremented somewhere in the memory. If there was a way to access it, this would eliminate the cost of calling and unpacking enumerate. My question is:
Is there such a way exists ?, If not, could be made (why, or why not) ?
The sources I used for more information on the subject:
Python speed
Python objects time complexity
Performance tips for Python

I dont think that the enumerate is the problem, to prove this you can do:
x, y = 0, 0
for val in a:
if x and y:
break
if val == "a":
x = val
elif val == "b":
y = val
This doesnt do the same thing you wanted in the first place (you dont get the index) but if you messure it with timeit, you will find that the diffrence is not so significant, meaning that the enumerate is not the source of the problem ( in my case it was 0.185 to 0.155 when running your example, so it is faster but the second solution got 0.055 at my computer )
The reason that lst.index is faster is that it is implemented in C .
You can see it's source code here:
https://svn.python.org/projects/python/trunk/Objects/listobject.c
the index function is called listindex in this file and is defined like
static PyObject *
listindex(PyListObject *self, PyObject *args)
( i couldnt find a way to add a link directly to the function )

You are trying to be un-Pythonic, which isn't going to end terribly well for you. If you really need to have that iterator count information available, there is a well-known and optimized way to do that: enumerate(). If you need to find an item in a list, there is a well-known and optimized way to do that: lst.index(). As DorElias showed above/below, enumerate is not the problem, it's that you're attempting to reinvent the wheel with the rest of your for loop. enumerate is going to be the best-supported (clearest, fastest, etc.) way to maintain an iteration count in every situation where an iteration count is actually the thing you need.

list membership test or set

Is it more efficient to check if an item is already in a list before adding it:
for word in open('book.txt','r').read().split():
if word in list:
pass
else:
list.append(item)
or to add everything then run set() on it? like this:
for word in open('book.txt','r').read().split():
list.append(word)
list = set(list)

If the ultimate intention is to construct a set, construct it directly and don't bother with the list:
words = set(open('book.txt','r').read().split())
This will be simple and efficient.
Just as your original code, this has the downside of first reading the entire file into memory. If that's an issue, this can be solved by reading one line at a time:
words = set(word for line in open('book.txt', 'r') for word in line.split())
(Thanks #Steve Jessop for the suggestion.)
Definitely don't take the first approach in your question, unless you know the list to be short, as it will need to scan the entire list on every single word.

A set is a hash table while a list is an array. set membership tests are O(1) while list membership tests are O(n). If anything, you should be filtering the list using a set, not filtering a set using a list.

It's worth testing to find out; but I frequently use comprehensions to filter my lists, and I find that works well; particularly if the code is experimental and subject to change.
l = list( open( 'book.txt', 'r').read().split() )
unique_l = list(set( l ))
# maybe something else:
good_l = [ word for word in l if not word in naughty_words ]
I have heard that this helps with efficiency; but as I said, a test tells more.

The algorithm with word in list is an expensive operation. Why? Because, to see if an item is in the list, you have to check every item in the list. Every time. It's a Shlemiel the painter algorithm. Every lookup is O(n), and you do it n times. There's no startup cost, but it gets expensive very quickly. And you end up looking at each item way more than one time - on average, len(list)/2 times.
Looking to see if things are in the set, is (usually) MUCH cheaper. Items are hashed, so you calculate the hash, look there, and if it's not there, it's not in the set - O(1). You do have to create the set the first time, so you'll look at every item once. Then you look at each item one more time to see if it's already in your set. Still overall O(n).
So, doing list(set(mylist)) is definitely preferable to your first solution.

#NPE's answer doesn't close the file explicitly. It's better to use a context manager
with open('book.txt','r') as fin:
words = set(fin.read().split())
For normal text files this is probably adequate. If it's an entire DNA sequence for example you probably don't want to read the entire file into memory at once.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.