I have a python code that is taking too much time (It actually never completed)
imp_pos_words = ' '.join([i for i in pos_word_ls if i not in unimp_words])
'unimp_words' is a list of 99,000 alphabetic words
'pos_word_ls' is a list of 15,40,000 alphabetic words
I actually want to omit out all the words which are there in 'unimp_words' from the 'pos_word_ls'
PS: 'pos_word_ls' has duplicate words so i can't type cast it to a set and perform minus.
Please help :)
When you are checking if i not in unimp_words you are traversing through the entire list to find if i is in the list or not which takes O(n) time, where n is the length of the list. Since you're doing this 15,40,000 times, it'll be incredibly slow.
Instead what you can use is a set which will be much faster. This is because when you check if an item is in the set, a hash function is used to find out where i is in the set, and this takes O(1) time.
To convert your list unimp_words to a set, you can use unimp_words = set(unimp_words). Now when you check if i not in unimp_words it should be much faster.
Use a setfor just the unimp_words. The i not in lookup will be much faster.
unimp_words = set(unimp_words)
imp_pos_words = ' '.join([i for i in pos_word_ls if i not in unimp_words])
If it is a list, if i not in unimp_words will have to traverse the whole list every time it checks a word. A set hashed lookup is much faster and your list comprehension will be about 99,000 times faster.
Related
I have an input of about 2-5 millions strings of about 400 characters each, coming from a stored text file.
I need to check for duplicates before adding them to the list that I check (doesn't have to be a list, can be any other data type, the list is technically a set since all items are unique).
I can expect about 0.01% at max of my data to be non-unique and I need to filter them out.
I'm wondering if there is any faster way for me to check if the item exists in the list rather than:
a=[]
for item in data:
if item not in a:
a.add(item)
I do not want to lose the order.
Would hashing be faster (I don't need encryption)? But then I'd have to maintain a hash table for all the values to check first.
Is there any way I'm missing?
I'm on python 2, can at max go upto python 3.5.
It's hard to answer this question because it keeps changing ;-) The version I'm answering asks whether there's a faster way than:
a=[]
for item in data:
if item not in a:
a.add(item)
That will be horridly slow, taking time quadratic in len(data). In any version of Python the following will take expected-case time linear in len(data):
seen = set()
for item in data:
if item not in seen:
seen.add(item)
emit(item)
where emit() does whatever you like (append to a list, write to a file, whatever).
In comments I already noted ways to achieve the same thing with ordered dictionaries (whether ordered by language guarantee in Python 3.7, or via the OrderedDict type from the collections package). The code just above is the most memory-efficient, though.
You can try this,
a = list(set(data))
A List is an ordered sequence of elements whereas Set is a distinct list of elements which is unordered.
Let's say I have a big list:
word_list = [elt.strip() for elt in open("bible_words.txt", "r").readlines()]
//complexity O(n) --> proporcional to list length "n"
I have learned that hash function used for building up dictionaries allows lookup to be much faster, like so:
word_dict = dict((elt, 1) for elt in word_list)
// complexity O(l) ---> constant.
using word_list, is there a most efficient way which is recommended to reduce the complexity of my code?
The code from the question does just one thing: fills all words from a file into a list. The complexity of that is O(n).
Filling the same words into any other type of container will still have at least O(n) complexity, because it has to read all of the words from the file and it has to put all of the words into the container.
What is different with a dict?
Finding out whether something is in a list has O(n) complexity, because the algorithm has to go through the list item by item and check whether it is the sought item. The item can be found at position 0, which is fast, or it could be the last item (or not in the list at all), which makes it O(n).
In dict, data is organized in "buckets". When a key:value pair is saved to a dict, hash of the key is calculated and that number is used to identify the bucket into which data is stored. Later on, when the key is looked up, hash(key) is calculated again to identify the bucket and then only that bucket is searched. There is typically only one key:value pair per bucked, so the search can be done in O(1).
For more detils, see the article about DictionaryKeys on python.org.
How about a set?
A set is something like a dictionary with only keys and no values. The question contains this code:
word_dict = dict((elt, 1) for elt in word_list)
That is obviously a dictionary which does not need values, so a set would be more appropriate.
BTW, there is no need to create a word_list which is a list first and convert it to set or dict. The first step can be skipped:
set_of_words = {elt.strip() for elt in open("bible_words.txt", "r").readlines()}
Are there any drawbacks?
Always ;)
A set does not have duplicates. So counting how many times a word is in the set will never return 2. If that is needed, don't use a set.
A set is not ordered. There is no way to check which was the first word in the set. If that is needed, don't use a set.
Objects saved to sets have to be hashable, which kind-of implies that they are immutable. If it was possible to modify the object, then its hash would change, so it would be in the wrong bucket and searching for it would fail. Anyway, str, int, float, and tuple objects are immutable, so at least those can go into sets.
Writing to a set is probably going to be a bit slower than writing to a list. Still O(n), but a slower O(n), because it has to calculate hashes and organize into buckets, whereas a list just dumps one item after another. See timings below.
Reading everything from a set is also going to be a bit slower than reading everything from a list.
All of these apply to dict as well as to set.
Some examples with timings
Writing to list vs. set:
>>> timeit.timeit('[n for n in range(1000000)]', number=10)
0.7802875302271843
>>> timeit.timeit('{n for n in range(1000000)}', number=10)
1.025623542189976
Reading from list vs. set:
>>> timeit.timeit('989234 in values', setup='values=[n for n in range(1000000)]', number=10)
0.19846207875508526
>>> timeit.timeit('989234 in values', setup='values={n for n in range(1000000)}', number=10)
3.5699193290383846e-06
So, writing to a set seems to be about 30% slower, but finding an item in the set is thousands of times faster when there are thousands of items.
I have a list of ~30 floats. I want to see if a specific float is in my list. For example:
1 >> # For the example below my list has integers, not floats
2 >> list_a = range(30)
3 >> 5.5 in list_a
False
4 >> 1 in list_a
True
The bottleneck in my code is line 3. I search if an item is in my list numerous times, and I require a faster alternative. This bottleneck takes over 99% of my time.
I was able to speed up my code by making list_a a set instead of a list. Are there any other ways to significantly speed up this line?
The best possible time to check if an element is in list if the list is not sorted is O(n) because the element may be anywhere and you need to look at each item and check if it is what you are looking for
If the array was sorted, you could've used binary search to have O(log n) look up time. You also can use hash maps to have average O(1) lookup time (or you can use built-in set, which is basically a dictionary that accomplishes the same task).
That does not make much sense for a list of length 30, though.
In my experience, Python indeed slows down when we search something in a long list.
To complement the suggestion above, my suggestion will be subsetting the list, of course only if the list can be subset and the query can be easily assigned to the correct subset.
Example is searching for a word in an English dictionary, first subsetting the dictionary into 26 "ABCD" sections based on each word's initials. If the query is "apple", you only need to search the "A" section. The advantage of this is that you have greatly limited the search space and hence the speed boost.
For numerical list, either subset it based on range, or on the first digit.
Hope this helps.
Say that I have a list of strings and a set of the same strings:
l = [str1, str2, str3, str4, ...]
s = set([str1, str2, st3, str4, ...])
I need to run a string comparison with a phrase that I have: comparephrase
I need to iterate over all the elements in the list or the set and generate a ratio between the comparephrase and the compared string. I know that set() is faster when we are doing a membership test. However, I am not doing a membership test but comparing the phrase that I have and the strings in the list/set. Does set() still offer a faster speed? If so, why? It seems to me that this set is actually a set with a list inside. Wouldn't that take the long time since we're iterating over the list within the set?
Iterating over a List is much much faster than iterating over a set.
The currently accepted answer is using a very small set and list and hence, the difference is negligible there.
The following code explains it:
>>> import timeit
>>> l = [ x*x for x in range(1, 400)]
>>> s = set(l)
>>> timeit.timeit("for i in s: pass", "from __main__ import s")
12.152284085999781
>>> timeit.timeit("for i in l: pass", "from __main__ import l")
5.460189446001095
>>> timeit.timeit("if 567 in l: pass", "from __main__ import l")
6.0497558240003855
>>> timeit.timeit("if 567 in s: pass", "from __main__ import s")
0.04609546199935721
I do not know what makes set iteration slow, but the fact is evident from the output above.
A Python set is optimized for equality tests and duplicate removal, and thus implements a hash table underneath. I believe this would make it very slightly slower than a list, if you have to compare every element to comparephrase; lists are very good for iterating over every element one after the other. The difference is probably going to be negligible in almost any case, though.
I've run some tests with timeit, and (while list performs slightly faster) there is no significant difference:
>>> import timeit
>>> # For the set
>>> timeit.timeit("for i in s: pass", "s = set([1,4,7,10,13])")
0.20565616500061878
>>> # For the list
>>> timeit.timeit("for i in l: pass", "l = [1,4,7,10,13]")
0.19532391999928223
These values stay very much the same (0.20 vs. 0.19) even when trying multiple times.
However, the overhead of creating the sets can be significant.
The test in the accepted answer is not really representative as Aditya Shaw stated.
Let me explain the technical differences between iterating lists and sets roughly in an easy way.
Iterating lists
Lists have their elements organized by an order and an index "by design", which can be iterated easily.
Access by an index is fast because it bases on few cheap memory read operations.
Iterating sets
Sets iterate slower because their element access is done by hashes.
Imagine a big tree with many branches and each leaf is an element. A hash is being translated into "directions" to traverse all branches until reaching the leaf.
Finding a leaf or element is still fast but slower compared to a simple index access of a list.
Sets have no linked items so an iteration cannot jump to the "next" item easily like a list. It has to start from the root of the tree on each iteration.
Sets (and dicts) are contrary to lists.
Each kind has a primary use case.
Lists and sets swap their advantages on finding elements directly.
Does a list contain an element?
The list has to iterate over all its elements until it finds a match. That depends heavily on the list size.
If the element is near the beginning, it gets found quite fast even on large lists.
If an element is not in the list or near the end, the list gets iterated completely or until a match near the end.
Does a set contain an element?
It just has to traverse let's say 5 branches to look if there's a leaf.
Even on large sets, the number of branches to traverse is relative low.
Why not make and use a universal collection type?
If a set had an internal list index, the set's operations would be slower because the list had to be updated and/or checked.
If a list had an internal set to find items faster, list operations would be slower because the hashes had to be updated and/or checked.
Data found behind the hash and behind the index also may be inconsistent because of duplicate management. And that also increases memory usage at all.
Is it more efficient to check if an item is already in a list before adding it:
for word in open('book.txt','r').read().split():
if word in list:
pass
else:
list.append(item)
or to add everything then run set() on it? like this:
for word in open('book.txt','r').read().split():
list.append(word)
list = set(list)
If the ultimate intention is to construct a set, construct it directly and don't bother with the list:
words = set(open('book.txt','r').read().split())
This will be simple and efficient.
Just as your original code, this has the downside of first reading the entire file into memory. If that's an issue, this can be solved by reading one line at a time:
words = set(word for line in open('book.txt', 'r') for word in line.split())
(Thanks #Steve Jessop for the suggestion.)
Definitely don't take the first approach in your question, unless you know the list to be short, as it will need to scan the entire list on every single word.
A set is a hash table while a list is an array. set membership tests are O(1) while list membership tests are O(n). If anything, you should be filtering the list using a set, not filtering a set using a list.
It's worth testing to find out; but I frequently use comprehensions to filter my lists, and I find that works well; particularly if the code is experimental and subject to change.
l = list( open( 'book.txt', 'r').read().split() )
unique_l = list(set( l ))
# maybe something else:
good_l = [ word for word in l if not word in naughty_words ]
I have heard that this helps with efficiency; but as I said, a test tells more.
The algorithm with word in list is an expensive operation. Why? Because, to see if an item is in the list, you have to check every item in the list. Every time. It's a Shlemiel the painter algorithm. Every lookup is O(n), and you do it n times. There's no startup cost, but it gets expensive very quickly. And you end up looking at each item way more than one time - on average, len(list)/2 times.
Looking to see if things are in the set, is (usually) MUCH cheaper. Items are hashed, so you calculate the hash, look there, and if it's not there, it's not in the set - O(1). You do have to create the set the first time, so you'll look at every item once. Then you look at each item one more time to see if it's already in your set. Still overall O(n).
So, doing list(set(mylist)) is definitely preferable to your first solution.
#NPE's answer doesn't close the file explicitly. It's better to use a context manager
with open('book.txt','r') as fin:
words = set(fin.read().split())
For normal text files this is probably adequate. If it's an entire DNA sequence for example you probably don't want to read the entire file into memory at once.