Is it more efficient to check if an item is already in a list before adding it:
for word in open('book.txt','r').read().split():
if word in list:
pass
else:
list.append(item)
or to add everything then run set() on it? like this:
for word in open('book.txt','r').read().split():
list.append(word)
list = set(list)
If the ultimate intention is to construct a set, construct it directly and don't bother with the list:
words = set(open('book.txt','r').read().split())
This will be simple and efficient.
Just as your original code, this has the downside of first reading the entire file into memory. If that's an issue, this can be solved by reading one line at a time:
words = set(word for line in open('book.txt', 'r') for word in line.split())
(Thanks #Steve Jessop for the suggestion.)
Definitely don't take the first approach in your question, unless you know the list to be short, as it will need to scan the entire list on every single word.
A set is a hash table while a list is an array. set membership tests are O(1) while list membership tests are O(n). If anything, you should be filtering the list using a set, not filtering a set using a list.
It's worth testing to find out; but I frequently use comprehensions to filter my lists, and I find that works well; particularly if the code is experimental and subject to change.
l = list( open( 'book.txt', 'r').read().split() )
unique_l = list(set( l ))
# maybe something else:
good_l = [ word for word in l if not word in naughty_words ]
I have heard that this helps with efficiency; but as I said, a test tells more.
The algorithm with word in list is an expensive operation. Why? Because, to see if an item is in the list, you have to check every item in the list. Every time. It's a Shlemiel the painter algorithm. Every lookup is O(n), and you do it n times. There's no startup cost, but it gets expensive very quickly. And you end up looking at each item way more than one time - on average, len(list)/2 times.
Looking to see if things are in the set, is (usually) MUCH cheaper. Items are hashed, so you calculate the hash, look there, and if it's not there, it's not in the set - O(1). You do have to create the set the first time, so you'll look at every item once. Then you look at each item one more time to see if it's already in your set. Still overall O(n).
So, doing list(set(mylist)) is definitely preferable to your first solution.
#NPE's answer doesn't close the file explicitly. It's better to use a context manager
with open('book.txt','r') as fin:
words = set(fin.read().split())
For normal text files this is probably adequate. If it's an entire DNA sequence for example you probably don't want to read the entire file into memory at once.
Related
I have an input of about 2-5 millions strings of about 400 characters each, coming from a stored text file.
I need to check for duplicates before adding them to the list that I check (doesn't have to be a list, can be any other data type, the list is technically a set since all items are unique).
I can expect about 0.01% at max of my data to be non-unique and I need to filter them out.
I'm wondering if there is any faster way for me to check if the item exists in the list rather than:
a=[]
for item in data:
if item not in a:
a.add(item)
I do not want to lose the order.
Would hashing be faster (I don't need encryption)? But then I'd have to maintain a hash table for all the values to check first.
Is there any way I'm missing?
I'm on python 2, can at max go upto python 3.5.
It's hard to answer this question because it keeps changing ;-) The version I'm answering asks whether there's a faster way than:
a=[]
for item in data:
if item not in a:
a.add(item)
That will be horridly slow, taking time quadratic in len(data). In any version of Python the following will take expected-case time linear in len(data):
seen = set()
for item in data:
if item not in seen:
seen.add(item)
emit(item)
where emit() does whatever you like (append to a list, write to a file, whatever).
In comments I already noted ways to achieve the same thing with ordered dictionaries (whether ordered by language guarantee in Python 3.7, or via the OrderedDict type from the collections package). The code just above is the most memory-efficient, though.
You can try this,
a = list(set(data))
A List is an ordered sequence of elements whereas Set is a distinct list of elements which is unordered.
I have a python code that is taking too much time (It actually never completed)
imp_pos_words = ' '.join([i for i in pos_word_ls if i not in unimp_words])
'unimp_words' is a list of 99,000 alphabetic words
'pos_word_ls' is a list of 15,40,000 alphabetic words
I actually want to omit out all the words which are there in 'unimp_words' from the 'pos_word_ls'
PS: 'pos_word_ls' has duplicate words so i can't type cast it to a set and perform minus.
Please help :)
When you are checking if i not in unimp_words you are traversing through the entire list to find if i is in the list or not which takes O(n) time, where n is the length of the list. Since you're doing this 15,40,000 times, it'll be incredibly slow.
Instead what you can use is a set which will be much faster. This is because when you check if an item is in the set, a hash function is used to find out where i is in the set, and this takes O(1) time.
To convert your list unimp_words to a set, you can use unimp_words = set(unimp_words). Now when you check if i not in unimp_words it should be much faster.
Use a setfor just the unimp_words. The i not in lookup will be much faster.
unimp_words = set(unimp_words)
imp_pos_words = ' '.join([i for i in pos_word_ls if i not in unimp_words])
If it is a list, if i not in unimp_words will have to traverse the whole list every time it checks a word. A set hashed lookup is much faster and your list comprehension will be about 99,000 times faster.
Let's say I have a big list:
word_list = [elt.strip() for elt in open("bible_words.txt", "r").readlines()]
//complexity O(n) --> proporcional to list length "n"
I have learned that hash function used for building up dictionaries allows lookup to be much faster, like so:
word_dict = dict((elt, 1) for elt in word_list)
// complexity O(l) ---> constant.
using word_list, is there a most efficient way which is recommended to reduce the complexity of my code?
The code from the question does just one thing: fills all words from a file into a list. The complexity of that is O(n).
Filling the same words into any other type of container will still have at least O(n) complexity, because it has to read all of the words from the file and it has to put all of the words into the container.
What is different with a dict?
Finding out whether something is in a list has O(n) complexity, because the algorithm has to go through the list item by item and check whether it is the sought item. The item can be found at position 0, which is fast, or it could be the last item (or not in the list at all), which makes it O(n).
In dict, data is organized in "buckets". When a key:value pair is saved to a dict, hash of the key is calculated and that number is used to identify the bucket into which data is stored. Later on, when the key is looked up, hash(key) is calculated again to identify the bucket and then only that bucket is searched. There is typically only one key:value pair per bucked, so the search can be done in O(1).
For more detils, see the article about DictionaryKeys on python.org.
How about a set?
A set is something like a dictionary with only keys and no values. The question contains this code:
word_dict = dict((elt, 1) for elt in word_list)
That is obviously a dictionary which does not need values, so a set would be more appropriate.
BTW, there is no need to create a word_list which is a list first and convert it to set or dict. The first step can be skipped:
set_of_words = {elt.strip() for elt in open("bible_words.txt", "r").readlines()}
Are there any drawbacks?
Always ;)
A set does not have duplicates. So counting how many times a word is in the set will never return 2. If that is needed, don't use a set.
A set is not ordered. There is no way to check which was the first word in the set. If that is needed, don't use a set.
Objects saved to sets have to be hashable, which kind-of implies that they are immutable. If it was possible to modify the object, then its hash would change, so it would be in the wrong bucket and searching for it would fail. Anyway, str, int, float, and tuple objects are immutable, so at least those can go into sets.
Writing to a set is probably going to be a bit slower than writing to a list. Still O(n), but a slower O(n), because it has to calculate hashes and organize into buckets, whereas a list just dumps one item after another. See timings below.
Reading everything from a set is also going to be a bit slower than reading everything from a list.
All of these apply to dict as well as to set.
Some examples with timings
Writing to list vs. set:
>>> timeit.timeit('[n for n in range(1000000)]', number=10)
0.7802875302271843
>>> timeit.timeit('{n for n in range(1000000)}', number=10)
1.025623542189976
Reading from list vs. set:
>>> timeit.timeit('989234 in values', setup='values=[n for n in range(1000000)]', number=10)
0.19846207875508526
>>> timeit.timeit('989234 in values', setup='values={n for n in range(1000000)}', number=10)
3.5699193290383846e-06
So, writing to a set seems to be about 30% slower, but finding an item in the set is thousands of times faster when there are thousands of items.
I have to check presence of millions of elements (20-30 letters str) in the list containing 10-100k of those elements. Is there faster way of doing that in python than set() ?
import sys
#load ids
ids = set( x.strip() for x in open(idfile) )
for line in sys.stdin:
id=line.strip()
if id in ids:
#print fastq
print id
#update ids
ids.remove( id )
set is as fast as it gets.
However, if you rewrite your code to create the set once, and not change it, you can use the frozenset built-in type. It's exactly the same except immutable.
If you're still having speed problems, you need to speed your program up in other ways, such as by using PyPy instead of cPython.
As I noted in my comment, what's probably slowing you down is that you're sequentially checking each line from sys.stdin for membership of your 'master' set. This is going to be really, really slow, and doesn't allow you to make use of the speed of set operations. As an example:
#!/usr/bin/env python
import random
# create two million-element sets of random numbers
a = set(random.sample(xrange(10000000),1000000))
b = set(random.sample(xrange(10000000),1000000))
# a intersection b
c = a & b
# a difference c
d = list(a - c)
print "set d is all remaining elements in a not common to a intersection b"
print "length of d is %s" % len(d)
The above runs in ~6 wallclock seconds on my five year-old machine, and it's testing for membership in larger sets than you require (unless I've misunderstood you). Most of that time is actually taken up creating the sets, so you won't even have that overhead. The fact that the strings you refer to are long isn't relevant here; creating a set creates a hash table, as agf explained. I suspect (though again, it's not clear from your question) that if you can get all your input data into a set before you do any membership testing, it'll be a lot faster, as opposed to reading it in one item at a time, then checking for set membership
You should try to split your data to make the search faster. The tree strcuture would allow you to find very quickly if the data is present or not.
For example, start with a simple map that links the first letter with all the keys starting with that letter, thus you don't have to search all the keys but only a smaller part of them.
This would look like :
ids = {}
for id in open(idfile):
ids.setdefault(id[0], set()).add(id)
for line in sys.stdin:
id=line.strip()
if id in ids.get(id[0], set()):
#print fastq
print id
#update ids
ids[id[0]].remove( id )
Creation will be a bit slower but search should be much faster (I would expect 20 times faster, if the fisrt character of your keys is well distributed and not always the same).
This is a first step, you could do the same thing with the second character and so on, search would then just be walking the tree with each letter...
As mentioned by urschrei, you should "vectorize" the check.
It is faster to check for the presence of a million elements once (as that is done in C) than to do the check for one element a million times.
I have a list called L inside a loop that must iterate though millions of lines. The salient features are:
for line in lines:
L = ['a', 'list', 'with', 'lots', 'of', 'items']
L[3] = 'prefix_text_to_item3' + L[3]
Do more stuff with L...
Is there a better approach to adding text to a list item that would speed up my code. Can .join be used? Thanks.
In a performance oriented code, it is not a good idea to add 2 strings together, it is preferable to use a "".join(_items2join_) instead. (I found some benchmarks there : http://www.skymind.com/~ocrow/python_string/)
Since accessing an element in a python list is O(1), and appending a list to another is O(1) (which is probably the time complexity of concatenating strings in python), The code you have provided is running as fast as it can as far as I can tell. :) You probably can't afford to do this, but when I need speed I go to C++ or some other compiled language when I need to process that much information. Things run much quicker. For time complexity of list operations in python, you may consult this web site: http://wiki.python.org/moin/TimeComplexity and here: What is the runtime complexity of python list functions?
Don't actually create list objects.
Use generator functions and generator expressions.
def appender( some_list, some_text ):
for item in some_list:
yield item + some_text
This appender function does not actually create a new list. It avoids some of the memory management overheads associated with creating a new list.
There may be a better approach depending on what you are doing with list L.
For instance, if you are printing it, something like this may be faster.
print "{0} {1} {2} {3}{4} {5}".format(L[0], L[1], L[2], 'prefix_text_to_item3', L[3], L[4])
What happens to L later in the program?