Fastest way to extract matching strings

Fastest way to extract matching strings - python

I want to search for words that match a given word in a list (example below). However, say there is a list that contain millions of words. What is the most efficient way to perform this search?. I was thinking of tokenizing each list and putting the words in hashtable. Then perform the word search / match and retrieve the list of words that contain this word. From what I can see is this operation will take O(n) operations. Is there any other way? may be without using hash-tables?.
words_list = ['yek', 'lion', 'opt'];
# e.g. if we were to search or match the word "key" with the words in the list we should get the word "yek" or a list of words if there many that match
Also, is there a python library or third party package that can perform efficient searches?

It's not entirely clear when you mean by "match" here, but if you can reduce that to an identity comparison, the problem reduces to a set lookup, which is O(1) time.
For example, if "match" means "has exactly the same set of characters":
words_set = {frozenset(word) for word in words_list}
Then, to look up a word:
frozenset(word) in words_set
Or, if it means "has exactly the same multiset of characters" (i.e., counting duplicates but ignoring order):
words_set = {sorted(word) for word in words_list}
sorted(word) in words_set
… or, if you prefer:
words_set = {collections.Counter(word) for word in words_list}
collections.Counter(word) in words_set
Either way, the key (no pun intended… but maybe it should have been) idea here is to come up with a transformation that turns your values (strings) into values that are identical iff they match (a set of characters, a multiset of characters, an ordered list of sorted characters, etc.). Then, the whole point of a set is that it can look for a value that's equal to your value in constant time.
Of course transforming the list takes O(N) time (unless you just build the transformed set in the first place, instead of building the list and then converting it), but you can use it over and over, and it takes O(1) time each time instead of O(N), which is what it sounds like you care about.
If you need to get back the matching word rather than just know that there is one, you can still do this with a set, but it's easier (if you can afford to waste a bit of space) with a dict:
words_dict = {frozenset(word): word for word in words_list}
words_dict[frozenset(word)] # KeyError if no match
If there could be multiple matches, just change the dict to a multidict:
words_dict = collections.defaultdict(set)
for word in words_list:
words_dict[frozenset(word)].add(word)
words_dict[frozenset(word)] # empty set if no match
Or, if you explicitly want it to be a list rather than a set:
words_dict = collections.defaultdict(list)
for word in words_list:
words_dict[frozenset(word)].append(word)
words_dict[frozenset(word)] # empty list if no match
If you want to do it without using hash tables (why?), you can use a search tree or other logarithmic data structure:
import blist # pip install blist to get it
words_dict = blist.sorteddict()
for word in words_list:
words_dict.setdefault(word, set()).add(word)
words_dict[frozenset(word)] # KeyError if no match
This looks almost identical, except for the fact that it's not quite trivial to wrap defaultdict around a blist.sorteddict—but that just takes a few lines of code. (And maybe you actually want a KeyError rather than an empty set, so I figured it was worth showing both defaultdict and normal dict with setdefault somewhere, so you can choose.)
But under the covers, it's using a hybrid B-tree variant instead of a hash table. Although this is O(log N) time instead of O(1), in some cases it's actually faster than a dict.

Related

Most efficient way to get first value that startswith of large list

I have a very large list with over a 100M strings. An example of that list look as follows:
l = ['1,1,5.8067',
'1,2,4.9700',
'2,2,3.9623',
'2,3,1.9438',
'2,7,1.0645',
'3,3,8.9331',
'3,5,2.6772',
'3,7,3.8107',
'3,9,7.1008']
I would like to get the first string that starts with e.g. '3'.
To do so, I have used a lambda iterator followed by next() to get the first item:
next(filter(lambda i: i.startswith('3,'), l))
Out[1]: '3,3,8.9331'
Considering the size of the list, this strategy unfortunately still takes relatively much time for a process I have to do over and over again. I was wondering if someone could come up with an even faster, more efficient approach. I am open for alternative strategies.

I have no way of testing it myself but it is possible that if you will join all the strings with a char that is not in any of the string:
concat_list = '$'.join(l)
And now use a simple .find('$3,'), it would be faster. It might happen if all the strings are relatively short. Since now all the string is in one place in memory.
If the amount of unique letters in the text is small you can use Abrahamson-Kosaraju method and het time complexity of practically O(n)
Another approach is to use joblib, create n threads when the i'th thread is checking the i + k * n, when one is finding the pattern it stops the others. So the time complexity is O(naive algorithm / n).

Since your actual strings consist of relatively short tokens (such as 301) after splitting the the strings by tabs, you can build a dict with each possible length of the first token as the keys so that subsequent lookups take only O(1) in average time complexity.
Build the dict with values of the list in reverse order so that the first value in the list that start with each distinct character will be retained in the final dict:
d = {s[:i + 1]: s for s in reversed(l) for i in range(len(s.split('\t')[0]))}
so that given:
l = ['301\t301\t51.806763\n', '301\t302\t46.970094\n',
'301\t303\t39.962393\n', '301\t304\t18.943836\n',
'301\t305\t11.064584\n', '301\t306\t4.751911\n']
d['3'] will return '301\t301\t51.806763'.
If you only need to test each of the first tokens as a whole, rather than prefixes, you can simply make the first tokens as the keys instead:
d = {s.split('\t')[0]: s for s in reversed(l)}
so that d['301'] will return '301\t301\t51.806763'.

How can I get the value of multiple elements while searching a dict efficiently? (python)

How can I search any given txt file for anagrams and display the anagrams for every word in that file.
So far I can read the file, extract every single word and alphabetically sort every single word. I've tried making two dicts one dict containing the actual words in the text file as keys and the alphabetically sorted version of the words as values, and another dict of the dictionary file I have that is set up the same way.
Using both these dictionaries I've been unable to find an efficient way to get the following output for every word in the input list:
'eerst': steer reste trees
If I try to loop through all the words in the given list, and inside each loop, loop inside the dictionary, looking and recording the anagrams, it takes too much time and is very inefficient. If I try the following:
for x in input_list:
if x in dictionary:
print dictionary[x]
I only get the first anagram of every word and nothing else.
If that made any sense, any suggestions would be immensely helpful.

I'm not sure if what I'm thinking of is what you're currently doing in your code, but I can't think of anything better:
from collections import defaultdict
words = 'dog god steer reste trees dog fred steer'.split() # or words from a file
unique_words = set(words)
anagram_dict = defaultdict(list)
for word in unique_words:
key = "".join(sorted(word))
anagram_dict[key].append(word)
for anagram_list in anagram_dict.values():
if len(anagram_list) > 1:
print(*anagram_list)
This will print (in arbitrary order):
god dog
steer trees reste
If you wanted to get the dictionary key value, you could make the final loop be over the items rather than the values of anagram_dict (and you could print out words that don't have any anagrams like 'fred' in the example above, if you wanted). Note that thanks to the set, duplicates of words are not sorted multiple times.
Running time should be O(M + U*N*log(N)) where M is the number of words, U is the number of unique ones and N is their average length. Unless you're sorting an organic chemistry textbook or something else that has lots of long words, it should be pretty close to linear in the length of the input.

Here is another way to get anagrams using itertools.groupby
from itertools import groupby
words = list_of_words
for k, g in groupby(sorted(words, key=sorted), key=sorted):
g = list(g)
if len(g) > 1:
print(g)
The big-O complexity isn't quite as good as the usual dictionary of lists approach, but it's still fairly efficient and it sounds funny when you read it out loud

Python: how to check that if an item is in a list efficiently?

I have a list of strings (words like), and, while I am parsing a text, I need to check if a word belongs to the group of words of my current list.
However, my input is pretty big (about 600 millions lines), and checking if an element belongs to a list is a O(n) operation according to the Python documentation.
My code is something like:
words_in_line = []
for word in line:
if word in my_list:
words_in_line.append(word)
As it takes too much time (days actually), I wanted to improve that part which is taking most of the time. I have a look at Python collections, and, more precisely, at deque. However, the only give a O(1) operation time access to the head and the tail of a list, not in the middle.
Do someone has an idea about how to do that in a better way?

You might consider a trie or a DAWG or a database. There are several Python implementations of the same.
Here is some relative timings for you to consider of a set vs a list:
import timeit
import random
with open('/usr/share/dict/words','r') as di: # UNIX 250k unique word list
all_words_set={line.strip() for line in di}
all_words_list=list(all_words_set) # slightly faster if this list is sorted...
test_list=[random.choice(all_words_list) for i in range(10000)]
test_set=set(test_list)
def set_f():
count = 0
for word in test_set:
if word in all_words_set:
count+=1
return count
def list_f():
count = 0
for word in test_list:
if word in all_words_list:
count+=1
return count
def mix_f():
# use list for source, set for membership testing
count = 0
for word in test_list:
if word in all_words_set:
count+=1
return count
print "list:", timeit.Timer(list_f).timeit(1),"secs"
print "set:", timeit.Timer(set_f).timeit(1),"secs"
print "mixed:", timeit.Timer(mix_f).timeit(1),"secs"
Prints:
list: 47.4126560688 secs
set: 0.00277495384216 secs
mixed: 0.00166988372803 secs
ie, matching a set of 10000 words against a set of 250,000 words is 17,085 X faster than matching a list of same 10000 words in a list of the same 250,000 words. Using a list for the source and a set for membership testing is 28,392 X faster than an unsorted list alone.
For membership testing, a list is O(n) and sets and dicts are O(1) for lookups.
Conclusion: Use better data structures for 600 million lines of text!

I'm not clear on why you chose a list in the first place, but here are some alternatives:
Using a set() is likely a good idea. This is very fast, though unordered, but sometimes that's exactly what's needed.
If you need things ordered and to have arbitrary lookups as well, you could use a tree of some sort:
http://stromberg.dnsalias.org/~strombrg/python-tree-and-heap-comparison/
If set membership testing with a small number of false positives here or there is acceptable, you might check into a bloom filter:
http://stromberg.dnsalias.org/~strombrg/drs-bloom-filter/
Depending on what you're doing, a trie might also be very good.

This uses list comprehension
words_in_line = [word for word in line if word in my_list]
which would be more efficient than the code you posted, though how much more for your huge data set is hard to know.

There are two improvments you can make here.
Back your word list with a hashtable. This will afford you O(1) performance when you are checking if a word is present in your word list. There are a number of ways to do this; the most fitting in this scenario is to convert your list to a set.
Using a more appropriate structure for your matching-word collection.
If you need to store all of the matches in memory at the same time, use a dequeue, since its append performance is superior to lists.
If you don't need all the matches in memory at once, consider using a generator. A generator is used to iterate over matched values according to the logic you specify, but it only stores part of the resulting list in memory at a time. It may offer improved performance if you are experiencing I/O bottlenecks.
Below is an example implementation based on my suggestions (opting for a generator, since I can't imagine you need all those words in memory at once).
from itertools import chain
d = set(['a','b','c']) # Load our dictionary
f = open('c:\\input.txt','r')
# Build a generator to get the words in the file
all_words_generator = chain.from_iterable(line.split() for line in f)
# Build a generator to filter out the non-dictionary words
matching_words_generator = (word for word in all_words_generator if word in d)
for matched_word in matching_words_generator:
# Do something with matched_word
print matched_word
# We're reading the file during the above loop, so don't close it too early
f.close()
input.txt
a b dog cat
c dog poop
maybe b cat
dog
Output
a
b
c
b

Fastest way in Python to find a 'startswith' substring in a long sorted list of strings

I've done a lot of Googling, but haven't found anything, so I'm really sorry if I'm just searching for the wrong things.
I am writing an implementation of the Ghost for MIT Introduction to Programming, assignment 5.
As part of this, I need to determine whether a string of characters is the start of any valid word. I have a list of valid words ("wordlist").
Update: I could use something that iterated through the list each time, such as Peter's simple suggestion:
def word_exists(wordlist, word_fragment):
return any(w.startswith(word_fragment) for w in wordlist)
I previously had:
wordlist = [w for w in wordlist if w.startswith(word_fragment)]
(from here) to narrow the list down to the list of valid words that start with that fragment and consider it a loss if wordlist is empty. The reason that I took this approach was that I (incorrectly, see below) thought that this would save time, as subsequent lookups would only have to search a smaller list.
It occurred to me that this is going through each item in the original wordlist (38,000-odd words) checking the start of each. This seems silly when wordlist is ordered, and the comprehension could stop once it hits something that is after the word fragment. I tried this:
newlist = []
for w in wordlist:
if w[:len(word_fragment)] > word_fragment:
# Take advantage of the fact that the list is sorted
break
if w.startswith(word_fragment):
newlist.append(w)
return newlist
but that is about the same speed, which I thought may be because list comprehensions run as compiled code?
I then thought that more efficient again would be some form of binary search in the list to find the block of matching words. Is this the way to go, or am I missing something really obvious?
Clearly it isn't really a big deal in this case, but I'm just starting out with programming and want to do things properly.
UPDATE:
I have since tested the below suggestions with a simple test script. While Peter's binary search/bisect would clearly be better for a single run, I was interested in whether the narrowing list would win over a series of fragments. In fact, it did not:
The totals for all strings "p", "py", "pyt", "pyth", "pytho" are as follows:
In total, Peter's simple test took 0.175472736359
In total, Peter's bisect left test took 9.36985015869e-05
In total, the list comprehension took 0.0499348640442
In total, Neil G's bisect took 0.000373601913452
The overhead of creating a second list etc clearly took more time than searching the longer list. In hindsight, this was likely the best approach regardless, as the "reducing list" approach increased the time for the first run, which was the worst case scenario.
Thanks all for some excellent suggestions, and well done Peter for the best answer!!!

Generator expressions are evaluated lazily, so if you only need to determine whether or not your word is valid, I would expect the following to be more efficient since it doesn't necessarily force it to build the full list once it finds a match:
def word_exists(wordlist, word_fragment):
return any(w.startswith(word_fragment) for w in wordlist)
Note that the lack of square brackets is important for this to work.
However this is obviously still linear in the worst case. You're correct that binary search would be more efficient; you can use the built-in bisect module for that. It might look something like this:
from bisect import bisect_left
def word_exists(wordlist, word_fragment):
try:
return wordlist[bisect_left(wordlist, word_fragment)].startswith(word_fragment)
except IndexError:
return False # word_fragment is greater than all entries in wordlist
bisect_left runs in O(log(n)) so is going to be considerably faster for a large wordlist.
Edit: I would guess that the example you gave loses out if your word_fragment is something really common (like 't'), in which case it probably spends most of its time assembling a large list of valid words, and the gain from only having to do a partial scan of the list is negligible. Hard to say for sure, but it's a little academic since binary search is better anyway.

You're right that you can do this more efficiently given that the list is sorted.
I'm building off of #Peter's answer, which returns a single element. I see that you want all the words that start with a given prefix. Here's how you do that:
from bisect import bisect_left
wordlist[bisect_left(wordlist, word_fragment):
bisect_left(wordlist, word_fragment[:-1] + chr(ord(word_fragment[-1])+1))]
This returns the slice from your original sorted list.

As Peter suggested I would use the Bisect module. Especially if you're reading from a large file of words.
If you really need speed you could make a daemon ( How do you create a daemon in Python? ) that has a pre-processed data structure suited for the task
I suggest you could use "tries"
http://www.topcoder.com/tc?module=Static&d1=tutorials&d2=usingTries
There are many algorithms and data structures to index and search
strings inside a text, some of them are included in the standard
libraries, but not all of them; the trie data structure is a good
example of one that isn't.
Let word be a single string and let dictionary be a large set of
words. If we have a dictionary, and we need to know if a single word
is inside of the dictionary the tries are a data structure that can
help us. But you may be asking yourself, "Why use tries if set
and hash tables can do the same?" There are two main reasons:
The tries can insert and find strings in O(L) time (where L represent
the length of a single word). This is much faster than set , but is it
a bit faster than a hash table.
The set and the hash tables
can only find in a dictionary words that match exactly with the single
word that we are finding; the trie allow us to find words that have a
single character different, a prefix in common, a character missing,
etc.
The tries can be useful in TopCoder problems, but also have a
great amount of applications in software engineering. For example,
consider a web browser. Do you know how the web browser can auto
complete your text or show you many possibilities of the text that you
could be writing? Yes, with the trie you can do it very fast. Do you
know how an orthographic corrector can check that every word that you
type is in a dictionary? Again a trie. You can also use a trie for
suggested corrections of the words that are present in the text but
not in the dictionary.
an example would be:
start={'a':nodea,'b':nodeb,'c':nodec...}
nodea={'a':nodeaa,'b':nodeab,'c':nodeac...}
nodeb={'a':nodeba,'b':nodebb,'c':nodebc...}
etc..
then if you want all the words starting with ab you would just traverse
start['a']['b'] and that would be all the words you want.
to build it you could iterate through your wordlist and for each word, iterate through the characters adding a new default dict where required.

In case of binary search (assuming wordlist is sorted), I'm thinking of something like this:
wordlist = "ab", "abc", "bc", "bcf", "bct", "cft", "k", "l", "m"
fragment = "bc"
a, m, b = 0, 0, len(wordlist)-1
iterations = 0
while True:
if (a + b) / 2 == m: break # endless loop = nothing found
m = (a + b) / 2
iterations += 1
if wordlist[m].startswith(fragment): break # found word
if wordlist[m] > fragment >= wordlist[a]: a, b = a, m
elif wordlist[b] >= fragment >= wordlist[m]: a, b = m, b
if wordlist[m].startswith(fragment):
print wordlist[m], iterations
else:
print "Not found", iterations
It will find one matched word, or none. You will then have to look to the left and right of it to find other matched words. My algorithm might be incorrect, its just a rough version of my thoughts.

Here's my fastest way to narrow the list wordlist down to a list of valid words starting with a given fragment :
sect() is a generator function that uses the excellent Peter's idea to employ bisect, and the islice() function :
from bisect import bisect_left
from itertools import islice
from time import clock
A,B = [],[]
iterations = 5
repetition = 10
with open('words.txt') as f:
wordlist = f.read().split()
wordlist.sort()
print 'wordlist[0:10]==',wordlist[0:10]
def sect(wordlist,word_fragment):
lgth = len(word_fragment)
for w in islice(wordlist,bisect_left(wordlist, word_fragment),None):
if w[0:lgth]==word_fragment:
yield w
else:
break
def hooloo(wordlist,word_fragment):
usque = len(word_fragment)
for w in wordlist:
if w[:usque] > word_fragment:
break
if w.startswith(word_fragment):
yield w
for rep in xrange(repetition):
te = clock()
for i in xrange(iterations):
newlistA = list(sect(wordlist,'VEST'))
A.append(clock()-te)
te = clock()
for i in xrange(iterations):
newlistB = list(hooloo(wordlist,'VEST'))
B.append(clock() - te)
print '\niterations =',iterations,' number of tries:',repetition,'\n'
print newlistA,'\n',min(A),'\n'
print newlistB,'\n',min(B),'\n'
result
wordlist[0:10]== ['AA', 'AAH', 'AAHED', 'AAHING', 'AAHS', 'AAL', 'AALII', 'AALIIS', 'AALS', 'AARDVARK']
iterations = 5 number of tries: 30
['VEST', 'VESTA', 'VESTAL', 'VESTALLY', 'VESTALS', 'VESTAS', 'VESTED', 'VESTEE', 'VESTEES', 'VESTIARY', 'VESTIGE', 'VESTIGES', 'VESTIGIA', 'VESTING', 'VESTINGS', 'VESTLESS', 'VESTLIKE', 'VESTMENT', 'VESTRAL', 'VESTRIES', 'VESTRY', 'VESTS', 'VESTURAL', 'VESTURE', 'VESTURED', 'VESTURES']
0.0286089433154
['VEST', 'VESTA', 'VESTAL', 'VESTALLY', 'VESTALS', 'VESTAS', 'VESTED', 'VESTEE', 'VESTEES', 'VESTIARY', 'VESTIGE', 'VESTIGES', 'VESTIGIA', 'VESTING', 'VESTINGS', 'VESTLESS', 'VESTLIKE', 'VESTMENT', 'VESTRAL', 'VESTRIES', 'VESTRY', 'VESTS', 'VESTURAL', 'VESTURE', 'VESTURED', 'VESTURES']
0.415578236899
sect() is 14.5 times faster than holloo()
PS:
I know the existence of timeit, but here, for such a result, clock() is fully sufficient

Doing binary search in the list is not going to guarantee you anything. I am not sure how that would work either.
You have a list which is ordered, it is a good news. The algorithmic performance complexity of both your cases is O(n) which is not bad, that you just have to iterate through the whole wordlist once.
But in the second case, the performance (engineering performance) should be better because you are breaking as soon as you find that rest cases will not apply. Try to have a list where 1st element is match and rest 38000 - 1 elements do not match, you will the second will beat the first.

String Occurrence Counting Algorithm

I am curious what is the most efficient algorithm (or commonly used) to count the number of occurrences of a string in a chunk of text.
From what I read, the Boyer–Moore string search algorithm is the standard for string searches but I am not sure if counting occurrences in an efficient way would be same as searching a string.
In Python this is what I want:
text_chunck = "one two three four one five six one"
occurance_count(text_chunck, "one") # gives 3.
EDIT: It seems like python str.count serves as such a method; however, I am not able to find what algorithm it uses.

For starters, yes, you can accomplish this with Boyer-Moore very efficiently. However, depending on some other parameters of your problem, there might be a better solution.
The Aho-Corasick string matching algorithm will find all occurrences of a set of pattern strings in a target string and does so in time O(m + n + z), where m is the length of the string to search, n is the combined length of all the patterns to match, and z is the total number of matches produced. This is linear in the size of the source and target strings if you just have one string to match. It also will find overlapping occurrences of the same string. Moreover, if you want to check how many times a set of strings appears in some source string, you only need to make one call to the algorithm. On top of this, if the set of strings that you want to search for never changes, you can do the O(n) work as preprocessing time and then find all matches in O(m + z).
If, on the other hand, you have one source string and a rapidly-changing set of substrings to search for, you may want to use a suffix tree. With O(m) preprocessing time on the string that you will be searching in, you can, in O(n) time per substring, check how many times a particular substring of length n appears in the string.
Finally, if you're looking for something you can code up easily and with minimal hassle, you might want to consider looking into the Rabin-Karp algorithm, which uses a roling hash function to find strings. This can be coded up in roughly ten to fifteen lines of code, has no preprocessing time, and for normal text strings (lots of text with few matches) can find all matches very quickly.
Hope this helps!

Boyer-Moore would be a good choice for counting occurrences, since it has some overhead that you would only need to do once. It does better the longer the pattern string is, so for "one" it would not be a good choice.
If you want to count overlaps, start the next search one character after the previous match. If you want to ignore overlaps, start the next search the full pattern string length after the previous match.
If your language has an indexOf or strpos method for finding one string in another, you can use that. If it proves to slow, then choose a better algorithm.

Hellnar,
You can use a simple dictionary to count occurrences in a String. The algorithm is a counting algorithm, here is an example:
"""
The counting algorithm is used to count the occurences of a character
in a string. This allows you to compare anagrams and strings themselves.
ex. animal, lamina a=2,n=1,i=1,m=1
"""
def count_occurences(str):
occurences = {}
for char in str:
if char in occurences:
occurences[char] = occurences[char] + 1
else:
occurences[char] = 1
return occurences
def is_matched(s1,s2):
matched = True
s1_count_table = count_occurences(s1)
for char in s2:
if char in s1_count_table and s1_count_table[char]>0:
s1_count_table[char] -= 1
else:
matched = False
break
return matched
#counting.is_matched("animal","laminar")
This example just returns True or False if the strings match. Keep in mind, this algorithm counts the number of times a character shows up in a string, this is good for anagrams.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.