String Occurrence Counting Algorithm - python

I am curious what is the most efficient algorithm (or commonly used) to count the number of occurrences of a string in a chunk of text.
From what I read, the Boyer–Moore string search algorithm is the standard for string searches but I am not sure if counting occurrences in an efficient way would be same as searching a string.
In Python this is what I want:
text_chunck = "one two three four one five six one"
occurance_count(text_chunck, "one") # gives 3.
EDIT: It seems like python str.count serves as such a method; however, I am not able to find what algorithm it uses.

For starters, yes, you can accomplish this with Boyer-Moore very efficiently. However, depending on some other parameters of your problem, there might be a better solution.
The Aho-Corasick string matching algorithm will find all occurrences of a set of pattern strings in a target string and does so in time O(m + n + z), where m is the length of the string to search, n is the combined length of all the patterns to match, and z is the total number of matches produced. This is linear in the size of the source and target strings if you just have one string to match. It also will find overlapping occurrences of the same string. Moreover, if you want to check how many times a set of strings appears in some source string, you only need to make one call to the algorithm. On top of this, if the set of strings that you want to search for never changes, you can do the O(n) work as preprocessing time and then find all matches in O(m + z).
If, on the other hand, you have one source string and a rapidly-changing set of substrings to search for, you may want to use a suffix tree. With O(m) preprocessing time on the string that you will be searching in, you can, in O(n) time per substring, check how many times a particular substring of length n appears in the string.
Finally, if you're looking for something you can code up easily and with minimal hassle, you might want to consider looking into the Rabin-Karp algorithm, which uses a roling hash function to find strings. This can be coded up in roughly ten to fifteen lines of code, has no preprocessing time, and for normal text strings (lots of text with few matches) can find all matches very quickly.
Hope this helps!

Boyer-Moore would be a good choice for counting occurrences, since it has some overhead that you would only need to do once. It does better the longer the pattern string is, so for "one" it would not be a good choice.
If you want to count overlaps, start the next search one character after the previous match. If you want to ignore overlaps, start the next search the full pattern string length after the previous match.
If your language has an indexOf or strpos method for finding one string in another, you can use that. If it proves to slow, then choose a better algorithm.

Hellnar,
You can use a simple dictionary to count occurrences in a String. The algorithm is a counting algorithm, here is an example:
"""
The counting algorithm is used to count the occurences of a character
in a string. This allows you to compare anagrams and strings themselves.
ex. animal, lamina a=2,n=1,i=1,m=1
"""
def count_occurences(str):
occurences = {}
for char in str:
if char in occurences:
occurences[char] = occurences[char] + 1
else:
occurences[char] = 1
return occurences
def is_matched(s1,s2):
matched = True
s1_count_table = count_occurences(s1)
for char in s2:
if char in s1_count_table and s1_count_table[char]>0:
s1_count_table[char] -= 1
else:
matched = False
break
return matched
#counting.is_matched("animal","laminar")
This example just returns True or False if the strings match. Keep in mind, this algorithm counts the number of times a character shows up in a string, this is good for anagrams.

Related

Fastest way to extract matching strings

I want to search for words that match a given word in a list (example below). However, say there is a list that contain millions of words. What is the most efficient way to perform this search?. I was thinking of tokenizing each list and putting the words in hashtable. Then perform the word search / match and retrieve the list of words that contain this word. From what I can see is this operation will take O(n) operations. Is there any other way? may be without using hash-tables?.
words_list = ['yek', 'lion', 'opt'];
# e.g. if we were to search or match the word "key" with the words in the list we should get the word "yek" or a list of words if there many that match
Also, is there a python library or third party package that can perform efficient searches?
It's not entirely clear when you mean by "match" here, but if you can reduce that to an identity comparison, the problem reduces to a set lookup, which is O(1) time.
For example, if "match" means "has exactly the same set of characters":
words_set = {frozenset(word) for word in words_list}
Then, to look up a word:
frozenset(word) in words_set
Or, if it means "has exactly the same multiset of characters" (i.e., counting duplicates but ignoring order):
words_set = {sorted(word) for word in words_list}
sorted(word) in words_set
… or, if you prefer:
words_set = {collections.Counter(word) for word in words_list}
collections.Counter(word) in words_set
Either way, the key (no pun intended… but maybe it should have been) idea here is to come up with a transformation that turns your values (strings) into values that are identical iff they match (a set of characters, a multiset of characters, an ordered list of sorted characters, etc.). Then, the whole point of a set is that it can look for a value that's equal to your value in constant time.
Of course transforming the list takes O(N) time (unless you just build the transformed set in the first place, instead of building the list and then converting it), but you can use it over and over, and it takes O(1) time each time instead of O(N), which is what it sounds like you care about.
If you need to get back the matching word rather than just know that there is one, you can still do this with a set, but it's easier (if you can afford to waste a bit of space) with a dict:
words_dict = {frozenset(word): word for word in words_list}
words_dict[frozenset(word)] # KeyError if no match
If there could be multiple matches, just change the dict to a multidict:
words_dict = collections.defaultdict(set)
for word in words_list:
words_dict[frozenset(word)].add(word)
words_dict[frozenset(word)] # empty set if no match
Or, if you explicitly want it to be a list rather than a set:
words_dict = collections.defaultdict(list)
for word in words_list:
words_dict[frozenset(word)].append(word)
words_dict[frozenset(word)] # empty list if no match
If you want to do it without using hash tables (why?), you can use a search tree or other logarithmic data structure:
import blist # pip install blist to get it
words_dict = blist.sorteddict()
for word in words_list:
words_dict.setdefault(word, set()).add(word)
words_dict[frozenset(word)] # KeyError if no match
This looks almost identical, except for the fact that it's not quite trivial to wrap defaultdict around a blist.sorteddict—but that just takes a few lines of code. (And maybe you actually want a KeyError rather than an empty set, so I figured it was worth showing both defaultdict and normal dict with setdefault somewhere, so you can choose.)
But under the covers, it's using a hybrid B-tree variant instead of a hash table. Although this is O(log N) time instead of O(1), in some cases it's actually faster than a dict.

How to cut up a string in a specific way, retaining what is useful?

I'm trying to make a program for my biology research.
I need to take this sequence:
NNNNNNNNNNCCNNAGTGNGNACAGACGACGGGCCCTGGCCCCTCGCACACCCTGGACCA
AGTCAATCGCACCCACTTCCCTTTCTTCTCGGATGTCAAGGGCGACCACCGGTTGGTGTT
GAGCGTCGTGGAGACCACCGTTCTGGGGCTCATCTTTGTCGTCTCACTGCTGGGCAACGT
GTGTGCTCTAGTGCTGGTGGCGCGCCGTCGGCGCCGTGGGGCGACAGCCAGCCTGGTGCT
CAACCTCTTCTGCGCGGATTTGCTCTTCACCAGCGCCATCCCTCTAGTGCTCGTCGTGCG
CTGGACTGAGGCCTGGCTGTTGGGGCCCGTCGTCTGCCACCTGCTCTTCTACGTGATGAC
AATGAGCGGCAGCGTCACGATCCTCACACTGGCCGCGGTCAGCCTGGAGCGCATGGTGTG
CATCGTGCGCCTCCGGCGCGGCTTGAGCGGCCCGGGGCGGCGGACTCAGGCGGCACTGCT
GGCTTTCATATGGGGTTACTCGGCGCTCGCCGCGCTGCCCCTCTGCATCTTGTTCCGCGT
GGTCCCGCAGCGCCTTCCCGGCGGGGACCAGGAAATTCCGATTTGCACATTGGATTGGCC
CAACCGCATAGGAGAAATCTCATGGGATGTGTTTTTTGTGACTTTGAACTTCCTGGTGCC
GGGACTGGTCATTGTGATCAGTTACTCCAAAATTTTACAGATCACGAAAGCATCGCGGAA
GAGGCTTACGCTGAGCTTGGCATACTCTGAGAGCCACCAGATCCGAGTGTCCCAACAAGA
CTACCGACTCTTCCGCACGCTCTTCCTGCTCATGGTTTCCTTCTTCATCATGTGGAGTCC
CATCATCATCACCATCCTCNCATCTTGATCCAAAACTTCCGGCAGGACCTGGNCATCTGG
NCATCCCTTTTCTTCTGGGNNGTNNNNNCACGTTGCNACTCTNCCTAAANCCCATACTGT
ANNANATGNCGCTNNNAGGAANGAATGGAGGAANANTTTTTGNNNNNNNNN
...and remove everything past the last N in the beginning and the first N at the end. In other words, to make it look something like this:
ACAGACGACGGGCCCTGGCCCCTCGCACACCCTGGACCA
AGTCAATCGCACCCACTTCCCTTTCTTCTCGGATGTCAAGGGCGACCACCGGTTGGTGTT
GAGCGTCGTGGAGACCACCGTTCTGGGGCTCATCTTTGTCGTCTCACTGCTGGGCAACGT
GTGTGCTCTAGTGCTGGTGGCGCGCCGTCGGCGCCGTGGGGCGACAGCCAGCCTGGTGCT
CAACCTCTTCTGCGCGGATTTGCTCTTCACCAGCGCCATCCCTCTAGTGCTCGTCGTGCG
CTGGACTGAGGCCTGGCTGTTGGGGCCCGTCGTCTGCCACCTGCTCTTCTACGTGATGAC
AATGAGCGGCAGCGTCACGATCCTCACACTGGCCGCGGTCAGCCTGGAGCGCATGGTGTG
CATCGTGCGCCTCCGGCGCGGCTTGAGCGGCCCGGGGCGGCGGACTCAGGCGGCACTGCT
GGCTTTCATATGGGGTTACTCGGCGCTCGCCGCGCTGCCCCTCTGCATCTTGTTCCGCGT
GGTCCCGCAGCGCCTTCCCGGCGGGGACCAGGAAATTCCGATTTGCACATTGGATTGGCC
CAACCGCATAGGAGAAATCTCATGGGATGTGTTTTTTGTGACTTTGAACTTCCTGGTGCC
GGGACTGGTCATTGTGATCAGTTACTCCAAAATTTTACAGATCACGAAAGCATCGCGGAA
GAGGCTTACGCTGAGCTTGGCATACTCTGAGAGCCACCAGATCCGAGTGTCCCAACAAGA
CTACCGACTCTTCCGCACGCTCTTCCTGCTCATGGTTTCCTTCTTCATCATGTGGAGTCC
CATCATCATCACCATCCTC
How would I do this?
I think you may be looking for the longest sequence of non-N characters in the input.
Otherwise, you have no rule to distinguish the last N in the prefix from the first N in the suffix. There is nothing at all different about the N you want to start after (before the ACAGAC…) and the next N (before the CATCCC), or, for that matter, the previous one (before the GN) except that it picks out the longest sequence. In fact, other than the 10 N's at the very start and the 9 at the very end, there doesn't seem to be anything special about any of the N's.
The easiest way to do that is to just grab all the sequences and keep the longest:
max(s.split('N'), key=len)
If you have some additional rule on top of this—e.g., the longest sequence whose length is divisible by three (which in this case is the same thing)—you can do the same basic thing:
max((seq for seq in s.split('N') if len(seq) % 3 == 0), key=len)
#abarnert's answer is correct but str.split() returns the a list sub strings. Meaning the memory usage is literally O(N) (e.g. use tons of memory). This isn't a problem when you're input is short but when processing DNA sequences, your input is typically very long. To avoid the memory overhead, your need to use a iterator. I recommend the re's finditer.
import re
_find_n_free_substrings = re.compile(r'[^N]+', re.MULTILINE).finditer
def longest_n_free_substring(string):
substrings = (match.group(0) for match in _find_n_free_substrings(string))
return max(substrings, key=len)

How to create a Boggle Board from a list of Words? (reverse Boggle solver!)

I am trying to solve the reverse Boggle problem. Simply put, given a list of words, come up with a 4x4 grid of letters in which as many words in the list can be found in sequences of adjacent letters (letters are adjacent both orthogonally and diagonally).
I DO NOT want to take a known board and solve it. That is an easy TRIE problem and has been discussed/solved to death here for people's CS projects.
Example word list:
margays, jaguars, cougars, tomcats, margay, jaguar, cougar, pumas, puma, toms
Solution:
ATJY
CTSA
OMGS
PUAR
This problem is HARD (for me). Algorithm I have so far:
For each word in the input, make a list of all possible ways it can legally be appear on the board by itself.
Try all possible combinations of placing word #2 on those boards and keep the ones that have no conflicts.
Repeat till end of list.
...
Profit!!! (for those that read /.)
Obviously, there are implementation details. Start with the longest word first. Ignore words that are substrings of other words.
I can generate all 68k possible boards for a 7 character word in around 0.4 seconds. Then when I add an additional 7 character board, I need to compare 68k x 68k boards x 7 comparisons. Solve time becomes glacial.
There must be a better way to do this!!!!
Some code:
BOARD_SIDE_LENGTH = 4
class Board:
def __init__(self):
pass
def setup(self, word, start_position):
self.word = word
self.indexSequence = [start_position,]
self.letters_left_over = word[1:]
self.overlay = []
# set up template for overlay. When we compare boards, we will add to this if the board fits
for i in range(BOARD_SIDE_LENGTH*BOARD_SIDE_LENGTH):
self.overlay.append('')
self.overlay[start_position] = word[0]
self.overlay_count = 0
#classmethod
def copy(boardClass, board):
newBoard = boardClass()
newBoard.word = board.word
newBoard.indexSequence = board.indexSequence[:]
newBoard.letters_left_over = board.letters_left_over
newBoard.overlay = board.overlay[:]
newBoard.overlay_count = board.overlay_count
return newBoard
# need to check if otherboard will fit into existing board (allowed to use blank spaces!)
# otherBoard will always be just a single word
#classmethod
def testOverlay(self, this_board, otherBoard):
for pos in otherBoard.indexSequence:
this_board_letter = this_board.overlay[pos]
other_board_letter = otherBoard.overlay[pos]
if this_board_letter == '' or other_board_letter == '':
continue
elif this_board_letter == other_board_letter:
continue
else:
return False
return True
#classmethod
def doOverlay(self, this_board, otherBoard):
# otherBoard will always be just a single word
for pos in otherBoard.indexSequence:
this_board.overlay[pos] = otherBoard.overlay[pos]
this_board.overlay_count = this_board.overlay_count + 1
#classmethod
def newFromBoard(boardClass, board, next_position):
newBoard = boardClass()
newBoard.indexSequence = board.indexSequence + [next_position]
newBoard.word = board.word
newBoard.overlay = board.overlay[:]
newBoard.overlay[next_position] = board.letters_left_over[0]
newBoard.letters_left_over = board.letters_left_over[1:]
newBoard.overlay_count = board.overlay_count
return newBoard
def getValidCoordinates(self, board, position):
row = position / 4
column = position % 4
for r in range(row - 1, row + 2):
for c in range(column - 1, column + 2):
if r >= 0 and r < BOARD_SIDE_LENGTH and c >= 0 and c < BOARD_SIDE_LENGTH:
if (r*BOARD_SIDE_LENGTH+c not in board.indexSequence):
yield r, c
class boardgen:
def __init__(self):
self.boards = []
def createAll(self, board):
# get the next letter
if len(board.letters_left_over) == 0:
self.boards.append(board)
return
next_letter = board.letters_left_over[0]
last_position = board.indexSequence[-1]
for row, column in board.getValidCoordinates(board, last_position):
new_board = Board.newFromBoard(board, row*BOARD_SIDE_LENGTH+column)
self.createAll(new_board)
And use it like this:
words = ['margays', 'jaguars', 'cougars', 'tomcats', 'margay', 'jaguar', 'cougar', 'pumas', 'puma']
words.sort(key=len)
first_word = words.pop()
# generate all boards for the first word
overlaid_boards = []
for i in range(BOARD_SIDE_LENGTH*BOARD_SIDE_LENGTH):
test_board = Board()
test_board.setup(first_word, i)
generator = boardgen()
generator.createAll(test_board)
overlaid_boards += generator.boards
This is an interesting problem. I can't quite come up with a full, optimized solution, but there here are some ideas you might try.
The hard part is the requirement to find the optimal subset if you can't fit all the words in. That's going to add a lot to the complexity. Start by eliminating word combinations that obviously aren't going to work. Cut any words with >16 letters. Count the number of unique letters needed. Be sure to take into account letters repeated in the same word. For example, if the list includes "eagle" I don't think you are allowed to use the same 'e' for both ends of the word. If your list of needed letters is >16, you have to drop some words. Figuring out which ones to cut first is an interesting sub-problem... I'd start with the words containing the least used letters. It might help to have all sub-lists sorted by score.
Then you can do the trivial cases where the total of word lengths is <16. After that, you start with the full list of words and see if there's a solution for that. If not, figure out which word(s) to drop and try again.
Given a word list then, the core algorithm is to find a grid (if one exists) that contains
all of those words.
The dumb brute-force way would be to iterate over all the grids possible with the letters you need, and test each one to see if all your words fit. It's pretty harsh though: middle case is 16! = 2x10exp13 boards. Exact formula for n unique letters is... (16!)/(16-n)! x pow(n, 16-n). Which gives a worst case in the range of 3x10exp16. Not very manageable.
Even if you can avoid rotations and flips, that only saves you 1/16 of the search space.
A somewhat smarter greedy algorithm would be to sort the words by some criteria, like difficulty or length. A recursive solution would be to take the top word remaining on the list, and attempt to place it on the grid. Then recurse with that grid and the remaining word list. If you fill up the grid before you run out of words, then you have to back track and try another way of placing the word. A greedier approach would be to try placements that re-use the most letters first.
You can do some pruning. If at any point the number of spaces left in the grid is less than the remaining set of unique letters needed, then you can eliminate those sub-trees. There are a few other cases where it's obvious there's no solution that can be cut, especially when the remaining grid spaces are < the length of the last word.
The search space for this depends on word lengths and how many letters are re-used. I'm sure it's better than brute-force, but I don't know if it's enough to make the problem reasonable.
The smart way would be to use some form of dynamic programming. I can't quite see the complete algorithm for this. One idea is to have a tree or graph of the letters, connecting each letter to "adjacent" letters in the word list. Then you start with the most-connected letter and try to map the tree onto the grid. Always place the letter that completes the most of the word list. There'd have to be some way of handling the case of multiple of the same letter in the grid. And I'm not sure how to order it so you don't have to search every combination.
The best thing would be to have a dynamic algorithm that also included all the sub word lists. So if the list had "fog" and "fox", and fox doesn't fit but fog does, it would be able to handle that without having to run the whole thing on both versions of the list. That's adding complexity because now you have to rank each solution by score as you go. But in the cases where all the words won't fit it would save a lot of time.
Good luck on this.
There are a couple of general ideas for speeding up backtrack search you could try:
1) Early checks. It usually helps to discard partial solutions that can never work as early as possible, even at the cost of more work. Consider all two-character sequences produced by chopping up the words you are trying to fit in - e.g. PUMAS contributes PU, UM, MA, and AS. These must all be present in the final answer. If a partial solution does not have enough overlapped two-character spaces free to contain all of the overlapped two-character sequences it does not yet have, then it cannot be extended to a final answer.
2) Symmetries. I think this is probably most useful if you are trying to prove that there is no solution. Given one way of filling in a board, you can rotate and reflect that solution to find other ways of filling in a board. If you have 68K starting points and one starting point is a rotation or reflection of another starting point, you don't need to try both, because if you can (or could) solve the problem from one starting point you can get the answer from the other starting point by rotating or reflecting the board. So you might be able to divide the number of starting points you need to try by some integer.
This problem is not the only one to have a large number of alternatives at each stage. This also affects the traveling salesman problem. If you can accept not having a guarantee that you will find the absolute best answer, you could try not following up the least promising of these 68k choices. You need some sort of score to decide which to keep - you might wish to keep those which use as many letters already in place as possible. Some programs for the traveling salesman problems discard unpromising links between nodes very early. A more general approach which discards partial solutions rather than doing a full depth first search or branch and bound is Limited Discrepancy Search - see for example http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.34.2426.
Of course some approaches to the TSP discard tree search completely in favor of some sort of hill-climbing approach. You might start off with a filled boggle square and repeatedly attempt to find your words in it, modifying a few characters in order to force them in, trying to find steps which successively increase the number of words that can be found in the square. The easiest form of hill-climbing is repeated simple hill-climbing from multiple random starts. Another approach is to restart the hill-climbing by randomizing only a portion of the solution so far - since you don't know the best size of portion to randomize you might decide to chose the size of portion to randomize at random, so that at least some fraction of the time you are randomizing the correct size of region to produce a new square to start from. Genetic algorithms and simulated annealing are very popular here. A paper on a new idea, Late Acceptance Hill-Climbing, also describes some of its competitors - http://www.cs.nott.ac.uk/~yxb/LAHC/LAHC-TR.pdf

Python text search question [closed]

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 10 years ago.
I wonder, if you open a text file in Python. And then you'd like to search of words containing a number of letters.
Say you type in 6 different letters (a,b,c,d,e,f) you want to search.
You'd like to find words matching at least 3 letters.
Each letter can only appear once in a word.
And the letter 'a' always has to be containing.
How should the code look like for this specific kind of search?
Let's see...
return [x for x in document.split()
if 'a' in x and sum((1 if y in 'abcdef' else 0 for y in x)) >= 3]
split with no parameters acts as a "words" function, splitting on any whitespace and removing words that contain no characters. Then you check if the letter 'a' is in the word. If 'a' is in the word, you use a generator expression that goes over every letter in the word. If the letter is inside of the string of available letters, then it returns a 1 which contributes to the sum. Otherwise, it returns 0. Then if the sum is 3 or greater, it keeps it. A generator is used instead of a list comprehension because sum will accept anything iterable and it stops a temporary list from having to be created (less memory overhead).
It doesn't have the best access times because of the use of in (which on a string should have an O(n) time), but that generally isn't a very big problem unless the data sets are huge. You can optimize that a bit to pack the string into a set and the constant 'abcdef' can easily be a set. I just didn't want to ruin the nice one liner.
EDIT: Oh, and to improve time on the if portion (which is where the inefficiencies are), you could separate it out into a function that iterates over the string once and returns True if the conditions are met. I would have done this, but it ruined my one liner.
EDIT 2: I didn't see the "must have 3 different characters" part. You can't do that in a one liner. You can just take the if portion out into a function.
def is_valid(word, chars):
count = 0
for x in word:
if x in chars:
count += 1
chars.remove(x)
return count >= 3 and 'a' not in chars
def parse_document(document):
return [x for x in document.split() if is_valid(x, set('abcdef'))]
This one shouldn't have any performance problems on real world data sets.
Here is what I would do if I had to write this:
I'd have a function that, given a word, would check whether it satisfies the criteria and would return a boolean flag.
Then I'd have some code that would iterate over all words in the file, present each of them to the function, and print out those for which the function has returned True.
I agree with aix's general plan, but it's perhaps even more general than a 'design pattern,' and I'm not sure how far it gets you, since it boils down to, "figure out a way to check for what you want to find and then check everything you need to check."
Advice about how to find what you want to find: You've entered into one of the most fundamental areas of algorithm research. Though LCS (longest common substring) is better covered, you'll have no problems finding good examples for containment either. The most rigorous discussion of this topic I've seen is on a Google cs wonk's website: http://neil.fraser.name. He has something called diff-match-patch which is released and optimized in many different languages, including python, which can be downloaded here:
http://code.google.com/p/google-diff-match-patch/
If you'd like to understand more about python and algorithms, magnus hetland has written a great book about python algorithms and his website features some examples within string matching and fuzzy string matching and so on, including the levenshtein distance in a very simple to grasp format. (google for magnus hetland, I don't remember address).
WIthin the standard library you can look at difflib, which offers many ways to assess similarity of strings. You are looking for containment which is not the same but it is quite related and you could potentially make a set of candidate words that you could compare, depending on your needs.
Alternatively you could use the new addition to python, Counter, and reconstruct the words you're testing as lists of strings, then make a function that requires counts of 1 or more for each of your tested letters.
Finally, on to the second part of the aix's approach, 'then apply it to everything you want to test,' I'd suggest you look at itertools. If you have any kind of efficiency constraint, you will want to use generators and a test like the one aix proposes can be most efficiently carried out in python with itertools.ifilter. You have your function that returns True for the values you want to keep, and the builtin function bool. So you can just do itertools.ifilter(bool,test_iterable), which will return all the values that succeed.
Good luck
words = 'fubar cadre obsequious xray'
def find_words(src, required=[], letters=[], min_match=3):
required = set(required)
letters = set(letters)
words = ((word, set(word)) for word in src.split())
words = (word for word in words if word[1].issuperset(required))
words = (word for word in words if len(word[1].intersection(letters)) >= min_match)
words = (word[0] for word in words)
return words
w = find_words(words, required=['a'], letters=['a', 'b', 'c', 'd', 'e', 'f'])
print list(w)
EDIT 1: I too didn't read the requirements closely enough. To ensure a word contains only 1 instance of a valid letter.
from collections import Counter
def valid(word, letters, min_match):
"""At least min_match, no more than one of any letter"""
c = 0
count = Counter(word)
for letter in letters:
char_count = count.get(letter, 0)
if char_count > 1:
return False
elif char_count == 1:
c += 1
if c == min_match:
return True
return True
def find_words(srcfile, required=[], letters=[], min_match=3):
required = set(required)
words = (word for word in srcfile.split())
words = (word for word in words if set(word).issuperset(required))
words = (word for word in words if valid(word, letters, min_match))
return words

Fastest way in Python to find a 'startswith' substring in a long sorted list of strings

I've done a lot of Googling, but haven't found anything, so I'm really sorry if I'm just searching for the wrong things.
I am writing an implementation of the Ghost for MIT Introduction to Programming, assignment 5.
As part of this, I need to determine whether a string of characters is the start of any valid word. I have a list of valid words ("wordlist").
Update: I could use something that iterated through the list each time, such as Peter's simple suggestion:
def word_exists(wordlist, word_fragment):
return any(w.startswith(word_fragment) for w in wordlist)
I previously had:
wordlist = [w for w in wordlist if w.startswith(word_fragment)]
(from here) to narrow the list down to the list of valid words that start with that fragment and consider it a loss if wordlist is empty. The reason that I took this approach was that I (incorrectly, see below) thought that this would save time, as subsequent lookups would only have to search a smaller list.
It occurred to me that this is going through each item in the original wordlist (38,000-odd words) checking the start of each. This seems silly when wordlist is ordered, and the comprehension could stop once it hits something that is after the word fragment. I tried this:
newlist = []
for w in wordlist:
if w[:len(word_fragment)] > word_fragment:
# Take advantage of the fact that the list is sorted
break
if w.startswith(word_fragment):
newlist.append(w)
return newlist
but that is about the same speed, which I thought may be because list comprehensions run as compiled code?
I then thought that more efficient again would be some form of binary search in the list to find the block of matching words. Is this the way to go, or am I missing something really obvious?
Clearly it isn't really a big deal in this case, but I'm just starting out with programming and want to do things properly.
UPDATE:
I have since tested the below suggestions with a simple test script. While Peter's binary search/bisect would clearly be better for a single run, I was interested in whether the narrowing list would win over a series of fragments. In fact, it did not:
The totals for all strings "p", "py", "pyt", "pyth", "pytho" are as follows:
In total, Peter's simple test took 0.175472736359
In total, Peter's bisect left test took 9.36985015869e-05
In total, the list comprehension took 0.0499348640442
In total, Neil G's bisect took 0.000373601913452
The overhead of creating a second list etc clearly took more time than searching the longer list. In hindsight, this was likely the best approach regardless, as the "reducing list" approach increased the time for the first run, which was the worst case scenario.
Thanks all for some excellent suggestions, and well done Peter for the best answer!!!
Generator expressions are evaluated lazily, so if you only need to determine whether or not your word is valid, I would expect the following to be more efficient since it doesn't necessarily force it to build the full list once it finds a match:
def word_exists(wordlist, word_fragment):
return any(w.startswith(word_fragment) for w in wordlist)
Note that the lack of square brackets is important for this to work.
However this is obviously still linear in the worst case. You're correct that binary search would be more efficient; you can use the built-in bisect module for that. It might look something like this:
from bisect import bisect_left
def word_exists(wordlist, word_fragment):
try:
return wordlist[bisect_left(wordlist, word_fragment)].startswith(word_fragment)
except IndexError:
return False # word_fragment is greater than all entries in wordlist
bisect_left runs in O(log(n)) so is going to be considerably faster for a large wordlist.
Edit: I would guess that the example you gave loses out if your word_fragment is something really common (like 't'), in which case it probably spends most of its time assembling a large list of valid words, and the gain from only having to do a partial scan of the list is negligible. Hard to say for sure, but it's a little academic since binary search is better anyway.
You're right that you can do this more efficiently given that the list is sorted.
I'm building off of #Peter's answer, which returns a single element. I see that you want all the words that start with a given prefix. Here's how you do that:
from bisect import bisect_left
wordlist[bisect_left(wordlist, word_fragment):
bisect_left(wordlist, word_fragment[:-1] + chr(ord(word_fragment[-1])+1))]
This returns the slice from your original sorted list.
As Peter suggested I would use the Bisect module. Especially if you're reading from a large file of words.
If you really need speed you could make a daemon ( How do you create a daemon in Python? ) that has a pre-processed data structure suited for the task
I suggest you could use "tries"
http://www.topcoder.com/tc?module=Static&d1=tutorials&d2=usingTries
There are many algorithms and data structures to index and search
strings inside a text, some of them are included in the standard
libraries, but not all of them; the trie data structure is a good
example of one that isn't.
Let word be a single string and let dictionary be a large set of
words. If we have a dictionary, and we need to know if a single word
is inside of the dictionary the tries are a data structure that can
help us. But you may be asking yourself, "Why use tries if set
and hash tables can do the same?" There are two main reasons:
The tries can insert and find strings in O(L) time (where L represent
the length of a single word). This is much faster than set , but is it
a bit faster than a hash table.
The set and the hash tables
can only find in a dictionary words that match exactly with the single
word that we are finding; the trie allow us to find words that have a
single character different, a prefix in common, a character missing,
etc.
The tries can be useful in TopCoder problems, but also have a
great amount of applications in software engineering. For example,
consider a web browser. Do you know how the web browser can auto
complete your text or show you many possibilities of the text that you
could be writing? Yes, with the trie you can do it very fast. Do you
know how an orthographic corrector can check that every word that you
type is in a dictionary? Again a trie. You can also use a trie for
suggested corrections of the words that are present in the text but
not in the dictionary.
an example would be:
start={'a':nodea,'b':nodeb,'c':nodec...}
nodea={'a':nodeaa,'b':nodeab,'c':nodeac...}
nodeb={'a':nodeba,'b':nodebb,'c':nodebc...}
etc..
then if you want all the words starting with ab you would just traverse
start['a']['b'] and that would be all the words you want.
to build it you could iterate through your wordlist and for each word, iterate through the characters adding a new default dict where required.
In case of binary search (assuming wordlist is sorted), I'm thinking of something like this:
wordlist = "ab", "abc", "bc", "bcf", "bct", "cft", "k", "l", "m"
fragment = "bc"
a, m, b = 0, 0, len(wordlist)-1
iterations = 0
while True:
if (a + b) / 2 == m: break # endless loop = nothing found
m = (a + b) / 2
iterations += 1
if wordlist[m].startswith(fragment): break # found word
if wordlist[m] > fragment >= wordlist[a]: a, b = a, m
elif wordlist[b] >= fragment >= wordlist[m]: a, b = m, b
if wordlist[m].startswith(fragment):
print wordlist[m], iterations
else:
print "Not found", iterations
It will find one matched word, or none. You will then have to look to the left and right of it to find other matched words. My algorithm might be incorrect, its just a rough version of my thoughts.
Here's my fastest way to narrow the list wordlist down to a list of valid words starting with a given fragment :
sect() is a generator function that uses the excellent Peter's idea to employ bisect, and the islice() function :
from bisect import bisect_left
from itertools import islice
from time import clock
A,B = [],[]
iterations = 5
repetition = 10
with open('words.txt') as f:
wordlist = f.read().split()
wordlist.sort()
print 'wordlist[0:10]==',wordlist[0:10]
def sect(wordlist,word_fragment):
lgth = len(word_fragment)
for w in islice(wordlist,bisect_left(wordlist, word_fragment),None):
if w[0:lgth]==word_fragment:
yield w
else:
break
def hooloo(wordlist,word_fragment):
usque = len(word_fragment)
for w in wordlist:
if w[:usque] > word_fragment:
break
if w.startswith(word_fragment):
yield w
for rep in xrange(repetition):
te = clock()
for i in xrange(iterations):
newlistA = list(sect(wordlist,'VEST'))
A.append(clock()-te)
te = clock()
for i in xrange(iterations):
newlistB = list(hooloo(wordlist,'VEST'))
B.append(clock() - te)
print '\niterations =',iterations,' number of tries:',repetition,'\n'
print newlistA,'\n',min(A),'\n'
print newlistB,'\n',min(B),'\n'
result
wordlist[0:10]== ['AA', 'AAH', 'AAHED', 'AAHING', 'AAHS', 'AAL', 'AALII', 'AALIIS', 'AALS', 'AARDVARK']
iterations = 5 number of tries: 30
['VEST', 'VESTA', 'VESTAL', 'VESTALLY', 'VESTALS', 'VESTAS', 'VESTED', 'VESTEE', 'VESTEES', 'VESTIARY', 'VESTIGE', 'VESTIGES', 'VESTIGIA', 'VESTING', 'VESTINGS', 'VESTLESS', 'VESTLIKE', 'VESTMENT', 'VESTRAL', 'VESTRIES', 'VESTRY', 'VESTS', 'VESTURAL', 'VESTURE', 'VESTURED', 'VESTURES']
0.0286089433154
['VEST', 'VESTA', 'VESTAL', 'VESTALLY', 'VESTALS', 'VESTAS', 'VESTED', 'VESTEE', 'VESTEES', 'VESTIARY', 'VESTIGE', 'VESTIGES', 'VESTIGIA', 'VESTING', 'VESTINGS', 'VESTLESS', 'VESTLIKE', 'VESTMENT', 'VESTRAL', 'VESTRIES', 'VESTRY', 'VESTS', 'VESTURAL', 'VESTURE', 'VESTURED', 'VESTURES']
0.415578236899
sect() is 14.5 times faster than holloo()
PS:
I know the existence of timeit, but here, for such a result, clock() is fully sufficient
Doing binary search in the list is not going to guarantee you anything. I am not sure how that would work either.
You have a list which is ordered, it is a good news. The algorithmic performance complexity of both your cases is O(n) which is not bad, that you just have to iterate through the whole wordlist once.
But in the second case, the performance (engineering performance) should be better because you are breaking as soon as you find that rest cases will not apply. Try to have a list where 1st element is match and rest 38000 - 1 elements do not match, you will the second will beat the first.

Categories