Fastest possible dictionary-like matching

Fastest possible dictionary-like matching - python

I will have to perform a spelling check-like operation in Python as follows:
I have a huge list of words (let's call it the lexicon). I am now given some text (let's call it the sample). I have to search for each sample word in the lexicon. If I cannot find it, that sample word is an error.
In short - a brute-force spelling checker. However, searching through the lexicon linearly for each sample word is bound to be slow. What's a better method to do this?
The complicating factor is that neither the sample nor the lexicon is in English. It is in a language which instead of 26 characters, can have over 300 - stored in Unicode.
A suggestion of any algorithm / data structure / parallelization method will be helpful. Algorithms which have high speed at the cost of less than 100% accuracy would be perfect, since I don't need 100% accuracy. I know about Norvig's algorithm for this, but it seems English-specific.

You can use a set of Unicode strings:
s = set(u"rabbit", u"lamb", u"calf")
and use the in operator to check whether a word occurs:
>>> u"rabbit" in s
True
>>> u"wolf" in s
False
This look-up is essentially O(1), so the size of the dictionary does not matter.
Edit: Here's the complete code for a (case-sensitive) spell checker (2.6 or above):
from io import open
import re
with open("dictionary", encoding="utf-8") as f:
words = set(line.strip() for line in f)
with open("document", encoding="utf-8") as f:
for w in re.findall(r"\w+", f.read()):
if w not in words:
print "Misspelled:", w.encode("utf-8")
(The print assumes your terminal uses UTF-8.)

This is where sets come in place. Create a set of all the words in your dictionary and then use a membership operator to check if the word is present in the dictionary or not.
Here is a simplified example
>>> dictionary = {'Python','check-like', 'will', 'perform','follows:', 'spelling', 'operation'}
>>> for word in "I will have to perform a spelling check-like operation in Python as follows:".split():
if word in dictionary:
print "Found {0} in the dictionary".format(word)
else:
print "{0} not present in the dictionary".format(word)
I not present in the dictionary
Found will in the dictionary
have not present in the dictionary
to not present in the dictionary
Found perform in the dictionary
a not present in the dictionary
Found spelling in the dictionary
Found check-like in the dictionary
Found operation in the dictionary
in not present in the dictionary
Found Python in the dictionary
as not present in the dictionary
Found follows: in the dictionary
>>>

Try it with a set, like everyone is telling you. Set lookups were optimized in python's C code by experienced programmers, so there's no way you can do better in your little application.
Unicode is not an issue: Set and dictionary keys can be unicode or English text, it doesn't matter. The only consideration for you might be unicode normalization, since different orders of diacritics would not compare equal. If this is an issue for your language, I would first ensure the lexicon is stored in normalized form, and then normalize each word before you check it. E.g., unicodedata.normalize('NFC', word)

Use a tree structure to store the words, such that each path from root to leaf represents a single word. If your traversal cannot reach a leaf, or reaches a leaf before the end of the word, you have a word not in your lexicon.
Apart from the benefits Emil mentions in the comments, note also that this allows you to do things like back-tracking to find alternative spellings.

The average time complexity of hashed search in a python dictionary is O(1). You can therefore use a "dictionary with no values" (a.k.a. a set)

That's what python dictionaries and sets are for! :)
Either store your lexicon in a dictionary if each word has some value (say frequency), or a set if you just need to check for existence. Searching them is O(1) so it will be damn fast.
lex = set(('word1', 'word2', .....))
for w in words:
if w not in lex:
print "Error: %s" % w

At first, you need to create index of your lexicon. for example you can make your own indexing system, but better way is using of full-text search engines Full text search engine
I may recomend apache lucene or sphinx for you. it's both fast and open source.
After you may send a searche queries from python to search engine and catch replies.

Here is a post I wrote on checking such things. It's simular to have the google suggestion/spell checker works.
http://blog.mattalcock.com/2012/12/5/python-spell-checker/
Hope it helps.

Related

python word decomposition into subwords: e.g. motorbike -> motor, bike

I have a list of words like [bike, motorbike, copyright].
Now I want to check if th word consists of subwords which are also stand alone words. That means the ouput of my algorithm should be something like: [bike, motor, motorbike, copy, right, copyright].
I already now how to check if a word is a english word:
import enchant
english_words = []
arr = [bike, motorbike, copyright, apfel]
d_brit = enchant.Dict("en_GB")
for word in arr:
if d_brit.check(word):
english_words.append(word)
I also found an algorithm which decomposes the word in all possible ways: Splitting a word into all possible 'subwords' - All possible combinations
Unfortunately, splitting the word like this and then check if it is an english word takes simply to long, because my dataset is way to huge.
Can anyone help?

The nested for loops used in the code are extremely slow in Python. As performance seems to be the main issue, I would recommend to look for available Python packages to do parts of the job, or to build your own extension module, e.g. using Cython, or to not use Python at all.

Some alternatives to splitting the word like this:
searching for words that start with the first characters of str. If found word is start of str, check if rest is a word in dataset
split the str in two portions that make sense when looking at the length distribution of the dataset i.e. what are the most common word lengths? Then searching for matches with basic comparison. (just a wild idea)
These are a few quick ideas for faster algorithms i can think of. But if these are not quick enough, then BernieD is right.

Scoring word similarity between arbitrary text

I have a list of over 500 very important, but arbitrary strings. they look like:
list_important_codes = ['xido9','uaid3','frps09','ggix21']
What I know
*Casing is not important, but all other characters must match exactly.
*Every string starts with 4 alphabetical characters, and ends with either one or two numerical characters.
*I have a list of about 100,000 strings,list_recorded_codes that were hand-typed and should match list_important_codes exactly, but about 10,000 of them dont. Because these strings were typed manually, the incorrect strings are usually only about 1 character off. (errors such as: *has an added space, *got two letters switched around, *has "01" instead of "1", etc)
What I need to do
I need to iterate through list_recorded_codes and find all of their perfect matches within list_important_codes.
What I tried
I spent about 10 hours trying to manually program a way to fix each word, but it seems to be impractical and incredibly tedious. not to mention, when my list doubles in size at a later date, i would have to manually go about that process again.
The solution I think I need, and the expected output
Im hoping that Python's NLTK can efficiently 'score' these arbitrary terms to find a 'best score'. For example, if the word in question is inputword = "gdix88", and that word gets compared to score(inputword,"gdox89")=.84 and score(inputword,"sudh88")=.21. with my expected output being highscore=.84, highscoreword='gdox89'
for manually_entered_text in ['xido9','uaid3','frp09','ggix21']:
--get_highest_score_from_important_words() #returns word_with_highest_score
--manually_entered_text = word_with_highest_score
I am also willing to use a different set of tools to fix this issue if needed. but also, the simpler the better! Thank you!

The 'score' you are looking for is called an edit distance. There is quite a lot of literature and algorithms available - easy to find, but only after you know the proper term :)
See the corresponding wikipedia article.
The nltk package provides an implementation of the so-called Levenshtein edit-distance:
from nltk.metrics.distance import edit_distance
if __name__ == '__main__':
print(edit_distance("xido9", "xido9 "))
print(edit_distance("xido9", "xido8"))
print(edit_distance("xido9", "xido9xxx"))
print(edit_distance("xido9", "xido9"))
The results are 1, 1, 3 and 0 in this case.
Here is the documentation of the corresponding nltk module
There are more specialized versions of this score that take into account how frequent various typing errors are (for example 'e' instead of 'r' might occur quite often because the keys are next to each other on a qwert keyboard).
But classic Levenshtein would were I would start.

You could apply a dynamic programming approach to this problem. Once you have your scoring matrix, you alignment_matrix and your local and global alignment functions set up, you could iterate through the list_important_codes and find the highest scoring alignment in the list_recorded_codes. Here is a project I did for DNA sequence alignment: DNA alignment. You can easily adapt it to your problem.

Getting a regex trie to run faster?

I have a 50mb regex trie that I'm using to split phrases apart.
Here is the relevant code:
import io
import re
with io.open('REGEXES.rx.txt', encoding='latin-1') as myfile:
regex = myfile.read()
while True == True:
Password = input("Enter a phrase to be split: ")
Words = re.findall(regex, Password)
print(Words)
Since the regex is so large, this takes forever!
Here is the code I'm trying now, with re.compile(TempRegex):
import io
import re
with io.open('REGEXES.rx.txt', encoding='latin-1') as myfile:
TempRegex = myfile.read()
regex = re.compile(TempRegex)
while True == True:
Password = input("Enter a phrase to be split: ")
Words = re.findall(regex, Password)
print(Words)
What I'm trying to do is I'm trying to check to see if an entered phrase is a combination of names. For example, the phrase "johnsmith123" to return ['john', 'smith', '123']. The regex file was created by a tool from a word list of every first and last name from Facebook. I want to see if an entered phrase is a combination of words from that wordlist essentially ... If johns and mith are names in the list, then I would want "johnsmith123" to return ['john', 'smith', '123', 'johns', 'mith'].

I don't think that regex is the way to go here. It seems to me that all you are trying to do is to find a list of all of the substrings of a given string that happen to be names.
If the user's input is a password or passphrase, that implies a relatively short string. It's easy to break that string up into the set of possible substrings, and then test that set against another set containing the names.
The number of substrings in a string of length n is n(n+1)/2. Assuming that no one is going to enter more than say 40 characters you are only looking at 820 substrings, many of which could be eliminated as being too short. Here is some code to do that:
def substrings(s, min_length=1):
for start in range(len(s)):
for length in range(min_length, len(s)-start+1):
yield s[start:start+length]
So the problem then is loading the names into a suitable data structure. Your regex is 50MB, but considering the snippet that you showed in one of your comments, the amount of actual data is going to be a lot smaller than that due to the overhead of the regex syntax.
If you just used text files with one name per line you could do this:
names = set(word.strip().lower() for word in open('names.txt'))
def substrings(s, min_length=1):
for start in range(len(s)):
for length in range(min_length, len(s)-start+1):
yield s[start:start+length]
s = 'johnsmith123'
print(sorted(names.intersection(substrings(s)))
Might give output:
['jo', 'john', 'johns', 'mi', 'smith']
I doubt that there will be memory issues given the likely small data set, but if you find that there's not enough memory to load the full data set at once you could look at using sqlite3 with a simple table to store the names. This will be slower to query, but it will fit in memory.
Another way could be to use the shelve module to create a persistent dictionary with names as keys.

Python's regex engine is not actually a regular expression, since it includes features such as lookbehind, capture groups, back references, and uses backtracking to match the leftmost valid branch instead of the longest.
If you use a true regex engine, you will almost always get better results if your regex does not require those features.
One of the most important qualities of a true regular expression is that it will always return a result in time proportional to the length of the input, without using any memory.
I've written one myself using a DFA implemented in C (but usable from python via cffi), which will have optimal asymptotic performance, but I haven't tried constant-factor improvements such as vectorization and assembly generation. I didn't make a generally usable API though since I only need to call it from within my library, but it shouldn't be too hard to figure out from the examples. (Note that search can be implemented as match with .* up front, then match backward, but for my purpose I would rather return a single character as an error token). Link to my project
You might also consider building the DFA offline and using it for multiple runs of your program - but this is what flex does so there was no point in me doing that for my project, so maybe just use that if you're comfortable with C? Of course you'd almost certainly have to write a fair bit of custom C code to use my project anyway ...

If you compile it, the regex patterns will be compiled into a bytecodes then run by a matching engine. If you don't compile it, it will load it over and over for the same regex whenever it is called. That's why compiled one is way faster if you are using same regex for multiple different records.

Searching a string for an exact match from a list in Python

I'm working on a project that searches specific user's Twitter streams from my followers list and retweets them. The code below works fine, but if the string appears in side of the word (for instance if the desired string was only "man" but they wrote "manager", it'd get retweeted). I'm still pretty new to python, but my hunch is RegEx will be the way to go, but my attempts have proved useless thus far.
if tweet["user"]["screen_name"] in friends:
for phrase in list:
if phrase in tweet["text"].lower():
print tweet
api.retweet(tweet["id"])
return True

Since you only want to match whole words the easiest way to get Python to do this is to split the tweet text into a list of words and then test for the presence of each of your words using in.
There's an optimization you can use because position isn't important: by building a set from the word list you make searching much faster (technically, O(1) rather than O(n)) because of the fast hashed access used by sets and dicts (thank you Tim Peters, also author of The Zen of Python).
The full solution is:
if tweet["user"]["screen_name"] in friends:
tweet_words = set(tweet["text"].lower().split())
for phrase in list:
if phrase in tweet_words:
print tweet
api.retweet(tweet["id"])
return True
This is not a complete solution. Really you should be taking care of things like purging leading and trailing punctuation. You could write a function to do that, and call it with the tweet text as an argument instead of using a .split() method call.
Given that optimization it occurred to me that iteration in Python could be avoided altogether if the phrases were a set also (the iteration will still happen, but at C speeds rather than Python speeds). So in the code that follows let's suppose that you have during initialization executed the code
tweet_words = set(l.lower() for l in list)
By the way, list is a terrible name for a variable, since by using it you make the Python list type unavailable under its usual name (though you can still get at it with tricks like type([])). Perhaps better to call it word_list or something else both more meaningful and not an existing name. You will have to adapt this code to your needs, it's just to give you the idea. Note that tweet_words only has to be set once.
list = ['Python', 'Perl', 'COBOL']
tweets = [
"This vacation just isn't worth the bother",
"Goodness me she's a great Perl programmer",
"This one slides by under the radar",
"I used to program COBOL but I'm all right now",
"A visit to the doctor is not reported"
]
tweet_words = set(w.lower() for w in list)
for tweet in tweets:
if set(tweet.lower().split()) & tweet_words:
print(tweet)

If you want to use regexes to do this, look for a pattern that is of the form \b<string>\b. In your case this would be:
pattern = re.compile(r"\bman\b")
if re.search(pattern, tweet["text"].lower()):
#do your thing
\b looks for a word boundary in regex. So prefixing and suffixing your pattern with it will match only the pattern. Hope it helps.

List all words in a dictionary that start with <user input>

How would a go about making a program where the user enters a string, and the program generates a list of words beginning with that string?
Ex:
User: "abd"
Program:abdicate, abdomen, abduct...
Thanks!
Edit: I'm using python, but I assume that this is a fairly language-independent problem.

Use a trie.
Add your list of words to a trie. Each path from the root to a leaf is a valid word. A path from a root to an intermediate node represents a prefix, and the children of the intermediate node are valid completions for the prefix.

One of the best ways to do this is to use a directed graph to store your dictionary. It takes a little bit of setting up, but once done it is fairly easy to then do the type of searches you are talking about.
The nodes in the graph correspond to a letter in your word, so each node will have one incoming link and up to 26 (in English) outgoing links.
You could also use a hybrid approach where you maintain a sorted list containing your dictionary and use the directed graph as an index into your dictionary. Then you just look up your prefix in your directed graph and then go to that point in your dictionary and spit out all words matching your search criteria.

If you on a debian[-like] machine,
#!/bin/bash
echo -n "Enter a word: "
read input
grep "^$input" /usr/share/dict/words
Takes all of 0.040s on my P200.

egrep `read input && echo ^$input` /usr/share/dict/words
oh I didn't see the Python edit, here is the same thing in python
my_input = raw_input("Enter beginning of word: ")
my_words = open("/usr/share/dict/words").readlines()
my_found_words = [x for x in my_words if x[0:len(my_input)] == my_input]

If you really want speed, use a trie/automaton. However, something that will be faster than simply scanning the whole list, given that the list of words is sorted:
from itertools import takewhile, islice
import bisect
def prefixes(words, pfx):
return list(
takewhile(lambda x: x.startswith(pfx),
islice(words,
bisect.bisect_right(words, pfx),
len(words)))
Note that an automaton is O(1) with regard to the size of your dictionary, while this algorithm is O(log(m)) and then O(n) with regard to the number of strings that actually start with the prefix, while the full scan is O(m), with n << m.

def main(script, name):
for word in open("/usr/share/dict/words"):
if word.startswith(name):
print word,
if __name__ == "__main__":
import sys
main(*sys.argv)

If you really want to be efficient - use suffix trees or suffix arrays - wikipedia article.
Your problem is what suffix trees were designed to handle.
There is even implementation for Python - here

You can use str.startswith(). Reference from the official docs:
str.startswith(prefix[, start[, end]])
Return True if string starts with the prefix, otherwise return False. prefix can also be a tuple of prefixes to look for. With optional start, test string beginning at that position. With optional end, stop comparing string at that position.
try code below:
dictionary = ['apple', 'abdicate', 'orange', 'abdomen', 'abduct', 'banana']
user_input = input('Enter something: ')
for word in dictionary:
if word.startswith(user_input):
print(word)
Output:
Enter something: abd
abdicate
abdomen
abduct

var words = from word in dictionary
where word.key.StartsWith("bla-bla-bla");
select word;

Try using regex to search through your list of words, e.g. /^word/ and report all matches.

If you need to be really fast, use a tree:
build an array and split the words in 26 sets based on the first letter, then split each item in 26 based on the second letter, then again.
So if your user types "abd" you would look for Array[0][1][3] and get a list of all the words starting like that. At that point your list should be small enough to pass over to the client and use javascript to filter.

Most Pythonic solution
# set your list of words, whatever the source
words_list = ('cat', 'dog', 'banana')
# get the word from the user inpuit
user_word = raw_input("Enter a word:\n")
# create an generator, so your output is flexible and store almost nothing in memory
word_generator = (word for word in words_list if word.startswith(user_word))
# now you in, you can make anything you want with it
# here we just list it :
for word in word_generator :
print word
Remember generators can be only used once, so turn it to a list (using list(word_generator)) or use the itertools.tee function if you expect using it more than once.
Best way to do it :
Store it into a database and use SQL to look for the word you need. If there is a lot of words in your dictionary, it will be much faster and efficient.
Python got thousand of DB API to help you do the job ;-)

If your dictionary is really big, i'd suggest indexing with a python text index (PyLucene - note that i've never used the python extension for lucene) The search would be efficient and you could even return a search 'score'.
Also, if your dictionary is relatively static you won't even have the overhead of re-indexing very often.

Don't use a bazooka to kill a fly. Use something simple just like SQLite. There are all the tools you need for every modern languages and you can just do :
"SELECT word FROM dict WHERE word LIKE "user_entry%"
It's lightning fast and a baby could do it. What's more it's portable, persistent and so easy to maintain.
Python tuto :
http://www.initd.org/pub/software/pysqlite/doc/usage-guide.html

A linear scan is slow, but a prefix tree is probably overkill. Keeping the words sorted and using a binary search is a fast and simple compromise.
import bisect
words = sorted(map(str.strip, open('/usr/share/dict/words')))
def lookup(prefix):
return words[bisect.bisect_left(words, prefix):bisect.bisect_right(words, prefix+'~')]
>>> lookup('abdicat')
['abdicate', 'abdication', 'abdicative', 'abdicator']

If you store the words in a .csv file, you can use pandas to solve this rather neatly, and after you have read it once you can reuse the already loaded data frame if the user should be able to perform more than one search per session.
df = pd.read_csv('dictionary.csv')
matching_words = df[0].loc[df[0].str.startswith(user_entry)]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.