Getting a regex trie to run faster?

Getting a regex trie to run faster? - python

I have a 50mb regex trie that I'm using to split phrases apart.
Here is the relevant code:
import io
import re
with io.open('REGEXES.rx.txt', encoding='latin-1') as myfile:
regex = myfile.read()
while True == True:
Password = input("Enter a phrase to be split: ")
Words = re.findall(regex, Password)
print(Words)
Since the regex is so large, this takes forever!
Here is the code I'm trying now, with re.compile(TempRegex):
import io
import re
with io.open('REGEXES.rx.txt', encoding='latin-1') as myfile:
TempRegex = myfile.read()
regex = re.compile(TempRegex)
while True == True:
Password = input("Enter a phrase to be split: ")
Words = re.findall(regex, Password)
print(Words)
What I'm trying to do is I'm trying to check to see if an entered phrase is a combination of names. For example, the phrase "johnsmith123" to return ['john', 'smith', '123']. The regex file was created by a tool from a word list of every first and last name from Facebook. I want to see if an entered phrase is a combination of words from that wordlist essentially ... If johns and mith are names in the list, then I would want "johnsmith123" to return ['john', 'smith', '123', 'johns', 'mith'].

I don't think that regex is the way to go here. It seems to me that all you are trying to do is to find a list of all of the substrings of a given string that happen to be names.
If the user's input is a password or passphrase, that implies a relatively short string. It's easy to break that string up into the set of possible substrings, and then test that set against another set containing the names.
The number of substrings in a string of length n is n(n+1)/2. Assuming that no one is going to enter more than say 40 characters you are only looking at 820 substrings, many of which could be eliminated as being too short. Here is some code to do that:
def substrings(s, min_length=1):
for start in range(len(s)):
for length in range(min_length, len(s)-start+1):
yield s[start:start+length]
So the problem then is loading the names into a suitable data structure. Your regex is 50MB, but considering the snippet that you showed in one of your comments, the amount of actual data is going to be a lot smaller than that due to the overhead of the regex syntax.
If you just used text files with one name per line you could do this:
names = set(word.strip().lower() for word in open('names.txt'))
def substrings(s, min_length=1):
for start in range(len(s)):
for length in range(min_length, len(s)-start+1):
yield s[start:start+length]
s = 'johnsmith123'
print(sorted(names.intersection(substrings(s)))
Might give output:
['jo', 'john', 'johns', 'mi', 'smith']
I doubt that there will be memory issues given the likely small data set, but if you find that there's not enough memory to load the full data set at once you could look at using sqlite3 with a simple table to store the names. This will be slower to query, but it will fit in memory.
Another way could be to use the shelve module to create a persistent dictionary with names as keys.

Python's regex engine is not actually a regular expression, since it includes features such as lookbehind, capture groups, back references, and uses backtracking to match the leftmost valid branch instead of the longest.
If you use a true regex engine, you will almost always get better results if your regex does not require those features.
One of the most important qualities of a true regular expression is that it will always return a result in time proportional to the length of the input, without using any memory.
I've written one myself using a DFA implemented in C (but usable from python via cffi), which will have optimal asymptotic performance, but I haven't tried constant-factor improvements such as vectorization and assembly generation. I didn't make a generally usable API though since I only need to call it from within my library, but it shouldn't be too hard to figure out from the examples. (Note that search can be implemented as match with .* up front, then match backward, but for my purpose I would rather return a single character as an error token). Link to my project
You might also consider building the DFA offline and using it for multiple runs of your program - but this is what flex does so there was no point in me doing that for my project, so maybe just use that if you're comfortable with C? Of course you'd almost certainly have to write a fair bit of custom C code to use my project anyway ...

If you compile it, the regex patterns will be compiled into a bytecodes then run by a matching engine. If you don't compile it, it will load it over and over for the same regex whenever it is called. That's why compiled one is way faster if you are using same regex for multiple different records.

Related

Fuzzy matching strings embedded within strings

I have a list of several thousands locations and a list of millions of sentences. My objective is to return a list of tuples that report the comment that was matched and the location mentioned within the comment. For example:
locations = ['Turin', 'Milan']
state_init = ['NY', 'OK', 'CA']
sent = ['This is a sent about turin. ok?', 'This is a sent about milano.' 'Alan Turing was not from the state of OK.'
result = [('Turin', 'This is a sent about turin. ok?'), ('Milan', 'this is a sent about Melan'), ('OK', 'Alan Turing was not from the state of OK.')]
In words, I do not want to match on locations embedded within other words, I do not want to match state initials if they are not capitalized. If possible, I would like to catch misspellings or fuzzy matches of locations that either omit a correct letter, replace one correct letter with an incorrect letter or have one error in the ordering of all of the correct letters. For example:
Milan
should match
Melan, Mlian, or Mlan but not Milano
The below function works very well at doing everything except the fuzzy matching and returning a tuple but I do not know how to do either of these things without a for loop. Not that I am against using a for loop but I still would not know how to implement this in a way that is computationally efficient.
Is there a way to add these functionalities that I am interested in having or am I trying to do too much in a single function?
def find_keyword_comments(sents, locations, state_init):
keywords = '|'.join(locations)
keywords1 = '|'.join(state_init)
word = re.compile(r"^.*\b({})\b.*$".format(locations), re.I)
word1 = re.compile(r"^.*\b({})\b.*$".format(state_init))
newlist = filter(word.match, test_comments)
newlist1 = filter(word1.match, test_comments)
final = list(newlist) + list(newlist1)
return final

I would recommend you look at metrics for fuzzy matching, mainly the one you are interested is Levenshtein Distance (sometimes called the edit distance).
Here are some implementations in pure python, but you can leverage a few modules to make your life easier:
fuzzywuzzy is a very common (pip-installable) package which implements this distance for what they call the pure ratio. It provides a bit more functionality than you are maybe looking for (partial string matching, ignoring punctuation marks, token order insensitivity...). The only drawback is that the ratio takes into account the length of the string as well. See this response for further basic usage
from fuzzywuzzy import fuzz
fuzz.ratio("this is a test", "this is a test!") # 96
python-Levenshtein is a pretty fast package because it is basically a wrapper in python to the C library underneath. The documentation is not the nicest, but should work. It is now back in the PyPI index so it is pip installable.

Searching a string for an exact match from a list in Python

I'm working on a project that searches specific user's Twitter streams from my followers list and retweets them. The code below works fine, but if the string appears in side of the word (for instance if the desired string was only "man" but they wrote "manager", it'd get retweeted). I'm still pretty new to python, but my hunch is RegEx will be the way to go, but my attempts have proved useless thus far.
if tweet["user"]["screen_name"] in friends:
for phrase in list:
if phrase in tweet["text"].lower():
print tweet
api.retweet(tweet["id"])
return True

Since you only want to match whole words the easiest way to get Python to do this is to split the tweet text into a list of words and then test for the presence of each of your words using in.
There's an optimization you can use because position isn't important: by building a set from the word list you make searching much faster (technically, O(1) rather than O(n)) because of the fast hashed access used by sets and dicts (thank you Tim Peters, also author of The Zen of Python).
The full solution is:
if tweet["user"]["screen_name"] in friends:
tweet_words = set(tweet["text"].lower().split())
for phrase in list:
if phrase in tweet_words:
print tweet
api.retweet(tweet["id"])
return True
This is not a complete solution. Really you should be taking care of things like purging leading and trailing punctuation. You could write a function to do that, and call it with the tweet text as an argument instead of using a .split() method call.
Given that optimization it occurred to me that iteration in Python could be avoided altogether if the phrases were a set also (the iteration will still happen, but at C speeds rather than Python speeds). So in the code that follows let's suppose that you have during initialization executed the code
tweet_words = set(l.lower() for l in list)
By the way, list is a terrible name for a variable, since by using it you make the Python list type unavailable under its usual name (though you can still get at it with tricks like type([])). Perhaps better to call it word_list or something else both more meaningful and not an existing name. You will have to adapt this code to your needs, it's just to give you the idea. Note that tweet_words only has to be set once.
list = ['Python', 'Perl', 'COBOL']
tweets = [
"This vacation just isn't worth the bother",
"Goodness me she's a great Perl programmer",
"This one slides by under the radar",
"I used to program COBOL but I'm all right now",
"A visit to the doctor is not reported"
]
tweet_words = set(w.lower() for w in list)
for tweet in tweets:
if set(tweet.lower().split()) & tweet_words:
print(tweet)

If you want to use regexes to do this, look for a pattern that is of the form \b<string>\b. In your case this would be:
pattern = re.compile(r"\bman\b")
if re.search(pattern, tweet["text"].lower()):
#do your thing
\b looks for a word boundary in regex. So prefixing and suffixing your pattern with it will match only the pattern. Hope it helps.

Python: Regex a dictionary using user input wildcards

I would like to be able to search a dictionary in Python using user input wildcards.
I have found this:
import fnmatch
lst = ['this','is','just','a','test', 'thing']
filtered = fnmatch.filter(lst, 'th*')
This matches this and thing. Now if I try to input a whole file and search through
with open('testfilefolder/wssnt10.txt') as f:
file_contents = f.read().lower()
filtered = fnmatch.filter(file_contents, 'th*')
this doesn't match anything. The difference is that in the file that I am reading from I is a text file (Shakespeare play) so I have spaces and it is not a list. I can match things such as a single letter, so if I just have 't' then I get a bunch of t's. So this tells me that I am matching single letters - I however am wanting to match whole words - but even more, to preserve the wildcard structure.
Since what I would like to happen is that a user enters in text (including what will be a wildcard) that I can substitute it in to the place that 'th*' is. The wild card would do what it should still. That leads to the question, can I just stick in a variable holding the search text in for 'th*'? After some investigation I am wondering if I am somehow supposed to translate the 'th*' for example and have found something such as:
regex = fnmatch.translate('th*')
print(regex)
which outputs th.*\Z(?ms)
Is this the right way to go about doing this? I don't know if it is needed.
What would be the best way in going about "passing in regex formulas" as well as perhaps an idea of what I have wrong in the code as it is not operating on the string of incoming text in the second set of code as it does (correctly) in the first.

If the problem is just that you "have spaces and it is not a list," why not make it into a list?
with open('testfilefolder/wssnt10.txt') as f:
file_contents = f.read().lower().split(' ') # split line on spaces to make a list
filtered = fnmatch.filter(file_contents, 'th*')

Efficient way to do a large number of search/replaces in Python?

I'm fairly new to Python, and am writing a series of script to convert between some proprietary markup formats. I'm iterating line by line over files and then basically doing a large number (100-200) of substitutions that basically fall into 4 categories:
line = line.replace("-","<EMDASH>") # Replace single character with tag
line = line.replace("<\\#>","#") # tag with single character
line = line.replace("<\\n>","") # remove tag
line = line.replace("\xe1","•") # replace non-ascii character with entity
the str.replace() function seems to be pretty efficient (fairly low in the numbers when I examine profiling output), but is there a better way to do this? I've seen the re.sub() method with a function as an argument, but am unsure if this would be better? I guess it depends on what kind of optimizations Python does internally. Thought I would ask for some advice before creating a large dict that might not be very helpful!
Additionally I do some parsing of tags (that look somewhat like HTML, but are not HTML). I identify tags like this:
m = re.findall('(<[^>]+>)',line)
And then do ~100 search/replaces (mostly removing matches) within the matched tags as well, e.g.:
m = re.findall('(<[^>]+>)',line)
for tag in m:
tag_new = re.sub("\*t\([^\)]*\)","",tag)
tag_new = re.sub("\*p\([^\)]*\)","",tag_new)
# do many more searches...
if tag != tag_new:
line = line.replace(tag,tag_new,1) # potentially problematic
Any thoughts of efficiency here?
Thanks!

str.replace() is more efficient if you're going to do basic search and replaces, and re.sub is (obviously) more efficient if you need complex pattern matching (because otherwise you'd have to use str.replace several times).
I'd recommend you use a combination of both. If you have several patterns that all get replaced by one thing, use re.sub. If you just have some cases where you just need to replace one specific tag with another, use str.replace.
You can also improve efficiency by using larger strings (call re.sub once instead of once for each line). Increases memory use, but shouldn't be a problem unless the file is HUGE, but also improves execution time.

If you don't actually need the regex and are just doing literal replacing, string.replace() will almost certainly be faster. But even so, your bottleneck here will be file input/output, not string manipulation.
The best solution though would probably be to use cStringIO

Depending on the ratio of relevant-to-not-relevant portions of the text you're operating on (and whether or not the parts each substitution operates on overlap), it might be more efficient to try to break down the input into tokens and work on each token individually.
Since each replace() in your current implementation has to examine the entire input string, that can be slow. If you instead broke down that stream into something like...
[<normal text>, <tag>, <tag>, <normal text>, <tag>, <normal text>]
# from an original "<normal text><tag><tag><normal text><tag><normal text>"
...then you could simply look to see if a given token is a tag, and replace it in the list (and then ''.join() at the end).

You can pass a function object to re.sub instead of a substitution string, it takes the match object and returns the substitution, so for example
>>> r = re.compile(r'<(\w+)>|(-)')
>>> r.sub(lambda m: '(%s)' % (m.group(1) if m.group(1) else 'emdash'), '<atag>-<anothertag>')
'(atag)(emdash)(anothertag)'
Of course you can use a more complex function object, this lambda is just an example.
Using a single regex that does all the substitution should be slightly faster than iterating the string many times, but if a lot of substitutions are perfomed the overhead of calling the function object that computes the substitution may be significant.

List all words in a dictionary that start with <user input>

How would a go about making a program where the user enters a string, and the program generates a list of words beginning with that string?
Ex:
User: "abd"
Program:abdicate, abdomen, abduct...
Thanks!
Edit: I'm using python, but I assume that this is a fairly language-independent problem.

Use a trie.
Add your list of words to a trie. Each path from the root to a leaf is a valid word. A path from a root to an intermediate node represents a prefix, and the children of the intermediate node are valid completions for the prefix.

One of the best ways to do this is to use a directed graph to store your dictionary. It takes a little bit of setting up, but once done it is fairly easy to then do the type of searches you are talking about.
The nodes in the graph correspond to a letter in your word, so each node will have one incoming link and up to 26 (in English) outgoing links.
You could also use a hybrid approach where you maintain a sorted list containing your dictionary and use the directed graph as an index into your dictionary. Then you just look up your prefix in your directed graph and then go to that point in your dictionary and spit out all words matching your search criteria.

If you on a debian[-like] machine,
#!/bin/bash
echo -n "Enter a word: "
read input
grep "^$input" /usr/share/dict/words
Takes all of 0.040s on my P200.

egrep `read input && echo ^$input` /usr/share/dict/words
oh I didn't see the Python edit, here is the same thing in python
my_input = raw_input("Enter beginning of word: ")
my_words = open("/usr/share/dict/words").readlines()
my_found_words = [x for x in my_words if x[0:len(my_input)] == my_input]

If you really want speed, use a trie/automaton. However, something that will be faster than simply scanning the whole list, given that the list of words is sorted:
from itertools import takewhile, islice
import bisect
def prefixes(words, pfx):
return list(
takewhile(lambda x: x.startswith(pfx),
islice(words,
bisect.bisect_right(words, pfx),
len(words)))
Note that an automaton is O(1) with regard to the size of your dictionary, while this algorithm is O(log(m)) and then O(n) with regard to the number of strings that actually start with the prefix, while the full scan is O(m), with n << m.

def main(script, name):
for word in open("/usr/share/dict/words"):
if word.startswith(name):
print word,
if __name__ == "__main__":
import sys
main(*sys.argv)

If you really want to be efficient - use suffix trees or suffix arrays - wikipedia article.
Your problem is what suffix trees were designed to handle.
There is even implementation for Python - here

You can use str.startswith(). Reference from the official docs:
str.startswith(prefix[, start[, end]])
Return True if string starts with the prefix, otherwise return False. prefix can also be a tuple of prefixes to look for. With optional start, test string beginning at that position. With optional end, stop comparing string at that position.
try code below:
dictionary = ['apple', 'abdicate', 'orange', 'abdomen', 'abduct', 'banana']
user_input = input('Enter something: ')
for word in dictionary:
if word.startswith(user_input):
print(word)
Output:
Enter something: abd
abdicate
abdomen
abduct

var words = from word in dictionary
where word.key.StartsWith("bla-bla-bla");
select word;

Try using regex to search through your list of words, e.g. /^word/ and report all matches.

If you need to be really fast, use a tree:
build an array and split the words in 26 sets based on the first letter, then split each item in 26 based on the second letter, then again.
So if your user types "abd" you would look for Array[0][1][3] and get a list of all the words starting like that. At that point your list should be small enough to pass over to the client and use javascript to filter.

Most Pythonic solution
# set your list of words, whatever the source
words_list = ('cat', 'dog', 'banana')
# get the word from the user inpuit
user_word = raw_input("Enter a word:\n")
# create an generator, so your output is flexible and store almost nothing in memory
word_generator = (word for word in words_list if word.startswith(user_word))
# now you in, you can make anything you want with it
# here we just list it :
for word in word_generator :
print word
Remember generators can be only used once, so turn it to a list (using list(word_generator)) or use the itertools.tee function if you expect using it more than once.
Best way to do it :
Store it into a database and use SQL to look for the word you need. If there is a lot of words in your dictionary, it will be much faster and efficient.
Python got thousand of DB API to help you do the job ;-)

If your dictionary is really big, i'd suggest indexing with a python text index (PyLucene - note that i've never used the python extension for lucene) The search would be efficient and you could even return a search 'score'.
Also, if your dictionary is relatively static you won't even have the overhead of re-indexing very often.

Don't use a bazooka to kill a fly. Use something simple just like SQLite. There are all the tools you need for every modern languages and you can just do :
"SELECT word FROM dict WHERE word LIKE "user_entry%"
It's lightning fast and a baby could do it. What's more it's portable, persistent and so easy to maintain.
Python tuto :
http://www.initd.org/pub/software/pysqlite/doc/usage-guide.html

A linear scan is slow, but a prefix tree is probably overkill. Keeping the words sorted and using a binary search is a fast and simple compromise.
import bisect
words = sorted(map(str.strip, open('/usr/share/dict/words')))
def lookup(prefix):
return words[bisect.bisect_left(words, prefix):bisect.bisect_right(words, prefix+'~')]
>>> lookup('abdicat')
['abdicate', 'abdication', 'abdicative', 'abdicator']

If you store the words in a .csv file, you can use pandas to solve this rather neatly, and after you have read it once you can reuse the already loaded data frame if the user should be able to perform more than one search per session.
df = pd.read_csv('dictionary.csv')
matching_words = df[0].loc[df[0].str.startswith(user_entry)]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Getting a regex trie to run faster? - python

If you compile it, the regex patterns will be compiled into a bytecodes then run by a matching engine. If you don't compile it, it will load it over and over for the same regex whenever it is called. That's why compiled one is way faster if you are using same regex for multiple different records.

Related

Fuzzy matching strings embedded within strings

Searching a string for an exact match from a list in Python

Python: Regex a dictionary using user input wildcards

Efficient way to do a large number of search/replaces in Python?

List all words in a dictionary that start with <user input>

Categories

Resources