Find regex match in large dictionary using python quickly

Find regex match in large dictionary using python quickly - python

I have a large dictionary containing regex values as the key and a numeric value as a value, and given a corpus (broken down into a list of individual word tokens) I would like to to find the regex value that best matches my word to obtain its respective value.
The dictionary contains many regex values that are ambiguous, in the sense that a word may have multiple regex matches, and therefore you would want to find the longest regex or 'best match' (ex: dictionary contains affect+, as well as affected an affection)
My issue is when running a large text sample through the dictionary and finding the regex match of each word token, it takes a long amount of time (0.1s per word), which obviously adds up over 1000's of words. This is because it goes through the whole dictionary each time to find the 'best match'.
Is there a faster way to achieve this? Please see the problematic part of my code below.
for word in textTokens:
for reg,value in dictionary.items():
if(re.match(reg, word)):
matchedWords.append(reg)

Because you mentioned the input regexes have the structure of word+ a simple word and a plus regex symbol, you can use a modified version of the optimal Aho-Corasick algorithm, in this algorithm you make a finite-state-machine from in search patterns that can easily be modified accept some regex signs easily, like in your case a very simplistic solution would be to pad your keys to length of longest words in the list and accept anything that comes after the padding, there is an easy to implement wildcards are '.' and '?', for * one have to go to end of the word or return and follow the other path through list, which can be exponential number of choices in constant memory(any and all deterministic finite automatons).
A finite state machine for a list of your regex keys can be made in linear time, meaning it takes time proportional to the sum of length of your dictionary keys. as explained here there is also support for longest matched word in the dictionary.

Related

Matching list of words in any order but only if adjacent/separated by a maximum of n words

I have only been able to find advice on how to match a number of substrings in any order, or separated by a maximum of n words (not both at once).
I need to implement a condition in python where a number of terms appear in any order but separated by e.g. a maximum of one word, or adjacent. I've found a way to implement the "in any order" part using lookarounds, but it doesn't account for the adjacent/separated by a maximum of one word issue. To illustrate:
re.search("^.*(?=.*word1\s*\w*\sword2)(?=.*\w*)(?=.*word3\w*).*$", "word1 filler word2 and word3")
This should match "word1 word2" or "word1 max1word word2" and "word3*", in any order, separated by one word as in this case - which it does. However, it also matches a string where the terms are separated by two or more words, which it should not. I tried doing it like this:
re.search("^.*(?=\s?word1\s*\w*\sword2)(?=\s?\w*)(?=\s?word3\w*).*$", "word1 word2 word3")
hoping that using \s? at the beginning of each bracketed term instead of .* would fix it but that doesn't work at all (no match even when there should be one).
Does anyone know of a solution?
In the actual patterns I'm looking for it's more than just two separate strings, so writing out each possible combination is not feasible.

Well, your question is not perfectly clear but you could try this, assuming word1, word2 and word3 are known words
(?:word1(\s\w+)?\sword2)|(?:word2(\s\w+)?\sword1)|word3
Demo

I'm trying to identify patents related to AI using a keyword search in abstracts and titles
In that case, I wouldn't recommend regular expressions at all.
Instead,
Use a library that provides stemming.
This is how your code knows that that "program", "programming", "programmer", etc. are related, and is much more robust than a regular expression solution and is more likely to be semantically accurate.
Use a library that can generate n-grams.
N-grams are essentially tuples of subsequent words from a text. You can infer that words appearing in the same tuple are more likely to be related to each other.
Use a library of stop words.
These are words that are so common as to be unlikely to provide value, like "and".
Putting this all together, 3-grams for
inductive randomword logic and programming
would be something like
(induct, randomword, logic)
(randomword, logic, program)
Applying this to an arbitrary article, you can then look for stemmed n-grams with whatever terms you wish. Adjust n if you find you're having too many false negatives or positives.
This isn't the place for a complete tutorial, but a library like NLTK provides all of these features.

Reading text from a file, then writing to another file with repetitions in text marked

I'm a beginner to both Python and to this forum, so please excuse any vague descriptions or mistakes.
I have a problem regarding reading/writing to a file. What I'm trying to do is to read a text from a file and then find the words that occur more than one time, mark them as repeated_word and then write the original text to another file but with the repeated words marked with star signs around them.
I find it difficult to understand how I'm going to compare just the words (without punctuation etc) but still be able to write the words in its original context to the file.
I have been recommended to use regex by some, but I don't know how to use it. Another approach is to iterate through the textstring and tokenize and normalize, sort of by going through each character, and then make some kind av object or element out of each word.
I am thankful to anyone who might have ideas on how to solve this. The main problem is not how to find which words that are repeated but how to mark them and then write them to the file in their context. Some help with the coding would be much appreciated, thanks.
EDIT
I have updated the code with what I've come up with so far. If there is anything you would consider "bad coding", please comment on it.
To explain the Whitelist class, the assignment has two parts, one of where I am supposed to mark the words and one regarding a whitelist, containing words that are "allowed repetitions", and shall therefore not be marked.
I have read heaps of stuff about regex but I still can't get my head around how to use it.

Basically, you need to do two things: find which words are repeated, and then transform each of these words into something else (namely, the original word with some marker around it). Since there's no way to know which words are repeated without going through the entire file, you will need to make two passes.
For the first pass, all you need to do is extract the words from the text and count how many times each one occurs. In order to determine what the words are, you can use a regular expression. A good starting point might be
regex = re.compile(r"[\w']+")
The function re.compile creates a regular expression from a string. This regular expression matches any sequence of one or more word characters (\w) or apostrophes, so it will catch contractions but not punctuation, and I think in many "normal" English texts this should capture all the words.
Once you have created the regular expression object, you can use its finditer method to iterate over all matches of this regular expression in your text.
for word in regex.finditer(text):
You can use the Counter class to count how many times each word occurs. (I leave the implementation as an exercise. :-P The documentation should be quite helpful.)
After you've gotten a count of how many times each word occurs, you will have to pick out those whose counts are 2 or more, and come up with some way to identify them in the input text. I think a regular expression will also help you here. Specifically, you can create a regular expression object which will match any of a selected set of words, by compiling the string consisting of the words joined by |.
regex = re.compile('|'.join(words))
where words is a list or set or some iterable. Since you're new to Python, let's not get too fancy (although one can); just code up a way to go through your Counter or whatever and create a list of all words which have a count of 2 or more, then create the regular expression as I showed you.
Once you have that, you'll probably benefit from the sub method, which takes a string and replaces all matches of the regular expression in it with some other text. In your case, the replacement text will be the original word with asterisks around it, so you can do this:
new_text = regex.sub(text, r'*\0*')
In a regular expression replacement, \0 refers to whatever was matched by the regex.
Finally, you can write new_text to a file.

If you know that the text only contains alphabetic characters, it may be easier to just ignore characters that are outside of a-z than to try to remove all the punctuation.
Here is one way to remove all characters that are not a-z or space:
file = ''.join(c for c in file if 97 <= ord(c) <= 122 or c == ' ')
This works because ord() returns the ASCII code for a given character, and ASCII 97-122 represent a-z (in lowercase).
Then you'll want to split those into words, you can accomplish that like:
words = file.split()
If you pass this to the Counter data structure it will count the occurrences of each word.
counter = Counter(file.split)
Then counter.items() will contain a mapping from word to number of occurrences.

OK. I presume that this is a homework assignment, so I'm not going to give you a complete solution. But, you really need to do a number of things.
The first is to read the input file in to memory. Then split it in to its component words (tokenize it) probably contained in a list, suitably cleaned up to remove stray punctuation. You seem to be well on your way to doing that, but I would recommend you look at the split() and strip() methods available for strings.
You need to consider whether you want the count to be case sensitive or not, and so you might want to convert each word in the list to (say) lowercase to keep this consistent. So you could do this with a for loop and the string lower() method, but a list-comprehension is probably better.
You then need to go through the list of words and count how many times each one appears. If you check out collections.Counter you will find that this does the heavy lifting for your or, alternatively, you will need to build a dictionary which has the words as keys and the count of the words. (You might also want to check out the collections.defaultdict class here as well).
Finally, you need to go through the text you've read from the file and for each word it contains which has more than one match (i.e. the count in the dictionary or counter is > 1) mark it appropriately. Regular expressions are designed to do exactly this sort of thing. So I recommend you look at the re library.
Having done that, you simply then write the result to a file, which is simple enough.
Finally, with respect to your file operations (reading and writing) I would recommend you consider replacing the try ... except construct with a with ... as one.

The most elegant way to find n words in String with the particular word

There is a big string and I need to find all substrings containing exactly N words (if it is possible).
For example:
big_string = "The most elegant way to find n words in String with the particular word"
N = 2
find_sub(big_string, 'find', N=2) # => ['way to find n words']
I've tried to solve it with regular expressions, but it happened to be more complex then I expect at first. Is there an elegant solution around I've just overlook?
Upd
By word we mean everything separated by \b
N parameter indicates how many words on each side of the 'find' should be

For your specific example (if we use the "word" definition of regular expressions, i.e. anything containing letters, digits and underscores) the regex would look like this:
r'(?:\w+\W+){2}find(?:\W+\w+){2}'
\w matches one of said word characters. \W matches any other character. I think it's obvious where in the pattern your parameters go. You can use the pattern with re.search or re.findall.
The issue is if there are less than the desired amount of words around your query (i.e. if it's too close to one end of the string). But you should be able to get away with:
r'(?:\w+\W+){0,2}find(?:\W+\w+){0,2}'
thanks to greediness of repetition. Note that in any case, if you want multiple results, matches can never overlap. So if you use the first pattern, you will only get the first match, if two occurrences of find are to close to each other, whereas in the second, you won't get n words before the second find (the ones that were already consumed will be missing). In particular, if two occurrences of find are closer together than n so that the second find will already be part of the first match, then you can't get the second match at all.
If you want to treat a word as anything that is not a white-space character, the approach looks similar:
r'(?:\S+\s+){0,2}find(?:\s+\S+){0,2}'
For anything else you will have to come up with the character classes yourself, I guess.

Extracting whole words

I have a large set of real-world text that I need to pull words out of to input into a spell checker. I'd like to extract as many meaningful words as possible without too much noise. I know there's plenty of regex ninjas around here, so hopefully someone can help me out.
Currently I'm extracting all alphabetical sequences with '[a-z]+'. This is an okay approximation, but it drags a lot of rubbish out with it.
Ideally I would like some regex (doesn't have to be pretty or efficient) that extracts all alphabetical sequences delimited by natural word separators (such as [/-_,.: ] etc.), and ignores any alphabetical sequences with illegal bounds.
However I'd also be happy to just be able to get all alphabetical sequences that ARE NOT adjacent to a number. So for instance 'pie21' would NOT extract 'pie', but 'http://foo.com' would extract ['http', 'foo', 'com'].
I tried lookahead and lookbehind assertions, but they were applied per-character (so for example re.findall('(?<!\d)[a-z]+(?!\d)', 'pie21') would return 'pi' when I want it to return nothing). I tried wrapping the alpha part as a term ((?:[a-z]+)) but it didn't help.
More detail: The data is an email database, so it's mostly plain English with normal numbers, but occasionally there's rubbish strings like GIHQ4NWL0S5SCGBDD40ZXE5IDP13TYNEA and AC7A21C0 that I'd like to ignore completely. I'm assuming any alphabetical sequence with a number in it is rubbish.

If you restrict yourself to ASCII letters, then use (with the re.I option set)
\b[a-z]+\b
\b is a word boundary anchor, matching only at the start and end of alphanumeric "words". So \b[a-z]+\b matches pie, but not pie21 or 21pie.
To also allow other non-ASCII letters, you can use something like this:
\b[^\W\d_]+\b
which also allows accented characters etc. You may need to set the re.UNICODE option, especially when using Python 2, in order to allow the \w shorthand to match non-ASCII letters.
[^\W\d_] as a negated character class allows any alphanumeric character except for digits and underscore.

Are you familiar with word boundaries? (\b). You can extract word's using the \b around the sequence and matching the alphabet within:
\b([a-zA-Z]+)\b
For instance, this will grab whole words but stop at tokens such as hyphens, periods, semi-colons, etc.
You can the \b sequence, and others, over at the python manual
EDIT Also, if you're looking to about a number following or preceding the match, you can use a negative look-ahead/behind:
(?!\d) # negative look-ahead for numbers
(?<!\d) # negative look-behind for numbers

What about:
import re
yourString="pie 42 http://foo.com GIHQ4NWL0S5SCGBDD40ZXE5IDP13TYNEA pie42"
filter (lambda x:re.match("^[a-zA-Z]+$",x),[x for x in set(re.split("[\s:/,.:]",yourString))])
Note that:
split explodes your string into potential candidates => returns a list of "potential words"
set makes unicity filtering => transforms the list in set, thus removing entries appearing more than once. This step is not mandatory.
filter reduces the number of candidates : takes a list, applies a test function to each element, and returns a list of the element succeeding the test. In our case, the test function is "anonymous"
lambda : anonymous function, taking an item and checking if it's a word (upper or lower letters only)
EDIT : added some explanations

Sample code
print re.search(ur'(?u)ривет\b', ur'Привет')
print re.search(ur'(?u)\bривет\b', ur'Привет')
or
s = ur"abcd ААБВ"
import re
rx1 = re.compile(ur"(?u)АБВ")
rx2 = re.compile(ur"(?u)АБВ\b")
rx3 = re.compile(ur"(?u)\bАБВ\b")
print rx1.findall(s)
print rx2.findall(s)
print rx3.findall(s)

Text Parsing Design

Let's say I have a paragraph of text like so:
Snails can be found in a very wide
range of environments including
ditches, deserts, and the abyssal
depths of the sea. Numerous kinds of snail can
also be found in fresh waters. (source)
I have 10,000 regex rules to match text, which can overlap. For example, the regex /Snails? can/i will find two matches (italicized in the text). The regex /can( also)? be/i has two matches (bolded).
After iterating through my regexes and finding matches, what is the best data structure to use, that given some place in the text, it returns all regexes that mached it? For example, if I want the matches for line 1, character 8 (0-based, which is the a in can), I would get a match for both regexes previously described.
I can create a hashmap: (key: character location, value: set of all matching regexes). Is this optimal? Is there a better way to parse the text with thousands of regexes (to not loop through each one)?
Thanks!

Storing all of the matches in a dictionary will work, but will it means you'll have to store all of the matches in memory at the same time. If your data is small enough to easily fit into memory, don't worry about it. Just do what works and move on.
If you do need to reduce memory usage of increase speed it really depends on how you are using the data. For example, if you process positions starting at the beginning and going to the end, you could use re.finditer to iteratively process all of the regexes and not maintain extra matches in memory longer then needed.

I'm assuming that your regex does not cross between multiple sentences. In that case you could
1) break your text into array of sentences
2) for each sentence simply record which (id) regex have matched.
3) when you would like to see the match - run the regex again.
"Store less / compute more" solution.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.