Regex fuzzy word match - python

Tough regex question: I want to use regexes to extract information from news sentences about crackdowns. Here are some examples:
doc1 = "5 young students arrested"
doc2 = "10 rebels were reported killed"
I want to match sentences based on lists of entities and outcomes:
entities = ['students','rebels']
outcomes = ['arrested','killed']
How can I use a regex to extract the number of participants from 0-99999, any of the entities, any of the outcomes, all while ignoring random text (such as 'young' or 'were reported')? This is what I have:
re.findall(r'\d{1,5} \D{1,50}'+ '|'.join(entities) + '\D{1,50}' + '|'.join(outcomes),doc1)
i.e., a number, some optional random text, an entity, some more optional random text, and an outcome.
Something is going wrong, I think because of the OR statements. Thanks for your help!

This regex should match your two examples:
pattern = r'\d+\s+.*?(' + '|'.join(entities) + r').*?(' + '|'.join(outcomes) + ')'
What you were missing were parentheses around the ORs.
However, using only regex likely won't give you good results. Consider using Natural Language Processing libraries like NLTK that parses sentences.

As #ReutSharabani already answered, this is not a proper way to do nlp, but this answers the literal question.
The regex should read:
import re;
entities = ['students','rebels'];
outcomes = ['arrested','killed'];
p = re.compile(r'(\d{1,5})\D{1,50}('+'|'.join(entities)+')\D{1,50}('+'|'.join(outcomes)+')');
m = p.match(doc1);
number = m.group(1);
entity = m.group(2);
outcome = m.group(3);
You forgot to group () your OR-operations. Instead what you generated was a|b|\W|c|d|\W (short version).

You ought to try out the regex module!
It has built in fuzzy match capabilities. The other answers seem much more robust and sleek, but this could be done simply with fuzzy matching as well!
pattern = r'\d{1,5}(%(entities)s)(%(outcomes)s){i}' %{'entities' : '|'.join(entities), 'outcomes' : '|'.join(outcomes)}
regex.match(pattern, news_sentence)
What's happening here is that the {i} indicates you want a match with any number of inserts. The problem here is that it could insert characters into one of the entities or outcomes and still yield a match. If you want to accept slight alterations on spelling to any of your outcomes or entities, then you could also use {e<=1} or something. Read more in the provided link about approximate matching!

Related

Remove repeating characters from sentence but retain the words meaning

I want to remove repeating characters from a sentence but make it so that the words still retain its meaning (if it has any). For example : I'm so haaappppyyyy about offline school
to I'm so happy about offline school. See, haaappppyyyy became happy and offline & school stay the same instead becoming ofline & schol
I've tried two solutions, using RE and itertools, but none really fits for what I'm searching for
Using Regex :
tweet = 'I'm so haaappppyyyy about offline school'
repeat_char = re.compile(r"(.)\1{1,}", re.IGNORECASE)
tweet = repeat_char.sub(r"\1\1", tweet)
tweet = re.sub("(.)\\1{2,}", "\\1", tweet)
output :
I'm so haappyy about offline school #it makes 2 chars for every repating chars
using itertools :
tweet = 'I'm so happy about offline school'
tweet = ''.join(ch for ch, _ in itertools.groupby(tweet))
output :
I'm so hapy about ofline schol
How can I fix this? should I make a lists of words I want to exclude?
In addition, I want it to also be able to reduce some words that's in a pattern to it's base form. For example :
wkwk (base form)
wkwkwkwk
wkwkwkwkwkwkwk
I want to make the second and the third word into the first word, the base form
You can combine regex and NLP here by iterating over all words in a string, and once you find one with identical consecutive letters reduce them to max 2 consecutive occurrences of the same letters and run the automatic spellcheck to fix the spelling.
See an example Python code:
import re
from textblob import TextBlob
from textblob import Word
rx = re.compile(r'([^\W\d_])\1{2,}')
print( re.sub(r'[^\W\d_]+', lambda x: Word(rx.sub(r'\1\1', x.group())).correct() if rx.search(x.group()) else x.group(), tweet) )
# => "I'm so happy about offline school"
The code uses the Textblob library, but you may use any you like.
Note that ([^\W\d_])\1{2,} matches any three or more consecutive letters, [^\W\d_]+ matches one or more letters.
This answer was originally written for Regex to reduce repeated chars in a string which was closed as duplicate before I could submit my post. So I "recycled" it here.
Regex is not always the best solution
Regex for validation of formats or input
A regex is often used for low-level pattern recognition and substitution.
It may be useful for validation of formats. You can see it as "dump" automation.
Linguistics (NLP)
When it comes to natural language (NLP), or here spelling (dictionary) the semantics may play a role. Depending on the context "ass" and "as" may both be correctly spelled, although the semantics are very different.
(I apologize for the rude examples, but I am not a native-speaker and those two had the most distinct meaning depending on re-duplication).
For those cases a regex or simple pattern-recognition may be not sufficient. It can cause more effort to apply it correctly than the research for a language-specific library or solution (including a basic application).
Examples for spelling that a regex may struggle with
Like the difference between "haappy" (orthographically invalid, but only the duplicated vowels "aa", not the consonants "pp") and "yeees" (contains no duplicates in correct spelling) or "kiss" (is correctly spelled with duplicate consonants)
Spelling correction requires more
For example a dictionary to lookup if duplicate characters (vowels or consonants) are valid for correct spelling of the word in its form.
Consider a spelling-correction module
You could use textblob module for spelling correction:
To install:
pip install textblob
Example for some test-cases (independent words):
from textblob import TextBlob
incorrect_words = ["cmputr", "yeees", "haappy"] # incorrect spelling
text = ",".join(incorrect_words) # join them as comma separated list
print(f"original words: {text}")
b = TextBlob(text)
# prints the corrected spelling
print(f"corrected words: {b.correct()}")
Prints:
original words: cmputr,yeees,haappy
corrected words: computer,eyes,happy
Surprise: You might have expected "yes" (so did I). But the correction results not in removal of 2 duplicated vowels "ee", but rearrangement to keep almost all letters (5 of 6, only removed one "e").
Example for the given sentence:
from textblob import TextBlob
tweet = "I'm so haaappppyyyy about offline school" # either escape or use different quotes when a single-quote (') is enclosed
print(TextBlob(tweet).correct())
Prints:
I'm so haaappppyyyy about office school
Unfortunately quite worse:
not "happy"
semantically out-of-scope with "office" instead "offline"
Apparently a preceeding cleaning step using regex, like Wiktor suggests, may ameliorate the result.
See also:
Stackabuse: Spelling Correction in Python with TextBlob, tutorial
documentation: TextBlob: Simplified Text Processing
Well, first of all you need a list (or set) of all allowed words, to compare with.
I'd approach it with the assumption (which might be wrong) that no words contain sequences of more than two repeating characters. So for each word generate a list of all potential candidates, for example "haaappppppyyyy" would yield you ["haappyy", "happyy", "happy", etc]. then it's just a matter of checking which one of those words actually exists by comparing to the allowed word list.
The time complexity of this is quite high, tho so if it needs to go fast then throw a hash table on it or something :)

Regex: how to use re.sub with variable number of elements?

I'm trying to replace {x;y} patterns in a text corpus with "x or y", except that the number of elements is variable, so sometimes there will be 3 or more elements i.e. {x;y;z} (max is 9).
I'm trying to do this with regex, but I'm not sure how to do this such that I can replace according to the number of elements present. So I mean like, if I use a regex with a variable component like the following
part = '(;[\w\s]+)'
regex = '\(([\w\s]+);([\w\s]+){}?\)'.format(part)
re.sub(regex,/1 or /2 or /3, text)
I will sometimes get an additional 'or' (and more if I increase the number of variable elements) when there are only 2 elements present in the braces, which I don't want. The alternative is to do this many times with different number of variable parts but the code would be very clunky. I'm wondering if there are any ways I could achieve this with regex methods? Would appreciate any ideas.
I'm using python3.5 with spyder.
The scenario is just a bit too much for a regular search-and-replace action, so I would recommend passing in a function to dynamically generate the replacement string.
import re
text = 'There goes my {cat;dog} playing in the {street;garden}.'
def replacer(m):
return m.group(1).replace(';', ' or ')
output = re.sub(r'\{((\w;?)*\w)\}', replacer, text)
print(output)
Output:
There goes my cat or dog playing in the street or garden.

mapping RegEx to additional "instruction" using a dictonary

I looked over some old Java code of mine, in which I extracted dates and their formats, from a number of strings. It was a terrible mess of if conditions and regex patterns and matchers.
So I thought about how I would solve this nowadays and in Python. I have a number of regex patterns which map to a date format, from which susequently a time stamp is created. I heard "If there is a switch statement in Java, in Python there should be a dictonary":
pattern_dic = {
"[\\d]{2}:[\\d]{2}, .{3} [\\d]{1,2}, [\\d]{4} \\(UTC\\)": "HH:mm, MMM dd, yyyy (zzz)",
"[\d]{2}:[\d]{2}, [\d]{1,2} .{3} [\d]{4} \(UTC\)" : "HH:mm, dd MMM yyyy (zzz)",
...
}
*I think that I have to change these date patterns because I just copied them from the Java solution.
In another problem in which I had regex / replacement pairs, I found a pretty nice solution using the dictionary like this
(courtesy to some brilliant person on Stack Overflow). This works only if the matching regex is a simple string, so it can looked up in the dictionary (I think).
pattern_acc = re.compile(r'\b(' + '|'.join(pattern_dic.keys()) + r')\b')
comment = pattern_acc.sub(lambda x: pattern_dic[x.group()], comment)
Here is what I came up with so far. My problem is that I don't know how i can get the matching part of the regex to look up in my dictionary ("matching_date_pattern"):
def multi_match(input_string, pattern_dic):
date_pattern = re.compile(r'\b(' + '|'.join(pattern_dic.keys()) + r')\b')
matches = date_pattern.findall(input_string)
date_formats = []
for match in matches:
matching_string = match.group()
date_format = pattern_dic["matching_date_pattern"]
date_formats.append((matching_string, date_format))
edit:
I should have stated that I would like to solve this as a preliminary problem. I would like to separate the matching and the searching. While being able to access the matching pattern.
Think for example if the regular expressions consist of many groups and the "instructions" they are matched to become more complex. Imagine for example that you expect a lot of different text objects, like links, markdown elements, and so on. My problem in the moment boils down to knowing which pattern matched, between matching and searching.
Maybe the question is also how expensive it is to compile patterns, since compiling them separately of course make it easier to access them.
This code you snatched from Stack Overflow is good is you want to match any of multiple regexps, but doesn't solve your problem of finding which of your regexps matched in every particular case. You should rather just iterate over pattern_dic and check every key in turn:
def multi_match(input_string, pattern_dic):
for regexp in pattern_dic:
re.search(regexp, input_string)
matching_string = match.group()
date_format = pattern_dic[regexp]
date_formats.append((matching_string, date_format))
return date_formats
Side remark: .append takes one argument, so it is necessary to form a tuple - thus additional pair of parentheses.

Beginner with regular expressions; need help writing a specific query - space, followed by 1-3 numbers, followed by any number of letters

I'm working with some poorly formatted HTML and I need to find every instance of a certain type of pattern. The issue is as follows:
a space, followed by a 1 to 3 digit number, followed by letters (a word, usually). Here are some examples of what I mean.
hello 7Out
how 99In
are 123May
So I would be looking for the expression to get the "7Out", "99In", "123May", etc. The initial space does not need to be included. I hope this is descriptive enough, as I am literally just starting to expose myself to regular expressions and am still struggling a bit. In the end, I will want to count the total number of these instances and add the total count to a df that already exists, so if you have any suggestions on how to do that I would be open to that as well. Thanks for your help in advance!
Your regular expression will be: r'\w\s(\d{1,3}[a-zA-Z]+)'
So in order to get count you can use len() upon list returned by findall. The code will be
import re
string='hello 70qwqeqwfwe123 12wfgtr123 34wfegr123 dqwfrgb'
result=re.findall(r'\w\s(\d{1,3}[a-zA-Z]+)',string)
print "result = ",result #this will give you all the found occurances as list
print "len(result) = ",len(result) #this will give you total no of occurances.
The result will be:
result = ['70qwqeqwfwe', '12wfgtr', '34wfegr']
len(result) = 3
Hint: findall will evaluate regular expression and returns results based on grouping. I'm using that to solve this problem.
Try these:
re.findall(r'(\w\s((\d{1,3})[a-zA-Z]+))',string)
re.findall(r'\w\s((\d{1,3})[a-zA-Z]+)',string)
To get an idea about regular expressions refer python re, tutorials point and to play with the matching characters use this.

Justadistraction: tokenizing English without whitespaces. Murakami SheepMan

I wondered how you would go about tokenizing strings in English (or other western languages) if whitespaces were removed?
The inspiration for the question is the Sheep Man character in the Murakami novel 'Dance Dance Dance'
In the novel, the Sheep Man is translated as saying things like:
"likewesaid, we'lldowhatwecan. Trytoreconnectyou, towhatyouwant," said the Sheep Man. "Butwecan'tdoit-alone. Yougottaworktoo."
So, some punctuation is kept, but not all. Enough for a human to read, but somewhat arbitrary.
What would be your strategy for building a parser for this? Common combinations of letters, syllable counts, conditional grammars, look-ahead/behind regexps etc.?
Specifically, python-wise, how would you structure a (forgiving) translation flow? Not asking for a completed answer, just more how your thought process would go about breaking the problem down.
I ask this in a frivolous manner, but I think it's a question that might get some interesting (nlp/crypto/frequency/social) answers.
Thanks!
I actually did something like this for work about eight months ago. I just used a dictionary of English words in a hashtable (for O(1) lookup times). I'd go letter by letter matching whole words. It works well, but there are numerous ambiguities. (asshit can be ass hit or as shit). To resolve those ambiguities would require much more sophisticated grammar analysis.
First of all, I think you need a dictionary of English words -- you could try some methods that rely solely on some statistical analysis, but I think a dictionary has better chances of good results.
Once you have the words, you have two possible approaches:
You could categorize the words into grammar categories and use a formal grammar to parse the sentences -- obviously, you would sometimes get no match or multiple matches -- I'm not familiar with techniques that would allow you to loosen the grammar rules in case of no match, but I'm sure there must be some.
On the other hand, you could just take some large corpus of English text and compute relative probabilities of certain words being next to each other -- getting a list of pair and triples of words. Since that data structure would be rather big, you could use word categories (grammatical and/or based on meaning) to simplify it. Then you just build an automaton and choose the most probable transitions between the words.
I am sure there are many more possible approaches. You can even combine the two I mentioned, building some kind of grammar with weight attached to its rules. It's a rich field for experimenting.
I don't know if this is of much help to you, but you might be able to make use of this spelling corrector in some way.
This is just some quick code I wrote out that I think would work fairly well to extract words from a snippet like the one you gave... Its not fully thought out, but I think something along these lines would work if you can't find a pre-packaged type of solution
textstring = "likewesaid, we'lldowhatwecan. Trytoreconnectyou, towhatyouwant," said the Sheep Man. "Butwecan'tdoit-alone. Yougottaworktoo."
indiv_characters = list(textstring) #splits string into individual characters
teststring = ''
sequential_indiv_word_list = []
for cur_char in indiv_characters:
teststring = teststring + cur_char
# do some action here to test the testsring against an English dictionary where you can API into it to get True / False if it exists as an entry
if in_english_dict == True:
sequential_indiv_word_list.append(teststring)
teststring = ''
#at the end just assemble a sentence from the pieces of sequential_indiv_word_list by putting a space between each word
There are some more issues to be worked out, such as if it never returns a match, this would obviously not work as it would never match if it just kept adding in more characters, however since your demo string had some spaces you could have it recognize these too and automatically start over at each of these.
Also you need to account for punctuation, write conditionals like
if cur_char == ',' or cur_char =='.':
#do action to start new "word" automatically

Categories