I'm trying to parse sentences in python- for any sentence I get I should take only the words that appear after the words 'say' or 'ask' (if the words doesn't appear, I should take to whole sentence)
I simply did it with regular expressions:
sen = re.search('(?s)(?<=say|Say).*$', current_game_row["sentence"], re.M | re.I)
(this is only for 'say', but adding 'ask' is not a problem...)
The problem is that if I get a sentence with punctuations like comma, colon (,:) after the word 'say' it takes it too.
Someone suggested me to use nltk tokenization in order to define it, but I'm new in python and don't understand how to use it. I see that nltk has the function RegexpParser but I'm not sure how to use it.
Please help me :-)
**
I forgot to mention that- I want to recognize 'said'/ asked etc. too and don't want to catch word that include the word 'say' or 'ask' (I'm not sure there are such words...).
In addition, if where are multiply 'say' or 'ask' , I only want to catch the first token in in the sentence.
**
Everything after a Keyword
We can deal with the unwanted punctuation by using \w to eat up all non-unicode.
sentence = "Hearsay? With masked flasks I said: abracadabra"
keys = '|'.join(['ask', 'asks', 'asked', 'say', 'says', 'said'])
result = re.search(rf'\b({keys})\b\W+(.*)', sentence, re.S | re.I)
if result == None:
print(sentence)
else:
print(result.group(2))
Output:
abracadabra
case-sensitive: You have case-insensitive flag re.I, so we can remove Say permutation.
multi-line: You have re.M option which directs ^ to not only match at the start of your string, but also right after every \n within that string. We can drop this since we do not need to use ^.
dot-matches-all: You have (?s) which directs . to match everything including \n. This is the same as applying re.S flag.
I'm not sure what the net effect of having both re.M and re.S is. I think your sentence might be a text blob with newlines inside, so I removed re.M and kept (?s) as re.S
Related
I have a string like below:
"i'm just returning from work. *oeee* all and we can go into some detail *oo*. what is it that happened as far as you're aware *aouu*"
with some junk characters like above (highlighted with '*' marks). All I could observe was that junk characters come as bunch of vowels knit together. Now, I need to remove any word that has space before and after and has only vowels in it (like oeee, aouu, etc...) and length of 2 or more. How do I achieve this in python?
Currently, I built a tuple to include replacement words like ((" oeee "," "),(" aouu "," ")) and sending it through a for loop with replace. But if the word is 'oeeee', I need a add a new item into the tuple. There must be a better way.
P.S: there will be no '*' in the actual text. I just put it here to highlight.
You need to use re.sub to do a regex replacement in python. You should use this regex:
\b[aeiou]{2,}\b
which will match a sequence of 2 or more vowels in a word by themselves. We use \b to match the boundaries of the word so it will match at the beginning and end of the string (in your string, aouu) as well as words adjacent to punctuation (in your string, oo). If your text may include uppercase vowels too, use the re.I flag to ignore case:
import re
text = "i'm just returning from work. oeee all and we can go into some detail oo. what is it that happened as far as you're aware aouu"
print(re.sub(r'\b[aeiou]{2,}\b', '', text, 0, re.I))
Output
i'm just returning from work. all and we can go into some detail . what is it that happened as far as you're aware
I have tried to replace in all procedures some mistakes. Now, I need to find last "end;" in procedure and replace it with another text.
I wrote like: (\s.*)(end|END)(.*(;).*)
But in work not correctly, it also replace some words in the middle of the text. I using re biblio from python.
You can use
result = re.sub(r'(?si)(.*)\bend\b', r'\g<1>some other word', text)
The regex matches
(?si) - an inline re.DOTALL (s) and re.IGNORECASE (i) modifier
(.*) - Group 1: any zero or more chars as many as possible
\bend\b -a whole word end.
The \g<1>some other word replacement is the Group 1 value (I used \g<1> since it will be helpful if your some other word starts with a digit) plus your word.
NOTE: if your some other word can contain literal backslashes, do not forget to double them.
I want to capitilize the first word after a dot in a whole paragraph (str) full of sentences. The problem is that all chars are lowercase.
I tried something like this:
text = "here a long. paragraph full of sentences. what in this case does not work. i am lost"
re.sub(r'(\b\. )([a-zA-z])', r'\1' (r'\2').upper(), text)
I expect something like this:
"Here a long. Paragraph full of sentences. What in this case does not work. I am lost."
You can use re.sub with a lambda:
import re
text = "here a long. paragraph full of sentences. what in this case does not work. i am lost"
result = re.sub('(?<=^)\w|(?<=\.\s)\w', lambda x:x.group().upper(), text)
Output:
'Here a long. Paragraph full of sentences. What in this case does not work. I am lost'
Regex Explanation:
(?<=^)\w: matches an alphanumeric character preceded by the start of the line.
(?<=\.\s)\w: matches an alphanumeric character preceded by a period and a space.
You can use ((?:^|\.\s)\s*)([a-z]) regex (which doesn't depend upon lookarounds which sometimes may not be available in the regex dialect you may be using and hence is simpler and widely supported. Like for example Javascript doesn't yet widely support lookbehind although it is supported in EcmaScript2018 but its not widely supported yet) where you capture either the starting zero or more whitespace at the beginning of a sentence or one or more whitespace followed by a literal dot . and capture it in group1 and next capture a lower case letter using ([a-z]) and capture in group2 and replace the matched text with group1 captured text and group2 captured letter by making it uppercase using lambda expression. Check this Python code,
import re
arr = ['here a long. paragraph full of sentences. what in this case does not work. i am lost',
' this para contains more than one space after period and also has unneeded space at the start of string. here a long. paragraph full of sentences. what in this case does not work. i am lost']
for s in arr:
print(re.sub(r'(^\s*|\.\s+)([a-z])', lambda m: m.group(1) + m.group(2).upper(), s))
Output,
Here a long. Paragraph full of sentences. What in this case does not work. I am lost
This para contains more than one space after period and also has unneeded space at the start of string. Here a long. Paragraph full of sentences. What in this case does not work. I am lost
And in case you want to get rid of extra whitespaces and reduce them to just one space, just take that \s* out of group1 and use this regex ((?:^|\.\s))\s*([a-z]) and with updated Python code,
import re
arr = ['here a long. paragraph full of sentences. what in this case does not work. i am lost',
' this para contains more than one space after period and also has unneeded space at the start of string. here a long. paragraph full of sentences. what in this case does not work. i am lost']
for s in arr:
print(re.sub(r'((?:^|\.\s))\s*([a-z])', lambda m: m.group(1) + m.group(2).upper(), s))
You get following where extra whitespace is reduced to just one space, which may often be desired,
Here a long. Paragraph full of sentences. What in this case does not work. I am lost
This para contains more than one space after period and also has unneeded space at the start of string. Here a long. Paragraph full of sentences. What in this case does not work. I am lost
Also, if this was to be done using PCRE based regex engine, then you could have used \U in the regex itself without having to use lambda functions and just been able to replace it with \1\U\2
Regex Demo for PCRE based regex
Suppose I have the following sentence,
Hi, my name is Dr. Who. I'm in love with fish-fingers and custard !!
I'm trying to capture the punctuation (except the apostrophe and hyphen) using regular expressions, but I also want to ignore certain words. For example, I'm ignoring Dr., and so I don't want to capture the . in the word Dr.
Ideally, the regex should capture the text in between the parentheses:
Hi(, )my( )name( )is( )Dr.( )Who(. )I'm( )in( )love( )with( )fish-fingers( )and( )custard( !!)
Note that I have a Python list that contains words like "Dr." that I want to ignore. I'm also using string.punctuation to get a list of punctuation characters to use in the regex. I've tried using negative lookahead but it was still catching the "." in Dr. Any help appreciated!
you can throw away at first all your stop words (like "Dr.") and then all letters (and digits).
import re
text = "Hi, my name is Dr. Who. I'm in love with fish-fingers and custard !!"
tmp = re.sub(r'[Dr.|Prof.]', '', text)
print(re.sub('[a-zA-Z0-9]*', '', tmp))
Would that work?
it would print:
, ' - !!
The output is capturing the text in between the parentheses, in your question.
I understand that \b can represent either the beginning or the end of a word. When would \b be required to represent the end? I'm asking because it seems that it's always necessary to have \s to indicate the end of the word, therefore eliminating the need to have \b. Like in the case below, one with a '\b' to end the inner group, the other without, and they get the same result.
m = re.search(r'(\b\w+\b)\s+\1', 'Cherry tree blooming will begin in in later March')
print m.group()
m = re.search(r'(\b\w+)\s+\1', 'Cherry tree blooming will begin in in later March')
print m.group()
\s is just whitespace. You can have word boundaries that aren't whitespace (punctuation, etc.) which is when you need to use \b. If you're only matching words that are delimited by whitespace then you can just use \s; and in that case you don't need the \b.
import re
sentence = 'Non-whitespace delimiters: Commas, semicolons; etc.'
print(re.findall(r'(\b\w+)\s+', sentence))
print(re.findall(r'(\b\w+\b)+', sentence))
Produces:
['whitespace']
['Non', 'whitespace', 'delimiters', 'Commas', 'semicolons', 'etc']
Notice how trying to catch word endings with just \s ends up missing most of them.
Consider wanting to match the word "march":
>>> regex = re.compile(r'\bmarch\b')
It can come at the end of the sentence...
>>> regex.search('I love march')
<_sre.SRE_Match object at 0x10568e4a8>
Or the beginning ...
>>> regex.search('march is a great month')
<_sre.SRE_Match object at 0x10568e440>
But if I don't want to match things like marching, word boundaries are the most convenient:
>>> regex.search('my favorite pass-time is marching')
>>>
You might be thinking "But I can get all of these things using r'\s+march\s+'" and you're kind of right... The difference is in what matches. With the \s+, you also might be including some whitespace in the match (since that's what \s+ means). This can make certain things like search for a word and replace it more difficult because you might have to manage keeping the whitespace consistent with what it was before.
It's not because it's at the end of the word, it's because you know what comes after the word. In your example:
m = re.search(r'(\b\w+\b)\s+\1', 'Cherry tree blooming will begin in in later March')
...the first \b is necessary to prevent a match starting with the in in begin. The second one is redundant because you're explicitly matching the non-word characters (\s+) that follow the word. Word boundaries are for situations where you don't know what the character on the other side will be, or even if there will be a character there.
Where you should be using another one is at the end of the regex. For example:
m = re.search(r'(\b\w+)\s+\1\b', "Let's go to the theater")
Without the second \b, you would get a false positive for the theater.
"I understand that \b can represent either the beginning or the end of a word. When would \b be required to represent the end?"
\b is never required to represent the end, or beginning, of a word. To answer your bigger question, it's only useful during development -- when working with natural language, you'll ultimately need to replace \b with something else. Why?
The \b operator matches a word boundary as you've discovered. But a key concept here is, "What is a word?" The answer is the very narrow set [A-Za-z0-9_] -- word is not a natural language word but a computer language identifier. The \b operator exists for a formal language's parser.
This means it doesn't handle common natural language situations like:
The word let's becomes two words, 'let' & 's' if `\b' represents the boundaries of a word. Also consider titles like Mr. & Mrs. lose their period.
Similarly, if `\b' represents the start of a word, then the appostrophe in these cases will be lost: 'twas 'bout 'cause
Hyphenated words suffer at the hand of `\b' as well, e.g mother-in-law (unless you want her to suffer.)
Unfortunately, you can't simply augment \b by including it in a character set as it doesn't represent a character. You may be able to combine it with other characters via alternation in a zero-width assertion.
When working with natural language, the \b operator is great for quickly prototyping an idea, but ultimately, probably not what you want. Ditto \w, but, since it represents a character, it's more easily augmented.