We have the repetitive words like Mr and Mrs in a text. We would like to add a space before and after the keywords Mr and Mrs. But, the word Mr is getting repetitive in Mrs. Please assist in solving the query:
Input:
Hi This is Mr.Sam. Hello, this is MrsPamela.Mr.Sam, what is your call about? Mrs.Pamela, I have a question for you.
import re
s = "Hi This is Mr Sam. Hello, this is Mrs.Pamela.Mr.Sam, what is your call about? Mrs. Pamela, I have a question for you."
words = ("Mr", "Mrs")
def add_spaces(string, words):
for word in words:
# pattern to match any non-space char before the word
patt1 = re.compile('\S{}'.format(word))
matches = re.findall(patt1, string)
for match in matches:
non_space_char = match[0]
string = string.replace(match, '{} {}'.format(non_space_char, word))
# pattern to match any non-space char after the word
patt2 = re.compile('{}\S'.format(word))
matches = re.findall(patt2, string)
for match in matches:
non_space_char = match[-1]
string = string.replace(match, '{} {}'.format(word, non_space_char))
return string
print(add_spaces(s, words))
Present Output:
Hi This is Mr .Sam. Hello, this is Mr sPamela. Mr .Sam, what is your call about? Mr s.Pamela, I have a question for you.
Expected Output:
Hi This is Mr .Sam. Hello, this is Mrs Pamela. Mr .Sam, what is your call about? Mrs .Pamela, I have a question for you.
You didn't specify anything after the letter 'r' so your pattern will match any starting with a space character followed by 'M' and 'r', so this will capture any ' Mr' even if it's followed by a 's' such as Mrs, that's why your your first pattern adds a space in the middle of Mrs.
A better pattern would be r'\bMr\b'
'\b' captures word boundaries, see the doc for further explanations: https://docs.python.org/3/library/re.html
I do not have a very extense knowledge of re module, but I came up with a solution which is extendable to any number of words and string and that perfectly works (tested in python3), although it is probably a very extense one and you may find something more optimized and much more concise.
On the other hand, it is not very difficult to understand the procedure:
To begin with, the program orders the words list from descending
length.
Then, it finds the matches of the longer words first and takes note
of the sections where the matches were already done in order not to
change them again. (Note that this introduces a limitation, but it
is necessary, due to the program cannot know if you want to allow
that a word in the variable word can be contained in other, anyway
it does not affect you case)
When it has taken note of all matches (in a non-blocked part of the
string) of a word, it adds the corresponding spaces and corrects the
blocked indexes (they have moved due to the insertion of the spaces)
Finally, it does a trim to eliminate multiple spaces
Note: I used a list for the variable words instead of a tuple
import re
def add_spaces(string, words):
# Get the lenght of the longest word
max_lenght = 0
for word in words:
if len(word)>max_lenght:
max_lenght = len(word)
print("max_lenght = ", max_lenght)
# Order words in descending lenght
ordered_words = []
i = max_lenght
while i>0:
for word in words:
if len(word)==i:
ordered_words.append(word)
i -= 1
print("ordered_words = ", ordered_words)
# Iterate over words adding spaces with each match and "blocking" the match section so not to modify it again
blocked_sections=[]
for word in ordered_words:
matches = [match.start() for match in re.finditer(word, string)]
print("matches of ", word, " are: ", matches)
spaces_position_to_add = []
for match in matches:
blocked = False
for blocked_section in blocked_sections:
if match>=blocked_section[0] and match<=blocked_section[1]:
blocked = True
if not blocked:
# Block section and store position to modify after
blocked_sections.append([match,match+len(word)])
spaces_position_to_add.append([match,match+len(word)+1])
# Add the spaces and update the existing blocked_sections
spaces_added = 0
for new_space in spaces_position_to_add:
# Add space before and after the word
string = string[:new_space[0]+spaces_added]+" "+string[new_space[0]+spaces_added:]
spaces_added += 1
string = string[:new_space[1]+spaces_added]+" "+string[new_space[1]+spaces_added:]
spaces_added += 1
# Update existing blocked_sections
for blocked_section in blocked_sections:
if new_space[0]<blocked_section[0]:
blocked_section[0] += 2
blocked_section[1] += 2
# Trim extra spaces
string = re.sub(' +', ' ', string)
return string
### MAIN ###
if __name__ == '__main__':
s = "Hi This is Mr Sam. Hello, this is Mrs.Pamela.Mr.Sam, what is your call about? Mrs. Pamela, I have a question for you."
words = ["Mr", "Mrs"]
print(s)
print(add_spaces(s,words))
Related
I want to calculate the occurrences of a given word in an article. I tried to use split method to cut the articles into n pieces and calculate the length like this.
def get_occur(str, word):
lst = str.split(word)
return len(lst) - 1
But the problem is, I will always count the word additionally if the word is a substring of another word. For example, I only want to count the number of "sad" in this sentence "I am very sad and she is a saddist". It should be one, but because "sad" is part of "saddist", I will count it accidentally. If I use " sad ", I will omit words that are at the start and end of sentences. Plus, I am dealing with huge number of articles so it is most desirable that I don't have to compare each word. How can I address this? Much appreciated.
You can use regular expressions:
import re
def count(text, pattern):
return len(re.findall(rf"\b{pattern}\b", text, flags=re.IGNORECASE))
\b marks word boundaries and the passed flag makes the matching case insensitive:
>>> count("Sadly, the SAD man is sad.", "sad")
2
If you want to only count lower-case occurrences, just omit the flag.
As mentioned by #schwobaseggl in the comment this will miss the word before the comma and there may be other cases so I have updated the answer.
from nltk.tokenize import word_tokenize
text = word_tokenize(text)
This will give you a list of words. Now use the below code
count = 0
for word in text:
if (word.lower() == 'sad'): # .lower to make it case-insensitive
count += 1
Hii i am new to regex and stuck with this question.
Q- Identify all of words that look like names in the sentence. In other words, those which are capitalized but aren't the first word in the sentence.
sentence = "This is not a name, but Harry is. So is Susy. Sam should be missed as it's the first word in the sentence."
Here's what i did ...but not getting any output(Excluding the text from begining till i get any capital letter word which is name)
p = re.compile(r'[^A-Z]\w+[A-Z]\w+')
m = p.finditer(sentence)
for m in m:
print(m)
Assuming there's always only one space after a dot before another sentence begins, you can use a negative lookbehind pattern to exclude names that are preceded by a dot and a space, and another negative lookbehind pattern to exclude the beginning of the string. Also use \b to ensure that a captial letter is matched at a word boundary:
re.findall(r'(?<!\. )(?<!^)\b[A-Z]\w*', sentence)
This returns:
['Harry', 'Susy']
You use a positive lookbehind to look for a capitalization pattern for a word not at the beginning of a sentence.
Like so:
>>> sentence = "This is not a name, but Harry is. So is Susy. Sam should be missed as it's the first word in the sentence."
>>> re.findall(r'(?<=[a-z,][ ])([A-Z][a-z]*)', sentence)
['Harry', 'Susy']
Imo best done with nltk:
from nltk import sent_tokenize, word_tokenize
sentence = "This is not a name, but Harry is. So is Susy. Sam should be missed as it's the first word in the sentence."
for sent in sent_tokenize(sentence):
words = word_tokenize(sent)
possible_names = [word for word in words[1:] if word[0].isupper()]
print(possible_names)
Or - if you're into comprehensions:
names = [word
for sent in sent_tokenize(sentence)
for word in word_tokenize(sent)[1:]
if word[0].isupper()]
Which will yield
['Harry', 'Susy']
You're overwriting your m variable. Try this:
p = re.compile(r'[^A-Z]\w+[A-Z]\w+')
for m in p.finditer(sentence):
print(m)
I have a bunch of documents and I'm interested in finding mentions of clinical trials. These are always denoted by the letters being in all caps (e.g. ASPIRE). I want to match any word in all caps, greater than three letters. I also want the surrounding +- 4 words for context.
Below is what I currently have. It kind of works, but fails the test below.
import re
pattern = '((?:\w*\s*){,4})\s*([A-Z]{4,})\s*((?:\s*\w*){,4})'
line = r"Lorem IPSUM is simply DUMMY text of the printing and typesetting INDUSTRY."
re.findall(pattern, line)
You may use this code in python that does it in 2 steps. First we split input by 4+ letter capital words and then we find upto 4 words on either side of match.
import re
str = 'Lorem IPSUM is simply DUMMY text of the printing and typesetting INDUSTRY'
re1 = r'\b([A-Z]{4,})\b'
re2 = r'(?:\s*\w+\b){,4}'
arr = re.split(re1, str)
result = []
for i in range(len(arr)):
if i % 2:
result.append( (re.search(re2, arr[i-1]).group(), arr[i], re.search(re2, arr[i+1]).group()) )
print result
Code Demo
Output:
[('Lorem', 'IPSUM', ' is simply'), (' is simply', 'DUMMY', ' text of the printing'), (' text of the printing', 'INDUSTRY', '')]
Would the following regex works for you?
(\b\w+\b\W*){,4}[A-Z]{3,}\W*(\b\w+\b\W*){,4}
Tested here: https://regex101.com/r/nTzLue/1/
On the left side you could match any word character \w+ one or more times followed by any non word characters \W+ one or more times. Combine those two in a non capturing group and repeat that 4 times {4} like (?:\w+\W+){4}
Then capture 3 or more uppercase characters in a group ([A-Z]{3,}).
Or the right side you could then turn the matching of the word and non word characters around of what you match on the left side (?:\W+\w+){4}
(?:\w+\W+){4}([A-Z]{3,})(?:\W+\w+){4}
The captured group will contain your uppercase word and the on capturing groups will contain the surrounding words.
This should do the job:
pattern = '(?:(\w+ ){4})[A-Z]{3}(\w+ ){5}'
I'm trying to extract a sentence from a paragraph using regular expressions in python.
Usually the code that I'm testing extracts the sentence correctly, but in the following paragraph the sentence does not get extracted correctly.
The paragraph:
"But in the case of malaria infections and sepsis, dendritic cells throughout the body are concentrated on alerting the immune system, which prevents them from detecting and responding to any new infections."
A new type of vaccine?
The code:
def splitParagraphIntoSentences(paragraph):
import re
sentenceEnders = re.compile('[.!?][\s]{1,2}(?=[A-Z])')
sentenceList = sentenceEnders.split(paragraph)
return sentenceList
if __name__ == '__main__':
f = open("bs.txt", 'r')
text = f.read()
mylist = []
sentences = splitParagraphIntoSentences(text)
for s in sentences:
mylist.append(s.strip())
for i in mylist:
print i
When tested with the above paragraph it gives output exactly as the input paragraph but the output should look like-
But in the case of malaria infections and sepsis, dendritic cells throughout the body are concentrated on alerting the immune system, which prevents them from detecting and responding to any new infections
A new type of vaccine
Is there anything wrong with the regular expression?
Riccardo Murri's answer is correct, but I thought I'd throw a bit more light on the subject.
There was a similar question asked with regard to PHP: php sentence boundaries detection. My answer to that question includes handling the exceptions such as "Mr.", "Mrs." and "Jr.". I've adapted that regex to work with Python, (which places more restrictions on lookbehinds). Here is a modified and tested version of your script which uses this new regex:
def splitParagraphIntoSentences(paragraph):
import re
sentenceEnders = re.compile(r"""
# Split sentences on whitespace between them.
(?: # Group for two positive lookbehinds.
(?<=[.!?]) # Either an end of sentence punct,
| (?<=[.!?]['"]) # or end of sentence punct and quote.
) # End group of two positive lookbehinds.
(?<! Mr\. ) # Don't end sentence on "Mr."
(?<! Mrs\. ) # Don't end sentence on "Mrs."
(?<! Jr\. ) # Don't end sentence on "Jr."
(?<! Dr\. ) # Don't end sentence on "Dr."
(?<! Prof\. ) # Don't end sentence on "Prof."
(?<! Sr\. ) # Don't end sentence on "Sr."
\s+ # Split on whitespace between sentences.
""",
re.IGNORECASE | re.VERBOSE)
sentenceList = sentenceEnders.split(paragraph)
return sentenceList
if __name__ == '__main__':
f = open("bs.txt", 'r')
text = f.read()
mylist = []
sentences = splitParagraphIntoSentences(text)
for s in sentences:
mylist.append(s.strip())
for i in mylist:
print i
You can see how it handles the special cases and it is easy to add or remove them as required. It correctly parses your example paragraph. It also correctly parses the following test paragraph (which includes more special cases):
This is sentence one. Sentence two! Sentence three? Sentence "four". Sentence "five"! Sentence "six"? Sentence "seven." Sentence 'eight!' Dr. Jones said: "Mrs. Smith you have a lovely daughter!"
But note that there are other exceptions that can fail which Riccardo Murri has correctly pointed out.
The paragraph you've posted as an example has its first sentence
enclosed in double quotes ", and the closing quote comes immediately
after the full stop: infections."
Your regexp [.!?]\s{1,2} is looking for a period followed by one or
two spaces as sentence terminator, so it won't catch it.
It can be adjusted to cope with this case by allowing for optional
closing quotes:
sentenceEnders = re.compile(r'''[.!?]['"]?\s{1,2}(?=[A-Z])''')
However, with the above regexp you would be removing the end quote
from the sentence. Keeping it is slightly more tricky and can be done
using a look-behind assertion:
sentenceEnders = re.compile(r'''(?<=[.!?]['"\s])\s*(?=[A-Z])''')
Note, however, that there are a lot of cases where a regexp-based splitter
fails, e.g.:
Abbreviations: "In the works of Dr. A. B. Givental ..." --
according to your regexp, this will be incorrectly split after
"Dr.", "A." and "B." (You can adjust the single-letter case,
but you cannot detect an abbreviation unless you hard-code it.)
Use of exclamation marks in the middle of the sentence:
"... when, lo and behold! M. Deshayes himself appeared..."
Use of multiple quote marks and nested quotes, etc.
Yes, there is something wrong. You take the separator into account only if it is followed by one or two spaces and then a capital letter, so the end of "A new type of vaccine?" sentence won't get matched for example.
I would not be too restrictive about the spaces either, unless it is an intent (the text might not be well formated), because e.g. "Hello Lucky Boy!How are you today?" would not get splitted.
I also do not understand your example, why is only the first sentence is in enclosed in " ?
Anyway:
>>> Text="""But in the case of malaria infections, dendritic cells and stuff.
A new type of vaccine? My uncle!
"""
>>> Sentences = re.split('[?!.][\s]*',Text)
>>> Sentences
['But in the case of malaria infections, dendritic cells and stuff',
'A new type of vaccine',
'My uncle',
'']
You might also filter the empty sentences:
>>> NonemptyS = [ s for s in Senteces if s ]
I would like to replace strings like 'HDMWhoSomeThing' to 'HDM Who Some Thing' with regex.
So I would like to extract words which starts with an upper-case letter or consist of upper-case letters only. Notice that in the string 'HDMWho' the last upper-case letter is in the fact the first letter of the word Who - and should not be included in the word HDM.
What is the correct regex to achieve this goal? I have tried many regex' similar to [A-Z][a-z]+ but without success. The [A-Z][a-z]+ gives me 'Who Some Thing' - without 'HDM' of course.
Any ideas?
Thanks,
Rukki
#! /usr/bin/env python
import re
from collections import deque
pattern = r'([A-Z]{2,}(?=[A-Z]|$)|[A-Z](?=[a-z]|$))'
chunks = deque(re.split(pattern, 'HDMWhoSomeMONKEYThingXYZ'))
result = []
while len(chunks):
buf = chunks.popleft()
if len(buf) == 0:
continue
if re.match(r'^[A-Z]$', buf) and len(chunks):
buf += chunks.popleft()
result.append(buf)
print ' '.join(result)
Output:
HDM Who Some MONKEY Thing XYZ
Judging by lines of code, this task is a much more natural fit with re.findall:
pattern = r'([A-Z]{2,}(?=[A-Z]|$)|[A-Z][a-z]*)'
print ' '.join(re.findall(pattern, 'HDMWhoSomeMONKEYThingX'))
Output:
HDM Who Some MONKEY Thing X
Try to split with this regular expression:
/(?=[A-Z][a-z])/
And if your regular expression engine does not support splitting empty matches, try this regular expression to put spaces between the words:
/([A-Z])(?![A-Z])/
Replace it with " $1" (space plus match of the first group). Then you can split at the space.
one liner :
' '.join(a or b for a,b in re.findall('([A-Z][a-z]+)|(?:([A-Z]*)(?=[A-Z]))',s))
using regexp
([A-Z][a-z]+)|(?:([A-Z]*)(?=[A-Z]))
So 'words' in this case are:
Any number of uppercase letters - unless the last uppercase letter is followed by a lowercase letter.
One uppercase letter followed by any number of lowercase letters.
so try:
([A-Z]+(?![a-z])|[A-Z][a-z]*)
The first alternation includes a negative lookahead (?![a-z]), which handles the boundary between an all-caps word and an initial caps word.
May be '[A-Z]*?[A-Z][a-z]+'?
Edit: This seems to work: [A-Z]{2,}(?![a-z])|[A-Z][a-z]+
import re
def find_stuff(str):
p = re.compile(r'[A-Z]{2,}(?![a-z])|[A-Z][a-z]+')
m = p.findall(str)
result = ''
for x in m:
result += x + ' '
print result
find_stuff('HDMWhoSomeThing')
find_stuff('SomeHDMWhoThing')
Prints out:
HDM Who Some Thing
Some HDM Who Thing