I have a regex expression in Python that is expected to remove all occurences of the word "NOTE." and the following sentence.
How can I do it correctly and also return all sentences being removed that way?
import re
text = "NOTE. This is the subsequent sentence to be removed. The weather is good. NOTE. This is another subsequent sentence to be removed. The sky is blue. Note that it's a dummy text."
clean_text = re.sub("NOTE\..*?(?=\.)", "", text)
Expected result:
clean_text:
The weather is good. The sky is blue. Note that it's a dummy text.
unique_sentences_removed:
["This is the subsequent sentence to be removed.", "This is another subsequent sentence to be removed."]
Stealing The fourth bird's regex but using re.split so we only need to search once. It returns a list alternating between non-matched and matched parts. Join the former to get the text, and the latter are your removals.
import re
pattern = r"\bNOTE\.\s*([^.]*\.)\s*"
text = "NOTE. This is the subsequent sentence to be removed. The weather is good. NOTE. This is another subsequent sentence to be removed. The sky is blue. Note that it's a dummy text."
parts = re.split(pattern, text)
clean_text = ''.join(parts[::2])
print(clean_text)
unique_sentences_removed = parts[1::2]
print(unique_sentences_removed)
Output:
The weather is good. The sky is blue. Note that it's a dummy text.
['This is the subsequent sentence to be removed.', 'This is another subsequent sentence to be removed.']
Demo
One option to remove the NOTE part is to use a pattern what also matches the dot after the next line followed by optional whitespace chars, instead of asserting the dot only.
If you add a capture group to the pattern, you can use re.findall with the same pattern to return the capture group value.
The pattern matches:
\bNOTE\.\s* Match the word NOTE followed by . and optional whitespace chars
([^.]*\.) Capture group 1, match optional chars other than . and then match the .
\s* Match optional whitespace chars
See the matches and the capture group value at this regex101 demo and a Python demo.
import re
pattern = r"\bNOTE\.\s*([^.]*\.)\s*"
text = "NOTE. This is the subsequent sentence to be removed. The weather is good. NOTE. This is another subsequent sentence to be removed. The sky is blue. Note that it's a dummy text."
clean_text = re.sub(pattern, "", text)
print(clean_text)
unique_sentences_removed = re.findall(pattern, text)
print(unique_sentences_removed)
Output
The weather is good. The sky is blue. Note that it's a dummy text.
['This is the subsequent sentence to be removed.', 'This is another subsequent sentence to be removed.']
You can capture the removed sentences in one pass using a replacement function with a side-effect that saves the removed sentence:
import re
def clean(text):
removed = []
def repl(m):
removed.append(m.group(1))
return ''
clean_text = re.sub("NOTE\.\s*(.*?\.)\s*", repl, text)
return clean_text, removed
text = "NOTE. This is the subsequent sentence to be removed. The weather is good. NOTE. This is another subsequent sentence to be removed. The sky is blue. Note that it's a dummy text."
result, removed = clean(text)
print(result)
print(removed)
Output:
The weather is good. The sky is blue. Note that it's a dummy text.
['This is the subsequent sentence to be removed.', 'This is another subsequent sentence to be removed.']
Related
Basically, I want to drop all the dots in the abbreviations like "L.L.C.", converting to "LLC". I don't have a list of all the abbreviations. I want to convert them as they are found. This step is performed before sentence tokenization.
text = """
Proligo L.L.C. is a limited liability company.
S.A. is a place.
She works for AAA L.P. in somewhere.
"""
text = re.sub(r"(?:([A-Z])\.){2,}", "\1", text)
This does not work.
I want to remove the dots from the abbreviations so that the dots will not break the sentence tokenizer.
Thank you!
P.S. Sorry for not being clear enough. I edited the sample text.
Try using a callback function with re.sub:
def callback( str ):
return str.replace('.', '')
text = "L.L.C., S.A., L.P."
text = re.sub(r"(?:[A-Z]\.)+", lambda m: callback(m.group()), text)
print(text)
The regex pattern (?:[A-Z]\.)+ will match any number of capital abbreviations. Then, for each match, the callback function will strip off dots.
import re
string = 'ha.f.d.s.a.s.d.f'
re.sub('\.', '', string)
#output
hafdsasdf
Note that this only works properly if your text does not contain multiple sentences. If it does it will create one long sentence as all '.' are replaced.
Use this regular expression:
>>> re.sub(r"(?<=[A-Z]).", "", text)
'LLC, SA, LP'
>>>
regex101
The answers here are extremely aggressive: any capital alphabetical character followed by a period will be replaced.
I'd recommend:
text = "L.L.C., S.A., L.P."
text = re.sub(r"L\.L\.C\.|S\.A\.|L\.P\.", lambda x: x.group().replace(".", ""), text)
print(text) # => "LLC, SA, LP"
This will only match the abbreviations you're asking for. You can add word boundaries for additional strictness.
I'm looking to grab noise text that has a specific pattern in it:
text = "this is some text lskdfmd&#kjansdl and some more text sldkf&#lsakjd and some other stuff"
I want to be able to remove everything in this sentence where after a space, and before a space contains &#.
result = "this is some text and some more text and some other stuff"
been trying:
re.compile(r'([\s]&#.*?([\s])).sub(" ", text)
I can't seem to get the first part though.
You may use
\S+&#\S+\s*
See a demo on regex101.com.
In Python:
import re
text = "this is some text lskdfmd&#kjansdl and some more text sldkf&#lsakjd and some other stuff"
rx = re.compile(r'\S+&#\S+\s*')
text = rx.sub('', text)
print(text)
Which yields
this is some text and some more text and some other stuff
You can use this regex to capture that noise string,
\s+\S*&#\S*\s+
and replace it with a single space.
Here, \s+ matches any whitespace(s) then \S* matches zero or more non-whitespace characters while sandwiching &# within it and again \S* matches zero or more whitespace(s) and finally followed by \s+ one or more whitespace which gets removed by a space, giving you your intended string.
Also, if this noise string can be either at the very start or very end of string, feel free to change \s+ to \s*
Regex Demo
Python code,
import re
s = 'this is some text lskdfmd&#kjansdl and some more text sldkf&#lsakjd and some other stuff'
print(re.sub(r'\s+\S*&#\S*\s+', ' ', s))
Prints,
this is some text and some more text and some other stuff
Try This:
import re
result = re.findall(r"[a-zA-z]+\&\#[a-zA-z]+", text)
print(result)
['lskdfmd&#kjansdl', 'sldkf&#lsakjd']
now remove the result list from the list of all words.
Edit1 Suggest by #Jan
re.sub(r"[a-zA-z]+\&\#[a-zA-z]+", '', text)
output: 'this is some text and some more text and some other stuff'
Edit2 Suggested by #Pushpesh Kumar Rajwanshi
re.sub(r" [a-zA-z]+\&\#[a-zA-z]+ ", " ", text)
output:'this is some text and some more text and some other stuff'
I'm looking for regex to get the result below.
The original sentence is:
txt="そう言え"
txt="そう言う"
and expected result is:
output="そう"
output="そう"
What I want to do here is to remove a word consists of two letters which includes character "言".
I tried putput = re.sub(r"^(?=.*言).*$", "", txt) in python but it actually removes the entire sentence. What do I do?
You can use a pattern that matches 言 followed by another word (denoted by \w), so that re.sub can replace the match with an empty string:
re.sub(r"言\w", "", txt)
As the title explains this regex pattern basically checks the description variable for matching word combinations within set, eg:
set = ["oak", "wood"]
then if it finds those 2 words within a 5 word spacing it will return those words. However, I need it to return the matching sentence. So if for example the description was:
description = "...would be a lovely addition to any home. This lovely oak hard wood table comes in a variety of sizes. Another great reason to consider..."
instead of just returning the matching words I want it to return the entire sentence that contains the keywords.
This is what I'm working with at the moment which obviously just returns the matching set pair.
re.findall(r"\b(?:(%s)\W+(?:\w+\W+){0,5}?(%s)|(%s)\W+(?:\w+\W+){0,5}?(%s))\b" % (set[0], set[1], set[1], set[0]), description)
I'm also aware that I believe this pattern will look beyond a single sentence for a match and as such you might get a case where it finds a match over 2 different sentences. If possible I'd also like to find a way that restricts matches to only be possible within the same sentence.
I'd appreciate any help I can get with this.
EDIT: Just to clarify my desired output is:
"This lovely oak hard wood table comes in a variety of sizes."
As this is the sentence which contains the matching keyword pair.
Thanks!
As per my comment some dummy code using nltk (do not have access to Python right now):
from nltk import sent_tokenize
for sent in sent_tokenize(your_data_here):
if any(['foo', 'bar']) in sent:
# do sth. useful here
Obviously, you could even apply your initial regex on sent (it's a string after all).
You can use the following RegEx:
print(re.findall(r"(^|(?<=\.))([a-zA-Z0-9\s]*oak[a-zA-Z0-9\s]*wood.*?(?=\.|$)).*?|([a-zA-Z0-9\s]*wood[a-zA-Z0-9\s]*oak.*?(?=\.|$))", description))
where:
r"(^|(?<=\.))" # means start with 'start of string' or '.'
r"([a-zA-Z0-9\s]*oak[a-zA-Z0-9\s]*wood.*?(?=\.)).*?" # means any letter/number/space followed bi 'oak', followed by any letter/number/space, followed by wood, stopping at the first occurrence of a '.' or 'end of line'
r"([a-zA-Z0-9\s]*wood[a-zA-Z0-9\s]*oak.*?(?=\.|$))" # same as previous, but with | (or) condition matches the wood-oak case
Output:
('', ' This lovely oak hard wood table comes in a variety of sizes',
'')
Is it a must to use regex? I found it more strict forward to just use the below:
set = ["oak","wood"]
description = "...would be a lovely addition to any home. This lovely oak hard wood table comes in a variety of sizes. Another great reason to consider..."
description2 = "...would be a lovely addition to any home. This is NOT oak however we do make other varieties that use cherry for a different style of hard wood."
def test_result(desc):
desc = desc.split(". ")
for sent in desc:
if all(s in sent for s in set):
if -5 <= sent.split(" ").index("oak") - sent.split(" ").index("wood") <= 5:
print (sent)
test_result(description)
test_result(description2)
Result:
This lovely oak hard wood table comes in a variety of sizes
You may try with following regex:
[^.]*?\boak(?:\W+[^\W.]+){0,5}?\W+wood(?:\W+[^\W.]+){0,5}?\W+table(?:\W+[^\W.]+){0,5}?\W+variety[^.]*\.+
Demo with several examples
Explained:
[^.]*? # Anything but a dot, ungreedy
\b oak # First word (with word boundary)
(?:\W+[^\W.]+){0,5}? # Some (0-5) random words: (separator + word except dot) x 5, ungreedy
\W+ wood # Second word. Starts with some separator
(?:\W+[^\W.]+){0,5}? # Again, random words, ungreedy
\W+ table # third word. Starts with some separator
(?:\W+[^\W.]+){0,5}? # Again, random words, ungreedy
\W+ variety # Final required word
[^.]* # The rest of the sentence (non dot characters) up to the end
\.+ # We match the final dot (or ... if more exist)
You can get it to capture the entire sentence by having it look for periods at the ends. You can also have it exclude periods from the search in the middle by replacing \W (match non-word characters) with [^.\w] (match anything that isn't a period or a word character).
"(^|\.)([^.]*\b(?:(%s)[^.\w]+(?:\w+[^.\w]+){0,5}?(%s)|(%s)[^.\w]+(?:\w+[^.\w]+){0,5}?(%s))\b[^.]*)(\.|$)"
The (^|\.) will match the beginning of the input or a period and the (\.|$) will match a period or the end of the input (in case there is input after the last period).
I can't test this in python right now, but it should point you in the right direction even if I have an error or typo.
I'm trying to extract a sentence from a paragraph using regular expressions in python.
Usually the code that I'm testing extracts the sentence correctly, but in the following paragraph the sentence does not get extracted correctly.
The paragraph:
"But in the case of malaria infections and sepsis, dendritic cells throughout the body are concentrated on alerting the immune system, which prevents them from detecting and responding to any new infections."
A new type of vaccine?
The code:
def splitParagraphIntoSentences(paragraph):
import re
sentenceEnders = re.compile('[.!?][\s]{1,2}(?=[A-Z])')
sentenceList = sentenceEnders.split(paragraph)
return sentenceList
if __name__ == '__main__':
f = open("bs.txt", 'r')
text = f.read()
mylist = []
sentences = splitParagraphIntoSentences(text)
for s in sentences:
mylist.append(s.strip())
for i in mylist:
print i
When tested with the above paragraph it gives output exactly as the input paragraph but the output should look like-
But in the case of malaria infections and sepsis, dendritic cells throughout the body are concentrated on alerting the immune system, which prevents them from detecting and responding to any new infections
A new type of vaccine
Is there anything wrong with the regular expression?
Riccardo Murri's answer is correct, but I thought I'd throw a bit more light on the subject.
There was a similar question asked with regard to PHP: php sentence boundaries detection. My answer to that question includes handling the exceptions such as "Mr.", "Mrs." and "Jr.". I've adapted that regex to work with Python, (which places more restrictions on lookbehinds). Here is a modified and tested version of your script which uses this new regex:
def splitParagraphIntoSentences(paragraph):
import re
sentenceEnders = re.compile(r"""
# Split sentences on whitespace between them.
(?: # Group for two positive lookbehinds.
(?<=[.!?]) # Either an end of sentence punct,
| (?<=[.!?]['"]) # or end of sentence punct and quote.
) # End group of two positive lookbehinds.
(?<! Mr\. ) # Don't end sentence on "Mr."
(?<! Mrs\. ) # Don't end sentence on "Mrs."
(?<! Jr\. ) # Don't end sentence on "Jr."
(?<! Dr\. ) # Don't end sentence on "Dr."
(?<! Prof\. ) # Don't end sentence on "Prof."
(?<! Sr\. ) # Don't end sentence on "Sr."
\s+ # Split on whitespace between sentences.
""",
re.IGNORECASE | re.VERBOSE)
sentenceList = sentenceEnders.split(paragraph)
return sentenceList
if __name__ == '__main__':
f = open("bs.txt", 'r')
text = f.read()
mylist = []
sentences = splitParagraphIntoSentences(text)
for s in sentences:
mylist.append(s.strip())
for i in mylist:
print i
You can see how it handles the special cases and it is easy to add or remove them as required. It correctly parses your example paragraph. It also correctly parses the following test paragraph (which includes more special cases):
This is sentence one. Sentence two! Sentence three? Sentence "four". Sentence "five"! Sentence "six"? Sentence "seven." Sentence 'eight!' Dr. Jones said: "Mrs. Smith you have a lovely daughter!"
But note that there are other exceptions that can fail which Riccardo Murri has correctly pointed out.
The paragraph you've posted as an example has its first sentence
enclosed in double quotes ", and the closing quote comes immediately
after the full stop: infections."
Your regexp [.!?]\s{1,2} is looking for a period followed by one or
two spaces as sentence terminator, so it won't catch it.
It can be adjusted to cope with this case by allowing for optional
closing quotes:
sentenceEnders = re.compile(r'''[.!?]['"]?\s{1,2}(?=[A-Z])''')
However, with the above regexp you would be removing the end quote
from the sentence. Keeping it is slightly more tricky and can be done
using a look-behind assertion:
sentenceEnders = re.compile(r'''(?<=[.!?]['"\s])\s*(?=[A-Z])''')
Note, however, that there are a lot of cases where a regexp-based splitter
fails, e.g.:
Abbreviations: "In the works of Dr. A. B. Givental ..." --
according to your regexp, this will be incorrectly split after
"Dr.", "A." and "B." (You can adjust the single-letter case,
but you cannot detect an abbreviation unless you hard-code it.)
Use of exclamation marks in the middle of the sentence:
"... when, lo and behold! M. Deshayes himself appeared..."
Use of multiple quote marks and nested quotes, etc.
Yes, there is something wrong. You take the separator into account only if it is followed by one or two spaces and then a capital letter, so the end of "A new type of vaccine?" sentence won't get matched for example.
I would not be too restrictive about the spaces either, unless it is an intent (the text might not be well formated), because e.g. "Hello Lucky Boy!How are you today?" would not get splitted.
I also do not understand your example, why is only the first sentence is in enclosed in " ?
Anyway:
>>> Text="""But in the case of malaria infections, dendritic cells and stuff.
A new type of vaccine? My uncle!
"""
>>> Sentences = re.split('[?!.][\s]*',Text)
>>> Sentences
['But in the case of malaria infections, dendritic cells and stuff',
'A new type of vaccine',
'My uncle',
'']
You might also filter the empty sentences:
>>> NonemptyS = [ s for s in Senteces if s ]