a Regex for extracting sentence from a paragraph in python

a Regex for extracting sentence from a paragraph in python - python

I'm trying to extract a sentence from a paragraph using regular expressions in python.
Usually the code that I'm testing extracts the sentence correctly, but in the following paragraph the sentence does not get extracted correctly.
The paragraph:
"But in the case of malaria infections and sepsis, dendritic cells throughout the body are concentrated on alerting the immune system, which prevents them from detecting and responding to any new infections."
A new type of vaccine?
The code:
def splitParagraphIntoSentences(paragraph):
import re
sentenceEnders = re.compile('[.!?][\s]{1,2}(?=[A-Z])')
sentenceList = sentenceEnders.split(paragraph)
return sentenceList
if __name__ == '__main__':
f = open("bs.txt", 'r')
text = f.read()
mylist = []
sentences = splitParagraphIntoSentences(text)
for s in sentences:
mylist.append(s.strip())
for i in mylist:
print i
When tested with the above paragraph it gives output exactly as the input paragraph but the output should look like-
But in the case of malaria infections and sepsis, dendritic cells throughout the body are concentrated on alerting the immune system, which prevents them from detecting and responding to any new infections
A new type of vaccine
Is there anything wrong with the regular expression?

Riccardo Murri's answer is correct, but I thought I'd throw a bit more light on the subject.
There was a similar question asked with regard to PHP: php sentence boundaries detection. My answer to that question includes handling the exceptions such as "Mr.", "Mrs." and "Jr.". I've adapted that regex to work with Python, (which places more restrictions on lookbehinds). Here is a modified and tested version of your script which uses this new regex:
def splitParagraphIntoSentences(paragraph):
import re
sentenceEnders = re.compile(r"""
# Split sentences on whitespace between them.
(?: # Group for two positive lookbehinds.
(?<=[.!?]) # Either an end of sentence punct,
| (?<=[.!?]['"]) # or end of sentence punct and quote.
) # End group of two positive lookbehinds.
(?<! Mr\. ) # Don't end sentence on "Mr."
(?<! Mrs\. ) # Don't end sentence on "Mrs."
(?<! Jr\. ) # Don't end sentence on "Jr."
(?<! Dr\. ) # Don't end sentence on "Dr."
(?<! Prof\. ) # Don't end sentence on "Prof."
(?<! Sr\. ) # Don't end sentence on "Sr."
\s+ # Split on whitespace between sentences.
""",
re.IGNORECASE | re.VERBOSE)
sentenceList = sentenceEnders.split(paragraph)
return sentenceList
if __name__ == '__main__':
f = open("bs.txt", 'r')
text = f.read()
mylist = []
sentences = splitParagraphIntoSentences(text)
for s in sentences:
mylist.append(s.strip())
for i in mylist:
print i
You can see how it handles the special cases and it is easy to add or remove them as required. It correctly parses your example paragraph. It also correctly parses the following test paragraph (which includes more special cases):
This is sentence one. Sentence two! Sentence three? Sentence "four". Sentence "five"! Sentence "six"? Sentence "seven." Sentence 'eight!' Dr. Jones said: "Mrs. Smith you have a lovely daughter!"
But note that there are other exceptions that can fail which Riccardo Murri has correctly pointed out.

The paragraph you've posted as an example has its first sentence
enclosed in double quotes ", and the closing quote comes immediately
after the full stop: infections."
Your regexp [.!?]\s{1,2} is looking for a period followed by one or
two spaces as sentence terminator, so it won't catch it.
It can be adjusted to cope with this case by allowing for optional
closing quotes:
sentenceEnders = re.compile(r'''[.!?]['"]?\s{1,2}(?=[A-Z])''')
However, with the above regexp you would be removing the end quote
from the sentence. Keeping it is slightly more tricky and can be done
using a look-behind assertion:
sentenceEnders = re.compile(r'''(?<=[.!?]['"\s])\s*(?=[A-Z])''')
Note, however, that there are a lot of cases where a regexp-based splitter
fails, e.g.:
Abbreviations: "In the works of Dr. A. B. Givental ..." --
according to your regexp, this will be incorrectly split after
"Dr.", "A." and "B." (You can adjust the single-letter case,
but you cannot detect an abbreviation unless you hard-code it.)
Use of exclamation marks in the middle of the sentence:
"... when, lo and behold! M. Deshayes himself appeared..."
Use of multiple quote marks and nested quotes, etc.

Yes, there is something wrong. You take the separator into account only if it is followed by one or two spaces and then a capital letter, so the end of "A new type of vaccine?" sentence won't get matched for example.
I would not be too restrictive about the spaces either, unless it is an intent (the text might not be well formated), because e.g. "Hello Lucky Boy!How are you today?" would not get splitted.
I also do not understand your example, why is only the first sentence is in enclosed in " ?
Anyway:
>>> Text="""But in the case of malaria infections, dendritic cells and stuff.
A new type of vaccine? My uncle!
"""
>>> Sentences = re.split('[?!.][\s]*',Text)
>>> Sentences
['But in the case of malaria infections, dendritic cells and stuff',
'A new type of vaccine',
'My uncle',
'']
You might also filter the empty sentences:
>>> NonemptyS = [ s for s in Senteces if s ]

Related

How to retrieve the text removed by regex sub?

I have a regex expression in Python that is expected to remove all occurences of the word "NOTE." and the following sentence.
How can I do it correctly and also return all sentences being removed that way?
import re
text = "NOTE. This is the subsequent sentence to be removed. The weather is good. NOTE. This is another subsequent sentence to be removed. The sky is blue. Note that it's a dummy text."
clean_text = re.sub("NOTE\..*?(?=\.)", "", text)
Expected result:
clean_text:
The weather is good. The sky is blue. Note that it's a dummy text.
unique_sentences_removed:
["This is the subsequent sentence to be removed.", "This is another subsequent sentence to be removed."]

Stealing The fourth bird's regex but using re.split so we only need to search once. It returns a list alternating between non-matched and matched parts. Join the former to get the text, and the latter are your removals.
import re
pattern = r"\bNOTE\.\s*([^.]*\.)\s*"
text = "NOTE. This is the subsequent sentence to be removed. The weather is good. NOTE. This is another subsequent sentence to be removed. The sky is blue. Note that it's a dummy text."
parts = re.split(pattern, text)
clean_text = ''.join(parts[::2])
print(clean_text)
unique_sentences_removed = parts[1::2]
print(unique_sentences_removed)
Output:
The weather is good. The sky is blue. Note that it's a dummy text.
['This is the subsequent sentence to be removed.', 'This is another subsequent sentence to be removed.']
Demo

One option to remove the NOTE part is to use a pattern what also matches the dot after the next line followed by optional whitespace chars, instead of asserting the dot only.
If you add a capture group to the pattern, you can use re.findall with the same pattern to return the capture group value.
The pattern matches:
\bNOTE\.\s* Match the word NOTE followed by . and optional whitespace chars
([^.]*\.) Capture group 1, match optional chars other than . and then match the .
\s* Match optional whitespace chars
See the matches and the capture group value at this regex101 demo and a Python demo.
import re
pattern = r"\bNOTE\.\s*([^.]*\.)\s*"
text = "NOTE. This is the subsequent sentence to be removed. The weather is good. NOTE. This is another subsequent sentence to be removed. The sky is blue. Note that it's a dummy text."
clean_text = re.sub(pattern, "", text)
print(clean_text)
unique_sentences_removed = re.findall(pattern, text)
print(unique_sentences_removed)
Output
The weather is good. The sky is blue. Note that it's a dummy text.
['This is the subsequent sentence to be removed.', 'This is another subsequent sentence to be removed.']

You can capture the removed sentences in one pass using a replacement function with a side-effect that saves the removed sentence:
import re
def clean(text):
removed = []
def repl(m):
removed.append(m.group(1))
return ''
clean_text = re.sub("NOTE\.\s*(.*?\.)\s*", repl, text)
return clean_text, removed
text = "NOTE. This is the subsequent sentence to be removed. The weather is good. NOTE. This is another subsequent sentence to be removed. The sky is blue. Note that it's a dummy text."
result, removed = clean(text)
print(result)
print(removed)
Output:
The weather is good. The sky is blue. Note that it's a dummy text.
['This is the subsequent sentence to be removed.', 'This is another subsequent sentence to be removed.']

Filter out words depending on surrounding punctuation

Objective:
I'm looking for a way to match or skip words based on whether or not they are surrounded by quotations marks ' ', guillemets « » or parentheses ( ).
Examples of desired results:
len(re.findall("my word", "blablabla 'my word' blablabla")) should return 0 because linguistically speaking my word =/= 'my word' and hence shouldn't be matched;
len(re.findall("'my word'", "blablabla 'my word' blablabla")) should return 1 because linguistically speaking 'my word' = 'my word' and hence should be matched;
But here's the catch — both len(re.findall("my word", "blablabla «my word» blablabla")) and len(re.findall("my word", "blablabla (my word) blablabla")) should return 1.
My attempt:
I have the following expression (correct me if I'm wrong) at my disposal but am clueless as to how to implement it: (?<!\w)'[^ ].*?\w*?[^ ]'
I wish to make the following code len(re.findall(r'(?<!\w)'+re.escape(myword)+r'(?!\w)', sentence)) – whose aim is to strip out punctuation marks I believe – take into account all of the aforementioned situations.
For now, my code detects my word inside of 'my word' which is not what I want.
Thanks in advance!

I think one of the strategies is to use negative look-ahead feature:
my_word = "word"
r"(?!'" + my_word + "')[^']" + "my_word"
This should do the job as you can check here.
Since negative look-ahead does not consume characters, to prevent a match you need to use [^'] to ensure the quotation mark ' is not an allowed character preceding your my_word. The ^ starting an enumeration of characters means precisely that.
If you want to expand the list of quotation marks that should cause the word not to be counted as found it is enough that you change ' into a list of disallowed characters:
r"(?!['`]" + my_word + "['`])[^'`]my_word"
It is worth noting that the example from #Prasanna question is going to be impossible to match using regex. You would need to use a proper parser - e.g. pyparsing - to handle such situations because regular expressions are not able to handle a match that requires two arbitrary counts of characters to match (e.g. any number of 'a' followed by the same number of 'b' letters) and it will not be possible to create a generic regular expression with a look-ahead that handles n words then myword and at the same time skips n words if they are preceded by a quotation mark).

How can I modify this regex pattern so that once it finds a match it returns the whole sentence, not just the words it matched?

As the title explains this regex pattern basically checks the description variable for matching word combinations within set, eg:
set = ["oak", "wood"]
then if it finds those 2 words within a 5 word spacing it will return those words. However, I need it to return the matching sentence. So if for example the description was:
description = "...would be a lovely addition to any home. This lovely oak hard wood table comes in a variety of sizes. Another great reason to consider..."
instead of just returning the matching words I want it to return the entire sentence that contains the keywords.
This is what I'm working with at the moment which obviously just returns the matching set pair.
re.findall(r"\b(?:(%s)\W+(?:\w+\W+){0,5}?(%s)|(%s)\W+(?:\w+\W+){0,5}?(%s))\b" % (set[0], set[1], set[1], set[0]), description)
I'm also aware that I believe this pattern will look beyond a single sentence for a match and as such you might get a case where it finds a match over 2 different sentences. If possible I'd also like to find a way that restricts matches to only be possible within the same sentence.
I'd appreciate any help I can get with this.
EDIT: Just to clarify my desired output is:
"This lovely oak hard wood table comes in a variety of sizes."
As this is the sentence which contains the matching keyword pair.
Thanks!

As per my comment some dummy code using nltk (do not have access to Python right now):
from nltk import sent_tokenize
for sent in sent_tokenize(your_data_here):
if any(['foo', 'bar']) in sent:
# do sth. useful here
Obviously, you could even apply your initial regex on sent (it's a string after all).

You can use the following RegEx:
print(re.findall(r"(^|(?<=\.))([a-zA-Z0-9\s]*oak[a-zA-Z0-9\s]*wood.*?(?=\.|$)).*?|([a-zA-Z0-9\s]*wood[a-zA-Z0-9\s]*oak.*?(?=\.|$))", description))
where:
r"(^|(?<=\.))" # means start with 'start of string' or '.'
r"([a-zA-Z0-9\s]*oak[a-zA-Z0-9\s]*wood.*?(?=\.)).*?" # means any letter/number/space followed bi 'oak', followed by any letter/number/space, followed by wood, stopping at the first occurrence of a '.' or 'end of line'
r"([a-zA-Z0-9\s]*wood[a-zA-Z0-9\s]*oak.*?(?=\.|$))" # same as previous, but with | (or) condition matches the wood-oak case
Output:
('', ' This lovely oak hard wood table comes in a variety of sizes',
'')

Is it a must to use regex? I found it more strict forward to just use the below:
set = ["oak","wood"]
description = "...would be a lovely addition to any home. This lovely oak hard wood table comes in a variety of sizes. Another great reason to consider..."
description2 = "...would be a lovely addition to any home. This is NOT oak however we do make other varieties that use cherry for a different style of hard wood."
def test_result(desc):
desc = desc.split(". ")
for sent in desc:
if all(s in sent for s in set):
if -5 <= sent.split(" ").index("oak") - sent.split(" ").index("wood") <= 5:
print (sent)
test_result(description)
test_result(description2)
Result:
This lovely oak hard wood table comes in a variety of sizes

You may try with following regex:
[^.]*?\boak(?:\W+[^\W.]+){0,5}?\W+wood(?:\W+[^\W.]+){0,5}?\W+table(?:\W+[^\W.]+){0,5}?\W+variety[^.]*\.+
Demo with several examples
Explained:
[^.]*? # Anything but a dot, ungreedy
\b oak # First word (with word boundary)
(?:\W+[^\W.]+){0,5}? # Some (0-5) random words: (separator + word except dot) x 5, ungreedy
\W+ wood # Second word. Starts with some separator
(?:\W+[^\W.]+){0,5}? # Again, random words, ungreedy
\W+ table # third word. Starts with some separator
(?:\W+[^\W.]+){0,5}? # Again, random words, ungreedy
\W+ variety # Final required word
[^.]* # The rest of the sentence (non dot characters) up to the end
\.+ # We match the final dot (or ... if more exist)

You can get it to capture the entire sentence by having it look for periods at the ends. You can also have it exclude periods from the search in the middle by replacing \W (match non-word characters) with [^.\w] (match anything that isn't a period or a word character).
"(^|\.)([^.]*\b(?:(%s)[^.\w]+(?:\w+[^.\w]+){0,5}?(%s)|(%s)[^.\w]+(?:\w+[^.\w]+){0,5}?(%s))\b[^.]*)(\.|$)"
The (^|\.) will match the beginning of the input or a period and the (\.|$) will match a period or the end of the input (in case there is input after the last period).
I can't test this in python right now, but it should point you in the right direction even if I have an error or typo.

how to not remove apostrophe only for some words in text file in python

In a sentence, How can I remove apostrophe, double quotes, comma and so on for all words excluding words like it's, what's etc.. and at end of the sentence there must be a space between word and full stop.
For example
Input Sentence :
"'This has punctuation, and it's hard to remove. ?"
Desired Output Sentence :
This has punctuation and it's hard to remove .

Use a negative look-behind
(?<!\w)["'?]|,(?= )
REmove the matched '"? characters through re.sub.
DEMO
And your code would be,
>>> s = '\"\'This has punctuation, and it\'s hard to remove. ?\" '
>>> m = re.sub(r'(?<!\w)[\"\'\?]|,(?= )', r'', s)
>>> m
"This has punctuation and it's hard to remove. "

I propose this code:
import re
sentences = [""""'This has punctuation, and it's hard to remove. ?" """,
"Did you see Cress' haircut?.",
"This 'thing' hasn't a really bad habit, you know?.",
"'I bought this for $30 from Best Buy it's. What a waste of money! The ear gels are 'comfortable at first, but what's after an hour."]
for s in sentences:
# Remove the specified characters
new_s = re.sub(r"""["?,$!]|'(?!(?<! ')[ts])""", "", s)
# Deal with the final dot
new_s = re.sub(r"\.", " .", new_s)
print(new_s)
ideone demo
Output:
This has punctuation and it's hard to remove .
Did you see Cress haircut .
This thing hasn't a really bad habit you know .
I bought this for 30 from Best Buy it's . What a waste of money The ear gels are comfortable at first but what's after an hour .
The regex:
["?,$!] # Match " ? , $ or !
| # OR
' # A ' if it does not have...
(?!
(?<! ')
[ts] # t or s after it, provided it has no ` '` before the t or s
)

Use this:
(?<![tT](?=.[sS]))["'?:;,.]
If you also want to leave the period at the end of a line (as long as it is preceded by a space):
(?<![tT](?=.[sS]))(?<! (?=.$))["'?:;,.]

My take for this is, remove all quotations which are at either end of a word. So split the sentences to word (separated by white-space) and strip any leading or trailing quotation marks from the words
>>> ''.join(e.strip(string.punctuation) for e in re.split("(\s)",st))
"This has punctuation and it's hard to remove "

Use the string.strip(delimiter) function for the outside quotes
like this :
output = chaine.strip("\"")
Be careful, you have to escape some characters with a '\' like ', ", \, and so on. Or you can enter them as "'", '"' (unsure).
Edit : mmh, didn't think about the apostrophes, if the only problem is the apostrophes you can strip the rest first then parse it manually with a for statement, place indice of first apostrophe found then if followed by an 's', leave it, I don't know, you have to set lexical/semantical rules before coding it.
Edit 2 :
If the string is only a sentence, and always has a dot at the end, and always needs the space, then use this at the end :
chaine[:-2]+" "+chaine[-2:]

Print the line between specific pattern

I want to print the lines between specific string, my string is as follows:
my_string = '''
##start/file1
file/images/graphs/main
file/images/graphs
file/graphs
##start/new
new/pattern/symbol
new/pattern/
##start/info/version
version/info/main
version/info/minor
##start
values/key
values
...
... '''
In this string i want to search for "main" and print it as:
##start/file1/file/images/graphs/main
##start/info/version/version/info/main
How can i do this?
I tried to find the lines between two ##start and search for main.

Try something like:
def get_mains(my_string):
section = ''
for line in my_string.split('\n'):
if line[0:7] == "##start":
section = line
continue
if 'main' in line:
yield '/'.join([section, line])
for main in get_mains(my_string):
print main

There is a way to do this with Python's Regular Expressions Parser called regex for short.
Basically, regex is this whole language for searching through a string for certain patterns. If I have the string 'Hello, World', it would match the regex pattern 'llo, Wor', because it contains an ell followed by an ell followed by an o followed by a comma and a space and a capital double-you and so on. On the surface it just looks like a substring test. The real power of regex comes with special characters. If I have the string 'Hello, World' again, it also matches the pattern 'Hello, \w\w\w\w\w', because \w is a special character that stands for any letter in the alphabet (plus a few extras). So 'Hello, Bobby', 'Hello, World', 'Hello, kitty' all match the pattern 'Hello, \w\w\w\w\w', because \w can stand in for any letter. There are many more of these 'special characters' and they are all very useful. To actually answer your question,
I constructed a pattern that matches
##start\textICareAbout
file_I_don't_care
file_I_don't_care
file_I_care_about\main
which is
r'(##start{line}){line}*?(.*main)'.format(line=r'(?:.*\n)')
The leading r makes the string a raw string (so we don't have to double backslash newlines, see the linked webpage). Then, everything in parenthesis becomes a group. Groups are peices of texts that we want to be able to recall later. There are two groups. The first one is (##start{line}), the second one is (.*main). The first group matches anything that starts with ##start and continues for a whole line, so lines like
##start/file1 or ##start/new
The second group matches lines that end in main, because .* matches every character except newlines. In between the two groups there is {line}*, which means 'match any thing that is a complete line, and match any number of them'. So tying it all together, we have:
match anything that starts with ##start, then we match any number of lines, and then we match any line that ends in main.
import re
# define my_string here
pattern = re.compile(r'(##start{line}){line}*?(.*main)'.format(line=r'(?:.*\n)'))
for match in pattern.findall(my_string):
string = match[0][:-1] # don't want the trailing \n
string += '/'
string += match[1]
print string
For your example, it outputs
##start/file1/file/images/graphs/main
##start/new/version/info/main
So Regex is pretty cool and other languages have it too. It is a very powerful tool, and you should learn how to use it here.
Also just a side note, I use the .format function, because I think it looks much cleaner and easier to read, so
'hello{line}world'.format(line=r'(?:.*\n)') just becomes evaluated to 'hello(?:.*\n)world', and it would match
hello
Any Text Here. Anything at all. (just for one line)
world

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

a Regex for extracting sentence from a paragraph in python - python

Related

How to retrieve the text removed by regex sub?

Filter out words depending on surrounding punctuation

How can I modify this regex pattern so that once it finds a match it returns the whole sentence, not just the words it matched?

how to not remove apostrophe only for some words in text file in python

Print the line between specific pattern

Categories

Resources