Sentence matching with regex

Sentence matching with regex - python

I have a text that splits into many lines, no particular formats. So I decided to line.strip('\n') for each line. Then I want to split the text into sentences using the sentence end marker . considering:
period . that is followed by a \s (whitespace), \S (like " ') and followed by [A-Z] will split
not to split [0-9]\.[A-Za-z], like 1.stackoverflow real time solution.
My program only solve half of 1 - period (.) that is followed by a \s and [A-Z]. Below is the code:
# -*- coding: utf-8 -*-
import re, sys
source = open(sys.argv[1], 'rb')
dest = open(sys.argv[2], 'wb')
sent = []
for line in source:
line1 = line.strip('\n')
k = re.sub(r'\.\s+([A-Z“])'.decode('utf8'), '.\n\g<1>', line1)
sent.append(k)
for line in sent:
dest.write(''.join(line))
Pls! I'd like to know which is the best way to master regex. It seems to be confusing.

To include the single quote in the character class, escape it with a \. The regex should be:
\.\s+[A-Z"\']
That's really all you need. You only need to tell a regex what to match, you don't need to specify what you don't want to match. Everything that doesn't fit the pattern won't match.
This regex will match any period followed by whitespace followed by a capital letter or a quote. Since a period immediately preceded by an number and immediately followed by a letter doesn't meet those criteria, it won't match.
This is assuming that the regex you had was working to split a period followed by whitespace followed by a capital, as you stated. Note, however, that this means that I am Sam. Sam I am. would split into I am Sam and am I am. Is that really what you want? If not, use zero-width assertions to exclude the parts you want to match but also keep. Here are your options, in order of what I think it's most likely you want.
1) Keep the period and the first letter or opening quote of the next sentence; lose the whitespace:
(?<=\.)\s+(?=[A-Z"\'])
This will split the example above into I am Sam. and Sam I am.
2) Keep the first letter of the next sentence; lose the period and whitespace:
\.\s+(?=[A-Z"\'])
This will split into I am Sam and Sam I am. This presumes that there are more sentences afterward, otherwise the period will stay with the second sentence, because it's not followed by whitespace and a capital letter or quote. If this option is the one you want - the sentences without the periods, then you might want to also match a period followed by the end of the string, with optional intervening whitespace, so that the final period and any trailing whitespace will be dropped:
\.(?:\s+(?=[A-Z"\'])|\s*$)
Note the ?:. You need non-capturing parentheses, because if you have capture groups in a split, anything captured by the group is added as an element in the results (e.g. split('(+)', 'a+b+c' gives you an array of a + b + c rather than just a b c).
3) Keep everything; whitespace goes with the preceding sentence:
(?<=\.\s+)(?=[A-Z"\'])
This will give you I am Sam. and Sam I am.
Regarding the last part of your question, the best resource for regex syntax I've seen is http://www.regular-expressions.info. Start with this summary: http://www.regular-expressions.info/reference.html Then go to the Tutorial page for more advanced details: http://www.regular-expressions.info/tutorial.html

Related

Capitalize each first word of a sentence in a paragraph

I want to capitilize the first word after a dot in a whole paragraph (str) full of sentences. The problem is that all chars are lowercase.
I tried something like this:
text = "here a long. paragraph full of sentences. what in this case does not work. i am lost"
re.sub(r'(\b\. )([a-zA-z])', r'\1' (r'\2').upper(), text)
I expect something like this:
"Here a long. Paragraph full of sentences. What in this case does not work. I am lost."

You can use re.sub with a lambda:
import re
text = "here a long. paragraph full of sentences. what in this case does not work. i am lost"
result = re.sub('(?<=^)\w|(?<=\.\s)\w', lambda x:x.group().upper(), text)
Output:
'Here a long. Paragraph full of sentences. What in this case does not work. I am lost'
Regex Explanation:
(?<=^)\w: matches an alphanumeric character preceded by the start of the line.
(?<=\.\s)\w: matches an alphanumeric character preceded by a period and a space.

You can use ((?:^|\.\s)\s*)([a-z]) regex (which doesn't depend upon lookarounds which sometimes may not be available in the regex dialect you may be using and hence is simpler and widely supported. Like for example Javascript doesn't yet widely support lookbehind although it is supported in EcmaScript2018 but its not widely supported yet) where you capture either the starting zero or more whitespace at the beginning of a sentence or one or more whitespace followed by a literal dot . and capture it in group1 and next capture a lower case letter using ([a-z]) and capture in group2 and replace the matched text with group1 captured text and group2 captured letter by making it uppercase using lambda expression. Check this Python code,
import re
arr = ['here a long. paragraph full of sentences. what in this case does not work. i am lost',
' this para contains more than one space after period and also has unneeded space at the start of string. here a long. paragraph full of sentences. what in this case does not work. i am lost']
for s in arr:
print(re.sub(r'(^\s*|\.\s+)([a-z])', lambda m: m.group(1) + m.group(2).upper(), s))
Output,
Here a long. Paragraph full of sentences. What in this case does not work. I am lost
This para contains more than one space after period and also has unneeded space at the start of string. Here a long. Paragraph full of sentences. What in this case does not work. I am lost
And in case you want to get rid of extra whitespaces and reduce them to just one space, just take that \s* out of group1 and use this regex ((?:^|\.\s))\s*([a-z]) and with updated Python code,
import re
arr = ['here a long. paragraph full of sentences. what in this case does not work. i am lost',
' this para contains more than one space after period and also has unneeded space at the start of string. here a long. paragraph full of sentences. what in this case does not work. i am lost']
for s in arr:
print(re.sub(r'((?:^|\.\s))\s*([a-z])', lambda m: m.group(1) + m.group(2).upper(), s))
You get following where extra whitespace is reduced to just one space, which may often be desired,
Here a long. Paragraph full of sentences. What in this case does not work. I am lost
This para contains more than one space after period and also has unneeded space at the start of string. Here a long. Paragraph full of sentences. What in this case does not work. I am lost
Also, if this was to be done using PCRE based regex engine, then you could have used \U in the regex itself without having to use lambda functions and just been able to replace it with \1\U\2
Regex Demo for PCRE based regex

Use CR/LF pair to reject a match using Regex

I am struggling to reject matches for words separated by newline character.
Here's the test string:
Cardoza Fred
Catto, Philipa
Duncan, Jean
Jerry Smith
and
but
and
Andrew
Red
Abcd
DDDD
Rules for regex:
1) Reject a word if it's followed by comma. Therefore, we will drop Catto.
2) Only select words that begin with a capital letter. Hence, and etc. will be dropped
3) If the word is followed by a carriage return (i.e. it is the first name, then ignore it).
Here's my attempt: \b([A-Z][a-z]+)\s(?!\n)
Explanation:
\b #start at a word boundary
([A-Z][a-z]+) #start with A-Z followed by a-z
\s #Last name must be followed by a space character
(?!\n) #The word shouldn't be followed by newline char i.e. ignore first names.
There are two problems with my regex.
1) Andrew is matched as Andre. I am unsure why w is missed. I have also observed that w of Andrew is not missed if I change the bottom portion of the sample text to remove all characters including and after w of Andrew. i.e. sample text would look like:
Cardoza Fred
Catto, Philipa
Duncan, Jean
Jerry Smith
and
but
and
Andrew
The output should be:
Cardoza
Jerry
You might ask: Why should Andrew be rejected? This is because of two reasons: a) Andrew is not followed by space. b) There is no first_name "space" last_name combination.
2) The first names are getting selected using my regex. How do I ignore first names?
I researched SO. It seems there is similar thread ignoring newline character in regex match, but the answer doesn't talk about ignoring \r.
This problem is adapted from Watt's Begining Regex book. I have spent close to 1 hour on this problem without any success. Any explanation will be greatly appreciated. I am using python's re module.
Here's regex101 for reference.

Andre (and not the trailing w) is being matched in your regex because the last token is negative lookahead for \n, and just before that is an optional space. So, Andrew<end of line> fails due to being at the end of the line, so the engine backtracks to Andre, which succeeds.
Maybe the optional quantifier in \s? in your regex101 was a typo, but it would probably be easier to start from scratch. If you want to find the initial names that are followed by a space and then another name, then you can use
^[A-Z][a-z]+(?= [A-Z][a-z]+$)
with the m flag:
https://regex101.com/r/kqeMcH/5
The m flag allows for ^ to match the beginning of a line, and $ to match the end of the line - easier than messing with looking for \ns. (Without the m flag, ^ will only match the beginning of the string, while $ will similarly only match the end of the string)
That is, start with repeated alphabetical characters, then lookahead for a space and more alphabetical characters, followed by the end of the line. Using positive lookahead will be a lot easier than negative lookahead for newlines and such.
Note that literal spaces are a bit more reliable in a regex than \s, because \s matches any whitespace character, including newlines. If you're looking for literal spaces, better to use a literal space.
To use flags in Python regex, either use the flags=, or define the flags at the beginning of the pattern, eg
pattern = r'(?m)^[a-z]+(?= [A-Z][a-z]+$)'

how to write a regular expression which matches a pattern if the sentence ends by period '.'

I've a group of strings like following:
a phrase containing spaces
A sentence contains spaces as well, but end by period.
I'd like to find a regular expression to match the spaces (like [ \t\f]) in the 2nd line, which ends by '.'.
I've looked around and found no solution. So I come here for help.
I am using Python, but do not mind knowing the pcre solution even it's not possible for python.
I came out some regex, but it could not exclude the first line.
my regex

Here is a regex pattern which, if applied repeatedly to every line, should be able to match spaces in that line, assuming the line ends with period:
\s+(?=.*\.$)
Demo
Here is my attempt at a Python script. I don't print the space when a match is found, because we can't see it. Instead, I print something visible:
input = 'A sentence contains spaces as well, but end by period.'
spaces = re.findall(r'\s+(?=.*\.$)', input)
for space in spaces:
print('found a space')
found a space (printed 9 times)

How to allow regular expression to return empty string

I have a series of text files to parse which may or may not contain any one of a collection of headers, and then lines of data or comment below that header. All header groups are preceded by a double line break.
I am seeking a regular expression that will return an empty string if it sees a header followed immediately by a double line break. I need to differentiate whether a document has that header with no content, or does not have that header at all.
For example, here are portions of two documents:
Dogs
Spaniel
Beagle
Birds
Parrot
and
Dogs
Amphibians
Frogs
Salamanders
I would like a regex that would return Spaniel\nBeagle in the first document, and an empty string for the second.
The closest I have been able to find is (in Python syntax) expr = re.compile("Dogs(.+?|)?\n\n, re.DOTALL). This returns the correct value for the first, but in the second case it returns \n\nAmphibians\nFrogs\nSalamanders. The second question mark and the pipe do not do what I had hoped.
I am handling this by program logic right now, searching for Dogs\n\n and only returning contents if that regex is not found, but it is unsatisfying because nothing beats the feeling of a single regular expression doing the job.
So: is there a regex that will match the second document, and return ""?

Problem
Your Dogs(.+?|)?\n\n pattern matches the word Dogs anywhere in the document, then tries to optionally (as there is an empty alternative |)) match any 1 or more (due to +? quantifier) characters, but as few as possible (since +? is a lazy quantifier), up to the first 2 newlines.
That means, the regex either matches Dogs only if there are no double newline symbols somewhere further in the text, or it will grab any text there is up to the first double newline symbols, because the .+? will consume 1 newline, and the \n\n pattern part will not be able to find the 2 newlines after Dogs.
Solution
You may use a *? quantifier instead of +? one to allow matching zero or more characters. The Dogs(.*?)\n\n will find Dogs, any 0+ chars as few as possible, up to the first \n\n, even those that appear right after Dogs.
Optimization:
If you process very long strings, and if the Dogs appear at the beginning of a line, you may use an unrolled regex since .*? is known to slow regex execution with longer inputs.
Use
expr = re.compile(r"^Dogs(.*(?:\n(?!\n).*)*)", re.MULTILINE)
See the regex demo
Basically, it will match
^ - start of a line
Dogs - Dogs substring
(.*(?:\n(?!\n).*)*) - Group 1 capturing:
.* - zero or more chars other than linebreak chars (as the re.DOTALL modifier is not used)
(?:\n(?!\n).*)* - zero or more sequences of:
\n(?!\n) - a newline not followed with another newline
.* - zero or more chars other than linebreak chars

match until a certain pattern using regex

I have string in a text file containing some text as follows:
txt = "java.awt.GridBagLayout.layoutContainer"
I am looking to get everything before the Class Name, "GridBagLayout".
I have tried something the following , but I can't figure out how to get rid of the "."
txt = re.findall(r'java\S?[^A-Z]*', txt)
and I get the following: "java.awt."
instead of what I want: "java.awt"
Any pointers as to how I could fix this?

Without using capture groups, you can use lookahead (the (?= ... ) business).
java\s?[^A-Z]*(?=\.[A-Z]) should capture everything you're after. Here it is broken down:
java //Literal word "java"
\s? //Match for an optional space character. (can change to \s* if there can be multiple)
[^A-Z]* //Any number of non-capital-letter characters
(?=\.[A-Z]) //Look ahead for (but don't add to selection) a literal period and a capital letter.

Make your pattern match a period followed by a capital letter:
'(java\S?[^A-Z]*?)\.[A-Z]'
Everything in capture group one will be what you want.

This seems to do what you want with re.findall(): (java\S?[^A-Z]*)\.[A-Z]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.