Regex for questions taking multiple sentences - python

I'm using re to take the questions from a text. I just want the sentence with the question, but it's taking multiple sentences before the question as well. My code looks like this:
match = re.findall("[A-Z].*\?", data2)
print(match)
an example of a result I get is:
'He knows me, and I know him. Do YOU know me? Hey?'
the two questions should be separated and the non question sentence shouldn't be there. Thanks for any help.

The . character in regex matches any text, including periods, which you don't want to include. Why not simply match anything besides the sentence ending punctuation?
questions = re.findall(r"\s*([^\.\?]+\?)", data2)
# \s* sentence beginning space to ignore
# ( start capture group
# [^\.\?]+ negated capture group matching anything besides "." and "?" (one or more)
# \? question mark to end sentence
# ) end capture group

You could look for letters, digits, and whitespace that end with a '?'.
>>> [i.strip() for i in re.findall('[\w\d\s]+\?', s)]
['Do YOU know me?', 'Hey?']
There would still be some edge cases to handle, like there could be punctuation like a ',' or other complexities.

You can use
(?<!\S)[A-Z][^?.]*\?(?!\S)
The pattern matches:
(?<!\S) Negative lookbehind, assert a whitespace boundary to the left
[A-Z] Match a single uppercase char A-Z
[^?.]*\? Match 0+ times any char except ? and . and then match a ?
(?!\S) Negative lookahead, assert a whitespace boundary to the right
Regex demo

You should use the ^ at the beginning of your expression so your regex expression should look like this: "^[A-Z].*\?".
"Matches the beginning of the string, or the beginning of a line if the multiline flag (m) is enabled. This matches a position, not a character."
If you have multiple sentences in your line you can use the following regex:
"(?<=.\s+)[A-Z].*\?"
?<= is called positive lookbehind. We try to find sentences which either start in a new line or have a period (.) and one or more whitespace characters before them.

Related

Regex to find sentences of a minimum length

I am trying to create a regular expression that finds sentences with a minimum length.
Really my conditions are:
there must at least be 5 words in a sequence
words in sequence must be distinct
sequence must be followed by some punctuation character.
So far I have tried
^(\b\w*\b\s?){5,}\s?[.?!]$
If my sample text is:
This is a sentence I would like to parse.
This is too short.
Single word
Not not not distinct distinct words words.
Another sentence that I would be interested in.
I would like to match on strings 1 and 5.
I am using the python re library. I am using regex101 to test and it appears the regex I have above is doing quite a bit of work regards to backtracking so I imagine those knowledgable in regex may be a bit appalled (my apologies).
You can use the following regex to identify the strings that meet all three conditions:
^(?!.*\b(\w+)\b.+\b\1\b)(?:.*\b\w+\b){5}.*[.?!]\s*$
with the case-indifferent flag set.
Demo
Python's regex engine performs the following operations.
^ # match beginning of line
(?! # begin negative lookahead
.+ # match 1+ chars
\b(\w+)\b # match a word in cap grp 1
.+ # match 1+ chars
\b\1\b # match the contents of cap grp 1 with word breaks
) # end negative lookahead
(?: # begin non-cap grp
.+ # match 1+ chars
\b\w+\b # match a word
) # end non-cap grp
{5} # execute non-cap grp 5 times
.* # match 0+ chars
[.?!] # match a punctuation char
\s* # match 0+ whitespaces
$ # match end of line
Items 1. and 3. are easily done by regex, but
2. words in sequence must be distinct
I don't see how you could do it with a regex pattern. Remember that regex is a string-matching operation; it doesn't do heavy logic. This problem doesn't sound like a regex problem to me.
I recommend splitting the string in the character " " and checking word by word. Quickier, no sweat.
Edit
can be done with a lookahead as Cary said.

Regex that not ending with smaller case

creating the regex which is having at least 3 chars and not end with
import re
re.findall(r'(\w{3,})(?![a-z])\b','I am tyinG a mixed charAv case VOW')
My Out
['tyinG', 'mixed', 'charAv', 'case', 'VOW']
My Expected is
['tyinG', 'VOW']
I am getting the proper out when i am doing the re.findall(r'(\w{3,})(?<![a-z])\b','I am tyinG a mixed charAv case VOW')
when i did the je.im my first regex which doesnot having < giving correct only
What is the relevance of < here
The first pattern (\w{3,})(?![a-z])\b does not give you the expected result because the pattern is first matching 3+ word chars and then asserts using a negative lookahead (?! that what is directly on the right is not a lowercase char a-z.
That assertion will be true as the lowercase a-z chars are already matched by \w
The second pattern (\w{3,})(?<![a-z])\b does give you the right result as it first tries to match 3 or more word chars and after that asserts using a negative lookbehind (?<! what is directly to the left is not a lowercase char a-z.
If you want to use a lookaround, you can make the pattern a bit more efficient by making use of a word boundary at the beginning.
At the end of the pattern place the negative lookbehind after the word boundary to first anchor it and then do the assertion.
\b\w{3,}\b(?<![a-z])
Note that you can omit the capturing group if you want the single match only.

regex - how to select a word that has a '-' in it?

I am learning Regular Expressions, so apologies for a simple question.
I want to select the words that have a '-' (minus sign) in it but not at the beginning and not at the end of the word
I tried (using findall):
r'\b-\b'
for
str = 'word semi-column peace'
but, of course got only:
['-']
Thank you!
What you actually want to do is a regex like this:
\w+-\w+
What this means is find a alphanumeric character at least once as indicated by the utilization of '+', then find a '-', following by another alphanumeric character at least once, again, as indicated by the '+' again.
str is a built in name, better not to use it for naming
st = 'word semi-column peace'
# \w+ word - \w+ word after -
print(re.findall(r"\b\w+-\w+\b",st))
['semi-column']
a '-' (minus sign) in it but not at the beginning and not at the end of the word
Since "-" is not a word character, you can't use word boundaries (\b) to prevent a match from words with hyphens at the beggining or end. A string like "-not-wanted-" will match both \b\w+-\w+\b and \w+-\w+.
We need to add an extra condition before and after the word:
Before: (?<![-\w]) not preceded by either a hyphen nor a word character.
After: (?![-\w]) not followed by either a hyphen nor a word character.
Also, a word may have more than 1 hyphen in it, and we need to allow it. What we can do here is repeat the last part of the word ("hyphen and word characters") once or more:
\w+(?:-\w+)+ matches:
\w+ one or more word characters
(?:-\w+)+ a hyphen and one or more word characters, and also allows this last part to repeat.
Regex:
(?<![-\w])\w+(?:-\w+)+(?![-\w])
regex101 demo
Code:
import re
pattern = re.compile(r'(?<![-\w])\w+(?:-\w+)+(?![-\w])')
text = "-abc word semi-column peace -not-wanted- one-word dont-match- multi-hyphenated-word"
result = re.findall(pattern, text)
ideone demo
You can also use the following regex:
>>> st = "word semi-column peace"
>>> print re.findall(r"\S+\-\S+", st)
['semi-column']
You can try something like this: Centering on the hyphen, I match until there is a white space in either direction from the hyphen I also make check to see if the words are surrounded by hyphens (e.g -test-cats-) and if they are I make sure not to include them. The regular expression should also work with findall.
st = 'word semi-column peace'
m = re.search(r'([^ | ^-]+-[^ | ^-]+)', st)
if m:
print m.group(1)

Python Regex doesn't match . (dot) as a character

I have a regex that matches all three characters words in a string:
\b[^\s]{3}\b
When I use it with the string:
And the tiger attacked you.
this is the result:
regex = re.compile("\b[^\s]{3}\b")
regex.findall(string)
[u'And', u'the', u'you']
As you can see it matches you as a word of three characters, but I want the expression to take "you." with the "." as a 4 chars word.
I have the same problem with ",", ";", ":", etc.
I'm pretty new with regex but I guess it happens because those characters are treated like word boundaries.
Is there a way of doing this?
Thanks in advance,
EDIT
Thaks to the answers of #BrenBarn and #Kendall Frey I managed to get to the regex I was looking for:
(?<!\w)[^\s]{3}(?=$|\s)
If you want to make sure the word is preceded and followed by a space (and not a period like is happening in your case), then use lookaround.
(?<=\s)\w{3}(?=\s)
If you need it to match punctuation as part of words (such as 'in.') then \w won't be adequate, and you can use \S (anything but a space)
(?<=\s)\S{3}(?=\s)
As described in the documentation:
A word is defined as a sequence of alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore character.
So if you want a period to count as a word character and not a word boundary, you can't use \b to indicate a word boundary. You'll have to use your own character class. For instance, you can use a regex like \s[^\s]{3}\s if you want to match 3 non-space characters surrounded by spaces. If you still want the boundary to be zero-width (i.e., restrict the match but not be included in it), you could use lookaround, something like (?<=\s)[^\s]{3}(?=\s).
This would be my approach. Also matches words that come right after punctuations.
import re
r = r'''
\b # word boundary
( # capturing parentheses
[^\s]{3} # anything but whitespace 3 times
\b # word boundary
(?=[^\.,;:]|$) # dont allow . or , or ; or : after word boundary but allow end of string
| # OR
[^\s]{2} # anything but whitespace 2 times
[\.,;:] # a . or , or ; or :
)
'''
s = 'And the tiger attacked you. on,bla tw; th: fo.tes'
print re.findall(r, s, re.X)
output:
['And', 'the', 'on,', 'bla', 'tw;', 'th:', 'fo.', 'tes']

How to match a word that doesn't start with X but ends with Y with regex

Example;
X=This
Y=That
not matching;
ThisWordShouldNotMatchThat
ThisWordShouldNotMatch
WordShouldNotMatch
matching;
AWordShouldMatchThat
I tried (?<!...) but seems not to be easy :)
^(?!This).*That$
As a free-spacing regex:
^ # Start of string
(?!This) # Assert that "This" can't be matched here
.* # Match the rest of the string
That # making sure we match "That"
$ # right at the end of the string
This will match a single word that fulfills your criteria, but only if this word is the only input to the regex. If you need to find words inside a string of many other words, then use
\b(?!This)\w*That\b
\b is the word boundary anchor, so it matches at the start and at the end of a word. \w means "alphanumeric character. If you also want to allow non-alphanumerics as part of your "word", then use \S instead - this will match anything that's not a space.
In Python, you could do words = re.findall(r"\b(?!This)\w*That\b", text).

Categories