I have a regex pattern as follows:
r'(?:(?<!\.|\s)[a-z]\.|(?<!\.|\s)[A-Z]\.)+'
and I am trying to modify that so it only matches the dot at the end of the sentences and not the letter before them. here is my string:
sent = 'This is the U.A. we have r.a.d. golden 13.56 date. a better date 34. was there.'
and here is what i have done:
import re
re.split(r'(?:(?<!\.|\s)[a-z]\.|(?<!\.|\s)[A-Z]\.)+', sent)
however what happens is that it removes the last letter of the words:
current output:
['This is the U.A. we have r.a.d. golden 13.56 dat',' a better date 34. was ther',
'']
my desired output is:
['This is the U.A. we have r.a.d. golden 13.56 date',' a better date 34. was there',
'']
i do not know how I can modify the pattern to keep the last letter of the words 'date', and 'there'
Your pattern can be reduced to and fixed as
(?<=(?<![.\s])[a-zA-Z])\.
See the regex demo.
If you need to also match multiple dots, put back + after the \..
Details:
(?<=(?<![.\s])[a-zA-Z]) - a positive lookbehind that matches a location that is immediately preceded with
(?<![.\s]) - a negative lookbehind that fails the match if there is a . or whitespace immediately to the left of the current location
[a-zA-Z] - an ASCII letter
\. - a literal dot.
Look, your pattern is basically an alternation of two patterns, (?<!\.|\s)[a-z]\. and (?<!\.|\s)[A-Z]\., the only difference between which is [a-z] and [A-Z]. It is clear the same alternation can be shortened to (?<!\.|\s)[a-zA-Z]\. The [a-zA-Z] must be put into a non-consuming pattern so that the letters could not be eaten up when splitting, so using a positive lookbehind is a natural solution.
Related
I'm using re to take the questions from a text. I just want the sentence with the question, but it's taking multiple sentences before the question as well. My code looks like this:
match = re.findall("[A-Z].*\?", data2)
print(match)
an example of a result I get is:
'He knows me, and I know him. Do YOU know me? Hey?'
the two questions should be separated and the non question sentence shouldn't be there. Thanks for any help.
The . character in regex matches any text, including periods, which you don't want to include. Why not simply match anything besides the sentence ending punctuation?
questions = re.findall(r"\s*([^\.\?]+\?)", data2)
# \s* sentence beginning space to ignore
# ( start capture group
# [^\.\?]+ negated capture group matching anything besides "." and "?" (one or more)
# \? question mark to end sentence
# ) end capture group
You could look for letters, digits, and whitespace that end with a '?'.
>>> [i.strip() for i in re.findall('[\w\d\s]+\?', s)]
['Do YOU know me?', 'Hey?']
There would still be some edge cases to handle, like there could be punctuation like a ',' or other complexities.
You can use
(?<!\S)[A-Z][^?.]*\?(?!\S)
The pattern matches:
(?<!\S) Negative lookbehind, assert a whitespace boundary to the left
[A-Z] Match a single uppercase char A-Z
[^?.]*\? Match 0+ times any char except ? and . and then match a ?
(?!\S) Negative lookahead, assert a whitespace boundary to the right
Regex demo
You should use the ^ at the beginning of your expression so your regex expression should look like this: "^[A-Z].*\?".
"Matches the beginning of the string, or the beginning of a line if the multiline flag (m) is enabled. This matches a position, not a character."
If you have multiple sentences in your line you can use the following regex:
"(?<=.\s+)[A-Z].*\?"
?<= is called positive lookbehind. We try to find sentences which either start in a new line or have a period (.) and one or more whitespace characters before them.
I'd like to match number, positive or negative, possibly with currency sign in front. But I don't want something like PSM-9. My code is:
test='AAA PCSK-9, $111 -3,33'
re.findall(r'\b-?[$€£]?-?\d+[\d,.]*\b', test)
Output is:['-9', '111', '3,33']
Could someone explain why -9 is matched? Thank you in advance.
Edit:
I don't any part of PCSK-9 is matched it is like a name of a product rather a number. So my desired output is:
['111', '3,33']
This is because \b matches the gap between K and -, a word and a non-word character. If you want to avoid matching - if it's preceded by a word you can use negative lookbehind instead:
re.findall(r'[$€£]?(?:(?<!\w)-)?\d+[\d,.]*\b', test)
With your sample input, this returns:
['9', '111', '3,33']
Demo: https://regex101.com/r/A66C5W/1
The word boundary matches between the K and the dash. The 2 parts after the dash [$€£]?-? are optional because of the questionmark and then you match one or more times a digit. This results in the match -9
What you might use instead of a word boundary is an assertion that checks if what is before and after the match is not a non whitespace character \S using a negative lookbehind and a negative lookahead.
(?<!\S)-?[$€£]?(\d+(?:[,.]\d+)?)(?!\S)
Regex demo | Python demo
-9 is matched because - is a non-word character, and S is a word character... so in between there's an interword boundary \b, as you state in your regexp.
I am trying to find a pattern which allows me to find a year of four digits. But I do not want to get results in which year is preceded by month e.g "This is Jan 2009" should not give any result, but "This is 2009" should return 2009. I use findall with lookbehind at Jan|Feb but I get 'an 2009' instead of blank. What am I missing? How to do It?
Any otherwise matching string preceded by a string matching the negative lookbehind is not matched.
In your current regex, [a-z]* \d{4} matches "an 2009".
The negative lookbehind '(?<!Jan|Feb)' does not match the "This is J" part, so it is not triggered.
If you remove '[a-z]*' from the regex, then no match will be returned on your test string.
To fix such problems:
First, write the match you want \d{4}
Then, write what you don't want (?<!Jan |Feb )
That is (?<!Jan |Feb )\d{4}
You may want to try this:
(?i)(?<!jan|feb)(?<!uary)\s+[0-9]*[0-9]
Hope it helps.
This generalized example should work for the cases you mentioned in your question above (edited to account for full month names):
INPUTS:
'This is 2009'
'This is Jan 2009'
REGEX:
re.findall(r'(?:\b[^A-Z][a-z]+\s)(\d{4})', text))
OUTPUTS:
['2009']
[]
EXPLANATION:
?: indicates a non-capturing group, therefore it will not be included in the output
\b asserts a word boundary
^[A-Z] asserts that the word does not start with a capital letter
[a-z]+ asserts that it is followed by one or more lowercase letters
\s accounts for any whitespace character
(\d{4}) asserts a capturing group for a digit (\d) for four occurrences {4}
Input is a two-sentence string:
s = 'Sentence 1 here. This sentence contains 1 fl. oz. but is one sentence.'
I'd like to .split s into sentences based on the logic that:
sentences end with one or more periods, exclamation marks, questions marks, or period+quotation mark
and are then followed by 1+ whitespace characters and a capitalized alpha character.
Desired result:
['Sentence 1 here.', 'This sentence contains 1 fl. oz. but is one sentence.']
Also okay:
['Sentence 1 here', 'This sentence contains 1 fl. oz. but is one sentence.']
But I currently chop off the 0th element of each sentence because the uppercase character is captured:
import re
END_SENT = re.compile(r'[.!?(.")]+[ ]+[A-Z]')
print(END_SENT.split(s))
['Sentence 1 here', 'his sentence contains 1 fl. oz. but is one sentence.']
Notice the missing T. How can I tell .split to ignore certain elements of the compiled pattern?
((?<=[.!?])|(?<=\.\")) +(?=[A-Z])
Try it here.
Although I would suggest the below to allow quotes to be followed by any of .!? to be a split condition
((?<=[.!?])|(?<=[.!?]\")) +(?=[A-Z])
Try it here.
Explanation
The common stuff in both +(?=[A-Z])
' +' #One or more spaces(The actual splitting chars used.)
(?= #START positive look ahead check if it followed by this, but do not consume
[A-Z] #Any capitalized alphabet
) #END positive look ahead
The conditions for what comes before the space
For Solution1
( #GROUP START
(?<= #START Positive look behind, Make sure this comes before but do not consume
[.!?] #any one of these chars should come before the splitting space
) #END positive look behind
| #OR condition this is also the reason we had to put all this in GROUP
(?<= #START Positive look behind,
\.\" #splitting space could precede by .", covering a condition that is not by the previous set of . or ! or ?
) #END positive look behind
) #END GROUP
For Solution2
( #GROUP START
(?<=[.!?]) #Same as the previous look behind
| #OR condition
(?<=[.!?]\") #Only difference here is that we are allowing quote after any of . or ! or ?
) #GROUP END
It's easier to describe the sentence than trying to identify the delimiter. So instead of re.split try with re.findall:
re.findall(r'([^.?!\s].*?[.?!]*)\s*(?![^A-Z])', s)
To preserve the next uppercase letter, the pattern uses a lookahead that is only a test and doesn't consume characters.
details:
( # capture group: re.findall return only the capture group content if any
[^.?!\s] # the first character isn't a space or a punctuation character
.*? # a non-greedy quantifier
[.?!]* # eventual punctuation characters
)
\s* # zero or more white-spaces
(?![^A-Z]) # not followed by a character that isn't a uppercase letter
# (this includes an uppercase letter and the end of the string)
Obviously, for more complicated cases with abbreviations, names, etc., you have to use tools like nltk or any other nlp tools trained with dictionaries.
I want a regex in Python which extracts one or multiple occurrences of words starting with capital letters unless the word occurs in the first word. I know it's not a robust and consistent method but it'll solve my problem as I don't want to use any statistical method (e.g. as in NLTK or StanfordNER).
Examples:
extract('His name is John Wayne.')
should return ['John Wayne'].
extract('He is The President of Neverland.')
should return ['The President', 'Neverland'] because they are capitalized words and they don't occur at the beginning of a sentence.
another example:
extract('He came home. Although late, it was nice to have Patrick there.')
should return ['Patrick'] because 'He' and 'Although' occur at the beginning of a sentence.
Also it could drop punctuation for example 'He was John, who came' should return 'John' and not 'John,'.
You can use this expression for this task:
(?<!\.\s)(?!^)\b([A-Z]\w*(?:\s+[A-Z]\w*)*)
RegEx Demo
RegEx Breakup:
(?<!\.\s) - Negative lookbehind to assert we don't have a DOT and space before
(?!^) - Negative lookahead to assert we are not at start
\b - Word boundary
( - Start capturing group
[A-Z]\w* - Match a word starting with a capital letter
(?: - Start non-capturing group
\s+ - Match 1 or more whitespaces
[A-Z]\w* - Match a capital letter word
)* End non-capturing group. Match 0 ore more of these
) - End capturing group