I am trying to find a pattern which allows me to find a year of four digits. But I do not want to get results in which year is preceded by month e.g "This is Jan 2009" should not give any result, but "This is 2009" should return 2009. I use findall with lookbehind at Jan|Feb but I get 'an 2009' instead of blank. What am I missing? How to do It?
Any otherwise matching string preceded by a string matching the negative lookbehind is not matched.
In your current regex, [a-z]* \d{4} matches "an 2009".
The negative lookbehind '(?<!Jan|Feb)' does not match the "This is J" part, so it is not triggered.
If you remove '[a-z]*' from the regex, then no match will be returned on your test string.
To fix such problems:
First, write the match you want \d{4}
Then, write what you don't want (?<!Jan |Feb )
That is (?<!Jan |Feb )\d{4}
You may want to try this:
(?i)(?<!jan|feb)(?<!uary)\s+[0-9]*[0-9]
Hope it helps.
This generalized example should work for the cases you mentioned in your question above (edited to account for full month names):
INPUTS:
'This is 2009'
'This is Jan 2009'
REGEX:
re.findall(r'(?:\b[^A-Z][a-z]+\s)(\d{4})', text))
OUTPUTS:
['2009']
[]
EXPLANATION:
?: indicates a non-capturing group, therefore it will not be included in the output
\b asserts a word boundary
^[A-Z] asserts that the word does not start with a capital letter
[a-z]+ asserts that it is followed by one or more lowercase letters
\s accounts for any whitespace character
(\d{4}) asserts a capturing group for a digit (\d) for four occurrences {4}
Related
I have a regex pattern as follows:
r'(?:(?<!\.|\s)[a-z]\.|(?<!\.|\s)[A-Z]\.)+'
and I am trying to modify that so it only matches the dot at the end of the sentences and not the letter before them. here is my string:
sent = 'This is the U.A. we have r.a.d. golden 13.56 date. a better date 34. was there.'
and here is what i have done:
import re
re.split(r'(?:(?<!\.|\s)[a-z]\.|(?<!\.|\s)[A-Z]\.)+', sent)
however what happens is that it removes the last letter of the words:
current output:
['This is the U.A. we have r.a.d. golden 13.56 dat',' a better date 34. was ther',
'']
my desired output is:
['This is the U.A. we have r.a.d. golden 13.56 date',' a better date 34. was there',
'']
i do not know how I can modify the pattern to keep the last letter of the words 'date', and 'there'
Your pattern can be reduced to and fixed as
(?<=(?<![.\s])[a-zA-Z])\.
See the regex demo.
If you need to also match multiple dots, put back + after the \..
Details:
(?<=(?<![.\s])[a-zA-Z]) - a positive lookbehind that matches a location that is immediately preceded with
(?<![.\s]) - a negative lookbehind that fails the match if there is a . or whitespace immediately to the left of the current location
[a-zA-Z] - an ASCII letter
\. - a literal dot.
Look, your pattern is basically an alternation of two patterns, (?<!\.|\s)[a-z]\. and (?<!\.|\s)[A-Z]\., the only difference between which is [a-z] and [A-Z]. It is clear the same alternation can be shortened to (?<!\.|\s)[a-zA-Z]\. The [a-zA-Z] must be put into a non-consuming pattern so that the letters could not be eaten up when splitting, so using a positive lookbehind is a natural solution.
I don't understand why it only gives 125, the first number only, why it does not give all positive numbers in that string? My goal is to extract all positive numbers.
import re
pattern = re.compile(r"^[+]?\d+")
text = "125 -898 8969 4788 -2 158 -947 599"
matches = pattern.finditer(text)
for match in matches:
print(match)
Try using the regular expression
-\d+|(\d+)
Disregard the matches. The strings representing non-negative integers are saved in capture group 1.
Demo
The idea is to match but not save to a capture group what you don't want (negative numbers), and both match and save to a capture group what you do want (non-negative numbers).
The regex attempts to match -\d+. If that succeeds the regex engine's internal string pointer is moved to just after the last digit matched. If -\d+ is not matched an attempt is made to match the second part of the alternation (following |). If \d+ is matched the match is saved to capture group 1.
Any plus signs in the string can be disregarded.
For a fuller description of this technique see The Greatest Regex Trick Ever. (Search for "Tarzan"|(Tarzan) to get to the punch line.)
The following pattern will only match non negative numbers:
pattern = re.compile("(?:^|[^\-\d])(\d+)")
pattern.findall(text)
OUTPUT
['125', '8969', '4788', '158', '599']
For the sake of completeness another idea by use of \b and a lookbehind.
\b(?<!-)\d+
See this demo at regex101
Your pattern ^[+]?\d+ is anchored at the start of the string, and will give only that match at the beginning.
Another option is to assert a whitspace boundary to the left, and match the optional + followed by 1 or more digits.
(?<!\S)\+?\d+\b
(?<!\S) Assert a whitespace boundary to the left
\+? Match an optional +
\d+\b Match 1 or more digits followed by a word bounadry
Regex demo
Use , to sperate the numbers in the string.
I'm using re to take the questions from a text. I just want the sentence with the question, but it's taking multiple sentences before the question as well. My code looks like this:
match = re.findall("[A-Z].*\?", data2)
print(match)
an example of a result I get is:
'He knows me, and I know him. Do YOU know me? Hey?'
the two questions should be separated and the non question sentence shouldn't be there. Thanks for any help.
The . character in regex matches any text, including periods, which you don't want to include. Why not simply match anything besides the sentence ending punctuation?
questions = re.findall(r"\s*([^\.\?]+\?)", data2)
# \s* sentence beginning space to ignore
# ( start capture group
# [^\.\?]+ negated capture group matching anything besides "." and "?" (one or more)
# \? question mark to end sentence
# ) end capture group
You could look for letters, digits, and whitespace that end with a '?'.
>>> [i.strip() for i in re.findall('[\w\d\s]+\?', s)]
['Do YOU know me?', 'Hey?']
There would still be some edge cases to handle, like there could be punctuation like a ',' or other complexities.
You can use
(?<!\S)[A-Z][^?.]*\?(?!\S)
The pattern matches:
(?<!\S) Negative lookbehind, assert a whitespace boundary to the left
[A-Z] Match a single uppercase char A-Z
[^?.]*\? Match 0+ times any char except ? and . and then match a ?
(?!\S) Negative lookahead, assert a whitespace boundary to the right
Regex demo
You should use the ^ at the beginning of your expression so your regex expression should look like this: "^[A-Z].*\?".
"Matches the beginning of the string, or the beginning of a line if the multiline flag (m) is enabled. This matches a position, not a character."
If you have multiple sentences in your line you can use the following regex:
"(?<=.\s+)[A-Z].*\?"
?<= is called positive lookbehind. We try to find sentences which either start in a new line or have a period (.) and one or more whitespace characters before them.
I have a pattern which looks like:
abc*_def(##)
and i want to look if this matches for some strings.
E.x. it matches for:
abc1_def23
abc10_def99
but does not match for:
abc9_def9
So the * stands for a number which can have one or more digits.
The # stands for a number with one digit
I want the value in the parenthesis as result
What would be the easiest and simplest solution for this problem?
Replace the * and # through regex expression and then look if they match?
Like this:
pattern = pattern.replace('*', '[0-9]*')
pattern = pattern.replace('#', '[0-9]')
pattern = '^' + pattern + '$'
Or program it myself?
Based on your requirements, I would go for a regex for the simple reason it's already available and tested, so it's easiest as you were asking.
The only "complicated" thing in your requirements is avoiding after def the same digit you have after abc.
This can be done with a negative backreference. The regex you can use is:
\babc(\d+)_def((?!\1)\d{1,2})\b
\b captures word boundaries; if you enclose your regex between two \b
you will restrict your search to words, i.e. text delimited by space,
punctuations etc
abc captures the string abc
\d+ captures one or more digits; if there is an upper limit to the number of digits you want, it has to be \d{1,MAX} where MAX is your maximum number of digits; anyway \d stands for a digit and + indicates 1 or more repetitions
(\d+) is a group: the use of parenthesis defines \d+ as something you want to "remember" inside your regex; it's somehow similar to defining a variable; in this case, (\d+) is your first group since you defined no other groups before it (i.e. to its left)
_def captures the string _def
(?!\1) is the part where you say "I don't want to repeat the first group after _def. \1 represents the first group, while (?!whatever) is a check that results positive is what follows the current position is NOT (the negation is given by !) whatever you want to negate.
Live demo here.
I had the hardest time getting this to work. The trick was the $
#!python2
import re
yourlist = ['abc1_def23', 'abc10_def99', 'abc9_def9', 'abc955_def9', 'abc_def9', 'abc9_def9288', 'abc49_def9234']
for item in yourlist:
if re.search(r'abc[0-9]+_def[0-9][0-9]$', item):
print item, 'is a match'
You could match your pattern like:
abc\d+_def(\d{2})
abc Match literally
\d+ Match 1 or more digits
_ Match underscore
def - Match literally
( Capturing group (Your 2 digits will be in this group)
\d{2} Match 2 digits
) Close capturing group
Then you could for example use search to check for a match and use .group(1) to get the digits between parenthesis.
Demo Python
You could also add word boundaries:
\babc\d+_def(\d{2})\b
What could be regex which match anystring followed by daily but it must not match daily preceded by m?
For example it should match following string
beta.daily
abcdaily
dailyabc
daily
But it must not match
mdaily or
abcmdaily or
mdailyabc
I have tried following and other regex but failed each time:
r'[^m]daily': But it doesn't match with daily
r'[^m]?daily' : It match with string containing mdaily which is not intended
Just add a negative lookbehind, (?<!m)d, before daily:
(?<!m)daily
The zero width negative lookbehind, (?<!m), makes sure daily is not preceded by m.
Demo