I want to match a pattern like '2 years', '4 days' in a text, and meanwhile want to avoid a pattern like '2 years old', i.e., I don't want a 'old' following 'years'. I thought a negative lookahead (?!old) would help. But I don't know how to do it. I tried
r=regex.compile(r'\b(\d+)\s*(years?|months?|days?)\s*(?!old)\b')
but it still match '2 years'.
For a full match you can omit the capture groups, and if there should be at least a single whitespace char between the words and the digits you can repeat 1 or more times using \s+
To prevent partial matches, you can use word boundaries \b
\b\d+\s+(?:year|month|day)s?\b(?!\s+old\b)
The pattern matches
\b\d+\s+ A word boundary, match 1+ digits and 1+ whitespace chars
(?:year|month|day)s?\b Match any of the alternatives and optional s
(?!\s+old\b) Negative lookahead, assert not 1+whitespace chars followed by old and a word boundary to the right
See a regex demo
Put \s* inside the lookahead:
r'\b(\d+)\s*(years?|months?|days?)(?!\s*old)\b'
As far as I understand, your regexp matched \s* zero times for the 2 years old case. The assertion fails since 2 years ends at word boundary and the content after it is space followed by old.
Related
I'm using re to take the questions from a text. I just want the sentence with the question, but it's taking multiple sentences before the question as well. My code looks like this:
match = re.findall("[A-Z].*\?", data2)
print(match)
an example of a result I get is:
'He knows me, and I know him. Do YOU know me? Hey?'
the two questions should be separated and the non question sentence shouldn't be there. Thanks for any help.
The . character in regex matches any text, including periods, which you don't want to include. Why not simply match anything besides the sentence ending punctuation?
questions = re.findall(r"\s*([^\.\?]+\?)", data2)
# \s* sentence beginning space to ignore
# ( start capture group
# [^\.\?]+ negated capture group matching anything besides "." and "?" (one or more)
# \? question mark to end sentence
# ) end capture group
You could look for letters, digits, and whitespace that end with a '?'.
>>> [i.strip() for i in re.findall('[\w\d\s]+\?', s)]
['Do YOU know me?', 'Hey?']
There would still be some edge cases to handle, like there could be punctuation like a ',' or other complexities.
You can use
(?<!\S)[A-Z][^?.]*\?(?!\S)
The pattern matches:
(?<!\S) Negative lookbehind, assert a whitespace boundary to the left
[A-Z] Match a single uppercase char A-Z
[^?.]*\? Match 0+ times any char except ? and . and then match a ?
(?!\S) Negative lookahead, assert a whitespace boundary to the right
Regex demo
You should use the ^ at the beginning of your expression so your regex expression should look like this: "^[A-Z].*\?".
"Matches the beginning of the string, or the beginning of a line if the multiline flag (m) is enabled. This matches a position, not a character."
If you have multiple sentences in your line you can use the following regex:
"(?<=.\s+)[A-Z].*\?"
?<= is called positive lookbehind. We try to find sentences which either start in a new line or have a period (.) and one or more whitespace characters before them.
I am trying to create a regular expression that finds sentences with a minimum length.
Really my conditions are:
there must at least be 5 words in a sequence
words in sequence must be distinct
sequence must be followed by some punctuation character.
So far I have tried
^(\b\w*\b\s?){5,}\s?[.?!]$
If my sample text is:
This is a sentence I would like to parse.
This is too short.
Single word
Not not not distinct distinct words words.
Another sentence that I would be interested in.
I would like to match on strings 1 and 5.
I am using the python re library. I am using regex101 to test and it appears the regex I have above is doing quite a bit of work regards to backtracking so I imagine those knowledgable in regex may be a bit appalled (my apologies).
You can use the following regex to identify the strings that meet all three conditions:
^(?!.*\b(\w+)\b.+\b\1\b)(?:.*\b\w+\b){5}.*[.?!]\s*$
with the case-indifferent flag set.
Demo
Python's regex engine performs the following operations.
^ # match beginning of line
(?! # begin negative lookahead
.+ # match 1+ chars
\b(\w+)\b # match a word in cap grp 1
.+ # match 1+ chars
\b\1\b # match the contents of cap grp 1 with word breaks
) # end negative lookahead
(?: # begin non-cap grp
.+ # match 1+ chars
\b\w+\b # match a word
) # end non-cap grp
{5} # execute non-cap grp 5 times
.* # match 0+ chars
[.?!] # match a punctuation char
\s* # match 0+ whitespaces
$ # match end of line
Items 1. and 3. are easily done by regex, but
2. words in sequence must be distinct
I don't see how you could do it with a regex pattern. Remember that regex is a string-matching operation; it doesn't do heavy logic. This problem doesn't sound like a regex problem to me.
I recommend splitting the string in the character " " and checking word by word. Quickier, no sweat.
Edit
can be done with a lookahead as Cary said.
This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 3 years ago.
I want to remove words with numbers. After research I understood that
s = "ABCD abcd AB55 55CD A55D 5555"
>>> re.sub("\S*\d\S*", "", s).strip()
This code works to solve my situation
However, I am not able to understand how this code works. I know about regex and I know individually \d recognizes all the numbers [0-9]. \S is for white spaces. and * is 0 or more occurrences of the pattern to its left
"\S*\d\S*"
This part I am not able to understand
But I am not sure I understand how this code identifies AB55.
Can anyone please explain to me? Thanks
this replaces a digit with any non-space symbols around with empty string ""
the AB55 is viewed like : AB are \S*, 5 is \d, 5 is \S*
55CD : empty string is \S*, 5 is \d, 5CD is \S*
A55D : A is \S*, 5 is \d, 5D is \S*
5555 : empty string is \S*, 5 is \d, 555 is \S*
The re.sub("\S*\d\S*", "", s) replaces all this substrings to empty string "" and .strip() is useless since it removes whitespace at the begin and end of the previous result
You misunderstand the code. \S is the opposite of \s: it matches with everything except whitespace.
Since the Kleene star (*) is greedy, it thus means that it aims to match as much non-space characters as possible, followed by a digit followed by as much non-space characters as possible. It will thus match a full word, where at least one character is a digit.
All these matches are then replaced by the empty string, and therefore removed from the original string.
Your code first matches 0+ times non whitespace chars \S* (where \s* matches whitespace chars) and will match all the way until the end of the "word". Then it backtracks to match a digit and and again match 0+ non whitespace chars.
The pattern will for example also match a single digit.
You could slightly optimize the pattern to first match not a whitespace char or a digit [^\s\d]* using a negated character class to prevent the first \S* match the whole word.
[^\s\d]*\d\S*
Regex demo
This is how your regex works, you mention about \S for white spaces. But it is not.
This is what python documentation mention about \s and \S
\s
Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].
\S
Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].
This is with \s which is for whitespace characters.
and you'll get an output like this,
>>> import re
>>>
>>> s = "ABCD abcd AB55 55CD A55D 5555"
>>> re.sub("\s*\d\s*", "", s).strip()
'ABCD abcd ABCD AD'
Why does this regex work in Python but not in Ruby:
/(?<!([0-1\b][0-9]|[2][0-3]))/
Would be great to hear an explanation and also how to get around it in Ruby
EDIT w/ the whole line of code:
re.sub(r'(?<!([0-1\b][0-9]|[2][0-3])):(?!([0-5][0-9])((?i)(am)|(pm)|(a\.m)|(p\.m)|(a\.m\.)|(p\.m\.))?\b)' , ':\n' , s)
Basically, I'm trying to add '\n' when there is a colon and it is not a time.
Ruby regex engine doesn't allow capturing groups in look behinds.
If you need grouping, you can use a non-capturing group (?:):
[8] pry(main)> /(?<!(:?[0-1\b][0-9]|[2][0-3]))/
SyntaxError: (eval):2: invalid pattern in look-behind: /(?<!(:?[0-1\b][0-9]|[2][0-3]))/
[8] pry(main)> /(?<!(?:[0-1\b][0-9]|[2][0-3]))/
=> /(?<!(?:[0-1\b][0-9]|[2][0-3]))/
Docs:
(?<!subexp) negative look-behind
Subexp of look-behind must be fixed-width.
But top-level alternatives can be of various lengths.
ex. (?<=a|bc) is OK. (?<=aaa(?:b|cd)) is not allowed.
In negative look-behind, capturing group isn't allowed,
but non-capturing group (?:) is allowed.
Learned from this answer.
Acc. to Onigmo regex documentation, capturing groups are not supported in negative lookbehinds. Although it is common among regex engines, not all of them count it as an error, hence you see the difference in the re and Onigmo regex libraries.
Now, as for your regex, it is not working correctly nor in Ruby nor in Python: the \b inside a character class in a Python and Ruby regex matches a BACKSPACE (\x08) char, not a word boundary. Moreover, when you use a word boundary after an optional non-word char, if the char appears in the string a word char must appear immediately to the right of that non-word char. The word boundary must be moved to right after m before \.?.
Another flaw with the current approach is that lookbehinds are not the best to exclude certain contexts like here. E.g. you can't account for a variable amount of whitespaces between the time digits and am / pm. It is better to match the contexts you do not want to touch and match and capture those you want to modify. So, we need two main alternatives here, one matching am/pm in time strings and another matching them in all other contexts.
Your pattern also has too many alternatives that can be merged using character classes and ? quantifiers.
Regex demo
\b((?:[01]?[0-9]|2[0-3]):[0-5][0-9]\s*[pa]\.?m\b\.?):
\b - word boundary
((?:[01]?[0-9]|2[0-3]):[0-5][0-9]\s*[pa]\.?m\b\.?) - capturing group 1:
(?:[01]?[0-9]|2[0-3]) - an optional 0 or 1 and then any digit or 2 and then a digit from 0 to 3
:[0-5][0-9] - : and then a number from 00 to 59
\s* - 0+ whitespaces
[pa]\.?m\b\.? - a or p, an optional dot, m, a word boundary, an optional dot
| - or
\b[ap]\.?m\b\.? - word boundary, a or p, an optional dot, m, a word boundary, an optional dot
Python fixed solution:
import re
text = 'am pm P.M. 10:56pm 10:43 a.m.'
rx = r'\b((?:[01]?[0-9]|2[0-3]):[0-5][0-9]\s*[pa]\.?m\b\.?)|\b[ap]\.?m\b\.?'
result = re.sub(rx, lambda x: x.group(1) if x.group(1) else "\n", text, flags=re.I)
Ruby solution:
text = 'am pm P.M. 10:56pm 10:43 a.m.'
rx = /\b((?:[01]?[0-9]|2[0-3]):[0-5][0-9]\s*[pa]\.?m\b\.?)|\b[ap]\.?m\b\.?/i
result = text.gsub(rx) { $1 || "\n" }
Output:
"\n \n \n 10:56pm 10:43 a.m."
For sure #mrzasa found the problem out.
But ..
Taking a guess at your intent to replace a non-time colon with a ':\n`
it could be done like this I guess. Does a little whitespace trim as well.
(?i)(?<!\b[01][0-9])(?<!\b[2][0-3])([^\S\r\n]*:)[^\S\r\n]*(?![0-5][0-9](?:[ap]\.?m\b\.?)?)
PCRE - https://regex101.com/r/7TxbAJ/1 Replace $1\n
Python - https://regex101.com/r/w0oqdZ/1 Replace \1\n
Readable version
(?i)
(?<!
\b [01] [0-9]
)
(?<!
\b [2] [0-3]
)
( # (1 start)
[^\S\r\n]*
:
) # (1 end)
[^\S\r\n]*
(?!
[0-5] [0-9]
(?: [ap] \.? m \b \.? )?
)
I am trying to find a pattern which allows me to find a year of four digits. But I do not want to get results in which year is preceded by month e.g "This is Jan 2009" should not give any result, but "This is 2009" should return 2009. I use findall with lookbehind at Jan|Feb but I get 'an 2009' instead of blank. What am I missing? How to do It?
Any otherwise matching string preceded by a string matching the negative lookbehind is not matched.
In your current regex, [a-z]* \d{4} matches "an 2009".
The negative lookbehind '(?<!Jan|Feb)' does not match the "This is J" part, so it is not triggered.
If you remove '[a-z]*' from the regex, then no match will be returned on your test string.
To fix such problems:
First, write the match you want \d{4}
Then, write what you don't want (?<!Jan |Feb )
That is (?<!Jan |Feb )\d{4}
You may want to try this:
(?i)(?<!jan|feb)(?<!uary)\s+[0-9]*[0-9]
Hope it helps.
This generalized example should work for the cases you mentioned in your question above (edited to account for full month names):
INPUTS:
'This is 2009'
'This is Jan 2009'
REGEX:
re.findall(r'(?:\b[^A-Z][a-z]+\s)(\d{4})', text))
OUTPUTS:
['2009']
[]
EXPLANATION:
?: indicates a non-capturing group, therefore it will not be included in the output
\b asserts a word boundary
^[A-Z] asserts that the word does not start with a capital letter
[a-z]+ asserts that it is followed by one or more lowercase letters
\s accounts for any whitespace character
(\d{4}) asserts a capturing group for a digit (\d) for four occurrences {4}