Regex - Ignore if group has prefix - python

I am trying to capture 8 digit phone numbers in free text. This should be ignored if a particular string appears before.
My regex:
(\b(\+?001|002)?[-]?\d{4}(-|\s)?\d{4}\b)
To Capture:
+001 12345678
12345678
Not Capture:
TTT-12345678-123
TTT-12345678
I am trying to use negative look behind as below example:
\w*(?<!foo)bar
But the above works only if the regex doesn't have subsequent groups.

You may use
(?<!TTT-)(?<!\w)(?:\+?001|002)?[-\s]?\d{4}[-\s]?\d{4}\b
See the regex demo
Details
(?<!TTT-) - no TTT- allowed immediately on the left
(?<!\w) - no word char allowed immediately on the left
(?:\+?001|002)? - an optional non-capturing group matching 1 or 0 occurrences of +001, 001 or 002
[-\s]? - an optional - or whitespace
\d{4} - any four digits
[-\s]?\d{4} - - an optional - or whitespace and any four digits
\b - a word boundary.
If the number can be glued to a word char on the right, replace the \b word boundary with the right-hand digit boundary, (?!\d).

Related

Python regular expression for height?

I am trying to create a regular expression that works for the different types of height inputs, it should work for the following examples below:
5-10
5-09
5-9
6'
6'0
5'9"
5'09"
5'9
5'09
I don't need to consider values below 4'0 or above 6'11.
Here's my regular expression so far:
[456][-']\d{1,2}"?
I need to make the " not work if there is a - between feet and inches.
Also, for the inches part, I am currently allowing for either 1 or 2 digits, when I really only want to allow for two digits when the first digit is a 0 or 1, and if it is 1, the second digit can only be 0 or 1.
For example, 00-09 should work but and 10 and 11 should work but not 12 or any other two-digit number.
You might use an alternation with an optional - and digits part, or match the ' followed by a second ' and use a capture group with an if clause to match up the "
\b(?<![-'"])(?:1[01]|0?\d)(?:'(?:(?:1[01]|0?\d)\b"?)?|-(?:1[01]|0?\d\b))(?![-'"])
The pattern matches:
\b A word boundary to prevent a partial word match
(?<![-'"]) Negative lookbehind, assert not ' or - or " directly to the left
(?:1[01]|0?\d) Match from 0-9 with optional leading 0 and 10 and 11
(?: Non capture group
' Match literally
(?: Non capture group
(?:1[01]|0?\d)\b
"? Match optional "
)? Close non capture group and make it optional
| Or
- Match literally
(?:1[01]|0?\d\b) Match 0-9 10 or 11 followed by a word boundary
) Close the outer group
(?![-'"]) Negative lookahead, assert not - or ' or " to the right
Regex demo

Regex python ignore word followed by given character

I have the regex (?<=^|(?<=[^a-zA-Z0-9-_\.]))#([A-Za-z]+[A-Za-z0-9-_]+)(?!\w).
Given the string #first#nope #second#Hello #my-friend, email# whats.up#example.com #friend, what can I do to exclude the strings #first and #second since they are not whole words on their own ?
In other words, exclude them since they are succeeded by # .
You can use
(?<![a-zA-Z0-9_.-])#(?=([A-Za-z]+[A-Za-z0-9_-]*))\1(?![#\w])
(?a)(?<![\w.-])#(?=([A-Za-z][\w-]*))\1(?![#\w])
See the regex demo. Details:
(?<![a-zA-Z0-9_.-]) - a negative lookbehind that matches a location that is not immediately preceded with ASCII digits, letters, _, . and -
# - a # char
(?=([A-Za-z]+[A-Za-z0-9_-]*)) - a positive lookahead with a capturing group inside that captures one or more ASCII letters and then zero or more ASCII letters, digits, - or _ chars
\1 - the Group 1 value (backreferences are atomic, no backtracking is allowed through them)
(?![#\w]) - a negative lookahead that fails the match if there is a word char (letter, digit or _) or a # char immediately to the right of the current location.
Note I put hyphens at the end of the character classes, this is best practice.
The (?a)(?<![\w.-])#(?=([A-Za-z][\w-]*))\1(?![#\w]) alternative uses shorthand character classes and the (?a) inline modifier (equivalent of re.ASCII / re.A makes \w only match ASCII chars (as in the original version). Remove (?a) if you plan to match any Unicode digits/letters.
Another option is to assert a whitespace boundary to the left, and assert no word char or # sign to the right.
(?<!\S)#([A-Za-z]+[\w-]+)(?![#\w])
The pattern matches:
(?<!\S) Negative lookbehind, assert not a non whitespace char to the left
# Match literally
([A-Za-z]+[\w-]+) Capture group1, match 1+ chars A-Za-z and then 1+ word chars or -
(?![#\w]) Negative lookahead, assert not # or word char to the right
Regex demo
Or match a non word boundary \B before the # instead of a lookbehind.
\B#([A-Za-z]+[\w-]+)(?![#\w])
Regex demo

Greedy Python RegEx capturing group to include "and"

I need some help writing regex expressions. I need an expression that can match the following patterns (including words and digits, spaces and commas):
Line 145
Line3544354
Lines 10,12
Line items 45,10,26
Lines 10 and 45
Thus far, I wrote one expression which includes the first three patterns and all case variations:
r'(?i)(line item[\.*\,*\s*\d+]+]+|line[\.*\,*\s*\d+]+|lines[\.*\,*\s*\d+]+|line items[\.*\,*\s*\d+]+)'
I would like to include the last two patterns listed but not sure how. I have wrote this expression for the pattern matching "Lines 10 and 45" by modifying the capturing group as follows:
r'(Lines[\.*\,*\w*\s*\d+]+)'
However, it does not work as expected. It selects all word characters in the string. I would like to keep my expressions greedy, but not sure how to implement the last two patterns in the list.
Any suggestions please?
You may use
(?i)lines?(?:\s+items?)?\s*\d+(?:\.\d+)?(?:\s*(?:,|and)\s*\d+(?:\.\d+)?)*
See the regex demo.
Pattern details:
(?i) - ignore case inline modifier
lines? - line or lines (? quantifier makes the preceding pattern optional, matching 1 or 0 occurrences)
(?:\s+items?)? - an optional non-capturing group matching 1 or 0 occurrences of 1+ whitespaces followed with item and an optional s char
\s* - 0+ whitespaces
\d+(?:\.\d+)? - 1+ digits followed with an optional sequence of . and 1+ digits
(?:\s*(?:,|and)\s*\d+(?:\.\d+)?)* - 0 or more repetitions of
\s* - 0+ whitespaces
(?:,|and) - , or and char sequence
\s* - 0+ whitespaces
\d+(?:\.\d+)? - 1+ digits followed with an optional sequence of . and 1+ digits

Invalid pattern in look-behind

Why does this regex work in Python but not in Ruby:
/(?<!([0-1\b][0-9]|[2][0-3]))/
Would be great to hear an explanation and also how to get around it in Ruby
EDIT w/ the whole line of code:
re.sub(r'(?<!([0-1\b][0-9]|[2][0-3])):(?!([0-5][0-9])((?i)(am)|(pm)|(a\.m)|(p\.m)|(a\.m\.)|(p\.m\.))?\b)' , ':\n' , s)
Basically, I'm trying to add '\n' when there is a colon and it is not a time.
Ruby regex engine doesn't allow capturing groups in look behinds.
If you need grouping, you can use a non-capturing group (?:):
[8] pry(main)> /(?<!(:?[0-1\b][0-9]|[2][0-3]))/
SyntaxError: (eval):2: invalid pattern in look-behind: /(?<!(:?[0-1\b][0-9]|[2][0-3]))/
[8] pry(main)> /(?<!(?:[0-1\b][0-9]|[2][0-3]))/
=> /(?<!(?:[0-1\b][0-9]|[2][0-3]))/
Docs:
(?<!subexp) negative look-behind
Subexp of look-behind must be fixed-width.
But top-level alternatives can be of various lengths.
ex. (?<=a|bc) is OK. (?<=aaa(?:b|cd)) is not allowed.
In negative look-behind, capturing group isn't allowed,
but non-capturing group (?:) is allowed.
Learned from this answer.
Acc. to Onigmo regex documentation, capturing groups are not supported in negative lookbehinds. Although it is common among regex engines, not all of them count it as an error, hence you see the difference in the re and Onigmo regex libraries.
Now, as for your regex, it is not working correctly nor in Ruby nor in Python: the \b inside a character class in a Python and Ruby regex matches a BACKSPACE (\x08) char, not a word boundary. Moreover, when you use a word boundary after an optional non-word char, if the char appears in the string a word char must appear immediately to the right of that non-word char. The word boundary must be moved to right after m before \.?.
Another flaw with the current approach is that lookbehinds are not the best to exclude certain contexts like here. E.g. you can't account for a variable amount of whitespaces between the time digits and am / pm. It is better to match the contexts you do not want to touch and match and capture those you want to modify. So, we need two main alternatives here, one matching am/pm in time strings and another matching them in all other contexts.
Your pattern also has too many alternatives that can be merged using character classes and ? quantifiers.
Regex demo
\b((?:[01]?[0-9]|2[0-3]):[0-5][0-9]\s*[pa]\.?m\b\.?):
\b - word boundary
((?:[01]?[0-9]|2[0-3]):[0-5][0-9]\s*[pa]\.?m\b\.?) - capturing group 1:
(?:[01]?[0-9]|2[0-3]) - an optional 0 or 1 and then any digit or 2 and then a digit from 0 to 3
:[0-5][0-9] - : and then a number from 00 to 59
\s* - 0+ whitespaces
[pa]\.?m\b\.? - a or p, an optional dot, m, a word boundary, an optional dot
| - or
\b[ap]\.?m\b\.? - word boundary, a or p, an optional dot, m, a word boundary, an optional dot
Python fixed solution:
import re
text = 'am pm P.M. 10:56pm 10:43 a.m.'
rx = r'\b((?:[01]?[0-9]|2[0-3]):[0-5][0-9]\s*[pa]\.?m\b\.?)|\b[ap]\.?m\b\.?'
result = re.sub(rx, lambda x: x.group(1) if x.group(1) else "\n", text, flags=re.I)
Ruby solution:
text = 'am pm P.M. 10:56pm 10:43 a.m.'
rx = /\b((?:[01]?[0-9]|2[0-3]):[0-5][0-9]\s*[pa]\.?m\b\.?)|\b[ap]\.?m\b\.?/i
result = text.gsub(rx) { $1 || "\n" }
Output:
"\n \n \n 10:56pm 10:43 a.m."
For sure #mrzasa found the problem out.
But ..
Taking a guess at your intent to replace a non-time colon with a ':\n`
it could be done like this I guess. Does a little whitespace trim as well.
(?i)(?<!\b[01][0-9])(?<!\b[2][0-3])([^\S\r\n]*:)[^\S\r\n]*(?![0-5][0-9](?:[ap]\.?m\b\.?)?)
PCRE - https://regex101.com/r/7TxbAJ/1 Replace $1\n
Python - https://regex101.com/r/w0oqdZ/1 Replace \1\n
Readable version
(?i)
(?<!
\b [01] [0-9]
)
(?<!
\b [2] [0-3]
)
( # (1 start)
[^\S\r\n]*
:
) # (1 end)
[^\S\r\n]*
(?!
[0-5] [0-9]
(?: [ap] \.? m \b \.? )?
)

How to extract different types of sub-strings from a string in python using regular expression?

As the title, I'm supposed to get some sub-strings from a string which looks like this: "-23/45 + 14/9". What I need to get from that string is the four numbers and the operator in the middle. What has confused me is that how to use only one regular expression pattern to do this. Below is the requirement:
Write a regular expression patt that can be used to extract
(numerator,denominator,operator,numerator,denominator)
from a string containing a fraction, an arithmetic operator, and a fraction. You may
assume there is a space before and after the arithmetic operator and no spaces
surrounding the / character in a fraction. And all fractions will have a numerator and
denominator.
Example:
>>> s = "-23/45 + 14/9"
>>> re.findall(patt,s)
[( "-23","45","+","14","49")]
>>> s = "-23/45 * 14/9"
>>> re.findall(patt,s)
[( "-23","45","*","14","49")]
In general, your code should handle any of the operators +, -, * and /.
Note: the operator module for the two argument function equivalents of the arithmetic
(and other) operators
My problem here is that how to use only one regular expression to do this. I have thought about getting the sub strings contain numbers and stop at any character which is not a number, but this will miss the operator in the middle. Another idea is to include all the operators( + - * /) and stop at white space, but this will make first and last two numbers become together. Can anybody give me a direction how to solve this problem with only one regular expression pattern? Thanks a lot!
Try this regex:
(-?\d+)\s*\/\s*(\d+) *([+*\/-])\s*(-?\d+)\s*\/(\d+)
Click for regex Demo
You can extract the required information from Group 1 to Group 5
Explanation:
(-?\d+) - matches an optional - followed by 1+ occurrences of a digit and capture it in Group 1
\s*\/\s* - matches 0+ occurrences of a whitespace followed by a / followed by 0+ occurrences of a whitespace
(\d+) - matches 1+ occurrences of a digit and capture it in Group 2
* - matches 0+ occurrences of a space
([+*\/-]) - matches one of the operators in +,-,/,* and captures it in Group 3
\s* - matches 0+ occurrences of a whitespace
(-?\d+) - matches an optional - followed by 1+ occurrences of a digit and capture it in Group 4
\s*\/ - matches 0+ occurrences of a whitespace followed by /
(\d+) - matches 1+ occurrences of a digit and capture it in Group 5

Categories