Greedy Python RegEx capturing group to include "and" - python

I need some help writing regex expressions. I need an expression that can match the following patterns (including words and digits, spaces and commas):
Line 145
Line3544354
Lines 10,12
Line items 45,10,26
Lines 10 and 45
Thus far, I wrote one expression which includes the first three patterns and all case variations:
r'(?i)(line item[\.*\,*\s*\d+]+]+|line[\.*\,*\s*\d+]+|lines[\.*\,*\s*\d+]+|line items[\.*\,*\s*\d+]+)'
I would like to include the last two patterns listed but not sure how. I have wrote this expression for the pattern matching "Lines 10 and 45" by modifying the capturing group as follows:
r'(Lines[\.*\,*\w*\s*\d+]+)'
However, it does not work as expected. It selects all word characters in the string. I would like to keep my expressions greedy, but not sure how to implement the last two patterns in the list.
Any suggestions please?

You may use
(?i)lines?(?:\s+items?)?\s*\d+(?:\.\d+)?(?:\s*(?:,|and)\s*\d+(?:\.\d+)?)*
See the regex demo.
Pattern details:
(?i) - ignore case inline modifier
lines? - line or lines (? quantifier makes the preceding pattern optional, matching 1 or 0 occurrences)
(?:\s+items?)? - an optional non-capturing group matching 1 or 0 occurrences of 1+ whitespaces followed with item and an optional s char
\s* - 0+ whitespaces
\d+(?:\.\d+)? - 1+ digits followed with an optional sequence of . and 1+ digits
(?:\s*(?:,|and)\s*\d+(?:\.\d+)?)* - 0 or more repetitions of
\s* - 0+ whitespaces
(?:,|and) - , or and char sequence
\s* - 0+ whitespaces
\d+(?:\.\d+)? - 1+ digits followed with an optional sequence of . and 1+ digits

Related

regex pattern to match whole word or word followed by another

I'm starting to learn regex in order to match words in python columns and replace them for other values.
df['col1']=df['col1'].str.replace(r'(?i)unlimi+\w*', 'Unlimited', regex=True)
This pattern serves to match different variations of the world Unlimited. But I have some values in the column that have not only one word, but two or more:
ex:
[Unlimited, Unlimited (on-net), Unlimited (on-off-net)]`
I was wondering if there is a way to match all of the words in the previous example with a single regex line.
You can use
df['col1']=df['col1'].str.replace(r'(?i)unlimi\w*(?:\s*\([^()]*\))?', 'Unlimited', regex=True)
See the regex demo.
The (?i)unlimi\w*(?:\s*\([^()]*\))? regex matches
(?i) - the regex to the right is case insensitive
unlimi - a fixed string
\w* - zero or more word chars
(?:\s*\([^()]*\))? - an optional sequence of
\s* - zero or more whitespaces
\( - a ( char
[^()]* - zero or more chars other than ( and )
\) - a ) char.

select group based on same value in regular Expression

I have a following content
ONE
1234234534564 123
34erewrwer323 123
123fsgrt43232 123
TWO
42433412133fr 234
fafafd3234132 342
THREE
sfafdfe345233 3234
FOUR
324ereffdf343 4323
fvdafasf34nhj 4323
fsfnhjdgh342g 4323
Consider ONE,TWO,THREE and FOUR are separate group.In that I want match only ONE and FOUR, based on the condition of second value of each line in the every group must be same and it will match group that has more than one line in that..How can I do that in regular expression
I have already tried following regex, but its not up to the mark
\w+\n\w+\t(\d+)(\n\w+\t\1){2,}
You may use
r'(?m)^[A-Z]+\r?\n\S+\s+(\d+)(?:\r?\n\S+\s+\1)+$'
See the regex demo.
Details
(?m) - enable re.MULTILINE mode to make ^ / $ match start and end of lines respectively
^ - start of a line
[A-Z]+ - 1+ uppercase ASCII letters (adjust as you see fit)
\r?\n - a line break like CRLF or LF
\S+ - 1+ non-whitespace chars
\s+ - 1 whitespaces (or use \t if a tab is the field separator)
(\d+) - Capturing group 1, one or more digits
(?:\r?\n\S+\s+\1)+ - one or more repetitions of a line break followed with 1+ non-whitespaces, 1+ whitespaces and the same value as in Group 1 since \1 is a backreference to the value stored in that group
$ - end of line.
In Python, use re.finditer:
for m in re.finditer(r'(?m)^[A-Z]+\r?\n\S+\s+(\d+)(?:\r?\n\S+\s+\1)+$', text):
print(m.group())
See the Python demo.

How to extract different types of sub-strings from a string in python using regular expression?

As the title, I'm supposed to get some sub-strings from a string which looks like this: "-23/45 + 14/9". What I need to get from that string is the four numbers and the operator in the middle. What has confused me is that how to use only one regular expression pattern to do this. Below is the requirement:
Write a regular expression patt that can be used to extract
(numerator,denominator,operator,numerator,denominator)
from a string containing a fraction, an arithmetic operator, and a fraction. You may
assume there is a space before and after the arithmetic operator and no spaces
surrounding the / character in a fraction. And all fractions will have a numerator and
denominator.
Example:
>>> s = "-23/45 + 14/9"
>>> re.findall(patt,s)
[( "-23","45","+","14","49")]
>>> s = "-23/45 * 14/9"
>>> re.findall(patt,s)
[( "-23","45","*","14","49")]
In general, your code should handle any of the operators +, -, * and /.
Note: the operator module for the two argument function equivalents of the arithmetic
(and other) operators
My problem here is that how to use only one regular expression to do this. I have thought about getting the sub strings contain numbers and stop at any character which is not a number, but this will miss the operator in the middle. Another idea is to include all the operators( + - * /) and stop at white space, but this will make first and last two numbers become together. Can anybody give me a direction how to solve this problem with only one regular expression pattern? Thanks a lot!
Try this regex:
(-?\d+)\s*\/\s*(\d+) *([+*\/-])\s*(-?\d+)\s*\/(\d+)
Click for regex Demo
You can extract the required information from Group 1 to Group 5
Explanation:
(-?\d+) - matches an optional - followed by 1+ occurrences of a digit and capture it in Group 1
\s*\/\s* - matches 0+ occurrences of a whitespace followed by a / followed by 0+ occurrences of a whitespace
(\d+) - matches 1+ occurrences of a digit and capture it in Group 2
* - matches 0+ occurrences of a space
([+*\/-]) - matches one of the operators in +,-,/,* and captures it in Group 3
\s* - matches 0+ occurrences of a whitespace
(-?\d+) - matches an optional - followed by 1+ occurrences of a digit and capture it in Group 4
\s*\/ - matches 0+ occurrences of a whitespace followed by /
(\d+) - matches 1+ occurrences of a digit and capture it in Group 5

extract string using regular expression

fix_release='Ubuntu 16.04 LTS'
p = re.compile(r'(Ubuntu)\b(\d+[.]\d+)\b')
fix_release = p.search(fix_release)
logger.info(fix_release) #fix_release is None
I want to extract the string 'Ubuntu 16.04'
But, result is None.... How can I extract the correct sentence?
You confused the word boundary \b with white space, the former matches the boundary between a word character and a non word character and consumes zero character, you can simply use r'Ubuntu \d+\.\d+' for your case:
fix_release='Ubuntu 16.04 LTS'
p = re.compile(r'Ubuntu \d+\.\d+')
p.search(fix_release).group(0)
# 'Ubuntu 16.04'
Try this Regex:
Ubuntu\s*\d+(?:\.\d+)?
Click for Demo
Explanation:
Ubuntu - matches Ubuntu literally
\s* - matches 0+ occurrences of a white-space, as many as possible
\d+ - matches 1+ digits, as many as possible
(?:\.\d+)? - matches a . followed by 1+ digits, as many as possible. A ? at the end makes this part optional.
Note: In your regex, you are using \b for the spaces. \b returns 0 length matches between a word-character and a non-word character. You can use \s instead

Return the next nth result \w+ after a hyphen globally

Just getting to the next stage of understanding regex, hoping the community can help...
string = These.Final.Hours-AUSVERSION.2013-TEST-TESTAGAIN-YIFY.cp(tt123456).MiLLENiUM.mp4
There are multiple test names preceded by a '-' hyphen which I derive from regex
\(?<=-)\w+\g
Result:
AUSVERSION
TEST
TESTAGAIN
YIFY
I can parse the very last result using greediness with regex \(?!.*-)(?<=-)\w+\g
Result:
YIFI (4th & last result)
Can you please help me parse either the 1st, 2nd, or 3rd result Globally using the same string?
In Python, you can get these matches with a simple -\s*(\w+) regex and re.findall and then access any match with the appropriate index:
See IDEONE demo:
import re
s = 'These.Final.Hours-AUSVERSION.2013-TEST-TESTAGAIN-YIFY.cp(tt123456).MiLLENiUM.mp4'
r = re.findall(r'-\s*(\w+)', s)
print(r[0]) # => AUSVERSION
print(r[1]) # => TEST
print(r[2]) # => TESTAGAIN
print(r[3]) # => YIFY
The -\s*(\w+) pattern search for a hyphen, followed with 0+ whitespaces, and then captures 1+ digits, letters or underscores. re.findall only returns the texts captured with capturing groups, so you only get those Group 1 values captured with (\w+).
To get these matches one by one, with re.search, you can use ^(?:.*?-\s*(\w+)){n}, where n is the match index you want. Here is a regex demo.
A quick Python demo (in real code, assign the result of re.search and only access Group 1 value after checking if there was a match):
s = "These.Final.Hours-AUSVERSION.2013-TEST-TESTAGAIN- YIFY.cp(tt123456).MiLLENiUM.mp4"
print(re.search(r'^(?:.*?-\s*(\w+))', s).group(1))
print(re.search(r'^(?:.*?-\s*(\w+)){2}', s).group(1))
print(re.search(r'^(?:.*?-\s*(\w+)){3}', s).group(1))
print(re.search(r'^(?:.*?-\s*(\w+)){4}', s).group(1))
Explanation of the pattern:
^ - start of string
(?:.*?-\s*(\w+)){2} - a non-capturing group that matches (here) 2 sequences of:
.*? - 0+ any characters other than a newline (since no re.DOTALL modifier is used) up to the first...
- - hyphen
\s* - 0 or more whitespaces
(\w+) - Group 1 capturing 1+ word characters (letters, digits or underscores).

Categories