Python Regex: Separating 'kg' and 'g' or 'ML and 'L' - python

Task is to separate the attributes of a product from its string. I am using regex to separate the required parts but having difficulty in distinguishing "L" from "ML" (or "l" from "ml"). Similar case for "kg" and "g" as regex always chooses the shorter string.
prod = 'TestProduct- 200 ML x24'
searchobj = re.findall('([0-9]+).*(g|kg|ltr|l|ml)\s*x*[*]*([0-9]+)', prod, re.I)
print(searchobj)
#output
[('200', 'L', '24')]
How to make output as following?
[('200', 'ML', '24')]
Thanks.

You could specify that you only want whole words of the form (g|kg|ltr|l|ml)\s by changing that to \s(g|kg|ltr|l|ml)\s (require a space before and after the expression).

You can use
(\d+(?:\.\d+)?)\s*(g|kg|ltr|l|ml)\s*x*\**(\d+(?:\.\d+)?)
See the regex demo.
Details
(\d+(?:\.\d+)?) - Group 1: one or more digits, and then an optional sequence of a dot and then one or more digits
\s* - 0+ whitespaces
(g|kg|ltr|l|ml) - Group 2: one of the char (sequences)
\s* - 0+ whitespaces
x* - 0 or more x chars
\** - 0 or more * chars
(\d+(?:\.\d+)?) - Group 3: one or more digits, and then an optional sequence of a dot and then one or more digits

Related

How to create regex where there can be specific char between characters in pattern and I have one "wildcard" char?

I want to create regex in python where I'm given a substring and I want to find it in my string. Characters in substring and my string are always either D, T or F. There are two conditions for match:
After every character in given substring there can occur char '-' (I don't know how to approach this one especially)
Every character can be either the character I'm looking at or 'X' so X is a "wildcard" (I know I can use '|' for that so it would be I believe ([DTF]|X))
So what I mean is if I'm given DTTFDD as substring other proper matches would be:
D-TTFDD
DXTFDD
Edit: These matches can occur in bigger string such as FTDTTDFDD-TTFXDTFTFD
How can I put all of this together?
Looks like you could try:
[DX]-?(?:[TX]-?){2}[FX]-?(?:[DX]-?){2}
See the online demo
[DX]-? - A literal "D" or "X" followed by an optional hyphen.
(?: - Open non-capture group:
[TX]-? - A literal "T" or "X" followed by an optional hyphen.
){2} - Close non-capture group and match twice.
[FX]-? - A literal "F" or "X" followed by an optional hyphen.
(?: - Open non-capture group:
[DX]-? - A literal "D" or "X" followed by an optional hyphen.
){2} - Close non-capture group and match twice.
A little less verbose without the non-capture groups:
[DX]-?[TX]-?[TX]-?[FX]-?[DX]-?[DX]-?

Regex positive lookahead for a character only after paired brackets

I am trying to parse SQL code using regex in Python.
I need an expression that would delimit group when it ends with end of string or comma but only if they follow after the matched brackets.
My current regexp matches second group only up to first occurrence of a comma, regardless of parentheses count:
(?m)^\s*'?([A-Za-z0-9_-]+)'?\s*=\s*((?s:.)*?)(?:\s*)(?=,|\Z)
For example, in the string below:
COL1 = DEF1,
COL2 = DEF(TEST,
TEST2),
COL3 = FUN(1, 2),
I get:
0: DEF1
1: DEF(TEST
2: FUN(1
And I would like it to match:
0: DEF1
1: DEF(TEST,
TEST2)
2: FUN(1, 2)
Thanks in advance!
You may use
(?sm)^\s*'?([\w-]+)'?\s*=\s*(.*?)(?=^\s*'?[\w-]+'?\s*=|\Z)
See the regex demo
Details
(?sm) - DOTALL and MULTILINE options on
^ - start of a line
-\s* - 0+ whitespaces
'? - an optional '
([\w-]+) - Group 1: one or more word or - chars
'? - an optional '
\s*=\s* - a = enclosed with 0+ whitespaces
(.*?) - Group 2: any zero or more chars other than line break chars as few as possible
(?=^\s*'?[\w-]+'?\s*=|\Z) - a positive lookahead requiring the end of string (\Z) or ^\s*'?[\w-]+'?\s*= pattern immediately to the right of the current location.

Remove duplicate words in a string using regex

I'm working on my regex skills and i find one of my strings having duplicate words at the starting. I would like to remove the duplicate and just have one word of it -
server_server_dev1_check_1233.zzz
server_server_qa1_run_1233.xyz
server_server_dev2_1233.qqa
server_dev1_1233.zzz
data_data_dev9_check_660.log
I used the below regex but i get both server_server in my output,
((.*?))_(?!\D)
How can i have my output just to one server_ if there are two or more and if its only one server_, then take as is?
The output doesn't have to contain the digits and also the part after . i.e. .zzz, .xyz etc
Expected output -
server_dev1_check
server_qa1_run
server_dev2
server_dev1
data_dev9_check
you could back reference the word in your search expression:
>>> s = "server_server_dev1_check_1233.zzz"
>>> re.sub(r"(.*_)\1",r"\1",s)
'server_dev1_check_1233.zzz'
and use the "many times" suffix so if there are more than 2 occurrences it still works:
'server_server_server_dev1_check_1233.zzz'
>>> re.sub(r"(.*_)\1{1,}",r"\1",s)
'server_dev1_check_1233.zzz'
getting rid of the suffix is not the hardest part, just capture the rest and discard the end:
>>> re.sub(r"(.*_)\1{1,}(.*)(_\d+\..*)",r"\1\2",s)
'server_dev1_check'
You may use a single re.sub call to match and remove what you do not need and match and capture what you need:
re.sub(r'^([^_]+)(?:_\1)*(.*)_\d+\.\w+$', r'\1\2', s)
See the regex demo
Details
^ - start of string
([^_]+) - Capturing group 1: any 1+ chars other than _
(?:_\1)* - zero or more repetitions of _ followed with the same substring as in Group 1 (thanks to the inline backreference \1 that retrieves the text from Group 1)
(.*) - Group 2: any 0+ chars, as many as possible
_ - an underscore
\d+ - 1+ digits
\. - a dot
\w+ - 1+ word chars ([^.]+ will also do, 1 or more chars other than .)
$ - end of string.
The replacement pattern is \1\2, i.e. the contents of Group 1 and 2 are concatenated and make up the resulting value.
Python demo:
import re
rx = r'^([^_]+)(?:_\1)*(.*)_\d+\.\w+$'
strs = ["server_server_dev1_check_1233.zzz", "server_server_qa1_run_1233.xyz", "server_server_dev2_1233.qqa", "server_dev1_1233.zzz", "data_data_dev9_check_660.log"]
for s in strs:
print(re.sub(rx, r'\1\2', s))
Output:
server_dev1_check
server_qa1_run
server_dev2
server_dev1
data_dev9_check

How to extract different types of sub-strings from a string in python using regular expression?

As the title, I'm supposed to get some sub-strings from a string which looks like this: "-23/45 + 14/9". What I need to get from that string is the four numbers and the operator in the middle. What has confused me is that how to use only one regular expression pattern to do this. Below is the requirement:
Write a regular expression patt that can be used to extract
(numerator,denominator,operator,numerator,denominator)
from a string containing a fraction, an arithmetic operator, and a fraction. You may
assume there is a space before and after the arithmetic operator and no spaces
surrounding the / character in a fraction. And all fractions will have a numerator and
denominator.
Example:
>>> s = "-23/45 + 14/9"
>>> re.findall(patt,s)
[( "-23","45","+","14","49")]
>>> s = "-23/45 * 14/9"
>>> re.findall(patt,s)
[( "-23","45","*","14","49")]
In general, your code should handle any of the operators +, -, * and /.
Note: the operator module for the two argument function equivalents of the arithmetic
(and other) operators
My problem here is that how to use only one regular expression to do this. I have thought about getting the sub strings contain numbers and stop at any character which is not a number, but this will miss the operator in the middle. Another idea is to include all the operators( + - * /) and stop at white space, but this will make first and last two numbers become together. Can anybody give me a direction how to solve this problem with only one regular expression pattern? Thanks a lot!
Try this regex:
(-?\d+)\s*\/\s*(\d+) *([+*\/-])\s*(-?\d+)\s*\/(\d+)
Click for regex Demo
You can extract the required information from Group 1 to Group 5
Explanation:
(-?\d+) - matches an optional - followed by 1+ occurrences of a digit and capture it in Group 1
\s*\/\s* - matches 0+ occurrences of a whitespace followed by a / followed by 0+ occurrences of a whitespace
(\d+) - matches 1+ occurrences of a digit and capture it in Group 2
* - matches 0+ occurrences of a space
([+*\/-]) - matches one of the operators in +,-,/,* and captures it in Group 3
\s* - matches 0+ occurrences of a whitespace
(-?\d+) - matches an optional - followed by 1+ occurrences of a digit and capture it in Group 4
\s*\/ - matches 0+ occurrences of a whitespace followed by /
(\d+) - matches 1+ occurrences of a digit and capture it in Group 5

Split string by number of whitespaces

I have a string that looks like either of these three examples:
1: Name = astring Some comments
2: Typ = one two thee Must be "sand", "mud" or "bedload"
3: RDW = 0.02 [ - ] Some comment about RDW
I first split the variable name and rest like so:
re.findall(r'\s*([a-zA-z0-9_]+)\s*=\s*(.*)', line)
I then want to split the right part of the string into a part containing the values and a part containing the comments (if there are any). I want to do this by looking at the number of whitespaces. If it exceeds say 4, then I assume the comments to start
Any idea on how to do this?
I currently have
re.findall(r'(?:(\S+)\s{0,3})+', dataString)
However if I test this using the string:
'aa aa23r234rf2134213^$&$%& bb'
Then it also selects 'bb'
You may use a single regex with re.findall:
^\s*(\w+)\s*=\s*(.*?)(?:(?:\s{4,}|\[)(.*))?$
See the regex demo.
Details:
^ - start of string
\s* - 0+ whitespaces
(\w+) - capturing group #1 matching 1 or more letters/digits/underscores
\s*=\s* - = enclosed with 0+ whitespaces
(.*?) - capturing group #2 matching any 0+ chars, as few as possible, up to the first...
(?:(?:\s{4,}|\[)(.*))? - an optional group matching
(?:\s{4,}|\[) - 4 or more whitespaces or a [
(.*) - capturing group #3 matching 0+ chars up to
$ - the end of string.

Categories