Parsing parameters from string using a regex with groups in python

Parsing parameters from string using a regex with groups in python - python

Please, I'm trying to grab some parameters from a string. The parameters start with : or $ and are enclosed between brackets.
Ex:
some text [more text :Parameter1] more text [more (:Parameter2)]
My goal is to get two matches as the following:
Full match: [more text :Parameter1]
Group 1: :Parameter1
Full match: [more (:Parameter2)]
Group 1: :Parameter2
The following regex almost works. Except for the cases when the parameter itself is enclosed between parenthesis like Parameter2.
r"\\[.*?([:\$].*?)]"
and in these cases I get:
Full match: [more text :Parameter2]
Group 1: :Parameter2)
Note that group1 comes with the last parenthesis.
I couldn't find a way to remove it. Appreciate any help.
regex101 tests
Thanks.

If you want the parameter to be between the opening and the matching closing parenthesis, you might make use of negated character classes [^][()$:] to match any character that is not in the character class.
To match either of the possibilities you could use an alternation which will give you 2 capturing groups:
\[[^][()$:]*(?:\(([:$][^][()$:]+)\)|([:$][^][()$:]+))\]
About the pattern
\[ Match [
[^][()$:]* Match 0+ times any character that is not in the character class
(?: Non capturing group
\( Match (
( Capturing group 1
[:$][^][()$:]+ Match $ or :, then match 1+ chars not in the character class
) Close group 1
\) Match )
| Or
( Capturing group 2
[:$][^][()$:]+ Match $ or :, then match 1+ chars not in the character class
) Close group 2
) Close non capturing group
\] Match ]
Regex demo

With extended regex pattern:
import re
s = 'some text [more text :Parameter1] more text [more (:Parameter2)]'
res = re.findall(r'(\[[^\[\]:$]+\(?([:$][^:$)]+)\)?\])', s)
print(res)
The output (in format (<full_match>, <group_1>)):
[('[more text :Parameter1]', ':Parameter1'), ('[more (:Parameter2)]', ':Parameter2')]

This regex does what you want:
\[.*?([:\$].*?)\)?]
Output:
[more text :Parameter1]
:Parameter1
[more (:Parameter2)]
:Parameter2

I'd suggest a simple expression,
(\[[^(:]+([^]]+)\])
and then scripting the rest of the problem to avoid look-arounds.
Test
import re
regex = r"(\[[^(:]+([^]]+)\])"
test_str = "some text [more text :Parameter1] more text [more (:Parameter2)]"
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
The expression is explained on the top right panel of this demo, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs step by step, if you like.

You can use the following regex:
(\[[^:]+([:$][^])]+)[])]+)
It will be faster than using lazy quantifiers.
Regex details:
\[ matches [
[^:]+ matches 1 or more times any characters but a :
([:$][^])]+) second group:
[:$]matches either : or $
[^])]+ matches 1 or more times any characters but a ] or )
[])]+ matches ] and/or ) at least one time
Demo
import re
s = 'some text [more text :Parameter1] more text [more (:Parameter2)]'
print(re.findall(r'(\[[^:]+([:$][^])]+)[])]+)', s)
Output:
[('[more text :Parameter1]', ':Parameter1'), '[more text (:Parameter2)]', ':Parameter2')]

Related

What is the correct way of grabbing an inner string in regular expressions for Python for multiple conditions

I would like to return all strings within the specified starting and end strings.
Given a string libs = 'libr(lib1), libr(lib2), libr(lib3), req(reqlib), libra(nonlib)'.
From the above libs string I would like to search for strings that are in between libr( and ) or the string between req( and ).
I would like to return ['lib1', 'lib2', 'lib3', 'reqlib']
import re
libs = 'libr(lib1), libr(lib2), libr(lib3), req(reqlib), libra(nonlib)'
pat1 = r'libr+\((.*?)\)'
pat2 = r'req+\((.*?)\)'
pat = f"{pat1}|{pat2}"
re.findall(pat, libs)
The code above currently returns [('lib1', ''), ('lib2', ''), ('lib3', ''), ('', 'reqlib')] and I am not sure how I should fix this.

Try this regex
(?:(?<=libr\()|(?<=req\())[^)]+
Click for Demo
Click for Code
Explanation:
(?:(?<=libr\()|(?<=req\())
(?<=libr\() - positive lookbehind that matches the position which is immediately preceded by text libr(
| - or
(?<=req\() - positive lookbehind that matches the position which is immediately preceded by text req(
[^)]+ - matches 1+ occurrences of any character which is not a ). So, this will match everything until it finds the next )

You can do it like this:
pat1 = r'(?<=libr\().*?(?=\))'
pat2 = r'(?<=req\().*?(?=\))'
It uses positive lookbehind (?<=) and positive lookahead (?=).
.*? : selects all characters in between. I'll name it "content"
(?<=libr\() : "content" must be preceded by libr( (we escape the
( )
?(?=\)) : content must be followed by ) ( ( is escaped too)
Complete code:
import re
libs = 'libr(lib1), libr(lib2), libr(lib3), req(reqlib), libra(nonlib)'
pat1 = r'(?<=libr\().*?(?=\))'
pat2 = r'(?<=req\().*?(?=\))'
pat = f"{pat1}|{pat2}"
result = re.findall(pat, libs)
print(result)
Output:
['lib1', 'lib2', 'lib3', 'reqlib']

I think a common way to do so is using alternation in the word you would want to be preceding the pattern you like to capture:
\b(?:libr|req)\(([^)]+)
See the online demo
\b - Word-boundary.
(?: - Open non-capture group:
libr|req - Match "libr" or "req".
) - Close non-capture group.
\( - A literal opening paranthesis.
( - Open a capture group:
[^)]+ - Match 1+ characters apart from closing paranthesis.
) - Close capture group.
A python demo:
import re
libs = 'libr(lib1), libr(lib2), libr(lib3), req(reqlib), libra(nonlib)'
lst = re.findall(r'\b(?:libr|req)\(([^)]+)', libs)
print(lst)
Prints:
['lib1', 'lib2', 'lib3', 'reqlib']

Find substrings that start and end with same Uppercase Character

I have a homework problem where I need to use regex to parse substrings out of a large string.
The goal is to select substrings that match the following parameters:
Substring starts and ends with the same uppercase character, and I need to ignore any instances of uppercase characters with the number 0 in front of them.
For example, ZAp0ZuZAuX0AZA would contain the matches ZAp0ZuZ and AuX0AZA
I've been messing around with this for a few hours and honestly haven't even gotten close...
I've tried some stuff like the code below, but that will select everything from the first uppercase through the last uppercase. I've also
[A-Z]{1}[[:alnum:]]*[A-Z]{1} <--- this selects the whole string
[A-Z]{1}[[:alnum:]][A-Z]{1} <--- this gives me strings like ZuZ, AuX
Really appreciate any help, I'm totally stumped on this one.

It may not be the best idea to do that with regular expressions, since simply you could split them. However, if you have/wish to do so, this expression might give you an idea what problems you might be facing, when your char list expands:
(?=.[A-Z])([A-Z])(.*?)\1
I have added (?=.[A-Z]) that must contain one uppercase. You can remove it and it would work. You can however add such boundaries to your expressions for safety.
JavaScript Test
const regex = /([A-Z])(.*?)\1/gm;
const str = `ZAp0ZuZAuX0AZA
ZApxxZuZAuXxafaAZA
ZApxaf09xZuZAuX090xafaAZA
abcZApxaf09xZuZAuX090xafaAZA`;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
// The result can be accessed through the `m`-variable.
m.forEach((match, groupIndex) => {
console.log(`Found match, group ${groupIndex}: ${match}`);
});
}
Python Test
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"([A-Z])(.*?)\1"
test_str = ("ZAp0ZuZAuX0AZA\n"
"ZApxxZuZAuXxafaAZA\n"
"ZApxaf09xZuZAuX090xafaAZA\n"
"abcZApxaf09xZuZAuX090xafaAZA")
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

This might work
(?<!0)([A-Z]).*?(?<!0)\1
https://regex101.com/r/nES9FP/1
Explained
(?<! 0 ) # Ignore Upper case with zero in front of it
( [A-Z] ) # (1), This Upper case is to be found down stream
.*? # Lazy, any character
(?<! 0 ) # Ignore Upper case with zero in front of it
\1 # Backref to what is in group (1)

You may use
(?<!0)([A-Z]).*?(?<!0)\1
See the regex demo.
Details
(?<!0)([A-Z]) - Group 1: an ASCII uppercase letter not preceded with a zero
.*? - any char but a linebreak char as few as possible
(?<!0)\1 - the same letter as in Group 1 not immediately preceded with 0.
See the Python demo:
import re
s="ZAp0ZuZAuX0AZA"
for m in re.finditer(r'(?<!0)([A-Z]).*?(?<!0)\1', s):
print(m.group()) # => ['ZAp0ZuZ', 'AuX0AZA']

Regex for Text Between Brackets and Text Between Semicolons

I have the following shape of string: PW[Yasui Chitetsu]; and would like to get only the name inside the brackets: Yasui Chitetsu. I'm trying something like
[^(PW\[)](.*)[^\]]
as a regular expression, but the last bracket is still in it. How do I unselect it? I don't think I need anything fancy like look behinds, etc, for this case.

The Problems with What You've Tried
There are a few problems with what you've tried:
It will omit the first and last characters of your match from the group, giving you something like asui Chitets.
It will have even more errors on strings that start with P or W. For example, in PW[Paul McCartney], you would match only ul McCartne with the group and ul McCartney with the full match.
The Regex
You want something like this:
(?<=\[)([^]]+)(?=\])
Here's a regex101 demo.
Explanation
(?<=\[) means that the match must be preceded by [
([^]]+) matches 1 or more characters that are not ]
(?=\])means that the match must be followed by ]
Sample Code
Here's some sample code (from the above regex101 link):
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"(?<=\[)([^]]+)(?=\])"
test_str = "PW[Yasui Chitetsu]"
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
Semicolons
In your title, you mentioned finding text between semicolons. The same logic would work for that, giving you this regex:
(?<=;)([^;]+)(?=;)

Python - Basic validation of international names?

Given a name string, I want to validate a few basic conditions:
-The characters belong to a recognized script/alphabet (Latin, Chinese, Arabic, etc) and aren't say, emojis.
-The string doesn't contain digits and is of length < 40
I know the latter can be accomplished via regex but is there a unicode way to accomplish the first? Are there any text processing libraries I can leverage?

You should be able to check this using the Unicode Character classes in regex.
[\p{P}\s\w]{40,}
The most important part here is the \w character class using Unicode mode:
\p{P} matches any kind of punctuation character
\s matches any kind of invisible character (equal to [\p{Z}\h\v])
\w match any word character in any script (equal to [\p{L}\p{N}_])
Live Demo
You may want to add more like \p{Sc} to match currency symbols, etc.
But to be able to take advantage of this, you need to use the regex module (an alternative to the standard re module) that supports Unicode codepoint properties with the \p{} syntax.
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import regex as re
regex = r"[\p{P}\s\w]{40,}"
test_str = ("Wow cool song!Wow cool song!Wow cool song!Wow cool song! 🕺🏻 \nWow cool song! 🕺🏻Wow cool song! 🕺🏻Wow cool song! 🕺🏻\n")
matches = re.finditer(regex, test_str, re.UNICODE | re.MULTILINE)
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
PS: .NET Regex gives you some more options like \p{IsGreek}.

Python regex: parsing newick format

I have a string like:
(A\2009_2009-01-04:0.2,(A\name2\human\2007_2007:0.3,A\chicken\ird16\2016_20016:0.4)A\name3\epi66321\2001_2001-04-04:0.5)A\name_with_space\2014_2014:0.1)A\name4\66036-8a\2004_2004-12-05;
In this tree, names are enclosed on the left by either an open bracket "(", a closing bracket ")", or a comma, and enclosed on the right with a colon ':'. That is, the substrings "A\2009_2009-01-04", "A\name2\human\2007_2007", "A\name3\epi66321\2001_2001-04-04", are names. (this is actually a tree in newick format).
I'd like to find a regex pattern which finds all names, with as little restriction on namespace as possible. Think of names as variables, like this example from Wikipedia:
(A:0.1,B:0.2,(C:0.3,D:0.4)E:0.5)F;
Where A, B, C etc. can be any string. The only restriction on namespace is that names cannot contain rounded or square brackets, '&', ',' or ':', because these are special characters that define the tree format, the same way that the comma defines a csv format.
Bonus: sometimes, internal nodes within the tree aren't labelled:
(A:0.1,B:0.2,(C:0.3,D:0.4):0.5);
In which case, a regex that correctly returns a string of length zero would be great.

It seems you want to extract substrings that start with 1+ (, ) or , and then contain 1+ non-whitespace characters other than : and ;, as many as possible, but stop at the word boundary.
Use
r'[(),]+([^;:]+)\b'
See the regex demo.
Pattern details
[(),]+ - one or more characters in the character class: (, ) or ,
([^;:]+) - Group 1: one or more chars other than ; and :, as many as possible
\b - a word boundary
Python demo:
import re
rx = r'[(),]+([^;:]+)\b'
s = "(A:0.1,B:0.2,(C:0.3,D:0.4)E:0.5)F;((A\\2009_2009-01-04:0.2,(A\\name2\\human\\2007_2007:0.3,A\\chicken\\ird16\\2016_20016:0.4)A\\name3\\epi66321\\2001_2001-04-04:0.5)A\\name_with_space\\2014_2014:0.1)A\\name4\\66036-8a\\2004_2004-12-05;"
res = re.findall(rx, s)
for val in res:
print(val)
Output:
A
B
C
D
E
F
A\2009_2009-01-04
A\name2\human\2007_2007
A\chicken\ird16\2016_20016
A\name3\epi66321\2001_2001-04-04
A\name_with_space\2014_2014
A\name4\66036-8a\2004_2004-12-05

you can use the regex
(\w+)(?=:|;)
see the sample code
import re
regex = r"(\w+)(?=:|;)"
test_str = "((B:0.2,(C:0.3,D:0.4)E:0.5)F:0.1)A;"
matches = re.finditer(regex, test_str)
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
The output is
Match 1 was found at 2-3: B
Match 2 was found at 9-10: C
Match 3 was found at 15-16: D
Match 4 was found at 21-22: E
Match 5 was found at 27-28: F
Match 6 was found at 33-34: A

A working solution:
[(),]([A-E])(?!;)
See live demo. One mistake you made was escaping characters inside the character class; but inside it they don't have special meaning.
I also took care of selecting against a trailing semicolon.

pattern = re.compile(r'[(),]A/[\S]*?:')
Not the most elegant, because I made use of the fact that all my names start with "A/". This will not be true for future use cases, just this current one. Will leave this question open if someone can find a more generalizable solution.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing parameters from string using a regex with groups in python - python

This regex does what you want: \[.?([:\$].?)\)?] Output: [more text :Parameter1] :Parameter1 [more (:Parameter2)] :Parameter2

Related

What is the correct way of grabbing an inner string in regular expressions for Python for multiple conditions

Find substrings that start and end with same Uppercase Character

Regex for Text Between Brackets and Text Between Semicolons

Python - Basic validation of international names?

Python regex: parsing newick format

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing parameters from string using a regex with groups in python - python

This regex does what you want: \[.*?([:\$].*?)\)?] Output: [more text :Parameter1] :Parameter1 [more (:Parameter2)] :Parameter2

Related

What is the correct way of grabbing an inner string in regular expressions for Python for multiple conditions

Find substrings that start and end with same Uppercase Character

Regex for Text Between Brackets and Text Between Semicolons

Python - Basic validation of international names?

Python regex: parsing newick format

Categories

Resources

This regex does what you want: \[.?([:\$].?)\)?] Output: [more text :Parameter1] :Parameter1 [more (:Parameter2)] :Parameter2