Find substrings that start and end with same Uppercase Character

Find substrings that start and end with same Uppercase Character - python

I have a homework problem where I need to use regex to parse substrings out of a large string.
The goal is to select substrings that match the following parameters:
Substring starts and ends with the same uppercase character, and I need to ignore any instances of uppercase characters with the number 0 in front of them.
For example, ZAp0ZuZAuX0AZA would contain the matches ZAp0ZuZ and AuX0AZA
I've been messing around with this for a few hours and honestly haven't even gotten close...
I've tried some stuff like the code below, but that will select everything from the first uppercase through the last uppercase. I've also
[A-Z]{1}[[:alnum:]]*[A-Z]{1} <--- this selects the whole string
[A-Z]{1}[[:alnum:]][A-Z]{1} <--- this gives me strings like ZuZ, AuX
Really appreciate any help, I'm totally stumped on this one.

It may not be the best idea to do that with regular expressions, since simply you could split them. However, if you have/wish to do so, this expression might give you an idea what problems you might be facing, when your char list expands:
(?=.[A-Z])([A-Z])(.*?)\1
I have added (?=.[A-Z]) that must contain one uppercase. You can remove it and it would work. You can however add such boundaries to your expressions for safety.
JavaScript Test
const regex = /([A-Z])(.*?)\1/gm;
const str = `ZAp0ZuZAuX0AZA
ZApxxZuZAuXxafaAZA
ZApxaf09xZuZAuX090xafaAZA
abcZApxaf09xZuZAuX090xafaAZA`;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
// The result can be accessed through the `m`-variable.
m.forEach((match, groupIndex) => {
console.log(`Found match, group ${groupIndex}: ${match}`);
});
}
Python Test
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"([A-Z])(.*?)\1"
test_str = ("ZAp0ZuZAuX0AZA\n"
"ZApxxZuZAuXxafaAZA\n"
"ZApxaf09xZuZAuX090xafaAZA\n"
"abcZApxaf09xZuZAuX090xafaAZA")
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

This might work
(?<!0)([A-Z]).*?(?<!0)\1
https://regex101.com/r/nES9FP/1
Explained
(?<! 0 ) # Ignore Upper case with zero in front of it
( [A-Z] ) # (1), This Upper case is to be found down stream
.*? # Lazy, any character
(?<! 0 ) # Ignore Upper case with zero in front of it
\1 # Backref to what is in group (1)

You may use
(?<!0)([A-Z]).*?(?<!0)\1
See the regex demo.
Details
(?<!0)([A-Z]) - Group 1: an ASCII uppercase letter not preceded with a zero
.*? - any char but a linebreak char as few as possible
(?<!0)\1 - the same letter as in Group 1 not immediately preceded with 0.
See the Python demo:
import re
s="ZAp0ZuZAuX0AZA"
for m in re.finditer(r'(?<!0)([A-Z]).*?(?<!0)\1', s):
print(m.group()) # => ['ZAp0ZuZ', 'AuX0AZA']

Related

Parsing parameters from string using a regex with groups in python

Please, I'm trying to grab some parameters from a string. The parameters start with : or $ and are enclosed between brackets.
Ex:
some text [more text :Parameter1] more text [more (:Parameter2)]
My goal is to get two matches as the following:
Full match: [more text :Parameter1]
Group 1: :Parameter1
Full match: [more (:Parameter2)]
Group 1: :Parameter2
The following regex almost works. Except for the cases when the parameter itself is enclosed between parenthesis like Parameter2.
r"\\[.*?([:\$].*?)]"
and in these cases I get:
Full match: [more text :Parameter2]
Group 1: :Parameter2)
Note that group1 comes with the last parenthesis.
I couldn't find a way to remove it. Appreciate any help.
regex101 tests
Thanks.

If you want the parameter to be between the opening and the matching closing parenthesis, you might make use of negated character classes [^][()$:] to match any character that is not in the character class.
To match either of the possibilities you could use an alternation which will give you 2 capturing groups:
\[[^][()$:]*(?:\(([:$][^][()$:]+)\)|([:$][^][()$:]+))\]
About the pattern
\[ Match [
[^][()$:]* Match 0+ times any character that is not in the character class
(?: Non capturing group
\( Match (
( Capturing group 1
[:$][^][()$:]+ Match $ or :, then match 1+ chars not in the character class
) Close group 1
\) Match )
| Or
( Capturing group 2
[:$][^][()$:]+ Match $ or :, then match 1+ chars not in the character class
) Close group 2
) Close non capturing group
\] Match ]
Regex demo

With extended regex pattern:
import re
s = 'some text [more text :Parameter1] more text [more (:Parameter2)]'
res = re.findall(r'(\[[^\[\]:$]+\(?([:$][^:$)]+)\)?\])', s)
print(res)
The output (in format (<full_match>, <group_1>)):
[('[more text :Parameter1]', ':Parameter1'), ('[more (:Parameter2)]', ':Parameter2')]

This regex does what you want:
\[.*?([:\$].*?)\)?]
Output:
[more text :Parameter1]
:Parameter1
[more (:Parameter2)]
:Parameter2

I'd suggest a simple expression,
(\[[^(:]+([^]]+)\])
and then scripting the rest of the problem to avoid look-arounds.
Test
import re
regex = r"(\[[^(:]+([^]]+)\])"
test_str = "some text [more text :Parameter1] more text [more (:Parameter2)]"
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
The expression is explained on the top right panel of this demo, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs step by step, if you like.

You can use the following regex:
(\[[^:]+([:$][^])]+)[])]+)
It will be faster than using lazy quantifiers.
Regex details:
\[ matches [
[^:]+ matches 1 or more times any characters but a :
([:$][^])]+) second group:
[:$]matches either : or $
[^])]+ matches 1 or more times any characters but a ] or )
[])]+ matches ] and/or ) at least one time
Demo
import re
s = 'some text [more text :Parameter1] more text [more (:Parameter2)]'
print(re.findall(r'(\[[^:]+([:$][^])]+)[])]+)', s)
Output:
[('[more text :Parameter1]', ':Parameter1'), '[more text (:Parameter2)]', ':Parameter2')]

RegEx for capturing scientific citations

I am trying to capture brackets of text that have at least one digit in them (think citations). This is my regex now, and it works fine: https://regex101.com/r/oOHPvO/5
\((?=.*\d).+?\)
So I wanted it to capture (Author 2000) and (2000) but not (Author).
I am trying to use python to capture all these brackets, but in python it also captures the text in the brackets even if they don't have digits.
import re
with open('text.txt') as f:
f = f.read()
s = "\((?=.*\d).*?\)"
citations = re.findall(s, f)
citations = list(set(citations))
for c in citations:
print (c)
Any ideas what I am doing wrong?

You may use
re.findall(r'\([^()\d]*\d[^()]*\)', s)
See the regex demo
Details
\( - a ( char
[^()\d]* - 0 or more chars other than (, ) and digit
\d - a digit
[^()]* - 0 or more chars other than (, )
\) - a ) char.
See the regex graph:
Python demo:
import re
rx = re.compile(r"\([^()\d]*\d[^()]*\)")
s = "Some (Author) and (Author 2000)"
print(rx.findall(s)) # => ['(Author 2000)']
To get the results without parentheses, add a capturing group:
rx = re.compile(r"\(([^()\d]*\d[^()]*)\)")
^ ^
See this Python demo.

The most reliable way to possibly handle this expression might be to add boundaries as your expression would likely grow. For example, we could try creating char lists, where we wish to collect our desired data:
(?=\().([a-z]+)([\s,;]+?)([0-9]+)(?=\)).
DEMO
Test
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"(?=\().([a-z]+)([\s,;]+?)([0-9]+)(?=\))."
test_str = "some text we wish before (Author) some text we wish after (Author 2000) some text we wish before (Author) some text we wish after (Author, 2000) some text we wish before (Author) some text we wish after (Author 2000) some text we wish before (Author) some text we wish after (Author; 2000)"
matches = re.finditer(regex, test_str, re.MULTILINE | re.IGNORECASE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
Demo
const regex = /(?=\().([a-z]+)([\s,;]+?)([0-9]+)(?=\))./mgi;
const str = `some text we wish before (Author) some text we wish after (Author 2000) some text we wish before (Author) some text we wish after (Author, 2000) some text we wish before (Author) some text we wish after (Author 2000) some text we wish before (Author) some text we wish after (Author; 2000)`;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
// The result can be accessed through the `m`-variable.
m.forEach((match, groupIndex) => {
console.log(`Found match, group ${groupIndex}: ${match}`);
});
}
RegEx Circuit
jex.im visualizes regular expressions:

How to force Python RegEx to match all possible groups connected by | (OR)

I am new to RegEx, and I wonder if there is a way that we can force RegEx to match all possible groups (if there are multiple) at the same 'match' where patterns are connected by OR (see below).
I've tried this: (?P<broad>travel)|(?P<step>step)|(?P<dist>distance|far|km), but if the input is: Tell me how many steps I traveled, the code only matches one of travel or step. I've also tried using findall instead of search, but then the group information is lost (because the output is a list).
I expect that the code can match all possible groups in the same 'match' if available, instead of quitting as soon as a match is found.
Current output:
Match 1
broad None
step step
dist None
Match 2
broad travel
step None
dist None
Expected output:
Match 1
broad travel
step step
dist None

Maybe here, we can use finditer and test our expression:
Demo
Test
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"(travel)|(step)|(distance|far|km)"
test_str = "Tell me how many steps I traveled"
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
Demo
const regex = /(travel)|(step)|(distance|far|km)/gm;
const str = `Tell me how many steps I traveled`;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
// The result can be accessed through the `m`-variable.
m.forEach((match, groupIndex) => {
console.log(`Found match, group ${groupIndex}: ${match}`);
});
}

Regex for Text Between Brackets and Text Between Semicolons

I have the following shape of string: PW[Yasui Chitetsu]; and would like to get only the name inside the brackets: Yasui Chitetsu. I'm trying something like
[^(PW\[)](.*)[^\]]
as a regular expression, but the last bracket is still in it. How do I unselect it? I don't think I need anything fancy like look behinds, etc, for this case.

The Problems with What You've Tried
There are a few problems with what you've tried:
It will omit the first and last characters of your match from the group, giving you something like asui Chitets.
It will have even more errors on strings that start with P or W. For example, in PW[Paul McCartney], you would match only ul McCartne with the group and ul McCartney with the full match.
The Regex
You want something like this:
(?<=\[)([^]]+)(?=\])
Here's a regex101 demo.
Explanation
(?<=\[) means that the match must be preceded by [
([^]]+) matches 1 or more characters that are not ]
(?=\])means that the match must be followed by ]
Sample Code
Here's some sample code (from the above regex101 link):
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"(?<=\[)([^]]+)(?=\])"
test_str = "PW[Yasui Chitetsu]"
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
Semicolons
In your title, you mentioned finding text between semicolons. The same logic would work for that, giving you this regex:
(?<=;)([^;]+)(?=;)

Python regex: parsing newick format

I have a string like:
(A\2009_2009-01-04:0.2,(A\name2\human\2007_2007:0.3,A\chicken\ird16\2016_20016:0.4)A\name3\epi66321\2001_2001-04-04:0.5)A\name_with_space\2014_2014:0.1)A\name4\66036-8a\2004_2004-12-05;
In this tree, names are enclosed on the left by either an open bracket "(", a closing bracket ")", or a comma, and enclosed on the right with a colon ':'. That is, the substrings "A\2009_2009-01-04", "A\name2\human\2007_2007", "A\name3\epi66321\2001_2001-04-04", are names. (this is actually a tree in newick format).
I'd like to find a regex pattern which finds all names, with as little restriction on namespace as possible. Think of names as variables, like this example from Wikipedia:
(A:0.1,B:0.2,(C:0.3,D:0.4)E:0.5)F;
Where A, B, C etc. can be any string. The only restriction on namespace is that names cannot contain rounded or square brackets, '&', ',' or ':', because these are special characters that define the tree format, the same way that the comma defines a csv format.
Bonus: sometimes, internal nodes within the tree aren't labelled:
(A:0.1,B:0.2,(C:0.3,D:0.4):0.5);
In which case, a regex that correctly returns a string of length zero would be great.

It seems you want to extract substrings that start with 1+ (, ) or , and then contain 1+ non-whitespace characters other than : and ;, as many as possible, but stop at the word boundary.
Use
r'[(),]+([^;:]+)\b'
See the regex demo.
Pattern details
[(),]+ - one or more characters in the character class: (, ) or ,
([^;:]+) - Group 1: one or more chars other than ; and :, as many as possible
\b - a word boundary
Python demo:
import re
rx = r'[(),]+([^;:]+)\b'
s = "(A:0.1,B:0.2,(C:0.3,D:0.4)E:0.5)F;((A\\2009_2009-01-04:0.2,(A\\name2\\human\\2007_2007:0.3,A\\chicken\\ird16\\2016_20016:0.4)A\\name3\\epi66321\\2001_2001-04-04:0.5)A\\name_with_space\\2014_2014:0.1)A\\name4\\66036-8a\\2004_2004-12-05;"
res = re.findall(rx, s)
for val in res:
print(val)
Output:
A
B
C
D
E
F
A\2009_2009-01-04
A\name2\human\2007_2007
A\chicken\ird16\2016_20016
A\name3\epi66321\2001_2001-04-04
A\name_with_space\2014_2014
A\name4\66036-8a\2004_2004-12-05

you can use the regex
(\w+)(?=:|;)
see the sample code
import re
regex = r"(\w+)(?=:|;)"
test_str = "((B:0.2,(C:0.3,D:0.4)E:0.5)F:0.1)A;"
matches = re.finditer(regex, test_str)
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
The output is
Match 1 was found at 2-3: B
Match 2 was found at 9-10: C
Match 3 was found at 15-16: D
Match 4 was found at 21-22: E
Match 5 was found at 27-28: F
Match 6 was found at 33-34: A

A working solution:
[(),]([A-E])(?!;)
See live demo. One mistake you made was escaping characters inside the character class; but inside it they don't have special meaning.
I also took care of selecting against a trailing semicolon.

pattern = re.compile(r'[(),]A/[\S]*?:')
Not the most elegant, because I made use of the fact that all my names start with "A/". This will not be true for future use cases, just this current one. Will leave this question open if someone can find a more generalizable solution.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find substrings that start and end with same Uppercase Character - python

Related

Parsing parameters from string using a regex with groups in python

RegEx for capturing scientific citations

How to force Python RegEx to match all possible groups connected by | (OR)

Regex for Text Between Brackets and Text Between Semicolons

Python regex: parsing newick format

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find substrings that start and end with same Uppercase Character - python

Related

Parsing parameters from string using a regex with groups in python

RegEx for capturing scientific citations

How to force Python RegEx to match all possible groups connected by | (*OR*)

Regex for Text Between Brackets and Text Between Semicolons

Python regex: parsing newick format

Categories

Resources

How to force Python RegEx to match all possible groups connected by | (OR)