RegEx for capturing scientific citations - python

I am trying to capture brackets of text that have at least one digit in them (think citations). This is my regex now, and it works fine: https://regex101.com/r/oOHPvO/5
\((?=.*\d).+?\)
So I wanted it to capture (Author 2000) and (2000) but not (Author).
I am trying to use python to capture all these brackets, but in python it also captures the text in the brackets even if they don't have digits.
import re
with open('text.txt') as f:
f = f.read()
s = "\((?=.*\d).*?\)"
citations = re.findall(s, f)
citations = list(set(citations))
for c in citations:
print (c)
Any ideas what I am doing wrong?

You may use
re.findall(r'\([^()\d]*\d[^()]*\)', s)
See the regex demo
Details
\( - a ( char
[^()\d]* - 0 or more chars other than (, ) and digit
\d - a digit
[^()]* - 0 or more chars other than (, )
\) - a ) char.
See the regex graph:
Python demo:
import re
rx = re.compile(r"\([^()\d]*\d[^()]*\)")
s = "Some (Author) and (Author 2000)"
print(rx.findall(s)) # => ['(Author 2000)']
To get the results without parentheses, add a capturing group:
rx = re.compile(r"\(([^()\d]*\d[^()]*)\)")
^ ^
See this Python demo.

The most reliable way to possibly handle this expression might be to add boundaries as your expression would likely grow. For example, we could try creating char lists, where we wish to collect our desired data:
(?=\().([a-z]+)([\s,;]+?)([0-9]+)(?=\)).
DEMO
Test
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"(?=\().([a-z]+)([\s,;]+?)([0-9]+)(?=\))."
test_str = "some text we wish before (Author) some text we wish after (Author 2000) some text we wish before (Author) some text we wish after (Author, 2000) some text we wish before (Author) some text we wish after (Author 2000) some text we wish before (Author) some text we wish after (Author; 2000)"
matches = re.finditer(regex, test_str, re.MULTILINE | re.IGNORECASE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
Demo
const regex = /(?=\().([a-z]+)([\s,;]+?)([0-9]+)(?=\))./mgi;
const str = `some text we wish before (Author) some text we wish after (Author 2000) some text we wish before (Author) some text we wish after (Author, 2000) some text we wish before (Author) some text we wish after (Author 2000) some text we wish before (Author) some text we wish after (Author; 2000)`;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
// The result can be accessed through the `m`-variable.
m.forEach((match, groupIndex) => {
console.log(`Found match, group ${groupIndex}: ${match}`);
});
}
RegEx Circuit
jex.im visualizes regular expressions:

Related

Parsing parameters from string using a regex with groups in python

Please, I'm trying to grab some parameters from a string. The parameters start with : or $ and are enclosed between brackets.
Ex:
some text [more text :Parameter1] more text [more (:Parameter2)]
My goal is to get two matches as the following:
Full match: [more text :Parameter1]
Group 1: :Parameter1
Full match: [more (:Parameter2)]
Group 1: :Parameter2
The following regex almost works. Except for the cases when the parameter itself is enclosed between parenthesis like Parameter2.
r"\\[.*?([:\$].*?)]"
and in these cases I get:
Full match: [more text :Parameter2]
Group 1: :Parameter2)
Note that group1 comes with the last parenthesis.
I couldn't find a way to remove it. Appreciate any help.
regex101 tests
Thanks.
If you want the parameter to be between the opening and the matching closing parenthesis, you might make use of negated character classes [^][()$:] to match any character that is not in the character class.
To match either of the possibilities you could use an alternation which will give you 2 capturing groups:
\[[^][()$:]*(?:\(([:$][^][()$:]+)\)|([:$][^][()$:]+))\]
About the pattern
\[ Match [
[^][()$:]* Match 0+ times any character that is not in the character class
(?: Non capturing group
\( Match (
( Capturing group 1
[:$][^][()$:]+ Match $ or :, then match 1+ chars not in the character class
) Close group 1
\) Match )
| Or
( Capturing group 2
[:$][^][()$:]+ Match $ or :, then match 1+ chars not in the character class
) Close group 2
) Close non capturing group
\] Match ]
Regex demo
With extended regex pattern:
import re
s = 'some text [more text :Parameter1] more text [more (:Parameter2)]'
res = re.findall(r'(\[[^\[\]:$]+\(?([:$][^:$)]+)\)?\])', s)
print(res)
The output (in format (<full_match>, <group_1>)):
[('[more text :Parameter1]', ':Parameter1'), ('[more (:Parameter2)]', ':Parameter2')]
This regex does what you want:
\[.*?([:\$].*?)\)?]
Output:
[more text :Parameter1]
:Parameter1
[more (:Parameter2)]
:Parameter2
I'd suggest a simple expression,
(\[[^(:]+([^]]+)\])
and then scripting the rest of the problem to avoid look-arounds.
Test
import re
regex = r"(\[[^(:]+([^]]+)\])"
test_str = "some text [more text :Parameter1] more text [more (:Parameter2)]"
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
The expression is explained on the top right panel of this demo, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs step by step, if you like.
You can use the following regex:
(\[[^:]+([:$][^])]+)[])]+)
It will be faster than using lazy quantifiers.
Regex details:
\[ matches [
[^:]+ matches 1 or more times any characters but a :
([:$][^])]+) second group:
[:$]matches either : or $
[^])]+ matches 1 or more times any characters but a ] or )
[])]+ matches ] and/or ) at least one time
Demo
import re
s = 'some text [more text :Parameter1] more text [more (:Parameter2)]'
print(re.findall(r'(\[[^:]+([:$][^])]+)[])]+)', s)
Output:
[('[more text :Parameter1]', ':Parameter1'), '[more text (:Parameter2)]', ':Parameter2')]

Python re.findall finds strangelly wrong patterns [duplicate]

This question already has answers here:
re.findall behaves weird
(3 answers)
Closed 3 years ago.
I m generally curious why re.findall makes sutch weid stuff as finding empty strings, tuples (what that suppose to mean). It seems it does not take clausures () normally, als o interpretes | wrong like ab | cd is (ab)| (cd) , not a (b|c)d like you would think normally. Because of that i cant define regex what i need.
But in this example ie see clear wrong behaviour on the simple pattern:
([a-zA-Z0-9]+\.+)+[a-zA-Z0-9]{1,3}
what describes simple urls like gskinner.com, www.capitolconnection.org what you can see on regex help in https://regexr.com/ , i recognize with re.findall :
hotmail.
living.
item.
2.
4S.
means letters then just. How can that be?
Full code, where i try to filter out jonk from the text is :
import re
singles = r'[()\.\/$%=0-9,?!=; \t\n\r\f\v\":\[\]><]'
digits_str = singles + r'[()\-\.\/$%=0-9 \t\n\r\f\v\'\":\[\]]*'
#small_word = '[a-zA-Z0-9]{1,3}'
#junk_then_small_word = singles + small_word + '(' + singles + small_word + ')*'
email = singles + '\S+#\S*'
http_str = r'[^\.]+\.+[^\.]+\.+([^\.]+\.+)+?'
http = '(http|https|www)' + http_str
web_address = '([a-zA-Z0-9]+\.+)+[a-zA-Z0-9]{1,3}'
pat = email + '|' + digits_str
d_pat = re.compile(web_address)
text = '''"Lucy Gonzalez" test-defis-wtf <stagecoachmama#hotmail.com> on 11/28/2000 01:02:22 PM
http://www.living.com/shopping/item/item.jhtml?.productId=LC-JJHY-2.00-10.4S.I will send checks
directly to the vendor for any bills pre 4/20. I will fax you copies. I will also try and get the payphone transferred.
www.capitolconnection.org <http://www.capitolconnection.org>.
and/or =3D=3D=3D=3D=3D=3D=3D= O\'rourke'''
print('findall:')
for x in re.findall(d_pat,text):
print(x)
print('split:')
for x in re.split(d_pat,text):
print(x)
From the documentation of re.findall:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
Your regex has groups, namely the part in parenthesis. If you want to display the entire match, put your regex in one big group (put parenthesis around the whole thing) and then do print(x[0]) instead of print(x).
I'm guessing that our expression has to be modified here, and that might be the problem, for instance, if we wish to match the desired patterns we would start with an expression similar to:
([a-zA-Z0-9]+)\.
if we wish to have 1 to 3 chars after the ., we would expand it to:
([a-zA-Z0-9]+)\.([a-zA-Z0-9]{1,3})?
Demo 1
Demo 2
Test
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"([a-zA-Z0-9]+)\.([a-zA-Z0-9]{1,3})?"
test_str = ("hotmail.\n"
"living.\n"
"item.\n"
"2.\n"
"4S.\n"
"hotmail.com\n"
"living.org\n"
"item.co\n"
"2.321\n"
"4S.123")
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

Find substrings that start and end with same Uppercase Character

I have a homework problem where I need to use regex to parse substrings out of a large string.
The goal is to select substrings that match the following parameters:
Substring starts and ends with the same uppercase character, and I need to ignore any instances of uppercase characters with the number 0 in front of them.
For example, ZAp0ZuZAuX0AZA would contain the matches ZAp0ZuZ and AuX0AZA
I've been messing around with this for a few hours and honestly haven't even gotten close...
I've tried some stuff like the code below, but that will select everything from the first uppercase through the last uppercase. I've also
[A-Z]{1}[[:alnum:]]*[A-Z]{1} <--- this selects the whole string
[A-Z]{1}[[:alnum:]][A-Z]{1} <--- this gives me strings like ZuZ, AuX
Really appreciate any help, I'm totally stumped on this one.
It may not be the best idea to do that with regular expressions, since simply you could split them. However, if you have/wish to do so, this expression might give you an idea what problems you might be facing, when your char list expands:
(?=.[A-Z])([A-Z])(.*?)\1
I have added (?=.[A-Z]) that must contain one uppercase. You can remove it and it would work. You can however add such boundaries to your expressions for safety.
JavaScript Test
const regex = /([A-Z])(.*?)\1/gm;
const str = `ZAp0ZuZAuX0AZA
ZApxxZuZAuXxafaAZA
ZApxaf09xZuZAuX090xafaAZA
abcZApxaf09xZuZAuX090xafaAZA`;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
// The result can be accessed through the `m`-variable.
m.forEach((match, groupIndex) => {
console.log(`Found match, group ${groupIndex}: ${match}`);
});
}
Python Test
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"([A-Z])(.*?)\1"
test_str = ("ZAp0ZuZAuX0AZA\n"
"ZApxxZuZAuXxafaAZA\n"
"ZApxaf09xZuZAuX090xafaAZA\n"
"abcZApxaf09xZuZAuX090xafaAZA")
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
This might work
(?<!0)([A-Z]).*?(?<!0)\1
https://regex101.com/r/nES9FP/1
Explained
(?<! 0 ) # Ignore Upper case with zero in front of it
( [A-Z] ) # (1), This Upper case is to be found down stream
.*? # Lazy, any character
(?<! 0 ) # Ignore Upper case with zero in front of it
\1 # Backref to what is in group (1)
You may use
(?<!0)([A-Z]).*?(?<!0)\1
See the regex demo.
Details
(?<!0)([A-Z]) - Group 1: an ASCII uppercase letter not preceded with a zero
.*? - any char but a linebreak char as few as possible
(?<!0)\1 - the same letter as in Group 1 not immediately preceded with 0.
See the Python demo:
import re
s="ZAp0ZuZAuX0AZA"
for m in re.finditer(r'(?<!0)([A-Z]).*?(?<!0)\1', s):
print(m.group()) # => ['ZAp0ZuZ', 'AuX0AZA']

How to force Python RegEx to match all possible groups connected by | (*OR*)

I am new to RegEx, and I wonder if there is a way that we can force RegEx to match all possible groups (if there are multiple) at the same 'match' where patterns are connected by OR (see below).
I've tried this: (?P<broad>travel)|(?P<step>step)|(?P<dist>distance|far|km), but if the input is: Tell me how many steps I traveled, the code only matches one of travel or step. I've also tried using findall instead of search, but then the group information is lost (because the output is a list).
I expect that the code can match all possible groups in the same 'match' if available, instead of quitting as soon as a match is found.
Current output:
Match 1
broad None
step step
dist None
Match 2
broad travel
step None
dist None
Expected output:
Match 1
broad travel
step step
dist None
Maybe here, we can use finditer and test our expression:
Demo
Test
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"(travel)|(step)|(distance|far|km)"
test_str = "Tell me how many steps I traveled"
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
Demo
const regex = /(travel)|(step)|(distance|far|km)/gm;
const str = `Tell me how many steps I traveled`;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
// The result can be accessed through the `m`-variable.
m.forEach((match, groupIndex) => {
console.log(`Found match, group ${groupIndex}: ${match}`);
});
}

Regex for Text Between Brackets and Text Between Semicolons

I have the following shape of string: PW[Yasui Chitetsu]; and would like to get only the name inside the brackets: Yasui Chitetsu. I'm trying something like
[^(PW\[)](.*)[^\]]
as a regular expression, but the last bracket is still in it. How do I unselect it? I don't think I need anything fancy like look behinds, etc, for this case.
The Problems with What You've Tried
There are a few problems with what you've tried:
It will omit the first and last characters of your match from the group, giving you something like asui Chitets.
It will have even more errors on strings that start with P or W. For example, in PW[Paul McCartney], you would match only ul McCartne with the group and ul McCartney with the full match.
The Regex
You want something like this:
(?<=\[)([^]]+)(?=\])
Here's a regex101 demo.
Explanation
(?<=\[) means that the match must be preceded by [
([^]]+) matches 1 or more characters that are not ]
(?=\])means that the match must be followed by ]
Sample Code
Here's some sample code (from the above regex101 link):
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"(?<=\[)([^]]+)(?=\])"
test_str = "PW[Yasui Chitetsu]"
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
Semicolons
In your title, you mentioned finding text between semicolons. The same logic would work for that, giving you this regex:
(?<=;)([^;]+)(?=;)

Categories