python regex: capturing group within OR

python regex: capturing group within OR - python

I'm using python and the re module to parse some strings and extract a 4 digits code associated with a prefix. Here are 2 examples of strings I would have to parse:
str1 = "random stuff tokenA1234 more stuff"
str2 = "whatever here tokenB5678 tokenA0123 and more there"
tokenA and tokenB are the prefixes and 1234, 5678, 0123 are the digits I need to grab. token A and B are just an example here. The prefix can be something like an address http://domain.com/ (tokenA) or a string like Id: ('[Ii]d:?\s?') (tokenB).
My regex looks like:
re.findall('.*?(?:tokenA([0-9]{4})|tokenB([0-9]{4})).*?', str1)
When parsing the 2 strings above, I get:
[('1234','')]
[('','5678'),('0123','')]
And I'd like to simply get ['1234'] or ['5678','0123'] instead of a tuple.
How can I modify the regex to achieve that? Thanks in advance.

You get tuples as a result since you have more than 1 capturing group in your regex. See re.findall reference:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
So, the solution is to use only one capturing group.
Since you have tokens in your regex, you can use them inside a group. Since only tokens differ, ([0-9]{4}) part is common for both, just use an alternation operator between tokens put into a non-capturing group:
(?:tokenA|tokenB)([0-9]{4})
^^^^^^^^^^^^^^^^^
The regex means:
(?:tokenA|tokenB) - match but not capture tokenA or tokenB
([0-9]{4}) - match and capture into Group 1 four digits
IDEONE demo:
import re
s = "tokenA1234tokenB34567"
print(re.findall(r'(?:tokenA|tokenB)([0-9]{4})', s))
Result: ['1234', '3456']

Simply do this:
re.findall(r"token[AB](\d{4})", s)
Put [AB] inside a character class, so that it would match either A or B

Related

Extract date from inside a string with Python

I have the following string, while the first letters can differ and can also be sometimes two, sometimes three or four.
PR191030.213101.ABD
I want to extract the 191030 and convert that to a valid date.
filename_without_ending.split(".")[0][-6:]
PZA191030_392001_USB
Sometimes it looks liket his
This solution is not valid since this is also might differ from time to time. The only REAL pattern is really the first six numbers.
How do I do this?
Thank you!

You could get the first 6 digits using a pattern an a capturing group
^[A-Z]{2,4}(\d{6})\.
^ Start of string
[A-Z]{2,4} Match 2, 3 or 4 uppercase chars
( Capture group 1
\d{6} Match 6 digits
)\. Close group and match trailing dot
Regex demo | Python demo
For example
import re
regex = r"^[A-Z]{2,4}(\d{6})\."
test_str = "PR191030.213101.ABD"
matches = re.search(regex, test_str)
if matches:
print(matches.group(1))
Output
191030

You can do:
a = 'PR191030.213101.ABD'
int(''.join([c for c in a if c.isdigit()][:6]))
Output:
191030

This can also be done by:
filename_without_ending.split(".")[0][2::]
This splits the string from the 3rd letter to the end.

Since first letters can differ we have to ignore alphabets and extract digits.
So using re module (for regular expressions) apply regex pattern on string. It will give matching pattern out of string.
'\d' is used to match [0-9]digits and + operator used for matching 1 digit atleast(1/more).
findall() will find all the occurences of matching pattern in a given string while #search() is used to find matching 1st occurence only.
import re
str="PR191030.213101.ABD"
print(re.findall(r"\d+",str)[0])
print(re.search(r"\d+",str).group())

regex. Find multiple occurrence of pattern

I have the following string
my_string = "this data is F56 F23 and G87"
And I would like to use regex to return the following output
['F56 F23', 'G87']
So basically, I'm interested in returning all the parts of the string that start with either F or G and are followed by two numbers. In addition, if there are multiple consecutive occurrences I would like regex to group them together.
I approached the problem with python and with this code
import re
re.findall(r'\b(F\d{2}|G\d{2})\b', my_string)
I was able to get all the occurrences
['F56', 'F23', 'G87']
But I would like to have the first two groups together since they are consecutive occurrences. Any ideas of how I can achieve that?

You can use this regex:
\b[FG]\d{2}(?:\s+[FG]\d{2})*\b
Non-capturing group (?:\s+[FG]\d{2})* will find zero or more of the following space separated F/G substrings.
Code:
>>> my_string = "this data is F56 F23 and G87"
>>> re.findall(r'\b[FG]\d{2}(?:\s+[FG]\d{2})*\b', my_string)
['F56 F23', 'G87']

So basically, I'm interested in returning all the parts of the string that start with either F or G and are followed by two numbers. In addition, if there are multiple consecutive occurrences I would like regex to group them together.
You can do this with:
\b(?:[FG]\d{2})(?:\s+[FG]\d{2})*\b
in case it is separated by at least one spacing character. If that is not a requirement, you can do this with:
\b(?:[FG]\d{2})(?:\s*[FG]\d{2})*\b
Both the first and second regex generate:
>>> re.findall(r'\b(?:[FG]\d{2})(?:\s+[FG]\d{2})*\b',my_string)
['F56 F23', 'G87']
>>> re.findall(r'\b(?:[FG]\d{2})(?:\s*[FG]\d{2})*\b',my_string)
['F56 F23', 'G87']

print map(lambda x : x[0].strip(), re.findall(r'((\b(F\d{2}|G\d{2})\b\s*)+)', my_string))
change your regex to r'((\b(F\d{2}|G\d{2})\b\s*)+)' (brackets around, /s* to find all, that are connected by whitespaces, a + after the last bracket to find more than one occurance (greedy)
now you have a list of lists, of which you need every 0th Argument. You can use map and lambda for this. To kill last blanks I used strip()

Regex: skip the first match of a character in group?

From this string
s = 'stringalading-0.26.0-1'
I'd like to extract the part 0.26.0-1. I can think of various ways to achieve this, using split or a regular expression using a pattern like this
pattern = r'\d+\.\d+\.\d+\-\d+'
I also tried to use a group of characters, like so:
pattern = r'[.\-\d]+'
This gives me:
In [30]: re.findall(pattern, s)
Out[30]: ['-0.26.0-1']
So I wondered: is it possible to skip the first occurrence of a character in a group, in this case the first occurrence of -?

is it possible to to skip the first occurrence of a character in a group, in this case the first occurrence of -?
NO, because when matching, the regex engine processes the string from left to right, and once the matching pattern is found, the matched chunk of text is written to the match buffer. Thus, either write a regex that only matches what you need, or post-process the found result by stripping unwanted characters from the left.
I think you do not need a regex here. You can split the string with - and pass the maxsplit argument set to 1, then just access the second item:
s = 'stringalading-0.26.0-1'
print(s.split("-", 1)[1]) # => '0.26.0-1'
See the Python demo
Also, your first regex works well:
import re
s = 'stringalading-0.26.0-1'
pat = r'\d+\.\d+\.\d+-\d+'
print(re.findall(pat, s)) # => ['0.26.0-1']

Do:
-(.*)
and get captured group 1.
Example:
In [9]: s = 'stringalading-0.26.0-1'
In [10]: re.search(r'-(.*)', s).group(1)
Out[10]: '0.26.0-1'

regular expression: may or may not contain a string

I want to match a floating number that might be in the form of 0.1234567 or 1.23e-5
Here is my python code:
import re
def main():
m2 = re.findall(r'\d{1,4}:[-+]?\d+\.\d+(e-\d+)?', '1:0.00003 3:0.123456 8:-0.12345')
for svs_elem in m2:
print svs_elem
main()
It prints blank... Based on my test, the problem was in (e-\d+)? part.

See emphasis:
Help on function findall in module re:
findall(pattern, string, flags=0)
Return a list of all non-overlapping matches in the string.
If one or more groups are present in the pattern, return a
list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result.
You have a group, so it’s returned instead of the entire match, but it doesn’t match in any of your cases. Make it non-capturing with (?:e-\d+):
m2 = re.findall(r'\d{1,4}:[-+]?\d+\.\d+(?:e-\d+)?', '1:0.00003 3:0.123456 8:-0.12345')

Use a non-capturing group. The matches are succeeding, but the output is the contents of the optional groups that don't actually match.
See the output when your input includes something like e-6:
>>> re.findall(r'\d{1,4}:[-+]?\d+\.\d+(e-\d+)?', '1:0.00003 3:0.123456 8:-0.12345e-6')
['', '', 'e-6']
With a non-capturing group ((?:...)):
>>> re.findall(r'\d{1,4}:[-+]?\d+\.\d+(?:e-\d+)?', '1:0.00003 3:0.123456 8:-0.12345e-6')
['1:0.00003', '3:0.123456', '8:-0.12345e-6']
Here's are some simpler examples to demonstrate how capturing groups work and how they influence the output of findall. First, no groups:
>>> re.findall("a[bc]", "ab")
["ab"]
Here, the string "ab" matched the regex, so we print everything the regex matched.
>>> re.findall("a([bc])", "ab")
["b"]
This time, we put the [bc] inside a capturing group, so even though the entire string is still matched by the regex, findall only includes the part inside the capturing group in its output.
>>> re.findall("a(?:[bc])", "ab")
["ab"]
Now, by converting the capturing group to a non-capturing group, findall again uses the match of the entire regex in its output.
>>> re.findall("a([bc])?", "a")
['']
>>> re.findall("a(?:[bc])?", "a")
['a']
In both of these final case, the regular expression as a whole matches, so the return value is a non-empty list. In the first one, the capturing group itself doesn't match any text, though, so the empty string is part of the output. In the second, we don't have a capturing group, so the match of the entire regex is used for the output.

Regular Expression in python

When the parenthesis were used in the below program output is
['www.google.com'].
import re
teststring = "href=\"www.google.com\""
m=re.findall('href="(.*?)"',teststring)
print m;
If parenthesis is removed in findall function output is ['href="www.google.com"'].
import re
teststring = "href=\"www.google.com\""
m=re.findall('href=".*?"',teststring)
print m;
Would be helpful if someone explained how it works.

The re.findall() documentation is quite clear on the difference:
Return all non-overlapping matches of pattern in string, as a list of strings. […] If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
So .findall() returns a list containing one of three types of values, depending on the number of groups in the pattern:
0 capturing groups in the pattern (no (...) parenthesis): the whole matched string ('href="www.google.com"' in your second example).
1 capturing group in the pattern: return the captured group ('www.google.com' in your first example).
more than 1 capturing group in the pattern: return a tuple of all matched groups.
Use non-capturing groups ((?:...)) if you don't want that behaviour, or add groups if you want more information. For example, adding a group around the href= part would result in a list of tuples with two elements each:
>>> re.findall('(href=)"(.*?)"', teststring)
[('href=', 'www.google.com')]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python regex: capturing group within OR - python

Simply do this: re.findall(r"token[AB](\d{4})", s) Put [AB] inside a character class, so that it would match either A or B

Related

Extract date from inside a string with Python

regex. Find multiple occurrence of pattern

Regex: skip the first match of a character in group?

regular expression: may or may not contain a string

Regular Expression in python

Categories

Resources