python regex how to capture all groups that may be optional? - python

minutes_pattern = re.compile(
r"""
(?:(?P<hours>\d+)H)?
(?:(?P<minutes>\d+)M)?
(?:(?P<seconds>\d+)S)?
""",
re.VERBOSE,
)
I'm trying to capture all HOURS, MINUTES, SECONDS.
However, these may or may not exist:
Examples:
PT30H20M10S
PT20M10S
PT20M
PT3010S
PT30H
I'm thinking (?:(?P<hours>\d+)H)? says:
?: non capturing group
?P hours group name
\d+ any number of digits
H must exist
)? this group may or may not exist
But this compile does not capture any of the 3 groups. Help please
regex = minutes_pattern.search(duration)
print(regex.groups())
This only returns None
(None, None, None)
Ideally this would show (for the 1st example):
group hours: 30
group minutes: 20
group seconds: 10
And for the 2nd example:
group hours: None
group minutes: 20
group seconds: 10

The problem is that your regex successfully matches an empty string, so when you use .search to search for the first match, the first match occurs at index 0 with a length of 0:
>>> minutes_pattern.search(duration)
<_sre.SRE_Match object; span=(0, 0), match=''>
The simplest way to fix this would be to include the letters PT at the start of your regex, since these are always present in your input string, and they prevent the regex returning an empty match:
>>> minutes_pattern = re.compile(r"PT(?:(\d+)H)?(?:(\d+)M)?(?:(\d+)S)?")
>>> minutes_pattern.search(duration)
<_sre.SRE_Match object; span=(0, 11), match='PT30H20M10S'>
>>> minutes_pattern.search(duration).groups()
('30', '20', '10')
Alternatively, if you don't want to include PT in your regex (perhaps some other inputs won't include that prefix), you could use .findall instead of .search, and then filter the list to find the "non-trivial" match:
>>> minutes_pattern = re.compile(r"(?:(\d+)H)?(?:(\d+)M)?(?:(\d+)S)?")
>>> minutes_pattern.findall(duration)
[('', '', ''), ('', '', ''), ('30', '20', '10'), ('', '', '')]
>>> next(filter(any, minutes_pattern.findall(duration)), None)
('30', '20', '10')

Related

Python recognize part of string (position and length)

I have got a file (.VAR) which gives me a positions and lengths in a strings per row, see the example below.
*STRING1 1L8:StringONE
*STRINGWITHVARIABLELENGTH2 *ABC 29L4:StringTWO
*STRINGWITHLENGTH3 *ABC 33L2:StringTHREE
How do i retrieve the " xxLxxx:" value, which is always preceded by a space and always ends with a colon, but never on the same location within the string.
Preferably I would like to find the number before L as the position, and the number behind L as the length, but only searching for "L" would give me also the input from other values within the string. Therefore I think I have to use the space_number_L_number_colon to recognize this part, but I don't know how.
Any thoughts? TIA
You can use a regex here.
Example:
s='''*STRING1 1L8:StringONE
*STRINGWITHVARIABLELENGTH2 *ABC 29L4:StringTWO
*STRINGWITHLENGTH3 *ABC 33L2:StringTHREE'''
import re
out = re.findall(r'\s(\d+)L(\d+):', s)
output: [('1', '8'), ('29', '4'), ('33', '2')]
As integers:
out = [tuple(map(int, x)) for x in re.findall(r'\s(\d+)L(\d+):', s)]
output: [(1, 8), (29, 4), (33, 2)]
regex:
regex demo
\s # space
(\d+) # capture one or more digits
L # literal L
(\d+) # capture one or more digits
: # literal :

Regular expression to extract number with hyphen

The text is like "1-2years. 3years. 10years."
I want get result [(1,2),(3),(10)].
I use python.
I first tried r"([0-9]?)[-]?([0-9])years". It works well except for the case of 10. I also tried r"([0-9]?)[-]?([0-9]|10)years" but the result is still [(1,2),(3),(1,0)].
Your attempt r"([0-9]?)[-]?([0-9])years" doesn't work for the case of 10 because you ask it to match one (or zero) digit per group.
You also don't need the hyphen in brackets.
This should work: Regex101
(\d+)(?:-(\d+))?years
Explanation:
(\d+): Capturing group for one or more digits
(?: ): Non-capturing group
- : hyphen
(\d+): Capturing group for one or more digits
(?: )?: Make the previous non-capturing group optional
In python:
import re
result = re.findall(r"(\d+)(?:-(\d+))?years", "1-2years. 3years. 10years.")
# Gives: [('1', '2'), ('3', ''), ('10', '')]
Each tuple in the list contains two elements: The number on the left side of the hyphen, and the number on the right side of the hyphen. Removing the blank elements is quite easy: you loop over each item in result, then you loop over each match in this item and only select it (and convert it to int) if it is not empty.
final_result = [tuple(int(match) for match in item if match) for item in result]
# gives: [(1, 2), (3,), (10,)]
This should work:
import re
st = '1-2years. 3years. 10years.'
result = [tuple(e for e in tup if e)
for tup in re.findall(r'(?:(\d+)-(\d+)|(\d+))years', st)]
# [('1', '2'), ('3',), ('10',)]
The regex will look for either one number, or two separated by a hyphen, immediately prior to the word years. If we give this to re.findall(), it will give us the output [('1', '2', ''), ('', '', '3'), ('', '', '10')], so we also use a quick list comprehension to filter out the empty strings.
Alternately we could use r'(\d+)(?:-(\d+))?years' to basically the same effect, which is closer to what you've already tried.
You can use this pattern: (?:(\d+)-)?(\d+)years
See Regex Demo
Code:
import re
pattern = r"(?:(\d+)-)?(\d+)years"
text = "1-2years. 3years. 10years."
print([tuple(int(z) for z in x if z) for x in re.findall(pattern, text)])
Output:
[(1, 2), (3,), (10,)]
You only match a single digit as the character class [0-9] is not repeated.
Another option is to match the first digits with an optional part for - and digits.
Then you can split the matches on -
\b(\d+(?:-\d+)?)years\.
\b A word boundary
( Capture group 1 (which will be returned by re.findall)
\d+(?:-\d+)? Match 1+ digits and optionally match - and again 1+ digits
) Close group 1
years\. Match literally with the escaped .
See a regex demo and a Python demo.
Example
pattern = r"\b(\d+(?:-\d+)?)years\."
s = "1-2years. 3years. 10years."
res = [tuple(v.split('-')) for v in re.findall(pattern, s)]
print(res)
Output
[('1', '2'), ('3',), ('10',)]
Or if a list of lists is also ok instead of tuples
res = [v.split('-') for v in re.findall(pattern, s)]
Output
[['1', '2'], ['3'], ['10']]

How can I get the number of groups to vary depending on the number of lines?

I have this regex: ^:([^:]+):([^:]*) which works as in this regex101 link.
Now, in Python, I have this:
def get_data():
data = read_mt_file()
match_fields = re.compile('^:([^:]+):([^:]*)', re.MULTILINE)
fields = re.findall(match_fields, data)
return fields
Which, for a file containing the data from regex101, returns:
[('1', 'text\ntext\n\n'), ('20', 'text\n\n'), ('21', 'text\ntext\ntext\n\n'), ('22', ' \n\n'), ('25', 'aa\naa\naaaaa')]
Now, this is ok, but I want to change the regex, so that I can get the number of groups to vary depending on the number of lines. Meaning:
for the first line, now, I get two groups:
1
text\ntext\n\n
I'd like to get instead:
1
((text\n), (text\n\n)) <-- those should be somehow in the same group but separated, each in his own subgroup. Somehow I need to know they both belong to 1 field, but are sepparate lines.
So, In python, the desired result for that file would be:
[('1', '(text\n), (text\n\n)'), ('20', 'text\n\n'), ('21', '(text\n), (text\n), (text\n\n)'), ('22', ' \n\n'), ('25', '(aa\n), (aa\n), (aaaaa)')]
Is this possible with regex? Could this be achieved with some nice string manipulation instead ?
To do what you want, you'd need another regex.
This is as re.match only matches the last item it matches:
>>> re.match(r'(\d)+', '12345').groups()
('5',)
Instead of using one regex you'll need to use two.
The one that you are using at the moment, and then one to match all the 'sub-groups', using say re.findall.
You can get these sub-groups by simply matching anything that isn't a \n and then any amount of \n.
So you could use a regex such as [^\n]+\n*:
>>> re.findall(r'[^\n]+\n*', 'text\ntext')
['text\n', 'text']
>>> re.findall(r'[^\n]+\n*', 'text\ntext\n\n')
['text\n', 'text\n\n']
>>> re.findall(r'[^\n]+\n*', '')
[]
You may use a simple trick: after getting the matches with your regex, run a .+\n* regex over the Group 2 value:
import re
p = re.compile(r'^:([^:]+):([^:]+)', re.MULTILINE)
s = ":1:text\ntext\n\n:20:text\n\n:21:text\ntext\ntext\n\n:22: \n\n:25:aa\naa\naaaaa"
print([[x.group(1)] + re.findall(r".+\n*", x.group(2)) for x in p.finditer(s)])
Here,
p.finditer(s) finds all matches in the string using your regex
[x.group(1)] - a list created from the first group contents
re.findall(r".+\n*", x.group(2)) - fetches individual lines from Group 2 contents (with trailing newlines, 0 or more)
[] + re.findall - combining the lists into 1.
Result is
[['1', 'text\n', 'text\n\n'], ['20', 'text\n\n'], ['21', 'text\n', 'text\n', 'text\n\n'], ['22', ' \n\n'], ['25', 'aa\n', 'aa\n', 'aaaaa']]
Another approach: match all the substrings with your pattern and then use a re.sub to add ), ( between the lines ending with optional newlines:
[(x, "({})".format(re.sub(r".+(?!\n*$)\n+", r"\g<0>), (", y))) for x, y in p.findall(s)]
Result:
[('1', '(text\n), (text\n\n)'), ('20', '(text\n\n)'), ('21', '(text\n), (text\n), (text\n\n)'), ('22', '( \n\n)'), ('25', '(aa\n), (aa\n), (aaaaa)')]
See the Python 3 demo
Here:
p.findall(s) - grabs all the matches in the form of a list of tuples containing your capture group contents using your regex
(x, "({})".format(re.sub(r".+(?!\n*$)\n+", r"\g<0>), (", y))) - creates a tuple from Group 1 contents and Group 2 contents that are a bit modified with the re.sub the way described below
.+(?!\n*$)\n+ - pattern that matches 1+ characters other than newline and then 1+ newline symbols if they are not at the end of the string. If they are at the end of the string, there will be no replacement made (to avoid , () at the end). The \g<0> in the replacement string is re-inserting the whole match back into the resulting string and appends ), ( to it.

Python Regular expression not returning as expected

I am having trouble understanding the output of this regular expression. I am using the following regex to find a dates in text:
^(?:(1[0-2]|0?[1-9])-(3[01]|[12][0-9]|0?[1-9])|(3[01]|[12][0-9]|0?[1-9])-(1[0-2]|0?[1-9]))-(?:[0-9]{2})?[0-9]{2}$
It appears to be matching the pattern within text correctly, but I'm confused by the return values.
For this test string:
TestString = "10-20-2015"
It's returning this:
[('10', '20', '', '')]
If I put () around the entire regex, I get this returned:
[('10-20-2015', '10', '20', '', '')]
I would expect it to simply return the full date string, but it appears to be breaking the results up and I don't understand why. Wrapping my regex in () returns the full date string, but it also returns 4 extra values.
How do I make this ONLY match the full date string and not small parts of the string?
from my console:
Python 3.4.2 (default, Oct 8 2014, 10:45:20)
[GCC 4.9.1] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> pattern = "^(?:(1[0-2]|0?[1-9])-(3[01]|[12][0-9]|0?[1-9])|(3[01]|[12][0-9]|0?[1-9])-(1[0-2]|0?[1-9]))-(?:[0-9]{2})?[0-9]{2}$"
>>> TestString = "10-20-2015"
>>> re.findall(pattern, TestString, re.I)
[('10', '20', '', '')]
>>> pattern = "(^(?:(1[0-2]|0?[1-9])-(3[01]|[12][0-9]|0?[1-9])|(3[01]|[12][0-9]|0?[1-9])-(1[0-2]|0?[1-9]))-(?:[0-9]{2})?[0-9]{2}$)"
>>> re.findall(pattern, TestString, re.I)
[('10-20-2015', '10', '20', '', '')]
>>>
>>> TestString = "10--2015"
>>> re.findall(pattern, TestString, re.I)
[]
>>> pattern = "^(?:(1[0-2]|0?[1-9])-(3[01]|[12][0-9]|0?[1-9])|(3[01]|[12][0-9]|0?[1-9])-(1[0-2]|0?[1-9]))-(?:[0-9]{2})?[0-9]{2}$"
>>> re.findall(pattern, TestString, re.I)
[]
Based on the the response, here was my answer: ((?:(?:1[0-2]|0[1-9])-(?:3[01]|[12][0-9]|0[1-9])|(?:3[01]|[12][0-9]|0[1-9])-(?:1[0-2]|0[1-9]))-(?:[0-9]{2})?[0-9]{2})
Every () is a captured group, (1[0-2]|0?[1-9]) captures 10, (3[01]|[12][0-9]|0?[1-9]) captures 20, and so on. When you surround everything in (), it came before the other () and matched everything. You can ignore a captured group, which is called non-captured group, use (?:) instead of ().
We can do that using one of the most important re functions - search(). This function scans through a string, looking for any location where this RE matches.
import re
text = "10-20-2015"
date_regex = '(\d{1,2})-(\d{1,2})-(\d{4})'
"""
\d in above pattern stands for numerical characters [0-9].
The numbers in curly brackets {} indicates the count of numbers permitted.
Parentheses/round brackets are used for capturing groups so that we can treat
multiple characters as a single unit.
"""
search_date = re.search(date_regex, text)
# for entire match
print(search_date.group())
# also print(search_date.group(0)) can be used
# for the first parenthesized subgroup
print(search_date.group(1))
# for the second parenthesized subgroup
print(search_date.group(2))
# for the third parenthesized subgroup
print(search_date.group(3))
# for a tuple of all matched subgroups
print(search_date.group(1, 2, 3))
Output for each of the print statement mentioned above:
10-20-2015
10
20
2015
('10', '20', '2015')
Hope this answer clears your doubt :-)

python regular expression, extracting set of numbers from a string

How can i get the number 24 and 200 from the string "Size:24 Resp_code:200"
by using re in python?, i have tried with \d+ but then i only get 24
in addition i have also tried this out:
import re
string2 = " Size:24 Resp_code:200"
regx = "(\d+) Resp_code:(\d+)"
print re.search(regx, string2).group(0)
print re.search(regx, string2).group(1)
here the out put is:
24 Resp_code:200
24
any advice on how to solve this ?
thanks in advance
The group 0 contains the whole matched string. Extract group 1, group 2 instead.
>>> string2 = " Size:24 Resp_code:200"
>>> regx = r"(\d+) Resp_code:(\d+)"
>>> match = re.search(regx, string2)
>>> match.group(1), match.group(2)
('24', '200')
>>> match.groups() # to get all groups from 1 to however many groups
('24', '200')
or using re.findall:
>>> re.findall(r'\d+', string2)
['24', '200']
Use:
print re.search(regx, string2).group(1) // 24
print re.search(regx, string2).group(2) // 200
group(0) prints whole string matched by your regex. Where group(1) is first match and group(2) is second match.
Check the doc:
If there is a single argument, the result is a single string; if there are multiple arguments, the result is a tuple with one item per argument. Without arguments, group1 defaults to zero (the whole match is returned)
You are doing it right but groups don't start at 0, but 1, group(0) will print the whole match:
>>> re.search(regx, string2).group(1,2)
('24', '200')

Categories