python re match groups - python

I want to extract some fields from string, however I am not sure how many are they.
I used regexp however, there are some problems which I do not understand.
for example:
199 -> (199)
199,200 -> (199,200)
300,20,500 -> (300,20, 500)
I tried it, however somewhat I can not get this to work.
Hope anyone can give me some advises. I will appreciate.
the regex I tried:
>>> re.match('^(\d+,)*(\d+)$', '20,59,199,300').groups()
('199,', '300')
// in this, I do not really care about ',' since I could use .strip(',') to trim that.
I did some google: and tried to use re.findall, but I am not sure how do I get this:
>>> re.findall('^(\d+,)*(\d+)$', '20,59,199,300')
[('199,', '300')]
------------------------------------------------------update
I realize without telling the whole story, this question can be confusing.
basically I want to validate syntax that defined in crontab (or similar)
I create a array for _VALID_EXPRESSION: it is a nested tuples.
(field_1,
field_2,
)
for each field_1, it has two tuples,
field_1: ((0,59), (r'....', r'....'))
valid_value valid_format
in my code, it looks like this:
_VALID_EXPRESSION = \
12 (((0, 59), (r'^\*$', r'^\*/(\d+)$', r'^(\d+)-(\d+)$',
13 r'^(\d+)-(\d+)/(\d+)$', r'^(\d+,)*(\d+)$')), # second
14 ((0, 59), (r'^\*$', r'^\*\/(\d+)$', r'^(\d+)-(\d+)$',
15 r'^(\d+)-(\d+)/(\d+)$', r'^(\d+,)*(\d+)$')), # minute
16 .... )
in my parse function, all I have to do is just extract all the groups and see if they are within the valid value.
one of regexp I need is that it is able to correctly match this string '50,200,300' and extract all the numbers in this case. (I could use split() of course, however, it will betray my original intention. so, I dislike that idea. )
Hope this will be helpful.

Why not just use a string.split?
numbers = targetstr.split(',')

The simplest solution with a regex is this:
r"(\d+,?)"
You can use findall to get the 300,, 20,, and 500 that you want. Or, if you don't want the commas:
r"(\d+),?"
This matches a group of 1 or more digits, followed by 0 or 1 commas (not in the group).
Either way:
>>> s = '300,20,500'
>>> r = re.compile(r"(\d+),?")
>>> r.findall(s)
['300', '20', '500']
However, as Sahil Grover points out, if those are your input strings, this is equivalent to just calling s.split(','). If your input strings might have non-digits, then this will ensure you only match digit strings, but even that would probably be simpler as filter(str.isdigit, s.split(',')).
If you want a tuple of ints instead of a list of strs:
>>> tuple(map(int, r.findall(s)))
(300, 20, 500)
If you find comprehensions/generator expressions easier to read than map/filter calls:
>>> tuple(int(x) for x in r.findall(s))
(300, 20, 500)
Or, more simply:
>>> tuple(int(x) for x in s.split(',') if x.isdigit())
(300, 20, 500)
And if you want the string (300, 20, 500), while you can of course do that by just calling repr on the tuple, there's a much easier way to get that:
>>> '(' + s + ')'
'(300, 20, 500)'
Your original regex:
'^(\d+,)*(\d+)$'
… is going to return exactly two groups, because you have exactly two groups in the pattern. And, since you're explicitly wrapping it in ^ and $, it has to match the entire string, so findall isn't going to help you here—it's going to find the exact same one match (of two groups) as match.

Related

How to match a string pattern in python

I am looking to match a pattern such as
(u'-<21 characters>', N),
21 character of 0-9, a-z, A-Z plus characters like ~!##$%^&*()_ ...
N is a number from 1 to 99
I am trying to find the specific way to retrieve the 21 characters as well as the number N and use them later on using the re.match method but I do not know how and the documentation is not understandable. How do I do so?
Here is one program that might do what you want.
Note the use of parentheses () to isolate the data you are looking for. Note also the use of m.group(1), m.group(2) to retrieve those saved items.
Note also the use of re.search() instead of re.match(). re.match() must match the data from the very beginning of the string. re.search(), on the other hand, will find the first match, regardless of its location in the string. (But also consider using re.findall(), if a string might have multiple matches.).
Don't be confused by my use of .splitlines(), it is just for the sake of the sample program. You could equally well do data = open('foo.txt') / for line in data:.
import re
data = '''
(u'--UE_y6auTgq3FXlvUMkbw', 10),
(u'--XBxRlD92RaV6TyUnP8Ow', 1),
(u'--sSW-WY3vyASh_eVPGUAw', 2),
(u'-0GkcDiIgVm0XzDZC8RFOg', 9),
(u'-0OlcD1Ngv3yHXZE6KDlnw', 1),
(u'-0QBrNvhrPQCaeo7mTo0zQ', 1)
'''
data = data.splitlines()
for line in data:
m = re.search(r"'(.+)', (\d+)", line)
if m:
chars = m.group(1)
N = int(m.group(2))
print("I found a match!: {}, {}".format(chars, N))

How to escape null characters .i.e [' '] while using regex split function? [duplicate]

I have the following file names that exhibit this pattern:
000014_L_20111007T084734-20111008T023142.txt
000014_U_20111007T084734-20111008T023142.txt
...
I want to extract the middle two time stamp parts after the second underscore '_' and before '.txt'. So I used the following Python regex string split:
time_info = re.split('^[0-9]+_[LU]_|-|\.txt$', f)
But this gives me two extra empty strings in the returned list:
time_info=['', '20111007T084734', '20111008T023142', '']
How do I get only the two time stamp information? i.e. I want:
time_info=['20111007T084734', '20111008T023142']
I'm no Python expert but maybe you could just remove the empty strings from your list?
str_list = re.split('^[0-9]+_[LU]_|-|\.txt$', f)
time_info = filter(None, str_list)
Don't use re.split(), use the groups() method of regex Match/SRE_Match objects.
>>> f = '000014_L_20111007T084734-20111008T023142.txt'
>>> time_info = re.search(r'[LU]_(\w+)-(\w+)\.', f).groups()
>>> time_info
('20111007T084734', '20111008T023142')
You can even name the capturing groups and retrieve them in a dict, though you use groupdict() rather than groups() for that. (The regex pattern for such a case would be something like r'[LU]_(?P<groupA>\w+)-(?P<groupB>\w+)\.')
If the timestamps are always after the second _ then you can use str.split and str.strip:
>>> strs = "000014_L_20111007T084734-20111008T023142.txt"
>>> strs.strip(".txt").split("_",2)[-1].split("-")
['20111007T084734', '20111008T023142']
Since this came up on google and for completeness, try using re.findall as an alternative!
This does require a little re-thinking, but it still returns a list of matches like split does. This makes it a nice drop-in replacement for some existing code and gets rid of the unwanted text. Pair it with lookaheads and/or lookbehinds and you get very similar behavior.
Yes, this is a bit of a "you're asking the wrong question" answer and doesn't use re.split(). It does solve the underlying issue- your list of matches suddenly have zero-length strings in it and you don't want that.
>>> f='000014_L_20111007T084734-20111008T023142.txt'
>>> f[10:-4].split('-')
['0111007T084734', '20111008T023142']
or, somewhat more general:
>>> f[f.rfind('_')+1:-4].split('-')
['20111007T084734', '20111008T023142']

Python - how to substitute a substring using regex with n occurrencies

I have a string with a lot of recurrencies of a single pattern like
a = 'eresQQQutnohnQQQjkhjhnmQQQlkj'
and I have another string like
b = 'rerTTTytu'
I want to substitute the entire second string having as a reference the 'QQQ' and the 'TTT', and I want to find in this case 3 different results:
'ererTTTytuohnQQQjkhjhnmQQQlkj'
'eresQQQutnrerTTTytujhnmQQQlkj'
'eresQQQutnohnQQQjkhjrerTTTytu'
I've tried using re.sub
re.sub('\w{3}QQQ\w{3}' ,b,a)
but I obtain only the first one, and I don't know how to get the other two solutions.
Edit: As you requested, the two characters surrounding 'QQQ' will be replaced as well now.
I don't know if this is the most elegant or simplest solution for the problem, but it works:
import re
# Find all occurences of ??QQQ?? in a - where ? is any character
matches = [x.start() for x in re.finditer('\S{2}QQQ\S{2}', a)]
# Replace each ??QQQ?? with b
results = [a[:idx] + re.sub('\S{2}QQQ\S{2}', b, a[idx:], 1) for idx in matches]
print(results)
Output
['errerTTTytunohnQQQjkhjhnmQQQlkj',
'eresQQQutnorerTTTytuhjhnmQQQlkj',
'eresQQQutnohnQQQjkhjhrerTTTytuj']
Since you didn't specify the output format, I just put it in a list.

Find first matching regex from list of regexes

Let's say I have a list of regexes like such (this is a simple example, the real code has more complex regexes):
regs = [r'apple', 'strawberry', r'pear', r'.*berry', r'fruit: [a-z]*']
I want to exactly match one of the regexes above (so ^regex$) and return the index. Additionally, I want to match the leftmost regex. So find('strawberry') should return 1 while find('blueberry') should return 3. I'm going to re-use the same set of regexes a lot, so precomputation is fine.
This is what I've coded, but it feels bad. The regex should be able to know which one got matched, and I feel this is terribly inefficient (keep in mind that the example above is simplified, and the real regexes are more complicated and in larger numbers):
import re
regs_compiled = [re.compile(reg) for reg in regs]
regs_combined = re.compile('^' +
'|'.join('(?:{})'.format(reg) for reg in regs) +
'$')
def find(s):
if re.match(regs_combined, s):
for i, reg in enumerate(regs_compiled):
if re.match(reg, s):
return i
return -1
Is there a way to find out which subexpression(s) were used to match the regex without looping explicitly?
The only way to figure out which subexpression of the regular expression matched the string would be to use capturing groups for every one and then check which group is not None. But this would require that no subexpression uses capturing groups on its own.
E.g.
>>> regs_combined = re.compile('^' +
'|'.join('({})'.format(reg) for reg in regs) +
'$')
>>> m = re.match(regs_combined, 'strawberry')
>>> m.groups()
(None, 'strawberry', None, None, None)
>>> m.lastindex - 1
1
Other than that, the standard regular expression implementation does not provide further information. You could of course build your own engine that exposes that information, but apart from your very special use case, it’s difficult to make this practically work in other situations—which is probably why this is not provided by existing solutions.

Regex to Split 1st Colon

I have a time in ISO 8601 ( 2009-11-19T19:55:00 ) which is also paired with a name commence. I'm trying to parse this into two. I'm currently up to here:
import re
sColon = re.compile('[:]')
aString = sColon.split("commence:2009-11-19T19:55:00")
Obviously this returns:
>>> aString
['commence','2009-11-19T19','55','00']
What I'd like it to return is this:
>>>aString
['commence','2009-11-19T19:55:00']
How would I go about do this in the original creation of sColon? Also, do you recommend any Regular Expression links or books that you have found useful, as I can see myself needing it in the future!
EDIT:
To clarify... I'd need a regular expression that would just parse at the very first instance of :, is this possible? The text ( commence ) before the colon can chance, yes...
>>> first, colon, rest = "commence:2009-11-19T19:55:00".partition(':')
>>> print (first, colon, rest)
('commence', ':', '2009-11-19T19:55:00')
You could put maximum split parameter in split function
>>> "commence:2009-11-19T19:55:00".split(":",1)
['commence', '2009-11-19T19:55:00']
Official Docs
S.split([sep [,maxsplit]]) -> list of strings
Return a list of the words in the string S, using sep as the
delimiter string. If maxsplit is given, at most maxsplit
splits are done. If sep is not specified or is None, any
whitespace string is a separator and empty strings are removed
from the result.
Looks like you need .IndexOf(":"), then .Substring()?
#OP, don't do the unnecessary. Regex is not needed with what you are doing. Python has very good string manipulation methods that you can use. All you need is split(), and slicing. Those are the very basics of Python.
>>> "commence:2009-11-19T19:55:00".split(":",1)
['commence', '2009-11-19T19:55:00']
>>>

Categories