Split a string storing the indices of every substring - python

There is a cool native .split() in python, returning the list of substrings. Is there a native or, at least, not very nasty way to split by multiple delimiters and automatically get substring coordinates? Something like this:
"abc? !cde".some_smart_split("!?") -> [("abc", 0, 2), (" ", 4, 4), ("cde", 6, 8)]
Of course, I can write some naive code myself. But my use case is much more complicated, it would be great to find something concise.

Using re.finditer:
>>> import re
>>> [(match.group(0), match.start(), match.end())
for match in re.finditer(r'[^!?]+', 'abc? !cde')]
[('abc', 0, 3), (' ', 4, 5), ('cde', 6, 9)]
The pattern [^!?]+ to match characters that is not !, ?.
used re.MatchedObject.group(0) to get the matched string.
re.MatchedObject.start and re.MatchedObject.end to get the indices of matched parts.

Related

Match a completed string with regex pattern

I have a string,
s = '`re.``search`(*pattern*, *string*, *flags=0*)',
Easily it produces such a result using sub
In [100]: re.sub(r'[`*]','',s)
Out[100]: 're.search(pattern, string, flags=0)'
I'd like to refactor it by writing a whole regex pattern instead of substituting.
In [101]: re.search(r'[^`*]+',s)
Out[101]: <_sre.SRE_Match object; span=(1, 4), match='re.'>
It stops at first match 're., while I intend to retrieve the completed.
How to accomplish such a task?

How to match a string pattern in python

I am looking to match a pattern such as
(u'-<21 characters>', N),
21 character of 0-9, a-z, A-Z plus characters like ~!##$%^&*()_ ...
N is a number from 1 to 99
I am trying to find the specific way to retrieve the 21 characters as well as the number N and use them later on using the re.match method but I do not know how and the documentation is not understandable. How do I do so?
Here is one program that might do what you want.
Note the use of parentheses () to isolate the data you are looking for. Note also the use of m.group(1), m.group(2) to retrieve those saved items.
Note also the use of re.search() instead of re.match(). re.match() must match the data from the very beginning of the string. re.search(), on the other hand, will find the first match, regardless of its location in the string. (But also consider using re.findall(), if a string might have multiple matches.).
Don't be confused by my use of .splitlines(), it is just for the sake of the sample program. You could equally well do data = open('foo.txt') / for line in data:.
import re
data = '''
(u'--UE_y6auTgq3FXlvUMkbw', 10),
(u'--XBxRlD92RaV6TyUnP8Ow', 1),
(u'--sSW-WY3vyASh_eVPGUAw', 2),
(u'-0GkcDiIgVm0XzDZC8RFOg', 9),
(u'-0OlcD1Ngv3yHXZE6KDlnw', 1),
(u'-0QBrNvhrPQCaeo7mTo0zQ', 1)
'''
data = data.splitlines()
for line in data:
m = re.search(r"'(.+)', (\d+)", line)
if m:
chars = m.group(1)
N = int(m.group(2))
print("I found a match!: {}, {}".format(chars, N))

Find a word in a string python using regex or other methods

I am trying to go through an array of words and check if they exist in a string. I understand there are many options for doing this such as using re.search but I need to differ between some words (ie. Java vs Javascript)
An example:
import re
s = 'Some types (python, c++, java, javascript) are examples of programming.'
words = ['python', 'java', 'c++', 'javascript', 'programming']
for w in words:
p = re.search(w, s)
print(p)
>><_sre.SRE_Match object; span=(12, 18), match='python'>
>><_sre.SRE_Match object; span=(20, 24), match='java'>
>><_sre.SRE_Match object; span=(20, 30), match='javascript'>
>><_sre.SRE_Match object; span=(48, 59), match='programming'>
The above works to an extent but matches Java with Javascript.
EDIT: Here was my solution
for w in words:
regexPart1 = r"\s"
regexPart2 = r"(?:!+|,|\.|\·|;|:|\(|\)|\"|\?+)?\s"
p = re.compile(regexPart1 + re.escape(w) + regexPart2 , re.IGNORECASE)
result = p.search(s)
You want to add word boundary marks to you regular expressions, say r'/bjavascript/b' in place of merely 'javascript'. (Note also that + should be escaped in c++ )
Also, iteration over words to match lacks potential efficiency of a compiled regexp. It may be better to combine the regexps into one:
w = r'\b(?:python|java|c\+\+|javascript|programming)\b'
re.search(w,s)

Why do Python regex spans extend one place past the actual match?

Looking at the spans returned from my regex matches, I noticed that they always return one past the actual match; e.g. in the example at Regular Expression HOWTO
>>> print(p.match('::: message'))
None
>>> m = p.search('::: message'); print(m)
<_sre.SRE_Match object at 0x...>
>>> m.group()
'message'
>>> m.span()
(4, 11)
The resulting span in the example is (4, 11) vs. the actual location (4, 10). This causes some trouble for me as the left-hand and right-hand boundaries have different meanings and I need to compare the relative positions of the spans.
Is there a good reason for this or can I go ahead and modify the spans to my liking by subtracting one from the right boundary?
Because in Python, slicing and ranges never the end value is always exclusive, and '::: message'[4:11] reflects the actual matched text:
>>> '::: message'[4:11]
'message'
Thus, you can use the MatchObject.span() results to slice the matched text from the original string:
>>> import re
>>> s = '::: message'
>>> match = p.search(s)
>>> match.span()
(4, 11)
>>> s[slice(*match.span())]
'message'

python re match groups

I want to extract some fields from string, however I am not sure how many are they.
I used regexp however, there are some problems which I do not understand.
for example:
199 -> (199)
199,200 -> (199,200)
300,20,500 -> (300,20, 500)
I tried it, however somewhat I can not get this to work.
Hope anyone can give me some advises. I will appreciate.
the regex I tried:
>>> re.match('^(\d+,)*(\d+)$', '20,59,199,300').groups()
('199,', '300')
// in this, I do not really care about ',' since I could use .strip(',') to trim that.
I did some google: and tried to use re.findall, but I am not sure how do I get this:
>>> re.findall('^(\d+,)*(\d+)$', '20,59,199,300')
[('199,', '300')]
------------------------------------------------------update
I realize without telling the whole story, this question can be confusing.
basically I want to validate syntax that defined in crontab (or similar)
I create a array for _VALID_EXPRESSION: it is a nested tuples.
(field_1,
field_2,
)
for each field_1, it has two tuples,
field_1: ((0,59), (r'....', r'....'))
valid_value valid_format
in my code, it looks like this:
_VALID_EXPRESSION = \
12 (((0, 59), (r'^\*$', r'^\*/(\d+)$', r'^(\d+)-(\d+)$',
13 r'^(\d+)-(\d+)/(\d+)$', r'^(\d+,)*(\d+)$')), # second
14 ((0, 59), (r'^\*$', r'^\*\/(\d+)$', r'^(\d+)-(\d+)$',
15 r'^(\d+)-(\d+)/(\d+)$', r'^(\d+,)*(\d+)$')), # minute
16 .... )
in my parse function, all I have to do is just extract all the groups and see if they are within the valid value.
one of regexp I need is that it is able to correctly match this string '50,200,300' and extract all the numbers in this case. (I could use split() of course, however, it will betray my original intention. so, I dislike that idea. )
Hope this will be helpful.
Why not just use a string.split?
numbers = targetstr.split(',')
The simplest solution with a regex is this:
r"(\d+,?)"
You can use findall to get the 300,, 20,, and 500 that you want. Or, if you don't want the commas:
r"(\d+),?"
This matches a group of 1 or more digits, followed by 0 or 1 commas (not in the group).
Either way:
>>> s = '300,20,500'
>>> r = re.compile(r"(\d+),?")
>>> r.findall(s)
['300', '20', '500']
However, as Sahil Grover points out, if those are your input strings, this is equivalent to just calling s.split(','). If your input strings might have non-digits, then this will ensure you only match digit strings, but even that would probably be simpler as filter(str.isdigit, s.split(',')).
If you want a tuple of ints instead of a list of strs:
>>> tuple(map(int, r.findall(s)))
(300, 20, 500)
If you find comprehensions/generator expressions easier to read than map/filter calls:
>>> tuple(int(x) for x in r.findall(s))
(300, 20, 500)
Or, more simply:
>>> tuple(int(x) for x in s.split(',') if x.isdigit())
(300, 20, 500)
And if you want the string (300, 20, 500), while you can of course do that by just calling repr on the tuple, there's a much easier way to get that:
>>> '(' + s + ')'
'(300, 20, 500)'
Your original regex:
'^(\d+,)*(\d+)$'
… is going to return exactly two groups, because you have exactly two groups in the pattern. And, since you're explicitly wrapping it in ^ and $, it has to match the entire string, so findall isn't going to help you here—it's going to find the exact same one match (of two groups) as match.

Categories