Python: Getting text of a Regex match - python

I have a regex match object in Python. I want to get the text it matched. Say if the pattern is '1.3', and the search string is 'abc123xyz', I want to get '123'. How can I do that?
I know I can use match.string[match.start():match.end()], but I find that to be quite cumbersome (and in some cases wasteful) for such a basic query.
Is there a simpler way?

You can simply use the match object's group function, like:
match = re.search(r"1.3", "abc123xyz")
if match:
doSomethingWith(match.group(0))
to get the entire match. EDIT: as thg435 points out, you can also omit the 0 and just call match.group().
Addtional note: if your pattern contains parentheses, you can even get these submatches, by passing 1, 2 and so on to group().

You need to put the regex inside "()" to be able to get that part
>>> var = 'abc123xyz'
>>> exp = re.compile(".*(1.3).*")
>>> exp.match(var)
<_sre.SRE_Match object at 0x691738>
>>> exp.match(var).groups()
('123',)
>>> exp.match(var).group(0)
'abc123xyz'
>>> exp.match(var).group(1)
'123'
or else it will not return anything:
>>> var = 'abc123xyz'
>>> exp = re.compile("1.3")
>>> print exp.match(var)
None

Related

How to get String Before Last occurrence of substring?

I want to get String before last occurrence of my given sub string.
My String was,
path =
D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v1001-1010.mov
my substring, 1001-1010 which will occurred twice. all i want is get string before its last occurrence.
Note: My substring is dynamic with different padding but only number.
I want,
D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v
I have done using regex and slicing,
>>> p = 'D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v1001-1010.mov'
>>> q = re.findall("\d*-\d*",p)
>>> q[-1].join(p.split(q[-1])[:-1])
'D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v'
>>>
Is their any better way to do by purely using regex?
Please Note I have tried so many eg:
regular expression to match everything until the last occurrence of /
Regex Last occurrence?
I got answer by using regex with slicing but i want to achieve by using regex alone..
Why use regex. Just use built in string methods:
path = "D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v1001-1010.mov"
index = path.rfind("1001-1010")
print(path[:index])
You can use a simple greedy match and a capture group:
(.*)1001-1010
Your match is in capture group #1
Since .* is greedy by nature, it will match longest match before matching your keyword 1001-1010.
RegEx Demo
As per comments below if keyword is not a static string then you may use this regex:
r'(.*\D)\d+-\d+'
Python Code:
>>> p = 'D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v1001-1010.mov'
>>> print (re.findall(r'(.*\D)\d+-\d+', p))
['D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v']
Thanks #anubhava,
My first regex was,
.*(\d*-\d*)\/
Now i have corrected mine..
.*(\d*-\d*)
or
(.*)(\d*-\d*)
which gives me,
>>> q = re.search('.+(\d*-\d*)', p)
>>> q.group()
'D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v0001-1001'
>>>
(.*\D)\d+-\d+
this gives me exactly what i want...
>>> q = re.search('(.*\D)\d+-\d+', p)
>>> q.groups()
('D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v',)
>>>

How to escape null characters .i.e [' '] while using regex split function? [duplicate]

I have the following file names that exhibit this pattern:
000014_L_20111007T084734-20111008T023142.txt
000014_U_20111007T084734-20111008T023142.txt
...
I want to extract the middle two time stamp parts after the second underscore '_' and before '.txt'. So I used the following Python regex string split:
time_info = re.split('^[0-9]+_[LU]_|-|\.txt$', f)
But this gives me two extra empty strings in the returned list:
time_info=['', '20111007T084734', '20111008T023142', '']
How do I get only the two time stamp information? i.e. I want:
time_info=['20111007T084734', '20111008T023142']
I'm no Python expert but maybe you could just remove the empty strings from your list?
str_list = re.split('^[0-9]+_[LU]_|-|\.txt$', f)
time_info = filter(None, str_list)
Don't use re.split(), use the groups() method of regex Match/SRE_Match objects.
>>> f = '000014_L_20111007T084734-20111008T023142.txt'
>>> time_info = re.search(r'[LU]_(\w+)-(\w+)\.', f).groups()
>>> time_info
('20111007T084734', '20111008T023142')
You can even name the capturing groups and retrieve them in a dict, though you use groupdict() rather than groups() for that. (The regex pattern for such a case would be something like r'[LU]_(?P<groupA>\w+)-(?P<groupB>\w+)\.')
If the timestamps are always after the second _ then you can use str.split and str.strip:
>>> strs = "000014_L_20111007T084734-20111008T023142.txt"
>>> strs.strip(".txt").split("_",2)[-1].split("-")
['20111007T084734', '20111008T023142']
Since this came up on google and for completeness, try using re.findall as an alternative!
This does require a little re-thinking, but it still returns a list of matches like split does. This makes it a nice drop-in replacement for some existing code and gets rid of the unwanted text. Pair it with lookaheads and/or lookbehinds and you get very similar behavior.
Yes, this is a bit of a "you're asking the wrong question" answer and doesn't use re.split(). It does solve the underlying issue- your list of matches suddenly have zero-length strings in it and you don't want that.
>>> f='000014_L_20111007T084734-20111008T023142.txt'
>>> f[10:-4].split('-')
['0111007T084734', '20111008T023142']
or, somewhat more general:
>>> f[f.rfind('_')+1:-4].split('-')
['20111007T084734', '20111008T023142']

Regex Expression not matching correctly

I'm tackling a python challenge problem to find a block of text in the format xXXXxXXXx (lower vs upper case, not all X's) in a chunk like this:
jdskvSJNDfbSJneSfnJDKoJIWhsjnfakjn
I have tested the following RegEx and found it correctly matches what I am looking for from this site (http://www.regexr.com/):
'([a-z])([A-Z]){3}([a-z])([A-Z]){3}([a-z])'
However, when I try to match this expression to the block of text, it just returns the entire string:
In [1]: import re
In [2]: example = 'jdskvSJNDfbSJneSfnJDKoJIWhsjnfakjn'
In [3]: expression = re.compile(r'([a-z])([A-Z]){3}([a-z])([A-Z]){3}([a-z])')
In [4]: found = expression.search(example)
In [5]: print found.string
jdskvSJNDfbSJneSfnJDKoJIWhsjnfakjn
Any ideas? Is my expression incorrect? Also, if there is a simpler way to represent that expression, feel free to let me know. I'm fairly new to RegEx.
You need to return the match group instead of the string attribute.
>>> import re
>>> s = 'jdskvSJNDfbSJneSfnJDKoJIWhsjnfakjn'
>>> rgx = re.compile(r'[a-z][A-Z]{3}[a-z][A-Z]{3}[a-z]')
>>> found = rgx.search(s).group()
>>> print found
nJDKoJIWh
The string attribute always returns the string passed as input to the match. This is clearly documented:
string
The string passed to match() or search().
The problem has nothing to do with the matching, you're just grabbing the wrong thing from the match object. Use match.group(0) (or match.group()).
Based on xXXXxXXXx if you want upper letters with len 3 and lower with len 1 between them this is what you want :
([a-z])(([A-Z]){3}([a-z]))+
also you can get your search function with group()
print expression.search(example).group(0)

Python Regex searching

I want to use regex to search in a file for this expression:
time:<float> s
I only want to get the float number.
I'm learning about regex, and this is what I did:
astr = 'lalala time:1.5 s\n'
p = re.compile(r'time:(\d+).*(\d+)')
m = p.search(astr)
Well, I get time:1.5 from m.group(0)
How can I directly just get 1.5 ?
I'm including some extra python-specific materiel since you said you're learning regex. As already mentioned the simplest regex for this would certainly be \d+\.\d+ in various commands as described below.
Something that threw me off with python initially was getting my head around the return types of various re methods and when to use group() vs. groups().
There are several methods you might use:
re.match()
re.search()
re.findall()
match() will only return an object if the pattern is found at the beginning of the string.
search() will find the first pattern and top.
findall() will find everything in the string.
The return type for match() and search() is a match object, __Match[T], or None, if a match isn't found. However the return type for findall() is a list[T]. These different return types obviously have ramifications for how you get the values out of your match.
Both match and search expose the group() and groups() methods for retrieving your matches. But when using findall you'll want to iterate through your list or pull a value with an enumerator. So using findall:
>>>import re
>>>easy = re.compile(r'123')
>>>matches = easy.findall(search_me)
>>>for match in matches: print match
123
If you're using search() or match(), you'll want to use .group() or groups() to retrieve your match depending on how you've set up your regular expression.
From the documentation, "The groups() method returns a tuple containing the strings for all the subgroups, from 1 up to however many there are."
Therefore if you have no groups in your regex, as shown in the following example, you wont get anything back:
>>>import re
>>>search_me = '123abc'
>>>easy = re.compile(r'123')
>>>matches = easy.search(search_me)
>>>print matches.groups()
()
Adding a "group" to your regular expression enables you to use this:
>>>import re
>>>search_me = '123abc'
>>>easy = re.compile(r'(123)')
>>>matches = easy.search(search_me)
>>>print matches.groups()
('123',)
You don't have to specify groups in your regex. group(0) or group() will return the entire match even if you don't have anything in parenthesis in your expression. --group() defaults to group(0).
>>>import re
>>>search_me = '123abc'
>>>easy = re.compile(r'123')
>>>matches = easy.search(search_me)
>>>print matches.group(0)
123
If you are using parenthesis you can use group to match specific groups and subgroups.
>>>import re
>>>search_me = '123abc'
>>>easy = re.compile(r'((1)(2)(3))')
>>>matches = easy.search(search_me)
>>>print matches.group(1)
>>>print matches.group(2)
>>>print matches.group(3)
>>>print matches.group(4)
123
1
2
3
I'd like to point as well that you don't have to compile your regex unless you care to for reasons of usability and/or readability. It won't improve your performance.
>>>import re
>>>search_me = '123abc'
>>>#easy = re.compile(r'123')
>>>#matches = easy.search(search_me)
>>>matches = re.search(r'123', search_me)
>>>print matches.group()
Hope this helps! I found sites like debuggex helpful while learning regex. (Although sometimes you have to refresh those pages; I was banging my head for a couple hours one night before I realized that after reloading the page my regex worked just fine.) Lately I think you're served just as well by throwing sandbox code into something like wakari.io, or an IDE like PyCharm, etc., and observing the output. http://www.rexegg.com/ is also a good site for general regex knowledge.
You could do create another group for that. And I would also change the regex slightly to allow for numbers that don't have a decimal separator.
re.compile(r'time:((\d+)(\.?(\d+))?')
Now you can use group(1) to capture the match of the floating point number.
I think the regex you actually want is something more like:
re.compile(r'time:(\d+\.\d+)')
or even:
re.compile(r'time:(\d+(?:\.\d+)?)') # This one will capture integers too.
Note that I've put the entire time into 1 grouping. I've also escaped the . which means any character in regex.
Then, you'd get 1.5 from m.group(1) -- m.group(0) is the entire match. m.group(1) is the first submatch (parenthesized grouping), m.group(2) is the second grouping, etc.
example:
>>> import re
>>> p = re.compile(r'time:(\d+(?:\.\d+)?)')
>>> p.search('time:34')
<_sre.SRE_Match object at 0x10fa77d50>
>>> p.search('time:34').group(1)
'34'
>>> p.search('time:34.55').group(1)
'34.55'

python regular expression substitute

I need to find the value of "taxid" in a large number of strings similar to one given below. For this particular string, the 'taxid' value is '9606'. I need to discard everything else. The "taxid" may appear anywhere in the text, but will always be followed by a ":" and then number.
score:0.86|taxid:9606(Human)|intact:EBI-999900
How to write regular expression for this in python.
>>> import re
>>> s = 'score:0.86|taxid:9606(Human)|intact:EBI-999900'
>>> re.search(r'taxid:(\d+)', s).group(1)
'9606'
If there are multiple taxids, use re.findall, which returns a list of all matches:
>>> re.findall(r'taxid:(\d+)', s)
['9606']
for line in lines:
match = re.match(".*\|taxid:([^|]+)\|.*",line)
print match.groups()

Categories