Number of repeats in a regex in python - python

Say I have the following regex:
"GGAGG.{5,13}?(ATG|GTG|TTG)(...)+?(TGA|TAA|TAG)"
Is there a way to see how many repeats are being done with for the part .{5,13}?
Looking to know how far between GGAGG and a start codon. I could go and search for it manually later but wondering if there's a better way within the original regex.

You could do
"GGAGG(.{5,13}?)(ATG|GTG|TTG)(...)+?(TGA|TAA|TAG)"
and then use code like
rem = re.match(pat, s)
dist_between_ggagg_and_start_codon = len(rem.group(1))

You can get the position of the entire match or groups using match.start method. Using that information:
>>> import re
>>> seq = 'xxxxGGAGGxxxxxxxATGxxxTGA'
>>> pattern = "GGAGG.{5,13}?(ATG|GTG|TTG)(...)+?(TGA|TAA|TAG)"
>>> match = re.search(pattern, seq)
>>> match.start(1) - match.start() - 5 # 5 = len(GGAGG)
7

Related

How to get String Before Last occurrence of substring?

I want to get String before last occurrence of my given sub string.
My String was,
path =
D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v1001-1010.mov
my substring, 1001-1010 which will occurred twice. all i want is get string before its last occurrence.
Note: My substring is dynamic with different padding but only number.
I want,
D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v
I have done using regex and slicing,
>>> p = 'D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v1001-1010.mov'
>>> q = re.findall("\d*-\d*",p)
>>> q[-1].join(p.split(q[-1])[:-1])
'D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v'
>>>
Is their any better way to do by purely using regex?
Please Note I have tried so many eg:
regular expression to match everything until the last occurrence of /
Regex Last occurrence?
I got answer by using regex with slicing but i want to achieve by using regex alone..
Why use regex. Just use built in string methods:
path = "D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v1001-1010.mov"
index = path.rfind("1001-1010")
print(path[:index])
You can use a simple greedy match and a capture group:
(.*)1001-1010
Your match is in capture group #1
Since .* is greedy by nature, it will match longest match before matching your keyword 1001-1010.
RegEx Demo
As per comments below if keyword is not a static string then you may use this regex:
r'(.*\D)\d+-\d+'
Python Code:
>>> p = 'D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v1001-1010.mov'
>>> print (re.findall(r'(.*\D)\d+-\d+', p))
['D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v']
Thanks #anubhava,
My first regex was,
.*(\d*-\d*)\/
Now i have corrected mine..
.*(\d*-\d*)
or
(.*)(\d*-\d*)
which gives me,
>>> q = re.search('.+(\d*-\d*)', p)
>>> q.group()
'D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v0001-1001'
>>>
(.*\D)\d+-\d+
this gives me exactly what i want...
>>> q = re.search('(.*\D)\d+-\d+', p)
>>> q.groups()
('D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v',)
>>>

Python - Most elegant way to extract a substring, being given left and right borders [duplicate]

This question already has answers here:
How to extract the substring between two markers?
(22 answers)
Closed 4 years ago.
I have a string - Python :
string = "/foo13546897/bar/Atlantis-GPS-coordinates/bar457822368/foo/"
Expected output is :
"Atlantis-GPS-coordinates"
I know that the expected output is ALWAYS surrounded by "/bar/" on the left and "/" on the right :
"/bar/Atlantis-GPS-coordinates/"
Proposed solution would look like :
a = string.find("/bar/")
b = string.find("/",a+5)
output=string[a+5,b]
This works, but I don't like it.
Does someone know a beautiful function or tip ?
You can use split:
>>> string.split("/bar/")[1].split("/")[0]
'Atlantis-GPS-coordinates'
Some efficiency from adding a max split of 1 I suppose:
>>> string.split("/bar/", 1)[1].split("/", 1)[0]
'Atlantis-GPS-coordinates'
Or use partition:
>>> string.partition("/bar/")[2].partition("/")[0]
'Atlantis-GPS-coordinates'
Or a regex:
>>> re.search(r'/bar/([^/]+)', string).group(1)
'Atlantis-GPS-coordinates'
Depends on what speaks to you and your data.
What you haven't isn't all that bad. I'd write it as:
start = string.find('/bar/') + 5
end = string.find('/', start)
output = string[start:end]
as long as you know that /bar/WHAT-YOU-WANT/ is always going to be present. Otherwise, I would reach for the regular expression knife:
>>> import re
>>> PATTERN = re.compile('^.*/bar/([^/]*)/.*$')
>>> s = '/foo13546897/bar/Atlantis-GPS-coordinates/bar457822368/foo/'
>>> match = PATTERN.match(s)
>>> match.group(1)
'Atlantis-GPS-coordinates'
import re
pattern = '(?<=/bar/).+?/'
string = "/foo13546897/bar/Atlantis-GPS-coordinates/bar457822368/foo/"
result = re.search(pattern, string)
print string[result.start():result.end() - 1]
# "Atlantis-GPS-coordinates"
That is a Python 2.x example. What it does first is:
1. (?<=/bar/) means only process the following regex if this precedes it (so that /bar/ must be before it)
2. '.+?/' means any amount of characters up until the next '/' char
Hope that helps some.
If you need to do this kind of search a bunch it is better to 'compile' this search for performance, but if you only need to do it once don't bother.
Using re (slower than other solutions):
>>> import re
>>> string = "/foo13546897/bar/Atlantis-GPS-coordinates/bar457822368/foo/"
>>> re.search(r'(?<=/bar/)[^/]+(?=/)', string).group()
'Atlantis-GPS-coordinates'

Finding the index of the second match of a regular expression in python

So I am trying to rename files to match the naming convention for plex mediaserver. ( SxxEyy )
Now I have a ton of files that use eg. 411 for S04E11. I have written a little function that will search for an occurrence of this pattern and replace it with the correct convention. Like this :
pattern1 = re.compile('[Ss]\\d+[Ee]\\d+')
pattern2 = re.compile('[\.\-]\d{3,4}')
def plexify_name(string):
#If the file matches the pattern we want, don't change it
if pattern1.search(string):
return string
elif pattern2.search(string):
piece_to_change = pattern2.search(string)
endpos = piece_to_change.end()
startpos = piece_to_change.start()
#Cut out the piece to change
cut = string[startpos+1:endpos-1]
if len(cut) == 4:
cut = 'S'+cut[0:2] + 'E' + cut[2:4]
if len(cut) == 3:
cut = 'S0'+cut[0:1] + 'E' + cut[1:3]
return string[0:startpos+1] + cut + string[endpos-1:]
And this works very well. But it turns out that some of the filenames will have a year in them eg. the.flash.2014.118.mp4 In which case it will change the 2014.
I tried using
pattern2.findall(string)
Which does return a list of strings like this --> ['.2014', '.118'] but what I want is a list of matchobjects so I can check if there is 2 and in that case use the start/end of the second. I can't seem to find something to do this in the re documentation. I am missing something or do I need to take a totally different approach?
You could try anchoring the match to the file extension:
pattern2 = re.compile(r'[.-]\d{3,4}(?=[.]mp4$)')
Here, (?= ... ) is a look-ahead assertion, meaning that the thing has to be there for the regex to match, but it's not part of the match:
>>> pattern2.findall('test.118.mp4')
['.118']
>>> pattern2.findall('test.2014.118.mp4')
['.118']
>>> pattern2.findall('test.123.mp4.118.mp4')
['.118']
Of course, you want it to work with all possible extensions:
>>> p2 = re.compile(r'[.-]\d{3,4}(?=[.][^.]+$)')
>>> p2.findall('test.2014.118.avi')
['.118']
>>> p2.findall('test.2014.118.mov')
['.118']
If there is more stuff between the episode number and the extension, regexes for matching that start to get tricky, so I would suggest a non-regex approach for dealing with that:
>>> f = 'test.123.castle.2014.118.x264.mp4'
>>> [p for p in f.split('.') if p.isdigit()][-1]
'118'
Or, alternatively, you can get match objects for all matches by using finditer and expanding the iterator by converting it to a list:
>>> p2 = re.compile(r'[.-]\d{3,4}')
>>> f = 'test.2014.712.x264.mp4'
>>> matches = list(p2.finditer(f))
>>> matches[-1].group(0)
'.712'

python regex find characters from and end of the string

svn-backup-test,2014/09/24/18/Rev1223/FullSvnCheckout.tgz
from the following string I need to fetch Rev1233. So i was wondering if we can have any regexpression to do that. I like to do following string.search ("Rev" uptill next /)
so far I split this using split array
s1,s2,s3,s4,s5 = string ("/",4)
You don't need a regex to do this. It is as simple as:
str = 'svn-backup-test,2014/09/24/18/Rev1223/FullSvnCheckout.tgz'
str.split('/')[-2]
Here is a quick python example
>>> impot re
>>> s = 'svn-backup-test,2014/09/24/18/Rev1223/FullSvnCheckout.tgz'
>>> p = re.compile('.*/(Rev\d+)/.*')
>>> p.match(s).groups()[0]
'Rev1223'
Find second part from the end using regex, if preferred:
/(Rev\d+)/[^/]+$
http://regex101.com/r/cC6fO3/1
>>> import re
>>> m = re.search('/(Rev\d+)/[^/]+$', 'svn-backup-test,2014/09/24/18/Rev1223/FullSvnCheckout.tgz')
>>> m.groups()[0]
'Rev1223'

Python re.search() not returning full group match

import re
ip6 = "1234:0678:0000:0000:00cd:0000:0000:0000"
zeroes = re.search("(:?0000)+", ip6)
print zeroes.group(0)
:0000:0000
I'm trying to find the longest sequence of four zeroes separated by colons. The string contains a sequence of three such groups, yet only two groups are printed. Why?
EDIT: It's printing :0000:0000 because that's the first match in the string -- but I thought regexps always looked for the longest match?
Answer updated to work in Python 2.6:
p = re.compile('((:?0000)+)')
longestword = ""
for word in p.findall(ip6):
if len(word[0])>len(longestword):
longestword = word[0]
print longestword
If you're not stuck on regular-expressions, you could use itertools.groupby:
from itertools import groupby
ip6 = "1234:0678:0000:0000:00cd:0000:0000:0000"
longest = 0
for section, elems in groupby(ip6.split(':')):
if section == '0000':
longest = len(list(elems))
print longest # Prints '3', the number of times '0000' repeats the most.
# you could, of course, generate a string of 0000:... from this
I'm sure this can be boiled down to something a bit more elegant, but I think this communicates the point.
I'm using Python 2.7.3
How about using re.finditer()
$ uname -r
3.2.0-4-amd64
#!/usr/bin/env python
import re
ip6 = "1234:0678:0000:0000:00cd:0000:0000:0000"
iters = re.finditer("(:?0000)+", ip6)
for match in iters:
print 'match.group() -> ',match.group()

Categories