Regex to retrieve the last few characters of a string - python

Regex to retrieve the last portion of a string:
https://play.google.com/store/apps/details?id=com.lima.doodlejump
I'm looking to retrieve the string followed by id=
The following regex didn't seem to work in python
sampleURL = "https://play.google.com/store/apps/details?id=com.lima.doodlejump"
re.search("id=(.*?)", sampleURL).group(1)
The above should give me an output:
com.lima.doodlejump
Is my search group right?

Your regular expression
(.*?)
will not work because, it will match between zero and unlimited times, as few times as possible (becasue of the ?). So, you have the following choices of RegEx
(.*) # Matches the rest of the string
(.*?)$ # Matches till the end of the string
But, you don't need RegEx at all here, simply split the string like this
data = "https://play.google.com/store/apps/details?id=com.lima.doodlejump"
print data.split("id=", 1)[-1]
Output
com.lima.doodlejump
If you really have to use RegEx, you can do like this
data = "https://play.google.com/store/apps/details?id=com.lima.doodlejump"
import re
print re.search("id=(.*)", data).group(1)
Output
com.lima.doodlejump

I'm surprised that nobody has mentioned urlparse yet...
>>> s = "https://play.google.com/store/apps/details?id=com.lima.doodlejump"
>>> urlparse.urlparse(s)
ParseResult(scheme='https', netloc='play.google.com', path='/store/apps/details', params='', query='id=com.lima.doodlejump', fragment='')
>>> urlparse.parse_qs(urlparse.urlparse(s).query)
{'id': ['com.lima.doodlejump']}
>>> urlparse.parse_qs(urlparse.urlparse(s).query)['id']
['com.lima.doodlejump']
>>> urlparse.parse_qs(urlparse.urlparse(s).query)['id'][0]
'com.lima.doodlejump'
The HUGE advantage here is that if the url query string gets more components then it could easily break the other solutions which rely on a simple str.split. It won't confuse urlparse however :).

Just split it in the place you want:
id = url.split('id=')[1]
If you print id, you'll get:
com.lima.doodlejump
Regex isn't needed here :)
However, in case there are multiple id=s in your string, and you only wanted the last one:
id = url.split('id=')[-1]
Hope this helps!

This works:
>>> import re
>>> sampleURL = "https://play.google.com/store/apps/details?id=com.lima.doodlejump"
>>> re.search("id=(.+)", sampleURL).group(1)
'com.lima.doodlejump'
>>>
Instead of capturing non-greedily for zero or more characters, this code captures greedily for one or more.

Related

I want to extract data using regular expression in python

I have a string = "ProductId%3D967164%26Colour%3Dbright-royal" and i want to extract data using regex so output will be 967164bright-royal.
I have tried with this (?:ProductId%3D|Colour%3D)(.*) in python with regex, but getting output as 967164%26Colour%3Dbright-royal.
Can anyone please help me to find out regex for it.
You don't need a regex here, use urllib.parse module:
from urllib.parse import parse_qs, unquote
qs = "ProductId%3D967164%26Colour%3Dbright-royal"
d = parse_qs(unquote(qs))
print(d)
# Output:
{'ProductId': ['967164'], 'Colour': ['bright-royal']}
Final output:
>>> ''.join(i[0] for i in d.values())
'967164bright-royal'
Update
>>> ''.join(re.findall(r'%3D(\S*?)(?=%26|$)', qs))
'967164bright-royal'
The alternative matches on the first part, you can not get a single match for 2 separate parts in the string.
If you want to capture both values using a regex in a capture group:
(?:ProductId|Colour)%3D(\S*?)(?=%26|$)
Regex demo
import re
pattern = r"(?:ProductId|Colour)%3D(\S*?)(?=%26|$)"
s = "ProductId%3D967164%26Colour%3Dbright-royal"
print(''.join(re.findall(pattern, s)))
Output
967164bright-royal
If you must use a regular expression and you can guarantee that the string will always be formatted the way you expect, you could try this.
import re
pattern = r"ProductId%3D(\d+)%26Colour%3D(.*)"
string = "ProductId%3D967164%26Colour%3Dbright-royal"
matches = re.match(pattern, string)
print(f"{matches[1]}{matches[2]}")

How to get String Before Last occurrence of substring?

I want to get String before last occurrence of my given sub string.
My String was,
path =
D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v1001-1010.mov
my substring, 1001-1010 which will occurred twice. all i want is get string before its last occurrence.
Note: My substring is dynamic with different padding but only number.
I want,
D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v
I have done using regex and slicing,
>>> p = 'D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v1001-1010.mov'
>>> q = re.findall("\d*-\d*",p)
>>> q[-1].join(p.split(q[-1])[:-1])
'D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v'
>>>
Is their any better way to do by purely using regex?
Please Note I have tried so many eg:
regular expression to match everything until the last occurrence of /
Regex Last occurrence?
I got answer by using regex with slicing but i want to achieve by using regex alone..
Why use regex. Just use built in string methods:
path = "D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v1001-1010.mov"
index = path.rfind("1001-1010")
print(path[:index])
You can use a simple greedy match and a capture group:
(.*)1001-1010
Your match is in capture group #1
Since .* is greedy by nature, it will match longest match before matching your keyword 1001-1010.
RegEx Demo
As per comments below if keyword is not a static string then you may use this regex:
r'(.*\D)\d+-\d+'
Python Code:
>>> p = 'D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v1001-1010.mov'
>>> print (re.findall(r'(.*\D)\d+-\d+', p))
['D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v']
Thanks #anubhava,
My first regex was,
.*(\d*-\d*)\/
Now i have corrected mine..
.*(\d*-\d*)
or
(.*)(\d*-\d*)
which gives me,
>>> q = re.search('.+(\d*-\d*)', p)
>>> q.group()
'D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v0001-1001'
>>>
(.*\D)\d+-\d+
this gives me exactly what i want...
>>> q = re.search('(.*\D)\d+-\d+', p)
>>> q.groups()
('D:/me/vol101/Prod/cent/2019_04_23_01/image/AVEN_000_3400_img_pic_p1001-1010/pxy/AVEN_000_3400_img-mp4_to_MOV_v',)
>>>

regex to extract data between quotes

As title says string is '="24digit number"' and I want to extract number between "" (example: ="000021484123647598423458" should get me '000021484123647598423458').
There are answers that answer how to get data between " but in my case I also need to confirm that =" exist without capturing (there are also other "\d{24}" strings, but they are for other stuff) it.
I couldn't modify these answers to get what I need.
My latest regex was ((?<=\")\d{24}(?=\")) and string is ="000021484123647598423458".
UPDATE: I think I will settle with pattern r'^(?:\=\")(\d{24})(?:\")' because I just want to capture digit characters.
word = '="000021484123647598423458"'
pattern = r'^(?:\=\")(\d{24})(?:\")'
match = re.findall(pattern, word)[0]
Thank you all for suggestions.
You could have it like:
=(['"])(\d{24})\1
See a demo on regex101.com.
In Python:
import re
string = '="000021484123647598423458"'
rx = re.compile(r'''=(['"])(\d{24})\1''')
print(rx.search(string).group(2))
# 000021484123647598423458
Any one of the following works:
>>> st = '="000021484123647598423458"'
>>> import re
>>> re.findall(r'".*\d+.*"',st)
['"000021484123647598423458"']
or
>>> re.findall(r'".*\d{24}.*"',st)
['"000021484123647598423458"']
or
>>> re.findall(r'"\d{24}"',st)
['"000021484123647598423458"']

Python Regular Expression Extracting 'name= ....'

I'm using a Python script to read data from our corporate instance of JIRA. There is a value that is returned as a string and I need to figure out how to extract one bit of info from it. What I need is the 'name= ....' and I just need the numbers from that result.
<class 'list'>: ['com.atlassian.greenhopper.service.sprint.Sprint#6f68eefa[id=30943,rapidViewId=10468,state=CLOSED,name=2016.2.4 - XXXXXXXXXX,startDate=2016-05-26T08:50:57.273-07:00,endDate=2016-06-08T20:59:00.000-07:00,completeDate=2016-06-09T07:34:41.899-07:00,sequence=30943]']
I just need the 2016.2.4 portion of it. This number will not always be the same either.
Any thoughts as how to do this with RE? I'm new to regular expressions and would appreciate any help.
A simple regular expression can do the trick: name=([0-9.]+).
The primary part of the regex is ([0-9.]+) which will search for any digit (0-9) or period (.) in succession (+).
Now, to use this:
import re
pattern = re.compile('name=([0-9.]+)')
string = '''<class 'list'>: ['com.atlassian.greenhopper.service.sprint.Sprint#6f68eefa[id=30943,rapidViewId=10468,state=CLOSED,name=2016.2.4 - XXXXXXXXXX,startDate=2016-05-26T08:50:57.273-07:00,endDate=2016-06-08T20:59:00.000-07:00,completeDate=2016-06-09T07:34:41.899-07:00,sequence=30943]']'''
matches = pattern.search(string)
# Only assign the value if a match is found
name_value = '' if not matches else matches.group(1)
Use a capturing group to extract the version name:
>>> import re
>>> s = 'com.atlassian.greenhopper.service.sprint.Sprint#6f68eefa[id=30943,rapidViewId=10468,state=CLOSED,name=2016.2.4 - XXXXXXXXXX,startDate=2016-05-26T08:50:57.273-07:00,endDate=2016-06-08T20:59:00.000-07:00,completeDate=2016-06-09T07:34:41.899-07:00,sequence=30943]'
>>> re.search(r"name=([0-9.]+)", s).group(1)
'2016.2.4'
where ([0-9.]+) is a capturing group matching one or more digits or dots, parenthesis define a capturing group.
A non-regex option would involve some splitting by ,, = and -:
>>> l = [item.split("=") for item in s.split(",")]
>>> next(value[1] for value in l if value[0] == "name").split(" - ")[0]
'2016.2.4'
This, of course, needs testing and error handling.

Python Regex searching

I want to use regex to search in a file for this expression:
time:<float> s
I only want to get the float number.
I'm learning about regex, and this is what I did:
astr = 'lalala time:1.5 s\n'
p = re.compile(r'time:(\d+).*(\d+)')
m = p.search(astr)
Well, I get time:1.5 from m.group(0)
How can I directly just get 1.5 ?
I'm including some extra python-specific materiel since you said you're learning regex. As already mentioned the simplest regex for this would certainly be \d+\.\d+ in various commands as described below.
Something that threw me off with python initially was getting my head around the return types of various re methods and when to use group() vs. groups().
There are several methods you might use:
re.match()
re.search()
re.findall()
match() will only return an object if the pattern is found at the beginning of the string.
search() will find the first pattern and top.
findall() will find everything in the string.
The return type for match() and search() is a match object, __Match[T], or None, if a match isn't found. However the return type for findall() is a list[T]. These different return types obviously have ramifications for how you get the values out of your match.
Both match and search expose the group() and groups() methods for retrieving your matches. But when using findall you'll want to iterate through your list or pull a value with an enumerator. So using findall:
>>>import re
>>>easy = re.compile(r'123')
>>>matches = easy.findall(search_me)
>>>for match in matches: print match
123
If you're using search() or match(), you'll want to use .group() or groups() to retrieve your match depending on how you've set up your regular expression.
From the documentation, "The groups() method returns a tuple containing the strings for all the subgroups, from 1 up to however many there are."
Therefore if you have no groups in your regex, as shown in the following example, you wont get anything back:
>>>import re
>>>search_me = '123abc'
>>>easy = re.compile(r'123')
>>>matches = easy.search(search_me)
>>>print matches.groups()
()
Adding a "group" to your regular expression enables you to use this:
>>>import re
>>>search_me = '123abc'
>>>easy = re.compile(r'(123)')
>>>matches = easy.search(search_me)
>>>print matches.groups()
('123',)
You don't have to specify groups in your regex. group(0) or group() will return the entire match even if you don't have anything in parenthesis in your expression. --group() defaults to group(0).
>>>import re
>>>search_me = '123abc'
>>>easy = re.compile(r'123')
>>>matches = easy.search(search_me)
>>>print matches.group(0)
123
If you are using parenthesis you can use group to match specific groups and subgroups.
>>>import re
>>>search_me = '123abc'
>>>easy = re.compile(r'((1)(2)(3))')
>>>matches = easy.search(search_me)
>>>print matches.group(1)
>>>print matches.group(2)
>>>print matches.group(3)
>>>print matches.group(4)
123
1
2
3
I'd like to point as well that you don't have to compile your regex unless you care to for reasons of usability and/or readability. It won't improve your performance.
>>>import re
>>>search_me = '123abc'
>>>#easy = re.compile(r'123')
>>>#matches = easy.search(search_me)
>>>matches = re.search(r'123', search_me)
>>>print matches.group()
Hope this helps! I found sites like debuggex helpful while learning regex. (Although sometimes you have to refresh those pages; I was banging my head for a couple hours one night before I realized that after reloading the page my regex worked just fine.) Lately I think you're served just as well by throwing sandbox code into something like wakari.io, or an IDE like PyCharm, etc., and observing the output. http://www.rexegg.com/ is also a good site for general regex knowledge.
You could do create another group for that. And I would also change the regex slightly to allow for numbers that don't have a decimal separator.
re.compile(r'time:((\d+)(\.?(\d+))?')
Now you can use group(1) to capture the match of the floating point number.
I think the regex you actually want is something more like:
re.compile(r'time:(\d+\.\d+)')
or even:
re.compile(r'time:(\d+(?:\.\d+)?)') # This one will capture integers too.
Note that I've put the entire time into 1 grouping. I've also escaped the . which means any character in regex.
Then, you'd get 1.5 from m.group(1) -- m.group(0) is the entire match. m.group(1) is the first submatch (parenthesized grouping), m.group(2) is the second grouping, etc.
example:
>>> import re
>>> p = re.compile(r'time:(\d+(?:\.\d+)?)')
>>> p.search('time:34')
<_sre.SRE_Match object at 0x10fa77d50>
>>> p.search('time:34').group(1)
'34'
>>> p.search('time:34.55').group(1)
'34.55'

Categories