Extract sub path from url with regex - python

I have this url:
http://www.example.com/en/news/2016/07/17/1207151/%D9%81%D8%AA%D9%88%D8%A7%DB%8C-%D8%B1%D9%87%D8%A8%D8%B1-
I am going to extract 1207151 here.
here is my regext:
pattern = '(http[s]?:\/\/)?([^\/\s]+\/)+[^/]+[^/]+[^/]+[^/]/(?<field1>[^/]+)/'
but it's wrong!
what is my mistake?

You can use this regex in python code:
>>> url = 'http://www.example.com/en/news/2016/07/17/1207151/%D9%81%D8%AA%D9%88%D8%A7%DB%8C-%D8%B1%D9%87%D8%A8%D8%B1-'
>>> re.search(r'^https?://(?:([^/]+)/){7}', url).group(1)
'1207151'
([^/]+)/){7} will match 1 or more of any non-forward-slash and a / 7 times, giving us last match in captured group #1.

You've got a couple things going on.
First you'll need to properly escape all of your /s. You've got most of them, but missed a couple:
(http[s]?:\/\/)?([^\/\s]+\/)+[^\/]+[^\/]+[^\/]+[^\/]\/(?<field1>[^\/]+)\/
From here, you have a number of "1 or more not /" in a row that can be reduced:
[^\/]+[^\/]+[^\/]+ ==> [^\/]{3,}
But that's not what you meant to do, you meant to have many blocks of "non /" followed by a "/" and based on your example, you want it 6 times before using your named capture group.
([^\/]+\/){6}
Here's what works:
http[s]?:\/\/([^\/]+\/){6}(?<field1>[^\/]+)\/
And you can see it in action here: https://regex101.com/r/kkqwRJ/2

import re
print re.search(r'.*/([^/]+)/.*',s).group(1)

Related

Extract values in name=value lines with regex

I'm really sorry for asking because there are some questions like this around. But can't get the answer fixed to make problem.
This are the input lines (e.g. from a config file)
profile2.name=share2
profile8.name=share8
profile4.name=shareSSH
profile9.name=share9
I just want to extract the values behind the = sign with Python 3.9. regex.
I tried this on regex101.
^profile[0-9]\.name=(.*?)
But this gives me the variable name including the = sign as result; e.g. profile2.name=. But I want exactly the inverted opposite.
The expected results (what Pythons re.find_all() return) are
['share2', 'share8', 'shareSSH', 'share9']
Try pattern profile\d+\.name=(.*), look at Regex 101 example
import re
re.findall('profile\d+\.name=(.*)', txt)
# output
['share2', 'share8', 'shareSSH', 'share9']
But this problem doesn't necessarily need regex, split should work absolutely fine:
Try removing the ? quantifier. It will make your capture group match an empty st
regex101

Extract URL's inclusive with fragments in string using Python with Regex

Ok i know ppl are going to say this question has been asked a million times.. but my question is DIFFERENT. I have searched stackoverflow many many many times to ensure this is not a duplicate..
I want a regex in Python that also helps to extract the URL from a string INCLUDING FRAGMENTS
What i have done so far is:
import re
test = 'This is a string with my URL as follows http://www.example.org/foo.html#bar and here i continue with my string'
test = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', test)
print (test)
The output i get for the above code is ['http://www.example.org/foo.html']
Which is not what i want..
I want to the output to be ['http://www.example.org/foo.html#bar']
Your original regex is this:
http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+
Couldn't you just add '#' Like this?:
http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),#]|(?:%[0-9a-fA-F][0-9a-fA-F]))+
I am unclear as to what you mean by 'fragments'... Do you mean anything up to the space in the string?

Find string with regular expression in python

I am a newbie in python and I am trying to cut piece of string in another string at python.
I looked at other similar questions but I could not find my answer.
I have a variable which contain a domain list which the domains look like this :
http://92.230.38.21/ios/Channel767/Hotbird.mp3
http://92.230.38.21/ios/Channel9798/Coldbird.mp3
....
I want the mp3 file name (in this example Hotbird, Coldbird etc)
I know I must be able to do it with re.findall() but I have no idea about regular expressions I need to use.
Any idea?
Update:
Here is the part I used:
for final in match2:
netname=re.findall('\W+\//\W+\/\W+\/\W+\/\W+', final)
print final
print netname
Which did not work. Then I tried to do this one which only cut the ip address (92.230.28.21) but not the name:
for final in match2:
netname=re.findall('\d+\.\d+\.\d+\.\d+', final)
print final
You may just use str.split():
>>> urls = ["http://92.230.38.21/ios/Channel767/Hotbird.mp3", "http://92.230.38.21/ios/Channel9798/Coldbird.mp3"]
>>> for url in urls:
... print(url.split("/")[-1].split(".")[0])
...
Hotbird
Coldbird
And here is an example regex-based approach:
>>> import re
>>>
>>> pattern = re.compile(r"/(\w+)\.mp3$")
>>> for url in urls:
... print(pattern.search(url).group(1))
...
Hotbird
Coldbird
where we are using a capturing group (\w+) to capture the mp3 filename consisting of one or more aplhanumeric characters which is followed by a dot, mp3 at the end of the url.
How about ?
([^/]*mp3)$
I think that might work
Basically it says...
Match from the end of the line, start with mp3, then match everything back to the first slash.
Think it will perform well.

Remove Part of String Before the Last Forward Slash

The program I am currently working on retrieves URLs from a website and puts them into a list. What I want to get is the last section of the URL.
So, if the first element in my list of URLs is "https://docs.python.org/3.4/tutorial/interpreter.html" I would want to remove everything before "interpreter.html".
Is there a function, library, or regex I could use to make this happen? I've looked at other Stack Overflow posts but the solutions don't seem to work.
These are two of my several attempts:
for link in link_list:
file_names.append(link.replace('/[^/]*$',''))
print(file_names)
&
for link in link_list:
file_names.append(link.rpartition('//')[-1])
print(file_names)
Have a look at str.rsplit.
>>> s = 'https://docs.python.org/3.4/tutorial/interpreter.html'
>>> s.rsplit('/',1)
['https://docs.python.org/3.4/tutorial', 'interpreter.html']
>>> s.rsplit('/',1)[1]
'interpreter.html'
And to use RegEx
>>> re.search(r'(.*)/(.*)',s).group(2)
'interpreter.html'
Then match the 2nd group which lies between the last / and the end of String. This is a greedy usage of the greedy technique in RegEx.
Debuggex Demo
Small Note - The problem with link.rpartition('//')[-1] in your code is that you are trying to match // and not /. So remove the extra / as in link.rpartition('/')[-1].
That doesn't need regex.
import os
for link in link_list:
file_names.append(os.path.basename(link))
You can use rpartition():
>>> s = 'https://docs.python.org/3.4/tutorial/interpreter.html'
>>> s.rpartition('/')
('https://docs.python.org/3.4/tutorial', '/', 'interpreter.html')
And take the last part of the 3 element tuple that is returned:
>>> s.rpartition('/')[2]
'interpreter.html'
Just use string.split:
url = "/some/url/with/a/file.html"
print url.split("/")[-1]
# Result should be "file.html"
split gives you an array of strings that were separated by "/". The [-1] gives you the last element in the array, which is what you want.
Here's a more general, regex way of doing this:
re.sub(r'^.+/([^/]+)$', r'\1', "http://test.org/3/files/interpreter.html")
'interpreter.html'
This should work if you plan to use regex
for link in link_list:
file_names.append(link.replace('.*/',''))
print(file_names)

How to stop python Regular Expression being too greedy

I'm trying to match (in Python) the show name and season/episode numbers from tv episode filenames in the format:
Show.One.S01E05.720p.HDTV.x264-CTU.mkv
and
Show.Two.S08E02.HDTV.XviD-LOL.avi
My regular expression:
(?P<show>[\w\s.,_-]+)\.[Ss]?(?P<season>[\d]{1,2})[XxEe]?(?P<episode>[\d]{2})
matches correctly on Show Two giving me Show Two, 08 and 02. However the 720 in Show One means I get back 7 and 20 for season/episode.
If I remove the ? after [XxEe] then it matches both types but I want that range to be optional for filenames where the episode identifier isn't included.
I've tried using ?? to stop the [XxEe] match being greedy as listed in the python docs re module section but this has no effect.
How can I capture the series name section and the season/episode section while ignoring the rest of the string?
Change the greedity on first match:
p=re.compile('(?P<show>[\w\s.,_-]+?)\.[Ss]?(?P<season>[\d]{1,2})[XxEe]?(?P<episode>[\d]{2})')
print p.findall("Game.of.Thrones.S01E05.720p.HDTV.x264-CTU.mkv")
[('Game.of.Thrones', '01', '05')]
print p.findall("Entourage.S08E02.HDTV.XviD-LOL.avi")
[('Entourage', '08', '02')]
Note the ? following + in first group.
Explanation :
First match eats too much, so reducing its greedity makes the following match sooner. (not a really nice example by the way, I would have changed names as they definitely sound a bit too Warezzz-y to be honest ;-) )
Try:
v
(?P<show>[\w\s.,_-]+?)\.[Ss]?(?P<season>[\d]{1,2})[XxEe]?(?P<episode>[\d]{2})
Add a dot at the end of the regex :
(?P<show>[\w\s.,_-]+)\.[Ss]?(?P<season>[\d]{1,2})[XxEe]?(?P<episode>[\d]{2})\.
here __^

Categories