The program I am currently working on retrieves URLs from a website and puts them into a list. What I want to get is the last section of the URL.
So, if the first element in my list of URLs is "https://docs.python.org/3.4/tutorial/interpreter.html" I would want to remove everything before "interpreter.html".
Is there a function, library, or regex I could use to make this happen? I've looked at other Stack Overflow posts but the solutions don't seem to work.
These are two of my several attempts:
for link in link_list:
file_names.append(link.replace('/[^/]*$',''))
print(file_names)
&
for link in link_list:
file_names.append(link.rpartition('//')[-1])
print(file_names)
Have a look at str.rsplit.
>>> s = 'https://docs.python.org/3.4/tutorial/interpreter.html'
>>> s.rsplit('/',1)
['https://docs.python.org/3.4/tutorial', 'interpreter.html']
>>> s.rsplit('/',1)[1]
'interpreter.html'
And to use RegEx
>>> re.search(r'(.*)/(.*)',s).group(2)
'interpreter.html'
Then match the 2nd group which lies between the last / and the end of String. This is a greedy usage of the greedy technique in RegEx.
Debuggex Demo
Small Note - The problem with link.rpartition('//')[-1] in your code is that you are trying to match // and not /. So remove the extra / as in link.rpartition('/')[-1].
That doesn't need regex.
import os
for link in link_list:
file_names.append(os.path.basename(link))
You can use rpartition():
>>> s = 'https://docs.python.org/3.4/tutorial/interpreter.html'
>>> s.rpartition('/')
('https://docs.python.org/3.4/tutorial', '/', 'interpreter.html')
And take the last part of the 3 element tuple that is returned:
>>> s.rpartition('/')[2]
'interpreter.html'
Just use string.split:
url = "/some/url/with/a/file.html"
print url.split("/")[-1]
# Result should be "file.html"
split gives you an array of strings that were separated by "/". The [-1] gives you the last element in the array, which is what you want.
Here's a more general, regex way of doing this:
re.sub(r'^.+/([^/]+)$', r'\1', "http://test.org/3/files/interpreter.html")
'interpreter.html'
This should work if you plan to use regex
for link in link_list:
file_names.append(link.replace('.*/',''))
print(file_names)
Related
I have a string:
s3://tester/test.pdf
I want to exclude s3://tester/ so even if i have s3://tester/folder/anotherone/test.pdf I am getting the entire path after s3://tester/
I have attempted to use the split & partition method but I can't seem to get it.
Currently am trying:
string.partition('/')[3]
But i get an error saying that it out of index.
EDIT: I should have specified that the name of the bucket will not always be the same so I want to make sure that it is only grabbing anything after the 3rd '/'.
You can use str.split():
path = 's3://tester/test.pdf'
print(path.split('/', 3)[-1])
Output:
test.pdf
UPDATE: With regex:
import re
path = 's3://tester/test.pdf'
print(re.split('/',path,3)[-1])
Output:
test.pdf
Have you tried .replace?
You could do:
string = "s3://tester/test.pdf"
string = string.replace("s3://tester/", "")
print(string)
This will replace "s3://tester/" with the empty string ""
Alternatively, you could use .split rather than .partition
You could also try:
string = "s3://tester/test.pdf"
string = "/".join(string.split("/")[3:])
print(string)
To answer "How to get everything after x amount of characters in python"
string[x:]
PLEASE SEE UPDATE
ORIGINAL
Using the builtin re module.
p = re.search(r'(?<=s3:\/\/tester\/).+', s).group()
The pattern uses a lookbehind to skip over the part you wish to ignore and matches any and all characters following it until the entire string is consumed, returning the matched group to the p variable for further processing.
This code will work for any length path following the explicit s3://tester/ schema you provided in your question.
UPDATE
Just saw updates duh.
Got the wrong end of the stick on this one, my bad.
Below re method should work no matter S3 variable, returning all after third / in string.
p = ''.join(re.findall(r'\/[^\/]+', s)[1:])[1:]
I have this url:
http://www.example.com/en/news/2016/07/17/1207151/%D9%81%D8%AA%D9%88%D8%A7%DB%8C-%D8%B1%D9%87%D8%A8%D8%B1-
I am going to extract 1207151 here.
here is my regext:
pattern = '(http[s]?:\/\/)?([^\/\s]+\/)+[^/]+[^/]+[^/]+[^/]/(?<field1>[^/]+)/'
but it's wrong!
what is my mistake?
You can use this regex in python code:
>>> url = 'http://www.example.com/en/news/2016/07/17/1207151/%D9%81%D8%AA%D9%88%D8%A7%DB%8C-%D8%B1%D9%87%D8%A8%D8%B1-'
>>> re.search(r'^https?://(?:([^/]+)/){7}', url).group(1)
'1207151'
([^/]+)/){7} will match 1 or more of any non-forward-slash and a / 7 times, giving us last match in captured group #1.
You've got a couple things going on.
First you'll need to properly escape all of your /s. You've got most of them, but missed a couple:
(http[s]?:\/\/)?([^\/\s]+\/)+[^\/]+[^\/]+[^\/]+[^\/]\/(?<field1>[^\/]+)\/
From here, you have a number of "1 or more not /" in a row that can be reduced:
[^\/]+[^\/]+[^\/]+ ==> [^\/]{3,}
But that's not what you meant to do, you meant to have many blocks of "non /" followed by a "/" and based on your example, you want it 6 times before using your named capture group.
([^\/]+\/){6}
Here's what works:
http[s]?:\/\/([^\/]+\/){6}(?<field1>[^\/]+)\/
And you can see it in action here: https://regex101.com/r/kkqwRJ/2
import re
print re.search(r'.*/([^/]+)/.*',s).group(1)
I am a newbie in python and I am trying to cut piece of string in another string at python.
I looked at other similar questions but I could not find my answer.
I have a variable which contain a domain list which the domains look like this :
http://92.230.38.21/ios/Channel767/Hotbird.mp3
http://92.230.38.21/ios/Channel9798/Coldbird.mp3
....
I want the mp3 file name (in this example Hotbird, Coldbird etc)
I know I must be able to do it with re.findall() but I have no idea about regular expressions I need to use.
Any idea?
Update:
Here is the part I used:
for final in match2:
netname=re.findall('\W+\//\W+\/\W+\/\W+\/\W+', final)
print final
print netname
Which did not work. Then I tried to do this one which only cut the ip address (92.230.28.21) but not the name:
for final in match2:
netname=re.findall('\d+\.\d+\.\d+\.\d+', final)
print final
You may just use str.split():
>>> urls = ["http://92.230.38.21/ios/Channel767/Hotbird.mp3", "http://92.230.38.21/ios/Channel9798/Coldbird.mp3"]
>>> for url in urls:
... print(url.split("/")[-1].split(".")[0])
...
Hotbird
Coldbird
And here is an example regex-based approach:
>>> import re
>>>
>>> pattern = re.compile(r"/(\w+)\.mp3$")
>>> for url in urls:
... print(pattern.search(url).group(1))
...
Hotbird
Coldbird
where we are using a capturing group (\w+) to capture the mp3 filename consisting of one or more aplhanumeric characters which is followed by a dot, mp3 at the end of the url.
How about ?
([^/]*mp3)$
I think that might work
Basically it says...
Match from the end of the line, start with mp3, then match everything back to the first slash.
Think it will perform well.
So I read lines with urllib and it returns items with lots of information in it. I want to extract a link from each item. So I want to get the part of the string that begins with "http" and ends with ".mp3". I'm stuck.
You can use regular expressions for what you describe:
In [48]: s='Link: http://google.com/song.mp3 Another link, http://yahoo.com/another_song.mp3'
In [49]: re.findall('http.*?mp3', s)
Out[49]: ['http://google.com/song.mp3', 'http://yahoo.com/another_song.mp3']
I am scraping a page with Python and BeautifulSoup library.
I have to get the URL only from this string. This actually is in href attribute of the a tag. I have scraped it but cannot seem to find a way to extract the URL from this
javascript:void%20window.open('/Sheraton-Tucson-Hotel-177/tnc/150/24795/en','TC_POPUP','width=490,height=405,screenX=300,screenY=250,top=250,left=300,scrollbars=yes,resizable=no');
You can write a straightforward regex to extract the URL.
>>> import re
>>> href = "javascript:void%20window.open('/Sheraton-Tucson-Hotel-177/tnc/150/24795/en','TC_POPUP','width=490,height=405,screenX=300,screenY=250,top=250,left=300,scrollbars=yes,resizable=no');"
>>> re.findall(r"'(.*?)'", href)
['/Sheraton-Tucson-Hotel-177/tnc/150/24795/en', 'TC_POPUP', 'width=490,height=405,screenX=300,screenY=250,top=250,left=300,scrollbars=yes,resizable=no']
>>> _[0]
'/Sheraton-Tucson-Hotel-177/tnc/150/24795/en'
The regex in question here is
'(.*?)'
Which reads "find a single-quote, followed by whatever (and capture the whatever), followed by another single quote, and do so non-greedily because of the ? operator". This extracts the arguments of window.open; then, just pick the first one to get the URL.
You shouldn't have any nested ' in your href, since those should be escaped to %27. If you do, though, this will not work, and you may need a solution that doesn't use regexes.
I did it that way.
terms = javascript:void%20window.open('/Sheraton-Tucson-Hotel-177/tnc/150/24795/en','TC_POPUP','width=490,height=405,screenX=300,screenY=250,top=250,left=300,scrollbars=yes,resizable=no');
terms.split("('")[1].split("','")[0]
outputs
/Sheraton-Tucson-Hotel-177/tnc/150/24795/en
Instead of a regex, you could just partition it twice on something, (eg: '):
s.partition("'")[2].partition("'")[0]
# /Sheraton-Tucson-Hotel-177/tnc/150/24795/en
Here's a quick and ugly answer
href.split("'")[1]