If I open a file using urllib2, like so:
remotefile = urllib2.urlopen('http://example.com/somefile.zip')
Is there an easy way to get the file name other then parsing the original URL?
EDIT: changed openfile to urlopen... not sure how that happened.
EDIT2: I ended up using:
filename = url.split('/')[-1].split('#')[0].split('?')[0]
Unless I'm mistaken, this should strip out all potential queries as well.
Did you mean urllib2.urlopen?
You could potentially lift the intended filename if the server was sending a Content-Disposition header by checking remotefile.info()['Content-Disposition'], but as it is I think you'll just have to parse the url.
You could use urlparse.urlsplit, but if you have any URLs like at the second example, you'll end up having to pull the file name out yourself anyway:
>>> urlparse.urlsplit('http://example.com/somefile.zip')
('http', 'example.com', '/somefile.zip', '', '')
>>> urlparse.urlsplit('http://example.com/somedir/somefile.zip')
('http', 'example.com', '/somedir/somefile.zip', '', '')
Might as well just do this:
>>> 'http://example.com/somefile.zip'.split('/')[-1]
'somefile.zip'
>>> 'http://example.com/somedir/somefile.zip'.split('/')[-1]
'somefile.zip'
If you only want the file name itself, assuming that there's no query variables at the end like http://example.com/somedir/somefile.zip?foo=bar then you can use os.path.basename for this:
[user#host]$ python
Python 2.5.1 (r251:54869, Apr 18 2007, 22:08:04)
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.path.basename("http://example.com/somefile.zip")
'somefile.zip'
>>> os.path.basename("http://example.com/somedir/somefile.zip")
'somefile.zip'
>>> os.path.basename("http://example.com/somedir/somefile.zip?foo=bar")
'somefile.zip?foo=bar'
Some other posters mentioned using urlparse, which will work, but you'd still need to strip the leading directory from the file name. If you use os.path.basename() then you don't have to worry about that, since it returns only the final part of the URL or file path.
I think that "the file name" isn't a very well defined concept when it comes to http transfers. The server might (but is not required to) provide one as "content-disposition" header, you can try to get that with remotefile.headers['Content-Disposition']. If this fails, you probably have to parse the URI yourself.
Just saw this I normally do..
filename = url.split("?")[0].split("/")[-1]
Using urlsplit is the safest option:
url = 'http://example.com/somefile.zip'
urlparse.urlsplit(url).path.split('/')[-1]
Do you mean urllib2.urlopen? There is no function called openfile in the urllib2 module.
Anyway, use the urllib2.urlparse functions:
>>> from urllib2 import urlparse
>>> print urlparse.urlsplit('http://example.com/somefile.zip')
('http', 'example.com', '/somefile.zip', '', '')
Voila.
You could also combine both of the two best-rated answers :
Using urllib2.urlparse.urlsplit() to get the path part of the URL, and then os.path.basename for the actual file name.
Full code would be :
>>> remotefile=urllib2.urlopen(url)
>>> try:
>>> filename=remotefile.info()['Content-Disposition']
>>> except KeyError:
>>> filename=os.path.basename(urllib2.urlparse.urlsplit(url).path)
The os.path.basename function works not only for file paths, but also for urls, so you don't have to manually parse the URL yourself. Also, it's important to note that you should use result.url instead of the original url in order to follow redirect responses:
import os
import urllib2
result = urllib2.urlopen(url)
real_url = urllib2.urlparse.urlparse(result.url)
filename = os.path.basename(real_url.path)
I guess it depends what you mean by parsing. There is no way to get the filename without parsing the URL, i.e. the remote server doesn't give you a filename. However, you don't have to do much yourself, there's the urlparse module:
In [9]: urlparse.urlparse('http://example.com/somefile.zip')
Out[9]: ('http', 'example.com', '/somefile.zip', '', '', '')
not that I know of.
but you can parse it easy enough like this:
url = 'http://example.com/somefile.zip'
print url.split('/')[-1]
using requests, but you can do it easy with urllib(2)
import requests
from urllib import unquote
from urlparse import urlparse
sample = requests.get(url)
if sample.status_code == 200:
#has_key not work here, and this help avoid problem with names
if filename == False:
if 'content-disposition' in sample.headers.keys():
filename = sample.headers['content-disposition'].split('filename=')[-1].replace('"','').replace(';','')
else:
filename = urlparse(sample.url).query.split('/')[-1].split('=')[-1].split('&')[-1]
if not filename:
if url.split('/')[-1] != '':
filename = sample.url.split('/')[-1].split('=')[-1].split('&')[-1]
filename = unquote(filename)
You probably can use simple regular expression here. Something like:
In [26]: import re
In [27]: pat = re.compile('.+[\/\?#=]([\w-]+\.[\w-]+(?:\.[\w-]+)?$)')
In [28]: test_set
['http://www.google.com/a341.tar.gz',
'http://www.google.com/a341.gz',
'http://www.google.com/asdasd/aadssd.gz',
'http://www.google.com/asdasd?aadssd.gz',
'http://www.google.com/asdasd#blah.gz',
'http://www.google.com/asdasd?filename=xxxbl.gz']
In [30]: for url in test_set:
....: match = pat.match(url)
....: if match and match.groups():
....: print(match.groups()[0])
....:
a341.tar.gz
a341.gz
aadssd.gz
aadssd.gz
blah.gz
xxxbl.gz
Using PurePosixPath which is not operating system—dependent and handles urls gracefully is the pythonic solution:
>>> from pathlib import PurePosixPath
>>> path = PurePosixPath('http://example.com/somefile.zip')
>>> path.name
'somefile.zip'
>>> path = PurePosixPath('http://example.com/nested/somefile.zip')
>>> path.name
'somefile.zip'
Notice how there is no network traffic here or anything (i.e. those urls don't go anywhere) - just using standard parsing rules.
import os,urllib2
resp = urllib2.urlopen('http://www.example.com/index.html')
my_url = resp.geturl()
os.path.split(my_url)[1]
# 'index.html'
This is not openfile, but maybe still helps :)
Related
I have some links stored in a file which looks like this:
http://r14---sn-p5qlsnss.googlevideo.com/videoplayback?itag=22&id=o-AOtM1kWozUiJKP2ENWH989ZIfJaZNPVvXTrBkXx40lG5&key=yt6&ip=159.253.144.86&lmt=1480060612064057&dur=1047.870&mv=m&source=youtube&ms=au&ei=DtN8WLfwFsKb1gKXho6YDw&expire=1484597102&mn=sn-p5qlsnss&mm=31&ipbits=0&nh=IgpwcjAzLmlhZDA3KgkxMjcuMC4wLjE&initcwndbps=4717500&mt=1484575249&pl=24&signature=1ECAB2B56C30CBF760721A1A26A7E80963DB36B8.6336B2C9C41DB53C8FA1D2A037793275F57C4825&ratebypass=yes&mime=video%2Fmp4&upn=tUcEt34Qe6c&sparams=dur%2Cei%2Cid%2Cinitcwndbps%2Cip%2Cipbits%2Citag%2Clmt%2Cmime%2Cmm%2Cmn%2Cms%2Cmv%2Cnh%2Cpl%2Cratebypass%2Csource%2Cupn%2Cexpire&title=600ft+UFO+Crash+Site+Discovered+On+Mars%21+11%2F23%2F16
At the end of the link we have the video's title. I want to read this link from a file and get the video's title in a proper format (with those '+' and '%' signs properly resolved). How do I do that?
I cannot use raw cgi as suggested here since the link is read from a file and not submitted by a form. Any idea?
There's super convenient urllib.parse.parse_qs for python 3, but if you're using python 2, you might have to dig out the title string first.
import urllib
url = 'http://r14---sn-p5qlsnss.googlevideo.com/videoplayback?itag=22&id=o-AOtM1kWozUiJKP2ENWH989ZIfJaZNPVvXTrBkXx40lG5&key=yt6&ip=159.253.144.86&lmt=1480060612064057&dur=1047.870&mv=m&source=youtube&ms=au&ei=DtN8WLfwFsKb1gKXho6YDw&expire=1484597102&mn=sn-p5qlsnss&mm=31&ipbits=0&nh=IgpwcjAzLmlhZDA3KgkxMjcuMC4wLjE&initcwndbps=4717500&mt=1484575249&pl=24&signature=1ECAB2B56C30CBF760721A1A26A7E80963DB36B8.6336B2C9C41DB53C8FA1D2A037793275F57C4825&ratebypass=yes&mime=video%2Fmp4&upn=tUcEt34Qe6c&sparams=dur%2Cei%2Cid%2Cinitcwndbps%2Cip%2Cipbits%2Citag%2Clmt%2Cmime%2Cmm%2Cmn%2Cms%2Cmv%2Cnh%2Cpl%2Cratebypass%2Csource%2Cupn%2Cexpire&title=600ft+UFO+Crash+Site+Discovered+On+Mars%21+11%2F23%2F16'
title = url[url.rfind('&title=') + 7:]
print urllib.unquote_plus(title)
Note: thanks to bereal for pointing out parse_qs is also available in python 2, so just:
import urlparse
print urlparse.parse_qs(url)['title'][0]
'600ft UFO Crash Site Discovered On Mars! 11/23/16'
You could use urllib.parse.parse_qs and give it the string:
In [17]: urllib.parse.parse_qs(s)
Out[17]:
{'dur': ['1047.870'],
'ei': ['DtN8WLfwFsKb1gKXho6YDw'],
'expire': ['1484597102'],
'http://r14---sn-p5qlsnss.googlevideo.com/videoplayback?itag': ['22'],
[.. and so on ..]
'source': ['youtube'],
'sparams': ['dur,ei,id,initcwndbps,ip,ipbits,itag,lmt,mime,mm,mn,ms,mv,nh,pl,ratebypass,source,upn,expire'],
'title': ['600ft UFO Crash Site Discovered On Mars! 11/23/16'],
'upn': ['tUcEt34Qe6c']}
In [18]: urllib.parse.parse_qs(s)["title"][0]
Out[18]: '600ft UFO Crash Site Discovered On Mars! 11/23/16'
Purl can fit your needs:
import purl
u = purl.URL('http://r14---sn-p5qlsnss.googlevideo.com/videoplayback?itag=22&id=o-AOtM1kWozUiJKP2ENWH989ZIfJaZNPVvXTrBkXx40lG5&key=yt6&ip=159.253.144.86&lmt=1480060612064057&dur=1047.870&mv=m&source=youtube&ms=au&ei=DtN8WLfwFsKb1gKXho6YDw&expire=1484597102&mn=sn-p5qlsnss&mm=31&ipbits=0&nh=IgpwcjAzLmlhZDA3KgkxMjcuMC4wLjE&initcwndbps=4717500&mt=1484575249&pl=24&signature=1ECAB2B56C30CBF760721A1A26A7E80963DB36B8.6336B2C9C41DB53C8FA1D2A037793275F57C4825&ratebypass=yes&mime=video%2Fmp4&upn=tUcEt34Qe6c&sparams=dur%2Cei%2Cid%2Cinitcwndbps%2Cip%2Cipbits%2Citag%2Clmt%2Cmime%2Cmm%2Cmn%2Cms%2Cmv%2Cnh%2Cpl%2Cratebypass%2Csource%2Cupn%2Cexpire&title=600ft+UFO+Crash+Site+Discovered+On+Mars%21+11%2F23%2F16')
print(u.query_param('title'))
use urlparse.parse_qs:
try:
from urlparse import urlparse # for python2
except:
from urllib import parse as urlparse # for python3
rv = urlparse.parse_qs(link)
title = rv['title'][0]
import urllib
a = "http://r14---sn-p5qlsnss.googlevideo.com/videoplayback?itag=22&id=o-AOtM1kWozUiJKP2ENWH989ZIfJaZNPVvXTrBkXx40lG5&key=yt6&ip=159.253.144.86&lmt=1480060612064057&dur=1047.870&mv=m&source=youtube&ms=au&ei=DtN8WLfwFsKb1gKXho6YDw&expire=1484597102&mn=sn-p5qlsnss&mm=31&ipbits=0&nh=IgpwcjAzLmlhZDA3KgkxMjcuMC4wLjE&initcwndbps=4717500&mt=1484575249&pl=24&signature=1ECAB2B56C30CBF760721A1A26A7E80963DB36B8.6336B2C9C41DB53C8FA1D2A037793275F57C4825&ratebypass=yes&mime=video%2Fmp4&upn=tUcEt34Qe6c&sparams=dur%2Cei%2Cid%2Cinitcwndbps%2Cip%2Cipbits%2Citag%2Clmt%2Cmime%2Cmm%2Cmn%2Cms%2Cmv%2Cnh%2Cpl%2Cratebypass%2Csource%2Cupn%2Cexpire&title=600ft+UFO+Crash+Site+Discovered+On+Mars%21+11%2F23%2F16"
b = a.split('=')[-1]
print urllib.unquote_plus(b)
This question already has answers here:
Get protocol + host name from URL
(16 answers)
Closed 9 years ago.
How do i truncate the below URL next to the domain "com" using python. i.e you tube.com only
youtube.com/video/AiL6nL
yahoo.com/video/Hhj9B2
youtube.com/video/MpVHQ
google.com/video/PGuTN
youtube.com/video/VU34MI
Is it possible to truncate like this?
Check out Pythons urlparse library. It is a standard library so nothing else needs to be installed.
So you could do the following:
import urlparse
import re
def check_and_add_http(url):
# checks if 'http://' is present at the start of the URL and adds it if not.
http_regex = re.compile(r'^http[s]?://')
if http_regex.match(url):
# 'http://' or 'https://' is present
return url
else:
# add 'http://' for urlparse to work.
return 'http://' + url
for url in url_list:
url = check_and_add_http(url)
print(urlparse.urlsplit(url)[1])
You can read more about urlsplit() in the documentation, including the indexes if you want to read the other parts of the URL.
You can use split():
myUrl.split(r"/")[0]
to get "youtube.com"
and:
myUrl.split(r"/", 1)[1]
to get everything else
I'd use the function urlsplit from the standard library:
from urlparse import urlsplit # python 2
from urllib.parse import urlsplit # python 3
myurl = "http://docs.python.org/2/library/urlparse.html"
urlsplit(myurl)[1] # returns 'docs.python.org'
No library function can tell that those strings are supposed to be absolute URLs, since, formally, they are relative ones. So, you have to prepend //.
>>> url = 'youtube.com/bla/foo'
>>> urlparse.urlsplit('//' + url)[1]
> 'youtube.com'
Just a crazy alternative solution using tldextract:
>>> import tldextract
>>> ext = tldextract.extract('youtube.com/video/AiL6nL')
>>> ".".join(ext[1:3])
'youtube.com'
For your particular input, you could use str.partition() or str.split():
print('youtube.com/video/AiL6nL'.partition('/')[0])
# -> youtube.com
Note: urlparse module (that you could use in general to parse an url) doesn't work in this case:
import urlparse
urlparse.urlsplit('youtube.com/video/AiL6nL')
# -> SplitResult(scheme='', netloc='', path='youtube.com/video/AiL6nL',
# query='', fragment='')
In general, it is safe to use a regex here if you know that all lines start with a hostname and otherwise each line contains a well-formed uri:
import re
print("\n".join(re.findall(r"(?m)^\s*([^\/?#]*)", text)))
Output
youtube.com
yahoo.com
youtube.com
google.com
youtube.com
Note: it doesn't remove the optional port part -- host:port.
I'm quite new to python. I'm trying to parse a file of URLs to leave only the domain name.
some of the urls in my log file begin with http:// and some begin with www.Some begin with both.
This is the part of my code which strips the http:// part. What do I need to add to it to look for both http and www. and remove both?
line = re.findall(r'(https?://\S+)', line)
Currently when I run the code only http:// is stripped. if I change the code to the following:
line = re.findall(r'(https?://www.\S+)', line)
Only domains starting with both are affected.
I need the code to be more conditional.
TIA
edit... here is my full code...
import re
import sys
from urlparse import urlparse
f = open(sys.argv[1], "r")
for line in f.readlines():
line = re.findall(r'(https?://\S+)', line)
if line:
parsed=urlparse(line[0])
print parsed.hostname
f.close()
I mistagged by original post as regex. it is indeed using urlparse.
It might be overkill for this specific situation, but i'd generally use urlparse.urlsplit (Python 2) or urllib.parse.urlsplit (Python 3).
from urllib.parse import urlsplit # Python 3
from urlparse import urlsplit # Python 2
import re
url = 'www.python.org'
# URLs must have a scheme
# www.python.org is an invalid URL
# http://www.python.org is valid
if not re.match(r'http(s?)\:', url):
url = 'http://' + url
# url is now 'http://www.python.org'
parsed = urlsplit(url)
# parsed.scheme is 'http'
# parsed.netloc is 'www.python.org'
# parsed.path is None, since (strictly speaking) the path was not defined
host = parsed.netloc # www.python.org
# Removing www.
# This is a bad idea, because www.python.org could
# resolve to something different than python.org
if host.startswith('www.'):
host = host[4:]
You can do without regexes here.
with open("file_path","r") as f:
lines = f.read()
lines = lines.replace("http://","")
lines = lines.replace("www.", "") # May replace some false positives ('www.com')
urls = [url.split('/')[0] for url in lines.split()]
print '\n'.join(urls)
Example file input:
http://foo.com/index.html
http://www.foobar.com
www.bar.com/?q=res
www.foobar.com
Output:
foo.com
foobar.com
bar.com
foobar.com
Edit:
There could be a tricky url like foobarwww.com, and the above approach would strip the www. We will have to then revert back to using regexes.
Replace the line lines = lines.replace("www.", "") with lines = re.sub(r'(www.)(?!com)',r'',lines). Of course, every possible TLD should be used for the not-match pattern.
I came across the same problem. This is a solution based on regular expressions:
>>> import re
>>> rec = re.compile(r"https?://(www\.)?")
>>> rec.sub('', 'https://domain.com/bla/').strip().strip('/')
'domain.com/bla'
>>> rec.sub('', 'https://domain.com/bla/ ').strip().strip('/')
'domain.com/bla'
>>> rec.sub('', 'http://domain.com/bla/ ').strip().strip('/')
'domain.com/bla'
>>> rec.sub('', 'http://www.domain.com/bla/ ').strip().strip('/')
'domain.com/bla'
Check out the urlparse library, which can do these things for you automatically.
>>> urlparse.urlsplit('http://www.google.com.au/q?test')
SplitResult(scheme='http', netloc='www.google.com.au', path='/q', query='test', fragment='')
You can use urlparse. Also, the solution should be generic to remove things other than 'www' before the domain name (i.e., handle cases like server1.domain.com). The following is a quick try that should work:
from urlparse import urlparse
url = 'http://www.muneeb.org/files/alan_turing_thesis.jpg'
o = urlparse(url)
domain = o.hostname
temp = domain.rsplit('.')
if(len(temp) == 3):
domain = temp[1] + '.' + temp[2]
print domain
I believe #Muneeb Ali is the nearest to the solution but the problem appear when is something like frontdomain.domain.co.uk....
I suppose:
for i in range(1,len(temp)-1):
domain = temp[i]+"."
domain = domain + "." + temp[-1]
Is there a nicer way to do this?
I'm doing this:
urlparse.urljoin('http://example.com/mypage', '?name=joe')
And I get this:
'http://example.com/?name=joe'
While I want to get this:
'http://example.com/mypage?name=joe'
What am I doing wrong?
You could use urlparse.urlunparse :
import urlparse
parsed = list(urlparse.urlparse('http://example.com/mypage'))
parsed[4] = 'name=joe'
urlparse.urlunparse(parsed)
You're experiencing a known bug which affects Python 2.4-2.6.
If you can't change or patch your version of Python, #jd's solution will work around the issue.
However, if you need a more generic solution that works as a standard urljoin would, you can use a wrapper method which implements the workaround for that specific use case, and default to the standard urljoin() otherwise.
For example:
import urlparse
def myurljoin(base, url, allow_fragments=True):
if url[0] != "?":
return urlparse.urljoin(base, url, allow_fragments)
if not allow_fragments:
url = url.split("#", 1)[0]
parsed = list(urlparse.urlparse(base))
parsed[4] = url[1:] # assign params field
return urlparse.urlunparse(parsed)
I solved it by bundling Python 2.6's urlparse module with my project. I also had to bundle namedtuple which was defined in collections, since urlparse uses it.
Are you sure? On Python 2.7:
>>> import urlparse
>>> urlparse.urljoin('http://example.com/mypage', '?name=joe')
'http://example.com/mypage?name=joe'
>>> import httplib
>>> conn = httplib.HTTPConnection("www.google.com")
>>> conn.request("HEAD", "/index.html")
>>> res = conn.getresponse()
>>> print res.status, res.reason
200 OK
This code will get the HTTP status code. However, notice that I split up "google.com" and "/index.html" on 2 lines.
And it's confusing.
What if I want to find the status code of just a general URL???
http://mydomain.com/sunny/boo.avi
http://anotherdomain.com/podcast.mp3
http://anotherdomain.com/rss/fee.xml
Can't I just stick the URL into it, and make it work?
Edit...I cannot use urllib, because I don't want to downlaod the file
Alternatively, if you expect that actually downloading the data is problematic and you really need the HEAD method, you could parse the URL using urlparse:
>>> import httplib
>>> import urlparse
>>> url = "http://www.google.com/index.html"
>>> (scheme, netloc, path, params, query, fragment) = urlparse.urlparse(url)
>>> conn = httplib.HTTPConnection(netloc)
>>> conn.request("HEAD", urlparse.urlunparse(('', '', path, params, query, fragment))
>>> res = conn.getresponse()
>>> print res.status, res.reason
302 Found
And wrap this into a function taking the URL as an argument.
Maybe you are better off using the URL library instead?
In Python 2, use urllib2:
>>> import urllib2
>>> url = urllib2.urlopen("http://www.google.com/index.html")
>>> url.getcode()
200
In Python 3, use urllib.request:
>>> import urllib.request
>>> url = urllib.request.urlopen("http://www.google.com/index.html")
>>> url.getcode()
200
The connect method takes a server argument (with an optional port). You have to split the connection with the resource you actually want.
For a simpler way to download web resources directly, you could go with urllib2 but urllib2 only supports GET or POST methods, no HEAD, so you end up downloading the whole resource.
According to the spec you're supposed to split it up like that, maybe Python could abstract that out for you a bit, they're probably just giving you straight access to the header so you know exactly how it's being formatted, which is really the preference.
Keep in mind that not all web servers support HEAD on each resource so you'll end up retrieving the resource anyway. You should write code accordingly.
I like urllib2, sample code:
import urllib2
res = urllib2.urlopen('http://google.com/index.html')
res.getCode() #contains code
I something went wrong, you'll get an exception you can catch.
EDIT: Thanks, changes res.code to res.getCode() since the second one is documented