Handling web links in Python

Handling web links in Python - python

I have some links stored in a file which looks like this:
http://r14---sn-p5qlsnss.googlevideo.com/videoplayback?itag=22&id=o-AOtM1kWozUiJKP2ENWH989ZIfJaZNPVvXTrBkXx40lG5&key=yt6&ip=159.253.144.86&lmt=1480060612064057&dur=1047.870&mv=m&source=youtube&ms=au&ei=DtN8WLfwFsKb1gKXho6YDw&expire=1484597102&mn=sn-p5qlsnss&mm=31&ipbits=0&nh=IgpwcjAzLmlhZDA3KgkxMjcuMC4wLjE&initcwndbps=4717500&mt=1484575249&pl=24&signature=1ECAB2B56C30CBF760721A1A26A7E80963DB36B8.6336B2C9C41DB53C8FA1D2A037793275F57C4825&ratebypass=yes&mime=video%2Fmp4&upn=tUcEt34Qe6c&sparams=dur%2Cei%2Cid%2Cinitcwndbps%2Cip%2Cipbits%2Citag%2Clmt%2Cmime%2Cmm%2Cmn%2Cms%2Cmv%2Cnh%2Cpl%2Cratebypass%2Csource%2Cupn%2Cexpire&title=600ft+UFO+Crash+Site+Discovered+On+Mars%21+11%2F23%2F16
At the end of the link we have the video's title. I want to read this link from a file and get the video's title in a proper format (with those '+' and '%' signs properly resolved). How do I do that?
I cannot use raw cgi as suggested here since the link is read from a file and not submitted by a form. Any idea?

There's super convenient urllib.parse.parse_qs for python 3, but if you're using python 2, you might have to dig out the title string first.
import urllib
url = 'http://r14---sn-p5qlsnss.googlevideo.com/videoplayback?itag=22&id=o-AOtM1kWozUiJKP2ENWH989ZIfJaZNPVvXTrBkXx40lG5&key=yt6&ip=159.253.144.86&lmt=1480060612064057&dur=1047.870&mv=m&source=youtube&ms=au&ei=DtN8WLfwFsKb1gKXho6YDw&expire=1484597102&mn=sn-p5qlsnss&mm=31&ipbits=0&nh=IgpwcjAzLmlhZDA3KgkxMjcuMC4wLjE&initcwndbps=4717500&mt=1484575249&pl=24&signature=1ECAB2B56C30CBF760721A1A26A7E80963DB36B8.6336B2C9C41DB53C8FA1D2A037793275F57C4825&ratebypass=yes&mime=video%2Fmp4&upn=tUcEt34Qe6c&sparams=dur%2Cei%2Cid%2Cinitcwndbps%2Cip%2Cipbits%2Citag%2Clmt%2Cmime%2Cmm%2Cmn%2Cms%2Cmv%2Cnh%2Cpl%2Cratebypass%2Csource%2Cupn%2Cexpire&title=600ft+UFO+Crash+Site+Discovered+On+Mars%21+11%2F23%2F16'
title = url[url.rfind('&title=') + 7:]
print urllib.unquote_plus(title)
Note: thanks to bereal for pointing out parse_qs is also available in python 2, so just:
import urlparse
print urlparse.parse_qs(url)['title'][0]
'600ft UFO Crash Site Discovered On Mars! 11/23/16'

You could use urllib.parse.parse_qs and give it the string:
In [17]: urllib.parse.parse_qs(s)
Out[17]:
{'dur': ['1047.870'],
'ei': ['DtN8WLfwFsKb1gKXho6YDw'],
'expire': ['1484597102'],
'http://r14---sn-p5qlsnss.googlevideo.com/videoplayback?itag': ['22'],
[.. and so on ..]
'source': ['youtube'],
'sparams': ['dur,ei,id,initcwndbps,ip,ipbits,itag,lmt,mime,mm,mn,ms,mv,nh,pl,ratebypass,source,upn,expire'],
'title': ['600ft UFO Crash Site Discovered On Mars! 11/23/16'],
'upn': ['tUcEt34Qe6c']}
In [18]: urllib.parse.parse_qs(s)["title"][0]
Out[18]: '600ft UFO Crash Site Discovered On Mars! 11/23/16'

Purl can fit your needs:
import purl
u = purl.URL('http://r14---sn-p5qlsnss.googlevideo.com/videoplayback?itag=22&id=o-AOtM1kWozUiJKP2ENWH989ZIfJaZNPVvXTrBkXx40lG5&key=yt6&ip=159.253.144.86&lmt=1480060612064057&dur=1047.870&mv=m&source=youtube&ms=au&ei=DtN8WLfwFsKb1gKXho6YDw&expire=1484597102&mn=sn-p5qlsnss&mm=31&ipbits=0&nh=IgpwcjAzLmlhZDA3KgkxMjcuMC4wLjE&initcwndbps=4717500&mt=1484575249&pl=24&signature=1ECAB2B56C30CBF760721A1A26A7E80963DB36B8.6336B2C9C41DB53C8FA1D2A037793275F57C4825&ratebypass=yes&mime=video%2Fmp4&upn=tUcEt34Qe6c&sparams=dur%2Cei%2Cid%2Cinitcwndbps%2Cip%2Cipbits%2Citag%2Clmt%2Cmime%2Cmm%2Cmn%2Cms%2Cmv%2Cnh%2Cpl%2Cratebypass%2Csource%2Cupn%2Cexpire&title=600ft+UFO+Crash+Site+Discovered+On+Mars%21+11%2F23%2F16')
print(u.query_param('title'))

use urlparse.parse_qs:
try:
from urlparse import urlparse # for python2
except:
from urllib import parse as urlparse # for python3
rv = urlparse.parse_qs(link)
title = rv['title'][0]

import urllib
a = "http://r14---sn-p5qlsnss.googlevideo.com/videoplayback?itag=22&id=o-AOtM1kWozUiJKP2ENWH989ZIfJaZNPVvXTrBkXx40lG5&key=yt6&ip=159.253.144.86&lmt=1480060612064057&dur=1047.870&mv=m&source=youtube&ms=au&ei=DtN8WLfwFsKb1gKXho6YDw&expire=1484597102&mn=sn-p5qlsnss&mm=31&ipbits=0&nh=IgpwcjAzLmlhZDA3KgkxMjcuMC4wLjE&initcwndbps=4717500&mt=1484575249&pl=24&signature=1ECAB2B56C30CBF760721A1A26A7E80963DB36B8.6336B2C9C41DB53C8FA1D2A037793275F57C4825&ratebypass=yes&mime=video%2Fmp4&upn=tUcEt34Qe6c&sparams=dur%2Cei%2Cid%2Cinitcwndbps%2Cip%2Cipbits%2Citag%2Clmt%2Cmime%2Cmm%2Cmn%2Cms%2Cmv%2Cnh%2Cpl%2Cratebypass%2Csource%2Cupn%2Cexpire&title=600ft+UFO+Crash+Site+Discovered+On+Mars%21+11%2F23%2F16"
b = a.split('=')[-1]
print urllib.unquote_plus(b)

Related

Python- regular expression to print word within link

I am using Jupyter Notebook to get docid=PE209374738 as my output using reg ex. It is currently stored in a dictionary in this format:
{'Url': 'https://backtoschool.com/document.php?docid=PE209374738&datasource=PHE&vid=3326&referrer=api'}.
This is my code:
results= xmldoc.getElementsByTagName("result")
dict= {}
for a in results:
url= 'Url'
dict[url] = a.getElementsByTagName("url")[0].childNodes[0].nodeValue
docid= re.search(r'\?(.*?)&')
Does anyone have any suggestions on how to print that id?

The standard library already has methods for parsing URLs properly, no need for regex.
In Python 3:
from urllib.parse import urlparse, parse_qs
url = 'https://backtoschool.com/document.php?docid=PE209374738&datasource=PHE&vid=3326&referrer=api'
print(parse_qs(urlparse(url).query)['docid'][0]) # PE209374738
In Python 2 the first line is:
from urlparse import urlparse, parse_qs

#alex-hall is correct, you probably should better parse this using a proper URL parser.
That said, your original question was about doing it with using regexps, so here is the solution (which you nearly nailed already):
s = 'https://backtoschool.com/document.php?docid=PE209374738&datasource=PHE&vid=3326&referrer=api'
m = re.search(r'\?docid=(.*?)&', s)
print m.groups()[0]
This will print the desired PE209374738.

Weird json value urllib python

I'm trying to manipulate a dynamic JSON from this site:
http://esaj.tjsc.jus.br/cposgtj/imagemCaptcha.do
It has 3 elements, imagem, a base64, labelValorCaptcha, just a message, and uuidCaptcha, a value to pass by parameter to play a sound in this link bellow:
http://esaj.tjsc.jus.br/cposgtj/somCaptcha.do?timestamp=1455996420264&uuidCaptcha=sajcaptcha_e7b072e1fce5493cbdc46c9e4738ab8a
When I enter in the first site through a browser and put in the second link the uuidCaptha after the equal ("..uuidCaptcha="), the sound plays normally. I wrote a simple code to catch this elements.
import urllib, json
url = "http://esaj.tjsc.jus.br/cposgtj/imagemCaptcha.do"
response = urllib.urlopen(url)
data = json.loads(response.read())
urlSound = "http://esaj.tjsc.jus.br/cposgtj/somCaptcha.do?timestamp=1455996420264&uuidCaptcha="
print urlSound + data['uuidCaptcha']
But I dont know what's happening, the caught value of the uuidCaptcha doesn't work. Open a error web page.
Someone knows?
Thanks!

It works for me.
$ cat a.py
#!/usr/bin/env python
# encoding: utf-8
import urllib, json
url = "http://esaj.tjsc.jus.br/cposgtj/imagemCaptcha.do"
response = urllib.urlopen(url)
data = json.loads(response.read())
urlSound = "http://esaj.tjsc.jus.br/cposgtj/somCaptcha.do?timestamp=1455996420264&uuidCaptcha="
print urlSound + data['uuidCaptcha']
$ python a.py
http://esaj.tjsc.jus.br/cposgtj/somCaptcha.do?timestamp=1455996420264&uuidCaptcha=sajcaptcha_efc8d4bc3bdb428eab8370c4e04ab42c

As I said #Charlie Harding, the best way is download the page and get the JSON values, because this JSON is dynamic and need an opened web link to exist.
More info here.

Python's `urlparse`: Adding GET keywords to a URL

I'm doing this:
urlparse.urljoin('http://example.com/mypage', '?name=joe')
And I get this:
'http://example.com/?name=joe'
While I want to get this:
'http://example.com/mypage?name=joe'
What am I doing wrong?

You could use urlparse.urlunparse :
import urlparse
parsed = list(urlparse.urlparse('http://example.com/mypage'))
parsed[4] = 'name=joe'
urlparse.urlunparse(parsed)

You're experiencing a known bug which affects Python 2.4-2.6.
If you can't change or patch your version of Python, #jd's solution will work around the issue.
However, if you need a more generic solution that works as a standard urljoin would, you can use a wrapper method which implements the workaround for that specific use case, and default to the standard urljoin() otherwise.
For example:
import urlparse
def myurljoin(base, url, allow_fragments=True):
if url[0] != "?":
return urlparse.urljoin(base, url, allow_fragments)
if not allow_fragments:
url = url.split("#", 1)[0]
parsed = list(urlparse.urlparse(base))
parsed[4] = url[1:] # assign params field
return urlparse.urlunparse(parsed)

I solved it by bundling Python 2.6's urlparse module with my project. I also had to bundle namedtuple which was defined in collections, since urlparse uses it.

Are you sure? On Python 2.7:
>>> import urlparse
>>> urlparse.urljoin('http://example.com/mypage', '?name=joe')
'http://example.com/mypage?name=joe'

Python error when using urllib.open

When I run this:
import urllib
feed = urllib.urlopen("http://www.yahoo.com")
print feed
I get this output in the interactive window (PythonWin):
<addinfourl at 48213968 whose fp = <socket._fileobject object at 0x02E14070>>
I'm expecting to get the source of the above URL. I know this has worked on other computers (like the ones at school) but this is on my laptop and I'm not sure what the problem is here. Also, I don't understand this error at all. What does it mean? Addinfourl? fp? Please help.

Try this:
print feed.read()
See Python docs here.

urllib.urlopen actually returns a file-like object so to retrieve the contents you will need to use:
import urllib
feed = urllib.urlopen("http://www.yahoo.com")
print feed.read()

In python 3.0:
import urllib
import urllib.request
fh = urllib.request.urlopen(url)
html = fh.read().decode("iso-8859-1")
fh.close()
print (html)

urllib2 file name

If I open a file using urllib2, like so:
remotefile = urllib2.urlopen('http://example.com/somefile.zip')
Is there an easy way to get the file name other then parsing the original URL?
EDIT: changed openfile to urlopen... not sure how that happened.
EDIT2: I ended up using:
filename = url.split('/')[-1].split('#')[0].split('?')[0]
Unless I'm mistaken, this should strip out all potential queries as well.

Did you mean urllib2.urlopen?
You could potentially lift the intended filename if the server was sending a Content-Disposition header by checking remotefile.info()['Content-Disposition'], but as it is I think you'll just have to parse the url.
You could use urlparse.urlsplit, but if you have any URLs like at the second example, you'll end up having to pull the file name out yourself anyway:
>>> urlparse.urlsplit('http://example.com/somefile.zip')
('http', 'example.com', '/somefile.zip', '', '')
>>> urlparse.urlsplit('http://example.com/somedir/somefile.zip')
('http', 'example.com', '/somedir/somefile.zip', '', '')
Might as well just do this:
>>> 'http://example.com/somefile.zip'.split('/')[-1]
'somefile.zip'
>>> 'http://example.com/somedir/somefile.zip'.split('/')[-1]
'somefile.zip'

If you only want the file name itself, assuming that there's no query variables at the end like http://example.com/somedir/somefile.zip?foo=bar then you can use os.path.basename for this:
[user#host]$ python
Python 2.5.1 (r251:54869, Apr 18 2007, 22:08:04)
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.path.basename("http://example.com/somefile.zip")
'somefile.zip'
>>> os.path.basename("http://example.com/somedir/somefile.zip")
'somefile.zip'
>>> os.path.basename("http://example.com/somedir/somefile.zip?foo=bar")
'somefile.zip?foo=bar'
Some other posters mentioned using urlparse, which will work, but you'd still need to strip the leading directory from the file name. If you use os.path.basename() then you don't have to worry about that, since it returns only the final part of the URL or file path.

I think that "the file name" isn't a very well defined concept when it comes to http transfers. The server might (but is not required to) provide one as "content-disposition" header, you can try to get that with remotefile.headers['Content-Disposition']. If this fails, you probably have to parse the URI yourself.

Just saw this I normally do..
filename = url.split("?")[0].split("/")[-1]

Using urlsplit is the safest option:
url = 'http://example.com/somefile.zip'
urlparse.urlsplit(url).path.split('/')[-1]

Do you mean urllib2.urlopen? There is no function called openfile in the urllib2 module.
Anyway, use the urllib2.urlparse functions:
>>> from urllib2 import urlparse
>>> print urlparse.urlsplit('http://example.com/somefile.zip')
('http', 'example.com', '/somefile.zip', '', '')
Voila.

You could also combine both of the two best-rated answers :
Using urllib2.urlparse.urlsplit() to get the path part of the URL, and then os.path.basename for the actual file name.
Full code would be :
>>> remotefile=urllib2.urlopen(url)
>>> try:
>>> filename=remotefile.info()['Content-Disposition']
>>> except KeyError:
>>> filename=os.path.basename(urllib2.urlparse.urlsplit(url).path)

The os.path.basename function works not only for file paths, but also for urls, so you don't have to manually parse the URL yourself. Also, it's important to note that you should use result.url instead of the original url in order to follow redirect responses:
import os
import urllib2
result = urllib2.urlopen(url)
real_url = urllib2.urlparse.urlparse(result.url)
filename = os.path.basename(real_url.path)

I guess it depends what you mean by parsing. There is no way to get the filename without parsing the URL, i.e. the remote server doesn't give you a filename. However, you don't have to do much yourself, there's the urlparse module:
In [9]: urlparse.urlparse('http://example.com/somefile.zip')
Out[9]: ('http', 'example.com', '/somefile.zip', '', '', '')

not that I know of.
but you can parse it easy enough like this:
url = 'http://example.com/somefile.zip'
print url.split('/')[-1]

using requests, but you can do it easy with urllib(2)
import requests
from urllib import unquote
from urlparse import urlparse
sample = requests.get(url)
if sample.status_code == 200:
#has_key not work here, and this help avoid problem with names
if filename == False:
if 'content-disposition' in sample.headers.keys():
filename = sample.headers['content-disposition'].split('filename=')[-1].replace('"','').replace(';','')
else:
filename = urlparse(sample.url).query.split('/')[-1].split('=')[-1].split('&')[-1]
if not filename:
if url.split('/')[-1] != '':
filename = sample.url.split('/')[-1].split('=')[-1].split('&')[-1]
filename = unquote(filename)

You probably can use simple regular expression here. Something like:
In [26]: import re
In [27]: pat = re.compile('.+[\/\?#=]([\w-]+\.[\w-]+(?:\.[\w-]+)?$)')
In [28]: test_set
['http://www.google.com/a341.tar.gz',
'http://www.google.com/a341.gz',
'http://www.google.com/asdasd/aadssd.gz',
'http://www.google.com/asdasd?aadssd.gz',
'http://www.google.com/asdasd#blah.gz',
'http://www.google.com/asdasd?filename=xxxbl.gz']
In [30]: for url in test_set:
....: match = pat.match(url)
....: if match and match.groups():
....: print(match.groups()[0])
....:
a341.tar.gz
a341.gz
aadssd.gz
aadssd.gz
blah.gz
xxxbl.gz

Using PurePosixPath which is not operating system—dependent and handles urls gracefully is the pythonic solution:
>>> from pathlib import PurePosixPath
>>> path = PurePosixPath('http://example.com/somefile.zip')
>>> path.name
'somefile.zip'
>>> path = PurePosixPath('http://example.com/nested/somefile.zip')
>>> path.name
'somefile.zip'
Notice how there is no network traffic here or anything (i.e. those urls don't go anywhere) - just using standard parsing rules.

import os,urllib2
resp = urllib2.urlopen('http://www.example.com/index.html')
my_url = resp.geturl()
os.path.split(my_url)[1]
# 'index.html'
This is not openfile, but maybe still helps :)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Handling web links in Python - python

use urlparse.parse_qs: try: from urlparse import urlparse # for python2 except: from urllib import parse as urlparse # for python3 rv = urlparse.parse_qs(link) title = rv['title'][0]

Related

Python- regular expression to print word within link

Weird json value urllib python

Python's `urlparse`: Adding GET keywords to a URL

Python error when using urllib.open

urllib2 file name

Categories

Resources