Python webbrowser - Open a url without https:// - python

I am trying to get python to open a website URL. This code works.
import webbrowser
url = 'http://www.example.com/'
webbrowser.open(url)
I have noticed that python will only open the URL is it has https:// at the beginning.
Is it possible to get python to open the URL if it's in any of the formats in the examples below?
url = 'http://www.example.com/'
url = 'https://example.com/'
url = 'www.example.com/'
url = 'example.com/'
The URLs will be pulled from outside sources so I can't change what data i receive.
I have looked at the python docs, and can't find the answer on stackoverflow.

Why not just add it?
if not url.startswith('http')
if url.startswith('www'):
url = "http://" + url
else
url = "http://www." + url

If you really don't want to change the url string (which is quite fast and easy) like stazima said, then you can use Python 3. It supports all the listed url types in your question (tested them).

Related

How to test external url or links in a django website?

Hi I am building a blogging website in django 1.8 with python 3. In the blog users will write blogs and sometimes add external links.
I want to crawl all the pages in this blog website and test every external link provided by the users is valid or not.
How can i do this? Should i use something like python scrapy?
import urllib2
import fnmatch
def site_checker(url):
url_chk = url.split('/')
if fnmatch.fnmatch(url_chk[0], 'http*'):
url = url
else:
url = 'http://%s' %(url)
print url
try:
response = urllib2.urlopen(url).read()
if response:
print 'site is legit'
except Exception:
print "not a legit site yo!"
site_checker('google') ## not a complete url
site_checker('http://google.com') ## this works
Hopefully this works. Urllib will read the html of the site and if its not empty. It's a legit site. Else it's not a site. Also I added a url check to add http:// if its not there.

Automating download of executable which is within several nested urls by listening to build triggers

Is there a way by which I can click on the most latest url on a web page, then url within it, to download an exe file in python.
I know how to do download files from a static url but what about changing urls?
Note: I want to go to most latest url out of all urls. then I need to again click a url within it. Later, download the file.
Thanks in advance!
I used BeautifulSoup to accomplish this as suggested by ZZY. Thanks ZZY. Basically, we can do something like this:
page = self.authorizedopen(username, password, url)
text = page.read()
page.close()
soup = BeautifulSoup(text)
data = ''
for tag in soup.findAll('a', href=True):
data = tag['href']
return url + '/' + data
And constantly manipulated the url to reach where I want to. Then used simple urlib2 to download the required file.

Python to Save Web Pages

This is probably a very simple task, but I cannot find any help. I have a website that takes the form www.xyz.com/somestuff/ID. I have a list of the IDs I need information from. I was hoping to have a simple script to go one the site and download the (complete) web page for each ID in a simple form ID_whatever_the_default_save_name_is in a specific folder.
Can I run a simple python script to do this for me? I can do it by hand, it is only 75 different pages, but I was hoping to use this to learn how to do things like this in the future.
Mechanize is a great package for crawling the web with python. A simple example for your issue would be:
import mechanize
br = mechanize.Browser()
response = br.open("www.xyz.com/somestuff/ID")
print response
This simply grabs your url and prints the response from the server.
This can be done simply in python using the urllib module. Here is a simple example in Python 3:
import urllib.request
url = 'www.xyz.com/somestuff/ID'
req = urllib.request.Request(url)
page = urllib.request.urlopen(req)
src = page.readall()
print(src)
For more info on the urllib module -> http://docs.python.org/3.3/library/urllib.html
Do you want just the html code for the website? If so, just create a url variable with the host site and add the page number as you go. I'll do this for an example with http://www.notalwaysright.com
import urllib.request
url = "http://www.notalwaysright.com/page/"
for x in range(1, 71):
newurl = url + x
response = urllib.request.urlopen(newurl)
with open("Page/" + x, "a") as p:
p.writelines(reponse.read())

Downloading a .csv file from the web (with redirects) in python

Let me start by saying that I know there are a few topics discussing problems similar to mine, but the suggested solutions do not seem to work for me for some reason.
Also, I am new to downloading files from the internet using scripts. Up until now I have mostly used python as a Matlab replacement (using numpy/scipy).
My goal:
I want to download a lot of .csv files from an internet database (http://dna.korea.ac.kr/vhot/) automatically using python. I want to do this because it is too cumbersome to download the 1000+ csv files I require by hand. The database can only be accessed using a UI, where you have to select several options from a drop down menu to finally end up with links to .csv files after some steps.
I have figured out that the url you get after filling out the drop down menus and pressing 'search' contains all the parameters of the drop-down menu. This means I can just change those instead of using the drop down menu, which helps a lot.
An example url from this website is (lets call it url1):
url1 = http://dna.korea.ac.kr/vhot/search.php?species=Human&selector=drop&mirname=&mirname_drop=hbv-miR-B2RC&pita=on&set=and&miranda_th=-5&rh_th=-10&ts_th=0&mt_th=7.3&pt_th=99999&gene=
On this page I can select 5 csv-files, one example directs me to the following url:
url2 = http://dna.korea.ac.kr/vhot/download.php?mirname=hbv-miR-B2RC&species_filter=species_id+%3D+9606&set=and&gene_filter=&method=pita&m_th=-5&rh_th=-10&ts_th=0&mt_th=7.3&pt_th=99999&targetscan=&miranda=&rnahybrid=&microt=&pita=on
However, this doesn't contain the csv file directly, but appears to be a 'redirect' (a new term for me, that I found by googeling, so correct me if I am wrong).
One strange thing. I appear to have to load url1 in my browser before I can access url2 (I do not know if it has to be the same day, or hour. url2 didn't work for me today and it did yesterday. Only after after accessing url1 did it work again...). If I do not access url1 before url2 I get "no results" instead of my csv file from my browser. Does anyone know what is going on here?
However, my main problem is that I cannot save the csv files from python.
I have tried using the packages urllib, urllib2 and request but I cannot get it to work.
From what i understand the Requests package should take care of redirects, but I haven't been able to make it work.
The solutions from the following web pages do not appear to work for me (or I am messing up):
stackoverflow.com/questions/7603044/how-to-download-a-file-returned-indirectly-from-html-form-submission-pyt
stackoverflow.com/questions/9419162/python-download-returned-zip-file-from-url
techniqal.com/blog/2008/07/31/python-file-read-write-with-urllib2/
Some of the things I have tried include:
import urllib2
import csv
import sys
url = 'http://dna.korea.ac.kr/vhot/download.php?mirname=hbv-miR-B2RC&species_filter=species_id+%3D+9606&set=or&gene_filter=&method=targetscan&m_th=-5&rh_th=-10&ts_th=0&mt_th=7.3&pt_th=-10&targetscan=on&miranda=&rnahybrid=&microt=&pita='
#1
u = urllib2.urlopen(url)
localFile = open('file.csv', 'w')
localFile.write(u.read())
localFile.close()
#2
req = urllib2.Request(url)
res = urllib2.urlopen(req)
finalurl = res.geturl()
pass
# finalurl = 'http://dna.korea.ac.kr/vhot/download.php?mirname=hbv-miR-B2RC&species_filter=species_id+%3D+9606&set=or&gene_filter=&method=targetscan&m_th=-5&rh_th=-10&ts_th=0&mt_th=7.3&pt_th=-10&targetscan=on&miranda=&rnahybrid=&microt=&pita='
#3
import requests
r = requests.get(url)
r.content
pass
#r.content = "< s c r i p t > location.replace('download_send.php?name=qgN9Th&type=targetscan'); < / s c r i p t >"
#4
import requests
r = requests.get(url,
allow_redirects=True,
data={'download_open': 'Download', 'format_open': '.csv'})
print r.content
# r.content = "
#5
import urllib
test1 = urllib.urlretrieve(url, "test.csv")
test2 = urllib.urlopen(url)
pass
For #2, #3 and #4 the outputs are displayed after the code.
For #1 and #5 I just get a .csv file with </script>'
Option #3 just gives me a new redirect I think, can this help me?
Can anybody help me with my problem?
The page does not send a HTTP Redirect, instead the redirect is done via JavaScript.
urllib and requests do not process javascript, so they cannot follow to the download url.
You have to extract the final download url by yourself, and then open it, using any of the methods.
You could extract the URL using the re module with a regex like r'location.replace\((.*?)\)'
Based on the response from ch3ka, I think I got it to work. From the source code I get the java redirect, and from this redirect I can get the data.
#Find source code
redirect = requests.get(url).content
#Search for the java redirect (find it in the source code)
# --> based on answer ch3ka
m = re.search(r"location.replace\(\'(.*?)\'\)", redirect).group(1)
# Now you need to create url from this redirect, and using this url get the data
data = requests.get(new_url).content

ValueError: unknown url type in urllib2, though the url is fine if opened in a browser

Basically, I am trying to download a URL using urllib2 in python.
the code is the following:
import urllib2
req = urllib2.Request('www.tattoo-cover.co.uk')
req.add_header('User-agent','Mozilla/5.0')
result = urllib2.urlopen(req)
it outputs ValueError and the program crushes for the URL in the example.
When I access the url in a browser, it works fine.
Any ideas how to handle the problem?
UPDATE:
thanks for Ben James and sth the problem is detected => add 'http://'
Now the question is refined:
Is it possible to handle such cases automatically with some builtin function or I have to do error handling with subsequent string concatenation?
When you enter a URL in a browser without the protocol, it defaults to HTTP. urllib2 won't make that assumption for you; you need to prefix it with http://.
You have to use a complete URL including the protocol, not just specify a host name.
The correct URL would be http://www.tattoo-cover.co.uk/.
You can use the method urlparse from urllib (Python 3) to check the presence of an addressing scheme (http, https, ftp) and concatenate the scheme in case it is not present:
In [1]: from urllib.parse import urlparse
..:
..: url = 'www.myurl.com'
..: if not urlparse(url).scheme:
..: url = 'http://' + url
..:
..: url
Out[1]: 'http://www.myurl.com'
You can use the urlparse function for that I think
:
Python User Documentation

Categories