Get webpage contents with Python? - python

I'm using Python 3.1, if that helps.
Anyways, I'm trying to get the contents of this webpage. I Googled for a little bit and tried different things, but they didn't work. I'm guessing that this should be an easy task, but...I can't get it. :/.
Results of urllib, urllib2:
>>> import urllib2
Traceback (most recent call last):
File "<pyshell#0>", line 1, in <module>
import urllib2
ImportError: No module named urllib2
>>> import urllib
>>> urllib.urlopen("http://www.python.org")
Traceback (most recent call last):
File "<pyshell#2>", line 1, in <module>
urllib.urlopen("http://www.python.org")
AttributeError: 'module' object has no attribute 'urlopen'
>>>
Python 3 solution
Thank you, Jason. :D.
import urllib.request
page = urllib.request.urlopen('http://services.runescape.com/m=hiscore/ranking?table=0&category_type=0&time_filter=0&date=1519066080774&user=zezima')
print(page.read())

If you're writing a project which installs packages from PyPI, then the best and most common library to do this is requests. It provides lots of convenient but powerful features. Use it like this:
import requests
response = requests.get('http://hiscore.runescape.com/index_lite.ws?player=zezima')
print (response.status_code)
print (response.content)
But if your project does not install its own dependencies, i.e. is limited to things built-in to the standard library, then you should consult one of the other answers.

Because you're using Python 3.1, you need to use the new Python 3.1 APIs.
Try:
urllib.request.urlopen('http://www.python.org/')
Alternately, it looks like you're working from Python 2 examples. Write it in Python 2, then use the 2to3 tool to convert it. On Windows, 2to3.py is in \python31\tools\scripts. Can someone else point out where to find 2to3.py on other platforms?
Edit
These days, I write Python 2 and 3 compatible code by using six.
from six.moves import urllib
urllib.request.urlopen('http://www.python.org')
Assuming you have six installed, that runs on both Python 2 and Python 3.

If you ask me. try this one
import urllib2
resp = urllib2.urlopen('http://hiscore.runescape.com/index_lite.ws?player=zezima')
and read the normal way ie
page = resp.read()
Good luck though

Mechanize is a great package for "acting like a browser", if you want to handle cookie state, etc.
http://wwwsearch.sourceforge.net/mechanize/

You can use urlib2 and parse the HTML yourself.
Or try Beautiful Soup to do some of the parsing for you.

Also you can use faster_than_requests package. That's very fast and simple:
import faster_than_requests as r
content = r.get2str("http://test.com/")
Look at this comparison:

A solution with works with Python 2.X and Python 3.X:
try:
# For Python 3.0 and later
from urllib.request import urlopen
except ImportError:
# Fall back to Python 2's urllib2
from urllib2 import urlopen
url = 'http://hiscore.runescape.com/index_lite.ws?player=zezima'
response = urlopen(url)
data = str(response.read())

Suppose you want to GET a webpage's content. The following code does it:
# -*- coding: utf-8 -*-
# python
# example of getting a web page
from urllib import urlopen
print urlopen("http://xahlee.info/python/python_index.html").read()

Related

urllib2 download HTML file

Using urllib2 in Python 2.7.4, I can readily download an Excel file:
output_file = 'excel.xls'
url = 'http://www.nbmg.unr.edu/geothermal/GEOTHERM-30Jun11.xls'
file(output_file, 'wb').write(urllib2.urlopen(url).read())
This results in the expected file that I can use as I wish.
However, trying to download just an HTML file gives me an empty file:
output_file = 'webpage.html'
url = 'http://www.nbmg.unr.edu/geothermal/mapfiles/nvgeowel.html'
file(output_file, 'wb').write(urllib2.urlopen(url).read())
I had the same results using urllib. There must be something simple I'm missing or don't understand. How do I download an HTML file from a URL? Why doesn't my code work?
If you want to download files or simply save a webpage you can use urlretrieve(from urllib library)instead of use read and write.
import urllib
urllib.urlretrieve("http://www.nbmg.unr.edu/geothermal/mapfiles/nvgeowel.html","doc.html")
#urllib.urlretrieve("url","save as..")
If you need to set a timeout you have to put it at the start of your file:
import socket
socket.setdefaulttimeout(25)
#seconds
It also Python 2.7.4 in my OS X 10.9, and the codes work well on it.
So I think there maybe other problems prevent its working. Can you open "http://www.nbmg.unr.edu/geothermal/GEOTHERM-30Jun11.xls" in your browser?
This may not directly answer the question, but if you're working with HTTP and have sufficient privileges to install python packages, I'd really recommend doing this with 'requests'. There's a related answered here - https://stackoverflow.com/a/13137873/45698

BeautifulSoup HTMLParseError. What's wrong with this?

This is my code:
from bs4 import BeautifulSoup as BS
import urllib2
url = "http://services.runescape.com/m=news/recruit-a-friend-for-free-membership-and-xp"
res = urllib2.urlopen(url)
soup = BS(res.read())
other_content = soup.find_all('div',{'class':'Content'})[0]
print other_content
Yet an error comes up:
/Library/Python/2.7/site-packages/bs4/builder/_htmlparser.py:149: RuntimeWarning: Python's built-in HTMLParser cannot parse the given document. This is not a bug in Beautiful Soup. The best solution is to install an external parser (lxml or html5lib), and use Beautiful Soup with that parser. See http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser for help.
"Python's built-in HTMLParser cannot parse the given document. This is not a bug in Beautiful Soup. The best solution is to install an external parser (lxml or html5lib), and use Beautiful Soup with that parser. See http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser for help."))
Traceback (most recent call last):
File "web.py", line 5, in <module>
soup = BS(res.read())
File "/Library/Python/2.7/site-packages/bs4/__init__.py", line 172, in __init__
self._feed()
File "/Library/Python/2.7/site-packages/bs4/__init__.py", line 185, in _feed
self.builder.feed(self.markup)
File "/Library/Python/2.7/site-packages/bs4/builder/_htmlparser.py", line 150, in feed
raise e
I've let two other people use this code, and it works for them perfectly fine. Why is it not working for me? I have bs4 installed...
Per the error message, one thing you may need to do is install lxml, which will provide a more powerful parsing engine for BeautifulSoup to use. See this section in the docs for a better overview, but the likely reason that it works for two other people is that they have lxml (or another parser that handles the HTML properly) installed, meaning that BeautifulSoup uses it instead of the standard built-in (side note: your example works for me as well on a system with lxml installed, but fails on one without it).
Also, see this note in the docs:
If you’re using a version of Python 2 earlier than 2.7.3, or a version
of Python 3 earlier than 3.2.2, it’s essential that you install lxml
or html5lib–Python’s built-in HTML parser is just not very good in
older versions.
I would recommend running sudo apt-get install python-lxml and seeing if the problem continues.

Urllib2 HTTPS truncated response

I am trying to fetch a page using urllib2.urlopen (actually, I am using mechanize, but this is the method that mechanize calls) When I fetch the page, I am getting incomplete responses; the page gets truncated. However, if I access the non-HTTPS version of the page, I get the complete page.
I am on Arch Linux (3.5.4-1-ARCH x86_64). I am running openssl 1.0.1c. This problem occurs on another Arch Linux machine I own, but not when using Python 3 (3.3.0).
This problem seems to be related to urllib2 not retrieving entire HTTP response.
I tested it on the only online Python interpreter that would let me use urllib2 (Py I/O) and it worked as expected. Here is the code:
import urllib2
u = urllib2.urlopen('https://wa151.avayalive.com/WAAdminPanel/login.aspx?ReturnUrl=%2fWAAdminPanel%2fprivate%2fHome.aspx')
print u.read()[-100:]
The last lines should contain the usual </body></html>.
When I try urllib.urlretrieve on my machines, I get:
ContentTooShortError: retrieval incomplete: got only 11365 out of 13805 bytes
I cannot test urlretrieve on the online interpreter because it will not let users write to temporary files. Later in the evening, I will try fetching the URL from my machine, but from a different location.
I'm getting the same error, using Python 2.7, on a different Linux system:
>>> urllib.urlretrieve('https://wa151.avayalive.com/WAAdminPanel/login.aspx?ReturnUrl=%2fWAAdminPanel%2fprivate%2fHome.aspx')
---------------------------------------------------------------------------
ContentTooShortError Traceback (most recent call last)
...
ContentTooShortError: retrieval incomplete: got only 11365 out of 13805 bytes
However, the same operation can be done (and actually works for me) using requests:
>>> import requests
>>> r = requests.get('https://wa151.avayalive.com/WAAdminPanel/login.aspx?ReturnUrl=%2fWAAdminPanel%2fprivate%2fHome.aspx')
>>> with open(somefilepath, 'w') as f:
... f.write(r.text)
Is that working for you?

How can I parse a YAML file from the web using PyYAML?

I need to get a YAML file from the web and parse it using PyYAMl, but i can't seem to find a way to do it.
import urllib
import yaml
fileToBeParsed = urllib.urlopen("http://website.com/file.yml")
pythonObject = yaml.open(fileToBeParsed)
print pythonObject
The error produced when runing this is:
AttributeError: 'module' object has no attribute 'open'
If it helps, I am using python 2. Sorry if this is a silly question.
I believe you want yaml.load(fileToBeParsed) and I would suggest looking at urllib2.urlopen if not the requests module.

Pyquery: I'm using pyquery ,the HTTP_REFFER is required in this page,How can i do with it?

the website needs HTTP_REFFER when i send request..
the common way to open pages in PyQuery is `
> doc=pyQuery(url=r'http://www.....')
how can i add HTTP_REFFER ?
`
pyQuery uses urlopen from urllib.request if you're on py3 or urllib2 if you're on py2. When you feed it with the url parameter it should either be a string or a Request object.
In the python2 case let's see how it would look like if you want to add an http_header to your request:
import urllib2
url = urllib2.Request("http://...", headers={'HTTP_REFERER': "http://..."})
doc = pyQuery(url=url)
It would be similar in the python3 case. It's always good to read through the code of the libs your're working with, you can find the pyQuery code here.

Categories