BeautifulSoup HTMLParseError. What's wrong with this? - python

This is my code:
from bs4 import BeautifulSoup as BS
import urllib2
url = "http://services.runescape.com/m=news/recruit-a-friend-for-free-membership-and-xp"
res = urllib2.urlopen(url)
soup = BS(res.read())
other_content = soup.find_all('div',{'class':'Content'})[0]
print other_content
Yet an error comes up:
/Library/Python/2.7/site-packages/bs4/builder/_htmlparser.py:149: RuntimeWarning: Python's built-in HTMLParser cannot parse the given document. This is not a bug in Beautiful Soup. The best solution is to install an external parser (lxml or html5lib), and use Beautiful Soup with that parser. See http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser for help.
"Python's built-in HTMLParser cannot parse the given document. This is not a bug in Beautiful Soup. The best solution is to install an external parser (lxml or html5lib), and use Beautiful Soup with that parser. See http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser for help."))
Traceback (most recent call last):
File "web.py", line 5, in <module>
soup = BS(res.read())
File "/Library/Python/2.7/site-packages/bs4/__init__.py", line 172, in __init__
self._feed()
File "/Library/Python/2.7/site-packages/bs4/__init__.py", line 185, in _feed
self.builder.feed(self.markup)
File "/Library/Python/2.7/site-packages/bs4/builder/_htmlparser.py", line 150, in feed
raise e
I've let two other people use this code, and it works for them perfectly fine. Why is it not working for me? I have bs4 installed...

Per the error message, one thing you may need to do is install lxml, which will provide a more powerful parsing engine for BeautifulSoup to use. See this section in the docs for a better overview, but the likely reason that it works for two other people is that they have lxml (or another parser that handles the HTML properly) installed, meaning that BeautifulSoup uses it instead of the standard built-in (side note: your example works for me as well on a system with lxml installed, but fails on one without it).
Also, see this note in the docs:
If you’re using a version of Python 2 earlier than 2.7.3, or a version
of Python 3 earlier than 3.2.2, it’s essential that you install lxml
or html5lib–Python’s built-in HTML parser is just not very good in
older versions.
I would recommend running sudo apt-get install python-lxml and seeing if the problem continues.

Related

`document.lastModified` in Python

In python, by using an HTML parser, is it possible to get the document.lastModified property of a web page. I'm trying to retrieve the date at which the webpage/document was last modified by the owner.
A somewhat related question "I am downloading a file using Python urllib2. How do I check how large the file size is?", suggests that the following (untested) code should work:
import urllib2
req = urllib2.urlopen("http://example.com/file.zip")
total_size = int(req.info().getheader('last-modified'))
You might want to add a default value as the second parameter to getheader(), in case it isn't set.
You can also look for a last-modified date in the HTML code, most notably in the meta-tags. The htmldate module does just that.
Here is how it could work:
1. Install the package:
pip/pip3/pipenv (your choice) -U htmldate
2. Retrieve a web page, parse it and output the date:
from htmldate import find_date
find_date('http://blog.python.org/2016/12/python-360-is-now-available.html')
(disclaimer: I'm the author)

Reading a page and parsing it with minidom.parse or minidom.parseString in Python?

I have either of these codes:
import urllib
from xml.dom import minidom
res = urllib.urlopen('https://www.google.com/webhp#q=apple&start=10')
dom = minidom.parse(res)
which gives me the error xml.parsers.expat.ExpatError: syntax error: line 1, column 0
Or this:
import urllib
from xml.dom import minidom
res = urllib.urlopen('https://www.google.com/webhp#q=apple&start=10')
dom = minidom.parseString(res.read())
which gives me the same error. res.read() reads fine and is a string.
I would like to parse through the code later. How can I do this using xml.dom.minidom?
The reason you're getting this error is that the page isn't valid XML. It's HTML 5. The doctype right at the top tells you this, even if you ignore the content type. You can't parse HTML with an XML parser.*
If you want to stick with what's in the stdlib, you can use html.parser (Python 3.x) / HTMLParser (2.x).** However, you may want to consider third-party libraries like lxml (which, despite the name, can parse HTML), html5lib, or BeautifulSoup (which wraps up a lower-level parser in a really nice interface).
* Well, unless it's XHTML, or the XML output of HTML5, but that's not the case here.
** Do not use htmllib unless you're using an old version of Python without a working HTMLParser. This module is deprecated for a reason.

urllib2 download HTML file

Using urllib2 in Python 2.7.4, I can readily download an Excel file:
output_file = 'excel.xls'
url = 'http://www.nbmg.unr.edu/geothermal/GEOTHERM-30Jun11.xls'
file(output_file, 'wb').write(urllib2.urlopen(url).read())
This results in the expected file that I can use as I wish.
However, trying to download just an HTML file gives me an empty file:
output_file = 'webpage.html'
url = 'http://www.nbmg.unr.edu/geothermal/mapfiles/nvgeowel.html'
file(output_file, 'wb').write(urllib2.urlopen(url).read())
I had the same results using urllib. There must be something simple I'm missing or don't understand. How do I download an HTML file from a URL? Why doesn't my code work?
If you want to download files or simply save a webpage you can use urlretrieve(from urllib library)instead of use read and write.
import urllib
urllib.urlretrieve("http://www.nbmg.unr.edu/geothermal/mapfiles/nvgeowel.html","doc.html")
#urllib.urlretrieve("url","save as..")
If you need to set a timeout you have to put it at the start of your file:
import socket
socket.setdefaulttimeout(25)
#seconds
It also Python 2.7.4 in my OS X 10.9, and the codes work well on it.
So I think there maybe other problems prevent its working. Can you open "http://www.nbmg.unr.edu/geothermal/GEOTHERM-30Jun11.xls" in your browser?
This may not directly answer the question, but if you're working with HTTP and have sufficient privileges to install python packages, I'd really recommend doing this with 'requests'. There's a related answered here - https://stackoverflow.com/a/13137873/45698

Pyquery: I'm using pyquery ,the HTTP_REFFER is required in this page,How can i do with it?

the website needs HTTP_REFFER when i send request..
the common way to open pages in PyQuery is `
> doc=pyQuery(url=r'http://www.....')
how can i add HTTP_REFFER ?
`
pyQuery uses urlopen from urllib.request if you're on py3 or urllib2 if you're on py2. When you feed it with the url parameter it should either be a string or a Request object.
In the python2 case let's see how it would look like if you want to add an http_header to your request:
import urllib2
url = urllib2.Request("http://...", headers={'HTTP_REFERER': "http://..."})
doc = pyQuery(url=url)
It would be similar in the python3 case. It's always good to read through the code of the libs your're working with, you can find the pyQuery code here.

Get webpage contents with Python?

I'm using Python 3.1, if that helps.
Anyways, I'm trying to get the contents of this webpage. I Googled for a little bit and tried different things, but they didn't work. I'm guessing that this should be an easy task, but...I can't get it. :/.
Results of urllib, urllib2:
>>> import urllib2
Traceback (most recent call last):
File "<pyshell#0>", line 1, in <module>
import urllib2
ImportError: No module named urllib2
>>> import urllib
>>> urllib.urlopen("http://www.python.org")
Traceback (most recent call last):
File "<pyshell#2>", line 1, in <module>
urllib.urlopen("http://www.python.org")
AttributeError: 'module' object has no attribute 'urlopen'
>>>
Python 3 solution
Thank you, Jason. :D.
import urllib.request
page = urllib.request.urlopen('http://services.runescape.com/m=hiscore/ranking?table=0&category_type=0&time_filter=0&date=1519066080774&user=zezima')
print(page.read())
If you're writing a project which installs packages from PyPI, then the best and most common library to do this is requests. It provides lots of convenient but powerful features. Use it like this:
import requests
response = requests.get('http://hiscore.runescape.com/index_lite.ws?player=zezima')
print (response.status_code)
print (response.content)
But if your project does not install its own dependencies, i.e. is limited to things built-in to the standard library, then you should consult one of the other answers.
Because you're using Python 3.1, you need to use the new Python 3.1 APIs.
Try:
urllib.request.urlopen('http://www.python.org/')
Alternately, it looks like you're working from Python 2 examples. Write it in Python 2, then use the 2to3 tool to convert it. On Windows, 2to3.py is in \python31\tools\scripts. Can someone else point out where to find 2to3.py on other platforms?
Edit
These days, I write Python 2 and 3 compatible code by using six.
from six.moves import urllib
urllib.request.urlopen('http://www.python.org')
Assuming you have six installed, that runs on both Python 2 and Python 3.
If you ask me. try this one
import urllib2
resp = urllib2.urlopen('http://hiscore.runescape.com/index_lite.ws?player=zezima')
and read the normal way ie
page = resp.read()
Good luck though
Mechanize is a great package for "acting like a browser", if you want to handle cookie state, etc.
http://wwwsearch.sourceforge.net/mechanize/
You can use urlib2 and parse the HTML yourself.
Or try Beautiful Soup to do some of the parsing for you.
Also you can use faster_than_requests package. That's very fast and simple:
import faster_than_requests as r
content = r.get2str("http://test.com/")
Look at this comparison:
A solution with works with Python 2.X and Python 3.X:
try:
# For Python 3.0 and later
from urllib.request import urlopen
except ImportError:
# Fall back to Python 2's urllib2
from urllib2 import urlopen
url = 'http://hiscore.runescape.com/index_lite.ws?player=zezima'
response = urlopen(url)
data = str(response.read())
Suppose you want to GET a webpage's content. The following code does it:
# -*- coding: utf-8 -*-
# python
# example of getting a web page
from urllib import urlopen
print urlopen("http://xahlee.info/python/python_index.html").read()

Categories