I am trying to fetch a remote page with python requests module, reconstruct a DOM tree, do some processing and save the result to file. When I fetch a page and then just write it to the file everything works (I can open an html file later in the browser and it is rendered correctly).
However, if I create a pyquery object and do some processing and then save it by using str conversion it fails. Specifically, special characters like && and etc. get modified within script tags of the saved source (caused by application of pyquery) and it prevents page from rendering correctly.
Here is my code:
import requests
from lxml import etree
from pyquery import PyQuery as pq
user_agent = {'User-agent': 'Mozilla/5.0'}
r = requests.get('http://www.google.com',headers=user_agent, timeout=4)
DOM = pq(r.text)
#some optional processing
fTest = open("fTest.html","wb")
fTest.write(str(DOM))
fTest.close()
So, the question is: How to make sure that special characters aren't escaped after application of pyquery? I suppose it might be related to lxml (parent library for pyquery), but after tedious search online and experiments with different ways of object serialization I still didn't make it. Maybe this is also related to unicode handling?!
Many thanks in advance!
I have found an elegant solution to the problem and the reason why it the code didn't work before.
First, you can read carefully the page with http://lxml.de/lxmlhtml.html.
It has a section "Creating HTML with the E-factory". After the section they point out to the fact that etree.tostring() method works for XML only. But for HTML with additional possibility to have script or style tags it will mess things around. So..
Second, the solution is to use the overloaded method html.tostring().
The final working code is:
# for networking
import requests
# for parsing and serialization
from lxml import etree
from lxml.html import tostring as html2str # IMPORTANT!!!
from pyquery import PyQuery as pq
user_agent = {'User-agent': 'Mozilla/5.0'}
r = requests.get('http://www.google.com',headers=user_agent, timeout=4)
# construct DOM object
DOM = pq(r.text)
# do stuff with DOM
#
# save result to file
fTest = open("fTest.html","wb")
fTest.write(html2str(DOM.root)) # IMPORTANT!!!
fTest.close()
Hope it will save time some of you in future! Have fun guys! ;)
Related
I am trying to read the contents of an HTML file with BeautifulSoup, but I'm receiving an UnicodeDecodeError.
I also tried changing the parser to html.parser instead of the lxml but it doesn't work.
however, if I use the requests library to request the URL, it works, but not if I read the HTML file locally.
answer:
I needed to add a Unicode and it was should have something like that: with open('lap.html', encoding="utf8") as html_file:
You are passing a file to 'BeautifulSoup' instead you have to pass the content of the file.
try :
content = html_file.read() source = BeautifulSoup(content, 'lxml')
First of all, fix the soruce to source, then make a gap between the equal sign and the text and then find out, what might not be encodable by the coding standart you use, because that error refers to a sign which cant be decoded/encoded
I can't get URL
base_url = "http://status.aws.amazon.com/"
socket.setdefaulttimeout(30)
htmldata = urllib2.urlopen(base_url)
for url in parser.url_list:
get_rss_th = threading.Thread(target=parser.get_rss,name="get_rss_th", args=(url,))
get_rss_th.start()
print htmldata
<addinfourl at 140176301032584 whose fp = <socket._fileobject object at 0x7f7d56a09750>>
when specifying htmldata.read() (Python error when using urllib.open)
then getting blank screen
python 2.7
whole code:https://github.com/tech-sketch/zabbix_aws_template/blob/master/scripts/AWS_Service_Health_Dashboard.py
The problem is, that from URL link (RSS feed), i can't get output (data) variable data = zbx_client.recv(4096) is empty- no status
There's no real problem with your code (except for a bunch of indentation errors and syntax errors that apparently aren't in your real code), only with your attempts to debug it.
First, you did this:
print htmldata
That's perfectly fine, but since htmldata is a urllib2 response object, printing it just prints that response object. Which apparently looks like this:
<addinfourl at 140176301032584 whose fp = <socket._fileobject object at 0x7f7d56a09750>>
That doesn't look like particularly useful information, but that's the kind of output you get when you print something that's only really useful for debugging purposes. It tells you what type of object it is, some unique identifier for it, and the key members (in this case, the socket fileobject wrapped up by the response).
Then you apparently tried this:
print htmldata.read()
But already called read on the same object earlier:
parser.feed(htmldata.read())
When you read() the same file-like object twice, the first time gets everything in the file, and the second time gets everything after everything in the file—that is, nothing.
What you want to do is read() the contents once, into a string, and then you can reuse that string as many times as you want:
contents = htmldata.read()
parser.feed(contents)
print contents
It's also worth noting that, as the urllib2 documentation said right at the top:
See also The Requests package is recommended for a higher-level HTTP client interface.
Using urllib2 can be a big pain, in a lot of ways, and this is just one of the more minor ones. Occasionally you can't use requests because you have to dig deep under the covers of HTTP, or handle some protocol it doesn't understand, or you just can't install third-party libraries, so urllib2 (or urllib.request, as it's renamed in Python 3.x) is still there. But when you don't have to use it, it's better not to. Even Python itself, in the ensurepip bootstrapper, uses requests instead of urllib2.
With requests, the normal way to access the contents of a response is with the content (for binary) or text (for Unicode text) properties. You don't have to worry about when to read(); it does it automatically for you, and lets you access the text over and over. So, you can just do this:
import requests
base_url = "http://status.aws.amazon.com/"
response = requests.get(base_url, timeout=30)
parser.feed(response.content) # assuming it wants bytes, not unicode
print response.text
If I use this code:
import urllib2
import socket
base_url = "http://status.aws.amazon.com/"
socket.setdefaulttimeout(30)
htmldata = urllib2.urlopen(base_url)
print(htmldata.read())
I get the page's HTML code.
I have either of these codes:
import urllib
from xml.dom import minidom
res = urllib.urlopen('https://www.google.com/webhp#q=apple&start=10')
dom = minidom.parse(res)
which gives me the error xml.parsers.expat.ExpatError: syntax error: line 1, column 0
Or this:
import urllib
from xml.dom import minidom
res = urllib.urlopen('https://www.google.com/webhp#q=apple&start=10')
dom = minidom.parseString(res.read())
which gives me the same error. res.read() reads fine and is a string.
I would like to parse through the code later. How can I do this using xml.dom.minidom?
The reason you're getting this error is that the page isn't valid XML. It's HTML 5. The doctype right at the top tells you this, even if you ignore the content type. You can't parse HTML with an XML parser.*
If you want to stick with what's in the stdlib, you can use html.parser (Python 3.x) / HTMLParser (2.x).** However, you may want to consider third-party libraries like lxml (which, despite the name, can parse HTML), html5lib, or BeautifulSoup (which wraps up a lower-level parser in a really nice interface).
* Well, unless it's XHTML, or the XML output of HTML5, but that's not the case here.
** Do not use htmllib unless you're using an old version of Python without a working HTMLParser. This module is deprecated for a reason.
I'm trying to parse an XML document I retrieve from the web, but it crashes after parsing with this error:
': failed to load external entity "<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="GreenButtonDataStyleSheet.xslt"?>
That is the second line in the XML that is downloaded. Is there a way to prevent the parser from trying to load the external entity, or another way to solve this? This is the code I have so far:
import urllib2
import lxml.etree as etree
file = urllib2.urlopen("http://www.greenbuttondata.org/data/15MinLP_15Days.xml")
data = file.read()
file.close()
tree = etree.parse(data)
In concert with what mzjn said, if you do want to pass a string to etree.parse(), just wrap it in a StringIO object.
Example:
from lxml import etree
from StringIO import StringIO
myString = "<html><p>blah blah blah</p></html>"
tree = etree.parse(StringIO(myString))
This method is used in the lxml documentation.
etree.parse(source) expects source to be one of
a file name/path
a file object
a file-like object
a URL using the HTTP or FTP protocol
The problem is that you are supplying the XML content as a string.
You can also do without urllib2.urlopen(). Just use
tree = etree.parse("http://www.greenbuttondata.org/data/15MinLP_15Days.xml")
Demonstration (using lxml 2.3.4):
>>> from lxml import etree
>>> tree = etree.parse("http://www.greenbuttondata.org/data/15MinLP_15Days.xml")
>>> tree.getroot()
<Element {http://www.w3.org/2005/Atom}feed at 0xedaa08>
>>>
In a competing answer, it is suggested that lxml fails because of the stylesheet referenced by the processing instruction in the document. But that is not the problem here. lxml does not try to load the stylesheet, and the XML document is parsed just fine if you do as described above.
If you want to actually load the stylesheet, you have to be explicit about it. Something like this is needed:
from lxml import etree
tree = etree.parse("http://www.greenbuttondata.org/data/15MinLP_15Days.xml")
# Create an _XSLTProcessingInstruction object
pi = tree.xpath("//processing-instruction()")[0]
# Parse the stylesheet and return an ElementTree
xsl = pi.parseXSL()
lxml docs for parse says To parse from a string, use the fromstring() function instead.
parse(...)
parse(source, parser=None, base_url=None)
Return an ElementTree object loaded with source elements. If no parser
is provided as second argument, the default parser is used.
The ``source`` can be any of the following:
- a file name/path
- a file object
- a file-like object
- a URL using the HTTP or FTP protocol
To parse from a string, use the ``fromstring()`` function instead.
Note that it is generally faster to parse from a file path or URL
than from an open file object or file-like object. Transparent
decompression from gzip compressed sources is supported (unless
explicitly disabled in libxml2).
You're getting that error because the XML you're loading references an external resource:
<?xml-stylesheet type="text/xsl" href="GreenButtonDataStyleSheet.xslt"?>
LXML doesn't know how to resolve GreenButtonDataStyleSheet.xslt. You and I probably realize that it's going to be available relative to your original URL, http://www.greenbuttondata.org/data/15MinLP_15Days.xml...the trick is to tell lxml how to go about loading it.
The lxml documentation includes a section titled "Document loading and URL resolving", which has just about all the information you need.
the website needs HTTP_REFFER when i send request..
the common way to open pages in PyQuery is `
> doc=pyQuery(url=r'http://www.....')
how can i add HTTP_REFFER ?
`
pyQuery uses urlopen from urllib.request if you're on py3 or urllib2 if you're on py2. When you feed it with the url parameter it should either be a string or a Request object.
In the python2 case let's see how it would look like if you want to add an http_header to your request:
import urllib2
url = urllib2.Request("http://...", headers={'HTTP_REFERER': "http://..."})
doc = pyQuery(url=url)
It would be similar in the python3 case. It's always good to read through the code of the libs your're working with, you can find the pyQuery code here.