how to use lxml to get the url - python

I want to know how to use lxml to get a url,and then I can use xpath to parse the data I want .
Please guide me,thank you very much.
res = requests.get('http://www.ipeen.com.tw/comment/778246')
doc = parse(res.content)
name = doc.xpath("//meta[#itemprop='name']/#content")
print name
There are errors in my code:
doc = parse(res.content)
File "/Users/ome/djangoenv/lib/python2.7/site-packages/lxml/html/__init__.py", line 786, in parse
return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
File "lxml.etree.pyx", line 3299, in lxml.etree.parse (src/lxml/lxml.etree.c:72655)
File "parser.pxi", line 1791, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:106263)
File "parser.pxi", line 1817, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:106564)
File "parser.pxi", line 1721, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:105561)
File "parser.pxi", line 1122, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:100456)
File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:94543)
File "parser.pxi", line 690, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:96003)
File "parser.pxi", line 618, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:95015)
IOError

res.content is a string, an HTML string.
You need to use lxml.html.fromstring():
import lxml.html
import requests
res = requests.get('http://www.ipeen.com.tw/comment/778246')
doc = lxml.html.fromstring(res.content)
name = doc.xpath(".//meta[#itemprop='name']/#content")
print name

Presumably res.content is a string containing the contents of the page. parse takes a filename or file-like object. Thus, you are using the page content as the name of a file. This is probably not what you want. To construct a tree from a string, use fromstring rather than parse.

Related

Reading local file with lxml -Python

I have what I thought was very basic code to read an xml file into Python. But I'm baffled that I'm running into issues.
I thought this code should work:
from lxml import etree
parser = etree.XMLParser(ns_clean=True, recover=True)
tree = etree.parse('file.xml', parser)
#root = tree.getroot()
However, I get this issue:
Traceback (most recent call last):
File "C:/Users/Root/Documents/xml.py", line 5, in <module>
tree = etree.parse('file.xml', parser)
File "src\lxml\etree.pyx", line 3519, in lxml.etree.parse
File "src\lxml\parser.pxi", line 1839, in lxml.etree._parseDocument
File "src\lxml\parser.pxi", line 1865, in lxml.etree._parseDocumentFromURL
File "src\lxml\parser.pxi", line 1769, in lxml.etree._parseDocFromFile
File "src\lxml\parser.pxi", line 1163, in lxml.etree._BaseParser._parseDocFromFile
File "src\lxml\parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
File "src\lxml\parser.pxi", line 711, in lxml.etree._handleParseResult
File "src\lxml\parser.pxi", line 638, in lxml.etree._raiseParseError
OSError: Error reading file 'file.xml': failed to load external entity "file.xml"
I read a few posts with similar issues, and it turns out that the problem was not including the full path. So I've run the code with a few iterations:
tree = etree.parse(r'C:\Users\Root\Documents\file.xml', parser)
tree = etree.parse("C:\\Users\\Root\\Documents\\file.xml", parser)
tree = etree.parse('C:/Users/Root/Documents/file.xml', parser)
Am I missing something obvious? Any help is much appreciated

Python Crawler - html.fromstring

I am trying to parse web page with this code.
ac = requests.get('link....')
html_text = ac.text
lx = html.fromstring(html_text)
When I run this code I am getting this error
Traceback (most recent call last):
File "Crawler.py", line 197, in <module>
cnx.close()
File "Crawler.py", line 46, in RequestPage
lx = html.fromstring(html_text)
File "C:\Python27\lib\site-packages\lxml\html\__init__.py", line 867, in fromstring
doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
File "C:\Python27\lib\site-packages\lxml\html\__init__.py", line 752, in document_fromstring
value = etree.fromstring(html, parser, **kw)
File "src\lxml\lxml.etree.pyx", line 3213, in lxml.etree.fromstring (src\lxml\lxml.etree.c:76696)
File "src\lxml\parser.pxi", line 1830, in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:115101)
File "src\lxml\parser.pxi", line 1711, in lxml.etree._parseDoc (src\lxml\lxml.etree.c:113677)
File "src\lxml\parser.pxi", line 1051, in lxml.etree._BaseParser._parseUnicodeDoc (src\lxml\lxml.etree.c:107847)
File "src\lxml\parser.pxi", line 584, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:102150)
File "src\lxml\parser.pxi", line 694, in lxml.etree._handleParseResult (src\lxml\lxml.etree.c:103800)
File "src\lxml\parser.pxi", line 633, in lxml.etree._raiseParseError (src\lxml\lxml.etree.c:102888)
lxml.etree.XMLSyntaxError: line 1843: Tag ie:menuitem invalid
I found the html tag which cause to the error:
<ie:menuitem id="MSOMenu_Help" iconsrc="/_layouts/images/HelpIcon.gif" onmenuclick="MSOWebPartPage_SetNewWindowLocation(MenuWebPart.getAttribute('helpLink'), MenuWebPart.getAttribute('helpMode'))" text="Help" type="option" style="display:none">
</ie:menuitem>
You found the HTML tag that case the error but did you fix it? If not try this:
ac = requests.get('link....')
lx = html.fromstring(ac.content)
valueOfHTMLTag = lx.xpath('//TAG[#class/id="Name"]/text()')
where you change:
TAG in the tag that you want to get the value of.
choose class or id of the tag
the id/class name of the tag
This will return an array with the values of that tag with the correct class/id.
Hope this helps!

How to read an xml file with & sign

This is my xml file:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE papers>
<papers>
<paper>
<title>Title containing & and more</title>
</paper>
</papers>
How do I read that using lxml's etree? I tried
from lxml import etree
with open(xml_file, 'r') as inf:
tree = etree.parse(inf)
but it results in the following Traceback:
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "lxml.etree.pyx", line 3239, in lxml.etree.parse (src/lxml/lxml.etree.c:69955)
File "parser.pxi", line 1769, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:102257)
File "parser.pxi", line 1789, in lxml.etree._parseFilelikeDocument (src/lxml/lxml.etree.c:102516)
File "parser.pxi", line 1684, in lxml.etree._parseDocFromFilelike (src/lxml/lxml.etree.c:101442)
File "parser.pxi", line 1134, in lxml.etree._BaseParser._parseDocFromFilelike (src/lxml/lxml.etree.c:97069)
File "parser.pxi", line 582, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:91275)
File "parser.pxi", line 683, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:92461)
File "parser.pxi", line 622, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:91757)
lxml.etree.XMLSyntaxError: xmlParseEntityRef: no name, line 5, column 30
If you need to retain the & character, you can parse the file as HTML.
from lxml import html
tree = html.parse(path)
If you don't need the & character, you can create a new XML parser and pass the recover=True option.
from lxml import etree
parser = etree.XMLParser(recover=True)
tree = etree.parse(path, parser=parser)
Since the xml file is malformed, because of the ampersand (predefined xml entity) use BeautifulSoup if you can. It is a more error tolerant parser.
from bs4 import BeautifulSoup
soup = BeautifulSoup(data)
print soup.find("title").text
outputs
Title containing & and more

printing title of URL from a file in python

I'm trying to fetch URL from file and output the title of page :
import lxml.html
file = open('ab.txt','r')
for line in file:
t = lxml.html.parse(line)
print t.find(".//title").text
The error :
Traceback (most recent call last):
File "C:\Python27\site.py", line 4, in <module>
t = lxml.html.parse(line)
File "C:\Python27\lib\site-packages\lxml\html\__init__.py", line 661, in parse
return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
File "lxml.etree.pyx", line 2706, in lxml.etree.parse (src/lxml/lxml.etree.c:49958)
File "parser.pxi", line 1500, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:71797)
File "parser.pxi", line 1529, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:72080)
File "parser.pxi", line 1429, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:71175)
File "parser.pxi", line 975, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:68173)
File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:64257)
File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:65178)
File "parser.pxi", line 563, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64493)
IOError: Error reading file 'http://example.com/5129860
': failed to load HTTP resource
The ab.txt has:
example.com/123
example.com/234
example.com/456
....
Anything wrong in here?
The parse method in lxml.html parses a filename, URL, or file-like object into an HTML document and returns a tree. From the documentation, the arguments of this function are like this,
parse(filename_or_url, parser=None, base_url=None, **kw)
So you can directly pass the filename and get your output.
t = lxml.html.parse('ab.txt')
print t.find(".//title").text
for line in file:
t = lxml.html.parse(line)
print t.find(".//title").text
Here you are trying to read each line and parse each line using the lxml.html.parse Which means that the argument to function is not a valid http content. you should be modifiying these lines as
from urllib2 import urlopen
for line in file:
content = urlopen(line)
t = lxml.html.parse(content)
print t.find(".//title").text
Here the entire content of the file, is read to variable content. There by it contains a valid http content.

Why is the slash at the end of lxml.html.parse() important?

I am using lxml to scrape html. This code works.
lxml.html.parse( "http://google.com/" )
This code does not.
lxml.html.parse( "http://google.com" )
Why does the slash at the end of the URL matter? Thank you.
To be clear, here is the error log that python is giving me from the latter code.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/davidfaux/epd-7.2-2-rh5-x86/lib/python2.7/site-packages/lxml/html/__init__.py", line 692, in parse
return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
File "lxml.etree.pyx", line 2953, in lxml.etree.parse (src/lxml/lxml.etree.c:56204)
File "parser.pxi", line 1533, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:82287)
File "parser.pxi", line 1562, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:82580)
File "parser.pxi", line 1462, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:81619)
File "parser.pxi", line 1002, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:78528)
File "parser.pxi", line 569, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:74472)
File "parser.pxi", line 650, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:75363)
File "parser.pxi", line 588, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74665)
IOError: Error reading file 'http://google.com': failed to load HTTP resource
Because without the slash, Google isn't sending you a page, it's sending you a redirect. In fact, it's redirecting you to the URL with the slash! The body of the redirect is probably empty.

Categories