printing title of URL from a file in python - python

I'm trying to fetch URL from file and output the title of page :
import lxml.html
file = open('ab.txt','r')
for line in file:
t = lxml.html.parse(line)
print t.find(".//title").text
The error :
Traceback (most recent call last):
File "C:\Python27\site.py", line 4, in <module>
t = lxml.html.parse(line)
File "C:\Python27\lib\site-packages\lxml\html\__init__.py", line 661, in parse
return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
File "lxml.etree.pyx", line 2706, in lxml.etree.parse (src/lxml/lxml.etree.c:49958)
File "parser.pxi", line 1500, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:71797)
File "parser.pxi", line 1529, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:72080)
File "parser.pxi", line 1429, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:71175)
File "parser.pxi", line 975, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:68173)
File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:64257)
File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:65178)
File "parser.pxi", line 563, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64493)
IOError: Error reading file 'http://example.com/5129860
': failed to load HTTP resource
The ab.txt has:
example.com/123
example.com/234
example.com/456
....
Anything wrong in here?

The parse method in lxml.html parses a filename, URL, or file-like object into an HTML document and returns a tree. From the documentation, the arguments of this function are like this,
parse(filename_or_url, parser=None, base_url=None, **kw)
So you can directly pass the filename and get your output.
t = lxml.html.parse('ab.txt')
print t.find(".//title").text

for line in file:
t = lxml.html.parse(line)
print t.find(".//title").text
Here you are trying to read each line and parse each line using the lxml.html.parse Which means that the argument to function is not a valid http content. you should be modifiying these lines as
from urllib2 import urlopen
for line in file:
content = urlopen(line)
t = lxml.html.parse(content)
print t.find(".//title").text
Here the entire content of the file, is read to variable content. There by it contains a valid http content.

Related

python lxml in docker: "Document is empty" while parsing

Why this code is working without issues on my mac with any version of python, requests and lxml, but doesn't work in any docker container? i tried everything(
it just fails on 34533 line (discovered by printing el.sourceline)
from requests import get
from lxml import etree
r = get('https://printbar.ru/synsfiles/yandex/market/idrr_full.xml')
with open('test.xml', 'wb') as f:
f.write(r.content)
tree = etree.iterparse(source='test.xml', events=('end',))
for (ev, el) in tree:
continue
print('ok')
https://printbar.ru/synsfiles/yandex/market/idrr_full.xml seems completely valid and works locally on any of my macs...
i tried ubuntu, alpine, several python containers even with prebuilt lxml, nothing helped. I expected that parsing this file won't throw this error in the middle of parsing:
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "src/lxml/iterparse.pxi", line 210, in lxml.etree.iterparse.__next__
File "src/lxml/iterparse.pxi", line 195, in lxml.etree.iterparse.__next__
File "src/lxml/iterparse.pxi", line 230, in lxml.etree.iterparse._read_more_events
File "src/lxml/parser.pxi", line 1376, in lxml.etree._FeedParser.feed
File "src/lxml/parser.pxi", line 606, in lxml.etree._ParserContext._handleParseResult
File "src/lxml/parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 725, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 654, in lxml.etree._raiseParseError
File "test.xml", line 1
lxml.etree.XMLSyntaxError: Document is empty, line 1, column 1
xmllint says that there is encoding error, but it works locally on mac...) HOW?) i want it dockerized!)

Can etree.XMLParser in recover mode still throw a parse error?

I have a utility method that parses XML using a parser created as etree.XMLParser(recover=True). I would like to test failure scenarios in a unit test. Except for empty input throwing an lxml.etree.XMLSyntaxError, I can't seem to break the parser.
My question is: is it possible to construct a StringIO or BytesIO input for this parser such that the parser throws a parse error?
Here's some examples (tested with Python 3.5 and lxml 4.3.3):
from io import BytesIO
from lxml import etree
def parse(xml):
parser = etree.XMLParser(recover=True)
elem = etree.parse(BytesIO(xml), parser)
print(etree.tostring(elem))
parse(b'<broken<') # prints b'<broken/>'
parse(b'</lf|\jf>') # prints None
parse('<?xml encoding="ascii"?><foo>æøå</foo>'.encode('utf-8')) # prints b'<foo/>'
parse(b'') # Throws lxml.etree.XMLSyntaxError
If I slap a NULL character at the beginning of any of the bad inputs you show that don't raise an error, I do get an error. For instance:
parse(b'\0<broken<')
produces:
Traceback (most recent call last):
File "test.py", line 13, in <module>
parse(b'\0<broken<') # prints b'<broken/>'
File "test.py", line 9, in parse
elem = etree.parse(BytesIO(xml), parser)
File "src/lxml/etree.pyx", line 3435, in lxml.etree.parse
File "src/lxml/parser.pxi", line 1857, in lxml.etree._parseDocument
File "src/lxml/parser.pxi", line 1877, in lxml.etree._parseMemoryDocument
File "src/lxml/parser.pxi", line 1765, in lxml.etree._parseDoc
File "src/lxml/parser.pxi", line 1127, in lxml.etree._BaseParser._parseDoc
File "src/lxml/parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 711, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 640, in lxml.etree._raiseParseError
File "<string>", line 1
lxml.etree.XMLSyntaxError: Document is empty, line 1, column 1
Isn't it because you are using recover=True?
recover - try hard to parse through broken XML
I changed recover=False and I get:
Traceback (most recent call last):
File "./foo.py", line 11, in <module>
parse(b'<broken<') # prints b'<broken/>'
File "./foo.py", line 7, in parse
elem = etree.parse(BytesIO(xml), parser)
File "src/lxml/etree.pyx", line 3435, in lxml.etree.parse
File "src/lxml/parser.pxi", line 1857, in lxml.etree._parseDocument
File "src/lxml/parser.pxi", line 1877, in lxml.etree._parseMemoryDocument
File "src/lxml/parser.pxi", line 1765, in lxml.etree._parseDoc
File "src/lxml/parser.pxi", line 1127, in lxml.etree._BaseParser._parseDoc
File "src/lxml/parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 711, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 640, in lxml.etree._raiseParseError
File "<string>", line 1
lxml.etree.XMLSyntaxError: error parsing attribute name, line 1, column 8
Am I missing something?

Python Crawler - html.fromstring

I am trying to parse web page with this code.
ac = requests.get('link....')
html_text = ac.text
lx = html.fromstring(html_text)
When I run this code I am getting this error
Traceback (most recent call last):
File "Crawler.py", line 197, in <module>
cnx.close()
File "Crawler.py", line 46, in RequestPage
lx = html.fromstring(html_text)
File "C:\Python27\lib\site-packages\lxml\html\__init__.py", line 867, in fromstring
doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
File "C:\Python27\lib\site-packages\lxml\html\__init__.py", line 752, in document_fromstring
value = etree.fromstring(html, parser, **kw)
File "src\lxml\lxml.etree.pyx", line 3213, in lxml.etree.fromstring (src\lxml\lxml.etree.c:76696)
File "src\lxml\parser.pxi", line 1830, in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:115101)
File "src\lxml\parser.pxi", line 1711, in lxml.etree._parseDoc (src\lxml\lxml.etree.c:113677)
File "src\lxml\parser.pxi", line 1051, in lxml.etree._BaseParser._parseUnicodeDoc (src\lxml\lxml.etree.c:107847)
File "src\lxml\parser.pxi", line 584, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:102150)
File "src\lxml\parser.pxi", line 694, in lxml.etree._handleParseResult (src\lxml\lxml.etree.c:103800)
File "src\lxml\parser.pxi", line 633, in lxml.etree._raiseParseError (src\lxml\lxml.etree.c:102888)
lxml.etree.XMLSyntaxError: line 1843: Tag ie:menuitem invalid
I found the html tag which cause to the error:
<ie:menuitem id="MSOMenu_Help" iconsrc="/_layouts/images/HelpIcon.gif" onmenuclick="MSOWebPartPage_SetNewWindowLocation(MenuWebPart.getAttribute('helpLink'), MenuWebPart.getAttribute('helpMode'))" text="Help" type="option" style="display:none">
</ie:menuitem>
You found the HTML tag that case the error but did you fix it? If not try this:
ac = requests.get('link....')
lx = html.fromstring(ac.content)
valueOfHTMLTag = lx.xpath('//TAG[#class/id="Name"]/text()')
where you change:
TAG in the tag that you want to get the value of.
choose class or id of the tag
the id/class name of the tag
This will return an array with the values of that tag with the correct class/id.
Hope this helps!

how to use lxml to get the url

I want to know how to use lxml to get a url,and then I can use xpath to parse the data I want .
Please guide me,thank you very much.
res = requests.get('http://www.ipeen.com.tw/comment/778246')
doc = parse(res.content)
name = doc.xpath("//meta[#itemprop='name']/#content")
print name
There are errors in my code:
doc = parse(res.content)
File "/Users/ome/djangoenv/lib/python2.7/site-packages/lxml/html/__init__.py", line 786, in parse
return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
File "lxml.etree.pyx", line 3299, in lxml.etree.parse (src/lxml/lxml.etree.c:72655)
File "parser.pxi", line 1791, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:106263)
File "parser.pxi", line 1817, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:106564)
File "parser.pxi", line 1721, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:105561)
File "parser.pxi", line 1122, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:100456)
File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:94543)
File "parser.pxi", line 690, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:96003)
File "parser.pxi", line 618, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:95015)
IOError
res.content is a string, an HTML string.
You need to use lxml.html.fromstring():
import lxml.html
import requests
res = requests.get('http://www.ipeen.com.tw/comment/778246')
doc = lxml.html.fromstring(res.content)
name = doc.xpath(".//meta[#itemprop='name']/#content")
print name
Presumably res.content is a string containing the contents of the page. parse takes a filename or file-like object. Thus, you are using the page content as the name of a file. This is probably not what you want. To construct a tree from a string, use fromstring rather than parse.

Why is the slash at the end of lxml.html.parse() important?

I am using lxml to scrape html. This code works.
lxml.html.parse( "http://google.com/" )
This code does not.
lxml.html.parse( "http://google.com" )
Why does the slash at the end of the URL matter? Thank you.
To be clear, here is the error log that python is giving me from the latter code.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/davidfaux/epd-7.2-2-rh5-x86/lib/python2.7/site-packages/lxml/html/__init__.py", line 692, in parse
return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
File "lxml.etree.pyx", line 2953, in lxml.etree.parse (src/lxml/lxml.etree.c:56204)
File "parser.pxi", line 1533, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:82287)
File "parser.pxi", line 1562, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:82580)
File "parser.pxi", line 1462, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:81619)
File "parser.pxi", line 1002, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:78528)
File "parser.pxi", line 569, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:74472)
File "parser.pxi", line 650, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:75363)
File "parser.pxi", line 588, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74665)
IOError: Error reading file 'http://google.com': failed to load HTTP resource
Because without the slash, Google isn't sending you a page, it's sending you a redirect. In fact, it's redirecting you to the URL with the slash! The body of the redirect is probably empty.

Categories