How i parse with lxml a result page with form?

How i parse with lxml a result page with form? - python

I try to parse a secondary page with form . I use example code source from this link :
http://blog.ianbicking.org/2007/09/24/lxmlhtml/
On my test i use this url: http://www.infofer.ro/
Like on example , I use this values :
>>> pprint(form.form_values())
[('cboData', '8/30/2010'),
('txtPlecare', 'Bucuresti Nord'),
('txtSosire', 'Constanta'),
('tip', 'GO'),
('lng', '1')]
The result is take it with this :
result = parse(submit_form(form)).getroot()
This is another page with another form .
I try something like this :
>>> page2=parse(result).getroot()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.6/site-packages/lxml/html/__init__.py", line 661, in parse
return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
File "lxml.etree.pyx", line 2706, in lxml.etree.parse (src/lxml/lxml.etree.c:49945)
File "parser.pxi", line 1525, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:72026)
TypeError: cannot parse from 'HtmlElement'
How i parse the form from secondary page ?
Regards.

The getroot method does not give you another "page", but an instance of lxml.html.HtmlElement.
There is no need (and no way) to parse this once more, you already have everything you need packed into the result variable.

Related

Unexpected error when using feedparser.py

I have had great success parsing RSS feeds from the National Hurricane Center using the feedparser module:
import feedparser
feedparser.parse('https://www.nhc.noaa.gov/gis-at.xml') #Works Fine
feedparser.parse('https://www.nhc.noaa.gov/gis-ep.xml') #Works Fine
However, when I try to read the superficially similar feed from the Central Pacific Hurricane Center, I generate a KeyError:
feedparser.parse('http://www.prh.noaa.gov/cphc/gis-cp.xml') #Doesn't work
Is this a bug with feedparser? Is the CPHC's feed malformed? Is there an option that I've forgotten to specify? It seems the trouble is that there isn't a key named 'where', but I don't know why this isn't a problem for the NHC feeds. The stack is reproduced below:
>>> import feedparser
>>> feedparser.parse('http://www.prh.noaa.gov/cphc/gis-cp.xml')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".../anaconda3/lib/python3.6/site-packages/feedparser.py", line 3956, in parse
saxparser.parse(source)
File ".../anaconda3/lib/python3.6/xml/sax/expatreader.py", line 111, in parse
xmlreader.IncrementalParser.parse(self, source)
File ".../anaconda3/lib/python3.6/xml/sax/xmlreader.py", line 125, in parse
self.feed(buffer)
File ".../anaconda3/lib/python3.6/xml/sax/expatreader.py", line 217, in feed
self._parser.Parse(data, isFinal)
File "/tmp/build/80754af9/python_1516124163501/work/Modules/pyexpat.c", line 414, in StartElement
File ".../anaconda3/lib/python3.6/xml/sax/expatreader.py", line 370, in start_element_ns
AttributesNSImpl(newattrs, qnames))
File ".../anaconda3/lib/python3.6/site-packages/feedparser.py", line 2031, in startElementNS
self.unknown_starttag(localname, list(attrsD.items()))
File ".../anaconda3/lib/python3.6/site-packages/feedparser.py", line 666, in unknown_starttag
return method(attrsD)
File ".../anaconda3/lib/python3.6/site-packages/feedparser.py", line 1500, in _start_gml_point
self._parse_srs_attrs(attrsD)
File ".../anaconda3/lib/python3.6/site-packages/feedparser.py", line 1496, in _parse_srs_attrs
context['where']['srsName'] = srsName
File ".../anaconda3/lib/python3.6/site-packages/feedparser.py", line 356, in __getitem__
return dict.__getitem__(self, key)
KeyError: 'where'

I know this is an old question but I faced myself this issue and became my first opensource contribution :)
Is this a bug with feedparser?
Yes, it was.
Is the CPHC's feed malformed?
Also yes, or at least it doesn't follow the GeoRSSS GML model to the letter. If you check the GMLPoint description you will see the following structure:
<georss:where>
<gml:Point>
<gml:pos>45.256 -71.92</gml:pos>
</gml:Point>
</georss:where>
but the feed data is structured this way:
<gml:Point>
<gml:pos>45.256 -71.92</gml:pos>
</gml:Point>
So that's why the KeyError: 'where' occurs, due to the absent of where tag.
This was fixed on feedparser's 6.0.9 hotfix (see https://github.com/kurtmckee/feedparser/pull/306)

python html parsing fails in document with javascript

I'm trying to use Python to parse HTML (although strictly speaking, the server claims it's xhtml) and every parser I have tried (ElementTree, minidom, and lxml) all fail. When I go to look at where the problem is, it's inside a script tag:
<script type="text/javascript">
... // some javascript code
if (condition1 && condition2) { // croaks on this line
I see what the problem is, the ampersand should be quoted. The problem is, this is inside a javascript script tag, so it cannot be quoted, because that would break the code.
What's going on here? How is inline javascript able to break my parse, and what can I do about it?
Update: per request, here is the code used with lxml.
>>> from lxml import etree
>>> tree=etree.parse("http://192.168.1.185/site.html")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "lxml.etree.pyx", line 3299, in lxml.etree.parse (src/lxml/lxml.etree.c:72655)
File "parser.pxi", line 1791, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:106263)
File "parser.pxi", line 1817, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:106564)
File "parser.pxi", line 1721, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:105561)
File "parser.pxi", line 1122, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:100456)
File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:94543)
File "parser.pxi", line 690, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:96003)
File "parser.pxi", line 620, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:95050)
lxml.etree.XMLSyntaxError: xmlParseEntityRef: no name, line 77, column 22
The lxml manual starts Chapter 9 by stating "lxml provides a very simple and powerful API for parsing XML and HTML" so I would expect to not see that exception.

There are a lot of really crappy ways for HTML parsing to break. Bad HTML is ubiquitous, and both script sections and various templating languages throw monkey wrenches into the works.
But, you also seem to be using XML-oriented parsers for the job, which are stricter and thus much, much more likely to break if not presented with exactly-right, totally valid input. Which most HTML--including most XHTML--manifestly is not.
So, use a parser designed to overlook some of the HTML gotchas:
import lxml.html
d = lxml.html.parse(URL)
That should take you off to the races.

Problem using the submit method in scrape.py library

I'm using the scrape.py library to scrape a website. (library and documentation can be found here http://zesty.ca/scrape/)
There is a a button on the page I want the session to press, but I don't understand exactly how to use the submit function. As I understand I am supposed to give it a region object of a form. The button itself is an input html element. I tried giving it both the form and input, and I get the same error every time.
My code (on google app engine):
s.go(url)
form = s.doc.first(name="form1")
s.submit(region=form)
or
s.go(url)
input = s.doc.first(tagname="input", id="blabla")
s.submit(region=input)
and the error:
ERROR 2011-05-01 23:37:18,673 __init__.py:427] sequence item 0: expected string, NoneType found
Traceback (most recent call last):
File "\appengine\ext\webapp\__init__.py", line 636, in __call__
handler.post(*groups)
File "main.py", line 135, in post
s.submit(region=form)
File "scrape.py", line 342, in submit
return self.go(url, p, redirects)
File "scrape.py", line 288, in go
self.cookiejar)
File "scrape.py", line 176, in fetch
data = urlencode(data)
File "scrape.py", line 409, in urlencode
for key, value in params.items()]
File "scrape.py", line 405, in urlquote
return ''.join(map(urlquoted.get, text))
TypeError: sequence item 0: expected string, NoneType found

Yes I do know that this is a year old but since I am currently using scrape.py and I know the answer to this question I thought I should add it for those who come after.
The problem is in the submit.
Instead of s.submit(region=form) it should be s.submit(form).
The reason is that the variable form contains something like <Region 1254:1250> so you don't need to tell scrape.py that it's there, it is expected to be there.
So it's probably nothing to do with Javascript.

My assupmtion is that it's probably because the button and the form were covered in javascript, so scrape probably couldn't work with that. Need libraries that support JS, like selenium or windmill.

BeautifulSoup not working

i am trying to import the content of my blog using BeautifulSoup,using the the syntax as given below
import urllib2
from BeautifulSoup import BeautifulSoup
response=urllib2.urlopen('http://www.bugsandbrains.blogspot.com')
html=response.read()
soup=BeautifulSoup(html)
Every thing worked fine two or three time after that it started throwing HtmlParseError
i see it highly unlikely that the structure of the page might have changed within a few minutes what else can might be causing this problem ?
i am enclosing the trace as well.
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/usr/lib/pymodules/python2.6/BeautifulSoup.py", line 1499, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "/usr/lib/pymodules/python2.6/BeautifulSoup.py", line 1230, in __init__
self._feed(isHTML=isHTML)
File "/usr/lib/pymodules/python2.6/BeautifulSoup.py", line 1263, in _feed
self.builder.feed(markup)
File "/usr/lib/python2.6/HTMLParser.py", line 108, in feed
self.goahead(0)
File "/usr/lib/python2.6/HTMLParser.py", line 150, in goahead
k = self.parse_endtag(i)
File "/usr/lib/python2.6/HTMLParser.py", line 317, in parse_endtag
self.error("bad end tag: %r" % (rawdata[i:j],))
File "/usr/lib/python2.6/HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParseError: bad end tag: u"</scr' + 'ipt>", at line 1152, column 16

I just tried your code on Windows with:
Python: 2.6 (same as yours)
BeautiSoup: 3.0.8.1 (latest)
I can't reproduce this. Are you using the latest code 3.0 series which is meant for Python 2.6, not 3.1 series which is for Python 3 [0]. Sorry, but can't think of any other clues right now.
[0] http://www.crummy.com/software/BeautifulSoup/#Download

I have tried your code, and it works. My env: ActivePython 2.6.6.15, BeautifulSoup 3.0.8.1. I printed out soup variable and it contains content of "Boredom Induced Post". When I tested http://www.bugsandbrains.blogspot.com with browsers they shows Wave Sandbox login page. No clue about what is wrong :(

how to fix or make an exception for this error

I'm creating a code that gets image's urls from any web pages, the code are in python and use BeutifulSoup and httplib2.
When I run the code, I get the next error:
Look me http://movies.nytimes.com (this line is printed by the code)
Traceback (most recent call last):
File "main.py", line 103, in <module>
visit(initialList,profundidad)
File "main.py", line 98, in visit
visit(dodo[indice], bottom -1)
File "main.py", line 94, in visit
getImages(w)
File "main.py", line 34, in getImages
iSoupList = BeautifulSoup(response, parseOnlyThese=SoupStrainer('img'))
File "/usr/local/lib/python2.6/dist-packages/BeautifulSoup.py", line 1499, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "/usr/local/lib/python2.6/dist-packages/BeautifulSoup.py", line 1230, in __init__
self._feed(isHTML=isHTML)
File "/usr/local/lib/python2.6/dist-packages/BeautifulSoup.py", line 1263, in _feed
self.builder.feed(markup)
File "/usr/lib/python2.6/HTMLParser.py", line 108, in feed
self.goahead(0)
File "/usr/lib/python2.6/HTMLParser.py", line 148, in goahead
k = self.parse_starttag(i)
File "/usr/lib/python2.6/HTMLParser.py", line 226, in parse_starttag
endpos = self.check_for_whole_start_tag(i)
File "/usr/lib/python2.6/HTMLParser.py", line 301, in check_for_whole_start_tag
self.error("malformed start tag")
File "/usr/lib/python2.6/HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: malformed start tag, at line 942, column 118
Someone can explain me how to fix or make an exeption for the error

Are you using latest version of BeautifulSoup?
This seems a known issue of version 3.1.x, because it started using a new parser (HTMLParser, instead of SGMLParser) that is much worse at processing malformed HTML. You can find more information about this on BeautifulSoup website.
As a quick solution, you can simply use an older version (3.0.7a).

To catch that error specifically, change your code to look like this:
try:
iSoupList = BeautifulSoup(response, parseOnlyThese=SoupStrainer('img'))
except HTMLParseError:
#Do something intelligent here
Here's some more reading on Python's try except blocks:
http://docs.python.org/tutorial/errors.html

I got that error when I had the string =& in my HTML document. When I replaced that string (in my case with =and) then I no longer received that parsing error.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How i parse with lxml a result page with form? - python

The getroot method does not give you another "page", but an instance of lxml.html.HtmlElement. There is no need (and no way) to parse this once more, you already have everything you need packed into the result variable.

Related

Unexpected error when using feedparser.py

python html parsing fails in document with javascript

Problem using the submit method in scrape.py library

BeautifulSoup not working

how to fix or make an exception for this error

Categories

Resources