Python/Mechanize - Can not select form - ParseError(exc)

Python/Mechanize - Can not select form - ParseError(exc) - python

i am getting this error:
>>> br = Browser()
>>> br.open("http://www.bestforumz.com/forum/")
<response_seek_wrapper at 0x21f9fd0
whose wrapped object =
<closeable_response at 0x21f9558 whose
fp = <socket._fileobject object at
0x021F5F30>>>
>>> br.select_form(nr=0)
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
br.select_form(nr=0)
File "build\bdist.win32\egg\mechanize\_mechanize.py", line 505, in select_form
global_form = self._factory.global_form
File "build\bdist.win32\egg\mechanize\_html.py", line 546, in __getattr__
self.forms()
File "build\bdist.win32\egg\mechanize\_html.py", line 559, in forms
self._forms_factory.forms())
File "build\bdist.win32\egg\mechanize\_html.py", line 228, in forms
raise ParseError(exc)
ParseError: <unprintable ParseError object>
Please hep me out
Thanks

I tell you this is some secret i've been used for parse html (the goal is make a force parsing html by mechanize)
br = mechanize.Browser(factory=mechanize.DefaultFactory(i_want_broken_xhtml_support=True))

mechanize isn't guaranteed to parse all HTML. You might have to do this by hand (which isn't too hard, this being Python).
Are you trying to post a query to the site's search.php page? You can use urllib2 for this.
import urllib2
import urllib
values = dict(foo="hello", bar="world") # examine form for actual vars
try:
req = urllib2.Request("http://example.com/search.php",
urllib.urlencode(values))
response_page = urllib2.urlopen(req).read()
except urllib2.HTTPError, details:
pass #do something with the error here...

Related

Python - BeautifulSoup error while scraping

UPDATE: Using lxml instead of html.parser helped solve the problem, as Freddier suggested in the answer below!
I am trying to webscrape some information off of this website: https://www.ticketmonster.co.kr/deal/952393926.
I get an error when I run soup(thispage, 'html.parser) but this error only happens for this specific page. Does anyone know why this is happening?
The code I have so far is very simple:
from bs4 import BeautifulSoup as soup
openU = urlopen(url)
thispage = openU.read()
open.close()
pageS = soup(thispage, 'html.parser')
The error I get is:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py", line 228, in __init__
self._feed()
File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\site- packages\bs4\__init__.py", line 289, in _feed
self.builder.feed(self.markup)
File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\builder\_htmlparser.py", line 215, in feed
parser.feed(markup)
File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\html\parser.py", line 111, in feed
self.goahead(0)
File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\html\parser.py", line 179, in goahead
k = self.parse_html_declaration(i)
File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\html\parser.py", line 264, in parse_html_declaration
return self.parse_marked_section(i)
File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\_markupbase.py", line 149, in parse_marked_section
sectName, j = self._scan_name( i+3, i )
File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\_markupbase.py", line 391, in _scan_name
% rawdata[declstartpos:declstartpos+20])
File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\_markupbase.py", line 34, in error
"subclasses of ParserBase must override error()")
NotImplementedError: subclasses of ParserBase must override error()
Please help!

Try using
pageS = soup(thispage, 'lxml')
insted of
pageS = soup(thispage, 'html.parser')
It looks may be a problem with characters encoding using "html.parser"

Error trying to use BeautifulSoup

For a school assignment we need to use some moviedata. Instead of copy/pasting all the info I need I thought I would scrape it off IMDB. However I am not familiar with Python and I am running into an issue here.
This is my code:
import urllib
import urllib.request
from bs4 import BeautifulSoup
url = "http://www.imdb.com"
values = {'q' : 'The Matrix'}
data = urllib.parse.urlencode(values).encode("utf-8")
req = urllib.request.Request(url, data)
response = urllib.request.urlopen(req)
r = response.read()
soup = BeautifulSoup(r)
That code keeps giving me the error:
> Traceback (most recent call last): File "<pyshell#16>", line 1, in
> <module>
> soup = BeautifulSoup(r) File "C:\Users\My Name\AppData\Local\Programs\Python\Python36-32\lib\site-packages\bs4\__init__.py",
> line 153, in __init__
> builder = builder_class() File "C:\Users\My Name\AppData\Local\Programs\Python\Python36-32\lib\site-packages\bs4\builder\_htmlparser.py",
> line 39, in __init__
> return super(HTMLParserTreeBuilder, self).__init__(*args, **kwargs) TypeError: __init__() got an unexpected keyword argument 'strict
'
Does any of you great minds know what I am doing wrong?
I tried using google and found a post mentioning it migt had something to do with requests so I unistalled requests and installed it again... didn't work.

Finding specific text using BeautifulSoup

I'm trying to grab all the winner categories from this page:
http://www.chicagoreader.com/chicago/BestOf?category=4053660&year=2013
I've written this in sublime:
import urllib2
from bs4 import BeautifulSoup
url = "http://www.chicagoreader.com/chicago/BestOf?category=4053660&year=2013"
page = urllib2.urlopen(url)
soup_package = BeautifulSoup(page)
page.close()
#find everything in the div class="bestOfItem). This works.
all_categories = soup_package.findAll("div",class_="bestOfItem")
# print(all_categories)
#this part breaks it:
soup = BeautifulSoup(all_categories)
winner = soup.a.string
print(winner)
When I run this in terminal, I get the following error:
Traceback (most recent call last):
File "winners.py", line 12, in <module>
soup = BeautifulSoup(all_categories)
File "build/bdist.macosx-10.9-intel/egg/bs4/__init__.py", line 193, in __init__
File "build/bdist.macosx-10.9-intel/egg/bs4/builder/_lxml.py", line 99, in prepare_markup
File "build/bdist.macosx-10.9-intel/egg/bs4/dammit.py", line 249, in encodings
File "build/bdist.macosx-10.9-intel/egg/bs4/dammit.py", line 304, in find_declared_encoding
TypeError: expected string or buffer
Any one know what's happening there?

You are trying to create a new BeautifulSoup object from a list of elements.
soup = BeautifulSoup(all_categories)
There is absolutely no need to do this here; just loop over each match instead:
for match in all_categories:
winner = match.a.string
print(winner)

AttributeError: 'HTTPResponse' object has no attribute 'type'

So, I am trying to build a program that will retrieve the scores of the NHL's season through the use of yahoo's RSS feed.
I am not an experienced programmer, so some things haven't quite gotten into my head just yet. However, here is my code so far:
from urllib.request import urlopen
import xml.etree.cElementTree as ET
YAHOO_NHL_URL = 'http://sports.yahoo.com/nhl/rss'
def retrievalyahoo():
nhl_site = urlopen('http://sports.yahoo.com/nhl/rss')
tree = ET.parse(urlopen(nhl_site))
retrievalyahoo()
The title above states the error I get after I test the aforementioned code.
EDIT: Okay, after the fix, the traceback error comes as this, to which I am puzzled:
Traceback (most recent call last):
File "C:/Nathaniel's Folder/Website Scores.py", line 12, in <module>
retrievalyahoo()
File "C:/Nathaniel's Folder/Website Scores.py", line 10, in retrievalyahoo
tree = ET.parse(nhl_site)
File "C:\Python33\lib\xml\etree\ElementTree.py", line 1242, in parse
tree.parse(source, parser)
File "C:\Python33\lib\xml\etree\ElementTree.py", line 1730, in parse
self._root = parser._parse(source)
File "<string>", line None
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 17, column 291

The problem is that you're trying to call urlopen on the result of urlopen.
Just call it once, like this:
nhl_site = urlopen('http://sports.yahoo.com/nhl/rss')
tree = ET.parse(nhl_site)
The error message probably could be nicer. If you look at the docs for urlopen:
Open the URL url, which can be either a string or a Request object.
Clearly the http.client.HTTPResponse object that it returns is neither a string nor a Request object. What's happened here is that urlopen sees that it's not a string, and therefore assumes it's a Request, and starts trying to access methods and attributes that Request objects have. This kind of design is generally a good thing, because it lets you pass things that act just like a Request and they'll just work… but it does mean that if you pass something that doesn't act like a Request, the error message can be mystifying.

QLineEdit and custom function incompatibility

I'm trying to finish my project with Python and PyQt4 and I'm having an issue passing a QLineEdit variable through a function I made. The string should work as an url and when I pass it through my first argument, which tries to read the url and get its content, it throws me this error:
Traceback (most recent call last):
File "programa2.py", line 158, in on_link_clicked
download_mango(self.a, self.path2)
File "c:\Users\Poblet\ManGet\mango.py", line 19, in download_mango
urlContent = urllib2.urlopen(url).read() # We read the URL
File "c:\Python27\lib\urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "c:\Python27\lib\urllib2.py", line 386, in open
protocol = req.get_type()
AttributeError: 'QString' object has no attribute 'get_type'
Which is triggered by the following action:
def on_link_clicked(self):
self.a = self.linkEdit.displayText()
download_mango(self.a, self.path2)
And I'm completely lost. Could it be a PyQt4 issue or something wrong with my function?
I appreciate your help.

You didn't post enough code to justify your statement that
The string should work as an url and when I pass it through my first argument
Looks like you are passing a QString into urlopen. Just wrap it in str() and you should be OK.
>>> url = QString('http://stackoverflow.com/questions/11121475')
>>> urllib2.urlopen(url).read()
### this generates your error ending with
AttributeError: 'QString' object has no attribute 'get_type'
>>> urllib2.urlopen(str(url)).read()
### works

self.a is missing
"http://"or "https://"
try
download_mango("http://"+self.a,self.path2)
see http://www.nullege.com/codes/search/urllib2.Request.get_type

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python/Mechanize - Can not select form - ParseError(exc) - python

I tell you this is some secret i've been used for parse html (the goal is make a force parsing html by mechanize) br = mechanize.Browser(factory=mechanize.DefaultFactory(i_want_broken_xhtml_support=True))

Related

Python - BeautifulSoup error while scraping

Error trying to use BeautifulSoup

Finding specific text using BeautifulSoup

AttributeError: 'HTTPResponse' object has no attribute 'type'

QLineEdit and custom function incompatibility

Categories

Resources