Error trying to use BeautifulSoup - python

For a school assignment we need to use some moviedata. Instead of copy/pasting all the info I need I thought I would scrape it off IMDB. However I am not familiar with Python and I am running into an issue here.
This is my code:
import urllib
import urllib.request
from bs4 import BeautifulSoup
url = "http://www.imdb.com"
values = {'q' : 'The Matrix'}
data = urllib.parse.urlencode(values).encode("utf-8")
req = urllib.request.Request(url, data)
response = urllib.request.urlopen(req)
r = response.read()
soup = BeautifulSoup(r)
That code keeps giving me the error:
> Traceback (most recent call last): File "<pyshell#16>", line 1, in
> <module>
> soup = BeautifulSoup(r) File "C:\Users\My Name\AppData\Local\Programs\Python\Python36-32\lib\site-packages\bs4\__init__.py",
> line 153, in __init__
> builder = builder_class() File "C:\Users\My Name\AppData\Local\Programs\Python\Python36-32\lib\site-packages\bs4\builder\_htmlparser.py",
> line 39, in __init__
> return super(HTMLParserTreeBuilder, self).__init__(*args, **kwargs) TypeError: __init__() got an unexpected keyword argument 'strict
'
Does any of you great minds know what I am doing wrong?
I tried using google and found a post mentioning it migt had something to do with requests so I unistalled requests and installed it again... didn't work.

Related

How to download and process data from the internet without writing it as file?

In python 3.6.8 I am trying to download a 'file' from a URL and process it directly, without creating a local file. I have tried the following code
import io
import requests
url = "https://raw.githubusercontent.com/enzoftware/random/master/README.md"
response = requests.get(url, stream=True)
with io.BytesIO(response.text) as f:
print(f.readlines())
but I get an error
Traceback (most recent call last):
File "tester.py", line 7, in <module>
with io.BytesIO(response.text) as f:
TypeError: a bytes-like object is required, not 'str'
How to do it right?
assuming you just want to read it line by line rather than considering any document (html) structure it may have you can just do
import requests
url = "https://raw.githubusercontent.com/enzoftware/random/master/README.md"
response = requests.get(url, stream=True)
for line in response.text.splitlines():
print (line)

Get number of results on Google search results in Python 3

Is there any way to return the number of Google search results in Python3? I tried several way from SO but none of them are still working:
>>> import requests
>>> from bs4 import BeautifulSoup
>>> def get_results(name):
re = requests.get('https://www.google.com/search', params={'q':name})
soup = BeautifulSoup(re.text, 'lxml')
response = soup.find('div', {'id': 'resultStats'})
return int(response.text.replace(',', '').split()[1])
>>> get_results('Leonardo DiCaprio')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 5, in get_results
AttributeError: 'NoneType' object has no attribute 'text'
response in your get_results() function is None because the request to Google returned an error page so the div you are looking for does not exist. You should check for a successful response status before trying to parse the results.

Beautiful soup 4 constructor error

I am running python 3.5 with BeautifulSoup4 and getting an error when I attempt to pass the plain text of a webpage to the constructor.
The source code I am trying to run is
import requests from bs4
import BeautifulSoup
tcg = 'http://magic.tcgplayer.com/db/deck_search_result.asp?Format=Commander'
sourcecode = requests.get(tcg)
plaintext = sourcecode.text
soup = BeautifulSoup(plaintext)
When running this I get the folloing error
Traceback (most recent call last):
File "/Users/Brian/PycharmProjects/magic_crawler/main.py", line 11, in <module>
soup = BeautifulSoup(plaintext)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/bs4/__init__.py", line 202, in __init__
self._feed()
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/bs4/__init__.py", line 216, in _feed
self.builder.feed(self.markup)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/bs4/builder/_htmlparser.py", line 156, in feed
parser = BeautifulSoupHTMLParser(*args, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'strict'
Python 3.5 is an alpha release (the first beta is expected this weekend but isn't out just yet at the time of this post). BeautifulSoup certainly hasn't claimed any compatibility with 3.5.
Stick to using Python 3.4 for now.

Finding specific text using BeautifulSoup

I'm trying to grab all the winner categories from this page:
http://www.chicagoreader.com/chicago/BestOf?category=4053660&year=2013
I've written this in sublime:
import urllib2
from bs4 import BeautifulSoup
url = "http://www.chicagoreader.com/chicago/BestOf?category=4053660&year=2013"
page = urllib2.urlopen(url)
soup_package = BeautifulSoup(page)
page.close()
#find everything in the div class="bestOfItem). This works.
all_categories = soup_package.findAll("div",class_="bestOfItem")
# print(all_categories)
#this part breaks it:
soup = BeautifulSoup(all_categories)
winner = soup.a.string
print(winner)
When I run this in terminal, I get the following error:
Traceback (most recent call last):
File "winners.py", line 12, in <module>
soup = BeautifulSoup(all_categories)
File "build/bdist.macosx-10.9-intel/egg/bs4/__init__.py", line 193, in __init__
File "build/bdist.macosx-10.9-intel/egg/bs4/builder/_lxml.py", line 99, in prepare_markup
File "build/bdist.macosx-10.9-intel/egg/bs4/dammit.py", line 249, in encodings
File "build/bdist.macosx-10.9-intel/egg/bs4/dammit.py", line 304, in find_declared_encoding
TypeError: expected string or buffer
Any one know what's happening there?
You are trying to create a new BeautifulSoup object from a list of elements.
soup = BeautifulSoup(all_categories)
There is absolutely no need to do this here; just loop over each match instead:
for match in all_categories:
winner = match.a.string
print(winner)

Python/Mechanize - Can not select form - ParseError(exc)

i am getting this error:
>>> br = Browser()
>>> br.open("http://www.bestforumz.com/forum/")
<response_seek_wrapper at 0x21f9fd0
whose wrapped object =
<closeable_response at 0x21f9558 whose
fp = <socket._fileobject object at
0x021F5F30>>>
>>> br.select_form(nr=0)
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
br.select_form(nr=0)
File "build\bdist.win32\egg\mechanize\_mechanize.py", line 505, in select_form
global_form = self._factory.global_form
File "build\bdist.win32\egg\mechanize\_html.py", line 546, in __getattr__
self.forms()
File "build\bdist.win32\egg\mechanize\_html.py", line 559, in forms
self._forms_factory.forms())
File "build\bdist.win32\egg\mechanize\_html.py", line 228, in forms
raise ParseError(exc)
ParseError: <unprintable ParseError object>
Please hep me out
Thanks
I tell you this is some secret i've been used for parse html (the goal is make a force parsing html by mechanize)
br = mechanize.Browser(factory=mechanize.DefaultFactory(i_want_broken_xhtml_support=True))
mechanize isn't guaranteed to parse all HTML. You might have to do this by hand (which isn't too hard, this being Python).
Are you trying to post a query to the site's search.php page? You can use urllib2 for this.
import urllib2
import urllib
values = dict(foo="hello", bar="world") # examine form for actual vars
try:
req = urllib2.Request("http://example.com/search.php",
urllib.urlencode(values))
response_page = urllib2.urlopen(req).read()
except urllib2.HTTPError, details:
pass #do something with the error here...

Categories