I am running python 3.5 with BeautifulSoup4 and getting an error when I attempt to pass the plain text of a webpage to the constructor.
The source code I am trying to run is
import requests from bs4
import BeautifulSoup
tcg = 'http://magic.tcgplayer.com/db/deck_search_result.asp?Format=Commander'
sourcecode = requests.get(tcg)
plaintext = sourcecode.text
soup = BeautifulSoup(plaintext)
When running this I get the folloing error
Traceback (most recent call last):
File "/Users/Brian/PycharmProjects/magic_crawler/main.py", line 11, in <module>
soup = BeautifulSoup(plaintext)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/bs4/__init__.py", line 202, in __init__
self._feed()
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/bs4/__init__.py", line 216, in _feed
self.builder.feed(self.markup)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/bs4/builder/_htmlparser.py", line 156, in feed
parser = BeautifulSoupHTMLParser(*args, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'strict'
Python 3.5 is an alpha release (the first beta is expected this weekend but isn't out just yet at the time of this post). BeautifulSoup certainly hasn't claimed any compatibility with 3.5.
Stick to using Python 3.4 for now.
Related
UPDATE: Using lxml instead of html.parser helped solve the problem, as Freddier suggested in the answer below!
I am trying to webscrape some information off of this website: https://www.ticketmonster.co.kr/deal/952393926.
I get an error when I run soup(thispage, 'html.parser) but this error only happens for this specific page. Does anyone know why this is happening?
The code I have so far is very simple:
from bs4 import BeautifulSoup as soup
openU = urlopen(url)
thispage = openU.read()
open.close()
pageS = soup(thispage, 'html.parser')
The error I get is:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\__init__.py", line 228, in __init__
self._feed()
File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\site- packages\bs4\__init__.py", line 289, in _feed
self.builder.feed(self.markup)
File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\site-packages\bs4\builder\_htmlparser.py", line 215, in feed
parser.feed(markup)
File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\html\parser.py", line 111, in feed
self.goahead(0)
File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\html\parser.py", line 179, in goahead
k = self.parse_html_declaration(i)
File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\html\parser.py", line 264, in parse_html_declaration
return self.parse_marked_section(i)
File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\_markupbase.py", line 149, in parse_marked_section
sectName, j = self._scan_name( i+3, i )
File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\_markupbase.py", line 391, in _scan_name
% rawdata[declstartpos:declstartpos+20])
File "C:\Users\Kathy\AppData\Local\Programs\Python\Python36\lib\_markupbase.py", line 34, in error
"subclasses of ParserBase must override error()")
NotImplementedError: subclasses of ParserBase must override error()
Please help!
Try using
pageS = soup(thispage, 'lxml')
insted of
pageS = soup(thispage, 'html.parser')
It looks may be a problem with characters encoding using "html.parser"
For a school assignment we need to use some moviedata. Instead of copy/pasting all the info I need I thought I would scrape it off IMDB. However I am not familiar with Python and I am running into an issue here.
This is my code:
import urllib
import urllib.request
from bs4 import BeautifulSoup
url = "http://www.imdb.com"
values = {'q' : 'The Matrix'}
data = urllib.parse.urlencode(values).encode("utf-8")
req = urllib.request.Request(url, data)
response = urllib.request.urlopen(req)
r = response.read()
soup = BeautifulSoup(r)
That code keeps giving me the error:
> Traceback (most recent call last): File "<pyshell#16>", line 1, in
> <module>
> soup = BeautifulSoup(r) File "C:\Users\My Name\AppData\Local\Programs\Python\Python36-32\lib\site-packages\bs4\__init__.py",
> line 153, in __init__
> builder = builder_class() File "C:\Users\My Name\AppData\Local\Programs\Python\Python36-32\lib\site-packages\bs4\builder\_htmlparser.py",
> line 39, in __init__
> return super(HTMLParserTreeBuilder, self).__init__(*args, **kwargs) TypeError: __init__() got an unexpected keyword argument 'strict
'
Does any of you great minds know what I am doing wrong?
I tried using google and found a post mentioning it migt had something to do with requests so I unistalled requests and installed it again... didn't work.
I'm trying to grab all the winner categories from this page:
http://www.chicagoreader.com/chicago/BestOf?category=4053660&year=2013
I've written this in sublime:
import urllib2
from bs4 import BeautifulSoup
url = "http://www.chicagoreader.com/chicago/BestOf?category=4053660&year=2013"
page = urllib2.urlopen(url)
soup_package = BeautifulSoup(page)
page.close()
#find everything in the div class="bestOfItem). This works.
all_categories = soup_package.findAll("div",class_="bestOfItem")
# print(all_categories)
#this part breaks it:
soup = BeautifulSoup(all_categories)
winner = soup.a.string
print(winner)
When I run this in terminal, I get the following error:
Traceback (most recent call last):
File "winners.py", line 12, in <module>
soup = BeautifulSoup(all_categories)
File "build/bdist.macosx-10.9-intel/egg/bs4/__init__.py", line 193, in __init__
File "build/bdist.macosx-10.9-intel/egg/bs4/builder/_lxml.py", line 99, in prepare_markup
File "build/bdist.macosx-10.9-intel/egg/bs4/dammit.py", line 249, in encodings
File "build/bdist.macosx-10.9-intel/egg/bs4/dammit.py", line 304, in find_declared_encoding
TypeError: expected string or buffer
Any one know what's happening there?
You are trying to create a new BeautifulSoup object from a list of elements.
soup = BeautifulSoup(all_categories)
There is absolutely no need to do this here; just loop over each match instead:
for match in all_categories:
winner = match.a.string
print(winner)
I'm writing a web spider to get some information from a website. when I parse this page http://www.tripadvisor.com/Hotels-g294265-oa120-Singapore-Hotels.html#ACCOM_OVERVIEW
, I find that some information are lost, I print the html doc using soup.prettify(),and the html doc is not the same with the doc I get using urllib2.openurl(), something is lost. Codes are as following:
htmlDoc = urllib2.urlopen(sourceUrl).read()
soup = BeautifulSoup(htmlDoc)
subHotelUrlTags = soup.findAll(name='a', attrs={'class' : 'property_title'})
print len(subHotelUrlTags)
#if len(subHotelUrlTags) != 30:
# print soup.prettify()
for hotelUrlTag in subHotelUrlTags:
hotelUrls.append(website + hotelUrlTag['href'])
I try to using HtmlParser to do the same thing, it prints out the following errors:
Traceback (most recent call last):
File "./spider_new.py", line 47, in <module>
hotelUrls = getHotelUrls()
File "./spider_new.py", line 40, in getHotelUrls
hotelParser.close()
File "/usr/lib/python2.6/HTMLParser.py", line 112, in close
self.goahead(1)
File "/usr/lib/python2.6/HTMLParser.py", line 164, in goahead
self.error("EOF in middle of construct")
File "/usr/lib/python2.6/HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: EOF in middle of construct, at line 3286, column 1
Download and install lxml..
It can parse such "faulty" webpages. (The HTML is probably broken in some weird way, and Python's HTML parser isn't great at understanding that sort of thing, even with bs4's help.)
Also, you don't need to change your code if you install lxml, BeautifulSoup will automatically pick up lxml and use it to parse the HTML instead.
I am having a strange error using BeautifulSoup.
Here is a snippet of the code I am running:
while True:
listing_soup = soupify(urlget(page_url))
for i in listing_soup.findAll('div', 'searchResultContent'):
# do some stuff ...
Here is the exception being thrown:
Traceback (most recent call last):
File "C:\path\to\script.py", line 71
6, in <module>
for i in listing_soup.findAll('div', 'searchResultContent'):
File "c:\python27\BeautifulSoup.py", line 612, in findAll
return self._findAll(name, attrs, text, limit, generator, **kwargs)
File "c:\python27\BeautifulSoup.py", line 275, in _findAll
strainer = SoupStrainer(name, attrs, text, **kwargs)
File "c:\python27\BeautifulSoup.py", line 660, in __init__
self.attrs=attrs.copy()
AttributeError: 'str' object has no attribute 'copy'
I am running Python 2.7.3 on Windows XP Professional. This script worls fine on Ubuntu Linux.
Note:
I am expecting the data from the web to be UTF, so the python script starts with the following line:
# coding=utf-8
Judging from the line numbers, you're using Beautiful Soup 3.0.0, which doesn't have the "search by CSS class" shortcut you're trying to use (it was reintroduced in 3.0.1). More to the point, you're using a version of the software that's five years old. I recommend Beautiful Soup 4 for all new projects.
Most likely you don't see the problem on Ubuntu because your Ubuntu installation is running a more recent version of Beautiful Soup.