Finding specific text using BeautifulSoup - python

I'm trying to grab all the winner categories from this page:
http://www.chicagoreader.com/chicago/BestOf?category=4053660&year=2013
I've written this in sublime:
import urllib2
from bs4 import BeautifulSoup
url = "http://www.chicagoreader.com/chicago/BestOf?category=4053660&year=2013"
page = urllib2.urlopen(url)
soup_package = BeautifulSoup(page)
page.close()
#find everything in the div class="bestOfItem). This works.
all_categories = soup_package.findAll("div",class_="bestOfItem")
# print(all_categories)
#this part breaks it:
soup = BeautifulSoup(all_categories)
winner = soup.a.string
print(winner)
When I run this in terminal, I get the following error:
Traceback (most recent call last):
File "winners.py", line 12, in <module>
soup = BeautifulSoup(all_categories)
File "build/bdist.macosx-10.9-intel/egg/bs4/__init__.py", line 193, in __init__
File "build/bdist.macosx-10.9-intel/egg/bs4/builder/_lxml.py", line 99, in prepare_markup
File "build/bdist.macosx-10.9-intel/egg/bs4/dammit.py", line 249, in encodings
File "build/bdist.macosx-10.9-intel/egg/bs4/dammit.py", line 304, in find_declared_encoding
TypeError: expected string or buffer
Any one know what's happening there?

You are trying to create a new BeautifulSoup object from a list of elements.
soup = BeautifulSoup(all_categories)
There is absolutely no need to do this here; just loop over each match instead:
for match in all_categories:
winner = match.a.string
print(winner)

Related

How to web scrab from txt file

Let's say I use online tool like HTML Source Code Viewer
then I input a link then they generate the HTML Source Code.
Then select only the <li> tags that I want, something like this
<li class='item'><a class='list-link' href='https://foo1.com'><img src='https://foo1.com/imgfoo1.jpg' /></a></li><li class='item'><a class='list-link' href='https://foo2.com'><img src='https://foo1.com/imgfoo2.jpg' /></a></li><li class='item'><a class='list-link' href='https://foo3.com'><img src='https://foo1.com/imgfoo3.jpg' /></a></li>
so yeah, sometimes it's one long line, then put them to a text name urlcontainer.txt
So, how should I scrab that?
Because when I run the code below on python using terminal
import requests
import numpy as np
from bs4 import BeautifulSoup as soup
page_html = np.genfromtxt('urlcontainer.txt',dtype='str')
page_soup = soup(page_html, "html.parser") #I got the error on this line
And this is the error
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/bs4/__init__.py", line 225, in __init__
markup, from_encoding, exclude_encodings=exclude_encodings)):
File "/usr/lib/python2.7/dist-packages/bs4/builder/_htmlparser.py", line 157, in prepare_markup
exclude_encodings=exclude_encodings)
File "/usr/lib/python2.7/dist-packages/bs4/dammit.py", line 352, in __init__
markup, override_encodings, is_html, exclude_encodings)
File "/usr/lib/python2.7/dist-packages/bs4/dammit.py", line 228, in __init__
self.markup, self.sniffed_encoding = self.strip_byte_order_mark(markup)
File "/usr/lib/python2.7/dist-packages/bs4/dammit.py", line 280, in strip_byte_order_mark
if (len(data) >= 4) and (data[:2] == b'\xfe\xff') \
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
The thing is, when I type page_html on the terminal, this is the value :
array(['<li', "class='item'><a", "class='list-link'",
"href='https://foo1.com'><img",
"src='https://foo1.com/imgfoo1.jpg'", '/></a></li><li',
"class='item'><a", "class='list-link'",
"href='https://foo2.com'><img",
"src='https://foo1.com/imgfoo2.jpg'", '/></a></li><li',
"class='item'><a", "class='list-link'",
"href='https://foo3.com'><img",
"src='https://foo1.com/imgfoo3.jpg'", '/></a></li>'],
dtype='|S34')
Just read the file as you normally would. No need to use NumPy.
with open("urlcontainer.txt") as f:
page = f.read()
soup = BeautifulSoup(page, "html.parser")
Then, carry on with your parsing activities.

Error trying to use BeautifulSoup

For a school assignment we need to use some moviedata. Instead of copy/pasting all the info I need I thought I would scrape it off IMDB. However I am not familiar with Python and I am running into an issue here.
This is my code:
import urllib
import urllib.request
from bs4 import BeautifulSoup
url = "http://www.imdb.com"
values = {'q' : 'The Matrix'}
data = urllib.parse.urlencode(values).encode("utf-8")
req = urllib.request.Request(url, data)
response = urllib.request.urlopen(req)
r = response.read()
soup = BeautifulSoup(r)
That code keeps giving me the error:
> Traceback (most recent call last): File "<pyshell#16>", line 1, in
> <module>
> soup = BeautifulSoup(r) File "C:\Users\My Name\AppData\Local\Programs\Python\Python36-32\lib\site-packages\bs4\__init__.py",
> line 153, in __init__
> builder = builder_class() File "C:\Users\My Name\AppData\Local\Programs\Python\Python36-32\lib\site-packages\bs4\builder\_htmlparser.py",
> line 39, in __init__
> return super(HTMLParserTreeBuilder, self).__init__(*args, **kwargs) TypeError: __init__() got an unexpected keyword argument 'strict
'
Does any of you great minds know what I am doing wrong?
I tried using google and found a post mentioning it migt had something to do with requests so I unistalled requests and installed it again... didn't work.

How to recover a document from un properly closed tags in python?

Here is my problem
I have a sample text like
text="""<!--translated from:
The Dutch Royal Library
"""
now I tried to strip this text from tags, but I always get this error using this code
t = html.fromstring(text)
ctext = t.text_content()
and my error is
Traceback (most recent call last):
File "test.py", line 31, in <module>
t = html.fromstring(text)
File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 634, in fromstring
doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 535, in document_fromstring
"Document is empty")
lxml.etree.ParserError: Document is empty
I traced the bug I found that removing unclosed
I already tried using BeautifulSoup
and here is my code
soup = BeautifulSoup(text)
print soup.prettify()
but no use, so can anyone help me?
Try removing the unclosed tag:
soup = BeautifulSoup(text[4:])
print soup.prettify()
Then BeautifulSoup will be able to find the content. You can have more information about this library at their documentation page

information lost when using beautifulsoup to parse a html page

I'm writing a web spider to get some information from a website. when I parse this page http://www.tripadvisor.com/Hotels-g294265-oa120-Singapore-Hotels.html#ACCOM_OVERVIEW
, I find that some information are lost, I print the html doc using soup.prettify(),and the html doc is not the same with the doc I get using urllib2.openurl(), something is lost. Codes are as following:
htmlDoc = urllib2.urlopen(sourceUrl).read()
soup = BeautifulSoup(htmlDoc)
subHotelUrlTags = soup.findAll(name='a', attrs={'class' : 'property_title'})
print len(subHotelUrlTags)
#if len(subHotelUrlTags) != 30:
# print soup.prettify()
for hotelUrlTag in subHotelUrlTags:
hotelUrls.append(website + hotelUrlTag['href'])
I try to using HtmlParser to do the same thing, it prints out the following errors:
Traceback (most recent call last):
File "./spider_new.py", line 47, in <module>
hotelUrls = getHotelUrls()
File "./spider_new.py", line 40, in getHotelUrls
hotelParser.close()
File "/usr/lib/python2.6/HTMLParser.py", line 112, in close
self.goahead(1)
File "/usr/lib/python2.6/HTMLParser.py", line 164, in goahead
self.error("EOF in middle of construct")
File "/usr/lib/python2.6/HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: EOF in middle of construct, at line 3286, column 1
Download and install lxml..
It can parse such "faulty" webpages. (The HTML is probably broken in some weird way, and Python's HTML parser isn't great at understanding that sort of thing, even with bs4's help.)
Also, you don't need to change your code if you install lxml, BeautifulSoup will automatically pick up lxml and use it to parse the HTML instead.

Why won't Python regex work on a formatted string of HTML?

from bs4 import BeautifulSoup
import urllib
import re
soup = urllib.urlopen("http://atlanta.craigslist.org/cto/")
soup = BeautifulSoup(soup)
souped = soup.p
print souped
m = re.search("\\$.",souped)
print m.group(0)
I can download and print out the html just fine, but it always breaks when I add the last two lines.
I get this error:
Traceback (most recent call last):
File "C:\Python27\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py", line 323, in RunScript
debugger.run(codeObject, __main__.__dict__, start_stepping=0)
File "C:\Python27\Lib\site-packages\pythonwin\pywin\debugger\__init__.py", line 60, in run
_GetCurrentDebugger().run(cmd, globals,locals, start_stepping)
File "C:\Python27\Lib\site-packages\pythonwin\pywin\debugger\debugger.py", line 655, in run
exec cmd in globals, locals
File "C:\Users\Zack\Documents\Scripto.py", line 1, in <module>
from bs4 import BeautifulSoup
File "C:\Python27\lib\re.py", line 142, in search
return _compile(pattern, flags).search(string)
TypeError: expected string or buffer
Thanks lots!
You probably want re.search("\\$.", str(souped)).
Because souped is an object and printing it converts it to text. But if you want to use it in another context (like you do, as text), you should convert it first like str(souped) or unicode(souped) if it's a unicode string.
You could pass a regex as search criteria to .find() method:
>>> from bs4 import BeautifulSoup
>>> from urllib2 import urlopen # from urllib.request import urlopen
>>> import re
>>> page = urlopen("http://atlanta.craigslist.org/cto/")
>>> soup = BeautifulSoup(page)
>>> soup.find('p', text=re.compile(r"\$."))
' -\n\t\t\t $7500'
soup.p returns a Tag object. You could use str() or unicode() to convert it to string:
>>> p = soup.p
>>> str(p)
'<p class="row">\n<span class="ih" id="images:5Nb5I85J83N73p33H6
c2pd3447d5bff6d1757.jpg">\xa0</span>\n<a href="http://atlanta.cr
aigslist.org/nat/cto/2870295634.html">2000 Lexus RX 300</a> -\n\
t\t\t $7500<font size="-1"> (Buford)</font> <span class="p"> pic
\xa0img</span><br class="c" />\n</p>'
>>> re.search(r"\$.", str(p)).group(0)
'$7'

Categories