Beautiful Soup 'ResultSet' object has no attribute 'text' - python

from bs4 import BeautifulSoup
import urllib.request
import win_unicode_console
win_unicode_console.enable()
link = ('https://pietroalbini.io/')
req = urllib.request.Request(link, headers={'User-Agent': 'Mozilla/5.0'})
url = urllib.request.urlopen(req).read()
soup = BeautifulSoup(url, "html.parser")
body = soup.find_all('div', {"class":"wrapper"})
print(body.text)
Hi, I have a problem with Beautiful Soup, if I run this code without ".text" at the end it show me a list of div but if I add ".text" at the end come the error
Traceback (most recent call last):
File "script.py", line 15, in
print(body.text)
AttributeError: 'ResultSet' object has no attribute 'text'

find_all returns a ResultSet object which you can iterate over using a for loop. What you can do is:
for wrapper in body.find_all('div', {"class":"wrapper"}):
print wrapper.text

If you'll type:
print(type(body))
you'll see body is <class 'bs4.element.ResultSet'> It means all the elements that match the class. You can either iterate over them:
for div in body:
print(div.text)
Or if you know you only have div, you can use find instead:
div = soup.find('div', {"class":"wrapper"})
div.text

Probably should have posted as answer.. so as stated in the comments almost verbatim
Your code should be the following:
for div in body:
print div.text
#python3
#print(div.text)
Or some naming schema to your preference thereof.
The find_all method returns a generated list ( loosely using the term list here ) of items that beautifulsoup has found matching your criteria after parsing the source webpages html either recursively or non-recursively depending upon how you search.
As the error says the resulting set of objects has no attribute text, since it isn't an element but rather a collection of them.
However, the items inside the resulting set ( should any be found ) do.
You can view the documentation here

Related

Why does ResultSet object has no attribute 'find'?

I'm trying to scrape the text off inside the "Other areas of Wikipedia" section on the Wikipedia front page. However, I run into the error ResultSet object has no attribute 'find'. What's wrong with my code and how do I get it to work?
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml' )
otherAreasContainer = soup.find_all('div', class_='mp-bordered')
otherAreasContainerTexts = otherAreasContainer.find_all('li')
for otherAreasContainerText in otherAreasContainerTexts:
print(otherAreasContainerText.text)
In your code otherAreasContainer is of type ResultSet, and ResultSet doesn't have .find_all() method.
To select all <li> from under the "Other areas of Wikipedia", you can use CSS selector h2:contains("Other areas of Wikipedia") + div li.
For example:
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/'
soup = BeautifulSoup(requests.get(url).content, 'lxml')
for li in soup.select('h2:contains("Other areas of Wikipedia") + div li'):
print(li.text)
Prints:
Community portal – Bulletin board, projects, resources and activities covering a wide range of Wikipedia areas.
Help desk – Ask questions about using Wikipedia.
Local embassy – For Wikipedia-related communication in languages other than English.
Reference desk – Serving as virtual librarians, Wikipedia volunteers tackle your questions on a wide range of subjects.
Site news – Announcements, updates, articles and press releases on Wikipedia and the Wikimedia Foundation.
Village pump – For discussions about Wikipedia itself, including areas for technical issues and policies.
More about CSS Selectors.
Running your code I got
Traceback (most recent call last):
File "h.py", line 7, in <module>
otherAreasContainerTexts = otherAreasContainer.find_all('li')
File "/home/td/anaconda3/lib/python3.7/site-packages/bs4/element.py", line 1620, in __getattr__
"ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?" % key
AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
This should be part of your question - make it easy for us to spot your problem!
find_all returns a ResultSet which essentially a list of elements found. You need to enumerate each of the elements to continue
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml' )
otherAreasContainer = soup.find_all('div', class_='mp-bordered')
for other in otherAreasContainer:
otherAreasContainerTexts = other.find_all('li')
for otherAreasContainerText in otherAreasContainerTexts:
print(otherAreasContainerText.text)
Result of find_all is a list, and list has not find or find_all attribute, you must iterate otherAreasContainer and then call find_all method on it, like this:
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
otherAreasContainer = soup.find_all('div', class_='mp-bordered')
for other in otherAreasContainer:
otherAreasContainerTexts = other.find_all('li')
for otherAreasContainerText in otherAreasContainerTexts:
print(otherAreasContainerText.text)

TypeError: 'NoneType' object is not iterable using BeautifulSoup

I am pretty new to Python and this could be a very simple type of error, but can´t work out what´s wrong. I am trying to get the links from a website containing a specific substring, but get the "TypeError: 'NoneType' object is not iterable" when I do it. I believe the problem is related to the links I get from the website. Anybody knows what is the problem here?
from bs4 import BeautifulSoup
from urllib.request import urlopen
html_page = urlopen("http://www.scoresway.com/?sport=soccer&page=competition&id=87&view=matches")
soup = BeautifulSoup(html_page, 'html.parser')
lista=[]
for link in soup.find_all('a'):
lista.append(link.get('href'))
for text in lista:
if "competition" in text:
print (text)
You're getting a TypeError exception because some 'a' tags dont have a 'href' attribute , and so get('href') returns None , wich is not iterable .
You can fix this if you replace this :
soup.find_all('a')
with this :
soup.find_all('a', href=True)
to ensure that all your links have a 'href' attribute
In the line lista.append(link.get('href')) expression link.get('href') can return None. After that you try to use "competition" in text, where text can equal to None - it is not iterable object. To avoid this, use link.get('href', '') and set default value of get() - empty string '' is iterable.
I found mistakes in two places.
First of all the urllib module doesn't have request method.
from urllib.request import urlopen
# should be
from urllib import urlopen
Second one is when you are fetching the link from the page, beautifulSoup is returning None for few link.
print(lista)
# prints [None, u'http://facebook.com/scoresway', u'http://twitter.com/scoresway', ...., None]
As you can see your list contains two None and that's why when you are iterating over it, you get "TypeError: 'NoneType'.
How to fix it?
You should remove the None from the list.
for link in soup.find_all('a'):
if link is not None: # Add this line
lista.append(link.get('href'))

BeautifulSoup search on beautifulsoup result?

Scraping a hotel website to retrieve titles and prices.
"hotelInfo" is the div that holds the interesting content.
It makes sense to me that I would want to only perform my operations on this div. My code is as follows -
from bs4 import BeautifulSoup
import requests
response = requests.get("http://$hotelurlhere.com")
soup = BeautifulSoup(response.text)
hotelInfo = soup.select('div.hotel-wrap')
hotelTitle = soup.find_all('h3', attrs={'class': 'p-name'})
hotelNameList = []
hotelPriceList = []
for hotel in hotelInfo:
for title in hotelTitle:
hotelNameList.append(title.text)
It makes more sense to say that hotelTitle should be a Beautifulsoup search on hotelInfo above. However when I try this
hotelTitle = hotelInfo.find_all('h3', attrs={'class': 'p-name'})
Error message:
Traceback (most recent call last):
File "main.py", line 8, in <module>
hotelTitle = hotelInfo.find_all('h3', attrs={'class': 'p-name'})
AttributeError: 'list' object has no attribute 'find_all'
An error was returned which was related to the list element not having an attribute of "find_all". I understand that this is because hotelInfo is a list element that was returned. I've searched for information on the correct way to check for the h3 info within this list but I am not having any success.
What is the best way to do this?
Shouldn't I be able to set hoteTitle to hotelInfo.find_all rather than just soup.find_all?
As the error message clearly suggests, there is no find_all() method which you can invoke in a list object. In this case, you should call find_all() on individual member of the list instead, assuming that you need some information from the div.hotel-wrap as well as the corresponding h3 :
for hotel in hotelInfo:
hotelTitle = hotel.find_all('h3', attrs={'class': 'p-name'})
If you only need the h3 elements, you can combine the two selectors to get them directly without having to find hotelInfo first :
hotelTitle = soup.select('div.hotel-wrap h3.p-name')
For hotelinfo ,hoteltitle in zip (hotelinfos,hoteltitles):
Data={
'hotelinfo':hotelinfo.get_text(),
}
Print(data)
Like that

Why does find_all give an error even though there is no error in just find? (Python Beautiful Soup)

I am trying to get the titles of songs from Billboard top 100.
The picture is their html script.
I wrote this code:
from bs4 import BeautifulSoup
import urllib.request
url= 'http://www.billboard.com/charts/year-end/2015/hot-100-songs'
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page.read(), "html.parser")
songtitle = soup.find("div", {"class": "row-title"}).h2.contents
print(songtitle)
It retrieves the first title "UPTOWN FUNK!"
When I use find_all it gives me the error:
line 6, in <module>
songtitle = soup.find_all("div", {"class": "row-title"}).h2.contents
AttributeError: 'ResultSet' object has no attribute 'h2'
Why does it give me an error instead of giving me all the titles? The full html script can be found by using Control Shift J in chrome on this site: http://www.billboard.com/charts/year-end/2015/hot-100-songs
.find_all() returns a ResultSet object which is basically a list of Tag instances - it does not have a find() method. You need to loop over the results of find_all() and call find() on each tag:
for item in soup.find_all("div", {"class": "row-title"}):
songtitle = item.h2.contents
print(songtitle)
Or, make a CSS selector:
for title in soup.select("div.row-title h2"):
print(title.get_text())
By the way, this problem is covered in the documentation:
AttributeError: 'ResultSet' object has no attribute 'foo' - This
usually happens because you expected find_all() to return a single tag
or string. But find_all() returns a list of tags and strings–a
ResultSet object. You need to iterate over the list and look at the
.foo of each one. Or, if you really only want one result, you need to
use find() instead of find_all().
find_all returns a list always. You can do list operations.
For example,
songtitle = soup.find_all("div", {"class": "row-title"})[0].get
print songtitle.get('h2')
songtitle = soup.find_all("div", {"class": "row-title"})[1].get
print songtitle.get('h2')
Output:
UPTOWN FUNK!
THINKING OUT LOUD
for item in soup.find_all("div", {"class": "row-title"}):
songtitle=item.get('h2')
print songtitle

BeautifulSoup: AttributeError: 'NavigableString' object has no attribute 'name'

Do you know why the first example in BeautifulSoup tutorial http://www.crummy.com/software/BeautifulSoup/documentation.html#QuickStart gives AttributeError: 'NavigableString' object has no attribute 'name'? According to this answer the space characters in the HTML causes the problem. I tried with sources of a few pages and 1 worked the others gave the same error (I removed spaces). Can you explain what does "name" refer to and why this error happens? Thanks.
Just ignore NavigableString objects while iterating through the tree:
from bs4 import BeautifulSoup, NavigableString, Tag
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for body_child in soup.body.children:
if isinstance(body_child, NavigableString):
continue
if isinstance(body_child, Tag):
print(body_child.name)
name will refer to the name of the tag if the object is a Tag object (ie: <html> name = "html")
if you have spaces in your markup in between nodes BeautifulSoup will turn those into NavigableString's. So if you use the index of the contents to grab nodes, you might grab a NavigableString instead of the next Tag.
To avoid this, query for the node you are looking for: Searching the Parse Tree
or if you know the name of the next tag you would like, you can use that name as the property and it will return the first Tag with that name or None if no children with that name exist: Using Tag Names as Members
If you wanna use the contents you have to check the objects you are working with. The error you are getting just means you are trying to access the name property because the code assumes it's a Tag
You can use try catch to eliminate the cases when Navigable String is being parsed in the loop, like this:
for j in soup.find_all(...)
try:
print j.find(...)
except NavigableString:
pass
This is the latest working code to obtain the name of the tags in soup.
from bs4 import BeautifulSoup, Tag
res = requests.get(url).content
soup = BeautifulSoup(res, 'lxml')
for child in soup.body.children:
if isinstance(body_child, Tag):
print(child.name)

Categories