I'm trying to scrape the text off inside the "Other areas of Wikipedia" section on the Wikipedia front page. However, I run into the error ResultSet object has no attribute 'find'. What's wrong with my code and how do I get it to work?
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml' )
otherAreasContainer = soup.find_all('div', class_='mp-bordered')
otherAreasContainerTexts = otherAreasContainer.find_all('li')
for otherAreasContainerText in otherAreasContainerTexts:
print(otherAreasContainerText.text)
In your code otherAreasContainer is of type ResultSet, and ResultSet doesn't have .find_all() method.
To select all <li> from under the "Other areas of Wikipedia", you can use CSS selector h2:contains("Other areas of Wikipedia") + div li.
For example:
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/'
soup = BeautifulSoup(requests.get(url).content, 'lxml')
for li in soup.select('h2:contains("Other areas of Wikipedia") + div li'):
print(li.text)
Prints:
Community portal – Bulletin board, projects, resources and activities covering a wide range of Wikipedia areas.
Help desk – Ask questions about using Wikipedia.
Local embassy – For Wikipedia-related communication in languages other than English.
Reference desk – Serving as virtual librarians, Wikipedia volunteers tackle your questions on a wide range of subjects.
Site news – Announcements, updates, articles and press releases on Wikipedia and the Wikimedia Foundation.
Village pump – For discussions about Wikipedia itself, including areas for technical issues and policies.
More about CSS Selectors.
Running your code I got
Traceback (most recent call last):
File "h.py", line 7, in <module>
otherAreasContainerTexts = otherAreasContainer.find_all('li')
File "/home/td/anaconda3/lib/python3.7/site-packages/bs4/element.py", line 1620, in __getattr__
"ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?" % key
AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
This should be part of your question - make it easy for us to spot your problem!
find_all returns a ResultSet which essentially a list of elements found. You need to enumerate each of the elements to continue
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml' )
otherAreasContainer = soup.find_all('div', class_='mp-bordered')
for other in otherAreasContainer:
otherAreasContainerTexts = other.find_all('li')
for otherAreasContainerText in otherAreasContainerTexts:
print(otherAreasContainerText.text)
Result of find_all is a list, and list has not find or find_all attribute, you must iterate otherAreasContainer and then call find_all method on it, like this:
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
otherAreasContainer = soup.find_all('div', class_='mp-bordered')
for other in otherAreasContainer:
otherAreasContainerTexts = other.find_all('li')
for otherAreasContainerText in otherAreasContainerTexts:
print(otherAreasContainerText.text)
Related
I want to scrape the data from the booking.com but got some errors and couldn't find any similar codes.
I want to scrape the name of the hotel,price and etc.
i have tried beautifulSoup 4 and tried to get data to a csv file.
import requests
from bs4 import BeautifulSoup
import pandas
# Replace search_url with a valid one byb visiting and searching booking.com
search_url = 'https://www.booking.com/searchresults.....'
page = requests.get(search_url)
soup = BeautifulSoup(page.content, 'html.parser')
week = soup.find(id = 'search_results_table' )
#print(week)
items = week.find_all(class_='sr-hotel__name')
print(items[0])
print(items[0].find(class_ = 'sr-hotel__name').get_text())
print(items[0].find(class_ = 'short-desc').get_text())
Here is a sample URL that can be used in place of search_url.
This is the error msg...
<span class="sr-hotel__name " data-et-click="
">
The Fort Printers
</span>
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-44-77b38c8546bb> in <module>
11 items = week.find_all(class_='sr-hotel__name')
12 print(items[0])
---> 13 print(items[0].find(class_ = 'sr-hotel__name').get_text())
14 print(items[0].find(class_ = 'short-desc').get_text())
15
AttributeError: 'NoneType' object has no attribute 'get_text'
Instead of using find() method multiple times, if you consider using getText() method directly it can help.
import requests
from bs4 import BeautifulSoup
import pandas
# Replace search_url with a valid one byb visiting and searching booking.com
search_url = 'https://www.booking.com/searchresults.....'
page = requests.get(search_url)
soup = BeautifulSoup(page.content, 'html.parser')
week = soup.find(id = 'search_results_table' )
#print(week)
items = week.find_all(class_='sr-hotel__name')
# print the whole thing
print(items[0])
hotel_name = items[0].getText()
# print hotel name
print(hotel_name)
# print without newlines
print(hotel_name[1:-1])
Hope this helps. I would suggest reading more of BeautifulSoup documentation.
first of all, buddy, using requests might be really hard since you have to completely imitate the request your browser will send.
You'll have to use some sniffing tool (burp, fiddler, wireshark) or in some cases look at the network in the developer mode on your browser which is relatively hard...
I'd suggest you to use "selenium" which is a web driver that makes your life easy when trying to scrape sites... read more about it here- https://medium.com/the-andela-way/introduction-to-web-scraping-using-selenium-7ec377a8cf72
And as for your error, I think you should use only .text instead of .get_text()
I am trying to pull back only the Title from a source code online. My code is able to currently pull all the correct lines, but I cannot figure out how to make it only pull back the title.
from bs4 import BeautifulSoup # BeautifulSoup is in bs4 package
import requests
URL = 'https://sc2replaystats.com/replay/playerStats/10774659/8465'
content = requests.get(URL)
soup = BeautifulSoup(content.text, 'html.parser')
tb = soup.find('table', class_='table table-striped table-condensed')
for link in tb.find_all('tr'):
name = link.find('td')
print(name.get_text('title'))
I expect it to only say
Nexus
Pylon
Gateway
Assimilator
ect
but I get the error:
Traceback (most recent call last):
File "main.py", line 11, in <module>
print(name.get_text().strip())
AttributeError: 'NoneType' object has no attribute 'get_text'
I don't understand what I am doing wrong since from what I read it should only pull back the desired results
Try below code. Your first row had table header instead of table data so it will be none when you are looking for the td tag.
So add the condition to check when you can find either td or span inside td tag then get its title as below.
from bs4 import BeautifulSoup # BeautifulSoup is in bs4 package
import requests
URL = 'https://sc2replaystats.com/replay/playerStats/10774659/8465'
content = requests.get(URL)
soup = BeautifulSoup(content.text, 'html.parser')
tb = soup.find('table', class_='table table-striped table-condensed')
for link in tb.find_all('tr'):
name = link.find('span')
if name is not None:
# Process only if the element is available
print(name['title'])
I thinkg you should use something like
for link in tb.find_all('tr'):
name = link.select('td[title]')
print(name.get_text('title'))
Because until I see, the string comes empty because there are not title tag name, so you are trying to get text from title attr from the tag td
bkyada's answer is perfect if you want another solution then.
In your for loop instead of finding td find_all span and iterate through it and find it's title attribute.
containers = link.find('span')
if containers is not None:
print(containers['title'])
It is more efficient to simply use the class name to identify the elements with title attribute as they all have one in first column.
from bs4 import BeautifulSoup # BeautifulSoup is in bs4 package
import requests
URL = 'https://sc2replaystats.com/replay/playerStats/10774659/8465'
content = requests.get(URL)
soup = BeautifulSoup(content.text, 'html.parser')
tb = soup.find('table', class_='table table-striped table-condensed')
titles = [i['title'] for i in tb.select('.blizzard_icons_single')]
print(titles)
titles = {i['title'] for i in tb.select('.blizzard_icons_single')} #set of unique
print(titles)
As title attribute is limited to that column you could also have used (slighlty less quick) attribute selector:
titles = {i['title'] for i in tb.select('[title]')} #set of unique
I am practicing building web scrapers. One that I am working on now involves going to a site, scraping links for the various cities on that site, then taking all of the links for each of the cities and scraping all the links for the properties in said cites.
I'm using the following code:
import requests
from bs4 import BeautifulSoup
main_url = "http://www.chapter-living.com/"
# Getting individual cities url
re = requests.get(main_url)
soup = BeautifulSoup(re.text, "html.parser")
city_tags = soup.find_all('a', class_="nav-title") # Bottom page not loaded dynamycally
cities_links = [main_url + tag["href"] for tag in city_tags.find_all("a")] # Links to cities
If I print out city_tags I get the HTML I want. However, when I print cities_links I get AttributeError: 'ResultSet' object has no attribute 'find_all'.
I gather from other q's on here that this error occurs because city_tags returns none, but this can't be the case if it is printing out the desired html? I have noticed that said html is in [] - does this make a difference?
Well city_tags is a bs4.element.ResultSet (essentially a list) of tags and you are calling find_all on it. You probably want to call find_all in every element of the resultset or in this specific case just retrieve their href attribute
import requests
from bs4 import BeautifulSoup
main_url = "http://www.chapter-living.com/"
# Getting individual cities url
re = requests.get(main_url)
soup = BeautifulSoup(re.text, "html.parser")
city_tags = soup.find_all('a', class_="nav-title") # Bottom page not loaded dynamycally
cities_links = [main_url + tag["href"] for tag in city_tags] # Links to cities
As the error says, the city_tags is a ResultSet which is a list of nodes and it doesn't have the find_all method, you either have to loop through the set and apply find_all on each individual node or in your case, I think you can simply extract the href attribute from each node:
[tag['href'] for tag in city_tags]
#['https://www.chapter-living.com/blog/',
# 'https://www.chapter-living.com/testimonials/',
# 'https://www.chapter-living.com/events/']
Scraping a hotel website to retrieve titles and prices.
"hotelInfo" is the div that holds the interesting content.
It makes sense to me that I would want to only perform my operations on this div. My code is as follows -
from bs4 import BeautifulSoup
import requests
response = requests.get("http://$hotelurlhere.com")
soup = BeautifulSoup(response.text)
hotelInfo = soup.select('div.hotel-wrap')
hotelTitle = soup.find_all('h3', attrs={'class': 'p-name'})
hotelNameList = []
hotelPriceList = []
for hotel in hotelInfo:
for title in hotelTitle:
hotelNameList.append(title.text)
It makes more sense to say that hotelTitle should be a Beautifulsoup search on hotelInfo above. However when I try this
hotelTitle = hotelInfo.find_all('h3', attrs={'class': 'p-name'})
Error message:
Traceback (most recent call last):
File "main.py", line 8, in <module>
hotelTitle = hotelInfo.find_all('h3', attrs={'class': 'p-name'})
AttributeError: 'list' object has no attribute 'find_all'
An error was returned which was related to the list element not having an attribute of "find_all". I understand that this is because hotelInfo is a list element that was returned. I've searched for information on the correct way to check for the h3 info within this list but I am not having any success.
What is the best way to do this?
Shouldn't I be able to set hoteTitle to hotelInfo.find_all rather than just soup.find_all?
As the error message clearly suggests, there is no find_all() method which you can invoke in a list object. In this case, you should call find_all() on individual member of the list instead, assuming that you need some information from the div.hotel-wrap as well as the corresponding h3 :
for hotel in hotelInfo:
hotelTitle = hotel.find_all('h3', attrs={'class': 'p-name'})
If you only need the h3 elements, you can combine the two selectors to get them directly without having to find hotelInfo first :
hotelTitle = soup.select('div.hotel-wrap h3.p-name')
For hotelinfo ,hoteltitle in zip (hotelinfos,hoteltitles):
Data={
'hotelinfo':hotelinfo.get_text(),
}
Print(data)
Like that
from bs4 import BeautifulSoup
import urllib.request
import win_unicode_console
win_unicode_console.enable()
link = ('https://pietroalbini.io/')
req = urllib.request.Request(link, headers={'User-Agent': 'Mozilla/5.0'})
url = urllib.request.urlopen(req).read()
soup = BeautifulSoup(url, "html.parser")
body = soup.find_all('div', {"class":"wrapper"})
print(body.text)
Hi, I have a problem with Beautiful Soup, if I run this code without ".text" at the end it show me a list of div but if I add ".text" at the end come the error
Traceback (most recent call last):
File "script.py", line 15, in
print(body.text)
AttributeError: 'ResultSet' object has no attribute 'text'
find_all returns a ResultSet object which you can iterate over using a for loop. What you can do is:
for wrapper in body.find_all('div', {"class":"wrapper"}):
print wrapper.text
If you'll type:
print(type(body))
you'll see body is <class 'bs4.element.ResultSet'> It means all the elements that match the class. You can either iterate over them:
for div in body:
print(div.text)
Or if you know you only have div, you can use find instead:
div = soup.find('div', {"class":"wrapper"})
div.text
Probably should have posted as answer.. so as stated in the comments almost verbatim
Your code should be the following:
for div in body:
print div.text
#python3
#print(div.text)
Or some naming schema to your preference thereof.
The find_all method returns a generated list ( loosely using the term list here ) of items that beautifulsoup has found matching your criteria after parsing the source webpages html either recursively or non-recursively depending upon how you search.
As the error says the resulting set of objects has no attribute text, since it isn't an element but rather a collection of them.
However, the items inside the resulting set ( should any be found ) do.
You can view the documentation here