BeautifulSoup search on beautifulsoup result?

BeautifulSoup search on beautifulsoup result? - python

Scraping a hotel website to retrieve titles and prices.
"hotelInfo" is the div that holds the interesting content.
It makes sense to me that I would want to only perform my operations on this div. My code is as follows -
from bs4 import BeautifulSoup
import requests
response = requests.get("http://$hotelurlhere.com")
soup = BeautifulSoup(response.text)
hotelInfo = soup.select('div.hotel-wrap')
hotelTitle = soup.find_all('h3', attrs={'class': 'p-name'})
hotelNameList = []
hotelPriceList = []
for hotel in hotelInfo:
for title in hotelTitle:
hotelNameList.append(title.text)
It makes more sense to say that hotelTitle should be a Beautifulsoup search on hotelInfo above. However when I try this
hotelTitle = hotelInfo.find_all('h3', attrs={'class': 'p-name'})
Error message:
Traceback (most recent call last):
File "main.py", line 8, in <module>
hotelTitle = hotelInfo.find_all('h3', attrs={'class': 'p-name'})
AttributeError: 'list' object has no attribute 'find_all'
An error was returned which was related to the list element not having an attribute of "find_all". I understand that this is because hotelInfo is a list element that was returned. I've searched for information on the correct way to check for the h3 info within this list but I am not having any success.
What is the best way to do this?
Shouldn't I be able to set hoteTitle to hotelInfo.find_all rather than just soup.find_all?

As the error message clearly suggests, there is no find_all() method which you can invoke in a list object. In this case, you should call find_all() on individual member of the list instead, assuming that you need some information from the div.hotel-wrap as well as the corresponding h3 :
for hotel in hotelInfo:
hotelTitle = hotel.find_all('h3', attrs={'class': 'p-name'})
If you only need the h3 elements, you can combine the two selectors to get them directly without having to find hotelInfo first :
hotelTitle = soup.select('div.hotel-wrap h3.p-name')

For hotelinfo ,hoteltitle in zip (hotelinfos,hoteltitles):
Data={
'hotelinfo':hotelinfo.get_text(),
}
Print(data)
Like that

Related

Is there a way to more accurately search for a class with BeautifulSoup

I'm trying to scrape this job site for a specific job title but I keep getting this error message.
Traceback (most recent call last):
File "/home/malachi/Documents/python_projects/Practice/Jobsearcher.py", line 7, in <module>
print(results.prettify())
AttributeError: 'NoneType' object has no attribute 'prettify'
I've run this same code on other websites with different class names and I got results but when I run it on the website I need it says that the class doesn't exist
from bs4 import BeautifulSoup
page =requests.get("https://careers.united.com/job-search-results/")
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(class_ = "jobTitle")
print(results)
print(results.prettify())```

I have changed this line of code:
results = soup.find(class_ = "jobTitle")
to these two lines of code. I have tested them and they work for me.
results = soup.find('a', attrs={'id': "job-result0"})
results = results.string
I use Google Chrome. It has the free extension ChroPath, which makes it super easy to identify selectors. I just right click on text in a browser and select Inspect, sometimes twice, and the correct HTML tag is highlighted.

Can't traverse into element to scrape rotten tomatoes ratings data using Beautiful Soup and Selenium

I am trying to get to the element that contains the rating data, but I can't figure out how to traverse into it (image linked below). The span element for both the critic rating and audience rating is in the same class (mop-ratings-wrap__percentage). I tried to get the elements by separately traversing into their respective divs ('mop-ratings-wrap__half' and 'mop-ratings-wrap__half audience-score') but I am getting this error:
runfile('/Users/*/.spyder-py3/temp.py', wdir='/Users/*/.spyder-py3')
Traceback (most recent call last):
File "/Users/*/.spyder-py3/temp.py", line 22, in <module>
cr=a.find('span', attrs={'class':'mop-ratings-wrap__percentage'})
TypeError: find() takes no keyword arguments
Here is my code:
# -*- coding: utf-8 -*-
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
driver = webdriver.Chrome("/Users/*/Downloads/chromedriver")
critics_rating=[]
audience_rating=[]
driver.get("https://www.rottentomatoes.com/m/bill_and_ted_face_the_music")
content = driver.page_source
soup = BeautifulSoup(content, "lxml")
for a in soup.find('div', attrs={'class':'mop-ratings-wrap__half'}):
cr=a.find('span', attrs={'class':'mop-ratings-wrap__percentage'})
critics_rating.append(cr.text)
for b in soup.find('div', attrs={'class':'mop-ratings-wrap__half audience-score'}):
ar=b.find('span', attrs={'class':'mop-ratings-wrap__percentage'})
audience_rating.append(ar.text)
print(critics_rating)
I am following this article: https://www.edureka.co/blog/web-scraping-with-python/#demo
And here is the data I want to extract

I suspect that
soup.find()
returns a string rather than a bs4 object like you are expecting. Therefore you are calling
"somestring".find()
which takes no keyword arguments.
(I would comment this but I lack reputation, sorry)

The issue is in your loop for a in soup.find('div', attrs={'class':'mop-ratings-wrap__half'}): you have returned one single element and then try to traverse through it, which is making equivalent to traversing through each letter of returned string element. Now you can not run find method on letters. Solution If you want to loop through element(s) to use find method on top of them, use find_all instead. As it will return a list of webelements, which you can traverse one by one using loop.
content = driver.page_source
soup = BeautifulSoup(content, 'html.parser')
ratings =[]
for a in soup.find_all('div', attrs={'class':'mop-ratings-wrap__half'}):
cr=a.find('span', attrs={'class':'mop-ratings-wrap__percentage'})
ratings.append(cr.text)
for rating in ratings:
print(rating.replace("\n", "").strip())
Output: Above code will print :
Note : To print your desired result, above is not the most shopiscated way of doing. But i have tried to answer your doubt rather than giving a better solution. You can use ratings[0] to print critic rating and ratings[1] to print user rating.

Why does ResultSet object has no attribute 'find'?

I'm trying to scrape the text off inside the "Other areas of Wikipedia" section on the Wikipedia front page. However, I run into the error ResultSet object has no attribute 'find'. What's wrong with my code and how do I get it to work?
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml' )
otherAreasContainer = soup.find_all('div', class_='mp-bordered')
otherAreasContainerTexts = otherAreasContainer.find_all('li')
for otherAreasContainerText in otherAreasContainerTexts:
print(otherAreasContainerText.text)

In your code otherAreasContainer is of type ResultSet, and ResultSet doesn't have .find_all() method.
To select all <li> from under the "Other areas of Wikipedia", you can use CSS selector h2:contains("Other areas of Wikipedia") + div li.
For example:
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/'
soup = BeautifulSoup(requests.get(url).content, 'lxml')
for li in soup.select('h2:contains("Other areas of Wikipedia") + div li'):
print(li.text)
Prints:
Community portal – Bulletin board, projects, resources and activities covering a wide range of Wikipedia areas.
Help desk – Ask questions about using Wikipedia.
Local embassy – For Wikipedia-related communication in languages other than English.
Reference desk – Serving as virtual librarians, Wikipedia volunteers tackle your questions on a wide range of subjects.
Site news – Announcements, updates, articles and press releases on Wikipedia and the Wikimedia Foundation.
Village pump – For discussions about Wikipedia itself, including areas for technical issues and policies.
More about CSS Selectors.

Running your code I got
Traceback (most recent call last):
File "h.py", line 7, in <module>
otherAreasContainerTexts = otherAreasContainer.find_all('li')
File "/home/td/anaconda3/lib/python3.7/site-packages/bs4/element.py", line 1620, in __getattr__
"ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?" % key
AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
This should be part of your question - make it easy for us to spot your problem!
find_all returns a ResultSet which essentially a list of elements found. You need to enumerate each of the elements to continue
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml' )
otherAreasContainer = soup.find_all('div', class_='mp-bordered')
for other in otherAreasContainer:
otherAreasContainerTexts = other.find_all('li')
for otherAreasContainerText in otherAreasContainerTexts:
print(otherAreasContainerText.text)

Result of find_all is a list, and list has not find or find_all attribute, you must iterate otherAreasContainer and then call find_all method on it, like this:
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
otherAreasContainer = soup.find_all('div', class_='mp-bordered')
for other in otherAreasContainer:
otherAreasContainerTexts = other.find_all('li')
for otherAreasContainerText in otherAreasContainerTexts:
print(otherAreasContainerText.text)

How to Grab Specific Text

I want to grab the price of bitcoin from this website: https://www.coindesk.com/price/bitcoin
but I am not sure how to do it, i'm pretty new to coding.
This is my code so far, I am not sure what I am doing wrong. Thanks in advance.
from bs4 import BeautifulSoup
import requests
r = requests.get('https://www.coindesk.com/price/bitcoin')
r_content = r.content
soup = BeautifulSoup(r_content, 'lxml')
p_value = soup.find('span', {'class': "currency-price", "data-value": True})['data-value']
print(p_value)
This is the result:
Traceback (most recent call last): File
"C:/Users/aidan/PycharmProjects/scraping/Scraper.py", line 8, in
p_value = soup.find('span', {'class': "currency-price", "data-value": True})['data-value'] TypeError: 'NoneType' object is not
subscriptable

Content is dynamically sourced from an API call returning json. You can use a list of currencies or a single currency. With requests javascript doesn't run and this content isn't added to the DOM and various DOM changes, to leave html as seen in browser, don't occur.
import requests
r = requests.get('https://production.api.coindesk.com/v1/currency/ticker?currencies=BTC').json()
print(r)
price = r['data']['currency']['BTC']['quotes']['USD']['price']
print(price)
r = requests.get('https://production.api.coindesk.com/v1/currency/ticker?currencies=ADA,BCH,BSV,BTC,BTG,DASH,DCR,DOGE,EOS,ETC,ETH,IOTA,LSK,LTC,NEO,QTUM,TRX,XEM,XLM,XMR,XRP,ZEC').json()
print(r)

The problem here is that the soup.find() call is not returning a value (that is, there is no span with the attributes you have defined on the page) therefore when you try to get data-value there is no dictionary to look it up in.

your website doesn't hold the data in html, this way you can't scrape it, but they are using an end point that you could use:
data = requests.get('https://production.api.coindesk.com/v1/currency/ticker?currencies=BTC').json()
p_value = data['data']['currency']['BTC']['quotes']['USD']['price']
print(p_value)
# output: 11375.678380772
the price is changing all the time so my output may be diffrent

Beautiful Soup 'ResultSet' object has no attribute 'text'

from bs4 import BeautifulSoup
import urllib.request
import win_unicode_console
win_unicode_console.enable()
link = ('https://pietroalbini.io/')
req = urllib.request.Request(link, headers={'User-Agent': 'Mozilla/5.0'})
url = urllib.request.urlopen(req).read()
soup = BeautifulSoup(url, "html.parser")
body = soup.find_all('div', {"class":"wrapper"})
print(body.text)
Hi, I have a problem with Beautiful Soup, if I run this code without ".text" at the end it show me a list of div but if I add ".text" at the end come the error
Traceback (most recent call last):
File "script.py", line 15, in
print(body.text)
AttributeError: 'ResultSet' object has no attribute 'text'

find_all returns a ResultSet object which you can iterate over using a for loop. What you can do is:
for wrapper in body.find_all('div', {"class":"wrapper"}):
print wrapper.text

If you'll type:
print(type(body))
you'll see body is <class 'bs4.element.ResultSet'> It means all the elements that match the class. You can either iterate over them:
for div in body:
print(div.text)
Or if you know you only have div, you can use find instead:
div = soup.find('div', {"class":"wrapper"})
div.text

Probably should have posted as answer.. so as stated in the comments almost verbatim
Your code should be the following:
for div in body:
print div.text
#python3
#print(div.text)
Or some naming schema to your preference thereof.
The find_all method returns a generated list ( loosely using the term list here ) of items that beautifulsoup has found matching your criteria after parsing the source webpages html either recursively or non-recursively depending upon how you search.
As the error says the resulting set of objects has no attribute text, since it isn't an element but rather a collection of them.
However, the items inside the resulting set ( should any be found ) do.
You can view the documentation here

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

BeautifulSoup search on beautifulsoup result? - python

For hotelinfo ,hoteltitle in zip (hotelinfos,hoteltitles): Data={ 'hotelinfo':hotelinfo.get_text(), } Print(data) Like that

Related

Is there a way to more accurately search for a class with BeautifulSoup

Can't traverse into element to scrape rotten tomatoes ratings data using Beautiful Soup and Selenium

Why does ResultSet object has no attribute 'find'?

How to Grab Specific Text

Beautiful Soup 'ResultSet' object has no attribute 'text'

Categories

Resources