How to Grab Specific Text - python

I want to grab the price of bitcoin from this website: https://www.coindesk.com/price/bitcoin
but I am not sure how to do it, i'm pretty new to coding.
This is my code so far, I am not sure what I am doing wrong. Thanks in advance.
from bs4 import BeautifulSoup
import requests
r = requests.get('https://www.coindesk.com/price/bitcoin')
r_content = r.content
soup = BeautifulSoup(r_content, 'lxml')
p_value = soup.find('span', {'class': "currency-price", "data-value": True})['data-value']
print(p_value)
This is the result:
Traceback (most recent call last): File
"C:/Users/aidan/PycharmProjects/scraping/Scraper.py", line 8, in
p_value = soup.find('span', {'class': "currency-price", "data-value": True})['data-value'] TypeError: 'NoneType' object is not
subscriptable

Content is dynamically sourced from an API call returning json. You can use a list of currencies or a single currency. With requests javascript doesn't run and this content isn't added to the DOM and various DOM changes, to leave html as seen in browser, don't occur.
import requests
r = requests.get('https://production.api.coindesk.com/v1/currency/ticker?currencies=BTC').json()
print(r)
price = r['data']['currency']['BTC']['quotes']['USD']['price']
print(price)
r = requests.get('https://production.api.coindesk.com/v1/currency/ticker?currencies=ADA,BCH,BSV,BTC,BTG,DASH,DCR,DOGE,EOS,ETC,ETH,IOTA,LSK,LTC,NEO,QTUM,TRX,XEM,XLM,XMR,XRP,ZEC').json()
print(r)

The problem here is that the soup.find() call is not returning a value (that is, there is no span with the attributes you have defined on the page) therefore when you try to get data-value there is no dictionary to look it up in.

your website doesn't hold the data in html, this way you can't scrape it, but they are using an end point that you could use:
data = requests.get('https://production.api.coindesk.com/v1/currency/ticker?currencies=BTC').json()
p_value = data['data']['currency']['BTC']['quotes']['USD']['price']
print(p_value)
# output: 11375.678380772
the price is changing all the time so my output may be diffrent

Related

Is there a way to more accurately search for a class with BeautifulSoup

I'm trying to scrape this job site for a specific job title but I keep getting this error message.
Traceback (most recent call last):
File "/home/malachi/Documents/python_projects/Practice/Jobsearcher.py", line 7, in <module>
print(results.prettify())
AttributeError: 'NoneType' object has no attribute 'prettify'
I've run this same code on other websites with different class names and I got results but when I run it on the website I need it says that the class doesn't exist
from bs4 import BeautifulSoup
page =requests.get("https://careers.united.com/job-search-results/")
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(class_ = "jobTitle")
print(results)
print(results.prettify())```
I have changed this line of code:
results = soup.find(class_ = "jobTitle")
to these two lines of code. I have tested them and they work for me.
results = soup.find('a', attrs={'id': "job-result0"})
results = results.string
I use Google Chrome. It has the free extension ChroPath, which makes it super easy to identify selectors. I just right click on text in a browser and select Inspect, sometimes twice, and the correct HTML tag is highlighted.

Why does ResultSet object has no attribute 'find'?

I'm trying to scrape the text off inside the "Other areas of Wikipedia" section on the Wikipedia front page. However, I run into the error ResultSet object has no attribute 'find'. What's wrong with my code and how do I get it to work?
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml' )
otherAreasContainer = soup.find_all('div', class_='mp-bordered')
otherAreasContainerTexts = otherAreasContainer.find_all('li')
for otherAreasContainerText in otherAreasContainerTexts:
print(otherAreasContainerText.text)
In your code otherAreasContainer is of type ResultSet, and ResultSet doesn't have .find_all() method.
To select all <li> from under the "Other areas of Wikipedia", you can use CSS selector h2:contains("Other areas of Wikipedia") + div li.
For example:
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/'
soup = BeautifulSoup(requests.get(url).content, 'lxml')
for li in soup.select('h2:contains("Other areas of Wikipedia") + div li'):
print(li.text)
Prints:
Community portal – Bulletin board, projects, resources and activities covering a wide range of Wikipedia areas.
Help desk – Ask questions about using Wikipedia.
Local embassy – For Wikipedia-related communication in languages other than English.
Reference desk – Serving as virtual librarians, Wikipedia volunteers tackle your questions on a wide range of subjects.
Site news – Announcements, updates, articles and press releases on Wikipedia and the Wikimedia Foundation.
Village pump – For discussions about Wikipedia itself, including areas for technical issues and policies.
More about CSS Selectors.
Running your code I got
Traceback (most recent call last):
File "h.py", line 7, in <module>
otherAreasContainerTexts = otherAreasContainer.find_all('li')
File "/home/td/anaconda3/lib/python3.7/site-packages/bs4/element.py", line 1620, in __getattr__
"ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?" % key
AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
This should be part of your question - make it easy for us to spot your problem!
find_all returns a ResultSet which essentially a list of elements found. You need to enumerate each of the elements to continue
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml' )
otherAreasContainer = soup.find_all('div', class_='mp-bordered')
for other in otherAreasContainer:
otherAreasContainerTexts = other.find_all('li')
for otherAreasContainerText in otherAreasContainerTexts:
print(otherAreasContainerText.text)
Result of find_all is a list, and list has not find or find_all attribute, you must iterate otherAreasContainer and then call find_all method on it, like this:
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
otherAreasContainer = soup.find_all('div', class_='mp-bordered')
for other in otherAreasContainer:
otherAreasContainerTexts = other.find_all('li')
for otherAreasContainerText in otherAreasContainerTexts:
print(otherAreasContainerText.text)

I tried lot of times to grab the data from booking.com.But i couldn't

I want to scrape the data from the booking.com but got some errors and couldn't find any similar codes.
I want to scrape the name of the hotel,price and etc.
i have tried beautifulSoup 4 and tried to get data to a csv file.
import requests
from bs4 import BeautifulSoup
import pandas
# Replace search_url with a valid one byb visiting and searching booking.com
search_url = 'https://www.booking.com/searchresults.....'
page = requests.get(search_url)
soup = BeautifulSoup(page.content, 'html.parser')
week = soup.find(id = 'search_results_table' )
#print(week)
items = week.find_all(class_='sr-hotel__name')
print(items[0])
print(items[0].find(class_ = 'sr-hotel__name').get_text())
print(items[0].find(class_ = 'short-desc').get_text())
Here is a sample URL that can be used in place of search_url.
This is the error msg...
<span class="sr-hotel__name " data-et-click="
">
The Fort Printers
</span>
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-44-77b38c8546bb> in <module>
11 items = week.find_all(class_='sr-hotel__name')
12 print(items[0])
---> 13 print(items[0].find(class_ = 'sr-hotel__name').get_text())
14 print(items[0].find(class_ = 'short-desc').get_text())
15
AttributeError: 'NoneType' object has no attribute 'get_text'
Instead of using find() method multiple times, if you consider using getText() method directly it can help.
import requests
from bs4 import BeautifulSoup
import pandas
# Replace search_url with a valid one byb visiting and searching booking.com
search_url = 'https://www.booking.com/searchresults.....'
page = requests.get(search_url)
soup = BeautifulSoup(page.content, 'html.parser')
week = soup.find(id = 'search_results_table' )
#print(week)
items = week.find_all(class_='sr-hotel__name')
# print the whole thing
print(items[0])
hotel_name = items[0].getText()
# print hotel name
print(hotel_name)
# print without newlines
print(hotel_name[1:-1])
Hope this helps. I would suggest reading more of BeautifulSoup documentation.
first of all, buddy, using requests might be really hard since you have to completely imitate the request your browser will send.
You'll have to use some sniffing tool (burp, fiddler, wireshark) or in some cases look at the network in the developer mode on your browser which is relatively hard...
I'd suggest you to use "selenium" which is a web driver that makes your life easy when trying to scrape sites... read more about it here- https://medium.com/the-andela-way/introduction-to-web-scraping-using-selenium-7ec377a8cf72
And as for your error, I think you should use only .text instead of .get_text()

Trouble grabbing data from a webpage located within comment

I've written a script in python to get some data from a website. It seems I did it the right way. However, when I print the data I get an error list index out of range. The data are within comment. So in my script I tried to use the python's built-in comment processing method. Could anybody point me out where I'm going wrong?
Link to the website: website_link
Script I've tried so far with:
import requests
from bs4 import BeautifulSoup, Comment
res = requests.get("replace_with_the_above_link")
soup = BeautifulSoup(res.text, 'lxml')
for comment in soup.find_all(string=lambda text:isinstance(text,Comment)):
sauce = BeautifulSoup(comment, 'lxml')
items = sauce.select("#tco_detail_data")[0]
data = ' '.join([' '.join(item.text.split()) for item in items.select("li")])
print(data)
This is the traceback:
Traceback (most recent call last):
File "C:\Users\Local\Programs\Python\Python35-32\new_line_one.py", line 8, in <module>
items = sauce.select("#tco_detail_data")[0]
IndexError: list index out of range
Please click on the below link to see which portion of data I would like to grab: Expected_output_link
None of the comments contain html with a "#tco_detail_data" tag, so select returns an empty list, which raises an IndexError when you try to select the first item.
However, you can find the data in a "ul#tco_detail_data" tag.
res = requests.get(link)
soup = BeautifulSoup(res.text, 'lxml')
data = soup.select_one("#tco_detail_data")
print(data)
If you want data in a list,
data = [list(item.stripped_strings) for item in data.select("ul")]
If you prefer a string,
data = '\n'.join([item.get_text(' ', strip=True) for item in data.select("ul")])

BeautifulSoup search on beautifulsoup result?

Scraping a hotel website to retrieve titles and prices.
"hotelInfo" is the div that holds the interesting content.
It makes sense to me that I would want to only perform my operations on this div. My code is as follows -
from bs4 import BeautifulSoup
import requests
response = requests.get("http://$hotelurlhere.com")
soup = BeautifulSoup(response.text)
hotelInfo = soup.select('div.hotel-wrap')
hotelTitle = soup.find_all('h3', attrs={'class': 'p-name'})
hotelNameList = []
hotelPriceList = []
for hotel in hotelInfo:
for title in hotelTitle:
hotelNameList.append(title.text)
It makes more sense to say that hotelTitle should be a Beautifulsoup search on hotelInfo above. However when I try this
hotelTitle = hotelInfo.find_all('h3', attrs={'class': 'p-name'})
Error message:
Traceback (most recent call last):
File "main.py", line 8, in <module>
hotelTitle = hotelInfo.find_all('h3', attrs={'class': 'p-name'})
AttributeError: 'list' object has no attribute 'find_all'
An error was returned which was related to the list element not having an attribute of "find_all". I understand that this is because hotelInfo is a list element that was returned. I've searched for information on the correct way to check for the h3 info within this list but I am not having any success.
What is the best way to do this?
Shouldn't I be able to set hoteTitle to hotelInfo.find_all rather than just soup.find_all?
As the error message clearly suggests, there is no find_all() method which you can invoke in a list object. In this case, you should call find_all() on individual member of the list instead, assuming that you need some information from the div.hotel-wrap as well as the corresponding h3 :
for hotel in hotelInfo:
hotelTitle = hotel.find_all('h3', attrs={'class': 'p-name'})
If you only need the h3 elements, you can combine the two selectors to get them directly without having to find hotelInfo first :
hotelTitle = soup.select('div.hotel-wrap h3.p-name')
For hotelinfo ,hoteltitle in zip (hotelinfos,hoteltitles):
Data={
'hotelinfo':hotelinfo.get_text(),
}
Print(data)
Like that

Categories