Unable to scrape the real-time price of bitcoin using beautifulsoup - python

I'm trying to scrape the real-time price of bitcoin. The price of bitcoin changes almost every 5 seconds on the website but in my code, it's not updating and remains the same as the first price scraped by the code. Can you help me why this is happening?
import requests
from bs4 import BeautifulSoup
url = 'https://coinmarketcap.com/currencies/bitcoin/'
for i in range(100):
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(response.text, 'html.parser')
price = soup.find('span' ,attrs={"class" : "cmc-details-panel-price__price"})
print (price)
time.sleep(20)
My output:
<span class="cmc-details-panel-price__price">$18,106.79</span>
<span class="cmc-details-panel-price__price">$18,106.79</span>
<span class="cmc-details-panel-price__price">$18,106.79</span>
<span class="cmc-details-panel-price__price">$18,106.79</span>
<span class="cmc-details-panel-price__price">$18,106.79</span>
<span class="cmc-details-panel-price__price">$18,106.79</span>
<span class="cmc-details-panel-price__price">$18,106.79</span>
<span class="cmc-details-panel-price__price">$18,106.79</span>
<span class="cmc-details-panel-price__price">$18,106.79</span>
<span class="cmc-details-panel-price__price">$18,106.79</span>
<span class="cmc-details-panel-price__price">$18,106.79</span>
<span class="cmc-details-panel-price__price">$18,106.79</span>
<span class="cmc-details-panel-price__price">$18,106.79</span>
<span class="cmc-details-panel-price__price">$18,106.79</span>
<span class="cmc-details-panel-price__price">$18,106.79</span>

The site is using live update, I guess some javascript. Everytime you refresh the site you will get the same value and then the site fires a trigger to update the value. Since your request can't wait or interact with javascript on the site it always gets the first value on the load.
My advice is to use an API, it's more efficient than scraping websites.
The first Google search gives: https://www.coindesk.com/coindesk-api as a free Bitcoin API.
See if their API endpoint: https://api.coindesk.com/v1/bpi/currentprice.json
gives what you need, and then just parse the JSON.
Edit: Read the terms on their page.

API is better but If you still want to scrape it, here's how you can do it using Google search results:
import time, requests, lxml
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
# check 100 times (or use while loop instead)
for _ in range(100):
html = requests.get('https://www.google.com/search?q=bitcoin+usd', headers=headers)
soup = BeautifulSoup(html.text, 'lxml')
print(soup.select_one('.SwHCTb').text)
time.sleep(20) # sleep for price to change
Output:
58,654.40
58,654.40
58,654.40
58,654.40
58,594.20
58,594.20
58,594.20
58,586.30
58,586.30
...
Alternatively, you can get this information by using the Google Direct Answer Box API from SerpApi. It's a paid API with a free plan.
The difference, in this case, is that you don't have to figure out how to bypass blocks from Google.
Code to integrate:
from serpapi import GoogleSearch
params = {
"api_key": "YOUR_API_KEY",
"engine": "google",
"q": "bitcoin usd",
"gl": "us",
"hl": "en"
}
search = GoogleSearch(params)
results = search.get_dict()
print(results['answer_box']['result'])
# 60,571.40 United States Dollar
Disclaimer, I work for SerpApi.

Related

Scraping <div> inside a <div>

I'm having some trouble scraping the names of a <div> that are already in a <div> (It works with complete other part even though I tried to search for a specific card-body)
https://namemc.com/minecraft-names?sort=asc&length_op=&length=3&lang=&searches=500
I need this part:
<div class="card-body p-0">
<div class="row no-gutters py-1 px-3">
<div class="col col-lg order-lg-1 text-nowrap text-ellipsis">
example
Even though I find names, they are not from the list I want. Does anybody know how to locate them?
Im using beautifulsoup and lxml. Part of my code:
from bs4 import BeautifulSoup
import requests
html_text = requests.get('https://namemc.com/minecraft-names?sort=asc&length_op=&length=3&lang=&searches=500').text
soup = BeautifulSoup(html_text, 'lxml')
itemlocator = soup.find('div', class_='card-body p-0')
for items in itemlocator:
print(items)
The following script should produce the available names that you see in that page. However, it seems you are only after the container in which Commander is available. In that case, you can try like below to get the desired portion which is concise and efficient compare to your current attempt.
import requests
from bs4 import BeautifulSoup
link = 'https://namemc.com/minecraft-names?sort=asc&length_op=&length=3&lang=&searches=500'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36'
}
html_text = requests.get(link,headers=headers)
soup = BeautifulSoup(html_text.text,'lxml')
item = soup.select_one(".card-body > .no-gutters a[href^='/name/Commander']")
item_text = item.get_text(strip=True)
datetime = item.find_parent().find_parent().select_one("time").get("datetime")
print(item_text,datetime)
Output:
Commander 2021-03-19T13:10:40.000Z

Can't scrape category titles from a webpage

I've written a scraper in python to get different category names from a webpage but it is unable to fetch anything from that page. I'm seriously confused not to be able to figure out where i'm going wrong. Any help would be vastly appreciated.
Here is the link to the webpage: URL
Here is what I've tried so far:
from bs4 import BeautifulSoup
import requests
res = requests.get("replace_with_above_url",headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text,"lxml")
for items in soup.select('.slide_container .h3.standardTitle'):
print(items.text)
Elements within which one such category names I'm after:
<div class="slide_container">
<a href="/offers/furniture/" tabindex="0">
<picture style="float: left; width: 100%;"><img style="width:100%" src="/_m4/9/8/1513184943_4413.jpg" data-w="270"></picture>
<div class="floated-details inverted" style="height: 69px;">
<div class="h3 margin-top-sm margin-bottom-sm standardTitle">
Furniture Offers #This is the name I'm after
</div>
<p class="carouselDesc">
</p>
</div>
</a>
</div>
from bs4 import BeautifulSoup
import requests
headers = {
'accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-encoding':'gzip, deflate, br',
'accept-language':'en-US,en;q=0.9',
'cache-control':'max-age=0',
'referer':'https://www.therange.co.uk/',
'upgrade-insecure-requests':'1',
'user-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36',
}
res = requests.get("https://www.therange.co.uk/",headers=headers)
soup = BeautifulSoup(res.text,'html.parser')
for items in soup.select('.slide_container .h3.standardTitle'):
print(items.text)
Try this
a user-agent is not enough because headers are the most important part
of scrapping.if you miss any header then server ll treat you as a bot.
Use "html.parser" instead of "lxml"
soup = BeautifulSoup(res.text,"html.parser")

Scrape URLs from <cite> tags using BeautifulSoup

I am trying to scrape the URLs from Google using Requests and Beautiful Soup web scraping libraries.
for URL in soup.find_all('cite'):
print(URL.text)
I was previously trying to get the URLs by searching for the links and then getting the href of the links but the problem with this method seems to be that these URLs are cached by Google, and when trying to access the URL the link is often broken.
I noticed that Google uses cite tags to hold the URLs. Whilst this works for the vast majority of URLs, sometimes there are other bits of text on the page also within cite tags.
Most of the tags have a class = "_Rm" or class = "Rm bc". How could I tell Beautiful Soup to search for tags with a class of substring "Rm"?
I understand there is probably a better way to do all of this. Is anyone aware of how I could do this / another method which will return the actual URL of websites?
This is the code that I had previously been using to get URLs
for URL in soup.find_all("a",href=re.compile("(?<=/url\?q=)(htt.*://.*)")):
print ("\n" + URL.text + "\n")
print re.split(":(?=http)",URL["href"].replace("/url?q=",""))'''
You can go to the parent container and use .text method since in this case, there no unwanted text below. This way, it will return all "cite" links.
using third-party API SerpApi (see below)
Code and full example:
from bs4 import BeautifulSoup
import requests
import lxml
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('https://www.google.com/search?q=java',
headers=headers).text
soup = BeautifulSoup(html, 'lxml')
for container in soup.findAll('div', class_='TbwUpd NJjxre'):
link = container.text
print(link)
Output:
https://www.java.com
https://www.oracle.com › java › technologies
https://www.oracle.com › java › technologies › javase-d...
https://en.wikipedia.org › wiki › Java_(programming_l...
https://en.wikipedia.org › wiki › Java
https://www.supremecourt.gov › opinions
https://openjdk.java.net
Alternatively, you can use Google Search Engine Results API from SerpApi. It's a paid API with a Free trial.
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "java",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(f"Link: {result['displayed_link']}")
Output:
Link: https://www.java.com
Link: https://www.oracle.com › java › technologies
Link: https://www.oracle.com › java › technologies › javase-d...
Link: https://en.wikipedia.org › wiki › Java_(programming_l...
Link: https://en.wikipedia.org › wiki › Java
Link: https://www.supremecourt.gov › opinions
Link: https://openjdk.java.net
Disclaimer, I work for SerpApi.

Retrieve a number from a span tag, using Python requests and Beautiful Soup

I'm new to python and html. I am trying to retrieve the number of comments from a page using requests and BeautifulSoup.
In this example I am trying to get the number 226. Here is the code as I can see it when I inspect the page in Chrome:
<a title="Go to the comments page" class="article__comments-counts" href="http://www.theglobeandmail.com/opinion/will-kevin-oleary-be-stopped/article33519766/comments/">
<span class="civil-comment-count" data-site-id="globeandmail" data-id="33519766" data-language="en">
226
</span>
Comments
</a>
When I request the text from the URL, I can find the code but there is no content between the span tags, no 226. Here is my code:
import requests, bs4
url = 'http://www.theglobeandmail.com/opinion/will-kevin-oleary-be-stopped/article33519766/'
r = requests.get()
soup = bs4.BeautifulSoup(r.text, 'html.parser')
span = soup.find('span', class_='civil-comment-count')
It returns this, same as the above but no 226.
<span class="civil-comment-count" data-id="33519766" data-language="en" data-site-id="globeandmail">
</span>
I'm at a loss as to why the value isn't appearing. Thank you in advance for any assistance.
The page, and specifically the number of comments, does involve JavaScript to be loaded and shown. But, you don't have to use Selenium, make a request to the API behind it:
import requests
with requests.Session() as session:
session.headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36"}
# visit main page
base_url = 'http://www.theglobeandmail.com/opinion/will-kevin-oleary-be-stopped/article33519766/'
session.get(base_url)
# get the comments count
url = "https://api-civilcomments.global.ssl.fastly.net/api/v1/topics/multiple_comments_count.json"
params = {"publication_slug": "globeandmail",
"reference_language": "en",
"reference_ids": "33519766"}
r = session.get(url, params=params)
print(r.json())
Prints:
{'comment_counts': {'33519766': 226}}
This page use JavaScript to get the comment number, this is what the page look like when disable the JavaScript:
You can find the real url which contains the number in Chrome's Developer tools:
Than you can mimic the requests using #alecxe code.

Web Scraping - No content displayed

I am trying to fetch the stock of a company specified by a user by taking the input. I am using requests to get the source code and BeautifulSoup to scrape. I am fetching the data from google.com. I am trying the fetch only the last stock price (806.93 in the picture). When I run my script, it prints none. None of the data is being fetched. What am I missing ?
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
company = raw_input("Enter the company name:")
URL = "https://www.google.co.in/?gfe_rd=cr&ei=-AKmV6eqC-LH8AfRqb_4Aw#newwindow=1&safe=off&q="+company+"+stock"
request = requests.get(URL)
soup = BeautifulSoup(request.content,"lxml")
code = soup.find('span',{'class':'_Rnb fmob_pr fac-l','data-symbol':'GOOGL'})
print code.contents[0]
The source code of the page looks like this :
Looks like that source is from inspecting the element, not the actual source. A couple of suggestions. Use google finance to get rid of some noise - https://www.google.com/finance?q=googl would be the URL. On that page there is a section that looks like this:
<div class=g-unit>
<div id=market-data-div class="id-market-data-div nwp g-floatfix">
<div id=price-panel class="id-price-panel goog-inline-block">
<div>
<span class="pr">
<span id="ref_694653_l">806.93</span>
</span>
<div class="id-price-change nwp">
<span class="ch bld"><span class="chg" id="ref_694653_c">+9.68</span>
<span class="chg" id="ref_694653_cp">(1.21%)</span>
</span>
</div>
</div>
You should be able to pull the number out of that.
I went to
https://www.google.com/?gfe_rd=cr&ei=-AKmV6eqC-LH8AfRqb_4Aw#newwindow=1&safe=off&q=+google+stock
, did a right click and "View Page Source" but did not see the code that you screenshotted.
Then I typed out a section of your code screenshot and created a BeautifulSoup object with it and then ran your find on it:
test_screenshot = BeautifulSoup('<div class="_F0c" data-tmid="/m/07zln7n"><span class="_Rnb fmob_pr fac-l" data-symbol="GOOGL" data-tmid="/m/07zln7n" data-value="806.93">806.93.</span> = $0<span class ="_hgj">USD</span>')
test_screenshot.find('span',{'class':'_Rnb fmob_pr fac-l','data-symbol':'GOOGL'})`
Which will output what you want:
<span class="_Rnb fmob_pr fac-l" data-symbol="GOOGL" data-tmid="/m/07zln7n" data-value="806.93">806.93.</span>
This means that the code you are getting is not the code you expect to get.
I suggest using the google finance page:
https://www.google.com/finance?q=google (replace 'google' with what you want to search), which will give you wnat you are looking for:
request = requests.get(URL)
soup = BeautifulSoup(request.content,"lxml")
code = soup.find("span",{'class':'pr'})
print code.contents
Will give you
[u'\n', <span id="ref_694653_l">806.93</span>, u'\n'].
In general, scraping Google search results can get really nasty, so try to avoid it if you can.
You might also want to look into Yahoo Finance Python API.
You're looking for this:
# two selectors which will handle two layouts
current_price = soup.select_one('.wT3VGc, .XcVN5d').text
Have a look at the SelectorGadget Chrome extension to grab CSS selectors by clicking on the desired element in your browser. CSS selectors reference.
It might be because there's no user-agent specified in your request headers.
The default requests user-agent is python-requests thus Google blocks a request because it knows that it's a bot and not a "real" user visit and you received a different HTML with different selectors and elements, and some sort of an error. User-agent fakes user visit by adding this information into HTTP request headers.
Pass user-agent into request headers:
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
response = requests.get('YOUR_URL', headers=headers)
Code and example in the online IDE:
import requests, lxml
from bs4 import BeautifulSoup
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0'
params = {
'q': 'alphabet inc class a stock',
'gl': 'us'
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
# two selectors which will handle two layouts
current_price = soup.select_one('.wT3VGc, .XcVN5d').text
print(current_price)
# 2,816.00
Alternatively, you can achieve the same thing by using Google Direct Answer Box API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you only need to iterate over structured JSON and get the data you want fast rather than figuring out why certain things don't work as expected and then to maintain it over time.
Code to integrate:
from serpapi import GoogleSearch
params = {
"api_key": "YOUR_API_KEY",
"engine": "google",
"q": "alphabet inc class a stock",
"gl": "us",
"hl": "en"
}
search = GoogleSearch(params)
results = search.get_dict()
current_price = results['answer_box']['price']
print(current_price)
# 2,816.00
P.S - I wrote an in-depth blog post about how to reduce the chance of being blocked while web scraping search engines.
Disclaimer, I work for SerpApi.

Categories