How to use BeautifulSoup to get real-time stock price on website? - python

I am working on a project to get the real-time stock price on http://www.jpmhkwarrants.com/en_hk/market-statistics/underlying/underlying-terms/code/1. I have searched online and tried several way to get the price, but still fail. Here is my code:
def getStockPrice():
url = "http://www.jpmhkwarrants.com/zh_hk/market-statistics/underlying/underlying-terms/code/1"
r = urlopen(url)
soup = BeautifulSoup(r.read(), 'lxmll)
price = soup.find(id = "real_time_box").find({"span", "class":"price"})
print(price)
The output is "None". I know that the price is scripted in the function above but I have no idea how to get the price. Can it be solved by beautifulsoup or else module?

view the page source you will see html like this
<div class="table detail">
.....
<div class="tl">即市走勢 <span class="description">前收市價</span>
.....
<td>買入價(延遲*)<span>82.15</span></td>
the span that we want is in index 2, select it with
price = soup.select('.table.detail td span')[1]
print(price.text)
Demo:
<script type="text/javascript" src="//cdn.datacamp.com/dcl-react.js.gz"></script>
<div data-datacamp-exercise data-lang="python">
<code data-type="sample-code">
from bs4 import BeautifulSoup
from urllib.request import urlopen
def getStockPrice():
url = "http://www.jpmhkwarrants.com/zh_hk/market-statistics/underlying/underlying-terms/code/1"
r = urlopen(url)
soup = BeautifulSoup(r.read(), 'html.parser')
price = soup.select('.table.detail td span')[1]
print(price.text)
getStockPrice()
</code>
</div>

Related

Web scraping in Python - but problems exporting data to excel

I'm trying to export som data to excel. I'm a total beginner, so i apologise for any dumb questions.
I',m practicising scraping from a demosite webscraper.io - and so far i have found scraped the data, that i want, which is the laptop names and links for the products
import requests
from bs4 import BeautifulSoup
from pprint import pprint
url ="https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html)
css_selector = {"class": "col-sm-4 col-lg-4 col-md-4"}
laptops = soup.find_all("div", attrs=css_selector)
for laptop in laptops:
laptop_link = laptop.find('a')
text = laptop_link.get_text()
href = laptop_link['href']
full_url = f"https://webscraper.io{href}"
print(text)
print (full_url)
I'm having major difficulties wrapping my head around how to export the text + full_url to excel.
I have seen coding being done like this
import pandas as pd
df = pd.DataFrame(laptops)
df.to_excel("laptops_testing.xlsx", encoding="utf-8")
But when i'm doing so, i'm getting an .xlsx file which contains a lot of data and coding, that i dont want. I just want the data, that i have been printing (text) and (full_url)
The data i'm seeing in Excel is looking like this:
<div class="thumbnail">
<img alt="item" class="img-responsive" src="/images/test-sites/e-commerce/items/cart2.png"/>
<div class="caption">
<h4 class="pull-right price">$295.99</h4>
<h4>
<a class="title" href="/test-sites/e-commerce/allinone/product/545" title="Asus VivoBook X441NA-GA190">Asus VivoBook X4...</a>
</h4>
<p class="description">Asus VivoBook X441NA-GA190 Chocolate Black, 14", Celeron N3450, 4GB, 128GB SSD, Endless OS, ENG kbd</p>
</div>
<div class="ratings">
<p class="pull-right">14 reviews</p>
<p data-rating="3">
<span class="glyphicon glyphicon-star"></span>
<span class="glyphicon glyphicon-star"></span>
<span class="glyphicon glyphicon-star"></span>
</p>
</div>
</div>
Screenshot from google sheets:
This is not that much hard for solve just use this code you just have to add urls and text in lists then change it into a pandas dataframe and then make a new excel file.
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
from pprint import pprint
url ="https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html)
css_selector = {"class": "col-sm-4 col-lg-4 col-md-4"}
laptops = soup.find_all("div", attrs=css_selector)
laptop_name = []
laptop_url = []
for laptop in laptops:
laptop_link = laptop.find('a')
text = laptop_link.get_text()
href = laptop_link['href']
full_url = f"https://webscraper.io{href}"
print(text)
//appending name of laptops
laptop_name.append(text)
print (full_url)
//appending urls
laptop_url.append(full_url)
//changing it into dataframe
new_df = pd.DataFrame({'Laptop Name':laptop_name,'Laptop url':laptop_url})
print(new_df)
// defining excel file
file_name = 'laptop.xlsx'
new_df.to_excel(file_name)
Use soup.select function to find by extended css selectors.
Here's a short solution:
import requests
from bs4 import BeautifulSoup
url ="https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'html.parser')
laptops = [(a.getText(), requests.compat.urljoin(url, a.get('href')))
for a in soup.select("div.col-sm-4.col-lg-4.col-md-4 a")]
df = pd.DataFrame(laptops)
df.to_excel("laptops_testing.xlsx")
The final document would look like:
Try this. Remeber to import pandas
And try not to run the code to many times you are sending a new request to the website each time
html = r.text
soup = BeautifulSoup(html)
css_selector = {"class": "col-sm-4 col-lg-4 col-md-4"}
laptops = soup.find_all("div", attrs=css_selector)
data = []
for laptop in laptops:
laptop_link = laptop.find('a')
text = laptop_link.get_text()
href = laptop_link['href']
full_url = f"https://webscraper.io{href}"
data.append([text,full_url])
df = pd.DataFrame(data, columns = ["laptop name","Url"])
df.to_csv("name")

Finding a specific span element with BeautifulSoup

I am trying to create a script to scrape price data from Udemy courses.
I'm struggling with navigating the HTML tree because the element I'm looking for is located inside multiple nested divs.
here's the structure of the HTML element I'm trying to access:
what I tried:
response = requests.get(url)
print(response)
doc = BeautifulSoup(response.text, "html.parser")
parent_div = doc.find(class_="sidebar-container--purchase-section--17KRp")
print(parent_div.find_all("span"))
and even:
response = requests.get(url)
print(response)
doc = BeautifulSoup(response.text, "html.parser")
main = doc.find(class_="generic-purchase-section--main-cta-container--3xxeM")
title = main.select_one("div span span span")
Here’s the URL: https://www.udemy.com/course/the-complete-web-development-bootcamp/
tried searching all the spans in the HTML and the specific span I'm searching for doesn't appear maybe because it's nested inside a div?
would appreciate a little guidance!
The price is being loaded by JavaScript. So it is not possible to scrape using beautifulsoup.
The data is loaded from an API Endpoint which takes in the course-id of the course.
Course-id of this course: 1565838
You can directly get the info from that endpoint like this.
import requests
course_id = '1565838'
url= f'https://www.udemy.com/api-2.0/course-landing-components/{course_id}/me/?components=price_text'
response = requests.get(url)
x = response.json()
print(x['price_text']['data']['pricing_result']['price'])
{'amount': 455.0, 'currency': 'INR', 'price_string': '₹455', 'currency_symbol': '₹'}
I tried your first approach several times and it works more-or-less for me, although it has returned a different number of span elements on different attempts (10 is the usual number but I have seen as few as 1 on one occasion):
import requests
from bs4 import BeautifulSoup
url = 'https://www.udemy.com/course/the-complete-web-development-bootcamp/'
response = requests.get(url)
doc = BeautifulSoup(response.text, "html.parser")
parent_div = doc.find(class_="sidebar-container--purchase-section--17KRp")
spans = parent_div.find_all("span")
print(len(spans))
for span in spans:
print(span)
Prints:
10
<span data-checked="checked" data-name="u872-accordion--3" data-type="radio" id="u872-accordion-panel--4" style="display:none"></span>
<span class="purchase-options--option-radio--1zjJ_ udlite-fake-toggle-input udlite-fake-toggle-radio udlite-fake-toggle-radio-small"></span>
<span class="udlite-accordion-panel-title">Subscribe</span>
<span>Try it free for 7 days</span>
<span class="udlite-text-xs purchase-section-container--cta-subscript--349MH">$29.99 per month after trial</span>
<span class="purchase-section-container--line--159eG"></span>
<span data-checked="" data-name="u872-accordion--3" data-type="radio" id="u872-accordion-panel--6" style="display:none"></span>
<span class="purchase-options--option-radio--1zjJ_ udlite-fake-toggle-input udlite-fake-toggle-radio udlite-fake-toggle-radio-small"></span>
<span class="udlite-accordion-panel-title">Buy Course</span>
<span class="money-back">30-Day Money-Back Guarantee</span>
As afar as your second approach goes, your main div does not have that many nested span elements, so it is bound to fail. Try just one span element:
import requests
from bs4 import BeautifulSoup
url = 'https://www.udemy.com/course/the-complete-web-development-bootcamp/'
response = requests.get(url)
doc = BeautifulSoup(response.text, "html.parser")
main = doc.find(class_="generic-purchase-section--main-cta-container--3xxeM")
title = main.select_one("div span")
print(title)
Prints:
<span class="money-back">30-Day Money-Back Guarantee</span>

BeautifulSoup parsing issues some div not showing

I'm trying to parse this page: https://www.ldlc.com/fr-be/informatique/pieces-informatique/carte-professionnelle/c4685/
The problem is, in this element: https://gyazo.com/e544be64a41a121bdb0c0f71aef50692 ,
I want the div that contains the price. If you inspect the page, you can see the html code for this part, shows like this:
<div class="price">
<div class"price">
"thePrice"
<sup>93</sup>
</div>
</div>
BUT, when using page_soup = soup(my_html_page, 'html.parser') or page_soup = soup(my_html_page, 'lxml') or page_soup = soup(my_html_page, 'html5lib') I only get this as the result for that part:
<div class="price"></div>
And that's it. I've been searching for hours on the internet to figure out why that inner div doesn't get parsed.
Three different parsers, and none seems to get passed the fact that the inner child shares the same class name than its parent, if this is the issue.
Hope its help you.
from bs4 import BeautifulSoup
import requests
url = 'https://www.ldlc.com/fr-be/informatique/pieces-informatique/carte-professionnelle/c4685/'
html = BeautifulSoup(requests.get(url).content, 'html.parser')
prices = html.find_all("div", {"class": "price"})
for price in prices:
print(price.text)
print output
561€95
169€94
165€95
1 165€94
7 599€95
267€95
259€94
599€95
511€94
1 042€94
2 572€94
783€95
2 479€94
2 699€95
499€94
386€95
169€94
2 343€95
783€95
499€94
499€94
259€94
267€95
165€95
169€94
2 399€95
561€95
2 699€95
2 699€95
6 059€95
7 589€95
10 991€95
9 619€94
2 479€94
3 135€95
7 589€95
511€94
1 042€94
386€95
599€95
1 165€94
2 572€94
783€95
2 479€94
2 699€95
499€94
169€94
2 343€95
2 699€95
3 135€95
6 816€95
7 589€95
561€95
267€95
To scrape all prices where class="price"> see this example:
import requests
from bs4 import BeautifulSoup
url = 'https://www.ldlc.com/fr-be/informatique/pieces-informatique/carte-professionnelle/c4685/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
# Select all the 'price' classes
for tag in soup.select('div.price'):
print(tag.text)

Can't access a tweet id with beautiful soup

My goal is to retrieve the ids of tweets in a twitter search as they are being posted. My code so far looks like this:
import requests
from bs4 import BeautifulSoup
keys = some_key_words + " -filter:retweets AND -filter:replies"
query = "https://twitter.com/search?f=tweets&vertical=default&q=" + keys + "&src=typd&lang=es"
req = requests.get(query).text
soup = BeautifulSoup(req, "lxml")
for tweets in soup.findAll("li",{"class":"js-stream-item stream-item stream-item"}):
print(tweets)
However, this doesn't return anything. Is there a problem with the code itself or am I looking at the wrong place of the source code? I understand that the ids should be stored here:
<div class="stream">
<ol class="stream-items js-navigable-stream" id="stream-items-id">
<li class="js-stream-item stream-item stream-item" **data-item-id**="1210306781806833664" id="stream-item-tweet-1210306781806833664" data-item-type="tweet">
from bs4 import BeautifulSoup
data = """
<div class="stream">
<ol class="stream-items js-navigable-stream" id="stream-items-id">
<li class="js-stream-item stream-item stream-item
" **data-item-id**="1210306781806833664"
id="stream-item-tweet-1210306781806833664"
data-item-type="tweet"
>
...
"""
soup = BeautifulSoup(data, 'html.parser')
for item in soup.findAll("li", {'class': 'js-stream-item stream-item stream-item'}):
print(item.get("**data-item-id**"))
Output:
1210306781806833664

Scraping multiple data tags from HTML using beautiful Soup

I am attempting to scrape HTML to create a dictionary that includes a pitchers name and his handed-ness. The data-tags are buried--so far I've only been able to collect the pitchers name from the data set. The HTML output (for each player) is as follows:
<div class="pitcher players">
<input name="import-data" type="hidden" value="%5B%7B%22slate_id%22%3A20190%2C%22type%22%3A%22classic%22%2C%22player_id%22%3A%2210893103%22%2C%22salary%22%3A%2211800%22%2C%22position%22%3A%22SP%22%2C%22fpts%22%3A14.96%7D%2C%7B%22slate_id%22%3A20192%2C%22type%22%3A%22classic%22%2C%22player_id%22%3A%2210894893%22%2C%22salary%22%3A%2211800%22%2C%22position%22%3A%22SP%22%2C%22fpts%22%3A14.96%7D%2C%7B%22slate_id%22%3A20193%2C%22type%22%3A%22classic%22%2C%22player_id%22%3A%2210895115%22%2C%22salary%22%3A%2211800%22%2C%22position%22%3A%22SP%22%2C%22fpts%22%3A14.96%7D%5D"/>
<a class="player-popup" data-url="https://rotogrinders.com/players/johnny-cueto-11193?site=draftkings" href="https://rotogrinders.com/players/johnny-cueto-11193">Johnny Cueto</a>
<span class="meta stats">
<span class="stats">
R
</span>
<span class="salary" data-role="salary" data-salary="$11.8K">
$11.8K
</span>
<span class="fpts" data-fpts="14.96" data-product="56" data-role="authorize" title="Projected Points">14.96</span>
I've tinkered and and coming up empty--I'm sure I'm overthinking this. Here is the code I have so far:
import requests
from bs4 import BeautifulSoup
url = "https://rotogrinders.com/lineups/mlb?site=draftkings"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "html.parser")
players_confirmed = {}
results = [soup.find_all("div", {'class':'pitcher players'}]
What's the best way to loop through the results set for the more granular data tag information I need?
I need the text from the HTML beginning with , and handed-ness from the tag
Optimally, I would have a dictionary with the following:
{Johnny Cueto : R, Player 2 : L, ...}
import requests
from bs4 import BeautifulSoup
url = "https://rotogrinders.com/lineups/mlb?site=draftkings"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "html.parser")
players_confirmed = {}
results = soup.find_all("div", {'class': 'pitcher players'})
dicti={}
for j in results:
dicti[j.a.text]=j.select(".stats")[1].text.strip("\n").strip()
just use select or find function of the founded element,and you will be able to iterate

Categories