Not able to get updated data from web page using BeautifulSoup

Not able to get updated data from web page using BeautifulSoup - python

import requests
URL = 'https://www.moneycontrol.com/india/stockpricequote/cigarettes/itc/ITC'
response = requests.get(URL)
soup = BeautifulSoup(response.text,'html.parser')
# time.sleep(5)
var1 = float(soup.find('td', attrs={'class': 'espopn'}).get_text().replace(",",""))
With this code, I am able to the value of var1, but the web page which I am accessing not showing real-time data once we land on the web page, it took 1 sec to update the real-time value once we land on the web page.
Due to which the value that I am getting in var1 is not a real-time value.
Wanted to know how I can wait once I land on the web page before doing web scraping.
Thanks in Advance.

1.As Data is updating dynamic so hard to get from bs4 so you can try from api itself so how to find it
2.Go to chrome developer mode and then Network tab find xhr and now reload your website under Name tab you will find links but there are lot of
3.But on left side there is search so you can search price and from it gives url and you click on that go to headers copy that url and make call using requests module
import requests
from bs4 import BeautifulSoup
res=requests.get("https://api.moneycontrol.com/mcapi/v1/stock/get-stock-price?scIdList=ITC%2CVST%2CGPI%2CIWP540954%2CGTC&scId=ITC")
main_data=res.json()
main_data['data'][0]
Output:
{'companyName': 'ITC',
'lastPrice': '215.25',
'perChange': '-0.62',
'marketCap': '264947.87',
'scTtm': '19.99',
'perform1yr': '7.33',
'priceBook': '4.16'}
Image:

Related

Web Scraping Stock Ticker Price from Yahoo Finance using BeautifulSoup

I'm trying to scrape Gold stock ticker from Yahoo! Finance.
from bs4 import BeautifulSoup
import requests, lxml
response = requests.get('https://finance.yahoo.com/quote/GC=F?p=GC=F')
soup = BeautifulSoup(response.text, 'lxml')
gold_price = soup.findAll("div", class_='My(6px) Pos(r) smartphone_Mt(6px)')[2].find_all('p').text
Whenever I run this it returns: list index out of range.
When I do print(len(ssoup)) it returns 4.
Any ideas?
Thank you.

You can make a direct request to the yahoo server. To locate the query URL you need to open Network tab via Dev tools (F12) -> Fetch/XHR -> find name: spark?symbols= (refresh page if you don't see any), find the needed symbol, and see the response (preview tab) on the newly opened tab on the right.
You can make direct requests to all of these links if the request method is GET since POST methods are much more complicated.
You need json and requests library, no need for bs4. Note that making a lot of such requests might block your IP (or set an IP rate limit) or you won't get any response because their system might detect that it's a bot since the regular user won't make such requests to the server, repeatedly. So you need to figure out how to bypass it.
Update:
There's possibly a hard limit on how many requests can be made in an X period of time.
Code and example in the online IDE (contains full JSON response):
import requests, json
response = requests.get('https://query1.finance.yahoo.com/v7/finance/spark?symbols=GC%3DF&range=1d&interval=5m&indicators=close&includeTimestamps=false&includePrePost=false&corsDomain=finance.yahoo.com&.tsrc=finance').text
data_1 = json.loads(response)
gold_price = data_1['spark']['result'][0]['response'][0]['meta']['previousClose']
print(gold_price)
# 1830.8
P.S. There's a blog about scraping Yahoo! Finance Home Page of mine, which is kind of relevant.

How do I get a list of redirect urls from Dell.com

I am working on a web scraping project and want to get a list of products from Dell's website. I found this link (https://www.dell.com/support/home/us/en/04/products/) which pulls up a box with a list of product categories (really just redirect urls. If it doesn't come up for you click the button which says "Browse all products"). I tried using Python Requests to GET the page and save the text to a file to parse through, but the response doesn't contain any of the categories/redirect urls. My code is as basic as it gets:
import requests
url = "https://www.dell.com/support/home/us/en/04/products/"
page = requests.get(url)
with open("laptops.txt", "w", encoding="utf-8") as outf:
outf.write(page.text)
outf.close()
Is there a way to get these redirect urls? I am essentially trying to make my own site map of their products so that I can scrape the details of each one. Thanks

This page uses JavaScript to get and display these links - but requests/urllib and BeautifulSoup/lxml can't run JavaScript.
Using DevTools in Firefox/Chrome (tab: Network) I found it reads it from url
https://www.dell.com/support/components/productselector/allproducts?category=all-products/esuprt_&country=pl&language=pl&region=emea&segment=bsd&customerset=plbsd1&openmodal=true&_=1589265310743
so I use it to get links.
You may have to to change country=pl&language=pl in url to get it in different language.
import requests
from bs4 import BeautifulSoup as BS
url = "https://www.dell.com/support/components/productselector/allproducts?category=all-products/esuprt_&country=pl&language=pl&region=emea&segment=bsd&customerset=plbsd1&openmodal=true&_=1589265310743"
response = requests.get(url)
soup = BS(response.text, 'html.parser')
all_items = soup.find_all('a')
for item in all_items:
print(item.text, item['href'])
BTW: Other method is it use Selenium to control real web browser which can run JavaScript.

try using selenium chrome driver it helps for handling dynamic data on website and also features like clicking buttons, handling page refresh etc.
Beginner guide to web scraping

Scrape "Script part" of a javascript rendered website in Python

I am making a project where I want to see the average karma of users on various subreddits on Reddit. As such I am in the process of scraping users karma, which is proving a bit difficult with the new reddit structure.
I am not able to use PRAW as the karma figures there are not correct.
According to the page source of a users all I need is to find the following two variables: commentKarma and postKarma. Both of these variables are found under the "" section, see example here view-source:https://www.reddit.com/user/loganb3171. However, when I use selenium page_source or beautifulsoup they do not show up.
I have been working on this problem for a couple of hours now and I am nowhere near it.
Any and all help is appreciated.
either of these snippets does not give me the entire pagesource as you get when right clicking "view page source"
source_var = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
source_var=driver.page_source

Okay, so I see that you're using selenium from the snippet in the question. If that's the case, then there's no way to set request headers with the web driver. Reddit will know you are a bot.
If you only need the page source, you can use requests to get the page and open it with selenium or use BeautifulSoup to parse the page
from bs4 import BeautifulSoup
import requests
url = "https://www.reddit.com/user/loganb3171"
page = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(page.text, 'html.parser')
print(soup.prettify())

web scraping python <span> with id

I want to scrap data in the <span/> attribute for a given website using BeautifulSoup. You can see at the screenshot where it locates. However, the code that I'm using is just returning an empty list. I can't find the data in the list that I want. What am I doing wrong?
from bs4 import BeautifulSoup
from urllib import request
url = "http://144.122.167.229"
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
data = opener.open(url).read()
soup = BeautifulSoup(data, 'html.parser')
your_data = list()
for line in soup.findAll('span', attrs={'id': 'mc1_legend_value'}):
your_data.append(line.text)
for line in soup.findAll('span'):
your_data.append(line.text)
ScreenShot : https://imgur.com/a/z0vNh
Thank you.

The dashboard from the screenshot looks to me like something javascript would generate. If you can't find the tag in the page source, that means it was later added by some javascript code or your browser tried to fix some html which it considered broken or out of place.
Keep in mind that right now you're sending a request to a server and it serves you the plain html back. A browser would parse the html and execute any javascript code if it finds any. In your case, beautiful soup or urllib doesn't execute any javascript code. urllib fetches the html and beautiful soup makes it easier to parse and extract relevant information.
If you want to get the value from that tag, I recommend using a headless browser to render your page and just after that parse it's html through beautiful soup or any other parser.
Give a try to selenium: http://selenium-python.readthedocs.io/.
You can control your own browser programmatically. You can make it request the page for you, render it, save the new html in a variable, parse it using beautifoul soup and extract the values you're interested in. I believe that it already has it's own parser implemented which you can use directly to search for that tag.
Or maybe even scrapinghub's splash: https://github.com/scrapinghub/splash
If the dashboard communicates with a server in real-time and that value is continuously received from the server, you could take a look at what requests are sent to the server in order to get that value. Take a look in developer console under the networks tab. Press F12 to open the developer console and click on Network. Refresh the page and you should get all the request send to the server along with the responses. Requests sent by the javascript are usually XMLHttpRequests. Click on XHR in the Network tab to filter out any other requests. (These are instructions for Google Chrome. Firefox might differ a bit).

Looping through web pages to webscrape data

I'm trying to loop through Zillow pages and extract data. I know that the URL is being updated with a new page number after each iteration but the data extracted is as if the URL is still on page 1.
import selenium
from selenium import webdriver
import requests
from bs4 import BeautifulSoup
import pandas as pd
next_page='https://www.zillow.com/romeo-mi-48065/real-estate-agent-reviews/'
num_data1=pd.DataFrame(columns=['name','number'])
browser=webdriver.Chrome()
browser.get('https://www.zillow.com/romeo-mi-48065/real-estate-agent-reviews/')
while True:
page=requests.get(next_page)
contents=page.content
soup = BeautifulSoup(contents, 'html.parser')
number_p=soup.find_all('p', attrs={'class':'ldb-phone-number'},text=True)
name_p=soup.find_all('p', attrs={'class':'ldb-contact-name'},text=True)
number_p=pd.DataFrame(number_p,columns=['number'])
name_p=pd.DataFrame(name_p,columns=['name'])
num_data=number_p['number'].apply(lambda x: x.text.strip())
nam_data=name_p['name'].apply(lambda x: x.text.strip())
number_df=pd.DataFrame(num_data,columns=['number'])
name_df=pd.DataFrame(nam_data,columns=['name'])
num_data0=pd.concat([number_df,name_df],axis=1)
num_data1=num_data1.append(num_data0)
try:
button=browser.find_element_by_css_selector('.zsg-pagination>li.zsg-pagination-next>a').click()
next_page=str(browser.current_url)
except IndexError:
break

Replace page=requests.get(next_page) with page = browser.page_source
Basically what's happening is that you're going to the next page in Chrome, but then trying to load that page's url with requests which is getting redirected back to page one by Zillow (probably because it doesn't have the cookies or appropriate request headers).

why not make your life easier and use the Zillow API instead of scraping? (do you even have permission to scrape their site?)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.