Python Requests-HTML - Can't find specific data - python

I am trying to scrape a web page using python requests-html library.
link to that web page is https://www.koyfin.com/charts/g/USADebt2GDP?view=table ,
below image shows (red rounded data) the data what i want to get.
My code is like this,
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://www.koyfin.com/charts/g/USADebt2GDP?view=table')
r.html.render(timeout=60)
print(r.text)
web page html like this,
Problem is when i scrape the web page i can't find the data i want, in HTML code i can see
the data inside first div tags in body section.
Any specific suggestions for how to solve this.
Thanks.

The problem is that the data is being loaded by JavaScript code after the initial page load. One solution is to use Selenium to drive a web browser to scrape the page. But using a regular browser I looked at the network requests that were being made and it appears that the data you seek is being loaded with the following AJAX call:
https://api.koyfin.com/api/v2/commands/g/g.gec/USADebt2GDP?dateFrom=2010-08-20&dateTo=2020-09-05&period=yearly
So:
import requests
response = requests.get('https://api.koyfin.com/api/v2/commands/g/g.gec/USADebt2GDP?dateFrom=2010-08-20&dateTo=2020-09-05&period=yearly')
results = response.json();
print(results)
for t in results['graph']['data']:
print(t)
Prints:
{'ticker': 'USADebt2GDP', 'companyName': 'United States Gross Federal Debt to GDP', 'startDate': '1940-12-31T00:00:00.000Z', 'endDate': '2019-12-31T00:00:00.000Z', 'unit': 'percent', 'graph': {'column_names': ['Date', 'Volume'], 'data': [['2010-12-31', 91.4], ['2011-12-31', 96], ['2012-12-31', 100.1], ['2013-12-31', 101.2], ['2014-12-31', 103.2], ['2015-12-31', 100.8], ['2016-12-31', 105.8], ['2017-12-31', 105.4], ['2018-12-31', 106.1], ['2019-12-31', 106.9]]}, 'withoutLiveData': True}
['2010-12-31', 91.4]
['2011-12-31', 96]
['2012-12-31', 100.1]
['2013-12-31', 101.2]
['2014-12-31', 103.2]
['2015-12-31', 100.8]
['2016-12-31', 105.8]
['2017-12-31', 105.4]
['2018-12-31', 106.1]
['2019-12-31', 106.9]
How I Came Up with the URL
And when you click on the last message:

Related

Not able to get updated data from web page using BeautifulSoup

import requests
URL = 'https://www.moneycontrol.com/india/stockpricequote/cigarettes/itc/ITC'
response = requests.get(URL)
soup = BeautifulSoup(response.text,'html.parser')
# time.sleep(5)
var1 = float(soup.find('td', attrs={'class': 'espopn'}).get_text().replace(",",""))
With this code, I am able to the value of var1, but the web page which I am accessing not showing real-time data once we land on the web page, it took 1 sec to update the real-time value once we land on the web page.
Due to which the value that I am getting in var1 is not a real-time value.
Wanted to know how I can wait once I land on the web page before doing web scraping.
Thanks in Advance.
1.As Data is updating dynamic so hard to get from bs4 so you can try from api itself so how to find it
2.Go to chrome developer mode and then Network tab find xhr and now reload your website under Name tab you will find links but there are lot of
3.But on left side there is search so you can search price and from it gives url and you click on that go to headers copy that url and make call using requests module
import requests
from bs4 import BeautifulSoup
res=requests.get("https://api.moneycontrol.com/mcapi/v1/stock/get-stock-price?scIdList=ITC%2CVST%2CGPI%2CIWP540954%2CGTC&scId=ITC")
main_data=res.json()
main_data['data'][0]
Output:
{'companyName': 'ITC',
'lastPrice': '215.25',
'perChange': '-0.62',
'marketCap': '264947.87',
'scTtm': '19.99',
'perform1yr': '7.33',
'priceBook': '4.16'}
Image:

How do I get a list of redirect urls from Dell.com

I am working on a web scraping project and want to get a list of products from Dell's website. I found this link (https://www.dell.com/support/home/us/en/04/products/) which pulls up a box with a list of product categories (really just redirect urls. If it doesn't come up for you click the button which says "Browse all products"). I tried using Python Requests to GET the page and save the text to a file to parse through, but the response doesn't contain any of the categories/redirect urls. My code is as basic as it gets:
import requests
url = "https://www.dell.com/support/home/us/en/04/products/"
page = requests.get(url)
with open("laptops.txt", "w", encoding="utf-8") as outf:
outf.write(page.text)
outf.close()
Is there a way to get these redirect urls? I am essentially trying to make my own site map of their products so that I can scrape the details of each one. Thanks
This page uses JavaScript to get and display these links - but requests/urllib and BeautifulSoup/lxml can't run JavaScript.
Using DevTools in Firefox/Chrome (tab: Network) I found it reads it from url
https://www.dell.com/support/components/productselector/allproducts?category=all-products/esuprt_&country=pl&language=pl&region=emea&segment=bsd&customerset=plbsd1&openmodal=true&_=1589265310743
so I use it to get links.
You may have to to change country=pl&language=pl in url to get it in different language.
import requests
from bs4 import BeautifulSoup as BS
url = "https://www.dell.com/support/components/productselector/allproducts?category=all-products/esuprt_&country=pl&language=pl&region=emea&segment=bsd&customerset=plbsd1&openmodal=true&_=1589265310743"
response = requests.get(url)
soup = BS(response.text, 'html.parser')
all_items = soup.find_all('a')
for item in all_items:
print(item.text, item['href'])
BTW: Other method is it use Selenium to control real web browser which can run JavaScript.
try using selenium chrome driver it helps for handling dynamic data on website and also features like clicking buttons, handling page refresh etc.
Beginner guide to web scraping

How web scrape with request, Bs4 when there is a script result?

I am trying to get some data from this website:
http://www.espn.com.br/futebol/resultados/_/liga/BRA.1/data/20181018
When I inspect the page on my browser I can see all the values I need on the HTML. I want to fetch the game result and the players names (for each date, in this example 2018-10-18)
On no game days the website shows:
"Sem jogos nesta data", which is it easy to find on browser inspection:
But when using
url = 'http://www.espn.com.br/futebol/resultados/_/liga/todos/data/20181018'
page = requests.get(url, "lxml")
The output is basically the website where I can't find the phrase "Sem jogos nesta data"
How can I get fetch the HTML containing the script results? Is it possible with request? urllib?
Looks like the data you are looking for that comes from their backend API. I would use selenium-python package instead of requests.
Here is example:
driver = webdriver.Firefox()
driver.get("http://www.espn.com.br/futebol/resultados/_/liga/todos/data/20181018")
value = driver.find_elements(By.XPATH, '//*[#id="events"]/div')
drive.close()
I didn't check the code but it should be working

Trigger data response from .aspx page

from bs4 import BeautifulSoup
from pprint import pprint
import requests
url = 'http://estadistico.ut.com.sv/OperacionDiaria.aspx'
s = requests.Session()
pagereq = s.get(url)
soup = BeautifulSoup(pagereq.content, 'lxml')
viewstategenerator = soup.find("input", attrs = {'id': '__VIEWSTATEGENERATOR'})['value']
viewstate = soup.find("input", attrs = {'id': '__VIEWSTATE'})['value']
eventvalidation = soup.find("input", attrs = {'id': '__EVENTVALIDATION'})['value']
eventtarget = 'ASPxDashboardViewer1'
DXCss = '1_33,1_4,1_9,1_5,15_2,15_4'
DXScript = '1_232,1_134,1_225,1_169,1_187,15_1,1_183,1_182,1_140,1_147,1_148,1_142,1_141,1_143,1_144,1_145,1_146,15_0,15_6,15_7'
eventargument = {"Task":"Export","ExportInfo":{"Mode":"SingleItem","GroupName":"pivotDashboardItem1","FileName":"Generación+por+tipo+de+tecnología+(MWh)","ClientState":{"clientSize":{"width":509,"height":385},"titleHeight":48,"itemsState":[{"name":"pivotDashboardItem1","headerHeight":34,"position":{"left":11,"top":146},"width":227,"height":108,"virtualSize":'null',"scroll":{"horizontal":'true',"vertical":'true'}}]},"Format":"Excel","DocumentOptions":{"paperKind":"Letter","pageLayout":"Portrait","scaleMode":"AutoFitWithinOnePage","scaleFactor":1,"autoFitPageCount":1,"showTitle":'true',"title":"Operación+Diaria","imageFormatOptions":{"format":"Png","resolution":96},"excelFormatOptions":{"format":"Csv","csvValueSeparator":","},"commonOptions":{"filterStatePresentation":"None","includeCaption":'true',"caption":"Generación+por+tipo+de+tecnología+(MWh)"},"pivotOptions":{"printHeadersOnEveryPage":'true'},"gridOptions":{"fitToPageWidth":'true',"printHeadersOnEveryPage":'true'},"chartOptions":{"automaticPageLayout":'true',"sizeMode":"Zoom"},"pieOptions":{"autoArrangeContent":'true'},"gaugeOptions":{"autoArrangeContent":'true'},"cardOptions":{"autoArrangeContent":'true'},"mapOptions":{"automaticPageLayout":'true',"sizeMode":"Zoom"},"rangeFilterOptions":{"automaticPageLayout":'true',"sizeMode":"Stretch"},"imageOptions":{},"fileName":"Generación+por+tipo+de+tecnología+(MWh)"},"ItemType":"PIVOT"},"Context":"BwAHAAIkY2NkNWRiYzItYzIwNS00MDIyLTkzZjUtYWQ0NzVhYTM5Y2E3Ag9PcGVyYWNpb25EaWFyaWECAAIAAAAAAMByQA==","RequestMarker":1,"ClientState":{}}
postdata = {'__EVENTTARGET': eventtarget,
'__EVENTARGUMENT': eventargument,
'__VIEWSTATE': viewstate,
'__VIEWSTATEGENERATOR': viewstategenerator,
'__EVENTVALIDATION': eventvalidation,
'DXScript': DXScript,
'DXCss': DXCss
}
datareq = s.post(url, data = postdata)
print datareq.text
I'm trying to scrape data from this .aspx webpage. The page loads the data dynamically via javascript so scraping directly with requests/BeautifulSoup won't work.
By looking at the network traffic I can see that when you click the export (Exportar a) button for an element, select a type of export (excel, csv) then confirm a POST request is made to the page. It returns a base64 encoded string of the data I need. As far as I can tell there is no way to make a GET request for the file directly as it is only generated when requested.
What I'm trying to do is is copy the POST request which triggers the csv response. So first I scrape for __VIEWSTATE, __VIEWSTATEGENERATOR and __EVENTVALIDATION. __EVENTTARGET, DXCSS and DXScript look to be fixed. __EVENTARGUMENT is copied directly from the POST request.
My code returns a server application error. I'm thinking the problem is either a) wrong __EVENTARGUMENT (maybe part dynamic rather than fixed?), b) not really understanding how .aspx pages work or c) what I'm trying to do isn't possible with these tools.
I did look at using selenium to trigger the data export but I couldn't see a way to capture the server response.
I was able to get help from someone who knows more about aspx pages than me.
Link to the Github gist that provides the solution.
https://gist.github.com/jarek/d73c672d8dd4ddb48d80bffc4d8038ba

How to use Python to reach an external web page

I have external web page like www.3thpart.com/folder/test.jsp?page=IND
This page uses "TOKEN" to authenticate there is no login(authentication) but for each different request web site creates a token.
I use this page in my company by some users lots of time. To get easier and save time What I need to do is to collect result in backand then display the result in a HTML page in from of the users.
I tried python with Visual studio. I am comfused. is it possible to give me advise that the way i have to follow. I mean I am trying to understand the logic of this situation.
Thanks in advance
C.A.
UPDATE:
import requests
url = 'https://subdomain.domain.com.tr/folder/main.jsp?page=ID'
payload = {
'Requested': 'fieldValue',
'Cmdd ': 'fieldValue',
'TOKEN': '',
'LANG': 'tr',
}
s = requests.Session()
response = s.post(url, data=payload)
outfile =open( "/Projects/Gib/test.txt","w")
response.encoding = 'UTF-8'
outfile.write(response.text)

Categories