Scrape a JavaScript heavy website with requests+bs4

Scrape a JavaScript heavy website with requests+bs4 - python

I'm trying to scrape some data on a stock market. But I don't like to use selenium. I always trace the fetch requests and use the requests module.
On this site, I found the URLs which I think as responsible for fetching data to the frontend.
But those links can't be accessed directly ex:link. They throw HTTP 405. Why is that?

You can access the API by making a POST request.
Try this:
import requests
response = requests.post('https://www.cse.lk/api/marketSummery')
json.loads(response.content)
Output
{'id': 30071974,
'tradeVolume': 5829973145.4,
'shareVolume': 231803649,
'tradeDate': 1660549800496}
The other APIs work too, eg. this:
url = " https://www.cse.lk/api/dailyMarketSummery"
response = requests.post(url)
json.loads(response.content)
Output
[[{'id': 12893,
'tradeDate': 1660501800000,
'marketTurnover': 5829973000.0,
'marketTrades': 51086.0,
'marketDomestic': 50376.0,
'marketForeign': 710.0,
'equityTurnover': 5829973000.0,
'equityDomesticPurchase': 5709995000.0,
'equityDomesticSales': 5592159700.0,
'equityForeignPurchase': 119978192.0,
'equityForeignSales': 237813568.0,
'volumeOfTurnOverNumber': 231803648.0,
'volumeOfTurnoverDomestic': 226246512.0,
'volumeOfTurnoverForeign': 5320827,
'tradesNo': 51086,
...

import requests
from pprint import pp
def main(url):
r = requests.post(url)
pp(r.json())
main('https://www.cse.lk/api/marketSummery')
{'id': 30071974,
'tradeVolume': 5829973145.4,
'shareVolume': 231803649,
'tradeDate': 1660549800496}

Related

Why won't python request pagination work?

I'm trying to use pagination to request multiple pages of rent listing from zillow. Otherwise I'm limited to the first page only. However, my code seems to load the first page only even if I specify specific pages manually.
# Rent
import requests
from bs4 import BeautifulSoup as soup
import json
url = 'https://www.zillow.com/torrance-ca/rentals'
params = {
'q': {"pagination":{"currentPage": 1},"isMapVisible":False,"filterState":{"fore":{"value":False},"mf":{"value":False},"ah":{"value":True},"auc":{"value":False},"nc":{"value":False},"fr":{"value":True},"land":{"value":False},"manu":{"value":False},"fsbo":{"value":False},"cmsn":{"value":False},"fsba":{"value":False}},"isListVisible":True}
}
headers = {
# headers were copied from network tab on developer tools in chrome
}
html = requests.get(url=url,headers=headers, params=params)
html.status_code
bsobj = soup(html.content, 'lxml')
for script in bsobj.find_all('script'):
inner_text_with_string = str(script.string)
if inner_text_with_string[:18] == '<!--{"queryState":':
my_query = inner_text_with_string
my_query = my_query.strip('><!-')
data = json.loads(my_query)
data = data['cat1']['searchResults']['listResults']
print(data)
This returns about 40 listings. However, if I change "pagination":{"currentPage": 1} to "pagination":{"currentPage": 2}, it returns the same listings! It's as if the pagination parameter isn't recognized.
I believe these are the correct parameters, as I took them straight from the url string query and used http://urlprettyprint.com/ to make it pretty.
Any thoughts on what I'm doing wrong?

Using the params argument with requests is sending the wrong data, you can confirm this by printing response.url. what i would do is use urllib.parse.urlencode:
from urllib.parse import urlencode
...
html = requests.get(url=f"{url}?{urlencode(params)}", headers=headers)

Is it possible to send python requests data in format "&username=login&password=password"

I need to send python requests data in application/x-www-form-urlencoded. Couldn;t find the answer. It must be that format otherwise the web won;t pass me :(

simple request should work
import requests
url = 'application/x-www-form-urlencoded&username=login&password=password'
r = requests.get(url)
or a JSON post:
import requests
r = requests.post('application/x-www-form-urlencoded', json={"username": "login","password": password})

Python - Requests pulling HTML instead of JSON

I'm building a Python web scraper (personal use) and am running into some trouble retrieving a JSON file. I was able to find the request URL I need, but when I run my script (I'm using Requests) the URL returns HTML instead of the JSON shown in the Chrome Developer Tools console. Here's my current script:
import requests
import json
url = 'https://nytimes.wd5.myworkdayjobs.com/Video?clientRequestID=1f1a6071627946499b4b09fd0f668ef0'
r = requests.get(url)
print(r.text)
Completely new to Python, so any push in the right direction is greatly appreciated. Thanks!

Looks like that website returns the response depending on the accept headers provided by the request. So try:
import requests
import json
url = 'https://nytimes.wd5.myworkdayjobs.com/Video?clientRequestID=1f1a6071627946499b4b09fd0f668ef0'
r = requests.get(url, headers={'accept': 'application/json'})
print(r.json())
You can have a look at the full api for further reference: http://docs.python-requests.org/en/latest/api/.

Trigger data response from .aspx page

from bs4 import BeautifulSoup
from pprint import pprint
import requests
url = 'http://estadistico.ut.com.sv/OperacionDiaria.aspx'
s = requests.Session()
pagereq = s.get(url)
soup = BeautifulSoup(pagereq.content, 'lxml')
viewstategenerator = soup.find("input", attrs = {'id': '__VIEWSTATEGENERATOR'})['value']
viewstate = soup.find("input", attrs = {'id': '__VIEWSTATE'})['value']
eventvalidation = soup.find("input", attrs = {'id': '__EVENTVALIDATION'})['value']
eventtarget = 'ASPxDashboardViewer1'
DXCss = '1_33,1_4,1_9,1_5,15_2,15_4'
DXScript = '1_232,1_134,1_225,1_169,1_187,15_1,1_183,1_182,1_140,1_147,1_148,1_142,1_141,1_143,1_144,1_145,1_146,15_0,15_6,15_7'
eventargument = {"Task":"Export","ExportInfo":{"Mode":"SingleItem","GroupName":"pivotDashboardItem1","FileName":"Generación+por+tipo+de+tecnología+(MWh)","ClientState":{"clientSize":{"width":509,"height":385},"titleHeight":48,"itemsState":[{"name":"pivotDashboardItem1","headerHeight":34,"position":{"left":11,"top":146},"width":227,"height":108,"virtualSize":'null',"scroll":{"horizontal":'true',"vertical":'true'}}]},"Format":"Excel","DocumentOptions":{"paperKind":"Letter","pageLayout":"Portrait","scaleMode":"AutoFitWithinOnePage","scaleFactor":1,"autoFitPageCount":1,"showTitle":'true',"title":"Operación+Diaria","imageFormatOptions":{"format":"Png","resolution":96},"excelFormatOptions":{"format":"Csv","csvValueSeparator":","},"commonOptions":{"filterStatePresentation":"None","includeCaption":'true',"caption":"Generación+por+tipo+de+tecnología+(MWh)"},"pivotOptions":{"printHeadersOnEveryPage":'true'},"gridOptions":{"fitToPageWidth":'true',"printHeadersOnEveryPage":'true'},"chartOptions":{"automaticPageLayout":'true',"sizeMode":"Zoom"},"pieOptions":{"autoArrangeContent":'true'},"gaugeOptions":{"autoArrangeContent":'true'},"cardOptions":{"autoArrangeContent":'true'},"mapOptions":{"automaticPageLayout":'true',"sizeMode":"Zoom"},"rangeFilterOptions":{"automaticPageLayout":'true',"sizeMode":"Stretch"},"imageOptions":{},"fileName":"Generación+por+tipo+de+tecnología+(MWh)"},"ItemType":"PIVOT"},"Context":"BwAHAAIkY2NkNWRiYzItYzIwNS00MDIyLTkzZjUtYWQ0NzVhYTM5Y2E3Ag9PcGVyYWNpb25EaWFyaWECAAIAAAAAAMByQA==","RequestMarker":1,"ClientState":{}}
postdata = {'__EVENTTARGET': eventtarget,
'__EVENTARGUMENT': eventargument,
'__VIEWSTATE': viewstate,
'__VIEWSTATEGENERATOR': viewstategenerator,
'__EVENTVALIDATION': eventvalidation,
'DXScript': DXScript,
'DXCss': DXCss
}
datareq = s.post(url, data = postdata)
print datareq.text
I'm trying to scrape data from this .aspx webpage. The page loads the data dynamically via javascript so scraping directly with requests/BeautifulSoup won't work.
By looking at the network traffic I can see that when you click the export (Exportar a) button for an element, select a type of export (excel, csv) then confirm a POST request is made to the page. It returns a base64 encoded string of the data I need. As far as I can tell there is no way to make a GET request for the file directly as it is only generated when requested.
What I'm trying to do is is copy the POST request which triggers the csv response. So first I scrape for __VIEWSTATE, __VIEWSTATEGENERATOR and __EVENTVALIDATION. __EVENTTARGET, DXCSS and DXScript look to be fixed. __EVENTARGUMENT is copied directly from the POST request.
My code returns a server application error. I'm thinking the problem is either a) wrong __EVENTARGUMENT (maybe part dynamic rather than fixed?), b) not really understanding how .aspx pages work or c) what I'm trying to do isn't possible with these tools.
I did look at using selenium to trigger the data export but I couldn't see a way to capture the server response.

I was able to get help from someone who knows more about aspx pages than me.
Link to the Github gist that provides the solution.
https://gist.github.com/jarek/d73c672d8dd4ddb48d80bffc4d8038ba

python/scrapy for dynamic content

I am trying to write a python/scrapy script to get a list of ads from https://www.donedeal.ie/search/search?section=cars&adType=forsale&source=&sort=relevance%20desc&max=30&start=0, im interested in getting urls to individual ads. I found that page is making a XHR POST request to https://www.donedeal.ie/search/api/v3/find/.
Tried to write scrapy shell script to try my idea:
from scrapy.http import FormRequest
url = 'https://www.donedeal.ie/search/api/v3/find/'
payload = {'section': "cars", 'adType': "forsale", 'source': "", 'sort': "relevance desc", 'area': '', 'max': '30', 'start':'0'}
req = FormRequest(url, formdata=payload)
fetch(req)
but i get no response. In Chrome dev tools i saw that such request gives a json response with item ids which I could use to form urls myself.
I tried Selenium approach as well, where it gives time for a page to load up the dynamic content but that didn't seem to work either. Completely lost at this stage :(

The problem is with the call, the payload is almost OK.
The site you want to scrape accepts only JSON as payload so you should change your FormRequest to something like this:
import json
yield Request( url, method='POST',
body=json.dumps(payload),
headers={'Content-Type':'application/json'} )
This is because FormRequest is for simulate HTML forms (the content type is set to application/x-www-form-urlencoded), not JSON calls.

I was not able to create a working example with Scrapy.
However, I did come up with two other solutions for you.
In the examples below, response contains JSON data.
Working Example #1 using urllib2 — Tested with Python 2.7.10
import urllib2
url = 'https://www.donedeal.ie/search/api/v3/find/'
payload = '{"section":"cars","adType":"forsale","source":"","sort":"relevance desc","max":30,"start":0,"area":[]}'
req = urllib2.Request(url)
req.add_header('Content-Type', 'application/json')
response = urllib2.urlopen(req, payload).read()
Working Example #2 using requests — Tested with Python 2.7.10 and 3.3.5 and 3.5.0
import requests
url = 'https://www.donedeal.ie/search/api/v3/find/'
payload = '{"section":"cars","adType":"forsale","source":"","sort":"relevance desc","max":30,"start":0,"area":[]}'
response = requests.post(url, json=payload).content

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrape a JavaScript heavy website with requests+bs4 - python

import requests from pprint import pp def main(url): r = requests.post(url) pp(r.json()) main('https://www.cse.lk/api/marketSummery') {'id': 30071974, 'tradeVolume': 5829973145.4, 'shareVolume': 231803649, 'tradeDate': 1660549800496}

Related

Why won't python request pagination work?

Is it possible to send python requests data in format "&username=login&password=password"

Python - Requests pulling HTML instead of JSON

Trigger data response from .aspx page

python/scrapy for dynamic content

Categories

Resources