I'm trying to scrape some data on a stock market. But I don't like to use selenium. I always trace the fetch requests and use the requests module.
On this site, I found the URLs which I think as responsible for fetching data to the frontend.
But those links can't be accessed directly ex:link. They throw HTTP 405. Why is that?
You can access the API by making a POST request.
Try this:
import requests
response = requests.post('https://www.cse.lk/api/marketSummery')
json.loads(response.content)
Output
{'id': 30071974,
'tradeVolume': 5829973145.4,
'shareVolume': 231803649,
'tradeDate': 1660549800496}
The other APIs work too, eg. this:
url = " https://www.cse.lk/api/dailyMarketSummery"
response = requests.post(url)
json.loads(response.content)
Output
[[{'id': 12893,
'tradeDate': 1660501800000,
'marketTurnover': 5829973000.0,
'marketTrades': 51086.0,
'marketDomestic': 50376.0,
'marketForeign': 710.0,
'equityTurnover': 5829973000.0,
'equityDomesticPurchase': 5709995000.0,
'equityDomesticSales': 5592159700.0,
'equityForeignPurchase': 119978192.0,
'equityForeignSales': 237813568.0,
'volumeOfTurnOverNumber': 231803648.0,
'volumeOfTurnoverDomestic': 226246512.0,
'volumeOfTurnoverForeign': 5320827,
'tradesNo': 51086,
...
import requests
from pprint import pp
def main(url):
r = requests.post(url)
pp(r.json())
main('https://www.cse.lk/api/marketSummery')
{'id': 30071974,
'tradeVolume': 5829973145.4,
'shareVolume': 231803649,
'tradeDate': 1660549800496}
I am quite new to python and have been writing some test scripts to figure out things, I have managed to write test scripts to log in to a webpage and scrap the data however the code didn't seem to apply to my company's internal webpage. First of all, I was getting an SSH error which I managed to fix.
The following code manages to scrap the login page but I can't seem to pass the PayLoad. I am not sure if there is a better way to do what I'm doing as once I manage to get past the login page the website is dynamic and mainly coded in Javascript.
import requests
import os.path
import time
from bs4 import BeautifulSoup
from datetime import datetime
from urllib.request import urlopen
from urllib import request, parse
import ssl
url = "https://10.43.52.106/3/Authenticate"
context = ssl._create_unverified_context()
data = {'j_username': 'username',
'j_password': 'Password',
'j_newpassword': '',
'j_newpasswordrepeat': '',
'continue': '',
'submit': 'submit form'
}
data = parse.urlencode(data).encode()
req = request.Request(url, data=data)
response = request.urlopen("https://10.43.52.106/3/Console",
context=context)
print (response.read())
I appreciate any help provided and know its going to best hard to test as its an internal webpage.
This is also the payload that gets sent when logging in:
Payload
I'm building a Python web scraper (personal use) and am running into some trouble retrieving a JSON file. I was able to find the request URL I need, but when I run my script (I'm using Requests) the URL returns HTML instead of the JSON shown in the Chrome Developer Tools console. Here's my current script:
import requests
import json
url = 'https://nytimes.wd5.myworkdayjobs.com/Video?clientRequestID=1f1a6071627946499b4b09fd0f668ef0'
r = requests.get(url)
print(r.text)
Completely new to Python, so any push in the right direction is greatly appreciated. Thanks!
Looks like that website returns the response depending on the accept headers provided by the request. So try:
import requests
import json
url = 'https://nytimes.wd5.myworkdayjobs.com/Video?clientRequestID=1f1a6071627946499b4b09fd0f668ef0'
r = requests.get(url, headers={'accept': 'application/json'})
print(r.json())
You can have a look at the full api for further reference: http://docs.python-requests.org/en/latest/api/.
from bs4 import BeautifulSoup
from pprint import pprint
import requests
url = 'http://estadistico.ut.com.sv/OperacionDiaria.aspx'
s = requests.Session()
pagereq = s.get(url)
soup = BeautifulSoup(pagereq.content, 'lxml')
viewstategenerator = soup.find("input", attrs = {'id': '__VIEWSTATEGENERATOR'})['value']
viewstate = soup.find("input", attrs = {'id': '__VIEWSTATE'})['value']
eventvalidation = soup.find("input", attrs = {'id': '__EVENTVALIDATION'})['value']
eventtarget = 'ASPxDashboardViewer1'
DXCss = '1_33,1_4,1_9,1_5,15_2,15_4'
DXScript = '1_232,1_134,1_225,1_169,1_187,15_1,1_183,1_182,1_140,1_147,1_148,1_142,1_141,1_143,1_144,1_145,1_146,15_0,15_6,15_7'
eventargument = {"Task":"Export","ExportInfo":{"Mode":"SingleItem","GroupName":"pivotDashboardItem1","FileName":"Generación+por+tipo+de+tecnología+(MWh)","ClientState":{"clientSize":{"width":509,"height":385},"titleHeight":48,"itemsState":[{"name":"pivotDashboardItem1","headerHeight":34,"position":{"left":11,"top":146},"width":227,"height":108,"virtualSize":'null',"scroll":{"horizontal":'true',"vertical":'true'}}]},"Format":"Excel","DocumentOptions":{"paperKind":"Letter","pageLayout":"Portrait","scaleMode":"AutoFitWithinOnePage","scaleFactor":1,"autoFitPageCount":1,"showTitle":'true',"title":"Operación+Diaria","imageFormatOptions":{"format":"Png","resolution":96},"excelFormatOptions":{"format":"Csv","csvValueSeparator":","},"commonOptions":{"filterStatePresentation":"None","includeCaption":'true',"caption":"Generación+por+tipo+de+tecnología+(MWh)"},"pivotOptions":{"printHeadersOnEveryPage":'true'},"gridOptions":{"fitToPageWidth":'true',"printHeadersOnEveryPage":'true'},"chartOptions":{"automaticPageLayout":'true',"sizeMode":"Zoom"},"pieOptions":{"autoArrangeContent":'true'},"gaugeOptions":{"autoArrangeContent":'true'},"cardOptions":{"autoArrangeContent":'true'},"mapOptions":{"automaticPageLayout":'true',"sizeMode":"Zoom"},"rangeFilterOptions":{"automaticPageLayout":'true',"sizeMode":"Stretch"},"imageOptions":{},"fileName":"Generación+por+tipo+de+tecnología+(MWh)"},"ItemType":"PIVOT"},"Context":"BwAHAAIkY2NkNWRiYzItYzIwNS00MDIyLTkzZjUtYWQ0NzVhYTM5Y2E3Ag9PcGVyYWNpb25EaWFyaWECAAIAAAAAAMByQA==","RequestMarker":1,"ClientState":{}}
postdata = {'__EVENTTARGET': eventtarget,
'__EVENTARGUMENT': eventargument,
'__VIEWSTATE': viewstate,
'__VIEWSTATEGENERATOR': viewstategenerator,
'__EVENTVALIDATION': eventvalidation,
'DXScript': DXScript,
'DXCss': DXCss
}
datareq = s.post(url, data = postdata)
print datareq.text
I'm trying to scrape data from this .aspx webpage. The page loads the data dynamically via javascript so scraping directly with requests/BeautifulSoup won't work.
By looking at the network traffic I can see that when you click the export (Exportar a) button for an element, select a type of export (excel, csv) then confirm a POST request is made to the page. It returns a base64 encoded string of the data I need. As far as I can tell there is no way to make a GET request for the file directly as it is only generated when requested.
What I'm trying to do is is copy the POST request which triggers the csv response. So first I scrape for __VIEWSTATE, __VIEWSTATEGENERATOR and __EVENTVALIDATION. __EVENTTARGET, DXCSS and DXScript look to be fixed. __EVENTARGUMENT is copied directly from the POST request.
My code returns a server application error. I'm thinking the problem is either a) wrong __EVENTARGUMENT (maybe part dynamic rather than fixed?), b) not really understanding how .aspx pages work or c) what I'm trying to do isn't possible with these tools.
I did look at using selenium to trigger the data export but I couldn't see a way to capture the server response.
I was able to get help from someone who knows more about aspx pages than me.
Link to the Github gist that provides the solution.
https://gist.github.com/jarek/d73c672d8dd4ddb48d80bffc4d8038ba
I have another question about posts.
This post should be almost identical to one referenced on stack overflow using this question 'Using request.post to post multipart form data via python not working', but for some reason I can't get it to work. The website is http://www.camp.bicnirrh.res.in/predict/. I want to post a file that is already in the FASTA format to this website and select the 'SVM' option using requests in python. This is based on what #NorthCat gave me previously, which worked like a charm:
import requests
import urllib
file={'file':(open('Bishop/newdenovo2.txt','r').read())}
url = 'http://www.camp.bicnirrh.res.in/predict/hii.php'
payload = {"algo[]":"svm"}
raw = urllib.urlencode(payload)
response = session.post(url, files=file, data=payload)
print(response.text)
Since it's not working, I assumed the payload was the problem. I've been playing with the payload, but I can't get any of these to work.
payload = {'S1':str(data), 'filename':'', 'algo[]':'svm'} # where I tried just reading the file in, called 'data'
payload = {'svm':'svm'} # not actually in the headers, but I tried this too)
payload = {'S1': '', 'algo[]':'svm', 'B1': 'Submit'}
None of these payloads resulted in data.
Any help is appreciated. Thanks so much!
You need to set the file post variable name to "userfile", i.e.
file={'userfile':(open('Bishop/newdenovo2.txt','r').read())}
Note that the read() is unnecessary, but it doesn't prevent the file upload succeeding. Here is some code that should work for you:
import requests
session = requests.session()
response = session.post('http://www.camp.bicnirrh.res.in/predict/hii.php',
files={'userfile': ('fasta.txt', open('fasta.txt'), 'text/plain')},
data={'algo[]':'svm'})
response.text contains the HTML results, save it to a file and view it in your browser, or parse it with something like Beautiful Soup and extract the results.
In the request I've specified a mime type of "text/plain" for the file. This is not necessary, but it serves as documentation and might help the receiving server.
The content of my fasta.txt file is:
>24.6jsd2.Tut
GGTGTTGATCATGGCTCAGGACAAACGCTGGCGGCGTGCTTAATACATGCAAGTCGAACGGGCTACCTTCGGGTAGCTAGTGGCGGACGGGTGAGTAACACGTAGGTTTTCTGCCCAATAGTGGGGAATAACAGCTCGAAAGAGTTGCTAATACCGCATAAGCTCTCTTGCGTGGGCAGGAGAGGAAACCCCAGGAGCAATTCTGGGGGCTATAGGAGGAGCCTGCGGCGGATTAGCTAGATGGTGGGGTAAAGGCCTACCATGGCGACGATCCGTAGCTGGTCTGAGAGGACGGCCAGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTAAGGAATATTCCACAATGGCCGAAAGCGTGATGGAGCGAAACCGCGTGCGGGAGGAAGCCTTTCGGGGTGTAAACCGCTTTTAGGGGAGATGAAACGCCACCGTAAGGTGGCTAAGACAGTACCCCCTGAATAAGCATCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGATGCAAGCGTTGTCCGGATTTACTGGGCGTAAAGCGCGCGCAGGCGGCAGGTTAAGTAAGGTGTGAAATCTCCCTGCTCAACGGGGAGGGTGCACTCCAGACTGACCAGCTAGAGGACGGTAGAGGGTGGTGGAATTGCTGGTGTAGCGGTGAAATGCGTAGAGATCAGCAGGAACACCCGTGGCGAAGGCGGCCACCTGGGCCGTACCTGACGCTGAGGCGCGAAGGCTAGGGGAGCGAACGGGATTAGATACCCCGGTAGTCCTAGCAGTAAACGATGTCCACTAGGTGTGGGGGGTTGTTGACCCCTTCCGTGCCGAAGCCAACGCATTAAGTGGACCGCCTGGGGAGTACGGTCGCAAGACTAAAACTCAAAGGAATTGACGGGGACCCGCACAAGCAGCGGAGCGTGTGGTTTAATTCGATGCGACGCGAAGAACCTTACCTGGGCTTGACATGCTATCGCAACACCCTGAAAGGGGTGCCTCCTTCGGGACGGTAGCACAGATGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTGTCCTTAGTTGTATATCTAAGGAGACTGCCGGAGACAAACCGGAGGAAGGTGGGGATGACGTCAAGTCAGCATGGCTCTTACGTCCAGGGCTACACATACGCTACAATGGCCGTTACAGTGAGATGCCACACCGCGAGGTGGAGCAGATCTCCAAAGGCGGCCTCAGTTCAGATTGCACTCTGCAACCCGAGTGCATGAAGTCGGAGTTGCTAGTAACCGCGTGTCAGCATAGCGCGGTGAATATGTTCCCGGGTCTTGTACACACCGCCCGTCACGTCATGGGAGCCGGCAACACTTCGAGTCCGTGAGCTAACCCCCCCTTTCGAGGGTGTGGGAGGCAGCGGCCGAGGGTGGGGCTGGTGACTGGGACGAAGTCGTAACAAGGT