Scrape XML response from ajax request with python

Scrape XML response from ajax request with python - python

I'm trying to get the data that get loaded into the chart of this page when hitting the max (time range) button. The data are loaded with an ajax request.
I inspected the request and tried to reproduce it with the requests python library but I'm only able to retrieve the 1-year data from this chart.
Here is the code I used:
r = requests.get("https://www.justetf.com/en/etf-profile.html?0-4.0-tabs-panel-chart-dates-ptl_max&groupField=none&sortField=ter&sortOrder=asc&from=search&isin=IE00B3VWN518&tab=chart&_=1576272593482")
r.content
I also tried to use Session:
from requests import Session
session = Session()
session.head('http://justetf.com')
response = session.get(
url='https://www.justetf.com/en/etf-profile.html?0-4.0-tabs-panel-chart-dates-ptl_max&groupField=none&sortField=ter&sortOrder=asc&from=search&isin=IE00B3VWN518&tab=chart&_=1575929227619',
data = {"0-4.0-tabs-panel-chart-dates-ptl_max":"",
"groupField":"none","sortField":"ter",
"sortOrder":"asc","from":"search",
"isin":"IE00B3VWN518",
"tab":"chart",
"_":"1575929227619"
},
headers={
'Host': 'www.justetf.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0',
'Accept': 'application/xml, text/xml, */*; q=0.01',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'Wicket-Ajax': 'true',
'Wicket-Ajax-BaseURL': 'en/etf-profile.html?0&groupField=none&sortField=ter&sortOrder=asc&from=search&isin=IE00B3VWN518&tab=chart',
'Wicket-FocusedElementId': 'id28',
'X-Requested-With': 'XMLHttpRequest',
'Connection': 'keep-alive',
'Referer': 'https://www.justetf.com/en/etf-profile.html?groupField=none&sortField=ter&sortOrder=asc&from=search&isin=IE00B3VWN518&tab=chart',
'Cookie': 'locale_=en; _ga=GA1.2.1297456970.1574289342; cookieconsent_status=dismiss; AWSALB=QMWHJxgfcpLXJLqX0i0FgBuLn+mpVHVeLRQ6upH338LdggA4/thXHT2vVWQX7pdBd1r486usZXgpAF8RpDsGJNtf6ei8e5NHTsg0hzVHR9C+Fj89AWuQ7ue+fzV2; JSESSIONID=ABB2A35B91751CA9B2D293F5A04505BE; _gid=GA1.2.1029531470.1575928527; _gat=1',
'TE': 'Trailer'
},
cookies = {"_ga":"GA1.2.1297456970.1574289342","_gid":"GA1.2.1411779365.1574289342","AWSALB":"5v+tPMgooQC0deJBlEGl2wVeUSmwVGJdydie1D6dAZSRAK5eBsmg+DQCdBj8t25YRytC5NIi0TbU3PmDcNMjiyFPTp1xKHgwNjZcDvMRePZjTxthds5DsvelzE2I","JSESSIONID":"310F346AED94D1A345207A3489DCF83D","locale_":"en"}
)
but I get this response
<ajax-response><redirect><![CDATA[/en/etf-profile.html?0&groupField=none&sortField=ter&sortOrder=asc&from=search&isin=IE00B3VWN518&tab=chart]]></redirect></ajax-response>
Why am I not getting a response to the same XML file that I get on my browser when I hit MAX?

Okay below is my solution to obtaining the data you seek:
url = "https://www.justetf.com/en/etf-profile.html"
querystring = {
# Modify this string to get the timeline you want
# Currently it is set to "max" as you can see
"0-1.0-tabs-panel-chart-dates-ptl_max":"",
"groupField":"none",
"sortField":"ter",
"sortOrder":"asc",
"from":"search",
"isin":"IE00B3VWN518",
"tab":"chart",
"_":"1576627890798"}
# Not all of these headers may be necessary
headers = {
'authority': "www.justetf.com",
'accept': "application/xml, text/xml, */*; q=0.01",
'x-requested-with': "XMLHttpRequest",
'wicket-ajax-baseurl': "en/etf-profile.html?0&groupField=none&sortField=ter&sortOrder=asc&from=search&isin=IE00B3VWN518&tab=chart",
'wicket-ajax': "true",
'wicket-focusedelementid': "id27",
'Connection': "keep-alive",
}
session = requests.Session()
# The first request won't return what we want but it sets the cookies
response = session.get( url, params=querystring)
# Cookies have been set now we can make the 2nd request and get the data we want
response = session.get( url, headers=headers, params=querystring)
print(response.text)
As a bonus, I have included a link to a repl.it where I actually parse the data and get each individual data point. You can find this here.
Let me know if that helps!

Related

How to programmatically generate new headers from URL?

I'm writing a script to pull financial data from a website everyday however the script stops working after a few hours because the cookies expire. I need to generate new headers (I think just the x-xsrf-token and cookies) each time I run the script (once a day), otherwise I get a 401 status code.
This is the page I'm trying to pull data from:
https://www.barchart.com/futures/quotes/CLZ22/futures-prices?viewName=main&timeFrame=current
And this is the XHR response url that I'm trying to scrape:
https://www.barchart.com/proxies/core-api/v1/quotes/get?fields=symbol%2CcontractSymbol%2ClastPrice%2CpriceChange%2CopenPrice%2ChighPrice%2ClowPrice%2CpreviousPrice%2Cvolume%2CopenInterest%2CtradeTime%2CsymbolCode%2CsymbolType%2ChasOptions&lists=futures.contractInRoot&root=CL&meta=field.shortName%2Cfield.type%2Cfield.description%2Clists.lastUpdate&hasOptions=true&page=1&limit=100&raw=1
The only way I know how to currently do that is to go to the website and copy the XHR request as cURL (bash) then paste this into Postman and manually paste these headers into my existing code.
Below is the code generated from Postman. I've been trying to figure out how to generate the headers from the URL so I don't have to run to Postman and manually refresh the headers everyday.
import requests
url = "https://www.barchart.com/proxies/core-api/v1/quotes/get?fields=symbol%2CcontractSymbol%2ClastPrice%2CpriceChange%2CopenPrice%2ChighPrice%2ClowPrice%2CpreviousPrice%2Cvolume%2CopenInterest%2CtradeTime%2CsymbolCode%2CsymbolType%2ChasOptions&lists=futures.contractInRoot&root=CL&meta=field.shortName%2Cfield.type%2Cfield.description%2Clists.lastUpdate&hasOptions=true&page=1&limit=100&raw=1"
payload={}
headers = {
'authority': 'www.barchart.com',
'accept': 'application/json',
'accept-language': 'en-US,en;q=0.9',
'cookie': 'webinar124WebinarClosed=true; market=eyJpdiI6IkovREZvUVlZMGFzM2x3b05wb3V4cGc9PSIsInZhbHVlIjoiSVJoT00rMTdWUFFYRlJiOG53OU12dTdjcUhEL3FKTW5XUy9FZFNjc1Z2VWkwdjV2RkNrWXpGZzYzMUNpK2IxbyIsIm1hYyI6IjkzNDYxOTg5OWQ5MzgxYjhlMGI4ODg4NDRlMDA1NWE3MjUxYTNmOTMzNzllYjBjYjhmNGM1ZGZiYjA0Yzk5ODEifQ%3D%3D; bcFreeUserPageView=0; laravel_token=eyJpdiI6ImR3bGlHVTY3WEhGdkdEWlBleGtkMFE9PSIsInZhbHVlIjoiWnFJMml3dis3cVN0d2VIdDRCbFQvczRmVGZxcjFYYTF4YWpBd09NSVJBVXRQYVVLdWxnaUlTM3dXTElUaUJHK1VoQkxaQkdsRHNlTzZRU3c2R3NhZzVROUYvRHM4TTQ3V2srcHZLZG9Ra3BzOUZndXhxME4rSmtYODZHTWtmN3pmOENtRGZWQmdhUEZFc0FiZ0dSV1BEbC9acTVVQnBTOUl1Y2ZleW50WVAxSmYvMTdQQVZlN0lRQ25qR1BKQWZUMU1XbE5rcW14ZTYvTkpVbkpmcXc2RVRHUmtrUHlTNithNkJiY1ZTNG1rWkl1cHkxeVRWUU9zZUE2RFhoN2VYeGFnZytPN2RBZ3VPS0tJdVQxZUw4eHB1d2FZN3JKNlJ3QmllYWx2N21nUGlFb25OYXM0aFhjbFBCS0Q0ajJTSmMiLCJtYWMiOiJkODgwMjliMzM4MGI4M2E4Njk4MmE3ODYzMDY2ZmRkYjRmN2MzZGExYThhMTliMTE2YjNiZDQ1YzkzZWMzMWQwIn0%3D; XSRF-TOKEN=eyJpdiI6InQyUll2aHRCaXFlQkZIRXV0TjdaVGc9PSIsInZhbHVlIjoiSmtZaXlTbmVrTkJNVmEyUHQrUDFZN1RWNCt5cmFSanMxcnpTTW8vTjdrTU1RVlZQWktXNnhtakJjeVJ6Y0h3cFpkaWl4UnBvS28vTHNCUzNsM0ZRcXN2ZG9tWnFLTUVwdUZHY2VhNmxSRFg0ajhXU0lobFRZaFZRanhHZis4STkiLCJtYWMiOiI2MDc3NjIzNTAwMmY5MjlkNjRkMTVkYTZjYmNiM2RiNjg4ZDI1MmUzZWEzYjc1NWY0ZDNiZGNjNzY0ZGY2NGY5In0%3D; laravel_session=eyJpdiI6IlhPVGVDbTVURlpWRDcvNWVMWUgxclE9PSIsInZhbHVlIjoiVEJvTUVIVkRHOFlQUXNKcUJRaGtmZ2U4aVcrbE9JNDV3bG1adG1DLzVpSzI5Z0lqYlk2NU5TQkE5ZTAzMHZPL1VoVjJlZU9kSkYvT1VERFBsK1BnRUVzaGMzVlNiRFFTQzFPblEyMUFXSjM3dmdRQXhnTXFSaVYwSkNkZ3ZJS3UiLCJtYWMiOiJjMzUzMzAyMjEzYzYwZGZmM2M3OTMwMGE0OGM3NTJmM2M3MzhkNDUyYjE2OTI4Njg5ODQxNDM3NjcyMzM0ZWE5In0%3D',
'referer': 'https://www.barchart.com/futures/quotes/CLZ22/futures-prices?viewName=main&timeFrame=current',
'sec-ch-ua': '"Chromium";v="106", "Google Chrome";v="106", "Not;A=Brand";v="99"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'sec-fetch-dest': 'empty',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-origin',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36',
'x-xsrf-token': 'eyJpdiI6InQyUll2aHRCaXFlQkZIRXV0TjdaVGc9PSIsInZhbHVlIjoiSmtZaXlTbmVrTkJNVmEyUHQrUDFZN1RWNCt5cmFSanMxcnpTTW8vTjdrTU1RVlZQWktXNnhtakJjeVJ6Y0h3cFpkaWl4UnBvS28vTHNCUzNsM0ZRcXN2ZG9tWnFLTUVwdUZHY2VhNmxSRFg0ajhXU0lobFRZaFZRanhHZis4STkiLCJtYWMiOiI2MDc3NjIzNTAwMmY5MjlkNjRkMTVkYTZjYmNiM2RiNjg4ZDI1MmUzZWEzYjc1NWY0ZDNiZGNjNzY0ZGY2NGY5In0='
}
response = requests.request("GET", url, headers=headers, data=payload)
print(response.text)
Is there a way to generate these headers in Python from xhr response url that I can then use when sending my GET request?

The cookies are the bottleneck. You first have to fetch them and then pass them along with the request:
import requests
from urllib.parse import unquote
ua_headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:86.0) Gecko/20100101 Firefox/86.0"
}
params = {
'fields': 'symbol,contractSymbol,lastPrice,priceChange,openPrice,highPrice,lowPrice,previousPrice,volume,openInterest,tradeTime,symbolCode,symbolType,hasOptions',
'lists': 'futures.contractInRoot',
'root': 'CL',
'meta': 'field.shortName,field.type,field.description,lists.lastUpdate',
'hasOptions': 'true',
'page': '1',
'limit': '100',
'raw': '1',
}
with requests.Session() as s:
# get cookies
s.get("https://www.barchart.com/options/iv-rank-percentile/stocks", headers=ua_headers)
# use one cookie as HTTP header
headers["X-XSRF-TOKEN"] = unquote(s.cookies["XSRF-TOKEN"])
response = s.get('https://www.barchart.com/proxies/core-api/v1/quotes/get', params=params, headers=headers)
print(response.json())

How to fix error 403 POST request (Python+requests, The code is presented)?

I am trying to log in to the site using a post request, as I need to log in to it. I pass all the information from the Chrome request headers in the header. I am transferring everything from the form data to the Data, but as a result I get error 403. Maybe someone can tell you? Unfortunately, I can't provide a website
(Where test - I have a real site specified)
RequestHeaders scrin
FormData
url = 'test.ru'
header = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9'
'Accept-Encoding': 'gzip, deflate'
'Accept-Language': 'ru-RU,ru;q=0.9,en-US;q=0.8,en;q=0.7'
'Cache-Control': 'max-age=0'
'Connection': 'keep-alive'
'Content-Length': '82'
'Content-Type': 'application/x-www-form-urlencoded'
'Cookie': 'cookie'
'Host': 'test'
'Origin': 'test.ru'
'Referer': 'test.ru/sign'
'Upgrade-Insecure-Requests': '1'
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.75 Safari/537.36'
}
data = {
'username':'admin'
'password':'q1'
'remember-me':'1'
'_csrf':'6a973144-4dec-42-ab-9fda-f62f6054fe68'
}
s = requests.Session()
s.post(url, data=data, headers=header)
*The screenshots are fresh, and I wrote the code yesterday, so some data may vary, but when testing the code, I took the current ones from the browser

Invalid URL when using Python Requests

I am trying to access the API returning program data at this page when you scroll down and new tiles are displayed on the screen. Looking in Chrome Tools I have found the API being called and put together the following Requests script:
import requests
session = requests.session()
url = 'https://ie.api.atom.nowtv.com/adapter-atlas/v3/query/node?slug=/entertainment/collections/all-entertainment&represent=(items[take=60](items(items[select_list=iceberg])))'
session.headers = {
'Host': 'https://www.nowtv.com',
'Connection': 'keep-alive',
'Accept': 'application/json, text/javascript, */*',
'X-Requested-With': 'XMLHttpRequest',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36',
'Referer': 'https://www.nowtv.com',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8'
}
scraper = cloudscraper.create_scraper(sess=session)
r = scraper.get(url)
data = r.content
print(data)
session.close()
This is returning the following only:
b'<HTML><HEAD>\n<TITLE>Invalid URL</TITLE>\n</HEAD><BODY>\n<H1>Invalid URL</H1>\nThe requested URL "[no URL]", is invalid.<p>\nReference #9.3c0f0317.1608324989.5902cff\n</BODY></HTML>\n'
I assume the issue is the part at the end of the URL that is in curly brackets. I am not sure however how to handle these in a Requests call. Can anyone provide the correct syntax?
Thanks

The issue is the Host session header value, don't set it.
That should be enough. But I've done some additional things as well:
add the X-* headers:
session.headers.update(**{
'X-SkyOTT-Proposition': 'NOWTV',
'X-SkyOTT-Language': 'en',
'X-SkyOTT-Platform': 'PC',
'X-SkyOTT-Territory': 'GB',
'X-SkyOTT-Device': 'COMPUTER'
})
visit the main page without XHR header set and with a broader Accept header value:
text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
I've also used params for the GET parameters - you don't have to do it, I think. It's just cleaner:
In [33]: url = 'https://ie.api.atom.nowtv.com/adapter-atlas/v3/query/node'
In [34]: response = session.get(url, params={
'slug': '/entertainment/collections/all-entertainment',
'represent': '(items[take=60,skip=2340](items(items[select_list=iceberg])))'
}, headers={
'Accept': 'application/json, text/plain, */*',
'X-Requested-With':'XMLHttpRequest'
})
In [35]: response
Out[35]: <Response [200]>
In [36]: response.text
Out[36]: '{"links":{"self":"/adapter-atlas/v3/query/node/e5b0e516-2b84-11e9-b860-83982be1b6a6"},"id":"e5b0e516-2b84-11e9-b860-83982be1b6a6","type":"CATALOGUE/COLLECTION","segmentId":"","segmentName":"default","childTypes":{"next_items":{"nodeTypes":["ASSET/PROGRAMME","CATALOGUE/SERIES"],"count":68},"items":{"nodeTypes":["ASSET/PROGRAMME","CATALOGUE/SERIES"],"count":2376},"curation-config":{"nodeTypes":["CATALOGUE/CURATIONCONFIG"],"count":1}},"attributes":{"childNodeTyp
...

Logging into website and scraping data

The website I am trying to log in to is https://realitysportsonline.com/RSOLanding.aspx. I can't seem to get the login to work since the process is a little different to a typical site that has a login specific page. I haven't got any errors, but the log in action doesn't work, which then causes the main to redirect to the homepage.
import requests
url = "https://realitysportsonline.com/RSOLanding.aspx"
main = "https://realitysportsonline.com/SetLineup_Contracts.aspx?leagueId=3000&viewingTeam=1"
data = {"username": "", "password": "", "vc_btn3 vc_btn3-size-md vc_btn3-shape-rounded vc_btn3-style-3d vc_btn3-color-danger" : "Log In"}
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36',
'Referer': 'https://realitysportsonline.com/RSOLanding.aspx',
'Host': 'realitysportsonline.com',
'Connection': 'keep-alive',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8'}
s = requests.session()
s.get(url)
r = s.post(url, data, headers=header)
page = requests.get(main)

First of all, you create a session and assuming your POST request worked, you then request an authorised page without using your previously created session.
You need to make the request with the s object you created like so:
page = s.get(main)
However, there were also a few issues with your POST request. You were making a request to the home page instead of the /Login route. You were also missing the Content-Type header.
import requests
url = "https://realitysportsonline.com/Services/AccountService.svc/Login"
main = "https://realitysportsonline.com/LeagueSetup.aspx?create=true"
payload = {"username":"","password":""}
headers = {
'Content-Type': "text/json",
'Cache-Control': "no-cache"
}
s = requests.session()
response = s.post(url, json=payload, headers=headers)
page = s.get(main)
PS your main request url redirects to the homepage, even with a valid session (at least for me).

requests.post to hastebin.com not working

I'm trying to use the requests module from python to make a post in http://hastebin.com/
but I've been failing and doesn't know what to do anymore. is there any way I can really make a post on the site? here my current code:
import requests
payload = "s2345"
headers = {
'Host': 'hastebin.com',
'Connection': 'keep-alive',
'Content-Length': '5',
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Origin': 'http://hastebin.com',
'X-Requested-With': 'XMLHttpRequest',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.130 Safari/537.36',
'Content-Type': 'application/json; charset=UTF-8',
'Referer': 'http://hastebin.com/',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.8'
}
req = requests.post('http://hastebin.com/',headers = headers, params=payload)
print (req.json())

Looking over the provided haste client code the server expects a raw post of the file, without a specific content type. The client also posts to the /documents path, not the root URL.
They are also not being picky about headers, just leave those all to requests to set; the following works for me and creates a new document on the site:
import requests
payload = "s2345"
response = requests.post('http://hastebin.com/documents', data=payload)
if response.status_code == 200:
print(response.json()['key'])
Note that I used data here, not the params option which sets the URL query paramaters.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrape XML response from ajax request with python - python

Related

How to programmatically generate new headers from URL?

How to fix error 403 POST request (Python+requests, The code is presented)?

Invalid URL when using Python Requests

Logging into website and scraping data

requests.post to hastebin.com not working

Categories

Resources