How to programmatically generate new headers from URL? - python

I'm writing a script to pull financial data from a website everyday however the script stops working after a few hours because the cookies expire. I need to generate new headers (I think just the x-xsrf-token and cookies) each time I run the script (once a day), otherwise I get a 401 status code.
This is the page I'm trying to pull data from:
https://www.barchart.com/futures/quotes/CLZ22/futures-prices?viewName=main&timeFrame=current
And this is the XHR response url that I'm trying to scrape:
https://www.barchart.com/proxies/core-api/v1/quotes/get?fields=symbol%2CcontractSymbol%2ClastPrice%2CpriceChange%2CopenPrice%2ChighPrice%2ClowPrice%2CpreviousPrice%2Cvolume%2CopenInterest%2CtradeTime%2CsymbolCode%2CsymbolType%2ChasOptions&lists=futures.contractInRoot&root=CL&meta=field.shortName%2Cfield.type%2Cfield.description%2Clists.lastUpdate&hasOptions=true&page=1&limit=100&raw=1
The only way I know how to currently do that is to go to the website and copy the XHR request as cURL (bash) then paste this into Postman and manually paste these headers into my existing code.
Below is the code generated from Postman. I've been trying to figure out how to generate the headers from the URL so I don't have to run to Postman and manually refresh the headers everyday.
import requests
url = "https://www.barchart.com/proxies/core-api/v1/quotes/get?fields=symbol%2CcontractSymbol%2ClastPrice%2CpriceChange%2CopenPrice%2ChighPrice%2ClowPrice%2CpreviousPrice%2Cvolume%2CopenInterest%2CtradeTime%2CsymbolCode%2CsymbolType%2ChasOptions&lists=futures.contractInRoot&root=CL&meta=field.shortName%2Cfield.type%2Cfield.description%2Clists.lastUpdate&hasOptions=true&page=1&limit=100&raw=1"
payload={}
headers = {
'authority': 'www.barchart.com',
'accept': 'application/json',
'accept-language': 'en-US,en;q=0.9',
'cookie': 'webinar124WebinarClosed=true; market=eyJpdiI6IkovREZvUVlZMGFzM2x3b05wb3V4cGc9PSIsInZhbHVlIjoiSVJoT00rMTdWUFFYRlJiOG53OU12dTdjcUhEL3FKTW5XUy9FZFNjc1Z2VWkwdjV2RkNrWXpGZzYzMUNpK2IxbyIsIm1hYyI6IjkzNDYxOTg5OWQ5MzgxYjhlMGI4ODg4NDRlMDA1NWE3MjUxYTNmOTMzNzllYjBjYjhmNGM1ZGZiYjA0Yzk5ODEifQ%3D%3D; bcFreeUserPageView=0; laravel_token=eyJpdiI6ImR3bGlHVTY3WEhGdkdEWlBleGtkMFE9PSIsInZhbHVlIjoiWnFJMml3dis3cVN0d2VIdDRCbFQvczRmVGZxcjFYYTF4YWpBd09NSVJBVXRQYVVLdWxnaUlTM3dXTElUaUJHK1VoQkxaQkdsRHNlTzZRU3c2R3NhZzVROUYvRHM4TTQ3V2srcHZLZG9Ra3BzOUZndXhxME4rSmtYODZHTWtmN3pmOENtRGZWQmdhUEZFc0FiZ0dSV1BEbC9acTVVQnBTOUl1Y2ZleW50WVAxSmYvMTdQQVZlN0lRQ25qR1BKQWZUMU1XbE5rcW14ZTYvTkpVbkpmcXc2RVRHUmtrUHlTNithNkJiY1ZTNG1rWkl1cHkxeVRWUU9zZUE2RFhoN2VYeGFnZytPN2RBZ3VPS0tJdVQxZUw4eHB1d2FZN3JKNlJ3QmllYWx2N21nUGlFb25OYXM0aFhjbFBCS0Q0ajJTSmMiLCJtYWMiOiJkODgwMjliMzM4MGI4M2E4Njk4MmE3ODYzMDY2ZmRkYjRmN2MzZGExYThhMTliMTE2YjNiZDQ1YzkzZWMzMWQwIn0%3D; XSRF-TOKEN=eyJpdiI6InQyUll2aHRCaXFlQkZIRXV0TjdaVGc9PSIsInZhbHVlIjoiSmtZaXlTbmVrTkJNVmEyUHQrUDFZN1RWNCt5cmFSanMxcnpTTW8vTjdrTU1RVlZQWktXNnhtakJjeVJ6Y0h3cFpkaWl4UnBvS28vTHNCUzNsM0ZRcXN2ZG9tWnFLTUVwdUZHY2VhNmxSRFg0ajhXU0lobFRZaFZRanhHZis4STkiLCJtYWMiOiI2MDc3NjIzNTAwMmY5MjlkNjRkMTVkYTZjYmNiM2RiNjg4ZDI1MmUzZWEzYjc1NWY0ZDNiZGNjNzY0ZGY2NGY5In0%3D; laravel_session=eyJpdiI6IlhPVGVDbTVURlpWRDcvNWVMWUgxclE9PSIsInZhbHVlIjoiVEJvTUVIVkRHOFlQUXNKcUJRaGtmZ2U4aVcrbE9JNDV3bG1adG1DLzVpSzI5Z0lqYlk2NU5TQkE5ZTAzMHZPL1VoVjJlZU9kSkYvT1VERFBsK1BnRUVzaGMzVlNiRFFTQzFPblEyMUFXSjM3dmdRQXhnTXFSaVYwSkNkZ3ZJS3UiLCJtYWMiOiJjMzUzMzAyMjEzYzYwZGZmM2M3OTMwMGE0OGM3NTJmM2M3MzhkNDUyYjE2OTI4Njg5ODQxNDM3NjcyMzM0ZWE5In0%3D',
'referer': 'https://www.barchart.com/futures/quotes/CLZ22/futures-prices?viewName=main&timeFrame=current',
'sec-ch-ua': '"Chromium";v="106", "Google Chrome";v="106", "Not;A=Brand";v="99"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'sec-fetch-dest': 'empty',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-origin',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36',
'x-xsrf-token': 'eyJpdiI6InQyUll2aHRCaXFlQkZIRXV0TjdaVGc9PSIsInZhbHVlIjoiSmtZaXlTbmVrTkJNVmEyUHQrUDFZN1RWNCt5cmFSanMxcnpTTW8vTjdrTU1RVlZQWktXNnhtakJjeVJ6Y0h3cFpkaWl4UnBvS28vTHNCUzNsM0ZRcXN2ZG9tWnFLTUVwdUZHY2VhNmxSRFg0ajhXU0lobFRZaFZRanhHZis4STkiLCJtYWMiOiI2MDc3NjIzNTAwMmY5MjlkNjRkMTVkYTZjYmNiM2RiNjg4ZDI1MmUzZWEzYjc1NWY0ZDNiZGNjNzY0ZGY2NGY5In0='
}
response = requests.request("GET", url, headers=headers, data=payload)
print(response.text)
Is there a way to generate these headers in Python from xhr response url that I can then use when sending my GET request?

The cookies are the bottleneck. You first have to fetch them and then pass them along with the request:
import requests
from urllib.parse import unquote
ua_headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:86.0) Gecko/20100101 Firefox/86.0"
}
params = {
'fields': 'symbol,contractSymbol,lastPrice,priceChange,openPrice,highPrice,lowPrice,previousPrice,volume,openInterest,tradeTime,symbolCode,symbolType,hasOptions',
'lists': 'futures.contractInRoot',
'root': 'CL',
'meta': 'field.shortName,field.type,field.description,lists.lastUpdate',
'hasOptions': 'true',
'page': '1',
'limit': '100',
'raw': '1',
}
with requests.Session() as s:
# get cookies
s.get("https://www.barchart.com/options/iv-rank-percentile/stocks", headers=ua_headers)
# use one cookie as HTTP header
headers["X-XSRF-TOKEN"] = unquote(s.cookies["XSRF-TOKEN"])
response = s.get('https://www.barchart.com/proxies/core-api/v1/quotes/get', params=params, headers=headers)
print(response.json())

Related

Submitting POST Request with Image

I'm trying to do my first POST request (on TinEye) that involves uploading an image. I'm trying to piece together bits from these answers: Python POST Request with an Image , How to post image using requests? , Sending images by POST using python requests , and Sending image over POST request with Python Requests , but I'm still missing something.
The headers of the request looks like this:
headers1:
headers2:
(...not sure what identifying info, if any, there are in there so I've blocked them just in case)
And the payload looks like this:
payload:
So, with all this info, what I've attempted so far looks like this:
import requests
import random,string
# pip install requests_toolbelt
from requests_toolbelt import MultipartEncoder
image_filename = "2015_Aston_Martin_DB9_GT_(19839443910).jpg" # Change this to another filename
imported_image = open(image_filename, 'rb')
def submit_image_post_request(image):
# Create a get request to get the initial cookies
cookies = requests.get("https://tineye.com/").cookies
# Generate a WebKitFormBoundary
boundary = '----WebKitFormBoundary' + ''.join(random.sample(string.ascii_letters + string.digits, 16))
# Generate the headers
headers = {
'authority': 'tineye.com',
'accept': 'application/json, text/plain, */*',
'accept-language': 'en-US,en;q=0.9',
'content-type': 'multipart/form-data; boundary=' + boundary,
'origin': 'https://tineye.com',
'referer': 'https://tineye.com/search/c8570370e2b2338dc656c8cefe221655b8a0ca17?sort=score&order=desc&page=1',
'sec-ch-ua': '"Chromium";v="104", " Not A;Brand";v="99", "Google Chrome";v="104"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'sec-fetch-dest': 'empty',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-origin',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36',
}
# Give the params
params = {
'sort': 'score',
'order': 'desc',
}
# Now comes the experimenting.
# Define a 'files' variable for the request using the opened image file
files = {
'image': image
}
# Try to recreate the "fields" of the form/request
fields = {
'file': (image_filename, image, "image/jpeg"),
# 'file_id': "0"
# "Content-Disposition": 'form-data; name="image"; filename=image_filename'
}
# Generate a MultipartEncoder using the fields and same boundary as in the headers
m = MultipartEncoder(fields=fields, boundary=boundary)
# Send the request
response = requests.post('https://tineye.com/result_json/', params=params, headers=headers, files=files, cookies=cookies, data=m)
return response
response = submit_image_post_request(imported_image)
It's not working obviously, I get a 400 response currently, and it's because of the last little bit of the function, as I'm not quite sure how to recreate the request. Looking to get some guidance on it.
I found an article that showed how to copy the request as a curl from Chrome, import it into Postman, and then export the corresponding python request from Postman, which I have done below as an updated attempt at got the 200 response code. Woohoo!
def search_image(self, image):
url = "https://tineye.com/result_json/"
cookies = requests.get(url).cookies
payload={}
files=[
('image',('file', image,'application/octet-stream'))
]
headers = {
'authority': 'tineye.com',
'accept': 'application/json, text/plain, */*',
'accept-language': 'en-US,en;q=0.9',
# 'cookie': '_ga=GA1.2.1487505347.1661754780; sort=score; order=desc; _gid=GA1.2.613122987.1662166051; __cf_bm=VYpWBFxDJVgFr_e6N_51uElQ4P0qmZtysVNuPdG4MU4-1662166051-0-AQ3g7/Ygshplz8dghxLlCTA8TBrR0b+YXr9kOMfagi18Ypry9kWkDQELjUXOGpClZgoX/BjZExzf+3r6aL8ytCau2kM8z5u3sFanPVaA39wOni+AMGy69RFrGBP8om+naQ==; tineye=fz1Bqk4sJOQqVaf4XCHM59qTFw8LSS6aLP3fQQoIYLyVWIsQR_-XpM-E6-L5GXQ8eex1ia7GI0-ffA57yuR-ll0nfPeAPkDzqdp1Uw; _gat_gtag_UA_2430070_8=1',
'origin': 'https://tineye.com',
'referer': 'https://tineye.com/search',
'sec-ch-ua': '"Chromium";v="104", " Not A;Brand";v="99", "Google Chrome";v="104"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'sec-fetch-dest': 'empty',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-origin',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36'
}
response = requests.post(url, headers=headers, data=payload, files=files, cookies=cookies, timeout=60)
return response

Python Requests Does Not get website that opens on browser

I have been trying to access this website https://www.dickssportinggoods.com/f/tents-accessories with requests module but it just keeps processing and does not stop while the same website works fine on browser. Scrappy gives a time out error for the same website. Is there something that should be taken into account while accessing websites like these. Thanks
For sites like these you can try to add the extra headers that your browser does. Following these steps worked for me -
Open the link in incognito window with the network tab open.
Copy the first request made by right clicking -> copy -> copy as curl
Go to https://curl.trillworks.com/. Paste the curl command to get the equivalent python requests code.
Now try removing headers one by one until it works with the minimal headers.
Image for reference - https://i.stack.imgur.com/vRS98.png
Edit -
import requests
headers = {
'authority': 'www.dickssportinggoods.com',
'pragma': 'no-cache',
'cache-control': 'no-cache',
'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
'sec-ch-ua-mobile': '?0',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'none',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'accept-language': 'en-US,en;q=0.9',
}
response = requests.get('https://www.dickssportinggoods.com/f/tents-accessories', headers=headers)
print(response.text)
Have you tried adding headers?
import requests
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get('https://www.dickssportinggoods.com/f/tents-accessories', headers=headers)
response.raise_for_status()
print(response.text)
So Thanks to #Marcel and #Sonal but appart from headers, it just worked when i put the statement in a try/except block.
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0\
Win64\
x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36'
}
session = requests.Session()
try:
r = session.get(
link, headers=headers, stream=True)
return r
except requests.exceptions.ConnectionError:
r.status_code = "Connection refused"

Invalid URL when using Python Requests

I am trying to access the API returning program data at this page when you scroll down and new tiles are displayed on the screen. Looking in Chrome Tools I have found the API being called and put together the following Requests script:
import requests
session = requests.session()
url = 'https://ie.api.atom.nowtv.com/adapter-atlas/v3/query/node?slug=/entertainment/collections/all-entertainment&represent=(items[take=60](items(items[select_list=iceberg])))'
session.headers = {
'Host': 'https://www.nowtv.com',
'Connection': 'keep-alive',
'Accept': 'application/json, text/javascript, */*',
'X-Requested-With': 'XMLHttpRequest',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36',
'Referer': 'https://www.nowtv.com',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8'
}
scraper = cloudscraper.create_scraper(sess=session)
r = scraper.get(url)
data = r.content
print(data)
session.close()
This is returning the following only:
b'<HTML><HEAD>\n<TITLE>Invalid URL</TITLE>\n</HEAD><BODY>\n<H1>Invalid URL</H1>\nThe requested URL "[no URL]", is invalid.<p>\nReference #9.3c0f0317.1608324989.5902cff\n</BODY></HTML>\n'
I assume the issue is the part at the end of the URL that is in curly brackets. I am not sure however how to handle these in a Requests call. Can anyone provide the correct syntax?
Thanks
The issue is the Host session header value, don't set it.
That should be enough. But I've done some additional things as well:
add the X-* headers:
session.headers.update(**{
'X-SkyOTT-Proposition': 'NOWTV',
'X-SkyOTT-Language': 'en',
'X-SkyOTT-Platform': 'PC',
'X-SkyOTT-Territory': 'GB',
'X-SkyOTT-Device': 'COMPUTER'
})
visit the main page without XHR header set and with a broader Accept header value:
text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
I've also used params for the GET parameters - you don't have to do it, I think. It's just cleaner:
In [33]: url = 'https://ie.api.atom.nowtv.com/adapter-atlas/v3/query/node'
In [34]: response = session.get(url, params={
'slug': '/entertainment/collections/all-entertainment',
'represent': '(items[take=60,skip=2340](items(items[select_list=iceberg])))'
}, headers={
'Accept': 'application/json, text/plain, */*',
'X-Requested-With':'XMLHttpRequest'
})
In [35]: response
Out[35]: <Response [200]>
In [36]: response.text
Out[36]: '{"links":{"self":"/adapter-atlas/v3/query/node/e5b0e516-2b84-11e9-b860-83982be1b6a6"},"id":"e5b0e516-2b84-11e9-b860-83982be1b6a6","type":"CATALOGUE/COLLECTION","segmentId":"","segmentName":"default","childTypes":{"next_items":{"nodeTypes":["ASSET/PROGRAMME","CATALOGUE/SERIES"],"count":68},"items":{"nodeTypes":["ASSET/PROGRAMME","CATALOGUE/SERIES"],"count":2376},"curation-config":{"nodeTypes":["CATALOGUE/CURATIONCONFIG"],"count":1}},"attributes":{"childNodeTyp
...

Unable to log into website using requests.post despite matching network form data

I am unable to login to a website using requests and fetch the API data behind an account. The requests payload data matches the form data used for normally logging in.
My code is as follows:
urlpage = 'https://speechanddebate.org/login'
header = {'User-Agent': 'Chrome/84.0.4147.89'}
payload = {'log': "email#gmail.com",
'pwd': "password",
'wp-submit': 'Log In',
'rememberme': 'forever',
'redirect_to': '/account',
'testcookie': '1'}
session = requests.Session()
test = session.post(urlpage, headers = header, data = payload)
I used inspect element to find what data is sent via POST when I log in normally rather than through webscraping and it gives this result when I check under networking:
I am not sure what I am doing differently compared to the other StackOverFlow answers out there. Here's a list of code modifications I've tried to make:
Without sessions and just doing a normal request
Making the data URL encoded
Changing it and having a with requests.Session() as session: block instead of just
session = requests.Session()
And tried POST with headers and without headers etc.
When I login normally I get the status code 302 indicating that the login was successful and I've been transferred to another web page. However, when I do it through webscraping, it fails to login and returns status code 200 and returns it back to the login page.
Try
headers = {
'authority': 'www.speechanddebate.org',
'cache-control': 'max-age=0',
'upgrade-insecure-requests': '1',
'origin': 'https://www.speechanddebate.org',
'content-type': 'application/x-www-form-urlencoded',
'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Mobile Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'referer': 'https://www.speechanddebate.org/login/',
'accept-language': 'en-US,en;q=0.9',
}

How to get right JSON responce from POST request from PYTHON script

Task is to get JSON responce from POST request from particular website.
Everything works fine in browser as follows. You may simulate the case yourself tryin to start enter text into Start Location field.
webaddress to check: https://www.hapag-lloyd.com/en/online-business/schedules/interactive-schedule.html
Chrome Dev Tool Screen 1 - Request URL and Header
Chrome Dev Tool Screen 2 - POST data
JSON RESPONCE (it must be like this)
{"rows":[{"LOCATION_COUNTRYABBREV":"GE","LOCATION_BUSINESSPOSTALCODE":"","LOCATION_BUSINESSLOCATIONNAME":"BATUMI","LOCATION_BUSINESSLOCODE":"GEBUS","STANDARDLOCATION_BUSINESSLOCODE":"GEBUS","LOCATION_PORTTYPE":"S","DISPLAYNAME":""}]}
My code as follows:
import requests
url = 'https://www.hapag-lloyd.com/en/online-business/schedules/interactive-schedule.html?_sschedules_interactive=_raction&action=getTypeAheadService'
POST_QUERY = 'batumi'
params = {
'query': POST_QUERY,
'reportname': 'FRTA0101',
'callConfiguration': "[resultLines=10,readDef1=location_businessLocationName STARTSWITH,readDef2=location_businessLocode STARTSWITH,readClause1=location_businessLocode<>'' AND location_portType='S' AND stdSubLocation_string10='STD',readClause2=location_businessLocode<>'' AND location_portType<>'S' AND stdSubLocation_string10='STD',readClause3=location_businessLocode<>'' AND location_portType='S' AND stdSubLocation_string10='SUB',readClause4=location_businessLocode<>'' AND stdSubLocation_string10='SUB',readClause5=location_businessLocode='' AND stdSubLocation_string10='SUB',sortDef1=location_businessLocationName ASC,resultAttr1=location_businessLocationName,resultAttr2=location_businessLocode,resultAttr3=location_businessPostalCode,resultAttr4=standardLocation_businessLocode,resultAttr5=location_countryAbbrev,resultAttr6=location_portType]"
}
headers = {
"Accept": "*/*",
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-EN,en;q=0.9,en-US;q=0.8,en;q=0.7',
'Cache-Control': 'no-cache',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'DNT': '1',
'Host': 'www.hapag-lloyd.com',
'Origin': 'https://www.hapag-lloyd.com',
'Pragma': 'no-cache',
# 'Proxy-Connection': 'keep-alive',
'Referer': 'https://www.hapag-lloyd.com/en/online-business/schedules/interactive-schedule.html',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'
}
print('Testing location: ', POST_QUERY)
var_cities = requests.post(url,data=params,headers=headers)
print(var_cities.content) #it does print some %$#%$
Python Print Content Screen
My question is "How to get right JSON responce from POST request from PYTHON script"?
I think using BeautifulSoup is a better option.
Try this
Python Convert HTML into JSON using Soup
print(var_cities.text)
This returns the html as string. Is this what you expected to get as a response? And to convert this into json, look at the answer above...

Categories