I understand what location does in HTTP headers.
Access to a site with Chrome gets location in response headers.
However, access to it with Python requests cannot get that info.
import requests
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36',
'accept': '*/*',
'accept-language': 'en-US,en;q=0.9,ru-RU;q=0.8,ru;q=0.7,uk;q=0.6,en-GB;q=0.5',
}
response = requests.get('https://ec.ef.com.cn/partner/englishcenters', headers=headers)
response.headers
Does it matter for scrapy? How do I get that info? Because I guess it might be a flag the site could use for anti-scraping.
What you see in your screenshot is response with HTTP code 302 which will usually automatically redirect some clients (along with Python Requests) to another URL, specified in Location header.
If you enter the URL you shared (https://ec.ef.com.cn/partner/englishcenters) in your browser, you'll see you will get redirected to some other URL. Same behaviour can be observed in your Python code if you print out response.url which should return you the URL you've been redirected to.
Related
Apologies if this is a bit website specific (barchart.com). I used the guidance provided here for properly connecting and scraping barchart.com for Futures data. However, after hours of trying, I am at a loss as to how to pull off this same trick for their pre-market data table: Barchart_Premarket_Site.
Anyone know the trick to get the pre-market data?
Here is the basic connection, for which i get a 403:
import requests
geturl=r'https://www.barchart.com/stocks/pre-market-trading/volume-advances?orderBy=preMarketVolume&orderDir=desc'
s=requests.Session()
r=s.get(geturl)
#j=r.json()
print(r)`
All that was required was to add more headers to the request. You can find your own headers using chrome > developer tools; and then just find the api request for the table and slam in a few of the headers associated with that request.
import requests
request_url = "https://www.barchart.com/proxies/core-api/v1/quotes/get?lists=stocks.us.premarket.volume_advances&orderDir=desc&fields=symbol%2CsymbolName%2CpreMarketLastPrice%2CpreMarketPriceChange%2CpreMarketPercentChange%2CpreMarketVolume%2CpreMarketAverage5dVolume%2CpreMarketPreviousLast%2CpreMarketPreviousChange%2CpreMarketPreviousPercentChange%2CpreMarketTradeTime%2CnextEarningsDate%2CnextEarningsDate%2CtimeCode%2CsymbolCode%2CsymbolType%2ChasOptions&orderBy=preMarketVolume&meta=field.shortName%2Cfield.type%2Cfield.description%2Clists.lastUpdate&hasOptions=true&page=1&limit=100&raw=1"
headers = {
'accept': 'application/json',
'cookie': '_gcl_au=1.1.685644914.1670446600; _fbp=fb.1.1670446600221.1987872306; _pbjs_userid_consent_data=3524755945110770; _pubcid=e7cf9178-59bc-4a82-b6c4-a2708ed78b8d; _admrla=2.2-1e3aed0d7d9d2975-a678aeef-7671-11ed-803e-d12e87d011f0; _lr_env_src_ats=false; _cc_id=6c9e21e7f9c269f8501e2616f9e68632; __browsiUID=c0174d21-a0ab-4dfe-8978-29ae08f44964; __qca=P0-531499020-1670446603686; __gads=ID=220b766bf87e15f9-22fa0316ded8001f:T=1670446598:S=ALNI_MaEWcBqESsJKLF0AwoIVvrKjpjZ_g; panoramaId_expiry=1673549551401; panoramaId=9aa5615403becfbc8adf14a3024816d53938b8cdbea6c8f5cabb60112755d70c; udmsrc=%7B%7D; _pk_id.1.73a4=1aee00a1c66e897b.1672997455.; _ccm_inf=1; bcPremierAdsListScreen=true; _hjSessionUser_2563157=eyJpZCI6ImI2MTM5NTQ4LWUxYzMtNTU2NS04MmM3LTk4ODQ5MWNjY2YxZCIsImNyZWF0ZWQiOjE2NzMwMzQ3OTY0NDAsImV4aXN0aW5nIjp0cnVlfQ==; bcFreeUserPageView=0; _gid=GA1.2.449489725.1673276404; _ga_4HQ9CY2XKK=GS1.1.1673303248.3.0.1673303248.0.0.0; _ga=GA1.2.606341620.1670446600; __aaxsc=2; aasd=5%7C1673314072749; webinar131WebinarClosed=true; _lr_geo_location_state=NC; _lr_geo_location=US; udm_edge_floater_fcap=%5B1673397095403%2C1673392312561%2C1673078162569%2C1673076955809%2C1673075752582%2C1673066137343%2C1673056514808%2C1673051706099%2C1673042087115%2C1673037276340%2C1672960427551%2C1672952009965%2C1672947201101%5D; pbjs-unifiedid=%7B%22TDID%22%3A%2219345091-e7fd-4323-baeb-4627c879c6ba%22%2C%22TDID_LOOKUP%22%3A%22TRUE%22%2C%22TDID_CREATED_AT%22%3A%222022-12-05T19%3A48%3A10%22%7D; __gpi=UID=000008c6d06e1e0d:T=1670446598:RT=1673433090:S=ALNI_MZS6mLx8CJg9iN6kzx4JeDFHPOMjg; market=eyJpdiI6InJvcVNudkprUjQ1bE0yWWQrSTlYY1E9PSIsInZhbHVlIjoieUpabHpmSnJGSkIxc0o1enpyb1dLdENBSWp4UE5NYUZwUFg3OGs0TGJSL0dQWUNpTDU0a2hZbklOQTFNd09OVSIsIm1hYyI6IjBjMjJkNDExZjRhOTc2M2QwYWU3NGUyNmVlZTgyMzY2NWM2MjQyOTY2MjY2YmUxODI2Y2RkY2FlNzI3MjNkOTIifQ%3D%3D; _lr_retry_request=true; __browsiSessionID=c02dadca-6355-415f-aa80-926cccd94759&true&false&DEFAULT&us&desktop-4.11.12&false; IC_ViewCounter_www.barchart.com=2; cto_bundle=dxDlRl90VldIJTJGa0VaRzRIS0xnQmdQOXVWVlhybWJ3NDluY29PelBnM0prMkFxZkxyZWh4dkZNZG9LcyUyRjY1VWlIMWRldkRVRlJ5QW05dHlsQU1xN2VmbzlJOXZFSTNlcFRxUkRxYiUyRlp6Z3hhUHpBekdReU5idVV0WnkxVll0eGp5TyUyQlVzJTJCVDVoRkpWWlZ4R0hOSUl2YTVJVDhBJTNEJTNE; cto_bidid=51ixCl92dkhqbmVmdnlTZHVYS25nWTk2eDVMUnVRNjhEMUhxa3FlcmFzRHVNSERUQkd5cFZrM0QyQyUyRkVNNkV6S0ZHOUZPcTBTR2lBUjA5QUc5YU1ucW9GMFZBWHB4aU9sMlo3WHAlMkJYWjZmJTJGWkpsWSUzRA; _awl=2.1673451629.5-df997ba8dc13bee936d8d14a9771e587-6763652d75732d6561737431-0; laravel_token=eyJpdiI6IjR2YStGblAxWlZoZzllcEtPUUFLNlE9PSIsInZhbHVlIjoiY3E2bHdQWFkyT1FFUHFka2NMMVoyREFvQlZwWXlxc3F0SlRuZnIyTHJsSWtNVFA0K1czcDloWFF2d0lVZys3azZyelkrWks5SWxuRW05MGlqV1I4QmViMU9KKzArVXJOTWNVK2hqZVRocVNHM3NZa1dNeStQbnNyYVBtcjlUeTZzT2lpV2t1ek1UOE1wSUFudmg0NzFTQ3VPeDJiYk16bGNBTzVqVHBCcFRZdTFsZjBVREVyUEhLeThjZm9wSGIzQ2NDVE0ya0xOQWx1VGx0aUlEUE9yakU4Q3RicWFmNDdkYjJSWHlsSWYwajlSUkozVmQ4OVNGNzZEeWhtUExtcXB6VnNrY2NsUzRFQnJyMlhiejFtc0l3U2p5SW5BbFFDZTN0dk9EUWNOR2hVYUdMbmhFUFZVT24xOFFGVkM3L2giLCJtYWMiOiIxYzM5Yzk1ZWNjNjM0NzdjMmM4YTJkZDg0ZmY5MWQwNWUzOTlhNTAwNjg2MTNmNTNlYzY4M2MzYWQ3MDA4MThlIn0%3D; XSRF-TOKEN=eyJpdiI6Ik1PMGEvOGFkZ1p1ekpNcXIvZWZtcHc9PSIsInZhbHVlIjoiMVZYQ3NCV1hjcWREdG5uSDVqYXZVVy91U29USys1dkJJeFNZZG9QVGNRNDhmMTJIeitVV2NUV0xSUC9ZTThVM3FDQWZBcVdNclhqSkx4MG1NTGhadUNlNXRLMEdUc3RDcEVwNnJVYU9FNTBub2NKRWxlekxBZmZEVXNhZUlwWnoiLCJtYWMiOiIxYTI0N2E2OGMxMzRhNmFiYTliMzBlYTdjYWZlNzUwMDRlY2Q5YjI2YzY4OGZlMWIxYmM0YTE3YzZkMTdhMGM3In0%3D; laravel_session=eyJpdiI6InJIcmMxRWVacmtGc2tENS9zYUFFOVE9PSIsInZhbHVlIjoibG1vQWh1d1dmaUNBZTV4dGdJbWhTVEoyMWVrblBFNTBycTBPai9ad2llcHRkd0hHUTI4ZS8rUFNFVm5LNEcvd1RXY1RwOHdVZHplNU92Vk9xUHZjYmMrUC9Cc3hJUkJNWE54OVR1UHFaTExpM1BRcWRSWEJ5Q3gvVVNzajdHZUoiLCJtYWMiOiI5NDVkOGU4NGM5Y2MwMThmMTgwMzQyOWQ1Yzc5MzU5ZGU2ZjkwMWRjYzBjZWJiZDFhMTQzODMzZmE2NWExMGQ3In0%3D',
'referer': 'https://www.barchart.com/stocks/pre-market-trading/volume-advances?orderBy=preMarketVolume&orderDir=desc',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36',
'x-xsrf-token': 'eyJpdiI6Im1LQVRpVEJONzZwMDRVQnhYK0I5SWc9PSIsInZhbHVlIjoiMkRIMnJBb1VDQmRscjNlajF1dVR2eWxRbGNJTGZCNWxMaWk3N0EzQWlyOWk0cXJBK2oyUVJ1N282R2VOVWh6WlhJcXdZdFplZmRqaFhPa203bi9HeFBxckJKeUVzVDRETHI5OHlxNDZnOEF5WVV5NXdNSWJiWk95UlFHRXQwN2siLCJtYWMiOiI1NTkyZjk2M2FlNTE0NDI0ODQ3YmE4ZjIyZDY1MzM2MTA3ZTY4NDA5NzA5YzViMjhiN2UwYTFhNTM1Y2ZkMjk5In0='
}
r = requests.get(request_url,headers=headers)
My code below download the website https://www.nasdaq.com/market-activity/stocks/mrtn/earnings . I am interested in data in tables, say "Quarterly Earnings Surprise Amount" Table. From developer tool on Chrome, I can see the data is in tags such as:
<td class="earnings-forecast__cell">1.13</td>
But when using the code below to download, the number in tag is disappear. Only have <td class="earnings-forecast__cell"> </td>
Can you please help to fix? Thanks, HHC
import requests
from bs4 import BeautifulSoup as soup
header = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
'referer': 'https://www.nasdaq.com/market-activity/stocks/mrtn/earnings'
}
# Send a get request to server:
url = 'https://www.nasdaq.com/market-activity/stocks/mrtn/earnings'
html = requests.get(url=url,headers=header)
# check if request is received
html.status_code #Successful responses (200–299)
data=soup(html.content,'lxml')
print(type(data))
# print(data)
If you try looking at the page source, you can identify that the table you are interested doesn't have any values. This indicates that the data in the table is rendered via JavaScript.
On checking the sources and the requests sent from the browser's "Network" tab, we can see that a xhr request from a JS script is sent and replied back with the data that you are looking for. The endpoint to which the script sent out a request is: https://api.nasdaq.com/api/company/MRTN/earnings-surprise.
Try this,
import requests
header = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
'referer': 'https://www.nasdaq.com/market-activity/stocks/mrtn/earnings'
}
url = 'https://api.nasdaq.com/api/company/MRTN/earnings-surprise'
response = requests.get(url = url, headers = header)
if response.status_code == 200:
print(response.json())
else:
print("Failed", response.status_code)
If you use Chrome, filter requests to "Fetch/XHR", and you should be able to view the request. (Refresh the page once with the "Network" tab open)
Happy coding!
I am currently using Python requests to scrape data from a website and using Postman as a tool to help me do it.
To those not familiar with Postman, it sends a get request and generates a code snippet to be used in many languages, including Python.
By using it, I can get data from the website quite easily, but it seems as like the 'Cookie' aspect of headers provided by Postman changes with time, so I can't automate my code to run anytime. The issue is that when the cookie is not valid I get an access denied message.
Here's an example of the code provided by Postman:
import requests
url = "https://wsloja.ifood.com.br/ifood-ws-v3/restaurants/7c854a4c-01a4-48d8-b3d4-239c6c069f6a/menu"
payload = {}
headers = {
'access_key': '69f181d5-0046-4221-b7b2-deef62bd60d5',
'browser': 'Windows',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36',
'Accept': 'application/json, text/plain, */*',
'secret_key': '9ef4fb4f-7a1d-4e0d-a9b1-9b82873297d8',
'Cache-Control': 'no-cache, no-store',
'X-Ifood-Session-Id': '85956739-2fac-4ebf-85d3-1aceda9738df',
'platform': 'Desktop',
'app_version': '8.37.0',
'Cookie': 'session_token=TlNUXzMyMjJfMTU5Nzg1MDE5NTIxNF84NDI5NTA2NDQ2MjUxMg==; _abck=AD1745CB8A0963BF3DD67C8AF7932007~-1~YAAQtXsGYH8UUe9zAQAACZ+IAgStbP4nYLMtonPvQ+4UY+iHA3k6XctPbGQmPF18spdWlGiDB4/HbBvDiF0jbgZmr2ETL8YF+f71Uwhsj+L8K+Fk4PFWBolAffkIRDfSubrf/tZOYRfmw09o59aFuQor5LeqxzXkfVsXE8uIJE0P/nC1JfImZ35G0OFt+HyIgDUZMFQ54Wnbap7+LMSWcvMKF6U/RlLm46ybnNnT/l/NLRaEAOIeIE3/JdKVVcYT2t4uePfrTkr5eD499nyhFJCwSVQytS9P7ZNAM4rFIPnM6kPtwcPjolLNeeU=~-1~-1~-1; ak_bmsc=129F92B2F8AC14A400433647B8C29EA3C9063145805E0000DB253D5F49CE7151~plVgguVnRQTAstyzs8P89cFlKQnC9ISQCH9KPHa8xYPDVoV2iQ/Hij2PL9r8EKEqcQfzkGmUWpK09ZpU0tL/llmBloi+S+Znl5P5/NJeV6Ex2gXqBu1ZCxc9soMWWyrdvG+0FFvSP3a6h3gaouPh2O/Tm4Ghk9ddR92t380WBkxvjXBpiPzoYp1DCO4yrEsn3Tip1Gan43IUHuCvO+zkRmgrE3Prfl1T/g0Px9mvLSVrg=; bm_sz=3106E71C2F26305AE435A7DA00506F01~YAAQRTEGyfky691zAQAAGuDbBggFW4fJcnF1UtgEsoXMFkEZk1rG8JMddyrxP3WleKrWBY7jA/Q08btQE43cKWmQ2qtGdB+ryPtI2KLNqQtKM5LnWRzU+RqBQqVbZKh/Rvp2pfTvf5lBO0FRCvESmYjeGvIbnntzaKvLQiDLO3kZnqmMqdyxcG1f51aoOasrjfo=; bm_sv=B4011FABDD7E457DDA32CBAB588CE882~aVOIuceCgWY25bT2YyltUzGUS3z5Ns7gJ3j30i/KuVUgG1coWzGavUdKU7RfSJewTvE47IPiLztXFBd+mj7c9U/IJp+hIa3c4z7fp22WX22YDI7ny3JxN73IUoagS1yQsyKMuxzxZOU9NpcIl/Eq8QkcycBvh2KZhhIZE5LnpFM='
}
response = requests.request("GET", url, headers=headers, data = payload)
print(response.text.encode('utf8'))
Here's just the Cookie part where I get access denied:
'Cookie': 'session_token=TlNUXzMyMjJfMTU5Nzg1MDE5NTIxNF84NDI5NTA2NDQ2MjUxMg==; _abck=AD1745CB8A0963BF3DD67C8AF7932007~-1~YAAQtXsGYH8UUe9zAQAACZ+IAgStbP4nYLMtonPvQ+4UY+iHA3k6XctPbGQmPF18spdWlGiDB4/HbBvDiF0jbgZmr2ETL8YF+f71Uwhsj+L8K+Fk4PFWBolAffkIRDfSubrf/tZOYRfmw09o59aFuQor5LeqxzXkfVsXE8uIJE0P/nC1JfImZ35G0OFt+HyIgDUZMFQ54Wnbap7+LMSWcvMKF6U/RlLm46ybnNnT/l/NLRaEAOIeIE3/JdKVVcYT2t4uePfrTkr5eD499nyhFJCwSVQytS9P7ZNAM4rFIPnM6kPtwcPjolLNeeU=~-1~-1~-1; ak_bmsc=129F92B2F8AC14A400433647B8C29EA3C9063145805E0000DB253D5F49CE7151~plVgguVnRQTAstyzs8P89cFlKQnC9ISQCH9KPHa8xYPDVoV2iQ/Hij2PL9r8EKEqcQfzkGmUWpK09ZpU0tL/llmBloi+S+Znl5P5/NJeV6Ex2gXqBu1ZCxc9soMWWyrdvG+0FFvSP3a6h3gaouPh2O/Tm4Ghk9ddR92t380WBkxvjXBpiPzoYp1DCO4yrEsn3Tip1Gan43IUHuCvO+zkRmgrE3Prfl1T/g0Px9mvLSVrg=; bm_sz=3106E71C2F26305AE435A7DA00506F01~YAAQRTEGyfky691zAQAAGuDbBggFW4fJcnF1UtgEsoXMFkEZk1rG8JMddyrxP3WleKrWBY7jA/Q08btQE43cKWmQ2qtGdB+ryPtI2KLNqQtKM5LnWRzU+RqBQqVbZKh/Rvp2pfTvf5lBO0FRCvESmYjeGvIbnntzaKvLQiDLO3kZnqmMqdyxcG1f51aoOasrjfo=; bm_sv=B4011FABDD7E457DDA32CBAB588CE882~aVOIuceCgWY25bT2YyltUzGUS3z5Ns7gJ3j30i/KuVUgG1coWzGavUdKU7RfSJewTvE47IPiLztXFBd+mj7c9U/IJp+hIa3c4z7fp22WX23E755znZL76c0V/amxbHU9BUnrEff3HGcsniyh5mU+C9XVmtNRLd8oT1UW9WUg3qE=' }
Which is slightly different from the one before.
How could I get through this by somehow having python get the session token?
Apparently just removing 'Cookie' from headers does the job.
I using python requests module to grab data from one website.
At first time i run script, all works fine, data is ok. Then, if run script again, it's return the same data, however this data changed on website if opened in browser. Whenever i run script, data still the same. BUT!
After 5 or 6 minutes, if run script again, data was updated. Looks like requests caching info.
If using the browser, every time hit refresh, data updates correctly.
r = requests.get('https://verysecretwebsite.com', headers=headers)
r.text
Actually i use following header:
headers = {'User-Agent': "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 YaBrowser/19.6.1.153 Yowser/2.5 Safari/537.36",
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'Referer': 'https://www.gismeteo.ru/weather-orenburg-5159/now/',
'DNT': '1',
'Connection': 'false',
'Upgrade-Insecure-Requests': '1',
'Cache-Control': 'no-cache, max-age=0',
'TE': 'Trailers'}
but with no luck.
I try grub this link https://www.gismeteo.ru/weather-orenburg-5159/now/ with section "data-dateformat="G:i"
In your code you haven't set any headers. This means that requests will always send its default User-Agent header like User-Agent: python-requests/2.22.0 and use no caching directives like Cache-Control.
The remote server of your website may have different caching policies for client applications. Remote server can respond with different data or use different caching time based on User-Agent and/or Cache-Control headers of your request.
So try to check what headers your browser uses (F12 in Chrome) to make requests to your site and then add them to your request. You can also add Cache-Control directive to force server to return the most recent data.
Example:
import requests
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 YaBrowser/19.6.1.153 Yowser/2.5 Safari/537.36",
"Cache-Control": "no-cache, max-age=0", # disable caching
}
r = requests.get("https://www.mysecretURL.com", headers=headers)
The requests.get() method doesn't cache data by default (from this StackOverflow post) I'm not entirely sure of the reason for the lag, as refreshing your browser is essentially identical to calling requests.get(). You could try creating a loop that automatically collects data every 5-10 seconds or so, and that should work fine (and keep you from having to manually run the same lines of code). Hope this helps!
I'm just trying to simply use a Python get request to access JSON data from stats.nba.com. It seems pretty straight-forward as I can enter the URL into your browser and get the results I'm looking for. However, whenever I run this the program just runs to no end. I'm wondering if I have to include some type of headers information in my get request.
The code is below:
import requests
url = 'http://stats.nba.com/stats/commonteamroster?LeagueID=00&Season=2017-18&TeamID=1610612756'
response=requests.get(url)
print response.text
I have tried to visit the url you given, you can add header to your request to avoid this problem (the minimum information you need to provide is User-Agent, I think you can use more header information as you can):
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
response = requests.get(url, headers=headers)
The stats.nba.com website need your 'User-Agent' header information.
You can get your request header information from Network tab in the browser.
Take chrome as example, when you press F12, and visit url you given, you can find the relative request information, the most useful information is request headers.
You need to use headers. Try copying from your browser's network tab. Here's what worked for me:
request_headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive',
'Host': 'stats.nba.com',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}
And here's the modified get:
response = requests.get(url, headers = request_headers)