How to scrape a website based on response headers using requests?

How to scrape a website based on response headers using requests? - python

I am trying to scrape https://www.foodhall.co.id/grand-indonesia/catalog .
I found the api https://api.foodhall.co.id/v1/catalog/productbycategoryv2 for the url above where the products are loaded from. I checked the response headers via inspect element and the returned response headers is as so:
HTTP/1.1 200 OK
Date: Mon, 16 Jan 2023 03:07:59 GMT
Server: Apache/2.4.41 (Ubuntu)
Set-Cookie: advanced-api=cigighcd1tcmdoj0eic643mogl; path=/; HttpOnly
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate
Pragma: no-cache
Access-Control-Allow-Origin: *
Content-Length: 2685
Keep-Alive: timeout=5, max=94
Connection: Keep-Alive
Content-Type: application/json; charset=UTF-8
What do I have to notice for in the response headers so that I don't get an invalid authorization error.
Currently my code is as follows
import requests
payload={
'store':'49',
'category_id':'',
'search':'',
'filter':"",
'tag':"",
'lang':'ID',
'page':'0'
}
headers={
'Authorization': 'Bearer 17485f41ae19fbba0f4edf3241c9f033bb1af4e1c843789acfc9cf5136d443ea1673838475',
'Connection': 'keep-alive',
'Content-Length': '57',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Host': 'api.foodhall.co.id',
'Origin': 'https://www.foodhall.co.id',
'Referer': 'https://www.foodhall.co.id/',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'
}
reponse=requests.post('https://api.foodhall.co.id/v1/catalog/productbycategoryv2',json=payload,headers=headers)
The response json is returning a {'success': 0, 'message': 'invalid Authorization'}.I thought that the set cookie response needs an authorization, so my next step is to figure out how to get the authorization code I guess.
Can someone help me?

Related

Selenium succeeding, python requests library failing, despite same url and same request headers - what's the difference?

# selenium-request.py
from seleniumwire import webdriver # Import from seleniumwire
# Create a new instance of the Chrome driver
driver = webdriver.Chrome()
driver.get('https://www.cmegroup.com/content/cmegroup/en/tools-information/advisorySearch/jcr:content/full-par/cmeadvisorysearch.advisorySearch.advisorynotices:Advisory%20Notices.-.2.12|07|2021.01|01|2008.json')
for request in driver.requests:
if request.response:
print(request.response.headers)
When I run that code I get the headers Selenium uses:
$ python selenium-request.py
Accept-Ranges: bytes
Access-Control-Allow-Origin: http://star-website.com
Content-Type: application/json
ETag: W/"36b8a-5d3d28ed9cc43"
Last-Modified: Thu, 23 Dec 2021 16:16:16 GMT
Referrer-Policy: no-referrer-when-downgrade
Server: Apache
ServerID: e1
Strict-Transport-Security: max-age=31536000; includeSubDomains
Vary: Accept-Encoding
Content-Encoding: gzip
Cache-Control: max-age=86400
Date: Thu, 23 Dec 2021 16:16:16 GMT
Content-Length: 46236
Connection: keep-alive
Content-Security-Policy: frame-ancestors 'self' *.cmegroup.com *.quikstrike.net commodex.co.il openexchange.community.cmegroup.com staging.tickertocker.com http://www.straitsfinancial.com www.straitsfinancial.com http://straitsfinancial.com https://www.home.saxo https://app.topsteptrader.com https://help.topsteptrader.com https://staging.topsteptrader.com https://blueeditsitecore.sys.dom https://bluesitecore.sys.dom https://sitecoredev.orange.saxobank.com https://sitecoredev-nocache.orange.saxobank.com https://sitecoredevedit.orange.tst2.dom http://star-website.com https://www.investing.com https://*.benzinga.com https://bz.zingbot.bz https://www.zingbot.bz https://gdcdyn.interactivebrokers.com https://www.interactivebrokers.com https://zingbot.bz https://www.zingbot.bz https://m.zingbot.bz https://bz.zingbot.bz https://dev.futuresfirstacademy.com https://uat.futuresfirstacademy.com https://futuresfirstacademy.com http://stage.barchart.com http://www.barchart.com https://www.infinityfutures.com https://kilofutures.com https://m.cqg.com https://mdemo.cqg.com *.chicago.cme.com:7822 https://uatm.cqg.com https://local.zingbot.bz https://www.gulfbondsukuk.org www.kgieworld.sg https://www.propex24.wpcomstaging.com https://www.propex24.com *.straitsfinancial.gate39tech.com us.straitsfinancial.com https://*.kapcoclients.com https://kapcoclients.com https://*.wallstreetbound.org https://wallstreetbound.org https://cofcointl.plateau.com https://rise.articulate.com https://members.tradeday.com http://blf-django.herokuapp.com https://www.bluelinefutures.com https://www.bluelinefutures.live https://www.bluelinefutures.trade https://login.chicago.cme.com https://loginnr.chicago.cme.com https://logincert.chicago.cme.com https://login-ny.chicago.cme.com https://ampfutures.com https://cme.ampfutures.com https://*.advantagefutures.com https://*.e-futures.com https://*.etrade.com https://*.gffbrokers.com https://infinityfutures-cn.com https://sweetfutures.com https://*.tradovate.com https://home.saxo https://*.tickmill.co.uk https://*.directa.it https://big.pt https://*.tradestation-international.com https://*.stonex.com http://tradinglesson.com https://tradinglesson.com *.ibroker.it *.ibroker.es *.cornertrader.ch *.whselfinvest.com *.banxbroker.de *.ameritrade.com *.sweetfutures.com *.danielstrading.com *.gainfutures.com *.futuresonline.com *.tdainc.com *.lsvp.com *.schwab.com *.schwab.co.uk *.us.global.schwab.com *.dev.schwab.com;
Set-Cookie: ak_bmsc=AB0A9701302106EABE2E195C6AC2A074~000000000000000000000000000000~YAAQLtERAvOZVN19AQAA7C8U6A7AWr7StAmiphZPltguFftPSOXgfa2NAq7Vts+40k7AdnPG55ULK1vyBRhPRdqWbtYml3JTC3RjHLu31l8kWBFvysYyuY2uz4GpkvmOWoBSN/Dl/2bQ9bEgbiYj3tCZ1o+wEvMfsiAWiJeMY3M1ozu6nyQz0JVpdvfsqun3z5wGhpJWhkjrJjeIyHvVdzx2uyIb1azRFlHT+nRCR6NHGoaMM/G2sI1DqPOXPB5btXjdncvB739c2Beh7RgWD/zvb78qpAJDUR1KOenDy1EwN2Bg8pqH1sxlsoVrl7i7r/pAOaWKfd4U1FKP7p730GfOp/m2VRBIdYgHDPHPvGeITPKrR/G22aR886r9Lerhug==; Domain=.cmegroup.com; Path=/; Expires=Thu, 23 Dec 2021 18:16:01 GMT; Max-Age=7185; HttpOnly
I copy these exact headers into a python dict and request as follows:
# python-request.py
import requests
headers = {
"Accept-Ranges": "bytes",
"Access-Control-Allow-Origin": "http://star-website.com",
"Content-Type": "application/json",
"ETag": 'W/"36b8a-5d3d28ed9cc43"',
"Last-Modified": "Thu, 23 Dec 2021 16:16:16 GMT",
"Referrer-Policy": "no-referrer-when-downgrade",
"Server": "Apache",
"ServerID": "e1",
"Strict-Transport-Security": "max-age=31536000; includeSubDomains",
"Vary": "Accept-Encoding",
"Content-Encoding": "gzip",
"Cache-Control": "max-age=86400",
"Date": "Thu, 23 Dec 2021 16:16:16 GMT",
"Content-Length": "46236",
"Connection": "keep-alive",
"Content-Security-Policy": "frame-ancestors 'self' *.cmegroup.com *.quikstrike.net commodex.co.il openexchange.community.cmegroup.com staging.tickertocker.com http://www.straitsfinancial.com www.straitsfinancial.com http://straitsfinancial.com https://www.home.saxo https://app.topsteptrader.com https://help.topsteptrader.com https://staging.topsteptrader.com https://blueeditsitecore.sys.dom https://bluesitecore.sys.dom https://sitecoredev.orange.saxobank.com https://sitecoredev-nocache.orange.saxobank.com https://sitecoredevedit.orange.tst2.dom http://star-website.com https://www.investing.com https://*.benzinga.com https://bz.zingbot.bz https://www.zingbot.bz https://gdcdyn.interactivebrokers.com https://www.interactivebrokers.com https://zingbot.bz https://www.zingbot.bz https://m.zingbot.bz https://bz.zingbot.bz https://dev.futuresfirstacademy.com https://uat.futuresfirstacademy.com https://futuresfirstacademy.com http://stage.barchart.com http://www.barchart.com https://www.infinityfutures.com https://kilofutures.com https://m.cqg.com https://mdemo.cqg.com *.chicago.cme.com:7822 https://uatm.cqg.com https://local.zingbot.bz https://www.gulfbondsukuk.org www.kgieworld.sg https://www.propex24.wpcomstaging.com https://www.propex24.com *.straitsfinancial.gate39tech.com us.straitsfinancial.com https://*.kapcoclients.com https://kapcoclients.com https://*.wallstreetbound.org https://wallstreetbound.org https://cofcointl.plateau.com https://rise.articulate.com https://members.tradeday.com http://blf-django.herokuapp.com https://www.bluelinefutures.com https://www.bluelinefutures.live https://www.bluelinefutures.trade https://login.chicago.cme.com https://loginnr.chicago.cme.com https://logincert.chicago.cme.com https://login-ny.chicago.cme.com https://ampfutures.com https://cme.ampfutures.com https://*.advantagefutures.com https://*.e-futures.com https://*.etrade.com https://*.gffbrokers.com https://infinityfutures-cn.com https://sweetfutures.com https://*.tradovate.com https://home.saxo https://*.tickmill.co.uk https://*.directa.it https://big.pt https://*.tradestation-international.com https://*.stonex.com http://tradinglesson.com https://tradinglesson.com *.ibroker.it *.ibroker.es *.cornertrader.ch *.whselfinvest.com *.banxbroker.de *.ameritrade.com *.sweetfutures.com *.danielstrading.com *.gainfutures.com *.futuresonline.com *.tdainc.com *.lsvp.com *.schwab.com *.schwab.co.uk *.us.global.schwab.com *.dev.schwab.com;",
"Set-Cookie": "ak_bmsc=AB0A9701302106EABE2E195C6AC2A074~000000000000000000000000000000~YAAQLtERAvOZVN19AQAA7C8U6A7AWr7StAmiphZPltguFftPSOXgfa2NAq7Vts+40k7AdnPG55ULK1vyBRhPRdqWbtYml3JTC3RjHLu31l8kWBFvysYyuY2uz4GpkvmOWoBSN/Dl/2bQ9bEgbiYj3tCZ1o+wEvMfsiAWiJeMY3M1ozu6nyQz0JVpdvfsqun3z5wGhpJWhkjrJjeIyHvVdzx2uyIb1azRFlHT+nRCR6NHGoaMM/G2sI1DqPOXPB5btXjdncvB739c2Beh7RgWD/zvb78qpAJDUR1KOenDy1EwN2Bg8pqH1sxlsoVrl7i7r/pAOaWKfd4U1FKP7p730GfOp/m2VRBIdYgHDPHPvGeITPKrR/G22aR886r9Lerhug==; Domain=.cmegroup.com; Path=/; Expires=Thu, 23 Dec 2021 18:16:01 GMT; Max-Age=7185; HttpOnly"
}
requests.get(
"https://www.cmegroup.com/content/cmegroup/en/tools-information/advisorySearch/jcr:content/full-par/cmeadvisorysearch.advisorySearch.advisorynotices:Advisory%20Notices.-.2.12|07|2021.01|01|2008.json",
headers=headers)
When I run this it just hangs indefinitely, so there is some issue with the request.
Apart from the headers, what is the difference between the requests made by python and Selenium - how could I identify the issue and hopefully get this working with the python requests library?
Update
I updated the code to get the request.headers instead:
Host: www.cmegroup.com
Connection: keep-alive
sec-ch-ua: " Not A;Brand";v="99", "Chromium";v="96", "Google Chrome";v="96"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "Linux"
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Sec-Fetch-Site: none
Sec-Fetch-Mode: navigate
Sec-Fetch-User: ?1
Sec-Fetch-Dest: document
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
... but the python requests script has the same result when using these headers, just hanging (or timing out if I set a timeout parameter).
Further update
Debug output is as follows:
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): www.cmegroup.com:443
send: b'GET /content/cmegroup/en/tools-information/advisorySearch/jcr:content/full-par/cmeadvisorysearch.advisorySearch.advisorynotices:Advisory%20Notices.-.2.12%7C07%7C2021.01%7C01%7C2008.json HTTP/1.1\r\nUser-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36\r\nAccept-Encoding: gzip, deflate, br\r\nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9\r\nConnection: keep-alive\r\nHost: www.cmegroup.com\r\nsec-ch-ua: " Not A;Brand";v="99", "Chromium";v="96", "Google Chrome";v="96"\r\nsec-ch-ua-mobile: ?0\r\nsec-ch-ua-platform: Linux\r\nUpgrade-Insecure-Requests: 1\r\nSec-Fetch-Site: none\r\nSec-Fetch-Mode: navigate\r\nSec-Fetch-User: ?1\r\nSec-Fetch-Dest: document\r\nAccept-Language: en-US,en;q=0.9\r\n\r\n'

It looks like it only needs a compatible User-Agent header.
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0',
}
url = 'https://www.cmegroup.com/content/cmegroup/en/tools-information/advisorySearch/jcr:content/full-par/cmeadvisorysearch.advisorySearch.advisorynotices:Advisory%20Notices.-.2.12|07|2021.01|01|2008.json'
response = requests.get(url, headers = headers, timeout = 30) # A
print(response.status_code) # Prints 200 (OK).
print(response.json()) # Prints the output as JSON. "item" key has 50 values in a list.
^ This snippet did the trick for me.

It looks, you are using the response headers, not request headers.
Try
print(request.headers)

Connect to websocket with cloudflare protection on python

The essence of the problem is that I used to connect to websocket by sending Origin, User-Agent, Cookies and the connection worked, now the domain owner decided to change it to the domain of the websocket and put cloudflare protection there, after which my connection method does not work . Advise some method, or information on how to connect to a web socket with cloudflare. Help me pls!!
Example of my code:
import websocket
import json
import time
import traceback
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36 OPR/68.0.3618.173', 'cookie': '__cfduid=da97b059db0292806e2affdf9c3f4fd8b1593022325; _csrf=i8W6njc7hUXMOf4iQjiAxKg1; language=en; theme=darkTheme; pro_version=false; csgo_ses=1489162147d69debd9fe5d0ea2e445c87a117578d774502172d7151b89b82f7f; steamid=76561199068891508; avatar=https://steamcdn-a.akamaihd.net/steamcommunity/public/images/avatars/fe/fef49e7fa7e1997310d705b2a6158ff8dc1cdfeb_medium.jpg; username=andrewcrook232; thirdparty_token=06d04856ce6e334aa1368696df775e7ba0b1b898db135b0af0b5dc0fe001dd55; user_type=old; sellerid=6721648; type_device=desktop', 'origin': 'https://cs.money'}
def start_ws():
try:
ws = websocket.WebSocketApp("wss://ws.cs.money/ws", on_message = on_message, cookie = json.dumps(headers))
print("Connected")
while True:
ws.run_forever(ping_timeout=20)
print("Reload")
time.sleep(20)
except:
print(traceback.format_exc())
def on_message(ws, message):
try:
print(message)
except:
print(traceback.format_exc())
if __name__ == "__main__":
start_ws()
Below is all the information that I got with Chrome Inspector (f12) -> Network -> WS -> headers, this information should be more than enough to successfully join WSS.
Request URL: wss://ws.cs.money/ws
Request Method: GET
Status Code: 101 Switching Protocols
alt-svc: h3-27=":443"; ma=86400, h3-28=":443"; ma=86400, h3-29=":443"; ma=86400
CF-Cache-Status: DYNAMIC
CF-RAY: 5a886ad37f4b8ac6-KBP
cf-request-id: 038921182700008ac6798a2200000001
Connection: upgrade
Date: Wed, 24 Jun 2020 18:12:29 GMT
Expect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
Sec-WebSocket-Accept: zrH4CEKXm3BY5z77HroJDqGgYSc=
Server: cloudflare
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
Upgrade: websocket
X-Content-Type-Options: nosniff
Accept-Encoding: gzip, deflate, br
Accept-Language: ru-RU,ru;q=0.9,en-US;q=0.8,en;q=0.7
Cache-Control: no-cache
Connection: Upgrade
Host: ws.cs.money
Origin: https://cs.money
Pragma: no-cache
Sec-WebSocket-Extensions: permessage-deflate; client_max_window_bits
Sec-WebSocket-Key: GXVT8QewAgPEZDEZZ+x3dA==
Sec-WebSocket-Version: 13
Upgrade: websocket
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36 OPR/68.0.3618.173
Also additional page data:
Request URL: https://cs.money/
Request Method: GET
Status Code: 200
Remote Address: 104.20.76.156:443
Referrer Policy: no-referrer-when-downgrade
alt-svc: h3-27=":443"; ma=86400, h3-28=":443"; ma=86400, h3-29=":443"; ma=86400
cf-cache-status: DYNAMIC
cf-ray: 5a886ab5adac8aea-KBP
cf-request-id: 038921058800008aea96109200000001
content-encoding: br
content-security-policy: script-src 'self' cs.money dev.csgo.trade gleam.io www.am4charts.com translate.google.com translate.googleapis.com www.googletagmanager.com www.googleoptimize.com www.google-analytics.com connect.facebook.net https://vk.com 'unsafe-inline' top-fwz1.mail.ru 'unsafe-eval' api.usersnap.com cdn.usersnap.com cs.money mc.yandex.ru diffuser-cdn.app-us1.com diffuser-cdn.app-us1.com prism.app-us1.com trackcmp.net api.basisid.com https://cdn.amplitude.com sc-static.net support.cs.money embed-sandbox.bridgerpay.com embed.bridgerpay.com cs.money; worker-src 'self' data: blob: cs.money; object-src cs.money dota.money; media-src cs.money dota.money; frame-src cs.money dota.money onesignal.com https://*.com https://*.ru https://*.ua http://www.youtube.com
content-type: text/html; charset=utf-8
date: Wed, 24 Jun 2020 18:12:25 GMT
expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
server: cloudflare
set-cookie: user_type=old; Path=/
set-cookie: language=en; Max-Age=8640000; Domain=cs.money; Path=/; Expires=Fri, 02 Oct 2020 18:12:25 GMT
set-cookie: language=en; Max-Age=8640000; Domain=.cs.money; Path=/; Expires=Fri, 02 Oct 2020 18:12:25 GMT
set-cookie: sellerid=6721648; Max-Age=8640000; Domain=cs.money; Path=/; Expires=Fri, 02 Oct 2020 18:12:25 GMT
set-cookie: pro_version=false; Max-Age=8640000; Domain=cs.money; Path=/; Expires=Fri, 02 Oct 2020 18:12:25 GMT
status: 200
strict-transport-security: max-age=31536000; includeSubDomains; preload
x-cache-status: BYPASS
x-content-type-options: nosniff
x-dns-prefetch-control: off
x-download-options: noopen
x-frame-options: SAMEORIGIN
x-powered-by: PHP 4.1.0
x-xss-protection: 1; mode=block
:authority: cs.money
:method: GET
:path: /
:scheme: https
accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
accept-encoding: gzip, deflate, br
accept-language: ru-RU,ru;q=0.9,en-US;q=0.8,en;q=0.7
cache-control: max-age=0
cookie: __cfduid=da97b059db0292806e2affdf9c3f4fd8b1593022325; _csrf=i8W6njc7hUXMOf4iQjiAxKg1; language=en; theme=darkTheme; pro_version=false; csgo_ses=1489162147d69debd9fe5d0ea2e445c87a117578d774502172d7151b89b82f7f; steamid=76561199068891508; avatar=https://steamcdn-a.akamaihd.net/steamcommunity/public/images/avatars/fe/fef49e7fa7e1997310d705b2a6158ff8dc1cdfeb_medium.jpg; username=andrewcrook232; thirdparty_token=06d04856ce6e334aa1368696df775e7ba0b1b898db135b0af0b5dc0fe001dd55; user_type=old; sellerid=6721648; type_device=desktop
referer: https://steamcommunity.com/openid/login?openid.mode=checkid_setup&openid.ns=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0&openid.identity=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.claimed_id=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.return_to=https%3A%2F%2Fauth.dota.trade%2Flogin%2Fcallback%3FredirectUrl%3Dhttps%3A%2F%2Fcs.money%26callbackUrl%3Dhttps%3A%2F%2Fcs.money%2Flogin&openid.realm=https%3A%2F%2Fauth.dota.trade
sec-fetch-dest: document
sec-fetch-mode: navigate
sec-fetch-site: cross-site
sec-fetch-user: ?1
upgrade-insecure-requests: 1
user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36 OPR/68.0.3618.173

I'm not sure about the real reason, but it seems that your code has some bug.
If you need to build a websocket connection with customized header, you pass it to header parameter, instead of json dump it.
ws = websocket.WebSocketApp("wss://ws.cs.money/ws",
on_message = on_message,
cookie = json.dumps(headers))
should be
cookie_string = headers['cookie']
del headers['cookie']
header_without_cookie = headers
ws = websocket.WebSocketApp("wss://ws.cs.money/ws",
on_message = on_message,
header = header_without_cookie,
cookie = cookie_string)
websocket-client documentation is missing, maybe you can read source code about usage
https://github.com/websocket-client/websocket-client/blob/2222f2c49d71afd74fcda486e3dfd14399e647af/websocket/_app.py

Open URL using python requests only then proceed to download file

I'm trying to download a file from a website using Python's request module.
However the site will allow me to download the file only if the download link is clicked directly from the download page.
So using requests, I tried hitting the download page's URL first using requests.get() then proceeding to download the file. But unfortunately this doesn't seem to work. A text asking me to open the download page first simply gets written into file.torrent"
import requests
def download(username, password):
with requests.Session() as session:
session.post('https://website.net/forum/login.php', data={'login_username': username, 'login_password': password})
# Download page URL
requests.get('https://website.net/forum/viewtopic.php?t=2508126')
# The download URL itself
response = requests.get('https://website.net/forum/dl.php?t=2508126')
with open('file.torrent', 'wb') as f:
f.write(response.content)
download(username='XXXXX', password='YYYYY')
Response when downloading directly from the download page (works) :
General :
Request URL: https://website.net/forum/dl.php?t=2508126
Request Method: GET
Status Code: 200 OK
Remote Address: 185.37.128.136:443
Referrer Policy: no-referrer-when-downgrade
Response Headers :
Cache-Control: no-store, no-cache, must-revalidate
Cache-Control: post-check=0, pre-check=0
Content-Disposition: attachment; filename="[website.net].t2508126.torrent"
Content-Length: 33641
Content-Type: application/x-bittorrent; name="[website.net].t2508126.torrent"
Date: Thu, 14 Feb 2019 07:57:08 GMT
Expires: Mon, 26 Jul 1997 05:00:00 GMT
Last-Modified: Thu, 14 Feb 2019 07:57:09 GMT
Pragma: no-cache
Server: nginx
Set-Cookie: bb_dl=deleted; expires=Thu, 01-Jan-1970 00:00:01 GMT; path=/forum/; domain=.website.net
Request Headers :
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Connection: keep-alive
Cookie: bb_t=a%3A3%3A%7Bi%3A2507902%3Bi%3A1550052944%3Bi%3A2508011%3Bi%3A1550120230%3Bi%3A2508126%3Bi%3A1550125516%3B%7D; bb_data=1-27969311-wXVPJGcedLE1I2mM9H0u-3106784170-1550128652-1550131012-3061288864-1; bb_dl=2508126
Host: website.net
Referer: https://website.net/forum/viewtopic.php?t=2508126
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3701.0 Safari/537.36
Query String Parameters :
t: 2508126
Response when opening the download link on it's own (doesn't work) :
General :
Request URL: https://website.net/forum/dl.php?t=2508126
Request Method: GET
Status Code: 200 OK
Remote Address: 185.37.128.136:443
Referrer Policy: no-referrer-when-downgrade
Response Headers :
Cache-Control: no-store, no-cache, must-revalidate
Cache-Control: post-check=0, pre-check=0
Content-Type: text/html; charset=windows-1251
Date: Thu, 14 Feb 2019 08:03:29 GMT
Expires: Mon, 26 Jul 1997 05:00:00 GMT
Last-Modified: Thu, 14 Feb 2019 08:03:29 GMT
Pragma: no-cache
Server: nginx
Transfer-Encoding: chunked
Request Headers :
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Connection: keep-alive
Cookie: bb_t=a%3A3%3A%7Bi%3A2507902%3Bi%3A1550052944%3Bi%3A2508011%3Bi%3A1550120230%3Bi%3A2508126%3Bi%3A1550125516%3B%7D; bb_data=1-27969311-wXVPJGcedLE1I2mM9H0u-3106784170-1550128652-1550131390-3061288864-1
Host: website.net
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3701.0 Safari/537.36
Query String Parameters :
t: 2508126

This works for me:
data={'login_username': username, 'login_password': password, 'login': ''}
and using session.get() instead of requests.get()

Getting bytes response from request

I'm trying to perform a request at Python 3, to a url that should return a JSON. Instead it's returning a sequence of bytes that i'm unable to convert. Why am i receiving this type of response and how can i convert it into human-readable data?
Bellow a snippet of my code:
headers = {}
headers['Host']= 'XXXXX' # hidden
headers['Connection']= 'keep-alive'
headers['Content-Length']= '122'
headers['Accept']= 'application/json, text/javascript, */*; q=0.01'
headers['Origin']= 'XXXXX' # hidden
headers['X-Requested-With']= 'XMLHttpRequest'
headers['User-Agent']= 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
headers['Content-Type']= 'application/json'
headers['Referer']= 'XXXXX' # hidden
headers['Accept-Encoding']= 'gzip, deflate, br'
headers['Accept-Language']= 'pt-BR,pt;q=0.9,en-US;q=0.8,en;q=0.7'
headers['Cookie'] = 'XXXXX' # hidden
try:
req = request.Request(url,post_data,headers)
x = request.urlopen(req)
print(x.read())
print(x.info())
except Exception as e:
print(e)
Bellow the response received:
b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03L\x8fAK\x031\x10\x85\xff\xca0\x07Q\x88\x899(\xb2\xd0\x93\xf4\xe2\xa1-z]\x90\xecf\xb6\x1b\xd9d\xca$-H\xe9\x7f7\x91\x8a^\x86\x997\xef\x1b\xde\x9c\xf1D\x92\x03\'\xec\xd0j\x8b\nI\x84\x05\xbb\xf3_\x13)g\xb7\xa7\xea\x88n\x99X"yx}\xdfn \x17\ti\xaf Q(3\t8\x11\xf7\xa5\x80\x87O\x1aK\x95\x8fq QW\x1bp5\x14\x8e\xaaV\x18g\'n,\x95\xe1i\xcaT\xe0\x01n\x07\xaa\xb7\t\xfa\xdfD\xab\x9a\xe7&R\x99\xd9\xaf\xd6Z\xeb\x1e\xef\x1aj\x8eYL\xae<\x99\x03\xc9\xf2hN\x94<\xcbG\x1bL\x8b\xa5\x0f\x11\x96\x90\x08\xec\xd3\xb3\xee\x13^\x14&\x17[\xfc\xb6}\xdb\xbd\xac\x7f\x1eS\xff\xfe\xda9\xc9\x04t\xd5G\xf6M\xb4\x8d\x0c\x1e\xbb{{\xf9\x06\x00\x00\xff\xff\x03\x00\xc4\xd9gg\'\x01\x00\x00'
Date: Wed, 26 Dec 2018 16:46:48 GMT
Server: Apache
Strict-Transport-Security: max-age=16070400
X-UA-Compatible: IE=Edge,chrome=1, IE=Edge,chrome=1
Content-Type: application/json; charset=utf-8
Vary: Accept-Encoding
Content-Encoding: gzip
X-Frame-Options: SAMEORIGIN
Connection: close
Transfer-Encoding: chunked

It seems to be zipped: Content-Encoding: gzip.
Unzip it and then use json.decode.
Example:
import zlib
decompressed_data=zlib.decompress(f.read(), 16+zlib.MAX_WBITS)
Another option - tell server you're upset with zipped content. Remove gzip and probably other types of compression from Accept-Encoding request header

try something like this
import requests
r = requests.post('your URL',data=YourData)
r.json()

How to get request headers rather than response headers using Python Requests

How can I grab the request headers for an XHR requests using Python Requests module? Using the following code seems to return the response headers:
import requests
r = requests.get('http://www.whoscored.com/tournamentsfeed/12496/Fixtures/?d=2015W50&isAggregate=false')
headers = r.headers
print headers
This returns an object that looks like this:
{'content-length': '624', 'content-encoding': 'gzip', 'expires': '-1', 'vary': 'Accept-Encoding', 'server': 'Microsoft-IIS/8.0', 'pragma': 'no-cache', 'cache-control': 'no-cache', 'date': 'Tue, 15 Dec 2015 14:41:34 GMT', 'x-powered-by': 'ASP.NET', 'content-type': 'text/html; charset=utf-8'}
However, when I look in Chrome developer tools the request header looks like this:
Host: www.whoscored.com
Connection: keep-alive
Accept: text/plain, */*; q=0.01
Model-Last-Mode: W50hFYr7jwZWt40WUb9udPVFxmB6g9yct204X0/gmf4=
X-Requested-With: XMLHttpRequest
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36
Referer: http://www.whoscored.com/Regions/252/Tournaments/2/England-Premier-League
Accept-Encoding: gzip, deflate, sdch
Accept-Language: en-GB,en-US;q=0.8,en;q=0.6
Cookie: __gads=ID=d09f8c0cdc1a4258:T=1449875272:S=ALNI_MbTPDtXiIlHK49F4FOqdDap__pfCA; nlsnocrvu=1; OX_plg=swf|shk|pm; _ga=GA1.3.578623339.1449875271; _gat=1; _ga=GA1.2.578623339.1449875271
Can anyone assist?
Thanks

You need to check for request headers like this
r.request.headers
That would give you something like
{'Connection': 'keep-alive', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'User-Agent': 'python-requests/2.7.0 CPython/2.7.10 Darwin/15.0.0'}
For obvious reasons it won't be the same as you see in the Chrome developer tools, because the browser adds its own headers which the requests module doesn't.
GET /tournamentsfeed/12496/Fixtures/?d=2015W50&isAggregate=false HTTP/1.1
Host: www.whoscored.com
Connection: keep-alive
Pragma: no-cache
Cache-Control: no-cache
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36
Accept-Encoding: gzip, deflate, sdch
Accept-Language: en-US,en;q=0.8,fr;q=0.6
Cookie: _ga=GA1.2.788154924.1450195026; _gat_as25n45=1
To get these headers you need to run some js code to pull the headers.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to scrape a website based on response headers using requests? - python

Related

Selenium succeeding, python requests library failing, despite same url and same request headers - what's the difference?

Connect to websocket with cloudflare protection on python

Open URL using python requests only then proceed to download file

Getting bytes response from request

How to get request headers rather than response headers using Python Requests

Categories

Resources