I'm trying to download a few pictures from a website using Python. These are a couple of them:
https://www.innovasport.com/medias/IS-DB2455-728-1.jpg?context=bWFzdGVyfGltYWdlc3w3Mzg2MHxpbWFnZS9qcGVnfGltYWdlcy9oM2QvaDk5LzEwNTY3NzE3MDYwNjM4LmpwZ3wwYmJiZjA5MjZkZjNhZTQwMTZiYzdmNTVhM2RmZDhiMTY2ZjI2YzNkY2QzZmUwYmQxNzc5OTY2MTZlNGMxYzBi
https://www.innovasport.com/medias/playera-nike-jordan-jumpman-is-CJ0921-011-1.png?context=bWFzdGVyfGltYWdlc3w1MDg4MnxpbWFnZS9wbmd8aW1hZ2VzL2g3ZS9oMjgvOTc1OTk4MjY4MjE0Mi5wbmd8NTZlODA5YzhmMDZmOGMyYTBkODliMGM3NGE0NGE0YzBlOThhMTAzM2ZmMWMyODM4M2ZjNTVjNmNmZWExM2VkNw
When opening the picture URLs directly in my browser, they open without problems. But when using a simple Python script I keep getting a 403 error message. Is there a way to get past this error?
First I tried a simple requests.get like this:
tshirt_pic = requests.get("https://www.innovasport.com/medias/playera-nike-jordan-jumpman-is-CJ0921-011-1.png?context=bWFzdGVyfGltYWdlc3w1MDg4MnxpbWFnZS9wbmd8aW1hZ2VzL2g3ZS9oMjgvOTc1OTk4MjY4MjE0Mi5wbmd8NTZlODA5YzhmMDZmOGMyYTBkODliMGM3NGE0NGE0YzBlOThhMTAzM2ZmMWMyODM4M2ZjNTVjNmNmZWExM2VkNw").content
with open('my_pic.png', 'wb') as file:
file.write(tshirt_pic)
But the only thing it downloaded was the error html.
Then, investigating a bit, I got suggestions of using the chrome developer Network options for getting the request cURL, translating it using https://curlconverter.com/ and finally use it for all picture requests from the site. But that didn't work either. I got the cURL, got it translated, and used it in my code, like this, but still go the 403 error.
import requests
cookies = {
'_gcl_au': '1.1.1635147955.1647278637',
'_vwo_uuid_v2': 'D7C88F448C90BA0FDC79062506EA49315|3249aaf2b173d1e5e3eb349b745491dd',
'_ALGOLIA': 'anonymous-cd27a5b3-6878-44ea-a8da-61284eefc793',
'_bamls_usid': 'd7c79c9a-0c5d-4e6d-a8e0-8b35f9873454',
'_vwo_uuid': 'D7C88F448C90BA0FDC79062506EA49315',
'scarab.visitor': '%225627060B81890552%22',
'mdLogger': 'false',
'kampyle_userid': '7a7c-b559-ced5-cdfe-7a9f-1b46-f8f6-5992',
'cd_user_id': '17f897547b459f-0ee040800429a-192b1e05-1fa400-17f897547b5b8e',
'_hjSessionUser_2536688': 'eyJpZCI6IjVkNjlkMzMxLWNlMDAtNTVhMi04NWY3LTIyZDcwNjRmNjYwNyIsImNyZWF0ZWQiOjE2NDcyNzg2Mzc5NjksImV4aXN0aW5nIjp0cnVlfQ==',
'BVBRANDID': 'b448d567-f852-45f4-b114-784aa3f60b22',
'scarab.profile': '%22000000000000210994%7C1647278780%22',
'_gcl_aw': 'GCL.1647291885.Cj0KCQjwz7uRBhDRARIsAFqjulk5bAJN9VM34Mvu46mlftXA-pi6u3ihl_b1WF2cYYGRHiiBqQv-IlMaAiEIEALw_wcB',
'_gac_UA-36216968-26': '1.1647291885.Cj0KCQjwz7uRBhDRARIsAFqjulk5bAJN9VM34Mvu46mlftXA-pi6u3ihl_b1WF2cYYGRHiiBqQv-IlMaAiEIEALw_wcB',
'_gac_UA-36216968-1': '1.1647291892.Cj0KCQjwz7uRBhDRARIsAFqjulk5bAJN9VM34Mvu46mlftXA-pi6u3ihl_b1WF2cYYGRHiiBqQv-IlMaAiEIEALw_wcB',
'DECLINED_DATE': '1647363860132',
'_ga': 'GA1.1.1904468361.1647278637',
'_vis_opt_s': '3%7C',
'kampyleUserSession': '1647443910700',
'kampyleUserSessionsCount': '12',
'kampyleSessionPageCounter': '1',
'kampyleUserPercentile': '76.69597099170795',
'_vwo_ds': '3%3Aa_0%2Ct_0%3A0%241647278636%3A14.83535232%3A%3A65_0%2C62_0%3A443_0%2C431_0%2C430_0%2C427_0%2C5_0%2C4_0%3A0',
'_ga_Z8VQ1XLMKW': 'GS1.1.1647482462.8.0.1647482462.60',
'_ga': 'GA1.2.1904468361.1647278637',
'_clck': 'ocen0m|1|ezu|0',
'_uetvid': '87e73510a3bb11ec845961db4a2ee8f4',
'__cf_bm': 'Wq7Ri6TwjPflvuTcmaxx_NBGVkBBtewayXNGkF6iRFY-1650921677-0-AenKxR/rPecqs7Ap4M3KUrJz1uWOsVHq8XxTeffRjRC39318w5Y5p5s7+izyRnzL9CxSdTMCKZpgIg1sQoyaPxQ=',
'cf_chl_2': 'a1a45f416d52eea',
'cf_chl_prog': 'b',
}
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Language': 'es-ES,es;q=0.9,en;q=0.8',
'Connection': 'keep-alive',
# Requests sorts cookies= alphabetically
# 'Cookie': '_gcl_au=1.1.1635147955.1647278637; _vwo_uuid_v2=D7C88F448C90BA0FDC79062506EA49315|3249aaf2b173d1e5e3eb349b745491dd; _ALGOLIA=anonymous-cd27a5b3-6878-44ea-a8da-61284eefc793; _bamls_usid=d7c79c9a-0c5d-4e6d-a8e0-8b35f9873454; _vwo_uuid=D7C88F448C90BA0FDC79062506EA49315; scarab.visitor=%225627060B81890552%22; mdLogger=false; kampyle_userid=7a7c-b559-ced5-cdfe-7a9f-1b46-f8f6-5992; cd_user_id=17f897547b459f-0ee040800429a-192b1e05-1fa400-17f897547b5b8e; _hjSessionUser_2536688=eyJpZCI6IjVkNjlkMzMxLWNlMDAtNTVhMi04NWY3LTIyZDcwNjRmNjYwNyIsImNyZWF0ZWQiOjE2NDcyNzg2Mzc5NjksImV4aXN0aW5nIjp0cnVlfQ==; BVBRANDID=b448d567-f852-45f4-b114-784aa3f60b22; scarab.profile=%22000000000000210994%7C1647278780%22; _gcl_aw=GCL.1647291885.Cj0KCQjwz7uRBhDRARIsAFqjulk5bAJN9VM34Mvu46mlftXA-pi6u3ihl_b1WF2cYYGRHiiBqQv-IlMaAiEIEALw_wcB; _gac_UA-36216968-26=1.1647291885.Cj0KCQjwz7uRBhDRARIsAFqjulk5bAJN9VM34Mvu46mlftXA-pi6u3ihl_b1WF2cYYGRHiiBqQv-IlMaAiEIEALw_wcB; _gac_UA-36216968-1=1.1647291892.Cj0KCQjwz7uRBhDRARIsAFqjulk5bAJN9VM34Mvu46mlftXA-pi6u3ihl_b1WF2cYYGRHiiBqQv-IlMaAiEIEALw_wcB; DECLINED_DATE=1647363860132; _ga=GA1.1.1904468361.1647278637; _vis_opt_s=3%7C; kampyleUserSession=1647443910700; kampyleUserSessionsCount=12; kampyleSessionPageCounter=1; kampyleUserPercentile=76.69597099170795; _vwo_ds=3%3Aa_0%2Ct_0%3A0%241647278636%3A14.83535232%3A%3A65_0%2C62_0%3A443_0%2C431_0%2C430_0%2C427_0%2C5_0%2C4_0%3A0; _ga_Z8VQ1XLMKW=GS1.1.1647482462.8.0.1647482462.60; _ga=GA1.2.1904468361.1647278637; _clck=ocen0m|1|ezu|0; _uetvid=87e73510a3bb11ec845961db4a2ee8f4; __cf_bm=Wq7Ri6TwjPflvuTcmaxx_NBGVkBBtewayXNGkF6iRFY-1650921677-0-AenKxR/rPecqs7Ap4M3KUrJz1uWOsVHq8XxTeffRjRC39318w5Y5p5s7+izyRnzL9CxSdTMCKZpgIg1sQoyaPxQ=; cf_chl_2=a1a45f416d52eea; cf_chl_prog=b',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.75 Safari/537.36',
'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="100", "Google Chrome";v="100"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Linux"',
}
params = {
'context': 'bWFzdGVyfGltYWdlc3w3Mzg2MHxpbWFnZS9qcGVnfGltYWdlcy9oM2QvaDk5LzEwNTY3NzE3MDYwNjM4LmpwZ3wwYmJiZjA5MjZkZjNhZTQwMTZiYzdmNTVhM2RmZDhiMTY2ZjI2YzNkY2QzZmUwYmQxNzc5OTY2MTZlNGMxYzBi',
}
response = requests.get('https://www.innovasport.com/medias/IS-DB2455-728-1.jpg', params=params, cookies=cookies, headers=headers)
print(response)
Related
I am getting a different response using the requests library in python compared to the raw response data shown in chrome dev tools.
The page: https://www.gerflor.co.uk/professionals-products/floors/taralay-impression-control.html
When clicking on the colour filter options for say the colour 'Brown light', a request appears in the network tab 'get-colors.html'. I have replicated this request with the appropriate headers and payload, yet I am getting a different response.
The response in the dev tools shows a json response, but when making this request in python I am getting a transparent web page. Even clicking on the file to open in a new tab from the dev tools opens up a transparent web page rather than the json response I am looking for. It seems as if this response is only exclusive to viewing it within the dev tools, and I cannot figure out how to recreate this request for the desired response.
Here is what I have done:
import requests
import json
url = ("https://www.gerflor.co.uk/colors-enhancer/get-colors.html")
headers = {'accept': 'application/json, text/plain, */*', 'accept-encoding': 'gzip, deflate, br', 'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8', 'cache-control': 'no-cache', 'content-length': '72', 'content-type': 'application/json;charset=UTF-8', 'cookie': '_ga=GA1.3.1278783742.1660305222; _hjSessionUser_1471753=eyJpZCI6IjU5OWIyOTJjLTZkM2ItNThiNi1iYzI4LTAzMDA0ZmVhYzFjZSIsImNyZWF0ZWQiOjE2NjAzMDUyMjIzMzksImV4aXN0aW5nIjp0cnVlfQ==; ln_or=eyI2NTM1MSI6ImQifQ%3D%3D; valid_navigation=1; tarteaucitron=!hotjar=true!googletagmanager=true; _gid=GA1.3.1938727070.1673437106; cc_cookie_accept=cc_cookie_accept; fuel_csrf_token=78fd0611d0719f24c2b40f49fab7ccc13f7623d7b9350a97cd81b93695a6febf695420653980ff9cb210e383896f5978f0becffda036cf0575a1ce0ff4d7f5b5; _hjIncludedInSessionSample=0; _hjSession_1471753=eyJpZCI6IjA2ZTg5YjgyLWUzNTYtNDRkZS1iOWY4LTA1OTI2Yjg0Mjk0OCIsImNyZWF0ZWQiOjE2NzM0NDM1Njg1MjEsImluU2FtcGxlIjpmYWxzZX0=; _hjIncludedInPageviewSample=1; _hjAbsoluteSessionInProgress=0; fuelfid=arY7ozatUQWFOvY0HgkmZI8qYSa1FPLDmxHaLIrgXxwtF7ypHdBPuVtgoCbjTLu4_bELQd33yf9brInne0Q0SmdvR1dPd1VoaDEyaXFmZFlxaS15ZzdZcDliYThkU0gyVGtXdXQ5aVFDdVk; _gat_UA-2144775-3=1', 'origin': 'https://www.gerflor.co.uk', 'pragma': 'no-cache', 'referer': 'https://www.gerflor.co.uk/professionals-products/floors/taralay-impression-control.html', 'sec-ch-ua': '"Not?A_Brand";v="8", "Chromium";v="108", "Google Chrome";v="108"', 'sec-ch-ua-mobile': '?0', 'sec-ch-ua-platform': '"Windows"', 'sec-fetch-dest': 'empty', 'sec-fetch-mode': 'cors', 'sec-fetch-site': 'same-origin', 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'}
payload = {'decors': [], 'shades': ['10020302'], 'designs': [], 'productId': '100031445'}
response = requests.post(url, headers=headers, data=payload)
I should be getting a json response from here but instead I am only getting html text of a transparent web page. I have tried using response = requests.Session() and attempt to make the post request that way but still the same result.
Anyone have any insight as to why this is happening and what can be done to resolve this?
Thank you.
I'm trying to get HTML response in Python with requests, but only get 403. While in Chrome browser link works fine and page is loaded: https://www.dell.com/support/home/en-us/product-support/servicetag/0-ek1RYjR0NnNuandqYVQ1NjdUMm9IZz090/overview
I've copied exact headers of the successfully loaded page from Chrome Developer Tools-> Network recording, but no luck (below).
import requests
headers = {
'authority': 'www.dell.com',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-language': 'en-US;q=0.9,en;q=0.8',
'cache-control': 'max-age=0',
'sec-ch-ua': '"Google Chrome";v="107", "Chromium";v="107", "Not=A?Brand";v="24"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
}
response = requests.get('https://www.dell.com/support/home/en-us/product-support/servicetag/0-ek1RYjR0NnNuandqYVQ1NjdUMm9IZz090/overview', headers=headers)
print(response.status_code)
Also, requests.get('https://www.dell.com/support/home/en-us?lwp=rt') returns 200 with no problem.
Can't figure out what the difference between a browser and a python request might be in this case.
UPD Python 3.7.3 running from Jupyter Notebook, but yes, that hardly matters. And I tried running Python console as well.
The Problem
I am trying to scrape the website. However, I can't reach the table content when I post a request from the Postman. I tried Request library to get info. I tried to use cloudscraper library to look like a person. Result HTML's table is empty. How can I solve it?
Screenshoots
1 - The Form
2 - Result
Code
import requests
url = "https://www.turkiye.gov.tr/mersin-yenisehir-belediyesi-arsa-rayic-degeri-sorgulama?submit"
payload='btn=Sorgula&caddesokak=&id=&islem=&mahalle=27&token=%7B609B03-5C5357-904654-84788D-227746-F7EEF8-F661BE-1B3F90%7D&yil=2021'
headers = {
'sec-ch-ua': '"Google Chrome";v="95", "Chromium";v="95", ";Not A Brand";v="99"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'Upgrade-Insecure-Requests': '1',
'DNT': '1',
'Content-Type': 'application/x-www-form-urlencoded',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-User': '?1',
'Sec-Fetch-Dest': 'document',
'Cookie': 'TURKIYESESSIONID=9a8ab4rjv7oprv5atidcmlo95i; language=tr_TR.UTF-8; TS01ee3a52=015c1cbb6d657270d7a05c71f0c60353ad5d33d8832ac14f33c8078bc783d34e5862d30b42518895fc09263e263aa5d0c8ac69356e191fa7dfed849b6029e59b84d9634c98180a76df4845df847364cfd3771e1e8c; w3p=4090734784.20480.0000'
}
response = requests.request("POST", url, headers=headers, data=payload)
print(response.text)
The problem you're having is that, postman and the request library don't work with javascript and the site you're trying to scrape relies heavily on javascript, I personally check in my browser and if you disable JS in that site it returns a blank page, a workaround is the selenium library, it has a learning curve but it will be able to scrape any site like that.
I have been trying to access this website https://www.dickssportinggoods.com/f/tents-accessories with requests module but it just keeps processing and does not stop while the same website works fine on browser. Scrappy gives a time out error for the same website. Is there something that should be taken into account while accessing websites like these. Thanks
For sites like these you can try to add the extra headers that your browser does. Following these steps worked for me -
Open the link in incognito window with the network tab open.
Copy the first request made by right clicking -> copy -> copy as curl
Go to https://curl.trillworks.com/. Paste the curl command to get the equivalent python requests code.
Now try removing headers one by one until it works with the minimal headers.
Image for reference - https://i.stack.imgur.com/vRS98.png
Edit -
import requests
headers = {
'authority': 'www.dickssportinggoods.com',
'pragma': 'no-cache',
'cache-control': 'no-cache',
'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
'sec-ch-ua-mobile': '?0',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'none',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'accept-language': 'en-US,en;q=0.9',
}
response = requests.get('https://www.dickssportinggoods.com/f/tents-accessories', headers=headers)
print(response.text)
Have you tried adding headers?
import requests
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get('https://www.dickssportinggoods.com/f/tents-accessories', headers=headers)
response.raise_for_status()
print(response.text)
So Thanks to #Marcel and #Sonal but appart from headers, it just worked when i put the statement in a try/except block.
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0\
Win64\
x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36'
}
session = requests.Session()
try:
r = session.get(
link, headers=headers, stream=True)
return r
except requests.exceptions.ConnectionError:
r.status_code = "Connection refused"
I am unable to login to a website using requests and fetch the API data behind an account. The requests payload data matches the form data used for normally logging in.
My code is as follows:
urlpage = 'https://speechanddebate.org/login'
header = {'User-Agent': 'Chrome/84.0.4147.89'}
payload = {'log': "email#gmail.com",
'pwd': "password",
'wp-submit': 'Log In',
'rememberme': 'forever',
'redirect_to': '/account',
'testcookie': '1'}
session = requests.Session()
test = session.post(urlpage, headers = header, data = payload)
I used inspect element to find what data is sent via POST when I log in normally rather than through webscraping and it gives this result when I check under networking:
I am not sure what I am doing differently compared to the other StackOverFlow answers out there. Here's a list of code modifications I've tried to make:
Without sessions and just doing a normal request
Making the data URL encoded
Changing it and having a with requests.Session() as session: block instead of just
session = requests.Session()
And tried POST with headers and without headers etc.
When I login normally I get the status code 302 indicating that the login was successful and I've been transferred to another web page. However, when I do it through webscraping, it fails to login and returns status code 200 and returns it back to the login page.
Try
headers = {
'authority': 'www.speechanddebate.org',
'cache-control': 'max-age=0',
'upgrade-insecure-requests': '1',
'origin': 'https://www.speechanddebate.org',
'content-type': 'application/x-www-form-urlencoded',
'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Mobile Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'referer': 'https://www.speechanddebate.org/login/',
'accept-language': 'en-US,en;q=0.9',
}