Download image by requests not work correct - python

I have a question.
Is there any possibility to download images that are currently on the website through requests, but without using the url?
In my case, it does not work, because the image on the website is different than the one under the link from which I am downloading it. This is due to the fact that the image changes each time the link is entered. And I want to download exactly what's on the page to rewrite the code.
Previously, I used selenium and the screenshot option for this, but I have already rewritten all the code for requests and I only miss this one.
Anyone have an idea how to download a photo that is currently on the site?
Below is the code with links:
import requests
from requests_html import HTMLSession
headers = {
'Content-Type': 'image/png',
'Host': 'www.oglaszamy24.pl',
'Connection': 'keep-alive',
'Cache-Control': 'max-age=0',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Sec-GPC': '1',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-User': '?1',
'Sec-Fetch-Dest': 'document',
'Referer': 'https://www.oglaszamy24.pl/dodaj-ogloszenie2.php?c1=8&c2=40&at=1',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'pl-PL,pl;q=0.9,en-US;q=0.8,en;q=0.7'
}
session = HTMLSession()
r = session.get('https://www.oglaszamy24.pl/dodaj-ogloszenie2.php?c1=8&c2=40&at=1')
r.html.render(sleep=2,timeout=20)
links = r.html.find("#captcha_img")
result = str(links)
results = result.split("src=")[1].split("'")[1]
resultss = "https://www.oglaszamy24.pl/"+results
with open ('image.png', 'wb') as f:
f.write(requests.get(resultss, headers=headers).content)

i'd rather try to use PIL (Python Image Library) and get a screenshot of the element's bounding box (coordinates and size). You can get those with libs like BS4 (BeautifulSoup) or Selenium.
Then you'd have a local copy (screenshot) of what the user would see.
A lot of sites have protection against scrapers and capcha services usually do not allow their resources to be downloaded, either via requests or otherwise.
But like that NFT joke goes: you don't download a screenshot...

Related

Raw response data different than response from requests

I am getting a different response using the requests library in python compared to the raw response data shown in chrome dev tools.
The page: https://www.gerflor.co.uk/professionals-products/floors/taralay-impression-control.html
When clicking on the colour filter options for say the colour 'Brown light', a request appears in the network tab 'get-colors.html'. I have replicated this request with the appropriate headers and payload, yet I am getting a different response.
The response in the dev tools shows a json response, but when making this request in python I am getting a transparent web page. Even clicking on the file to open in a new tab from the dev tools opens up a transparent web page rather than the json response I am looking for. It seems as if this response is only exclusive to viewing it within the dev tools, and I cannot figure out how to recreate this request for the desired response.
Here is what I have done:
import requests
import json
url = ("https://www.gerflor.co.uk/colors-enhancer/get-colors.html")
headers = {'accept': 'application/json, text/plain, */*', 'accept-encoding': 'gzip, deflate, br', 'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8', 'cache-control': 'no-cache', 'content-length': '72', 'content-type': 'application/json;charset=UTF-8', 'cookie': '_ga=GA1.3.1278783742.1660305222; _hjSessionUser_1471753=eyJpZCI6IjU5OWIyOTJjLTZkM2ItNThiNi1iYzI4LTAzMDA0ZmVhYzFjZSIsImNyZWF0ZWQiOjE2NjAzMDUyMjIzMzksImV4aXN0aW5nIjp0cnVlfQ==; ln_or=eyI2NTM1MSI6ImQifQ%3D%3D; valid_navigation=1; tarteaucitron=!hotjar=true!googletagmanager=true; _gid=GA1.3.1938727070.1673437106; cc_cookie_accept=cc_cookie_accept; fuel_csrf_token=78fd0611d0719f24c2b40f49fab7ccc13f7623d7b9350a97cd81b93695a6febf695420653980ff9cb210e383896f5978f0becffda036cf0575a1ce0ff4d7f5b5; _hjIncludedInSessionSample=0; _hjSession_1471753=eyJpZCI6IjA2ZTg5YjgyLWUzNTYtNDRkZS1iOWY4LTA1OTI2Yjg0Mjk0OCIsImNyZWF0ZWQiOjE2NzM0NDM1Njg1MjEsImluU2FtcGxlIjpmYWxzZX0=; _hjIncludedInPageviewSample=1; _hjAbsoluteSessionInProgress=0; fuelfid=arY7ozatUQWFOvY0HgkmZI8qYSa1FPLDmxHaLIrgXxwtF7ypHdBPuVtgoCbjTLu4_bELQd33yf9brInne0Q0SmdvR1dPd1VoaDEyaXFmZFlxaS15ZzdZcDliYThkU0gyVGtXdXQ5aVFDdVk; _gat_UA-2144775-3=1', 'origin': 'https://www.gerflor.co.uk', 'pragma': 'no-cache', 'referer': 'https://www.gerflor.co.uk/professionals-products/floors/taralay-impression-control.html', 'sec-ch-ua': '"Not?A_Brand";v="8", "Chromium";v="108", "Google Chrome";v="108"', 'sec-ch-ua-mobile': '?0', 'sec-ch-ua-platform': '"Windows"', 'sec-fetch-dest': 'empty', 'sec-fetch-mode': 'cors', 'sec-fetch-site': 'same-origin', 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'}
payload = {'decors': [], 'shades': ['10020302'], 'designs': [], 'productId': '100031445'}
response = requests.post(url, headers=headers, data=payload)
I should be getting a json response from here but instead I am only getting html text of a transparent web page. I have tried using response = requests.Session() and attempt to make the post request that way but still the same result.
Anyone have any insight as to why this is happening and what can be done to resolve this?
Thank you.

Scraping image source from an interactive map

I would like to scrape the source URL for images on an interactive map that holds the location for various traffic cameras and export them to a JSON or CSV as a list. Because i am trying to gather this data from different websites i have attempted to use parsehub and octoparse to no avail. I previously attempted to use BS4 and selenium but wasn't able to extract the div / img tag with the src ur. Any help would be appreciated. Below are examples of two different websites with similar but different methods for housing the images.
https://tripcheck.com/
https://cwwp2.dot.ca.gov/vm/iframemap.htm (cal trans uses iframes)
The image names (for trip check) come from an api call. you would need to request the cctv ids then you can build the urls.
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:103.0) Gecko/20100101 Firefox/103.0',
'Accept': '*/*',
'Accept-Language': 'en-US,en;q=0.5',
'X-Requested-With': 'XMLHttpRequest',
'DNT': '1',
'Connection': 'keep-alive',
'Referer': 'https://tripcheck.com/',
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
}
params = {
'dt': '1659122796377',
}
response = requests.get('https://tripcheck.com/Scripts/map/data/cctvinventory.js', params=params, headers=headers)
response.json()['features'][0]['attributes']['filename']
output:
'AstoriaUS101MeglerBrNB_pid392.jpg'
Above, you iterate over the attributes array in the json response. and then for the url:
import time
cam = response.json()['features'][0]['attributes']['filename']
rand = str(time.time()).replace('.','')[:13]
f'https://tripcheck.com/RoadCams/cams/{cam}?rand={rand}'
output:
'https://tripcheck.com/RoadCams/cams/AstoriaUS101MeglerBrNB_pid392.jpg?rand=1659123325440'
Note the rand parameter appears to be part of time stamp. As does the 'dt' parameter in the original request. You can use time.time() to generate a time stamp and manipulate it as you need.

Saving cookies across requests to request headers

I am working on a project, and i need to log in to icloud.com using requests. I tried doing it myself but then i imported library pyicloud which does login for me and completes the 2fa. But when it does login i need to create hide my mails which library doesnt to and i tried to do it my self using post, and get requests. However i want to compile it and make it user friendly so the user wont need to interfere with the code, so it automatically gets cookies and puts it in request header, and this is my main problem.
This is my code
from pyicloud import PyiCloudService
import requests
import json
session = requests.Session()
api = PyiCloudService('mail', 'password')
# here is the 2fa and login function, but after this comment user is logged in
headers = {
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'pl-PL,pl;q=0.9,en-US;q=0.8,en;q=0.7',
'Connection': 'keep-alive',
'Content-Length': '2',
'Content-Type': 'text/plain',
'Origin': 'https://www.icloud.com',
'Referer': 'https://www.icloud.com/',
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-site',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36',
'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="101", "Google Chrome";v="101"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"macOS"'
}
session.get('https://icloud.com/settings/')
r = session.post('https://p113-maildomainws.icloud.com/v1/hme/generate?clientBuildNumber=2215Project36&clientMasteringNumber=2215B21&clientId=8b343412-32c8-43d6-9b36-ffc417865d6e&dsid=8267218741', headers=headers, json={})
print(r.text)
And with manually entered cookie into the header it prints this
{"success":true,"timestamp":1653818738,"result":{"hme":"clones.lacks_0d#icloud.com"}}
And without the cookie which i want to automatically enter into the header it prints out this
{"reason":"Missing X-APPLE-WEBAUTH-USER cookie","error":1}
I tried making
session = requests.Session()
and this what another user told me to do, but this also doesnt work.
session.get('https://icloud.com/settings/')
I need to somehow get the 'cookie': 'x' into the header without me changing the headers manually, maybe something with response header.
Any help will be appriciated
Thank you, and have a nice day:)

Getting 403 when trying to download an image

I'm trying to download a few pictures from a website using Python. These are a couple of them:
https://www.innovasport.com/medias/IS-DB2455-728-1.jpg?context=bWFzdGVyfGltYWdlc3w3Mzg2MHxpbWFnZS9qcGVnfGltYWdlcy9oM2QvaDk5LzEwNTY3NzE3MDYwNjM4LmpwZ3wwYmJiZjA5MjZkZjNhZTQwMTZiYzdmNTVhM2RmZDhiMTY2ZjI2YzNkY2QzZmUwYmQxNzc5OTY2MTZlNGMxYzBi
https://www.innovasport.com/medias/playera-nike-jordan-jumpman-is-CJ0921-011-1.png?context=bWFzdGVyfGltYWdlc3w1MDg4MnxpbWFnZS9wbmd8aW1hZ2VzL2g3ZS9oMjgvOTc1OTk4MjY4MjE0Mi5wbmd8NTZlODA5YzhmMDZmOGMyYTBkODliMGM3NGE0NGE0YzBlOThhMTAzM2ZmMWMyODM4M2ZjNTVjNmNmZWExM2VkNw
When opening the picture URLs directly in my browser, they open without problems. But when using a simple Python script I keep getting a 403 error message. Is there a way to get past this error?
First I tried a simple requests.get like this:
tshirt_pic = requests.get("https://www.innovasport.com/medias/playera-nike-jordan-jumpman-is-CJ0921-011-1.png?context=bWFzdGVyfGltYWdlc3w1MDg4MnxpbWFnZS9wbmd8aW1hZ2VzL2g3ZS9oMjgvOTc1OTk4MjY4MjE0Mi5wbmd8NTZlODA5YzhmMDZmOGMyYTBkODliMGM3NGE0NGE0YzBlOThhMTAzM2ZmMWMyODM4M2ZjNTVjNmNmZWExM2VkNw").content
with open('my_pic.png', 'wb') as file:
file.write(tshirt_pic)
But the only thing it downloaded was the error html.
Then, investigating a bit, I got suggestions of using the chrome developer Network options for getting the request cURL, translating it using https://curlconverter.com/ and finally use it for all picture requests from the site. But that didn't work either. I got the cURL, got it translated, and used it in my code, like this, but still go the 403 error.
import requests
cookies = {
'_gcl_au': '1.1.1635147955.1647278637',
'_vwo_uuid_v2': 'D7C88F448C90BA0FDC79062506EA49315|3249aaf2b173d1e5e3eb349b745491dd',
'_ALGOLIA': 'anonymous-cd27a5b3-6878-44ea-a8da-61284eefc793',
'_bamls_usid': 'd7c79c9a-0c5d-4e6d-a8e0-8b35f9873454',
'_vwo_uuid': 'D7C88F448C90BA0FDC79062506EA49315',
'scarab.visitor': '%225627060B81890552%22',
'mdLogger': 'false',
'kampyle_userid': '7a7c-b559-ced5-cdfe-7a9f-1b46-f8f6-5992',
'cd_user_id': '17f897547b459f-0ee040800429a-192b1e05-1fa400-17f897547b5b8e',
'_hjSessionUser_2536688': 'eyJpZCI6IjVkNjlkMzMxLWNlMDAtNTVhMi04NWY3LTIyZDcwNjRmNjYwNyIsImNyZWF0ZWQiOjE2NDcyNzg2Mzc5NjksImV4aXN0aW5nIjp0cnVlfQ==',
'BVBRANDID': 'b448d567-f852-45f4-b114-784aa3f60b22',
'scarab.profile': '%22000000000000210994%7C1647278780%22',
'_gcl_aw': 'GCL.1647291885.Cj0KCQjwz7uRBhDRARIsAFqjulk5bAJN9VM34Mvu46mlftXA-pi6u3ihl_b1WF2cYYGRHiiBqQv-IlMaAiEIEALw_wcB',
'_gac_UA-36216968-26': '1.1647291885.Cj0KCQjwz7uRBhDRARIsAFqjulk5bAJN9VM34Mvu46mlftXA-pi6u3ihl_b1WF2cYYGRHiiBqQv-IlMaAiEIEALw_wcB',
'_gac_UA-36216968-1': '1.1647291892.Cj0KCQjwz7uRBhDRARIsAFqjulk5bAJN9VM34Mvu46mlftXA-pi6u3ihl_b1WF2cYYGRHiiBqQv-IlMaAiEIEALw_wcB',
'DECLINED_DATE': '1647363860132',
'_ga': 'GA1.1.1904468361.1647278637',
'_vis_opt_s': '3%7C',
'kampyleUserSession': '1647443910700',
'kampyleUserSessionsCount': '12',
'kampyleSessionPageCounter': '1',
'kampyleUserPercentile': '76.69597099170795',
'_vwo_ds': '3%3Aa_0%2Ct_0%3A0%241647278636%3A14.83535232%3A%3A65_0%2C62_0%3A443_0%2C431_0%2C430_0%2C427_0%2C5_0%2C4_0%3A0',
'_ga_Z8VQ1XLMKW': 'GS1.1.1647482462.8.0.1647482462.60',
'_ga': 'GA1.2.1904468361.1647278637',
'_clck': 'ocen0m|1|ezu|0',
'_uetvid': '87e73510a3bb11ec845961db4a2ee8f4',
'__cf_bm': 'Wq7Ri6TwjPflvuTcmaxx_NBGVkBBtewayXNGkF6iRFY-1650921677-0-AenKxR/rPecqs7Ap4M3KUrJz1uWOsVHq8XxTeffRjRC39318w5Y5p5s7+izyRnzL9CxSdTMCKZpgIg1sQoyaPxQ=',
'cf_chl_2': 'a1a45f416d52eea',
'cf_chl_prog': 'b',
}
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Language': 'es-ES,es;q=0.9,en;q=0.8',
'Connection': 'keep-alive',
# Requests sorts cookies= alphabetically
# 'Cookie': '_gcl_au=1.1.1635147955.1647278637; _vwo_uuid_v2=D7C88F448C90BA0FDC79062506EA49315|3249aaf2b173d1e5e3eb349b745491dd; _ALGOLIA=anonymous-cd27a5b3-6878-44ea-a8da-61284eefc793; _bamls_usid=d7c79c9a-0c5d-4e6d-a8e0-8b35f9873454; _vwo_uuid=D7C88F448C90BA0FDC79062506EA49315; scarab.visitor=%225627060B81890552%22; mdLogger=false; kampyle_userid=7a7c-b559-ced5-cdfe-7a9f-1b46-f8f6-5992; cd_user_id=17f897547b459f-0ee040800429a-192b1e05-1fa400-17f897547b5b8e; _hjSessionUser_2536688=eyJpZCI6IjVkNjlkMzMxLWNlMDAtNTVhMi04NWY3LTIyZDcwNjRmNjYwNyIsImNyZWF0ZWQiOjE2NDcyNzg2Mzc5NjksImV4aXN0aW5nIjp0cnVlfQ==; BVBRANDID=b448d567-f852-45f4-b114-784aa3f60b22; scarab.profile=%22000000000000210994%7C1647278780%22; _gcl_aw=GCL.1647291885.Cj0KCQjwz7uRBhDRARIsAFqjulk5bAJN9VM34Mvu46mlftXA-pi6u3ihl_b1WF2cYYGRHiiBqQv-IlMaAiEIEALw_wcB; _gac_UA-36216968-26=1.1647291885.Cj0KCQjwz7uRBhDRARIsAFqjulk5bAJN9VM34Mvu46mlftXA-pi6u3ihl_b1WF2cYYGRHiiBqQv-IlMaAiEIEALw_wcB; _gac_UA-36216968-1=1.1647291892.Cj0KCQjwz7uRBhDRARIsAFqjulk5bAJN9VM34Mvu46mlftXA-pi6u3ihl_b1WF2cYYGRHiiBqQv-IlMaAiEIEALw_wcB; DECLINED_DATE=1647363860132; _ga=GA1.1.1904468361.1647278637; _vis_opt_s=3%7C; kampyleUserSession=1647443910700; kampyleUserSessionsCount=12; kampyleSessionPageCounter=1; kampyleUserPercentile=76.69597099170795; _vwo_ds=3%3Aa_0%2Ct_0%3A0%241647278636%3A14.83535232%3A%3A65_0%2C62_0%3A443_0%2C431_0%2C430_0%2C427_0%2C5_0%2C4_0%3A0; _ga_Z8VQ1XLMKW=GS1.1.1647482462.8.0.1647482462.60; _ga=GA1.2.1904468361.1647278637; _clck=ocen0m|1|ezu|0; _uetvid=87e73510a3bb11ec845961db4a2ee8f4; __cf_bm=Wq7Ri6TwjPflvuTcmaxx_NBGVkBBtewayXNGkF6iRFY-1650921677-0-AenKxR/rPecqs7Ap4M3KUrJz1uWOsVHq8XxTeffRjRC39318w5Y5p5s7+izyRnzL9CxSdTMCKZpgIg1sQoyaPxQ=; cf_chl_2=a1a45f416d52eea; cf_chl_prog=b',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.75 Safari/537.36',
'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="100", "Google Chrome";v="100"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Linux"',
}
params = {
'context': 'bWFzdGVyfGltYWdlc3w3Mzg2MHxpbWFnZS9qcGVnfGltYWdlcy9oM2QvaDk5LzEwNTY3NzE3MDYwNjM4LmpwZ3wwYmJiZjA5MjZkZjNhZTQwMTZiYzdmNTVhM2RmZDhiMTY2ZjI2YzNkY2QzZmUwYmQxNzc5OTY2MTZlNGMxYzBi',
}
response = requests.get('https://www.innovasport.com/medias/IS-DB2455-728-1.jpg', params=params, cookies=cookies, headers=headers)
print(response)

Download PDF from PeerJ

I am trying to use Python requests to download a PDF from PeerJ. For example, https://peerj.com/articles/1.pdf.
My code is simply:
r = requests.get('https://peerj.com/articles/1.pdf')
However, the Response object returned displays as <Response [432]>, which indicates an HTTP 432 error. As far as I know, that error code is not assigned.
When I examine r.text or r.content, there is some HTML which says that it's an error 432 and gives a link to the same PDF, https://peerj.com/articles/1.pdf.
I can view the PDF when I open it in my browser (Chrome).
How do I get the actual PDF (as a bytes object, like I should get from r.content)?
While opening the site, you have mentioned, I also opened the developer tool in my firefox browser and copied the http request header from there and assigned it to headers parameter in request.get funcion.
a = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,/;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.5',
'Connection': 'keep-alive',
'Host': 'peerj.com',
'Referer': 'https://peerj.com/articles/1.pdf',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-User': '?1',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0'}
r = requests.get('https://peerj.com/articles/1.pdf', headers= a)

Categories