Download PDF from PeerJ - python

I am trying to use Python requests to download a PDF from PeerJ. For example, https://peerj.com/articles/1.pdf.
My code is simply:
r = requests.get('https://peerj.com/articles/1.pdf')
However, the Response object returned displays as <Response [432]>, which indicates an HTTP 432 error. As far as I know, that error code is not assigned.
When I examine r.text or r.content, there is some HTML which says that it's an error 432 and gives a link to the same PDF, https://peerj.com/articles/1.pdf.
I can view the PDF when I open it in my browser (Chrome).
How do I get the actual PDF (as a bytes object, like I should get from r.content)?

While opening the site, you have mentioned, I also opened the developer tool in my firefox browser and copied the http request header from there and assigned it to headers parameter in request.get funcion.
a = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,/;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.5',
'Connection': 'keep-alive',
'Host': 'peerj.com',
'Referer': 'https://peerj.com/articles/1.pdf',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-User': '?1',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0'}
r = requests.get('https://peerj.com/articles/1.pdf', headers= a)

Related

Raw response data different than response from requests

I am getting a different response using the requests library in python compared to the raw response data shown in chrome dev tools.
The page: https://www.gerflor.co.uk/professionals-products/floors/taralay-impression-control.html
When clicking on the colour filter options for say the colour 'Brown light', a request appears in the network tab 'get-colors.html'. I have replicated this request with the appropriate headers and payload, yet I am getting a different response.
The response in the dev tools shows a json response, but when making this request in python I am getting a transparent web page. Even clicking on the file to open in a new tab from the dev tools opens up a transparent web page rather than the json response I am looking for. It seems as if this response is only exclusive to viewing it within the dev tools, and I cannot figure out how to recreate this request for the desired response.
Here is what I have done:
import requests
import json
url = ("https://www.gerflor.co.uk/colors-enhancer/get-colors.html")
headers = {'accept': 'application/json, text/plain, */*', 'accept-encoding': 'gzip, deflate, br', 'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8', 'cache-control': 'no-cache', 'content-length': '72', 'content-type': 'application/json;charset=UTF-8', 'cookie': '_ga=GA1.3.1278783742.1660305222; _hjSessionUser_1471753=eyJpZCI6IjU5OWIyOTJjLTZkM2ItNThiNi1iYzI4LTAzMDA0ZmVhYzFjZSIsImNyZWF0ZWQiOjE2NjAzMDUyMjIzMzksImV4aXN0aW5nIjp0cnVlfQ==; ln_or=eyI2NTM1MSI6ImQifQ%3D%3D; valid_navigation=1; tarteaucitron=!hotjar=true!googletagmanager=true; _gid=GA1.3.1938727070.1673437106; cc_cookie_accept=cc_cookie_accept; fuel_csrf_token=78fd0611d0719f24c2b40f49fab7ccc13f7623d7b9350a97cd81b93695a6febf695420653980ff9cb210e383896f5978f0becffda036cf0575a1ce0ff4d7f5b5; _hjIncludedInSessionSample=0; _hjSession_1471753=eyJpZCI6IjA2ZTg5YjgyLWUzNTYtNDRkZS1iOWY4LTA1OTI2Yjg0Mjk0OCIsImNyZWF0ZWQiOjE2NzM0NDM1Njg1MjEsImluU2FtcGxlIjpmYWxzZX0=; _hjIncludedInPageviewSample=1; _hjAbsoluteSessionInProgress=0; fuelfid=arY7ozatUQWFOvY0HgkmZI8qYSa1FPLDmxHaLIrgXxwtF7ypHdBPuVtgoCbjTLu4_bELQd33yf9brInne0Q0SmdvR1dPd1VoaDEyaXFmZFlxaS15ZzdZcDliYThkU0gyVGtXdXQ5aVFDdVk; _gat_UA-2144775-3=1', 'origin': 'https://www.gerflor.co.uk', 'pragma': 'no-cache', 'referer': 'https://www.gerflor.co.uk/professionals-products/floors/taralay-impression-control.html', 'sec-ch-ua': '"Not?A_Brand";v="8", "Chromium";v="108", "Google Chrome";v="108"', 'sec-ch-ua-mobile': '?0', 'sec-ch-ua-platform': '"Windows"', 'sec-fetch-dest': 'empty', 'sec-fetch-mode': 'cors', 'sec-fetch-site': 'same-origin', 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'}
payload = {'decors': [], 'shades': ['10020302'], 'designs': [], 'productId': '100031445'}
response = requests.post(url, headers=headers, data=payload)
I should be getting a json response from here but instead I am only getting html text of a transparent web page. I have tried using response = requests.Session() and attempt to make the post request that way but still the same result.
Anyone have any insight as to why this is happening and what can be done to resolve this?
Thank you.

Python GET request returns 403 Access Denied. Headers are used. Page is accessible via browser

I'm trying to get HTML response in Python with requests, but only get 403. While in Chrome browser link works fine and page is loaded: https://www.dell.com/support/home/en-us/product-support/servicetag/0-ek1RYjR0NnNuandqYVQ1NjdUMm9IZz090/overview
I've copied exact headers of the successfully loaded page from Chrome Developer Tools-> Network recording, but no luck (below).
import requests
headers = {
'authority': 'www.dell.com',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-language': 'en-US;q=0.9,en;q=0.8',
'cache-control': 'max-age=0',
'sec-ch-ua': '"Google Chrome";v="107", "Chromium";v="107", "Not=A?Brand";v="24"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
}
response = requests.get('https://www.dell.com/support/home/en-us/product-support/servicetag/0-ek1RYjR0NnNuandqYVQ1NjdUMm9IZz090/overview', headers=headers)
print(response.status_code)
Also, requests.get('https://www.dell.com/support/home/en-us?lwp=rt') returns 200 with no problem.
Can't figure out what the difference between a browser and a python request might be in this case.
UPD Python 3.7.3 running from Jupyter Notebook, but yes, that hardly matters. And I tried running Python console as well.

Bypass DDoS-GUARD via python requests

I need to parse the site with DDoS-GUARD.
In Firefox devtools I found the GET-method https://stolichki.ru/cities/all
If I open this url in firefox, it's returns the JSON-object.
But Python requests returns the html page with 403 status.
response_raw = requests.get('https://stolichki.ru/cities/all')
print(response_raw.text)
I tried to change the headers of request
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'ru-RU,ru;q=0.8,en-US;q=0.5,en;q=0.3',
'Connection': 'keep-alive',
'DNT': '1',
'Host': 'stolichki.ru',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
'Sec-GPC': '1',
'TE': 'trailers',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:99.0) Gecko/20100101 Firefox/99.0'
}
But it didn't help.
Please don't offer grab library

How to scrape the website which returns an empty table?

The Problem
I am trying to scrape the website. However, I can't reach the table content when I post a request from the Postman. I tried Request library to get info. I tried to use cloudscraper library to look like a person. Result HTML's table is empty. How can I solve it?
Screenshoots
1 - The Form
2 - Result
Code
import requests
url = "https://www.turkiye.gov.tr/mersin-yenisehir-belediyesi-arsa-rayic-degeri-sorgulama?submit"
payload='btn=Sorgula&caddesokak=&id=&islem=&mahalle=27&token=%7B609B03-5C5357-904654-84788D-227746-F7EEF8-F661BE-1B3F90%7D&yil=2021'
headers = {
'sec-ch-ua': '"Google Chrome";v="95", "Chromium";v="95", ";Not A Brand";v="99"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'Upgrade-Insecure-Requests': '1',
'DNT': '1',
'Content-Type': 'application/x-www-form-urlencoded',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-User': '?1',
'Sec-Fetch-Dest': 'document',
'Cookie': 'TURKIYESESSIONID=9a8ab4rjv7oprv5atidcmlo95i; language=tr_TR.UTF-8; TS01ee3a52=015c1cbb6d657270d7a05c71f0c60353ad5d33d8832ac14f33c8078bc783d34e5862d30b42518895fc09263e263aa5d0c8ac69356e191fa7dfed849b6029e59b84d9634c98180a76df4845df847364cfd3771e1e8c; w3p=4090734784.20480.0000'
}
response = requests.request("POST", url, headers=headers, data=payload)
print(response.text)
The problem you're having is that, postman and the request library don't work with javascript and the site you're trying to scrape relies heavily on javascript, I personally check in my browser and if you disable JS in that site it returns a blank page, a workaround is the selenium library, it has a learning curve but it will be able to scrape any site like that.

Unable to get cookies in python requests module

I want to scrape data from this URL https://weibo.com/hebgqt?refer_flag=1001030103_&is_all=1
I am able to scrape the data if I pass the cookie in headers manually. But, I want to do it automatically. Here is the code.
import requests
url = 'https://weibo.com/hebgqt?refer_flag=1001030103_&is_all=1'
headers = {
'authority': 'weibo.com',
'cache-control': 'max-age=0',
'sec-ch-ua': '^\\^',
'sec-ch-ua-mobile': '?0',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'accept-language': 'en-IN,en-GB;q=0.9,en-US;q=0.8,en;q=0.7',
'cookie': 'SINAGLOBAL=764815322341.5566.1622097283265; SUB=_2AkMXj8zTf8NxqwJRmP0RzmrjaY1yyg3EieKh0z0IJRMxHRl-yT92qmgntRB6PA_iPI199P4zlRz9zonVc5W23plzUH7V; SUBP=0033WrSXqPxfM72-Ws9jqgMF55529P9D9W55o9Nf.NuDNjNQuIS8pJY_; _s_tentry=-; Apache=3847225399074.1636.1624690011593; ULV=1624690011604:5:4:4:3847225399074.1636.1624690011593:1624608998989',
}
response = requests.get(url, headers=headers).text
print(response)
I tried to get cookies by the following code but I am getting an empty dictionary.
import requests
url = 'https://weibo.com/hebgqt?refer_flag=1001030103_&is_all=1'
r = requests.get(url)
print(r.cookies.get_dict())
Note: Website is Chinese. So, I am using Nord VPN & if I don't use it I will get SysCallError error.
Please help me to find cookies or any other way to fetch data from the above URL.
I think in order to read cookies, you should use a request Session as shown here:
https://stackoverflow.com/a/25092059/7426792

Categories