I want to make crawler for one website that requires login.
I have email and password (sorry but I can not share it)
This is the website:
https://www.eurekalert.org/
When I click on login, it redirects me here:
https://signin.aaas.org/oxauth/login
First I have done this:
session = requests.session()
r = session.get('https://www.eurekalert.org/')
cookies = r.cookies.get_dict()
#cookies = cookies['PHPSESSID']
print("COOKIE eurekalert", cookies)
The only cookie that I could get is:
{'PHPSESSID': 'vd2jp35ss5d0sm0i5em5k9hsca'}
But for logging in I need more cookie key-value pairs.
I have managed to log in, but for logging in I need to have cookie data, and I can not retrieve it:
login_response = session.post('https://signin.aaas.org/oxauth/login', headers=login_headers, data=login_data)
headers = {
'authority': 'www.eurekalert.org',
'cache-control': 'max-age=0',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'cross-site',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'sec-ch-ua': '"Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"',
'sec-ch-ua-mobile': '?0',
'referer': 'https://signin.aaas.org/',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
'cookie': '_fbp=fb.1.1626615735334.1262364960; __gads=ID=d7a7a2d080319a5d:T=1626615735:S=ALNI_MYdVrKc4-uasMo3sVMCjzFABP0TeQ; __utmz=28029352.1626615736.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __utma=28029352.223995016.1626615735.1626615735.1626615735.1; _ga=GA1.2.223995016.1626615735; adBlockEnabled=not%20blocked; _gid=GA1.2.109852943.1629792860; AMCVS_242B6472541199F70A4C98A6%40AdobeOrg=1; AMCV_242B6472541199F70A4C98A6%40AdobeOrg=-1124106680%7CMCIDTS%7C18864%7CMCMID%7C62442014968131466430435549466681355333%7CMCAAMLH-1630397660%7C6%7CMCAAMB-1630397660%7CRKhpRz8krg2tLO6pguXWp5olkAcUniQYPHaMWWgdJ3xzPWQmdj0y%7CMCOPTOUT-1629800060s%7CNONE%7CvVersion%7C5.2.0; s_cc=true; __atuvc=2%7C31%2C0%7C32%2C0%7C33%2C1%7C34; PHPSESSID=af75g985r6eccuisu8dvkkv41v; s_tp=1616; s_ppv=www.eurekalert.org%2C58%2C58%2C938',
}
response = session.get('https://www.eurekalert.org/reporter/home', headers=headers)
print(response)
soup = BeautifulSoup(response.content, 'html.parser')
The headers (full cookie data) are collected with network->copy->copy curl->pasted here:https://curl.trillworks.com/
But values that should go in the cookie should be retrieved dynamically. Im missing the value that should go in the 'cookie'.
When I go in the cookie tab, all values are there, but I can not get it with my request.
Related
I'm trying to do my first POST request (on TinEye) that involves uploading an image. I'm trying to piece together bits from these answers: Python POST Request with an Image , How to post image using requests? , Sending images by POST using python requests , and Sending image over POST request with Python Requests , but I'm still missing something.
The headers of the request looks like this:
headers1:
headers2:
(...not sure what identifying info, if any, there are in there so I've blocked them just in case)
And the payload looks like this:
payload:
So, with all this info, what I've attempted so far looks like this:
import requests
import random,string
# pip install requests_toolbelt
from requests_toolbelt import MultipartEncoder
image_filename = "2015_Aston_Martin_DB9_GT_(19839443910).jpg" # Change this to another filename
imported_image = open(image_filename, 'rb')
def submit_image_post_request(image):
# Create a get request to get the initial cookies
cookies = requests.get("https://tineye.com/").cookies
# Generate a WebKitFormBoundary
boundary = '----WebKitFormBoundary' + ''.join(random.sample(string.ascii_letters + string.digits, 16))
# Generate the headers
headers = {
'authority': 'tineye.com',
'accept': 'application/json, text/plain, */*',
'accept-language': 'en-US,en;q=0.9',
'content-type': 'multipart/form-data; boundary=' + boundary,
'origin': 'https://tineye.com',
'referer': 'https://tineye.com/search/c8570370e2b2338dc656c8cefe221655b8a0ca17?sort=score&order=desc&page=1',
'sec-ch-ua': '"Chromium";v="104", " Not A;Brand";v="99", "Google Chrome";v="104"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'sec-fetch-dest': 'empty',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-origin',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36',
}
# Give the params
params = {
'sort': 'score',
'order': 'desc',
}
# Now comes the experimenting.
# Define a 'files' variable for the request using the opened image file
files = {
'image': image
}
# Try to recreate the "fields" of the form/request
fields = {
'file': (image_filename, image, "image/jpeg"),
# 'file_id': "0"
# "Content-Disposition": 'form-data; name="image"; filename=image_filename'
}
# Generate a MultipartEncoder using the fields and same boundary as in the headers
m = MultipartEncoder(fields=fields, boundary=boundary)
# Send the request
response = requests.post('https://tineye.com/result_json/', params=params, headers=headers, files=files, cookies=cookies, data=m)
return response
response = submit_image_post_request(imported_image)
It's not working obviously, I get a 400 response currently, and it's because of the last little bit of the function, as I'm not quite sure how to recreate the request. Looking to get some guidance on it.
I found an article that showed how to copy the request as a curl from Chrome, import it into Postman, and then export the corresponding python request from Postman, which I have done below as an updated attempt at got the 200 response code. Woohoo!
def search_image(self, image):
url = "https://tineye.com/result_json/"
cookies = requests.get(url).cookies
payload={}
files=[
('image',('file', image,'application/octet-stream'))
]
headers = {
'authority': 'tineye.com',
'accept': 'application/json, text/plain, */*',
'accept-language': 'en-US,en;q=0.9',
# 'cookie': '_ga=GA1.2.1487505347.1661754780; sort=score; order=desc; _gid=GA1.2.613122987.1662166051; __cf_bm=VYpWBFxDJVgFr_e6N_51uElQ4P0qmZtysVNuPdG4MU4-1662166051-0-AQ3g7/Ygshplz8dghxLlCTA8TBrR0b+YXr9kOMfagi18Ypry9kWkDQELjUXOGpClZgoX/BjZExzf+3r6aL8ytCau2kM8z5u3sFanPVaA39wOni+AMGy69RFrGBP8om+naQ==; tineye=fz1Bqk4sJOQqVaf4XCHM59qTFw8LSS6aLP3fQQoIYLyVWIsQR_-XpM-E6-L5GXQ8eex1ia7GI0-ffA57yuR-ll0nfPeAPkDzqdp1Uw; _gat_gtag_UA_2430070_8=1',
'origin': 'https://tineye.com',
'referer': 'https://tineye.com/search',
'sec-ch-ua': '"Chromium";v="104", " Not A;Brand";v="99", "Google Chrome";v="104"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'sec-fetch-dest': 'empty',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-origin',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36'
}
response = requests.post(url, headers=headers, data=payload, files=files, cookies=cookies, timeout=60)
return response
I am working on a project, and i need to log in to icloud.com using requests. I tried doing it myself but then i imported library pyicloud which does login for me and completes the 2fa. But when it does login i need to create hide my mails which library doesnt to and i tried to do it my self using post, and get requests. However i want to compile it and make it user friendly so the user wont need to interfere with the code, so it automatically gets cookies and puts it in request header, and this is my main problem.
This is my code
from pyicloud import PyiCloudService
import requests
import json
session = requests.Session()
api = PyiCloudService('mail', 'password')
# here is the 2fa and login function, but after this comment user is logged in
headers = {
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'pl-PL,pl;q=0.9,en-US;q=0.8,en;q=0.7',
'Connection': 'keep-alive',
'Content-Length': '2',
'Content-Type': 'text/plain',
'Origin': 'https://www.icloud.com',
'Referer': 'https://www.icloud.com/',
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-site',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36',
'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="101", "Google Chrome";v="101"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"macOS"'
}
session.get('https://icloud.com/settings/')
r = session.post('https://p113-maildomainws.icloud.com/v1/hme/generate?clientBuildNumber=2215Project36&clientMasteringNumber=2215B21&clientId=8b343412-32c8-43d6-9b36-ffc417865d6e&dsid=8267218741', headers=headers, json={})
print(r.text)
And with manually entered cookie into the header it prints this
{"success":true,"timestamp":1653818738,"result":{"hme":"clones.lacks_0d#icloud.com"}}
And without the cookie which i want to automatically enter into the header it prints out this
{"reason":"Missing X-APPLE-WEBAUTH-USER cookie","error":1}
I tried making
session = requests.Session()
and this what another user told me to do, but this also doesnt work.
session.get('https://icloud.com/settings/')
I need to somehow get the 'cookie': 'x' into the header without me changing the headers manually, maybe something with response header.
Any help will be appriciated
Thank you, and have a nice day:)
I'm trying to produce the JSON data so I can search for available camp rentals and the only way seems to be a request with a header otherwise I get a Not Authorize message when just using the URL. Unfortunately I'm having no luck this way as well since I keep getting a Session has expired message. I'm not a web developer so not sure what the cause is. Any help would be greatly appreciated it. Thank you
import time
import sys
import requests
url = "https://reservations.piratecoveresort.com/irmdata/api/irm?sessionID=_rdpirm01&arrival=2021-10-26&departure=2021-10-28&people1=1&people2=0&people3=0&people4=0&promocode=&groupnum=&rateplan=RACK&changeResNum=&roomtype=&roomnum=&propertycode=&locationcode=&preferences=&preferences=&preferences=&preferences=&preferences=WTF&preferences=&preferences=&preferences=&preferences=&preferences=&preferences=&preferences=&preferences=&preferences=&preferences=&preferences=&preferences=&preferences=&preferences=&preferences=&masterType=&page=&start=0&limit=12&multiRoom=false"
payload={}
headers = {
'authority': 'reservations.piratecoveresort.com',
'method': 'GET',
'path': '/irmdata/api/irm?sessionID=_rdpirm01&arrival=2021-10-26&departure=2021-10-28&people1=1&people2=0&people3=0&people4=0&promocode=&groupnum=&rateplan=RACK&changeResNum=&roomtype=&roomnum=&propertycode=&locationcode=&preferences=&preferences=&preferences=&preferences=&preferences=WTF&preferences=&preferences=&preferences=&preferences=&preferences=&preferences=&preferences=&preferences=&preferences=&preferences=&preferences=&preferences=&preferences=&preferences=&preferences=&masterType=&page=&start=0&limit=12&multiRoom=false',
'scheme': 'https',
'accept': 'application/json, text/plain, */*',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'authentication': '',
'content-type': 'application/json; charset=utf-8',
'cookie': 'rdpirm01=',
'dnt': '0',
'referer': 'https://reservations.piratecoveresort.com/irmng/',
#'sec-ch-ua': "Chromium";v="94", "Google Chrome";v="94", ";Not A Brand";v="99",
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': "Windows",
'sec-fetch-dest': 'empty',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-origin',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36',
}
response = requests.request("GET", url, headers=headers, data=payload)
print(response.text)
Result
Session Expired
You're getting session expired because the session cookie (and authentication token possibly too) are expired. You can fix this using a requests session which will set these session headers for you. Read more here:
https://docs.python-requests.org/en/master/user/advanced/
The Problem
I am trying to scrape the website. However, I can't reach the table content when I post a request from the Postman. I tried Request library to get info. I tried to use cloudscraper library to look like a person. Result HTML's table is empty. How can I solve it?
Screenshoots
1 - The Form
2 - Result
Code
import requests
url = "https://www.turkiye.gov.tr/mersin-yenisehir-belediyesi-arsa-rayic-degeri-sorgulama?submit"
payload='btn=Sorgula&caddesokak=&id=&islem=&mahalle=27&token=%7B609B03-5C5357-904654-84788D-227746-F7EEF8-F661BE-1B3F90%7D&yil=2021'
headers = {
'sec-ch-ua': '"Google Chrome";v="95", "Chromium";v="95", ";Not A Brand";v="99"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'Upgrade-Insecure-Requests': '1',
'DNT': '1',
'Content-Type': 'application/x-www-form-urlencoded',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-User': '?1',
'Sec-Fetch-Dest': 'document',
'Cookie': 'TURKIYESESSIONID=9a8ab4rjv7oprv5atidcmlo95i; language=tr_TR.UTF-8; TS01ee3a52=015c1cbb6d657270d7a05c71f0c60353ad5d33d8832ac14f33c8078bc783d34e5862d30b42518895fc09263e263aa5d0c8ac69356e191fa7dfed849b6029e59b84d9634c98180a76df4845df847364cfd3771e1e8c; w3p=4090734784.20480.0000'
}
response = requests.request("POST", url, headers=headers, data=payload)
print(response.text)
The problem you're having is that, postman and the request library don't work with javascript and the site you're trying to scrape relies heavily on javascript, I personally check in my browser and if you disable JS in that site it returns a blank page, a workaround is the selenium library, it has a learning curve but it will be able to scrape any site like that.
I want to scrape data from this URL https://weibo.com/hebgqt?refer_flag=1001030103_&is_all=1
I am able to scrape the data if I pass the cookie in headers manually. But, I want to do it automatically. Here is the code.
import requests
url = 'https://weibo.com/hebgqt?refer_flag=1001030103_&is_all=1'
headers = {
'authority': 'weibo.com',
'cache-control': 'max-age=0',
'sec-ch-ua': '^\\^',
'sec-ch-ua-mobile': '?0',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'accept-language': 'en-IN,en-GB;q=0.9,en-US;q=0.8,en;q=0.7',
'cookie': 'SINAGLOBAL=764815322341.5566.1622097283265; SUB=_2AkMXj8zTf8NxqwJRmP0RzmrjaY1yyg3EieKh0z0IJRMxHRl-yT92qmgntRB6PA_iPI199P4zlRz9zonVc5W23plzUH7V; SUBP=0033WrSXqPxfM72-Ws9jqgMF55529P9D9W55o9Nf.NuDNjNQuIS8pJY_; _s_tentry=-; Apache=3847225399074.1636.1624690011593; ULV=1624690011604:5:4:4:3847225399074.1636.1624690011593:1624608998989',
}
response = requests.get(url, headers=headers).text
print(response)
I tried to get cookies by the following code but I am getting an empty dictionary.
import requests
url = 'https://weibo.com/hebgqt?refer_flag=1001030103_&is_all=1'
r = requests.get(url)
print(r.cookies.get_dict())
Note: Website is Chinese. So, I am using Nord VPN & if I don't use it I will get SysCallError error.
Please help me to find cookies or any other way to fetch data from the above URL.
I think in order to read cookies, you should use a request Session as shown here:
https://stackoverflow.com/a/25092059/7426792