I want to scrape data from this URL https://weibo.com/hebgqt?refer_flag=1001030103_&is_all=1
I am able to scrape the data if I pass the cookie in headers manually. But, I want to do it automatically. Here is the code.
import requests
url = 'https://weibo.com/hebgqt?refer_flag=1001030103_&is_all=1'
headers = {
'authority': 'weibo.com',
'cache-control': 'max-age=0',
'sec-ch-ua': '^\\^',
'sec-ch-ua-mobile': '?0',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'accept-language': 'en-IN,en-GB;q=0.9,en-US;q=0.8,en;q=0.7',
'cookie': 'SINAGLOBAL=764815322341.5566.1622097283265; SUB=_2AkMXj8zTf8NxqwJRmP0RzmrjaY1yyg3EieKh0z0IJRMxHRl-yT92qmgntRB6PA_iPI199P4zlRz9zonVc5W23plzUH7V; SUBP=0033WrSXqPxfM72-Ws9jqgMF55529P9D9W55o9Nf.NuDNjNQuIS8pJY_; _s_tentry=-; Apache=3847225399074.1636.1624690011593; ULV=1624690011604:5:4:4:3847225399074.1636.1624690011593:1624608998989',
}
response = requests.get(url, headers=headers).text
print(response)
I tried to get cookies by the following code but I am getting an empty dictionary.
import requests
url = 'https://weibo.com/hebgqt?refer_flag=1001030103_&is_all=1'
r = requests.get(url)
print(r.cookies.get_dict())
Note: Website is Chinese. So, I am using Nord VPN & if I don't use it I will get SysCallError error.
Please help me to find cookies or any other way to fetch data from the above URL.
I think in order to read cookies, you should use a request Session as shown here:
https://stackoverflow.com/a/25092059/7426792
Related
I'm trying to get HTML response in Python with requests, but only get 403. While in Chrome browser link works fine and page is loaded: https://www.dell.com/support/home/en-us/product-support/servicetag/0-ek1RYjR0NnNuandqYVQ1NjdUMm9IZz090/overview
I've copied exact headers of the successfully loaded page from Chrome Developer Tools-> Network recording, but no luck (below).
import requests
headers = {
'authority': 'www.dell.com',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-language': 'en-US;q=0.9,en;q=0.8',
'cache-control': 'max-age=0',
'sec-ch-ua': '"Google Chrome";v="107", "Chromium";v="107", "Not=A?Brand";v="24"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
}
response = requests.get('https://www.dell.com/support/home/en-us/product-support/servicetag/0-ek1RYjR0NnNuandqYVQ1NjdUMm9IZz090/overview', headers=headers)
print(response.status_code)
Also, requests.get('https://www.dell.com/support/home/en-us?lwp=rt') returns 200 with no problem.
Can't figure out what the difference between a browser and a python request might be in this case.
UPD Python 3.7.3 running from Jupyter Notebook, but yes, that hardly matters. And I tried running Python console as well.
The Problem
I am trying to scrape the website. However, I can't reach the table content when I post a request from the Postman. I tried Request library to get info. I tried to use cloudscraper library to look like a person. Result HTML's table is empty. How can I solve it?
Screenshoots
1 - The Form
2 - Result
Code
import requests
url = "https://www.turkiye.gov.tr/mersin-yenisehir-belediyesi-arsa-rayic-degeri-sorgulama?submit"
payload='btn=Sorgula&caddesokak=&id=&islem=&mahalle=27&token=%7B609B03-5C5357-904654-84788D-227746-F7EEF8-F661BE-1B3F90%7D&yil=2021'
headers = {
'sec-ch-ua': '"Google Chrome";v="95", "Chromium";v="95", ";Not A Brand";v="99"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'Upgrade-Insecure-Requests': '1',
'DNT': '1',
'Content-Type': 'application/x-www-form-urlencoded',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-User': '?1',
'Sec-Fetch-Dest': 'document',
'Cookie': 'TURKIYESESSIONID=9a8ab4rjv7oprv5atidcmlo95i; language=tr_TR.UTF-8; TS01ee3a52=015c1cbb6d657270d7a05c71f0c60353ad5d33d8832ac14f33c8078bc783d34e5862d30b42518895fc09263e263aa5d0c8ac69356e191fa7dfed849b6029e59b84d9634c98180a76df4845df847364cfd3771e1e8c; w3p=4090734784.20480.0000'
}
response = requests.request("POST", url, headers=headers, data=payload)
print(response.text)
The problem you're having is that, postman and the request library don't work with javascript and the site you're trying to scrape relies heavily on javascript, I personally check in my browser and if you disable JS in that site it returns a blank page, a workaround is the selenium library, it has a learning curve but it will be able to scrape any site like that.
I have been trying to access this website https://www.dickssportinggoods.com/f/tents-accessories with requests module but it just keeps processing and does not stop while the same website works fine on browser. Scrappy gives a time out error for the same website. Is there something that should be taken into account while accessing websites like these. Thanks
For sites like these you can try to add the extra headers that your browser does. Following these steps worked for me -
Open the link in incognito window with the network tab open.
Copy the first request made by right clicking -> copy -> copy as curl
Go to https://curl.trillworks.com/. Paste the curl command to get the equivalent python requests code.
Now try removing headers one by one until it works with the minimal headers.
Image for reference - https://i.stack.imgur.com/vRS98.png
Edit -
import requests
headers = {
'authority': 'www.dickssportinggoods.com',
'pragma': 'no-cache',
'cache-control': 'no-cache',
'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
'sec-ch-ua-mobile': '?0',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'none',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'accept-language': 'en-US,en;q=0.9',
}
response = requests.get('https://www.dickssportinggoods.com/f/tents-accessories', headers=headers)
print(response.text)
Have you tried adding headers?
import requests
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get('https://www.dickssportinggoods.com/f/tents-accessories', headers=headers)
response.raise_for_status()
print(response.text)
So Thanks to #Marcel and #Sonal but appart from headers, it just worked when i put the statement in a try/except block.
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0\
Win64\
x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36'
}
session = requests.Session()
try:
r = session.get(
link, headers=headers, stream=True)
return r
except requests.exceptions.ConnectionError:
r.status_code = "Connection refused"
I want to make crawler for one website that requires login.
I have email and password (sorry but I can not share it)
This is the website:
https://www.eurekalert.org/
When I click on login, it redirects me here:
https://signin.aaas.org/oxauth/login
First I have done this:
session = requests.session()
r = session.get('https://www.eurekalert.org/')
cookies = r.cookies.get_dict()
#cookies = cookies['PHPSESSID']
print("COOKIE eurekalert", cookies)
The only cookie that I could get is:
{'PHPSESSID': 'vd2jp35ss5d0sm0i5em5k9hsca'}
But for logging in I need more cookie key-value pairs.
I have managed to log in, but for logging in I need to have cookie data, and I can not retrieve it:
login_response = session.post('https://signin.aaas.org/oxauth/login', headers=login_headers, data=login_data)
headers = {
'authority': 'www.eurekalert.org',
'cache-control': 'max-age=0',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'cross-site',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'sec-ch-ua': '"Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"',
'sec-ch-ua-mobile': '?0',
'referer': 'https://signin.aaas.org/',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
'cookie': '_fbp=fb.1.1626615735334.1262364960; __gads=ID=d7a7a2d080319a5d:T=1626615735:S=ALNI_MYdVrKc4-uasMo3sVMCjzFABP0TeQ; __utmz=28029352.1626615736.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __utma=28029352.223995016.1626615735.1626615735.1626615735.1; _ga=GA1.2.223995016.1626615735; adBlockEnabled=not%20blocked; _gid=GA1.2.109852943.1629792860; AMCVS_242B6472541199F70A4C98A6%40AdobeOrg=1; AMCV_242B6472541199F70A4C98A6%40AdobeOrg=-1124106680%7CMCIDTS%7C18864%7CMCMID%7C62442014968131466430435549466681355333%7CMCAAMLH-1630397660%7C6%7CMCAAMB-1630397660%7CRKhpRz8krg2tLO6pguXWp5olkAcUniQYPHaMWWgdJ3xzPWQmdj0y%7CMCOPTOUT-1629800060s%7CNONE%7CvVersion%7C5.2.0; s_cc=true; __atuvc=2%7C31%2C0%7C32%2C0%7C33%2C1%7C34; PHPSESSID=af75g985r6eccuisu8dvkkv41v; s_tp=1616; s_ppv=www.eurekalert.org%2C58%2C58%2C938',
}
response = session.get('https://www.eurekalert.org/reporter/home', headers=headers)
print(response)
soup = BeautifulSoup(response.content, 'html.parser')
The headers (full cookie data) are collected with network->copy->copy curl->pasted here:https://curl.trillworks.com/
But values that should go in the cookie should be retrieved dynamically. Im missing the value that should go in the 'cookie'.
When I go in the cookie tab, all values are there, but I can not get it with my request.
I have a text file with entries like this (url.py):
import requests
headers = {
'authority': 'www.spain.com',
'pragma': 'no-cache',
'cache-control': 'no-cache',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'none',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'accept-language': 'en-US,en;q=0.9,pt;q=0.8',
}
links=['https://www.spain.com']
for url in links:
page = requests.get(url, headers=headers)
print(page)
Return
ubuntu#OS-Ubuntu:/mnt/$ python3 url.py
<Response [200]>
I need this to be filled in automatically because I will receive a txt file (domain.txt) with the domains like this:
www.spain.com
www.uk.com
www.italy.com
I wanted the python script to be unique and transversal ... I would just add more domains to my domain.txt and then I would run my url.py and it would automatically make the request on all domains of domain.txt
You can help me with that.
Assuming url.py is located in the same directory as domains.txt, you can open the file and read each link into a list using:
with open('domains.txt', 'r') as f:
links = f.read().splitlines()