I am using python 3.5.2. I want to scrap a webpage where cookies are required. But when I use requests.session() the cookies maintained in the session are not updated, thus my scraping failed constantly. Following is my code snippet.
import requests
from bs4 import BeautifulSoup
import time
import requests.utils
session = requests.session()
session.headers.update({"User-Agent": "Mozilla/5.0 (Windows NT 6.2; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0"})
print(session.cookies.get_dict())
url = "http://www.beianbaba.com/"
session.get(url)
print(session.cookies.get_dict())
Do you guys have any idea about this?Thank you so much in advance.
It seems like that website request is not providing any cookies. I used the exact same code but requested for https://google.com:
import requests
session = requests.Session()
session.headers.update({"User-Agent": "Mozilla/5.0 (Windows NT 6.2; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0"})
print(session.cookies.get_dict())
url = "http://google.com/"
session.get(url)
print(session.cookies.get_dict())
And got this output:
{}
{'NID': 'a cookie that i removed'}
Related
I'm not sure if there's an API for this but I'm trying to scrape the price of certain products from Wayfair.
I wrote some python code using Beautiful Soup and requests but I'm getting some HTML which mentions Our systems have detected unusual traffic from your computer network.
Is there anything I can do to make this work?
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9'
}
def fetch_wayfair_price(url):
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content)
print(soup)
fetch_wayfair_price('https://www.wayfair.ca/furniture/pdp/mistana-katarina-5-piece-extendable-solid-wood-dining-set-mitn2584.html')
The Wayfair site loads a lot of data in after the initial page, so just getting the html from the URL you provided probably won't be all you need. That being said, I was able to get the request to work.
Start by using a session from the requests libary; this will track cookies and session. I also added upgrade-insecure-requests: '1' to the headers so that you can avoid some of the issue that HTTPS introduce to scraping. The request now returns the same response as the browser request does, but the information you want is probably loaded in subsequent requests the browser/JS make.
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36',
'upgrade-insecure-requests': '1',
'dnt': '1'
}
def fetch_wayfair_price(url):
with requests.Session() as session:
post = session.get(url, headers=headers)
soup = BeautifulSoup(post.text, 'html.parser')
print(soup)
fetch_wayfair_price(
'https://www.wayfair.ca/furniture/pdp/mistana-katarina-5-piece-extendable-solid-wood-dining-set-mitn2584.html')
Note: The session that the requests libary creates really should persist outside of the fetch_wayfair_price method. I have contained it in the method to match your example.
I want to scrape https://www.jdsports.it/ using BeautifulSoup but I get access denied.
On my pc I don't get any problem accessing the site and I'm using the same user agent of the Python program but on the program the result is different, you can see the output below.
EDIT:
I think I need cookies to gain access to the site. How can I get them and use them to access the site with the python program to scrape it?
-The script works if I use "https://www.jdsports.com" that's the same site but with different region.
Thanks!
import time
import requests
from bs4 import BeautifulSoup
import smtplib
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}
url = 'https://www.jdsports.it/'
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
soup.encode('utf-8')
status = soup.findAll.get_text()
print (status)
The output is:
<html><head>
<title>Access Denied</title>
</head><body>
<h1>Access Denied</h1>
You don't have permission to access "http://www.jdsports.it/" on this server.<p>
Reference #18.35657b5c.1589627513.36921df8
</p></body>
</html>
>
python beautifulsoup user-agent cookies python-requests
Suspected HTTP/2 at first, but wasn't able to get that working either. Perhaps you are more lucky, here's a HTTP/2 starting point:
import asyncio
import httpx
import logging
logging.basicConfig(format='%(message)s', level=logging.DEBUG)
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',
}
url = 'https://www.jdsports.it/'
async def f():
client = httpx.AsyncClient(http2=True)
r = await client.get(url, allow_redirects=True, headers=headers)
print(r.text)
asyncio.run(f())
(Tested both on Windows and Linux.) Could this have something to do with TLS1.2? That's where I'd look next, as curl works.
Input
import requests
from http import cookiejar
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64;rv:57.0) Gecko/20100101 Firefox/57.0'}
url = "http://www.baidu.com/"
session = requests.Session()
req = session.put(url = url,headers=headers)
cookie = requests.utils.dict_from_cookiejar(req.cookies)
print(session.cookies.get_dict())
print(cookie)
Gives output:
{'BAIDUID': '323CFCB910A545D7FCCDA005A9E070BC:FG=1', 'BDSVRTM': '0'}
{'BAIDUID': '323CFCB910A545D7FCCDA005A9E070BC:FG=1'}
as here.
I try to use this code to get all cookies from the Baidu website but only return the first cookie. I compare it with the original web cookies(in the picture), it has 9 cookies. How can I get all the cookies?
You didn't maintain your session, so it terminated after the second cookie.
import requests
from http import cookiejar
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64;rv:57.0) Gecko/20100101 Firefox/57.0'}
url = "http://www.baidu.com/"
with requests.Session() as s:
req = s.get(url, headers=headers)
print(req.cookies.get_dict())
>> print(req.cookies.get_dict().keys())
>>> ['BDSVRTM', 'BAIDUID', 'H_PS_PSSID', 'BIDUPSID', 'PSTM', 'BD_HOME']
import requests
import webbrowser
from bs4 import BeautifulSoup
url = 'https://www.gamefaqs.com'
#headers={'User-Agent': 'Mozilla/5.0'}
headers ={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'}
response = requests.get(url, headers)
response.status_code is returning 403.
I can browse the website using firefox/chrome, so It seems to be a coding error.
I can't figure out what mistake I'm making.
Thank you.
This works if you make the request through a Session object.
import requests
session = requests.Session()
response = session.get('https://www.gamefaqs.com', headers={'User-Agent': 'Mozilla/5.0'})
print(response.status_code)
Output:
200
Using keyword argument works for me:
import requests
headers={'User-Agent': 'Mozilla/5.0'}
response = requests.get('https://www.gamefaqs.com', headers=headers)
Try using a Session.
import requests
session = requests.Session()
response = session.get(url, headers={'user-agent': 'Mozilla/5.0'})
print(response.status_code)
If still the request returns 403 Forbidden (after session object &
adding user-agent to headers), you may need to add more headers:
headers = {
'user-agent':"Mozilla/5.0 ...",
'accept': '"text/html,application...',
'referer': 'https://...',
}
r = session.get(url, headers=headers)
In the chrome, Request headers can be found in the Network > Headers > Request-Headers of the Developer Tools. (Press F12 to toggle it.)
reason being, few websites look for user-agent or for presence of specific headers before accepting the request.
I am trying to use the below code to access websites in python 3 using urllib
url = "http://www.goal.com"
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
r = urllib.request.Request(url=url, headers=headers)
urllib.request.urlopen(r).read(1000)
It works fine when it access "yahoo.com", but it always returned error 403 when accessing sites such as "goal.com, hkticketing.com.hk" and I cannot figure out what I am missing. Appreciate for your help.
In python 2.x version , you can use urllib2 to fetch the contents. You can invoke the add headers function to add the header information. Then invoke the open method and read the contents. Finally print them.
import urllib2
import sys
print sys.version
url = urllib2.build_opener()
url.addheaders = [('User-agent', 'Mozilla/5.0(Windows NT 6.1; WOW64; rv:23.0)Gecko/20100101 Firefox/23.0')]
print url.open('http://hkticketing.com.hk').read()