Access denied [403] when accessing site with BeautifulSoup python - python

I want to scrape https://www.jdsports.it/ using BeautifulSoup but I get access denied.
On my pc I don't get any problem accessing the site and I'm using the same user agent of the Python program but on the program the result is different, you can see the output below.
EDIT:
I think I need cookies to gain access to the site. How can I get them and use them to access the site with the python program to scrape it?
-The script works if I use "https://www.jdsports.com" that's the same site but with different region.
Thanks!
import time
import requests
from bs4 import BeautifulSoup
import smtplib
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}
url = 'https://www.jdsports.it/'
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
soup.encode('utf-8')
status = soup.findAll.get_text()
print (status)
The output is:
<html><head>
<title>Access Denied</title>
</head><body>
<h1>Access Denied</h1>
You don't have permission to access "http://www.jdsports.it/" on this server.<p>
Reference #18.35657b5c.1589627513.36921df8
</p></body>
</html>
>
python beautifulsoup user-agent cookies python-requests

Suspected HTTP/2 at first, but wasn't able to get that working either. Perhaps you are more lucky, here's a HTTP/2 starting point:
import asyncio
import httpx
import logging
logging.basicConfig(format='%(message)s', level=logging.DEBUG)
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',
}
url = 'https://www.jdsports.it/'
async def f():
client = httpx.AsyncClient(http2=True)
r = await client.get(url, allow_redirects=True, headers=headers)
print(r.text)
asyncio.run(f())
(Tested both on Windows and Linux.) Could this have something to do with TLS1.2? That's where I'd look next, as curl works.

Related

I cannot parse a JavaScript rendered webpage with Python Requests

I tried to write a little app for parsing this page: https://apps.microsoft.com/store/category/Business
I cannot get a full html code. The tag body is not full.
import requests
def get_data(url):
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"
}
req = requests.get(url, headers=headers)
with open("index.html", "w") as file:
file.write(req.text)
get_data("https://apps.microsoft.com/store/category/Business")
You cannot just parse this page because it is a client side rendered page through JavaScript.
You need to use a tool like:
pyppeteer
Selenium
Or maybe try to reverse engineer the page and directly call the APIs.
(Or maybe see if Microsoft has a public API you can call to get the info you want).

503 Error When Trying To Crawl One Single Website Page | Python | Requests

Goal:
I am trying to scrape the HTML from this page: https://www.doherty.jobs/jobs/search?q=&l=&lat=&long=&d=.
(note - I will eventually want to paginate and scrape all job listings from this page)
My issue:
I get a 503 error when I try to scrape the page using Python and Requests. I am working out of Google Colab.
Initial Code:
import requests
url = 'https://www.doherty.jobs/jobs/search?q=&l=&lat=&long=&d='
response = requests.get(url)
print(response)
Attempted solutions:
Using 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'
Implementing this code I found in another thread:
import requests
def getUrl(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36',
}
res = requests.get(url, headers=headers)
res.raise_for_status()
getUrl('https://www.doherty.jobs/jobs/search?q=&l=&lat=&long=&d=')
I am able to access the website via my browser.
Is there anything else I can try?
Thank you
That page is protected by cloudflare, there's some options to try to bypass it, seems that using cloudscraper works:
import cloudscraper
scraper = cloudscraper.create_scraper()
url = 'https://www.doherty.jobs/jobs/search?q=&l=&lat=&long=&d='
response = scraper.get(url).text
print(response)
In order to use it, you'll need to install it:
pip install cloudscraper

Wayfair product prices

I'm not sure if there's an API for this but I'm trying to scrape the price of certain products from Wayfair.
I wrote some python code using Beautiful Soup and requests but I'm getting some HTML which mentions Our systems have detected unusual traffic from your computer network.
Is there anything I can do to make this work?
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9'
}
def fetch_wayfair_price(url):
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content)
print(soup)
fetch_wayfair_price('https://www.wayfair.ca/furniture/pdp/mistana-katarina-5-piece-extendable-solid-wood-dining-set-mitn2584.html')
The Wayfair site loads a lot of data in after the initial page, so just getting the html from the URL you provided probably won't be all you need. That being said, I was able to get the request to work.
Start by using a session from the requests libary; this will track cookies and session. I also added upgrade-insecure-requests: '1' to the headers so that you can avoid some of the issue that HTTPS introduce to scraping. The request now returns the same response as the browser request does, but the information you want is probably loaded in subsequent requests the browser/JS make.
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36',
'upgrade-insecure-requests': '1',
'dnt': '1'
}
def fetch_wayfair_price(url):
with requests.Session() as session:
post = session.get(url, headers=headers)
soup = BeautifulSoup(post.text, 'html.parser')
print(soup)
fetch_wayfair_price(
'https://www.wayfair.ca/furniture/pdp/mistana-katarina-5-piece-extendable-solid-wood-dining-set-mitn2584.html')
Note: The session that the requests libary creates really should persist outside of the fetch_wayfair_price method. I have contained it in the method to match your example.

Not able to replicate AJAX using Python Requests

I am trying to replicate ajax request from a web page (https://droughtmonitor.unl.edu/Data/DataTables.aspx). AJAX is initiated when we select values from dropdowns.
I am using the following request using python, but not able to see the response as in Network tab of the browser.
import bs4
import requests
import lxml
ses = requests.Session()
ses.get('https://droughtmonitor.unl.edu/Data/DataTables.aspx')
headers_dict = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
}
url = 'https://droughtmonitor.unl.edu/Ajax2018.aspx/ReturnTabularDMAreaPercent_urban'
req_data = {'area':'00064', 'statstype':'1'}
resp = ses.post(url,data = req_data,headers = headers_dict)
soup = bs4.BeautifulSoup(resp.content,'lxml')
print(soup)
You need to add several things to your request to get an Answer from the server.
You need to convert your dict to json to pass it as string and not as dict.
You also need to specify the type of request-data by setting the request header to Content-Type:application/json; charset=utf-8
with those changes I was able to request the correkt data.
import bs4
import requests
ses = requests.Session()
ses.get('https://droughtmonitor.unl.edu/Data/DataTables.aspx')
headers_dict = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
'Content-Type': 'application/json; charset=utf-8'}
url = 'https://droughtmonitor.unl.edu/Ajax2018.aspx/ReturnTabularDMAreaPercent_urban'
req_data = json.dumps({'area':'00037', 'statstype':'1'})
resp = ses.post(url,data = req_data,headers = headers_dict)
soup = bs4.BeautifulSoup(resp.content,'lxml')
print(soup)
Quite a tricky problem I must say.
From the requests documentation:
Instead of encoding the dict yourself, you can also pass it directly
using the json parameter (added in version 2.4.2) and it will be
encoded automatically:
>>> url = 'https://api.github.com/some/endpoint'
>>> payload = {'some': 'data'}
>>> r = requests.post(url, json=payload)
Then, to get the output, call r.json() and you will get the data you are looking for.

Read a pdf file from a url and POST it on another server using python requests

I have tried the code below. Doesn't work.
The PDF is readable in browser.
I want to GET the pdf file from the GET-url and POST it to another server.
import requests
response = requests.get(url='some url')
requests.post(url='my_url', files={'file':response.content})
Link: (Expired)
It is caused by a missing header, specific the Uses-Agent. Looks like the site checks it.
The call returns a HTTP 406 (response.status_code). With the header a HTTP 200 is returned.
Try this:
import requests
header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36"}
response = requests.get(url='some url', headers=header)
requests.post(url='my_url', files={'file':response.content})

Categories