I'm trying to open up this website using python beautifulsoup and urllib but I keep getting a 403 error. Can someone guide me with this error?
My current code is this;
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.cubesmart.com/florida-self-storage/st--petersburg-self-storage/3337.html?utm_source=local&utm_medium=organic&utm_campaign=googlemybusiness&utm_term=3337'
uClient = uReq(my_url)
but I get the 403 error.
I searched around and tried using the approach below, but it too is giving me the same error.
from urllib.request import Request, urlopen
url="https://www.cubesmart.com/florida-self-storage/st--petersburg-self-storage/3337.html?utm_source=local&utm_medium=organic&utm_campaign=googlemybusiness&utm_term=3337"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
web_byte = urlopen(req).read()
webpage = web_byte.decode('utf-8')
Any help is appreciated.
Try to use session() from requests as below:
import requests
my_session = requests.session()
for_cookies = my_session.get("https://www.cubesmart.com")
cookies = for_cookies.cookies
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0'}
my_url = 'https://www.cubesmart.com/florida-self-storage/st--petersburg-self-storage/3337.html?utm_source=local&utm_medium=organic&utm_campaign=googlemybusiness&utm_term=3337'
response = my_session.get(my_url, headers=headers, cookies=cookies)
print(response.status_code) # 200
Related
I am trying to get the html of the following web: https://betway.es/es/sports/cpn/tennis/230 in order to get the matches' names and the odds
with the code in python:
from bs4 import BeautifulSoup
import urllib.request
url = 'https://betway.es/es/sports/cpn/tennis/230'
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
soup = str(soup)
But when I run the code it throws the next exception: HTTPError: HTTP Error 403: Forbidden
I have seen that maybe with headers could be possible, but I am completely new with this module so no idea how to use them. Any advice? In addition, although I am able to download the url, I cannot find the odds, anyone knows what can be a reason?
I'm unfortunately part of a country blocked by this site.
But, using the requests package:
import requests as rq
from bs4 import BeautifulSoup as bs
url = 'https://betway.es/es/sports/cpn/tennis/230'
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:86.0) Gecko/20100101 Firefox/86.0"}
page = rq.get(url, headers=headers)
You can find your headers in F12 -> Networks -> random line -> Headers Tab
It is, as a result, a partial answer.
I am trying to parse the comments present on webpage https://xueqiu.com/S/SZ300816.
But I am not able to get it correctly through request library:
>>> url = 'https://xueqiu.com/S/SZ300816'
>>> headers
{'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'}
>>> response = requests.get(url, headers=headers)
>>> from bs4 import BeautifulSoup as bs4
>>> soup = bs4(response.text)
>>> soup.findAll('article', {'class': "timeline__item"})
[]
>>>
Can someone please suggest what I am doing wrong? Thanks.
I got the url from the network tab of chrome devlopment tool. data loaded via from this url in json format. I try to resolve your problem, hope help you.
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as bs
import requests
import json
headers={
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
}
def scrape(url):
with requests.Session() as req:
req.headers.update(headers)
r = req.get(url)
mydata =r.json()
print(mydata['list'][0])
print(mydata['list'][0]['text'])
print(mydata['list'][0]['description'])
url = 'https://xueqiu.com/query/v1/symbol/search/status?u=141606248084627&uuid=1331335789820403712&count=10&comment=0&symbol=SZ300816&hl=0&source=all&sort=&page=1&q=&type=11&session_token=null&access_token=db48cfe87b71562f38e03269b22f459d974aa8ae'
scrape(url)
I have very limited knowledge in web crawling/scraping and am trying to create a web crawler to this URL. However, when I try the usual printing of the response text from the server, I get this:
<html><body><h1>400 Bad request</h1>
Your browser sent an invalid request.
</body></html>
I don't think there's anything wrong with the code as it works on other websites I've tried it on. Was hoping you good folks here could help me figure this out. And this is just a hunch, but is this caused by the url not ending in a .xml?
import requests
url = 'https://phys.org/rss-feed/'
res = requests.get(url)
print(res.text[:500])
Try using BeautifulSoup and a header to mask your request like a real one:
import requests,lxml
from bs4 import BeautifulSoup
URL='https://phys.org/rss-feed/'
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0"
headers = {"user-agent": USER_AGENT}
resp = requests.get(URL, headers=headers)
soup = BeautifulSoup(resp.content, "lxml")
print(soup)
Just masking alone also works:
import requests
URL='https://phys.org/rss-feed/'
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0"
headers = {"user-agent": USER_AGENT}
resp = requests.get(URL, headers=headers)
I'm getting HTTPError: 400 for the below code,i didn't understand why i'm not able to open the url
from bs4 import BeautifulSoup
import requests
import lxml
import pandas as pd
import re
search_url = f'https://www.booking.com/reviewlist.en-gb.html?aid=304142&label=gen173nr-1DCAsoAkIbY2VudHJvLXlhcy1pc2xhbmQtYWJ1LWRoYWJpSDNYBGhsiAEBmAEJuAEGyAEM2AED6AEBiAIBqAIDuAKEwOrxBcACAQ&sid=61a721d17d76bc82ccf82c3c3d92de7c&cc1=ae&dist=1&pagename=centro-yas-island-abu-dhabi&srpvid=fee14d92dc160043&type=total&rows=10&offset=0'
page = requests.get(search_url)
print(page)
if page.status_code == requests.codes.ok:
soup = BeautifulSoup(page.text, 'lxml')
# get_property_attributes(soup)
else:
print('open error')
```
```
ouput : <Response [400]>
```
please any one give me some suggestions to overcome the issue
Try adding headers parameter in the request:
from bs4 import BeautifulSoup
import requests
import lxml
import pandas as pd
import re
search_url = 'https://www.booking.com/reviewlist.en-gb.html?aid=304142&label=gen173nr-1DCAsoAkIbY2VudHJvLXlhcy1pc2xhbmQtYWJ1LWRoYWJpSDNYBGhsiAEBmAEJuAEGyAEM2AED6AEBiAIBqAIDuAKEwOrxBcACAQ&sid=61a721d17d76bc82ccf82c3c3d92de7c&cc1=ae&dist=1&pagename=centro-yas-island-abu-dhabi&srpvid=fee14d92dc160043&type=total&rows=10&offset=0'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'}
page = requests.get(search_url, headers=headers)
print(page)
if page.status_code == requests.codes.ok:
soup = BeautifulSoup(page.text, 'lxml')
# get_property_attributes(soup)
else:
print('open error')
Output:
<Response [200]>
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
myUrl = "https://mee6.xyz/levels/159962941502783488"
uClient = uReq(myUrl)
pageHtml = uClient.read()
print("pageHtml)
I'm trying to access a page to start scraping it, but it replies that the HTTP is forbidden, I looked up other results, but they don't match up with how I'm doing my code
You are being blocked because of your User Agent. Try spoofing your User Agent like this:
from urllib.request import urlopen as uReq
from urllib.request import Request
from bs4 import BeautifulSoup as soup
myUrl = "https://mee6.xyz/levels/159962941502783488"
req = Request(
myUrl,
data=None,
headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36'
}
)
uClient = uReq(req)
pageHtml = uClient.read()
print(pageHtml)