What's wrong with this get method call using BeautifulSoup? - python

I'm attempting to scrape a web page. When executing this code, it outputs running1 but not running2. Why would this be the case?
Code:
from time import gmtime, strftime
import requests
from bs4 import BeautifulSoup
import smtplib
from email.mime.text import MIMEText
print("running1")
url = "https://www.johnlewis.com/nordictrack-commercial-14-9-elliptical-cross-trainer/p5639979"
response = requests.get(url)
print("running2")
soup = BeautifulSoup(response.text, 'lxml')
print("running3")

To get correct response from server try to specify User-Agent HTTP header:
import requests
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:101.0) Gecko/20100101 Firefox/101.0"
}
url = "https://www.johnlewis.com/nordictrack-commercial-14-9-elliptical-cross-trainer/p5639979"
response = requests.get(url, headers=headers)
print(response.text)
Prints:
<!DOCTYPE html><html lang="en"><head>
...

Related

Getting a json attribute from URL

So here's my script:
import requests
import urllib
import json
url = 'https://www.homedepot.com/p/ZLINE-Kitchen-and-Bath-36-DuraSnow-Stainless-Steel-Range-Hood-with-Hand-Hammered-Copper-Shell-8654HH-36-8654HH-36/311287560'
response = json.loads(requests.get(url).text)
print(response["offers"])
and after grabbing the page source of https://www.homedepot.com/p/ZLINE-Kitchen-and-Bath-36-DuraSnow-Stainless-Steel-Range-Hood-with-Hand-Hammered-Copper-Shell-8654HH-36-8654HH-36/311287560
I want to grab this data
"offers":{"#type":"Offer","url":"https://www.homedepot.com/p/ZLINE-Kitchen-and-Bath-36-DuraSnow-Stainless-Steel-Range-Hood-with-Hand-Hammered-Copper-Shell-8654HH-36-8654HH-36/311287560","priceCurrency":"USD","price":1449.95,"priceValidUntil":"4/7/2021","availability":"https://schema.org/InStock"}
More specifically, price and priceValidUntil
from some googling I think this would be the way to do it but since there's so much data within the webpage I think it is taking my script a ton of time to run.
Is there a more efficient way of getting this json data and am I grabbing this data correctly?
You can use this example how to load the json data from HTML page:
import json
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:87.0) Gecko/20100101 Firefox/87.0"
}
url = "https://www.homedepot.com/p/ZLINE-Kitchen-and-Bath-36-DuraSnow-Stainless-Steel-Range-Hood-with-Hand-Hammered-Copper-Shell-8654HH-36-8654HH-36/311287560"
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
data = json.loads(
soup.select_one('script[type="application/ld+json"]').contents[0]
)
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
print("Price:", data["offers"]["price"])
print("Price valid until:", data["offers"]["priceValidUntil"])
Prints:
Price: 1449.95
Price valid until: 4/8/2021

Unable to get correct response page using requests library

I am trying to parse the comments present on webpage https://xueqiu.com/S/SZ300816.
But I am not able to get it correctly through request library:
>>> url = 'https://xueqiu.com/S/SZ300816'
>>> headers
{'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'}
>>> response = requests.get(url, headers=headers)
>>> from bs4 import BeautifulSoup as bs4
>>> soup = bs4(response.text)
>>> soup.findAll('article', {'class': "timeline__item"})
[]
>>>
Can someone please suggest what I am doing wrong? Thanks.
I got the url from the network tab of chrome devlopment tool. data loaded via from this url in json format. I try to resolve your problem, hope help you.
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as bs
import requests
import json
headers={
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
}
def scrape(url):
with requests.Session() as req:
req.headers.update(headers)
r = req.get(url)
mydata =r.json()
print(mydata['list'][0])
print(mydata['list'][0]['text'])
print(mydata['list'][0]['description'])
url = 'https://xueqiu.com/query/v1/symbol/search/status?u=141606248084627&uuid=1331335789820403712&count=10&comment=0&symbol=SZ300816&hl=0&source=all&sort=&page=1&q=&type=11&session_token=null&access_token=db48cfe87b71562f38e03269b22f459d974aa8ae'
scrape(url)

Amazon.com returns status 503

I am trying to get https://www.amazon.com content with Python Requests library. But I got an server error instantly. Here is the code:
import requests
response = requests.get('https://www.amazon.com')
print(response)
And this code returns <Response [503]>. Anyone can tell me why is this happening and how to fix this?
Amazon requires, that you specify User-Agent HTTP header to return 200 response:
import requests
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0'}
response = requests.get('https://www.amazon.com', headers=headers)
print(response)
Prints:
<Response [200]>
Try this,
import requests
headers = {'User-Agent': 'Mozilla 5.0'}
response = requests.get('https://www.amazon.com', headers=headers)
print(response)
You have not put the code from which you want the info.
The code should be like this:
import requests
response = requests.get('https://www.amazon.com')
print(response.content)
also you can use json, status_code or text in place of content

Python Requests Error 400 Browser Sent An Invalid Request

I have very limited knowledge in web crawling/scraping and am trying to create a web crawler to this URL. However, when I try the usual printing of the response text from the server, I get this:
<html><body><h1>400 Bad request</h1>
Your browser sent an invalid request.
</body></html>
I don't think there's anything wrong with the code as it works on other websites I've tried it on. Was hoping you good folks here could help me figure this out. And this is just a hunch, but is this caused by the url not ending in a .xml?
import requests
url = 'https://phys.org/rss-feed/'
res = requests.get(url)
print(res.text[:500])
Try using BeautifulSoup and a header to mask your request like a real one:
import requests,lxml
from bs4 import BeautifulSoup
URL='https://phys.org/rss-feed/'
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0"
headers = {"user-agent": USER_AGENT}
resp = requests.get(URL, headers=headers)
soup = BeautifulSoup(resp.content, "lxml")
print(soup)
Just masking alone also works:
import requests
URL='https://phys.org/rss-feed/'
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0"
headers = {"user-agent": USER_AGENT}
resp = requests.get(URL, headers=headers)

Python Web Scrape - 403 Error

I'm trying to open up this website using python beautifulsoup and urllib but I keep getting a 403 error. Can someone guide me with this error?
My current code is this;
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.cubesmart.com/florida-self-storage/st--petersburg-self-storage/3337.html?utm_source=local&utm_medium=organic&utm_campaign=googlemybusiness&utm_term=3337'
uClient = uReq(my_url)
but I get the 403 error.
I searched around and tried using the approach below, but it too is giving me the same error.
from urllib.request import Request, urlopen
url="https://www.cubesmart.com/florida-self-storage/st--petersburg-self-storage/3337.html?utm_source=local&utm_medium=organic&utm_campaign=googlemybusiness&utm_term=3337"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
web_byte = urlopen(req).read()
webpage = web_byte.decode('utf-8')
Any help is appreciated.
Try to use session() from requests as below:
import requests
my_session = requests.session()
for_cookies = my_session.get("https://www.cubesmart.com")
cookies = for_cookies.cookies
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0'}
my_url = 'https://www.cubesmart.com/florida-self-storage/st--petersburg-self-storage/3337.html?utm_source=local&utm_medium=organic&utm_campaign=googlemybusiness&utm_term=3337'
response = my_session.get(my_url, headers=headers, cookies=cookies)
print(response.status_code) # 200

Categories