I am trying to parse the comments present on webpage https://xueqiu.com/S/SZ300816.
But I am not able to get it correctly through request library:
>>> url = 'https://xueqiu.com/S/SZ300816'
>>> headers
{'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'}
>>> response = requests.get(url, headers=headers)
>>> from bs4 import BeautifulSoup as bs4
>>> soup = bs4(response.text)
>>> soup.findAll('article', {'class': "timeline__item"})
[]
>>>
Can someone please suggest what I am doing wrong? Thanks.
I got the url from the network tab of chrome devlopment tool. data loaded via from this url in json format. I try to resolve your problem, hope help you.
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as bs
import requests
import json
headers={
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
}
def scrape(url):
with requests.Session() as req:
req.headers.update(headers)
r = req.get(url)
mydata =r.json()
print(mydata['list'][0])
print(mydata['list'][0]['text'])
print(mydata['list'][0]['description'])
url = 'https://xueqiu.com/query/v1/symbol/search/status?u=141606248084627&uuid=1331335789820403712&count=10&comment=0&symbol=SZ300816&hl=0&source=all&sort=&page=1&q=&type=11&session_token=null&access_token=db48cfe87b71562f38e03269b22f459d974aa8ae'
scrape(url)
Related
I'm attempting to scrape a web page. When executing this code, it outputs running1 but not running2. Why would this be the case?
Code:
from time import gmtime, strftime
import requests
from bs4 import BeautifulSoup
import smtplib
from email.mime.text import MIMEText
print("running1")
url = "https://www.johnlewis.com/nordictrack-commercial-14-9-elliptical-cross-trainer/p5639979"
response = requests.get(url)
print("running2")
soup = BeautifulSoup(response.text, 'lxml')
print("running3")
To get correct response from server try to specify User-Agent HTTP header:
import requests
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:101.0) Gecko/20100101 Firefox/101.0"
}
url = "https://www.johnlewis.com/nordictrack-commercial-14-9-elliptical-cross-trainer/p5639979"
response = requests.get(url, headers=headers)
print(response.text)
Prints:
<!DOCTYPE html><html lang="en"><head>
...
I have very limited knowledge in web crawling/scraping and am trying to create a web crawler to this URL. However, when I try the usual printing of the response text from the server, I get this:
<html><body><h1>400 Bad request</h1>
Your browser sent an invalid request.
</body></html>
I don't think there's anything wrong with the code as it works on other websites I've tried it on. Was hoping you good folks here could help me figure this out. And this is just a hunch, but is this caused by the url not ending in a .xml?
import requests
url = 'https://phys.org/rss-feed/'
res = requests.get(url)
print(res.text[:500])
Try using BeautifulSoup and a header to mask your request like a real one:
import requests,lxml
from bs4 import BeautifulSoup
URL='https://phys.org/rss-feed/'
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0"
headers = {"user-agent": USER_AGENT}
resp = requests.get(URL, headers=headers)
soup = BeautifulSoup(resp.content, "lxml")
print(soup)
Just masking alone also works:
import requests
URL='https://phys.org/rss-feed/'
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0"
headers = {"user-agent": USER_AGENT}
resp = requests.get(URL, headers=headers)
I'm trying to open up this website using python beautifulsoup and urllib but I keep getting a 403 error. Can someone guide me with this error?
My current code is this;
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.cubesmart.com/florida-self-storage/st--petersburg-self-storage/3337.html?utm_source=local&utm_medium=organic&utm_campaign=googlemybusiness&utm_term=3337'
uClient = uReq(my_url)
but I get the 403 error.
I searched around and tried using the approach below, but it too is giving me the same error.
from urllib.request import Request, urlopen
url="https://www.cubesmart.com/florida-self-storage/st--petersburg-self-storage/3337.html?utm_source=local&utm_medium=organic&utm_campaign=googlemybusiness&utm_term=3337"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
web_byte = urlopen(req).read()
webpage = web_byte.decode('utf-8')
Any help is appreciated.
Try to use session() from requests as below:
import requests
my_session = requests.session()
for_cookies = my_session.get("https://www.cubesmart.com")
cookies = for_cookies.cookies
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0'}
my_url = 'https://www.cubesmart.com/florida-self-storage/st--petersburg-self-storage/3337.html?utm_source=local&utm_medium=organic&utm_campaign=googlemybusiness&utm_term=3337'
response = my_session.get(my_url, headers=headers, cookies=cookies)
print(response.status_code) # 200
I am using python 3.5.2. I want to scrap a webpage where cookies are required. But when I use requests.session() the cookies maintained in the session are not updated, thus my scraping failed constantly. Following is my code snippet.
import requests
from bs4 import BeautifulSoup
import time
import requests.utils
session = requests.session()
session.headers.update({"User-Agent": "Mozilla/5.0 (Windows NT 6.2; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0"})
print(session.cookies.get_dict())
url = "http://www.beianbaba.com/"
session.get(url)
print(session.cookies.get_dict())
Do you guys have any idea about this?Thank you so much in advance.
It seems like that website request is not providing any cookies. I used the exact same code but requested for https://google.com:
import requests
session = requests.Session()
session.headers.update({"User-Agent": "Mozilla/5.0 (Windows NT 6.2; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0"})
print(session.cookies.get_dict())
url = "http://google.com/"
session.get(url)
print(session.cookies.get_dict())
And got this output:
{}
{'NID': 'a cookie that i removed'}
import requests
from bs4 import BeautifulSoup
import lxml
import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
f =open('ala2009link.csv','r')
s=open('2009alanews.csv','w')
for row in csv.reader(f):
url=row[0]
print url
res = requests.get(url)
print res.content
soup = BeautifulSoup(res.content)
print soup
data=soup.find_all("article",{"class":"article-wrapper news"})
#data=soup.find_all("main",{"class":"main-content"})
for item in data:
title= item.find_all("h2",{"class","article-headline"})[0].text
s.write("%s \n"% title)
content=soup.find_all("p")
for main in content:
k=main.text.encode('utf-8')
s.write("%s \n"% k)
#k=csv.writer(s)
#k.writerow('%s\n'% (main))
s.close()
f.close()
this is my code to extract data in website ,but i don't know why i can't extract data ,is this ad blocker warning to block my beautifulsoup ?
this is the example link:http://www.rolltide.com/news/2009/6/23/Bert_Bank_Passes_Away.aspx?path=football
The reason that no results are returned is because this website requires that you have a User-Agent header in your request.
To fix this add a headers parameter with a User-Agent to the requests.get() like so.
url = 'http://www.rolltide.com/news/2009/6/23/Bert_Bank_Passes_Away.aspx?path=football'
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/29.0.1547.65 Chrome/29.0.1547.65 Safari/537.36',
}
res = requests.get(url, headers=headers)