I have very limited knowledge in web crawling/scraping and am trying to create a web crawler to this URL. However, when I try the usual printing of the response text from the server, I get this:
<html><body><h1>400 Bad request</h1>
Your browser sent an invalid request.
</body></html>
I don't think there's anything wrong with the code as it works on other websites I've tried it on. Was hoping you good folks here could help me figure this out. And this is just a hunch, but is this caused by the url not ending in a .xml?
import requests
url = 'https://phys.org/rss-feed/'
res = requests.get(url)
print(res.text[:500])
Try using BeautifulSoup and a header to mask your request like a real one:
import requests,lxml
from bs4 import BeautifulSoup
URL='https://phys.org/rss-feed/'
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0"
headers = {"user-agent": USER_AGENT}
resp = requests.get(URL, headers=headers)
soup = BeautifulSoup(resp.content, "lxml")
print(soup)
Just masking alone also works:
import requests
URL='https://phys.org/rss-feed/'
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0"
headers = {"user-agent": USER_AGENT}
resp = requests.get(URL, headers=headers)
Related
I am trying to extract the redirected link of this link. When I click on this link I am redirected to this page and I want to store this page link. So, for this I have tried with urllib module but it didn't give any response.
from urllib import request
headers = headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko)'}
url = 'https://www.forexfactory.com/news/403059-manufacturing-in-us-expands-after-reaching-three-year-low/hit'
response = requests.get(url, headers=headers)
print(response) # Output: <Response [503]>
So, how can I extract this link?
You can use cloudscraper to process the cloudflare redirect:
import cloudscraper
scraper = cloudscraper.create_scraper()
url = 'https://www.forexfactory.com/news/403059-manufacturing-in-us-expands-after-reaching-three-year-low/hit'
r = scraper.get(url)
print(r.url)
you can use the requests library
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko)'}
url = 'https://www.forexfactory.com/news/403059-manufacturing-in-us-expands-after-reaching-three-year-low/hit'
response = requests.get(url, headers=headers)
print(response.url)
I am trying to parse the comments present on webpage https://xueqiu.com/S/SZ300816.
But I am not able to get it correctly through request library:
>>> url = 'https://xueqiu.com/S/SZ300816'
>>> headers
{'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'}
>>> response = requests.get(url, headers=headers)
>>> from bs4 import BeautifulSoup as bs4
>>> soup = bs4(response.text)
>>> soup.findAll('article', {'class': "timeline__item"})
[]
>>>
Can someone please suggest what I am doing wrong? Thanks.
I got the url from the network tab of chrome devlopment tool. data loaded via from this url in json format. I try to resolve your problem, hope help you.
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as bs
import requests
import json
headers={
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
}
def scrape(url):
with requests.Session() as req:
req.headers.update(headers)
r = req.get(url)
mydata =r.json()
print(mydata['list'][0])
print(mydata['list'][0]['text'])
print(mydata['list'][0]['description'])
url = 'https://xueqiu.com/query/v1/symbol/search/status?u=141606248084627&uuid=1331335789820403712&count=10&comment=0&symbol=SZ300816&hl=0&source=all&sort=&page=1&q=&type=11&session_token=null&access_token=db48cfe87b71562f38e03269b22f459d974aa8ae'
scrape(url)
So I'm trying to extract the current EUR/USD price from a website using Python urllib but the website does not send the same HTML it sends to Chrome. The first part of the HTML is the same as on Chrome but it does not want to give me the EUR/USD value. Can I somehow bypass this?
Here's the code:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
while True:
req = Request('https://www.strategystocks.co.uk/currencies-market.html', headers={"User-Agent":'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'})
html = urlopen(req).read()
soup = BeautifulSoup(html, "html.parser")
print(soup)
buy = int(soup.find("span", class_="buyPrice").text)
sell = int(soup.find("span", class_="sellPrice").text)
print("Buy", buy)
print("Sell", sell)
The data is loaded via Javascript, but you can simulate the Ajax request with requests library:
import requests
url = 'https://marketools.plus500.com/Feeds/UpdateTable?instsIds=2&isUseSentiments=true'
headers = {'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:75.0) Gecko/20100101 Firefox/75.0'}
data = requests.get(url, headers=headers).json()
# print(data) # <-- uncomment this to print all data
print('Buy =',data['Feeds'][0]['B'])
print('Sell =',data['Feeds'][0]['S'])
Prints:
Buy = 1.08411
Sell = 1.08403
I'm trying to open up this website using python beautifulsoup and urllib but I keep getting a 403 error. Can someone guide me with this error?
My current code is this;
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.cubesmart.com/florida-self-storage/st--petersburg-self-storage/3337.html?utm_source=local&utm_medium=organic&utm_campaign=googlemybusiness&utm_term=3337'
uClient = uReq(my_url)
but I get the 403 error.
I searched around and tried using the approach below, but it too is giving me the same error.
from urllib.request import Request, urlopen
url="https://www.cubesmart.com/florida-self-storage/st--petersburg-self-storage/3337.html?utm_source=local&utm_medium=organic&utm_campaign=googlemybusiness&utm_term=3337"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
web_byte = urlopen(req).read()
webpage = web_byte.decode('utf-8')
Any help is appreciated.
Try to use session() from requests as below:
import requests
my_session = requests.session()
for_cookies = my_session.get("https://www.cubesmart.com")
cookies = for_cookies.cookies
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0'}
my_url = 'https://www.cubesmart.com/florida-self-storage/st--petersburg-self-storage/3337.html?utm_source=local&utm_medium=organic&utm_campaign=googlemybusiness&utm_term=3337'
response = my_session.get(my_url, headers=headers, cookies=cookies)
print(response.status_code) # 200
I'm trying to find all the information inside "inspect" when using a browser for example chrome, currently i can get the page "source" but it doesn't contain everything that inspect contains
when i tried using
with urllib.request.urlopen(section_url) as url:
html = url.read()
I got the following error message: "urllib.error.HTTPError: HTTP Error 403: Forbidden"
Now I'm assuming this is because the url I'm trying to get is from a https url instead of a http one, and i was wondering if there is a specific way to get that information from https since the normal methods arn't working.
Note: I've also tried this, but it didn't show me everything
f = requests.get(url)
print(f.text)
You need to have a user-agent to make the browser think you're not a robot.
import urllib.request, urllib.error, urllib.parse
url = 'http://www.google.com' #Input your url
user_agent = 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.63 Safari/534.3'
headers = { 'User-Agent' : user_agent }
req = urllib.request.Request(url, None, headers)
response = urllib.request.urlopen(req)
html = response.read()
response.close()
adapted from https://stackoverflow.com/a/3949760/6622817