Unable to read website intermittently with requests - python

I tried to read a website using Python requests.
However, it sometimes succeeds but sometimes fails.
Here is my code:
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
url = 'https://hn.house.ifeng.com/homedetail/25762.shtml'
res = requests.get(url, headers = headers, timeout = 10, verify=False)
What are the reasons and how to solve it?
Thank you very much.

Related

Python request returning 403

I am trying to use requests to make an api call that this page is making https://www.betonline.ag/sportsbook/martial-arts/mma.
requests.post(
url='https://api.betonline.ag/offering/api/offering/sports/offering-by-league',
headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.0 Safari/605.1.15'}
json={"Sport":"martial-arts","League":"mma","ScheduleText":None,"Period":0}
)
I have also tried including all the headers I see in the image but am still unable to get a 200 and get the response.
What am I missing?

Python timeout error when trying to scrape sizes on product page

hi can anyone get this to work - I am trying to scrape sizes from an interactive dropdown selector but keep getting a timeout error
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}
soup = BeautifulSoup(requests.get("https://www.asos.com/nike/nike-air max-95-logo-leather-trainers-in-dark-navy-orange/prd/20750072 colourwayid=60085113", timeout=60.0).content)
print([size.text.strip() for size in soup.find(class_="colour-size select")])
It's because you've forgot the parameter headers
Try again:
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}
soup = BeautifulSoup(requests.get("https://www.asos.com/nike/nike-air max-95-logo-leather-trainers-in-dark-navy-orange/prd/20750072 colourwayid=60085113",
timeout=60.0,
headers=headers).content)

Python: ConnectionError: 'Connection aborted' when scraping specific websites

I'm trying to scrape this website:
https://www.footpatrol.com/
However it seems like the website denies my scraping attempt.
Using headers did not help.
from bs4 import BeautifulSoup
import requests
url = "https://www.footpatrol.com/"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get(url, headers = headers)
data = r.text
soup = BeautifulSoup(data, 'lxml')
for a in soup.find_all():
print(a)
This leads to me getting the ConnectionError, how can I fix my code so I can scrape the site?
I'm able to get a response by changing the User Agent to:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}
and the following User Agent also works:
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}
It seems that the Chrome version is the culprit in your User Agent.

How to change user agent urllib2

I'm trying to access a page using the following
page = urllib2.urlopen(full_url)
soup = BeautifulSoup(page, 'html.parser')
li_post_id = "post-" + str(post_id)
li_soup = soup.find('li', attrs={'id':li_post_id})
This works fine on my ubuntu machine, but when running it on my Windows server I get 403 Forbidden error, so I assume the issue is with the user agent.
How do I change this, say, to Firefox? I have only seen tutorials to change the user agent using requests, but I don't want to change all of my code to this.
You could try this.
import random
import requests, bs4
agents= [
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko)',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko)',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko)',
'Mozilla/5.0 (Windows NT 6.4; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)']
headers = {"User-Agent":random.choice(agents)}
response = requests.get(full_url,headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
Changing the header doesn't have anything to do with BeautifulSoup. It is meant for HTML parsing only. You need to change it in your urllib request like so:
Python3
import urllib.request
req = urllib.request.build_opener()
req.addheaders = [('User-Agent', 'Some user agent')]
response = req.open('http://www.stackoverflow.com')
Python2.7
import urllib2
req = urllib2.build_opener()
req.addheaders = [('User-Agent', 'Some user agent')]
response = req.open('http://www.stackoverflow.com')

Python Requests Get XML

If I go to http://boxinsider.cratejoy.com/feed/ I can see the XML just fine. But when I try to access it using python requests, I get a 403 error.
blog_url = 'http://boxinsider.cratejoy.com/feed/'
headers = {'Accepts': 'text/html,application/xml'}
blog_request = requests.get(blog_url, timeout=10, headers=headers)
Any ideas on why?
Because it's hosted by WPEngine and they filter user agents.
Try this:
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.152 Safari/537.36"
requests.get('http://boxinsider.cratejoy.com/feed/', headers={'User-agent': USER_AGENT})

Categories