If I go to http://boxinsider.cratejoy.com/feed/ I can see the XML just fine. But when I try to access it using python requests, I get a 403 error.
blog_url = 'http://boxinsider.cratejoy.com/feed/'
headers = {'Accepts': 'text/html,application/xml'}
blog_request = requests.get(blog_url, timeout=10, headers=headers)
Any ideas on why?
Because it's hosted by WPEngine and they filter user agents.
Try this:
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.152 Safari/537.36"
requests.get('http://boxinsider.cratejoy.com/feed/', headers={'User-agent': USER_AGENT})
Related
I am trying to use requests to make an api call that this page is making https://www.betonline.ag/sportsbook/martial-arts/mma.
requests.post(
url='https://api.betonline.ag/offering/api/offering/sports/offering-by-league',
headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.0 Safari/605.1.15'}
json={"Sport":"martial-arts","League":"mma","ScheduleText":None,"Period":0}
)
I have also tried including all the headers I see in the image but am still unable to get a 200 and get the response.
What am I missing?
I tried to read a website using Python requests.
However, it sometimes succeeds but sometimes fails.
Here is my code:
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
url = 'https://hn.house.ifeng.com/homedetail/25762.shtml'
res = requests.get(url, headers = headers, timeout = 10, verify=False)
What are the reasons and how to solve it?
Thank you very much.
I want to scrape https://health.usnews.com/doctors/specialists-index while sending a request to this site through scrapy spider it shows status code as 403. In my request, I added user_agent but also it's not working.
I referred these two answer Python Doesn't Have Permission To Access On This Server / Return City/State from ZIP and 403:You don't have permission to access /index.php on this server but it's not working for me.
my user_agent is Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36. Some one help me to scrape the above mentioned site.
Try to add 'authority' in the headers as well. The below works for me in scrapy shell:
from scrapy import Request
headers = {
'authority': 'health.usnews.com',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36',
}
url = "https://health.usnews.com/doctors/specialists-index"
req = Request(url, headers=headers)
fetch(req)
This is the first time I am trying requests.post() because I have always used requests.get(). So I'm trying to navigate to a website and search. I am using yellowpages.com, and before I get negative feedback about using the site to scrape or about an API, I just want to try it out. The problem I am running into is that it spits out some html that isn't remotely what I am looking for. I'll post my code below to show you what I am talking about.
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
url = "https://www.yellowpages.com"
search_terms = "Car Dealership"
location = "Jackson, MS"
q = {'search_terms': search_terms, 'geo_locations_terms': location}
page = requests.post(url, headers=headers, params=q)
print(page.text)
Your request boils down to
$ curl -X POST \
-H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36' \
'https://www.yellowpages.com/?search_terms=Car Dealership&geo_locations_terms=Jackson, MS'
For this the server returns a 502 Bad Gateway status code.
The reason is that you use POST together wihy query parameters params. The two don't go well together. Use data instead:
requests.post(url, headers=headers, data=q)
I'm trying to access a page using the following
page = urllib2.urlopen(full_url)
soup = BeautifulSoup(page, 'html.parser')
li_post_id = "post-" + str(post_id)
li_soup = soup.find('li', attrs={'id':li_post_id})
This works fine on my ubuntu machine, but when running it on my Windows server I get 403 Forbidden error, so I assume the issue is with the user agent.
How do I change this, say, to Firefox? I have only seen tutorials to change the user agent using requests, but I don't want to change all of my code to this.
You could try this.
import random
import requests, bs4
agents= [
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko)',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko)',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko)',
'Mozilla/5.0 (Windows NT 6.4; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)']
headers = {"User-Agent":random.choice(agents)}
response = requests.get(full_url,headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
Changing the header doesn't have anything to do with BeautifulSoup. It is meant for HTML parsing only. You need to change it in your urllib request like so:
Python3
import urllib.request
req = urllib.request.build_opener()
req.addheaders = [('User-Agent', 'Some user agent')]
response = req.open('http://www.stackoverflow.com')
Python2.7
import urllib2
req = urllib2.build_opener()
req.addheaders = [('User-Agent', 'Some user agent')]
response = req.open('http://www.stackoverflow.com')