I want to scrape https://health.usnews.com/doctors/specialists-index while sending a request to this site through scrapy spider it shows status code as 403. In my request, I added user_agent but also it's not working.
I referred these two answer Python Doesn't Have Permission To Access On This Server / Return City/State from ZIP and 403:You don't have permission to access /index.php on this server but it's not working for me.
my user_agent is Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36. Some one help me to scrape the above mentioned site.
Try to add 'authority' in the headers as well. The below works for me in scrapy shell:
from scrapy import Request
headers = {
'authority': 'health.usnews.com',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36',
}
url = "https://health.usnews.com/doctors/specialists-index"
req = Request(url, headers=headers)
fetch(req)
Related
I have watched other questions on stakeoverflow regarding HTTP 403 error however, have not found solution there.
i would like to change error from 403 to 200
trying to scrape this url https://angel.co/startups.
import requests
import random
my_session = requests.session()
for_cookies = my_session.get('https://angel.co/startups')
cookies = for_cookies.cookies
user_agents_list = [
'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko)
Mobile/15E148',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/99.0.4844.83 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/99.0.4844.51 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/105.0.0.0 Safari/537.36',
]
response = my_session.get('https://angel.co/startups',cookies=cookies, headers={'User-Agent':
random.choice(user_agents_list)})
print(response.text)
response.status_code #403
while running this code i am getting 403 error and instead of whole HTML page.
apart from that, i successfully managed to scrape 1st page using cloudscraper however, no idea how to scraper another pages.
page format 1,2,3...2500
It may be due to cloudflare protection or some sort of protection.
So, use cloudscraper to bypass it.
import cloudscraper
url = "https://angel.co/startups"
scraper = cloudscraper.create_scraper()
response = scraper.get(url)
text = response.text
print(response.status_code)
Output
200
I am trying to use requests to make an api call that this page is making https://www.betonline.ag/sportsbook/martial-arts/mma.
requests.post(
url='https://api.betonline.ag/offering/api/offering/sports/offering-by-league',
headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.0 Safari/605.1.15'}
json={"Sport":"martial-arts","League":"mma","ScheduleText":None,"Period":0}
)
I have also tried including all the headers I see in the image but am still unable to get a 200 and get the response.
What am I missing?
Goal:
I am trying to scrape the HTML from this page: https://www.doherty.jobs/jobs/search?q=&l=&lat=&long=&d=.
(note - I will eventually want to paginate and scrape all job listings from this page)
My issue:
I get a 503 error when I try to scrape the page using Python and Requests. I am working out of Google Colab.
Initial Code:
import requests
url = 'https://www.doherty.jobs/jobs/search?q=&l=&lat=&long=&d='
response = requests.get(url)
print(response)
Attempted solutions:
Using 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'
Implementing this code I found in another thread:
import requests
def getUrl(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36',
}
res = requests.get(url, headers=headers)
res.raise_for_status()
getUrl('https://www.doherty.jobs/jobs/search?q=&l=&lat=&long=&d=')
I am able to access the website via my browser.
Is there anything else I can try?
Thank you
That page is protected by cloudflare, there's some options to try to bypass it, seems that using cloudscraper works:
import cloudscraper
scraper = cloudscraper.create_scraper()
url = 'https://www.doherty.jobs/jobs/search?q=&l=&lat=&long=&d='
response = scraper.get(url).text
print(response)
In order to use it, you'll need to install it:
pip install cloudscraper
If I go to http://boxinsider.cratejoy.com/feed/ I can see the XML just fine. But when I try to access it using python requests, I get a 403 error.
blog_url = 'http://boxinsider.cratejoy.com/feed/'
headers = {'Accepts': 'text/html,application/xml'}
blog_request = requests.get(blog_url, timeout=10, headers=headers)
Any ideas on why?
Because it's hosted by WPEngine and they filter user agents.
Try this:
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.152 Safari/537.36"
requests.get('http://boxinsider.cratejoy.com/feed/', headers={'User-agent': USER_AGENT})
I have a script meant for personal use that scrapes some websites for information and until recently it worked just fine, but it seems one of the websites buffed up its security and I can no longer get access to its contents.
I'm using python with requests and BeautifulSoup to scrape the data, but when I try to grab the content of the website with requests, I run into the following:
'<html><head><META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"></head><iframe src="/_Incapsula_Resource?CWUDNSAI=9_4E402615&incident_id=133000790078576866-343390778581910775&edet=12&cinfo=4bb304cac75381e904000000" frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula incident ID: 133000790078576866-343390778581910775</iframe></html>'
I've done a bit of research, and it looks like this is what's stopping me: http://www.robotstxt.org/meta.html
Is there any way I can convince the website that I'm not a malicious robot? This is a script I run ~1 time per day on a single bit of source, so I'm not really a burden on their servers by any means. Just someone with a script to make things easier :)
EDIT: Tried switching to mechanize and ignoring robots.txt that way, but I'm not getting a 403 Forbidden response. I suppose they have changed their stance on scraping and have not updated their TOS yet. Time to go to Plan B, by no longer using the website unless anyone has any other ideas.
What is most likely happening is the Server is checking the user-agent and denying access to the default user-agent used by bots.
For example requests sets the user-agent to something like python-requests/2.9.1
You can specify the headers your self.
url = "https://google.com"
UAS = ("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1",
"Mozilla/5.0 (Windows NT 6.3; rv:36.0) Gecko/20100101 Firefox/36.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10; rv:33.0) Gecko/20100101 Firefox/33.0",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36",
)
ua = UAS[random.randrange(len(UAS))]
headers = {'user-agent': ua}
r = requests.get(url, headers=headers)