python error 403 while scraping angle list website - python

I have watched other questions on stakeoverflow regarding HTTP 403 error however, have not found solution there.
i would like to change error from 403 to 200
trying to scrape this url https://angel.co/startups.
import requests
import random
my_session = requests.session()
for_cookies = my_session.get('https://angel.co/startups')
cookies = for_cookies.cookies
user_agents_list = [
'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko)
Mobile/15E148',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/99.0.4844.83 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/99.0.4844.51 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/105.0.0.0 Safari/537.36',
]
response = my_session.get('https://angel.co/startups',cookies=cookies, headers={'User-Agent':
random.choice(user_agents_list)})
print(response.text)
response.status_code #403
while running this code i am getting 403 error and instead of whole HTML page.
apart from that, i successfully managed to scrape 1st page using cloudscraper however, no idea how to scraper another pages.
page format 1,2,3...2500

It may be due to cloudflare protection or some sort of protection.
So, use cloudscraper to bypass it.
import cloudscraper
url = "https://angel.co/startups"
scraper = cloudscraper.create_scraper()
response = scraper.get(url)
text = response.text
print(response.status_code)
Output
200

Related

Python IndexError: list index out of range. Can someone help me?

Can someone help me to solve my problem in this code?
CODE:
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/83.0.4103.61 Safari/537.36"}
url = "https://www.amazon.com/RUNMUS-Surround-Canceling-Compatible-Controller/dp/B07GRM747Y"
resp = requests.get(url, headers=headers)
s = BeautifulSoup(resp.content, features='lxml')
product_title = s.select("#productTitle")[0].get_text().strip()
print(product_title)
If you try to print what you get as response, you will not encounter related errors.
import requests
headers = {'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36"}
url = "https://www.amazon.com/RUNMUS-Surround-Canceling-Compatible-Controller/dp/B07GRM747Y"
resp = requests.get(url, headers=headers)
print(resp.content)
The output you are getting from this request:
b'<!--\n To discuss automated access to Amazon data please contact api-services-support#amazon.com.\n For information about migrating to our APIs refer to our Marketplace APIs...
The site you are sending requests is not allowing you to access content with provided headers. So your s.select("#productTitle") creates empty list therefore you are getting an index error.

Python: ConnectionError: 'Connection aborted' when scraping specific websites

I'm trying to scrape this website:
https://www.footpatrol.com/
However it seems like the website denies my scraping attempt.
Using headers did not help.
from bs4 import BeautifulSoup
import requests
url = "https://www.footpatrol.com/"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get(url, headers = headers)
data = r.text
soup = BeautifulSoup(data, 'lxml')
for a in soup.find_all():
print(a)
This leads to me getting the ConnectionError, how can I fix my code so I can scrape the site?
I'm able to get a response by changing the User Agent to:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}
and the following User Agent also works:
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}
It seems that the Chrome version is the culprit in your User Agent.

Not able to generate request object

r = request.get(url='https://www.zomato.com')
It gives a time out error and python interpreter just does not respond . I have tried some other sites and it works for them , but for this site it does not . Why is that ?
Usually, provide a User-Agent header may help a lot when trying to get a web page, which makes it more like a browser visit:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36 Edge/15.15063'
}
r = requests.get('https://www.zomato.com', headers=headers)

Python requests not responding

Done a few small successful projects, been struggling to get the requests from this website fro ages - any tips?
UPDATE - Would like to get full beautiful soup request so I can start scraping the information from the tables
from bs4 import BeautifulSoup
import requests
r = requests.get("http://www.transfermarkt.co.uk/championship/marktwerte/wettbewerb/GB2")
soup = BeautifulSoup(r.content,"html.parser")
print soup
returning
<html>
<head><title>404 Not Found</title></head>
<body bgcolor="white">
<center><h1>404 Not Found</h1></center>
<hr><center>nginx</center>
</hr></body>
</html>
You need to pretend to be a real user with a browser and provide a User-Agent header:
r = requests.get("http://www.transfermarkt.co.uk/championship/marktwerte/wettbewerb/GB2", headers={
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"
})
Demo:
>>> from bs4 import BeautifulSoup
>>> import requests
>>>
>>> r = requests.get("http://www.transfermarkt.co.uk/championship/marktwerte/wettbewerb/GB2", headers={
... "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"
... })
>>> soup = BeautifulSoup(r.content,"html.parser")
>>> print(soup.title.get_text())
Top market values 15/16 - Championship - Transfermarkt
There are some sites where requests fail to give response as many of them track if the request originating party is a browser or a bot.
So, let us look like a browser.
It can be done by modifying the header as follows:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36', "Upgrade-Insecure-Requests": "1","DNT": "1","Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8","Accept-Language": "en-US,en;q=0.5","Accept-Encoding": "gzip, deflate"}
And then, just simply add this header your GET request as follow:
response = requests.get("https://example.com",headers=headers)
In total you will get:
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36', "Upgrade-Insecure-Requests": "1","DNT": "1","Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8","Accept-Language": "en-US,en;q=0.5","Accept-Encoding": "gzip, deflate"}
response = requests.get("https://example.com",headers=headers)

Python requests vs. robots.txt

I have a script meant for personal use that scrapes some websites for information and until recently it worked just fine, but it seems one of the websites buffed up its security and I can no longer get access to its contents.
I'm using python with requests and BeautifulSoup to scrape the data, but when I try to grab the content of the website with requests, I run into the following:
'<html><head><META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"></head><iframe src="/_Incapsula_Resource?CWUDNSAI=9_4E402615&incident_id=133000790078576866-343390778581910775&edet=12&cinfo=4bb304cac75381e904000000" frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula incident ID: 133000790078576866-343390778581910775</iframe></html>'
I've done a bit of research, and it looks like this is what's stopping me: http://www.robotstxt.org/meta.html
Is there any way I can convince the website that I'm not a malicious robot? This is a script I run ~1 time per day on a single bit of source, so I'm not really a burden on their servers by any means. Just someone with a script to make things easier :)
EDIT: Tried switching to mechanize and ignoring robots.txt that way, but I'm not getting a 403 Forbidden response. I suppose they have changed their stance on scraping and have not updated their TOS yet. Time to go to Plan B, by no longer using the website unless anyone has any other ideas.
What is most likely happening is the Server is checking the user-agent and denying access to the default user-agent used by bots.
For example requests sets the user-agent to something like python-requests/2.9.1
You can specify the headers your self.
url = "https://google.com"
UAS = ("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1",
"Mozilla/5.0 (Windows NT 6.3; rv:36.0) Gecko/20100101 Firefox/36.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10; rv:33.0) Gecko/20100101 Firefox/33.0",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36",
)
ua = UAS[random.randrange(len(UAS))]
headers = {'user-agent': ua}
r = requests.get(url, headers=headers)

Categories