Python requests vs. robots.txt

Python requests vs. robots.txt - python

I have a script meant for personal use that scrapes some websites for information and until recently it worked just fine, but it seems one of the websites buffed up its security and I can no longer get access to its contents.
I'm using python with requests and BeautifulSoup to scrape the data, but when I try to grab the content of the website with requests, I run into the following:
'<html><head><META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"></head><iframe src="/_Incapsula_Resource?CWUDNSAI=9_4E402615&incident_id=133000790078576866-343390778581910775&edet=12&cinfo=4bb304cac75381e904000000" frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula incident ID: 133000790078576866-343390778581910775</iframe></html>'
I've done a bit of research, and it looks like this is what's stopping me: http://www.robotstxt.org/meta.html
Is there any way I can convince the website that I'm not a malicious robot? This is a script I run ~1 time per day on a single bit of source, so I'm not really a burden on their servers by any means. Just someone with a script to make things easier :)
EDIT: Tried switching to mechanize and ignoring robots.txt that way, but I'm not getting a 403 Forbidden response. I suppose they have changed their stance on scraping and have not updated their TOS yet. Time to go to Plan B, by no longer using the website unless anyone has any other ideas.

What is most likely happening is the Server is checking the user-agent and denying access to the default user-agent used by bots.
For example requests sets the user-agent to something like python-requests/2.9.1
You can specify the headers your self.
url = "https://google.com"
UAS = ("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1",
"Mozilla/5.0 (Windows NT 6.3; rv:36.0) Gecko/20100101 Firefox/36.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10; rv:33.0) Gecko/20100101 Firefox/33.0",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36",
)
ua = UAS[random.randrange(len(UAS))]
headers = {'user-agent': ua}
r = requests.get(url, headers=headers)

Related

python error 403 while scraping angle list website

I have watched other questions on stakeoverflow regarding HTTP 403 error however, have not found solution there.
i would like to change error from 403 to 200
trying to scrape this url https://angel.co/startups.
import requests
import random
my_session = requests.session()
for_cookies = my_session.get('https://angel.co/startups')
cookies = for_cookies.cookies
user_agents_list = [
'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko)
Mobile/15E148',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/99.0.4844.83 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/99.0.4844.51 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/105.0.0.0 Safari/537.36',
]
response = my_session.get('https://angel.co/startups',cookies=cookies, headers={'User-Agent':
random.choice(user_agents_list)})
print(response.text)
response.status_code #403
while running this code i am getting 403 error and instead of whole HTML page.
apart from that, i successfully managed to scrape 1st page using cloudscraper however, no idea how to scraper another pages.
page format 1,2,3...2500

It may be due to cloudflare protection or some sort of protection.
So, use cloudscraper to bypass it.
import cloudscraper
url = "https://angel.co/startups"
scraper = cloudscraper.create_scraper()
response = scraper.get(url)
text = response.text
print(response.status_code)
Output
200

How to keep my 'User-Agent headers' always up to date in my python codes?

In my research I find several ways to create a fake User-Agent and also hide my User-Agent when making requests.
But what I'm actually looking for is how do I keep my User-Agent always up to date without needing to get the value manually.
Manually way is search on Google for example:
Is there a correct method for this?
For example, currently the User-Agent I use to make requests is this:
import requests
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
url = f'https://XXXXXXXXXXXXXXXXX.com'
response = requests.get(url, headers=headers).json()
But the version of Chrome that is appearing is already a much older version of the one I currently use. I'd like to know if there's a way to keep this up to date without having to manually modify it every time!

you can use fake user agent:
https://github.com/hellysmile/fake-useragent
it has a lot of options and keep user agent updated.
from fake_useragent import UserAgent
ua = UserAgent()
ua.ie
# Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US);
ua.msie
# Mozilla/5.0 (compatible; MSIE 10.0; Macintosh; Intel Mac OS X 10_7_3; Trident/6.0)'
ua['Internet Explorer']
# Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; GTB7.4; InfoPath.2; SV1; .NET CLR 3.3.69573; WOW64; en-US)
ua.opera
# Opera/9.80 (X11; Linux i686; U; ru) Presto/2.8.131 Version/11.11
ua.chrome
# Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.2 (KHTML, like Gecko) Chrome/22.0.1216.0 Safari/537.2'
ua.google
# Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/537.13 (KHTML, like Gecko) Chrome/24.0.1290.1 Safari/537.13
ua['google chrome']
# Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11
ua.firefox
# Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:16.0.1) Gecko/20121011 Firefox/16.0.1
ua.ff
# Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:15.0) Gecko/20100101 Firefox/15.0.1
ua.safari
# Mozilla/5.0 (iPad; CPU OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5355d Safari/8536.25
# and the best one, random via real world browser usage statistic
ua.random

You can get your browsers user agent in javascript on the console with navigator.userAgent. How can you automatically have this value stored into your python scripts?? I'm not really aware of a way todo that.

Error when web scraping is called through my personal server with python

I'm brand new to coding and have finished making a simple program to web scrape some stock websites for particular data. The simplified code looks like this:
headers = {'User-Agent': 'Personal_User_Agent'}
fv = f"https://finviz.com/quote.ashx?t=JAGX"
r_fv = requests.get(fv, headers=headers)
soup_fv = BeautifulSoup(r_fv.text, 'html.parser')
fv_ticker_title = soup_fv.find('title')
print(fv_ticker_title)
The website would not work until I created a user agent, but then it worked fine. I then created a website through python's local host which also worked fine, and so I thought I was ready to make the website public via "python anywhere".
However, when I went to create the public website, the program shuts down every time I go to access information through web scraping (i.e. using the user_agent). I didn't like the idea of using my user agent for a public domain, but I couldn't find out how other people who web scrape go about this problem when a user agent is required for a public domain. Any advice!?

I would add some random headers to rotate through rather than my own headers. Something like this should work:
import random
header_list = [
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_18_3) AppleWebKit/537.34 (KHTML, like Gecko) Chrome/82.0.412.92 Safari/539.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 11.12; rv:87.0) Gecko/20170102 Firefox/78.0',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.32 (KHTML, like Gecko) Chrome/82.0.12.17 Safari/535.42'
]
fv = f"https://finviz.com/quote.ashx?t=JAGX"
user_agent = random.choice(header_list)
headers = {'User-Agent': user_agent}
r_fv = requests.get(fv, headers=headers)
soup_fv = BeautifulSoup(r_fv.text, 'html.parser')
fv_ticker_title = soup_fv.find('title')
print(fv_ticker_title)
or Option 2.
Use a library called fake-headers to generate them off the cuff:
from fake_headers import Headers
fv = f"https://finviz.com/quote.ashx?t=JAGX"
headers = Headers(os="mac", headers=True).generate()
r_fv = requests.get(fv, headers=headers)
soup_fv = BeautifulSoup(r_fv.text, 'html.parser')
fv_ticker_title = soup_fv.find('title')
print(fv_ticker_title)
Really depends on whether you want to use a library or not...

Python request.get fails to get an answer for a url I can open on my browser

I'm learning how to use python requests (Python 3) and I am trying to make a simple requests.get to get the HTML code from several websites. Although it works for most of them, there is one I am having trouble with.
When I call : http://es.rs-online.com/ everything works fine:
In [1]: import requests
...:html = requests.get("http://es.rs-online.com/")
In [2]:html
Out[2]: <Response [200]>
However, when I try it with http://es.farnell.com/, python is unable to solve the address and keeps working on it forever. If I set a timeout, no matter how long, the requests.get() will always be interrupted by the timeout and by nothing else. I have also tried adding headers but it didn't solve the issue. Also, I don't think the error has anything to do with the proxy that I'm using, as I am able to open this website in my browser. Currently, my code looks like this:
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36'}
html = requests.get("http://es.farnell.com/",headers=headers, timeout=5, allow_redirects = True )
After 5 secs, I get the expected timeout notification.
ReadTimeout: HTTPConnectionPool(host='es.farnell.com', port=80): Read timed out. (read timeout=5)
Does anyone know what could be the issue?

The problem is in your header. Do remember that some site are more lenient than others when it comes to the content of the header you are sending. In order to fix the issue, you should replace your current header with:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36', "Upgrade-Insecure-Requests": "1","DNT": "1","Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8","Accept-Language": "en-US,en;q=0.5","Accept-Encoding": "gzip, deflate"}
I would also recommend you to send the get request to https://es.farnell.com/ rather than http://es.farnell.com/, remove the timeout = 5 and remove allow_redirects = True (as it is True by default).
All in all your code should look like this:
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36', "Upgrade-Insecure-Requests": "1","DNT": "1","Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8","Accept-Language": "en-US,en;q=0.5","Accept-Encoding": "gzip, deflate"}
html = requests.get("https://es.farnell.com",headers=headers)
hope this helps.

Not able to generate request object

r = request.get(url='https://www.zomato.com')
It gives a time out error and python interpreter just does not respond . I have tried some other sites and it works for them , but for this site it does not . Why is that ?

Usually, provide a User-Agent header may help a lot when trying to get a web page, which makes it more like a browser visit:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36 Edge/15.15063'
}
r = requests.get('https://www.zomato.com', headers=headers)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python requests vs. robots.txt - python

Related

python error 403 while scraping angle list website

How to keep my 'User-Agent headers' always up to date in my python codes?

Error when web scraping is called through my personal server with python

Python request.get fails to get an answer for a url I can open on my browser

Not able to generate request object

Categories

Resources