Changing User Agent in Python 3 for urrlib.request.urlopen - python

I want to open a url using urllib.request.urlopen('someurl'):
with urllib.request.urlopen('someurl') as url:
b = url.read()
I keep getting the following error:
urllib.error.HTTPError: HTTP Error 403: Forbidden
I understand the error to be due to the site not letting python access it, to stop bots wasting their network resources- which is understandable. I went searching and found that you need to change the user agent for urllib. However all the guides and solutions I have found for this issue as to how to change the user agent have been with urllib2, and I am using python 3 so all the solutions don't work.
How can I fix this problem with python 3?

From the Python docs:
import urllib.request
req = urllib.request.Request(
url,
data=None,
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
}
)
f = urllib.request.urlopen(req)
print(f.read().decode('utf-8'))

from urllib.request import urlopen, Request
urlopen(Request(url, headers={'User-Agent': 'Mozilla'}))

I just answered a similar question here: https://stackoverflow.com/a/43501438/206820
In case you just not only want to open the URL, but also want to download the resource(say, a PDF file), you can use the code as below:
# proxy = ProxyHandler({'http': 'http://192.168.1.31:8888'})
proxy = ProxyHandler({})
opener = build_opener(proxy)
opener.addheaders = [('User-Agent','Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.1 Safari/603.1.30')]
install_opener(opener)
result = urlretrieve(url=file_url, filename=file_name)
The reason I added proxy is to monitor the traffic in Charles, and here is the traffic I got:

The host site rejection is coming from the OWASP ModSecurity Core Rules for Apache mod-security. Rule 900002 has a list of "bad" user agents, and one of them is "python-urllib2". That's why requests with the default user agent fail.
Unfortunately, if you use Python's "robotparser" function,
https://docs.python.org/3.5/library/urllib.robotparser.html?highlight=robotparser#module-urllib.robotparser
it uses the default Python user agent, and there's no parameter to change that. If "robotparser"'s attempt to read "robots.txt" is refused (not just URL not found), it then treats all URLs from that site as disallowed.

Related

Web Scraping TooManyRedirects: Exceeded 30 redirects. requests_ip_rotator

import requests
from requests_ip_rotator import ApiGateway, EXTRA_REGIONS
if __name__ == "__main__":
# Create gateway object and initialise in AWS
gateway = ApiGateway("https://spare.avspart.com", regions=EXTRA_REGIONS, access_key_id = 'my key', access_key_secret = 'my secret key')
gateway.start(force=True)
# Execute from random IP
session = requests.Session()
# session.max_redirects = 100
session.mount("https://spare.avspart.com", gateway)
# setting User-Agent header
session.headers['User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36'
response = session.get("https://spare.avspart.com/catalog/case/64848/4534337/677993/")
print(response.status_code)
# Delete gateways
gateway.shutdown()
I am trying to scrape this page"https://spare.avspart.com/catalog/case/64848/4534337/677993/" using requests-ip-rotator because I was blocked using requests.get() but when I try and access it I get TooManyRedirects: Exceeded 30 redirects. error.
I have read through most of the posts on this problem and tried various things such as changing the session.max_redirects and trying different types of headers, and also reaching out to the library creator. This answer Accepted answer for the same issue seems to solve the problem, but when I try and implement this in my code the issue persists.
It would be great if anyone has any recommendations for other things I can try.

Connection error in python-requests

I'm trying to search using beautifulsoup with anaconda for python 3.6.
I am trying to scrape accuweather.com to find the weather in Tel Aviv.
This is my code:
from bs4 import BeautifulSoup
import requests
data=requests.get("https://www.accuweather.com/he/il/tel-
aviv/215854/weather-forecast/215854")
soup=BeautifulSoup(data.text,"html parser")
soup.find('div',('class','info'))
I get this error:
raise ConnectionError(err, request=request)
ConnectionError: ('Connection aborted.', OSError("(10060,
'WSAETIMEDOUT')",))
What can I do and what does this error mean?
What does this error mean
Googling for "errno 10600" yields quite a few results. Basically, it's a low-level network error (it's not http specific, you can have the same issue for any kind of network connection), whose canonical description is
A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
IOW, your system failed to connect to the host. This might come from a lot of reasons, either temporary (like your internet connection is down) or not (like a proxy - if you are behind a proxy - blocking access to this host, etc), or quite simply (as is the case here) the host blocking your requests.
The first thing to do when you have such an error is to check your internet connection, then try to get the url in your browser. If you can get it in your browser then it's most often the host blocking you, most often based on your client's "user-agent" header (the client here is requests), and specifying a "standard" user-agent header as explained in newbie's answer should solve the problem (and it does in this case, or at least it did for me).
NB : to set the user agent:
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
}
data = requests.get("https://www.accuweather.com/he/il/tel-aviv/215854/weather-forecast/215854", headers=headers)
The problem does not come from the code, but from the website.
If you add User-Agent field in the header of the request it will look like it comes from a browser.
Example:
from bs4 import BeautifulSoup
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
}
data=requests.get("https://www.accuweather.com/he/il/tel-aviv/215854/weather-forecast/215854", headers=headers)

Timeout during fetching https website using Python

I have trouble fetching zomato.com website using Python and requests library.
import requests
r = requests.get('https://www.zomato.com/san-antonio')
print r.status_code
I run this script and get no response. I'm guessing that the problem is https, but I tried it with some other https websites and it worked liked a charm, and 200 was printed to the console.
Am I missing something here?
You'll need to pretend you're coming from an actual browser:
import requests
r = requests.get('https://www.zomato.com/san-antonio', headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"})
print(r.status_code)
# returns: 200

Python 3 Website detects scraper when using User-Agent spoofing

I'm trying to scrape some information from Indeed.com using urllib. Occasionally, the job link gets redirected to the hiring company's webpage. When this happens, Indeed throws up some html about using an incompatible browser or device, rather than continuing to the redirected page. After looking around, I found that in most cases spoofing urllib's user agent to look like a browser is enough to get around this, but this doesn't seem to be the case here.
Any suggestions on where to go beyond spoofing the User-Agent? Is it possible Indeed is able to realize the User-Agent is spoofed, and that there is no way around this?
Here's an example of the code:
import urllib
from fake_useragent import UserAgent
from http.cookiejar import CookieJar
ua = UserAgent()
website = 'http://www.indeed.com/rc/clk?jk=0fd52fac51427150&fccid=7f79c79993ec7e60'
req = urllib.request.Request(website)
cj = CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
opener.addheaders = [('User-Agent', ua.chrome)]
response = opener.open(req)
print(response.read().decode('utf-8'))
Thanks for the help!
This header usually works :
HDR = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'}
Another option is to use the requests package.

Trouble downloading website using urllib(2) and requests - Bad status line

I'm trying to download pages from the site
http://statsheet.com/
like this
url = 'http://statsheet.com'
urllib2.urlopen(url)
I have tried with the Python modules urllib, urllib2 and "reqests", but I only get error messages like "got a bad status line", "BadStatusLine" or similar
Is there any way to get around this?
You need to specify a common browser user agent e.g.
wget -U "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.34
Safari/537.36" http://statsheet.com
Related question/answer:
Changing user agent on urllib2.urlopen

Categories