Trouble downloading website using urllib(2) and requests - Bad status line - python

I'm trying to download pages from the site
http://statsheet.com/
like this
url = 'http://statsheet.com'
urllib2.urlopen(url)
I have tried with the Python modules urllib, urllib2 and "reqests", but I only get error messages like "got a bad status line", "BadStatusLine" or similar
Is there any way to get around this?

You need to specify a common browser user agent e.g.
wget -U "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.34
Safari/537.36" http://statsheet.com
Related question/answer:
Changing user agent on urllib2.urlopen

Related

Python web scraping, requests object hangs

I am trying to scrape the website in python, https://www.nseindia.com/
However when I try to load the website using Requests in python the call simply hangs below is the code I am using.
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
r = requests.get('https://www.nseindia.com/',headers=headers)
The requests.get call simply hangs, not sure what I am doing wrong here? The same URL works perfectly in Chrome or any other browser.
Appreciate any help.

Why do I not get any content with python requests get, but still a 200 response?

I'm doing a requests.get(url='url', verify=False), from my django application hosted on an Ubuntu server from AWS, to a url that has a Django Rest Framework. There are no permissions or authentication on the DRF, because I'm the one that made it. I've added headers such as
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}, but wasn't able to get any content.
BUT when I do run ./manage.py shell and run the exact same command, I get the output that I need!
EDIT 1:
So I've started using subprocess.get_output("curl <url> --insecure", shell=True) and it works, but I know this is not a very "nice" way to do things.
I know what the problem was.
My application when it was being deployed was single threaded, not multithreaded.
I changed my worker number and that fixed everything.

Timeout during fetching https website using Python

I have trouble fetching zomato.com website using Python and requests library.
import requests
r = requests.get('https://www.zomato.com/san-antonio')
print r.status_code
I run this script and get no response. I'm guessing that the problem is https, but I tried it with some other https websites and it worked liked a charm, and 200 was printed to the console.
Am I missing something here?
You'll need to pretend you're coming from an actual browser:
import requests
r = requests.get('https://www.zomato.com/san-antonio', headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"})
print(r.status_code)
# returns: 200

Scrapy view returns a blank page

I'm new at Scrapy and I was just trying to scrape http://www.diseasesdatabase.com/
When I type scrapy view http://www.diseasesdatabase.com/, it displays a blank page but if I download the page and do it on the local file, it displays as usual. Why is this happening?
Pretend being a real browser providing a User-Agent header:
scrapy view http://www.diseasesdatabase.com/ -s USER_AGENT="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.130 Safari/537.36"
Worked for me.
Note that -s option here helps to override the built-in USER_AGENT setting.

Changing User Agent in Python 3 for urrlib.request.urlopen

I want to open a url using urllib.request.urlopen('someurl'):
with urllib.request.urlopen('someurl') as url:
b = url.read()
I keep getting the following error:
urllib.error.HTTPError: HTTP Error 403: Forbidden
I understand the error to be due to the site not letting python access it, to stop bots wasting their network resources- which is understandable. I went searching and found that you need to change the user agent for urllib. However all the guides and solutions I have found for this issue as to how to change the user agent have been with urllib2, and I am using python 3 so all the solutions don't work.
How can I fix this problem with python 3?
From the Python docs:
import urllib.request
req = urllib.request.Request(
url,
data=None,
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
}
)
f = urllib.request.urlopen(req)
print(f.read().decode('utf-8'))
from urllib.request import urlopen, Request
urlopen(Request(url, headers={'User-Agent': 'Mozilla'}))
I just answered a similar question here: https://stackoverflow.com/a/43501438/206820
In case you just not only want to open the URL, but also want to download the resource(say, a PDF file), you can use the code as below:
# proxy = ProxyHandler({'http': 'http://192.168.1.31:8888'})
proxy = ProxyHandler({})
opener = build_opener(proxy)
opener.addheaders = [('User-Agent','Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.1 Safari/603.1.30')]
install_opener(opener)
result = urlretrieve(url=file_url, filename=file_name)
The reason I added proxy is to monitor the traffic in Charles, and here is the traffic I got:
The host site rejection is coming from the OWASP ModSecurity Core Rules for Apache mod-security. Rule 900002 has a list of "bad" user agents, and one of them is "python-urllib2". That's why requests with the default user agent fail.
Unfortunately, if you use Python's "robotparser" function,
https://docs.python.org/3.5/library/urllib.robotparser.html?highlight=robotparser#module-urllib.robotparser
it uses the default Python user agent, and there's no parameter to change that. If "robotparser"'s attempt to read "robots.txt" is refused (not just URL not found), it then treats all URLs from that site as disallowed.

Categories