For some reason when I try to get and process the following url with python-requests I receive an error causing my program to fail. Other similar urls seems to work fine
import requests
test = requests.get('http://t.co/Ilvvq1cKjK')
print test.url, test.status_code
What could be causing this URL to fail instead of just producing a 404 status code?
The requests library has an exception hierarchy as listed here
So wrap your GET request in a try/except block:
import requests
try:
test = requests.get('http://t.co/Ilvvq1cKjK')
print test.url, test.status_code
except requests.exceptions.ConnectionError as e:
print e.request.url, "*connection failed*"
That way you end up with similar behaviour to what you're doing now (so you get the redirected url), but cater for not being able to connect rather than print the status code.
Related
I have the following code:
urllib2.urlopen('http://muhlenberg.edu/alumni/').geturl()
This should return http://www.muhlenbergconnect.com/s/1570/index.aspx?gid=2&pgid=61/ which is the link that it gets redirected to. However, I get an IncompleteRead error when running that code.
Is there a way I can prevent this error from happening and still return the correct link?
I'm working on a simple web scraper for a page which requires users to be logged in to see its content.
from twill.commands import *
go("https://website.com/user")
fv("1","edit-name","NICKNAME")
fv("1","edit-pass","NICKNAME")
submit('0')
That is my current code. When running it, I get the following error
raise MissingSchema(error)
requests.exceptions.MissingSchema: Invalid URL '/user': No schema supplied. Perhaps you meant http:///user?
What am I doing wrong?
I am trying to use Python to get a JSON file from the Web. If I open the URL in my browser (Mozilla or Chromium) I do see the JSON. But when I do the following with the Python:
response = urllib2.urlopen(url)
data = json.loads(response.read())
I get an error message that tells me the following (after translation in English): Errno 10060, a connection troughs an error, since the server after a certain time period did not react, or the connection was erroneous, or the host did not react.
ADDED
It looks like there are many people who faced the described problem. There are also some answers to the similar (or the same) question. For example here we can see the following solution:
import requests
r = requests.get("http://www.google.com", proxies={"http": "http://61.233.25.166:80"})
print(r.text)
It is already a step forward for me (I think that it is very likely that the proxy is the reason of the problem). However, I still did not get it done since I do not know URL of my proxy and I probably will need user name and password. Howe can I find them? How did it happen that my browsers have them I do not?
ADDED 2
I think I am now one step further. I have used this site to find out what my proxy is: http://www.whatismyproxy.com/
Then I have used the following code:
proxies = {'http':'my_proxy.blabla.com/'}
r = requests.get(url, proxies = proxies)
print r
As a result I get
<Response [404]>
Looks not so good, but at least I think that my proxy is correct, because when I randomly change the address of the proxy I get another error:
Cannot connect to proxy
So, I can connect to proxy but something is not found.
I think there might be something wrong, when you're trying to get the json from the online source(URL). Just to make things clear, here is a small code snippet
#!/usr/bin/env python
try:
# For Python 3+
from urllib.request import urlopen
except ImportError:
# For Python 2
from urllib2 import urlopen
import json
def get_jsonparsed_data(url):
response = urlopen(url)
data = str(response.read())
return json.loads(data)
If you still get a connection error, You can try a couple of steps:
Try to urlopen() a random site from the Interpreter (Interactive Mode). If you are able to grab the source code you're good. If not check internet conditions or try the request module. Check here
Check and see if the json in the URL is in the correct syntax. For sample json syntax check here
Try the simplejson module.
Edit 1:
if you want to access websites using a system wide proxy you will have to use a proxy handler to use loopback(local host) to connect to that proxy.. A sample code is shown below.
proxy = urllib2.ProxyHandler({
'http': '127.0.0.1',
'https': '127.0.0.1'
})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
# this way you can send both http and https request using proxies
urllib2.urlopen('http://www.google.com')
urllib2.urlopen('https://www.google.com')
I have not not worked a lot with ProxyHandler. I just know the theory and code. I am sure there are better ways to access websites through proxies; One which does not involve installing the opener everytime you run the program. But hopefully it will point you in the right direction.
I need some idea to test the server from a link. I do not know where to start
Would be:
site = 'example.com'
if(site === Apache)
print '[ok] Apache - Version:'
else
print '[No] Is not apache'
I prefer using requests since it's simple and well documented. And it doesn't return an error like urllib
import requests
request = requests.get("http://stackoverflow.com/")
if "Apache" in request.headers['server']:
print "Apache Server found"
else:
print "This is no Apache Server"
Also see : http://www.python-requests.org/en/latest/ for more information
In python 3:
import urllib.request
response = urllib.request.urlopen('http://www.google.com')
print(response.headers['Server'])
would be the simplest way to get the server header in some cases.
Some sites (like stackoverflow), however, will return 403 error code.
Using a Nokia N900 , I have a urllib.urlopen statement that I want to be skipped if the server is offline. (If it fails to connect > proceed to next line of code ) .
How should / could this be done in Python?
According to the urllib documentation, it will raise IOError if the connection can't be made.
try:
urllib.urlopen(url)
except IOError:
# exception handling goes here if you want it
pass
else:
DoSomethingUseful()
Edit: As unutbu pointed out, urllib2 is more flexible. The Python documentation has a good tutorial on how to use it.
try:
urllib.urlopen("http://fgsfds.fgsfds")
except IOError:
pass
If you are using Python3, urllib.request.urlopen has a timeout parameter. You could use it like this:
import urllib.request as request
try:
response = request.urlopen('http://google.com',timeout = 0.001)
print(response)
except request.URLError as err:
print('got here')
# urllib.URLError: <urlopen error timed out>
timeout is measured in seconds. The ultra-short value above is just to demonstrate that it works. In real life you'd probably want to set it to a larger value, of course.
urlopen also raises a urllib.error.URLError (which is also accessible as request.URLError) if the url does not exist or if your network is down.
For Python2.6+, equivalent code can be found here.