I want to get the response code from a web server, but sometime I get code 200 even if the page doesn't exist and I don't know how to deal with it.
I'm using this code:
def checking_url(link):
try:
link = urllib.request.urlopen(link)
response = link.code
except urllib.error.HTTPError as e:
response = e.code
return response
When I'm checking a website like this one:
https://www.wykop.pl/notexistlinkkk/
It still returns code 200 even if the page doesn't exist.
Is there any solution to deal with it?
I found solution, now gonna test it with more websites
I had to use http.client.
You are getting response code 200, because the website you are checking has automatic redirection. In the URL you gave, even if you specify a non-existing page, it automatically redirects you to the home page, rather than returning a 404 status code. Your code works fine.
import urllib2
thisCode = None
try:
i = urllib2.urlopen('http://www.google.com')
thisCode = i.code
except urllib2.HTTPError, e:
thisCode = e.code
print thisCode
Related
i get data from an api in django.
The data comes from an order form from another website.
The data also includes an url, for example like example.com but i can't validate the input because i don't have access to the order form.
The url that i get can also have different kinds. More examples:
example.de
http://example.de
www.example.com
https://example.de
http://www.example.de
https://www.example.de
Now i would like to open the url to get the correct url.
For example if i open example.com in my browser, i got the correct url http://example.com/ and that is what i wish for all urls.
How can i do that in python fast?
If you get status_code 200 you know that you have a valid address.
In regards to HTTPS://. You will get an SSL error if you don't Follow the answers in this guide. Once you have that in place, the program will find the correct URL for you.
import requests
import traceback
validProtocols = ["https://www.", "http://www.", "https://", "http://"]
def removeAnyProtocol(url):
url = url.replace("www.","") # to remove any inputs containing just www since we aren't planning on using them regardless.
for protocol in validProtocols:
url = url.replace(protocol, "")
return url
def validateUrl(url):
for protocol in validProtocols:
if(protocol not in url):
pUrl = protocol + removeAnyProtocol(url)
try:
req = requests.head(pUrl, allow_redirects=True)
if req.status_code == 200:
return pUrl
else:
continue
except Exception:
print(traceback.format_exc())
continue
else:
try:
req = requests.head(url, allow_redirects=True)
if req.status_code == 200:
return url
except Exception:
print(traceback.format_exc())
continue
Usage:
correctUrl = validateUrl("google.com") # https://www.google.com
I'm sending a POST request to some URL and this URL then throws a 200 OK or 401 Unauthorized code depending on the parameters provided in the POST request.
Additionally to that return code, the website also returns a text, which is especially useful on errors so the person who did the request knows why it failed. For that, I use this code:
#/usr/bin/env python
import urllib
import urllib2
url = 'https://site/request'
params = {
'param1': 'value1',
'param2': 'value2',
...
}
data = urllib.urlencode(params)
req = urllib2.Request(url, data)
try:
response = urllib2.urlopen(req)
the_page = response.read()
except urllib2.URLError as e:
print e.code, e.reason # Returns only 401 Unauthorized, not the text
When the request is successful, I get a 200 code and I can grab the message with the the_page variable. Pretty useless in that case.
But when it fails, the line which throws the URLError is the one calling urlopen(), so I can't grab the web error message.
Is there any way to grab the message even on a URLError event? If not, is there an alternative way to do a POST request and grab the Web content on error?
The Python version is 2.7.6 in my case.
Thanks
In case you catch an HTTPError—it’s a more specific subclass of URLError and I think it would be raised in case of a 401—it can be read as a file-like object, yielding page contents:
urllib2.HTTPError documentation
Overriding urllib2.HTTPError or urllib.error.HTTPError and reading response HTML anyway
I would suggest using the requests library (install using pip install requests)
import requests
url = 'https://site/request'
params = {
'param1': 'value1',
'param2': 'value2',
}
post_response = requests.post(url, json=params)
if post_response.ok:
the_page = post_response.content
# do your stuff here
print post_response.content # this will give you the message regardless of failure
print post_response.status_code # this will give you the status code of the request
post_response.raise_for_status() # this will throw an error if the status is not 200
Docs: http://docs.python-requests.org/en/latest/
I'm writing a script to DL the entire collection of BBC podcasts from various show hosts. My script uses BS4, Mechanize, and wget.
I would like to know how I can test if a request for a URL yields a response code of '404' form the server. I have wrote the below function:
def getResponseCode(br, url):
print("Opening: " + url)
try:
response = br.open(url)
print("Response code: " + str(response.code))
return True
except (mechanize.HTTPError, mechanize.URLError) as e:
if isinstance(e,mechanize.HTTPError):
print("Mechanize error: " + str(e.code))
else:
print("Mechanize error: " + str(e.reason.args))
return False
I pass into it my Browser() object and a URL string. It returns either True or False depending on whether the response is a '404' or '200' (well actually, Mechanize throws and Exception if it is anything other than a '200' hence the Exception handling).
In main() I am basically looping over this function passing in a number of URLs from a list of URLs that I have scraped with BS4. When the function returns True I proceed to download the MP3 with wget.
However. My problem is:
The URLs are direct path to the podcast MP3 files on the remote
server and I have noticed that when the URL is available,
br.open(<URL>) will hang. I suspect this is because Mechanize is
caching/downloading the actual data from the server. I do not want
this because I merely want to return True if the response code is
'200'. How can I not cache/DL and just test the response code?
I have tried using br.open_novisit(url, data=None) however the hang still persists...
I don't think there's any good way to get Mechanize to do what you want. The whole point of Mechanize is that it's trying to simulate a browser visiting a URL, and a browser visiting a URL downloads the page. If you don't want to do that, don't use an API designed for that.
On top of that, whatever API you're using, by sending a GET request for the URL, you're asking the server to send you the entire response. Why do that just to hang up on it as soon as possible? Use the HEAD request to ask the server whether it's available. (Sometimes servers won't HEAD things even when they should, so you'll have to fall back to GET. But cross that bridge if you come to it.)
For example:
req = urllib.request.Request(url, method='HEAD')
resp = urllib.request.urlopen(req)
return 200 <= resp.code < 300
But this raises a question:
When the function returns True I proceed to download the MP3 with wget.
Why? Why not just use wget in the first place? If the URL is gettable, it will get the URL; if not, it will give you an error—just as easily as Mechanize will. And that avoids hitting each URL twice.
For that matter, why try to script wget, instead of using the built-in support in the stdlib or a third-party module like requests?
If you're just looking for a way to parallelize things, that's easy to do in Python:
def is_good_url(url):
req = urllib.request.Request(url, method='HEAD')
resp = urllib.request.urlopen(req)
return url, 200 <= resp.code < 300
with futures.ThreadPoolExecutor(max_workers=8) as executor:
fs = [executor.submit(is_good_url, url) for url in urls]
results = (f.result() for f in futures.as_completed(fs))
good_urls = [url for (url, good) in results if good]
And to change this to actually download the valid URLs instead of just making a note of which ones are valid, just change the task function to something that fetches and saves the data from a GET instead of doing the HEAD thing. The ThreadPoolExecutor Example in the docs does almost exactly what you want.
I'm trying to scrape data off an internal website with urllib2. When I run
try:
resp = urllib2.urlopen(urlBase)
data = resp.read()
except HTTPError as e1:
print("HTTP Error %d trying to reach %s" % (e1.code, urlBase))
except URLError as e2:
print("URLError %d" % e2.code)
print(e2.read())
I get an HTTPError with e1.code of 404. If I navigate to the site on Firefox and use the developer tools I see an HTTP code of 200. Does anyone know what the problem could be?
Edit 1 Before I call this, I also install an empty proxy handler so urllib2 doesn't try to use the proxy settings set by my shell:
handler = urllib2.ProxyHandler({})
opener = urllib2.build_opener(handler)
urllib2.intall_opener(opener)
Edit 2 FWIW the url I'm navigating to is an apache index and not an html document. However, the status code as read by Firefox is still saying HTTP/1.1 Status 200
This sometimes happens to me after I've been using an HTTP proxy like Charles. In my case, The fix is simply opening and closing the HTTP proxy.
Turns out a function inside the try I stripped out was trying to access another page that was triggering the 404 error.
I am trying to get a page from wikipedia. I have already added a 'User-Agent' header to my request. However, when I open the page using urllib2.urlopen I get the following page as a result:
ERROR: The requested URL could not be retrieved
ERROR
The requested URL could not be retrieved
While trying to retrieve the URL the following error was encountered:
Access Denied.
Access control configuration prevents your request from being allowed at this time. Please contact your service provider if you feel this is incorrect.
Here is the code I use to open the page:
def get_site(request_user_link,request): # request_user_link is request for url entered by user
# request is request generated by current page - used to get HTTP_USER_AGENT
# tag for WIKIPEDIA and other sites
request_user_link.add_header('User-Agent',str(request.META['HTTP_USER_AGENT']))
try:
response = urllib2.urlopen(request_user_link)
except urllib2.HTTPError, err:
logger.error('HTTPError = ' +str(err.code))
response=None
except urllib2.URLError, err:
logger.error('HTTPError = ' +str(err.reason))
response=None
except httplib.HTTPException, err:
logger.error('HTTPException')
response=None
except Exception:
import traceback
logger.error('generic exception' + traceback.format_exec())
response=None
return response
I pass the value of the HTTP_USER_AGENT from the current user object as the "User-Agent" header for the request I send to wikipedia.
If there are any other headers I need to add to this request, please let me know. Otherwise, please advise an alternate solution.
EDIT: Please note that I was able to get the page successfully yesterday after I added the 'User-Agent' header. Today, I seem to be getting this Error page.
Wikipedia is not very forgiving if violate their crawling rules. As you first exposed your IP with the standard urllib2 user-agent you were branded in the logs. When the logs were 'processed' your IP was banned. This should be easily tested by running your script for another IP. Be careful since Wikipedia is also known to block IP ranges.
IP bans are usually temporary, but if you have multiple offenses it can become permanent.
Wikipedia also have autoban on known proxy servers. I suspect that they are them selves parsing anon proxy sites like proxy-list.org and commercial proxy sites like hidemyass.com for the IP's.
Wikipedia does this of course to protect the content from vandalism and spam. Please respect the rules.
If possible I suggest the use of a local copy of wikipedia on your own servers. This copy you can violate to your harts content.
i wrote a script that reads from wikipedia, this is a simplified version.
import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')] #wikipedia needs this
resource = opener.open(URL)
data = resource.read()
resource.close()
#data is your website.