I am parsing through various links using requests, however some links are "bad" aka they basically just don't load which causes my program to become hung up and eventually crash.
Is there a way to set a time limit for getting a request, and if that time passes (fails to get a request from the url) it will just return some kind of error? Or is there some other way I can prevent bad links from breaking my program?
urllib2 is one option
import urllib2
test_url = "http://www.test.com"
try:
urllib2.urlopen(test_url)
except:
pass
Related
I am trying to use requests to pull information from the NPI API but it is taking on average over 20 seconds to pull the information. If I try and access it via my web browser it takes less than a second. I'm rather new to this and any help would be greatly appreciated. Here is my code.
import json
import sys
import requests
url = "https://npiregistry.cms.hhs.gov/api/?number=&enumeration_type=&taxonomy_description=&first_name=&last_name=&organization_name=&address_purpose=&city=&state=&postal_code=10017&country_code=&limit=&skip="
htmlfile=requests.get(url)
data = htmlfile.json()
for i in data["results"]:
print(i)
This might be due to the response being incorrectly formatted, or due to requests taking longer than necessary to set up the request. To solve these issues, read on:
Server response formatted incorrectly
A possible issue might be that the response parsing is actually the offending line. You can check this by not reading the response you receive from the server. If the code is still slow, this is not your problem, but if this fixed it, the problem might lie with parsing the response.
In case some headers are set incorrectly, this can lead to parsing errors which prevents chunked transfer (source).
In other cases, setting the encoding manually might resolve parsing problems (source).
To fix those, try:
r = requests.get(url)
r.raw.chunked = True # Fix issue 1
r.encoding = 'utf-8' # Fix issue 2
print(response.text)
Setting up the request takes long
This is mainly applicable if you're sending multiple requests in a row. To prevent requests having to set up the connection each time, you can utilize a requests.Session. This makes sure the connection to the server stays open and configured and also persists cookies as a nice benefit. Try this (source):
import requests
session = requests.Session()
for _ in range(10):
session.get(url)
Didn't solve your issue?
If that did not solve your issue, I have collected some other possible solutions here.
I'm using urlopen() to open a website and pull (financial) data from it. Here is my line:
sourceCode = urlopen('xxxxxxxx').read()
After this, I then pull the data I need out. I loop through different pages on the same domain to pull data (stock info). I end the body of the loop with:
time.sleep(1)
as I'm told that keeps the site from blocking me. My program will run for a few minutes, but at some point, it stalls and quits pulling data. I can rerun it and it'll run another arbitrary amount of time and then stall.
Is there something I can do to prevent this?
This worked (for most websites) for me:
If you're using the urllib.request library, you can create a Request and spoof the user agent. This might mean that they stop blocking you.
from urllib.request import Request, urlopen
req = Request(path, headers={'User-Agent': 'Mozilla/5.0})
data = urlopen(req).read()
Hope this helps
Edit:I have found that i made a mistake, because the cause of the error wasn't urllib but nltk, which wasn't able to process a long string which came from this exact page. Sorry for this one.
I do not know why, but this no matter if I use Urllib2.urlopen or request when I come across a specific url.
import requests
r = requests.get('SomeURL')
print html = r.text
Here is its behavior.
1) When I go thought a loop of 200 urls it freezes each time at that exactly the same URL. It stays here for hours if i do not terminate program.
2) When u try with just example of the code outside of the loop it works.
3) If i blacklist just this url it goes through the loop without problems.
It actually doesn't return any kind of error code and it work good outside of loop and also timeout is set but it doesn't do anything. It still hangs for an indefinite time.
So is there any other way to forcefully stop the http get request after a certain time, because the timeout doesn't work. Is there any other library other than urllib2 and request that could do the job, and that follows timeout limits?
for i in range(0,mincount):
code(call the request for urlist[i])
It always works but freezes only when I request this site. If i had 200 request to yahoo for example it would work. But when i try go to this particular url i cannot.
#end
edit: It is a standard for cycle and there is not much room for error.
I think it's simply a very slow page; on my system, it takes about 9.7s to load.
If you are trying to run it in a short loop, it would indeed seem to freeze.
You could try something like
links = [
'SomeURL',
'http://www.google.com/'
]
for link in links:
try:
html = requests.get(link, timeout=2.).content
print("Successfully loaded {}".format(link))
except requests.Timeout:
print("Timed out loading {}".format(link))
which gave me
Timed out loading SomeURL
Successfully loaded http://www.google.com/
This code was working fine and reading html.
Then the site just stopped giving any data on read(). No error code.
It's because the web server has detected something unusual right?
(Before I figured setting the User Agent I had Error 403: Bad Behavior)
Does urllib2 have some noticeable signature that raises a flag?
Would switching to another library help?
I'm not doing anything suspicious. I can't see any behavior difference between me using this library to read a page and using lynx browser.
import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
search_url='http://www.google.com/?search=stackoverflow"'
raw = opener.open(search_url)
print raw.headers
print raw.read()
Given your print statement, I assume you are doing this over Python2.x
I ran the same thing on my system and it works, irrespective of setting the user-agent. What David Robinson suggested might have something to do over here.
On another note, I have personally used, the following example snippet
import urllib2
req = urllib2.Request('http://www.voidspace.org.uk')
response = urllib2.urlopen(req)
the_page = response.read()
Multiple times, over multiple use cases and have never encountered your error.
Try using mechanize instead of plain urllib2 for crawling search engines, it mimics better a browser's behaviour.
I'm getting a weird error that I can't seem to find a solution for.
This error does not occur every time I hit this segment of code, and neither does it happen for the same iteration through the loop (it happens in a loop). If I run it enough, it doesn't seem to encounter the error and the program executes successfully. Regardless, I'd still like to figure out why this is happening.
Here is my error, versions, trace, etc: http://dpaste.com/681658/
It seems to happen with the following line in my code:
page = urllib2.urlopen(url)
Where url is.... a URL obviously.
And do have import urllib2 in my code.
The BadStatusLine exception is raised when you call urllib2.urlopen(url) and the remote server responds with a status code that python cannot understand.
Assuming that you don't control url, you can't prevent this from happening. All you can do is catch the exception, and manage it gracefully.
from httplib import BadStatusLine
try:
page = urllib2.urlopen(url)
# do something with page
except BadStatusLine:
print "could not fetch %s" % url
Explanations from other users are right and good, but in practice you may find this useful:
In my experience this usually happens when you are sending unquoted values to the url parameters, like values containing spaces or other characters that need to be quotes or url encoded.
This doesn't have anything to do with Django, it's an exception thrown by urllib2 which couldn't parse the response after fetching your url. It may be a network issue, a malformed response… Some servers / applications throw this kind of error randomly. If you don't control what this URL returns you're left with catching the exception, debugging which URLs are causing problems and trying to identify a pattern.