Python web scraping, skip url if error - python

I'm trying to scrape one site (about 7000 links, all in a list), and because of my method, it is taking a LONG time and I guess that I'm ok with that (since that implies staying undetected). But if I do get any kind of error in trying to retrieve a page, can I just skip it?? Right now, if there's an error, the code breaks and gives me a bunch of error messages. Here's my code:
Collection is a list of lists and the resultant file. Basically, I'm trying to run a loop with get_url_data() (which I have a previous question to thank for) with all my url's in urllist. I have something called HTTPError but that doesn't seem to handle all the errors, hence this post. In a related side-quest, it would also be nice to get a list of the url's that couldn't process, but that's not my main concern (but it would be cool if someone could show me how).
Collection=[]
def get_url_data(url):
try:
r = requests.get(url, timeout=10)
r.raise_for_status()
except HTTPError:
return None
site = bs4.BeautifulSoup(r.text)
groups=site.select('div.filters')
word=url.split("/")[-1]
B=[]
for x in groups:
B.append(word)
T=[a.get_text() for a in x.select('div.blahblah [class=txt]')]
A1=[a.get_text() for a in site.select('div.blah [class=txt]')]
if len(T)==1 and len(A1)>0 and T[0]=='verb' and A1[0]!='as in':
B.append(T)
B.append([a.get_text() for a in x.select('div.blahblah [class=ttl]')])
B.append([a.get_text() for a in x.select('div.blah [class=text]')])
Collection.append(B)
B=[]
for url in urllist:
get_url_data(url)
I think the main error code was this, which triggered other ones Because there were a bunch of errors starting with During handling of the above exception, another exception occurred.
Traceback (most recent call last):
File "C:\Python34\lib\site-packages\requests\packages\urllib3\connectionpool.py", line 319, in _make_request
httplib_response = conn.getresponse(buffering=True)
TypeError: getresponse() got an unexpected keyword argument 'buffering'

You can make your try-catch block look like this,
try:
r = requests.get(url, timeout=10)
r.raise_for_status()
except Exception:
return
The Exception class will handle all the errors and exception.
If you want to get the exception message you can print this in your except block. You have then instantiate exception first before raising it.
except Exception as e:
print(e.message)
return

Related

How to avoid or skip error 400 in python while calling the API

Note:- I have written my code after referring to few examples in stack overflow but still could not get the required output
I have a python script in which loop iterates with an Instagram API. I give the user_id as an input to the API which gets the no of posts, no of followers and no of following. Each time it gets a response, I load it into a JSON schema and append to lists data1, data2 and data3.
The issue is:= Some accounts are private accounts and the API call is not allowed to it. When I run the script in IDLE Python shell, its gives the error
Traceback (most recent call last):
File "<pyshell#144>", line 18, in <module>
beta=json.load(url)
File "C:\Users\rnair\AppData\Local\Programs\Python\Python35\lib\site- packages\simplejson-3.8.2-py3.5-win-amd64.egg\simplejson\__init__.py", line 455, in load
return loads(fp.read(),
File "C:\Users\rnair\AppData\Local\Programs\Python\Python35\lib\tempfile.py", line 483, in func_wrapper
return func(*args, **kwargs)
**ValueError: read of closed file**
But the JSON contains this:-
{
"meta": {
"error_type": "APINotAllowedError",
"code": 400,
"error_message": "you cannot view this resource"
}
}
My code is:-
for r in range(307,601):
var=r,sheet.cell(row=r,column=2).value
xy=var[1]
ij=str(xy)
if xy=="Account Deleted":
data1.append('null')
data2.append('null')
data3.append('null')
continue
myopener=Myopen()
try:
url=myopener.open('https://api.instagram.com/v1/users/'+ij+'/?access_token=641567093.1fb234f.a0ffbe574e844e1c818145097050cf33')
except urllib.error.HTTPError as e: // I want the change here
data1.append('Private Account')
data2.append('Private Account')
data3.append('Private Account')
continue
beta=json.load(url)
item=beta['data']['counts']
data1.append(item['media'])
data2.append(item['followed_by'])
data3.append(item['follows'])
I am using Python version 3.5.2. The main question is If the loop runs and a particular call is blocked and getting this error, how to avoid it and keep running the next iterations? Also, if the account is private, I want to append "Private account" to the lists.
Looks like the code that is actually fetching the URL is within your custom type - "Myopen" (which is not shown). It also looks like its not throwing the HTTPError you are expecting since your "json.load" line is still being executed (and leading to the ValueError that is being thrown).
If you want your error handling block to fire, you would need to check the response status code to see if its != 200 within Myopen and throw the HTTPError you are expecting instead of whatever its doing now.
I'm not personally familiar with FancyURLOpener, but it looks like it supports a getcode method. Maybe try something like this instead of expecting an HTTPError:
url = myopener.open('yoururl')
if url.getcode() == 400:
data1.append('Private Account')
data2.append('Private Account')
data3.append('Private Account')
continue

Django model.DoesNotExist exception somehow replaced with an AttributeError

I'm experiencing an odd exception on a Django 1.5 site:
"TypeError: 'exceptions.AttributeError' object is not callable"
Essentially it looks like the model.DoesNotExist exception has been replaced with an AttributeError.
The error is intermittent, but once it happens it seems to 'stick' until I restart the process, which makes me think it might be a case of the model class getting incorrectly set in the course of a particular request, and then persisting.
Bottom of traceback:
File "/opt/mysite/django/apps/profiles/models.py", line 353, in profile_from_cache
profile = self.get(user=user_id)
File "/opt/mysite/.virtualenvs/django/lib/python2.7/site-packages/django/db/models/manager.py", line 143, in get
return self.get_query_set().get(*args, **kwargs)
File "/opt/mysite/.virtualenvs/django/lib/python2.7/site-packages/django/db/models/query.py", line 404, in get
self.model._meta.object_name)
TypeError: 'exceptions.AttributeError' object is not callable
Line of code from django/db/models/query.py:
if not num:
raise self.model.DoesNotExist(
"%s matching query does not exist." %
self.model._meta.object_name)
So it looks as if it's trying to pass a message to the DoesNotExist exception on the model, but it's somehow been replaced by an AttributeError instead.
The issue only seems to happen from http requests - if I do the same action from the command line I just get a DoesNotExist exception (which is what should be happening).
I can't find any obvious reason this should be happening. Any ideas?
(PS this seems to be the same issue. The user's answer to it, I think, is wrong: https://groups.google.com/forum/#!topic/django-users/k9JMyXlUt3Q)
Possibly relevant code
Here is an outline of the model manager:
class CacheManager(models.Manager):
def profile_from_cache(self, user_id=None):
profile = cache.get("profile_%s" % user_id)
if profile is None:
try:
profile = self.get(user=user_id)
except Profile.DoesNotExist:
return None
cache.set("profile_%s" % user_id, profile, settings.CACHE_TIMEOUT)
return profile
...
class Profile(models.Model):
...
caches = CacheManager()
Here's the line of code that seems to be causing the error. In this case it's in some middleware, but there are a few different places, all causing the same thing.
Profile.caches.profile_from_cache(user_id=request.user.pk)
It was because of incorrect syntax elsewhere in the project, that caught multiple exceptions like this:
try:
do_something()
except AttributeError, Profile.DoesNotExist:
pass
When it should have been this:
try:
do_something()
except (AttributeError, Profile.DoesNotExist):
pass
What was happening was that when this happened, AttributeError got assigned to Profile.DoesNotExist in memory, so when it was raised later, it was the wrong exception.
Thanks to this post for helping.

Safest way to open a file in python 3.4

I was expecting the following would work but PyDev is returning an error:
try fh = open(myFile):
logging.info("success")
except Exception as e:
logging.critical("failed because:")
logging.critical(e)
gives
Encountered "fh" at line 237, column 5. Was expecting: ":"...
I've looked around and cannot find a safe way to open a filehandle for reading in Python 3.4 and report errors properly. Can someone point me in the correct direction please?
You misplaced the :; it comes directly after try; it is better to put that on its own, separate line:
try:
fh = open(myFile)
logging.info("success")
except Exception as e:
logging.critical("failed because:")
logging.critical(e)
You placed the : after the open() call instead.
Instead of passing in e as a separate argument, you can tell logging to pick up the exception automatically:
try:
fh = open(myFile)
logging.info("success")
except Exception:
logging.critical("failed because:", exc_info=True)
and a full traceback will be included in the log. This is what the logging.exception() function does; it'll call logging.error() with exc_info set to true, producing a message at log level ERROR plus a traceback.

Handling IncompleteRead,URLError

it's a piece of web mining script.
def printer(q,missing):
while 1:
tmpurl=q.get()
try:
image=urllib2.urlopen(tmpurl).read()
except httplib.HTTPException:
missing.put(tmpurl)
continue
wf=open(tmpurl[-35:]+".jpg","wb")
wf.write(image)
wf.close()
q is a Queue() composed of Urls and `missing is an empty queue to gather error-raising-urls
it runs in parallel by 10 threads.
and everytime I run this, I got this.
File "C:\Python27\lib\socket.py", line 351, in read
data = self._sock.recv(rbufsize)
File "C:\Python27\lib\httplib.py", line 541, in read
return self._read_chunked(amt)
File "C:\Python27\lib\httplib.py", line 592, in _read_chunked
value.append(self._safe_read(amt))
File "C:\Python27\lib\httplib.py", line 649, in _safe_read
raise IncompleteRead(''.join(s), amt)
IncompleteRead: IncompleteRead(5274 bytes read, 2918 more expected)
but I do use the except...
I tried something else like
httplib.IncompleteRead
urllib2.URLError
even,
image=urllib2.urlopen(tmpurl,timeout=999999).read()
but none of this is working..
how can I catch the IncompleteRead and URLError?
I think the correct answer to this question depends on what you consider an "error-raising URL".
Methods of catching multiple exceptions
If you think any URL which raises an exception should be added to the missing queue then you can do:
try:
image=urllib2.urlopen(tmpurl).read()
except (httplib.HTTPException, httplib.IncompleteRead, urllib2.URLError):
missing.put(tmpurl)
continue
This will catch any of those three exceptions and add that url to the missing queue. More simply you could do:
try:
image=urllib2.urlopen(tmpurl).read()
except:
missing.put(tmpurl)
continue
To catch any exception but this is not considered Pythonic and could hide other possible errors in your code.
If by "error-raising URL" you mean any URL that raises an httplib.HTTPException error but you'd still like to keep processing if the other errors are received then you can do:
try:
image=urllib2.urlopen(tmpurl).read()
except httplib.HTTPException:
missing.put(tmpurl)
continue
except (httplib.IncompleteRead, urllib2.URLError):
continue
This will only add the URL to the missing queue if it raises an httplib.HTTPException but will otherwise catch httplib.IncompleteRead and urllib.URLError and keep your script from crashing.
Iterating over a Queue
As an aside, while 1 loops are always a bit concerning to me. You should be able to loop through the Queue contents using the following pattern, though you're free to continue doing it your way:
for tmpurl in iter(q, "STOP"):
# rest of your code goes here
pass
Safely working with files
As another aside, unless it's absolutely necessary to do otherwise, you should use context managers to open and modify files. So your three file-operation lines would become:
with open(tmpurl[-35:]+".jpg","wb") as wf:
wf.write()
The context manager takes care of closing the file, and will do so even if an exception occurs while writing to the file.

Identify the exception to be used from the traceback

I want to catch an exception when user fails login due to wrong password .
So i make a function using imaplib .I enter a wrong password and get a traceback with error details.
Now my question is actually general.How do you identify the exception we have to mention in our "try and except" body from the error messages?
These is what I got->
>>> count("testarc31#gmail.com","Xbox#36")
Traceback (most recent call last):
File "<pyshell#2>", line 1, in <module>
count("testarc31#gmail.com","Xbox#36")
File "E:\Arindam\py_progs\Mail Notifier\0.0.19\Mail.py", line 24, in count
obj.login(m,p)
File "C:\Python27\lib\imaplib.py", line 500, in login
raise self.error(dat[-1])
error: [AUTHENTICATIONFAILED] Invalid credentials (Failure)
If i want to make a try and except,what will i mention in the exception part?
try:
login(mail,pass):
except ????:
something
Question :
1) What will be ???? here . Can it be deduced directly from the error report?
2) Is there a basic idea to identify what is the exception we have to use from each error we get ?
You want to use something like this:
try:
..code that might raise an exception...
except ExceptionType, e:
...do something...
In your case, that probably want this:
try:
login(mail,pass)
except imaplib.IMAP4.error, e:
print "Ouch -- an error from imaplib!"
To identify the type of an exception, you can look at its exception message. In this case it's just "error" -- unfortunately the module name is not included. You can get a better idea of exactly where it comes from by doing:
try:
login(mail,pass)
except Exception, e:
print type(e)

Categories