it's a piece of web mining script.
def printer(q,missing):
while 1:
tmpurl=q.get()
try:
image=urllib2.urlopen(tmpurl).read()
except httplib.HTTPException:
missing.put(tmpurl)
continue
wf=open(tmpurl[-35:]+".jpg","wb")
wf.write(image)
wf.close()
q is a Queue() composed of Urls and `missing is an empty queue to gather error-raising-urls
it runs in parallel by 10 threads.
and everytime I run this, I got this.
File "C:\Python27\lib\socket.py", line 351, in read
data = self._sock.recv(rbufsize)
File "C:\Python27\lib\httplib.py", line 541, in read
return self._read_chunked(amt)
File "C:\Python27\lib\httplib.py", line 592, in _read_chunked
value.append(self._safe_read(amt))
File "C:\Python27\lib\httplib.py", line 649, in _safe_read
raise IncompleteRead(''.join(s), amt)
IncompleteRead: IncompleteRead(5274 bytes read, 2918 more expected)
but I do use the except...
I tried something else like
httplib.IncompleteRead
urllib2.URLError
even,
image=urllib2.urlopen(tmpurl,timeout=999999).read()
but none of this is working..
how can I catch the IncompleteRead and URLError?
I think the correct answer to this question depends on what you consider an "error-raising URL".
Methods of catching multiple exceptions
If you think any URL which raises an exception should be added to the missing queue then you can do:
try:
image=urllib2.urlopen(tmpurl).read()
except (httplib.HTTPException, httplib.IncompleteRead, urllib2.URLError):
missing.put(tmpurl)
continue
This will catch any of those three exceptions and add that url to the missing queue. More simply you could do:
try:
image=urllib2.urlopen(tmpurl).read()
except:
missing.put(tmpurl)
continue
To catch any exception but this is not considered Pythonic and could hide other possible errors in your code.
If by "error-raising URL" you mean any URL that raises an httplib.HTTPException error but you'd still like to keep processing if the other errors are received then you can do:
try:
image=urllib2.urlopen(tmpurl).read()
except httplib.HTTPException:
missing.put(tmpurl)
continue
except (httplib.IncompleteRead, urllib2.URLError):
continue
This will only add the URL to the missing queue if it raises an httplib.HTTPException but will otherwise catch httplib.IncompleteRead and urllib.URLError and keep your script from crashing.
Iterating over a Queue
As an aside, while 1 loops are always a bit concerning to me. You should be able to loop through the Queue contents using the following pattern, though you're free to continue doing it your way:
for tmpurl in iter(q, "STOP"):
# rest of your code goes here
pass
Safely working with files
As another aside, unless it's absolutely necessary to do otherwise, you should use context managers to open and modify files. So your three file-operation lines would become:
with open(tmpurl[-35:]+".jpg","wb") as wf:
wf.write()
The context manager takes care of closing the file, and will do so even if an exception occurs while writing to the file.
Related
I got a good code to read fasta files:
from itertools import groupby
def is_header(line):
return line[0] == '>'
def parse_fasta(filename):
if filename.endswith('.gz'):
opener = lambda filename: gzip.open(filename, 'rb')
else:
opener = lambda filename: open(filename, 'r')
with opener(filename) as f:
fasta_iter = (it[1] for it in groupby(f, is_header))
for name in fasta_iter:
name = name.__next__()[1:].strip()
sequences = ''.join(seq.strip() for seq in fasta_iter.__next__())
yield name, sequences.upper()
And it worked fine until I update to Python 3.10.4, then when I try to use it I got this error:
Traceback (most recent call last):
File "/media/paulosschlogl/Paulo/pauloscchlogl/Genome_kmers/fasta_parser.py", line 21, in parse_fasta
sequences = ''.join(seq.strip() for seq in fasta_iter.__next__())
StopIteration
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/media/paulosschlogl/Paulo/pauloscchlogl/Genome_kmers/fasta_split_chr_pls.py", line 112, in <module>
sys.exit(main())
File "/media/paulosschlogl/Paulo/pauloscchlogl/Genome_kmers/fasta_split_chr_pls.py", line 81, in main
plasmid, chromosome = split_sequences_from_fasta_file(filename)
File "/media/paulosschlogl/Paulo/pauloscchlogl/Genome_kmers/fasta_parser.py", line 111, in split_sequences_from_fasta_file
for header, seq in parse_fasta(filename)
RuntimeError: generator raised StopIteration
I run my code in conda (conda 4.13.0) environment, and I try to debug the function but I got stucked.
I don't want to loose this function to try use Biopython.
If you guys have any idea to fixe it I really appreciate.
Thanks
Paulo
Example of fasta file:
> seq_name
AGGTAGGGA
The funny stuff is that, when I run the function in python interpret at the command line all worked fine, but when I call the functions from the script using the function imported thats when I got the errors.
>>> import gzip
>>> from itertools import groupby
>>> def is_header(line):
... return line[0] == '>'
...
>>> for name, seq in parse_fasta("/media/paulosschlogl/Paulo/pauloscchlogl/Genome_kmers/Genomes/Archaea/Asgard/Chr/GCA_008000775.1_ASM800077v1_genomic.fna.gz"):
... print(name, seq[:50])
...
CP042905.1 Candidatus Prometheoarchaeum syntrophicum strain MK-D1 chromosome, complete genome TAAATATTATAGCCCGTAATAGCAGAGTCACCAACACTTAAAGGTGCATC
>>> quit()
The exception you're getting is because you're manually calling the __next__ method on various iterators in your code. Eventually you do that on an iterator that doesn't have any values left, and you'll get a StopIteration exception raised.
In much older versions of Python, that was OK to leave uncaught in a generator function. The StopIteration exception would continue to bubble up just like any other exception. For a generator function, raising StopIteration is an expected part of its behavior (it happens automatically when the function ends, either with a return statement, or by reaching the end of its code). In Python 3.5, this behavior changed, with PEP 479 making it an error for a StopIteration to go uncaught in a generator.
Now, given the logic of your code, I'm not exactly sure why you're getting empty iterators. If the file is in the format you describe, the __next__ calls should always have a value to get, and the StopIteration that comes when there are no values left will be received by the for loop, which will suppress it (and just end the loop). Perhaps some of your files are not correctly formatted, with a header line by itself with no subsequent sequences?
In any case, you can better diagnose the issue if you catch the StopIteration and print out some diagnostic information. I'd try:
with opener(filename) as f:
fasta_iter = (it[1] for it in groupby(f, is_header))
for name in fasta_iter:
name = name.__next__()[1:].strip()
try:
sequences = ''.join(seq.strip() for seq in fasta_iter.__next__())
except StopIteration as si:
print(f'no sequence for {name}')
raise ValueError() from si
yield name, sequences.upper()
If you find that the missing sequence is a normal thing and happens at the end of every one of your files, then you could suppress the error by just putting a return statement in the except block, or maybe by using zip in your for loop (for name_iter, sequences_iter in zip(fasta_iter, fasta_iter):). I'd hesitate to jump to that as the first solution though, since it will throw away a header if there is an extra one, and silently losing data is generally a bad idea.
Note:- I have written my code after referring to few examples in stack overflow but still could not get the required output
I have a python script in which loop iterates with an Instagram API. I give the user_id as an input to the API which gets the no of posts, no of followers and no of following. Each time it gets a response, I load it into a JSON schema and append to lists data1, data2 and data3.
The issue is:= Some accounts are private accounts and the API call is not allowed to it. When I run the script in IDLE Python shell, its gives the error
Traceback (most recent call last):
File "<pyshell#144>", line 18, in <module>
beta=json.load(url)
File "C:\Users\rnair\AppData\Local\Programs\Python\Python35\lib\site- packages\simplejson-3.8.2-py3.5-win-amd64.egg\simplejson\__init__.py", line 455, in load
return loads(fp.read(),
File "C:\Users\rnair\AppData\Local\Programs\Python\Python35\lib\tempfile.py", line 483, in func_wrapper
return func(*args, **kwargs)
**ValueError: read of closed file**
But the JSON contains this:-
{
"meta": {
"error_type": "APINotAllowedError",
"code": 400,
"error_message": "you cannot view this resource"
}
}
My code is:-
for r in range(307,601):
var=r,sheet.cell(row=r,column=2).value
xy=var[1]
ij=str(xy)
if xy=="Account Deleted":
data1.append('null')
data2.append('null')
data3.append('null')
continue
myopener=Myopen()
try:
url=myopener.open('https://api.instagram.com/v1/users/'+ij+'/?access_token=641567093.1fb234f.a0ffbe574e844e1c818145097050cf33')
except urllib.error.HTTPError as e: // I want the change here
data1.append('Private Account')
data2.append('Private Account')
data3.append('Private Account')
continue
beta=json.load(url)
item=beta['data']['counts']
data1.append(item['media'])
data2.append(item['followed_by'])
data3.append(item['follows'])
I am using Python version 3.5.2. The main question is If the loop runs and a particular call is blocked and getting this error, how to avoid it and keep running the next iterations? Also, if the account is private, I want to append "Private account" to the lists.
Looks like the code that is actually fetching the URL is within your custom type - "Myopen" (which is not shown). It also looks like its not throwing the HTTPError you are expecting since your "json.load" line is still being executed (and leading to the ValueError that is being thrown).
If you want your error handling block to fire, you would need to check the response status code to see if its != 200 within Myopen and throw the HTTPError you are expecting instead of whatever its doing now.
I'm not personally familiar with FancyURLOpener, but it looks like it supports a getcode method. Maybe try something like this instead of expecting an HTTPError:
url = myopener.open('yoururl')
if url.getcode() == 400:
data1.append('Private Account')
data2.append('Private Account')
data3.append('Private Account')
continue
I have a problem with identifying an exception.
Im writing a scraper that scrapes a lot of different websites, and some errors I want to handle and some I only want to ignore.
I except my exceptions like this:
except Exception as e:
most of the exceptions I can identify like this:
type(e).__name__ == "IOError"
But I have one exception "[Errno 10054] An existing connection was forcibly closed by the remote host"
that has the name "error" which is too vague and Im guessing other errors also have that name. Im guessing I can somehow get the errno number from my exception and thus identify it. But I don't know how.
First, you should not rely on the exception's class name, but on the class itself - two classes from two different modules can have the same value for the __name__ attribute while being different exceptions. So what you want is:
try:
something_that_may_raise()
except IOError as e:
handle_io_error(e)
except SomeOtherError as e:
handle_some_other_error(e)
etc...
Then you have two kind of exceptions: the one that you can actually handle one way or another, and the other ones. If the program is only for your personal use, the best way to handle "the other ones" is usually to not handle them at all - the Python runtime will catch them, display a nice traceback with all relevant informations (so you know what happened and where and can eventually add some handling for this case).
If it's a "public" program and/or if you do have some things to clean up before the program crash, you can add a last "catch all" except clause at the program's top level that will log the error and traceback somewhere so it isn't lost (logging.exception is your friend), clean what has to be cleaned and terminate with a more friendly error message.
There are very few cases where one would really want to just ignore an exception (I mean pretending nothing wrong or unexpected happened and happily continue). At the very least you will want to notify the user one of the actions failed and why - in your case that might be a top-level loop iterating over a set of sites to scrap, with an inner try/except block catching "expected" error cases, ie:
# config:
config = [
# ('url', {params})
('some.site.tld', {"param1" : value1, "param2" : value2}),
('some.other.tld', {"param1" : value1, "answer" : 42}),
# etc
]
def run():
for url, params in config:
try:
results = scrap(url, **params)
except (SomeKnownError, SomeOtherExceptedException) as e:
# things that are to be expected and mostly harmless
#
# you configured your logger so that warnings only
# go to stderr
logger.warning("failed to scrap %s : %s - skipping", url, e)
except (MoreSeriousError, SomethingIWannaKnowAbout) as e:
# things that are more annoying and you want to know
# about but that shouldn't prevent from continuing
# with the remaining sites
#
# you configured your logger so that exceptions goes
# to both stderr and your email.
logger.exception("failed to scrap %s : %s - skipping", url, e)
else:
do_something_with(results)
Then have a top-level handler around the call to run() that takes care of unexpected errors :
def main(argv):
parse_args()
try:
set_up_everything()
run()
return 0
except Exception as e:
logger.exception("oops, something unexpected happened : %s", e)
return 1
finally:
do_some_cleanup()
if __name__ == "__main__":
sys.exit(main(sys.argv))
Note that the logging module has an SMTPHandler - but since mail can easily fail too you'd better still have a reliable log (stderr and tee to a file ?) locally. The logging module takes some time to learn but it really pays off in the long run.
I was expecting the following would work but PyDev is returning an error:
try fh = open(myFile):
logging.info("success")
except Exception as e:
logging.critical("failed because:")
logging.critical(e)
gives
Encountered "fh" at line 237, column 5. Was expecting: ":"...
I've looked around and cannot find a safe way to open a filehandle for reading in Python 3.4 and report errors properly. Can someone point me in the correct direction please?
You misplaced the :; it comes directly after try; it is better to put that on its own, separate line:
try:
fh = open(myFile)
logging.info("success")
except Exception as e:
logging.critical("failed because:")
logging.critical(e)
You placed the : after the open() call instead.
Instead of passing in e as a separate argument, you can tell logging to pick up the exception automatically:
try:
fh = open(myFile)
logging.info("success")
except Exception:
logging.critical("failed because:", exc_info=True)
and a full traceback will be included in the log. This is what the logging.exception() function does; it'll call logging.error() with exc_info set to true, producing a message at log level ERROR plus a traceback.
I'm trying to scrape one site (about 7000 links, all in a list), and because of my method, it is taking a LONG time and I guess that I'm ok with that (since that implies staying undetected). But if I do get any kind of error in trying to retrieve a page, can I just skip it?? Right now, if there's an error, the code breaks and gives me a bunch of error messages. Here's my code:
Collection is a list of lists and the resultant file. Basically, I'm trying to run a loop with get_url_data() (which I have a previous question to thank for) with all my url's in urllist. I have something called HTTPError but that doesn't seem to handle all the errors, hence this post. In a related side-quest, it would also be nice to get a list of the url's that couldn't process, but that's not my main concern (but it would be cool if someone could show me how).
Collection=[]
def get_url_data(url):
try:
r = requests.get(url, timeout=10)
r.raise_for_status()
except HTTPError:
return None
site = bs4.BeautifulSoup(r.text)
groups=site.select('div.filters')
word=url.split("/")[-1]
B=[]
for x in groups:
B.append(word)
T=[a.get_text() for a in x.select('div.blahblah [class=txt]')]
A1=[a.get_text() for a in site.select('div.blah [class=txt]')]
if len(T)==1 and len(A1)>0 and T[0]=='verb' and A1[0]!='as in':
B.append(T)
B.append([a.get_text() for a in x.select('div.blahblah [class=ttl]')])
B.append([a.get_text() for a in x.select('div.blah [class=text]')])
Collection.append(B)
B=[]
for url in urllist:
get_url_data(url)
I think the main error code was this, which triggered other ones Because there were a bunch of errors starting with During handling of the above exception, another exception occurred.
Traceback (most recent call last):
File "C:\Python34\lib\site-packages\requests\packages\urllib3\connectionpool.py", line 319, in _make_request
httplib_response = conn.getresponse(buffering=True)
TypeError: getresponse() got an unexpected keyword argument 'buffering'
You can make your try-catch block look like this,
try:
r = requests.get(url, timeout=10)
r.raise_for_status()
except Exception:
return
The Exception class will handle all the errors and exception.
If you want to get the exception message you can print this in your except block. You have then instantiate exception first before raising it.
except Exception as e:
print(e.message)
return