I have a small script that repeatedly (hourly) fetches tweets from the API, using sixohsix's
Twitter Wrapper for Python. I am successful with handling most, if not all of the errors coming from the Twitter API, i.e. all the 5xx and 4xx stuff.
Nonetheless I randomly observe the below error traceback (only once in 2-3 days). I mean the program exits and the traceback is displayed in the shell. I have no clue what this could mean, but think it is not directly related to what my script does since it has proved itself to correctly run most of the time.
This is where I call a function of the wrapper in my script:
KW = {
'count': 200, # number of tweets to fetch (fetch maximum)
'user_id' : tweeter['user_id'],
'include_rts': 'false', # do not include native RT's
'trim_user' : 'true',
}
timeline = tw.twitter_request(tw_endpoint,\
tw_endpoint.statuses.user_timeline, KW)
The function tw.twitter_request(tw_endpoint, tw_endpoint.statuses.user_timeline, KW) basically does return tw_endpoint.statuses_user_timeline(**args), where args translate to KW, and tw_endpoint is an OAuthorized endpoint gained from using the sixohsix's library's
return twitter.Twitter(domain='api.twitter.com', api_version='1.1',
auth=twitter.oauth.OAuth(access_token, access_token_secret,
consumer_key, consumer_secret))
This is the traceback:
Traceback (most recent call last):
File "search_twitter_entities.py", line 166, in <module>
tw_endpoint.statuses.user_timeline, KW)
File "/home/tg/mild/twitter_utils.py", line 171, in twitter_request
return twitter_function(**args)
File "build/bdist.linux-x86_64/egg/twitter/api.py", line 173, in __call__
File "build/bdist.linux-x86_64/egg/twitter/api.py", line 177, in _handle_response
File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 400, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 418, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1215, in https_open
return self.do_open(httplib.HTTPSConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1180, in do_open
r = h.getresponse(buffering=True)
File "/usr/lib/python2.7/httplib.py", line 1030, in getresponse
response.begin()
File "/usr/lib/python2.7/httplib.py", line 407, in begin
version, status, reason = self._read_status()
File "/usr/lib/python2.7/httplib.py", line 371, in _read_status
raise BadStatusLine(line)
httplib.BadStatusLine: ''
The only thing I can gain from that traceback is that the error happens somewhere deep inside another Python library and has something to do with an invalid HTTP stat coming from the Twitter API or the wrapper... But as I said, maybe some of you could give me a hint on how to debug/solve this since it is pretty annoying having to regularly check my script and restart it to continue fetching tweets.
EDIT: To clarify this a little, the first two functions in the traceback are already in a try-except block. For example, the try-except-Block in File "twitter_utils.py" filters out 40x and 50x exceptions, but also looks for general exceptions with only except:. So what I don't understand is why the error is not getting caught at this position and instead, the program is force-closed and a traceback printed? Shortly speaking I am in the situation where I cannot catch an error, just like a parse error in a PHP script. So how would I do this?
Perhaps this will point you in the right direction. This is what's being called when BadStatusLine is called upon:
class BadStatusLine(HTTPException):
def __init__(self, line):
if not line:
line = repr(line)
self.args = line,
self.line = line
I'm not too familiar with httplib, but if I had to guess, you're geting an empty response/error line and, well, it can't be parsed. There are comments before the line you're program is stopping at:
# Presumably, the server closed the connection before
# sending a valid response.
raise BadStatusLine(line)
If twitter is closing the connection before sending a response, you could try again, meaning do a try/except at "search_twitter_entities.py", line 166 a couple times (ugly).
try:
timeline = tw.twitter_request(tw_endpoint,\
tw_endpoint.statuses.user_timeline, KW)
except:
try:
timeline = tw.twitter_request(tw_endpoint,\
tw_endpoint.statuses.user_timeline, KW) # try again
except:
pass
Or, assuming you can reassign timeline as none each time, do a while loop:
timeline = None
while timeline == None:
try:
timeline = tw.twitter_request(tw_endpoint,\
tw_endpoint.statuses.user_timeline, KW)
except:
pass
I didn't test of of that. Check for bad code.
Related
I have a bit of code that uses newspaper to go take a look at various media outlets and download articles from them. This has been working fine for a long time but has recently started acting up. I can see what the problem is but as I'm new to Python I'm not sure about the best way to address it. Basically (I think) I need to make a modification to keep the occasional malformed web address from crashing the script entirely and instead allow it to dispense with that web address and move on to the others.
The origins of the error is when I attempt to download an article using:
article.download()
Some articles (they change every day obviously) will throw the following error but the script continues to run:
Traceback (most recent call last):
File "C:\Anaconda3\lib\encodings\idna.py", line 167, in encode
raise UnicodeError("label too long")
UnicodeError: label too long
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Anaconda3\lib\site-packages\newspaper\mthreading.py", line 38, in run
func(*args, **kargs)
File "C:\Anaconda3\lib\site-packages\newspaper\source.py", line 350, in download_articles
html = network.get_html(url, config=self.config)
File "C:\Anaconda3\lib\site-packages\newspaper\network.py", line 39, in get_html return get_html_2XX_only(url, config, response)
File "C:\Anaconda3\lib\site-packages\newspaper\network.py", line 60, in get_html_2XX_only url=url, **get_request_kwargs(timeout, useragent))
File "C:\Anaconda3\lib\site-packages\requests\api.py", line 72, in get return request('get', url, params=params, **kwargs)
File "C:\Anaconda3\lib\site-packages\requests\api.py", line 58, in request return session.request(method=method, url=url, **kwargs)
File "C:\Anaconda3\lib\site-packages\requests\sessions.py", line 502, in request resp = self.send(prep, **send_kwargs)
File "C:\Anaconda3\lib\site-packages\requests\sessions.py", line 612, in send r = adapter.send(request, **kwargs)
File "C:\Anaconda3\lib\site-packages\requests\adapters.py", line 440, in send timeout=timeout
File "C:\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 600, in urlopen chunked=chunked)
File "C:\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 356, in _make_request conn.request(method, url, **httplib_request_kw)
File "C:\Anaconda3\lib\http\client.py", line 1107, in request self._send_request(method, url, body, headers)
File "C:\Anaconda3\lib\http\client.py", line 1152, in _send_request self.endheaders(body)
File "C:\Anaconda3\lib\http\client.py", line 1103, in endheaders self._send_output(message_body)
File "C:\Anaconda3\lib\http\client.py", line 934, in _send_output self.send(msg)
File "C:\Anaconda3\lib\http\client.py", line 877, in send self.connect()
File "C:\Anaconda3\lib\site-packages\urllib3\connection.py", line 166, in connect conn = self._new_conn()
File "C:\Anaconda3\lib\site-packages\urllib3\connection.py", line 141, in _new_conn (self.host, self.port), self.timeout, **extra_kw)
File "C:\Anaconda3\lib\site-packages\urllib3\util\connection.py", line 60, in create_connection for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
File "C:\Anaconda3\lib\socket.py", line 733, in getaddrinfo for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
UnicodeError: encoding with 'idna' codec failed (UnicodeError: label too long)
The next bit is supposed to then parse and run natural language processing on each article and write certain elements to a dataframe so I then have:
for paper in papers:
for article in paper.articles:
article.parse()
print(article.title)
article.nlp()
if article.publish_date is None:
d = datetime.now().date()
else:
d = article.publish_date.date()
stories.loc[i] = [paper.brand, d, datetime.now().date(), article.title, article.summary, article.keywords, article.url]
i += 1
(This might be a little sloppy too but that's a problem for another day)
This runs fine until it gets to one of those URLs with the error and then tosses an article exception and the script crashes:
C:\Anaconda3\lib\site-packages\PIL\TiffImagePlugin.py:709: UserWarning: Corrupt EXIF data. Expecting to read 2 bytes but only got 0.
warnings.warn(str(msg))
ArticleException Traceback (most recent call last) <ipython-input-17-2106485c4bbb> in <module>()
4 for paper in papers:
5 for article in paper.articles:
----> 6 article.parse()
7 print(article.title)
8 article.nlp()
C:\Anaconda3\lib\site-packages\newspaper\article.py in parse(self)
183
184 def parse(self):
--> 185 self.throw_if_not_downloaded_verbose()
186
187 self.doc = self.config.get_parser().fromstring(self.html)
C:\Anaconda3\lib\site-packages\newspaper\article.py in throw_if_not_downloaded_verbose(self)
519 if self.download_state == ArticleDownloadState.NOT_STARTED:
520 print('You must `download()` an article first!')
--> 521 raise ArticleException()
522 elif self.download_state == ArticleDownloadState.FAILED_RESPONSE:
523 print('Article `download()` failed with %s on URL %s' %
ArticleException:
So what's the best way to keep this from terminating my script? Should I address it in the download stage where I'm getting the unicode error or at the parse stage by telling it to overlook those bad addresses? And how would I go about implementing that correction?
Really appreciate any advice.
I had the same issue and although in general using except: pass is not recommended, the following worked for me:
try:
a.parse()
file.write( a.title+'\n')
except :
pass
What I've found is that Navid is correct for this exact problem.
However .parse() is only one of the functions that can trip you up. I wrap all the calls inside of the try / accept structure like this:
word_list = []
for words in google_news.articles:
try:
words.download()
words.parse()
words.nlp()
except:
pass
word_list.append(words.keywords)
You can try catching the ArticleException. Don't forget to import the newspaper module.
try:
article.download()
article.parse()
except newspaper.article.ArticleException:
# do something
I've been at this for the better part of a day but have been coming up with the same Error 400 for quite some time. Basically, the application's goal is to parse a book's ISBN from the Amazon referral url and use it as the reference key to pull images from Amazon's Product Advertising API. The webpage is written in Python 3.4 and Django 1.8. I spent quite a while researching on here and settled for using python-amazon-simple-product-api since it would make parsing results from Amazon a little easier.
Answers like: How to use Python Amazon Simple Product API to get price of a product
Make it seem pretty simple, but I haven't quite gotten it to lookup a product successfully yet. Here's a console printout of what my method usually does, with the correct ISBN already filled:
>>> from amazon.api import AmazonAPI
>>> access_key='amazon-access-key'
>>> secret ='amazon-secret-key'
>>> assoc ='amazon-associate-account-name'
>>> amazon = AmazonAPI(access_key, secret, assoc)
>>> product = amazon.lookup(ItemId='1632360705')
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/home/tsuko/.virtualenvs/django17/lib/python3.4/site-packages/amazon/api.py", line 161, in lo
okup
response = self.api.ItemLookup(ResponseGroup=ResponseGroup, **kwargs)
File "/home/tsuko/.virtualenvs/django17/lib/python3.4/site-packages/bottlenose/api.py", line 242, i
n __call__
{'api_url': api_url, 'cache_url': cache_url})
File "/home/tsuko/.virtualenvs/django17/lib/python3.4/site-packages/bottlenose/api.py", line 203, i
n _call_api
return urllib2.urlopen(api_request, timeout=self.Timeout)
File "/usr/lib/python3.4/urllib/request.py", line 153, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.4/urllib/request.py", line 461, in open
response = meth(req, response)
File "/usr/lib/python3.4/urllib/request.py", line 571, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python3.4/urllib/request.py", line 499, in error
return self._call_chain(*args)
File "/usr/lib/python3.4/urllib/request.py", line 433, in _call_chain
result = func(*args)
File "/usr/lib/python3.4/urllib/request.py", line 579, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 400: Bad Request
Now I guess I'm curious if this is some quirk with PythonAnywhere, or if I've missed a configuration setting in Django? As far as I can tell through AWS and the Amazon Associates page my keys are correct. I'm not too worried about parsing at this point, just getting the object. I've even tried bypassing the API and just using Bottlenose (which the API extends) but I get the same error 400 result.
I'm really new to Django and Amazon's API, any assistance would be appreciated!
You still haven't authorised your account for API access. To do so, you can go to https://affiliate-program.amazon.com/gp/advertising/api/registration/pipeline.html
I wrote a feature to run a task asynchronously via celery, tested it locally and it's all good. Shipped to my staging environment and when the celery tries to consume the tasks it fails with the following traceback.
I'm not even sure how I can go about debugging this error, as it's being called by celery, and happening deep in the python standard lib. Any ideas?
Traceback (most recent call last):
File "/home/ubuntu/hypnos-venv/local/lib/python2.7/site-packages/celery/app/trace.py", line 238, in trace_task
R = retval = fun(*args, **kwargs)
File "/home/ubuntu/hypnos-venv/local/lib/python2.7/site-packages/celery/app/trace.py", line 416, in __protected_call__
return self.run(*args, **kwargs)
File "/home/ubuntu/Hypnos/hypnos/recs_jobber/tasks.py", line 5, in send_sms_action
msg = twilio_client.sms.messages.create(body = sms_action.body, to=sms_action.to_number, from_=TW_NUMBER)
File "/home/ubuntu/hypnos-venv/local/lib/python2.7/site-packages/twilio/rest/resources/sms_messages.py", line 167, in create
return self.create_instance(kwargs)
File "/home/ubuntu/hypnos-venv/local/lib/python2.7/site-packages/twilio/rest/resources/base.py", line 352, in create_instance
data=transform_params(body))
File "/home/ubuntu/hypnos-venv/local/lib/python2.7/site-packages/twilio/rest/resources/base.py", line 204, in request
resp = make_twilio_request(method, uri, auth=self.auth, **kwargs)
File "/home/ubuntu/hypnos-venv/local/lib/python2.7/site-packages/twilio/rest/resources/base.py", line 129, in make_twilio_request
resp = make_request(method, uri, **kwargs)
File "/home/ubuntu/hypnos-venv/local/lib/python2.7/site-packages/twilio/rest/resources/base.py", line 101, in make_request
resp, content = http.request(url, method, headers=headers, body=data)
File "/home/ubuntu/hypnos-venv/local/lib/python2.7/site-packages/httplib2/__init__.py", line 1570, in request
(response, content) = self._request(conn, authority, uri, request_uri, method, body, headers, redirections, cachekey)
File "/home/ubuntu/hypnos-venv/local/lib/python2.7/site-packages/httplib2/__init__.py", line 1317, in _request
(response, content) = self._conn_request(conn, request_uri, method, body, headers)
File "/home/ubuntu/hypnos-venv/local/lib/python2.7/site-packages/httplib2/__init__.py", line 1252, in _conn_request
conn.connect()
File "/home/ubuntu/hypnos-venv/local/lib/python2.7/site-packages/httplib2/__init__.py", line 1017, in connect
sock.settimeout(self.timeout)
File "/usr/lib/python2.7/socket.py", line 224, in meth
return getattr(self._sock,name)(*args)
TypeError: a float is required
Used Celery's rdb to get into the frame and seems that the socket timeout is not set.
(Pdb) self.timeout
<Unset Timeout Value>
Any ideas how I might solve this? The error is flowing from Twilio -> httplib2 -> socket.py which is a wrapper to _socket. This is over my head and not sure even how to approach the problem.
By default, Twilio's lib sets timeout to:
class _UnsetTimeoutKls(object):
""" A sentinel for an unset timeout. Defaults to the system timeout. """
def __repr__(self):
return '<Unset Timeout Value>'
# None has special meaning for timeouts, so we use this sigil to indicate
# that we don't care
UNSET_TIMEOUT = _UnsetTimeoutKls()
So something must be happening in between instantiating a TwilioRestClient and the actual socket call that is evaluating that _UnsetTimeoutKls to something different than None.
Setting timeout to None when initialising TwilioRestClient seems to fix the error:
self.client = TwilioRestClient(TWILIO_SID,
TWILIO_AUTH_TOKEN,
timeout=None)
Here's what's going on:
You're telling the Twilio library to create an SMS message.
Twilio goes over to httplib to connect to the Twilio servers.
httplib, while connecting to the Twilio servers, sets the timeout of a socket.
The only problem is that for some reason, self.timeout in the penultimate stack frame is not a float as required. You may want to try running your application under the Python debugger, e.g.:
python -m pdb myapp.py
You'll get to a prompt from which you can type run to run your application. Once the error occurs, it should drop you back to the prompt. You should then be able to type up until you get to the offending frame and see what self.timeout is. You may then want to look around to see where self.timeout is getting set to that, and why. You should then be able to resolve the issue by fixing that.
I am seeing the python-requests library crash with the following traceback:
Traceback (most recent call last):
File "/usr/lib/python3.2/http/client.py", line 529, in _read_chunked
chunk_left = int(line, 16)
ValueError: invalid literal for int() with base 16: b''
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "./app.py", line 507, in getUrlContents
response = requests.get(url, headers=headers, auth=authCredentials, timeout=http_timeout_seconds)
File "/home/dotancohen/code/lib/requests/api.py", line 55, in get
return request('get', url, **kwargs)
File "/home/dotancohen/code/lib/requests/api.py", line 44, in request
return session.request(method=method, url=url, **kwargs)
File "/home/dotancohen/code/lib/requests/sessions.py", line 338, in request
resp = self.send(prep, **send_kwargs)
File "/home/dotancohen/code/lib/requests/sessions.py", line 441, in send
r = adapter.send(request, **kwargs)
File "/home/dotancohen/code/lib/requests/adapters.py", line 340, in send
r.content
File "/home/dotancohen/code/lib/requests/models.py", line 601, in content
self._content = bytes().join(self.iter_content(CONTENT_CHUNK_SIZE)) or bytes()
File "/home/dotancohen/code/lib/requests/models.py", line 542, in generate
for chunk in self.raw.stream(chunk_size, decode_content=True):
File "/home/dotancohen/code/lib/requests/packages/urllib3/response.py", line 222, in stream
data = self.read(amt=amt, decode_content=decode_content)
File "/home/dotancohen/code/lib/requests/packages/urllib3/response.py", line 173, in read
data = self._fp.read(amt)
File "/usr/lib/python3.2/http/client.py", line 489, in read
return self._read_chunked(amt)
File "/usr/lib/python3.2/http/client.py", line 534, in _read_chunked
raise IncompleteRead(b''.join(value))
http.client.IncompleteRead: IncompleteRead(0 bytes read)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.2/threading.py", line 740, in _bootstrap_inner
self.run()
File "./app.py", line 298, in run
self.target(*self.args)
File "./app.py", line 400, in provider_query
url_contents = getUrlContents(str(providerUrl), '', authCredentials)
File "./app.py", line 523, in getUrlContents
except http.client.IncompleteRead as error:
NameError: global name 'http' is not defined
As can be seen, I've tried to catch the http.client.IncompleteRead: IncompleteRead(0 bytes read) error that requests is throwing with the line except http.client.IncompleteRead as error:. However, that is throwing a NameError due to http not being defined. So how can I catch that exception?
This is the code throwing the exception:
import requests
from requests_oauthlib import OAuth1
authCredentials = OAuth1('x', 'x', 'x', 'x')
response = requests.get(url, auth=authCredentials, timeout=20)
Note that I am not including the http library, though requests is including it. The error is very intermittent (happens perhaps once every few hours, even if I run the requests.get() command every ten seconds) so I'm not sure if added the http library to the imports has helped or not.
In any case, in the general sense, if included library A in turn includes library B, is it impossible to catch exceptions from B without including B myself?
To answer your question
In any case, in the general sense, if included library A in turn includes library B, is it impossible to catch exceptions from B without including B myself?
Yes. For example:
a.py:
import b
# do some stuff with b
c.py:
import a
# but you want to use b
a.b # gives you full access to module b which was imported by a
Although this does the job, it doesn't look so pretty, especially with long package/module/class/function names in real world.
So in your case to handle http exception, either try to figure out which package/module within requests imports http and so that you'd do raise requests.XX.http.WhateverError or rather just import it as http is a standard library.
It's hard to analyze the problem if you don't give source and just the stout,
but check this link out : http://docs.python-requests.org/en/latest/user/quickstart/#errors-and-exceptions
Basically,
try and catch the exception whereever the error is rising in your code.
Exceptions:
In the event of a network problem (e.g. DNS failure, refused connection, etc),
Requests will raise a **ConnectionError** exception.
In the event of the rare invalid HTTP response,
Requests will raise an **HTTPError** exception.
If a request times out, a **Timeout** exception is raised.
If a request exceeds the configured number of maximum redirections,
a **TooManyRedirects** exception is raised.
All exceptions that Requests explicitly raises inherit
from **requests.exceptions.RequestException.**
Hope that helped.
I've got a list of 100 websites in CSV format. All of the sites have the same general format, including a large table with 7 columns. I wrote this script to extract the data from the 7th column of each of the websites and then write this data to file. The script below partially works, however: opening the output file (after running the script) shows that something is being skipped because it only shows 98 writes (clearly the script also registers a number of exceptions). Guidance on how to implement a "catching exception" in this context would be much appreciated. Thank you!
import csv, urllib2, re
def replace(variab): return variab.replace(",", " ")
urls = csv.reader(open('input100.txt', 'rb')) #access list of 100 URLs
for url in urls:
html = urllib2.urlopen(url[0]).read() #get HTML starting with the first URL
col7 = re.findall('td7.*?td', html) #use regex to get data from column 7
string = str(col7) #stringify data
neat = re.findall('div3.*?div', string) #use regex to get target text
result = map(replace, neat) #apply function to remove','s from elements
string2 = ", ".join(result) #separate list elements with ', ' for export to csv
output = open('output.csv', 'ab') #open file for writing
output.write(string2 + '\n') #append output to file and create new line
output.close()
Return:
Traceback (most recent call last):
File "C:\Python26\supertest3.py", line 6, in <module>
html = urllib2.urlopen(url[0]).read()
File "C:\Python26\lib\urllib2.py", line 124, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python26\lib\urllib2.py", line 383, in open
response = self._open(req, data)
File "C:\Python26\lib\urllib2.py", line 401, in _open
'_open', req)
File "C:\Python26\lib\urllib2.py", line 361, in _call_chain
result = func(*args)
File "C:\Python26\lib\urllib2.py", line 1130, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "C:\Python26\lib\urllib2.py", line 1103, in do_open
r = h.getresponse()
File "C:\Python26\lib\httplib.py", line 950, in getresponse
response.begin()
File "C:\Python26\lib\httplib.py", line 390, in begin
version, status, reason = self._read_status()
File "C:\Python26\lib\httplib.py", line 354, in _read_status
raise BadStatusLine(line)
BadStatusLine
>>>>
Make the body of your for loop into:
for url in urls:
try:
...the body you have now...
except Exception, e:
print>>sys.stderr, "Url %r not processed: error (%s) % (url, e)
(Or, use logging.error instead of the goofy print>>, if you're already using the logging module of the standard library [and you should;-)]).
I'd recommend reading the Errors and Exceptions Python documentation, especially section 8.3 -- Handling Exceptions.