I am writing a program which uses Tweepy to get data from Twitter. Tweepy uses another thread, and on occasion this thread throws an exception. However, my error catching logic does not catch the exceptions because they occur in a different thread. Is there any way to catch exceptions that are thrown by other threads without changing the thread's code?
To clarify, I needed to use the extra thread option in Tweepy so that the stream wouldn't block the rest of my program from executing. I get occasional updates from a database regarding which Twitter accounts to track, and the only way I was able to do this while streaming was to stream on a separate thread.
while 1:
# Create twitter stream
try:
# Reconnect to the stream if it was disconnected (or at start)
if reconnect:
reconnect = False
# NEW THREAD CREATED HERE
tweet_stream.filter(follow=twitter_uids, async=True)
# Sleep for sleep_interval before checking for new usernames
time.sleep(sleep_interval)
users_update = get_user_names(twitter_usernames)
# Restart the stream if new users were found in DB
if len(users_update) != 0:
# Disconnect and set flag for stream to be restarted with new usernames
twitter_usernames = users_update
twitter_uids = get_twitter_uids(users_update)
reconnect = True
tweet_stream.disconnect()
tweet_stream._thread.join()
except Exception as e:
# ERROR HANDLING CODE
Related
How should the transaction requires abort case be handled for the Transactional Producer API?
According to the documentation, if while trying to commit a transaction, an error occurs where a transaction should be aborted (possibly due to rebalancing), the steps to be taken should be:
producer --> abort transaction
producer --> begin transaction
rewind consumer offsets
Aborting and beginning a new transaction are straight forward. However, how and how far should the consumer offset be rewound? What would this look like for the Python client? In the case of a single message being consumed at a time, should only that message simply be reprocessed?
As a reference, the code example that I'm referring to is:
while True:
try:
producer.commit_transaction(10.0)
break
except KafkaException as e:
if e.args[0].retriable():
# retriable error, try again
continue
# **************Relevant to the question**************
elif e.args[0].txn_requires_abort():
# abort current transaction, begin a new transaction,
# and rewind the consumer to start over.
producer.abort_transaction()
producer.begin_transaction()
rewind_consumer_offsets...()
# **************
else:
# treat all other errors as fatal
raise
Firestore listeners will randomly close after some length of time (possibly due to inactivity), and in python there is no easy way of catching the errors they throw because they throw them in a separate thread. For my case, I want to maintain a long lasting listener that never closes due to inactivity or server side error.
I've tried wrapping everything in a try - except, and then wrapping that all in the while(True) loop, but that doesn't catch the error because the error is thrown in a separate thread.
The error occurs after 10 minutes - 24 hours of inactivity (I'm not sure inactivity is the case, it could be random, but the shortest interval I ever found was 10 minutes after starting it) on both Linux and windows devices. I haven't tried Mac or any other devices, but I doubt it's device specific.
Looking at gRPC (the thing listeners use to communicate between client and server) spec, there is no default timeout for the python api (and a timeout wouldn't explain why it disconnects after different amounts of time), and no timeout it set anywhere in Firestores listener code.
The specific error that occurs is:
google.api_core.exceptions.InternalServerError: 500 Received RST_STREAM with error code 0
and sometimes
google.api_core.exceptions.InternalServerError: 500 Received RST_STREAM with error code 2
Minimal code to show the problem (left running on a dummy collection called info that only has one document in it for a while):
class TestWatchInfo():
def __init__(self):
self.query_watch = db.collection(u'info').on_snapshot(self.on_snapshot)
def on_snapshot(self, col_snapshot, changes, read_time):
try:
for change in changes:
pass
except Exception as err:
print(err)
print("Error occurred at " + str(time.ctime()))
traceback.print_exc()
if __name__ == '__main__':
try:
test_object = TestWatchInfo()
while(True):
time.sleep(60)
except Exception as err:
print(err)
print("Error occurred at " + str(time.ctime()))
traceback.print_exc()
Ideally, I would be able to catch the actual error that occurs in the main python thread, but as far as I can tell since I am not the one spawning the threads I have no way of adding thread/gRPC specific code to catch that error. Alternatively, I would like to be able to auto-restart the gRPC connection after it gets closed due to the server side.
In actuality, the Firestore listener just raises an error in the thread it created and closes the listener.
I figured out an alternative method to detecting the listener error and restarting the listener after a server side close. I have no idea how to catch the actual error, but I figured out how to detect when Firestore just randomly closes the listener connection.
In the Firebase listener code they keep track of a private variable '_closed' that becomes true if the connection ever gets closed for any reason. Therefore, if we periodically check that, we can restart our listener and be on our merry way.
Using the code from before, I added a new method start_snapshot in order to restart our failed listener expression on error, and in my long running code, I added a check against the listener to see if it is closed, and restart it if it is.
class TestWatchInfo():
def __init__(self):
self.start_snapshot()
def start_snapshot(self):
self.query_watch = db.collection(u'info').on_snapshot(self.on_snapshot)
def on_snapshot(self, col_snapshot, changes, read_time):
try:
for change in changes:
pass
except Exception as err:
print(err)
print("Error occurred at " + str(time.ctime()))
traceback.print_exc()
if __name__ == '__main__':
try:
test_object = TestWatchInfo()
while(True):
if test_object.query_watch._closed:
test_object.start_snapshot()
# code here
except Exception as err:
print(err)
print("Error occurred at " + str(time.ctime()))
traceback.print_exc()
I am trying to using python download a batch of files, and I use requests module with stream turned on, in other words, I retrieve each file in 200K blocks.
However, sometimes, the downloading may stop as it just gets stuck (no response) and there is no error. I guess this is because the connection between my computer and server was not stable enough. Here is my question, how to check this kind of stop and make a new connection?
You probably don't want to detect this from outside, when you can just use timeouts to have requests fail instead of stopping is the server stops sending bytes.
Since you didn't show us your code, it's hard to show you how to change it… but I'll show you how to change some other code:
# hanging
text = requests.get(url).text
# not hanging
try:
text = requests.get(url, timeout=10.0).text
except requests.exceptions.Timeout:
# failed, do something else
# trying until success
while True:
try:
text = requests.get(url, timeout=10.0).text
break
except requests.exceptions.Timeout:
pass
If you do want to detect it from outside for some reason, you'll need to use multiprocessing or similar to move the requests-driven code to a child process. Ideally you'll want it to post updates on some Queue or set and notify some Condition-protected shared flag variable every 200KB, then the main process can block on the Queue or Condition and kill the child process if it times out. For example (pseudocode):
def _download(url, q):
create request
for each 200kb block downloaded:
q.post(buf)
def download(url):
q = multiprocessing.Queue()
with multiprocessing.Process(_download, args=(url, q)) as proc:
try:
return ''.join(iter(functools.partial(q.get, timeout=10.0)))
except multiprocessing.Empty:
proc.kill()
# failed, do something else
I'm working on a research project that involves analyzing large amounts of data from Twitter. The project is being built in Python using Tweepy. As you might imagine I have to work very closely within the confines of the Twitter rate limiter. As such, my authentication code looks like this.
auth1 = tweepy.OAuthHandler("...", "...")
auth1.set_access_token("...", "...")
api1 = tweepy.API(auth1, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
Which does a wonderful job of stopping and waiting before I trip my limit on requests for a small scaled down run. However, when I try and run the program on my full data set I eventually get this error while the program is sleeping:
tweepy.error.TweepError: Failed to send request: ('Connection aborted.', error(104, 'Connection reset by peer'))
My research tells me that this is happening because Twitter is disconnecting and I need to catch the error. How would I catch this error, reconnect and have my program pick up where it left off? Any advice would be welcome.
The twitter disconnection error are socket exception,which is a special case of IOError exceptions.In order to catch that you need to do something like
auth = tweepy.OAuthHandler(… # set up your oauth here
try:
stream = tweepy.Stream(auth=auth, listener=SomeListener()) # start the stream
except IOError, ex:
print 'I just caught the exception: %s' % ex
If it works wrap in a while True loop with an increasing backoff so to provide some pause between re-connection.Reference link
I've also tried at the same way to wrap Tweepy calls inside a while True loop, but I got also issues with reconnections (in some cases this solution does not permit equally to solve the problem). Otherwise, I've thought to switch Auth (connected to Tweepy API instance, here "twapi") in case of error, and it seems to work properly:
...
while True:
try:
users_stream = twapi.lookup_users(screen_names=[scrname_list_here])
except tweepy.error.TweepError, ex:
time.sleep(120)
global twapi
global switch_auth
if switch_auth == False:
twapi = tweepy.API(auths[auth_id+1])
switch_auth = True
elif switch_auth == True:
twapi = tweepy.API(auths[auth_id])
switch_auth = False
continue
break
...
By using a bool variable switch_auth is possible (in case arises the Tweepy error related to failed reconnection) to "switch" the auth input of Tweepy API module (it can be assumed stored in auths list) to solve the problem.
The same technique can be used to 'switch' Auth when research's rate limit is reached. I hope it will be useful, just try!
I'm trying to figure out how to properly close an asynchronous tweepy stream.
The tweepy streaming module can be found here.
I start the stream like this:
stream = Stream(auth, listener)
stream.filter(track=['keyword'], async=True)
When closing the application, I try to close the stream as simple as:
stream.disconnect()
This method seems to work as intended but it seems to have one problem:
the stream thread is still in the middle of the loop (waiting/handling tweets) and is not killed until the next loop, so when the stream receives a tweet even after the app has closed, it still tries to call the listener object (this can be seen with a simple print syntax on the listener object). I'm not sure if this is a bad thing or if it can simply be ignored.
I have 2 questions:
Is this the best way to close the stream or should I take a different approach?
Shouldn't the async thread be created as a daemon thread?
I had the same problem. I fixed it with restarting the script. Tweepy Stream doesn't stop until the next incoming tweet.
Example:
import sys
import os
python=sys.executable
time.sleep(10)
print "restart"
os.execl(python,python,*sys.argv)
I didn't find another solution.
I am not positive that it applies to your situation, but in general you can have applicable entities clean up after themselves by putting them in a with block:
with stream = Stream(auth, listener):
stream.filter(track=['keyword'], async=True)
# ...
# Outside the with-block; stream is automatically disposed of.
What "disposed of" actually means, it that the entities __exit__ function is called.
Presumably tweepy will have overridden that to Do The Right Thing.
As #VooDooNOFX suggests, you can check the source to be sure.
This is by design. Looking at the source, you will notice that disconnect has no immediate termination option.
def disconnect(self):
if self.running is False:
return
self.running = False
When calling disconnect(), it simply sets self.running = False, which is then checked on the next loop of the _run method
You can ignore this side effect.
Instead of restarting the script, as #burkay suggests, I finally deleted the Stream object and started a new one. In my example, someone wants to add a new user to be followed, so I update the track list this way:
stream.disconnect() # that should wait until next tweet, so let's delete it
del stream
# now, create a new object
stream = tweepy.Stream( auth=api.auth, listener=listener )
stream.userstream( track=all_users(), async=True )