What's the best way to use the Twitter Stream API with Python to collect tweets in a large area?
I'm interested in geolocation, particularly the nationwide collection of tweets in North America. I'm currently using Python and Tweepy to dump tweets from the Twitter streaming API into a MongoDB database.
I'm currently using the API's location-filter to pull tweets within a boundary box, and then I further filter to only store tweets with coordinates. I've found that if my boundary box is large enough, I run into a Python connection error:
raise ProtocolError('Connection broken: %r' % e, e)
requests.packages.urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))
I've made the bounding box smaller (I've successfully tried NYC and NYC + New England), but it seemms like the error returns with a large enough bounding box. I've also tried threading with the intention of running multiple StreamListeners concurrently, but I don't think the API allows this (I'm getting 420 errors), or at least not in the manner that I'm attempting.
I'm using Tweepy to set up a custom StreamListener class:
class MyListener(StreamListener):
"""Custom StreamListener for streaming data."""
# def __init__(self):
def on_data(self, data):
try:
db = pymongo.MongoClient(config.db_uri).twitter
col = db.tweets
decoded_json = json.loads(data)
geo = str(decoded_json['coordinates'])
user = decoded_json['user']['screen_name']
if geo != "None":
col.insert(decoded_json)
print("Geolocated tweet saved from user %s" % user)
else: print("No geo data from user %s" % user)
return True
except BaseException as e:
print("Error on_data: %s" % str(e))
time.sleep(5)
return True
def on_error(self, status):
print(status)
return True
This is what my Thread class looks like:
class myThread(threading.Thread):
def __init__(self, threadID, name, streamFilter):
threading.Thread.__init__(self)
self.threadID = threadID
self.name = name
self.streamFilter = streamFilter
def run(self):
print("Starting " + self.name)
#twitter_stream.filter(locations=self.streamFilter)
Stream(auth, MyListener()).filter(locations=self.streamFilter)
And main:
if __name__ == '__main__':
auth = OAuthHandler(config.consumer_key, config.consumer_secret)
auth.set_access_token(config.access_token, config.access_secret)
api = tweepy.API(auth)
twitter_stream = Stream(auth, MyListener())
# Bounding boxes:
northeast = [-78.44,40.88,-66.97,47.64]
texas = [-107.31,25.68,-93.25,36.7]
california = [-124.63,32.44,-113.47,42.2]
northeastThread = myThread(1,"ne-thread", northeast)
texasThread = myThread(2,"texas-thread", texas)
caliThread = myThread(3,"cali-thread", california)
northeastThread.start()
time.sleep(5)
texasThread.start()
time.sleep(10)
caliThread.start()
There is nothing bad or unusual about getting a ProtocolError. Connections do break from time to time. You should catch this error in your code and simply restart the stream. All will be good.
BTW, I noticed you are interrogating the geo field which has been deprecated. The field you want is coordinates. You might also find places useful.
(The Twitter API docs say multiple streaming connections are not allowed.)
It seems twitter allocates one block of tweets when you try to search a keyword in a big geo (let's say country or city). I think this can be overcome by running multiple streams of program concurrently, but as separate programs.
Related
#!/usr/bin/env python
# twitterbots/bots/favretweet.py
import tweepy
import logging
from config import create_api
import seacret
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger()
#stream = tweepy.Stream(seacret.KEY, seacret.SECRET, seacret.TOKEN, seacret.TOKEN_SECRET)
class FavRetweetListener(tweepy.Stream):
def __init__(self, api):
self.api = api
self.user = api.get_user(screen_name='MyGasAndEnergy1')
def on_status(self, tweet):
logger.info(f"Prosessing tweet id {tweet.id}")
if tweet.in_reply_to_status_id is not None or tweet.user.id == self.user.user_id:
return
if not tweet.favorite:
try:
tweet.favorite()
except Exception as e:
logger.error("Error on Fav", exc_info=True)
if not tweet.retweeted:
try:
tweet.retweet()
except Exception as e:
logger.error("Error on vav and retweet", exc_info=True)
def on_error(self, status):
logger.error(status)
def main(keywords):
api = create_api()
tweets_listener = FavRetweetListener(api)
#new way to auth
stream = tweepy.Stream(seacret.KEY, seacret.SECRET, seacret.TOKEN, seacret.TOKEN_SECRET)
#old way to auto + important tweets_listener for actions
stream = tweepy.Stream(api.auth, tweets_listener)
stream.filter(track=keywords, languages=["en"])
if __name__ == "__main__":
main(["Python", "Tweepy"])
I have older code for editing for my use. But this part I can not figure, because of my noobines. Code is suppose to fav and retweet in twitter if it founds suitable keyword.
New code needs:
stream = tweepy.Stream(seacret.KEY, seacret.SECRET, seacret.TOKEN, seacret.TOKEN_SECRET)
Old code needs:
tweets_listener = FavRetweetListener(api)
stream = tweepy.Stream(api.auth, tweets_listener)
But new tweepy don't work with older api.auth method but want all secret tokens to be in tweepy.Stream() and that mean that I can not launch rest of my code via tweets_listener becauce it wont accept anything more.
How can I continue. I haven't found solution for this after googling or/and can not ask proper questions to move on with this problem.
Tweepy is python module/packet for working twitter-things. This script is originally from realpython.com. Problem is that I don't want to downgrade tweepy.
So I need include FavRetweetListener, but I don't have knowledge how I have to refactor code.
I switched to tweepy.Cursor and get it working. Thanks to all. Better question next time.
https://docs.tweepy.org/en/stable/v1_pagination.html#tweepy.Cursor
I'm looking for the fastest way to check if a specific user (TwitterID) has tweeted in real-time. To achieve this I have used Tweepy and the stream function, this results in a notification of the tweeted tweet in about -+5 seconds. Is there a faster way to check if someone has tweeted by using another library / requests or code optimization?
Thanks in advance.
import tweepy
TwitterID = "148137271"
class MyStreamListener(tweepy.StreamListener):
def __init__(self, api):
self.api = api
self.me = api.me()
def on_status(self, tweet):
#Filter if ID has tweeted
if tweet.user.id_str == TwitterID:
print("Tweeted:", tweet.text)
def on_error(self, status):
print("Error detected")
print(status)
# Authenticate to Twitter
auth = tweepy.OAuthHandler("x")
auth.set_access_token("Y",
"Z")
# Create API object
api = tweepy.API(auth, wait_on_rate_limit=True,
wait_on_rate_limit_notify=True)
tweets_listener = MyStreamListener(api)
stream = tweepy.Stream(api.auth, tweets_listener)
stream.filter([TwitterID])
I'd say around 5 seconds is a reasonable latency, given that your program is not running on the same server as Twitter's core systems. You're subject to network and API latency and those things are outside of your control. There's no real way to rewrite this logic to change the time between a Tweet being posted and it reaching the API. If you think about the internal stuff going on inside Twitter itself from a Tweet being posted and it being fanned out to potentially millions of followers, the fact that the API - AT THE END OF AN UNKNOWN NETWORK CONNECTION - gets the Tweet data inside of < 5 seconds is pretty crazy in itself.
The on_direct_message never called when a message arrives. i have used python3.7 and latest tweepy library. i n twitter account it works fine but not working with the code snippet am using. But the code snippet is working well to listen tweets.
twitter_stream=Stream(auth,StdOutListener())
print('Stream created...')
twitter_stream.filter(follow=[user.id_str], is_async=True)
NB:Permission taken for read, write and direct message. And all access parameters are correct
The StdOutListener is:
class StdOutListener( StreamListener ):
def __init__( self ):
self.tweetCount = 0
def on_connect( self ):
print("Connection established!!")
def on_disconnect( self, notice ):
print("Connection lost!! : ", notice)
def on_data( self, status ):
print("Entered on_data()")
print(status, flush = True)
return True
def on_direct_message( self, status ):
print("Entered on_direct_message()")
try:
print(status, flush = True)
return True
except BaseException as e:
print("Failed on_direct_message()", str(e))
def on_error( self, status ):
print(status)
Direct Messages are not supported in the Twitter streaming API (they were a part of the User Streams API, which was removed in 2018 and replaced with the Account Activity API).
To receive Direct Messages in realtime, you will need to implement a webhook handler for the Account Activity API. You could try the twitivity library, or look at this Python sample app. Tweepy does not have built-in support for this API.
I'm using the streaming API to follow a specific user id, and I'm able to stream without any issues. However, when I compare all streamed tweets collected in one day to the ones collected with the rest API it seems that the stream API missed some retweets, i.e. tweets from the user id that somebody else retweeted.
I would've expected tweets from the rest API to be missing due to deleted content, but I can't understand why there would be missing tweets from the streaming.
I checked and I'm not hitting the rate limit (all tweets collected throughout the day are less than 200), the connection wasn't interrupted, I tried different days, and it is always around 25% missing retweets. No other types of tweets are missing.
Any help is much appreciated!!
class StreamListener(tweepy.StreamListener):
def __init__(self, output_file=sys.stdout):
super(StreamListener,self).__init__()
def on_status(self, status):
with open('tweets.json', 'a') as tf:
json.dump(status._json, tf)
tf.write('\n')
print(status.text)
def on_error(self, status_code):
if status_code == 420:
return False
stream_listener = StreamListener()
stream = tweepy.Stream(auth=api.auth, listener=stream_listener)
stream.filter(follow=<id>)
I am implementing a Twitter bot for fun purposes using Tweepy.
What I am trying to code is a bot that tracks a certain keyword and based in it the bot replies the user that tweeted with the given string.
I considered storing the Twitter's Stream on a .json file and looping the Tweet object for every user but it seems impractical as receiving the stream locks the program on a loop.
So, how could I track the tweets with the Twitter's Stream API based on a certain keyword and reply the users that tweeted it?
Current code:
from tweepy import OAuthHandler
from tweepy import Stream
from tweepy.streaming import StreamListener
class MyListener(StreamListener):
def on_data(self, data):
try:
with open("caguei.json", 'a+') as f:
f.write(data)
data = f.readline()
tweet = json.loads(data)
text = str("#%s acabou de. %s " % (tweet['user']['screen_name'], random.choice(exp)))
tweepy.API.update_status(status=text, in_reply_to_status_id=tweet['user']['id'])
#time.sleep(300)
return True
except BaseException as e:
print("Error on_data: %s" % str(e))
return True
def on_error(self, status):
print(status)
return True
api = tweepy.API(auth)
twitter_stream = Stream(auth, MyListener())
twitter_stream.filter(track=['dengue']) #Executing it the program locks on a loop
Tweepy StreamListener class allows you to override it's on_data method. That's where you should be doing your logic.
As per the code
class StreamListener(object):
...
def on_data(self, raw_data):
"""Called when raw data is received from connection.
Override this method if you wish to manually handle
the stream data. Return False to stop stream and close connection.
"""
...
So in your listener, you can override this method and do your custom logic.
class MyListener(StreamListener):
def on_data(self, data):
do_whatever_with_data(data)
You can also override several other methods (on_direct_message, etc) and I encourage you to take a look at the code of StreamListener.
Update
Okay, you can do what you intent to do with the following:
class MyListener(StreamListener):
def __init__(self, *args, **kwargs):
super(MyListener, self).__init__(*args, **kwargs)
self.file = open("whatever.json", "a+")
def _persist_to_file(self, data):
try:
self.file.write(data)
except BaseException:
pass
def on_data(self, data):
try:
tweet = json.loads(data)
text = str("#%s acabou de. %s " % (tweet['user']['screen_name'], random.choice(exp)))
tweepy.API.update_status(status=text, in_reply_to_status_id=tweet['user']['id'])
self._persist_to_file(data)
return True
except BaseException as e:
print("Error on_data: %s" % str(e))
return True
def on_error(self, status):
print(status)
return True