I'm using the streaming API to follow a specific user id, and I'm able to stream without any issues. However, when I compare all streamed tweets collected in one day to the ones collected with the rest API it seems that the stream API missed some retweets, i.e. tweets from the user id that somebody else retweeted.
I would've expected tweets from the rest API to be missing due to deleted content, but I can't understand why there would be missing tweets from the streaming.
I checked and I'm not hitting the rate limit (all tweets collected throughout the day are less than 200), the connection wasn't interrupted, I tried different days, and it is always around 25% missing retweets. No other types of tweets are missing.
Any help is much appreciated!!
class StreamListener(tweepy.StreamListener):
def __init__(self, output_file=sys.stdout):
super(StreamListener,self).__init__()
def on_status(self, status):
with open('tweets.json', 'a') as tf:
json.dump(status._json, tf)
tf.write('\n')
print(status.text)
def on_error(self, status_code):
if status_code == 420:
return False
stream_listener = StreamListener()
stream = tweepy.Stream(auth=api.auth, listener=stream_listener)
stream.filter(follow=<id>)
Related
I'm looking for the fastest way to check if a specific user (TwitterID) has tweeted in real-time. To achieve this I have used Tweepy and the stream function, this results in a notification of the tweeted tweet in about -+5 seconds. Is there a faster way to check if someone has tweeted by using another library / requests or code optimization?
Thanks in advance.
import tweepy
TwitterID = "148137271"
class MyStreamListener(tweepy.StreamListener):
def __init__(self, api):
self.api = api
self.me = api.me()
def on_status(self, tweet):
#Filter if ID has tweeted
if tweet.user.id_str == TwitterID:
print("Tweeted:", tweet.text)
def on_error(self, status):
print("Error detected")
print(status)
# Authenticate to Twitter
auth = tweepy.OAuthHandler("x")
auth.set_access_token("Y",
"Z")
# Create API object
api = tweepy.API(auth, wait_on_rate_limit=True,
wait_on_rate_limit_notify=True)
tweets_listener = MyStreamListener(api)
stream = tweepy.Stream(api.auth, tweets_listener)
stream.filter([TwitterID])
I'd say around 5 seconds is a reasonable latency, given that your program is not running on the same server as Twitter's core systems. You're subject to network and API latency and those things are outside of your control. There's no real way to rewrite this logic to change the time between a Tweet being posted and it reaching the API. If you think about the internal stuff going on inside Twitter itself from a Tweet being posted and it being fanned out to potentially millions of followers, the fact that the API - AT THE END OF AN UNKNOWN NETWORK CONNECTION - gets the Tweet data inside of < 5 seconds is pretty crazy in itself.
I'm using tweepy 3.10 to make a retweet bot which runs from a stream at the moment. It can filter out my own retweets, however if anyone else retweets something I've retweeted it crashes. How would be best to filter out items which I've already retweeted?
I tried adding:
if tweettext.startswith("rt #") == True:
return
But that didn't end up filtering the way I hoped it would.
The current logic is:
def on_status(self, tweet):
# This tweet is a reply or I'm its author so, ignore it
if tweet.in_reply_to_status_id is not None or \
tweet.user.id == self.me.id:
return
else:
# Retweets based on the above logic
tweet.retweet()
print(f"{tweet.user.name}:{tweet.text}")
For anyone hitting this question in the future, the solution was quite simple. Add in a "try" function to ignore if an error is generated when trying to retweet but cant:
def on_status(self, tweet):
# This tweet is a reply or I'm its author so, ignore it
if tweet.in_reply_to_status_id is not None or \
tweet.user.id == self.me.id:
return
try:
# Retweets based on the above logic
tweet.retweet()
# checking if the tweet is a retweet (exception added)
except Exception:
return
print(f"{tweet.user.name}:{tweet.text}")
So what I want to do is live stream Tweets from Twitters API: for just the hashtag 'Brexit', only in the English language, and for a specific amount of Tweets (1k - 2k).
So far my code will live stream the Tweets, but whichever way I modify it I either end up with it ignoring the count and just streaming indefinitely, or I get errors. If I change it to only stream a specific users Tweets the count function works, but it ignores the hashtag. If I stream everything for the given hashtag it completely ignores the count. I've had a decent go at trying to fix it but am quite inexperienced and have really hit a brick wall with it.
If I could get some help with how to tick all these boxes at the same time would be much appreciated!
The code below so far will just stream 'Brexit' Tweets indefinitely so ignores the count=10
The bottom of the code is a bit of a mess due to me playing with it, apologies:
import numpy as np
import pandas as pd
import tweepy
from tweepy import API
from tweepy import Cursor
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
import Twitter_Credentials
import matplotlib.pyplot as plt
# Twitter client - hash out to stream all
class TwitterClient:
def __init__(self, twitter_user=None):
self.auth = TwitterAuthenticator().authenticate_twitter_app()
self.twitter_client = API(self.auth)
self.twitter_user = twitter_user
def get_twitter_client_api(self):
return self.twitter_client
# Twitter authenticator
class TwitterAuthenticator:
def authenticate_twitter_app(self):
auth = OAuthHandler(Twitter_Credentials.consumer_key, Twitter_Credentials.consumer_secret)
auth.set_access_token(Twitter_Credentials.access_token, Twitter_Credentials.access_secret)
return auth
class TwitterStreamer():
# Class for streaming and processing live Tweets
def __init__(self):
self.twitter_authenticator = TwitterAuthenticator()
def stream_tweets(self, fetched_tweets_filename, hash_tag_list):
# this handles Twitter authentication and connection to Twitter API
listener = TwitterListener(fetched_tweets_filename)
auth = self.twitter_authenticator.authenticate_twitter_app()
stream = Stream(auth, listener)
# This line filters Twitter stream to capture data by keywords
stream.filter(track=hash_tag_list)
# Twitter stream listener
class TwitterListener(StreamListener):
# This is a listener class that prints incoming Tweets to stdout
def __init__(self, fetched_tweets_filename):
self.fetched_tweets_filename = fetched_tweets_filename
def on_data(self, data):
try:
print(data)
with open(self.fetched_tweets_filename, 'a') as tf:
tf.write(data)
return True
except BaseException as e:
print("Error on_data: %s" % str(e))
return True
def on_error(self, status):
if status == 420:
# Return false on data in case rate limit occurs
return False
print(status)
class TweetAnalyzer():
# Functionality for analysing and categorising content from tweets
def tweets_to_data_frame(self, tweets):
df = pd.DataFrame(data=[tweet.text for tweet in tweets], columns=['tweets'])
df['id'] = np.array([tweet.id for tweet in tweets])
df['len'] = np.array([len(tweet.text) for tweet in tweets])
df['date'] = np.array([tweet.created_at for tweet in tweets])
df['source'] = np.array([tweet.source for tweet in tweets])
df['likes'] = np.array([tweet.favorite_count for tweet in tweets])
df['retweets'] = np.array([tweet.retweet_count for tweet in tweets])
return df
if __name__ == "__main__":
auth = OAuthHandler(Twitter_Credentials.consumer_key, Twitter_Credentials.consumer_secret)
auth.set_access_token(Twitter_Credentials.access_token, Twitter_Credentials.access_secret)
api = tweepy.API(auth)
for tweet in Cursor(api.search, q="#brexit", count=10,
lang="en",
since="2019-04-03").items():
fetched_tweets_filename = "tweets.json"
twitter_streamer = TwitterStreamer()
hash_tag_list = ["Brexit"]
twitter_streamer.stream_tweets(fetched_tweets_filename, hash_tag_list)
You're trying to use two different methods of accessing the Twitter API - Streaming is realtime, and searching is a one-off API call.
Since streaming is continuous and realtime, there's no way to apply a count of results to it - the code simply opens a connection, says "hey, send me all the Tweets from now onwards that contain the hash_tag_list", and sits listening. At that point you then drop into the StreamListener, where for each Tweet received, you write them into a file.
You could apply a counter here, but you'd need to wrap it inside your StreamListener on_data handler, and increment the counter for each Tweet received. When you get to 1000 Tweets, stop listening.
For the search option, you have a couple of issues... the first one is that you're asking for Tweets since 2019, but the standard search API can only go back 7 days in time. You've obviously asked for only 10 Tweets there. The way you've written the method though, what's actually happening is that for each Tweet in the collection of 10 that the API returns, you then create a realtime streaming connection and start listening and writing to a file. So that's not going to work.
You'll need to choose one - either search for 1000 Tweets and write them to a file (never set up TwitterStreamer()), or, listen for 1000 Tweets and write them to a file (drop the for Tweet in Cursor(api.search... and jump straight to the streamer).
Simply add the hashtag symbol to the search phrase in the list, and it'll match tweets that use a specific hashtag. It's case-sensitive so you may want to add as many options to the search array. Merely using "Brexit" matches tweets that may or may not use the hashtag but contain the keyword "Brexit".
hash_tag_list = ["#Brexit"]
I am making a Tweepy program using StreamListener that waits for an account to tweet, saves the tweet as a txt file, replaces characters and then tweets the update txt file.
When I set the account to my own, #Bobwont, it works fine. Waits for #Bobwont to tweet, saves tweet as txt file, replaces characters and tweets the text.
When I set the account to #Zackfox, it seems to pull tweets from his profile instead of waiting for him to tweet. I'm not sure how to explain. I have posted my code and the terminal process.
Please let me know if you need more information.
zabkfox.py:
class MyStreamListener(tweepy.StreamListener):
def on_status(self, status):
if hasattr(status, 'retweeted_status'):
print('retweet')
else:
#print data
with open('tweet.txt','w') as tf:
tf.write(status.text)
with open('tweet.txt','r') as tf:
contents = tf.read()
newcontents = contents.replace('c','\U0001F171\uFE0F')
print(newcontents)
api.update_status(newcontents)
return True
def on_error(self, status):
print(status.text)
myStreamListener = MyStreamListener()
myStream = tweepy.Stream(auth = api.auth, listener=myStreamListener)
myStream.filter(follow=['1700626069'])
terminal:
Documents/zabkfox/zabkfox.py
this is my twitter bot
retweet
retweet
#za🅱️kfox I liked this and then unliked this. I had a good 🅱️hu🅱️kle. This is why I'm on twitter.
retweet
#za🅱️kfox #bluefa🅱️ebleedem Songs wa🅱️k af
retweet
retweet
retweet
retweet
It seems to only be doing it with tweets where he replies. But it shouldn't be pulling his tweets anyway. It should wait for him to tweet.
Found a fix:
Turn out it was picking up tweets where someone else had replied to him. I simply added this elif statement to work around it:
class MyStreamListener(tweepy.StreamListener):
def on_status(self, status):
if hasattr(status, 'retweeted_status'):
print('retweet')
elif hasattr(status, 'in_reply_to_user_id'):
print('reply')
else:
#print data
with open('tweet.txt','w') as tf:
tf.write(status.text)
with open('tweet.txt','r') as tf:
contents = tf.read()
newcontents = contents.replace('c','\U0001F171\uFE0F')
print(newcontents)
api.update_status(newcontents)
return True
What's the best way to use the Twitter Stream API with Python to collect tweets in a large area?
I'm interested in geolocation, particularly the nationwide collection of tweets in North America. I'm currently using Python and Tweepy to dump tweets from the Twitter streaming API into a MongoDB database.
I'm currently using the API's location-filter to pull tweets within a boundary box, and then I further filter to only store tweets with coordinates. I've found that if my boundary box is large enough, I run into a Python connection error:
raise ProtocolError('Connection broken: %r' % e, e)
requests.packages.urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))
I've made the bounding box smaller (I've successfully tried NYC and NYC + New England), but it seemms like the error returns with a large enough bounding box. I've also tried threading with the intention of running multiple StreamListeners concurrently, but I don't think the API allows this (I'm getting 420 errors), or at least not in the manner that I'm attempting.
I'm using Tweepy to set up a custom StreamListener class:
class MyListener(StreamListener):
"""Custom StreamListener for streaming data."""
# def __init__(self):
def on_data(self, data):
try:
db = pymongo.MongoClient(config.db_uri).twitter
col = db.tweets
decoded_json = json.loads(data)
geo = str(decoded_json['coordinates'])
user = decoded_json['user']['screen_name']
if geo != "None":
col.insert(decoded_json)
print("Geolocated tweet saved from user %s" % user)
else: print("No geo data from user %s" % user)
return True
except BaseException as e:
print("Error on_data: %s" % str(e))
time.sleep(5)
return True
def on_error(self, status):
print(status)
return True
This is what my Thread class looks like:
class myThread(threading.Thread):
def __init__(self, threadID, name, streamFilter):
threading.Thread.__init__(self)
self.threadID = threadID
self.name = name
self.streamFilter = streamFilter
def run(self):
print("Starting " + self.name)
#twitter_stream.filter(locations=self.streamFilter)
Stream(auth, MyListener()).filter(locations=self.streamFilter)
And main:
if __name__ == '__main__':
auth = OAuthHandler(config.consumer_key, config.consumer_secret)
auth.set_access_token(config.access_token, config.access_secret)
api = tweepy.API(auth)
twitter_stream = Stream(auth, MyListener())
# Bounding boxes:
northeast = [-78.44,40.88,-66.97,47.64]
texas = [-107.31,25.68,-93.25,36.7]
california = [-124.63,32.44,-113.47,42.2]
northeastThread = myThread(1,"ne-thread", northeast)
texasThread = myThread(2,"texas-thread", texas)
caliThread = myThread(3,"cali-thread", california)
northeastThread.start()
time.sleep(5)
texasThread.start()
time.sleep(10)
caliThread.start()
There is nothing bad or unusual about getting a ProtocolError. Connections do break from time to time. You should catch this error in your code and simply restart the stream. All will be good.
BTW, I noticed you are interrogating the geo field which has been deprecated. The field you want is coordinates. You might also find places useful.
(The Twitter API docs say multiple streaming connections are not allowed.)
It seems twitter allocates one block of tweets when you try to search a keyword in a big geo (let's say country or city). I think this can be overcome by running multiple streams of program concurrently, but as separate programs.