Tweepy: Stream data for X minutes? - python

I'm using tweepy to datamine the public stream of tweets for keywords. This is pretty straightforward and has been described in multiple places:
http://runnable.com/Us9rrMiTWf9bAAW3/how-to-stream-data-from-twitter-with-tweepy-for-python
http://adilmoujahid.com/posts/2014/07/twitter-analytics/
Copying code directly from the second link:
#Import the necessary methods from tweepy library
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
#Variables that contains the user credentials to access Twitter API
access_token = "ENTER YOUR ACCESS TOKEN"
access_token_secret = "ENTER YOUR ACCESS TOKEN SECRET"
consumer_key = "ENTER YOUR API KEY"
consumer_secret = "ENTER YOUR API SECRET"
#This is a basic listener that just prints received tweets to stdout.
class StdOutListener(StreamListener):
def on_data(self, data):
print data
return True
def on_error(self, status):
print status
if __name__ == '__main__':
#This handles Twitter authetification and the connection to Twitter Streaming API
l = StdOutListener()
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
stream = Stream(auth, l)
#This line filter Twitter Streams to capture data by the keywords: 'python', 'javascript', 'ruby'
stream.filter(track=['python', 'javascript', 'ruby'])
What I can't figure out is how can I stream this data into a python variable? Instead of printing it to the screen... I'm working in an ipython notebook and want to capture the stream in some variable, foo after streaming for a minute or so. Furthermore, how do I get the stream to timeout? It runs indefinitely in this manner.
Related:
Using tweepy to access Twitter's Streaming API

Yes, in the post, #Adil Moujahid mentions that his code ran for 3 days. I adapted the same code and for initial testing, did the following tweaks:
a) Added a location filter to get limited tweets instead of universal tweets containing the keyword.
See How to add a location filter to tweepy module.
From here, you can create an intermediate variable in the above code as follows:
stream_all = Stream(auth, l)
Suppose we, select San Francisco area, we can add:
stream_SFO = stream_all.filter(locations=[-122.75,36.8,-121.75,37.8])
It is assumed that the time to filter for location is lesser than filter for the keywords.
(b) Then you can filter for the keywords:
tweet_iter = stream_SFO.filter(track=['python', 'javascript', 'ruby'])
(c) You can then write it to file as follows:
with open('file_name.json', 'w') as f:
json.dump(tweet_iter,f,indent=1)
This should take much lesser time. I co-incidently wanted to address the same question that you have posted today. Hence, I don't have the execution time.
Hope this helps.

I notice that you are looking to stream data into a variable for later use. The way that I have done this is to create a method to stream data into a database using sqlite3 and sqlalchemy.
For example, first here is the regular code:
import tweepy
import json
import time
import db_commands
import credentials
API_KEY = credentials.ApiKey
API_KEY_SECRET = credentials.ApiKeySecret
ACCESS_TOKEN = credentials.AccessToken
ACCESS_TOKEN_SECRET = credentials.AccessTokenSecret
def create_auth_instance():
"""Set up Authentication Instance"""
auth = tweepy.OAuthHandler(API_KEY, API_KEY_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)
api = tweepy.API(auth, wait_on_rate_limit = True)
return api
class MyStreamListener(tweepy.StreamListener):
""" Listen for tweets """
def __init__(self, api=None):
self.counter = 0
# References the auth instance for the listener
self.api = create_auth_instance()
# Creates a database command instance
self.dbms = db_commands.MyDatabase(db_commands.SQLITE, dbname='mydb.sqlite')
# Creates a database table
self.dbms.create_db_tables()
def on_connect(self):
"""Notify when user connected to twitter"""
print("Connected to Twitter API!")
def on_status(self, tweet):
"""
Everytime a tweet is tracked, add the contents of the tweet,
its username, text, and date created, into a sqlite3 database
"""
user = tweet.user.screen_name
text = tweet.text
date_created = tweet.created_at
self.dbms.insert(user, text, date_created)
def on_error(self, status_code):
"""Handle error codes"""
if status_code == 420:
# Return False if stream disconnects
return False
def main():
"""Create twitter listener (Stream)"""
tracker_subject = input("Type subject to track: ")
twitter_listener = MyStreamListener()
myStream = tweepy.Stream(auth=twitter_listener.api.auth, listener=twitter_listener)
myStream.filter(track=[tracker_subject], is_async=True)
main()
As you can see in the code, we authenticate and create a listener and then activate a stream
twitter_listener = MyStreamListener()
myStream = tweepy.Stream(auth=twitter_listener.api.auth, listener=twitter_listener)
myStream.filter(track=[tracker_subject], is_async=True)
Everytime we receive a tweet, the 'on_status' function will execute, which can be used to perform a set of actions on the tweet data that is being streamed.
def on_status(self, tweet):
"""
Everytime a tweet is tracked, add the contents of the tweet,
its username, text, and date created, into a sqlite3 database
"""
user = tweet.user.screen_name
text = tweet.text
date_created = tweet.created_at
self.dbms.insert(user, text, date_created)
Tweet data, tweet, is captured in three variables user, text, date_created and then referenced the Database controller initialized in the MyStreamListener Class's init function. This insert function is called from the imported db_commands file.
Here is the code located in db_commands.py file that is imported into the code using import db_commands.
from sqlalchemy import create_engine
from sqlalchemy import Table, Column, Integer, String, MetaData, ForeignKey
# Global Variables
SQLITE = 'sqlite'
# MYSQL = 'mysql'
# POSTGRESQL = 'postgresql'
# MICROSOFT_SQL_SERVER = 'mssqlserver'
# Table Names
TWEETS = 'tweets'
class MyDatabase:
# http://docs.sqlalchemy.org/en/latest/core/engines.html
DB_ENGINE = {
SQLITE: 'sqlite:///{DB}',
# MYSQL: 'mysql://scott:tiger#localhost/{DB}',
# POSTGRESQL: 'postgresql://scott:tiger#localhost/{DB}',
# MICROSOFT_SQL_SERVER: 'mssql+pymssql://scott:tiger#hostname:port/{DB}'
}
# Main DB Connection Ref Obj
db_engine = None
def __init__(self, dbtype, username='', password='', dbname=''):
dbtype = dbtype.lower()
if dbtype in self.DB_ENGINE.keys():
engine_url = self.DB_ENGINE[dbtype].format(DB=dbname)
self.db_engine = create_engine(engine_url)
print(self.db_engine)
else:
print("DBType is not found in DB_ENGINE")
def create_db_tables(self):
metadata = MetaData()
tweets = Table(TWEETS, metadata,
Column('id', Integer, primary_key=True),
Column('user', String),
Column('text', String),
Column('date_created', String),
)
try:
metadata.create_all(self.db_engine)
print("Tables created")
except Exception as e:
print("Error occurred during Table creation!")
print(e)
# Insert, Update, Delete
def execute_query(self, query=''):
if query == '' : return
print (query)
with self.db_engine.connect() as connection:
try:
connection.execute(query)
except Exception as e:
print(e)
def insert(self, user, text, date_created):
# Insert Data
query = "INSERT INTO {}(user, text, date_created)"\
"VALUES ('{}', '{}', '{}');".format(TWEETS, user, text, date_created)
self.execute_query(query)
This code uses sqlalchemy package to create a sqlite3 database and post tweets to a tweets table. Sqlalchemy can easily be installed with pip install sqlalchemy. If you use these two codes together, you should be able to scrape tweets through a filter into a databse. Please let me know if this helps and if you have any further questions.

Related

how do i transfer live twitter steam to mongo db in python

I need to make an app that will live stream tweets from twitter api to one of two things. I can either have them save to a file then from the file save them to my mongo db database, or I can have it go from the stream directly to the database.
Right now the program will save it to a file and print it in the terminal but I can not figure out where I would do the transfer. if i put the insert one in the on_status it just jumps out as soon as it hits that line and if i put it in the main area, it never does it.
from tweepy import Stream
from tweepy import OAuthHandler
import tweepy
import json
import pymongo
#tokens and access
access_token = ''
access_token_secret = ''
API_KEY = ''
API_KEY_secret = ''
#make connection to database
connection = pymongo.MongoClient('localhost', 27017)
#create the database and collection
database = connection.twitter_db
collection = database.tweets.create_index([("id", pymongo.ASCENDING)], unique=True,)
class MyListener (tweepy.Stream):
def on_status(self, status):
json_str=json.dumps(status._json)
print(json_str)
try:
with open("UkraineTweets.json","a") as file:
file.write(json_str + "\n")
print("written to the file")
#collection.insert_one(data)
#print("entered into mongo_db")
return True
except:
pass
return False
if __name__ == "__main__":
authentication = tweepy.OAuthHandler(API_KEY,API_KEY_secret)
authentication.set_access_token(access_token, access_token_secret)
api = tweepy.API(authentication)
stream = MyListener(API_KEY, API_KEY_secret, access_token, access_token_secret)
stream.filter(track=['Russia', 'Ukraine', 'war', 'Putin', '#StandWithUkraine' ,'#StopPutinNOW'], languages=['en'])''''

Handle 420 response code returned by Tweepy api

Whenever a user logs in to my application and searches I have to start a streaming API for fetching data required by him.
Here is my stream API class
import tweepy
import json
import sys
class TweetListener(tweepy.StreamListener):
def on_connect(self):
# Called initially to connect to the Streaming API
print("You are now connected to the streaming API.")
def on_error(self, status_code):
# On error - if an error occurs, display the error / status code
print('An Error has occured: ' + repr(status_code))
return False
def on_data(self, data):
json_data = json.loads(data)
print(json_data)
Here is my python code file which calls class above to start Twitter Streaming
import tweepy
from APIs.StreamKafkaApi1 import TweetListener
consumer_key = "***********"
consumer_secret = "*********"
access_token = "***********"
access_secret = "********"
hashtags = ["#ipl"]
def callStream():
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth,wait_on_rate_limit=True)
tweetListener = TweetListener(userid,projectid)
streamer = tweepy.Stream(api.auth, tweetListener)
streamer.filter(track=hashtags, async=True)
if __name__ == "__main__":
callStream()
But if I hit more than twice my application return error code 420.
I thought to change API(using multiple keys) used to fetch data whenever Error 420 occurs.
How to get error raised by the on_error method of TweetListener class in def callStream()
I would like to add onto #Andy Piper's answer. Response 420 means your script is making too many requests and has been Rate Limited. To resolve this, here is what I do(in class TweetListener):
def on_limit(self,status):
print ("Rate Limit Exceeded, Sleep for 15 Mins")
time.sleep(15 * 60)
return True
Do this and the error will be handled.
If you persist on using multiple keys. I am not sure but try exception handling on TweetListener and streamer, for tweepy.error.RateLimitError and use recursive call of the function using next API key?
def callStream(key):
#authenticate the API keys here
try:
tweetListener = TweetListener(userid,projectid)
streamer = tweepy.Stream(api.auth, tweetListener)
streamer.filter(track=hashtags, async=True)
except tweepy.TweepError as e:
if e.reason[0]['code'] == "420":
callStream(nextKey)
return True
Per the Twitter error response code documentation
Returned when an application is being rate limited for making too many
requests.
The Twitter streaming API does not support more than a couple of connections per user and IP address. It is against the Twitter Developer Policy to use multiple application keys to attempt to circumvent this and your apps could be suspended if you do.

Python API Streaming, write new file after certain size

I have a python script that maintains an open connection to the Twitter Streaming API, and writes the data into a json file. Is it possible to write to a new file, without breaking the connection, after the current file being written reaches a certain size? For example, I just streamed data for over 1 week, but all the data is contained in a single file (~2gb) making it slow to parse. If I could write to a new file after, say 500mb, then I would have 4 smaller files (e.g. dump1.json, dump2.json etc) to parse instead of one large one.
import tweepy
from tweepy import OAuthHandler
from tweepy import Stream
from tweepy.streaming import StreamListener
# Add consumer/access tokens for Twitter API
consumer_key = '-----'
consumer_secret = '-----'
access_token = '-----'
access_secret = '-----'
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth)
# Define streamlistener class to open a connection to Twitter and begin consuming data
class MyListener(StreamListener):
def on_data(self, data):
try:
with open('G:\xxxx\Raw_tweets.json', 'a') as f:
f.write(data)
return True
except BaseException as e:
print("Error on_data: %s" % str(e))
return True
def on_error(self, status):
print(status)
return True
bounding_box = [-77.2157,38.2036,-76.5215,39.3365]#filtering by location
keyword_list = ['']#filtering by keyword
twitter_stream = Stream(auth, MyListener())
twitter_stream.filter(locations=bounding_box) # Filter Tweets in stream by location bounding box
#twitter_stream.filter(track=keyword_list) # Filter Tweets in stream by keyword
Since you re-open your file every time, it is rather simple - use an index in file name and advance it if your file size reaches threshold
class MyListener(StreamListener):
def __init(self):
self._file_index = 0
def on_data(self, data):
tweets_file = 'G:\xxxx\Raw_tweets{}.json'.format(self._file_index)
while os.path.exists(tweets_file) and os.stat(tweet_file).st_size > 2**10:
self._file_index += 1
tweets_file = 'G:\xxxx\Raw_tweets{}.json'.format(self._file_index)
....
The cycle will take care of your app being restarted

MongoDB Python Twitter streaming not saving to database

Trying to create a python script that will datamine twitter, but Im not having good luck! I don't know what I'm doing wrong
from pymongo import MongoClient
import json
from tweepy import Stream
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener
import datetime
# Auth Variables
consumer_key = "INSERT_KEY_HERE"
consumer_key_secret = "INSERT_KEY_HERE"
access_token = "INSERT_KEY_HERE"
access_token_secret = "INSERT_KEY_HERE"
# MongoDB connection info
connection = MongoClient('localhost', 27017)
db = connection.TwitterStream
db.tweets.ensure_index("id", unique=True, dropDups=True)
collection = db.tweets
# Key words to be tracked, (hashtags)
keyword_list = ['#MorningAfter', '#Clinton', '#Trump']
class StdOutListener(StreamListener):
def on_data(self, data):
# Load the Tweet into the variable "t"
t = json.loads(data)
# Pull important data from the tweet to store in the database.
tweet_id = t['id_str'] # The Tweet ID from Twitter in string format
text = t['text'] # The entire body of the Tweet
hashtags = t['entities']['hashtags'] # Any hashtags used in the Tweet
time_stamp = t['created_at'] # The timestamp of when the Tweet was created
language = t['lang'] # The language of the Tweet
# Convert the timestamp string given by Twitter to a date object called "created"
created = datetime.datetime.strptime(time_stamp, '%a %b %d %H:%M:%S +0000 %Y')
# Load all of the extracted Tweet data into the variable "tweet" that will be stored into the database
tweet = {'id': tweet_id, 'text': text, 'hashtags': hashtags, 'language': language, 'created': created}
# Save the refined Tweet data to MongoDB
collection.insert(tweet)
print(tweet_id + "\n")
return True
# Prints the reason for an error to your console
def on_error(self, status):
print(status)
l = StdOutListener(api=tweepy.API(wait_on_rate_limit=True))
auth = OAuthHandler(consumer_key, consumer_key_secret)
auth.set_access_token(access_token, access_token_secret)
stream = Stream(auth, listener=l)
stream.filter(track=keyword_list)
Here is the script I have so far. I've tried to do a few google searches and I've compared what I have to what they have and can't find the source of the issue. It runs and connects to the MongoDB, I have the correct database created, but nothing is being put in the database. I have a bit of debug code, where it prints the tweet id, but that just prints 401 over and over in about 5-10 sec interval. I tried some basic examples I found while googling what I wanted to do and still nothing happened. I think it might be an issue with the Database connecting? here are some images of the database being run.
Any ideas would be greatly appreciated, Thank you!
I figured it out finally! The printing of 401 was the key here, it's an authentication error. I had to connect my system clock to the internet, and reset my system's clock.

Twitter API - get tweets with specific id

I have a list of tweet ids for which I would like to download their text content. Is there any easy solution to do this, preferably through a Python script? I had a look at other libraries like Tweepy and things don't appear to work so simple, and downloading them manually is out of the question since my list is very long.
You can access specific tweets by their id with the statuses/show/:id API route. Most Python Twitter libraries follow the exact same patterns, or offer 'friendly' names for the methods.
For example, Twython offers several show_* methods, including Twython.show_status() that lets you load specific tweets:
CONSUMER_KEY = "<consumer key>"
CONSUMER_SECRET = "<consumer secret>"
OAUTH_TOKEN = "<application key>"
OAUTH_TOKEN_SECRET = "<application secret"
twitter = Twython(
CONSUMER_KEY, CONSUMER_SECRET,
OAUTH_TOKEN, OAUTH_TOKEN_SECRET)
tweet = twitter.show_status(id=id_of_tweet)
print(tweet['text'])
and the returned dictionary follows the Tweet object definition given by the API.
The tweepy library uses tweepy.get_status():
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(OAUTH_TOKEN, OAUTH_TOKEN_SECRET)
api = tweepy.API(auth)
tweet = api.get_status(id_of_tweet)
print(tweet.text)
where it returns a slightly richer object, but the attributes on it again reflect the published API.
Sharing my work that was vastly accelerated by the previous answers (thank you). This Python 2.7 script fetches the text for tweet IDs stored in a file. Adjust get_tweet_id() for your input data format;
original configured for data at https://github.com/mdredze/twitter_sandy
Update April 2018: responding late to #someone bug report (thank you). This script no longer discards every 100th tweet ID (that was my bug). Please note that if a tweet is unavailable for whatever reason, the bulk fetch silently skips it. The script now warns if the response size is different from the request size.
'''
Gets text content for tweet IDs
'''
# standard
from __future__ import print_function
import getopt
import logging
import os
import sys
# import traceback
# third-party: `pip install tweepy`
import tweepy
# global logger level is configured in main()
Logger = None
# Generate your own at https://apps.twitter.com/app
CONSUMER_KEY = 'Consumer Key (API key)'
CONSUMER_SECRET = 'Consumer Secret (API Secret)'
OAUTH_TOKEN = 'Access Token'
OAUTH_TOKEN_SECRET = 'Access Token Secret'
# batch size depends on Twitter limit, 100 at this time
batch_size=100
def get_tweet_id(line):
'''
Extracts and returns tweet ID from a line in the input.
'''
(tagid,_timestamp,_sandyflag) = line.split('\t')
(_tag, _search, tweet_id) = tagid.split(':')
return tweet_id
def get_tweets_single(twapi, idfilepath):
'''
Fetches content for tweet IDs in a file one at a time,
which means a ton of HTTPS requests, so NOT recommended.
`twapi`: Initialized, authorized API object from Tweepy
`idfilepath`: Path to file containing IDs
'''
# process IDs from the file
with open(idfilepath, 'rb') as idfile:
for line in idfile:
tweet_id = get_tweet_id(line)
Logger.debug('get_tweets_single: fetching tweet for ID %s', tweet_id)
try:
tweet = twapi.get_status(tweet_id)
print('%s,%s' % (tweet_id, tweet.text.encode('UTF-8')))
except tweepy.TweepError as te:
Logger.warn('get_tweets_single: failed to get tweet ID %s: %s', tweet_id, te.message)
# traceback.print_exc(file=sys.stderr)
# for
# with
def get_tweet_list(twapi, idlist):
'''
Invokes bulk lookup method.
Raises an exception if rate limit is exceeded.
'''
# fetch as little metadata as possible
tweets = twapi.statuses_lookup(id_=idlist, include_entities=False, trim_user=True)
if len(idlist) != len(tweets):
Logger.warn('get_tweet_list: unexpected response size %d, expected %d', len(tweets), len(idlist))
for tweet in tweets:
print('%s,%s' % (tweet.id, tweet.text.encode('UTF-8')))
def get_tweets_bulk(twapi, idfilepath):
'''
Fetches content for tweet IDs in a file using bulk request method,
which vastly reduces number of HTTPS requests compared to above;
however, it does not warn about IDs that yield no tweet.
`twapi`: Initialized, authorized API object from Tweepy
`idfilepath`: Path to file containing IDs
'''
# process IDs from the file
tweet_ids = list()
with open(idfilepath, 'rb') as idfile:
for line in idfile:
tweet_id = get_tweet_id(line)
Logger.debug('Enqueing tweet ID %s', tweet_id)
tweet_ids.append(tweet_id)
# API limits batch size
if len(tweet_ids) == batch_size:
Logger.debug('get_tweets_bulk: fetching batch of size %d', batch_size)
get_tweet_list(twapi, tweet_ids)
tweet_ids = list()
# process remainder
if len(tweet_ids) > 0:
Logger.debug('get_tweets_bulk: fetching last batch of size %d', len(tweet_ids))
get_tweet_list(twapi, tweet_ids)
def usage():
print('Usage: get_tweets_by_id.py [options] file')
print(' -s (single) makes one HTTPS request per tweet ID')
print(' -v (verbose) enables detailed logging')
sys.exit()
def main(args):
logging.basicConfig(level=logging.WARN)
global Logger
Logger = logging.getLogger('get_tweets_by_id')
bulk = True
try:
opts, args = getopt.getopt(args, 'sv')
except getopt.GetoptError:
usage()
for opt, _optarg in opts:
if opt in ('-s'):
bulk = False
elif opt in ('-v'):
Logger.setLevel(logging.DEBUG)
Logger.debug("main: verbose mode on")
else:
usage()
if len(args) != 1:
usage()
idfile = args[0]
if not os.path.isfile(idfile):
print('Not found or not a file: %s' % idfile, file=sys.stderr)
usage()
# connect to twitter
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(OAUTH_TOKEN, OAUTH_TOKEN_SECRET)
api = tweepy.API(auth)
# hydrate tweet IDs
if bulk:
get_tweets_bulk(api, idfile)
else:
get_tweets_single(api, idfile)
if __name__ == '__main__':
main(sys.argv[1:])
You can access tweets in bulk (up to 100 at a time) with the status/lookup endpoint: https://dev.twitter.com/rest/reference/get/statuses/lookup
I don't have enough reputation to add an actual comment so sadly this is the way to go:
I found a bug and a strange thing in chrisinmtown answer:
Every 100th tweet will be skipped due to the bug. Here is a simple solution:
if len(tweet_ids) < 100:
tweet_ids.append(tweet_id)
else:
tweet_ids.append(tweet_id)
get_tweet_list(twapi, tweet_ids)
tweet_ids = list()
Using is better since it works even past the rate limit.
api = tweepy.API(auth_handler=auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

Categories