Trying to create a python script that will datamine twitter, but Im not having good luck! I don't know what I'm doing wrong
from pymongo import MongoClient
import json
from tweepy import Stream
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener
import datetime
# Auth Variables
consumer_key = "INSERT_KEY_HERE"
consumer_key_secret = "INSERT_KEY_HERE"
access_token = "INSERT_KEY_HERE"
access_token_secret = "INSERT_KEY_HERE"
# MongoDB connection info
connection = MongoClient('localhost', 27017)
db = connection.TwitterStream
db.tweets.ensure_index("id", unique=True, dropDups=True)
collection = db.tweets
# Key words to be tracked, (hashtags)
keyword_list = ['#MorningAfter', '#Clinton', '#Trump']
class StdOutListener(StreamListener):
def on_data(self, data):
# Load the Tweet into the variable "t"
t = json.loads(data)
# Pull important data from the tweet to store in the database.
tweet_id = t['id_str'] # The Tweet ID from Twitter in string format
text = t['text'] # The entire body of the Tweet
hashtags = t['entities']['hashtags'] # Any hashtags used in the Tweet
time_stamp = t['created_at'] # The timestamp of when the Tweet was created
language = t['lang'] # The language of the Tweet
# Convert the timestamp string given by Twitter to a date object called "created"
created = datetime.datetime.strptime(time_stamp, '%a %b %d %H:%M:%S +0000 %Y')
# Load all of the extracted Tweet data into the variable "tweet" that will be stored into the database
tweet = {'id': tweet_id, 'text': text, 'hashtags': hashtags, 'language': language, 'created': created}
# Save the refined Tweet data to MongoDB
collection.insert(tweet)
print(tweet_id + "\n")
return True
# Prints the reason for an error to your console
def on_error(self, status):
print(status)
l = StdOutListener(api=tweepy.API(wait_on_rate_limit=True))
auth = OAuthHandler(consumer_key, consumer_key_secret)
auth.set_access_token(access_token, access_token_secret)
stream = Stream(auth, listener=l)
stream.filter(track=keyword_list)
Here is the script I have so far. I've tried to do a few google searches and I've compared what I have to what they have and can't find the source of the issue. It runs and connects to the MongoDB, I have the correct database created, but nothing is being put in the database. I have a bit of debug code, where it prints the tweet id, but that just prints 401 over and over in about 5-10 sec interval. I tried some basic examples I found while googling what I wanted to do and still nothing happened. I think it might be an issue with the Database connecting? here are some images of the database being run.
Any ideas would be greatly appreciated, Thank you!
I figured it out finally! The printing of 401 was the key here, it's an authentication error. I had to connect my system clock to the internet, and reset my system's clock.
Related
I am trying to scrape Twitter profiles for a project I am doing. I have the following code
from tweepy import OAuthHandler
import pandas as pd
"""I like to have my python script print a message at the beginning. This helps me confirm whether everything is set up correctly. And it's nice to get an uplifting message ;)."""
print("You got this!")
access_token = ''
access_token_secret = ''
consumer_key = ''
consumer_secret = ''
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
tweets = []
count = 1
"""Twitter will automatically sample the last 7 days of data. Depending on how many total tweets there are with the specific hashtag, keyword, handle, or key phrase that you are looking for, you can set the date back further by adding since= as one of the parameters. You can also manually add in the number of tweets you want to get back in the items() section."""
for tweet in tweepy.Cursor(api.search, q="#BNonnecke", count=450, since='2020-02-28').items(50000):
print(count)
count += 1
try:
data = [tweet.created_at, tweet.id, tweet.text, tweet.user._json['screen_name'], tweet.user._json['name'], tweet.user._json['created_at'], tweet.entities['urls']]
data = tuple(data)
tweets.append(data)
except tweepy.TweepError as e:
print(e.reason)
continue
except StopIteration:
break
df = pd.DataFrame(tweets, columns = ['created_at','tweet_id', 'tweet_text', 'screen_name', 'name', 'account_creation_date', 'urls'])
"""Add the path to the folder you want to save the CSV file in as well as what you want the CSV file to be named inside the single quotations"""
df.to_csv(path_or_buf = '/Users/Name/Desktop/FolderName/FileName.csv', index=False)
however, I keep getting the error "API" object has no attribute "search" from the line "for tweet in tweepy.Cursor(api.search, q="#BNonnecke", count=450, since='2020-02-28').items(50000):" I am not really sure why and don't know how to resolve this issue.
Thanks so much!
The latest version of Tweepy (v4 upwards) now has a search_tweets method instead of a search method. Check the documentation.
API.search_tweets(q, *, geocode, lang, locale, result_type, count, until, since_id, max_id, include_entities)
Also, read the comment in your code :-) The Search API has a 7 day history limit, so searching for Tweets since 2020-02-28 will only return Tweets posted in the 7 days before the date you run your code.
I am trying to stream data from twitter to an aws bucket. The good news is I can get the data to stream to my bucket but the data comes in approx 20 kb chunks (I think this may be due to some firehose settings) and its not saving as JSON even after I specified it to in my python code using JSON.LOAD. Rather than saving as JSON, the data in my S3 bucket looks like it does not have a file extension and has long string of alphanumeric characters. I think it may be something to do with the parameters being used in client.put_record()
Any help is greatly appreciated!
Please find my code below, which I got from github here.
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
import json
import boto3
import time
#Variables that contains the user credentials to access Twitter API
consumer_key = "MY_CONSUMER_KEY"
consumer_secret = "MY_CONSUMER_SECRET"
access_token = "MY_ACCESS_TOKEN"
access_token_secret = "MY_SECRET_ACCESS_TOKEN"
#This is a basic listener that just prints received tweets to stdout.
class StdOutListener(StreamListener):
def on_data(self, data):
tweet = json.loads(data)
try:
if 'extended_tweet' in tweet.keys():
#print (tweet['text'])
message_lst = [str(tweet['id']),
str(tweet['user']['name']),
str(tweet['user']['screen_name']),
tweet['extended_tweet']['full_text'],
str(tweet['user']['followers_count']),
str(tweet['user']['location']),
str(tweet['geo']),
str(tweet['created_at']),
'\n'
]
message = '\t'.join(message_lst)
print(message)
client.put_record(
DeliveryStreamName=delivery_stream,
Record={
'Data': message
}
)
elif 'text' in tweet.keys():
#print (tweet['text'])
message_lst = [str(tweet['id']),
str(tweet['user']['name']),
str(tweet['user']['screen_name']),
tweet['text'].replace('\n',' ').replace('\r',' '),
str(tweet['user']['followers_count']),
str(tweet['user']['location']),
str(tweet['geo']),
str(tweet['created_at']),
'\n'
]
message = '\t'.join(message_lst)
print(message)
client.put_record(
DeliveryStreamName=delivery_stream,
Record={
'Data': message
}
)
except (AttributeError, Exception) as e:
print (e)
return True
def on_error(self, status):
print (status)
if __name__ == '__main__':
#This handles Twitter authetification and the connection to Twitter Streaming API
listener = StdOutListener()
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
#tweets = Table('tweets_ft',connection=conn)
client = boto3.client('firehose',
region_name='us-east-1',
aws_access_key_id='MY ACCESS KEY',
aws_secret_access_key='MY SECRET KEY'
)
delivery_stream = 'my_firehose'
#This line filter Twitter Streams to capture data by the keywords: 'python', 'javascript', 'ruby'
#stream.filter(track=['trump'], stall_warnings=True)
while True:
try:
print('Twitter streaming...')
stream = Stream(auth, listener)
stream.filter(track=['brexit'], languages=['en'], stall_warnings=True)
except Exception as e:
print(e)
print('Disconnected...')
time.sleep(5)
continue
Its possible that you have enabled S3 compression for your firehose. Please ensure that the compression is disabled if you want to store raw json data in your bucket:
You could also have some transformation applied to your firehose which encode or otherwise transform your json messages into some other format.
So it looks like the files were coming on with JSON formatting, i just had to open the files in S3 with firefox and i could see the contents of files. The issue with the file sizes is due to the firehose buffer settings, i have them set to the lowest which is why files were being sent in such small chunks
I'm using tweepy to datamine the public stream of tweets for keywords. This is pretty straightforward and has been described in multiple places:
http://runnable.com/Us9rrMiTWf9bAAW3/how-to-stream-data-from-twitter-with-tweepy-for-python
http://adilmoujahid.com/posts/2014/07/twitter-analytics/
Copying code directly from the second link:
#Import the necessary methods from tweepy library
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
#Variables that contains the user credentials to access Twitter API
access_token = "ENTER YOUR ACCESS TOKEN"
access_token_secret = "ENTER YOUR ACCESS TOKEN SECRET"
consumer_key = "ENTER YOUR API KEY"
consumer_secret = "ENTER YOUR API SECRET"
#This is a basic listener that just prints received tweets to stdout.
class StdOutListener(StreamListener):
def on_data(self, data):
print data
return True
def on_error(self, status):
print status
if __name__ == '__main__':
#This handles Twitter authetification and the connection to Twitter Streaming API
l = StdOutListener()
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
stream = Stream(auth, l)
#This line filter Twitter Streams to capture data by the keywords: 'python', 'javascript', 'ruby'
stream.filter(track=['python', 'javascript', 'ruby'])
What I can't figure out is how can I stream this data into a python variable? Instead of printing it to the screen... I'm working in an ipython notebook and want to capture the stream in some variable, foo after streaming for a minute or so. Furthermore, how do I get the stream to timeout? It runs indefinitely in this manner.
Related:
Using tweepy to access Twitter's Streaming API
Yes, in the post, #Adil Moujahid mentions that his code ran for 3 days. I adapted the same code and for initial testing, did the following tweaks:
a) Added a location filter to get limited tweets instead of universal tweets containing the keyword.
See How to add a location filter to tweepy module.
From here, you can create an intermediate variable in the above code as follows:
stream_all = Stream(auth, l)
Suppose we, select San Francisco area, we can add:
stream_SFO = stream_all.filter(locations=[-122.75,36.8,-121.75,37.8])
It is assumed that the time to filter for location is lesser than filter for the keywords.
(b) Then you can filter for the keywords:
tweet_iter = stream_SFO.filter(track=['python', 'javascript', 'ruby'])
(c) You can then write it to file as follows:
with open('file_name.json', 'w') as f:
json.dump(tweet_iter,f,indent=1)
This should take much lesser time. I co-incidently wanted to address the same question that you have posted today. Hence, I don't have the execution time.
Hope this helps.
I notice that you are looking to stream data into a variable for later use. The way that I have done this is to create a method to stream data into a database using sqlite3 and sqlalchemy.
For example, first here is the regular code:
import tweepy
import json
import time
import db_commands
import credentials
API_KEY = credentials.ApiKey
API_KEY_SECRET = credentials.ApiKeySecret
ACCESS_TOKEN = credentials.AccessToken
ACCESS_TOKEN_SECRET = credentials.AccessTokenSecret
def create_auth_instance():
"""Set up Authentication Instance"""
auth = tweepy.OAuthHandler(API_KEY, API_KEY_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)
api = tweepy.API(auth, wait_on_rate_limit = True)
return api
class MyStreamListener(tweepy.StreamListener):
""" Listen for tweets """
def __init__(self, api=None):
self.counter = 0
# References the auth instance for the listener
self.api = create_auth_instance()
# Creates a database command instance
self.dbms = db_commands.MyDatabase(db_commands.SQLITE, dbname='mydb.sqlite')
# Creates a database table
self.dbms.create_db_tables()
def on_connect(self):
"""Notify when user connected to twitter"""
print("Connected to Twitter API!")
def on_status(self, tweet):
"""
Everytime a tweet is tracked, add the contents of the tweet,
its username, text, and date created, into a sqlite3 database
"""
user = tweet.user.screen_name
text = tweet.text
date_created = tweet.created_at
self.dbms.insert(user, text, date_created)
def on_error(self, status_code):
"""Handle error codes"""
if status_code == 420:
# Return False if stream disconnects
return False
def main():
"""Create twitter listener (Stream)"""
tracker_subject = input("Type subject to track: ")
twitter_listener = MyStreamListener()
myStream = tweepy.Stream(auth=twitter_listener.api.auth, listener=twitter_listener)
myStream.filter(track=[tracker_subject], is_async=True)
main()
As you can see in the code, we authenticate and create a listener and then activate a stream
twitter_listener = MyStreamListener()
myStream = tweepy.Stream(auth=twitter_listener.api.auth, listener=twitter_listener)
myStream.filter(track=[tracker_subject], is_async=True)
Everytime we receive a tweet, the 'on_status' function will execute, which can be used to perform a set of actions on the tweet data that is being streamed.
def on_status(self, tweet):
"""
Everytime a tweet is tracked, add the contents of the tweet,
its username, text, and date created, into a sqlite3 database
"""
user = tweet.user.screen_name
text = tweet.text
date_created = tweet.created_at
self.dbms.insert(user, text, date_created)
Tweet data, tweet, is captured in three variables user, text, date_created and then referenced the Database controller initialized in the MyStreamListener Class's init function. This insert function is called from the imported db_commands file.
Here is the code located in db_commands.py file that is imported into the code using import db_commands.
from sqlalchemy import create_engine
from sqlalchemy import Table, Column, Integer, String, MetaData, ForeignKey
# Global Variables
SQLITE = 'sqlite'
# MYSQL = 'mysql'
# POSTGRESQL = 'postgresql'
# MICROSOFT_SQL_SERVER = 'mssqlserver'
# Table Names
TWEETS = 'tweets'
class MyDatabase:
# http://docs.sqlalchemy.org/en/latest/core/engines.html
DB_ENGINE = {
SQLITE: 'sqlite:///{DB}',
# MYSQL: 'mysql://scott:tiger#localhost/{DB}',
# POSTGRESQL: 'postgresql://scott:tiger#localhost/{DB}',
# MICROSOFT_SQL_SERVER: 'mssql+pymssql://scott:tiger#hostname:port/{DB}'
}
# Main DB Connection Ref Obj
db_engine = None
def __init__(self, dbtype, username='', password='', dbname=''):
dbtype = dbtype.lower()
if dbtype in self.DB_ENGINE.keys():
engine_url = self.DB_ENGINE[dbtype].format(DB=dbname)
self.db_engine = create_engine(engine_url)
print(self.db_engine)
else:
print("DBType is not found in DB_ENGINE")
def create_db_tables(self):
metadata = MetaData()
tweets = Table(TWEETS, metadata,
Column('id', Integer, primary_key=True),
Column('user', String),
Column('text', String),
Column('date_created', String),
)
try:
metadata.create_all(self.db_engine)
print("Tables created")
except Exception as e:
print("Error occurred during Table creation!")
print(e)
# Insert, Update, Delete
def execute_query(self, query=''):
if query == '' : return
print (query)
with self.db_engine.connect() as connection:
try:
connection.execute(query)
except Exception as e:
print(e)
def insert(self, user, text, date_created):
# Insert Data
query = "INSERT INTO {}(user, text, date_created)"\
"VALUES ('{}', '{}', '{}');".format(TWEETS, user, text, date_created)
self.execute_query(query)
This code uses sqlalchemy package to create a sqlite3 database and post tweets to a tweets table. Sqlalchemy can easily be installed with pip install sqlalchemy. If you use these two codes together, you should be able to scrape tweets through a filter into a databse. Please let me know if this helps and if you have any further questions.
I've been trying to figure out Tweepy for the last 3 hours and I'm still stuck.
I would like to be able to get all my friend's tweets for the period between Sept and Oct 2014, and have it be filtered by the top 10 number of retweets.
I'm only vaguely familiar with StreamListener, however, I think this does a list of tweets that are real time. I was wondering if I could go back last month and grab out those tweets from my friends. Can this be done through Tweepy? This is the code I have now.
from tweepy import Stream
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener
import csv
ckey = 'xyz'
csecret = 'xyz'
atoken = 'xyz'
asecret = 'xyz'
auth = OAuthHandler(ckey,csecret)
auth.set_access_token(atoken, a secret)
class Listener(StreamListener):
def on_data(self,data):
print data
return True
def on_error(self,status):
print status
auth = OAuthHandler(ckey, csecret)
auth.set_access_token(atoken, asecret)
tiwtterStream = Stream(aut, Listener())
users = [123456, 7890019, 9919038] # this is the list of friends I would like to output
twitterStream.filter(users, since=09-01-2014, until = 10-01-2014)
You are correct in that StreamListener returns real-time tweets. To get past tweets from specific users, you need to use tweepy's API wrapper--tweepy.API. An example, which would replace from your Listener class on down:
api = tweepy.API(auth)
tweetlist = api.user_timeline(id=123456)
This returns a list of up to 20 status objects. You can mess with the parameters to get more results, probably count and since will be helpful for your implementation. I think the most you can ask for with a single count is 200 tweets.
P.S. Not a major issue but you authenticate twice in your code which is not necessary.
I've been having some troubles using the GetStreamFilter function from the Python-Twitter library.
I have used this code:
import twitter
import time
consumer_key = 'myConsumerKey'
consumer_secret = 'myConsumerSecret'
access_token = 'myAccessToken'
access_token_secret = 'myAccessTokenSecret'
apiTest = twitter.Api(consumer_key,
consumer_secret,
access_token,
access_token_secret)
#print(apiTest.VerifyCredentials())
while (True):
stream = apiTest.GetStreamFilter(None, ['someWord'])
try:
print(stream.next())
except:
print ("No posts")
time.sleep(3)
What I want to do is to fetch new tweets that include the word "someWord", and to do these every three seconds (or every time there's a new one that is published).
You need to create the stream only once and outside the loop.
stream = apiTest.GetStreamFilter(None, ['someWord'])
while True:
for tweet in stream:
print(tweet)
How about replacing your while True with a loop that extracts things out of the stream?
for tweet in apiTest.GetStreamFilter(track=['someWord']):
print tweet