How do I save streaming tweets in json via tweepy? - python

I've been learning Python for a couple of months through online courses and would like to further my learning through a real world mini project.
For this project, I would like to collect tweets from the twitter streaming API and store them in json format (though you can choose to just save the key information like status.text, status.id, I've been advised that the best way to do this is to save all the data and do the processing after). However, with the addition of the on_data() the code ceases to work. Would someone be able to to assist please? I'm also open to suggestions on the best way to store/process tweets! My end goal is to be able to track tweets based on demographic variables (e.g., country, user profile age, etc) and the sentiment of particular brands (e.g., Apple, HTC, Samsung).
In addition, I would also like to try filtering tweets by location AND keywords. I've adapted the code from How to add a location filter to tweepy module separately. However, while it works when there are a few keywords, it stops when the number of keywords grows. I presume my code is inefficient. Is there a better way of doing it?
### code to save tweets in json###
import sys
import tweepy
import json
consumer_key=" "
consumer_secret=" "
access_key = " "
access_secret = " "
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)
file = open('today.txt', 'a')
class CustomStreamListener(tweepy.StreamListener):
def on_status(self, status):
print status.text
def on_data(self, data):
json_data = json.loads(data)
file.write(str(json_data))
def on_error(self, status_code):
print >> sys.stderr, 'Encountered error with status code:', status_code
return True # Don't kill the stream
def on_timeout(self):
print >> sys.stderr, 'Timeout...'
return True # Don't kill the stream
sapi = tweepy.streaming.Stream(auth, CustomStreamListener())
sapi.filter(track=['twitter'])

I found a way to save the tweets to a json file. Happy to hear how it can be improved!
# initialize blank list to contain tweets
tweets = []
# file name that you want to open is the second argument
save_file = open('9may.json', 'a')
class CustomStreamListener(tweepy.StreamListener):
def __init__(self, api):
self.api = api
super(tweepy.StreamListener, self).__init__()
self.save_file = tweets
def on_data(self, tweet):
self.save_file.append(json.loads(tweet))
print tweet
save_file.write(str(tweet))

In rereading your original question, I realize that you ask a lot of smaller questions. I'll try to answer most of them here but some may merit actually asking a separate question on SO.
Why does it break with the addition of on_data ?
Without seeing the actual error, it's hard to say. It actually didn't work for me until I regenerated my consumer/acess keys, I'd try that.
There are a few things I might do differently than your answer.
tweets is a global list. This means that if you have multiple StreamListeners (i.e. in multiple threads), every tweet collected by any stream listener will be added to this list. This is because lists in Python refer to locations in memory--if that's confusing, here's a basic example of what I mean:
>>> bar = []
>>> foo = bar
>>> foo.append(7)
>>> print bar
[7]
Notice that even though you thought appended 7 to foo, foo and bar actually refer to the same thing (and therefore changing one changes both).
If you meant to do this, it's a pretty great solution. However, if your intention was to segregate tweets from different listeners, it could be a huge headache. I personally would construct my class like this:
class CustomStreamListener(tweepy.StreamListener):
def __init__(self, api):
self.api = api
super(tweepy.StreamListener, self).__init__()
self.list_of_tweets = []
This changes the tweets list to be only in the scope of your class. Also, I think it's appropriate to change the property name from self.save_file to self.list_of_tweets because you also name the file that you're appending the tweets to save_file. Although this will not strictly cause an error, it's confusing to human me that self.save_file is a list and save_file is a file. It helps future you and anyone else that reads your code figure out what the heck everything does/is. More on variable naming.
In my comment, I mentioned that you shouldn't use file as a variable name. file is a Python builtin function that constructs a new object of type file. You can technically overwrite it, but it is a very bad idea to do so. For more builtins, see the Python documentation.
How do I filter results on multiple keywords?
All keywords are OR'd together in this type of search, source:
sapi.filter(track=['twitter', 'python', 'tweepy'])
This means that this will get tweets containing 'twitter', 'python' or 'tweepy'. If you want the union (AND) all of the terms, you have to post-process by checking a tweet against the list of all terms you want to search for.
How do I filter results based on location AND keyword?
I actually just realized that you did ask this as its own question as I was about to suggest. A regex post-processing solution is a good way to accomplish this. You could also try filtering by both location and keyword like so:
sapi.filter(locations=[103.60998,1.25752,104.03295,1.44973], track=['twitter'])
What is the best way to store/process tweets?
That depends on how many you'll be collecting. I'm a fan of databases, especially if you're planning to do a sentiment analysis on a lot of tweets. When you collect data, you should only collect things you will need. This means, when you save results to your database/wherever in your on_data method, you should extract the important parts from the JSON and not save anything else. If for example you want to look at brand, country and time, only take those three things; don't save the entire JSON dump of the tweet because it'll just take up unnecessary space.

I just insert the raw JSON into the database. It seems a bit ugly and hacky but it does work. A noteable problem is that the creation dates of the Tweets are stored as strings. How do I compare dates from Twitter data stored in MongoDB via PyMongo? provides a way to fix that (I inserted a comment in the code to indicate where one would perform that task)
# ...
client = pymongo.MongoClient()
db = client.twitter_db
twitter_collection = db.tweets
# ...
class CustomStreamListener(tweepy.StreamListener):
# ...
def on_status(self, status):
try:
twitter_json = status._json
# TODO: Transform created_at to Date objects before insertion
tweet_id = twitter_collection.insert(twitter_json)
except:
# Catch any unicode errors while printing to console
# and just ignore them to avoid breaking application.
pass
# ...
stream = tweepy.Stream(auth, CustomStreamListener(), timeout=None, compression=True)
stream.sample()

Related

Problem with getting tweet_fields from Twitter API 2.0 using Tweepy

I have a similar problem as in this question (Problem with getting user.fields from Twitter API 2.0)
but I am using Tweepy. When making the request with tweet_fields, the response is only giving me the default values. In another fuction where I use user_fields it works perfectly.
I followed this guide, specifically number 17 (https://dev.to/twitterdev/a-comprehensive-guide-for-using-the-twitter-api-v2-using-tweepy-in-python-15d9)
My function looks like this:
def get_user_tweets():
client = get_client()
tweets = client.get_users_tweets(id=get_user_id(), max_results=5)
ids = []
for tweet in tweets.data:
ids.append(str(tweet.id))
tweets_info = client.get_tweets(ids=ids, tweet_fields=["public_metrics"])
print(tweets_info)
This is my response (with the last tweets from elonmusk) also there is no error code or anything else
Response(data=[<Tweet id=1471419792770973699 text=#WholeMarsBlog I came to the US with no money & graduated with over $100k in debt, despite scholarships & working 2 jobs while at school>, <Tweet id=1471399837753135108 text=#TeslaOwnersEBay #PPathole #ScottAdamsSays #johniadarola #SenWarren It’s complicated, but hopefully out next quarter, along with Witcher. Lot of internal debate as to whether we should be putting effort towards generalized gaming emulation vs making individual games work well.>, <Tweet id=1471393851843792896 text=#PPathole #ScottAdamsSays #johniadarola #SenWarren Yeah!>, <Tweet id=1471338213549744130 text=link>, <Tweet id=1471325148435394566 text=#24_7TeslaNews #Tesla ❤️>], includes={}, errors=[], meta={})
I found this link: https://giters.com/tweepy/tweepy/issues/1670. According to it,
Response is a namedtuple. Here, within its data field, is a single Tweet object.
The string representation of a Tweet object will only ever include its ID and text. This was an intentional design choice, to reduce the excess of information that could be displayed when printing all the data as the string representation, as with models.Status. The ID and text are the only default / guaranteed fields, so the string representation remains consistent and unique, while still being concise. This design is used throughout the API v2 models.
To access the data of the Tweet object, you can use attributes or keys (like a dictionary) to access each field.
If you want all the data as a dictionary, you can use the data attribute/key.
In that case, to access public metrics, you could maybe try doing this instead:
tweets_info = client.get_tweets(ids=ids, tweet_fields=["public_metrics"])
for tweet in tweets_info.data:
print(tweet["id"])
print(tweet["public_metrics"])

How to check if extended entity is present or not in tweepy response

I am able to fetch different tweet parameters from tweet.
keyword = tweepy.Cursor(api.search, val,tweet_mode='extended',lang='en').items(2)
tweetdone = 0
all_tweet = []
for tweet in keyword:
tweet_record = {}
tweet_record['tweet.text'] = tweet.full_text
tweet_record['tweet.user.name'] = tweet.user.name
tweet_record['tweet.user.location'] = tweet.user.location
tweet_record['tweet.user.verified'] = tweet.user.verified
tweet_record['tweet.lang'] = tweet.lang
tweet_record['tweet.created_at'] = tweet.created_at
tweet_record['tweet.user'] = tweet.user
tweet_record['tweet.retweet_count'] = tweet.retweet_count
tweet_record['tweet.favorite_count'] = tweet.favorite_count
I want to parse media objects from the tweet, but extended_entities in which media_url is present is not available in all tweets.
so if I try to fetch it like this:
tweet_record['media_url'] = tweet.extended_entities.media_url
It errors out because extended_entities may not be present in some tweets.
How to deal this issue and fetch media content correctly?
You have a couple of options here, you can check whether the key exists, or use some try/excepts.
Check whether key exists:
You can do this because tweepy returns a status object, which acts similarly to a json file, or python dictionary, and thus you essentially have a key:value pair. You should be able to use (going by your above code)
if 'extended_entities' in tweet:
tweet_record['media_url'] = tweet.extended_entities.media_url
of course, the reverse is also possible
if 'extended_entities' not in tweet:
#whatever you want to do
This could lead to problems though, what if the extended_entities exists, but for some reason media_url doesn't? And what if you want to get even more from within that (there isn't for a status object, but hey, I'm just trying to future proof here!) You'll have to do long, or multi nested if statements, which won't look the best
if 'extended_entities' in tweet:
if 'media_url' in tweet['extended_entities']
#etc
so it might be easier to just throw it in a try except...
try:
tweet_record['media_url'] = tweet.extended_entities.media_url
except AttributeError:
#etc
this means the program won't error when particular elements aren't found. AttributeError is for accessing an invalid attribute of an object. You of course may want to re-order this for readability. Keep in mind though, that while doing this is pythonic it can be a bit hard to read if used too often in my opinion.
I referred to this question when looking up things for this answer. Gives some good ideas for this sort of thing if you need further help.
Hope that helps.
Also, a good option is to use hasattr(Object, name) within an if-statement:
if hasattr(tweet, "extended_entities"):
\# do whatever

Twython with 140 character limitation of twitter

I am trying to search in twitter using Tython, but it seems that the library has a limitation on 140 characters. With the new feature of python, i.e. 280 characters length, what can one do?
This is not a limitation of Twython. The Twitter API by default returns the old 140-character limited tweet. In order to see the newer extended tweet you just need to supply this parameter to your search query:
tweet_mode=extended
Then, you will find the 280-character extended tweet in the full_text field of the returned tweet.
I use another library (TwitterAPI), but I think you would do something like this using Twython:
results = api.search(q='pizza',tweet_mode='extended')
for result in results['statuses']:
print(result['full_text'])
Unfortunately, I am unable to find anything related "Tython". However, if searching twitter data (in this case posts) and/or gathering metadata is your goal, I would recommend you having a look into the library TwitterSearch.
Here is a quick example from the provided link with searching for Twitter-posts containing the words Gutenberg and Doktorarbeit.
from TwitterSearch import *
try:
tso = TwitterSearchOrder() # create a TwitterSearchOrder object
tso.set_keywords(['Guttenberg', 'Doktorarbeit']) # let's define all words we would like to have a look for
tso.set_language('de') # we want to see German tweets only
tso.set_include_entities(False) # and don't give us all those entity information
# it's about time to create a TwitterSearch object with our secret tokens (API auth credentials)
ts = TwitterSearch(
consumer_key = 'aaabbb',
consumer_secret = 'cccddd',
access_token = '111222',
access_token_secret = '333444'
)
# this is where the fun actually starts :)
for tweet in ts.search_tweets_iterable(tso):
print( '#%s tweeted: %s' % ( tweet['user']['screen_name'], tweet['text'] ) )
except TwitterSearchException as e: # take care of all those ugly errors if there are some
print(e)

Twython getting tweets from user

I am using Twython to get a stream of tweets. I used this tutorial, expect that I am not using GPIO.
My code is the following:
import time
from twython import TwythonStreamer
TERMS='#stackoverflow'
APP_KEY='MY APP KEY'
APP_SECRET='MY APP SECRET'
OAUTH_TOKEN='MY OATH TOKEN'
OAUTH_TOKEN_SECRET='MY OATH TOKEN SECRET'
class BlinkyStreamer(TwythonStreamer):
def on_success(self, data):
if 'text' in data:
print data['text'].encode('utf-8')
try:
stream = BlinkyStreamer(APP_KEY, APP_SECRET, OAUTH_TOKEN, OAUTH_TOKEN_SECRET)
stream.statuses.filter(track=TERMS)
except KeyboardInterrupt
That outputs a stream of all tweets containing #stackoverflow. But I want to output the tweet if it is from a certain user, e.g. #StackStatus.
I am running this on a Raspberry Pi.
How would I do that? Any help is appreciated!
Edit: if there is another, other or easier, way to execute some script when a new tweet is placed by some user, please let me know, this would solve my question as well!
The 'follow' parameter does not work as stated above by teknoboy. Correct usage is with the user's ID, not their screen name. You can get user IDs using http://gettwitterid.com/.
The third parameter available is Location - you can use 1, 2 or 3 of then as desired. They become linked with "OR", not 'AND'.
Example Usage:
SearchTerm = 'abracadabra' # If spaces are included, they are 'OR', ie finds tweets with any one of the words, not the whole string.
Tweeter = '25073877' # This is Donald Trump, finds tweets from him or mentioning him
Place = '"47.405,-177.296,1mi"' # Sent from within 1 mile of Lat, Long
stream.statuses.filter(track=SearchTerm, follow=Tweeter, location=Place)
you should supply the filter with the follow parameter to stream specific users' tweets.
if you wish to only follow one user, you can define
FOLLOW='StackStatus'
and change the appropriate line to
stream.statuses.filter(track=TERMS, follow=FOLLOW)
if you wish to see all the user's tweets, regardless of keyword, you can omit the track parameter:
stream.statuses.filter(follow=FOLLOW)

Create classes that grab youtube queries and display information using Python

I want to use the urllib module to send HTTP requests and grab data. I can get the data by using the urlopen() function, but not really sure how to incorporate it into classes. I really need help with the query class to move forward. From the query I need to pull
• Top Rated
• Top Favorites
• Most Viewed
• Most Recent
• Most Discussed
My issue is, I can't parse the XML document to retrieve this data. I also don't know how to use classes to do it.
Here is what I have so far:
import urllib #this allows the programm to sen HTTP requests and to read the responses.
class Query:
'''performs the actual HTTP requests and initial parsing to build the Video-
objects from the response. It will also calculate the following information
based on the video and user results. '''
def __init__(self, feed_id, max_results):
'''Takes as input the type of query (feed_id) and the maximum number of
results (max_results) that the query should obtain. The correct HTTP
request must be constructed and submitted. The results are converted
into Video objects, which are stored within this class.
'''
self.feed = feed_id
self.max = max_results
top_rated = urllib.urlopen("http://gdata.youtube.com/feeds/api/standardfeeds/top_rated")
results_str = top_rated.read()
splittedlist = results_str.split('<entry')
top_rated.close()
def __str__(self):
''' prints out the information on each video and Youtube user. '''
pass
class Video:
pass
class User:
pass
#main function: This handles all the user inputs and stuff.
def main():
useinput = raw_input('''Welcome to the YouTube text-based query application.
You can select a popular feed to perform a query on and view statistical
information about the related videos and users.
1) today
2) this week
3) this month
4) since youtube started
Please select a time(or 'Q' to quit):''')
secondinput = raw_input("\n1) Top Rated\n2) Top Favorited\n3) Most Viewed\n4) Most Recent\n5) Most Discussed\n\nPlease select a feed (or 'Q' to quit):")
thirdinput = raw_input("Enter the maximum number of results to obtain:")
main()
toplist = []
top_rated = urllib.urlopen("http://gdata.youtube.com/feeds/api/standardfeeds/top_rated")
result_str = top_rated.read()
top_rated.close()
splittedlist = result_str.split('<entry')
results_str = top_rated.read()
x=splittedlist[1].find('title')#find the title index
splittedlist[1][x: x+75]#string around the title (/ marks the end of the title)
w=splittedlist[1][x: x+75].find(">")#gives you the start index
z=splittedlist[1][x: x+75].find("<")#gives you the end index
titles = splittedlist[1][x: x+75][w+1:z]#gives you the title!!!!
toplist.append(titles)
print toplist
I assume that your challenge is parsing XML.
results_str = top_rated.read()
splittedlist = results_str.split('<entry')
And I see you are using string functions to parse XML. Such functions based on finite automata (regular languages) are NOT suited for parsing context-free languages such as XML. Expect it to break very easily.
For more reasons, please refer RegEx match open tags except XHTML self-contained tags
Solution: consider using an XML parser like elementree. It comes with Python and allows you to browse the XML tree pythonically. http://effbot.org/zone/element-index.htm
Your may come up with code like:
import elementtree.ElementTree as ET
..
results_str = top_rated.read()
root = ET.fromstring(results_str)
for node in root:
print node
I also don't know how to use classes to do it.
Don't be in a rush to create classes :-)
In the above example, you are importing a module, not importing a class and instantiating/initializing it, like you do for Java. Python has powerful primitive types (dictionaries, lists) and considers modules as objects: so (IMO) you can go easy on classes.
You use classes to organize stuff, not because your teacher has indoctrinated you "classes are good. Lets have lots of them".
Basically you want to use the Query class to communicate with the API.
def __init__(self, feed_id, max_results, time):
qs = "http://gdata.youtube.com/feeds/api/standardfeeds/"+feed_id+"?max- results="+str(max_results)+"&time=" + time
self.feed_id = feed_id
self.max_results = max_results
wo = urllib.urlopen(qs)
result_str = wo.read()
wo.close()

Categories