How to get tweets data that contain multiple keywords - python

I'm trying to accumulate tweets data by using these typical codes. As you can see I attempt to track tweets containing 'UniversalStudios', 'Disneyland' OR 'Los Angeles'. But in fact what I really want to get are tweets that contain these keywords "UniversalStudios", "Disneyland" AND "LosAngeles" altogether. Can anyone tell me how to achieve that?
Thanks a lot in advance :)
#This is a basic listener that just prints received tweets to stdout.
class StdOutListener(StreamListener):
def on_data(self, data):
all_data = json.loads(data)
tweet = TextBlob(all_data["text"])
#Add the 'sentiment data to all_data
#all_data['sentiment'] = tweet.sentiment
#print(tweet)
#print(tweet.sentiment)
# Open json text file to save the tweets
with open('tweets.json', 'a') as tf:
# Write a new line
tf.write('\n')
# Write the json data directly to the file
json.dump(all_data, tf)
# Alternatively: tf.write(json.dumps(all_data))
return True
def on_error(self, status):
print (status)
if __name__ == '__main__':
#This handles Twitter authetification and the connection to Twitter Streaming API
l = StdOutListener()
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
stream = Stream(auth, l)
#This line filter Twitter Streams to capture data by the keywords: 'python', 'javascript', 'ruby'
stream.filter(languages = ['en'], track=['UniversalStudios','Disneyland', "LosAngeles"])

Twitter's API (see "track") mentions you need to have spaces between the phrases to mean ANDs (commas are ORs). I'm not sure how the library you're using handles it, but my bet would be:
track=['UniversalStudios Disneyland LosAngeles']
The quote from the docs:
By this model, you can think of commas as logical ORs, while spaces are equivalent to logical ANDs (e.g. ‘the twitter’ is the AND twitter, and ‘the,twitter’ is the OR twitter).

Related

How to stream Tweets by hashtag with language AND count filter using Tweepy?

So what I want to do is live stream Tweets from Twitters API: for just the hashtag 'Brexit', only in the English language, and for a specific amount of Tweets (1k - 2k).
So far my code will live stream the Tweets, but whichever way I modify it I either end up with it ignoring the count and just streaming indefinitely, or I get errors. If I change it to only stream a specific users Tweets the count function works, but it ignores the hashtag. If I stream everything for the given hashtag it completely ignores the count. I've had a decent go at trying to fix it but am quite inexperienced and have really hit a brick wall with it.
If I could get some help with how to tick all these boxes at the same time would be much appreciated!
The code below so far will just stream 'Brexit' Tweets indefinitely so ignores the count=10
The bottom of the code is a bit of a mess due to me playing with it, apologies:
import numpy as np
import pandas as pd
import tweepy
from tweepy import API
from tweepy import Cursor
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
import Twitter_Credentials
import matplotlib.pyplot as plt
# Twitter client - hash out to stream all
class TwitterClient:
def __init__(self, twitter_user=None):
self.auth = TwitterAuthenticator().authenticate_twitter_app()
self.twitter_client = API(self.auth)
self.twitter_user = twitter_user
def get_twitter_client_api(self):
return self.twitter_client
# Twitter authenticator
class TwitterAuthenticator:
def authenticate_twitter_app(self):
auth = OAuthHandler(Twitter_Credentials.consumer_key, Twitter_Credentials.consumer_secret)
auth.set_access_token(Twitter_Credentials.access_token, Twitter_Credentials.access_secret)
return auth
class TwitterStreamer():
# Class for streaming and processing live Tweets
def __init__(self):
self.twitter_authenticator = TwitterAuthenticator()
def stream_tweets(self, fetched_tweets_filename, hash_tag_list):
# this handles Twitter authentication and connection to Twitter API
listener = TwitterListener(fetched_tweets_filename)
auth = self.twitter_authenticator.authenticate_twitter_app()
stream = Stream(auth, listener)
# This line filters Twitter stream to capture data by keywords
stream.filter(track=hash_tag_list)
# Twitter stream listener
class TwitterListener(StreamListener):
# This is a listener class that prints incoming Tweets to stdout
def __init__(self, fetched_tweets_filename):
self.fetched_tweets_filename = fetched_tweets_filename
def on_data(self, data):
try:
print(data)
with open(self.fetched_tweets_filename, 'a') as tf:
tf.write(data)
return True
except BaseException as e:
print("Error on_data: %s" % str(e))
return True
def on_error(self, status):
if status == 420:
# Return false on data in case rate limit occurs
return False
print(status)
class TweetAnalyzer():
# Functionality for analysing and categorising content from tweets
def tweets_to_data_frame(self, tweets):
df = pd.DataFrame(data=[tweet.text for tweet in tweets], columns=['tweets'])
df['id'] = np.array([tweet.id for tweet in tweets])
df['len'] = np.array([len(tweet.text) for tweet in tweets])
df['date'] = np.array([tweet.created_at for tweet in tweets])
df['source'] = np.array([tweet.source for tweet in tweets])
df['likes'] = np.array([tweet.favorite_count for tweet in tweets])
df['retweets'] = np.array([tweet.retweet_count for tweet in tweets])
return df
if __name__ == "__main__":
auth = OAuthHandler(Twitter_Credentials.consumer_key, Twitter_Credentials.consumer_secret)
auth.set_access_token(Twitter_Credentials.access_token, Twitter_Credentials.access_secret)
api = tweepy.API(auth)
for tweet in Cursor(api.search, q="#brexit", count=10,
lang="en",
since="2019-04-03").items():
fetched_tweets_filename = "tweets.json"
twitter_streamer = TwitterStreamer()
hash_tag_list = ["Brexit"]
twitter_streamer.stream_tweets(fetched_tweets_filename, hash_tag_list)
You're trying to use two different methods of accessing the Twitter API - Streaming is realtime, and searching is a one-off API call.
Since streaming is continuous and realtime, there's no way to apply a count of results to it - the code simply opens a connection, says "hey, send me all the Tweets from now onwards that contain the hash_tag_list", and sits listening. At that point you then drop into the StreamListener, where for each Tweet received, you write them into a file.
You could apply a counter here, but you'd need to wrap it inside your StreamListener on_data handler, and increment the counter for each Tweet received. When you get to 1000 Tweets, stop listening.
For the search option, you have a couple of issues... the first one is that you're asking for Tweets since 2019, but the standard search API can only go back 7 days in time. You've obviously asked for only 10 Tweets there. The way you've written the method though, what's actually happening is that for each Tweet in the collection of 10 that the API returns, you then create a realtime streaming connection and start listening and writing to a file. So that's not going to work.
You'll need to choose one - either search for 1000 Tweets and write them to a file (never set up TwitterStreamer()), or, listen for 1000 Tweets and write them to a file (drop the for Tweet in Cursor(api.search... and jump straight to the streamer).
Simply add the hashtag symbol to the search phrase in the list, and it'll match tweets that use a specific hashtag. It's case-sensitive so you may want to add as many options to the search array. Merely using "Brexit" matches tweets that may or may not use the hashtag but contain the keyword "Brexit".
hash_tag_list = ["#Brexit"]

Extracting raw Tweet data with Twitter API without shortened t.co link

I am learning to use the Twitter API with Tweepy. I would like help with extracting raw Tweet data - meaning no shortened URLs. This Tweet, for example, shows a YouTube link but when parsed by the API, prints a t.co link. How can I print the text as displayed? Thanks for your help.
Note: I have a similar concern as this question, but it is not the same.
Function code:
def get_tweets(username):
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
# Call api
api = tweepy.API(auth)
tweets = api.user_timeline(screen_name=username)
# Empty Array
tmp=[]
# create array of tweet information: username,
# tweet id, date/time, text
tweets_for_csv = [tweet.text for tweet in tweets] # CSV file created
for j in tweets_for_csv:
# Append tweets to the empty array tmp
tmp.append(j)
dict1 = {}
punctuation = '''`~!##$%^&*(){}[];:'".,\/?'''
tmps = str(tmp)
for char in tmps:
if char in punctuation:
tmps = tmps.replace(char," ")
tmps2 = tmps.split(" ")
a = 0
while a < len(tmps2):
for b in tmps2:
dict1[a] = b
a += 1
Twitter's API returns the raw Tweet data without any parsing. This data includes shortened URLs because that's how the Tweet is represented. Twitter itself simply parses and displays the original URL. The link itself is even still the shortened one.
Tweet objects have an entities attribute, which provides an entities object with a urls field that is an array of URL objects, representing the URLs included in the text of the Tweet, or an empty array if no links are present. Each URL object includes a display_url field with the original URL pasted/typed into the Tweet and an indices field that is an array of integers representing offsets within the Tweet text where the URL begins and ends. You can use these fields to replace the shortened URL.

Python issue with saving the output

As a new user to Python I have hit an issue with the following code. Instead of only printing the results of Twitter search on the screen I need to save the file (ideally pipe-delimited which I don't yet know how to produce...). However the following code runs ok but doesn't create the Output.txt file. It did once and then never again. I am running it on Mac OS and ending the code with Ctrl+C (as I still don't know how to modify it only to return specific number of tweets). I thought that the issue might be related to Flush'ing but after trying to include the options from this post:Flushing issues none of them seemed to work (unless I did something wrong which is more than probable...)
import tweepy
import json
import sys
# Authentication details. To obtain these visit dev.twitter.com
consumer_key = 'xxxxxx'
consumer_secret = 'xxxxx'
access_token = 'xxxxx-xxxx'
access_token_secret = 'xxxxxxxx'
# This is the listener, resposible for receiving data
class StdOutListener(tweepy.StreamListener):
def on_data(self, data):
# Twitter returns data in JSON format - we need to decode it first
decoded = json.loads(data)
# Also, we convert UTF-8 to ASCII ignoring all bad characters sent by users
print '#%s: %s' % (decoded['user']['screen_name'], decoded['text'].encode('ascii', 'ignore'))
print ''
return True
def on_error(self, status):
print status
if __name__ == '__main__':
l = StdOutListener()
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
print "Showing all new tweets for #Microsoft"
stream = tweepy.Stream(auth, l)
stream.filter(track=['Microsoft'])
sys.stdout = open('Output.txt', 'w')
I think you would be much better off chaning StdOutListener and having it write to the file directly. Assigning sys.stdout to a file is... weird. This way, you can print things for debug output. Also note that file mode "w" will truncate the file when it's opened.
class TweepyFileListener(tweepy.StreamListener):
def on_data(self, data):
print "on_data called"
# Twitter returns data in JSON format - we need to decode it first
decoded = json.loads(data)
msg = '#%s: %s\n' % (
decoded['user']['screen_name'],
decoded['text'].encode('ascii', 'ignore'))
#you should really open the file in __init__
#You should also use a RotatingFileHandler or this guy will get massive
with open("Output.txt", "a") as tweet_log:
print "Received: %s\n" % msg
tweet_log.write(msg)

How to add a location filter to tweepy module

I have found the following piece of code that works pretty well for letting me view in Python Shell the standard 1% of the twitter firehose:
import sys
import tweepy
consumer_key=""
consumer_secret=""
access_key = ""
access_secret = ""
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)
class CustomStreamListener(tweepy.StreamListener):
def on_status(self, status):
print status.text
def on_error(self, status_code):
print >> sys.stderr, 'Encountered error with status code:', status_code
return True # Don't kill the stream
def on_timeout(self):
print >> sys.stderr, 'Timeout...'
return True # Don't kill the stream
sapi = tweepy.streaming.Stream(auth, CustomStreamListener())
sapi.filter(track=['manchester united'])
How do I add a filter to only parse tweets from a certain location? Ive seen people adding GPS to other twitter related Python code but I cant find anything specific to sapi within the Tweepy module.
Any ideas?
Thanks
The streaming API doesn't allow to filter by location AND keyword simultaneously.
Bounding boxes do not act as filters for other filter parameters. For example
track=twitter&locations=-122.75,36.8,-121.75,37.8 would match any tweets containing
the term Twitter (even non-geo tweets) OR coming from the San Francisco area.
Source: https://dev.twitter.com/docs/streaming-apis/parameters#locations
What you can do is ask the streaming API for keyword or located tweets and then filter the resulting stream in your app by looking into each tweet.
If you modify the code as follows you will capture tweets in United Kingdom, then those tweets get filtered to only show those that contain "manchester united"
import sys
import tweepy
consumer_key=""
consumer_secret=""
access_key=""
access_secret=""
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)
class CustomStreamListener(tweepy.StreamListener):
def on_status(self, status):
if 'manchester united' in status.text.lower():
print status.text
def on_error(self, status_code):
print >> sys.stderr, 'Encountered error with status code:', status_code
return True # Don't kill the stream
def on_timeout(self):
print >> sys.stderr, 'Timeout...'
return True # Don't kill the stream
sapi = tweepy.streaming.Stream(auth, CustomStreamListener())
sapi.filter(locations=[-6.38,49.87,1.77,55.81])
Juan gave the correct answer. I'm filtering for Germany only using this:
# Bounding boxes for geolocations
# Online-Tool to create boxes (c+p as raw CSV): http://boundingbox.klokantech.com/
GEOBOX_WORLD = [-180,-90,180,90]
GEOBOX_GERMANY = [5.0770049095, 47.2982950435, 15.0403900146, 54.9039819757]
stream.filter(locations=GEOBOX_GERMANY)
This is a pretty crude box that includes parts of some other countries. If you want a finer grain you can combine multiple boxes to fill out the location you need.
It should be noted though that you limit the number of tweets quite a bit if you filter by geotags. This is from roughly 5 million Tweets from my test database (the query should return the %age of tweets that actually contain a geolocation):
> db.tweets.find({coordinates:{$ne:null}}).count() / db.tweets.count()
0.016668392651547598
So only 1.67% of my sample of the 1% stream include a geotag. However there's other ways of figuring out a user's location:
http://arxiv.org/ftp/arxiv/papers/1403/1403.2345.pdf
You can't filter it while streaming but you could filter it at the output stage, if you were writing the tweets to a file.
sapi.filter(track=['manchester united'],locations=['GPS Coordinates'])

Downloading Full JSON data from Tweets Using Rest API and Tweepy, Querying by Tweet ID

Brand new to using tweepy and Twitter's API(s) in general, and I've realized (too late) that I've made a number of mistakes in collecting some Twitter data. I've been collecting tweets about the winter olympics and had been using the Streaming API to filter by search terms. However, instead of retrieving all the data available, I've only retrieved text, datetime, and Tweet ID. An example of the implemented stream listener is below:
import os
import sys
import tweepy
os.chdir('/my/preferred/location/Twitter Olympics/Data')
consumer_key = 'cons_key'
consumer_secret = 'cons_sec'
access_token = 'access_token'
access_secret = 'access_sec'
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth)
# count is used to give an approximation of how many tweets I'm pulling at a given time.
count = []
f = open('feb24.txt', 'a')
class StreamListener(tweepy.StreamListener):
def on_status(self, status):
print 'Running...'
info = status.text, status.created_at, status.id
f.write(str(info))
for i in info:
count.append(1)
def on_error(self, status_code):
print >> sys.stderr, "Encountered error with status code: ", status_code
def on_timeout(self):
print >> sys.stderr, "Timeout..."
return True
sapi = tweepy.streaming.Stream(auth, StreamListener())
sapi.filter(track=["olympics", "olympics 2014", "sochi", "Sochi2014", "sochi 2014", "2014Sochi", "winter olympics"])
An example of the output that is stored in the .txt file is here:
('RT #Visa: There can only be one winner. Soak it in #TeamUSA, this is your #everywhere #Sochi2014 http://t.co/dVKYUln1r7', datetime.datetime(2014, 2, 15, 18, 9, 51), 111111111111111111).
So, here's my question. If I'm able to get the Tweet ID's in a list, is there a way to iterate over these to query the Twitter Rest API and retrieve the full JSON files? My hunch is yes, but I'm unsure about implementation, and mainly about how to save the resulting data as a JSON file (since I've been using .txt files here). Thanks in advance for reading.
Figured it out. For anyone who has made this terrible mistake (just get all the data to begin with!) here's some code with regular expressions that will extract the ID numbers to store them as a list:
import re
# Read in your ugly text file.
tweet_string = open('nameoffile.txt', 'rU')
tweet_string = tweet_string.read()
# Find all the id numbers with a regex.
id_finder = re.compile('[0-9]{18,18}')
# Go through the twee_string object and find all
# the IDs that meet the regex criteria.
idList = re.findall(id_finder, tweet_string)
Now you can iterate over the list idList and feed each ID as a query to the api (assuming you've done authenticating and have an instance of the api class). You can then append these to a list. Something like:
tweet_list = []
for id in idList:
tweet = api.get_status(id)
tweet_list.append(tweet)
An important note: what will be appended in the tweet_list variable will be a tweepy status object. I need to get a workaround for that, but the above problem is solved.

Categories