I am working with Twitter's API and tweepy in the hopes of scraping available geolocation coordinates from Tweets. My end goal is to store only the coordinates of each Tweet in a table.
My issue is that when location Tweets, I run into an error where more information than the coordinates is provided:
My code thus far is as follows:
import pandas as pd
import json
import tweepy
import csv
class MyStreamListener(tweepy.StreamListener):
def on_status(self, status):
if status.retweeted:
return
if True:
coords = status.coordinates
geo = status.geo
if geo is not None:
geo = json.dumps(geo)
if coords is not None:
coords = json.dumps(coords)
print(coords, geo)
with open('coordinates_data.csv', 'a') as f:
writer = csv.writer(f)
writer.writerow([coords,geo])
def on_error(self, status_code):
if status_code == 420:
#returning False in on_error disconnects the stream
return False
LOCATIONS = [-124.7771694, 24.520833, -66.947028, 49.384472, # Contiguous US
-164.639405, 58.806859, -144.152365, 71.76871, # Alaska
-160.161542, 18.776344, -154.641396, 22.878623] # Hawaii
auth = tweepy.OAuthHandler('access auths', 'access auths')
auth.set_access_token('token','token')
api = tweepy.API(auth)
user = api.me()
myStreamListener = MyStreamListener()
myStream = tweepy.Stream(auth = api.auth, listener=myStreamListener)
myStream.filter(locations=LOCATIONS)
I'm sure this issue relates to my lack of 'json' understanding, or that I need to use a data dictionary.
I would appreciate any help!
Just to clarify, Tweepy is a third-party library that interfaces with Twitter's API.
That's just how Twitter represents coordinates objects. Tweepy parses the coordinates attribute of the Status/Tweet object data as the dictionary that it is. You can simply access the coordinates field as a key for that dictionary to get the list with the longitude and latitude values.
You also have a missing quotation mark, ', where you initialize auth, but I assume that's a typo from when you replaced your credentials for this question.
Related
I am using code which is working fine. I have added the whole code as taken from geeks for geeks. But I want to modify it to add referenced_tweets.type. I am new to APIs and really want to understand how to fix this.
import pandas as pd
import tweepy
# function to display data of each tweet
def printtweetdata(n, ith_tweet):
print()
print(f"Tweet {n}:")
print(f"Username:{ith_tweet[0]}")
print(f"likes:{ith_tweet[1]}")
print(f"Location:{ith_tweet[2]}")
print(f"Following Count:{ith_tweet[3]}")
print(f"Follower Count:{ith_tweet[4]}")
print(f"Total Tweets:{ith_tweet[5]}")
print(f"Retweet Count:{ith_tweet[6]}")
print(f"Tweet Text:{ith_tweet[7]}")
print(f"Hashtags Used:{ith_tweet[8]}")
# function to perform data extraction
def scrape(words, date_since, numtweet):
# Creating DataFrame using pandas
db = pd.DataFrame(columns=['username', 'likes', 'location', 'following',
'followers', 'totaltweets', 'retweetcount', 'text', 'hashtags'])
# We are using .Cursor() to search through twitter for the required tweets.
# The number of tweets can be restricted using .items(number of tweets)
tweets = tweepy.Cursor(api.search, q=words, lang="en",
since=date_since, tweet_mode='extended').items(numtweet)
# .Cursor() returns an iterable object. Each item in
# the iterator has various attributes that you can access to
# get information about each tweet
list_tweets = [tweet for tweet in tweets]
# Counter to maintain Tweet Count
i = 1
# we will iterate over each tweet in the list for extracting information about each tweet
for tweet in list_tweets:
username = tweet.user.screen_name
likes = tweet.favorite_count
location = tweet.user.location
following = tweet.user.friends_count
followers = tweet.user.followers_count
totaltweets = tweet.user.statuses_count
retweetcount = tweet.retweet_count
hashtags = tweet.entities['hashtags']
# Retweets can be distinguished by a retweeted_status attribute,
# in case it is an invalid reference, except block will be executed
try:
text = tweet.retweeted_status.full_text
except AttributeError:
text = tweet.full_text
hashtext = list()
for j in range(0, len(hashtags)):
hashtext.append(hashtags[j]['text'])
# Here we are appending all the extracted information in the DataFrame
ith_tweet = [username, likes, location, following,
followers, totaltweets, retweetcount, text, hashtext]
db.loc[len(db)] = ith_tweet
# Function call to print tweet data on screen
printtweetdata(i, ith_tweet)
i = i+1
filename = 'etihad.csv'
# we will save our database as a CSV file.
db.to_csv(filename)
if __name__ == '__main__':
# Enter your own credentials obtained
# from your developer account
consumer_key =
consumer_secret =
access_key =
access_secret =
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)
# Enter Hashtag and initial date
print("Enter Twitter HashTag to search for")
words = input()
print("Enter Date since The Tweets are required in yyyy-mm--dd")
date_since = input()
# number of tweets you want to extract in one run
numtweet = 100
scrape(words, date_since, numtweet)
print('Scraping has completed!')
I now want to add referenced_tweets.type in order to get if the Tweet is a Retweet or not but I'm not sure how to do it. Can someone help?
API.search uses the standard search API, part of Twitter API v1.1.
referenced_tweets is a value that can be set for tweet.fields, a Twitter API v2 fields parameter.
Currently, if you want to use Twitter API v2 through Tweepy, you'll have to use the development version of Tweepy on the master branch and its Client class. Otherwise, you'll need to wait until Tweepy v4.0 is released.
Alternatively, if your only goal is to determine whether a Status/Tweet object is a Retweet or not, you can simply check for the retweeted_status attribute.
Hi StackOverflow community, I have a query that I am completely unaware of and it is regarding the API 1.1 of Twitter, Tweepy and Python. It turns out that doing my little analysis on devices used, given a certain HastTag, I see that the part within the related Json (api.source) in certain accounts comes out completely without data, including mine when I send a post from my app. what happened ? Is this option going to disappear? Is anyone informed about it? Thank you !
# define streaming class
class TweetStreamListener(StreamListener):
# on success
def on_data(self, status):
# convert string to JSON
data = json.loads(status)
# extract relevant information
description = data['user']['description']
loc = data['user']['location']
text = data['text']
coords = data['coordinates']
name = data['user']['screen_name']
user_created = data['user']['created_at']
followers = data['user']['followers_count']
id_str = data['id_str']
created = data['created_at']
retweets = data['retweet_count']
bg_color = data['user']['profile_background_color']
idTweet = data['id']
source = data['source']
print(name)
print(text)
print('Source: ', source)
return True
# on failure
def on_error(self, status):
print(status)
Results :
Results 'Source' blank New features API ?
Trying to obtain full tweets via the following code. I understand you want to set the parameter tweet_mode to value 'extended', but since I'm not the standard query here I don't know where it would fit. For the text field I always get partial text cut off by '...' followed by a URL. With this configuration, how would you go about getting full tweets:
from twython import Twython, TwythonStreamer
import json
import pandas as pd
import csv
def process_tweet(tweet):
d = {}
d['hashtags'] = [hashtag['text'] for hashtag in tweet['entities']['hashtags']]
d['text'] = tweet['text']
d['user'] = tweet['user']['screen_name']
d['user_loc'] = tweet['user']['location']
return d
# Create a class that inherits TwythonStreamer
class MyStreamer(TwythonStreamer):
# Received data
def on_success(self, data):
# Only collect tweets in English
if data['lang'] == 'en':
tweet_data = process_tweet(data)
self.save_to_csv(tweet_data)
# Problem with the API
def on_error(self, status_code, data):
print(status_code, data)
self.disconnect()
# Save each tweet to csv file
def save_to_csv(self, tweet):
with open(r'tweets_about_california.csv', 'a') as file:
writer = csv.writer(file)
writer.writerow(list(tweet.values()))
# Instantiate from our streaming class
stream = MyStreamer(creds['CONSUMER_KEY'], creds['CONSUMER_SECRET'],
creds['ACCESS_TOKEN'], creds['ACCESS_SECRET'])
# Start the stream
stream.statuses.filter(track='california', tweet_mode='extended')
The tweet_mode=extended parameter has no effect on the v1.1 streaming API, as all Tweets are delivered in both extended and default (140) format.
If the Tweet object has the value truncated: true then there will be an additional element in the payload - extended_tweet. This is where the full_text value will be stored.
Note that this answer only applies to the v1.1 Twitter API, in v2 all Tweet text is returned by default in the streaming API (Twython does not currently support v2).
1.The api: stream.filter(). I read the documentation which said that all parameters can be optional. However, when I left it empty, it won't work.
Still the question with api. It is said that if I write code like below:
twitter_stream.filter(locations = [-180,-90, 180, 90])
It can filter all tweets with geological information. However, when I check the json data, I still find many tweets, the value of their attribute geo are still null.
3.I tried to use stream to get as many tweets as possible. However, it is said that it can get tweets in real time. will there be any parameters to set the time
like to collect tweets from 2013 to 2015
4.I tried to collect data through users and their followers and continue the same step until I get as many tweets as I want. So my code is like below:
import tweepy
import chardet
import json
import sys
#set one global list to store all user_names
users_unused = ["Raithan8"]
users_used = []
def process_or_store(tweet):
print(json.dumps(tweet))
consumer_key =
consumer_secret =
access_token =
access_token_secret =
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth, wait_on_rate_limit=True)
def getAllTweets():
#initialize one empty list tw store all tweets
screen_name = users_unused[0]
users_unused.remove(screen_name)
users_used.append(screen_name)
print("this is the current user: " + screen_name)
for friend in tweepy.Cursor(api.friends, screen_name = screen_name).items():
if friend not in users_unused and friend not in users_used:
users_unused.append(friend.screen_name)
for follower in tweepy.Cursor(api.followers, screen_name = screen_name).items():
if follower not in users_unused and follower not in users_used:
users_unused.append(follower.screen_name)
print(users_unused)
print(users_used)
alltweets = []
#tweepy limits at most 200 tweets each time
new_tweets = api.user_timeline(screen_name = screen_name, count = 200)
alltweets.extend(new_tweets)
if not alltweets:
return alltweets
oldest = alltweets[-1].id - 1
while(len(new_tweets) <= 0):
new_tweets = api.user_timeline(screen_name = screen_name, count = 200, max_id = oldest)
alltweets.extend(new_tweets)
oldest = alltweets[-1].id - 1
return alltweets
def storeTweets(alltweets, file_name = "tweets.json"):
for tweet in alltweets:
json_data = tweet._json
data = json.dumps(tweet._json)
with open(file_name, "a") as f:
if json_data['geo'] is not None:
f.write(data)
f.write("\n")
if __name__ == "__main__":
while(1):
if not users_unused:
break
storeTweets(getAllTweets())
I don't why it runs so slow. Maybe it is mainly because I initialize tweepy API as below
api = tweepy.API(auth, wait_on_rate_limit=True)
But if I don't initialize it in this way, it will raise error below:
raise RateLimitError(error_msg, resp)
tweepy.error.RateLimitError: [{'message': 'Rate limit exceeded', 'code': 88}]
2) There's a difference between a tweet with coordinates and filtering by location.
Filtering by location means that the sender is located in the range of your filter. If you set it globally twitter_stream.filter(locations = [-180,-90, 180, 90]) it will return tweets for people who set their country name in their preferences.
If you need to filter by coordinates (a tweet that has a coordinates) you can take a look at my blog post. But basically you need to set a listener and then check if the tweet have some coordinates.
3 and 4) Twitter's Search API and Twitter's Streaming API are different in many ways and restrictions about rate limits (Tweepy) and Twitter rate limit.
You have a limitation about how many tweets you want to get (in the past).
Check again Tweepy API because wait_on_rate_limit set as true just wait that your current limit window is available again. That's why it's "slow" as you said.
However using streaming API doesn't have such restrictions.
I am trying to extract tweet locations from a specific area with python using tweepy + writing it into a csv-file.
I am not very much into python but I could manage to put together the following sript which kind of works:
import json
from tweepy import Stream
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener
#Enter Twitter API Key information
consumer_key = 'cons_key'
consumer_secret = 'cons_secret'
access_token = 'acc_token'
access_secret = 'acc-secret'
file = open("C:\Python27\Output2.csv", "w")
file.write("X,Y\n")
data_list = []
count = 0
class listener(StreamListener):
def on_data(self, data):
global count
#How many tweets you want to find, could change to time based
if count <= 100:
json_data = json.loads(data)
coords = json_data["coordinates"]
if coords is not None:
print coords["coordinates"]
lon = coords["coordinates"][0]
lat = coords["coordinates"][1]
data_list.append(json_data)
file.write(str(lon) + ",")
file.write(str(lat) + "\n")
count += 1
return True
else:
file.close()
return False
def on_error(self, status):
print status
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
twitterStream = Stream(auth, listener())
#What you want to search for here
twitterStream.filter(locations=[11.01,47.85,12.09,48.43])
the problem is, that it extracts the coordinates very slowly (like 10 entries per 30 minutes). Would there be a way to make this faster?
How can I add the timestamps for each tweet?
Is there way to make sure to retrieve all tweets possible for the specific region (I guess the max is all tweets of the past week)?
thanks very much in advance!
Twitter’s standard streaming API provides a 1% sample of all the Tweets posted. In addition, very few Tweets have location data added to them. So, I’m not surprised that you’re only getting a small number of Tweets in a 30 minute timespan for one specific bounding box. The only way to improve the volume would be to pay for the enterprise PowerTrack API.
Tweets all contain a created_at value which is the time stamp you’ll want to record.