Random sampling tweets with tweepy - python

I'm trying to analyze tweets that have the hashtag #contentmarketing. I first tried grabbing 20,000 tweets with tweepy but ran into the rate limit. So I'd like to take a random sample instead (or a couple random samples).
I'm not really familiar with random sampling through an API call. If I had an array that already contained the data, I would take random indices from that array without replacement. However, I don't think I can create that array in the first place without the rate limit kicking in.
Can anyone enlighten me on how to access random tweets (or random data from an API, overall)?
For reference, here's the code that got me in rate limit purgatory:
import tweepy
from tweepy import OAuthHandler
consumerKey = 'my-key'
consumerSecret = 'my-key'
accessToken = 'my-key'
accessSecret = 'my-key'
auth = OAuthHandler(consumerKey, consumerSecret)
auth.set_access_token(accessToken, accessSecret)
api = tweepy.API(auth)
tweets = []
for tweet in tweepy.Cursor(api.search, q='#contentmarketing', count=20000,
lang='en', since='2017-06-20').items():
tweets.append(tweet)
with open('content-tweets.json', 'w') as f:
json.dump(tweets, f, sort_keys=True, indent=4)

This should stop the rate limit from kicking in, just make the following changes to your code:
api = tweepy.API(auth, wait_on_rate_limit=True)

I ever heared about getting random tweets. But you can get "forever" tweets and not all of them, so this is quite the same.
With the public search API, you can do 450 requests within 15 minutes (app auth). So you can ask for 100 tweets every 2 seconds. This is never ended.
Then change the "count" parameter to 100, and add a time.sleep(2) :
import time
for tweet in tweepy.Cursor(api.search, q='#contentmarketing', count=100, lang='en', since='2017-06-20').items():
tweets.append(tweet)
time.sleep(2)
Reference : https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets.html

Related

how to get whole user timeline of a specific twitter user

so I came up with this script to get the all of a user tweet from one twitter user
import tweepy
from tweepy import OAuthHandler
import json
def load_api():
Consumer_key = ''
consumer_secret = ''
access_token = ''
access_secret = ''
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
return tweepy.API(auth)
api = load_api()
user = 'allkpop'
tweets = api.user_timeline(id=user, count=2000)
print('Found %d tweets' % len(tweets))
tweets_text = [t.text for t in tweets]
filename = 'tweets-'+user+'.json'
json.dump(tweets_text, open(filename, 'w'))
print('Saved to file:', filename)
but when I run it I can only get 200 tweets per request. Is there a way to get 2000 tweets or at least more than 2000 tweets?
please help me, thank you
The Twitter API has request limits. The one you're using corresponds to the Twitter statuses/user_timeline endpoint. The max number that you can get for this endpoint is documented as 3,200. Also note that there's a max number of requests in a 15-minute window, which might explain why you're only getting 2000, instead of the max. Here are a couple observations that might be interesting for you:
Documentation says that the max count is 200.
There's an include_rts (include retweets) parameter that might return more values. While it's part of the Twitter API, I can't see where Tweepy documents that.
You might try Tweepy Cursors to see if that will bring back more items.
Because of the 15 minute limits, you might be able to pause until the next 15 minute window to continue. That said, I don't know enough about your use case to know if this is practical or not.

Is there any way to speed-up python code for downloading tweets using tweepy?

Here is the code that i am using for the purpose .For each user request it's taking too long time to download all the tweets.What are some ways to speed up the execution time.The idea is to use tweet analytics in real time as the user visits the website.I am new to python so any help would be appreciated .
import tweepy #https://github.com/tweepy/tweepy
#Twitter API credentials
consumer_key = ".."
consumer_secret = ".."
access_key = ".."
access_secret = ".."
def get_all_tweets(screen_name):
#Twitter only allows access to a users most recent 3240 tweets with this method
#authorize twitter, initialize tweepy
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)
#initialize a list to hold all the tweepy Tweets
alltweets = []
#make initial request for most recent tweets (200 is the maximum allowed count)
new_tweets = api.user_timeline(screen_name = screen_name,count=200)
#save most recent tweets
alltweets.extend(new_tweets)
#save the id of the oldest tweet less one
oldest = alltweets[-1].id - 1
#keep grabbing tweets until there are no tweets left to grab
while len(new_tweets) > 0:
print ("getting tweets before %s".format(oldest))
#all subsiquent requests use the max_id param to prevent duplicates
new_tweets = api.user_timeline(screen_name = screen_name,count=200,max_id=oldest)
#save most recent tweets
alltweets.extend(new_tweets)
#update the id of the oldest tweet less one
oldest = alltweets[-1].id - 1
print ("...%s tweets downloaded so far".format(len(alltweets)))
#transform the tweepy tweets into a 2D array that will populate the csv
outtweets = [[tweet.id_str, tweet.created_at, tweet.text.encode("utf-8")] for tweet in alltweets]
return outtweets
One way to make your solution faster would be to make some cache.
When you've downloaded all the tweets for a screen name, save them locally, for instance as [twitter_screen_name].json
Then edit your function to check for your cache files. If it doesn't exist, create it empty. Then load it, refresh only what needs to, and save back your json cache file.
This way, when a user visits, you'll download only the diff with twitter. This will be much faster for the regularly consulted screen names.
Then you could add something for auto clearing the cache - a simple CRON that removes files with last-accessed META older than n days for instance.

tweets about dengue in Brazil: top 10 people

I am using tweepy to get tweets about dengue in Brazil.
I am interesting in getting recent tweets about the 10 people with the most followers. I use the search api, not streaming api, because I don't need all the tweets, just the most relevant.
I am surprised to get so few tweets (only 17). Should I use the streaming api instead?
Here is my code:
#api access
consumer_key=""
consumer_secret=""
access_token_key=""
access_token_secret=""
import csv
#write results in file
writer= csv.writer(open(r"twitter.csv", "wt"), lineterminator='\n', delimiter =';')
writer.writerow(["date", "langage", "place", "country", "username", "nb_followers", "tweet_text"])
import tweepy
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token_key, access_token_secret)
api = tweepy.API(auth)
#get tweets for Brazil only
places = api.geo_search(query="Brazil", granularity="country")
place_id = places[0].id
print(place_id)
for tweet in tweepy.Cursor(api.search, q="dengue+OR+%23dengue&place:" + place_id, since="2015-08-01", until="2015-08-25").items():
date=tweet.created_at
langage=tweet.lang
try:
place=tweet.place.full_name
country=tweet.place.country
except:
place=None
country=None
username=tweet.user.screen_name
nb_followers=tweet.user.followers_count
tweet_text=tweet.text.encode('utf-8')
print("created on", tweet.created_at)
print("langage", tweet.lang)
print("place:", place)
print("country:", country)
print("user:", tweet.user.screen_name)
print("nb_followers:", tweet.user.followers_count)
print(tweet.text.encode("utf-8"))
print('')
writer.writerow([date, langage, place, country, username, nb_followers, tweet_text])
Try doing the search manually and seeing what you get. It sounds like your application is appropriate for the search API.
I think I know what the problem is: The place attribute is rarely present in the data. Thus very few tweets are returned.
I am now using the lang attribute with pt value (their is no pt-br langage unfortunately). It's not exactly what I want since it returns tweets from other countries such as Portugal but it's so far the best I could find.
for tweet in tweepy.Cursor(api.search, q="dengue+OR+%23dengue", lang="pt", since=date, until=end_date).items():

Using Tweepy to print tweets just from your friends

I've been trying to figure out Tweepy for the last 3 hours and I'm still stuck.
I would like to be able to get all my friend's tweets for the period between Sept and Oct 2014, and have it be filtered by the top 10 number of retweets.
I'm only vaguely familiar with StreamListener, however, I think this does a list of tweets that are real time. I was wondering if I could go back last month and grab out those tweets from my friends. Can this be done through Tweepy? This is the code I have now.
from tweepy import Stream
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener
import csv
ckey = 'xyz'
csecret = 'xyz'
atoken = 'xyz'
asecret = 'xyz'
auth = OAuthHandler(ckey,csecret)
auth.set_access_token(atoken, a secret)
class Listener(StreamListener):
def on_data(self,data):
print data
return True
def on_error(self,status):
print status
auth = OAuthHandler(ckey, csecret)
auth.set_access_token(atoken, asecret)
tiwtterStream = Stream(aut, Listener())
users = [123456, 7890019, 9919038] # this is the list of friends I would like to output
twitterStream.filter(users, since=09-01-2014, until = 10-01-2014)
You are correct in that StreamListener returns real-time tweets. To get past tweets from specific users, you need to use tweepy's API wrapper--tweepy.API. An example, which would replace from your Listener class on down:
api = tweepy.API(auth)
tweetlist = api.user_timeline(id=123456)
This returns a list of up to 20 status objects. You can mess with the parameters to get more results, probably count and since will be helpful for your implementation. I think the most you can ask for with a single count is 200 tweets.
P.S. Not a major issue but you authenticate twice in your code which is not necessary.

Managing Tweepy API Search

Please forgive me if this is a gross repeat of a question previously answered elsewhere, but I am lost on how to use the tweepy API search function. Is there any documentation available on how to search for tweets using the api.search() function?
Is there any way I can control features such as number of tweets returned, results type etc.?
The results seem to max out at 100 for some reason.
the code snippet I use is as follows
searched_tweets = self.api.search(q=query,rpp=100,count=1000)
I originally worked out a solution based on Yuva Raj's suggestion to use additional parameters in GET search/tweets - the max_id parameter in conjunction with the id of the last tweet returned in each iteration of a loop that also checks for the occurrence of a TweepError.
However, I discovered there is a far simpler way to solve the problem using a tweepy.Cursor (see tweepy Cursor tutorial for more on using Cursor).
The following code fetches the most recent 1000 mentions of 'python'.
import tweepy
# assuming twitter_authentication.py contains each of the 4 oauth elements (1 per line)
from twitter_authentication import API_KEY, API_SECRET, ACCESS_TOKEN, ACCESS_TOKEN_SECRET
auth = tweepy.OAuthHandler(API_KEY, API_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)
api = tweepy.API(auth)
query = 'python'
max_tweets = 1000
searched_tweets = [status for status in tweepy.Cursor(api.search, q=query).items(max_tweets)]
Update: in response to Andre Petre's comment about potential memory consumption issues with tweepy.Cursor, I'll include my original solution, replacing the single statement list comprehension used above to compute searched_tweets with the following:
searched_tweets = []
last_id = -1
while len(searched_tweets) < max_tweets:
count = max_tweets - len(searched_tweets)
try:
new_tweets = api.search(q=query, count=count, max_id=str(last_id - 1))
if not new_tweets:
break
searched_tweets.extend(new_tweets)
last_id = new_tweets[-1].id
except tweepy.TweepError as e:
# depending on TweepError.code, one may want to retry or wait
# to keep things simple, we will give up on an error
break
There's a problem in your code. Based on Twitter Documentation for GET search/tweets,
The number of tweets to return per page, up to a maximum of 100. Defaults to 15. This was
formerly the "rpp" parameter in the old Search API.
Your code should be,
CONSUMER_KEY = '....'
CONSUMER_SECRET = '....'
ACCESS_KEY = '....'
ACCESS_SECRET = '....'
auth = tweepy.auth.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_KEY, ACCESS_SECRET)
api = tweepy.API(auth)
search_results = api.search(q="hello", count=100)
for i in search_results:
# Do Whatever You need to print here
The other questions are old and the API changed a lot.
Easy way, with Cursor (see the Cursor tutorial). Pages returns a list of elements (You can limit how many pages it returns. .pages(5) only returns 5 pages):
for page in tweepy.Cursor(api.search, q='python', count=100, tweet_mode='extended').pages():
# process status here
process_page(page)
Where q is the query, count how many will it bring for requests (100 is the maximum for requests) and tweet_mode='extended' is to have the full text. (without this the text is truncated to 140 characters) More info here. RTs are truncated as confirmed jaycech3n.
If you don't want to use tweepy.Cursor, you need to indicate max_id to bring the next chunk. See for more info.
last_id = None
result = True
while result:
result = api.search(q='python', count=100, tweet_mode='extended', max_id=last_id)
process_result(result)
# we subtract one to not have the same again.
last_id = result[-1]._json['id'] - 1
I am working on extracting twitter data for around a location (in here, around India), for all tweets which include a special keyword or a list of keywords.
import tweepy
import credentials ## all my twitter API credentials are in this file, this should be in the same directory as is this script
## set API connection
auth = tweepy.OAuthHandler(credentials.consumer_key,
credentials.consumer_secret)
auth.set_access_secret(credentials.access_token,
credentials.access_secret)
api = tweepy.API(auth, wait_on_rate_limit=True) # set wait_on_rate_limit =True; as twitter may block you from querying if it finds you exceeding some limits
search_words = ["#covid19", "2020", "lockdown"]
date_since = "2020-05-21"
tweets = tweepy.Cursor(api.search, =search_words,
geocode="20.5937,78.9629,3000km",
lang="en", since=date_since).items(10)
## the geocode is for India; format for geocode="lattitude,longitude,radius"
## radius should be in miles or km
for tweet in tweets:
print("created_at: {}\nuser: {}\ntweet text: {}\ngeo_location: {}".
format(tweet.created_at, tweet.user.screen_name, tweet.text, tweet.user.location))
print("\n")
## tweet.user.location will give you the general location of the user and not the particular location for the tweet itself, as it turns out, most of the users do not share the exact location of the tweet
Results:
created_at: 2020-05-28 16:48:23
user: XXXXXXXXX
tweet text: RT #Eatala_Rajender: Media Bulletin on status of positive cases #COVID19 in Telangana. (Dated. 28.05.2020)
# TelanganaFightsCorona
# StayHom…
geo_location: Hyderabad, India
You can search the tweets with specific strings as showed below:
tweets = api.search('Artificial Intelligence', count=200)

Categories