Does tweets from search api overlap? - python

I'm very new to twitter api, and was wondering if I use search api, and I want to call it every minute, to retrieve about a 1000 tweets. Will I get duplicate tweets if in case there were created less than a 1000 tweets for a given criteria or I will call it more often than once a minute
I hope my question is clear, just in case if it matters I use python-twitter library.
and the way I get tweets is :
self.api = twitter.Api(consumer_key, consumer_secret ,access_key, access_secret)
self.api.VerifyCredentials()
self.api.GetSearch(self.hashtag, per_page=100)

Your search results will overlap because the API has no idea what you searched before. One way to prevent the overlap is to use use the tweet ID from the last retrieved tweet. Here is a python 2.7 snippet from my code:
maxid = 10000000000000000000
for i in range(0,10):
with open('output.json','a') as outfile:
time.sleep(5) # don't piss off twitter
print 'maxid=',maxid,', twitter loop',i
results = api.GetSearch('search_term', count=100,max_id = maxid)
for tweet in results:
tweet = str(tweet).replace('\n',' ').replace('\r',' ') # remove new lines
tweet = (json.loads(tweet))
maxid = tweet['id'] # redefine maxid
json.dump(tweet,outfile)
outfile.write('\n') #print tweets on new lines
This code gives you 10 loops of 100 tweets since the last id, which is defined each time through the loop. It then write a json file (with one tweet per line). I use this code to search into the recent past, but you can adapt it to have non-overlapping tweets by changing the 'max_id' to 'since_id'.

Related

Fetch the latest tweet from Twitter with Tweepy

I want to fetch the latest tweet if the keyword is met from a bounch of users at Twitter in real time.
This code fetchs the latest tweet if 'Twitter' keyword is met, and stores it in the "store" variable every 5 seconds and goes on forever.
Is there a way to make it to only fetch the tweet if it isent already present in the store variable. And if its already there it should stay on and search for the next tweet but not fetch it?
import tweepy
import time
api = 'APIKEY'
apisq = 'APISQ'
acc_tok = 'TOK'
acc_sq = 'TOkSQ'
auth = tweepy.OAuthHandler(api, apisq)
auth.set_access_token(acc_tok, acc_sq)
api = tweepy.API(auth)
store = []
username = 'somename'
while True:
first = []
get_tweets = api.user_timeline(screen_name=username, count=1)
test = get_tweets[0]
first.append(test.text)
time.sleep(5)
if any('Twitter' in word for word in first):
store.append(first)
print(store)
else:
continue
Ive tried with some Conditional Statements but has not been very succesful yet.
I think the important piece of data would be the 'id' field returned in the list. You could either add the tweets to a dictionary where the key would be the 'id' and the value the text of the tweet, or create a second list that contains the 'id' and then create a filter condition to validate that the 'id' isn't present in the other list before adding the new tweet.
The dictionary method is likely the quickest computationally, but the second list method is likely the easiest conceptually.

rate limit tweepy paginator search_all_tweets

I'm not sure why I am getting rate limited so quickly using:
mentions = []
for tweet in tweepy.Paginator(client.search_all_tweets, query= "to:######## lang:nl -is:retweet",
start_time = "2022-01-01T00:00:00Z", end_time = "2022-05-31T00:00:00Z",
max_results=500).flatten(limit=10000):
mention = tweet.text
mentions.append(mention)
I suppose I could put time.sleep(1) after these lines, but then it would mean I could only process one Tweet every second, whereas with a regular client.search_all_tweets I would get 500 Tweets per request.
Is there anything I'm missing here? How can I process more than one Tweet a second using tweepy.Paginator?
BTW: I have academic access and know the rate limit documentation.
See the FAQ section about this in Tweepy's documentation:
Why am I getting rate-limited so quickly when using Client.search_all_tweets() with Paginator?
The GET /2/tweets/search/all Twitter API endpoint that Client.search_all_tweets() uses has an additional 1 request per second rate limit that is not handled by Paginator.
You can time.sleep() 1 second while iterating through responses to handle this rate limit.
See also the relevant Tweepy issues #1688 and #1871.

Twython with 140 character limitation of twitter

I am trying to search in twitter using Tython, but it seems that the library has a limitation on 140 characters. With the new feature of python, i.e. 280 characters length, what can one do?
This is not a limitation of Twython. The Twitter API by default returns the old 140-character limited tweet. In order to see the newer extended tweet you just need to supply this parameter to your search query:
tweet_mode=extended
Then, you will find the 280-character extended tweet in the full_text field of the returned tweet.
I use another library (TwitterAPI), but I think you would do something like this using Twython:
results = api.search(q='pizza',tweet_mode='extended')
for result in results['statuses']:
print(result['full_text'])
Unfortunately, I am unable to find anything related "Tython". However, if searching twitter data (in this case posts) and/or gathering metadata is your goal, I would recommend you having a look into the library TwitterSearch.
Here is a quick example from the provided link with searching for Twitter-posts containing the words Gutenberg and Doktorarbeit.
from TwitterSearch import *
try:
tso = TwitterSearchOrder() # create a TwitterSearchOrder object
tso.set_keywords(['Guttenberg', 'Doktorarbeit']) # let's define all words we would like to have a look for
tso.set_language('de') # we want to see German tweets only
tso.set_include_entities(False) # and don't give us all those entity information
# it's about time to create a TwitterSearch object with our secret tokens (API auth credentials)
ts = TwitterSearch(
consumer_key = 'aaabbb',
consumer_secret = 'cccddd',
access_token = '111222',
access_token_secret = '333444'
)
# this is where the fun actually starts :)
for tweet in ts.search_tweets_iterable(tso):
print( '#%s tweeted: %s' % ( tweet['user']['screen_name'], tweet['text'] ) )
except TwitterSearchException as e: # take care of all those ugly errors if there are some
print(e)

How do I return more than 100 Twitter search results with Twython?

Twitter only returns 100 tweets per "page" when returning search results on the API. They provide the max_id and since_id in the returned search_metadata that can be used as parameters to get earlier/later tweets.
Twython 3.1.2 documentation suggests that this pattern is the "old way" to search:
results = twitter.search(q="xbox",count=423,max_id=421482533256044543)
for tweet in results['statuses']:
... do something
and that this is the "new way":
results = twitter.cursor(t.search,q='xbox',count=375)
for tweet in results:
... do something
When I do the latter, it appears to endlessly iterate over the same search results. I'm trying to push them to a CSV file, but it pushes a ton of duplicates.
What is the proper way to search for a large number of tweets, with Twython, and iterate through the set of unique results?
Edit: Another issue here is that when I try to iterate with the generator (for tweet in results:), it loops repeatedly, without stopping. Ah -- this is a bug... https://github.com/ryanmcgrath/twython/issues/300
I had the same problem, but it seems that you should just loop through a user's timeline in batches using the max_id parameter. The batches should be 100 as per Terence's answer (but actually, for user_timeline 200 is the max count), and just set the max_id to the last id in the previous set of returned tweets minus one (because max_id is inclusive). Here's the code:
'''
Get all tweets from a given user.
Batch size of 200 is the max for user_timeline.
'''
from twython import Twython, TwythonError
tweets = []
# Requires Authentication as of Twitter API v1.1
twitter = Twython(PUT YOUR TWITTER KEYS HERE!)
try:
user_timeline = twitter.get_user_timeline(screen_name='eugenebann',count=200)
except TwythonError as e:
print e
print len(user_timeline)
for tweet in user_timeline:
# Add whatever you want from the tweet, here we just add the text
tweets.append(tweet['text'])
# Count could be less than 200, see:
# https://dev.twitter.com/discussions/7513
while len(user_timeline) != 0:
try:
user_timeline = twitter.get_user_timeline(screen_name='eugenebann',count=200,max_id=user_timeline[len(user_timeline)-1]['id']-1)
except TwythonError as e:
print e
print len(user_timeline)
for tweet in user_timeline:
# Add whatever you want from the tweet, here we just add the text
tweets.append(tweet['text'])
# Number of tweets the user has made
print len(tweets)
As per the official Twitter API documentation.
Count optional
The number of tweets to return per page, up to a maximum of 100
You need to make repeated calls to the python method. However, there is no guarantee that these will be the next N, or if the tweets are really coming in it might miss some.
If you want all the tweets in a time frame you can use the streaming api: https://dev.twitter.com/docs/streaming-apis and combine this with the oauth2 module.
How can I consume tweets from Twitter's streaming api and store them in mongodb
python-twitter streaming api support/example
Disclaimer: i have not actually tried this
As a solution to the problem of returning 100 tweets for a search query using Twython, here is the link showing how it can be done using the "old way":
Twython search API with next_results

How to get id of the tweet posted in tweepy

I want to check if a certain tweet is a reply to the tweet that I sent. Here is how I think I can do it:
Step1: Post a tweet and store id of posted tweet
Step2: Listen to my handle and collect all the tweets that have my handle in it
Step3: Use tweet.in_reply_to_status_id to see if tweet is reply to the stored id
In this logic, I am not sure how to get the status id of the tweet that I am posting in step 1. Is there a way I can get it? If not, is there another way in which I can solve this problem?
What one could do, is get the last nth tweet from a user, and then get the tweet.id of the relevant tweet. This can be done doing:
latestTweets = api.user_timeline(screen_name = 'user', count = n, include_rts = False)
I, however, doubt that it is the most efficient way.
When you call the update_status method of your tweepy.API object, it returns a Status object. This object contains all of the information about your tweet.
Example:
my_tweet_ids = list()
api = tweepy.API(auth)
# Post to twitter
status = api.update_status(status='Testing out my Twitter')
# Append this tweet's id to my list of tweets
my_tweet_ids.append(status.id)
Then you can just iterate over the list using for each tweet_id in my_tweet_ids and check the number of replies for each one.

Categories