I am trying to use the twitter API Tweepy to crawl and store Twitter data in a fast/scalable way. I'm most interested in follower/following relationships. Is there a way for me to achieve this faster than I currently am? It seems like with the 15 minute refresh period I can only store maybe a dozen connections per 15 minutes. The tweepy.Cursor line seems to be the bottleneck.
for iters in range(20):
nameSearching = namesToSearch[0]
print("getting followers of " + nameSearching)
for id in tweepy.Cursor(api.followers_ids, screen_name='elonmusk').items(bfs_value):
print(id)
ids.append(id)
print("ids loaded")
namesToSearch.pop(0)
nodeSearching = nodemap[nameSearching]
Related
I'm not sure why I am getting rate limited so quickly using:
mentions = []
for tweet in tweepy.Paginator(client.search_all_tweets, query= "to:######## lang:nl -is:retweet",
start_time = "2022-01-01T00:00:00Z", end_time = "2022-05-31T00:00:00Z",
max_results=500).flatten(limit=10000):
mention = tweet.text
mentions.append(mention)
I suppose I could put time.sleep(1) after these lines, but then it would mean I could only process one Tweet every second, whereas with a regular client.search_all_tweets I would get 500 Tweets per request.
Is there anything I'm missing here? How can I process more than one Tweet a second using tweepy.Paginator?
BTW: I have academic access and know the rate limit documentation.
See the FAQ section about this in Tweepy's documentation:
Why am I getting rate-limited so quickly when using Client.search_all_tweets() with Paginator?
The GET /2/tweets/search/all Twitter API endpoint that Client.search_all_tweets() uses has an additional 1 request per second rate limit that is not handled by Paginator.
You can time.sleep() 1 second while iterating through responses to handle this rate limit.
See also the relevant Tweepy issues #1688 and #1871.
I am trying to create a project that accesses a twitter account using the tweepy api but I am faced with status code 429. Now, I've looked around and I see that it means that I have too many requests. However, I am only ever for 10 tweets at a time and within those, only one should exist during my testing.
for tweet in tweepy.Cursor(api.search, q = '#realtwitchess ',lang = ' ').items(10):
try:
text = str(tweet.text)
textparts = str.split(text) #convert tweet into string array to disect
print(text)
for x, string in enumerate(textparts):
if (x < len(textparts)-1): #prevents error that arises with an incomplete call of the twitter bot to start a game
if string == "gamestart" and textparts[x+1][:1] == "#": #find games
otheruser = api.get_user(screen_name = textparts[2][1:]) #drop the # sign (although it might not matter)
self.games.append((tweet.user.id,otheruser.id))
elif (len(textparts[x]) == 4): #find moves
newMove = Move(tweet.user.id,string)
print newMove.getMove()
self.moves.append(newMove)
if tweet.user.id == thisBot.id: #ignore self tweets
continue
except tweepy.TweepError as e:
print(e.reason)
sleep(900)
continue
except StopIteration: #stop iteration when last tweet is reached
break
When the error does appear, it is in the first for loop line. The kinda weird part is that it doesn't complain every time, or even in consistent intervals. Sometimes it will work and other times, seemingly randomly, not work.
We have tried adding longer sleep times in the loop and reducing the item count.
Add wait_on_rate_limit=True on the API call like this:
api = tweepy.API(auth, wait_on_rate_limit=True)
This will make the rest of the code obey the rate limit
You found the correct information about error code. In fact, the 429 code is returned when a request cannot be served due to the application’s rate limit having been exhausted for the resource.(from documentation)
I suppose that your problem regards not the quantity of data but the frequency.
Check the Twitter API rate limits (that are the same for tweepy).
Rate limits are divided into 15 minute intervals. All endpoints require authentication, so there is no concept of unauthenticated calls and rate limits.
There are two initial buckets available for GET requests: 15 calls every 15 minutes, and 180 calls every 15 minutes.
I think that you can try to use API in this range to avoid the problem
Update
For the latest versions of Tweepy (from 3.2.0), the wait_on_rate_limit has been introduced.
If set to True, it allows to automatically avoid this problem.
From documentation:
wait_on_rate_limit – Whether or not to automatically wait for rate limits to replenish
api =tweepy.API(auth,wait_on_rate_limit=True,wait_on_rate_limit_notify=True)
this should help for setting rate
With tweepy in Python I'm looking for a way to list all followers from one account, with username and number of followers.
Now I can obtain the list of all ids in this way:
ids = []
for page in tweepy.Cursor(api.followers_ids, screen_name="username").pages():
ids.extend(page)
time.sleep(1)
but with this list of ids I can't obtain username and number of followers of every id, because the rate limit exceed...
How I can complete this code?
Thank you all!
On the REST API, your are allowed 180 queries every 15 minutes, and I guess the Streaming API has a similar limitation. You do not want to come too close to this limit, since your application will eventually get blocked even if you do not strictly hit it.
Since your problem has something to do with the rate limit, you should put a sleep in your for loop. I'd say a sleep(4) should be enough, but it's mostly a matter of trial and error there, try to change the value and see for yourself.
Something like
sleeptime = 4
pages = tweepy.Cursor(api.followers, screen_name="username").pages()
while True:
try:
page = next(pages)
time.sleep(sleeptime)
except tweepy.TweepError: #taking extra care of the "rate limit exceeded"
time.sleep(60*15)
page = next(pages)
except StopIteration:
break
for user in page:
print(user.id_str)
print(user.screen_name)
print(user.followers_count)
Twitter only returns 100 tweets per "page" when returning search results on the API. They provide the max_id and since_id in the returned search_metadata that can be used as parameters to get earlier/later tweets.
Twython 3.1.2 documentation suggests that this pattern is the "old way" to search:
results = twitter.search(q="xbox",count=423,max_id=421482533256044543)
for tweet in results['statuses']:
... do something
and that this is the "new way":
results = twitter.cursor(t.search,q='xbox',count=375)
for tweet in results:
... do something
When I do the latter, it appears to endlessly iterate over the same search results. I'm trying to push them to a CSV file, but it pushes a ton of duplicates.
What is the proper way to search for a large number of tweets, with Twython, and iterate through the set of unique results?
Edit: Another issue here is that when I try to iterate with the generator (for tweet in results:), it loops repeatedly, without stopping. Ah -- this is a bug... https://github.com/ryanmcgrath/twython/issues/300
I had the same problem, but it seems that you should just loop through a user's timeline in batches using the max_id parameter. The batches should be 100 as per Terence's answer (but actually, for user_timeline 200 is the max count), and just set the max_id to the last id in the previous set of returned tweets minus one (because max_id is inclusive). Here's the code:
'''
Get all tweets from a given user.
Batch size of 200 is the max for user_timeline.
'''
from twython import Twython, TwythonError
tweets = []
# Requires Authentication as of Twitter API v1.1
twitter = Twython(PUT YOUR TWITTER KEYS HERE!)
try:
user_timeline = twitter.get_user_timeline(screen_name='eugenebann',count=200)
except TwythonError as e:
print e
print len(user_timeline)
for tweet in user_timeline:
# Add whatever you want from the tweet, here we just add the text
tweets.append(tweet['text'])
# Count could be less than 200, see:
# https://dev.twitter.com/discussions/7513
while len(user_timeline) != 0:
try:
user_timeline = twitter.get_user_timeline(screen_name='eugenebann',count=200,max_id=user_timeline[len(user_timeline)-1]['id']-1)
except TwythonError as e:
print e
print len(user_timeline)
for tweet in user_timeline:
# Add whatever you want from the tweet, here we just add the text
tweets.append(tweet['text'])
# Number of tweets the user has made
print len(tweets)
As per the official Twitter API documentation.
Count optional
The number of tweets to return per page, up to a maximum of 100
You need to make repeated calls to the python method. However, there is no guarantee that these will be the next N, or if the tweets are really coming in it might miss some.
If you want all the tweets in a time frame you can use the streaming api: https://dev.twitter.com/docs/streaming-apis and combine this with the oauth2 module.
How can I consume tweets from Twitter's streaming api and store them in mongodb
python-twitter streaming api support/example
Disclaimer: i have not actually tried this
As a solution to the problem of returning 100 tweets for a search query using Twython, here is the link showing how it can be done using the "old way":
Twython search API with next_results
I'm very new to twitter api, and was wondering if I use search api, and I want to call it every minute, to retrieve about a 1000 tweets. Will I get duplicate tweets if in case there were created less than a 1000 tweets for a given criteria or I will call it more often than once a minute
I hope my question is clear, just in case if it matters I use python-twitter library.
and the way I get tweets is :
self.api = twitter.Api(consumer_key, consumer_secret ,access_key, access_secret)
self.api.VerifyCredentials()
self.api.GetSearch(self.hashtag, per_page=100)
Your search results will overlap because the API has no idea what you searched before. One way to prevent the overlap is to use use the tweet ID from the last retrieved tweet. Here is a python 2.7 snippet from my code:
maxid = 10000000000000000000
for i in range(0,10):
with open('output.json','a') as outfile:
time.sleep(5) # don't piss off twitter
print 'maxid=',maxid,', twitter loop',i
results = api.GetSearch('search_term', count=100,max_id = maxid)
for tweet in results:
tweet = str(tweet).replace('\n',' ').replace('\r',' ') # remove new lines
tweet = (json.loads(tweet))
maxid = tweet['id'] # redefine maxid
json.dump(tweet,outfile)
outfile.write('\n') #print tweets on new lines
This code gives you 10 loops of 100 tweets since the last id, which is defined each time through the loop. It then write a json file (with one tweet per line). I use this code to search into the recent past, but you can adapt it to have non-overlapping tweets by changing the 'max_id' to 'since_id'.