Tweepy returns only 76 tweets - python

I am trying to gather movie reviews from Twitter. However, I get only 76 tweets. I tried to except tweeterror but that doesn't help. Here is my code:
import tweepy
import time
import cPickle as pickle
auth = tweepy.OAuthHandler(**hidden**)
auth.set_access_token(**hidden**)
api = tweepy.API(auth)
def limit_handled(cursor):
while True:
try:
yield cursor.next()
"I am awake..."
except tweepy.error:
print "going to sleep..."
time.sleep(15 * 60)
except StopIteration:
break
query = '#moviereview -filter:links'
max_tweets = 1000000
searched_tweets = [status.text for status in limit_handled(tweepy.Cursor(api.search, q=query).items(max_tweets))]
with open("twitter_reviews.pkl","wb") as f:
pickle.dump(searched_tweets,f,-1)
print len(searched_tweets)

Try modifying your query parameters, as per your code, this is not what is filtering out further results.
Query for:
'#moviereview -filter:links'
provides 78 results (and counting)
Query for:
'#moviereview'
provides 1713 results (and counting)
Query for:
'#moviereview Filter:links'
provides 4534 results (and counting)
and as #Ethan mentioned + Twitters API documentation (https://dev.twitter.com/rest/public/search)
The Twitter Search API searches against a sampling of recent Tweets
published in the past 7 days.

Related

How do I set my result type to be recent, in tweepy python?

I'm trying to make a twitter bot that retweets the 5 most recent posts in #fortnite on Twitter. The only problem I'm having is how do I set my result type to only be recent posts, so the bot only retweets the recent posts in #Fortnite.
import tweepy as twitter
import keys
import time, datetime
auth = twitter.OAuthHandler(keys.API_KEY, keys.API_SECRET_KEY)
auth.set_access_token(keys.ACCESS_TOKEN, keys.SECRET_ACCESS_TOKEN)
api = twitter.API(auth)
def twitter_bot(hashtag, delay):
while True:
print(f"\n{datetime.datetime.now()}\n")
for tweets in twitter.Cursor(api.search_tweets, q=hashtag, count=10).items(5):
try:
tweet_id = dict(tweets._json)["id"]
tweet_text = dict(tweets._json)["text"]
print("id: " + str(tweet_id))
print("text: " + str(tweet_text))
api.retweet(tweet_id)
except:
print("error")
time.sleep(delay)
twitter_bot("#Fortnite", 10)
In the Twitter API docs, it says the following -
result_type
Specifies what type of search results you would prefer to receive. The current default is "mixed." Valid values include:
mixed : Include both popular and real time results in the response.
recent : return only the most recent results in the response
popular : return only the most popular results in the response.
Link to docs - https://developer.twitter.com/en/docs/twitter-api/v1/tweets/search/api-reference/get-search-tweets
If you know how to implement this into this code, I'd appreciate the help!
The result_type parameter is for the search tweets api, so:
import tweepy
import keys
import time, datetime
auth = tweepy.OAuthHandler(keys.API_KEY, keys.API_SECRET_KEY)
auth.set_access_token(keys.ACCESS_TOKEN, keys.SECRET_ACCESS_TOKEN)
api = tweepy.API(auth)
def twitter_bot(hashtag, delay):
while True:
print(f"\n{datetime.datetime.now()}\n")
for tweets in api.search_tweets(q=hashtag, result_type="recent", count=5):
try:
tweet_id = dict(tweets._json)["id"]
tweet_text = dict(tweets._json)["text"]
print("id: " + str(tweet_id))
print("text: " + str(tweet_text))
api.retweet(tweet_id)
except:
print("error")
time.sleep(delay)
twitter_bot("#Fortnite", 10)

How do I pull tweets from a user timeline for specific covid related keyword on python?

I want to retrieve At least 1000 tweets from a {user} timeline replies included
● At least 100 tweets of the 1000 tweets are related to Covid-19 keyword like ["covid19", "wuhan", "mask", "lockdown", "quarantine", "sars-cov-2"] etc.
I wrote the function to retrieve the tweets:
def get_tweets_by_user(self, screen_name):
'''
Use user_timeline api to fetch POI related tweets, some postprocessing may be required.
:return: List
'''
result = []
tweets = api.user_timeline(screen_name=screen_name,
# 200 is the maximum allowed count
count=200,
include_rts = True,
# Necessary to keep full_text
# otherwise only the first 140 words are extracted
tweet_mode = 'extended'
)
for tw in tweets:
result.append(tw)
return result
Now how do I retrieve 100 tweets related to covid-19 keywords from user timeline?
Register for Twitter developer API. You'll need a couple consumer keys. Tell them you're a student.
import requests as re
import json
import twitter # install this library to work with twitter dev.
consumer_key = "your key"
consumer_secret = "your key"
access_token = "your key"
access_token_secret = "your key"
api = twitter.Api(consumer_key=yourkey,
consumer_secret=yoursecret,
access_token_key=youraccesstoken,
access_token_secret=yourtokensecret)
FILTER = ["covid-19 string here"] # PUT YOUR COVID 19 STRING HERE
LANGUAGES = ['en']
store_file = "outputfileforcovidtweets.txt"
_location = ["put coordinates here"]
def main():
with open(store_file, 'a') as z:
for line in api.GetStreamFilter(track=FILTER, languages=LANGUAGES, locations=_location):
z.write(json.dumps(line))
z.write('\n')
main()
This will collect real-time tweets to your output file. :)

Excluding Search Terms in Python & Tweepy?

I want to search for #gaming, but exclude all tweets that have #stream in them. I'm using Python & Tweepy in Visual Studio Code. I have the below code, but when I run it still includes tweets with #stream. I looked up the Twitter API filtering rules and this is what it had. Could anyone help me with my code? Thanks!
import tweepy
auth = tweepy.OAuthHandler("CONSUMER_KEY", "CONSUMER_SECRET")
auth.set_access_token("ACCESS_TOKEN", "ACCESS_TOKEN_SECRET")
api = tweepy.API(auth)
user2 = api.me()
print(user2.name)
def main():
search = ("#gaming -#stream")
numberoftweets = 2
for tweet in tweepy.Cursor(api.search, search).items(numberoftweets):
try:
tweet.favorite()
tweet.retweet()
api.create_friendship(tweet.user.id)
print("Tweet Retweeted and followed")
except tweepy.TweepError as e:
print(e.reason)
except StopIteration:
break
main()
The search rules are here and checked your search parameters and it seems to be working, so it's not clear where you are seeing #stream. Takes 10-15 seconds to pull 500 tweets and get a hashtag list, but I've run this several times and no #stream. If I remove the - from in front of #stream then hashtag is in the list; works as expected.
import pandas as pd
search = ("#gaming -#stream")
numberoftweets = 500
tweets = tweepy.Cursor(api.search, search).items(numberoftweets)
df_hold_list = [pd.DataFrame([tweet._json.get('entities').get('hashtags')]) for tweet in tweets] # note brackets around json dictionary
df_hashtags = pd.concat(df_hold_list, axis=0).reset_index(drop=True)
df_hashtags.iloc[0]
hashtag_list = df_hashtags.stack().apply(lambda x: '' if pd.isnull(x) else x['text']).unique().tolist()
hashtag_list.sort()
print(hashtag_list)

questions about api of Tweepy

1.The api: stream.filter(). I read the documentation which said that all parameters can be optional. However, when I left it empty, it won't work.
Still the question with api. It is said that if I write code like below:
twitter_stream.filter(locations = [-180,-90, 180, 90])
It can filter all tweets with geological information. However, when I check the json data, I still find many tweets, the value of their attribute geo are still null.
3.I tried to use stream to get as many tweets as possible. However, it is said that it can get tweets in real time. will there be any parameters to set the time
like to collect tweets from 2013 to 2015
4.I tried to collect data through users and their followers and continue the same step until I get as many tweets as I want. So my code is like below:
import tweepy
import chardet
import json
import sys
#set one global list to store all user_names
users_unused = ["Raithan8"]
users_used = []
def process_or_store(tweet):
print(json.dumps(tweet))
consumer_key =
consumer_secret =
access_token =
access_token_secret =
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth, wait_on_rate_limit=True)
def getAllTweets():
#initialize one empty list tw store all tweets
screen_name = users_unused[0]
users_unused.remove(screen_name)
users_used.append(screen_name)
print("this is the current user: " + screen_name)
for friend in tweepy.Cursor(api.friends, screen_name = screen_name).items():
if friend not in users_unused and friend not in users_used:
users_unused.append(friend.screen_name)
for follower in tweepy.Cursor(api.followers, screen_name = screen_name).items():
if follower not in users_unused and follower not in users_used:
users_unused.append(follower.screen_name)
print(users_unused)
print(users_used)
alltweets = []
#tweepy limits at most 200 tweets each time
new_tweets = api.user_timeline(screen_name = screen_name, count = 200)
alltweets.extend(new_tweets)
if not alltweets:
return alltweets
oldest = alltweets[-1].id - 1
while(len(new_tweets) <= 0):
new_tweets = api.user_timeline(screen_name = screen_name, count = 200, max_id = oldest)
alltweets.extend(new_tweets)
oldest = alltweets[-1].id - 1
return alltweets
def storeTweets(alltweets, file_name = "tweets.json"):
for tweet in alltweets:
json_data = tweet._json
data = json.dumps(tweet._json)
with open(file_name, "a") as f:
if json_data['geo'] is not None:
f.write(data)
f.write("\n")
if __name__ == "__main__":
while(1):
if not users_unused:
break
storeTweets(getAllTweets())
I don't why it runs so slow. Maybe it is mainly because I initialize tweepy API as below
api = tweepy.API(auth, wait_on_rate_limit=True)
But if I don't initialize it in this way, it will raise error below:
raise RateLimitError(error_msg, resp)
tweepy.error.RateLimitError: [{'message': 'Rate limit exceeded', 'code': 88}]
2) There's a difference between a tweet with coordinates and filtering by location.
Filtering by location means that the sender is located in the range of your filter. If you set it globally twitter_stream.filter(locations = [-180,-90, 180, 90]) it will return tweets for people who set their country name in their preferences.
If you need to filter by coordinates (a tweet that has a coordinates) you can take a look at my blog post. But basically you need to set a listener and then check if the tweet have some coordinates.
3 and 4) Twitter's Search API and Twitter's Streaming API are different in many ways and restrictions about rate limits (Tweepy) and Twitter rate limit.
You have a limitation about how many tweets you want to get (in the past).
Check again Tweepy API because wait_on_rate_limit set as true just wait that your current limit window is available again. That's why it's "slow" as you said.
However using streaming API doesn't have such restrictions.

how to take all tweets in a hashtag with tweepy?

I'm trying to take every open tweets in a hashtag but my code does not go further than 299 tweets.
I also trying to take tweets from a specific time line like tweets only in May 2015 and July 2016. Are there any way to do it in the main process or should I write a little code for it?
Here is my code:
# if this is the first time, creates a new array which
# will store max id of the tweets for each keyword
if not os.path.isfile("max_ids.npy"):
max_ids = np.empty(len(keywords))
# every value is initialized as -1 in order to start from the beginning the first time program run
max_ids.fill(-1)
else:
max_ids = np.load("max_ids.npy") # loads the previous max ids
# if there is any new keywords added, extends the max_ids array in order to correspond every keyword
if len(keywords) > len(max_ids):
new_indexes = np.empty(len(keywords) - len(max_ids))
new_indexes.fill(-1)
max_ids = np.append(arr=max_ids, values=new_indexes)
count = 0
for i in range(len(keywords)):
since_date="2015-01-01"
sinceId = None
tweetCount = 0
maxTweets = 5000000000000000000000 # maximum tweets to find per keyword
tweetsPerQry = 100
searchQuery = "#{0}".format(keywords[i])
while tweetCount < maxTweets:
if max_ids[i] < 0:
if (not sinceId):
new_tweets = api.search(q=searchQuery, count=tweetsPerQry)
else:
new_tweets = api.search(q=searchQuery, count=tweetsPerQry,
since_id=sinceId)
else:
if (not sinceId):
new_tweets = api.search(q=searchQuery, count=tweetsPerQry,
max_id=str(max_ids - 1))
else:
new_tweets = api.search(q=searchQuery, count=tweetsPerQry,
max_id=str(max_ids - 1),
since_id=sinceId)
if not new_tweets:
print("Keyword: {0} No more tweets found".format(searchQuery))
break
for tweet in new_tweets:
count += 1
print(count)
file_write.write(
.
.
.
)
item = {
.
.
.
.
.
}
# instead of using mongo's id for _id, using tweet's id
raw_data = tweet._json
raw_data["_id"] = tweet.id
raw_data.pop("id", None)
try:
db["Tweets"].insert_one(item)
except pymongo.errors.DuplicateKeyError as e:
print("Already exists in 'Tweets' collection.")
try:
db["RawTweets"].insert_one(raw_data)
except pymongo.errors.DuplicateKeyError as e:
print("Already exists in 'RawTweets' collection.")
tweetCount += len(new_tweets)
print("Downloaded {0} tweets".format(tweetCount))
max_ids[i] = new_tweets[-1].id
np.save(arr=max_ids, file="max_ids.npy") # saving in order to continue mining from where left next time program run
Have a look at this: https://tweepy.readthedocs.io/en/v3.5.0/cursor_tutorial.html
And try this:
import tweepy
auth = tweepy.OAuthHandler(CONSUMER_TOKEN, CONSUMER_SECRET)
api = tweepy.API(auth)
for tweet in tweepy.Cursor(api.search, q='#python', rpp=100).items():
# Do something
pass
In your case you have a max number of tweets to get, so as per the linked tutorial you could do:
import tweepy
MAX_TWEETS = 5000000000000000000000
auth = tweepy.OAuthHandler(CONSUMER_TOKEN, CONSUMER_SECRET)
api = tweepy.API(auth)
for tweet in tweepy.Cursor(api.search, q='#python', rpp=100).items(MAX_TWEETS):
# Do something
pass
If you want tweets after a given ID, you can also pass that argument.
Sorry, I can't answer in comment, too long. :)
Sure :) Check this example:
Advanced searched for #data keyword 2015 may - 2016 july
Got this url: https://twitter.com/search?l=&q=%23data%20since%3A2015-05-01%20until%3A2016-07-31&src=typd
session = requests.session()
keyword = 'data'
date1 = '2015-05-01'
date2 = '2016-07-31'
session.get('https://twitter.com/search?l=&q=%23+keyword+%20since%3A+date1+%20until%3A+date2&src=typd', streaming = True)
Now we have all the requested tweets,
Probably you could have problems with 'pagination'
Pagination url ->
https://twitter.com/i/search/timeline?vertical=news&q=%23data%20since%3A2015-05-01%20until%3A2016-07-31&src=typd&include_available_features=1&include_entities=1&max_position=TWEET-759522481271078912-759538448860581892-BD1UO2FFu9QAAAAAAAAETAAAAAcAAAASAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA&reset_error_state=false
Probably you could put a random tweet id, or you can parse first, or requests some data from twitter. It can be done.
Use Chrome's networking tab to find all the requested information :)
This code worked for me.
import tweepy
import pandas as pd
import os
#Twitter Access
auth = tweepy.OAuthHandler( 'xxx','xxx')
auth.set_access_token('xxx-xxx','xxx')
api = tweepy.API(auth,wait_on_rate_limit = True)
df = pd.DataFrame(columns=['text', 'source', 'url'])
msgs = []
msg =[]
for tweet in tweepy.Cursor(api.search, q='#bmw', rpp=100).items(10):
msg = [tweet.text, tweet.source, tweet.source_url]
msg = tuple(msg)
msgs.append(msg)
df = pd.DataFrame(msgs)
Check twitter api documentation, probably it allows just 300 tweets to parse.
I would recommend to forget api, make it with requests with streaming. The api is an implementation of requests with limitations.

Categories