The problem I'm facing is that whenever I try to retrieve a tweets from a hashtag most of the tweets are retweets of an origin tweet and they all repeat the same like and retweet number. For example if I have a tweet with over 100 likes and 20 retweets and there are over 10 retweets of my tweet all 10 of those tweets will have 100 likes and 20 retweets, which is redundant data. This is a very big issue especially because I usually retrieve about 5000 - 10000 tweets for analysis.
Code:
from os import access
import tweepy
import configparser
import pandas as pd
api_key = ''
api_key_secret = ''
access_token = ''
access_token_secret = ''
auth = tweepy.OAuthHandler(api_key, api_key_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
# user = '#elonmusk'
keywords = '#AsiaCup2022'
limit = 10000
tweets = tweepy.Cursor(api.search_tweets, q = keywords, count = 100, tweet_mode = 'extended').items(limit)
# tweets = api.user_timeline(screen_name = user, count = limit, tweet_mode = 'extended')
columns = ['User', 'Tweet', 'Likes', 'Retweets']
data = []
for tweet in tweets:
try:
data.append([tweet.user.screen_name, tweet.full_text, tweet.retweeted_status.favorite_count, tweet.retweet_count])
except:
data.append([tweet.user.screen_name, tweet.full_text, tweet.favorite_count, tweet.retweet_count])
df = pd.DataFrame(data, columns=columns)
df.to_excel("Cup2022.xlsx")
An example of what my issue is:
As you can see in the pic the same tweet has been retweeted by two different users and they have the same like and retweet count as the original tweet. Any help would be appreciated, this is a really big problem for me considering it messes up my entire result.
You can exclude the retweets from the results of your search.
keywords = '#AsiaCup2022 -filter:retweets'
That way you will get only the original tweets and you will avoid that redundancy.
Related
Through the basic Academic Research Developer Account, I'm using the Tweepy API to collect tweets containing specified keywords or hashtags. This enables me to collect 10,000,000 tweets per month. Using the entire archive search, I'm trying to collect tweets from one whole calendar date at a time. I've gotten a rate limit error (despite the wait_on_rate_limit flag being set to true) Now there's an error with the request limit.
here is the code
import pandas as pd
import tweepy
# function to display data of each tweet
def printtweetdata(n, ith_tweet):
print()
print(f"Tweet {n}:")
print(f"Username:{ith_tweet[0]}")
print(f"tweet_ID:{ith_tweet[1]}")
print(f"userID:{ith_tweet[2]}")
print(f"creation:{ith_tweet[3]}")
print(f"location:{ith_tweet[4]}")
print(f"Total Tweets:{ith_tweet[5]}")
print(f"likes:{ith_tweet[6]}")
print(f"retweets:{ith_tweet[7]}")
print(f"hashtag:{ith_tweet[8]}")
# function to perform data extraction
def scrape(words, numtweet, since_date, until_date):
# Creating DataFrame using pandas
db = pd.DataFrame(columns=['username', 'tweet_ID', 'userID',
'creation', 'location', 'text','likes','retweets', 'hashtags'])
# We are using .Cursor() to search through twitter for the required tweets.
# The number of tweets can be restricted using .items(number of tweets)
tweets = tweepy.Cursor(api.search_full_archive,'research',query=words,
fromDate=since_date, toDate=until_date).items(numtweet)
# .Cursor() returns an iterable object. Each item in
# the iterator has various attributes that you can access to
# get information about each tweet
list_tweets = [tweet for tweet in tweets]
# Counter to maintain Tweet Count
i = 1
# we will iterate over each tweet in the list for extracting information about each tweet
for tweet in list_tweets:
username = tweet.user.screen_name
tweet_ID = tweet.id
userID= tweet.author.id
creation = tweet.created_at
location = tweet.user.location
likes = tweet.favorite_count
retweets = tweet.retweet_count
hashtags = tweet.entities['hashtags']
# Retweets can be distinguished by a retweeted_status attribute,
# in case it is an invalid reference, except block will be executed
try:
text = tweet.retweeted_status.full_text
except AttributeError:
text = tweet.text
hashtext = list()
for j in range(0, len(hashtags)):
hashtext.append(hashtags[j]['text'])
# Here we are appending all the extracted information in the DataFrame
ith_tweet = [username, tweet_ID, userID,
creation, location, text, likes,retweets,hashtext]
db.loc[len(db)] = ith_tweet
# Function call to print tweet data on screen
printtweetdata(i, ith_tweet)
i = i+1
filename = 'C:/Users/USER/Desktop/الجامعة الالمانية/output/twitter.csv'
# we will save our database as a CSV file.
db.to_csv(filename)
if __name__ == '__main__':
consumer_key = "####"
consumer_secret = "###"
access_token = "###"
access_token_secret = "###"
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth,wait_on_rate_limit=True)
since_date = '200701010000'
until_date = '202101012359'
words = "#USA"
# number of tweets you want to extract in one run
numtweet = 1000
scrape(words, numtweet, since_date, until_date)
print('Scraping has completed!')
I got this error:
TooManyRequests: 429 Too Many Requests
Request exceeds account’s current package request limits. Please upgrade your package and retry or contact Twitter about enterprise access.
Unfortunately, I believe this is due to the Sandbox quota. For a premium account it would be more.
Tweepy API Documentation
You may check out this answer here - Limit
So I created a hashtag #tweet230720211255 and I want to download each and every tweet posted with this hashtag from last 7 days. So I used Tweepy and so far it does a good job of downloading tweets. But there is a problem that I'm facing.
Tweepy only downloads the tweets that has text in them. By that what I mean is if you post a tweet like this or this, basically without any text except the hashtag, then it won't get downloaded. Would like some help here please. The code I have used is below:
#Scraping
import tweepy # for tweet mining
import csv # to read and write csv files
import glob
#Processing
import pandas as pd
import preprocessor as p
import requests
import string
import re # In-built regular expressions library
from collections import Counter
CONSUMER_KEY = 'xxxx'
CONSUMER_SECRET = 'xxxx'
ACCESS_KEY = 'xxxx'
ACCESS_SECRET = 'xxxx'
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET) # Pass in Consumer key and secret for authentication by API
auth.set_access_token(ACCESS_KEY, ACCESS_SECRET) # Pass in Access key and secret for authentication by API
api = tweepy.API(auth, wait_on_rate_limit = True, wait_on_rate_limit_notify = True) # Sleeps when API limit is reached
def get_tweets_in_data_range(data, phrase, max_tweets, end_date, start_date = None, start_via_tweet_id = None):
search_query = phrase
#+ " -filter:links AND -filter:retweets AND -filter:replies" #Exclude links, retweets, replies
for i in tweepy.Cursor(api.search, q = search_query, since = start_date, until = end_date, since_id = start_via_tweet_id, lang = "en", tweet_mode = "extended").items(max_tweets):
data.append([i.full_text, i.id, i.created_at, i.coordinates, i.retweet_count, i.favorite_count]) # Embedded Twitter parameters
scraped_data = []
scraped_data.append(["text", "id", "time", "location", "retweet_count", "fav_count"])
PHRASE = "\"#tweet230720211255\""
MAX_TWEETS = 1000 #Maximum number of tweets to scrape
START_DATE = '2021-07-23' #only last 7 days supported
END_DATE = '2021-07-27' #only last 7 days supported
get_tweets_in_data_range(scraped_data, PHRASE, MAX_TWEETS, END_DATE, START_DATE) #call to get tweets between date ranges
cursor = tweepy.Cursor(api.user_timeline, id='burgerking', tweet_mode = "extended").items(1)
for i in cursor:
print(dir(i))
tweets = pd.DataFrame(scraped_data[1:],columns=scraped_data[0])
tweets_csv = tweets.to_csv('download.csv', index=True) #saves a csv file with the data scraped
with pd.option_context('display.max_rows', None,
'display.max_columns', None,
'display.precision', 3,
):
print(tweets)
Please point out where I'm going wrong.
I'd like to get Tweets with #MeTooMen using Tweepy.
There are many Tweets using this hashtag as far as I searched Twitter, but I get 0 result when I try to get these Tweets with Tweepy. Do you have any idea what I can do to improve this code?
import os
import tweepy as tw
import pandas as pd
api_key = '*'
api_secret_key = '*'
access_token = '*'
access_token_secret = '*'
auth = tw.OAuthHandler(api_key, api_secret_key)
auth.set_access_token(access_token, access_token_secret)
api = tw.API(auth, wait_on_rate_limit=True)
# Define the search term and the date_since date as variables
search_words = "#metoomen"
date_since = "2017-10-17"
date_until = "2018-01-31"
tweets = tw.Cursor(api.search,
q = search_words,
lang = "en",
since = date_since,
until = date_until).items(5)
users_locs = [[tweet.user.screen_name, tweet.user.location, tweet.text] for tweet in tweets]
users_locs
>>> []
API.search uses Twitter's standard search API and doesn't accept date_since or date_until parameters:
Keep in mind that the search index has a 7-day limit. In other words, no tweets will be found for a date older than one week.
https://developer.twitter.com/en/docs/twitter-api/v1/tweets/search/guides/standard-operators also says:
[It] is not a complete index of all Tweets, but instead an index of recent Tweets. The index includes between 6-9 days of Tweets.
You'll need to use the Full-archive premium search API endpoint, with API.search_full_archive, instead.
I am using code which is working fine. I have added the whole code as taken from geeks for geeks. But I want to modify it to add referenced_tweets.type. I am new to APIs and really want to understand how to fix this.
import pandas as pd
import tweepy
# function to display data of each tweet
def printtweetdata(n, ith_tweet):
print()
print(f"Tweet {n}:")
print(f"Username:{ith_tweet[0]}")
print(f"likes:{ith_tweet[1]}")
print(f"Location:{ith_tweet[2]}")
print(f"Following Count:{ith_tweet[3]}")
print(f"Follower Count:{ith_tweet[4]}")
print(f"Total Tweets:{ith_tweet[5]}")
print(f"Retweet Count:{ith_tweet[6]}")
print(f"Tweet Text:{ith_tweet[7]}")
print(f"Hashtags Used:{ith_tweet[8]}")
# function to perform data extraction
def scrape(words, date_since, numtweet):
# Creating DataFrame using pandas
db = pd.DataFrame(columns=['username', 'likes', 'location', 'following',
'followers', 'totaltweets', 'retweetcount', 'text', 'hashtags'])
# We are using .Cursor() to search through twitter for the required tweets.
# The number of tweets can be restricted using .items(number of tweets)
tweets = tweepy.Cursor(api.search, q=words, lang="en",
since=date_since, tweet_mode='extended').items(numtweet)
# .Cursor() returns an iterable object. Each item in
# the iterator has various attributes that you can access to
# get information about each tweet
list_tweets = [tweet for tweet in tweets]
# Counter to maintain Tweet Count
i = 1
# we will iterate over each tweet in the list for extracting information about each tweet
for tweet in list_tweets:
username = tweet.user.screen_name
likes = tweet.favorite_count
location = tweet.user.location
following = tweet.user.friends_count
followers = tweet.user.followers_count
totaltweets = tweet.user.statuses_count
retweetcount = tweet.retweet_count
hashtags = tweet.entities['hashtags']
# Retweets can be distinguished by a retweeted_status attribute,
# in case it is an invalid reference, except block will be executed
try:
text = tweet.retweeted_status.full_text
except AttributeError:
text = tweet.full_text
hashtext = list()
for j in range(0, len(hashtags)):
hashtext.append(hashtags[j]['text'])
# Here we are appending all the extracted information in the DataFrame
ith_tweet = [username, likes, location, following,
followers, totaltweets, retweetcount, text, hashtext]
db.loc[len(db)] = ith_tweet
# Function call to print tweet data on screen
printtweetdata(i, ith_tweet)
i = i+1
filename = 'etihad.csv'
# we will save our database as a CSV file.
db.to_csv(filename)
if __name__ == '__main__':
# Enter your own credentials obtained
# from your developer account
consumer_key =
consumer_secret =
access_key =
access_secret =
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)
# Enter Hashtag and initial date
print("Enter Twitter HashTag to search for")
words = input()
print("Enter Date since The Tweets are required in yyyy-mm--dd")
date_since = input()
# number of tweets you want to extract in one run
numtweet = 100
scrape(words, date_since, numtweet)
print('Scraping has completed!')
I now want to add referenced_tweets.type in order to get if the Tweet is a Retweet or not but I'm not sure how to do it. Can someone help?
API.search uses the standard search API, part of Twitter API v1.1.
referenced_tweets is a value that can be set for tweet.fields, a Twitter API v2 fields parameter.
Currently, if you want to use Twitter API v2 through Tweepy, you'll have to use the development version of Tweepy on the master branch and its Client class. Otherwise, you'll need to wait until Tweepy v4.0 is released.
Alternatively, if your only goal is to determine whether a Status/Tweet object is a Retweet or not, you can simply check for the retweeted_status attribute.
I have some Python code here that retrieves a max limit of 200 Tweets from each of the USA Democratic political candidates' Twitter accounts. Although, I have it set to no replies and no Retweets, so it's actually returning much less. I know that you can return 200 Tweets max per call though you can make multiple calls, specifically 180, in a 15-minute window which would return many more Tweets. My question is how to go about making multiple calls while still returning the data in the pandas df format that I have set up currently. Thanks!
import datetime as dt
import os
import pandas as pd
import tweepy as tw
#define developer's permissions
consumer_key = 'xxxxxxxx'
consumer_secret = 'xxxxxxxx'
access_token = 'xxxxxx'
access_token_secret = 'xxxxxxx'
#access twitter's API
auth = tw.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tw.API(auth, wait_on_rate_limit=True)
#function collects tweets from
def get_tweets(handle):
try:
tweets = api.user_timeline(screen_name=handle,
count=200,
exclude_replies=True,
include_rts=False,
tweet_mode="extended")
print(handle, "Number of tweets extracted: {}\n".format(len(tweets)))
df = pd.DataFrame(data=[tweet.user.screen_name for tweet in tweets], columns=['handle'])
df['tweets'] = np.array([tweet.full_text for tweet in tweets])
df['date'] = np.array([tweet.created_at for tweet in tweets])
df['len'] = np.array([len(tweet.full_text) for tweet in tweets])
df['like_count'] = np.array([tweet.favorite_count for tweet in tweets])
df['rt_count'] = np.array([tweet.retweet_count for tweet in tweets])
except:
pass
return df
#list of all the candidate twitter handles
handles = ['#JoeBiden', '#ewarren', '#BernieSanders', '#MikeBloomberg', '#PeteButtigieg', '#AndrewYang', '#AmyKlobuchar']
df = pd.DataFrame()
#loop through the diffent candidate twitter handles and collect each candidates tweets
for handle in handles:
df_new = get_tweets(handle)
df = pd.concat((df, df_new))
#JoeBiden Number of tweets extracted: 200.
#ewarren Number of tweets extracted: 200.
#BernieSanders Number of tweets extracted: 200.
#MikeBloomberg Number of tweets extracted: 200.
#PeteButtigieg Number of tweets extracted: 200.
#AndrewYang Number of tweets extracted: 200.
#AmyKlobuchar Number of tweets extracted: 200.
First of all, you're going to want to regenerate your credentials now.
You can iterate through paginated results with a Cursor or by passing the since_id and/or max_id parameters for API.user_timeline.
See also the documentation for the GET statuses/user_timeline endpoint.
The Twitter API documentation explain why you get a lower result:
exclude_replies - "This parameter will prevent replies from appearing in the returned timeline. Using exclude_replies with the count parameter will mean you will receive up-to count tweets — this is because the count parameter retrieves that many Tweets before filtering out retweets and replies."