I am trying to scrape Twitter profiles for a project I am doing. I have the following code
from tweepy import OAuthHandler
import pandas as pd
"""I like to have my python script print a message at the beginning. This helps me confirm whether everything is set up correctly. And it's nice to get an uplifting message ;)."""
print("You got this!")
access_token = ''
access_token_secret = ''
consumer_key = ''
consumer_secret = ''
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
tweets = []
count = 1
"""Twitter will automatically sample the last 7 days of data. Depending on how many total tweets there are with the specific hashtag, keyword, handle, or key phrase that you are looking for, you can set the date back further by adding since= as one of the parameters. You can also manually add in the number of tweets you want to get back in the items() section."""
for tweet in tweepy.Cursor(api.search, q="#BNonnecke", count=450, since='2020-02-28').items(50000):
print(count)
count += 1
try:
data = [tweet.created_at, tweet.id, tweet.text, tweet.user._json['screen_name'], tweet.user._json['name'], tweet.user._json['created_at'], tweet.entities['urls']]
data = tuple(data)
tweets.append(data)
except tweepy.TweepError as e:
print(e.reason)
continue
except StopIteration:
break
df = pd.DataFrame(tweets, columns = ['created_at','tweet_id', 'tweet_text', 'screen_name', 'name', 'account_creation_date', 'urls'])
"""Add the path to the folder you want to save the CSV file in as well as what you want the CSV file to be named inside the single quotations"""
df.to_csv(path_or_buf = '/Users/Name/Desktop/FolderName/FileName.csv', index=False)
however, I keep getting the error "API" object has no attribute "search" from the line "for tweet in tweepy.Cursor(api.search, q="#BNonnecke", count=450, since='2020-02-28').items(50000):" I am not really sure why and don't know how to resolve this issue.
Thanks so much!
The latest version of Tweepy (v4 upwards) now has a search_tweets method instead of a search method. Check the documentation.
API.search_tweets(q, *, geocode, lang, locale, result_type, count, until, since_id, max_id, include_entities)
Also, read the comment in your code :-) The Search API has a 7 day history limit, so searching for Tweets since 2020-02-28 will only return Tweets posted in the 7 days before the date you run your code.
Related
I'm trying to use Tweepy to get the number of followers for a list of twitter accounts, some of which are inactive/suspended. So far I have imported an excel file where one column is all the twitter usernames and another column would be the number of followers for that given account.
I want to iterate through the twitter handle column, remove any inactive/suspended usernames from the dataframe, then iterate through the revised dataframe and populate the number of followers column with the number of followers for each account using Tweepy. I know this will have to be done with two for loops and am looking for help with the first: I'm stuck on how to iterate through the column and remove these inactive accounts.
Here's the head of the dataframe for reference - I'd like to populate the Twitter Followers column and replace the NaNs with the number of twitter followers.
So far I've written an exception into my code that flags for specific Tweepy errors like inactive/suspended accounts.
I need help writing the first for loop that removes rows from the dataframe for inactive twitter accounts. Currently when I run this I get no output.
My code so far:
import tweepy
import sys
import csv
from urllib.request import urlopen
try:
import json
except ImportError:
import simplejson as json
from datetime import datetime
import pandas as pd
CONSUMER_KEY = ""
CONSUMER_SECRET = ""
ACCESS_TOKEN = ""
ACCESS_TOKEN_SECRET = ""
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)
# Creation of the actual interface, using authentication
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
df = pd.read_excel('/Downloads/inc5k_twitterhandles_2.xlsx', sheet_name=0)
usernames = df['Twitter Handle'].tolist()
rows_to_drop = []
for i in df['Twitter Handle']:
try:
user_followers=api.friends_ids(id=i)# this returns the interger value of the user
except tweepy.error.TweepError as t:
if t.message[0]['code'] == 50: #Code for user not found error
rows_to_drop.append(i)
elif t.message[0]['code'] == 88: # The code for the rate limit error
time.sleep(15*5) # Sleep for 15 minutes
df.drop(df.index[rows_to_drop], inplace=True )
print(df.head())
I'm trying to write a simple python programme that uses the tweepy API for twitter and wget to retrieve the image link from a twitter post ID (Example: twitter.com/ExampleUsername/12345678), then download the image from the link. The actual programme works fine, but there is a problem. While it runs FOR every ID in the dictionary (if there are 2 IDs, it runs 2 times), it doesn't use every ID, so the script ends up looking at the last ID on the dictionary, then downloading the image from that same id however many times there is an ID in the dictionary. Does anyone know how to make the script run again for every ID?
tl;dr I want the programme to look at the first ID, grab its image link, download it, then do the same thing with the next ID until its done all of the IDs.
#!/usr/bin/env python
# encoding: utf-8
import tweepy #https://github.com/tweepy/tweepy
import wget
#Twitter API credentials
consumer_key = "nice try :)"
consumer_secret = "nice try :)"
access_key = "nice try :)"
access_secret = "my, this joke is getting really redundant"
def get_all_tweets():
#authorize twitter, initialize tweepy
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)
id_list = [1234567890, 0987654321]
# Hey StackOverflow, these are example ID's. They won't work as they're not real twitter ID's, so if you're gonna run this yourself, you'll want to find some twitter IDs on your own
# tweets = api.statuses_lookup(id_list)
for i in id_list:
tweets = []
tweets.extend(api.statuses_lookup(id_=id_list, include_entities=True))
for tweet in tweets:
spacefiller = (1+1)
# this is here so the loop runs, if it doesn't the app breaks
a = len(tweets)
print(tweet.entities['media'][0]['media_url'])
url = tweet.entities['media'][0]['media_url']
wget.download(url)
get_all_tweets()
Thanks,
~CS
I figured it out!
I knew that loop was being used for something...
I moved everything from a = len(tweets to wget.download(url) into the for tweet in tweets: loop, and removed the for i in id_list: loop.
Thanks to tdelany this programme works now! Thanks everyone!
Here's the new code if anyone wants it:
#!/usr/bin/env python
# encoding: utf-8
import tweepy #https://github.com/tweepy/tweepy
import wget
#Twitter API credentials
consumer_key = "nice try :)"
consumer_secret = "nice try :)"
access_key = "nice try :)"
access_secret = "my, this joke is getting really redundant"
def get_all_tweets():
#authorize twitter, initialize tweepy
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)
id_list = [1234567890, 0987654321]
# Hey StackOverflow, these are example ID's. They won't work as they're not real twitter ID's, so if you're gonna run this yourself, you'll want to find some twitter IDs on your own
tweets = []
tweets.extend(api.statuses_lookup(id_=id_list, include_entities=True))
for tweet in tweets:
a = len(tweets)
print(tweet.entities['media'][0]['media_url'])
url = tweet.entities['media'][0]['media_url']
wget.download(url)
get_all_tweets()
One strange thing I see is that the variable i declared in the outer loop is never used after on. Shouldn't your code be
tweets.extend(api.statuses_lookup(id_=i, include_entities=True))
and not id_=id_list as you wrote?
so I came up with this script to get the all of a user tweet from one twitter user
import tweepy
from tweepy import OAuthHandler
import json
def load_api():
Consumer_key = ''
consumer_secret = ''
access_token = ''
access_secret = ''
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
return tweepy.API(auth)
api = load_api()
user = 'allkpop'
tweets = api.user_timeline(id=user, count=2000)
print('Found %d tweets' % len(tweets))
tweets_text = [t.text for t in tweets]
filename = 'tweets-'+user+'.json'
json.dump(tweets_text, open(filename, 'w'))
print('Saved to file:', filename)
but when I run it I can only get 200 tweets per request. Is there a way to get 2000 tweets or at least more than 2000 tweets?
please help me, thank you
The Twitter API has request limits. The one you're using corresponds to the Twitter statuses/user_timeline endpoint. The max number that you can get for this endpoint is documented as 3,200. Also note that there's a max number of requests in a 15-minute window, which might explain why you're only getting 2000, instead of the max. Here are a couple observations that might be interesting for you:
Documentation says that the max count is 200.
There's an include_rts (include retweets) parameter that might return more values. While it's part of the Twitter API, I can't see where Tweepy documents that.
You might try Tweepy Cursors to see if that will bring back more items.
Because of the 15 minute limits, you might be able to pause until the next 15 minute window to continue. That said, I don't know enough about your use case to know if this is practical or not.
I am using tweepy to get tweets about dengue in Brazil.
I am interesting in getting recent tweets about the 10 people with the most followers. I use the search api, not streaming api, because I don't need all the tweets, just the most relevant.
I am surprised to get so few tweets (only 17). Should I use the streaming api instead?
Here is my code:
#api access
consumer_key=""
consumer_secret=""
access_token_key=""
access_token_secret=""
import csv
#write results in file
writer= csv.writer(open(r"twitter.csv", "wt"), lineterminator='\n', delimiter =';')
writer.writerow(["date", "langage", "place", "country", "username", "nb_followers", "tweet_text"])
import tweepy
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token_key, access_token_secret)
api = tweepy.API(auth)
#get tweets for Brazil only
places = api.geo_search(query="Brazil", granularity="country")
place_id = places[0].id
print(place_id)
for tweet in tweepy.Cursor(api.search, q="dengue+OR+%23dengue&place:" + place_id, since="2015-08-01", until="2015-08-25").items():
date=tweet.created_at
langage=tweet.lang
try:
place=tweet.place.full_name
country=tweet.place.country
except:
place=None
country=None
username=tweet.user.screen_name
nb_followers=tweet.user.followers_count
tweet_text=tweet.text.encode('utf-8')
print("created on", tweet.created_at)
print("langage", tweet.lang)
print("place:", place)
print("country:", country)
print("user:", tweet.user.screen_name)
print("nb_followers:", tweet.user.followers_count)
print(tweet.text.encode("utf-8"))
print('')
writer.writerow([date, langage, place, country, username, nb_followers, tweet_text])
Try doing the search manually and seeing what you get. It sounds like your application is appropriate for the search API.
I think I know what the problem is: The place attribute is rarely present in the data. Thus very few tweets are returned.
I am now using the lang attribute with pt value (their is no pt-br langage unfortunately). It's not exactly what I want since it returns tweets from other countries such as Portugal but it's so far the best I could find.
for tweet in tweepy.Cursor(api.search, q="dengue+OR+%23dengue", lang="pt", since=date, until=end_date).items():
Please forgive me if this is a gross repeat of a question previously answered elsewhere, but I am lost on how to use the tweepy API search function. Is there any documentation available on how to search for tweets using the api.search() function?
Is there any way I can control features such as number of tweets returned, results type etc.?
The results seem to max out at 100 for some reason.
the code snippet I use is as follows
searched_tweets = self.api.search(q=query,rpp=100,count=1000)
I originally worked out a solution based on Yuva Raj's suggestion to use additional parameters in GET search/tweets - the max_id parameter in conjunction with the id of the last tweet returned in each iteration of a loop that also checks for the occurrence of a TweepError.
However, I discovered there is a far simpler way to solve the problem using a tweepy.Cursor (see tweepy Cursor tutorial for more on using Cursor).
The following code fetches the most recent 1000 mentions of 'python'.
import tweepy
# assuming twitter_authentication.py contains each of the 4 oauth elements (1 per line)
from twitter_authentication import API_KEY, API_SECRET, ACCESS_TOKEN, ACCESS_TOKEN_SECRET
auth = tweepy.OAuthHandler(API_KEY, API_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)
api = tweepy.API(auth)
query = 'python'
max_tweets = 1000
searched_tweets = [status for status in tweepy.Cursor(api.search, q=query).items(max_tweets)]
Update: in response to Andre Petre's comment about potential memory consumption issues with tweepy.Cursor, I'll include my original solution, replacing the single statement list comprehension used above to compute searched_tweets with the following:
searched_tweets = []
last_id = -1
while len(searched_tweets) < max_tweets:
count = max_tweets - len(searched_tweets)
try:
new_tweets = api.search(q=query, count=count, max_id=str(last_id - 1))
if not new_tweets:
break
searched_tweets.extend(new_tweets)
last_id = new_tweets[-1].id
except tweepy.TweepError as e:
# depending on TweepError.code, one may want to retry or wait
# to keep things simple, we will give up on an error
break
There's a problem in your code. Based on Twitter Documentation for GET search/tweets,
The number of tweets to return per page, up to a maximum of 100. Defaults to 15. This was
formerly the "rpp" parameter in the old Search API.
Your code should be,
CONSUMER_KEY = '....'
CONSUMER_SECRET = '....'
ACCESS_KEY = '....'
ACCESS_SECRET = '....'
auth = tweepy.auth.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_KEY, ACCESS_SECRET)
api = tweepy.API(auth)
search_results = api.search(q="hello", count=100)
for i in search_results:
# Do Whatever You need to print here
The other questions are old and the API changed a lot.
Easy way, with Cursor (see the Cursor tutorial). Pages returns a list of elements (You can limit how many pages it returns. .pages(5) only returns 5 pages):
for page in tweepy.Cursor(api.search, q='python', count=100, tweet_mode='extended').pages():
# process status here
process_page(page)
Where q is the query, count how many will it bring for requests (100 is the maximum for requests) and tweet_mode='extended' is to have the full text. (without this the text is truncated to 140 characters) More info here. RTs are truncated as confirmed jaycech3n.
If you don't want to use tweepy.Cursor, you need to indicate max_id to bring the next chunk. See for more info.
last_id = None
result = True
while result:
result = api.search(q='python', count=100, tweet_mode='extended', max_id=last_id)
process_result(result)
# we subtract one to not have the same again.
last_id = result[-1]._json['id'] - 1
I am working on extracting twitter data for around a location (in here, around India), for all tweets which include a special keyword or a list of keywords.
import tweepy
import credentials ## all my twitter API credentials are in this file, this should be in the same directory as is this script
## set API connection
auth = tweepy.OAuthHandler(credentials.consumer_key,
credentials.consumer_secret)
auth.set_access_secret(credentials.access_token,
credentials.access_secret)
api = tweepy.API(auth, wait_on_rate_limit=True) # set wait_on_rate_limit =True; as twitter may block you from querying if it finds you exceeding some limits
search_words = ["#covid19", "2020", "lockdown"]
date_since = "2020-05-21"
tweets = tweepy.Cursor(api.search, =search_words,
geocode="20.5937,78.9629,3000km",
lang="en", since=date_since).items(10)
## the geocode is for India; format for geocode="lattitude,longitude,radius"
## radius should be in miles or km
for tweet in tweets:
print("created_at: {}\nuser: {}\ntweet text: {}\ngeo_location: {}".
format(tweet.created_at, tweet.user.screen_name, tweet.text, tweet.user.location))
print("\n")
## tweet.user.location will give you the general location of the user and not the particular location for the tweet itself, as it turns out, most of the users do not share the exact location of the tweet
Results:
created_at: 2020-05-28 16:48:23
user: XXXXXXXXX
tweet text: RT #Eatala_Rajender: Media Bulletin on status of positive cases #COVID19 in Telangana. (Dated. 28.05.2020)
# TelanganaFightsCorona
# StayHom…
geo_location: Hyderabad, India
You can search the tweets with specific strings as showed below:
tweets = api.search('Artificial Intelligence', count=200)