Get huge number of tweets by location - python

I want to get all tweets for a specific topic that tweeted from NYC. I have twitter credential and my Python code to use the Twitter API is the following:
import tweepy
import numpy as np
import pandas as pd
auth = tweepy.OAuthHandler("......", ".......")
auth.set_access_token("........", "........")
api = tweepy.API(auth)
df = pd.DataFrame(columns = ['Tweets', 'Date of Tweet', 'Retweet Count', 'User Location', 'User Registration Date'])
def stream():
i = 0
for tweet in tweepy.Cursor(api.search, q='climatechange', count=100000, lang='en', tweet_mode='extended', since='2020-02-2', until='2020-02-25',geocode='43.17305,-77.62479,100km').items():
print(i, end='\r')
df.loc[i, 'Tweets'] = tweet.full_text
df.loc[i, 'Date of Tweet'] = tweet.created_at
df.loc[i, 'Retweet Count'] = tweet.retweet_count
df.loc[i, 'User Location'] = tweet.user.location
df.loc[i, 'User Registration Date'] = tweet.user.created_at
df.to_csv('GeoTweets1.csv')
i+=1
if i == 10000:
break
else:
pass
stream()
df.info()
Questions:
1- I want to get metadata. I mean big data of tweets as much as I can. we know that every 15 min we can request 180 keywords and by each request, we can get 100 tweets, which means 18000 tweets. How can I iterate the code that it gives me 18K tweets for a specific keyword and automatically repeat it every 15min? For example, to get tweets of NYC about climatechange, I want to continuously run this code for 10hrs, which equals to 40 of 15mins, which means I can get 720K tweets.
2- I also have the issue to get tweets according to location. When I run the above code and request tweets for a keyword such as climatechanges, it gives me 100 tweets, but for Geo query for New York gives me less. e.g. for geocode='43.17305,-77.62479,32km it gives me 22 tweets and for geocode='43.17305,-77.62479,100km it gives me 12 tweets. Why for Geo search it doesn't give me 100 tweets
Thank you for your help

Related

Only get Tweets that mention a country

Is it possible to exclusively gather Tweets which mention countries by name? I am only gathering Tweets from the US.
I know that Twitter allows us to access context_annotations from the payload, and that context_annotations identifies if a tweet mentions a country. Here, https://developer.twitter.com/en/docs/twitter-api/annotations/overview ,they mention that countries is domain number 160 in context annotations.
I'm wondering if it is possible to exclusively gather Tweets that mention country names. I am not familiar with Tweepy, so I've finally managed to obtain Tweets from the US, but am still unable to specify the code to obtain only tweets which mention countries.
This is my current code:
client = tweepy.Client(bearer_token=bearer_token)
# Specify Query
query = ' "favorite country" place_country:US'
start_time = '2022-03-05T00:00:00Z'
end_time = '2022-03-11T00:00:00Z'
tweets = client.search_all_tweets(query=query, tweet_fields=['context_annotations', 'created_at', 'geo'],
place_fields = ['place_type','geo'], expansions='geo.place_id',
start_time=start_time,
end_time=end_time, max_results=10000)
# Prepare to write to csv file
f = open('tweetSheet.csv','w')
writer = csv.writer(f)
# Write to csv file
for tweet in tweets.data:
print(tweet.text)
print(tweet.created_at)
writer.writerow(['0', tweet.id, tweet.created_at, tweet.text])
# Close csv file
f.close()
has:geo:
One way of doing this would be by filtering in tweets that have country attributes.
You can use the has:geo: operator in your query instead of the place_country: operator seen in the Twitter Docs. This way you get all the tweets that are geo tagged, every geo tagged tweet has a country attribute.
includes
Another way would be checking if the tweet has an includes attribute, empty if it has no geo attributes: response.includes != {}. To get the country code if needed then response.includes['places'][0].country works just fine. It is not very well documented in the Tweepy Docs so here are all the geo attributes found in the Twitter Docs for a tweet:
twt_geo = 1602695447298162689
twt_no_geo = 1602719044645408768
response = client.get_tweet(
twt_geo, place_fields=['country', 'country_code', 'place_type', 'name'], expansions=['geo.place_id'])
if(response.includes != {}):
print(response.includes)
print(response.includes['places'][0].country)
print(response.includes['places'][0].country_code)
print(response.includes['places'][0].place_type)
print(response.includes['places'][0].name)
print(response.includes['places'][0].full_name)
print(response.includes['places'][0])
print(response.data.geo)
print(response.data.geo['place_id'])
else:
print(response.data.id)
Hashtags
If you are implying filtering in tweets that have country names as hashtags as country mentions, you can extract the tweet text with response.data.text and compare the country names you would like to filter in.

Integrate for loop in twitter scrapper in Python

Hy all, I need a little wisdom.
I maage to make a scrapper using the Twitter API and Tweepy. It scrapes tweets from individual profiles. I have a list of around 100 profiles that I want to scrape tweets from, but I cant figure out how to instruct the scraper to extract data from multiple profiles and how to save the output properly in csv. I have the following code:
import tweepy
import time
import pandas as pd
import csv
# API keyws that yous saved earlier
api_key = ''
api_secrets = ''
access_token = ''
access_secret = ''
# Authenticate to Twitter
auth = tweepy.OAuthHandler(api_key,api_secrets)
auth.set_access_token(access_token,access_secret)
#Instantiate the tweepy API
api = tweepy.API(auth, wait_on_rate_limit=True)
username = "markrutte"
no_of_tweets = 3200
try:
#The number of tweets we want to retrieved from the user
tweets = api.user_timeline(screen_name=username, count=no_of_tweets)
#Pulling Some attributes from the tweet
attributes_container = [[tweet.created_at, tweet.favorite_count,tweet.source, tweet.text] for tweet in tweets]
#Creation of column list to rename the columns in the dataframe
columns = ["Date Created", "Number of Likes", "Source of Tweet", "Tweet"]
tweets_df = pd.DataFrame(attributes_container, columns=columns)
except BaseException as e:
print('Status Failed On,',str(e))
time.sleep(3)
In my head, I believe I should specify a list with usernames as the values. And then, for username in list: scrape tweets. However, I dont really know how to do this and am still learning. Can anyone give me some advice or know a tutorial on how I should do this?
Appreciate it.
In my head, I believe I should specify a list with usernames as the values. And then, for username in list: scrape tweets. However, I dont really know how to do this and am still learning. Can anyone give me some advice or know a tutorial on how I should do this?
Appreciate it.
If you put your scraping code into a function, you can then concat its results into an overall dataframe in a loop:
def get_tweets(username, no_of_tweets):
#Creation of column list to rename the columns in the dataframe
columns = ["Date Created", "Number of Likes", "Source of Tweet", "Tweet"]
try:
#The number of tweets we want to retrieved from the user
tweets = api.user_timeline(screen_name=username, count=no_of_tweets)
#Pulling Some attributes from the tweet
attributes_container = [[tweet.created_at, tweet.favorite_count,tweet.source, tweet.text] for tweet in tweets]
# return a dataframe
return pd.DataFrame(attributes_container, columns=columns)
except BaseException as e:
print('Status Failed On,',str(e))
# return an empty dataframe
return pd.DataFrame(columns=columns)
usernames = ['user1', 'user2', 'user3']
no_of_tweets = 3200
tweets_df = pd.concat([get_tweets(username, no_of_tweets) for username in usernames])

Why does Twints "since" & "until" not work?

I'm trying to get all tweets from 2018-01-01 until now from various firms.
My code works, however I do not get the tweets from the time range. Sometimes I only get the tweets from today and yesterday or from mid April up to now, but not since the beginning of 2018. I've got then the message: [!] No more data! Scraping will stop now.
ticker = []
#read in csv file with company ticker in a list
with open('C:\\Users\\veron\\Desktop\\Test.csv', newline='') as inputfile:
for row in csv.reader(inputfile):
ticker.append(row[0])
#Getting tweets for every ticker in the list
for i in ticker:
searchstring = (f"{i} since:2018-01-01")
c = twint.Config()
c.Search = searchstring
c.Lang = "en"
c.Panda = True
c.Custom["tweet"] = ["date", "username", "tweet"]
c.Store_csv = True
c.Output = f"{i}.csv"
twint.run.Search(c)
df = pd. read_csv(f"{i}.csv")
df['company'] = i
df.to_csv(f"{i}.csv", index=False)
Does anyone had the same issues and has some tip?
You need to add the configuration parameter Since separately. For example:
c.Since = "2018-01-01"
Similarly for Until:
c.Until = "2017-12-27"
The official documentation might be helpful.
Since (string) - Filter Tweets sent since date, works only with twint.run.Search (Example: 2017-12-27).
Until (string) - Filter Tweets sent until date, works only with twint.run.Search (Example: 2017-12-27).

Twython Twitter Post is Truncated

I'm trying to get some tweets using Twython, but even with tweet_mode:extended the results are still truncated. Any ideas how I can get the full text.
def requestTweets(topic, resultType = "new", amount = 10, language = "en"):
'''Get the n tweets for a topic, either newest (new) or most popular (popular)'''
#Create Query
query = {'q': topic,
'result_type': resultType,
'count': amount,
'lang': language,
'tweet_mode': 'extended',
}
#Get Data
dict_ = {'user': [], 'date': [], 'full_text': [],'favorite_count': []}
for status in python_tweets.search(**query)['statuses']:
dict_['user'].append(status['user']['screen_name'])
dict_['date'].append(status['created_at'])
dict_['full_text'].append(status['full_text'])
dict_['favorite_count'].append(status['favorite_count'])
# Structure data in a pandas DataFrame for easier manipulation
df = pd.DataFrame(dict_)
df.sort_values(by='favorite_count', inplace=True, ascending=False)
return df
tweets = requestTweets("chocolate")
for index, tweet in tweets.iterrows():
print("***********************************")
print(tweet['full_text'])
Results look like this:
I know it's kind of late! but I put the answer to whom may need it.
I'm using twython==3.9.1 and its possible to get the full text of tweets with the below snippet code:
twython: Twython
tweets = twython.get_user_timeline(
screen_name=user_screen_name, # or user id
count=200, # max count is 200
include_retweets=include_retweets, # could be False or True
exclude_replies=exclude_replies, # could be False or True
tweet_mode='extended', # to get tweets full text
)
I couldn't find a way to do it with Twython, so I switched to tweepy in the end, still, if anyone has an answer, that would be great.

How to collect Tweets in a JSON file on Twitter using Python?

I'm building a program that collects a specified number of tweets(no specific hashtags, just random posts) from a specific country (based on co-ordinates) over the span of 1-2 months.
For example, I'm collecting 200 tweets/status updates from the United States which were posted anywhere between September and October.
The reason I'm doing this is because I want to gather these tweets and perform sentiment analysis on the to see whether or not the average tweet from a specified country is negative/positive.
The problem I'm having is that I don't know how to "filter" for random tweets/status updates because these kind of tweets don't have hashtags. Furthermore, I'm not sure if Twitter allows me to collect tweets which are 2 months old. Any suggestions?
code
import tweepy
from tweepy import OAuthHandler
import json
import datetime as dt
import time
import os
import sys
'''
I created a twitter account for anyone to use if they want to test the code!
I used Python 3 and tweepy version 3.5.0.
'''
def load_api():
''' Function that loads the twitter API after authorizing the user. '''
consumer_key = 'nn'
consumer_secret = 'nn'
access_token = 'nn'
access_secret = 'nnn'
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth, wait_on_rate_limit=True)
# load the twitter API via tweepy
return tweepy.API(auth)
def tweet_search(api, query, max_tweets, max_id, since_id, geocode):
''' Function that takes in a search string 'query', the maximum
number of tweets 'max_tweets', and the minimum (i.e., starting)
tweet id. It returns a list of tweepy.models.Status objects. '''
searched_tweets = []
while len(searched_tweets) < max_tweets:
remaining_tweets = max_tweets - len(searched_tweets)
try:
new_tweets = api.search(q=query, count=remaining_tweets,
since_id=str(since_id),
max_id=str(max_id-1))
# geocode=geocode)
print('found',len(new_tweets),'tweets')
if not new_tweets:
print('no tweets found')
break
searched_tweets.extend(new_tweets)
max_id = new_tweets[-1].id
except tweepy.TweepError:
print('exception raised, waiting 15 minutes')
print('(until:', dt.datetime.now()+dt.timedelta(minutes=15), ')')
time.sleep(15*60)
break # stop the loop
return searched_tweets, max_id
def get_tweet_id(api, date='', days_ago=9, query='a'):
''' Function that gets the ID of a tweet. This ID can then be
used as a 'starting point' from which to search. The query is
required and has been set to a commonly used word by default.
The variable 'days_ago' has been initialized to the maximum
amount we are able to search back in time (9).'''
if date:
# return an ID from the start of the given day
td = date + dt.timedelta(days=1)
tweet_date = '{0}-{1:0>2}-{2:0>2}'.format(td.year, td.month, td.day)
tweet = api.search(q=query, count=1, until=tweet_date)
else:
# return an ID from __ days ago
td = dt.datetime.now() - dt.timedelta(days=days_ago)
tweet_date = '{0}-{1:0>2}-{2:0>2}'.format(td.year, td.month, td.day)
# get list of up to 10 tweets
tweet = api.search(q=query, count=10, until=tweet_date)
print('search limit (start/stop):',tweet[0].created_at)
# return the id of the first tweet in the list
return tweet[0].id
def write_tweets(tweets, filename):
''' Function that appends tweets to a file. '''
with open(filename, 'a') as f:
for tweet in tweets:
json.dump(tweet._json, f)
f.write('\n')
def main():
''' This is a script that continuously searches for tweets
that were created over a given number of days. The search
dates and search phrase can be changed below. '''
''' search variables: '''
search_phrases = ['#PythonPleaseWork']
time_limit = 1.0 # runtime limit in hours
max_tweets = 20 # number of tweets per search but it doesn't seem to be working
min_days_old, max_days_old = 1, 1 # search limits e.g., from 7 to 8
# gives current weekday from last week,
# min_days_old=0 will search from right now
USA = '39.8,-95.583068847656,2500km' # this geocode includes nearly all American
# states (and a large portion of Canada)
# but it still fetches from outside the USA
# loop over search items,
# creating a new file for each
for search_phrase in search_phrases:
print('Search phrase =', search_phrase)
''' other variables '''
name = search_phrase.split()[0]
json_file_root = name + '/' + name
os.makedirs(os.path.dirname(json_file_root), exist_ok=True)
read_IDs = False
# open a file in which to store the tweets
if max_days_old - min_days_old == 1:
d = dt.datetime.now() - dt.timedelta(days=min_days_old)
day = '{0}-{1:0>2}-{2:0>2}'.format(d.year, d.month, d.day)
else:
d1 = dt.datetime.now() - dt.timedelta(days=max_days_old-1)
d2 = dt.datetime.now() - dt.timedelta(days=min_days_old)
day = '{0}-{1:0>2}-{2:0>2}_to_{3}-{4:0>2}-{5:0>2}'.format(
d1.year, d1.month, d1.day, d2.year, d2.month, d2.day)
json_file = json_file_root + '_' + day + '.json'
if os.path.isfile(json_file):
print('Appending tweets to file named: ',json_file)
read_IDs = True
# authorize and load the twitter API
api = load_api()
# set the 'starting point' ID for tweet collection
if read_IDs:
# open the json file and get the latest tweet ID
with open(json_file, 'r') as f:
lines = f.readlines()
max_id = json.loads(lines[-1])['id']
print('Searching from the bottom ID in file')
else:
# get the ID of a tweet that is min_days_old
if min_days_old == 0:
max_id = -1
else:
max_id = get_tweet_id(api, days_ago=(min_days_old-1))
# set the smallest ID to search for
since_id = get_tweet_id(api, days_ago=(max_days_old-1))
print('max id (starting point) =', max_id)
print('since id (ending point) =', since_id)
''' tweet gathering loop '''
start = dt.datetime.now()
end = start + dt.timedelta(hours=time_limit)
count, exitcount = 0, 0
while dt.datetime.now() < end:
count += 1
print('count =',count)
# collect tweets and update max_id
tweets, max_id = tweet_search(api, search_phrase, max_tweets,
max_id=max_id, since_id=since_id,
geocode=USA)
# write tweets to file in JSON format
if tweets:
write_tweets(tweets, json_file)
exitcount = 0
else:
exitcount += 1
if exitcount == 3:
if search_phrase == search_phrases[-1]:
sys.exit('Maximum number of empty tweet strings reached - exiting')
else:
print('Maximum number of empty tweet strings reached - breaking')
break
if __name__ == "__main__":
main()
You can not get 2 months historical data with Search API.
"The Twitter Search API searches against a sampling of recent Tweets published in the past 7 days.
Before getting involved, it’s important to know that the Search API is focused on relevance and not completeness. This means that some Tweets and users may be missing from search results."
https://developer.twitter.com/en/docs/tweets/search/overview/basic-search
You can use Streaming api with country filter and instead of hashtags you can use a few stop words. Example, for US you can use "the,and" , for France "le,la,et" etc.
In addition, it is not a good idea to share your access tokens.

Categories