Twython Twitter Post is Truncated - python

I'm trying to get some tweets using Twython, but even with tweet_mode:extended the results are still truncated. Any ideas how I can get the full text.
def requestTweets(topic, resultType = "new", amount = 10, language = "en"):
'''Get the n tweets for a topic, either newest (new) or most popular (popular)'''
#Create Query
query = {'q': topic,
'result_type': resultType,
'count': amount,
'lang': language,
'tweet_mode': 'extended',
}
#Get Data
dict_ = {'user': [], 'date': [], 'full_text': [],'favorite_count': []}
for status in python_tweets.search(**query)['statuses']:
dict_['user'].append(status['user']['screen_name'])
dict_['date'].append(status['created_at'])
dict_['full_text'].append(status['full_text'])
dict_['favorite_count'].append(status['favorite_count'])
# Structure data in a pandas DataFrame for easier manipulation
df = pd.DataFrame(dict_)
df.sort_values(by='favorite_count', inplace=True, ascending=False)
return df
tweets = requestTweets("chocolate")
for index, tweet in tweets.iterrows():
print("***********************************")
print(tweet['full_text'])
Results look like this:

I know it's kind of late! but I put the answer to whom may need it.
I'm using twython==3.9.1 and its possible to get the full text of tweets with the below snippet code:
twython: Twython
tweets = twython.get_user_timeline(
screen_name=user_screen_name, # or user id
count=200, # max count is 200
include_retweets=include_retweets, # could be False or True
exclude_replies=exclude_replies, # could be False or True
tweet_mode='extended', # to get tweets full text
)

I couldn't find a way to do it with Twython, so I switched to tweepy in the end, still, if anyone has an answer, that would be great.

Related

Only get Tweets that mention a country

Is it possible to exclusively gather Tweets which mention countries by name? I am only gathering Tweets from the US.
I know that Twitter allows us to access context_annotations from the payload, and that context_annotations identifies if a tweet mentions a country. Here, https://developer.twitter.com/en/docs/twitter-api/annotations/overview ,they mention that countries is domain number 160 in context annotations.
I'm wondering if it is possible to exclusively gather Tweets that mention country names. I am not familiar with Tweepy, so I've finally managed to obtain Tweets from the US, but am still unable to specify the code to obtain only tweets which mention countries.
This is my current code:
client = tweepy.Client(bearer_token=bearer_token)
# Specify Query
query = ' "favorite country" place_country:US'
start_time = '2022-03-05T00:00:00Z'
end_time = '2022-03-11T00:00:00Z'
tweets = client.search_all_tweets(query=query, tweet_fields=['context_annotations', 'created_at', 'geo'],
place_fields = ['place_type','geo'], expansions='geo.place_id',
start_time=start_time,
end_time=end_time, max_results=10000)
# Prepare to write to csv file
f = open('tweetSheet.csv','w')
writer = csv.writer(f)
# Write to csv file
for tweet in tweets.data:
print(tweet.text)
print(tweet.created_at)
writer.writerow(['0', tweet.id, tweet.created_at, tweet.text])
# Close csv file
f.close()
has:geo:
One way of doing this would be by filtering in tweets that have country attributes.
You can use the has:geo: operator in your query instead of the place_country: operator seen in the Twitter Docs. This way you get all the tweets that are geo tagged, every geo tagged tweet has a country attribute.
includes
Another way would be checking if the tweet has an includes attribute, empty if it has no geo attributes: response.includes != {}. To get the country code if needed then response.includes['places'][0].country works just fine. It is not very well documented in the Tweepy Docs so here are all the geo attributes found in the Twitter Docs for a tweet:
twt_geo = 1602695447298162689
twt_no_geo = 1602719044645408768
response = client.get_tweet(
twt_geo, place_fields=['country', 'country_code', 'place_type', 'name'], expansions=['geo.place_id'])
if(response.includes != {}):
print(response.includes)
print(response.includes['places'][0].country)
print(response.includes['places'][0].country_code)
print(response.includes['places'][0].place_type)
print(response.includes['places'][0].name)
print(response.includes['places'][0].full_name)
print(response.includes['places'][0])
print(response.data.geo)
print(response.data.geo['place_id'])
else:
print(response.data.id)
Hashtags
If you are implying filtering in tweets that have country names as hashtags as country mentions, you can extract the tweet text with response.data.text and compare the country names you would like to filter in.

Integrate for loop in twitter scrapper in Python

Hy all, I need a little wisdom.
I maage to make a scrapper using the Twitter API and Tweepy. It scrapes tweets from individual profiles. I have a list of around 100 profiles that I want to scrape tweets from, but I cant figure out how to instruct the scraper to extract data from multiple profiles and how to save the output properly in csv. I have the following code:
import tweepy
import time
import pandas as pd
import csv
# API keyws that yous saved earlier
api_key = ''
api_secrets = ''
access_token = ''
access_secret = ''
# Authenticate to Twitter
auth = tweepy.OAuthHandler(api_key,api_secrets)
auth.set_access_token(access_token,access_secret)
#Instantiate the tweepy API
api = tweepy.API(auth, wait_on_rate_limit=True)
username = "markrutte"
no_of_tweets = 3200
try:
#The number of tweets we want to retrieved from the user
tweets = api.user_timeline(screen_name=username, count=no_of_tweets)
#Pulling Some attributes from the tweet
attributes_container = [[tweet.created_at, tweet.favorite_count,tweet.source, tweet.text] for tweet in tweets]
#Creation of column list to rename the columns in the dataframe
columns = ["Date Created", "Number of Likes", "Source of Tweet", "Tweet"]
tweets_df = pd.DataFrame(attributes_container, columns=columns)
except BaseException as e:
print('Status Failed On,',str(e))
time.sleep(3)
In my head, I believe I should specify a list with usernames as the values. And then, for username in list: scrape tweets. However, I dont really know how to do this and am still learning. Can anyone give me some advice or know a tutorial on how I should do this?
Appreciate it.
In my head, I believe I should specify a list with usernames as the values. And then, for username in list: scrape tweets. However, I dont really know how to do this and am still learning. Can anyone give me some advice or know a tutorial on how I should do this?
Appreciate it.
If you put your scraping code into a function, you can then concat its results into an overall dataframe in a loop:
def get_tweets(username, no_of_tweets):
#Creation of column list to rename the columns in the dataframe
columns = ["Date Created", "Number of Likes", "Source of Tweet", "Tweet"]
try:
#The number of tweets we want to retrieved from the user
tweets = api.user_timeline(screen_name=username, count=no_of_tweets)
#Pulling Some attributes from the tweet
attributes_container = [[tweet.created_at, tweet.favorite_count,tweet.source, tweet.text] for tweet in tweets]
# return a dataframe
return pd.DataFrame(attributes_container, columns=columns)
except BaseException as e:
print('Status Failed On,',str(e))
# return an empty dataframe
return pd.DataFrame(columns=columns)
usernames = ['user1', 'user2', 'user3']
no_of_tweets = 3200
tweets_df = pd.concat([get_tweets(username, no_of_tweets) for username in usernames])

How to fix number of returned rows in Pandas?

I build a code that extracting data from YouTube by search query, and now I need to convert my output data into the pandas data frame, so later I will be able to export this as .csv.
But now I stuck one the issue that my pf.DataFrame actually return me only first row of parsed data instead of full massive. Please help!
Example: I want pandas give me back same row number as the maxResults im searching for
Now: pandas give me back only first line info from parsed data no matter how much data was found
Scraping code:
api_key = "***"
from googleapiclient.discovery import build
from pprint import PrettyPrinter
from google.colab import files
youtube = build('youtube','v3',developerKey = api_key)
print(type(youtube))
pp = PrettyPrinter()
nextPageToken = ''
for x in range(1):
#while True:
request = youtube.search().list(
q='star wars',
part='id,snippet',
maxResults=3,
order="viewCount",
pageToken=nextPageToken,
type='video')
print(type(request))
res = request.execute()
pp.pprint(res)
if 'nextPageToken' in res:
nextPageToken = res['nextPageToken']
# else:
# break
ids = [item['id']['videoId'] for item in res['items']]
results = youtube.videos().list(id=ids, part='snippet').execute()
for result in results.get('items', []):
print(result ['id'])
print(result ['snippet']['channelTitle'])
print(result ['snippet']['title'])
print(result ['snippet']['description'])
Pandas Code:
data = {'Channel Title': [result['snippet']['channelTitle']],
'Title': [result['snippet']['title']],
'Description': [result['snippet']['description']]
}
df = pd.DataFrame(data,
columns = ['Channel Title', 'Title', 'Description'],
)
#df3 = pd.concat([df], ignore_index = True)
#df3.reset_index()
df.head()
#print(df3)
IIUC~
This:
data = {'Channel Title': [result['snippet']['channelTitle']],
'Title': [result['snippet']['title']],
'Description': [result['snippet']['description']]
}
Should be:
data = {'Channel Title': [result['snippet']['channelTitle'] for result in results['items']],
'Title': [result['snippet']['title'] for result in results['items']],
'Description': [result['snippet']['description'] for result in results['items']]
}
Otherwise you're just using result from the last iteration of your for-loop....

Twython using Enterprise Search API

I am trying to scrape tweets from twitter using twython and I want to use enterprise search api for this because I want to define fromDate and toDate parameters.
I couldn't find any way to do it though, and when I try to cursor tweets from this date, It only returns the tweets about 14 days ago from now.
twitter = Twython(consumer_token, access_token=ACCESS_TOKEN)
# Search parameters
def search_query(QUERY_TO_BE_SEARCHED):
"""
QUERY_TO_BE_SEARCHED : text you want to search for
"""
df_dict=[]
results = twitter.cursor(twitter.search, q=QUERY_TO_BE_SEARCHED,fromDate='2019071200',toDate='2019071400',count=100)
for q in results:
retweet_count = q['retweet_count']
favs_count = q['favorite_count']
date_created = q['created_at']
text = q['text']
hashtags = q['entities']['hashtags']
user_name = '#'+str(q['user']['screen_name'])
user_mentions = []
if(len(q['entities']['user_mentions'])!=0):
for n in q['entities']['user_mentions']:
user_mentions.append(n['screen_name']) # Mentioned profile names in the tweet
temp_dict = {'User ID':user_name,'Date':date_created,'Text':text,'Favorites':favs_count,'RTs':retweet_count,
'Hashtags':hashtags,'Mentions':user_mentions}
df_dict.append(temp_dict)
return pd.DataFrame(df_dict)
that is my code, can you help me improve this ?

How to create pandas dataframe from Twitter Search API?

I am working with the Twitter Search API which returns a dictionary of dictionaries. My goal is to create a dataframe from a list of keys in the response dictionary.
Example of API response here: Example Response
I have a list of keys within the Statuses dictionary
keys = ["created_at", "text", "in_reply_to_screen_name", "source"]
I would like to loop through each key value returned in the Statuses dictionary and put them in a dataframe with the keys as the columns.
Currently have code to loop through a single key individually and assign to list then append to dataframe but want a way to do more than one key at a time. Current code below:
#w is the word to be queired
w = 'keyword'
#count of tweets to return
count = 1000
#API call
query = twitter.search.tweets(q= w, count = count)
def data_l2 (q, k1, k2):
data = []
for results in q[k1]:
data.append(results[k2])
return(data)
screen_names = data_l3(query, "statuses", "user", "screen_name")
data = {'screen_names':screen_names,
'tweets':tweets}
frame=pd.DataFrame(data)
frame
I will share a more generic solution that I came up with, as I was working with the Twitter API. Let's say you have the ID's of tweets that you want to fetch in a list called my_ids :
# Fetch tweets from the twitter API using the following loop:
list_of_tweets = []
# Tweets that can't be found are saved in the list below:
cant_find_tweets_for_those_ids = []
for each_id in my_ids:
try:
list_of_tweets.append(api.get_status(each_id))
except Exception as e:
cant_find_tweets_for_those_ids.append(each_id)
Then in this code block we isolate the json part of each tweepy status object that we have downloaded and we add them all into a list....
my_list_of_dicts = []
for each_json_tweet in list_of_tweets:
my_list_of_dicts.append(each_json_tweet._json)
...and we write this list into a txt file:
with open('tweet_json.txt', 'w') as file:
file.write(json.dumps(my_list_of_dicts, indent=4))
Now we are going to create a DataFrame from the tweet_json.txt file (I have added some keys that were relevant to my use case that I was working on, but you can add your specific keys instead):
my_demo_list = []
with open('tweet_json.txt', encoding='utf-8') as json_file:
all_data = json.load(json_file)
for each_dictionary in all_data:
tweet_id = each_dictionary['id']
whole_tweet = each_dictionary['text']
only_url = whole_tweet[whole_tweet.find('https'):]
favorite_count = each_dictionary['favorite_count']
retweet_count = each_dictionary['retweet_count']
created_at = each_dictionary['created_at']
whole_source = each_dictionary['source']
only_device = whole_source[whole_source.find('rel="nofollow">') + 15:-4]
source = only_device
retweeted_status = each_dictionary['retweeted_status'] = each_dictionary.get('retweeted_status', 'Original tweet')
if retweeted_status == 'Original tweet':
url = only_url
else:
retweeted_status = 'This is a retweet'
url = 'This is a retweet'
my_demo_list.append({'tweet_id': str(tweet_id),
'favorite_count': int(favorite_count),
'retweet_count': int(retweet_count),
'url': url,
'created_at': created_at,
'source': source,
'retweeted_status': retweeted_status,
})
tweet_json = pd.DataFrame(my_demo_list, columns = ['tweet_id', 'favorite_count',
'retweet_count', 'created_at',
'source', 'retweeted_status', 'url'])

Categories