How to retrieve full text from tweet using twarc2 - python

I am using twarc2 for retrieving tweets. The returned jsonl file has the following keys:
dict_keys(['text', 'conversation_id', 'entities', 'author_id', 'public_metrics', 'source', 'id', 'reply_settings', 'edit_history_tweet_ids', 'created_at', 'possibly_sensitive', 'lang', 'referenced_tweets', 'author', '__twarc'])
When I checked the value of data[0]['text'], it terminated with ... like below:
RT #Weather_West: "You may have heard that we have 12 years to fix everything. This is well-meaning nonsense, but it’s still nonsense. We h…
I am wondering how can I get the full text of the tweet. Apparently, twarc2 doesn't even return retweeted_status unlike tweepy which used to be helpful for retrieving the full text.

Actually, twarc2 csv auto-expands the tweets. So, instead of working with .jsonl, one can first convert to .csv and then one will be able to access the full text from the tweet.

Related

Retrieving full text using twarc2

I am retrieving tweets using twarc2 with search terms in the following way:
twarc2 search --archive --start-time "2015-01-01" --end-time "2018-12-31" --limit 25000 "faith OR #faith" results.jsonl
But the resultant tweets are truncated after a certain length. E.g. RT #AndrewYNg: We cannot abdicate responsibility when two children, ages 7 and 8, die in US custody. The US once said: "Give me your tired,… although the tweet is a bit longer. I read the twarc2 documentation but can't find any "extended" tweet_mode option for retrieving the full_text. Any help on this will be appreciated.
Are all the retrieved tweets truncated? The example you provided is a retweet (includes "RT"). RT are truncated but their original tweet is full text.
You can exclude retweets in twarc2. Try adding the below to your command:
-is:retweet
Hope that helps.

transform JSON file to be usable

Long story short, i get the query from spotify api which is JSON that has data about newest albums. How do i get the specific info from that like let's say every band name or every album title. I've tried a lot of ways to get that info that i found on the internet and nothing seems to work for me and after couple of hours im kinda frustrated
JSON data is on jsfiddle
here is the request
endpoint = "https://api.spotify.com/v1/browse/new-releases"
lookup_url = f"{endpoint}"
r = requests.get(lookup_url, headers=headers)
print(r.json())
you can find the
When you make this request like the comments have mentioned you get a dictionary which you can then access the keys and values. For example if you want to get the album_type you could do the following:
print(data["albums"]["items"][0]["album_type"])
Since items contains a list you would need to get the first values 0 and then access the album_type.
Output:
single
Here is a link to the code I used with your json.
I suggest you look into how to deal with json data in python, this is a good place to start.
I copied the data from the jsfiddle link.
Now try the following code:
import ast
pyobj=ast.literal_eval(str_cop_from_src)
later you can try with keys
pyobj["albums"]["items"][0]["album_type"]
pyobj will be a python dictionary will all data.

Tweets get back from twitter api are not showing whole tweets

This first I am using python twitter tool. I have question about results get back from it. It seems they are omission of original tweets.
import twitter
api = twitter.Api(consumer_key='jyd2tcu**OHiIrfg',
consumer_secret='****t80qZeM4JYvV5V8UpB0fTtebPSsb0LUjI9kYSZbLTRn',
access_token_key='1***74372608-dfi5bz22RTKep7GF04lk6FnPSYBgnD',
access_token_secret='5gt0YIw***gwPca5RXiwMksg7GM4ACQtl4')
results = api.GetSearch(
raw_query="q=immigration%20&result_type=recent")
Text I got back is
Text='RT #ddale8: Fox is now showing Trump\'s comments at Cabinet. He begins the clip by saying he\'s "heard numbers as high as $275 billion" for h…')
It ends with "…", is it how twitter api works or is there a way i can get whole tweets instead?
thank you
Try passing tweet_mode="extended" to the twitter.Api constructor.
I believe that since the original tweet is greater than 140 chars, we need to inform the interface to expect this as it does not do this by default.

Reading a dictionary from within a dictionary

I have a json file for tweet data. The data that I want to look at is the text of the tweet. For some reason, some of the tweets are too long to put into the normal text part of the dictionary.
It seems like there is a dictionary within another dictionary and I can't figure out how to access it very well.
Basically, what I want in the end is one column of a data frame that will have all of the text from each individual tweet. Here is a link to a small sample of the data that contains a problem tweet.
Here is the code I have so far:
import json
import pandas as pd
tweets = []
#This writes the json file so that I can work with it. This part works correctly.
with open("filelocation.txt") as source
for line in source:
if line.strip():
tweets.append(json.loads(line))
print(len(tweets)
df = pd.DataFrame.from_dict(tweets)
df.info()
When looking at the info you can see that there will be a column called extended_tweet that only encompasses one of the two sample tweets. Within this column, there seems to be another dictionary with one of those keys being full_text.
I want to add another column to the dataframe that just has this information along with the normal text column when the full_text is null.
My first thought was to try and read that specific column of the dataframe as a dictionary again using:
d = pd.DataFrame.from_dict(tweets['extended_tweet]['full_text])
But this doesn't work. I don't really understand why that doesn't work as that is how I read the data the first time.
My guess is that I can't look at the specific names because I am going back to the list and it would have to read all or none. The error it gives me says "KeyError: 'full_text' "
I also tried using the recommendation provided by this website. But this gave me a None value no matter what.
Thanks in advance!
I tried to do what #Dan D. suggested, however, this still gave me errors. But it gave me the idea to try this:
tweet[0]['extended_tweet']['full_text']
This works and gives me the value that I am looking for. But I need to run through the whole thing. So I tried this:
df['full'] = [tweet[i]['extended_tweet']['full_text'] for i in range(len(tweet))
This gives me "Key Error: 'extended_tweet' "
Does it seem like I am on the right track?
I would suggest to flatten out the dictionaries like this:
tweet = json.loads(line)
tweet['full_text'] = tweet['extended_tweet']['full_text']
tweets.append(tweet)
I don't know if the answer suggested earlier works. I never got that successfully. But I did figure out something else that works well for me.
What I really needed was a way to display the full text of a tweet. I first loaded the tweets from the json with what I posted above. Then I noticed that in the data file, there is something called truncated. If this value is true, the tweet is cut short and the full tweet is placed within the
tweet[i]['extended_tweet]['full_text]
In order to access it, I used this:
tweet_list = []
for i in range(len(tweets)):
if tweets[i]['truncated'] == 'True':
tweet_list.append(tweets[i]['extended_tweet']['full_text']
else:
tweet_list.append(tweets[i]['text']
Then I can work with the data using the whol text from each tweet.

Tweepy: extended mode with api.search

I've written a simple script to get the most trending 300 tweets containing a specific hashtag.
for self._tweet in tweepy.Cursor(self._api.search,q=self._screen_name,count=300, lang="en").items(300):
self._csvWriter.writerow([self._tweet.created_at, self._tweet.text.encode('utf-8')])
It works well and it save the result to CSV but the tweets are truncated.
I modified the code like this, adding the twitter_mode=extended parameter:
for self._tweet in tweepy.Cursor(self._api.search,q=self._screen_name,count=300, lang="en", tweet_mode="extended").items(300):
self._csvWriter.writerow([self._tweet.created_at, self._tweet.text.encode('utf-8')])
But I got this exception:
AttributeError: 'Status' object has no attribute 'text
My question is: how can I save an complete tweet using a Cursor? (complete = not truncated)
Thanks in advance (and sorry, I'm a Tweepy newbie trying to learn as much as possible)
You're really close, do this instead:
for self._tweet in tweepy.Cursor(self._api.search,q=self._screen_name,count=300, lang="en", tweet_mode="extended").items(300):
self._csvWriter.writerow([self._tweet.created_at, self._tweet.full_text.encode('utf-8')])
Notice that I used full_text in self._tweet.full_text.encode('utf-8'), rather than just text. The text property is null when you use tweet_mode='extended' and the tweet appears in full_text instead.

Categories