How to create pandas dataframe from Twitter Search API? - python

I am working with the Twitter Search API which returns a dictionary of dictionaries. My goal is to create a dataframe from a list of keys in the response dictionary.
Example of API response here: Example Response
I have a list of keys within the Statuses dictionary
keys = ["created_at", "text", "in_reply_to_screen_name", "source"]
I would like to loop through each key value returned in the Statuses dictionary and put them in a dataframe with the keys as the columns.
Currently have code to loop through a single key individually and assign to list then append to dataframe but want a way to do more than one key at a time. Current code below:
#w is the word to be queired
w = 'keyword'
#count of tweets to return
count = 1000
#API call
query = twitter.search.tweets(q= w, count = count)
def data_l2 (q, k1, k2):
data = []
for results in q[k1]:
data.append(results[k2])
return(data)
screen_names = data_l3(query, "statuses", "user", "screen_name")
data = {'screen_names':screen_names,
'tweets':tweets}
frame=pd.DataFrame(data)
frame

I will share a more generic solution that I came up with, as I was working with the Twitter API. Let's say you have the ID's of tweets that you want to fetch in a list called my_ids :
# Fetch tweets from the twitter API using the following loop:
list_of_tweets = []
# Tweets that can't be found are saved in the list below:
cant_find_tweets_for_those_ids = []
for each_id in my_ids:
try:
list_of_tweets.append(api.get_status(each_id))
except Exception as e:
cant_find_tweets_for_those_ids.append(each_id)
Then in this code block we isolate the json part of each tweepy status object that we have downloaded and we add them all into a list....
my_list_of_dicts = []
for each_json_tweet in list_of_tweets:
my_list_of_dicts.append(each_json_tweet._json)
...and we write this list into a txt file:
with open('tweet_json.txt', 'w') as file:
file.write(json.dumps(my_list_of_dicts, indent=4))
Now we are going to create a DataFrame from the tweet_json.txt file (I have added some keys that were relevant to my use case that I was working on, but you can add your specific keys instead):
my_demo_list = []
with open('tweet_json.txt', encoding='utf-8') as json_file:
all_data = json.load(json_file)
for each_dictionary in all_data:
tweet_id = each_dictionary['id']
whole_tweet = each_dictionary['text']
only_url = whole_tweet[whole_tweet.find('https'):]
favorite_count = each_dictionary['favorite_count']
retweet_count = each_dictionary['retweet_count']
created_at = each_dictionary['created_at']
whole_source = each_dictionary['source']
only_device = whole_source[whole_source.find('rel="nofollow">') + 15:-4]
source = only_device
retweeted_status = each_dictionary['retweeted_status'] = each_dictionary.get('retweeted_status', 'Original tweet')
if retweeted_status == 'Original tweet':
url = only_url
else:
retweeted_status = 'This is a retweet'
url = 'This is a retweet'
my_demo_list.append({'tweet_id': str(tweet_id),
'favorite_count': int(favorite_count),
'retweet_count': int(retweet_count),
'url': url,
'created_at': created_at,
'source': source,
'retweeted_status': retweeted_status,
})
tweet_json = pd.DataFrame(my_demo_list, columns = ['tweet_id', 'favorite_count',
'retweet_count', 'created_at',
'source', 'retweeted_status', 'url'])

Related

Processing API data (json) into a singular data frame (list of list of dictionaries)?

So this is a somewhat of a continuation from a previous post of mine except now I have API data to work with. I am trying to get keys Type and Email as columns in a data frame to come up with a final number. My code:
jsp_full=[]
for p in payloads:
payload = {"payload": {"segmentId":p}}
r = requests.post(url,headers = header, json = payload)
#print(r, r.reason)
time.sleep(r.elapsed.total_seconds())
json_data = r.json() if r and r.status_code == 200 else None
json_keys = json_data['payload']['supporters']
json_package = []
jsp_full.append(json_package)
for row in json_keys:
SID = row['supporterId']
Handle = row['contacts']
a_key = 'value'
list_values = [a_list[a_key] for a_list in Handle]
string = str(list_values).split(",")
data = {
'SupporterID' : SID,
'Email' : strip_characters(string[-1]),
'Type' : labels(p)
}
json_package.append(data)
t2 = round(time.perf_counter(),2)
b_key = "Email"
e = len([b_list[b_key] for b_list in json_package])
t = str(labels(p))
#print(json_package)
print(f'There are {e} emails in the {t} segment')
print(f'Finished in {t2 - t1} seconds')
excel = pd.DataFrame(json_package)
excel.to_excel(r'C:\Users\am\Desktop\email parsing\{0} segment {1}.xlsx'.format(t, str(today)), sheet_name=t)
This part works all well and good. Each payload in the API represents a different segment of people so I split them out into different files. However, I am at a point where I need to combine all records into a single data frame hence why I append out to jsp_full. This is a list of a list of dictionaries.
Once I have that I would run the balance of my code which is like this:
S= pd.DataFrame(jsp_full[0], index = {0})
Advocacy_Supporters = S.sort_values("Type").groupby("Type", as_index=False)["Email"].first()
print(Advocacy_Supporters['Email'].count())
print("The number of Unique Advocacy Supporters is :")
Advocacy_Supporters_Group = Advocacy_Supporters.groupby("Type")["Email"].nunique()
print(Advocacy_Supporters_Group)
Some sample data:
[{'SupporterID': '565f6a2f-c7fd-4f1b-bac2-e33976ef4306', 'Email': 'somebody#somewhere.edu', 'Type': 'd_Student Ambassadors'}, {'SupporterID': '7508dc12-7647-4e95-a8b8-bcb067861faf', 'Email': 'someoneelse#email.somewhere.edu', 'Type': 'd_Student Ambassadors'},...`
My desired output is a dataframe that looks like so:
SupporterID Email Type
565f6a2f-c7fd-4f1b-bac2-e33976ef4306 somebody#somewhere.edu d_Student Ambassadors
7508dc12-7647-4e95-a8b8-bcb067861faf someoneelse#email.somewhere.edu d_Student Ambassadors
Any help is greatly appreciated!!
So because this code creates an excel file for each segment, all I did was read back in the excels via a for loop like so:
filesnames = ['e_S Donors', 'b_Contributors', 'c_Activists', 'd_Student Ambassadors', 'a_Volunteers', 'f_Offline Action Takers']
S= pd.DataFrame()
for i in filesnames:
data = pd.read_excel(r'C:\Users\am\Desktop\email parsing\{0} segment {1}.xlsx'.format(i, str(today)),sheet_name= i, engine = 'openpyxl')
S= S.append(data)
This did the trick since it was in a format I already wanted.

Adding values to python dictionary, only returning last value

Below is a list of twitter handles I am using to scrape tweets
myDict = {}
list = ['ShoePalace', 'StreetWearDealz', 'ClothesUndrCost', 'DealsPlus', 'bodega', 'FRSHSneaks',
'more_sneakers', 'BOOSTLINKS', 'endclothing', 'DopeKixDaily', 'RSVPGallery', 'StealSupply',
'SneakerAlertHD', 'JustFreshKicks', 'solefed', 'SneakerMash', 'StealsBySwell', 'KicksDeals',
'FatKidDeals', 'sneakersteal', 'SOLELINKS', 'SneakerShouts', 'KicksUnderCost', 'snkr_twitr',
'KicksFinder']
In the for loop below I am cycling thru each twitter handle and grabbing data. After the data is pull I am attempting to add the data to the dictionary (myDict). Currently the code is only returning a single value:
{'title': 'Ad: Nike Air Max 97 Golf ‘Grass’ is back in stock at Nikestore!\n\n>>', 'url': 'example.com', 'image': 'image.jpg', 'tweet_url': 'example.com', 'username': 'KicksFinder', 'date': datetime.datetime(2020, 7, 27, 11, 44, 26)}
for i in list:
for tweet in get_tweets(i, pages=1):
tweet_url = 'https://www.twitter.com/' + tweet['tweetUrl']
username = tweet['username']
date = tweet['time']
text = tweet['text']
title = text.split('http')[0]
title = title.strip()
title = title.rstrip()
try:
entries = tweet['entries']
image = entries["photos"][0]
url = entries["urls"][0]
myDict['title'] = title
myDict['url'] = url
myDict['image'] = image
myDict['tweet_url'] = tweet_url
myDict['username'] = username
myDict['date'] = date
except IndexError:
title = title
image = ""
link = ""
return(myDict)
You're mutating a single dict, not adding to a list.
We can refactor your code to a handful of simpler functions that process tweepy? Tweets into dicts and others that yield processed tweet dicts for a given user.
Instead of printing the tweets at the end, you could now list.append them - or even simpler, just tweets = list(process_tweets_for_users(usernames)) :)
def process_tweet(tweet) -> dict:
"""
Turn a Twitter-native Tweet into a dict
"""
tweet_url = "https://www.twitter.com/" + tweet["tweetUrl"]
username = tweet["username"]
date = tweet["time"]
text = tweet["text"]
title = text.split("http")[0]
title = title.strip()
try:
entries = tweet["entries"]
image = entries["photos"][0]
url = entries["urls"][0]
except Exception:
image = url = None
return {
"title": title,
"url": url,
"image": image,
"tweet_url": tweet_url,
"username": username,
"date": date,
}
def process_user_tweets(username: str):
"""
Generate processed tweets for a given user.
"""
for tweet in get_tweets(username, pages=1):
try:
yield process_tweet(tweet)
except Exception as exc:
# TODO: improve error handling
print(exc)
def process_tweets_for_users(usernames):
"""
Generate processed tweets for a number of users.
"""
for username in usernames:
yield from process_user_tweets(username)
usernames = [
"ShoePalace",
"StreetWearDealz",
"ClothesUndrCost",
"DealsPlus",
"bodega",
"FRSHSneaks",
"more_sneakers",
"BOOSTLINKS",
"endclothing",
"DopeKixDaily",
"RSVPGallery",
"StealSupply",
"SneakerAlertHD",
"JustFreshKicks",
"solefed",
"SneakerMash",
"StealsBySwell",
"KicksDeals",
"FatKidDeals",
"sneakersteal",
"SOLELINKS",
"SneakerShouts",
"KicksUnderCost",
"snkr_twitr",
"KicksFinder",
]
for tweet in process_tweets_for_users(usernames):
print(tweet)
It is expected you only get the results for the last value in your lists because you seem to be overwriting the results for each tweet, instead of appending them to a list. I would use defauldict(list) and then append each tweet:
from collections import defaultdict
myDict = defaultdict(list)
for i in list:
for tweet in get_tweets(i, pages=1):
tweet_url = 'https://www.twitter.com/' + tweet['tweetUrl']
username = tweet['username']
date = tweet['time']
text = tweet['text']
title = text.split('http')[0]
title = title.strip()
title = title.rstrip()
try:
entries = tweet['entries']
image = entries["photos"][0]
url = entries["urls"][0]
myDict['title'].append(title)
myDict['url'].append(url)
myDict['image'].append(image)
myDict['tweet_url'].append(tweet_url)
myDict['username'].append(username)
myDict['date'].append(date)
except IndexError:
title = title
image = ""
link = ""
return(myDict)
Now that you have everything nice and tidy, you can put it into a nice dataframe to work with your data:
tweets_df = pd.DataFrame(tweets_df)

Parsing a JSON using specific key words using Python

I'm trying to parse a JSON of a sites stock.
The JSON: https://www.ssense.com/en-us/men/sneakers.json
So I want to take some keywords from the user. Then I want to parse the JSON using these keywords to find the name of the item and (in this specific case) return the ID, SKU and the URL.
So for example:
If I inputted "Black Fennec" I want to parse the JSON and find the ID,SKU, and URL of Black Fennec Sneakers (that have an ID of 3297299, a SKU of 191422M237006, and a url of /men/product/ps-paul-smith/black-fennec-sneakers/3297299 )
I have never attempted doing anything like this. Based on some guides that show how to parse a JSON I started out with this:
r = requests.Session()
stock = r.get("https://www.ssense.com/en-us/men/sneakers.json",headers = headers)
obj json_data = json.loads(stock.text)
However I am now confused. How do I find the product based off the keywords and how do I get the ID,Url and the SKU or it?
Theres a number of ways to handle the output. not sure what you want to do with it. But this should get you going.
EDIT 1:
import requests
r = requests.Session()
obj_json_data = r.get("https://www.ssense.com/en-us/men/sneakers.json").json()
products = obj_json_data['products']
keyword = input('Enter a keyword: ')
for product in products:
if keyword.upper() in product['name'].upper():
name = product['name']
id_var = product['id']
sku = product['sku']
url = product['url']
print ('Product: %s\nID: %s\nSKU: %s\nURL: %s' %(name, id_var, sku, url))
# if you only want to return the first match, uncomment next line
#break
I also have it setup to store it into a dataframe, and or a list too. Just to give some options of where to go with it.
import requests
import pandas as pd
r = requests.Session()
obj_json_data = r.get("https://www.ssense.com/en-us/men/sneakers.json").json()
products = obj_json_data['products']
keyword = input('Enter a keyword: ')
products_found = []
results = pd.DataFrame()
for product in products:
if keyword.upper() in product['name'].upper():
name = product['name']
id_var = product['id']
sku = product['sku']
url = product['url']
temp_df = pd.DataFrame([[name, id_var, sku, url]], columns=['name','id','sku','url'])
results = results.append(temp_df)
products_found = products_found.append(name)
print ('Product: %s\nID: %s\nSKU: %s\nURL: %s' %(name, id_var, sku, url))
if products_found == []:
print ('Nothing found')
EDIT 2: Here is another way to do it by converting the json to a dataframe, then filtering by those rows that have the keyword in the name (this is actually a better solution in my opinion)
import requests
import pandas as pd
from pandas.io.json import json_normalize
r = requests.Session()
obj_json_data = r.get("https://www.ssense.com/en-us/men/sneakers.json").json()
products = obj_json_data['products']
products_df = json_normalize(products)
keyword = input('Enter a keyword: ')
products_found = []
results = pd.DataFrame()
results = products_df[products_df['name'].str.contains(keyword, case = False)]
#print (results[['name', 'id', 'sku', 'url']])
products_found = list(results['name'])
if products_found == []:
print ('Nothing found')
else:
print ('Found: '+ str(products_found))

Issues with downloading dict into CSV accesing NY Times API via Python

I am currently trying to download a large number of NY Times articles using their API, based on Python 2.7. To do so, I was able to reuse a piece of code i found online:
[code]from nytimesarticle import articleAPI
api = articleAPI('...')
articles = api.search( q = 'Brazil',
fq = {'headline':'Brazil', 'source':['Reuters','AP', 'The New York Times']},
begin_date = '20090101' )
def parse_articles(articles):
'''
This function takes in a response to the NYT api and parses
the articles into a list of dictionaries
'''
news = []
for i in articles['response']['docs']:
dic = {}
dic['id'] = i['_id']
if i['abstract'] is not None:
dic['abstract'] = i['abstract'].encode("utf8")
dic['headline'] = i['headline']['main'].encode("utf8")
dic['desk'] = i['news_desk']
dic['date'] = i['pub_date'][0:10] # cutting time of day.
dic['section'] = i['section_name']
if i['snippet'] is not None:
dic['snippet'] = i['snippet'].encode("utf8")
dic['source'] = i['source']
dic['type'] = i['type_of_material']
dic['url'] = i['web_url']
dic['word_count'] = i['word_count']
# locations
locations = []
for x in range(0,len(i['keywords'])):
if 'glocations' in i['keywords'][x]['name']:
locations.append(i['keywords'][x]['value'])
dic['locations'] = locations
# subject
subjects = []
for x in range(0,len(i['keywords'])):
if 'subject' in i['keywords'][x]['name']:
subjects.append(i['keywords'][x]['value'])
dic['subjects'] = subjects
news.append(dic)
return(news)
def get_articles(date,query):
'''
This function accepts a year in string format (e.g.'1980')
and a query (e.g.'Amnesty International') and it will
return a list of parsed articles (in dictionaries)
for that year.
'''
all_articles = []
for i in range(0,100): #NYT limits pager to first 100 pages. But rarely will you find over 100 pages of results anyway.
articles = api.search(q = query,
fq = {'headline':'Brazil','source':['Reuters','AP', 'The New York Times']},
begin_date = date + '0101',
end_date = date + '1231',
page = str(i))
articles = parse_articles(articles)
all_articles = all_articles + articles
return(all_articles)
Download_all = []
for i in range(2009,2010):
print 'Processing' + str(i) + '...'
Amnesty_year = get_articles(str(i),'Brazil')
Download_all = Download_all + Amnesty_year
import csv
keys = Download_all[0].keys()
with open('brazil-mentions.csv', 'wb') as output_file:
dict_writer = csv.DictWriter(output_file, keys)
dict_writer.writeheader()
dict_writer.writerows(Download_all)
Without the last bit (starting with "... import csv" this seems to be working fine. If I simply print my results, ("print Download_all") I can see them, however in a very unstructured way. Running the actual code i however get the message:
File "C:\Users\xxx.yyy\AppData\Local\Continuum\Anaconda2\lib\csv.py", line 148, in _dict_to_list
+ ", ".join([repr(x) for x in wrong_fields]))
ValueError: dict contains fields not in fieldnames: 'abstract'
Since I am quite a newbie at this, I would highly appreciate your help in guiding me how to download the news articles into a csv file in a structured way.
Thanks a lot in advance!
Best regards
Where you have:
keys = Download_all[0].keys()
This takes the column headers for the CSV from the dictionary for the first article. The problem is that the article dictionaries do not all have the same keys, so when you reach the first one that has the extra abstract key, it fails.
It looks like you'll have problems with abstract and snippet which are only added to the dictionary if they exist in the response.
You need to make keys equal to the superset of all possible keys:
keys = Download_all[0].keys() + ['abstract', 'snippet']
Or, ensure that every dict has a value for every field:
def parse_articles(articles):
...
if i['abstract'] is not None:
dic['abstract'] = i['abstract'].encode("utf8")
else:
dic['abstract'] = ""
...
if i['snippet'] is not None:
dic['snippet'] = i['snippet'].encode("utf8")
else:
dic['snippet'] = ""

Selecting values from a JSON file in Python

I am getting JIRA data using the following python code,
how do I store the response for more than one key (my example shows only one KEY but in general I get lot of data) and print only the values corresponding to total,key, customfield_12830, summary
import requests
import json
import logging
import datetime
import base64
import urllib
serverURL = 'https://jira-stability-tools.company.com/jira'
user = 'username'
password = 'password'
query = 'project = PROJECTNAME AND "Build Info" ~ BUILDNAME AND assignee=ASSIGNEENAME'
jql = '/rest/api/2/search?jql=%s' % urllib.quote(query)
response = requests.get(serverURL + jql,verify=False,auth=(user, password))
print response.json()
response.json() OUTPUT:-
http://pastebin.com/h8R4QMgB
From the the link you pasted to pastebin and from the json that I saw, its a you issues as list containing key, fields(which holds custom fields), self, id, expand.
You can simply iterate through this response and extract values for keys you want. You can go like.
data = response.json()
issues = data.get('issues', list())
x = list()
for issue in issues:
temp = {
'key': issue['key'],
'customfield': issue['fields']['customfield_12830'],
'total': issue['fields']['progress']['total']
}
x.append(temp)
print(x)
x is list of dictionaries containing the data for fields you mentioned. Let me know if I have been unclear somewhere or what I have given is not what you are looking for.
PS: It is always advisable to use dict.get('keyname', None) to get values as you can always put a default value if key is not found. For this solution I didn't do it as I just wanted to provide approach.
Update: In the comments you(OP) mentioned that it gives attributerror.Try this code
data = response.json()
issues = data.get('issues', list())
x = list()
for issue in issues:
temp = dict()
key = issue.get('key', None)
if key:
temp['key'] = key
fields = issue.get('fields', None)
if fields:
customfield = fields.get('customfield_12830', None)
temp['customfield'] = customfield
progress = fields.get('progress', None)
if progress:
total = progress.get('total', None)
temp['total'] = total
x.append(temp)
print(x)

Categories