Extract nested values from data frame using python

Extract nested values from data frame using python - python

I've extracted the data from API response and created a dictionary function:
def data_from_api(a):
dictionary = dict(
data = a['number']
,created_by = a['opened_by']
,assigned_to = a['assigned']
,closed_by = a['closed']
)
return dictionary
and then to df (around 1k records):
raw_data = []
for k in data['resultsData']:
records = data_from_api(k)
raw_data.append(records)
I would like to create a function allows to extract the nested fields {display_value} in the columns in the dataframe. I need only the names like John Snow, etc. Please see below:
How to create a function extracts the display values for those fields? I've tried to create something like:
df = pd.DataFrame.from_records(raw_data)
def get_nested_fields(nested):
if isinstance(nested, dict):
return nested['display_value']
else:
return ''
df['created_by'] = df['opened_by'].apply(get_nested_fields)
df['assigned_to'] = df['assigned'].apply(get_nested_fields)
df['closed_by'] = df['closed'].apply(get_nested_fields)
but I'm getting an error:
KeyError: 'created_by'
Could you please help me?

You can use .str and get() like below. If the key isn't there, it'll write None.
df = pd.DataFrame({'data':[1234, 5678, 5656], 'created_by':[{'display_value':'John Snow', 'link':'a.com'}, {'display_value':'John Dow'}, {'my_value':'Jane Doe'}]})
df['author'] = df['created_by'].str.get('display_value')
output
data created_by author
0 1234 {'display_value': 'John Snow', 'link': 'a.com'} John Snow
1 5678 {'display_value': 'John Dow'} John Dow
2 5656 {'my_value': 'Jane Doe'} None

Related

Processing API data (json) into a singular data frame (list of list of dictionaries)?

So this is a somewhat of a continuation from a previous post of mine except now I have API data to work with. I am trying to get keys Type and Email as columns in a data frame to come up with a final number. My code:
jsp_full=[]
for p in payloads:
payload = {"payload": {"segmentId":p}}
r = requests.post(url,headers = header, json = payload)
#print(r, r.reason)
time.sleep(r.elapsed.total_seconds())
json_data = r.json() if r and r.status_code == 200 else None
json_keys = json_data['payload']['supporters']
json_package = []
jsp_full.append(json_package)
for row in json_keys:
SID = row['supporterId']
Handle = row['contacts']
a_key = 'value'
list_values = [a_list[a_key] for a_list in Handle]
string = str(list_values).split(",")
data = {
'SupporterID' : SID,
'Email' : strip_characters(string[-1]),
'Type' : labels(p)
}
json_package.append(data)
t2 = round(time.perf_counter(),2)
b_key = "Email"
e = len([b_list[b_key] for b_list in json_package])
t = str(labels(p))
#print(json_package)
print(f'There are {e} emails in the {t} segment')
print(f'Finished in {t2 - t1} seconds')
excel = pd.DataFrame(json_package)
excel.to_excel(r'C:\Users\am\Desktop\email parsing\{0} segment {1}.xlsx'.format(t, str(today)), sheet_name=t)
This part works all well and good. Each payload in the API represents a different segment of people so I split them out into different files. However, I am at a point where I need to combine all records into a single data frame hence why I append out to jsp_full. This is a list of a list of dictionaries.
Once I have that I would run the balance of my code which is like this:
S= pd.DataFrame(jsp_full[0], index = {0})
Advocacy_Supporters = S.sort_values("Type").groupby("Type", as_index=False)["Email"].first()
print(Advocacy_Supporters['Email'].count())
print("The number of Unique Advocacy Supporters is :")
Advocacy_Supporters_Group = Advocacy_Supporters.groupby("Type")["Email"].nunique()
print(Advocacy_Supporters_Group)
Some sample data:
[{'SupporterID': '565f6a2f-c7fd-4f1b-bac2-e33976ef4306', 'Email': 'somebody#somewhere.edu', 'Type': 'd_Student Ambassadors'}, {'SupporterID': '7508dc12-7647-4e95-a8b8-bcb067861faf', 'Email': 'someoneelse#email.somewhere.edu', 'Type': 'd_Student Ambassadors'},...`
My desired output is a dataframe that looks like so:
SupporterID Email Type
565f6a2f-c7fd-4f1b-bac2-e33976ef4306 somebody#somewhere.edu d_Student Ambassadors
7508dc12-7647-4e95-a8b8-bcb067861faf someoneelse#email.somewhere.edu d_Student Ambassadors
Any help is greatly appreciated!!

So because this code creates an excel file for each segment, all I did was read back in the excels via a for loop like so:
filesnames = ['e_S Donors', 'b_Contributors', 'c_Activists', 'd_Student Ambassadors', 'a_Volunteers', 'f_Offline Action Takers']
S= pd.DataFrame()
for i in filesnames:
data = pd.read_excel(r'C:\Users\am\Desktop\email parsing\{0} segment {1}.xlsx'.format(i, str(today)),sheet_name= i, engine = 'openpyxl')
S= S.append(data)
This did the trick since it was in a format I already wanted.

I have an input file that I am trying to build a dictionary from

I have an input file that I am trying to build a data base from.
Each line looks like this:
Amy Shchumer, Trainwreck, I Feel Pretty, Snatched, Inside Amy Shchumer
Bill Hader,Inside Out, Trainwreck, Tropic Thunder
And so on.
The first string is an actor\actress, and then movies they played in.
The data isn't sorted and they are some trailing whitespaces.
I would like to create a dictionary that would look like this:
{'Trainwreck': {'Amy Shchumer', 'Bill Hader'}}
The key would be the movie, the values should be the actors in it, unified in a set data type.
def create_db():
my_dict = {}
raw_data = open('database.txt','r+')
for line in raw_data:
lst1 = line.split(",") //to split by the commas
len_row = len(lst1)
lst2 = list(lst1)
for j in range(1,len_row):
my_dict[lst2[j]] = set([lst2[0]])
print(my_dict)
It doesn't work... it doesn't solve the issue that when a key already exists then the actor should be unified in a set with the prev actor
Instead I end up with:
'Trainwreck': {'Amy Shchumer'}, 'Inside Out': {'Bill Hader'}

def create_db():
db = {}
with open("database.txt") as data:
for line in data.readlines():
person, *movies = line.split(",")
for m in movies:
m = m.strip()
db[m] = db.get(m, []) + [person]
return db
Output:
{'Trainwreck': ['Amy Shchumer', 'Bill Hader'],
'I Feel Pretty': ['Amy Shchumer'],
'Snatched': ['Amy Shchumer'],
'Inside Amy Shchumer': ['Amy Shchumer'],
'Inside Out': ['Bill Hader'],
'Tropic Thunder': ['Bill Hader']}
This will loop through the data and assign the first value of each line to person and the rest to movies (see here for an example of how * unpacks tuples). Then for all the movies, it uses .get to check if it’s in the database yet, returning the list if it is and an empty list if it isn’t. Then it adds the new actor to the list.
Another way to do this would be to use a defaultdict:
from collections import defaultdict
def create_db():
db = defaultdict(lambda: [])
with open("database.txt") as data:
for line in data.readlines():
person, *movies = line.split(",")
for m in movies:
db[m.strip()].append(person)
return db
which automatically assigns [] if the key does not exist.

How to populate dataframe from dictionary in loop

I am trying to perform entity analysis on text and I want to put the results in a dataframe. Currently the results are not stored in a dictionary, nor in a Dataframe. The results are extracted with two functions.
df:
ID title cur_working pos_arg neg_arg date
132 leave yes good coffee management, leadership and salary 13-04-2018
145 love it yes nice colleagues long days 14-04-2018
I have the following code:
result = entity_analysis(df, 'neg_arg', 'ID')
#This code loops through the rows and calls the function entities_text()
def entity_analysis(df, col, idcol):
temp_dict = {}
for index, row in df.iterrows():
id = (row[idcol])
x = (row[col])
entities = entities_text(x, id)
#temp_dict.append(entities)
#final = pd.DataFrame(columns = ['id', 'name', 'type', 'salience'])
return print(entities)
def entities_text(text, id):
"""Detects entities in the text."""
client = language.LanguageServiceClient()
ent_df = {}
if isinstance(text, six.binary_type):
text = text.decode('utf-8')
# Instantiates a plain text document.
document = types.Document(
content=text,
type=enums.Document.Type.PLAIN_TEXT)
# Detects entities in the document.
entities = client.analyze_entities(document).entities
# entity types from enums.Entity.Type
entity_type = ('UNKNOWN', 'PERSON', 'LOCATION', 'ORGANIZATION',
'EVENT', 'WORK_OF_ART', 'CONSUMER_GOOD', 'OTHER')
for entity in entities:
ent_df[id] = ({
'name': [entity.name],
'type': [entity_type[entity.type]],
'salience': [entity.salience]
})
return print(ent_df)
This code gives the following outcome:
{'132': {'name': ['management'], 'type': ['OTHER'], 'salience': [0.16079013049602509]}}
{'132': {'name': ['leadership'], 'type': ['OTHER'], 'salience': [0.05074194446206093]}}
{'132': {'name': ['salary'], 'type': ['OTHER'], 'salience': [0.27505040168762207]}}
{'145': {'name': ['days'], 'type': ['OTHER'], 'salience': [0.004272154998034239]}}
I have created temp_dict and a final dataframe in the function entity_analysis(). This thread explained that appending to a dataframe in a loop is not efficient. I don't know how to populate the dataframe in an efficient way. These threads are related to my question but they explain how to populate a Dataframe from existing data. When I try to use temp_dict.update(entities) and return temp_dict I get an error:
in entity_analysis
temp_dict.update(entities)
TypeError: 'NoneType' object is not iterable
I want the output to be like this:
ID name type salience
132 management OTHER 0.16079013049602509
132 leadership OTHER 0.05074194446206093
132 salary OTHER 0.27505040168762207
145 days OTHER 0.004272154998034239

One solution is to create a list of lists via your entities iterable. Then feed your list of lists into pd.DataFrame:
LoL = []
for entity in entities:
LoL.append([id, entity.name, entity_type[entity.type], entity.salience])
df = pd.DataFrame(LoL, columns=['ID', 'name', 'type', 'salience'])
If you also need the dictionary in the format you currently produce, then you can add your current logic to your for loop. However, first check whether you need to use two structures to store identical data.

Split string, unicode, unicode, string in python

I was trying to split combination of string, unicode in python. The split has to be made on the ResultSet object retrieved from web-site. Using the code below, I am able to get the details, actually it is user details:
from bs4 import BeautifulSoup
import urllib2
import re
url = "http://www.mouthshut.com/vinay_beriwal"
profile_user = urllib2.urlopen(url)
profile_soup = BeautifulSoup(profile_user.read())
usr_dtls = profile_soup.find("div",id=re.compile("_divAboutMe")).find_all('p')
for dt in usr_dtls:
usr_dtls = " ".join(dt.text.split())
print(usr_dtls)
The output is as below:
i love yellow..
Name: Vinay Beriwal
Age: 39 years
Hometown: New Delhi, India
Country: India
Member since: Feb 11, 2016
What I need is to create distinct 5 variables as Name, Age, Hometown, Country, Member since and store the corresponding value after ':' for same.
Thanks

You can use a dictionary to store name-value pairs.For example -
my_dict = {"Name":"Vinay","Age":21}
In my_dict, Name and Age are the keys of the dictionary, you can access values like this -
print (my_dict["Name"]) #This will print Vinay
Also, it's nice and better to use complete words for variable names.
results = profile_soup.find("div",id=re.compile("_divAboutMe")).find_all('p')
user_data={} #dictionary initialization
for result in results:
result = " ".join(result.text.split())
try:
var,value = result.strip().split(':')
user_data[var.strip()]=value.strip()
except:
pass
#If you print the user_data now
print (user_data)
'''
This is what it'll print
{'Age': ' 39 years', 'Country': ' India', 'Hometown': 'New Delhi, India', 'Name': 'Vinay Beriwal', 'Member since': 'Feb 11, 2016'}
'''

You can use a dictionary to store your data:
my_dict = {}
for dt in usr_dtls:
item = " ".join(dt.text.split())
try:
if ':' in item:
k, v = item.split(':')
my_dict[k.strip()] = v.strip()
except:
pass
Note: You should not use usr_dtls inside your for loop, because that's would override your original usr_dtls

Python: change value inside dual key dictionary

So what I'm trying to do is make a dictionary of people and their information but I want to use their names as the main key and have each part of their information to also have a key. I have not been able to figure out how to go about changing the values of their individual information.
I'm not even sure if I'm going about this the right way here is the code.
name = raw_input("name")
age = raw_input("age")
address = raw_input("address")
ramrod = {}
ramrod[name] = {'age': age}, {'address' : address}
print ramrod
#prints out something like this: {'Todd': ({'age': '67'}, {'address': '55555 FooBar rd'})}

What you are looking for is a simple nested dictionary:
>>> data = {"Bob": {"Age": 20, "Hobby": "Surfing"}}
>>> data["Bob"]["Age"]
20
A dictionary is not a pair - you can store more than one item in a dictionary. So you want one dictionary containing a mapping from name to information, where information is a dictionary containing mappings from the name of the information you want to store about the person to that information.
Note that if you have behaviour associated with the data, or you end up with a lot of large dictionaries, a class might be more suitable:
class Person:
def __init__(self, name, age, hobby):
self.name = name
self.age = age
self.hobby = hobby
>>> data = {"Bob": Person("Bob", 20, "Surfing")}
>>> data["Bob"].age
20

You were close
ramrod[name] = {'age': age, 'address' : address}

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract nested values from data frame using python - python

Related

Processing API data (json) into a singular data frame (list of list of dictionaries)?

I have an input file that I am trying to build a dictionary from

How to populate dataframe from dictionary in loop

Split string, unicode, unicode, string in python

Python: change value inside dual key dictionary

Categories

Resources