Flatten and Shape JSON DataFrame - python

I have the below JSON string in data. I want it to look like the Expected Result Below
import json
import pandas as pd
data = [{'useEventValue': True,
'eventConditions': [{'type': 'CATEGORY',
'matchType': 'EXACT',
'expression': 'ABC'},
{'type': 'ACTION',
'matchType': 'EXACT',
'expression': 'DEF'},
{'type': 'LABEL', 'matchType': 'REGEXP', 'expression': 'GHI|JKL'}]}]
Expected Result:
Category_matchType
Category_expression
Action_matchType
Action_expression
Label_matchType
Label_expression
0
EXACT
ABC
EXACT
DEF
REGEXP
GHI|JKL
What I've Tried:
This question is similar, but I'm not using the index the way the OP is. Following this example, I've tried using json_normalize and then using various forms of melt, stack, unstack, pivot, etc. But there has to be an easier way!
# this bit of code produces the below result where I can start using reshaping functions to get to what I need but it seems messy
df = pd.json_normalize(data, 'eventConditions')
type
matchType
expression
0
CATEGORY
EXACT
ABC
1
ACTION
EXACT
DEF
2
LABEL
REGEXP
GHI|JKL

We can use json_normalize to read the json data as pandas dataframe, then use stack followed by unstack to reshape the dataframe
df = pd.json_normalize(data, 'eventConditions')
df = df.set_index([df.groupby('type').cumcount(), 'type']).stack().unstack([1, 2])
df.columns = df.columns.map('_'.join)
CATEGORY_matchType CATEGORY_expression ACTION_matchType ACTION_expression LABEL_matchType LABEL_expression
0 EXACT ABC EXACT DEF REGEXP GHI|JKL

If your data is not too large in size, you could maybe process the json data first and then create a dataframe like this:
import pandas as pd
import json
data = [{'useEventValue': True,
'eventConditions': [{'type': 'CATEGORY',
'matchType': 'EXACT',
'expression': 'ABC'},
{'type': 'ACTION',
'matchType': 'EXACT',
'expression': 'DEF'},
{'type': 'LABEL', 'matchType': 'REGEXP', 'expression': 'GHI|JKL'}]}]
new_data = {}
for i in data:
for event in i['eventConditions']:
for key in event.keys():
if key != 'type':
col_name = event['type'] + '_' + key
new_data[col_name] = [event[key]] if col_name not in new_data else new_data[col_name].append(event[key])
df = pd.DataFrame(new_data)
df
Just found a way to do it with Pandas only:
df = pd.json_normalize(data, 'eventConditions')
df = df.melt(id_vars=[('type')])
df['type'] = df['type'] + '_' + df['variable']
df.drop(columns=['variable'], inplace=True)
df.set_index('type', inplace=True)
df = df.T

Related

Faster way to iterate over columns in pandas

I have the following task.
I have this data:
import pandas
import numpy as np
data = {'name': ['Todd', 'Chris', 'Jackie', 'Ben', 'Richard', 'Susan', 'Joe', 'Rick'],
'phone': [912341.0, np.nan , 912343.0, np.nan, 912345.0, 912345.0, 912347.0, np.nan],
' email': ['todd#gmail.com', 'chris#gmail.com', np.nan, 'ben#gmail.com', np.nan ,np.nan , 'joe#gmail.com', 'rick#gmail.com'],
'most_visited_airport': ['Heathrow', 'Beijing', 'Heathrow', np.nan, 'Tokyo', 'Beijing', 'Tokyo', 'Heathrow'],
'most_visited_place': ['Turkey', 'Spain',np.nan , 'Germany', 'Germany', 'Spain',np.nan , 'Spain']
}
df = pandas.DataFrame(data)
What I have to do is for every feature column (most_visited_airport etc.) and its values (Heathrow, Beijing, Tokyo) I have to generate personal information and output it to a file.
E.g. If we look at most_visited_airport and Heathrow
I need to output three files containing the names, emails and phones of the people who visited the airport the most.
Currently, I have this code to do the operation for both columns and all the values:
columns_to_iterate = [ x for x in df.columns if 'most' in x]
for each in df[columns_to_iterate]:
values = df[each].dropna().unique()
for i in values:
df1 = df.loc[df[each]==i,'name']
df2 = df.loc[df[each]==i,' email']
df3 = df.loc[df[each]==i,'phone']
df1.to_csv(f'{each}_{i}_{df1.name}.csv')
df2.to_csv(f'{each}_{i}_{df2.name}.csv')
df3.to_csv(f'{each}_{i}_{df3.name}.csv')
Is it possible to do this in a more elegant and maybe faster way? Currently I have small dataset but not sure if this code will perform well with big data. My particular concern are the nested loops.
Thank you in advance!
You could replace the call to unique with a groupby, which would not only get the unique values, but split up the dataframe for you:
for column in df.filter(regex='^most'):
for key, group in df.groupby(column):
for attr in ('name', 'phone', 'email'):
group['name'].dropna().to_csv(f'{column}_{key}_{attr}.csv')
You can do it this way.
cols = df.filter(regex='most').columns.values
def func_current_cols_to_csv(most_col):
place = [i for i in df[most_col].dropna().unique().tolist()]
csv_cols = ['name', 'phone', ' email']
result = [df[df[most_col] == i][j].dropna().to_csv(f'{most_col}_{i}_{j}.csv', index=False) for i in place for j in
csv_cols]
return result
[func_current_cols_to_csv(i) for i in cols]
also in the options when writing to csv, you can leave the index, but do not forget to reset it before writing.

pandas, dataframe: If you need to process data row by row, how to do it faster than itertuples

I know that .itertuples() and .iterrows() are slow, but how can I speed them up if I need to use and process data one row at a time, as shown below?
df = pd.read_csv('example.csv')
posts = []
for row in df.itertuples():
post = Post(title=row.title, text=row.text, ...)
posts.append(post)
You can use list comprehension and unpacking (using kwargs) if your DataFrame columns have the same names as your class attributes. An example is shown below.
df = pd.DataFrame({"title": ["fizz", "buzz"], "text": ["aaaa", "bbbb"]})
posts = [Post(**kwargs) for kwargs in df.to_dict("records")]
What I usually do is using apply function.
import pandas as pd
df = pd.DataFrame(dict(title=["title1", "title2", "title3"],text=["text1", "text2", "text3"]))
df["Posts"] = df.apply(lambda x: dict(title=x["title"], text=x["text"]), axis=1)
posts = list(df["Posts"])
print(posts)
Output:
[{'title': 'title1', 'text': 'text1'}, {'title': 'title2', 'text': 'text2'}, {'title': 'title3', 'text': 'text3'}]
It's better to avoid a for loop when you have another methods to do that.

Error handling with dataframe.explode() in Python pandas

so I have some data that I am using the df.explode() method on. I get an error when running the code, I know the error is caused, because one of my rows (row 3) does not have a corresponding Qty for each location, but how can I handle that error ?
Code that returns ('ValueError: cannot reindex from a duplicate axis')
import pandas as pd
import openpyxl
data = {'ITEM': ['Item1', 'Item2', 'Item3'],
'Locations': ['loc1;loc2', 'loc3', 'loc4;loc5'],
'Qty': ['100;200', '100', '500']
}
df1 = pd.DataFrame(data, columns=['ITEM', 'Locations', 'Qty'])
print(df1)
formatted_df1 = (df1.set_index(['ITEM'])
.apply(lambda x: x.str.split(';')
.explode()).reset_index())
print(formatted_df1)
Code that works (Note that the last record has 500;600):
import pandas as pd
import openpyxl
data = {'ITEM': ['Item1', 'Item2', 'Item3'],
'Locations': ['loc1;loc2', 'loc3', 'loc4;loc5'],
'Qty': ['100;200', '100', '500']
}
df1 = pd.DataFrame(data, columns=['ITEM', 'Locations', 'Qty'])
print(df1)
formatted_df1 = (df1.set_index(['ITEM'])
.apply(lambda x: x.str.split(';')
.explode()).reset_index())
print(formatted_df1)

Selectively load JSON data into a dataframe

I have some json data that I want to put into a pandas dataframe. The json looks like this:
{'date': [20170629,
20170630,
20170703,
20170705,
20170706,
20170707],
'errorMessage': None,
'seriesarr': [{'chartOnlyFlag': 'false',
'dqMaxValidStr': None,
'expression': 'DB(FXO,V1,EUR,USD,7D,VOL)',
'freq': None,
'frequency': None,
'iDailyDates': None,
'label': '',
'message': None,
'plotPoints': [0.0481411225888,
0.0462401214563,
0.0587196848727,
0.0765737640932,
0.0678912611279,
0.0675766942022],
}
I am trying to create a pandas DataFrame with 'date' as the index and 'plotPoints' as a second column. I don't need any of the other infomation.
I've tried
df = pd.io.json.json_normalize(data, record_path = 'date', meta = ['seriesarr', ['plotPoints']])
When I do this I get the following error:
KeyError: ("Try running with errors='ignore' as key %s is not always present", KeyError('plotPoints',)
Any help with this is appreciated.
Thanks!
IIUC, json_normalize may not be able to help you here. It might instead just be easier to extract that data and then load it into a dataframe directly. If need be, convert to datetime using pd.to_datetime:
date = data.get('date')
plotPoints = data.get('seriesarr')[0].get('plotPoints')
df = pd.DataFrame({'date' : pd.to_datetime(date, format='%Y%m%d'),
'plotPoints' : plotPoints})
df
date plotPoints
0 2017-06-29 0.048141
1 2017-06-30 0.046240
2 2017-07-03 0.058720
3 2017-07-05 0.076574
4 2017-07-06 0.067891
5 2017-07-07 0.067577
This is under the assumption that your data is exactly as shown in the question.
As #COLDSPEED pointed out, getting data directly from dictionary columns will be suitable since 'plotPoints' is contained within a list of dict.
A list comprehension variation is as below that has date as index and plotpoints as column..
col1 = data['date']
adict = dict((k,v) for d in data['seriesarr'] for k,v in d.iteritems() )
col2 = adict['plotPoints']
pd.DataFrame(data= col2, index=col1)
>>> 0
20170629 0.048141
20170630 0.046240
20170703 0.058720
20170705 0.076574
20170706 0.067891
20170707 0.067577

Get a list of keys and values in a nested dictionary oriented by index

I have an Excel file with a structure like this:
name age status
anna 35 single
petr 27 married
I have converted such a file into a nested dictionary with a structure like this:
{'anna': {'age':35}, {'status': 'single'}},
{'petr': {'age':27}, {'status': 'married'}}
using pandas:
import pandas as pd
df = pd.read_excel('path/to/file')
df.set_index('name', inplace=True)
print(df.to_dict(orient='index'))
But now when running list(df.keys()) it returns me a list of all keys in the dictionary ('age', 'status', etc) but not 'name'.
My eventual goal is that it returns me all the keys and values by typing a name.
Is it possible somehow? Or maybe I should use some other way to import a data in order to achieve a goal? Eventually I should anyway come to a dictionary because I will merge it with other dictionaries by a key.
I think you need parameter drop=False to set_index for not drop column Name:
import pandas as pd
df = pd.read_excel('path/to/file')
df.set_index('name', inplace=True, drop=False)
print (df)
name age status
name
anna anna 35 single
petr petr 27 married
d = df.to_dict(orient='index')
print (d)
{'anna': {'age': 35, 'status': 'single', 'name': 'anna'},
'petr': {'age': 27, 'status': 'married', 'name': 'petr'}}
print (list(df.keys()))
['name', 'age', 'status']
Given a dataframe from excel, you should do this to obtain the thing you want:
resulting_dict = {}
for name, info in df.groupby('name').apply(lambda x: x.to_dict()).iteritems():
stats = {}
for key, values in info.items():
if key != 'name':
value = list(values.values())[0]
stats[key] = value
resulting_dict[name] = stats
Try this :
import pandas as pd
df = pd.read_excel('path/to/file')
df[df['name']=='anna'] #Get all details of anna
df[df['name']=='petr'] #Get all details of petr

Categories