Convert pandas multiindex dataframe to nested dictionary - python

I have a pandas multiindex dataframe that I'm trying to output as a nested dictionary.
# create the dataset
data = {'clump_thickness': {(0, 0): 274.0, (0, 1): 19.0, (1, 0): 67.0, (1, 1): 12.0, (2, 0): 83.0, (2, 1): 45.0, (3, 0): 16.0, (3, 1): 40.0, (4, 0): 4.0, (4, 1): 54.0, (5, 0): 0.0, (5, 1): 69.0, (6, 0): 0.0, (6, 1): 0.0, (7, 0): 0.0, (7, 1): 0.0, (8, 0): 0.0, (8, 1): 0.0, (9, 0): 0.0, (9, 1): 0.0}}
df = pd.DataFrame(data)
df.head()
# clump_thickness
# 0 0 274.0
# 1 19.0
# 1 0 67.0
# 1 12.0
# 2 0 83.0
df is the dataframe that I want to output as a nested dictionary. The output I'm looking for is in the form -
{"0":
{
"0":274,
"1":19
},
"1":{
"0":67,
"1":12
},
"2":{
"0":83,
"1":45
},
"3":{
"0":16,
"1":40
},
"4":{
"0":4,
"1":54
},
"5":{
"0":0,
"1":69
}
}
Here the first index forms the keys of the outer most dictionary. For each key we have a dictionary stored whose keys are the values in the second index.
When I do df.to_dict(), the instead of nesting, the multiindex is returned as a tuple. How do I achieve this?

For me working:
d = {l: df.xs(l)['clump_thickness'].to_dict() for l in df.index.levels[0]}
Another solution similar like DataFrame with MultiIndex to dict , but is necessary filter column for Series:
d = df.groupby(level=0).apply(lambda df: df.xs(df.name).clump_thickness.to_dict()).to_dict()
print (d)
{0: {0: 274.0, 1: 19.0},
1: {0: 67.0, 1: 12.0},
2: {0: 83.0, 1: 45.0},
3: {0: 16.0, 1: 40.0},
4: {0: 4.0, 1: 54.0},
5: {0: 0.0, 1: 69.0},
6: {0: 0.0, 1: 0.0},
7: {0: 0.0, 1: 0.0},
8: {0: 0.0, 1: 0.0},
9: {0: 0.0, 1: 0.0}}

df.unstack().clump_thickness.apply(lambda x: x.to_dict(), axis=1).to_dict()

Related

Convert `DataFrame.groupby()` into dictionary (and then reverse it)

Say I have the following DataFrame() where I have repeated observations per individual (column id_ind). Hence, first two rows belong the first individual, the third and fourth rows belong to the second individual, and so forth...
import pandas as pd
X = pd.DataFrame.from_dict({'x1_1': {0: -0.1766214634108258, 1: 1.645852185286492, 2: -0.13348860101031038, 3: 1.9681043689968933, 4: -1.7004428240831382, 5: 1.4580091413853749, 6: 0.06504113741068565, 7: -1.2168493676768384, 8: -0.3071304478616376, 9: 0.07121332925591593}, 'x1_2': {0: -2.4207773498298844, 1: -1.0828751040719462, 2: 2.73533787008624, 3: 1.5979611987152071, 4: 0.08835542172064115, 5: 1.2209786277076156, 6: -0.44205979195950784, 7: -0.692872860268244, 8: 0.0375521181289943, 9: 0.4656030062266639}, 'x1_3': {0: -1.548320898226322, 1: 0.8457342014424675, 2: -0.21250514722879738, 3: 0.5292389938329516, 4: -2.593946520223666, 5: -0.6188958526077123, 6: 1.6949245117526974, 7: -1.0271341091035742, 8: 0.637561891142571, 9: -0.7717170035055559}, 'x2_1': {0: 0.3797245517345564, 1: -2.2364391598508835, 2: 0.6205947900678905, 3: 0.6623865847688559, 4: 1.562036259999875, 5: -0.13081282910947759, 6: 0.03914373833251773, 7: -0.995761652421108, 8: 1.0649494418154162, 9: 1.3744782478849122}, 'x2_2': {0: -0.5052556836786106, 1: 1.1464291788297152, 2: -0.5662380273138174, 3: 0.6875729143723538, 4: 0.04653136473130827, 5: -0.012885303852347407, 6: 1.5893672346098884, 7: 0.5464286050059511, 8: -0.10430829457707284, 9: -0.5441755265313813}, 'x2_3': {0: -0.9762973303149007, 1: -0.983731467806563, 2: 1.465827578266328, 3: 0.5325950414202745, 4: -1.4452121324204903, 5: 0.8148816373643869, 6: 0.470791989780882, 7: -0.17951636294180473, 8: 0.7351814781280054, 9: -0.28776723200679066}, 'x3_1': {0: 0.12751822396637064, 1: -0.21926633684030983, 2: 0.15758799357206943, 3: 0.5885412224632464, 4: 0.11916562911189271, 5: -1.6436210334529249, 6: -0.12444368631987467, 7: 1.4618564171802453, 8: 0.6847234328916137, 9: -0.23177118858569187}, 'x3_2': {0: -0.6452955690715819, 1: 1.052094761527654, 2: 0.20190339195326157, 3: 0.6839430295237913, 4: -0.2607691613858866, 5: 0.3315513026670213, 6: 0.015901139336566113, 7: 0.15243420084881903, 8: -0.7604225072161022, 9: -0.4387652927008854}, 'x3_3': {0: -1.067058994377549, 1: 0.8026914180717286, 2: -1.9868531745912268, 3: -0.5057770735303253, 4: -1.6589569342151713, 5: 0.358172252880764, 6: 1.9238983803281329, 7: 2.2518318810978246, 8: -1.2781475121874357, 9: -0.7103081175166167}})
Y = pd.DataFrame.from_dict({'CHOICE': {0: 1.0, 1: 1.0, 2: 2.0, 3: 2.0, 4: 3.0, 5: 2.0, 6: 1.0, 7: 1.0, 8: 2.0, 9: 2.0}})
Z = pd.DataFrame.from_dict({'z1': {0: 2.4196730570917233, 1: 2.4196730570917233, 2: 2.822802255159467, 3: 2.822802255159467, 4: 2.073171091633643, 5: 2.073171091633643, 6: 2.044165101485163, 7: 2.044165101485163, 8: 2.4001241292606275, 9: 2.4001241292606275}, 'z2': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 1.0, 5: 1.0, 6: 1.0, 7: 1.0, 8: 0.0, 9: 0.0}, 'z3': {0: 1.0, 1: 1.0, 2: 1.0, 3: 1.0, 4: 2.0, 5: 2.0, 6: 2.0, 7: 2.0, 8: 3.0, 9: 3.0}})
id = pd.DataFrame.from_dict({'id_choice': {0: 1.0, 1: 2.0, 2: 3.0, 3: 4.0, 4: 5.0, 5: 6.0, 6: 7.0, 7: 8.0, 8: 9.0, 9: 10.0}, 'id_ind': {0: 1.0, 1: 1.0, 2: 2.0, 3: 2.0, 4: 3.0, 5: 3.0, 6: 4.0, 7: 4.0, 8: 5.0, 9: 5.0}} )
# Create a dataframe with all the data
data = pd.concat([id, X, Z, Y], axis=1)
print(data.head(4))
# id_choice id_ind x1_1 x1_2 x1_3 x2_1 x2_2 \
# 0 1.0 1.0 -0.176621 -2.420777 -1.548321 0.379725 -0.505256
# 1 2.0 1.0 1.645852 -1.082875 0.845734 -2.236439 1.146429
# 2 3.0 2.0 -0.133489 2.735338 -0.212505 0.620595 -0.566238
# 3 4.0 2.0 1.968104 1.597961 0.529239 0.662387 0.687573
#
# x2_3 x3_1 x3_2 x3_3 z1 z2 z3 CHOICE
# 0 -0.976297 0.127518 -0.645296 -1.067059 2.419673 0.0 1.0 1.0
# 1 -0.983731 -0.219266 1.052095 0.802691 2.419673 0.0 1.0 1.0
# 2 1.465828 0.157588 0.201903 -1.986853 2.822802 0.0 1.0 2.0
# 3 0.532595 0.588541 0.683943 -0.505777 2.822802 0.0 1.0 2.0
I want to perform two operations.
First, I want to convert the DataFrame data into a dictionary of DataFrame()s where the keys are the number of individuals (in this particular case, numbers ranging from 1.0 to 5.0.). I've done this below as suggested here. Unfortunately, I am getting a dictionary of numpy values and not a dictionary of DataFrame()s.
# Create a dictionary with the data for each individual
data_dict = data.set_index('id_ind').groupby('id_ind').apply(lambda x : x.to_numpy().tolist()).to_dict()
print(data_dict.keys())
# dict_keys([1.0, 2.0, 3.0, 4.0, 5.0])
print(data_dict[1.0])
#[[1.0, -0.1766214634108258, -2.4207773498298844, -1.548320898226322, 0.3797245517345564, -0.5052556836786106, -0.9762973303149007, 0.12751822396637064, -0.6452955690715819, -1.067058994377549, 2.4196730570917233, 0.0, 1.0, 1.0], [2.0, 1.645852185286492, -1.0828751040719462, 0.8457342014424675, -2.2364391598508835, 1.1464291788297152, -0.983731467806563, -0.21926633684030983, 1.052094761527654, 0.8026914180717286, 2.4196730570917233, 0.0, 1.0, 1.0]]
Second, I want to recover the original DataFrame data reversing the previous operation. The naive approach is as follows. However, it is, of course, not producing the expected result.
# Naive approach
res = pd.DataFrame.from_dict(data_dict, orient='index')
print(res)
# 0 1
#1.0 [1.0, -0.1766214634108258, -2.4207773498298844... [2.0, 1.645852185286492, -1.0828751040719462, ...
#2.0 [3.0, -0.13348860101031038, 2.73533787008624, ... [4.0, 1.9681043689968933, 1.5979611987152071, ...
#3.0 [5.0, -1.7004428240831382, 0.08835542172064115... [6.0, 1.4580091413853749, 1.2209786277076156, ...
#4.0 [7.0, 0.06504113741068565, -0.4420597919595078... [8.0, -1.2168493676768384, -0.692872860268244,...
#5.0 [9.0, -0.3071304478616376, 0.0375521181289943,... [10.0, 0.07121332925591593, 0.4656030062266639...
This solution was inspired by #mozway comments.
# Create a dictionary with the data for each individual
data_dict = dict(list(data.groupby('id_ind')))
# Convert the dictionary into a dataframe
res = pd.concat(data_dict, axis=0).reset_index(drop=True)
print(res.head(4))
# id_choice id_ind x1_1 x1_2 x1_3 x2_1 x2_2 \
#0 1.0 1.0 -0.176621 -2.420777 -1.548321 0.379725 -0.505256
#1 2.0 1.0 1.645852 -1.082875 0.845734 -2.236439 1.146429
#2 3.0 2.0 -0.133489 2.735338 -0.212505 0.620595 -0.566238
#3 4.0 2.0 1.968104 1.597961 0.529239 0.662387 0.687573
#
# x2_3 x3_1 x3_2 x3_3 z1 z2 z3 CHOICE
#0 -0.976297 0.127518 -0.645296 -1.067059 2.419673 0.0 1.0 1.0
#1 -0.983731 -0.219266 1.052095 0.802691 2.419673 0.0 1.0 1.0
#2 1.465828 0.157588 0.201903 -1.986853 2.822802 0.0 1.0 2.0
#3 0.532595 0.588541 0.683943 -0.505777 2.822802 0.0 1.0 2.0

Reshaping the pandas dataframe from a binary columns into statistical

I have the following data frame
df = pd.DataFrame( {'Code Similarity & Clone Detection': {0: 0.0, 1: 0.0, 2: 0.0, 3: 1.0, 4: 0.0, 5: 0.0, 6: 0.0, 7: 0.0, 8: 0.0, 9: 1.0}, 'Code Navigation & Understanding': {0: 0.0, 1: 0.0, 2: 1.0, 3: 0.0, 4: 0.0, 5: 1.0, 6: 1.0, 7: 1.0, 8: 1.0, 9: 0.0}, 'Security': {0: 1.0, 1: 1.0, 2: 0.0, 3: 0.0, 4: 1.0, 5: 0.0, 6: 0.0, 7: 0.0, 8: 0.0, 9: 0.0}, 'ANN': {0: 1.0, 1: 1.0, 2: 1.0, 3: 0.0, 4: 0.0, 5: 1.0, 6: 0.0, 7: 0.0, 8: 0.0, 9: 0.0}, 'CNN': {0: 1.0, 1: 1.0, 2: 0.0, 3: 0.0, 4: 1.0, 5: 0.0, 6: 0.0, 7: 0.0, 8: 0.0, 9: 0.0}, 'RNN': {0: 0.0, 1: 1.0, 2: 0.0, 3: 0.0, 4: 0.0, 5: 0.0, 6: 0.0, 7: 0.0, 8: 0.0, 9: 0.0}, 'LSTM': {0: 0.0, 1: 1.0, 2: 0.0, 3: 0.0, 4: 1.0, 5: 0.0, 6: 1.0, 7: 0.0, 8: 1.0, 9: 1.0}} )
I want to convert this data frame into a new one with three columns, the first column called "SE" which includes the head of the first 4 columns in df, The second column called 'DL' which includes the rest of the columns in df. the third column called 'count' which counts the occurrences for each SE and DL values that come together. The following figure is must be the new shape
Use:
#create MultiIndex by all combinations
mux = pd.MultiIndex.from_product([df.columns[:4], df.columns[4:]])
#repeat by first and second level with transpose
df1 = df.reindex(mux, axis=1, level=0).T
df2 = df.reindex(mux, axis=1, level=1).T
#sum together per columns, per MultiIndex
df=(df1.add(df2)
.sum(axis=1)
.sum(level=[0,1])
.astype(int)
.rename_axis(['SE','DL'])
.reset_index(name='count'))
print (df.head(10))
SE DL count
0 Code Similarity & Clone Detection ANN 5
1 Code Similarity & Clone Detection CNN 5
2 Code Similarity & Clone Detection RNN 3
3 Code Similarity & Clone Detection LSTM 7
4 Code Similarity & Clone Detection attention mechanism 9
5 Code Similarity & Clone Detection Autoencoder 7
6 Code Similarity & Clone Detection GNN 6
7 Code Similarity & Clone Detection Other_DL 4
8 Code Navigation & Understanding ANN 8
9 Code Navigation & Understanding CNN 8
EDIT: If need count 1 matching between use:
#in real data change 3 to 4 for select first 4 columns
mux = pd.MultiIndex.from_product([df.columns[:3], df.columns[3:]])
#repeat by first and second level with transpose
s1 = df.reindex(mux, axis=1, level=0).T.stack()
s2 = df.reindex(mux, axis=1, level=1).T.stack()
df = (s1[s1 == 1].eq(s2[s2 == 1]).sum(level=[0,1])
.rename_axis(['SE','DL'])
.sort_index(level=1)
.reset_index(name='count'))
print (df)
SE DL count
0 Code Navigation & Understanding ANN 2
1 Code Similarity & Clone Detection ANN 0
2 Security ANN 2
3 Code Navigation & Understanding CNN 0
4 Code Similarity & Clone Detection CNN 0
5 Security CNN 3
6 Code Navigation & Understanding LSTM 2
7 Code Similarity & Clone Detection LSTM 1
8 Security LSTM 2
9 Code Navigation & Understanding RNN 0
10 Code Similarity & Clone Detection RNN 0
11 Security RNN 1

plotnine/ggplot - changing legend positions

I have this dataframe:
df = pd.DataFrame({'ymin': {0: 0.0,
1: 0.0,
2: 0.0,
3: 0.0,
4: 0.511,
5: 0.571,
6: 0.5329999999999999,
7: 0.5389999999999999},
'ymax': {0: 0.511,
1: 0.571,
2: 0.533,
3: 0.539,
4: 1.0,
5: 1.0,
6: 1.0,
7: 1.0},
'xmin': {0: 0.0,
1: 0.14799999999999996,
2: 0.22400000000000003,
3: 0.5239999999999999,
4: 0.0,
5: 0.14799999999999996,
6: 0.22400000000000003,
7: 0.5239999999999999},
'xmax': {0: 0.148,
1: 0.22399999999999998,
2: 0.524,
3: 1.001,
4: 0.148,
5: 0.22399999999999998,
6: 0.524,
7: 1.001},
'variable': {0: 'A', 1: 'A', 2: 'A', 3: 'A', 4: 'B', 5: 'B', 6: 'B', 7: 'B'}})
Where I plot this:
(ggplot(df, aes(ymin = "ymin", ymax = "ymax",
xmin = "xmin", xmax = "xmax", fill = "variable"))
+ geom_rect(colour = "grey", alpha=0.7))
I'm looking to change the position of the legends to the same to the positions of the plot: blue-up and red-bottom. And A always will be red and B always will be blue
There might be a more standard way to do it, but here is a quick hack to fix your problem:
Change the order of your variable
Assign colors manually (You could also look for exact color codes and replace it with the color names if it matters in your case)
df = df.assign(variable = pd.Categorical(df['variable'], ['B', 'A']))
(ggplot(df, aes(ymin = "ymin", ymax = "ymax",
xmin = "xmin", xmax = "xmax", fill = "variable"))+
geom_rect(colour = "grey", alpha=0.7)+
scale_fill_manual(values = ["blue", "red"]))
output looks like this:
You could set order of levels with df$variable <- factor(df$variable, levels = c("B","A")

Pandas MultiIndex (more than 2 levels) DataFrame to Nested Dict/JSON

This question is similar to this one, but I want to take it a step further. Is it possible to extend the solution to work with more levels? Multilevel dataframes' .to_dict() method has some promising options, but most of them will return entries that are indexed by tuples (i.e. (A, 0, 0): 274.0) rather than nesting them in dictionaries.
For an example of what I'm looking to accomplish, consider this multiindex dataframe:
data = {0: {
('A', 0, 0): 274.0,
('A', 0, 1): 19.0,
('A', 1, 0): 67.0,
('A', 1, 1): 12.0,
('B', 0, 0): 83.0,
('B', 0, 1): 45.0
},
1: {
('A', 0, 0): 254.0,
('A', 0, 1): 11.0,
('A', 1, 0): 58.0,
('A', 1, 1): 11.0,
('B', 0, 0): 76.0,
('B', 0, 1): 56.0
}
}
df = pd.DataFrame(data).T
df.index = ['entry1', 'entry2']
df
# output:
A B
0 1 0
0 1 0 1 0 1
entry1 274.0 19.0 67.0 12.0 83.0 45.0
entry2 254.0 11.0 58.0 11.0 76.0 56.0
You can imagine that we have many records here, not just two, and that the index names could be longer strings. How could you turn this into nested dictionaries (or directly to JSON) that look like this:
[
{'entry1': {'A': {0: {0: 274.0, 1: 19.0}, 1: {0: 67.0, 1: 12.0}},
'B': {0: {0: 83.0, 1: 45.0}}},
'entry2': {'A': {0: {0: 254.0, 1: 11.0}, 1: {0: 58.0, 1: 11.0}},
'B': {0: {0: 76.0, 1: 56.0}}}}
]
I'm thinking some amount of recursion could potentially be helpful, maybe something like this, but have so far been unsuccessful.
So, you really need to do 2 things here:
df.to_dict()
Convert this to nested dictionary.
df.to_dict(orient='index') gives you a dictionary with the index as keys; it looks like this:
>>> df.to_dict(orient='index')
{'entry1': {('A', 0, 0): 274.0,
('A', 0, 1): 19.0,
('A', 1, 0): 67.0,
('A', 1, 1): 12.0,
('B', 0, 0): 83.0,
('B', 0, 1): 45.0},
'entry2': {('A', 0, 0): 254.0,
('A', 0, 1): 11.0,
('A', 1, 0): 58.0,
('A', 1, 1): 11.0,
('B', 0, 0): 76.0,
('B', 0, 1): 56.0}}
Now you need to nest this. Here's a trick from Martijn Pieters to do that:
def nest(d: dict) -> dict:
result = {}
for key, value in d.items():
target = result
for k in key[:-1]: # traverse all keys but the last
target = target.setdefault(k, {})
target[key[-1]] = value
return result
Putting this all together:
def df_to_nested_dict(df: pd.DataFrame) -> dict:
d = df.to_dict(orient='index')
return {k: nest(v) for k, v in d.items()}
Output:
>>> df_to_nested_dict(df)
{'entry1': {'A': {0: {0: 274.0, 1: 19.0}, 1: {0: 67.0, 1: 12.0}},
'B': {0: {0: 83.0, 1: 45.0}}},
'entry2': {'A': {0: {0: 254.0, 1: 11.0}, 1: {0: 58.0, 1: 11.0}},
'B': {0: {0: 76.0, 1: 56.0}}}}
I took the idea from the previous answer and slightly modified it.
1) Took the function nested_dict from stackoverflow, to create the dictionary
from collections import defaultdict
def nested_dict(n, type):
if n == 1:
return defaultdict(type)
else:
return defaultdict(lambda: nested_dict(n-1, type))
2 Wrote the following function:
def df_to_nested_dict(self, df, type):
# Get the number of levels
temp = df.index.names
lvl = len(temp)
# Create the target dictionary
new_nested_dict=nested_dict(lvl, type)
# Convert the dataframe to a dictionary
temp_dict = df.to_dict(orient='index')
for x, y in temp_dict.items():
dict_keys = ''
# Process the individual items from the key
for item in x:
dkey = '[%d]' % item
dict_keys = dict_keys + dkey
# Create a string and execute it
dict_update = 'new_nested_dict%s = y' % dict_keys
exec(dict_update)
return new_nested_dict
It is the same idea but it is done slightly different

pandas: columns must be same length as key

I'm trying to re-format several columns into strings (they contain NaNs, so I can't just read them in as integers). All of the columns are currently float64, and I want to make it so they don't have decimals.
Here is the data:
{'crash_id': {0: 201226857.0,
1: 201226857.0,
2: 2012272611.0,
3: 2012272611.0,
4: 2012298998.0},
'driver_action1': {0: 1.0, 1: 1.0, 2: 29.0, 3: 1.0, 4: 3.0},
'driver_action2': {0: 99.0, 1: 99.0, 2: 1.0, 3: 99.0, 4: 99.0},
'driver_action3': {0: 99.0, 1: 99.0, 2: 99.0, 3: 99.0, 4: 99.0},
'driver_action4': {0: 99.0, 1: 99.0, 2: 99.0, 3: 99.0, 4: 99.0},
'harmful_event1': {0: 14.0, 1: 14.0, 2: 14.0, 3: 14.0, 4: 14.0},
'harmful_event2': {0: 99.0, 1: 99.0, 2: 99.0, 3: 99.0, 4: 99.0},
'harmful_event3': {0: 99.0, 1: 99.0, 2: 99.0, 3: 99.0, 4: 99.0},
'harmful_event4': {0: 99.0, 1: 99.0, 2: 99.0, 3: 99.0, 4: 99.0},
'most_damaged_area': {0: 14.0, 1: 2.0, 2: 14.0, 3: 14.0, 4: 3.0},
'most_harmful_event': {0: 14.0, 1: 14.0, 2: 14.0, 3: 14.0, 4: 14.0},
'point_of_impact': {0: 15.0, 1: 1.0, 2: 14.0, 3: 14.0, 4: 1.0},
'vehicle_id': {0: 20121.0, 1: 20122.0, 2: 20123.0, 3: 20124.0, 4: 20125.0},
'vehicle_maneuver': {0: 3.0, 1: 1.0, 2: 4.0, 3: 1.0, 4: 1.0}}
When I try to convert those columns to string, this is what happens:
>> df[['crash_id','vehicle_id','point_of_impact','most_damaged_area','most_harmful_event','vehicle_maneuver','harmful_event1','harmful_event2','harmful_event3','harmful_event4','driver_action1','driver_action2','driver_action3','driver_action4']] = df[['crash_id','vehicle_id','point_of_impact','most_damaged_area','most_harmful_event','vehicle_maneuver','harmful_event1','harmful_event2','harmful_event3','harmful_event4','driver_action1','driver_action2','driver_action3','driver_action4']].applymap(lambda x: '{:.0f}'.format(x))
File "C:\Users\<name>\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2376, in _setitem_array
raise ValueError('Columns must be same length as key')
ValueError: Columns must be same length as key
I've never seen this error before and feel like this is something simple...what am I doing wrong?
Your code runs for me with the dictionary you provided. Try creating a function to deal with the NaN cases separately; I think they are causing your issues.
Something basic like below:
def formatter(x):
if x == None:
return None
else:
return '{:.0f}'.format(x)

Categories