Converting to Pandas MultiIndex - python

I have a dataframe of the form:
SpeciesName 0
0 A [[Year: 1, Quantity: 2],[Year: 3, Quantity: 4...]]
1 B [[Year: 1, Quantity: 7],[Year: 2, Quantity: 15...]]
2 C [[Year: 2, Quantity: 9],[Year: 4, Quantity: 13...]]
I'm attempting to try and create a MultiIndex that uses the SpeciesName and the year as the index:
SpeciesName Year
A 1 Data
2 Data
B 1 Data
2 Data
I have not been able to get pandas.MultiIndex(..) to work and my attempts at iterating through the dataset and manually creating a new object have not been very fruitful. Any insights would be greatly appreciated!

I'm going to assume your data is list of dictionaries... because if I don't, what you've written makes no sense unless they are strings and I don't want to parse strings
df = pd.DataFrame([
['A', [dict(Year=1, Quantity=2), dict(Year=3, Quantity=4)]],
['B', [dict(Year=1, Quantity=7), dict(Year=2, Quantity=15)]],
['C', [dict(Year=2, Quantity=9), dict(Year=4, Quantity=13)]]
], columns=['SpeciesName', 0])
df
SpeciesName 0
0 A [{'Year': 1, 'Quantity': 2}, {'Year': 3, 'Quantity': 4}]
1 B [{'Year': 1, 'Quantity': 7}, {'Year': 2, 'Quantity': 15}]
2 C [{'Year': 2, 'Quantity': 9}, {'Year': 4, 'Quantity': 13}]
Then the solution is obvious
pd.DataFrame.from_records(
*zip(*(
[d, s]
for s, l in zip(
df['SpeciesName'], df[0].values.tolist())
for d in l
))
).set_index('Year', append=True)
Quantity
Year
A 1 2
3 4
B 1 7
2 15
C 2 9
4 13

Related

Pandas: How to group by column values when column values are dicts?

I am doing an exercise in which the current requirement is to "Find the top 10 major project themes (using column 'mjtheme_namecode')".
My first thought was to do group_by, then count and sort the groups.
However, the values in this column are lists of dicts, e.g.
[{'code': '1', 'name': 'Economic management'},
{'code': '6', 'name': 'Social protection and risk management'}]
and I can't (apparently) group these, at least not with group_by. I get an error.
TypeError: unhashable type: 'list'
Is there a trick? I'm guessing something along the lines of this question.
(I can group by another column that has string values and matches 1:1 with this column, but the exercise is specific.)
df.head()
There are two steps to solve your problem:
Using pandas==0.25
Flatten the list of dict
Transform dict in columns:
Step 1
df = df.explode('mjtheme_namecode')
Step 2
df = df.join(pd.DataFrame(df['mjtheme_namecode'].values.tolist())
Added: if the dict has multiple hierarchies, you can try using json_normalize:
from pandas.io.json import json_normalize
df = df.join(json_normalize(df['mjtheme_namecode'].values.tolist())
The only issue here is pd.explode will duplicate all other columns (in case that is an issue).
Using sample data:
x = [
[1,2,[{'a':1, 'b':3},{'a':2, 'b':4}]],
[1,3,[{'a':5, 'b':6},{'a':7, 'b':8}]]
]
df = pd.DataFrame(x, columns=['col1','col2','col3'])
Out[1]:
col1 col2 col3
0 1 2 [{'a': 1, 'b': 3}, {'a': 2, 'b': 4}]
1 1 3 [{'a': 5, 'b': 6}, {'a': 7, 'b': 8}]
## Step 1
df.explode('col3')
Out[2]:
col1 col2 col3
0 1 2 {'a': 1, 'b': 3}
0 1 2 {'a': 2, 'b': 4}
1 1 3 {'a': 5, 'b': 6}
1 1 3 {'a': 7, 'b': 8}
## Step 2
df = df.join(pd.DataFrame(df['col3'].values.tolist()))
Out[3]:
col1 col2 col3 a b
0 1 2 {'a': 1, 'b': 3} 1 3
0 1 2 {'a': 2, 'b': 4} 1 3
1 1 3 {'a': 5, 'b': 6} 2 4
1 1 3 {'a': 7, 'b': 8} 2 4
## Now you can group with the new variables

Pandas dataframe to duplicated matrix in sum of quantities

import pandas as pd
data = {0: {'ID': 'A', 'Qty': 1, 'Type': 'SVGA'},
1: {'ID': 'B', 'Qty': 2, 'Type': 'SVGA'},
2: {'ID': 'B', 'Qty': 2, 'Type': 'XGA'},
3: {'ID': 'C', 'Qty': 3, 'Type': 'XGA'},
4: {'ID': 'D', 'Qty': 4, 'Type': 'XGA'},
5: {'ID': 'A', 'Qty': 1, 'Type': 'LED'},
6: {'ID': 'C', 'Qty': 3, 'Type': 'LED'}}
df = pd.DataFrame.from_dict(data, orient='index')
Is it possible to transform this dataframe to a duplicated matrix in sum.
Expected output:
LED SVGA XGA
LED 4 1 3
SVGA 1 3 2
XGA 3 2 9
It seems like the key here is the "ID" column, because the value for each Type-Type cell is computed with respect to whether these Types coexist for the same ID.
So, start with a self-merge on "ID". You can then pivot your result to get your matrix.
merge + crosstab
v = df.merge(df[['ID', 'Type']], on='ID')
pd.crosstab(v.Type_x, v.Type_y, v.Qty, aggfunc='sum')
Type_y LED SVGA XGA
Type_x
LED 4 1 3
SVGA 1 3 2
XGA 3 2 9
merge + pivot_table
df.merge(df[['ID', 'Type']], on='ID').pivot_table(
index='Type_x', columns='Type_y', values='Qty', aggfunc='sum'
)
Type_y LED SVGA XGA
Type_x
LED 4 1 3
SVGA 1 3 2
XGA 3 2 9

Keeping additional column when normalizing list of dicts

I have a dataframe containing id and list of dicts:
df = pd.DataFrame({
'list_of_dicts': [[{'a': 1, 'b': 2}, {'a': 11, 'b': 22}],
[{'a': 3, 'b': 4}, {'a': 33, 'b': 44}]],
'id': [100, 200]
})
and I want to normalize it like this:
id a b
0 100 1 2
0 100 3 4
1 200 11 22
1 200 33 44
This gets most of the way:
pd.concat([
pd.DataFrame.from_dict(item)
for item in df.list_of_dicts
])
but is missing the id column.
I'm most interested in readability.
How about something like this:
d = {
'list_of_dicts': [[{'a': 1, 'b': 2}, {'a': 11, 'b': 22}],
[{'a': 3, 'b': 4}, {'a': 33, 'b': 44}]],
'id': [100, 200]
}
df = pd.DataFrame([pd.Series(x) for ld in d['list_of_dicts'] for x in ld])
id = [[x]*len(l) for l,x in zip(d['list_of_dicts'],d['id'])]
df['id'] = pd.Series([x for l in id for x in l])
EDIT - Here's a simpler version
t = [[('id', i)]+list(l.items()) for i in d['id'] for ll in d['list_of_dicts'] for l in ll]
df = pd.DataFrame([dict(x) for x in t])
And, if you really want the id column first, you can change dict to OrderedDict from the collections module.
This is what I call an incomprehension
pd.DataFrame(
*list(map(list, zip(
*[(d, i) for i, l in zip(df.id, df.list_of_dicts) for d in l]
)))
).rename_axis('id').reset_index()
id a b
0 100 1 2
1 100 11 22
2 200 3 4
3 200 33 44

Python sum values of list of dictionaries if two other key value pairs match

I have a list of dictionaries of the following form:
lst = [{"Name":'Nick','Hour':0,'Value':2.75},
{"Name":'Sam','Hour':1,'Value':7.0},
{"Name":'Nick','Hour':0,'Value':2.21},
{'Name':'Val',"Hour":1,'Value':10.1},
{'Name':'Nick','Hour':1,'Value':2.1},
{'Name':'Val',"Hour":1,'Value':11},]
I want to be able to sum all values for a name for a particular hour, e.g. if Name == Nick and Hour == 0, I want value to give me the sum of all values meeting the condition. 2.75 + 2.21, according to the piece above.
I have already tried the following but it doesn't help me out with both conditions.
finalList = collections.defaultdict(float)
for info in lst:
finalList[info['Name']] += info['Value']
finalList = [{'Name': c, 'Value': finalList[c]} for c in finalList]
This sums up all the values for a particular Name, not checking if the Hour was the same. How can I incorporate that condition into my code as well?
My expected output :
finalList = [{"Name":'Nick','Hour':0,'Value':4.96},
{"Name":'Sam','Hour':1,'Value':7.0},
{'Name':'Val',"Hour":1,'Value':21.1},
{'Name':'Nick','Hour':1,'Value':2.1}...]
consider using pandas module - it's very comfortable for such data sets:
import pandas as pd
In [109]: lst
Out[109]:
[{'Hour': 0, 'Name': 'Nick', 'Value': 2.75},
{'Hour': 1, 'Name': 'Sam', 'Value': 7.0},
{'Hour': 0, 'Name': 'Nick', 'Value': 2.21},
{'Hour': 1, 'Name': 'Val', 'Value': 10.1},
{'Hour': 1, 'Name': 'Nick', 'Value': 2.1}]
In [110]: df = pd.DataFrame(lst)
In [111]: df
Out[111]:
Hour Name Value
0 0 Nick 2.75
1 1 Sam 7.00
2 0 Nick 2.21
3 1 Val 10.10
4 1 Nick 2.10
In [123]: df.groupby(['Name','Hour']).sum().reset_index()
Out[123]:
Name Hour Value
0 Nick 0 4.96
1 Nick 1 2.10
2 Sam 1 7.00
3 Val 1 10.10
export it to CSV:
df.groupby(['Name','Hour']).sum().reset_index().to_csv('/path/to/file.csv', index=False)
result:
Name,Hour,Value
Nick,0,4.96
Nick,1,2.1
Sam,1,7.0
Val,1,10.1
if you want to have it as a dictionary:
In [125]: df.groupby(['Name','Hour']).sum().reset_index().to_dict('r')
Out[125]:
[{'Hour': 0, 'Name': 'Nick', 'Value': 4.96},
{'Hour': 1, 'Name': 'Nick', 'Value': 2.1},
{'Hour': 1, 'Name': 'Sam', 'Value': 7.0},
{'Hour': 1, 'Name': 'Val', 'Value': 10.1}]
you can do many fancy things using pandas:
In [112]: df.loc[(df.Name == 'Nick') & (df.Hour == 0), 'Value'].sum()
Out[112]: 4.96
In [121]: df.groupby('Name')['Value'].agg(['sum','mean'])
Out[121]:
sum mean
Name
Nick 7.06 2.353333
Sam 7.00 7.000000
Val 10.10 10.100000
[{'Name':name, 'Hour':hour, 'Value': sum(d['Value'] for d in lst if d['Name']==name and d['Hour']==hour)} for hour in hours for name in names]
if you don't already have all names and hours in lists (or sets) you can get them like so:
names = {d['Name'] for d in lst}
hours= {d['Hour'] for d in lst}
You can use any (hashable) object as a key for a python dictionary, so just use a tuple containing Name and Hour as the key:
from collections import defaultdict
d = defaultdict(float)
for item in lst:
d[(item['Name'], item['Hour'])] += item['Value']

Reorganizing the data in a dataframe

I have data in the following format:
data =
[
{'data1': [{'sub_data1': 0}, {'sub_data2': 4}, {'sub_data3': 1}, {'sub_data4': -5}]},
{'data2': [{'sub_data1': 1}, {'sub_data2': 1}, {'sub_data3': 1}, {'sub_data4': 12}]},
{'data3': [{'sub_data1': 3}, {'sub_data2': 0}, {'sub_data3': 1}, {'sub_data4': 7}]},
]
How should I reorganize it so that when save it to hdf by
a = pd.DataFrame(data, columns=map(lambda x: x.name, ['data1', 'data2', 'data3']))
a.to_hdf('my_data.hdf')
I get a dataframe in the following format:
data1 data2 data3
_________________________________________
sub_data1 0 1 1
sub_data2 4 1 0
sub_data3 1 1 1
sub_data4 -5 12 7
update1: after following advice given me below and saving it an hdf file and reading it, I got this which is not what I want:
data1 data2 data3
0 {u'sub_data1': 22} {u'sub_data1': 33} {u'sub_data1': 44}
1 {u'sub_data2': 0} {u'sub_data2': 11} {u'sub_data2': 44}
2 {u'sub_data3': 12} {u'sub_data3': 16} {u'sub_data3': 19}
3 {u'sub_data4': 0} {u'sub_data4': 0} {u'sub_data4': 0}
Well if you convert your data into dictionary of dictionaries, you can then just create DataFrame very easily:
In [25]: data2 = {k: {m: n for i in v for m, n in i.iteritems()} for x in data for k, v in x.iteritems()}
In [26]: data2
Out[26]:
{'data1': {'sub_data1': 0, 'sub_data2': 4, 'sub_data3': 1, 'sub_data4': -5},
'data2': {'sub_data1': 1, 'sub_data2': 1, 'sub_data3': 1, 'sub_data4': 12},
'data3': {'sub_data1': 3, 'sub_data2': 0, 'sub_data3': 1, 'sub_data4': 7}}
In [27]: pd.DataFrame(data2)
Out[27]:
data1 data2 data3
sub_data1 0 1 3
sub_data2 4 1 0
sub_data3 1 1 1
sub_data4 -5 12 7

Categories