Converting to Pandas MultiIndex

Converting to Pandas MultiIndex - python

I have a dataframe of the form:
SpeciesName 0
0 A [[Year: 1, Quantity: 2],[Year: 3, Quantity: 4...]]
1 B [[Year: 1, Quantity: 7],[Year: 2, Quantity: 15...]]
2 C [[Year: 2, Quantity: 9],[Year: 4, Quantity: 13...]]
I'm attempting to try and create a MultiIndex that uses the SpeciesName and the year as the index:
SpeciesName Year
A 1 Data
2 Data
B 1 Data
2 Data
I have not been able to get pandas.MultiIndex(..) to work and my attempts at iterating through the dataset and manually creating a new object have not been very fruitful. Any insights would be greatly appreciated!

I'm going to assume your data is list of dictionaries... because if I don't, what you've written makes no sense unless they are strings and I don't want to parse strings
df = pd.DataFrame([
['A', [dict(Year=1, Quantity=2), dict(Year=3, Quantity=4)]],
['B', [dict(Year=1, Quantity=7), dict(Year=2, Quantity=15)]],
['C', [dict(Year=2, Quantity=9), dict(Year=4, Quantity=13)]]
], columns=['SpeciesName', 0])
df
SpeciesName 0
0 A [{'Year': 1, 'Quantity': 2}, {'Year': 3, 'Quantity': 4}]
1 B [{'Year': 1, 'Quantity': 7}, {'Year': 2, 'Quantity': 15}]
2 C [{'Year': 2, 'Quantity': 9}, {'Year': 4, 'Quantity': 13}]
Then the solution is obvious
pd.DataFrame.from_records(
*zip(*(
[d, s]
for s, l in zip(
df['SpeciesName'], df[0].values.tolist())
for d in l
))
).set_index('Year', append=True)
Quantity
Year
A 1 2
3 4
B 1 7
2 15
C 2 9
4 13

Related

Pandas: How to group by column values when column values are dicts?

I am doing an exercise in which the current requirement is to "Find the top 10 major project themes (using column 'mjtheme_namecode')".
My first thought was to do group_by, then count and sort the groups.
However, the values in this column are lists of dicts, e.g.
[{'code': '1', 'name': 'Economic management'},
{'code': '6', 'name': 'Social protection and risk management'}]
and I can't (apparently) group these, at least not with group_by. I get an error.
TypeError: unhashable type: 'list'
Is there a trick? I'm guessing something along the lines of this question.
(I can group by another column that has string values and matches 1:1 with this column, but the exercise is specific.)
df.head()

There are two steps to solve your problem:
Using pandas==0.25
Flatten the list of dict
Transform dict in columns:
Step 1
df = df.explode('mjtheme_namecode')
Step 2
df = df.join(pd.DataFrame(df['mjtheme_namecode'].values.tolist())
Added: if the dict has multiple hierarchies, you can try using json_normalize:
from pandas.io.json import json_normalize
df = df.join(json_normalize(df['mjtheme_namecode'].values.tolist())
The only issue here is pd.explode will duplicate all other columns (in case that is an issue).
Using sample data:
x = [
[1,2,[{'a':1, 'b':3},{'a':2, 'b':4}]],
[1,3,[{'a':5, 'b':6},{'a':7, 'b':8}]]
]
df = pd.DataFrame(x, columns=['col1','col2','col3'])
Out[1]:
col1 col2 col3
0 1 2 [{'a': 1, 'b': 3}, {'a': 2, 'b': 4}]
1 1 3 [{'a': 5, 'b': 6}, {'a': 7, 'b': 8}]
## Step 1
df.explode('col3')
Out[2]:
col1 col2 col3
0 1 2 {'a': 1, 'b': 3}
0 1 2 {'a': 2, 'b': 4}
1 1 3 {'a': 5, 'b': 6}
1 1 3 {'a': 7, 'b': 8}
## Step 2
df = df.join(pd.DataFrame(df['col3'].values.tolist()))
Out[3]:
col1 col2 col3 a b
0 1 2 {'a': 1, 'b': 3} 1 3
0 1 2 {'a': 2, 'b': 4} 1 3
1 1 3 {'a': 5, 'b': 6} 2 4
1 1 3 {'a': 7, 'b': 8} 2 4
## Now you can group with the new variables

Pandas dataframe to duplicated matrix in sum of quantities

import pandas as pd
data = {0: {'ID': 'A', 'Qty': 1, 'Type': 'SVGA'},
1: {'ID': 'B', 'Qty': 2, 'Type': 'SVGA'},
2: {'ID': 'B', 'Qty': 2, 'Type': 'XGA'},
3: {'ID': 'C', 'Qty': 3, 'Type': 'XGA'},
4: {'ID': 'D', 'Qty': 4, 'Type': 'XGA'},
5: {'ID': 'A', 'Qty': 1, 'Type': 'LED'},
6: {'ID': 'C', 'Qty': 3, 'Type': 'LED'}}
df = pd.DataFrame.from_dict(data, orient='index')
Is it possible to transform this dataframe to a duplicated matrix in sum.
Expected output:
LED SVGA XGA
LED 4 1 3
SVGA 1 3 2
XGA 3 2 9

It seems like the key here is the "ID" column, because the value for each Type-Type cell is computed with respect to whether these Types coexist for the same ID.
So, start with a self-merge on "ID". You can then pivot your result to get your matrix.
merge + crosstab
v = df.merge(df[['ID', 'Type']], on='ID')
pd.crosstab(v.Type_x, v.Type_y, v.Qty, aggfunc='sum')
Type_y LED SVGA XGA
Type_x
LED 4 1 3
SVGA 1 3 2
XGA 3 2 9
merge + pivot_table
df.merge(df[['ID', 'Type']], on='ID').pivot_table(
index='Type_x', columns='Type_y', values='Qty', aggfunc='sum'
)
Type_y LED SVGA XGA
Type_x
LED 4 1 3
SVGA 1 3 2
XGA 3 2 9

Keeping additional column when normalizing list of dicts

I have a dataframe containing id and list of dicts:
df = pd.DataFrame({
'list_of_dicts': [[{'a': 1, 'b': 2}, {'a': 11, 'b': 22}],
[{'a': 3, 'b': 4}, {'a': 33, 'b': 44}]],
'id': [100, 200]
})
and I want to normalize it like this:
id a b
0 100 1 2
0 100 3 4
1 200 11 22
1 200 33 44
This gets most of the way:
pd.concat([
pd.DataFrame.from_dict(item)
for item in df.list_of_dicts
])
but is missing the id column.
I'm most interested in readability.

How about something like this:
d = {
'list_of_dicts': [[{'a': 1, 'b': 2}, {'a': 11, 'b': 22}],
[{'a': 3, 'b': 4}, {'a': 33, 'b': 44}]],
'id': [100, 200]
}
df = pd.DataFrame([pd.Series(x) for ld in d['list_of_dicts'] for x in ld])
id = [[x]*len(l) for l,x in zip(d['list_of_dicts'],d['id'])]
df['id'] = pd.Series([x for l in id for x in l])
EDIT - Here's a simpler version
t = [[('id', i)]+list(l.items()) for i in d['id'] for ll in d['list_of_dicts'] for l in ll]
df = pd.DataFrame([dict(x) for x in t])
And, if you really want the id column first, you can change dict to OrderedDict from the collections module.

This is what I call an incomprehension
pd.DataFrame(
*list(map(list, zip(
*[(d, i) for i, l in zip(df.id, df.list_of_dicts) for d in l]
)))
).rename_axis('id').reset_index()
id a b
0 100 1 2
1 100 11 22
2 200 3 4
3 200 33 44

Python sum values of list of dictionaries if two other key value pairs match

I have a list of dictionaries of the following form:
lst = [{"Name":'Nick','Hour':0,'Value':2.75},
{"Name":'Sam','Hour':1,'Value':7.0},
{"Name":'Nick','Hour':0,'Value':2.21},
{'Name':'Val',"Hour":1,'Value':10.1},
{'Name':'Nick','Hour':1,'Value':2.1},
{'Name':'Val',"Hour":1,'Value':11},]
I want to be able to sum all values for a name for a particular hour, e.g. if Name == Nick and Hour == 0, I want value to give me the sum of all values meeting the condition. 2.75 + 2.21, according to the piece above.
I have already tried the following but it doesn't help me out with both conditions.
finalList = collections.defaultdict(float)
for info in lst:
finalList[info['Name']] += info['Value']
finalList = [{'Name': c, 'Value': finalList[c]} for c in finalList]
This sums up all the values for a particular Name, not checking if the Hour was the same. How can I incorporate that condition into my code as well?
My expected output :
finalList = [{"Name":'Nick','Hour':0,'Value':4.96},
{"Name":'Sam','Hour':1,'Value':7.0},
{'Name':'Val',"Hour":1,'Value':21.1},
{'Name':'Nick','Hour':1,'Value':2.1}...]

consider using pandas module - it's very comfortable for such data sets:
import pandas as pd
In [109]: lst
Out[109]:
[{'Hour': 0, 'Name': 'Nick', 'Value': 2.75},
{'Hour': 1, 'Name': 'Sam', 'Value': 7.0},
{'Hour': 0, 'Name': 'Nick', 'Value': 2.21},
{'Hour': 1, 'Name': 'Val', 'Value': 10.1},
{'Hour': 1, 'Name': 'Nick', 'Value': 2.1}]
In [110]: df = pd.DataFrame(lst)
In [111]: df
Out[111]:
Hour Name Value
0 0 Nick 2.75
1 1 Sam 7.00
2 0 Nick 2.21
3 1 Val 10.10
4 1 Nick 2.10
In [123]: df.groupby(['Name','Hour']).sum().reset_index()
Out[123]:
Name Hour Value
0 Nick 0 4.96
1 Nick 1 2.10
2 Sam 1 7.00
3 Val 1 10.10
export it to CSV:
df.groupby(['Name','Hour']).sum().reset_index().to_csv('/path/to/file.csv', index=False)
result:
Name,Hour,Value
Nick,0,4.96
Nick,1,2.1
Sam,1,7.0
Val,1,10.1
if you want to have it as a dictionary:
In [125]: df.groupby(['Name','Hour']).sum().reset_index().to_dict('r')
Out[125]:
[{'Hour': 0, 'Name': 'Nick', 'Value': 4.96},
{'Hour': 1, 'Name': 'Nick', 'Value': 2.1},
{'Hour': 1, 'Name': 'Sam', 'Value': 7.0},
{'Hour': 1, 'Name': 'Val', 'Value': 10.1}]
you can do many fancy things using pandas:
In [112]: df.loc[(df.Name == 'Nick') & (df.Hour == 0), 'Value'].sum()
Out[112]: 4.96
In [121]: df.groupby('Name')['Value'].agg(['sum','mean'])
Out[121]:
sum mean
Name
Nick 7.06 2.353333
Sam 7.00 7.000000
Val 10.10 10.100000

[{'Name':name, 'Hour':hour, 'Value': sum(d['Value'] for d in lst if d['Name']==name and d['Hour']==hour)} for hour in hours for name in names]
if you don't already have all names and hours in lists (or sets) you can get them like so:
names = {d['Name'] for d in lst}
hours= {d['Hour'] for d in lst}

You can use any (hashable) object as a key for a python dictionary, so just use a tuple containing Name and Hour as the key:
from collections import defaultdict
d = defaultdict(float)
for item in lst:
d[(item['Name'], item['Hour'])] += item['Value']

Reorganizing the data in a dataframe

I have data in the following format:
data =
[
{'data1': [{'sub_data1': 0}, {'sub_data2': 4}, {'sub_data3': 1}, {'sub_data4': -5}]},
{'data2': [{'sub_data1': 1}, {'sub_data2': 1}, {'sub_data3': 1}, {'sub_data4': 12}]},
{'data3': [{'sub_data1': 3}, {'sub_data2': 0}, {'sub_data3': 1}, {'sub_data4': 7}]},
]
How should I reorganize it so that when save it to hdf by
a = pd.DataFrame(data, columns=map(lambda x: x.name, ['data1', 'data2', 'data3']))
a.to_hdf('my_data.hdf')
I get a dataframe in the following format:
data1 data2 data3
_________________________________________
sub_data1 0 1 1
sub_data2 4 1 0
sub_data3 1 1 1
sub_data4 -5 12 7
update1: after following advice given me below and saving it an hdf file and reading it, I got this which is not what I want:
data1 data2 data3
0 {u'sub_data1': 22} {u'sub_data1': 33} {u'sub_data1': 44}
1 {u'sub_data2': 0} {u'sub_data2': 11} {u'sub_data2': 44}
2 {u'sub_data3': 12} {u'sub_data3': 16} {u'sub_data3': 19}
3 {u'sub_data4': 0} {u'sub_data4': 0} {u'sub_data4': 0}

Well if you convert your data into dictionary of dictionaries, you can then just create DataFrame very easily:
In [25]: data2 = {k: {m: n for i in v for m, n in i.iteritems()} for x in data for k, v in x.iteritems()}
In [26]: data2
Out[26]:
{'data1': {'sub_data1': 0, 'sub_data2': 4, 'sub_data3': 1, 'sub_data4': -5},
'data2': {'sub_data1': 1, 'sub_data2': 1, 'sub_data3': 1, 'sub_data4': 12},
'data3': {'sub_data1': 3, 'sub_data2': 0, 'sub_data3': 1, 'sub_data4': 7}}
In [27]: pd.DataFrame(data2)
Out[27]:
data1 data2 data3
sub_data1 0 1 3
sub_data2 4 1 0
sub_data3 1 1 1
sub_data4 -5 12 7

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Converting to Pandas MultiIndex - python

Related

Pandas: How to group by column values when column values are dicts?

Pandas dataframe to duplicated matrix in sum of quantities

Keeping additional column when normalizing list of dicts

Python sum values of list of dictionaries if two other key value pairs match

Reorganizing the data in a dataframe

Categories

Resources