Aggregate each columns in a dataframe based on custom functions in python - python

here is my dataframe:
df = [{'id': 1, 'name': 'bob', 'apple': 45, 'grape': 10, 'rate':0},
{'id': 1, 'name': 'bob', 'apple': 45, 'grape': 20, 'rate':0},
{'id': 2, 'name': 'smith', 'apple': 5, 'grape': 30, 'rate':0},
{'id': 2, 'name': 'smith', 'apple': 10, 'grape': 40, 'rate':0}]
i would like to:
where apple= apple.sum() and grape=grape.sum() and rate = grape/apple*100.
id name apple grape rate
0 1 bob 90 30 300
1 2 smith 15 70 21.4
I have attempted this with the following:
df = pd.DataFrame(df)
def cal_rate(rate):
return df['apple'] / df['grape']*100
agg_funcs = {'apple':'sum',
'grape':'sum',
'rate' : cal_rate}
df=df.groupby(['id','name').agg(agg_funcs).reset_index()
But got this result:
id name apple grape rate
0 1 bob 90 30 105
1 2 smith 15 70 105
Can you help me out?thanks in advance.

Here you go:
import pandas as pd
df = [{'id': 1, 'name': 'bob', 'apple': 45, 'grape': 10, 'rate':0},
{'id': 1, 'name': 'bob', 'apple': 45, 'grape': 20, 'rate':0},
{'id': 2, 'name': 'smith', 'apple': 5, 'grape': 30, 'rate':0},
{'id': 2, 'name': 'smith', 'apple': 10, 'grape': 40, 'rate':0}]
df = pd.DataFrame(df)
def cal_rate(group):
frame = df.loc[group.index]
return frame['apple'].sum() / frame['grape'].sum() * 100
agg_funcs = {'apple':'sum',
'grape':'sum',
'rate' : cal_rate}
df=df.groupby(['id','name']).agg(agg_funcs).reset_index()
print(df)
Output
id name apple grape rate
0 1 bob 90 30 300.0
1 2 smith 15 70 21.4

You can also do it this way
df = df.groupby(['id', 'name']).agg({'apple':'sum', 'grape':'sum'}).reset_index()
df['rate'] = (df['apple'] / df['grape']) *100

just another way to do this
import pandas as pd
df = [{'id': 1, 'name': 'bob', 'apple': 45, 'grape': 10, 'rate':0},
{'id': 1, 'name': 'bob', 'apple': 45, 'grape': 20, 'rate':0},
{'id': 2, 'name': 'smith', 'apple': 5, 'grape': 30, 'rate':0},
{'id': 2, 'name': 'smith', 'apple': 10, 'grape': 40, 'rate':0}]
df = pd.DataFrame(df)
df=df.groupby(['id','name']).sum().reset_index()
df['rate']=round((df['apple'] / df['grape'])*100,1)
print(df)
output
id name apple grape rate
0 1 bob 90 30 300.0
1 2 smith 15 70 21.4

Related

Finding average in pandas from a value in a dictionary?

experimenting on a project with a large dataset of movies. I have a large data frame, with one row named "Genres" and one named "Vote Average". My goal is to find the 20 highest rated genres bases on "Vote Average".
I would use a group by but I can't seem to figure it out because the genre information looks like this in the column "Genres" :
[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10749, 'name': 'Romance'}]
How can I extract Comedy, Drama and Romance from the list above?
How can I group by individual genres while assigning the rows "Vote Average to each genre, so I can print the top 20 rated genres in the data frame?
Genres Vote Average
1 [{'id': 16, 'name': 'Animation'}, {'id': 35, '... 7.7
2 [{'id': 12, 'name': 'Adventure'}, {'id': 14, '... 6.9
3 [{'id': 10749, 'name': 'Romance'}, {'id': 35, ... 6.5
4 [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam... 6.1
5 [{'id': 35, 'name': 'Comedy'}] 5.7
... ... ...
32255 [{'id': 878, 'name': 'Science Fiction'}] 3.5
32256 [{'id': 18, 'name': 'Drama'}, {'id': 28, 'name... 5.7
32257 [{'id': 28, 'name': 'Action'}, {'id': 18, 'nam... 3.8
32258 [] 0.0
32259 [] 0.0
EDIT: Example from Data frame is above. movies_metadata.csv from https://www.kaggle.com/rounakbanik/the-movies-dataset
EDIT:
Now when I see all information on kaggle then I think it may need totally different method because these genres are assigned to titles and they can't be in separated rows.
OLD:
Now you have to convert it to correct DataFrame with genres in separated rows insitead of
[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, ...]
Here is my example data
import pandas as pd
df = pd.DataFrame([
{'Genre': [{'id': 16, 'name': 'Animation'}], 'Vote Average': 7.7},
{'Genre': [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10749, 'name': 'Romance'}], 'Vote Average': 6.1},
{'Genre': [{'id': 10749, 'name': 'Romance'}], 'Vote Average': 6.5},
])
print(df)
result:
Genre Vote Average
0 [{'id': 16, 'name': 'Animation'}] 7.7
1 [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam... 6.1
2 [{'id': 10749, 'name': 'Romance'}] 6.5
You can iterate every row and use pd.DataFrame(row['Genre']) to create correct dataframe which you will add to new global dataframe
new_df = pd.DataFrame(columns=['id', 'name', 'Vote Average'])
for index, row in df.iterrows():
temp_df = pd.DataFrame(row['Genre'])
temp_df['Vote Average'] = row['Vote Average']
new_df = new_df.append(temp_df)
print(new_df)
result:
id name Vote Average
0 16 Animation 7.7
0 35 Comedy 6.1
1 18 Drama 6.1
2 10749 Romance 6.1
0 10749 Romance 6.5
and now you can do whatever you like.
Other method to correct data:
First convert list of dictionares to separated rows with dictionares
new_df = df.explode('Genre')
print(new_df)
result:
Genre Vote Average
0 {'id': 16, 'name': 'Animation'} 7.7
1 {'id': 35, 'name': 'Comedy'} 6.1
1 {'id': 18, 'name': 'Drama'} 6.1
1 {'id': 10749, 'name': 'Romance'} 6.1
2 {'id': 10749, 'name': 'Romance'} 6.5
and later convert every dictionary to columns
new_df['id'] = new_df['Genre'].str['id']
new_df['name'] = new_df['Genre'].str['name']
print(new_df)
result:
Genre Vote Average id name
0 {'id': 16, 'name': 'Animation'} 7.7 16 Animation
1 {'id': 35, 'name': 'Comedy'} 6.1 35 Comedy
1 {'id': 18, 'name': 'Drama'} 6.1 18 Drama
1 {'id': 10749, 'name': 'Romance'} 6.1 10749 Romance
2 {'id': 10749, 'name': 'Romance'} 6.5 10749 Romance
or using
new_df[['id','name']] = new_df['Genre'].apply(pd.Series)
Full example
import pandas as pd
df = pd.DataFrame([
{'Genre': [{'id': 16, 'name': 'Animation'}], 'Vote Average': 7.7},
{'Genre': [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10749, 'name': 'Romance'}], 'Vote Average': 6.1},
{'Genre': [{'id': 10749, 'name': 'Romance'}], 'Vote Average': 6.5},
])
print('--- df ---')
print(df)
print('--- iterrows ---')
new_df = pd.DataFrame(columns=['id', 'name', 'Vote Average'])
for index, row in df.iterrows():
temp_df = pd.DataFrame(row['Genre'])
temp_df['Vote Average'] = row['Vote Average']
new_df = new_df.append(temp_df)
print(new_df)
print('--- explode #1 ---')
new_df = df.explode('Genre')
print(new_df)
print('--- columns #1 ---')
new_df['id'] = new_df['Genre'].str['id']
new_df['name'] = new_df['Genre'].str['name']
new_df.drop('Genre', inplace=True, axis=1)
new_df.reset_index(inplace=True)
print(new_df)
print('--- explode #2 ---')
new_df = df.explode('Genre')
print(new_df)
print('--- columns #2 ---')
new_df[['id','name']] = new_df['Genre'].apply(pd.Series)
new_df.drop('Genre', inplace=True, axis=1)
new_df.reset_index(inplace=True)
print(new_df)

How to create a nested json with key names in python?

I have the following data in a pandas dataframe. I want to group the data based on month, then type.
month hour Type count
0 4 0 Bike 8
1 4 0 Pedelec 16
2 4 1 Bike 9
3 4 1 Pedelec 4
4 4 2 Bike 18
... ... ... ... ...
412 12 21 Pedelec 15
413 12 22 Bike 7
414 12 22 Pedelec 10
415 12 23 Bike 2
416 12 23 Pedelec 15
I want to convert this to a nested json with field names. The code I use to create a dictionary is this:
jsonfile=barchart.groupby(['month','Type'])[['hour','count']].apply(lambda x: x.to_dict('r')).reset_index(name='data').groupby('month')['Type','data'].apply(lambda x: x.set_index('Type')['data'].to_dict()).reset_index(name='data').groupby('month')['data'].apply(list).to_dict()
The output I get is in this format:
[{'month': 4,
'values': [{'Bike': [{'hour': 0, 'count': 8},
{'hour': 1, 'count': 9},
{'hour': 2, 'count': 18},
{'hour': 3, 'count': 2},
{'hour': 4, 'count': 2},
...
{'hour': 23, 'count': 14}],
'Pedelec': [{'hour': 0, 'count': 16},
{'hour': 1, 'count': 4},
{'hour': 2, 'count': 12},
...
{'hour': 23, 'count': 27}]}]},
Expected output:
[{'month': 4,
'values': [{'Type': 'Bike': [{'hour': 0, 'count': 8},
{'hour': 1, 'count': 9},
I used the following to create my deired format
jsonfile=barchart.groupby(['month','Type'])[['hour','count']].apply(lambda x: x.to_dict('r')).reset_index(name='data').groupby('month')['Type','data'].apply(lambda x: x.set_index('Type')['data'].to_dict()).reset_index(name='data').groupby('month')['data'].apply(list).to_dict()
json_arr=[]
for month,values in jsonfile.items():
arr=[]
for value in values:
for types, val in value.items():
arr.append({"type": types, "values": val})
json_arr.append({"month": month, "values": arr} )

python DataFrame split dict column to multiple columns

The column looks this way:
0 [{'id': 18, 'name': 'Drama'}, {'id': 10769, 'n...
1 [{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...
2 [{'id': 35, 'name': 'Comedy'}, {'id': 27, 'nam...
3 [{'id': 18, 'name': 'Drama'}]
4 [{'id': 99, 'name': 'Documentary'}]
5 [{'id': 18, 'name': 'Drama'}, {'id': 28, 'name...
6 [{'id': 10749, 'name': 'Romance'}, {'id': 18, ...
I wish to see ID columns with bool value for each genre:
index id=18 id=10769 id=35 id=27 ...
0 1 1 0 0 ...
1 1 0 0 0 ...
2 0 0 1 1 ...
3 1 0 0 0 ...
...
Use list comprehension with flattening and then DataFrame contructor:
df = pd.DataFrame({'col':[[{'id': 18, 'name': 'Drama'}, {'id': 10769}],
[{'id': 99, 'name': 'Documentary'}]]})
print (type(df.loc[0, 'col']))
<class 'list'>
df = pd.DataFrame([y for x in df['col'] for y in x])
print (df)
id name
0 18 Drama
1 10769 NaN
2 99 Documentary
#alternative
#df = pd.concat([pd.DataFrame(x) for x in df['col']], ignore_index=True)

List of dictionaries - stack one value of dictionary

I have trouble in adding one value of dictionary when conditions met, For example I have this list of dictionaries:
[{'plu': 1, 'price': 150, 'quantity': 2, 'stock': 5},
{'plu': 2, 'price': 150, 'quantity': 7, 'stock': 10},
{'plu': 1, 'price': 150, 'quantity': 6, 'stock': 5},
{'plu': 1, 'price': 200, 'quantity': 4, 'stock': 5},
{'plu': 2, 'price': 150, 'quantity': 3, 'stock': 10}
]
Then output should look like this:
[{'plu': 1, 'price': 150, 'quantity': 8, 'stock': 5},
{'plu': 1, 'price': 200, 'quantity': 4, 'stock': 5},
{'plu': 2, 'price': 150, 'quantity': 10, 'stock': 10}
]
Quantity should be added only if plu and price are the same, it should ignore key:values other than that (ex. stock). What is the most efficient way to do that?
#edit
I tried:
import itertools as it
keyfunc = lambda x: x['plu']
groups = it.groupby(sorted(new_data, key=keyfunc), keyfunc)
x = [{'plu': k, 'quantity': sum(x['quantity'] for x in g)} for k, g in groups]
But it works only on plu and then I get only quantity value when making html table in django, other are empty
You need to sort/groupby the combined key, not just one key. Easiest/most efficient way to do this is with operator.itemgetter. To preserve an arbitrary stock value, you'll need to use the group twice, so you'll need to convert it to a sequence:
from operator import itemgetter
keyfunc = itemgetter('plu', 'price')
# Unpack key and listify g so it can be reused
groups = ((plu, price, list(g))
for (plu, price), g in it.groupby(sorted(new_data, key=keyfunc), keyfunc))
x = [{'plu': plu, 'price': price, 'stock': g[0]['stock'],
'quantity': sum(x['quantity'] for x in g)}
for plu, price, g in groups]
Alternatively, if stock is guaranteed to be the same for each unique plu/price pair, you can include it in the key to simplify matters, so you don't need to listify the groups:
keyfunc = itemgetter('plu', 'price', 'stock')
groups = it.groupby(sorted(new_data, key=keyfunc), keyfunc)
x = [{'plu': plu, 'price': price, 'stock': stock,
'quantity': sum(x['quantity'] for x in g)
for (plu, price, stock), g in groups]
Optionally, you could create getquantity = itemgetter('quantity') at top level (like the keyfunc) and change sum(x['quantity'] for x in g) to sum(map(getquantity, g)) which pushes work to the C layer in CPython, and can be faster if your groups are large.
The other approach is to avoid sorting entirely using collections.Counter (or collections.defaultdict(int), though Counter makes the intent more clear here):
from collections import Counter
grouped = Counter()
for plu, price, stock, quantity in map(itemgetter('plu', 'price', 'stock', 'quantity'), new_data):
grouped[plu, price, stock] += quantity
then convert back to your preferred form with:
x = [{'plu': plu, 'price': price, 'stock': stock, 'quantity': quantity}
for (plu, price, stock), quantity in grouped.items()]
This should be faster for large inputs, since it replaces O(n log n) sorting work with O(n) dict operations (which are roughly O(1) cost).
Using pandas will make this a trivial problem:
import pandas as pd
data = [{'plu': 1, 'price': 150, 'quantity': 2, 'stock': 5},
{'plu': 2, 'price': 150, 'quantity': 7, 'stock': 10},
{'plu': 1, 'price': 150, 'quantity': 6, 'stock': 5},
{'plu': 1, 'price': 200, 'quantity': 4, 'stock': 5},
{'plu': 2, 'price': 150, 'quantity': 3, 'stock': 10}]
df = pd.DataFrame.from_records(data)
# df
#
# plu price quantity stock
# 0 1 150 2 5
# 1 2 150 7 10
# 2 1 150 6 5
# 3 1 200 4 5
# 4 2 150 3 10
new_df = df.groupby(['plu','price','stock'], as_index=False).sum()
new_df = new_df[['plu','price','quantity','stock']] # Optional: reorder the columns
# new_df
#
# plu price quantity stock
# 0 1 150 8 5
# 1 1 200 4 5
# 2 2 150 10 10
And finally, if you want to, port it back to dict (though I would argue pandas give you a lot more functionality to handle the data elements):
new_data = df2.to_dict(orient='records')
# new_data
#
# [{'plu': 1, 'price': 150, 'quantity': 8, 'stock': 5},
# {'plu': 1, 'price': 200, 'quantity': 4, 'stock': 5},
# {'plu': 2, 'price': 150, 'quantity': 10, 'stock': 10}]

Convert dataframe to dictionary in Python

I have a csv file that I converted into dataframe using Pandas. Here's the dataframe:
Customer ProductID Count
John 1 50
John 2 45
Mary 1 75
Mary 2 10
Mary 5 15
I need an output in the form of a dictionary that looks like this:
{ProductID:1, Count:{John:50, Mary:75}},
{ProductID:2, Count:{John:45, Mary:10}},
{ProductID:5, Count:{John:0, Mary:15}}
I read the following answers:
python pandas dataframe to dictionary
and
Convert dataframe to dictionary
This is the code that I'm having:
df = pd.read_csv('customer.csv')
dict1 = df.set_index('Customer').T.to_dict('dict')
dict2 = df.to_dict(orient='records')
and this is my current output:
dict1 = {'John': {'Count': 45, 'ProductID': 2}, 'Mary': {'Count': 15, 'ProductID': 5}}
dict2 = [{'Count': 50, 'Customer': 'John', 'ProductID': 1},
{'Count': 45, 'Customer': 'John', 'ProductID': 2},
{'Count': 75, 'Customer': 'Mary', 'ProductID': 1},
{'Count': 10, 'Customer': 'Mary', 'ProductID': 2},
{'Count': 15, 'Customer': 'Mary', 'ProductID': 5}]
IIUC you can use:
d = df.groupby('ProductID').apply(lambda x: dict(zip(x.Customer, x.Count)))
.reset_index(name='Count')
.to_dict(orient='records')
print (d)
[{'ProductID': 1, 'Count': {'John': 50, 'Mary': 75}},
{'ProductID': 2, 'Count': {'John': 45, 'Mary': 10}},
{'ProductID': 5, 'Count': {'Mary': 15}}]

Categories