Pandas: How to create a multi-indexed pivot - python

I have a set of experiments defined by two variables: scenario and height. For each experiment, I take 3 measurements: result 1, 2 and 3.
The dataframe that collects all the results looks like this:
import numpy as np
import pandas as pd
df = pd.DataFrame()
df['Scenario']= np.repeat(['Scenario a','Scenario b','Scenario c'],3)
df['height'] = np.tile([0,1,2],3)
df['Result 1'] = np.arange(1,10)
df['Result 2'] = np.arange(20,29)
df['Result 3'] = np.arange(30,39)
If I run the following:
mypiv = df.pivot('Scenario','height').transpose()
writer = pd.ExcelWriter('test_df_pivot.xlsx')
mypiv.to_excel(writer,'test df pivot')
writer.save()
I obtain a dataframe where columns are the scenarios, and the rows have a multi-index defined by result and height:
+----------+--------+------------+------------+------------+
| | height | Scenario a | Scenario b | Scenario c |
+----------+--------+------------+------------+------------+
| Result 1 | 0 | 1 | 4 | 7 |
| | 1 | 2 | 5 | 8 |
| | 2 | 3 | 6 | 9 |
| Result 2 | 0 | 20 | 23 | 26 |
| | 1 | 21 | 24 | 27 |
| | 2 | 22 | 25 | 28 |
| Result 3 | 0 | 30 | 33 | 36 |
| | 1 | 31 | 34 | 37 |
| | 2 | 32 | 35 | 38 |
+----------+--------+------------+------------+------------+
How can I create a pivot where the indices are swapped, i.e. height first, then result?
I couldn't find a way to create it directly. I managed to get what I want swapping the levels and the re-sorting the results:
mypiv2 = mypiv.swaplevel(0,1 , axis=0).sortlevel(level=0,axis=0,sort_remaining=True)
but I was wondering if there is a more direct way.

You can first set_index and then stack with unstack:
print (df.set_index(['height','Scenario']).stack().unstack(level=1))
Scenario Scenario a Scenario b Scenario c
height
0 Result 1 1 4 7
Result 2 20 23 26
Result 3 30 33 36
1 Result 1 2 5 8
Result 2 21 24 27
Result 3 31 34 37
2 Result 1 3 6 9
Result 2 22 25 28
Result 3 32 35 38

Related

groupby/eq compute mean of specific column

Im trying to figure out to how to use groupby/eq to computer the mean of specific column, i have a df as seen below (original df).
I would like to groupby 'Group' and 'players' with class equals 1 and get the mean of the 'score'.
example:
Group = a, players =2
(16+13+19)/3 = 16
+-------+---------+-------+-------+------------+
| Group | players | class | score | score_mean |
+-------+---------+-------+-------+------------+
| a | 2 | 2 | 14 | |
| a | 2 | 1 | 16 | 16 |
| a | 2 | 1 | 13 | 16 |
| a | 2 | 2 | 13 | |
| a | 2 | 1 | 19 | 16 |
| a | 2 | 2 | 17 | |
| a | 2 | 2 | 14 | |
+-------+---------+-------+-------+------------+
i've tried:
df['score_mean'] = df['class'].eq(1).groupby(['Group', 'players'])['score'].transform('mean')
but i kept getting "KeyError"
original df:
+----+-------+---------+-------+-------+
| | Group | players | class | score |
+----+-------+---------+-------+-------+
| 0 | a | 1 | 1 | 10 |
| 1 | c | 2 | 1 | 20 |
| 2 | a | 1 | 3 | 29 |
| 3 | c | 1 | 3 | 22 |
| 4 | a | 2 | 2 | 14 |
| 5 | b | 1 | 2 | 16 |
| 6 | a | 2 | 1 | 16 |
| 7 | b | 2 | 3 | 17 |
| 8 | c | 1 | 2 | 22 |
| 9 | b | 1 | 2 | 23 |
| 10 | c | 2 | 2 | 22 |
| 11 | d | 1 | 1 | 13 |
| 12 | a | 2 | 1 | 13 |
| 13 | d | 1 | 3 | 23 |
| 14 | a | 2 | 2 | 13 |
| 15 | d | 2 | 1 | 34 |
| 16 | b | 1 | 3 | 32 |
| 17 | c | 2 | 2 | 29 |
| 18 | b | 2 | 2 | 28 |
| 19 | a | 2 | 1 | 19 |
| 20 | a | 1 | 1 | 19 |
| 21 | c | 1 | 1 | 27 |
| 22 | b | 1 | 3 | 47 |
| 23 | a | 2 | 2 | 17 |
| 24 | c | 1 | 1 | 14 |
| 25 | c | 2 | 2 | 25 |
| 26 | a | 1 | 3 | 67 |
| 27 | b | 2 | 3 | 21 |
| 28 | a | 1 | 3 | 27 |
| 29 | c | 1 | 1 | 16 |
| 30 | a | 2 | 2 | 14 |
| 31 | b | 1 | 2 | 25 |
+----+-------+---------+-------+-------+
data = {'Group':['a','c','a','c','a','b','a','b','c','b','c','d','a','d','a','d',
'b','c','b','a','a','c','b','a','c','c','a','b','a','c','a','b'],
'players':[1,2,1,1,2,1,2,2,1,1,2,1,2,1,2,2,1,2,2,2,1,1,1,2,1,2,1,2,1,1,2,1],
'class':[1,1,3,3,2,2,1,3,2,2,2,1,1,3,2,1,3,2,2,1,1,1,3,2,1,2,3,3,3,1,2,2],
'score':[10,20,29,22,14,16,16,17,22,23,22,13,13,23,13,34,32,29,28,19,19,27,47,17,14,25,67,21,27,16,14,25]}
df = pd.DataFrame(data)
kindly advice
Many thanks & Regards
Try:
Via set_index(),groupby() ,assign() and reset_index() method:
df=(df.set_index(['Group','players'])
.assign(score_mean=df[df['class'].eq(1)].groupby(['Group', 'players'])['score'].mean())
.reset_index())
Update:
If you want the first df as your output then use:
grouped=df.groupby(['Group', 'players','class']).transform('mean')
grouped=grouped.assign(players=df['players'],Group=df['Group'],Class=df['class']).where(df['Group']=='a').dropna()
grouped['score']=grouped.apply(lambda x:float('NaN') if x['players']==1 else x['score'],1)
grouped=grouped.dropna(subset=['score'])
Now if you print grouped you will get your desired output
If I got you right, need values returned only where class=1. Not sure what that will serve but code below. Use groupby transform and chain where
df['score_mean']=df.groupby(['Group','players'])['score'].transform('mean').where(df['class']==1).fillna('')
Group players class score score_mean
0 a 1 1 10 10
1 a 2 1 20 20
2 a 3 5 29
3 a 4 5 22
4 a 5 5 14
5 b 1 7 16
6 b 2 7 16
7 b 3 7 17
8 c 1 4 22
9 c 2 2 23
10 c 3 2 22
11 d 1 4 13
12 d 2 4 13
13 d 3 3 23
14 d 4 8 13
15 d 5 7 34
16 e 1 7 32
17 e 2 2 29
18 e 3 2 28
19 e 4 1 19 19
20 e 5 1 19 19
21 e 6 1 27 27
22 f 1 5 47
23 f 2 5 17
24 f 3 7 14
25 f 4 7 25
26 g 1 3 67
27 g 2 3 21
28 g 3 3 27
29 g 4 8 16
30 g 5 8 14
31 g 6 8 25
You could first filter by class and then create score_mean by doing a groupby and transform.
(
df[df['class']==1]
.assign(score_mean = lambda x: x.groupby(['Group', 'players']).score.transform('mean'))
)
Group players class score score_mean
0 a 1 1 10 14.5
1 c 2 1 20 20.0
6 a 2 1 16 16.0
11 d 1 1 13 13.0
12 a 2 1 13 16.0
15 d 2 1 34 34.0
19 a 2 1 19 16.0
20 a 1 1 19 14.5
21 c 1 1 27 19.0
24 c 1 1 14 19.0
29 c 1 1 16 19.0
If you want to keep other classes and set the mean to '', you can do:
(
df[df['class']==1]
.groupby(['Group', 'players']).score.transform('mean')
.pipe(lambda x: df.assign(score_mean = x))
.fillna('')
)

Find top N values within each group

I have a dataset similar to the sample below:
| id | size | old_a | old_b | new_a | new_b |
|----|--------|-------|-------|-------|-------|
| 6 | small | 3 | 0 | 21 | 0 |
| 6 | small | 9 | 0 | 23 | 0 |
| 13 | medium | 3 | 0 | 12 | 0 |
| 13 | medium | 37 | 0 | 20 | 1 |
| 20 | medium | 30 | 0 | 5 | 6 |
| 20 | medium | 12 | 2 | 3 | 0 |
| 12 | small | 7 | 0 | 2 | 0 |
| 10 | small | 8 | 0 | 12 | 0 |
| 15 | small | 19 | 0 | 3 | 0 |
| 15 | small | 54 | 0 | 8 | 0 |
| 87 | medium | 6 | 0 | 9 | 0 |
| 90 | medium | 11 | 1 | 16 | 0 |
| 90 | medium | 25 | 0 | 4 | 0 |
| 90 | medium | 10 | 0 | 5 | 0 |
| 9 | large | 8 | 1 | 23 | 0 |
| 9 | large | 19 | 0 | 2 | 0 |
| 1 | large | 1 | 0 | 0 | 0 |
| 50 | large | 34 | 0 | 7 | 0 |
This is the input for above table:
data=[[6,'small',3,0,21,0],[6,'small',9,0,23,0],[13,'medium',3,0,12,0],[13,'medium',37,0,20,1],[20,'medium',30,0,5,6],[20,'medium',12,2,3,0],[12,'small',7,0,2,0],[10,'small',8,0,12,0],[15,'small',19,0,3,0],[15,'small',54,0,8,0],[87,'medium',6,0,9,0],[90,'medium',11,1,16,0],[90,'medium',25,0,4,0],[90,'medium',10,0,5,0],[9,'large',8,1,23,0],[9,'large',19,0,2,0],[1,'large',1,0,0,0],[50,'large',34,0,7,0]]
data= pd.DataFrame(data,columns=['id','size','old_a','old_b','new_a','new_b'])
I want to have an output which will group the dataset on size and would list out top 2 id based on the values of 'new_a' column within each group of size. Since, some of the ids are repeating multiple times, I would want to sum the values of new_a for such ids and then find top 2 values. My final table should look like the one below:
| size | id | new_a |
|--------|----|-------|
| large | 9 | 25 |
| large | 50 | 7 |
| medium | 13 | 32 |
| medium | 90 | 25 |
| small | 6 | 44 |
| small | 10 | 12 |
I have tried the below code but it isn't showing top 2 values of new_a for each group within 'size' column.
nlargest = data.groupby(['size','id'])['new_a'].sum().nlargest(2).reset_index()
print(
df.groupby('size').apply(
lambda x: x.groupby('id').sum().nlargest(2, columns='new_a')
).reset_index()[['size', 'id', 'new_a']]
)
Prints:
size id new_a
0 large 9 25
1 large 50 7
2 medium 13 32
3 medium 90 25
4 small 6 44
5 small 10 12
You can set size, id as the index to avoid double groupby here, and use Series.sum leveraging level parameter.
df.set_index(["size", "id"]).groupby(level=0).apply(
lambda x: x.sum(level=1).nlargest(2)
).reset_index()
size id new_a
0 large 9 25
1 large 50 7
2 medium 13 32
3 medium 90 25
4 small 6 44
5 small 10 12
You can chain two groupby methods:
data.groupby(['id', 'size'])['new_a'].sum().groupby('size').nlargest(2)\
.droplevel(0).to_frame('new_a').reset_index()
Output:
id size new_a
0 9 large 25
1 50 large 7
2 13 medium 32
3 90 medium 25
4 6 small 44
5 10 small 12

Get a combination of values in dataframe and implement a function

I want to take the combination of values ​​in a column and apply a function to each combination.What is the easiest way to do this?
Example Data
| name | value |
|------|-------|
| 6A | 1 |
| 6A | 1 |
| 6A | 1 |
| 6B | 3 |
| 6B | 3 |
| 6B | 3 |
| 6C | 7 |
| 6C | 5 |
| 6C | 4 |
The Result I Want
i used sum as a function in the example:
| pair | result |
|-------|--------|
| 6A_6B | 4 |
| 6A_6B | 4 |
| 6A_6B | 4 |
| 6A_6C | 8 |
| 6A_6C | 6 |
| 6A_6C | 5 |
| 6B_6C | 10 |
| 6B_6C | 8 |
| 6B_6C | 7 |
Note
My function takes "pandas.Series" as parameters.
For example:
x=a series of "6A"
and
y=a series of "6B"
6A_6B = sum(x,y)
I find it more straightforward to reshape the data, then it's a simple addition of all pairwise combinations.
import pandas as pd
from itertools import combinations
u = (df.assign(idx = df.groupby('name').cumcount()+1)
.pivot(index='idx', columns='name', values='value'))
#name 6A 6B 6C
#idx
#1 1 3 7
#2 1 3 5
#3 1 3 4
l = []
for items in combinations(u.columns, 2):
l.append(u.loc[:, items].sum(1).to_frame('result').assign(pair='_'.join(items)))
df = pd.concat(l)
result pair
idx
1 4 6A_6B
2 4 6A_6B
3 4 6A_6B
1 8 6A_6C
2 6 6A_6C
3 5 6A_6C
1 10 6B_6C
2 8 6B_6C
3 7 6B_6C
itertools.combinations
Off the top of my head
from itertools import combinations
g = dict(tuple(df.groupby('name')))
pd.DataFrame([
(f'{x}_{y}', a + b)
for x, y in combinations(g, 2)
for a, b in zip(g[x]['value'], g[y]['value'])
], columns=df.columns)
name value
0 6A_6B 4
1 6A_6B 4
2 6A_6B 4
3 6A_6C 8
4 6A_6C 6
5 6A_6C 5
6 6B_6C 10
7 6B_6C 8
8 6B_6C 7

Pandas: Transpose, groupby and summarize columns

i have a pandas DataFrame which looks like this:
| Id | Filter 1 | Filter 2 | Filter 3 |
|----|----------|----------|----------|
| 25 | 0 | 1 | 1 |
| 25 | 1 | 0 | 1 |
| 25 | 0 | 0 | 1 |
| 30 | 1 | 0 | 1 |
| 31 | 1 | 0 | 1 |
| 31 | 0 | 1 | 0 |
| 31 | 0 | 0 | 1 |
I need to transpose this table, add "Name" column with the name of the filter and summarize Filters column values. The result table should be like this:
| Id | Name | Summ |
| 25 | Filter 1 | 1 |
| 25 | Filter 2 | 1 |
| 25 | Filter 3 | 3 |
| 30 | Filter 1 | 1 |
| 30 | Filter 2 | 0 |
| 30 | Filter 3 | 1 |
| 31 | Filter 1 | 1 |
| 31 | Filter 2 | 1 |
| 31 | Filter 3 | 2 |
The only solution i have came so far was to use apply function on groupped by Id column, but this mehod is too slow for my case - dataset can be more than 40 columns and 50_000 rows, how can i do this with pandas native methods?(eg Pivot, Transpose, Groupby)
Use:
df_new=df.melt('Id',var_name='Name',value_name='Sum').groupby(['Id','Name']).Sum.sum()\
.reset_index()
print(df_new)
Id Name Sum
0 25 Filter 1 1
1 25 Filter 2 1
2 25 Filter 3 3
3 30 Filter 1 1
4 30 Filter 2 0
5 30 Filter 3 1
6 31 Filter 1 1
7 31 Filter 2 1
8 31 Filter 3 1
stack then groupby
df.set_index('Id').stack().groupby(level=[0,1]).sum().reset_index()
Id level_1 0
0 25 Filter 1 1
1 25 Filter 2 1
2 25 Filter 3 3
3 30 Filter 1 1
4 30 Filter 2 0
5 30 Filter 3 1
6 31 Filter 1 1
7 31 Filter 2 1
8 31 Filter 3 1
Short version
df.set_index('Id').sum(level=0).stack()#df.groupby('Id').sum().stack()
Using filter and melt
df.filter(like='Filter').groupby(df.Id).sum().T.reset_index().melt(id_vars='index')
index Id value
0 Filter 1 25 1
1 Filter 2 25 1
2 Filter 3 25 3
3 Filter 1 30 1
4 Filter 2 30 0
5 Filter 3 30 1
6 Filter 1 31 1
7 Filter 2 31 1
8 Filter 3 31 2

Removing duplicates with three conditions, Pandas

I have the following dataframe:
reference | topcredit | currentbalance | creditlimit
1 1 | 50 | 20 | 70
2 1 | 30 | 28 | 50
3 1 | 50 | 20 | 70
4 1 | 81 | 32 | 100
5 2 | 70 | 0 | 56
6 2 | 50 | 20 | 70
7 2 | 100 | 0 | 150
8 3 | 85 | 85 | 95
9 3 | 85 | 85 | 95
And so on...
I want to drop duplicates based on the 'reference' only those that have the same topcredit, currentbalance and creditlimit.
In the reference 1 I have two that have the same numbers in the three columns in line 1 and 3, but also in reference 2, line 6 I would like to keep 1 of reference 1 and also line 6 of reference 2. In reference 3 both lines have the same information too.
The expected output is:
reference | topcredit | currentbalance | creditlimit
1 | 50 | 20 | 70
1 | 30 | 28 | 50
1 | 81 | 32 | 100
2 | 70 | 24 | 56
2 | 50 | 20 | 70
2 | 100 | 80 | 150
3 | 85 | 85 | 95
I would apreciate the help, I've been searching how to do it for a while.

Categories