Sum specific columns in dataframe with multi index

Sum specific columns in dataframe with multi index - python

I have a dataframe (df) with a multi index
Class A B
Sex M F M F
Group
B1 81 34 55 92
B2 38 3 25 11
B3 73 71 69 79
B4 69 23 27 96
B5 5 1 33 28
B6 40 81 91 87
I am trying to create 2 sum columns (one for M and one for F) so my output would look like:
Class A B Total
Sex M F M F Total M Total F
Group
B1 81 34 55 92 136 126
B2 38 3 25 11 63 14
B3 73 71 69 79 142 150
B4 69 23 27 96 96 119
B5 5 1 33 28 38 29
B6 40 81 91 87 131 168
I have tried :
df['Total M'] = df['M'].sum(axis=1)
df['Total F'] = df['F'].sum(axis=1)
without success

You can try
df[('Total', 'Total M')] = df.xs('M', level=1, axis=1).sum(axis=1)
df[('Total', 'Total F')] = df.xs('F', level=1, axis=1).sum(axis=1)
# or in a for loop
for col in ['M', 'F']:
df[('Total', f'Total {col}')] = df.xs(col, level=1, axis=1).sum(axis=1)
print(df)
A B Total
M F M F Total M Total F
B1 81 34 55 92 136 126
B2 38 3 25 11 63 14
B3 73 71 69 79 142 150
B4 69 23 27 96 96 119
B5 5 1 33 28 38 29
B6 40 81 91 87 131 168

use the loc accessor. Code below
df[('Total', 'Total M')] = df.loc(axis=1)[:, ['M']].sum(axis=1)
df[('Total', 'Total F')]= df.loc(axis=1)[:, ['F']].sum(axis=1)

Here's an alternative approach using groupby. Probably overkill for only 2 level groups, but should scale well in scenarios where there are more.
totals = df.groupby(axis=1, level=1).sum()
totals.columns = pd.MultiIndex.from_product([['Total'], totals.columns])
df = df.join(totals)
[out]
Class A B Total
Sex M F M F F M
Group
B1 81 34 55 92 126 136
B2 38 3 25 11 14 63
B3 73 71 69 79 150 142
B4 69 23 27 96 119 96
B5 5 1 33 28 29 38
B6 40 81 91 87 168 131

Related

replcae values within a group on pandas dataframe column by previous group value

I have a dataframe :
country group A B C D
0 1 a1 10 20 30 40
1 1 a1 11 21 31 41
2 1 a1 12 22 32 42
3 2 a2 50 60 70 80
4 2 a2 51 61 71 81
5 2 a2 52 62 72 82
6 2 a2 53 63 73 83
7 2 a2 50 60 70 80
8 3 a3 51 61 71 81
9 3 a3 52 62 72 82
10 3 a3 53 63 73 83
11 3 a3 53 63 73 83
My goal is to have a dataframe as follows :
country group A B C D
0 1 NAN 10 20 30 40
1 1 NAN 11 21 31 41
2 1 NAN 12 22 32 42
3 2 a1 50 60 70 80
4 2 a1 51 61 71 81
5 2 a1 52 62 72 82
6 2 a1 53 63 73 83
7 2 a1 50 60 70 80
8 3 a2 51 61 71 81
9 3 a2 52 62 72 82
10 3 a2 53 63 73 83
11 3 a2 53 63 73 83
Where I get the values of the previous group on column group and shift it to next group

You can use a mapping Series:
s = df.set_index('country')['group'].drop_duplicates()
df['group'] = df['country'].map(s.shift())
output:
country group A B C D
0 1 NaN 10 20 30 40
1 1 NaN 11 21 31 41
2 1 NaN 12 22 32 42
3 2 a1 50 60 70 80
4 2 a1 51 61 71 81
5 2 a1 52 62 72 82
6 2 a1 53 63 73 83
7 2 a1 50 60 70 80
8 3 a2 51 61 71 81
9 3 a2 52 62 72 82
10 3 a2 53 63 73 83
11 3 a2 53 63 73 83
mapping Series s:
country
1 a1
2 a2
3 a3
Name: group, dtype: object

Use Series.shift values with comapre by origianl column and then forward filling missing values:
s = df['group'].shift()
df['group'] = s.where(s.ne(df['group'])).ffill()
print (df)
country group A B C D
0 1 NaN 10 20 30 40
1 1 NaN 11 21 31 41
2 1 NaN 12 22 32 42
3 2 a1 50 60 70 80
4 2 a1 51 61 71 81
5 2 a1 52 62 72 82
6 2 a1 53 63 73 83
7 2 a1 50 60 70 80
8 3 a2 51 61 71 81
9 3 a2 52 62 72 82
10 3 a2 53 63 73 83
11 3 a2 53 63 73 83

How to create a column that contains the penultimate value of each row?

I have a DataFrame and I need to create a new column which contains the second largest value of each row in the original Dataframe.
Sample:
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))
Desired output:
0 1 2 3 4 5 6 7 8 9 penultimate
0 52 69 62 7 20 69 38 10 57 17 62
1 52 94 49 63 1 90 14 76 20 84 90
2 78 37 58 7 27 41 27 26 48 51 58
3 6 39 99 36 62 90 47 25 60 84 90
4 37 36 91 93 76 69 86 95 69 6 93
5 5 54 73 61 22 29 99 27 46 24 73
6 71 65 45 9 63 46 4 93 36 18 71
7 85 7 76 46 65 97 64 52 28 80 85
How can this be done in as little code as possible?

You could use NumPy for this:
import numpy as np
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))
df['penultimate'] = np.sort(df.values, 1)[:, -2]
print(df)
Using NumPy is faster.

Here is a simple lambda function!
# Input
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))
# Output
out = df.apply(lambda x: x.sort_values().unique()[-2], axis=1)
df['penultimate'] = out
print(df)
Cheers!

Is there a better way to keep few columns by dropping several columns from a DataFrame than to use .drop()?

Curious to know if there is a better way to keep needed columns in a dataframe if those I need to keep are a small number and the ones to remove are several of them
import numpy as np
df1 = pd.DataFrame(np.random.randint(10,99, size=(13, 26)), columns =list('abcdefghijklmnopqrstuvwxyz'))
df1
Output:
a b c d e f g h i j ... q r s t u v w x y z
0 78 60 27 38 21 93 74 47 16 53 ... 79 56 40 41 87 80 14 82 12 50
1 84 73 59 46 91 43 22 28 57 52 ... 27 65 81 72 68 90 68 61 22 44
2 56 37 29 52 57 14 87 82 46 90 ... 67 57 29 14 55 30 46 72 56 91
3 86 44 46 79 41 74 32 49 42 32 ... 33 34 40 17 30 78 29 75 80 52
4 14 89 90 79 67 17 34 39 57 37 ... 93 49 78 91 26 73 40 48 91 36
5 16 62 32 87 56 81 82 17 59 57 ... 84 24 97 39 46 40 68 53 73 40
6 69 72 16 47 37 20 27 56 13 37 ... 10 28 17 35 39 14 51 85 69 53
7 81 34 35 20 66 44 86 23 94 57 ... 38 45 76 53 82 72 64 34 81 43
8 95 90 97 31 18 85 74 18 43 22 ... 20 20 96 25 53 76 55 96 58 98
9 73 53 72 94 55 33 22 40 11 64 ... 84 66 85 34 94 32 78 72 10 62
10 73 24 57 17 63 24 94 25 59 84 ... 34 45 27 28 47 23 38 80 45 41
11 69 18 22 42 95 38 16 47 68 36 ... 59 69 35 39 78 75 85 86 53 55
12 46 27 53 77 48 15 57 90 32 57 ... 32 79 18 67 71 86 54 11 36 51
13 rows × 26 columns
Say, I have to only keep a few random columns , E.g. e,u,r,q,j ; is there a better way to keep them having to run df1.drop() with 21 column names passed in? I could not find a better way in any of the questions.
Edit:
Different from the solution in
Selecting multiple columns in a pandas dataframe
since the columns to choose to drop are random and not sequential

You can copy all the rows you want to keep into a new dataframe and then overwrite your first dataframe like so:
import numpy as np
import pandas as pd
df1 = pd.DataFrame(np.random.randint(10,99, size=(13, 26)), columns =list('abcdefghijklmnopqrstuvwxyz'))
df2 = pd.DataFrame()
columns_to_keep = ["e", "r", "u"]
for column in columns_to_keep:
df2[column] = df1[column]
df1 = df2
df1
or alternatively using a for statement to drop any item not in a list:
columns_to_keep = ["e", "r", "u"]
for column_name, column_data in df1.iteritems():
if column_name not in columns_to_keep:
df1 = df1.drop(column_name, axis=1)
df1

Let's just use column filtering and reassign back to df1:
df1 = pd.DataFrame(np.random.randint(10,99, size=(13, 26)), columns =list('abcdefghijklmnopqrstuvwxyz'))
columns_to_keep = ["e", "r", "u"]
df1 = df1[columns_to_keep]
df1.head()
Output:
e r u
0 65 95 13
1 58 42 75
2 95 34 12
3 43 20 79
4 83 27 47

Linear warping of pandas time-series

I have a pandas data frame, df:
import pandas as pd
import numpy as np
np.random.seed(123)
s = np.arange(5)
df = pd.DataFrame()
for i in s:
s_df = pd.DataFrame({'time':np.arange(100),
'x':np.arange(100),
'y':np.arange(100),
'r':np.random.randint(60,100)})
s_df['unit'] = str(i)
df = df.append(s_df)
I want select the 'x' and 'y' data for each 'unit', from 'time' 0 up until its value of 'r', and then warp the selected data to fit a new normalized timescale of 0-100. The new DataFrame should look the same, but x and y will have been stretched to fit the new timescale.

I think you can start with this and modify:
df.groupby('unit', as_index=False, group_keys=False)\
.apply(lambda g: g[g.time <= g.r.max()].pipe(lambda x: x.assign(x = np.interp(x.time * 100/x.r.max(), g.time, g.x),
y = np.interp(x.time * 100/x.r.max(), g.time, g.y))))
Output:
r time x y unit
0 91 0 0.369445 0.802790 0
1 91 1 0.802881 0.411523 0
2 91 2 0.080290 0.228482 0
3 91 3 0.248865 0.624470 0
4 91 4 0.350376 0.207805 0
5 91 5 0.604447 0.495269 0
6 91 6 0.402430 0.317250 0
7 91 7 0.205757 0.296371 0
8 91 8 0.426954 0.793716 0
9 91 9 0.728095 0.486691 0
10 91 10 0.087941 0.701258 0
11 91 11 0.653719 0.937834 0
12 91 12 0.702571 0.381267 0
13 91 13 0.113419 0.492686 0
14 91 14 0.381172 0.539422 0
15 91 15 0.490320 0.166290 0
16 91 16 0.440490 0.029675 0
17 91 17 0.663973 0.245057 0
18 91 18 0.273116 0.280711 0
19 91 19 0.807658 0.869288 0
20 91 20 0.227972 0.987803 0
21 91 21 0.747295 0.526613 0
22 91 22 0.491929 0.118479 0
23 91 23 0.403465 0.564284 0
24 91 24 0.618359 0.648467 0
25 91 25 0.867436 0.447866 0
26 91 26 0.487128 0.526473 0
27 91 27 0.135412 0.855466 0
28 91 28 0.469281 0.753690 0
29 91 29 0.397495 0.786670 0
.. .. ... ... ... ...
53 82 53 0.985053 0.534743 4
54 82 54 0.255997 0.789710 4
55 82 55 0.629316 0.889916 4
56 82 56 0.730672 0.539548 4
57 82 57 0.484289 0.278669 4
58 82 58 0.120573 0.754350 4
59 82 59 0.071606 0.912240 4
60 82 60 0.126613 0.775831 4
61 82 61 0.392633 0.706384 4
62 82 62 0.312653 0.698514 4
63 82 63 0.164337 0.420798 4
64 82 64 0.655284 0.317136 4
65 82 65 0.526961 0.484673 4
66 82 66 0.205197 0.516752 4
67 82 67 0.405965 0.314419 4
68 82 68 0.892710 0.620090 4
69 82 69 0.351876 0.422846 4
70 82 70 0.601449 0.152340 4
71 82 71 0.187239 0.486854 4
72 82 72 0.757108 0.727058 4
73 82 73 0.728311 0.623236 4
74 82 74 0.725225 0.279149 4
75 82 75 0.536730 0.746806 4
76 82 76 0.584319 0.543595 4
77 82 77 0.591636 0.451003 4
78 82 78 0.042806 0.766688 4
79 82 79 0.326183 0.832956 4
80 82 80 0.558992 0.507238 4
81 82 81 0.303649 0.143872 4
82 82 82 0.303214 0.113151 4
[428 rows x 5 columns]

How to apply a function to rows of two pandas DataFrame

There are two pandas DataFrame, say dfx, dfy of the same shape and exactly the same column and row indices. I want to apply a function to the corresponding rows of these two DataFrame.
In other words, suppose we have a function as follows
def fun( row_x, row_y):
...# a function of the corresponding rows
Let index be the common index of dfx, dfy. I want to compute in pandas the following list/Series
[fun(dfx[i], dfy[i]) for i in index] (pseudo-code)
By the following code, I make a grouped two-level indexed DataFrame. Then I do not know how to apply agg in the proper way.
dfxy = pd.concat({'dfx':dfx, 'dfy':dfy})
dfxy = dfxy.swaplevel(0,1,axis=0).sort_index(level=0)
grouped=dfxy.groupby(level=0)

In [19]:
dfx = pd.DataFrame(data = np.random.randint(0 , 100 , 50).reshape(10 ,-1) , columns=list('abcde'))
dfx
Out[19]:
a b c d e
3 44 8 55 95
26 5 18 34 10
20 20 91 15 8
83 7 50 47 27
97 65 10 94 93
44 6 70 60 4
38 64 8 67 92
44 21 42 6 12
30 98 34 7 79
76 7 14 58 5
In [4]:
dfy = pd.DataFrame(data = np.random.randint(0 , 100 , 50).reshape(10 ,-1) , columns=list('fghij'))
dfy
Out[4]:
f g h i j
82 48 29 54 78
7 31 78 38 30
90 91 43 8 40
52 88 13 87 39
41 88 90 51 91
55 4 94 62 98
31 23 4 59 93
87 12 33 77 0
25 99 39 23 1
7 50 46 39 66
In [13]:
dfxy = pd.concat({'dfx':dfx, 'dfy':dfy} , axis = 1)
dfxy
Out[13]:
dfx dfy
a b c d e f g h i j
20 76 5 98 38 82 48 29 54 78
39 36 9 3 74 7 31 78 38 30
43 12 50 72 14 90 91 43 8 40
89 41 95 91 86 52 88 13 87 39
33 30 55 64 94 41 88 90 51 91
89 84 48 1 60 55 4 94 62 98
68 40 27 10 63 31 23 4 59 93
33 10 86 89 67 87 12 33 77 0
56 89 0 70 67 25 99 39 23 1
48 58 98 18 24 7 50 46 39 66
def f(x , y):
return pd.Series(data = [np.mean(x) , np.mean(y)] , index=['x_mean' , 'y_mean'])
In [17]:
dfxy.apply( lambda x : f(x['dfx'] , x['dfy']) , axis = 1)
Out[17]:
x_mean y_mean
0 47.4 58.2
1 32.2 36.8
2 38.2 54.4
3 80.4 55.8
4 55.2 72.2
5 56.4 62.6
6 41.6 42.0
7 57.0 41.8
8 56.4 37.4
9 49.2 41.6

Could this be what you are looking for?
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: dfx = pd.DataFrame(data=np.random.randint(0,100,50).reshape(10,-1),
columns=['index', 'a', 'b', 'c', 'd'])
In [4]: dfy = pd.DataFrame(data=np.random.randint(0,100,50).reshape(10,-1),
columns=['index', 'a', 'b', 'c', 'd'])
In [5]: dfy['index'] = dfx['index']
In [6]: print(dfx)
index a b c d
0 25 41 46 18 98
1 0 21 9 20 29
2 18 78 63 94 70
3 86 71 71 95 64
4 23 33 19 34 29
5 69 10 91 19 42
6 92 68 60 12 58
7 74 49 22 74 1
8 47 35 56 41 80
9 93 20 44 16 49
In [7]: print(dfy)
index a b c d
0 25 28 35 96 89
1 0 44 94 50 43
2 18 18 39 75 45
3 86 18 87 72 88
4 23 2 28 24 4
5 69 53 55 55 40
6 92 0 52 54 91
7 74 8 1 96 59
8 47 74 21 7 7
9 93 42 83 42 60
In [8]: print(dfx.merge(dfy, on='index'))
index a_x b_x c_x d_x a_y b_y c_y d_y
0 25 41 46 18 98 28 35 96 89
1 0 21 9 20 29 44 94 50 43
2 18 78 63 94 70 18 39 75 45
3 86 71 71 95 64 18 87 72 88
4 23 33 19 34 29 2 28 24 4
5 69 10 91 19 42 53 55 55 40
6 92 68 60 12 58 0 52 54 91
7 74 49 22 74 1 8 1 96 59
8 47 35 56 41 80 74 21 7 7
9 93 20 44 16 49 42 83 42 60
In [9]: def my_function(x):
...: return sum(x)
...:
In [10]: print(dfx.merge(dfy, on='index').drop('index', axis=1).apply(my_function, axis=1))
0 451
1 310
2 482
3 566
4 173
5 365
6 395
7 310
8 321
9 356
dtype: int64
In [11]: print(pd.DataFrame(
{
'my_function':
dfx.merge(dfy, on='index').\
drop('index', axis=1).apply(my_function, axis=1),
'index':
dfx['index']
}))
index my_function
0 25 451
1 0 310
2 18 482
3 86 566
4 23 173
5 69 365
6 92 395
7 74 310
8 47 321
9 93 356

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sum specific columns in dataframe with multi index - python

use the loc accessor. Code below df[('Total', 'Total M')] = df.loc(axis=1)[:, ['M']].sum(axis=1) df[('Total', 'Total F')]= df.loc(axis=1)[:, ['F']].sum(axis=1)

Related

replcae values within a group on pandas dataframe column by previous group value

How to create a column that contains the penultimate value of each row?

Is there a better way to keep few columns by dropping several columns from a DataFrame than to use .drop()?

Linear warping of pandas time-series

How to apply a function to rows of two pandas DataFrame

Categories

Resources