How to apply a function to rows of two pandas DataFrame - python

There are two pandas DataFrame, say dfx, dfy of the same shape and exactly the same column and row indices. I want to apply a function to the corresponding rows of these two DataFrame.
In other words, suppose we have a function as follows
def fun( row_x, row_y):
...# a function of the corresponding rows
Let index be the common index of dfx, dfy. I want to compute in pandas the following list/Series
[fun(dfx[i], dfy[i]) for i in index] (pseudo-code)
By the following code, I make a grouped two-level indexed DataFrame. Then I do not know how to apply agg in the proper way.
dfxy = pd.concat({'dfx':dfx, 'dfy':dfy})
dfxy = dfxy.swaplevel(0,1,axis=0).sort_index(level=0)
grouped=dfxy.groupby(level=0)

In [19]:
dfx = pd.DataFrame(data = np.random.randint(0 , 100 , 50).reshape(10 ,-1) , columns=list('abcde'))
dfx
Out[19]:
a b c d e
3 44 8 55 95
26 5 18 34 10
20 20 91 15 8
83 7 50 47 27
97 65 10 94 93
44 6 70 60 4
38 64 8 67 92
44 21 42 6 12
30 98 34 7 79
76 7 14 58 5
In [4]:
dfy = pd.DataFrame(data = np.random.randint(0 , 100 , 50).reshape(10 ,-1) , columns=list('fghij'))
dfy
Out[4]:
f g h i j
82 48 29 54 78
7 31 78 38 30
90 91 43 8 40
52 88 13 87 39
41 88 90 51 91
55 4 94 62 98
31 23 4 59 93
87 12 33 77 0
25 99 39 23 1
7 50 46 39 66
In [13]:
dfxy = pd.concat({'dfx':dfx, 'dfy':dfy} , axis = 1)
dfxy
Out[13]:
dfx dfy
a b c d e f g h i j
20 76 5 98 38 82 48 29 54 78
39 36 9 3 74 7 31 78 38 30
43 12 50 72 14 90 91 43 8 40
89 41 95 91 86 52 88 13 87 39
33 30 55 64 94 41 88 90 51 91
89 84 48 1 60 55 4 94 62 98
68 40 27 10 63 31 23 4 59 93
33 10 86 89 67 87 12 33 77 0
56 89 0 70 67 25 99 39 23 1
48 58 98 18 24 7 50 46 39 66
def f(x , y):
return pd.Series(data = [np.mean(x) , np.mean(y)] , index=['x_mean' , 'y_mean'])
In [17]:
dfxy.apply( lambda x : f(x['dfx'] , x['dfy']) , axis = 1)
Out[17]:
x_mean y_mean
0 47.4 58.2
1 32.2 36.8
2 38.2 54.4
3 80.4 55.8
4 55.2 72.2
5 56.4 62.6
6 41.6 42.0
7 57.0 41.8
8 56.4 37.4
9 49.2 41.6

Could this be what you are looking for?
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: dfx = pd.DataFrame(data=np.random.randint(0,100,50).reshape(10,-1),
columns=['index', 'a', 'b', 'c', 'd'])
In [4]: dfy = pd.DataFrame(data=np.random.randint(0,100,50).reshape(10,-1),
columns=['index', 'a', 'b', 'c', 'd'])
In [5]: dfy['index'] = dfx['index']
In [6]: print(dfx)
index a b c d
0 25 41 46 18 98
1 0 21 9 20 29
2 18 78 63 94 70
3 86 71 71 95 64
4 23 33 19 34 29
5 69 10 91 19 42
6 92 68 60 12 58
7 74 49 22 74 1
8 47 35 56 41 80
9 93 20 44 16 49
In [7]: print(dfy)
index a b c d
0 25 28 35 96 89
1 0 44 94 50 43
2 18 18 39 75 45
3 86 18 87 72 88
4 23 2 28 24 4
5 69 53 55 55 40
6 92 0 52 54 91
7 74 8 1 96 59
8 47 74 21 7 7
9 93 42 83 42 60
In [8]: print(dfx.merge(dfy, on='index'))
index a_x b_x c_x d_x a_y b_y c_y d_y
0 25 41 46 18 98 28 35 96 89
1 0 21 9 20 29 44 94 50 43
2 18 78 63 94 70 18 39 75 45
3 86 71 71 95 64 18 87 72 88
4 23 33 19 34 29 2 28 24 4
5 69 10 91 19 42 53 55 55 40
6 92 68 60 12 58 0 52 54 91
7 74 49 22 74 1 8 1 96 59
8 47 35 56 41 80 74 21 7 7
9 93 20 44 16 49 42 83 42 60
In [9]: def my_function(x):
...: return sum(x)
...:
In [10]: print(dfx.merge(dfy, on='index').drop('index', axis=1).apply(my_function, axis=1))
0 451
1 310
2 482
3 566
4 173
5 365
6 395
7 310
8 321
9 356
dtype: int64
In [11]: print(pd.DataFrame(
{
'my_function':
dfx.merge(dfy, on='index').\
drop('index', axis=1).apply(my_function, axis=1),
'index':
dfx['index']
}))
index my_function
0 25 451
1 0 310
2 18 482
3 86 566
4 23 173
5 69 365
6 92 395
7 74 310
8 47 321
9 93 356

Related

How to create a column that contains the penultimate value of each row?

I have a DataFrame and I need to create a new column which contains the second largest value of each row in the original Dataframe.
Sample:
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))
Desired output:
0 1 2 3 4 5 6 7 8 9 penultimate
0 52 69 62 7 20 69 38 10 57 17 62
1 52 94 49 63 1 90 14 76 20 84 90
2 78 37 58 7 27 41 27 26 48 51 58
3 6 39 99 36 62 90 47 25 60 84 90
4 37 36 91 93 76 69 86 95 69 6 93
5 5 54 73 61 22 29 99 27 46 24 73
6 71 65 45 9 63 46 4 93 36 18 71
7 85 7 76 46 65 97 64 52 28 80 85
How can this be done in as little code as possible?
You could use NumPy for this:
import numpy as np
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))
df['penultimate'] = np.sort(df.values, 1)[:, -2]
print(df)
Using NumPy is faster.
Here is a simple lambda function!
# Input
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))
# Output
out = df.apply(lambda x: x.sort_values().unique()[-2], axis=1)
df['penultimate'] = out
print(df)
Cheers!

compare two data frames and delete columns based on lookup table

I have two data frames:
df1:
A B C D E F
0 63 9 56 23 41 0
1 40 35 69 98 47 45
2 51 95 55 36 10 34
3 25 11 67 83 49 89
4 91 10 43 73 96 95
5 2 47 8 30 46 9
6 37 10 33 8 45 20
7 40 88 6 29 46 79
8 75 87 49 76 0 69
9 92 21 86 91 46 41
df2:
A B C D E F
0 0 0 0 1 1 0
I want to delete Columns in df1 based on values in df2(lookup table). wherever df2 has 1 I have to delete that column in df1.
so my final output should be like.
A B C F
0 63 9 56 0
1 40 35 69 45
2 51 95 55 34
3 25 11 67 89
4 91 10 43 95
5 2 47 8 9
6 37 10 33 20
7 40 88 6 79
8 75 87 49 69
9 92 21 86 41
Assuming len(df1.columns) == len(df2.columns):
df1.loc[:, ~df2.loc[0].astype(bool).values]
A B C F
0 63 9 56 0
1 40 35 69 45
2 51 95 55 34
3 25 11 67 89
4 91 10 43 95
5 2 47 8 9
6 37 10 33 20
7 40 88 6 79
8 75 87 49 69
9 92 21 86 41
If the columns aren't the same, but df2 has a subset of columns in df1, then
df1.reindex(df2.columns[~df2.loc[0].astype(bool)], axis=1)
Or with drop, similar to #student's method:
df1.drop(df2.columns[df2.loc[0].astype(bool)], axis=1)
A B C F
0 63 9 56 0
1 40 35 69 45
2 51 95 55 34
3 25 11 67 89
4 91 10 43 95
5 2 47 8 9
6 37 10 33 20
7 40 88 6 79
8 75 87 49 69
9 92 21 86 41
columns can do intersection
df1[df1.columns.intersection(df2.columns[~df2.iloc[0].astype(bool)])]
Out[354]:
A B C F
0 63 9 56 0
1 40 35 69 45
2 51 95 55 34
3 25 11 67 89
4 91 10 43 95
5 2 47 8 9
6 37 10 33 20
7 40 88 6 79
8 75 87 49 69
9 92 21 86 41
You can try with drop to drop the columns:
remove_col = df2.columns[(df2 == 1).any()] # get columns with any value 1
df1.drop(remove_col, axis=1, inplace=True) # drop the columns in original dataframe
Or, in one line as:
df1.drop(df2.columns[(df2 == 1).any()], axis=1, inplace=True)
Following can be an easily understandable solution:
df1.loc[:,df2.loc[0]!=1]
Output:
A B C F
0 63 9 56 0
1 40 35 69 45
2 51 95 55 34
3 25 11 67 89
4 91 10 43 95
5 2 47 8 9
6 37 10 33 20
7 40 88 6 79
8 75 87 49 69
9 92 21 86 41
loc can be used for selecting rows or columns with a boolean or conditional lookup : https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/

Linear warping of pandas time-series

I have a pandas data frame, df:
import pandas as pd
import numpy as np
np.random.seed(123)
s = np.arange(5)
df = pd.DataFrame()
for i in s:
s_df = pd.DataFrame({'time':np.arange(100),
'x':np.arange(100),
'y':np.arange(100),
'r':np.random.randint(60,100)})
s_df['unit'] = str(i)
df = df.append(s_df)
I want select the 'x' and 'y' data for each 'unit', from 'time' 0 up until its value of 'r', and then warp the selected data to fit a new normalized timescale of 0-100. The new DataFrame should look the same, but x and y will have been stretched to fit the new timescale.
I think you can start with this and modify:
df.groupby('unit', as_index=False, group_keys=False)\
.apply(lambda g: g[g.time <= g.r.max()].pipe(lambda x: x.assign(x = np.interp(x.time * 100/x.r.max(), g.time, g.x),
y = np.interp(x.time * 100/x.r.max(), g.time, g.y))))
Output:
r time x y unit
0 91 0 0.369445 0.802790 0
1 91 1 0.802881 0.411523 0
2 91 2 0.080290 0.228482 0
3 91 3 0.248865 0.624470 0
4 91 4 0.350376 0.207805 0
5 91 5 0.604447 0.495269 0
6 91 6 0.402430 0.317250 0
7 91 7 0.205757 0.296371 0
8 91 8 0.426954 0.793716 0
9 91 9 0.728095 0.486691 0
10 91 10 0.087941 0.701258 0
11 91 11 0.653719 0.937834 0
12 91 12 0.702571 0.381267 0
13 91 13 0.113419 0.492686 0
14 91 14 0.381172 0.539422 0
15 91 15 0.490320 0.166290 0
16 91 16 0.440490 0.029675 0
17 91 17 0.663973 0.245057 0
18 91 18 0.273116 0.280711 0
19 91 19 0.807658 0.869288 0
20 91 20 0.227972 0.987803 0
21 91 21 0.747295 0.526613 0
22 91 22 0.491929 0.118479 0
23 91 23 0.403465 0.564284 0
24 91 24 0.618359 0.648467 0
25 91 25 0.867436 0.447866 0
26 91 26 0.487128 0.526473 0
27 91 27 0.135412 0.855466 0
28 91 28 0.469281 0.753690 0
29 91 29 0.397495 0.786670 0
.. .. ... ... ... ...
53 82 53 0.985053 0.534743 4
54 82 54 0.255997 0.789710 4
55 82 55 0.629316 0.889916 4
56 82 56 0.730672 0.539548 4
57 82 57 0.484289 0.278669 4
58 82 58 0.120573 0.754350 4
59 82 59 0.071606 0.912240 4
60 82 60 0.126613 0.775831 4
61 82 61 0.392633 0.706384 4
62 82 62 0.312653 0.698514 4
63 82 63 0.164337 0.420798 4
64 82 64 0.655284 0.317136 4
65 82 65 0.526961 0.484673 4
66 82 66 0.205197 0.516752 4
67 82 67 0.405965 0.314419 4
68 82 68 0.892710 0.620090 4
69 82 69 0.351876 0.422846 4
70 82 70 0.601449 0.152340 4
71 82 71 0.187239 0.486854 4
72 82 72 0.757108 0.727058 4
73 82 73 0.728311 0.623236 4
74 82 74 0.725225 0.279149 4
75 82 75 0.536730 0.746806 4
76 82 76 0.584319 0.543595 4
77 82 77 0.591636 0.451003 4
78 82 78 0.042806 0.766688 4
79 82 79 0.326183 0.832956 4
80 82 80 0.558992 0.507238 4
81 82 81 0.303649 0.143872 4
82 82 82 0.303214 0.113151 4
[428 rows x 5 columns]

how to multiply multiple columns by another column pandas

I have a Dataframe of 100 Columns and I want to multiply one column ('Count') value with the columns position ranging from 6 to 74. Please tell me how to do that. I have been trying
df = df.ix[0, 6:74].multiply(df["Count"], axis="index")
df = df[df.columns[6:74]]*df["Count"]
None of them is working
The result Dataframe should be of 100 columns with all original columns where columns number 6 to 74 have the multiplied values in all the rows.
Assuming the same dataframe provided by #MaxU
Not easier, but a perspective on how to use other api elements.
pd.DataFrame.update and pd.DataFrame.mul
df.update(df.iloc[:, 3:7].mul(df.Count, 0))
df
0 1 2 3 4 5 6 7 8 9 Count
0 89 38 89 15.366436 1.355862 7.231264 4.971494 12 70 69 0.225977
1 49 1 38 1.004190 1.095480 2.829990 0.273870 57 93 64 0.030430
2 2 53 49 49.749460 50.379200 54.157640 16.373240 22 31 41 0.629740
3 38 44 23 28.437516 73.545300 41.185368 73.545300 19 99 57 0.980604
4 45 2 60 10.093230 4.773825 10.502415 6.274170 43 63 55 0.136395
5 65 97 15 10.375760 57.066680 38.260615 14.915155 68 5 21 0.648485
6 95 90 45 52.776000 16.888320 22.517760 50.664960 76 32 75 0.703680
7 60 31 65 63.242210 2.976104 26.784936 38.689352 72 73 94 0.744026
8 64 96 96 7.505370 37.526850 11.007876 10.007160 68 56 39 0.500358
9 78 54 74 8.409275 25.227825 16.528575 9.569175 97 63 37 0.289975
Demo:
Sample DF:
In [6]: df = pd.DataFrame(np.random.randint(100,size=(10,10))) \
.assign(Count=np.random.rand(10))
In [7]: df
Out[7]:
0 1 2 3 4 5 6 7 8 9 Count
0 89 38 89 68 6 32 22 12 70 69 0.225977
1 49 1 38 33 36 93 9 57 93 64 0.030430
2 2 53 49 79 80 86 26 22 31 41 0.629740
3 38 44 23 29 75 42 75 19 99 57 0.980604
4 45 2 60 74 35 77 46 43 63 55 0.136395
5 65 97 15 16 88 59 23 68 5 21 0.648485
6 95 90 45 75 24 32 72 76 32 75 0.703680
7 60 31 65 85 4 36 52 72 73 94 0.744026
8 64 96 96 15 75 22 20 68 56 39 0.500358
9 78 54 74 29 87 57 33 97 63 37 0.289975
Let's multiply columns 3-6 by df['Count']:
In [8]: df.iloc[:, 3:6+1]
Out[8]:
3 4 5 6
0 68 6 32 22
1 33 36 93 9
2 79 80 86 26
3 29 75 42 75
4 74 35 77 46
5 16 88 59 23
6 75 24 32 72
7 85 4 36 52
8 15 75 22 20
9 29 87 57 33
In [9]: df.iloc[:, 3:6+1] *= df['Count']
In [10]: df
Out[10]:
0 1 2 3 4 5 6 7 8 9 Count
0 89 38 89 66.681065 0.818372 20.751519 15.480964 12 70 69 0.225977
1 49 1 38 32.359929 4.910233 60.309102 6.333122 57 93 64 0.030430
2 2 53 49 77.467708 10.911630 55.769707 18.295685 22 31 41 0.629740
3 38 44 23 28.437513 10.229653 27.236368 52.776014 19 99 57 0.980604
4 45 2 60 72.564688 4.773838 49.933342 32.369289 43 63 55 0.136395
5 65 97 15 15.689662 12.002793 38.260613 16.184644 68 5 21 0.648485
6 95 90 45 73.545292 3.273489 20.751519 50.664974 76 32 75 0.703680
7 60 31 65 83.351331 0.545581 23.345459 36.591370 72 73 94 0.744026
8 64 96 96 14.709058 10.229653 14.266669 14.073604 68 56 39 0.500358
9 78 54 74 28.437513 11.866397 36.963643 23.221446 97 63 37 0.289975
The easiest thing to do here would be to extract the values, multiply, and then assign.
u = df.iloc[0, 6:74].values
v = df[['count']]
df = pd.DataFrame(u * v)
By using combine_first
df.iloc[:, 3:6+1].mul(df['Count'],axis=0).combine_first(df)
You need to concatenate the data frame resulting from multiplication with the remaining columns:
df=pd.concat( [df.iloc[0:6],df.iloc[75:],df.iloc[:,6:74+1].multiply(df['Count'],axis=0)] , axis=1)

Axis argument to .loc() to interpret the passed slicers on a axis=1

The documentation suggests:
You can also specify the axis argument to .loc to interpret the passed
slicers on a single axis.
However I get an error trying to slice along the column index.
import pandas as pd
import numpy as np
cols= [(yr,m) for yr in [2014,2015] for m in [7,8,9,10]]
df = pd.DataFrame(np.random.randint(1,100,(10,8)),index=tuple('ABCDEFGHIJ'))
df.columns =pd.MultiIndex.from_tuples(cols)
print df.head()
2014 2015
7 8 9 10 7 8 9 10
A 68 51 6 48 24 3 4 85
B 79 75 68 62 19 40 63 45
C 60 15 32 32 37 95 56 38
D 4 54 81 50 13 64 65 13
E 78 21 84 1 83 18 39 57
#This does not work as expected
print df.loc(axis=1)[(2014,9):(2015,8)]
AssertionError: Start slice bound is non-scalar
#but an arbitrary transpose and changing axis works!
df = df.T
print df.loc(axis=0)[(2014,9):(2015,8)]
A B C D E F G H I J
2014 9 6 68 32 81 84 60 83 39 94 93
10 48 62 32 50 1 84 18 14 92 33
2015 7 24 19 37 13 83 69 31 91 69 90
8 3 40 95 64 18 8 32 93 16 25
So I could always assign the slice and re-transpose.
That though feels like a hack and the axis=1 setting should have worked.
df = df.loc(axis=0)[(2014,9):(2015,8)]
df = df.T
print df
2014 2015
9 10 7 8
A 64 98 99 87
B 43 36 22 84
C 32 78 86 66
D 67 8 34 73
E 83 54 96 33
F 18 83 36 71
G 13 25 76 8
H 69 4 99 84
I 3 52 50 62
J 67 60 9 49
That might be a bug. Pls post an issue on github. The canoncial way to select things is to fully specify all the axes.
In [6]: df.loc[:,(2014,9):(2015,8)]
Out[6]:
2014 2015
9 10 7 8
A 26 2 44 69
B 41 7 5 1
C 8 27 23 22
D 54 72 81 93
E 18 23 54 7
F 11 81 37 83
G 60 38 59 29
H 3 95 89 96
I 6 9 77 9
J 90 92 10 32

Categories