how to use custom calculation based on two dataframes in python

how to use custom calculation based on two dataframes in python - python

I have 2 dataframes as below,
df1
index X1 X2 X3 X4
0 6 10 6 7
1 8 9 11 13
2 12 13 15 11
3 8 11 7 6
4 11 7 6 6
5 13 14 11 10
df2
index Y
0 20
1 14
2 17
3 14
4 15
5 20
I want to get 3rd dataframe such that new X(i) = Y - (Y/X(i))
index X1 X2 X3 X4
0 16.67 18.00 16.67 17.14
1 12.25 12.44 12.73 12.92
2 15.58 15.69 15.87 15.45
3 12.25 12.73 12.00 11.67
4 13.64 12.86 12.50 12.50
5 18.46 18.57 18.18 18.00
Please note, it shall match the index number of df1 and df2 while calculating.
Thanks in advance!

You can use numpy for vectorized solution i.e
df[df.columns] = (df2.values - df2.values/df.values)
X1 X2 X3 X4
index
0 16.666667 18.000000 16.666667 17.142857
1 12.250000 12.444444 12.727273 12.923077
2 15.583333 15.692308 15.866667 15.454545
3 12.250000 12.727273 12.000000 11.666667
4 13.636364 12.857143 12.500000 12.500000
5 18.461538 18.571429 18.181818 18.000000

Base on pandas
-(1/df1.div(df2.Y,0)).sub(df2.Y,0)
Out[634]:
X1 X2 X3 X4
index
0 16.666667 18.000000 16.666667 17.142857
1 12.250000 12.444444 12.727273 12.923077
2 15.583333 15.692308 15.866667 15.454545
3 12.250000 12.727273 12.000000 11.666667
4 13.636364 12.857143 12.500000 12.500000
5 18.461538 18.571429 18.181818 18.000000

Related

How to use each vector entry to fill NAN's of a separate groups in a dataframe

Say I have a vector ValsHR which looks like this:
valsHR=[78.8, 82.3, 91.0]
And I have a dataframe MainData
Age Patient HR
21 1 NaN
21 1 NaN
21 1 NaN
30 2 NaN
30 2 NaN
24 3 NaN
24 3 NaN
24 3 NaN
I want to fill the NaNs so that the first value in valsHR will only fill in the NaNs for patient 1, the second will fill the NaNs for patient 2 and the third will fill in for patient 3.
So far I've tried using this:
mainData['HR'] = mainData['HR'].fillna(ValsHR) but it fills all the NaNs with the first value in the vector.
I've also tried to use this:
mainData['HR'] = mainData.groupby('Patient').fillna(ValsHR) fills the NaNs with values that aren't in the valsHR vector at all.
I was wondering if anyone knew a way to do this?

Create dictionary by Patient values with missing values, map to original column and replace missing values only:
print (df)
Age Patient HR
0 21 1 NaN
1 21 1 NaN
2 21 1 NaN
3 30 2 100.0 <- value is not replaced
4 30 2 NaN
5 24 3 NaN
6 24 3 NaN
7 24 3 NaN
p = df.loc[df.HR.isna(), 'Patient'].unique()
valsHR = [78.8, 82.3, 91.0]
df['HR'] = df['HR'].fillna(df['Patient'].map(dict(zip(p, valsHR))))
print (df)
Age Patient HR
0 21 1 78.8
1 21 1 78.8
2 21 1 78.8
3 30 2 100.0
4 30 2 82.3
5 24 3 91.0
6 24 3 91.0
7 24 3 91.0
If some groups has no NaNs:
print (df)
Age Patient HR
0 21 1 NaN
1 21 1 NaN
2 21 1 NaN
3 30 2 100.0 <- group 2 is not replaced
4 30 2 100.0 <- group 2 is not replaced
5 24 3 NaN
6 24 3 NaN
7 24 3 NaN
p = df.loc[df.HR.isna(), 'Patient'].unique()
valsHR = [78.8, 82.3, 91.0]
df['HR'] = df['HR'].fillna(df['Patient'].map(dict(zip(p, valsHR))))
print (df)
Age Patient HR
0 21 1 78.8
1 21 1 78.8
2 21 1 78.8
3 30 2 100.0
4 30 2 100.0
5 24 3 82.3
6 24 3 82.3
7 24 3 82.3

It is simply mapping, if all of NaN should be replaced
import pandas as pd
from io import StringIO
valsHR=[78.8, 82.3, 91.0]
vals = {i:k for i,k in enumerate(valsHR, 1)}
df = pd.read_csv(StringIO("""Age Patient
21 1
21 1
21 1
30 2
30 2
24 3
24 3
24 3"""), sep="\s+")
df["HR"] = df["Patient"].map(vals)
>>> df
Age Patient HR
0 21 1 78.8
1 21 1 78.8
2 21 1 78.8
3 30 2 82.3
4 30 2 82.3
5 24 3 91.0
6 24 3 91.0
7 24 3 91.0

Reshape DataFrame from long to wide along one column

I am looking for a way to reconfigure Table A show below into Table B.
Table A:
type x1 x2 x3
A 4 6 9
A 7 4 1
A 9 6 2
B 1 3 8
B 2 7 9
transformed into Table B:
type x1 x2 x3 x1' x2' x3' x1'' x2'' x3''
A 4 6 9 7 4 1 9 6 2
B 1 3 8 2 7 9 NA NA NA
The real Table A is over 150000 rows and 36 Columns. With 2100 unique "type" values.

You can set the index appropriately and then unstack:
df
type x1 x2 x3
0 A 4 6 9
1 A 7 4 1
2 A 9 6 2
3 B 1 3 8
4 B 2 7 9
res = (df.set_index(['type', df.groupby('type').cumcount()])
.unstack()
.sort_index(level=-1, axis=1))
res.columns = res.columns.map(lambda x: x[0] + "'" * int(x[1]))
res
x1 x2 x3 x1' x2' x3' x1'' x2'' x3''
type
A 4.0 6.0 9.0 7.0 4.0 1.0 9.0 6.0 2.0
B 1.0 3.0 8.0 2.0 7.0 9.0 NaN NaN NaN

pandas pct_change() in reverse

Suppose we have a dataframe and we calculate as percent change between rows
y_axis = [1,2,3,4,5,6,7,8,9]
x_axis = [100,105,115,95,90,88,110,100,0]
DF = pd.DataFrame({'Y':y_axis, 'X':x_axis})
DF = DF[['Y','X']]
DF['PCT'] = DF['X'].pct_change()
Y X PCT
0 1 100 NaN
1 2 105 0.050000
2 3 115 0.095238
3 4 95 -0.173913
4 5 90 -0.052632
5 6 88 -0.022222
6 7 110 0.250000
7 8 100 -0.090909
8 9 0 -1.000000
That way it starts from the first row.
I want calculate pct_change() starting from the last row.
One way to do it
DF['Reverse'] = list(reversed(x_axis))
DF['PCT_rev'] = DF['Reverse'].pct_change()
pct_rev = DF.PCT_rev.tolist()
DF['_PCT_'] = list(reversed(pct_rev))
DF2 = DF[['Y','X','PCT','_PCT_']]
Y X PCT _PCT_
0 1 100 NaN -0.047619
1 2 105 0.050000 -0.086957
2 3 115 0.095238 0.210526
3 4 95 -0.173913 0.055556
4 5 90 -0.052632 0.022727
5 6 88 -0.022222 -0.200000
6 7 110 0.250000 0.100000
7 8 100 -0.090909 inf
8 9 0 -1.000000 NaN
But that is a very ugly and inefficient solution.
I was wondering if there are more elegant solutions?

DF.assign(_PCT_=DF.X.pct_change(-1))
Y X PCT _PCT_
0 1 100 NaN -0.047619
1 2 105 0.050000 -0.086957
2 3 115 0.095238 0.210526
3 4 95 -0.173913 0.055556
4 5 90 -0.052632 0.022727
5 6 88 -0.022222 -0.200000
6 7 110 0.250000 0.100000
7 8 100 -0.090909 inf
8 9 0 -1.000000 NaN
Series.pct_change(periods=1, fill_method='pad', limit=None, freq=None, **kwargs)
periods : int, default 1 Periods to shift for forming percent change
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.pct_change.html

I deleted my other answer because #su79eu7k 's is way better.
You can cut your time in half by using the underlying arrays. But you also have to suppress a warning.
a = DF.X.values
DF.assign(_PCT_=np.append((a[:-1] - a[1:]) / a[1:], np.nan))
Y X PCT _PCT_
0 1 100 NaN -0.047619
1 2 105 0.050000 -0.086957
2 3 115 0.095238 0.210526
3 4 95 -0.173913 0.055556
4 5 90 -0.052632 0.022727
5 6 88 -0.022222 -0.200000
6 7 110 0.250000 0.100000
7 8 100 -0.090909 inf
8 9 0 -1.000000 NaN

sort pandas dataframe indeces

I have a dataframe df which indeces are
df.index
Out[4]:
Index([u'2015-03-28_p001_2', u'2015-03-29_p001_2',
u'2015-03-30_p001_2', u'2015-03-31_p001_2',
u'2015-03-31_p002_3', u'2015-04-01_p001_2',
u'2015-04-01_p002_3', u'2015-04-02_p001_2',
u'2015-04-02_p002_3', u'2015-04-03_p001_2',
...
u'2016-03-31_p127_1', u'2016-04-01_p127_1',
u'2016-04-01_p128_3', u'2016-04-02_p127_1',
u'2016-04-02_p128_3', u'2016-04-03_p127_1',
u'2016-04-03_p128_3', u'2016-04-04_p127_1',
u'2016-04-05_p127_1', u'2016-04-06_p127_1'],
dtype='object', length=781)
The dataframe df is the results of a merge of 2 dataframes.
As you can see from the indeces are not sorted. E.g. '2015-03-31_p002_3'(5th position) comes before '2015-04-01_p001_2' (6th position)
I would like to group together all the _p001_2 and sort them according to the date, then all the _p002_3, etc etc.
But I didnt manage to do it...

If sort_index is not possible use, then it is a bit complicated - need create helper DataFrame by split, then sort_values and last reindex:
idx = pd.Index([u'2015-03-28_p001_2', u'2015-03-29_p001_2',
u'2015-03-30_p001_2', u'2015-03-31_p001_2',
u'2015-03-31_p002_3', u'2015-04-01_p001_2',
u'2015-04-01_p002_3', u'2015-04-02_p001_2',
u'2015-04-02_p002_3', u'2015-04-03_p001_2',
u'2016-03-31_p127_1', u'2016-04-01_p127_1',
u'2016-04-01_p128_3', u'2016-04-02_p127_1',
u'2016-04-02_p128_3', u'2016-04-03_p127_1',
u'2016-04-03_p128_3', u'2016-04-04_p127_1',
u'2016-04-05_p127_1', u'2016-04-06_p127_1'])
df = pd.DataFrame({'a':range(len(idx))}, index=idx)
print (df)
a
2015-03-28_p001_2 0
2015-03-29_p001_2 1
2015-03-30_p001_2 2
2015-03-31_p001_2 3
2015-03-31_p002_3 4
2015-04-01_p001_2 5
2015-04-01_p002_3 6
2015-04-02_p001_2 7
2015-04-02_p002_3 8
2015-04-03_p001_2 9
2016-03-31_p127_1 10
2016-04-01_p127_1 11
2016-04-01_p128_3 12
2016-04-02_p127_1 13
2016-04-02_p128_3 14
2016-04-03_p127_1 15
2016-04-03_p128_3 16
2016-04-04_p127_1 17
2016-04-05_p127_1 18
2016-04-06_p127_1 19
df = df.sort_index()
print (df)
a
2015-03-28_p001_2 0
2015-03-29_p001_2 1
2015-03-30_p001_2 2
2015-03-31_p001_2 3
2015-03-31_p002_3 4
2015-04-01_p001_2 5
2015-04-01_p002_3 6
2015-04-02_p001_2 7
2015-04-02_p002_3 8
2015-04-03_p001_2 9
2016-03-31_p127_1 10
2016-04-01_p127_1 11
2016-04-01_p128_3 12
2016-04-02_p127_1 13
2016-04-02_p128_3 14
2016-04-03_p127_1 15
2016-04-03_p128_3 16
2016-04-04_p127_1 17
2016-04-05_p127_1 18
2016-04-06_p127_1 19
df1 = df.index.to_series().str.split('_', expand=True)
df1[0] = pd.to_datetime(df1[0])
#if necessary change order columns for sorting
df1 = df1.sort_values(by=[1,2,0])
print (df1)
0 1 2
2015-03-28_p001_2 2015-03-28 p001 2
2015-03-29_p001_2 2015-03-29 p001 2
2015-03-30_p001_2 2015-03-30 p001 2
2015-03-31_p001_2 2015-03-31 p001 2
2015-04-01_p001_2 2015-04-01 p001 2
2015-04-02_p001_2 2015-04-02 p001 2
2015-04-03_p001_2 2015-04-03 p001 2
2015-03-31_p002_3 2015-03-31 p002 3
2015-04-01_p002_3 2015-04-01 p002 3
2015-04-02_p002_3 2015-04-02 p002 3
2016-03-31_p127_1 2016-03-31 p127 1
2016-04-01_p127_1 2016-04-01 p127 1
2016-04-02_p127_1 2016-04-02 p127 1
2016-04-03_p127_1 2016-04-03 p127 1
2016-04-04_p127_1 2016-04-04 p127 1
2016-04-05_p127_1 2016-04-05 p127 1
2016-04-06_p127_1 2016-04-06 p127 1
2016-04-01_p128_3 2016-04-01 p128 3
2016-04-02_p128_3 2016-04-02 p128 3
2016-04-03_p128_3 2016-04-03 p128 3
df = df.reindex(df1.index)
print (df)
a
2015-03-28_p001_2 0
2015-03-29_p001_2 1
2015-03-30_p001_2 2
2015-03-31_p001_2 3
2015-04-01_p001_2 5
2015-04-02_p001_2 7
2015-04-03_p001_2 9
2015-03-31_p002_3 4
2015-04-01_p002_3 6
2015-04-02_p002_3 8
2016-03-31_p127_1 10
2016-04-01_p127_1 11
2016-04-02_p127_1 13
2016-04-03_p127_1 15
2016-04-04_p127_1 17
2016-04-05_p127_1 18
2016-04-06_p127_1 19
2016-04-01_p128_3 12
2016-04-02_p128_3 14
2016-04-03_p128_3 16
EDIT:
If duplicates, then is necessary create new columns, sort and last drop them:
df[[0,1,2]] = df.index.to_series().str.split('_', expand=True)
df[0] = pd.to_datetime(df[0])
df = df.sort_values(by=[1,2,0])
df = df.drop([0,1,2], axis=1)
print (df)
a
2015-03-28_p001_2 0
2015-03-29_p001_2 1
2015-03-30_p001_2 2
2015-03-31_p001_2 3
2015-04-01_p001_2 5
2015-04-02_p001_2 7
2015-04-03_p001_2 9
2015-03-31_p002_3 4
2015-04-01_p002_3 6
2015-04-02_p002_3 8
2016-03-31_p127_1 10
2016-04-01_p127_1 11
2016-04-02_p127_1 13
2016-04-03_p127_1 15
2016-04-04_p127_1 17
2016-04-05_p127_1 18
2016-04-06_p127_1 19
2016-04-01_p128_3 12
2016-04-02_p128_3 14
2016-04-03_p128_3 16

How to Reverse Rolling Sum?

I have a rolling sum calculated on a grouped data frame but its adding up the wrong way, it is a sum of the future, when I need a sum of the past.
What am I doing wrong here?
I import the data and sort by Dimension and Date (I have tried removing the date sort already)
df = pd.read_csv('Input.csv', parse_dates=True)
df.sort_values(['Dimension','Date'])
print(df)
I then create a new column which is a multi index grouped by rolling window
new_column = df.groupby('Dimension').Value1.apply(lambda x:
x.rolling(window=3).sum())
I then reset the index to be the same as the original
df['Sum_Value1'] = new_column.reset_index(level=0, drop=True)
print(df)
I have also tried reversing the index before the calculation, but that also failed.
Input
Dimension,Date,Value1,Value2
1,4/30/2002,10,20
1,1/31/2002,10,20
1,10/31/2001,10,20
1,7/31/2001,10,20
1,4/30/2001,10,20
1,1/31/2001,10,20
1,10/31/2000,10,20
2,4/30/2002,10,20
2,1/31/2002,10,20
2,10/31/2001,10,20
2,7/31/2001,10,20
2,4/30/2001,10,20
2,1/31/2001,10,20
2,10/31/2000,10,20
3,4/30/2002,10,20
3,1/31/2002,10,20
3,10/31/2001,10,20
3,7/31/2001,10,20
3,1/31/2001,10,20
3,10/31/2000,10,20
Output:
Dimension Date Value1 Value2 Sum_Value1
0 1 4/30/2002 10 20 NaN
1 1 1/31/2002 10 20 NaN
2 1 10/31/2001 10 20 30.0
3 1 7/31/2001 10 20 30.0
4 1 4/30/2001 10 20 30.0
5 1 1/31/2001 10 20 30.0
6 1 10/31/2000 10 20 30.0
7 2 4/30/2002 10 20 NaN
8 2 1/31/2002 10 20 NaN
9 2 10/31/2001 10 20 30.0
10 2 7/31/2001 10 20 30.0
11 2 4/30/2001 10 20 30.0
12 2 1/31/2001 10 20 30.0
13 2 10/31/2000 10 20 30.0
Goal Output:
Dimension Date Value1 Value2 Sum_Value1
0 1 4/30/2002 10 20 30.0
1 1 1/31/2002 10 20 30.0
2 1 10/31/2001 10 20 30.0
3 1 7/31/2001 10 20 30.0
4 1 4/30/2001 10 20 30.0
5 1 1/31/2001 10 20 NaN
6 1 10/31/2000 10 20 NaN
7 2 4/30/2002 10 20 30.0
8 2 1/31/2002 10 20 30.0
9 2 10/31/2001 10 20 30.0
10 2 7/31/2001 10 20 30.0
11 2 4/30/2001 10 20 30.0
12 2 1/31/2001 10 20 Nan
13 2 10/31/2000 10 20 NaN

You need a backward sum, therefore reverse your series before sum rolling it:
lambda x: x[::-1].rolling(window=3).sum()

You can shift the result by window-1 to get the left aligned results:
df["sum_value1"] = (df.groupby('Dimension').Value1
.apply(lambda x: x.rolling(window=3).sum().shift(-2)))

Rolling backwards is the same as rolling forward and then shifting the result:
x.rolling(window=3).sum().shift(-2)

def reverse_rolling(series, window, func):
index = series.index
series = pd.DataFrame(series.iloc[::-1])
series = series.rolling(window, 1).apply(func)
series = series.iloc[::-1]
series['index'] = index
series = series.set_index('index')
return series[0]

You can use
import pandas as pd
from pandas.api.indexers import FixedForwardWindowIndexer
df = pd.read_csv(r'C:\Users\xxxx\python\data.txt')
indexer = FixedForwardWindowIndexer(window_size=3)
df1 = df.join(df.groupby('Dimension')['Value1'].rolling(indexer, min_periods=3).sum().to_frame().reset_index(), rsuffix='_sum')
del df1['Dimension_sum']
del df1['level_1']
df1
Input:
Dimension Date Value1 Value2
0 1 4/30/2002 10 20
1 1 1/31/2002 10 20
2 1 10/31/2001 10 20
3 1 7/31/2001 10 20
4 1 4/30/2001 10 20
5 1 1/31/2001 10 20
6 1 10/31/2000 10 20
7 2 4/30/2002 10 20
8 2 1/31/2002 10 20
9 2 10/31/2001 10 20
10 2 7/31/2001 10 20
11 2 4/30/2001 10 20
12 2 1/31/2001 10 20
13 2 10/31/2000 10 20
14 3 4/30/2002 10 20
15 3 1/31/2002 10 20
16 3 10/31/2001 10 20
17 3 7/31/2001 10 20
18 3 1/31/2001 10 20
19 3 10/31/2000 10 20
OUTPUT:
Dimension Date Value1 Value2 Value1_sum
0 1 4/30/2002 10 20 30.0
1 1 1/31/2002 10 20 30.0
2 1 10/31/2001 10 20 30.0
3 1 7/31/2001 10 20 30.0
4 1 4/30/2001 10 20 30.0
5 1 1/31/2001 10 20 NaN
6 1 10/31/2000 10 20 NaN
7 2 4/30/2002 10 20 30.0
8 2 1/31/2002 10 20 30.0
9 2 10/31/2001 10 20 30.0
10 2 7/31/2001 10 20 30.0
11 2 4/30/2001 10 20 30.0
12 2 1/31/2001 10 20 NaN
13 2 10/31/2000 10 20 NaN
14 3 4/30/2002 10 20 30.0
15 3 1/31/2002 10 20 30.0
16 3 10/31/2001 10 20 30.0
17 3 7/31/2001 10 20 30.0
18 3 1/31/2001 10 20 NaN
19 3 10/31/2000 10 20 NaN

Just had to do the same thing myself and came up with a simple one-liner:
df['Sum_Value1'] = df['Value1'].iloc[::-1].rolling(window = 3).sum()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to use custom calculation based on two dataframes in python - python

Related

How to use each vector entry to fill NAN's of a separate groups in a dataframe

Reshape DataFrame from long to wide along one column

pandas pct_change() in reverse

sort pandas dataframe indeces

How to Reverse Rolling Sum?

Categories

Resources