Create new dataframe column from a shifted existing column - python

I have a dataframe with open, high, low, close prices of a stock. I want to add an additional column that has the percent change between today's open and yesterday's high. This is my current implementation, however, the resulting column contains percent changes between the current day's high and open.
df
open high low close
0 100 110 95 103
1 103 113 103 111
2 111 132 109 124
3 124 136 114 130
My attempt (incorrect):
df['prevhigh_curropen'] = (df['open'] - df['high']).shift(-1) / df['high'].shift(-1)
Output (incorrect):
open high low close prevhigh_curropen
0 100 110 95 103 -0.091
1 103 113 103 111 -0.089
2 111 132 109 124 -0.159
3 124 136 114 130 -0.088
Desired output:
open high low close prevhigh_curropen
0 100 110 95 103 nan
1 103 113 103 111 -0.064
2 111 132 109 124 -0.018
3 124 136 114 130 -0.061
Is there a non-iterative way to do this like I attempted above?

Your formula is wrong, you have to use df['high'].shift():
df = pd.DataFrame({'open': range(1, 11), 'high': range(1, 11)})
df['prevhigh_curropen'] = df['open'].sub(df['high'].shift()) \
.div(df['high'].shift()) \
.mul(100)
>>> df
open high prevhigh_curropen
0 1 1 NaN
1 2 2 100.000000
2 3 3 50.000000
3 4 4 33.333333
4 5 5 25.000000
5 6 6 20.000000
6 7 7 16.666667
7 8 8 14.285714
8 9 9 12.500000
9 10 10 11.111111
For your sample the output is:
>>> df
open high low close prevhigh_curropen
0 100 110 95 103 NaN
1 103 113 103 111 -6.363636
2 111 132 109 124 -1.769912
3 124 136 114 130 -6.060606
The first value is NaN because we don't know the high value from the previous day.

We can simplify the terms slightly from (a - b) / b to (a / b) - (b / b) to (a / b) - 1.
Mathematical Operators:
df['prevhigh_curropen'] = (df['open'] / df['high'].shift()) - 1
or with Series Methods:
df['prevhigh_curropen'] = df['open'].div(df['high'].shift()).sub(1)
*The benefit here is that we only need to shift once, and maintain 1 copy of df['high'].shift()
Resulting df:
open high low close prevhigh_curropen
0 100 110 95 103 NaN
1 103 113 103 111 -0.063636
2 111 132 109 124 -0.017699
3 124 136 114 130 -0.060606
Setup Used:
import pandas as pd
df = pd.DataFrame({
'open': [100, 103, 111, 124],
'high': [110, 113, 132, 136],
'low': [95, 103, 109, 114],
'close': [103, 111, 124, 130]
})

Related

concat result of apply in python

I am trying to apply a function on a column of a dataframe.
After getting multiple results as dataframes, I want to concat them all in one.
Why does the first option work and the second not?
import numpy as np
import pandas as pd
def testdf(n):
test = pd.DataFrame(np.random.randint(0,n*100,size=(n*3, 3)), columns=list('ABC'))
test['index'] = n
return test
test = pd.DataFrame({'id': [1,2,3,4]})
testapply = test['id'].apply(func = testdf)
#option 1
pd.concat([testapply[0],testapply[1],testapply[2],testapply[3]])
#option2
pd.concat([testapply])
pd.concat expects a sequence of pandas objects, but your #2 case/option passes a sequence of single pd.Series object that contains multiple dataframes, so it doesn't make concatenation - you just get that series as is.To fix your 2nd approach use unpacking:
print(pd.concat([*testapply]))
A B C index
0 91 15 91 1
1 93 85 91 1
2 26 87 74 1
0 195 103 134 2
1 14 26 159 2
2 96 143 9 2
3 18 153 35 2
4 148 146 130 2
5 99 149 103 2
0 276 150 115 3
1 232 126 91 3
2 37 242 234 3
3 144 73 81 3
4 96 153 145 3
5 144 94 207 3
6 104 197 49 3
7 0 93 179 3
8 16 29 27 3
0 390 74 379 4
1 78 37 148 4
2 350 381 260 4
3 279 112 260 4
4 115 387 173 4
5 70 213 378 4
6 43 37 149 4
7 240 399 117 4
8 123 0 47 4
9 255 172 1 4
10 311 329 9 4
11 346 234 374 4

How to group column which respects certain conditions

I try since this afternoon to group column which respects certain conditions. I giva an easy example, I got 3 column like it :
ID1_column_A ID2_column_B ID2_column_C
234 100 10
334 130 11
34 250 40
34 200 25
My aim is to group column who start per the same ID, so here, I will have only 2 column in output :
ID1_column_A Fusion_B_C
234 110
334 141
34 290
34 225
thanks for reading me
IIUC, you can try this:
df.groupby(df.columns.str.split('_').str[0], axis=1).sum()
Output:
ID1 ID2
0 234 110
1 334 141
2 34 290
3 34 225
import pandas as pd
data = pd.DataFrame({'ID1_column_A': [234, 334, 34, 34], 'ID2_column_B': [100, 130, 250, 200], 'ID2_column_C': [10, 11, 40, 25]})
data['Fusion_B_C'] = data.loc[:, 'ID2_column_B':'ID2_column_C'].sum(axis=1)
data.drop(columns=['ID2_column_B', 'ID2_column_C'], inplace=True)#Deleting unnecessary columns
print(data)
Output
ID1_column_A Fusion_B_C
0 234 110
1 334 141
2 34 290
3 34 225

Compare each row of Pandas df1 with every row within df2 and return string value from closest matching column

I have two data frames.
df1 includes 4 men and 4 women with their weight and height (inches).
#df1
John, 236, 76
Jack, 204, 74
Jim, 156, 71
Jared, 182, 72
Suzy, 119, 60
Sally, 149, 66
Sharon, 169, 65
Sammy, 182, 75
df2 includes 4 men and 4 women with their weight and height (inches).
#df2
Aaron, 285, 77
Abe, 236, 75
Alex, 178, 72
Adam, 195, 71
Mary, 148, 66
Maylee, 155, 66
Marilyn, 199, 65
Madison, 160, 73
What I am trying to do is have men from df1 be compared to men from df2 to see who they are most like based on height and weight. Just subtract weight from weight and height from height and return an absolute value for each man in df2. More specifically, return the name of the man most similar.
So in this case John's closest match is Abe so in a new column
df1['doppelganger'] = "Abe".
I'm a beginner hobbyist so even pointing me in the right direction would be helpful. I've been looking through stack overflow for about five hours trying to figure out how to go about something like this.
First is necessary distinguish men and women, here is used new column with repeat 4 times m and f. Then is used DataFrame.merge with outer join by new column for all combinations and created new columns for differences, last column is sum of them. then sorting by 3 columns by DataFrame.sort_values, so first row per groups by A and g are filtered by DataFrame.drop_duplicates:
df = (df1.assign(g = ['m']*4 + ['f']*4)
.merge(df2.assign(g = ['m']*4 + ['f']*4), on='g', how='outer', suffixes=('','_'))
.assign(dif1 = lambda x: x['B'].sub(x['B_']).abs(),
dif2 = lambda x: x['C'].sub(x['C_']).abs(),
sumdiff = lambda x: x['dif1'] + x['dif2'])
.sort_values(['A', 'g','sumdiff'])
.drop_duplicates(['A','g'])
.sort_index()
.rename(columns={'A_':'doppelganger'})
)
print (df)
A B C g doppelganger B_ C_ dif1 dif2 sumdiff
1 John 236 76 m Abe 236 75 0 1 1
7 Jack 204 74 m Adam 195 71 9 3 12
10 Jim 156 71 m Alex 178 72 22 1 23
14 Jared 182 72 m Alex 178 72 4 0 4
16 Suzy 119 60 f Mary 148 66 29 6 35
20 Sally 149 66 f Mary 148 66 1 0 1
25 Sharon 169 65 f Maylee 155 66 14 1 15
31 Sammy 182 75 f Madison 160 73 22 2 24
Input DataFrames:
print (df1)
A B C
0 John 236 76
1 Jack 204 74
2 Jim 156 71
3 Jared 182 72
4 Suzy 119 60
5 Sally 149 66
6 Sharon 169 65
7 Sammy 182 75
print (df2)
A B C
0 Aaron 285 77
1 Abe 236 75
2 Alex 178 72
3 Adam 195 71
4 Mary 148 66
5 Maylee 155 66
6 Marilyn 199 65
7 Madison 160 73

use result of a function applied to groupby for calculation on the original df

I am having some data which look like as shown below df.
I am trying to calculate first the mean angle for each group using the function mean_angle. The calculated mean angle is then used to do another calculation per group using the function fun.
import pandas as pd
import numpy as np
generate sample data
a = np.array([1,2,3,4]).repeat(4)
x1 = 90 + np.random.randint(-15, 15, size=a.size//2 - 2 )
x2 = 270 + np.random.randint(-50, 50, size=a.size//2 + 2 )
b = np.concatenate((x1, x2))
np.random.shuffle(b)
df = pd.DataFrame({'a':a, 'b':b})
The returned dataframe is printed below.
a b
0 1 295
1 1 78
2 1 280
3 1 94
4 2 308
5 2 227
6 2 96
7 2 299
8 3 248
9 3 288
10 3 81
11 3 78
12 4 103
13 4 265
14 4 309
15 4 229
My functions are mean_angle and fun
def mean_angle(deg):
deg = np.deg2rad(deg)
deg = deg[~np.isnan(deg)]
S = np.sum(np.sin(deg))
C = np.sum(np.cos(deg))
mu = np.arctan2(S,C)
mu = np.rad2deg(mu)
if mu <0:
mu = 360 + mu
return mu
def fun(x, mu):
return np.where(abs(mu - x) < 45, x, np.where(x+180<360, x+180, x-180))
what I have tried
mu = df.groupby(['a'])['b'].apply(mean_angle)
df2 = df.groupby(['a'])['b'].apply(fun, args = (mu,)) #this function should be element wise
I know it is totally wrong but I could not come up with a better way.
The desired output is something like this where mu the mean_angle per group
a b c
0 1 295 np.where(abs(mu - 295) < 45, 295, np.where(295 +180<360, 295 +180, 295 -180))
1 1 78 np.where(abs(mu - 78) < 45, 78, np.where(78 +180<360, 78 +180, 78 -180))
2 1 280 np.where(abs(mu - 280 < 45, 280, np.where(280 +180<360, 280 +180, 280 -180))
3 1 94 ...
4 2 308 ...
5 2 227 .
6 2 96 .
7 2 299 .
8 3 248 .
9 3 288 .
10 3 81 .
11 3 78 .
12 4 103 .
13 4 265 .
14 4 309 .
15 4 229 .
Any help is appreciated
You don't need your second function, just pass the necessary columns to np.where(). So creating your dataframe in the same manner and not modifying your mean_angle function, we have the following sample dataframe:
a b
0 1 228
1 1 291
2 1 84
3 1 226
4 2 266
5 2 311
6 2 82
7 2 274
8 3 79
9 3 250
10 3 222
11 3 88
12 4 80
13 4 291
14 4 100
15 4 293
Then create your c column (containing your mu values) using groupby() and transform(), and finally apply your np.where() logic:
df['c'] = df.groupby(['a'])['b'].transform(mean_angle)
df['c'] = np.where(abs(df['c'] - df['b']) < 45, df['b'], np.where(df['b']+180<360, df['b']+180, df['b']-180))
Yields:
a b c
0 1 228 228
1 1 291 111
2 1 84 264
3 1 226 226
4 2 266 266
5 2 311 311
6 2 82 262
7 2 274 274
8 3 79 259
9 3 250 70
10 3 222 42
11 3 88 268
12 4 80 260
13 4 291 111
14 4 100 280
15 4 293 113

pandas column values to row values

I have a dataset (171 columns) and when I take it into my dataframe, it looks like this way-
ANO MNO UJ2010 DJ2010 UF2010 DF2010 UM2010 DM2010 UA2010 DA2010 ...
1 A 113 06/01/2010 129 06/02/2010 143 06/03/2010 209 05/04/2010 ...
2 B 218 06/01/2010 211 06/02/2010 244 06/03/2010 348 05/04/2010 ...
3 C 22 06/01/2010 114 06/02/2010 100 06/03/2010 151 05/04/2010 ...
Now I want to change my dataframe like this way -
ANO MNO Time Unit
1 A 06/01/2010 113
1 A 06/02/2010 129
1 A 06/03/2010 143
2 B 06/01/2010 218
2 B 06/02/2010 211
2 B 06/03/2010 244
3 C 06/01/2010 22
3 C 06/02/2010 114
3 C 06/03/2010 100
....
.....
I tried to use pd.melt, but I think it does not fullfil my purpose. How can I do this?
Use pd.lreshape as a close alternative to pd.melt after filtering the columns to be grouped under the distinct headers.
Through the use of pd.lreshape, when you inject a dictionary object as it's groups parameter, the keys would take on the new header name and all the list of column names fed as values to this dict would be cast under that single header. Thus, it produces a long formatted DF after the transformation.
Finally sort the DF w.r.t the unused columns to align these accordingly.
Then, a reset_index(drop=True) at the end to relabel the index axis to the default integer values by dropping off the intermediate index.
d = pd.lreshape(df, {"Time": df.filter(regex=r'^D').columns,
"Unit": df.filter(regex=r'^U').columns})
d.sort_values(['ANO', 'MNO']).reset_index(drop=True)
If there's a mismatch in the length of the grouping columns, then:
from itertools import groupby, chain
unused_cols = ['ANO', 'MNO']
cols = df.columns.difference(unused_cols)
# filter based on the common strings starting from the first slice upto end.
fnc = lambda x: x[1:]
pref1, pref2 = "D", "U"
# Obtain groups based on a common interval of slices.
groups = [list(g) for n, g in groupby(sorted(cols, key=fnc), key=fnc)]
# Fill single length list with it's other char counterpart.
fill_missing = [i if len(i)==2 else i +
[pref1 + i[0][1:] if i[0][0] == pref2 else pref2 + i[0][1:]]
for i in groups]
# Reindex based on newly obtained column names.
df = df.reindex(columns=unused_cols + list(chain(*fill_missing)))
Continue the same steps with pd.lreshape as mentioned above but this time with dropna=False parameter included.
You can reshape by stack but first create MultiIndex in columns with % and //.
MultiIndex values map pairs Time and Unit to second level of MultiIndex by floor division (//) by 2, differences of each pairs are created by modulo division (%).
Then stack use last level created by // and create new level of MultiIndex in index, which is not necessary, so is removed by reset_index(level=2, drop=True).
Last reset_index for convert first and second level to columns.
[[1,0]] is for swap columns for change ordering.
df = df.set_index(['ANO','MNO'])
cols = np.arange(len(df.columns))
df.columns = [cols % 2, cols // 2]
print (df)
0 1 0 1 0 1 0 1
0 0 1 1 2 2 3 3
ANO MNO
1 A 113 06/01/2010 129 06/02/2010 143 06/03/2010 209 05/04/2010
2 B 218 06/01/2010 211 06/02/2010 244 06/03/2010 348 05/04/2010
3 C 22 06/01/2010 114 06/02/2010 100 06/03/2010 151 05/04/2010
df = df.stack()[[1,0]].reset_index(level=2, drop=True).reset_index()
df.columns = ['ANO','MNO','Time','Unit']
print (df)
ANO MNO Time Unit
0 1 A 06/01/2010 113
1 1 A 06/02/2010 129
2 1 A 06/03/2010 143
3 1 A 05/04/2010 209
4 2 B 06/01/2010 218
5 2 B 06/02/2010 211
6 2 B 06/03/2010 244
7 2 B 05/04/2010 348
8 3 C 06/01/2010 22
9 3 C 06/02/2010 114
10 3 C 06/03/2010 100
11 3 C 05/04/2010 151
EDIT:
#last column is missing
print (df)
ANO MNO UJ2010 DJ2010 UF2010 DF2010 UM2010 DM2010 UA2010
0 1 A 113 06/01/2010 129 06/02/2010 143 06/03/2010 209
1 2 B 218 06/01/2010 211 06/02/2010 244 06/03/2010 348
2 3 C 22 06/01/2010 114 06/02/2010 100 06/03/2010 151
df = df.set_index(['ANO','MNO'])
#MultiIndex is created by first character of column names with all another
df.columns = [df.columns.str[0], df.columns.str[1:]]
print (df)
U D U D U D U
J2010 J2010 F2010 F2010 M2010 M2010 A2010
ANO MNO
1 A 113 06/01/2010 129 06/02/2010 143 06/03/2010 209
2 B 218 06/01/2010 211 06/02/2010 244 06/03/2010 348
3 C 22 06/01/2010 114 06/02/2010 100 06/03/2010 151
#stack add missing values, replace them by NaN
df = df.stack().reset_index(level=2, drop=True).reset_index()
df.columns = ['ANO','MNO','Time','Unit']
print (df)
ANO MNO Time Unit
0 1 A NaN 209
1 1 A 06/02/2010 129
2 1 A 06/01/2010 113
3 1 A 06/03/2010 143
4 2 B NaN 348
5 2 B 06/02/2010 211
6 2 B 06/01/2010 218
7 2 B 06/03/2010 244
8 3 C NaN 151
9 3 C 06/02/2010 114
10 3 C 06/01/2010 22
11 3 C 06/03/2010 100
You can use iloc with pd.concat for this. The solution is simple - just stack all relevant columns (which are selected via iloc) vertically one after another and concatenate them:
def rename(sub_df):
sub_df.columns = ["ANO", "MNO", "Time", "Unit"]
return sub_df
pd.concat([rename(df.iloc[:, [0, 1, x+1, x]])
for x in range(2, df.shape[1], 2)])
ANO MNO Time Unit
0 1 A 06/01/2010 113
1 2 B 06/01/2010 218
2 3 C 06/01/2010 22
0 1 A 06/02/2010 129
1 2 B 06/02/2010 211
2 3 C 06/02/2010 114
0 1 A 06/03/2010 143
1 2 B 06/03/2010 244
2 3 C 06/03/2010 100
0 1 A 05/04/2010 209
1 2 B 05/04/2010 348
2 3 C 05/04/2010 151

Categories