currently iam translating some R Scripts to Python. But i am struggeling with the following line:
return(trackTable[, .(
AVERAGE_WIND_COMPONENT = sum(TRACK_WIND_COMPONENT*GROUND_DIST, na.rm = T)/sum(GROUND_DIST, na.rm = T) #PRÜFEN!!!!!
), by=KEY_COLUMN])
Now I tried to rewrite the R code in python:
table['temp'] = (table['track_wind_component'] * table['ground_dist']) / table['ground_dist']
AVERAGE_WIND_COMPONENT = table.groupby(['KEY_COLUMN'])['temp'].sum()
AVERAGE_WIND_COMPONENT = pd.DataFrame({'KEY_COLUMN':AVERAGE_WIND_COMPONENT.index, 'AVERAGE_WIND_COMPONENT':AVERAGE_WIND_COMPONENT.values})
But my results for the AVERAGE_WIND_COMPONENT are wrong...What did I translate wrong here? It is probably the groupby and as I build my temp column.
Example df:
KEY_COLUMN track_wind_component ground_dist
0 xyz -0.000000 2.262407
1 xyz 0.000000 9.769840
2 xyz -135.378229 38.581616
3 xyz 11.971863 30.996997
4 xyz -78.208083 45.404430
5 xyz -88.718762 48.589553
6 xyz -118.302506 22.193426
7 xyz -71.033648 76.602917
8 xyz -68.369886 11.092901
9 xyz -65.706124 6.210328
10 xyz -60.822561 17.444752
11 xyz 39.630277 18.082869
12 xyz 102.477706 35.175366
13 xyz 43.061773 8.793499
14 xyz -71.036785 15.289568
15 xyz 65.246215 49.247986
16 xyz -29.249612 1.043781
17 xyz -25.848495 11.490416
18 xyz -11.223688 NaN
expected result for this KEY_COLUMN: -36.8273304
OK now your expected result makes sense now
First create a function that uses np.sum() this is the equivalent of R's sum(value, na.rm = T)
def my_agg(df):
names = {
'result': np.sum(df['track_wind_component'] * df['ground_dist']) / np.sum(df['ground_dist'])
}
return pd.Series(names, index=['result'])
df.groupby('KEY_COLUMN').apply(my_agg)
out:
result
KEY_COLUMN
xyz -36.827331
What was wrong with your code:
table['temp'] = (table['track_wind_component'] * table['ground_dist']) / table['ground_dist']
# this is just creating a column that is the exact same as
# table['track_wind_component'] because, for example, (x*y)/y = x
AVERAGE_WIND_COMPONENT = table.groupby(['KEY_COLUMN'])['temp'].sum()
# you are now essentially just grouping and summing the track_wind_column
what the R code is doing is taking the sum of (table['track_wind_component'] * table['ground_dist']) divided by the sum of (table['ground_dist'])
all which is grouped by the key_column
The R code is also ignoring NaN values that is why I used np.sum()
I have a set of data for my vehicle tracking system that requires me to calculate the distance base on lat and long. Understand that by using haversine formula can help getting distance between rows but I'm sort of stucked as I need my distance based on 2 field(Model type and mode).
As shown below is my code:
def haversine(lat1,lon1,lat2,lon2, to_radians = True, earth_radius =6371):
if to_radians:
lat1,lon1,lat2,lon2 = np.radians([lat1,lon1,lat2,lon2])
a = np.sin((lat2-lat1)/2.0)**2+ np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2
return earth_radius *2 * np.arcsin(np.sqrt(a))
mydataset = pd.read_csv(x + '.txt')
print (mydataset.shape)
mydataset = mydataset.sort_values(by=['Model','timestamp']) #sort
mydataset['dist'] =
np.concatenate(mydataset.groupby(["Model"]).apply(lambda
x: haversine(x['Latitude'],x['Longitude'],
x['Latitude'].shift(),x['Longitude'].shift())).values)
With this, I am able to calculate the distance based on the model(by using sorting) between the rows.
But I would like to take it a step further to calculate based on both Mode and model. My fields are "Index, Model, Mode, Lat, Long, Timestamp"
Please advice!
Index, Model, Timestamp, Long, Lat, Mode(denote as 0 or 2), Distance Calculated
1, X, 2018-01-18 09:16:37.070, 103.87772815, 1.35653496, 0, 0.0
2, X, 2018-01-18 09:16:39.071, 103.87772815, 1.35653496, 0, 0.0
3, X, 2018-01-18 09:16:41.071, 103.87772815, 1.35653496, 0, 0.0
4, X, 2018-01-18-09:16:43.071, 103.87772052, 1.35653496, 0, 0.0008481795
5, X, 2018-01-18 09:16:45.071, 103.87770526, 1.35653329, 0, 0.0017064925312804799
6, X, 2018-01-18 09:16:51.070, 103.87770526, 1.35653329, 2, 0.0
7, X, 2018-01-18 09:16:53.071, 103.87770526, 1.35653329, 2, 0.0
8, X, 2018-01-18 09:59:55.072, 103.87770526, 1.35652828, 0, 0.0005570865824842293
I need it to calculate distance of total journey of a model and also distance of total journey of a model in whichever mode
I think need add DataFrame contructor to function and then add another column name to groupby like ["Model", "Mode(denote as 0 or 2)"] or ["Model", "Mode"] by columns names:
def haversine(lat1,lon1,lat2,lon2, to_radians = True, earth_radius =6371):
if to_radians:
lat1,lon1,lat2,lon2 = np.radians([lat1,lon1,lat2,lon2])
a = np.sin((lat2-lat1)/2.0)**2+ np.cos(lat1) * np.cos(lat2) * np.sin((lon2-
lon1)/2.0)**2
return pd.DataFrame(earth_radius *2 * np.arcsin(np.sqrt(a)))
mydataset['dist'] = (mydataset.groupby(["Model", "Mode(denote as 0 or 2)"])
.apply(lambda x: haversine(x['Lat'],
x['Long'],
x['Lat'].shift(),
x['Long'].shift())).values)
#if need replace NaNs to 0
mydataset['dist'] = mydataset['dist'].fillna(0)
print (mydataset)
Index Model Timestamp Long Lat \
0 1 X 2018-01-18 09:16:37.070 103.877728 1.356535
1 2 X 2018-01-18 09:16:39.071 103.877728 1.356535
2 3 X 2018-01-18 09:16:41.071 103.877728 1.356535
3 4 X 2018-01-18 09:16:43.071 103.877721 1.356535
4 5 X 2018-01-18 09:16:45.071 103.877705 1.356533
5 6 X 2018-01-18 09:16:51.070 103.877705 1.356533
6 7 X 2018-01-18 09:16:53.071 103.877705 1.356533
7 8 X 2018-01-18 09:59:55.072 103.877705 1.356528
Mode(denote as 0 or 2) Distance Calculated dist
0 0 0.000000 0.000000
1 0 0.000000 0.000000
2 0 0.000000 0.000000
3 0 0.000848 0.000848
4 0 0.001706 0.001706
5 2 0.000000 0.000557
6 2 0.000000 0.000000
7 0 0.000557 0.000000
I have dataset of values of end of month. I have to multiply last 12 values and sve it in "val' column. How can I do that?
I tried to loop using shift and also grouper but it did not work.
My code:
filtered_df=df.copy()
filtered_df = filtered_df[filtered_df['monthly'].notnull()]
for index, row in filtered_df.iterrows():
if index > 12:
pre_1 = row.shift(1)
pre_2 = row.shift(2)
pre_3 = row.shift(3)
pre_4 = row.shift(4)
pre_5 = row.shift(5)
pre_6 = row.shift(-6)
pre_7 = row.shift(-7)
pre_8 = row.shift(-8)
pre_9 = row.shift(-9)
pre_10 = row.shift(-10)
pre_11 = row.shift(-11)
pre_12 = row.shift(-12)
all_vals = (pre_1['monthly'] * pre_2['monthly'] * pre_3['monthly'] * pre_4[
'monthly'] * pre_5['monthly'] * pre_6['monthly'] * pre_7['monthly'] * pre_8[
'monthly'] * pre_9['monthly'] * pre_10['monthly'] * pre_11[
'monthly'] * pre_12['monthly'])
row['generic_momentum'] = all_vals
But I'm getting nan values and also it is not picking the right column
I also tried this but it is not working:
df.tail(12).prod()
Dataset
Date monthly val
31/01/11 0.959630357
28/02/11 0.939530957
31/03/11 1.024870166
31/05/11 0.956831905
30/06/11 1.06549785
30/09/11 0.903054795
31/10/11 1.027355404
30/11/11 0.893328025
31/01/12 1.015152156
29/02/12 1.05621569
30/04/12 1.116884715
31/05/12 0.878896927
31/07/12 0.950743984
31/08/12 1.094999121
31/10/12 0.94769417
30/11/12 1.073116682
31/12/12 0.986747164
31/01/13 0.975354237
28/02/13 0.888879072
30/04/13 0.940063889
31/05/13 1.017259688
31/07/13 0.990201439
30/09/13 1.018815133
31/10/13 1.088671085
31/12/13 1.104019842
31/01/14 0.989041096
28/02/14 1.017825485
31/03/14 0.960047355
30/04/14 1.064095477
30/06/14 1.023850957
31/07/14 1.08941545
30/09/14 1.065516629
31/10/14 0.984540626
31/12/14 1.023386988
28/02/15 1.150857956
31/03/15 1.01209752
30/04/15 1.00295515
30/06/15 1.043231635
31/07/15 1.042820448
31/08/15 1.241814907
30/09/15 1.014741935
30/11/15 0.980878108
31/12/15 0.995258408
29/02/16 1.0507026
31/03/16 1.033018209
31/05/16 0.931798992
30/06/16 1.032879184
31/08/16 0.881060764
30/09/16 1.000240668
30/11/16 0.849364675
31/01/17 1.075015059
28/02/17 0.933706879
31/03/17 1.036073194
31/05/17 1.203092255
30/06/17 0.956726321
31/07/17 1.010709024
31/08/17 1.102072394
31/10/17 0.99223153
30/11/17 1.088148242
31/01/18 0.982730721
28/02/18 1.102215081
IIUC: Use a combination of pd.Series.rolling and np.prod
df['monthly val'].rolling(12).apply(np.prod)
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 0.821766
12 0.814156
13 0.948878
14 0.877424
15 0.984058
16 0.911327
17 0.984289
....
An alternative is to use cumprod and shift
df['monthly val'].cumprod().pipe(lambda s: s / s.shift(12))
I've been working on this all morning and for the life of me cannot figure it out. I'm sure this is very basic, but I've become so frustrated my mind is being clouded. I'm attempting to calculate the total return of a portfolio of securities at each date (monthly).
The formula is (1 + r1) * (1+r2) * (1+ r(t))..... - 1
Here is what I'm working with:
Adj_Returns = Adj_Close/Adj_Close.shift(1)-1
Adj_Returns['Risk Parity Portfolio'] = (Adj_Returns.loc['2003-01-31':]*Weights.shift(1)).sum(axis = 1)
Adj_Returns
SPY IYR LQD Risk Parity Portfolio
Date
2002-12-31 NaN NaN NaN 0.000000
2003-01-31 -0.019802 -0.014723 0.000774 -0.006840
2003-02-28 -0.013479 0.019342 0.015533 0.011701
2003-03-31 -0.001885 0.010015 0.001564 0.003556
2003-04-30 0.088985 0.045647 0.020696 0.036997
For example, with 2002-12-31 being base 100 for risk parity, I want 2003-01-31 to be 99.316 (100 * (1-0.006840)), 2003-02-28 to be 100.478 (99.316 * (1+ 0.011701)) so on and so forth.
Thanks!!
You want to use pd.DataFrame.cumprod
df.add(1).cumprod().sub(1).sum(1)
Consider the dataframe of returns df
np.random.seed([3,1415])
df = pd.DataFrame(np.random.normal(.025, .03, (10, 5)), columns=list('ABCDE'))
df
A B C D E
0 -0.038892 -0.013054 -0.034115 -0.042772 0.014521
1 0.024191 0.034487 0.035463 0.046461 0.048123
2 0.006754 0.035572 0.014424 0.012524 -0.002347
3 0.020724 0.047405 -0.020125 0.043341 0.037007
4 -0.003783 0.069827 0.014605 -0.019147 0.056897
5 0.056890 0.042756 0.033886 0.001758 0.049944
6 0.069609 0.032687 -0.001997 0.036253 0.009415
7 0.026503 0.053499 -0.006013 0.053447 0.047013
8 0.062084 0.029664 -0.015238 0.029886 0.062748
9 0.048341 0.065248 -0.024081 0.019139 0.028955
We can see the cumulative return or total return is
df.add(1).cumprod().sub(1)
A B C D E
0 -0.038892 -0.013054 -0.034115 -0.042772 0.014521
1 -0.015641 0.020983 0.000139 0.001702 0.063343
2 -0.008993 0.057301 0.014565 0.014247 0.060847
3 0.011544 0.107423 -0.005853 0.058206 0.100105
4 0.007717 0.184750 0.008666 0.037944 0.162699
5 0.065046 0.235405 0.042847 0.039769 0.220768
6 0.139183 0.275786 0.040764 0.077464 0.232261
7 0.169375 0.344039 0.034505 0.135051 0.290194
8 0.241974 0.383909 0.018742 0.168973 0.371151
9 0.302013 0.474207 -0.005791 0.191346 0.410852
Plot it
df.add(1).cumprod().sub(1).plot()
Add sum of returns to new column
df.assign(Portfolio=df.add(1).cumprod().sub(1).sum(1))
A B C D E Portfolio
0 -0.038892 -0.013054 -0.034115 -0.042772 0.014521 -0.114311
1 0.024191 0.034487 0.035463 0.046461 0.048123 0.070526
2 0.006754 0.035572 0.014424 0.012524 -0.002347 0.137967
3 0.020724 0.047405 -0.020125 0.043341 0.037007 0.271425
4 -0.003783 0.069827 0.014605 -0.019147 0.056897 0.401777
5 0.056890 0.042756 0.033886 0.001758 0.049944 0.603835
6 0.069609 0.032687 -0.001997 0.036253 0.009415 0.765459
7 0.026503 0.053499 -0.006013 0.053447 0.047013 0.973165
8 0.062084 0.029664 -0.015238 0.029886 0.062748 1.184749
9 0.048341 0.065248 -0.024081 0.019139 0.028955 1.372626
I have dataframe
ID Value
A 70
A 80
A 1000
A 100
A 200
A 130
A 60
A 300
A 800
A 200
A 150
A 250
I need to replace outliers to median value.
I use
df = pd.read_excel("test.xlsx")
grouped = df.groupby('ID')
statBefore = pd.DataFrame({'q1': grouped['Value'].quantile(.25), \
'median': grouped['Value'].median(), 'q3' :
grouped['Value'].quantile(.75)})
def is_outlier(row):
iq_range = statBefore.loc[row.ID]['q3'] - statBefore.loc[row.ID]['q1']
median = statBefore.loc[row.ID]['median']
q3 = statBefore.loc[row.ID]['q3']
q1 = statBefore.loc[row.ID]['q1']
if row.Value > (q3 + (3 * iq_range)) or row.Value < (q1 - (3 * iq_range)):
return True
else:
return False
#apply the function to the original df:
df.loc[:, 'outlier'] = df.apply(is_outlier, axis = 1)
But it returns me median - 175 and q1 - 92, but I get - 90, and it returns me q3 - 262,5, but I count and get - 275.
What wrong there?
This is simple and performant, with no Python for-loops to slow it down:
s = pd.Series([30, 31, 32, 45, 50, 999]) # example data
s.where(s.between(*s.quantile([0.25, 0.75])), s.median())
It gives you:
0 38.5
1 38.5
2 32.0
3 45.0
4 38.5
5 38.5
Unpacking that code, we have s.quantile([0.25, 0.75]) to get this:
0.25 31.25
0.75 48.75
We then use the values (31.25 and 48.75) as arguments to between(), with the * operator to unpack them because between() expects two separate arguments, not an array of length 2. That gives us:
0 False
1 False
2 True
3 True
4 False
5 False
Now that we have the binary mask, we use s.where() to choose the original values at the True locations, and fall back to s.median() otherwise.
This is just how quantiles are defined
df = pd.DataFrame(np.array([60,70,80,100,130,150,200,200,250,300,800,1000]))
print df.quantile(.25)
print df.quantile(.50)
print df.quantile(.75)
(The q1 for your data set is 95 btw)
The median is in between 150 and 200 (175)
The first quantile is 3 quarters between 80 and 100 (95)
The thrid quantile is 1 quarter in between 250 and 300 (262.5)