I would like to calculate the exponential moving average of my data, as usual, there are a few different way to implement it in python. And before I use any of them, I would like to understand (verify) it, and the result is very surprising, none of them are the same!
Below I use the TA-Lib EMA, as well as the pandas ewm function. I have also included one from excel, using formula [data now-EMA (previous)] x multiplier + EMA (previous), with multiplier = 0.1818.
Can someone explain how they are calculated? why they all have different result? which one is correct?
df = pd.DataFrame({"Number": [x for x in range(1,7)]*5})
data = df["Number"]
df["TA_MA"] = MA(data, timeperiod = 5)
df["PD_MA"] = data.rolling(5).mean()
df["TA_EMA"] = EMA(data, timeperiod = 5)
df["PD_EMA_1"] = data.ewm(span=5, adjust=False).mean()
df["PD_EMA_2"] = data.ewm(span=5, adjust=True).mean()
Number TA_MA PD_MA TA_EMA PD_EMA_1 PD_EMA_2 Excel_EMA
0 1 NaN NaN NaN 1.000000 1.000000 NaN
1 2 NaN NaN NaN 1.333333 1.600000 NaN
2 3 NaN NaN NaN 1.888889 2.263158 NaN
3 4 NaN NaN NaN 2.592593 2.984615 NaN
4 5 3.0 3.0 3.000000 3.395062 3.758294 3.00
5 6 4.0 4.0 4.000000 4.263374 4.577444 3.55
6 1 3.8 3.8 3.000000 3.175583 3.310831 3.08
7 2 3.6 3.6 2.666667 2.783722 2.856146 2.89
8 3 3.4 3.4 2.777778 2.855815 2.905378 2.91
9 4 3.2 3.2 3.185185 3.237210 3.276691 3.11
10 5 3.0 3.0 3.790123 3.824807 3.857846 3.45
11 6 4.0 4.0 4.526749 4.549871 4.577444 3.91
12 1 3.8 3.8 3.351166 3.366581 3.378804 3.38
13 2 3.6 3.6 2.900777 2.911054 2.917623 3.13
14 3 3.4 3.4 2.933852 2.940703 2.945145 3.11
15 4 3.2 3.2 3.289234 3.293802 3.297299 3.27
16 5 3.0 3.0 3.859490 3.862534 3.865443 3.58
17 6 4.0 4.0 4.572993 4.575023 4.577444 4.02
18 1 3.8 3.8 3.381995 3.383349 3.384424 3.47
19 2 3.6 3.6 2.921330 2.922232 2.922811 3.21
20 3 3.4 3.4 2.947553 2.948155 2.948546 3.17
21 4 3.2 3.2 3.298369 3.298770 3.299077 3.32
22 5 3.0 3.0 3.865579 3.865847 3.866102 3.63
23 6 4.0 4.0 4.577053 4.577231 4.577444 4.06
24 1 3.8 3.8 3.384702 3.384821 3.384915 3.50
25 2 3.6 3.6 2.923135 2.923214 2.923265 3.23
26 3 3.4 3.4 2.948756 2.948809 2.948844 3.19
27 4 3.2 3.2 3.299171 3.299206 3.299233 3.33
28 5 3.0 3.0 3.866114 3.866137 3.866160 3.64
29 6 4.0 4.0 4.577409 4.577425 4.577444 4.07
Related
i have the following dataframe in pandas:
Race_ID Athlete_ID Finish_time
0 1.0 1.0 56.1
1 1.0 3.0 60.2
2 1.0 2.0 57.1
3 1.0 4.0 57.2
4 2.0 2.0 56.2
5 2.0 1.0 56.3
6 2.0 3.0 56.4
7 2.0 4.0 56.5
8 3.0 1.0 61.2
9 3.0 2.0 62.1
10 3.0 3.0 60.4
11 3.0 4.0 60.0
12 4.0 2.0 55.0
13 4.0 1.0 54.0
14 4.0 3.0 53.0
15 4.0 4.0 52.0
where Race_ID is in descending order of time. (i.e. 1 is the most current race nad 4 is the oldest race)
And I want to add a new column Relative_time#t-1 which is the Athlete's Finish_time in the last race relative to the fastest time in the last race. Hence the output would look something like
Race_ID Athlete_ID Finish_time Relative_time#t-1
0 1.0 1.0 56.1 56.3/56.2
1 1.0 3.0 60.2 56.4/56.2
2 1.0 2.0 57.1 56.2/56.2
3 1.0 4.0 57.2 56.5/56.2
4 2.0 2.0 56.2 62.1/60
5 2.0 1.0 56.3 61.2/60
6 2.0 3.0 56.4 60.4/60
7 2.0 4.0 56.5 60/60
8 3.0 1.0 61.2 54/52
9 3.0 2.0 62.1 55/52
10 3.0 3.0 60.4 53/52
11 3.0 4.0 60.0 52/52
12 4.0 2.0 55.0 0
13 4.0 1.0 54.0 0
14 4.0 3.0 53.0 0
15 4.0 4.0 52.0 0
Here's the code:
data = [[1,1,56.1,'56.3/56.2'],
[1,3,60.2,'56.4/56.2'],
[1,2,57.1,'56.2/56.2'],
[1,4,57.2,'56.5/56.2'],
[2,2,56.2,'62.1/60'],
[2,1,56.3,'61.2/60'],
[2,3,56.4,'60.4/60'],
[2,4,56.5,'60/60'],
[3,1,61.2,'54/52'],
[3,2,62.1,'55/52'],
[3,3,60.4,'53/52'],
[3,4,60,'52/52'],
[4,2,55,'0'],
[4,1,54,'0'],
[4,3,53,'0'],
[4,4,52,'0']]
df = pd.DataFrame(data,columns=['Race_ID','Athlete_ID','Finish_time','Relative_time#t-1'],dtype=float)
I intentionally made the Relative_time#t-1 as str instead of int to show the formula.
Here is what I have tried:
df.sort_values(by = ['Race_ID', 'Athlete_ID'], ascending=[True, True], inplace=True)
df['Finish_time#t-1'] = df.groupby('Athlete_ID')['Finish_time'].shift(-1)
df['Finish_time#t-1'] = df['Finish_time#t-1'].replace(np.nan, 0, regex = True)
So I get the numerator for the new column but I don't know how to get the minimum time for each Race_ID (i.e. the value in the denominator)
Thank you in advance.
Try this:
(df.groupby('Athlete_ID')['Finish_time']
.shift(-1)
.div(df['Race_ID'].map(
df.groupby('Race_ID')['Finish_time']
.min()
.shift(-1)))
.fillna(0))
Output:
0 1.001779
1 1.003559
2 1.000000
3 1.005338
4 1.035000
5 1.020000
6 1.006667
7 1.000000
8 1.038462
9 1.057692
10 1.019231
11 1.000000
12 0.000000
13 0.000000
14 0.000000
15 0.000000
So I have a sqlite local database, I read it into my program as a pandas dataframe using
""" Seperating hitters and pitchers """
pitchers = pd.read_sql_query("SELECT * FROM ALL_NORTHWOODS_DATA WHERE BF_y >= 20 AND BF_x >= 20", northwoods_db)
hitters = pd.read_sql_query("SELECT * FROM ALL_NORTHWOODS_DATA WHERE PA_y >= 25 AND PA_x >= 25", northwoods_db)
But when I do this, some of the numbers are not numeric. Here is a head of one of the dataframes:
index Year Age_x AgeDif_x Tm_x Lg_x Lev_x Aff_x G_x PA_x ... ER_y BK_y WP_y BF_y WHIP_y H9_y HR9_y BB9_y SO9_y SO/W_y
0 84 2020 21 -0.3 Hillsdale GMAC NCAA None 5 None ... 4.0 None 3.0 71.0 1.132 5.6 0.0 4.6 8.7 1.89
1 264 2018 -- None Duke ACC NCAA None 15 None ... 13 0 1 88 2.111 10.0 0.5 9.0 8.0 0.89
2 298 2019 21 0.1 Wisconsin-Milwaukee Horz NCAA None 8 None ... 1.0 0.0 2.0 21.0 2.25 9.0 0.0 11.3 11.3 1.0
3 357 2017 22 1.0 Nova Southeastern SSC NCAA None 15.0 None ... 20.0 0.0 3.0 206.0 1.489 9.7 0.4 3.7 8.5 2.32
4 418 2021 21 -0.4 Creighton BigE NCAA None 4 None ... 26.0 1.0 6.0 226.0 1.625 8.6 0.9 6.0 7.5 1.25
When I try to make the dataframe numeric, I used this line of code:
hitters = hitters.apply(pd.to_numeric, errors='coerce')
pitchers = pitchers.apply(pd.to_numeric, errors='coerce')
But when I did that, the new head of the dataframes is full of NaN's, it seems like it got rid of all of the string values but I want to keep those.
index Year Age_x AgeDif_x Tm_x Lg_x Lev_x Aff_x G_x PA_x ... ER_y BK_y WP_y BF_y WHIP_y H9_y HR9_y BB9_y SO9_y SO/W_y
0 84 2020 21.0 -0.3 NaN NaN NaN NaN 5.0 NaN ... 4.0 NaN 3.0 71.0 1.132 5.6 0.0 4.6 8.7 1.89
1 264 2018 NaN NaN NaN NaN NaN NaN 15.0 NaN ... 13.0 0.0 1.0 88.0 2.111 10.0 0.5 9.0 8.0 0.89
2 298 2019 21.0 0.1 NaN NaN NaN NaN 8.0 NaN ... 1.0 0.0 2.0 21.0 2.250 9.0 0.0 11.3 11.3 1.00
3 357 2017 22.0 1.0 NaN NaN NaN NaN 15.0 NaN ... 20.0 0.0 3.0 206.0 1.489 9.7 0.4 3.7 8.5 2.32
4 418 2021 21.0 -0.4 NaN NaN NaN NaN 4.0 NaN ... 26.0 1.0 6.0 226.0 1.625 8.6 0.9 6.0 7.5 1.25
Is there a better way to makethe number values numeric and keep all my string columns? Maybe there is an sqlite function that can do it better? I am not sure, any help is appriciated.
Maybe you can use combine_first:
hitters_new = hitters.apply(pd.to_numeric, errors='coerce').combine_first(hitters)
pitchers_new = pitchers.apply(pd.to_numeric, errors='coerce').combine_first(pitchers)
You can try using astype or convert_dtypes. They both take an argument which is the columns you want to convert, if you already know which columns are numeric and which ones are strings that can work. Otherwise, take a look at this thread to do this automatically.
I have 2 dataframe:
df1
aa gg pm
1 3.3 0.5
1 0.0 4.7
1 9.3 0.2
2 0.3 0.6
2 14.0 91.0
3 13.0 31.0
4 13.1 64.0
5 1.3 0.5
6 3.3 0.5
7 11.1 3.0
7 11.3 24.0
8 3.2 0.0
8 5.3 0.3
8 3.3 0.3
and df2:
aa gg st
1 3.3 in
2 0.3 in
5 1.3 in
7 11.1 in
8 5.3 in
I would like to merge these two dataframe on col aa and gg to get results like:
aa gg pm st
1 3.3 0.5 in
1 0.0 4.7
1 9.3 0.2
2 0.3 0.6 in
2 14.0 91.0
3 13.0 31.0
4 13.1 64.0
5 1.3 0.5 in
6 3.3 0.5
7 11.1 3.0 in
7 11.3 24.0
8 3.2 0.0
8 5.3 0.3 in
8 3.3 0.3
I want to map the col st details to based on col aa and gg.
please let me know how to do this.
You can multiple float columns by 1000 or 10000 and convert to integers and then use these new columns for join:
df1['gg_int'] = df1['gg'].mul(1000).astype(int)
df2['gg_int'] = df2['gg'].mul(1000).astype(int)
df = df1.merge(df2.drop('gg', axis=1), on=['aa','gg_int'], how='left')
df = df.drop('gg_int', axis=1)
print (df)
aa gg pm st
0 1 3.3 0.5 in
1 1 0.0 4.7 NaN
2 1 9.3 0.2 NaN
3 2 0.3 0.6 in
4 2 14.0 91.0 NaN
5 3 13.0 31.0 NaN
6 4 13.1 64.0 NaN
7 5 1.3 0.5 in
8 6 3.3 0.5 NaN
9 7 11.1 3.0 in
10 7 11.3 24.0 NaN
11 8 3.2 0.0 NaN
12 8 5.3 0.3 in
13 8 3.3 0.3 NaN
I am sure there must be a very simple solution to this problem, but I am failing to find it (and browsing through previously asked questions, I didn't find the answer I wanted or didn't understand it).
I have a dataframe similar to this (just much bigger, with many more rows and columns):
x val1 val2 val3
0 0.0 10.0 NaN NaN
1 0.5 10.5 NaN NaN
2 1.0 11.0 NaN NaN
3 1.5 11.5 NaN 11.60
4 2.0 12.0 NaN 12.08
5 2.5 12.5 12.2 12.56
6 3.0 13.0 19.8 13.04
7 3.5 13.5 13.3 13.52
8 4.0 14.0 19.8 14.00
9 4.5 14.5 14.4 14.48
10 5.0 15.0 19.8 14.96
11 5.5 15.5 15.5 15.44
12 6.0 16.0 19.8 15.92
13 6.5 16.5 16.6 16.40
14 7.0 17.0 19.8 18.00
15 7.5 17.5 17.7 NaN
16 8.0 18.0 19.8 NaN
17 8.5 18.5 18.8 NaN
18 9.0 19.0 19.8 NaN
19 9.5 19.5 19.9 NaN
20 10.0 20.0 19.8 NaN
In the next step, I need to compute the derivative dVal/dx for each of the value columns (in reality I have more than 3 columns, so I need to have a robust solution in a loop, I can't select the rows manually each time). But because of the NaN values in some of the columns, I am facing the problem that x and val are not of the same dimension. I feel the way to overcome this would be to only select only those x intervals, for which the val is notnull. But I am not able to do that. I am probably making some very stupid mistakes (I am not a programmer and I am very untalented, so please be patient with me:) ).
Here is the code so far (now that I think of it, I may have introduced some mistakes just by leaving some old pieces of code because I've been messing with it for a while, trying different things):
import pandas as pd
import numpy as np
df = pd.read_csv('H:/DocumentsRedir/pokus/dataframe.csv', delimiter=',')
vals = list(df.columns.values)[1:]
for i in vals:
V = np.asarray(pd.notnull(df[i]))
mask = pd.notnull(df[i])
X = np.asarray(df.loc[mask]['x'])
derivative=np.diff(V)/np.diff(X)
But I am getting this error:
ValueError: operands could not be broadcast together with shapes (20,) (15,)
So, apparently, it did not select only the notnull values...
Is there an obvious mistake that I am making or a different approach that I should adopt? Thanks!
(And another less important question: is np.diff the right function to use here or had I better calculated it manually by finite differences? I'm not finding numpy documentation very helpful.)
To calculate dVal/dX:
dVal = df.iloc[:, 1:].diff() # `x` is in column 0.
dX = df['x'].diff()
>>> dVal.apply(lambda series: series / dX)
val1 val2 val3
0 NaN NaN NaN
1 1 NaN NaN
2 1 NaN NaN
3 1 NaN NaN
4 1 NaN 0.96
5 1 NaN 0.96
6 1 15.2 0.96
7 1 -13.0 0.96
8 1 13.0 0.96
9 1 -10.8 0.96
10 1 10.8 0.96
11 1 -8.6 0.96
12 1 8.6 0.96
13 1 -6.4 0.96
14 1 6.4 3.20
15 1 -4.2 NaN
16 1 4.2 NaN
17 1 -2.0 NaN
18 1 2.0 NaN
19 1 0.2 NaN
20 1 -0.2 NaN
We difference all columns (except the first one), and then apply a lambda function to each column which divides it by the difference in column X.
Given the following data frame:
index value
1 0.8
2 0.9
3 1.0
4 0.9
5 nan
6 nan
7 nan
8 0.4
9 0.9
10 nan
11 0.8
12 2.0
13 1.4
14 1.9
15 nan
16 nan
17 nan
18 8.4
19 9.9
20 10.0
…
in which the data 'value' is separated into a number of clusters by value NAN. is there any way I can calculate some values such as accumulate summation, or mean of the clustered data, for example, I want calculate the accumulated sum and generate the following data frame:
index value cumsum
1 0.8 0.8
2 0.9 1.7
3 1.0 2.7
4 0.9 3.6
5 nan 0
6 nan 0
7 nan 0
8 0.4 0.4
9 0.9 1.3
10 nan 0
11 0.8 0.8
12 2.0 2.8
13 1.4 4.2
14 1.9 6.1
15 nan 0
16 nan 0
17 nan 0
18 8.4 8.4
19 9.9 18.3
20 10.0 28.3
…
Any suggestions?
Also as a simple extension of the problem, if two clusters of data are close enough, such as there are only 1 NAN separate them we consider the as one cluster of data, such that we can have the following data frame:
index value cumsum
1 0.8 0.8
2 0.9 1.7
3 1.0 2.7
4 0.9 3.6
5 nan 0
6 nan 0
7 nan 0
8 0.4 0.4
9 0.9 1.3
10 nan 1.3
11 0.8 2.1
12 2.0 4.1
13 1.4 5.5
14 1.9 7.4
15 nan 0
16 nan 0
17 nan 0
18 8.4 8.4
19 9.9 18.3
20 10.0 28.3
Thank you for the help!
You can do the first part using the compare-cumsum-groupby pattern. Your "simple extension" isn't quite so simple, but we can still pull it off, by finding out the parts of value that we want to treat as zero:
n = df["value"].isnull()
clusters = (n != n.shift()).cumsum()
df["cumsum"] = df["value"].groupby(clusters).cumsum().fillna(0)
to_zero = n & (df["value"].groupby(clusters).transform('size') == 1)
tmp_value = df["value"].where(~to_zero, 0)
n2 = tmp_value.isnull()
new_clusters = (n2 != n2.shift()).cumsum()
df["cumsum_skip1"] = tmp_value.groupby(new_clusters).cumsum().fillna(0)
produces
>>> df
index value cumsum cumsum_skip1
0 1 0.8 0.8 0.8
1 2 0.9 1.7 1.7
2 3 1.0 2.7 2.7
3 4 0.9 3.6 3.6
4 5 NaN 0.0 0.0
5 6 NaN 0.0 0.0
6 7 NaN 0.0 0.0
7 8 0.4 0.4 0.4
8 9 0.9 1.3 1.3
9 10 NaN 0.0 1.3
10 11 0.8 0.8 2.1
11 12 2.0 2.8 4.1
12 13 1.4 4.2 5.5
13 14 1.9 6.1 7.4
14 15 NaN 0.0 0.0
15 16 NaN 0.0 0.0
16 17 NaN 0.0 0.0
17 18 8.4 8.4 8.4
18 19 9.9 18.3 18.3
19 20 10.0 28.3 28.3