Calculating the accumulated summation of clustered data in data frame in pandas - python

Given the following data frame:
index value
1 0.8
2 0.9
3 1.0
4 0.9
5 nan
6 nan
7 nan
8 0.4
9 0.9
10 nan
11 0.8
12 2.0
13 1.4
14 1.9
15 nan
16 nan
17 nan
18 8.4
19 9.9
20 10.0
…
in which the data 'value' is separated into a number of clusters by value NAN. is there any way I can calculate some values such as accumulate summation, or mean of the clustered data, for example, I want calculate the accumulated sum and generate the following data frame:
index value cumsum
1 0.8 0.8
2 0.9 1.7
3 1.0 2.7
4 0.9 3.6
5 nan 0
6 nan 0
7 nan 0
8 0.4 0.4
9 0.9 1.3
10 nan 0
11 0.8 0.8
12 2.0 2.8
13 1.4 4.2
14 1.9 6.1
15 nan 0
16 nan 0
17 nan 0
18 8.4 8.4
19 9.9 18.3
20 10.0 28.3
…
Any suggestions?
Also as a simple extension of the problem, if two clusters of data are close enough, such as there are only 1 NAN separate them we consider the as one cluster of data, such that we can have the following data frame:
index value cumsum
1 0.8 0.8
2 0.9 1.7
3 1.0 2.7
4 0.9 3.6
5 nan 0
6 nan 0
7 nan 0
8 0.4 0.4
9 0.9 1.3
10 nan 1.3
11 0.8 2.1
12 2.0 4.1
13 1.4 5.5
14 1.9 7.4
15 nan 0
16 nan 0
17 nan 0
18 8.4 8.4
19 9.9 18.3
20 10.0 28.3
Thank you for the help!

You can do the first part using the compare-cumsum-groupby pattern. Your "simple extension" isn't quite so simple, but we can still pull it off, by finding out the parts of value that we want to treat as zero:
n = df["value"].isnull()
clusters = (n != n.shift()).cumsum()
df["cumsum"] = df["value"].groupby(clusters).cumsum().fillna(0)
to_zero = n & (df["value"].groupby(clusters).transform('size') == 1)
tmp_value = df["value"].where(~to_zero, 0)
n2 = tmp_value.isnull()
new_clusters = (n2 != n2.shift()).cumsum()
df["cumsum_skip1"] = tmp_value.groupby(new_clusters).cumsum().fillna(0)
produces
>>> df
index value cumsum cumsum_skip1
0 1 0.8 0.8 0.8
1 2 0.9 1.7 1.7
2 3 1.0 2.7 2.7
3 4 0.9 3.6 3.6
4 5 NaN 0.0 0.0
5 6 NaN 0.0 0.0
6 7 NaN 0.0 0.0
7 8 0.4 0.4 0.4
8 9 0.9 1.3 1.3
9 10 NaN 0.0 1.3
10 11 0.8 0.8 2.1
11 12 2.0 2.8 4.1
12 13 1.4 4.2 5.5
13 14 1.9 6.1 7.4
14 15 NaN 0.0 0.0
15 16 NaN 0.0 0.0
16 17 NaN 0.0 0.0
17 18 8.4 8.4 8.4
18 19 9.9 18.3 18.3
19 20 10.0 28.3 28.3

Related

Pyhton code for rolling window regression by groups

I would like to perform a rolling window regression for panel data over a period of 12 months and get the monthly intercept fund wise as output. My data has Funds (ID) with monthly returns.
enter image description here
Request you to please help me with the python code for the same.
In statsmodels there is rolling OLS. You can use that with groupby
Sample code:
import pandas as pd
import numpy as np
from statsmodels.regression.rolling import RollingOLS
# Read data & adding "intercept" column
df = pd.read_csv('sample_rolling_regression_OLS.csv')
df['intercept'] = 1
# Groupby then apply RollingOLS
df.groupby('name')[['y', 'intercept', 'x']].apply(lambda g: RollingOLS(g['y'], g[['intercept', 'x']], window=6).fit().params)
Sample data: or you can download at: https://www.dropbox.com/s/zhklsg5cmfksufm/sample_rolling_regression_OLS.csv?dl=0
name y x intercept
0 a 13.7 7.8 1
1 a -14.7 -9.7 1
2 a -3.4 -0.6 1
3 a 7.4 3.3 1
4 a -5.3 -1.9 1
5 a -8.3 -2.3 1
6 a 8.9 3.7 1
7 a 10.0 7.9 1
8 a 1.8 -0.4 1
9 a 6.7 3.1 1
10 a 17.4 9.9 1
11 a 8.9 7.7 1
12 a -3.1 -1.5 1
13 a -12.2 -7.9 1
14 a 7.6 4.9 1
15 a 4.2 2.3 1
16 a -15.3 -5.6 1
17 a 9.9 6.7 1
18 a 11.0 5.2 1
19 a 5.7 5.1 1
20 a -0.3 -0.6 1
21 a -15.0 -8.7 1
22 a -10.6 -5.7 1
23 a -16.0 -9.1 1
24 b 16.7 8.5 1
25 b 9.2 8.2 1
26 b 4.7 3.4 1
27 b -16.7 -8.7 1
28 b -4.8 -1.5 1
29 b -2.6 -2.2 1
30 b 16.3 9.5 1
31 b 15.8 9.8 1
32 b -10.8 -7.3 1
33 b -5.4 -3.4 1
34 b -6.0 -1.8 1
35 b 1.9 -0.6 1
36 b 6.3 6.1 1
37 b -14.7 -8.0 1
38 b -16.1 -9.7 1
39 b -10.5 -8.0 1
40 b 4.9 1.0 1
41 b 11.1 4.5 1
42 b -14.8 -8.5 1
43 b -0.2 -2.8 1
44 b 6.3 1.7 1
45 b -14.1 -8.7 1
46 b 13.8 8.9 1
47 b -6.2 -3.0 1

python exponential moving average

I would like to calculate the exponential moving average of my data, as usual, there are a few different way to implement it in python. And before I use any of them, I would like to understand (verify) it, and the result is very surprising, none of them are the same!
Below I use the TA-Lib EMA, as well as the pandas ewm function. I have also included one from excel, using formula [data now-EMA (previous)] x multiplier + EMA (previous), with multiplier = 0.1818.
Can someone explain how they are calculated? why they all have different result? which one is correct?
df = pd.DataFrame({"Number": [x for x in range(1,7)]*5})
data = df["Number"]
df["TA_MA"] = MA(data, timeperiod = 5)
df["PD_MA"] = data.rolling(5).mean()
df["TA_EMA"] = EMA(data, timeperiod = 5)
df["PD_EMA_1"] = data.ewm(span=5, adjust=False).mean()
df["PD_EMA_2"] = data.ewm(span=5, adjust=True).mean()
Number TA_MA PD_MA TA_EMA PD_EMA_1 PD_EMA_2 Excel_EMA
0 1 NaN NaN NaN 1.000000 1.000000 NaN
1 2 NaN NaN NaN 1.333333 1.600000 NaN
2 3 NaN NaN NaN 1.888889 2.263158 NaN
3 4 NaN NaN NaN 2.592593 2.984615 NaN
4 5 3.0 3.0 3.000000 3.395062 3.758294 3.00
5 6 4.0 4.0 4.000000 4.263374 4.577444 3.55
6 1 3.8 3.8 3.000000 3.175583 3.310831 3.08
7 2 3.6 3.6 2.666667 2.783722 2.856146 2.89
8 3 3.4 3.4 2.777778 2.855815 2.905378 2.91
9 4 3.2 3.2 3.185185 3.237210 3.276691 3.11
10 5 3.0 3.0 3.790123 3.824807 3.857846 3.45
11 6 4.0 4.0 4.526749 4.549871 4.577444 3.91
12 1 3.8 3.8 3.351166 3.366581 3.378804 3.38
13 2 3.6 3.6 2.900777 2.911054 2.917623 3.13
14 3 3.4 3.4 2.933852 2.940703 2.945145 3.11
15 4 3.2 3.2 3.289234 3.293802 3.297299 3.27
16 5 3.0 3.0 3.859490 3.862534 3.865443 3.58
17 6 4.0 4.0 4.572993 4.575023 4.577444 4.02
18 1 3.8 3.8 3.381995 3.383349 3.384424 3.47
19 2 3.6 3.6 2.921330 2.922232 2.922811 3.21
20 3 3.4 3.4 2.947553 2.948155 2.948546 3.17
21 4 3.2 3.2 3.298369 3.298770 3.299077 3.32
22 5 3.0 3.0 3.865579 3.865847 3.866102 3.63
23 6 4.0 4.0 4.577053 4.577231 4.577444 4.06
24 1 3.8 3.8 3.384702 3.384821 3.384915 3.50
25 2 3.6 3.6 2.923135 2.923214 2.923265 3.23
26 3 3.4 3.4 2.948756 2.948809 2.948844 3.19
27 4 3.2 3.2 3.299171 3.299206 3.299233 3.33
28 5 3.0 3.0 3.866114 3.866137 3.866160 3.64
29 6 4.0 4.0 4.577409 4.577425 4.577444 4.07

Merging different length dataframe in Python/pandas

I have 2 dataframe:
df1
aa gg pm
1 3.3 0.5
1 0.0 4.7
1 9.3 0.2
2 0.3 0.6
2 14.0 91.0
3 13.0 31.0
4 13.1 64.0
5 1.3 0.5
6 3.3 0.5
7 11.1 3.0
7 11.3 24.0
8 3.2 0.0
8 5.3 0.3
8 3.3 0.3
and df2:
aa gg st
1 3.3 in
2 0.3 in
5 1.3 in
7 11.1 in
8 5.3 in
I would like to merge these two dataframe on col aa and gg to get results like:
aa gg pm st
1 3.3 0.5 in
1 0.0 4.7
1 9.3 0.2
2 0.3 0.6 in
2 14.0 91.0
3 13.0 31.0
4 13.1 64.0
5 1.3 0.5 in
6 3.3 0.5
7 11.1 3.0 in
7 11.3 24.0
8 3.2 0.0
8 5.3 0.3 in
8 3.3 0.3
I want to map the col st details to based on col aa and gg.
please let me know how to do this.
You can multiple float columns by 1000 or 10000 and convert to integers and then use these new columns for join:
df1['gg_int'] = df1['gg'].mul(1000).astype(int)
df2['gg_int'] = df2['gg'].mul(1000).astype(int)
df = df1.merge(df2.drop('gg', axis=1), on=['aa','gg_int'], how='left')
df = df.drop('gg_int', axis=1)
print (df)
aa gg pm st
0 1 3.3 0.5 in
1 1 0.0 4.7 NaN
2 1 9.3 0.2 NaN
3 2 0.3 0.6 in
4 2 14.0 91.0 NaN
5 3 13.0 31.0 NaN
6 4 13.1 64.0 NaN
7 5 1.3 0.5 in
8 6 3.3 0.5 NaN
9 7 11.1 3.0 in
10 7 11.3 24.0 NaN
11 8 3.2 0.0 NaN
12 8 5.3 0.3 in
13 8 3.3 0.3 NaN

How to get indexes of values in a Pandas DataFrame?

I am sure there must be a very simple solution to this problem, but I am failing to find it (and browsing through previously asked questions, I didn't find the answer I wanted or didn't understand it).
I have a dataframe similar to this (just much bigger, with many more rows and columns):
x val1 val2 val3
0 0.0 10.0 NaN NaN
1 0.5 10.5 NaN NaN
2 1.0 11.0 NaN NaN
3 1.5 11.5 NaN 11.60
4 2.0 12.0 NaN 12.08
5 2.5 12.5 12.2 12.56
6 3.0 13.0 19.8 13.04
7 3.5 13.5 13.3 13.52
8 4.0 14.0 19.8 14.00
9 4.5 14.5 14.4 14.48
10 5.0 15.0 19.8 14.96
11 5.5 15.5 15.5 15.44
12 6.0 16.0 19.8 15.92
13 6.5 16.5 16.6 16.40
14 7.0 17.0 19.8 18.00
15 7.5 17.5 17.7 NaN
16 8.0 18.0 19.8 NaN
17 8.5 18.5 18.8 NaN
18 9.0 19.0 19.8 NaN
19 9.5 19.5 19.9 NaN
20 10.0 20.0 19.8 NaN
In the next step, I need to compute the derivative dVal/dx for each of the value columns (in reality I have more than 3 columns, so I need to have a robust solution in a loop, I can't select the rows manually each time). But because of the NaN values in some of the columns, I am facing the problem that x and val are not of the same dimension. I feel the way to overcome this would be to only select only those x intervals, for which the val is notnull. But I am not able to do that. I am probably making some very stupid mistakes (I am not a programmer and I am very untalented, so please be patient with me:) ).
Here is the code so far (now that I think of it, I may have introduced some mistakes just by leaving some old pieces of code because I've been messing with it for a while, trying different things):
import pandas as pd
import numpy as np
df = pd.read_csv('H:/DocumentsRedir/pokus/dataframe.csv', delimiter=',')
vals = list(df.columns.values)[1:]
for i in vals:
V = np.asarray(pd.notnull(df[i]))
mask = pd.notnull(df[i])
X = np.asarray(df.loc[mask]['x'])
derivative=np.diff(V)/np.diff(X)
But I am getting this error:
ValueError: operands could not be broadcast together with shapes (20,) (15,)
So, apparently, it did not select only the notnull values...
Is there an obvious mistake that I am making or a different approach that I should adopt? Thanks!
(And another less important question: is np.diff the right function to use here or had I better calculated it manually by finite differences? I'm not finding numpy documentation very helpful.)
To calculate dVal/dX:
dVal = df.iloc[:, 1:].diff() # `x` is in column 0.
dX = df['x'].diff()
>>> dVal.apply(lambda series: series / dX)
val1 val2 val3
0 NaN NaN NaN
1 1 NaN NaN
2 1 NaN NaN
3 1 NaN NaN
4 1 NaN 0.96
5 1 NaN 0.96
6 1 15.2 0.96
7 1 -13.0 0.96
8 1 13.0 0.96
9 1 -10.8 0.96
10 1 10.8 0.96
11 1 -8.6 0.96
12 1 8.6 0.96
13 1 -6.4 0.96
14 1 6.4 3.20
15 1 -4.2 NaN
16 1 4.2 NaN
17 1 -2.0 NaN
18 1 2.0 NaN
19 1 0.2 NaN
20 1 -0.2 NaN
We difference all columns (except the first one), and then apply a lambda function to each column which divides it by the difference in column X.

converting sparse dataframe to dense dataframe

I have sparse data stored in a dataframe:
df = pd.DataFrame({'a':[1,3,5], 'b':[2,5,5], 'data':np.random.randn(3)})
a b data
0 1 2 -0.824022
1 3 5 0.503239
2 5 5 -0.540105
Since I care about the null data the actual data would look like this:
true_df
a b data
0 1 1 NaN
1 1 2 -0.824022
2 1 3 NaN
3 1 4 NaN
4 1 5 NaN
5 2 1 NaN
6 2 2 NaN
7 2 3 NaN
8 2 4 NaN
9 2 5 NaN
10 3 1 NaN
11 3 2 NaN
12 3 3 NaN
13 3 4 NaN
14 3 5 0.503239
15 4 1 NaN
16 4 2 NaN
17 4 3 NaN
18 4 4 NaN
19 4 5 NaN
20 5 1 NaN
21 5 2 NaN
22 5 3 NaN
23 5 4 NaN
24 5 5 -0.540105
My question is how do I construct true_df? I was hoping there was some way to use pd.concat or pd.merge, that is, construct a dataframe that is the shape of the dense table and then join the two dataframes but that doesn't join in the expected way (the columns are not combined). The ultimate goal is to pivot on a and b.
As a follow up because I think kinjo is correct, why does this only work for integers and not for floats? Using:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1.0,1.3,1.5], 'b':[1.2,1.5,1.5], 'data':np.random.randn(3)})
### Create all possible combinations of a,b
newindex = [(b,a) for b in np.arange(1,df.b.max()+0.1, 0.1) for a in np.arange(1,df.a.max()+0.1,0.1)]
### Set the index as a,b and reindex
df.set_index(['a','b']).reindex(newindex).reset_index()
Will return:
a b data
0 1.0 1.0 NaN
1 1.0 1.1 NaN
2 1.0 1.2 NaN
3 1.0 1.3 NaN
4 1.0 1.4 NaN
5 1.0 1.5 NaN
6 1.0 1.6 NaN
7 1.1 1.0 NaN
8 1.1 1.1 NaN
9 1.1 1.2 NaN
10 1.1 1.3 NaN
11 1.1 1.4 NaN
12 1.1 1.5 NaN
13 1.1 1.6 NaN
14 1.2 1.0 NaN
15 1.2 1.1 NaN
16 1.2 1.2 NaN
17 1.2 1.3 NaN
18 1.2 1.4 NaN
19 1.2 1.5 NaN
20 1.2 1.6 NaN
21 1.3 1.0 NaN
22 1.3 1.1 NaN
23 1.3 1.2 NaN
24 1.3 1.3 NaN
25 1.3 1.4 NaN
26 1.3 1.5 NaN
27 1.3 1.6 NaN
28 1.4 1.0 NaN
29 1.4 1.1 NaN
30 1.4 1.2 NaN
31 1.4 1.3 NaN
32 1.4 1.4 NaN
33 1.4 1.5 NaN
34 1.4 1.6 NaN
35 1.5 1.0 NaN
36 1.5 1.1 NaN
37 1.5 1.2 NaN
38 1.5 1.3 NaN
39 1.5 1.4 NaN
40 1.5 1.5 NaN
41 1.5 1.6 NaN
42 1.6 1.0 NaN
43 1.6 1.1 NaN
44 1.6 1.2 NaN
45 1.6 1.3 NaN
46 1.6 1.4 NaN
47 1.6 1.5 NaN
48 1.6 1.6 NaN
Reindex is a straightforward solution. Similar to #jezrael's solution, but no need for merge.
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1,3,5], 'b':[2,5,5], 'data':np.random.randn(3)})
### Create all possible combinations of a,b
newindex = [(b,a) for b in range(1,df.b.max()+1) for a in range(1,df.a.max()+1)]
### Set the index as a,b and reindex
df.set_index(['a','b']).reindex(newindex)
You can then reset the index if you want the numeric count as your overall index.
In the case that your index is floats you should use linspace and not arange:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1.0,1.3,1.5], 'b':[1.2,1.5,1.5], 'data':np.random.randn(3)})
### Create all possible combinations of a,b
newindex = [(b,a) for b in np.linspace(a_min, a_max, a_step, endpoint=False) for a in np.linspace(b_min, b_max, b_step, endpoint=False)]
### Set the index as a,b and reindex
df.set_index(['a','b']).reindex(newindex).reset_index()
Since you intend to pivot an a and b, you could obtain the pivoted result with
import numpy as np
import pandas as pd
df = pd.DataFrame({'a':[1,3,5], 'b':[2,5,5], 'data':np.random.randn(3)})
result = pd.DataFrame(np.nan, index=range(1,6), columns=range(1,6))
result.update(df.pivot(index='a', columns='b', values='data'))
print(result)
which yields
1 2 3 4 5
1 NaN 0.436389 NaN NaN NaN
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN -1.066621
4 NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN 0.328880
This is a nice fast approach for converting numeric data from sparse to dense, using SciPy's sparse functionality. Works if your ultimate goal is the pivoted (i.e. dense) dataframe:
import pandas as pd
from scipy.sparse import csr_matrix
df = pd.DataFrame({'a':[1,3,5], 'b':[2,5,5], 'data':np.random.randn(3)})
df_shape = df['a'].max()+1, df['b'].max()+1
sp_df = csr_matrix((df['data'], (df['a'], df['b'])), shape=df_shape)
df_dense = pd.DataFrame.sparse.from_spmatrix(sp_df)

Categories