Apply custom rolling function to pandas dataframe with datetime index - python

I have a pandas dataframe on which I wish to apply my own custom rolling function as follows:
def testms(x, field):
mu = np.sum(x[field])
si = np.sum(x[field])/len(x[field])
x['mu'] = mu
x['si'] = si
return x
df2 = pd.concat([pd.DataFrame({'A':[1,1,1,1,1,2,2,2,2,2]}),
pd.DataFrame({'B':random_dates(pd.to_datetime('2015-01-01'),
pd.to_datetime('2018-01-01'), 10)}),
pd.DataFrame({'C':np.random.rand(10)})],axis=1)
df2
A B C
0 1 2016-08-25 01:09:42.953011200 0.791725
1 1 2017-02-23 13:30:20.296310399 0.528895
2 1 2016-10-23 05:33:14.994806400 0.568045
3 1 2016-08-20 17:41:03.991027200 0.925597
4 1 2016-04-09 17:59:00.805200000 0.071036
5 2 2016-12-09 13:06:00.751737600 0.087129
6 2 2016-04-25 00:47:45.953232000 0.020218
7 2 2017-09-05 06:35:58.432531200 0.832620
8 2 2017-11-23 03:18:47.370528000 0.778157
9 2 2016-02-25 15:14:53.907532800 0.870012
tester = lambda x: testms(x, 'C')
df2.set_index('B').groupby('A')['C'].rolling('90D', min_periods=1).apply(tester).reset_index()
However when I apply the above code, I get the following error:
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

If use Rolling.apply it working differently like GroupBy.apply - it processing each columns separately and not possible return multiple columns, only scalars:
So in your solution are necessary 2 functions, where is not possible specify column, but column for processing is specify after groupby:
def testms1(x):
mu = np.sum(x)
return mu
def testms2(x):
#same like mean
#si = np.sum(x)/len(x)
si = np.mean(x)
return si
tester1 = lambda x: testms1(x)
tester2 = lambda x: testms2(x)
r = df2.set_index('B').groupby('A')['C'].rolling('90D', min_periods=1)
s1 = r.apply(tester1, raw=False).rename('mu')
s2 = r.apply(tester2, raw=False).rename('si')
df = pd.concat([s1, s2], axis=1).reset_index()
print (df)
A B mu si
0 1 2016-08-25 01:09:42.953011200 0.791725 0.791725
1 1 2017-02-23 13:30:20.296310399 0.528895 0.528895
2 1 2016-10-23 05:33:14.994806400 1.096940 0.548470
3 1 2016-08-20 17:41:03.991027200 2.022537 0.674179
4 1 2016-04-09 17:59:00.805200000 2.093573 0.523393
5 2 2016-12-09 13:06:00.751737600 0.087129 0.087129
6 2 2016-04-25 00:47:45.953232000 0.107347 0.053673
7 2 2017-09-05 06:35:58.432531200 0.832620 0.832620
8 2 2017-11-23 03:18:47.370528000 1.610777 0.805389
9 2 2016-02-25 15:14:53.907532800 2.480789 0.826930
Alternative solution with Resampler.aggregate:
r = df2.set_index('B').groupby('A')['C'].rolling('90D', min_periods=1)
df1 = r.agg(['sum','mean']).rename(columns={'sum':'mu', 'mean':'si'}).reset_index()
print (df1)
A B mu si
0 1 2016-08-25 01:09:42.953011200 0.791725 0.791725
1 1 2017-02-23 13:30:20.296310399 0.528895 0.528895
2 1 2016-10-23 05:33:14.994806400 1.096940 0.548470
3 1 2016-08-20 17:41:03.991027200 2.022537 0.674179
4 1 2016-04-09 17:59:00.805200000 2.093573 0.523393
5 2 2016-12-09 13:06:00.751737600 0.087129 0.087129
6 2 2016-04-25 00:47:45.953232000 0.107347 0.053673
7 2 2017-09-05 06:35:58.432531200 0.832620 0.832620
8 2 2017-11-23 03:18:47.370528000 1.610777 0.805389
9 2 2016-02-25 15:14:53.907532800 2.480789 0.826930

Related

How to change Nan value to particular number in Dataframe?

DP 1 DP 2 DP 3 DP 4 DP 5 DP 6 DP 7 DP 8 DP 9 DP 10
OP 1 2.33 1.711 1.218 1.046 1.150 1.025 1.046 1.092 nan -
OP 2 3.043 1.691 1.362 1.174 1.067 1.048 1.051 1.059
OP 3 4.054 1.717 1.238 1.132 1.068 1.056 1.045
OP 4 3.014 1.748 1.327 1.103 1.093 1.116
OP 5 2.798 1.862 1.241 1.242 1.148
OP 6 3.973 1.589 1.553 1.161
OP 7 3.372 1.552 1.458
OP 8 3.359 1.871
OP 9 3.494
OP 10
this is the dataframe DF1 ;
for ele in DF1:
x = ele+2.0
print(x)
this will give the output:
DP 1 DP 2 DP 3 DP 4 DP 5 DP 6 DP 7 DP 8 DP 9 DP 10
OP 1 4.33 3.711 3.218 3.046 3.150 3.025 3.046 3.092 nan -
OP 2 5.043 3.691 3.362 3.174 3.067 3.048 3.051 3.059
OP 3 6.054 3.717 3.238 3.132 3.068 3.056 3.045
OP 4 5.014 3.748 3.327 3.103 3.093 3.116
OP 5 4.798 3.862 3.241 3.242 3.148
OP 6 5.973 3.589 3.553 3.161
OP 7 5.372 3.552 3.458
OP 8 5.359 3.871
OP 9 5.494
OP 10
But i Need Output like :
DP 1 DP 2 DP 3 DP 4 DP 5 DP 6 DP 7 DP 8 DP 9 DP 10
OP 1 4.33 3.711 3.218 3.046 3.150 3.025 3.046 3.092 2.0 -
OP 2 5.043 3.691 3.362 3.174 3.067 3.048 3.051 3.059
OP 3 6.054 3.717 3.238 3.132 3.068 3.056 3.045
OP 4 5.014 3.748 3.327 3.103 3.093 3.116
OP 5 4.798 3.862 3.241 3.242 3.148
OP 6 5.973 3.589 3.553 3.161
OP 7 5.372 3.552 3.458
OP 8 5.359 3.871
OP 9 5.494
OP 10
that means if i add nan to number then it should give the respective number.
Does this help?
for ele in DF1:
for ind,val in ele:
if np.isnan(val):
ele[ind] = 2.0
else:
ele[ind] = val+2.0
As you want:
import pandas as pd
import numpy as np
data = [[1,10],[2,12],[3,13],[4,10],[5,12],[np.nan,13]]
df = pd.DataFrame(data,columns=['a','b'],dtype=float)
for element in df['a']:
if(element >= 0):
x = element + 2.0
else:
x = 2.0
print(x)
Easy Way:
df.fillna(2.0)
One way is to simply redefine addition so that x+nan evaluates to x, but that's rather dangerous. Safer is to define a custom function:
def nan_sum(a,b):
if not a:
return b
if not b:
return a
return a+b
Then you can apply it to the dataframe: DF1.applymap(lambda x: nan_sum(x,2.0))
You can utilize the np.nan_to_num() function which is specifically designed to replace nans with zeros. Its default behavior is to replace nans with 0.0.
import numpy as np
df.applymap(lambda x: np.nan_to_num(x)+2)

How to create Traingular moving average in python using for loop

I use python pandas to caculate the following formula
(https://i.stack.imgur.com/XIKBz.png)
I do it in python like this :
EURUSD['SMA2']= EURUSD['Close']. rolling (2).mean()
EURUSD['TMA2']= ( EURUSD['Close'] + EURUSD[SMA2']) / 2
The proplem is long coding when i calculated TMA 100 , so i need to use " for loop " to easy change TMA period .
Thanks in advance
Edited :
I had found the code but there is an error :
values = []
for i in range(1,201): values.append(eurusd['Close']).rolling(window=i).mean() values.mean()
TMA is average of averages.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(10, 5))
print(df)
# df['mean0']=df.mean(0)
df['mean1']=df.mean(1)
print(df)
df['TMA'] = df['mean1'].rolling(window=10,center=False).mean()
print(df)
Or you can easily print it.
print(df["mean1"].mean())
Here is how it looks:
0 1 2 3 4
0 0.643560 0.412046 0.072525 0.618968 0.080146
1 0.018226 0.222212 0.077592 0.125714 0.595707
2 0.652139 0.907341 0.581802 0.021503 0.849562
3 0.129509 0.315618 0.711265 0.812318 0.757575
4 0.881567 0.455848 0.470282 0.367477 0.326812
5 0.102455 0.156075 0.272582 0.719158 0.266293
6 0.412049 0.527936 0.054381 0.587994 0.442144
7 0.063904 0.635857 0.244050 0.002459 0.423960
8 0.446264 0.116646 0.990394 0.678823 0.027085
9 0.951547 0.947705 0.080846 0.848772 0.699036
0 1 2 3 4 mean1
0 0.643560 0.412046 0.072525 0.618968 0.080146 0.365449
1 0.018226 0.222212 0.077592 0.125714 0.595707 0.207890
2 0.652139 0.907341 0.581802 0.021503 0.849562 0.602470
3 0.129509 0.315618 0.711265 0.812318 0.757575 0.545257
4 0.881567 0.455848 0.470282 0.367477 0.326812 0.500397
5 0.102455 0.156075 0.272582 0.719158 0.266293 0.303313
6 0.412049 0.527936 0.054381 0.587994 0.442144 0.404901
7 0.063904 0.635857 0.244050 0.002459 0.423960 0.274046
8 0.446264 0.116646 0.990394 0.678823 0.027085 0.451842
9 0.951547 0.947705 0.080846 0.848772 0.699036 0.705581
0 1 2 3 4 mean1 TMA
0 0.643560 0.412046 0.072525 0.618968 0.080146 0.365449 NaN
1 0.018226 0.222212 0.077592 0.125714 0.595707 0.207890 NaN
2 0.652139 0.907341 0.581802 0.021503 0.849562 0.602470 NaN
3 0.129509 0.315618 0.711265 0.812318 0.757575 0.545257 NaN
4 0.881567 0.455848 0.470282 0.367477 0.326812 0.500397 NaN
5 0.102455 0.156075 0.272582 0.719158 0.266293 0.303313 NaN
6 0.412049 0.527936 0.054381 0.587994 0.442144 0.404901 NaN
7 0.063904 0.635857 0.244050 0.002459 0.423960 0.274046 NaN
8 0.446264 0.116646 0.990394 0.678823 0.027085 0.451842 NaN
9 0.951547 0.947705 0.080846 0.848772 0.699036 0.705581 0.436115

Problem to implement count, groupby, np.repeat and agg with pandas

I have similar dataframe pandas:
df = pd.DataFrame({'x': np.random.rand(61800), 'y':np.random.rand(61800), 'z':np.random.rand(61800)})
I need to work out my dataset for the following result:
extract = df.assign(count=np.repeat(range(10),10)).groupby('count',as_index=False).agg(['mean','min', 'max'])
But if i use np.repeat(range(150),150)) i received this error:
This doesn't work because the .assign you're performing needs to have enough values to fit the original dataframe:
In [81]: df = pd.DataFrame({'x': np.random.rand(61800), 'y':np.random.rand(61800), 'z':np.random.rand(61800)})
In [82]: df.assign(count=np.repeat(range(10),10))
ValueError: Length of values does not match length of index
In this case, everything works fine if we do 10 groups repeated 6,180 times:
In [83]: df.assign(count=np.repeat(range(10),6180))
Out[83]:
x y z count
0 0.781364 0.996545 0.756592 0
1 0.609127 0.981688 0.626721 0
2 0.547029 0.167678 0.198857 0
3 0.184405 0.484623 0.219722 0
4 0.451698 0.535085 0.045942 0
... ... ... ... ...
61795 0.783192 0.969306 0.974836 9
61796 0.890720 0.286384 0.744779 9
61797 0.512688 0.945516 0.907192 9
61798 0.526564 0.165620 0.766733 9
61799 0.683092 0.976219 0.524048 9
[61800 rows x 4 columns]
In [84]: extract = df.assign(count=np.repeat(range(10),6180)).groupby('count',as_index=False).agg(['mean','min', 'max'])
In [85]: extract
Out[85]:
x y z
mean min max mean min max mean min max
count
0 0.502338 0.000230 0.999546 0.501603 0.000263 0.999842 0.503807 0.000113 0.999826
1 0.500392 0.000059 0.999979 0.499935 0.000012 0.999767 0.500114 0.000230 0.999811
2 0.498377 0.000023 0.999832 0.496921 0.000003 0.999475 0.502887 0.000028 0.999828
3 0.504970 0.000637 0.999680 0.500943 0.000256 0.999902 0.497370 0.000257 0.999969
4 0.501195 0.000290 0.999992 0.498617 0.000149 0.999779 0.497895 0.000022 0.999877
5 0.499476 0.000186 0.999956 0.503227 0.000308 0.999907 0.504688 0.000100 0.999756
6 0.495488 0.000378 0.999606 0.499893 0.000119 0.999740 0.495924 0.000031 0.999556
7 0.498443 0.000005 0.999417 0.495728 0.000262 0.999972 0.501255 0.000087 0.999978
8 0.494110 0.000014 0.999888 0.495197 0.000074 0.999970 0.493215 0.000166 0.999718
9 0.496333 0.000365 0.999307 0.502074 0.000110 0.999856 0.499164 0.000035 0.999927

Multiplying data within columns python

I've been working on this all morning and for the life of me cannot figure it out. I'm sure this is very basic, but I've become so frustrated my mind is being clouded. I'm attempting to calculate the total return of a portfolio of securities at each date (monthly).
The formula is (1 + r1) * (1+r2) * (1+ r(t))..... - 1
Here is what I'm working with:
Adj_Returns = Adj_Close/Adj_Close.shift(1)-1
Adj_Returns['Risk Parity Portfolio'] = (Adj_Returns.loc['2003-01-31':]*Weights.shift(1)).sum(axis = 1)
Adj_Returns
SPY IYR LQD Risk Parity Portfolio
Date
2002-12-31 NaN NaN NaN 0.000000
2003-01-31 -0.019802 -0.014723 0.000774 -0.006840
2003-02-28 -0.013479 0.019342 0.015533 0.011701
2003-03-31 -0.001885 0.010015 0.001564 0.003556
2003-04-30 0.088985 0.045647 0.020696 0.036997
For example, with 2002-12-31 being base 100 for risk parity, I want 2003-01-31 to be 99.316 (100 * (1-0.006840)), 2003-02-28 to be 100.478 (99.316 * (1+ 0.011701)) so on and so forth.
Thanks!!
You want to use pd.DataFrame.cumprod
df.add(1).cumprod().sub(1).sum(1)
Consider the dataframe of returns df
np.random.seed([3,1415])
df = pd.DataFrame(np.random.normal(.025, .03, (10, 5)), columns=list('ABCDE'))
df
A B C D E
0 -0.038892 -0.013054 -0.034115 -0.042772 0.014521
1 0.024191 0.034487 0.035463 0.046461 0.048123
2 0.006754 0.035572 0.014424 0.012524 -0.002347
3 0.020724 0.047405 -0.020125 0.043341 0.037007
4 -0.003783 0.069827 0.014605 -0.019147 0.056897
5 0.056890 0.042756 0.033886 0.001758 0.049944
6 0.069609 0.032687 -0.001997 0.036253 0.009415
7 0.026503 0.053499 -0.006013 0.053447 0.047013
8 0.062084 0.029664 -0.015238 0.029886 0.062748
9 0.048341 0.065248 -0.024081 0.019139 0.028955
We can see the cumulative return or total return is
df.add(1).cumprod().sub(1)
A B C D E
0 -0.038892 -0.013054 -0.034115 -0.042772 0.014521
1 -0.015641 0.020983 0.000139 0.001702 0.063343
2 -0.008993 0.057301 0.014565 0.014247 0.060847
3 0.011544 0.107423 -0.005853 0.058206 0.100105
4 0.007717 0.184750 0.008666 0.037944 0.162699
5 0.065046 0.235405 0.042847 0.039769 0.220768
6 0.139183 0.275786 0.040764 0.077464 0.232261
7 0.169375 0.344039 0.034505 0.135051 0.290194
8 0.241974 0.383909 0.018742 0.168973 0.371151
9 0.302013 0.474207 -0.005791 0.191346 0.410852
Plot it
df.add(1).cumprod().sub(1).plot()
Add sum of returns to new column
df.assign(Portfolio=df.add(1).cumprod().sub(1).sum(1))
A B C D E Portfolio
0 -0.038892 -0.013054 -0.034115 -0.042772 0.014521 -0.114311
1 0.024191 0.034487 0.035463 0.046461 0.048123 0.070526
2 0.006754 0.035572 0.014424 0.012524 -0.002347 0.137967
3 0.020724 0.047405 -0.020125 0.043341 0.037007 0.271425
4 -0.003783 0.069827 0.014605 -0.019147 0.056897 0.401777
5 0.056890 0.042756 0.033886 0.001758 0.049944 0.603835
6 0.069609 0.032687 -0.001997 0.036253 0.009415 0.765459
7 0.026503 0.053499 -0.006013 0.053447 0.047013 0.973165
8 0.062084 0.029664 -0.015238 0.029886 0.062748 1.184749
9 0.048341 0.065248 -0.024081 0.019139 0.028955 1.372626

Extract Digits from Pandas column (Object dtype)

I'm having trouble removing non-digits from a df column. I have tried a few methods, but there are still quite a few that produce NaN values when the function passed through the column.
I need the output to be only digits in an integer form (No leading zeros)
Cust #
0 10726
2 11699
5 12963
8 z13307
9 13405
12 14831-001
16 16416
17 16917
18 18027
24 19233z
dtype('O')
I have tried:
Unique_Stores['Cust #2']=Unique_Stores['Cust #2'].str.extract('(\d+)',expand=True)
Unique_Stores['Cust #2'].str.replace(r'(\D+)','')
Unique_Stores['Cust #2'].replace(to_replace="([0-9]+)", value=r"\1", regex=True, inplace=True)
Unique_Stores['Cust #2'] = pd.to_numeric(Unique_Stores['Cust #2'].str.replace(r'\D+', ''), errors='coerce')
Thank you in advance, and let me know if you need more info.
But no matter what I do, the first 1000 or so lines return NaN values- even when the value is an integer.
Link to actual dataset
UPDATE:
In [144]: df = pd.read_csv(r'D:\download\Customer_Numbers.csv', index_col=0)
In [145]: df['Cust #2'] = df['Cust #'].str.replace(r'\D+', '').astype(int)
In [146]: df
Out[146]:
State Zip Code Cust # Cust #2
0 PA 16505 10726 10726
2 MI 48103 11699 11699
5 NH 3253 12963 12963
8 PA 18951 13307 13307
9 MA 2360 13405 13405
12 NY 11940 14831 14831
16 OH 44278 16416 16416
17 OH 45459 16917 16917
18 MA 1748 18027 18027
24 NY 14226 19233 19233
... ... ... ... ...
54393 WA 99207 005611-99 561199
54394 WA 99006 7775 7775
54395 WA 99353 8006 8006
54399 WA 99206 8888 8888
54404 CA 92117 444202 444202
54408 CA 90019 30066 30066
54411 CA 90026 443607 443607
54414 CA 90094 9242 9242
54417 CA 90405 9245 9245
54420 CA 90038 9247 9247
[6492 rows x 4 columns]
In [147]: df.dtypes
Out[147]:
State object
Zip Code object
Cust # object
Cust #2 int32
dtype: object
OLD answer:
In [123]: df
Out[123]:
val
0 10726
2 11699
5 12963
8 z13307
9 13405
12 14831-001
16 16416
17 16917
18 18027
24 19233z
In [124]: df['val'] = pd.to_numeric(df['val'].str.replace(r'\D+', ''), errors='coerce')
In [125]: df
Out[125]:
val
0 10726
2 11699
5 12963
8 13307
9 13405
12 14831001
16 16416
17 16917
18 18027
24 19233

Categories