I have a dataframe with ID's of clients and their expenses for 2014-2018. What I want is to have the mean of the expenses per ID but only the years before a certain date can be taken into account when calculating the mean value (so column 'Date' dictates which columns can be taken into account for the mean).
Example: for index 0 (ID: 12), the date states '2016-03-08', then the mean should be taken from the columns 'y_2014' and 'y_2015', so then for this index, the mean is 111.0.
If the date is too early (e.g. somewhere in 2014 or earlier in this case), then NaN should be returned (see index 6 and 9).
Initial dataframe:
y_2014 y_2015 y_2016 y_2017 y_2018 Date ID
0 100.0 122.0 324 632 NaN 2016-03-08 12
1 120.0 159.0 54 452 541.0 2015-04-09 96
2 NaN 164.0 687 165 245.0 2016-02-15 20
3 180.0 421.0 512 184 953.0 2018-05-01 73
4 110.0 654.0 913 173 103.0 2017-08-04 84
5 130.0 NaN 754 124 207.0 2016-07-03 26
6 170.0 256.0 843 97 806.0 2013-02-04 87
7 140.0 754.0 95 101 541.0 2016-06-08 64
8 80.0 985.0 184 84 90.0 2019-03-05 11
9 96.0 65.0 127 130 421.0 2014-05-14 34
Desired output:
y_2014 y_2015 y_2016 y_2017 y_2018 Date ID mean
0 100.0 122.0 324 632 NaN 2016-03-08 12 111.0
1 120.0 159.0 54 452 541.0 2015-04-09 96 120.0
2 NaN 164.0 687 165 245.0 2016-02-15 20 164.0
3 180.0 421.0 512 184 953.0 2018-05-01 73 324.25
4 110.0 654.0 913 173 103.0 2017-08-04 84 559.0
5 130.0 NaN 754 124 207.0 2016-07-03 26 130.0
6 170.0 256.0 843 97 806.0 2013-02-04 87 NaN
7 140.0 754.0 95 101 541.0 2016-06-08 64 447
8 80.0 985.0 184 84 90.0 2019-03-05 11 284.6
9 96.0 65.0 127 130 421.0 2014-05-14 34 NaN
Tried code: -> I'm still working on it, as I don't really know how to start for this, I only uploaded the dataframe so far, probably something with the 'datetime'-package has to be done to get the desired dataframe?
import pandas as pd
import numpy as np
import datetime
df = pd.DataFrame({"ID": [12,96,20,73,84,26,87,64,11,34],
"y_2014": [100,120,np.nan,180,110,130,170,140,80,96],
"y_2015": [122,159,164,421,654,np.nan,256,754,985,65],
"y_2016": [324,54,687,512,913,754,843,95,184,127],
"y_2017": [632,452,165,184,173,124,97,101,84,130],
"y_2018": [np.nan,541,245,953,103,207,806,541,90,421],
"Date": ['2016-03-08', '2015-04-09', '2016-02-15', '2018-05-01', '2017-08-04',
'2016-07-03', '2013-02-04', '2016-06-08', '2019-03-05', '2014-05-14']})
print(df)
Due to your naming convention, one need to extract the years from column names for comparison purpose. Then you can mask the data and taking mean:
# the years from columns
data = df.filter(like='y_')
data_years = data.columns.str.extract('(\d+)')[0].astype(int)
# the years from Date
years = pd.to_datetime(df.Date).dt.year.values
df['mean'] = data.where(data_years<years[:,None]).mean(1)
Output:
y_2014 y_2015 y_2016 y_2017 y_2018 Date ID mean
0 100.0 122.0 324 632 NaN 2016-03-08 12 111.00
1 120.0 159.0 54 452 541.0 2015-04-09 96 120.00
2 NaN 164.0 687 165 245.0 2016-02-15 20 164.00
3 180.0 421.0 512 184 953.0 2018-05-01 73 324.25
4 110.0 654.0 913 173 103.0 2017-08-04 84 559.00
5 130.0 NaN 754 124 207.0 2016-07-03 26 130.00
6 170.0 256.0 843 97 806.0 2013-02-04 87 NaN
7 140.0 754.0 95 101 541.0 2016-06-08 64 447.00
8 80.0 985.0 184 84 90.0 2019-03-05 11 284.60
9 96.0 65.0 127 130 421.0 2014-05-14 34 NaN
one more answer:
import pandas as pd
import numpy as np
df = pd.DataFrame({"ID": [12,96,20,73,84,26,87,64,11,34],
"y_2014": [100,120,np.nan,180,110,130,170,140,80,96],
"y_2015": [122,159,164,421,654,np.nan,256,754,985,65],
"y_2016": [324,54,687,512,913,754,843,95,184,127],
"y_2017": [632,452,165,184,173,124,97,101,84,130],
"y_2018": [np.nan,541,245,953,103,207,806,541,90,421],
"Date": ['2016-03-08', '2015-04-09', '2016-02-15', '2018-05-01', '2017-08-04',
'2016-07-03', '2013-02-04', '2016-06-08', '2019-03-05', '2014-05-14']})
#Subset from original df to calculate mean
subset = df.loc[:,['y_2014', 'y_2015', 'y_2016', 'y_2017', 'y_2018']]
#an expense value is only available for the calculation of the mean when that year has passed, therefore 2015-01-01 is chosen for the 'y_2014' column in the subset etc. to check with the 'Date'-column
subset.columns = ['2015-01-01', '2016-01-01', '2017-01-01', '2018-01-01', '2019-01-01']
s = subset.columns[0:].values < df.Date.values[:,None]
t = s.astype(float)
t[t == 0] = np.nan
df['mean'] = (subset.iloc[:,0:]*t).mean(1)
print(df)
#Additionally: (gives the sum of expenses before a certain date in the 'Date'-column
df['sum'] = (subset.iloc[:,0:]*t).sum(1)
print(df)
ab = '1 234'
ab = ab.replace(" ", "")
ab
'1234'
its easy to use replace() to get rid of the white space, but when I have a column of pandas dataframe;
gbpusd['Profit'] = gbpusd['Profit'].replace(" ", "")
gbpusd['Profit'].head()
3 7 000.00
4 6 552.00
11 4 680.00
14 3 250.00
24 1 700.00
Name: Profit, dtype: object
But it didnt work, googled many times but no solutions...
gbpusd['Profit'].sum() TypeError: can only concatenate str (not "int")
to str
Then, as the whitespace is still here, which cannot do further analysis, like sum()
The thing is harder than I think: the raw data is
gbpusd.head()
Ticket Open Time Type Volume Item Price S / L T / P Close Time Price.1 Commission Taxes Swap Profit
84 50204109.0 2019.10.24 09:56:32 buy 0.5 gbpusd 1.29148 0.0 0.0 2019.10.24 09:57:48 1.29179 0 0.0 0.0 15.5
85 50205025.0 2019.10.24 10:10:13 buy 0.5 gbpusd 1.29328 0.0 0.0 2019.10.24 15:57:02 1.29181 0 0.0 0.0 -73.5
86 50207371.0 2019.10.24 10:34:10 buy 0.5 gbpusd 1.29236 0.0 0.0 2019.10.24 15:57:18 1.29197 0 0.0 0.0 -19.5
87 50207747.0 2019.10.24 10:40:32 buy 0.5 gbpusd 1.29151 0.0 0.0 2019.10.24 15:57:24 1.29223 0 0.0 0.0 36
88 50212252.0 2019.10.24 11:47:14 buy 1.5 gbpusd 1.28894 0.0 0.0 2019.10.24 15:57:12 1.29181 0 0.0 0.0 430.5
when I did
gbpusd['Profit'] = gbpusd['Profit'].str.replace(" ", "")
gbpusd['Profit']
84 NaN
85 NaN
86 NaN
87 NaN
88 NaN
89 NaN
90 NaN
91 NaN
92 NaN
93 NaN
94 NaN
95 NaN
96 NaN
97 NaN
98 NaN
99 NaN
100 NaN
101 NaN
102 NaN
103 NaN
104 NaN
105 NaN
106 NaN
107 NaN
108 NaN
109 NaN
110 NaN
111 NaN
112 NaN
113 NaN
...
117 4680.00
118 NaN
119 NaN
120 NaN
121 NaN
122 NaN
123 NaN
124 NaN
125 NaN
126 NaN
127 NaN
128 NaN
129 NaN
130 -2279.00
131 -2217.00
132 -2037.00
133 -5379.00
134 -1620.00
135 -7154.00
136 -4160.00
137 1144.00
138 NaN
139 NaN
140 NaN
141 -1920.00
142 7000.00
143 3250.00
144 NaN
145 1700.00
146 NaN
Name: Profit, Length: 63, dtype: object
The white space is replaced, but some data which has no space is NaN now...someone may have the same problem...
also need to use str
gbpusdprofit = gbpusd['Profit'].str.replace(" ", "")
Output:
0 7000.00
1 6552.00
2 4680.00
3 3250.00
4 1700.00
Name: Profit, dtype: object
and for sum:
gbpusd['Profit'].str.replace(" ", "").astype('float').sum()
Result:
23182.0
You can convert to string and sum in a oneliner:
gbpusd['Profit'].str.replace(' ', "").astype(float).sum()
I have a dataframe of daily stock data, which is indexed by a datetimeindex.
There are multiple stock entries, thus there are duplicate datetimeindex values.
I am looking for a way to:
Group the dataframe by the stock symbol
Resample the prices for each symbol group into monthly price frequency data
Perform a pct_change calculation on each symbol group monthly price
Store it as a new column 'monthly_return' in the original dataframe.
I have been able to manage the first three operations. Storing the result in the original dataframe is where I'm having some trouble.
To illustrate this, I created a toy dataset which includes a 'dummy' index (idx) column which I use to assist creation of the desired output later on in the third code block.
import random
import pandas as pd
import numpy as np
datelist = pd.date_range(pd.datetime(2018,1,1), periods=PER).to_pydatetime().tolist() * 2
ids = [random.choice(['A', 'B']) for i in range(len(datelist))]
prices = random.sample(range(200), len(datelist))
idx = range(len(datelist))
df1 = pd.DataFrame(data=zip(idx, ids, prices), index=datelist, columns='idx label prices'.split())
print(df1.head(10))
df1
idx label prices
2018-01-01 0 B 40
2018-01-02 1 A 190
2018-01-03 2 A 159
2018-01-04 3 A 25
2018-01-05 4 A 89
2018-01-06 5 B 164
...
2018-01-31 30 A 102
2018-02-01 31 A 117
2018-02-02 32 A 120
2018-02-03 33 B 75
2018-02-04 34 B 170
...
Desired Output
idx label prices monthly_return
2018-01-01 0 B 40 0.000000
2018-01-02 1 A 190 0.000000
2018-01-03 2 A 159 0.000000
2018-01-04 3 A 25 0.000000
2018-01-05 4 A 89 0.000000
2018-01-06 5 B 164 0.000000
...
2018-01-31 30 A 102 -0.098039
2018-02-01 31 A 117 0.000000
2018-02-02 32 A 120 0.000000
...
2018-02-26 56 B 152 0.000000
2018-02-27 57 B 2 0.000000
2018-02-28 58 B 49 -0.040816
2018-03-01 59 B 188 0.000000
...
2018-01-28 89 A 88 0.000000
2018-01-29 90 A 26 0.000000
2018-01-30 91 B 128 0.000000
2018-01-31 92 A 144 -0.098039
...
2018-02-26 118 A 92 0.000000
2018-02-27 119 B 111 0.000000
2018-02-28 120 B 34 -0.040816
...
What I have tried so far is:
dfX = df1.copy(deep=True)
dfX = df1.groupby('label').resample('M')['prices'].last().pct_change(1).shift(-1)
print(dfX)
Which outputs:
label
A 2018-01-31 -0.067961
2018-02-28 -0.364583
2018-03-31 0.081967
B 2018-01-31 1.636364
2018-02-28 -0.557471
2018-03-31 NaN
This is quite close to what I would like to do, however I am only getting pct_change data on end of month dates back which is annoying to store back in the original dataframe (df1) as a new column.
Something like this doesn't work:
dfX = df1.copy(deep=True)
dfX['monthly_return'] = df1.groupby('label').resample('M')['prices'].last().pct_change(1).shift(-1)
As it yields the error:
TypeError: incompatible index of inserted column with frame index
I have considered 'upsampling' the monthly_return data back into a daily series, however this could likely end up causing the same error mentioned above since the original dataset could be missing dates (such as weekends). Additionally, resetting the index to clear this error would still create problems as the grouped dfX does not have the same number of rows/frequency as the original df1 which is of daily frequency.
I have a hunch that this can be done by using multi-indexing and dataframe merging however I am unsure how to go about doing so.
This generates my desired output, but it isn't as clean of a solution as I was hoping for
df1 is generated the same as before (code given in question):
idx label prices
2018-01-01 0 A 145
2018-01-02 1 B 86
2018-01-03 2 B 141
...
2018-01-25 86 B 12
2018-01-26 87 B 71
2018-01-27 88 B 186
2018-01-28 89 B 151
2018-01-29 90 A 161
2018-01-30 91 B 143
2018-01-31 92 B 88
...
Then:
def fun(x):
dates = x.date
x = x.set_index('date', drop=True)
x['monthly_return'] = x.resample('M').last()['prices'].pct_change(1).shift(-1)
x = x.reindex(dates)
return x
dfX = df1.copy(deep=True)
dfX.reset_index(inplace=True)
dfX.columns = 'date idx label prices'.split()
dfX = dfX.groupby('label').apply(fun).droplevel(level='label')
print(dfX)
Which outputs the desired result (unsorted):
idx label prices monthly_return
date
2018-01-01 0 A 145 NaN
2018-01-06 5 A 77 NaN
2018-01-08 7 A 48 NaN
2018-01-09 8 A 31 NaN
2018-01-11 10 A 20 NaN
2018-01-12 11 A 27 NaN
2018-01-14 13 A 109 NaN
2018-01-15 14 A 166 NaN
2018-01-17 16 A 130 NaN
2018-01-18 17 A 139 NaN
2018-01-19 18 A 191 NaN
2018-01-21 20 A 164 NaN
2018-01-22 21 A 112 NaN
2018-01-23 22 A 167 NaN
2018-01-25 24 A 140 NaN
2018-01-26 25 A 42 NaN
2018-01-30 29 A 107 NaN
2018-02-04 34 A 9 NaN
2018-02-07 37 A 84 NaN
2018-02-08 38 A 23 NaN
2018-02-10 40 A 30 NaN
2018-02-12 42 A 89 NaN
2018-02-15 45 A 79 NaN
2018-02-16 46 A 115 NaN
2018-02-19 49 A 197 NaN
2018-02-21 51 A 11 NaN
2018-02-26 56 A 111 NaN
2018-02-27 57 A 126 NaN
2018-03-01 59 A 135 NaN
2018-03-03 61 A 28 NaN
2018-01-01 62 A 120 NaN
2018-01-03 64 A 170 NaN
2018-01-05 66 A 45 NaN
2018-01-07 68 A 173 NaN
2018-01-08 69 A 158 NaN
2018-01-09 70 A 63 NaN
2018-01-11 72 A 62 NaN
2018-01-12 73 A 168 NaN
2018-01-14 75 A 169 NaN
2018-01-15 76 A 142 NaN
2018-01-17 78 A 83 NaN
2018-01-18 79 A 96 NaN
2018-01-21 82 A 25 NaN
2018-01-22 83 A 90 NaN
2018-01-23 84 A 59 NaN
2018-01-29 90 A 161 NaN
2018-02-01 93 A 150 NaN
2018-02-04 96 A 85 NaN
2018-02-06 98 A 124 NaN
2018-02-14 106 A 195 NaN
2018-02-16 108 A 136 NaN
2018-02-17 109 A 134 NaN
2018-02-18 110 A 183 NaN
2018-02-19 111 A 32 NaN
2018-02-24 116 A 102 NaN
2018-02-25 117 A 72 NaN
2018-02-27 119 A 38 NaN
2018-03-02 122 A 137 NaN
2018-03-03 123 A 171 NaN
2018-01-02 1 B 86 NaN
2018-01-03 2 B 141 NaN
2018-01-04 3 B 189 NaN
2018-01-05 4 B 60 NaN
2018-01-07 6 B 1 NaN
2018-01-10 9 B 87 NaN
2018-01-13 12 B 44 NaN
2018-01-16 15 B 147 NaN
2018-01-20 19 B 92 NaN
2018-01-24 23 B 81 NaN
2018-01-27 26 B 190 NaN
2018-01-28 27 B 24 NaN
2018-01-29 28 B 116 NaN
2018-01-31 30 B 98 1.181818
2018-02-01 31 B 121 NaN
2018-02-02 32 B 110 NaN
2018-02-03 33 B 66 NaN
2018-02-05 35 B 4 NaN
2018-02-06 36 B 13 NaN
2018-02-09 39 B 114 NaN
2018-02-11 41 B 16 NaN
2018-02-13 43 B 174 NaN
2018-02-14 44 B 78 NaN
2018-02-17 47 B 144 NaN
2018-02-18 48 B 14 NaN
2018-02-20 50 B 133 NaN
2018-02-22 52 B 156 NaN
2018-02-23 53 B 159 NaN
2018-02-24 54 B 177 NaN
2018-02-25 55 B 43 NaN
2018-02-28 58 B 19 -0.338542
2018-03-02 60 B 127 NaN
2018-01-02 63 B 2 NaN
2018-01-04 65 B 97 NaN
2018-01-06 67 B 8 NaN
2018-01-10 71 B 54 NaN
2018-01-13 74 B 106 NaN
2018-01-16 77 B 74 NaN
2018-01-19 80 B 188 NaN
2018-01-20 81 B 172 NaN
2018-01-24 85 B 51 NaN
2018-01-25 86 B 12 NaN
2018-01-26 87 B 71 NaN
2018-01-27 88 B 186 NaN
2018-01-28 89 B 151 NaN
2018-01-30 91 B 143 NaN
2018-01-31 92 B 88 1.181818
2018-02-02 94 B 75 NaN
2018-02-03 95 B 103 NaN
2018-02-05 97 B 82 NaN
2018-02-07 99 B 128 NaN
2018-02-08 100 B 123 NaN
2018-02-09 101 B 52 NaN
2018-02-10 102 B 18 NaN
2018-02-11 103 B 21 NaN
2018-02-12 104 B 50 NaN
2018-02-13 105 B 64 NaN
2018-02-15 107 B 185 NaN
2018-02-20 112 B 125 NaN
2018-02-21 113 B 108 NaN
2018-02-22 114 B 132 NaN
2018-02-23 115 B 180 NaN
2018-02-26 118 B 67 NaN
2018-02-28 120 B 192 -0.338542
2018-03-01 121 B 58 NaN
Perhaps there is a more concise and pythonic way of doing this.
I have a data frame like this which is imported from a CSV.
stock pop
Date
2016-01-04 325.316 82
2016-01-11 320.036 83
2016-01-18 299.169 79
2016-01-25 296.579 84
2016-02-01 295.334 82
2016-02-08 309.777 81
2016-02-15 317.397 75
2016-02-22 328.005 80
2016-02-29 315.504 81
2016-03-07 328.802 81
2016-03-14 339.559 86
2016-03-21 352.160 82
2016-03-28 348.773 84
2016-04-04 346.482 83
2016-04-11 346.980 80
2016-04-18 357.140 75
2016-04-25 357.439 77
2016-05-02 356.443 78
2016-05-09 365.158 78
2016-05-16 352.160 72
2016-05-23 344.540 74
2016-05-30 354.998 81
2016-06-06 347.428 77
2016-06-13 341.053 78
2016-06-20 363.515 80
2016-06-27 349.669 80
2016-07-04 371.583 82
2016-07-11 358.335 81
2016-07-18 362.021 79
2016-07-25 368.844 77
... ... ...
I wanted to add a new column MA which calculates Rolling mean for the column pop. I tried the following
df['MA']=data.rolling(5,on='pop').mean()
I get an error
ValueError: Wrong number of items passed 2, placement implies 1
So I thought let me try if it just works without adding a column. I used
data.rolling(5,on='pop').mean()
I got the output
stock pop
Date
2016-01-04 NaN 82
2016-01-11 NaN 83
2016-01-18 NaN 79
2016-01-25 NaN 84
2016-02-01 307.2868 82
2016-02-08 304.1790 81
2016-02-15 303.6512 75
2016-02-22 309.4184 80
2016-02-29 313.2034 81
2016-03-07 319.8970 81
2016-03-14 325.8534 86
2016-03-21 332.8060 82
2016-03-28 336.9596 84
2016-04-04 343.1552 83
2016-04-11 346.7908 80
2016-04-18 350.3070 75
2016-04-25 351.3628 77
2016-05-02 352.8968 78
2016-05-09 356.6320 78
2016-05-16 357.6680 72
2016-05-23 355.1480 74
2016-05-30 354.6598 81
2016-06-06 352.8568 77
2016-06-13 348.0358 78
2016-06-20 350.3068 80
2016-06-27 351.3326 80
2016-07-04 354.6496 82
2016-07-11 356.8310 81
2016-07-18 361.0246 79
2016-07-25 362.0904 77
... ... ...
I can't seem to apply Rolling mean on the column pop. What am I doing wrong?
To assign a column, you can create a rolling object based on your Series:
df['new_col'] = data['column'].rolling(5).mean()
The answer posted by ac2001 is not the most performant way of doing this. He is calculating a rolling mean on every column in the dataframe, then he is assigning the "ma" column using the "pop" column. The first method of the following is much more efficient:
%timeit df['ma'] = data['pop'].rolling(5).mean()
%timeit df['ma_2'] = data.rolling(5).mean()['pop']
1000 loops, best of 3: 497 µs per loop
100 loops, best of 3: 2.6 ms per loop
I would not recommend using the second method unless you need to store computed rolling means on all other columns.
Edit: pd.rolling_mean is deprecated in pandas and will be removed in future. Instead: Using pd.rolling you can do:
df['MA'] = df['pop'].rolling(window=5,center=False).mean()
for a dataframe df:
Date stock pop
0 2016-01-04 325.316 82
1 2016-01-11 320.036 83
2 2016-01-18 299.169 79
3 2016-01-25 296.579 84
4 2016-02-01 295.334 82
5 2016-02-08 309.777 81
6 2016-02-15 317.397 75
7 2016-02-22 328.005 80
8 2016-02-29 315.504 81
9 2016-03-07 328.802 81
To get:
Date stock pop MA
0 2016-01-04 325.316 82 NaN
1 2016-01-11 320.036 83 NaN
2 2016-01-18 299.169 79 NaN
3 2016-01-25 296.579 84 NaN
4 2016-02-01 295.334 82 82.0
5 2016-02-08 309.777 81 81.8
6 2016-02-15 317.397 75 80.2
7 2016-02-22 328.005 80 80.4
8 2016-02-29 315.504 81 79.8
9 2016-03-07 328.802 81 79.6
Documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rolling.html
Old: Although it is deprecated you can use:
df['MA']=pd.rolling_mean(df['pop'], window=5)
to get:
Date stock pop MA
0 2016-01-04 325.316 82 NaN
1 2016-01-11 320.036 83 NaN
2 2016-01-18 299.169 79 NaN
3 2016-01-25 296.579 84 NaN
4 2016-02-01 295.334 82 82.0
5 2016-02-08 309.777 81 81.8
6 2016-02-15 317.397 75 80.2
7 2016-02-22 328.005 80 80.4
8 2016-02-29 315.504 81 79.8
9 2016-03-07 328.802 81 79.6
Documentation: http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.rolling_mean.html
This solution worked for me.
data['MA'] = data.rolling(5).mean()['pop']
I think the issue may be that the on='pop' is just changing the column to perform the rolling window from the index.
From the doc string: " For a DataFrame, column on which to calculate the rolling window, rather than the index"
I have a data frame like this which is imported from a CSV.
stock pop
Date
2016-01-04 325.316 82
2016-01-11 320.036 83
2016-01-18 299.169 79
2016-01-25 296.579 84
2016-02-01 295.334 82
2016-02-08 309.777 81
2016-02-15 317.397 75
2016-02-22 328.005 80
2016-02-29 315.504 81
2016-03-07 328.802 81
2016-03-14 339.559 86
2016-03-21 352.160 82
2016-03-28 348.773 84
2016-04-04 346.482 83
2016-04-11 346.980 80
2016-04-18 357.140 75
2016-04-25 357.439 77
2016-05-02 356.443 78
2016-05-09 365.158 78
2016-05-16 352.160 72
2016-05-23 344.540 74
2016-05-30 354.998 81
2016-06-06 347.428 77
2016-06-13 341.053 78
2016-06-20 363.515 80
2016-06-27 349.669 80
2016-07-04 371.583 82
2016-07-11 358.335 81
2016-07-18 362.021 79
2016-07-25 368.844 77
... ... ...
I wanted to add a new column MA which calculates Rolling mean for the column pop. I tried the following
df['MA']=data.rolling(5,on='pop').mean()
I get an error
ValueError: Wrong number of items passed 2, placement implies 1
So I thought let me try if it just works without adding a column. I used
data.rolling(5,on='pop').mean()
I got the output
stock pop
Date
2016-01-04 NaN 82
2016-01-11 NaN 83
2016-01-18 NaN 79
2016-01-25 NaN 84
2016-02-01 307.2868 82
2016-02-08 304.1790 81
2016-02-15 303.6512 75
2016-02-22 309.4184 80
2016-02-29 313.2034 81
2016-03-07 319.8970 81
2016-03-14 325.8534 86
2016-03-21 332.8060 82
2016-03-28 336.9596 84
2016-04-04 343.1552 83
2016-04-11 346.7908 80
2016-04-18 350.3070 75
2016-04-25 351.3628 77
2016-05-02 352.8968 78
2016-05-09 356.6320 78
2016-05-16 357.6680 72
2016-05-23 355.1480 74
2016-05-30 354.6598 81
2016-06-06 352.8568 77
2016-06-13 348.0358 78
2016-06-20 350.3068 80
2016-06-27 351.3326 80
2016-07-04 354.6496 82
2016-07-11 356.8310 81
2016-07-18 361.0246 79
2016-07-25 362.0904 77
... ... ...
I can't seem to apply Rolling mean on the column pop. What am I doing wrong?
To assign a column, you can create a rolling object based on your Series:
df['new_col'] = data['column'].rolling(5).mean()
The answer posted by ac2001 is not the most performant way of doing this. He is calculating a rolling mean on every column in the dataframe, then he is assigning the "ma" column using the "pop" column. The first method of the following is much more efficient:
%timeit df['ma'] = data['pop'].rolling(5).mean()
%timeit df['ma_2'] = data.rolling(5).mean()['pop']
1000 loops, best of 3: 497 µs per loop
100 loops, best of 3: 2.6 ms per loop
I would not recommend using the second method unless you need to store computed rolling means on all other columns.
Edit: pd.rolling_mean is deprecated in pandas and will be removed in future. Instead: Using pd.rolling you can do:
df['MA'] = df['pop'].rolling(window=5,center=False).mean()
for a dataframe df:
Date stock pop
0 2016-01-04 325.316 82
1 2016-01-11 320.036 83
2 2016-01-18 299.169 79
3 2016-01-25 296.579 84
4 2016-02-01 295.334 82
5 2016-02-08 309.777 81
6 2016-02-15 317.397 75
7 2016-02-22 328.005 80
8 2016-02-29 315.504 81
9 2016-03-07 328.802 81
To get:
Date stock pop MA
0 2016-01-04 325.316 82 NaN
1 2016-01-11 320.036 83 NaN
2 2016-01-18 299.169 79 NaN
3 2016-01-25 296.579 84 NaN
4 2016-02-01 295.334 82 82.0
5 2016-02-08 309.777 81 81.8
6 2016-02-15 317.397 75 80.2
7 2016-02-22 328.005 80 80.4
8 2016-02-29 315.504 81 79.8
9 2016-03-07 328.802 81 79.6
Documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rolling.html
Old: Although it is deprecated you can use:
df['MA']=pd.rolling_mean(df['pop'], window=5)
to get:
Date stock pop MA
0 2016-01-04 325.316 82 NaN
1 2016-01-11 320.036 83 NaN
2 2016-01-18 299.169 79 NaN
3 2016-01-25 296.579 84 NaN
4 2016-02-01 295.334 82 82.0
5 2016-02-08 309.777 81 81.8
6 2016-02-15 317.397 75 80.2
7 2016-02-22 328.005 80 80.4
8 2016-02-29 315.504 81 79.8
9 2016-03-07 328.802 81 79.6
Documentation: http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.rolling_mean.html
This solution worked for me.
data['MA'] = data.rolling(5).mean()['pop']
I think the issue may be that the on='pop' is just changing the column to perform the rolling window from the index.
From the doc string: " For a DataFrame, column on which to calculate the rolling window, rather than the index"