https://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.rolling_quantile.html
I cant not see how to best ignore NaNs in the rolling percentile function. Would anyone know?
seriestest = pd.Series([1, 5, 7, 2, 4, 6, 9, 3, 8, 10])
and insert nans
seriestest2 = pd.Series([1, 5, np.NaN, 2, 4, np.nan, 9, 3, 8, 10])
Now, on the first series, I get expected output, using:
seriestest.rolling(window = 3).quantile(.5)
But, I wish to do the same and ignore NaNs on the test2 series.
seriestest2.rolling(window = 3).quantile(.5)
Gives:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 8.0
9 8.0
dtype: float64
But I think it gives something like this if we can parse a skipna=True, which doesn't work for me:
0 NaN
1 NaN
2 5.0
3 2.0
4 4.0
5 4.0
6 4.0
7 3.0
8 8.0
9 8.0
dtype: float64
The issue is that having nan values will give you less than the required number of elements (3) in your rolling window. You can define the minimum number of valid observations with rolling to be less by setting the min_periods parameter.
seriestest2.rolling(window=3, min_periods=1).quantile(.5)
Alternatively, if you simply want to replace nan values, with say 0, you can use fillna:
seriestest2.fillna(value=0).rolling(window=3).quantile(.5)
Related
I have a dataset in which few values are null. I want to change them to either 4 or 5 randomly in specific rows. How do I do that?
data.replace(np.nan, np.random.randint(4,5))
I tried this and every nan value changed to only 4 and not 4 and 5 randomly. Also I dont know how to replace nan values for only specific rows like row 1,4,5,8.
Use loc and select by index and isna. Change np.random.randint(4,5) to (4,6) to get both four and fives.
import pandas as pd
import numpy as np
data = {
'A': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'B': [0, np.nan, 1, 2.0, 2, np.nan, 3, 2.0, 7, np.nan]}
df = pd.DataFrame(data)
# A B
# 1 0.0
# 2 NaN
# 3 1.0
# 4 2.0
# 5 2.0
# 6 NaN
# 7 3.0
# 8 2.0
# 9 7.0
# 10 NaN
# If index is 1 or 5, and the value is NaN, change B to 4 or 5
df.loc[df.index.isin([1, 5]) & pd.isna(df["B"]), "B"] = np.random.randint(4,6)
# A B
# 1 0.0
# 2 4.0
# 3 1.0
# 4 2.0
# 5 2.0
# 6 4.0
# 7 3.0
# 8 2.0
# 9 7.0
# 10 NaN
Consider this simple example
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
df = pd.DataFrame({'a':[1,3,5,7,4,5,6,4,7,8,9,1,3,5,7,4,5,6,4,7,8,9],
'b':[3,5,6,2,4,6,2,5,7,1,9,5,3,2,5,4,3,6,4,1,1,9]})
I am trying to perform a rolling regression of a on b. I am trying to use the simplest pandas tool available: apply. I want to use apply because I want to keep the flexibility of returning any parameter of the regression.
However, the simple code below does not work
df.rolling(10).apply(lambda x: smf.ols('a ~ b', data = x).fit())
File "<string>", line 1, in <module>
PatsyError: Error evaluating factor: NameError: name 'b' is not defined
a ~ b
^
What is the issue?
Thanks!
rolling apply is not capable of interacting with multiple columns simultaneously, nor is it able to produce non-numeric values. We instead need to take advantage of the iterable nature of rolling objects. We also need to account for handling min_periods ourselves, since the iterable rolling object generates all windows results regardless of other rolling arguments.
We can then create some function to produce each row in the results from the regression results to do something like:
def process(x):
if len(x) >= 10:
reg = smf.ols('a ~ b', data=x).fit()
print(reg.params)
return [
# b from params
reg.params['b'],
# b from tvalues
reg.tvalues['b'],
# Both lower and upper b from conf_int()
*reg.conf_int().loc['b', :].tolist()
]
# Return NaN in the same dimension as the results
return [np.nan] * 4
df = df.join(
# join new DataFrame back to original
pd.DataFrame(
(process(x) for x in df.rolling(10)),
columns=['coef', 't', 'lower', 'upper']
)
)
df:
a b coef t lower upper
0 1 3 NaN NaN NaN NaN
1 3 5 NaN NaN NaN NaN
2 5 6 NaN NaN NaN NaN
3 7 2 NaN NaN NaN NaN
4 4 4 NaN NaN NaN NaN
5 5 6 NaN NaN NaN NaN
6 6 2 NaN NaN NaN NaN
7 4 5 NaN NaN NaN NaN
8 7 7 NaN NaN NaN NaN
9 8 1 -0.216802 -0.602168 -1.047047 0.613442
10 9 9 0.042781 0.156592 -0.587217 0.672778
11 1 5 0.032086 0.097763 -0.724742 0.788913
12 3 3 0.113475 0.329006 -0.681872 0.908822
13 5 2 0.198582 0.600297 -0.564258 0.961421
14 7 5 0.203540 0.611002 -0.564646 0.971726
15 4 4 0.236599 0.686744 -0.557872 1.031069
16 5 3 0.293651 0.835945 -0.516403 1.103704
17 6 6 0.314286 0.936382 -0.459698 1.088269
18 4 4 0.276316 0.760812 -0.561191 1.113823
19 7 1 0.346491 1.028220 -0.430590 1.123572
20 8 1 -0.492424 -1.234601 -1.412181 0.427332
21 9 9 0.235075 0.879433 -0.381326 0.851476
Setup:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
df = pd.DataFrame({
'a': [1, 3, 5, 7, 4, 5, 6, 4, 7, 8, 9, 1, 3, 5, 7, 4, 5, 6, 4, 7, 8, 9],
'b': [3, 5, 6, 2, 4, 6, 2, 5, 7, 1, 9, 5, 3, 2, 5, 4, 3, 6, 4, 1, 1, 9]
})
Rolling.apply applies the rolling operation to each column separately (Related question).
Following user3226167's answer of this thread, it seems that easiest way to accomplish what you want is to use RollingOLS.from_formula from statsmodels.regression.rolling.
from statsmodels.regression.rolling import RollingOLS
df = pd.DataFrame({'a':[1,3,5,7,4,5,6,4,7,8,9,1,3,5,7,4,5,6,4,7,8,9],
'b':[3,5,6,2,4,6,2,5,7,1,9,5,3,2,5,4,3,6,4,1,1,9]})
model = RollingOLS.from_formula('a ~ b', data = df, window=10)
reg_obj = model.fit()
# estimated coefficient
b_coeff = reg_obj.params['b'].rename('coef')
# b t-value
b_t_val = reg_obj.tvalues['b'].rename('t')
# 95 % confidence interval of b
b_conf_int = reg_obj.conf_int(cols=[1]).droplevel(level=0, axis=1)
# join all the desired information to the original df
df = df.join([b_coeff, b_t_val, b_conf_int])
where reg_obj is a RollingRegressionResults which holds lots of information about the regression (see all its different attributes in the docs)
Output
>>> type(reg_obj)
<class 'statsmodels.regression.rolling.RollingRegressionResults'>
>>> df
a b coef t lower upper
0 1 3 NaN NaN NaN NaN
1 3 5 NaN NaN NaN NaN
2 5 6 NaN NaN NaN NaN
3 7 2 NaN NaN NaN NaN
4 4 4 NaN NaN NaN NaN
5 5 6 NaN NaN NaN NaN
6 6 2 NaN NaN NaN NaN
7 4 5 NaN NaN NaN NaN
8 7 7 NaN NaN NaN NaN
9 8 1 -0.216802 -0.602168 -0.922460 0.488856
10 9 9 0.042781 0.156592 -0.492679 0.578240
11 1 5 0.032086 0.097763 -0.611172 0.675343
12 3 3 0.113475 0.329006 -0.562521 0.789472
13 5 2 0.198582 0.600297 -0.449786 0.846949
14 7 5 0.203540 0.611002 -0.449372 0.856452
15 4 4 0.236599 0.686744 -0.438653 0.911851
16 5 3 0.293651 0.835945 -0.394846 0.982147
17 6 6 0.314286 0.936382 -0.343553 0.972125
18 4 4 0.276316 0.760812 -0.435514 0.988146
19 7 1 0.346491 1.028220 -0.313981 1.006963
20 8 1 -0.492424 -1.234601 -1.274162 0.289313
21 9 9 0.235075 0.879433 -0.288829 0.758978
for example:
import pandas as pd
df_1 = pd.DataFrame({"A":[1, 5, 3, 4, 2],
"B":[3, 2, 4, 3, 4],
"C":[2, 2, 7, 3, 4],
"D":[4, 3, 6, 12, 7]})
df_2 = pd.DataFrame(index = list(range(5)),columns = ['a','c'])
df_2.loc[2,['a','c']] = df_1.loc[2,['A','C']]
print(df_1.loc[2,['A','C']])
print(df_2)
I got:
A 3
C 7
Name: 2, dtype: int64
a c
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
Obviously I failed to set multiple cells at the same time in one row. Is there any way to achieve this? (except using loops)
Here working index alignment, so because different a, c with A, C columns it set missing values (here not change), solution is set by numpy array for avoid it:
df_2.loc[2,['a','c']] = df_1.loc[2,['A','C']].values
print (df_2)
a c
0 NaN NaN
1 NaN NaN
2 3 7
3 NaN NaN
4 NaN NaN
If replace columns names for match, it working nice:
df_2.loc[2,['a','c']] = df_1.loc[2,['A','C']].rename({'A':'a','C':'c'})
#alternative
#df_2.loc[2,['a','c']] = df_1.rename(columns={'A':'a','C':'c'}).loc[2,['a','c']]
print (df_2)
a c
0 NaN NaN
1 NaN NaN
2 3 7
3 NaN NaN
4 NaN NaN
I have a pandas dataframe with two dimensions. I want to calculate the rolling standard deviation along axis 1 while also including datapoints in the rows above and below.
So say I have this df:
data = {'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8],
'C': [9, 10, 11, 12]}
df = pd.DataFrame(data)
print(df)
A B C
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
I want a rectangular window 3 rows high and 2 columns across, moving from left to right. So, for example,
std_df.loc[1, 'C']
would be equal to
np.std([1, 5, 9, 2, 6, 10, 3, 7, 11])
But no idea how to achieve this without very slow iteration
Looks like what you want is pd.shift
import pandas as pd
import numpy as np
data = {'A': [1,2,3,4], 'B': [5,6,7,8], 'C': [9,10,11,12]}
df = pd.DataFrame(data)
print(df)
A B C
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
Shifting the dataframe you provided by 1 yields the row above
print(df.shift(1))
A B C
0 NaN NaN NaN
1 1.0 5.0 9.0
2 2.0 6.0 10.0
3 3.0 7.0 11.0
Similarly, shifting the dataframe you provided by -1 yields the row below
print(df.shift(-1))
A B C
0 2.0 6.0 10.0
1 3.0 7.0 11.0
2 4.0 8.0 12.0
3 NaN NaN NaN
so the code below should do what you're looking for (add_prefix prefixes the column names to make them unique)
above_df = df.shift(1).add_prefix('above_')
below_df = df.shift(-1).add_prefix('below_')
lagged = pd.concat([df, above_df, below_df], axis=1)
lagged['std'] = lagged.apply(np.std, axis=1)
print(lagged)
A B C above_A above_B above_C below_A below_B below_C std
0 1 5 9 NaN NaN NaN 2.0 6.0 10.0 3.304038
1 2 6 10 1.0 5.0 9.0 3.0 7.0 11.0 3.366502
2 3 7 11 2.0 6.0 10.0 4.0 8.0 12.0 3.366502
3 4 8 12 3.0 7.0 11.0 NaN NaN NaN 3.304038
I'd like to generate a series that's the incremental mean of a timeseries. Meaning that, starting from the first date (index 0), the mean stored in row x is the average of values [0:x]
data
index value mean formula
0 4
1 5
2 6
3 7 5.5 average(0-3)
4 4 5.2 average(0-4)
5 5 5.166666667 average(0-5)
6 6 5.285714286 average(0-6)
7 7 5.5 average(0-7)
I'm hoping there's a way to do this without looping to take advantage of pandas.
Here's an update for newer versions of Pandas (starting with 0.18.0)
df['value'].expanding().mean()
or
s.expanding().mean()
As #TomAugspurger points out, you can use expanding_mean:
In [11]: s = pd.Series([4, 5, 6, 7, 4, 5, 6, 7])
In [12]: pd.expanding_mean(s, 4)
Out[12]:
0 NaN
1 NaN
2 NaN
3 5.500000
4 5.200000
5 5.166667
6 5.285714
7 5.500000
dtype: float64
Another approach is to use cumsum(), and divide by the cumulative number of items, for example:
In [1]:
s = pd.Series([4, 5, 6, 7, 4, 5, 6, 7])
s.cumsum() / pd.Series(np.arange(1, len(s)+1), s.index)
Out[1]:
0 4.000000
1 4.500000
2 5.000000
3 5.500000
4 5.200000
5 5.166667
6 5.285714
7 5.500000
dtype: float64