how does pandas rolling std calculate? - python

I am using a.rolling(5).std() to get a std series in a window(size=5, a is a pd.Series)
but i found the result is not what i want.
here is the example:
In [15]: a = [-49, -50, -50, -51, -48]
In [16]: pd.Series(a).rolling(5).std()
Out[16]:
0 NaN
1 NaN
2 NaN
3 NaN
4 1.140175
dtype: float64
In [17]: np.std(a)
Out[17]: 1.0198039027185568
I think the last element of pd.Series(a).rolling(5).std() should be equal with np.std(a),
but why it's not?

This is probably due to Pandas normalizing by N - 1 instead of N. See the first note at https://pandas.pydata.org/docs/reference/api/pandas.core.window.rolling.Rolling.std.html
You can change this behavior using the degrees of freedom argument, ddof, e.g. pd.Series(a).rolling(5).std(ddof=0).

Related

Pandas series: conditional rolling standard deviation

I have a Pandas series of random numbers from -1 to +1:
from pandas import Series
from random import random
x = Series([random() * 2 - 1. for i in range(1000)])
x
Output:
0 -0.499376
1 -0.386884
2 0.180656
3 0.014022
4 0.409052
...
995 -0.395711
996 -0.844389
997 -0.508483
998 -0.156028
999 0.002387
Length: 1000, dtype: float64
I can get the rolling standard deviation of the full Series easily:
x.rolling(30).std()
Output:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
...
995 0.575365
996 0.580220
997 0.580924
998 0.577202
999 0.576759
Length: 1000, dtype: float64
But what I would like to do is to get the standard deviation of only positive numbers within the rolling window. In our example, the window length is 30... say there are only 15 positive numbers in the window, I want the standard deviation of only those 15 numbers.
One could remove all negative numbers from the Series and calculate the rolling standard deviation:
x[x > 0].rolling(30).std()
Output:
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
...
988 0.286056
990 0.292455
991 0.283842
994 0.291798
999 0.291824
Length: 504, dtype: float64
...But this isn't the same thing, as there will always be 30 positive numbers in the window here, whereas for what I want, the number of positive numbers will change.
I want to avoid iterating over the Series; I was hoping there might be a more Pythonic way to solve my problem. Can anyone help ?
Mask the non positive values with NaN then calculate the rolling std with min_periods=1 and optionally set the first 29 values to NaN.
w = 30
s = x.mask(x <= 0).rolling(w, min_periods=1).std()
s.iloc[:w - 1] = np.nan
Note
Passing the argument min_periods=1 is important here because there can be certain windows where the number of non-null values is not equal to length of that window and in such case you will get the NaN result.
Another possible solution:
pd.Series(np.where(x >= 0, x, np.nan)).rolling(30, min_periods=1).std()
Output:
0 NaN
1 NaN
2 NaN
3 0.441567
4 0.312562
...
995 0.323768
996 0.312461
997 0.304077
998 0.308342
999 0.301742
Length: 1000, dtype: float64
You may first turn non-positive values into np.nan, then apply np.nanstd to each window. So
x[x.values <= 0] = np.nan
rolling_list = [np.nanstd(window.to_list()) for window in x.rolling(window=30)]
will return
[0.0,
0.0,
0.38190115685808856,
0.38190115685808856,
0.38190115685808856,
0.3704840425749437,
0.33234158296550925,
0.33234158296550925,
0.3045579286056045,
0.2962826377559198,
0.275920580105683,
0.29723758167880554,
0.29723758167880554,
0.29723758167880554,
0.29723758167880554,
0.29723758167880554
...]
IIUC, after rolling, you want to calculate std of only positive values in each rolling window
out = x.rolling(30).apply(lambda w: w[w>0].std())
print(out)
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
...
995 0.324031
996 0.298276
997 0.294917
998 0.304506
999 0.308050
Length: 1000, dtype: float64

I made code that copies data from a smaller dataframe into a larger dataframe based on date but it is slow

The code I have works but it is very slow. I have a large dataframe based on days and a smaller dataframe that is the day data but averaged into weekly/monthly/yearly intervals. I am moving the change in direction "Turning Point" from the yearly (tempTimeScale) to the daily dataframe based on when the it changed on the day of that year rather than at the start/end of the year.
Is there a way to make it run faster?
import numpy as np
d = {"Turning Point Up": [10, np.nan, np.nan, 17, np.nan]}
dailyData = pd.DataFrame(data=d)
y = {"Turning Point Up": [17]}
tempTimeScale = pd.DataFrame(data=y)
tempTimeScale
def align(additive):
for indexD, rowD in dailyData.iterrows():
for indexY, rowY in tempTimeScale.iterrows():
if rowD["Turning Point Up"]==rowY["Turning Point Up"]:
dailyData.at[indexD,"Turning Point Up Y"]=rowY["Turning Point Up"]
o = {"Turning Point Up": [10, np.nan, np.nan, 17, np.nan], "Turning Point Up Y": [np.nan, np.nan, np.nan, 17, np.nan]}
exampleoutput = pd.DataFrame(data=o)
exampleoutput
Try:
m = dailyData["Turning Point Up"].isin(tempTimeScale["Turning Point Up"])
dailyData["Turning Point Up Y"] = dailyData.loc[m, "Turning Point Up"]
print(dailyData)
Prints:
Turning Point Up Turning Point Up Y
0 10.0 NaN
1 NaN NaN
2 NaN NaN
3 17.0 17.0
4 NaN NaN
Based on your expected output of:
Turning Point Up Turning Point Up Y
0 10.0 NaN
1 NaN NaN
2 NaN NaN
3 17.0 17.0
4 NaN NaN
you're setting Turning Point Up Y to the same value as Turning Point Up when there's a match, and NaN otherwise. Is this what you actually want or do you want some sort of "indicator" for which values match in both? If it's the latter, then use the method in #Andrej's answer to set a flag indicating that:
dailyData["Turning Point Up Y"] = dailyData["Turning Point Up"].isin(tempTimeScale["Turning Point Up"])
Result:
Turning Point Up Turning Point Up Y
0 10.0 False
1 NaN False
2 NaN False
3 17.0 True
4 NaN False

Pandas: How to find the average length of days for a local outbreak to peak in a COVID-19 dataframe?

Let's say I have this dataframe containing the difference in number of active cases from previous value in each country:
[in]
import pandas as pd
import numpy as np
active_cases = {'Day(s) since outbreak':['0', '1', '2', '3', '4', '5'], 'Australia':[np.NaN, 10, 10, -10, -20, -20], 'Albania':[np.NaN, 20, 0, 15, 0, -20], 'Algeria':[np.NaN, 25, 10, -10, 20, -20]}
df = pd.DataFrame(active_cases)
df
[out]
Day(s) since outbreak Australia Albania Algeria
0 0 NaN NaN NaN
1 1 10.0 20.0 25.0
2 2 10.0 0.0 10.0
3 3 -10.0 15.0 -10.0
4 4 -20.0 0.0 20.0
5 5 -20.0 -20.0 -20.0
I need to find the average length of days for a local outbreak to peak in this COVID-19 dataframe.
My solution is to find the nth row with the first negative value in each column (e.g., nth row of first negative value in 'Australia': 3, nth row of first negative value in 'Albania': 5) and average it.
However, I have no idea how to do this in Panda/Python.
Are there any ways to perform this task with simple lines of Python/Panda code?
you can set_index the column Day(s) since outbreak, then use iloc to select all rows except the first one, then check where the values are less than (lt) 0. Use idxmax to get the first row where the value is less than 0 and take the mean. With your input, it gives:
print (df.set_index('Day(s) since outbreak')\
.iloc[1:, :].lt(0).idxmax().astype(float).mean())
3.6666666666666665
IICU
using df.where mask negatives and replace positives with np.NaN and then calculate the mean
cols= ['Australia','Albania','Algeria']
df.set_index('Day(s) since outbreak', inplace=True)
m = df< 0
df2=df.where(m, np.NaN)
#df2 = df2.replace(0, np.NaN)
df2.mean()
Result

Pandas describe vs scipy.stats percentileofscore with NaN?

I'm having a weird situation, where pd.describe is giving me percentile markers that disagree with scipy.stats percentileofscore, because of NaNs, I think.
My df is:
f_recommend
0 3.857143
1 4.500000
2 4.458333
3 NaN
4 3.600000
5 NaN
6 4.285714
7 3.587065
8 4.200000
9 NaN
When I run df.describe(percentiles=[.25, .5, .75]) I get:
f_recommend
count 7.000000
mean 4.069751
std 0.386990
min 3.587065
25% 3.728571
50% 4.200000
75% 4.372024
max 4.500000
I get the same values when I run with NaN removed.
When I want to look up a specific value, however, when I run scipy.stats.percentileofscore(df['f_recommend'], 3.61, kind = 'mean') I get: 28th percentile with NaN and 20th without.
Any thoughts to explain this discrepancy?
ETA:
I don't believe that the problem is that we're calculating percentiles differently. Because that only matters when you're calculating percentiles of the same 2 numbers in different ways. But here, describe gives 25 percentile as 3.72. So there is absolutely no way that 3.61 can be 28th percentile. None of the formulas should give that.
In particular, when I use describe on the 6 values without NaN, I get the same values, so that's ignoring NaN, which is fine. But when I run percentile of score without the NaN I get a number that doesn't match.
ETA 2:
Simpler example:
In [48]: d = pd.DataFrame([1,2,3,4,5,6,7])
In [49]: d.describe()
Out[49]:
0
count 7.000000
mean 4.000000
std 2.160247
min 1.000000
25% 2.500000
50% 4.000000
75% 5.500000
max 7.000000
In [50]: sp.stats.percentileofscore(d[0], 2.1, kind = 'mean')
Out[50]: 28.571428571428573
the "kind" argument doesn't matter because 2.1 is unique.
scipy.stats.percentileofscore does not ignore nan, nor does it check for the value and handle it in some special way. It is just another floating point value in your data. This means the behavior of percentileofscore with data containing nan is undefined, because of the behavior of nan in comparisons:
In [44]: np.nan > 0
Out[44]: False
In [45]: np.nan < 0
Out[45]: False
In [46]: np.nan == 0
Out[46]: False
In [47]: np.nan == np.nan
Out[47]: False
Those results are all correct--that is how nan is supposed to behave. But that means, in order to know how percentileofscore handles nan, you have to know how the code does comparisons. And that is an implementation detail that you shouldn't have to know, and that you can't rely on to be the same in future versions of scipy.
If you investigate the behavior of percentfileofscore, you'll find that it behaves as if nan was infinite. For example, if you replace nan with a value larger than any other value in the input, you'll get the same results:
In [53]: percentileofscore([10, 20, 25, 30, np.nan, np.nan], 18)
Out[53]: 16.666666666666664
In [54]: percentileofscore([10, 20, 25, 30, 999, 999], 18)
Out[54]: 16.666666666666664
Unfortunately, you can't rely on this behavior. If the implementation changes in the future, nan might end up behaving like negative infinity, or have some other unspecified behavior.
The solution to this "problem" is simple: don't give percentileofscore any nan values. You'll have to clean up your data first. Note that this can be as simple as:
result = percentileofscore(a[~np.isnan(a)], score)
the answer is very simple.
There is no universally accepted formula for computing percentiles, in particular when your data contains ties or when it cannot be perfectly broken down in equal-size buckets.
For instance, have a look at the documentation in R. There are more than seven types of formulas! https://stat.ethz.ch/R-manual/R-devel/library/stats/html/quantile.html
At the end, it comes down to understanding which formula is used and whether the differences are big enough to be a problem in your case.

pandas DataFrame Dividing a column by itself

I have a pandas dataframe that I filled with this:
import pandas.io.data as web
test = web.get_data_yahoo('QQQ')
The dataframe looks like this in iPython:
In [13]: test
Out[13]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 729 entries, 2010-01-04 00:00:00 to 2012-11-23 00:00:00
Data columns:
Open 729 non-null values
High 729 non-null values
Low 729 non-null values
Close 729 non-null values
Volume 729 non-null values
Adj Close 729 non-null values
dtypes: float64(5), int64(1)
When I divide one column by another, I get a float64 result that has a satisfactory number of decimal places. I can even divide one column by another column offset by one, for instance test.Open[1:]/test.Close[:], and get a satisfactory number of decimal places. When I divide a column by itself offset, however, I get just 1:
In [83]: test.Open[1:] / test.Close[:]
Out[83]:
Date
2010-01-04 NaN
2010-01-05 0.999354
2010-01-06 1.005635
2010-01-07 1.000866
2010-01-08 0.989689
2010-01-11 1.005393
...
In [84]: test.Open[1:] / test.Open[:]
Out[84]:
Date
2010-01-04 NaN
2010-01-05 1
2010-01-06 1
2010-01-07 1
2010-01-08 1
2010-01-11 1
I'm probably missing something simple. What do I need to do in order to get a useful value out of that sort of calculation? Thanks in advance for the assistance.
If you're looking to do operations between the column and lagged values, you should be doing something like test.Open / test.Open.shift().
shift realigns the data and takes an optional number of periods.
You may not be getting what you think you are when you do test.Open[1:]/test.Close. Pandas matches up the rows based on their index, so you're still getting each element of one column divided by its corresponding element in the other column (not the element one row back). Here's an example:
>>> print d
A B C
0 1 3 7
1 -2 1 6
2 8 6 9
3 1 -5 11
4 -4 -2 0
>>> d.A / d.B
0 0.333333
1 -2.000000
2 1.333333
3 -0.200000
4 2.000000
>>> d.A[1:] / d.B
0 NaN
1 -2.000000
2 1.333333
3 -0.200000
4 2.000000
Notice that the values returned are the same for both operations. The second one just has nan for the first one, since there was no corresponding value in the first operand.
If you really want to operate on offset rows, you'll need to dig down to the numpy arrays that underpin the pandas DataFrame, to bypass pandas's index-aligning features. You can get at these innards with the values attribute of a column.
>>> d.A.values[1:] / d.B.values[:-1]
array([-0.66666667, 8. , 0.16666667, 0.8 ])
Now you really are getting each value divided by the one before it in the other column. Note that here you have to explicitly slice the second operand to leave off the last element, to make them equal in length.
So you can do the same to divide a column by an offset version of itself:
>>> d.A.values[1:] / d.A.values[:-1]
45: array([-2. , -4. , 0.125, -4. ])

Categories