calculate percentile using rolling window pandas - python

I create a pandas dataframe as
df = pd.DataFrame(data=[[1],[2],[3],[1],[2],[3],[1],[2],[3]])
df
Out[19]:
0
0 1
1 2
2 3
3 1
4 2
5 3
6 1
7 2
8 3
I calculate the 75% percentile on windows of length =3
df.rolling(window=3,center=False).quantile(0.75)
Out[20]:
0
0 NaN
1 NaN
2 2.0
3 2.0
4 2.0
5 2.0
6 2.0
7 2.0
8 2.0
then just to check I calculate the 75% on the first window separately
df.iloc[0:3].quantile(0.75)
Out[22]:
0 2.5
Name: 0.75, dtype: float64
why I get a different value?

This is a bug, referenced in GH9413 and GH16211.
The reason, as given by the devs -
It looks like the difference here is that quantile and percentile take
the weighted average of the nearest points, whereas rolling_quantile
simply uses one the nearest point (no averaging).
Rolling.quantile did not interpolate when computing the quantiles.
The bug has been fixed as of 0.21.
For older versions, the fix is using a rolling_apply.
df.rolling(window=3, center=False).apply(lambda x: pd.Series(x).quantile(0.75))
0
0 NaN
1 NaN
2 2.5
3 2.5
4 2.5
5 2.5
6 2.5
7 2.5
8 2.5

Related

pandas - ranking with tolerance?

Is there a way to rank values in a dataframe but considering a tolerance?
Say I have the following values
ex = pd.Series([16.52,19.95,16.15,22.77,20.53,19.96])
and if I ran rank:
ex.rank(method='average')
0 2.0
1 3.0
2 1.0
3 6.0
4 5.0
5 4.0
dtype: float64
But what I'd like as a result would be (with a tolereance of 0.01):
0 2.0
1 3.5
2 1.0
3 6.0
4 5.0
5 3.5
Any way to define this tolerance?
Thanks
This function may works:
def rank_with_tolerance(sr, tolerance=0.01+1e-10, method='average'):
vals = pd.Series(sr.unique()).sort_values()
vals.index = vals
vals = vals.mask(vals - vals.shift(1) <= tolerance, vals.shift(1))
return sr.map(vals).fillna(sr).rank(method=method)
It works for your given input:
ex = pd.Series([16.52,19.95,16.15,22.77,20.53,19.96])
rank_with_tolerance(ex, tolerance=0.01+1e-10, method='average')
# result:
0 2.0
1 3.5
2 1.0
3 6.0
4 5.0
5 3.5
dtype: float64
And with more complex sets it seems to work too:
ex = pd.Series([16.52,19.95,19.96, 19.95, 19.97, 19.97, 19.98])
rank_with_tolerance(ex, tolerance=0.01+1e-10, method='average')
# result:
0 1.0
1 3.0
2 3.0
3 3.0
4 5.5
5 5.5
6 7.0
dtype: float64
You could do some sort of min-max scaling i.e.
ex = pd.Series([16.52,19.95,16.15,22.77,20.53,19.96])
# You scale the values to be between 0 and 1
ex_scaled = (ex - min(ex)) / (max(ex) - min(ex))
# You put them on a scale from 1 to the length of your series
result = ex_scaled * len(ex) + 1
# result
0 1.335347
1 4.444109
2 1.000000
3 7.000000
4 4.969789
5 4.453172
That way you are still ranking, but values closer to each other have ranks close to each other
You can sort the values, merge the close ones and rank on that:
s = ex.drop_duplicates().sort_values()
mapper = (s.groupby(s.diff().abs().gt(0.011).cumsum(), sort=False)
.transform('mean')
.reindex_like(ex)
)
out = mapper.rank(method='average')
N.B. I used 0.011 as threshold as floating point arithmetics does not always enable enough precision to detect a value clode to the threshold
output:
0 2.0
1 3.5
2 1.0
3 6.0
4 5.0
5 3.5
dtype: float64
intermediate mapper:
0 16.520
1 19.955
2 16.150
3 22.770
4 20.530
5 19.955
dtype: float64

Create dataframe with (for each cell) averages of other dataframes

I have a list of about 20 dataframes, all with the same structure (same rows and columns).
I want to create a new df, where each cell is equal to the average of the corresponding (same row/column) cells of the listed dfs.
So, for example, if we have just 2 dfs (A and B), I need the following:
A=
A B C D
0 7 6 8 7
1 7 0 7 6
2 9 2 7 0
B=
A B C D
0 6 9 2 7
1 4 4 5 7
2 6 8 5 4
Average=
A B C D
0 6.5 7.5 5.0 7.0
1 5.5 2.0 6.0 6.5
2 7.5 5.0 6.0 2.0
I tried this code, but it's pretty slow (the real dfs are quite large) and messes up the order of columns:
dfs = [A,B]
Average = pd.concat([each.stack() for each in dfs],axis=1)\
.apply(lambda x:x.mean(),axis=1)\
.unstack()
Is there a better alternative? Thanks
Use -
(A+B) / 2
Output
A B C D
0 6.5 7.5 5.0 7.0
1 5.5 2.0 6.0 6.5
2 7.5 5.0 6.0 2.0
For scaling up to more dfs, put all of them in a list and just use sum(list). Edit: Based on #younggoti's reco-
list_of_df = [A,B]
sum(list_of_df)/len(list_of_df)

Pandas: re-index and interpolate in multi-index dataframe

I'm having trouble understanding pandas reindex. I have a series of measurements, munged into a multi-index df, and I'd like to reindex and interpolate those measurements to align them with some other data.
My actual data has ~7 index levels and several different measurements. I hope the solution for this toy data problem is applicable to my real data. It's "small data"; each individual measurement is a couple KB.
Here's a pair of toy problems, one which shows the expected behavior and one which doesn't seem to do anything.
Single-level index, works as expected:
"""
step,value
1,1
3,2
5,1
"""
df_i = pd.read_clipboard(sep=",").set_index("step")
print(df_i)
new_index = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
df_i = df_i.reindex(new_index).interpolate()
print(df_i)
Outputs, the original df and the re-indexed and interpolated one:
value
step
1 1
3 2
5 1
value
step
1 1.0
2 1.5
3 2.0
4 1.5
5 1.0
6 1.0
7 1.0
8 1.0
9 1.0
Works great.
Multi-index, currently not working:
"""
sample,meas_id,step,value
1,1,1,1
1,1,3,2
1,1,5,1
1,2,3,2
1,2,5,2
1,2,7,1
1,2,9,0
"""
df_mi = pd.read_clipboard(sep=",").set_index(["sample", "meas_id", "step"])
print(df_mi)
df_mi = df_mi.reindex(new_index, level="step").interpolate()
print(df_mi)
Output, unchanged after reindex (and therefore after interpolate):
value
sample meas_id step
1 1 1 1
3 2
5 1
2 3 2
5 2
7 1
9 0
value
sample meas_id step
1 1 1 1
3 2
5 1
2 3 2
5 2
7 1
9 0
How do I actually reindex a column in a multi-index df?
Here's the output I'd like, assuming linear interpolation:
value
sample meas_id step
1 1 1 1
2 1.5
3 2
5 1
6 1
7 1
8 1
9 1
2 1 NaN (or 2)
2 NaN (or 2)
3 2
4 2
5 2
6 1.5
7 1
8 0.5
9 0
I spent some sincere time looking over SO, and if the answer is in there, I missed it:
Fill multi-index Pandas DataFrame with interpolation
Resampling Within a Pandas MultiIndex
pandas multiindex dataframe, ND interpolation for missing values
Fill multi-index Pandas DataFrame with interpolation
https://pandas.pydata.org/pandas-docs/stable/basics.html#basics-reindexing
Possibly related GitHub issues:
https://github.com/numpy/numpy/issues/11975
https://github.com/pandas-dev/pandas/issues/23104
https://github.com/pandas-dev/pandas/issues/17132
IIUC create the index by using MultiIndex.from_product, then just do reindex
idx=pd.MultiIndex.from_product([df_mi.index.levels[0],df_mi.index.levels[1],new_index])
df_mi.reindex(idx).interpolate()
Out[161]:
value
1 1 1 1.000000
2 1.500000
3 2.000000
4 1.500000
5 1.000000
6 1.142857
7 1.285714
8 1.428571
9 1.571429
2 1 1.714286 # here is bad , it take previous value into consideration
2 1.857143
3 2.000000
4 2.000000
5 2.000000
6 1.500000
7 1.000000
8 0.500000
9 0.000000
My think
def idx(x):
idx = pd.MultiIndex.from_product([x.index.get_level_values(0).unique(), x.index.get_level_values(1).unique(), new_index])
return idx
pd.concat([y.reindex(idx(y)).interpolate() for _,y in df_mi.groupby(level=[0,1])])
value
1 1 1 1.0
2 1.5
3 2.0
4 1.5
5 1.0
6 1.0
7 1.0
8 1.0
9 1.0
2 1 NaN
2 NaN
3 2.0
4 2.0
5 2.0
6 1.5
7 1.0
8 0.5
9 0.0

pandas df.mean for multi-index across axis 0

How do you get the mean across axis 0 for certain mult-index (index_col [1])? I have
df:
1 2 3
h a 1 4 8
h b 5 4 6
i a 9 3 6
i b 5 2 5
j a 2 2 2
j b 4 4 4
I would like to create df1 - mean of 2nd index value across axis 0 ('a', 'b', 'a', 'b')
df1:
1 2 3
0 a 4 3 5.3
1 b 4.6 3.3 5
I know that I can select certain rows
df.loc[['a','b']].mean(axis=0)
but I'm not sure how this relates to multi-index dataframes?
I think you need groupby by second level with mean:
print (df.groupby(level=1).mean())
1 2 3
a 4.000000 3.000000 5.333333
b 4.666667 3.333333 5.000000
And if necesary round values:
print (df.groupby(level=1).mean().round(1))
1 2 3
a 4.0 3.0 5.3
b 4.7 3.3 5.0

Looking for a pandas function analogous to DataFrame.nafill()

I would like to apply a function that acts like fillna() but takes a different value than nan. Unfortunately DataFrame.replace() will not work in my case. Here is an example: Given a DataFrame:
df = pd.DataFrame([[1,2,3],[4,-1,-1],[5,6,-1]])
0 1 2
0 1 2.0 3.0
1 4 -1.0 -1.0
2 5 6.0 -1.0
3 7 8.0 NaN
I am looking for a function which will output:
0 1 2
0 1 2.0 3.0
1 4 2.0 3.0
2 5 6.0 3.0
3 7 8.0 NaN
So df.replace() with to_replace=-1 and 'method='ffill' will not work because it requires a column-independent value which will replace the -1 entries. In my example it is column-dependent. I know I can code it with a loop but am looking for an efficient code as it will be applied to a large DataFrame. Any suggestions? Thank you.
You can just replace the value with NaN and then call ffill:
In [3]:
df.replace(-1, np.NaN).ffill()
Out[3]:
0 1 2
0 1 2 3
1 4 2 3
2 5 6 3
I think you're over thinking this
EDIT
If you already have NaN values then create a boolean mask and update just those elements again with ffill on the inverse of the mask:
In [15]:
df[df == -1] = df[df != -1].ffill()
df
Out[15]:
0 1 2
0 1 2 3
1 4 2 3
2 5 6 3
3 7 8 NaN
Another method (thanks to #DSM in comments) is to use where to essentially do the same thing as above:
In [17]:
df.where(df != -1, df.replace(-1, np.nan).ffill())
Out[17]:
0 1 2
0 1 2 3
1 4 2 3
2 5 6 3
3 7 8 NaN

Categories