pandas - ranking with tolerance? - python

Is there a way to rank values in a dataframe but considering a tolerance?
Say I have the following values
ex = pd.Series([16.52,19.95,16.15,22.77,20.53,19.96])
and if I ran rank:
ex.rank(method='average')
0 2.0
1 3.0
2 1.0
3 6.0
4 5.0
5 4.0
dtype: float64
But what I'd like as a result would be (with a tolereance of 0.01):
0 2.0
1 3.5
2 1.0
3 6.0
4 5.0
5 3.5
Any way to define this tolerance?
Thanks

This function may works:
def rank_with_tolerance(sr, tolerance=0.01+1e-10, method='average'):
vals = pd.Series(sr.unique()).sort_values()
vals.index = vals
vals = vals.mask(vals - vals.shift(1) <= tolerance, vals.shift(1))
return sr.map(vals).fillna(sr).rank(method=method)
It works for your given input:
ex = pd.Series([16.52,19.95,16.15,22.77,20.53,19.96])
rank_with_tolerance(ex, tolerance=0.01+1e-10, method='average')
# result:
0 2.0
1 3.5
2 1.0
3 6.0
4 5.0
5 3.5
dtype: float64
And with more complex sets it seems to work too:
ex = pd.Series([16.52,19.95,19.96, 19.95, 19.97, 19.97, 19.98])
rank_with_tolerance(ex, tolerance=0.01+1e-10, method='average')
# result:
0 1.0
1 3.0
2 3.0
3 3.0
4 5.5
5 5.5
6 7.0
dtype: float64

You could do some sort of min-max scaling i.e.
ex = pd.Series([16.52,19.95,16.15,22.77,20.53,19.96])
# You scale the values to be between 0 and 1
ex_scaled = (ex - min(ex)) / (max(ex) - min(ex))
# You put them on a scale from 1 to the length of your series
result = ex_scaled * len(ex) + 1
# result
0 1.335347
1 4.444109
2 1.000000
3 7.000000
4 4.969789
5 4.453172
That way you are still ranking, but values closer to each other have ranks close to each other

You can sort the values, merge the close ones and rank on that:
s = ex.drop_duplicates().sort_values()
mapper = (s.groupby(s.diff().abs().gt(0.011).cumsum(), sort=False)
.transform('mean')
.reindex_like(ex)
)
out = mapper.rank(method='average')
N.B. I used 0.011 as threshold as floating point arithmetics does not always enable enough precision to detect a value clode to the threshold
output:
0 2.0
1 3.5
2 1.0
3 6.0
4 5.0
5 3.5
dtype: float64
intermediate mapper:
0 16.520
1 19.955
2 16.150
3 22.770
4 20.530
5 19.955
dtype: float64

Related

How to populate NaN values based on conditions from two other columns using Pandas?

I have a dataframe that looks something like this:
ID
hiqual
Wave
1
1.0
g
1
NaN
i
1
NaN
k
2
1.0
g
2
NaN
i
2
NaN
k
3
1.0
g
3
NaN
i
4
5.0
g
4
NaN
i
This is a long format dataframe and I have my hiqual variable for my first measurement wave (g). I would like to populate the NaN values for the subsequent measurement waves (i and k) as the same value give in wave g for each ID.
I tried using fillna() but I am not sure how to provide the two conditions of ID and Wave and how to populate based on that. I would be grateful for any help/suggestions on this?
The exact expected output is unclear, but think you might want:
m = df['hiqual'].isna()
df.loc[m, 'hiqual'] = df['Wave'].mask(m).ffill()
If you dataframe is already ordered by ID and wave columns, you can simply fill forward values:
>>> df.sort_values(['ID', 'Wave']).ffill()
ID hiqual Wave
0 1 1.0 g
1 1 1.0 i
2 1 1.0 k
3 2 1.0 g
4 2 1.0 i
5 2 1.0 k
6 3 1.0 g
7 3 1.0 i
8 4 5.0 g
9 4 5.0 i
You can also use explicitly g values:
g_vals = df[df['Wave']=='g'].set_index('ID')['hiqual']
df['hiqual'] = df['hiqual'].fillna(df['ID'].map(g_vals))
print(df)
print(g_vals)
# Output
ID hiqual Wave
0 1 1.0 g
1 1 1.0 i
2 1 1.0 k
3 2 1.0 g
4 2 1.0 i
5 2 1.0 k
6 3 1.0 g
7 3 1.0 i
8 4 5.0 g
9 4 5.0 i
# g_vals
ID
1 1.0
2 1.0
3 1.0
4 5.0
Name: hiqual, dtype: float64

Applying a function to chunks of the Dataframe

I have a Dataframe (df) (for instance - simplified version)
A B
0 2.0 3.0
1 3.0 4.0
and generated 20 bootstrap resamples that are all now in the same df but differ in the Resample Nr.
A B
0 1 0 2.0 3.0
1 1 1 3.0 4.0
2 2 1 3.0 4.0
3 2 1 3.0 4.0
.. ..
.. ..
39 20 0 2.0 3.0
40 20 0 2.0 3.0
Now I want to apply a certain function on each Reample Nr. Say:
C = sum(df['A'] * df['B']) / sum(df['B'] ** 2)
The outlook would look like this:
A B C
0 1 0 2.0 3.0 Calculated Value X1
1 1 1 3.0 4.0 Calculated Value X1
2 2 1 3.0 4.0 Calculated Value X2
3 2 1 3.0 4.0 Calculated Value X2
.. ..
.. ..
39 20 0 2.0 3.0 Calculated Value X20
40 20 0 2.0 3.0 Calculated Value X20
So there are 20 different new values.
I know there is a df.iloc command where I can specify my row selection df.iloc[row, column] but I would like to find a command where I don't have to repeat the code for the 20 samples.
My goal is to find a command that identifies the Resample Nr. automatically and then calculates the function for each Resample Nr.
How can I do this?
Thank you!
Use DataFrame.assign to create two new columns x and y that corresponds to df['A'] * df['B'] and df['B']**2, then use DataFrame.groupby on Resample Nr. (or level=1) and transform using sum:
s = df.assign(x=df['A'].mul(df['B']), y=df['B']**2)\
.groupby(level=1)[['x', 'y']].transform('sum')
df['C'] = s['x'].div(s['y'])
Result:
A B C
0 1 0 2.0 3.0 0.720000
1 1 1 3.0 4.0 0.720000
2 2 1 3.0 4.0 0.750000
3 2 1 3.0 4.0 0.750000
39 20 0 2.0 3.0 0.666667
40 20 0 2.0 3.0 0.666667

Removing specific value from cell of dataframe and shifting the value towards left

I'm working with the pandas data frame. I have unwanted data in some cells. I need to clear that data from specific cells and shift the whole row towards left by one cell. I have tried couple of things but it is not working for me. Here is the example dataframe
userId movieId ratings extra
0 1 500 3.5
1 1 600 4.5
2 1 www.abcd 700 2.0
3 2 1100 5.0
4 2 1200 4.0
5 3 600 4.5
6 4 600 5.0
7 4 1900 3.5
Expected Outcome:
userId movieId ratings extra
0 1 500 3.5
1 1 600 4.5
2 1 700 2.0
3 2 1100 5.0
4 2 1200 4.0
5 3 600 4.5
6 4 600 5.0
7 4 1900 3.5
I have tried the following code but it is showing the following error.
raw = df[f['ratings'].str.contains('www')==True]
#Here I am trying to fix the specific cell value to empty but it shows the following error.
**AttributeError:** 'str' object has no attribute 'at'
df = df.at[raw, 'movieId'] = ' '
#code for shifting the cell value to left
df.iloc[raw,2:-1] = df.iloc[raw,2:-1].shift(-1,axis=1)
You can shift values by mask, but is realy important match types, it means if column movieId is filled by strings (because at leas one string) is necessary convert it to numeric by to_numeric for avoid data lost, because different types:
m = df['movieId'].str.contains('www')
df['movieId'] = pd.to_numeric(df['movieId'], errors='coerce')
#if want shift only missing values rows
#m = df['movieId'].isna()
df[m] = df[m].shift(-1, axis=1)
df['userId'] = df['userId'].ffill()
df = df.drop('extra', axis=1)
print (df)
userId movieId ratings
0 1.0 500.0 3.5
1 1.0 600.0 4.5
2 1.0 700.0 2.0
3 2.0 1100.0 5.0
4 2.0 1200.0 4.0
5 3.0 600.0 4.5
6 4.0 600.0 5.0
7 4.0 1900.0 3.5
If omit converting to numeric get missing value:
m = df['movieId'].str.contains('www')
df[m] = df[m].shift(-1, axis=1)
df['userId'] = df['userId'].ffill()
df = df.drop('extra', axis=1)
print (df)
userId movieId ratings
0 1.0 500 3.5
1 1.0 600 4.5
2 1.0 NaN 2.0
3 2.0 1100 5.0
4 2.0 1200 4.0
5 3.0 600 4.5
6 4.0 600 5.0
7 4.0 1900 3.5
You can try this:-
df['movieId'] = pd.to_numeric(df['movieId'], errors='coerce')
df = df.sort_values(by = 'movieId', ascending = 'True')

Pandas diffing rows in one group against rows of the previous group

How to determine the difference between rows in col X but between groups, rather than within groups. So the diff value within groups should be ffill.
df = pd.DataFrame({
'Time' : [1,1,2,2,3,3],
'X' : [1,1,3,3,6,6],
'Y' : [1,1,1,1,2,2],
})
df['X'] = df['X'].diff()
df['X'] = df.groupby('Time')['X'].diff()
Intended Output:
Time X Y
0 1 0 1
1 1 0 1
2 2 2 1
3 2 2 1
4 3 3 2
5 3 3 2
If values inside a group are equal (but the number of rows per group are not), you can do this by subtracting all rows in a group with the value of the previous group.
df['X'] - df['Time'].map(df.groupby('Time')['X'].max().shift()).fillna(df['X'])
0 0.0
1 0.0
2 2.0
3 2.0
4 3.0
5 3.0
dtype: float64
Details
The first piece is to find the unique values in each group (I use max(), but you can just as well use unique() or first()):
df.groupby('Time')['X'].max()
Time
1 1
2 3
3 6
Name: X, dtype: int64
Next, shift them down:
_.shift()
Time
1 NaN
2 1.0
3 3.0
Name: X, dtype: float64
Map it back to "Time" (the grouper):
df['Time'].map(_)
0 NaN
1 NaN
2 1.0
3 1.0
4 3.0
5 3.0
Name: Time, dtype: float64
Fill the first group of NaNs with "X":
_.fillna(df['X'])
0 1.0
1 1.0
2 1.0
3 1.0
4 3.0
5 3.0
Name: Time, dtype: float64
Now you have your RHS. Just subtract this from "X" and you're done.
If you have fixed rows for each group you can do
>>> df.X = df.X.diff(periods=2).fillna(0) # assumes all groups have two rows
>>> df
Time X Y
0 1 0.0 1
1 1 0.0 1
2 2 2.0 1
3 2 2.0 1
4 3 3.0 2
5 3 3.0 2

calculate percentile using rolling window pandas

I create a pandas dataframe as
df = pd.DataFrame(data=[[1],[2],[3],[1],[2],[3],[1],[2],[3]])
df
Out[19]:
0
0 1
1 2
2 3
3 1
4 2
5 3
6 1
7 2
8 3
I calculate the 75% percentile on windows of length =3
df.rolling(window=3,center=False).quantile(0.75)
Out[20]:
0
0 NaN
1 NaN
2 2.0
3 2.0
4 2.0
5 2.0
6 2.0
7 2.0
8 2.0
then just to check I calculate the 75% on the first window separately
df.iloc[0:3].quantile(0.75)
Out[22]:
0 2.5
Name: 0.75, dtype: float64
why I get a different value?
This is a bug, referenced in GH9413 and GH16211.
The reason, as given by the devs -
It looks like the difference here is that quantile and percentile take
the weighted average of the nearest points, whereas rolling_quantile
simply uses one the nearest point (no averaging).
Rolling.quantile did not interpolate when computing the quantiles.
The bug has been fixed as of 0.21.
For older versions, the fix is using a rolling_apply.
df.rolling(window=3, center=False).apply(lambda x: pd.Series(x).quantile(0.75))
0
0 NaN
1 NaN
2 2.5
3 2.5
4 2.5
5 2.5
6 2.5
7 2.5
8 2.5

Categories