how rank is calculated in pandas - python

I confuse to understand rank of series. I know that rank is calculated from the highest value to lowest value in a series. If two numbers are equal, then pandas calculates the average of the numbers.
In this example, the highest value is 7. why do we get rank 5.5 for number 7 and rank 1.5 for number 4 ?
S1 = pd.Series([7,6,7,5,4,4])
S1.rank()
Output:
0 5.5
1 4.0
2 5.5
3 3.0
4 1.5
5 1.5
dtype: float64

The Rank is calculated in this way
Arrange the elements in ascending order and the ranks are assigned starting with '1' for the lowest element.
Elements - 4, 4, 5, 6, 7, 7
Ranks - 1, 2, 3, 4, 5, 6
Now consider the repeating items, average out the corresponding ranks and assign the averaged rank to them.
Since we have '4' repeating twice, the final rank of each occurrence will be the average of 1,2 which is 1.5.
In the same way or 7, final rank for each occurrence will be average of 5,6 which is 5.5
Elements - 4, 4, 5, 6, 7, 7
Ranks - 1, 2, 3, 4, 5, 6
Final Rank - 1.5, 1.5, 3, 4, 5.5, 5.5

As commented by Joachim, the rank function accepts an argument method with default 'average'. That is, the final rank is the average of all the rank of the same values.
Per the document, other options of method are:
method : {'average', 'min', 'max', 'first', 'dense'}, default
'average'
How to rank the group of records that have the same value
(i.e. ties):
average: average rank of the group
min: lowest rank in the group
max: highest rank in the group
first: ranks assigned in order they appear in the array
dense: like 'min', but rank always increases by 1 between groups numeric_only : bool, optional
For example, let's try: method='dense', then S1.rank(method='dense') gives:
0 4.0
1 3.0
2 4.0
3 2.0
4 1.0
5 1.0
dtype: float64
which is somewhat equivalent to factorize.
Update: per your question, let's try writing a function that behaves similar to S1.rank():
def my_rank(s):
# sort s by values
s_sorted = s.sort_values(kind='mergesort')
# this is the incremental ranks
# equivalent to s.rank(method='first')
ranks = pd.Series(np.arange(len(s_sorted))+1, index=s_sorted.index)
# averaged ranks
avg_ranks = ranks.groupby(s_sorted).transform('mean')
return avg_ranks

You were performing default rank if you want max rank the follow as below
S1 = pd.Series([7,6,7,5,4,4])
S1.rank(method='max')
Here is all rank supported by pandas
methods : {‘average’, ‘min’, ‘max’, ‘first’, ‘dense’}, and default is ‘average’
S1['default_rank'] = S1.rank()
S1['max_rank'] = S1.rank(method='max')
S1['NA_bottom'] = S1.rank(na_option='bottom')
S1['pct_rank'] = S1.rank(pct=True)
print(S1)

Related

Pandas centred rolling window rank returns wrong value

I'm trying to calculate the rank of a column value within a rolling window in Pandas like this:
df = pd.DataFrame( [[1, 10],
[2, 20],
[3, 50],
[4, 30],
[5, 40]],
columns=['order_col', 'rank_col'])
df['rank'] = df.rolling(3, center=True, min_periods=1, on='order_col')['rank_col'].rank()
The result of rank() though gives the rank of the last row in the window not the one in the centre, as expected:
Any ideas how I can get the rank of the correct row? I.e. I expect the ranks to be 1, 2, 3, 1, 2
EDIT: I chose a small example to illustrate the problem but in actuality my dataframe has thousands of rows and the rolling window is of size 100+ rows.
The following is a workaround, you'd use rank in apply and explicitly take the center value.
The code inspects the index of the series to recognize that it's the first window and not the last.
def series_rank_center(series):
if 1 in series.index and len(series) < 3:
return series.rank().iat[0] # center value for first window
else:
return series.rank().iat[1] # center value
df.rolling(3, center=True, min_periods=1, on='order_col').apply(series_rank_center)
order_col rank_col
0 1 1.0
1 2 2.0
2 3 3.0
3 4 1.0
4 5 2.0

Getting highest value out of a dataframe with value_counts()

I want to print out the highest, not unique value out of my dataframe.
With df['Value'].value_counts() i can count them, but how do i selected them by how often the numbers appear.
Value
1
2
1
2
3
2
As I understand you want the first highest value that has a frequency greater than 1. In this case you can write,
for val, cnt in df['Value'].value_counts().sort_index(ascending=False).iteritems():
if cnt > 1:
print(val)
break
The sort_index sorts the items by the 'Value' rather than the frequencies. For example if your 'Value' columns has values [1, 2, 3, 3, 2, 2,2, 1, 3, 2] then the result of df['Value'].value_counts().sort_index(ascending=False).iteritems() will be as follows,
3 3
2 5
1 2
Name: Value, dtype: int64
The answer in this example would then be 3 since it is the first highest value with frequency greater than 1.

Groupby when given the start positional index of each group

I have one series of values that I would like to group, and another series containing the starting positional index of each group after the first (the first group is understood to begin at positional index 0). The series of values can have an arbitrary index. Is there a way to use this to produce a groupby-aggregate? Ideally empty groups would be preserved. Example:
values = pd.Series(np.arange(10, 20), index=np.arange(110, 120))
group_indices = pd.Series([3, 3, 8])
Now, values.groupby(group_indices) should be grouped so that the first group is values.iloc[:3], the second is values.iloc[3:3] (an empty group), the third is values.iloc[3:8], and the fourth is values.iloc[8:], and values.groupby(group_indices).mean() would be pd.Series([11.0, NaN, 15.0, 18.5]).
Here is a easy way
values.groupby(values.index.isin(group_indices).cumsum()).mean()
Out[454]:
1 11.0
2 15.0
3 18.5
dtype: float64
Straightforwardly with numpy.split routine:
In [1286]: values = pd.Series(np.arange(10, 20))
In [1287]: group_indices = pd.Series([0, 3, 8])
In [1288]: pd.Series([s.mean() for s in np.split(values, group_indices) if s.size])
Out[1288]:
0 11.0
1 15.0
2 18.5
dtype: float64
To account "empty" group - just remove if s.size check:
In [1304]: group_indices = pd.Series([3, 3, 8])
In [1305]: pd.Series([s.mean() for s in np.split(values, group_indices)])
Out[1305]:
0 11.0
1 NaN
2 15.0
3 18.5
dtype: float64
Given your update, here's an odd way to do this with pd.merge_asof. Some care needs to be taken to deal with the first group that's from 0 to your first index in the Series.
import pandas as pd
import numpy as np
(pd.merge_asof(values.to_frame('val'),
values.iloc[np.r_[group_indices]].reset_index().reset_index().drop(columns=0),
left_index=True, right_on='index',
direction='backward')
.fillna({'level_0': -1}) # Because your first group is 0: first index
.groupby('level_0').val.mean()
.reindex([-1]+[*range(len(group_indices))]) # Get 0 size groups in output
)
level_0
-1 11.0
0 NaN
1 15.0
2 18.5
Name: val, dtype: float64
Let's change the group_indicies a bit, so that the group names (1,2,3) are visible,
group_indices = pd.Series([1,2,3],index=[0, 3, 8])
then
values.groupby(group_indices.reindex(values.index,method='ffill')).mean()
would give you what you want.
Note that group_indices.reindex(values.index,method='ffill') gives you
0 1
1 1
2 1
3 2
4 2
5 2
6 2
7 2
8 3
9 3
which assigns each row of values with a group number.
My solution involves keeping the inputs as they are and doing some ugly adjustments:
pd.DataFrame(values).assign(group=pd.cut(pd.DataFrame(values).index,
[-1,2,7,np.inf], labels=[0,1,2])).groupby('group').mean()
Output
0
group
0 11.0
1 15.0
2 18.5
Thanks to all the answers, especially WeNYoBen's. The following will produce the correct groups and skip over empty groups.
# First, add the final index to `group_indices` so that
# we have a series of right endpoints, or interval upper bounds
upper_bounds = group_indices.append(pd.Series(values.shape[0]), ignore_index=True)
# Compute indices of nonempty groups
lower_bounds = upper_bounds.shift(fill_value=0)
nonempty_group_idxs = upper_bounds != lower_bounds
# Get means indexed from 0 to n_nonempty_groups-1
means = values.groupby(pd.RangeIndex(values.shape[0]).isin(upper_bounds).cumsum()).mean()
# Reassign index for the correct groups
means.index = nonempty_group_idxs.index[nonempty_group_idxs]
This will have a noncontinuous index, with skipped elements corresponding to empty groups in the original groupby. If you want to place NaN in those spots, you can do
means = means.reindex(index=pd.RangeIndex(group_indices.shape[0]))

How can I use Python rank() with same-values' counts?

I am dealing with a dataframe in python.
Here is what I want to do.
1. same value gets same rank
2. the next rank should be added as much as the same rank counts
this is what I intended
price rank
5300 1
5300 1
5300 1
5200 4 < previous rank: 1 + counts of 5300s: 3
5200 4 < same value, same rank
5100 6 < previous rank: 4 + counts of 5200s: 2
First, I tried to use rank(method = "dense") function. But it did not work as I expected.
df_sales["rank"] = df_sales["price"].rank(ascending = False, method = "dense")
Thank you in advance.
You need to use method='min' and ascending=False:
df = pd.DataFrame({'x':[5300,5300,5300,5200,5200, 5100]})
df['r'] = df['x'].rank(method='min', ascending=False)
From pandas.Series.rank
method : {‘average’, ‘min’, ‘max’, ‘first’, ‘dense’}
average: average rank of group
min: lowest rank in group
max: highest rank in group
first: ranks assigned in order they appear in the array
dense: like ‘min’, but rank always increases by 1 between groups
Note that dense specifically increases rank by 1 within groups. You want the min option.

Rank with ties in python when tie breaker is random

In R , way to break ties randomly when using the rank function is simple:
rank(my_vec, ties.method = "random")
However, though both scipy (scipy.stats.rankdata) and pandas (pandas.Series.rank) have ranking functions, none of them suggest a method that break ties randomly.
Is there a simple way to use a framework in python that has this feature? Given that list order has to remain the same.
Pandas' rank allows for these methods:
method : {'average', 'min', 'max', 'first', 'dense'}
* average: average rank of group
* min: lowest rank in group
* max: highest rank in group
* first: ranks assigned in order they appear in the array
* dense: like 'min', but rank always increases by 1 between groups
To "simply" accomplish your goal we can use 'first' after having randomized the Series.
Assume my series is named my_vec
my_vec.sample(frac=1).rank(method='first')
You can then put it back in the same order it was with
my_vec.sample(frac=1).rank(method='first').reindex_like(my_vec)
Example Runs
my_vec = pd.Series([1, 2, 3, 1, 2, 3])
Trial 1
my_vec.sample(frac=1).rank(method='first').reindex_like(my_vec)
0 2.0 <- I expect this and
1 4.0
2 6.0
3 1.0 <- this to be first ranked
4 3.0
5 5.0
dtype: float64
Trial 2
my_vec.sample(frac=1).rank(method='first').reindex_like(my_vec)
0 1.0 <- Still first ranked
1 3.0
2 6.0
3 2.0 <- but order has switched
4 4.0
5 5.0
dtype: float64

Categories