In R , way to break ties randomly when using the rank function is simple:
rank(my_vec, ties.method = "random")
However, though both scipy (scipy.stats.rankdata) and pandas (pandas.Series.rank) have ranking functions, none of them suggest a method that break ties randomly.
Is there a simple way to use a framework in python that has this feature? Given that list order has to remain the same.
Pandas' rank allows for these methods:
method : {'average', 'min', 'max', 'first', 'dense'}
* average: average rank of group
* min: lowest rank in group
* max: highest rank in group
* first: ranks assigned in order they appear in the array
* dense: like 'min', but rank always increases by 1 between groups
To "simply" accomplish your goal we can use 'first' after having randomized the Series.
Assume my series is named my_vec
my_vec.sample(frac=1).rank(method='first')
You can then put it back in the same order it was with
my_vec.sample(frac=1).rank(method='first').reindex_like(my_vec)
Example Runs
my_vec = pd.Series([1, 2, 3, 1, 2, 3])
Trial 1
my_vec.sample(frac=1).rank(method='first').reindex_like(my_vec)
0 2.0 <- I expect this and
1 4.0
2 6.0
3 1.0 <- this to be first ranked
4 3.0
5 5.0
dtype: float64
Trial 2
my_vec.sample(frac=1).rank(method='first').reindex_like(my_vec)
0 1.0 <- Still first ranked
1 3.0
2 6.0
3 2.0 <- but order has switched
4 4.0
5 5.0
dtype: float64
Related
I'm trying to figure out if the value in my dataframe is increasing in the tens/hundreds place. For example I created a dataframe with a few values, I duplicate the values and shifted them and now i'm able to compare them. But how do i code and find out if the tens place is increasing or if it just increasing by a little, for example 0.02 points.
import pandas as pd
import numpy as np
data = {'value':['9','10','19','22','31']}
df = pd.DataFrame(data)
df['value_copy'] = df['value'].shift(1)
df['Increase'] = np.where(df['value']<df['value_copy'],1,0)
output should be in this case:
[nan,1,0,1,1]
IIUC, divide by 10, get the floor, then compare the successive values (diff(1)) to see if the difference is exactly 1:
np.floor(df['value'].astype(float).div(10)).diff(1).eq(1).astype(int)
If you want a jump to at least the next tens (or more) use ge (≥):
np.floor(df['value'].astype(float).div(10)).diff(1).ge(1).astype(int)
output:
0 0
1 1
2 0
3 1
4 1
Name: value, dtype: int64
NB. if you insist on the NaN:
s = np.floor(df['value'].astype(float).div(10)).diff(1)
s.eq(1).astype(int).mask(s.isna())
output:
0 NaN
1 1.0
2 0.0
3 1.0
4 1.0
Name: value, dtype: float64
I confuse to understand rank of series. I know that rank is calculated from the highest value to lowest value in a series. If two numbers are equal, then pandas calculates the average of the numbers.
In this example, the highest value is 7. why do we get rank 5.5 for number 7 and rank 1.5 for number 4 ?
S1 = pd.Series([7,6,7,5,4,4])
S1.rank()
Output:
0 5.5
1 4.0
2 5.5
3 3.0
4 1.5
5 1.5
dtype: float64
The Rank is calculated in this way
Arrange the elements in ascending order and the ranks are assigned starting with '1' for the lowest element.
Elements - 4, 4, 5, 6, 7, 7
Ranks - 1, 2, 3, 4, 5, 6
Now consider the repeating items, average out the corresponding ranks and assign the averaged rank to them.
Since we have '4' repeating twice, the final rank of each occurrence will be the average of 1,2 which is 1.5.
In the same way or 7, final rank for each occurrence will be average of 5,6 which is 5.5
Elements - 4, 4, 5, 6, 7, 7
Ranks - 1, 2, 3, 4, 5, 6
Final Rank - 1.5, 1.5, 3, 4, 5.5, 5.5
As commented by Joachim, the rank function accepts an argument method with default 'average'. That is, the final rank is the average of all the rank of the same values.
Per the document, other options of method are:
method : {'average', 'min', 'max', 'first', 'dense'}, default
'average'
How to rank the group of records that have the same value
(i.e. ties):
average: average rank of the group
min: lowest rank in the group
max: highest rank in the group
first: ranks assigned in order they appear in the array
dense: like 'min', but rank always increases by 1 between groups numeric_only : bool, optional
For example, let's try: method='dense', then S1.rank(method='dense') gives:
0 4.0
1 3.0
2 4.0
3 2.0
4 1.0
5 1.0
dtype: float64
which is somewhat equivalent to factorize.
Update: per your question, let's try writing a function that behaves similar to S1.rank():
def my_rank(s):
# sort s by values
s_sorted = s.sort_values(kind='mergesort')
# this is the incremental ranks
# equivalent to s.rank(method='first')
ranks = pd.Series(np.arange(len(s_sorted))+1, index=s_sorted.index)
# averaged ranks
avg_ranks = ranks.groupby(s_sorted).transform('mean')
return avg_ranks
You were performing default rank if you want max rank the follow as below
S1 = pd.Series([7,6,7,5,4,4])
S1.rank(method='max')
Here is all rank supported by pandas
methods : {‘average’, ‘min’, ‘max’, ‘first’, ‘dense’}, and default is ‘average’
S1['default_rank'] = S1.rank()
S1['max_rank'] = S1.rank(method='max')
S1['NA_bottom'] = S1.rank(na_option='bottom')
S1['pct_rank'] = S1.rank(pct=True)
print(S1)
I have a pandas data frame that looks like the following:
fastmoving[['dist','unique','id']]
Out[683]:
dist unique id
1 0.406677 4.0 4.997434e+09
2 0.406677 4.0 4.452593e+09
5 0.406677 4.0 4.188395e+09
1 0.434386 4.0 8.288070e+09
4 0.434386 4.0 3.274609e+09
What I want to achieve is to:
Find top n longest-distance entries. Column 'dist'
Find which ids have the largest percentage m in the top n entries. Column 'id'.
So far I was able to write the code for the maximum entries.
#Get the first id with the largest dist:
fastmoving.loc[fastmoving['dist'].idxmax(),'id']
#Get all id's with the largest dist:
fastmoving.loc[fastmoving['dist']==fastmoving['dist'].max(),'id']
what I miss is to my code to work for more than one value.
So instead of the maximum value, to work for a range of maximum values (top n values).
And then get all the ids that belong with over some m percentage in those n maximum values.
Can you please help me on how I can achieve that in pandas?
Thanks a lot
Alex
you can use nlargest for top n and quantile for top m%, like this:
import pandas as pd
from io import StringIO
fastmoving = pd.read_csv(StringIO("""
dist unique id
1 0.406677 4.0 4.997434e+09
2 0.406677 4.0 4.452593e+09
5 0.406677 4.0 4.188395e+09
1 0.434386 4.0 8.288070e+09
4 0.434386 4.0 3.274609e+09"""), sep="\s+")
n = 3
m = 50
top_n_dist = fastmoving.nlargest(n, ["dist"])
top_m_precent_id_in_top_n_dist = top_n_dist[top_n_dist['id']>top_n_dist['id'].quantile(m/100)]
print(top_m_precent_id_in_top_n_dist)
IIUC, you can leverage nlargest. The following example would take the top 3 values of dist, and from that, extract the top 2 values of id:
fastmoving.nlargest(3, ["dist", "id"]).nlargest(2, "id")
dist unique id
1 0.434386 4.0 8.288070e+09
1 0.406677 4.0 4.997434e+09
I've been trying to get a cumsum on a pandas groupby object. I need the cumsum to be shifted by one, which is achieved by shift(). However, doing both of these functions on a single groupby object gives some unwanted results:
df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2],
'B': [2, 3, 5, 2, 3, 5]})
df.groupby('A').cumsum().shift()
which gives:
B
0 NaN
1 2.0
2 5.0
3 10.0
4 2.0
5 5.0
I.e. the last value of the cumsum() on group 1 is shifted into the first value of group 2. What I want is these groups to stay seperated, and to get:
B
0 NaN
1 2.0
2 5.0
3 NaN
4 2.0
5 5.0
But I'm not sure how to get both functions to work on the groupby object combined. Can't find this question anywhere else. Have been playing around with agg but can't seem to work that out. Any help would be appreciated.
Use lambda function with GroupBy.apply, also is necessary define columns in list after groupby for processing:
df['B'] = df.groupby('A')['B'].apply(lambda x: x.cumsum().shift())
print (df)
A B
0 1 NaN
1 1 2.0
2 1 5.0
3 2 NaN
4 2 2.0
5 2 5.0
The result of your first operation df.groupby('A').cumsum() is a regular dataframe. It is equivalent to df.groupby('A')[['B']].cumsum(), but Pandas conveniently allows you to omit the [['B']] indexing part.
Any subsequent operation on this dataframe therefore will not by default be performed groupwise, unless you use GroupBy again:
res = df.groupby('A').cumsum().groupby(df['A']).shift()
But, as you can see, this repeats the grouping operation and will be inefficient. You can instead define a single function which combines cumsum and shift in the correct order, then apply this function on a single GroupBy object. Defining this single function is known as function composition, and it's not native to Python. Here are a few alternatives:
Define a new named function
This is an explicit and recommended solution:
def cum_shift(x):
return x.cumsum().shift()
res1 = df.groupby('A')[['B']].apply(cum_shift)
Define an anonymous lambda function
A one-line version of the above:
res2 = df.groupby('A')[['B']].apply(lambda x: x.cumsum().shift())
Use a library which composes
This a pure functional solution; for example, via 3rd party toolz:
from toolz import compose
from operator import methodcaller
cumsum_shift_comp = compose(methodcaller('shift'), methodcaller('cumsum'))
res3 = df.groupby('A')[['B']].apply(cumsum_shift_comp)
All the above give the equivalent result:
assert res.equals(res1) and res1.equals(res2) and res2.equals(res3)
print(res1)
B
0 NaN
1 2.0
2 5.0
3 NaN
4 2.0
5 5.0
I've been wondering how to solve the following problem. Say I have a dataframe df, which looks like this:
Name quantity price
A 1 10.0
A 3 26.0
B 1 15.0
B 3 30.0
...
Now, say I wanted to extrapolate the price by quantity, and for each Name create a row for quantity = 1,2,3, which is some function of the list of available quantities and respective prices. (I.e. say that I have a function extrapolate(qts, prices, n) that computes a price for quantity=n based on known qts and prices, then the result would look like:
Name quantity price
A 1 10.0
A 2 extrapolate([1, 3], [10.0, 26.0], 2)
A 3 26.0
B 1 15.0
B 2 extrapolate([1, 3], [15.0, 30.0], 2)
B 3 30.0
...
I would appreciate some insight on how to achieve this, or a place to reference to learn more about how groupby can be used for this case
Thank you in advance
What do you want is called missing data imputation. There are many approaches to it.
You may want to check package called fancyimpute. It offers imputing data using MICE, which seems to do what you want.
Other than that, if your case is just as simple in structure as the example is, you can always groupby('Name').mean() and you will get the middle value for each subgroup.
The following should do what you described:
def get_extrapolate_val(group, qts, prices, n):
# do your actual calculations here; now it returns just a dummy value
some_value = (group[qts] * group[prices]).sum() / n
return some_value
# some definitions
n = 2
quan_col = 'quantity'
price_col = 'price'
First we group by Name and then apply the function get_extrapolate_val to each group whereby we pass the additional column names and n as arguments. As this returns a series object, we need an additional reset_index and rename which will make the concatenation easier.
new_stuff = df.groupby('Name').apply(get_extrapolate_val, quan_col, price_col, n).reset_index().rename(columns={0: price_col})
Add n as additional column
new_stuff[quan_col] = n
We concatenate the two dataframes and are done
final_df = pd.concat([df, new_stuff]).sort_values(['Name', quan_col]).reset_index(drop=True)
Name price quantity
0 A 10.0 1
1 A 44.0 2
2 A 26.0 3
3 B 15.0 1
4 B 52.5 2
5 B 30.0 3
The values I now added in are of course meaningless but are just there to illustrate the method.
OLD version
Assuming that there is always only 1 and 3 in your quantity column, the following should work:
new_stuff = df.groupby('Name', as_index=False)['price'].mean()
This gives
Name price
0 A 18.0
1 B 22.5
That - as written - assumes that it is always only 1 and 3, so we can simply calculate the mean.
Then we add the 2
new_stuff['quantity'] = 2
and concatenate the two dataframes with an additional sorting
pd.concat([df, new_stuff]).sort_values(['Name', 'quantity']).reset_index(drop=True)
which gives the desired outcome
Name price quantity
0 A 10.0 1
1 A 18.0 2
2 A 26.0 3
3 B 15.0 1
4 B 22.5 2
5 B 30.0 3
There are probably far more elegant ways to do this though...