pandas dataframe groupby rank generates unexpected order of ranking - python

I am using the following code to generate the rank column,
df["rank"] = df.groupby(['group1','userId'])[['rank_level1','rank_level2']].rank(method='first', ascending=True).astype(int)
but as you can see in the following example data it is generating the wrong order of ranking considering rank_level2 column
expected_Rank is the ranking order I am expecting
group1
userId
geoId
rank_level1
rank_level2
rank
expected_Rank
a
1
q
3
3.506102795
1
8
a
1
w
3
-9.359613563
2
2
a
1
e
3
-2.368458072
3
3
a
1
r
3
13.75731938
4
9
a
1
t
3
0.229777761
5
5
a
1
y
3
-10.25124866
6
1
a
1
u
3
2.82822285
7
7
a
1
i
3
0
8
4
a
1
o
3
1.120593402
9
6
a
1
p
4
1.98
10
10
a
1
z
4
5.110299374
11
11
b
1
p
2
-9.552317622
1
1
b
1
r
3
1.175083121
2
6
b
1
t
3
0
3
5
b
1
o
3
9.383253146
4
8
b
1
w
3
5.782528196
5
7
b
1
i
3
-0.680999413
6
4
b
1
y
3
-0.990387248
7
3
b
1
e
3
-11.18793533
8
2
b
1
z
3
12.33791512
9
9
b
1
u
4
-4.799979138
10
11
b
1
q
4
-25.92
11
10

Create tuples by both columns and then use GroupBy.transform with Series.rank and method='dense':
df["rank"] = (df.assign(new=df[['rank_level1','rank_level2']].agg(tuple, 1))
.groupby(['group1','userId'])['new']
.transform(lambda x: x.rank(method='dense', ascending=True))
.astype(int))
print (df)
group1 userId geoId rank_level1 rank_level2 rank expected_Rank
0 a 1 q 3 3.506103 8 8
1 a 1 w 3 -9.359614 2 2
2 a 1 e 3 -2.368458 3 3
3 a 1 r 3 13.757319 9 9
4 a 1 t 3 0.229778 5 5
5 a 1 y 3 -10.251249 1 1
6 a 1 u 3 2.828223 7 7
7 a 1 i 3 0.000000 4 4
8 a 1 o 3 1.120593 6 6
9 a 1 p 4 1.980000 10 10
10 a 1 z 4 5.110299 11 11
11 b 1 p 2 -9.552318 1 1
12 b 1 r 3 1.175083 6 6
13 b 1 t 3 0.000000 5 5
14 b 1 o 3 9.383253 8 8
15 b 1 w 3 5.782528 7 7
16 b 1 i 3 -0.680999 4 4
17 b 1 y 3 -0.990387 3 3
18 b 1 e 3 -11.187935 2 2
19 b 1 z 3 12.337915 9 9
20 b 1 u 4 -4.799979 11 11
21 b 1 q 4 -25.920000 10 10
because:
df["rank"] = df.assign(new=df[['rank_level1','rank_level2']].agg(tuple, 1)).groupby(['group1','userId'])['new'].rank(method='first', ascending=True).astype(int)
DataError: No numeric types to aggregate

Related

Python: How to add groupby but not affect ngroup()?

per user I want an unique item order (as they click through them). If a item already has been seen, then don't cumulative count, but place the already assigned value there. For example, c,d, g & b in the tables below. I used the function below, but its not getting the job done at the moment. If I add the 'user_id' to the grouper I mess up the ngroup(). Can anyone help me with this?
df['Order Number'] = df.groupby(pd.Grouper(key='Item',sort=False)).ngroup()+1
print(df)
Current Output:
User_id Item Order Number
0 1 b 1
1 1 a 2
2 1 c 3
3 1 d 4
4 1 c 3
5 1 d 4
6 1 e 5
7 1 b 1
8 1 f 6
9 1 g 7
10 1 b 1
-----------------------------
11 2 x 8
12 2 g 7
13 2 g 7
14 2 f 6
15 2 h 9
16 2 i 10
17 2 f 11
18 2 k 12
19 2 l 13
Desired Output:
User_id Item Order Number
0 1 b 1
1 1 a 2
2 1 c 3
3 1 d 4
4 1 c 3
5 1 d 4
6 1 e 5
7 1 b 1
8 1 f 6
9 1 g 7
10 1 b 1
-----------------------------
11 2 x 1
12 2 g 2
13 2 g 2
14 2 f 3
15 2 h 4
16 2 i 5
17 2 f 3
18 2 k 7
19 2 l 8
Use GroupBy.transform with factorize in lambda function:
df['Order Number'] = df.groupby('User_id')['Item'].transform(lambda x: pd.factorize(x)[0])+1
print (df)
User_id Item Order Number
0 1 b 1
1 1 a 2
2 1 c 3
3 1 d 4
4 1 c 3
5 1 d 4
6 1 e 5
7 1 b 1
8 1 f 6
9 1 g 7
10 1 b 1
11 2 x 1
12 2 g 2
13 2 g 2
14 2 f 3
15 2 h 4
16 2 i 5
17 2 f 3
18 2 k 6
19 2 l 7

Column that counts up within subgroups pandas

I've got a df
df1
a b
4 0 1
5 0 1
6 0 2
2 0 3
3 1 2
15 1 3
12 1 3
13 1 1
15 3 1
14 3 1
8 3 3
9 3 2
10 3 1
the df should be grouped by a and b and I need a column c that goes up from 1 to amount of groups within subgroups of a
df1
a b c
4 0 1 1
5 0 1 1
6 0 2 2
2 0 3 3
3 1 2 1
15 1 3 2
12 1 3 2
13 1 1 3
15 3 1 1
14 3 1 1
8 3 3 2
9 3 2 3
10 3 1 4
How can I do that?
We can do groupby + transform factorize
df['C']=df.groupby('a').b.transform(lambda x : x.factorize()[0]+1)
4 1
5 1
6 2
2 3
3 1
15 2
12 2
13 3
15 1
14 1
8 1
9 1
10 2
Name: b, dtype: int64
Just so we can see the loop version
from itertools import count
from collections import defaultdict
x = defaultdict(count)
y = {}
c = []
for a, b in zip(df.a, df.b):
if (a, b) not in y:
y[(a, b)] = next(x[a]) + 1
c.append(y[(a, b)])
df.assign(C=c)
a b C
4 0 1 1
5 0 1 1
6 0 2 2
2 0 3 3
3 1 2 1
15 1 3 2
12 1 3 2
13 1 1 3
15 3 1 1
14 3 1 1
8 3 3 2
9 3 2 3
10 3 1 1
One option is groupby a and then iterate through each group and groupby b. Then use can use ngroup
df['c'] = np.hstack([g.groupby('b').ngroup().to_numpy() for _,g in df.groupby('a')])
a b c
4 0 1 0
5 0 1 0
6 0 2 1
2 0 3 2
3 1 2 1
15 1 3 2
12 1 3 2
13 1 1 0
15 3 1 0
14 3 1 0
8 3 1 0
9 3 1 0
10 3 2 1
you can use groupby.rank if you don't care about the order in the data.
df['c'] = df.groupby('a')['b'].rank('dense').astype(int)

Python dataframe add columns in groups of 3

I have a data-frame with n rows:
df = 1 2 3
4 5 6
4 2 3
3 1 9
6 7 0
9 2 5
I want to add a columns with the same value in groups of 3.
n (num rows) is for sure divided by 3.
So the new df will be:
df = 1 2 3 A
4 5 6 A
4 2 3 A
3 1 9 B
6 7 0 B
9 2 5 B
What is the best way to do so?
First remove last rows if not dividsable by 3 with DataFrame.iloc and then create 100% unique group by divide by 3 with integer division by 3:
print (df)
a b d
0 1 2 3
1 4 5 6
2 4 2 3
3 3 1 9
4 6 7 0
5 9 2 5
6 0 0 4 <- removed last row
N = 3
num = len(df) // N * N
df = df.iloc[:num]
df['groups'] = np.arange(len(df)) // N
print (df)
a b d groups
0 1 2 3 0
1 4 5 6 0
2 4 2 3 0
3 3 1 9 1
4 6 7 0 1
5 9 2 5 1
IIUC, groupby:
df['new_col'] = df.sum(1).groupby(np.arange(len(df))//3).transform('sum')
Output:
0 1 2 new_col
0 1 2 3 30
1 4 5 6 30
2 4 2 3 30
3 3 1 9 42
4 6 7 0 42
5 9 2 5 42

Can You Preserve Column Order When Pandas Dataframe.Combine Or DataFrame.Combine_First?

If you have 2 dataframes, represented as:
A F Y
0 1 2 3
1 4 5 6
And
B C T
0 7 8 9
1 10 11 12
When combining it becomes:
A B C F T Y
0 1 7 8 2 9 3
1 4 10 11 5 12 6
I would like it to become:
A F Y B C T
0 1 2 3 7 8 9
1 4 5 6 10 11 12
How do I combine 1 data frame with another but keep the original column order?
In [1294]: new_df = df.join(df1)
In [1295]: new_df
Out[1295]:
A F Y B C T
0 1 2 3 7 8 9
1 4 5 6 10 11 12
OR you can also use pd.merge(not a very clean solution though)
In [1297]: df['tmp' ] =1
In [1298]: df1['tmp'] = 1
In [1309]: pd.merge(df, df1, on=['tmp'], left_index=True, right_index=True).drop('tmp', 1)
Out[1309]:
A F Y B C T
0 1 2 3 7 8 9
1 4 5 6 10 11 12

sort dataframe by position in group then by that group

consider the dataframe df
df = pd.DataFrame(dict(
A=list('aaaaabbbbccc'),
B=range(12)
))
print(df)
A B
0 a 0
1 a 1
2 a 2
3 a 3
4 a 4
5 b 5
6 b 6
7 b 7
8 b 8
9 c 9
10 c 10
11 c 11
I want to sort the dataframe such if I grouped by column 'A' I'd pull the first position from each group, then cycle back and get the second position from each group if any are remaining. So on and so forth.
I'd expect results tot look like this
A B
0 a 0
5 b 5
9 c 9
1 a 1
6 b 6
10 c 10
2 a 2
7 b 7
11 c 11
3 a 3
8 b 8
4 a 4
You can use cumcount for count values in groups first, then sort_values and reindex by Series cum:
cum = df.groupby('A')['B'].cumcount().sort_values()
print (cum)
0 0
5 0
9 0
1 1
6 1
10 1
2 2
7 2
11 2
3 3
8 3
4 4
dtype: int64
print (df.reindex(cum.index))
A B
0 a 0
5 b 5
9 c 9
1 a 1
6 b 6
10 c 10
2 a 2
7 b 7
11 c 11
3 a 3
8 b 8
4 a 4
Here's a NumPy approach -
def approach1(g, v):
# Inputs : 1D arrays of groupby and value columns
id_arr2 = np.ones(v.size,dtype=int)
sf = np.flatnonzero(g[1:] != g[:-1])+1
id_arr2[sf[0]] = -sf[0]+1
id_arr2[sf[1:]] = sf[:-1] - sf[1:]+1
return id_arr2.cumsum().argsort(kind='mergesort')
Sample run -
In [246]: df
Out[246]:
A B
0 a 0
1 a 1
2 a 2
3 a 3
4 a 4
5 b 5
6 b 6
7 b 7
8 b 8
9 c 9
10 c 10
11 c 11
In [247]: df.iloc[approach1(df.A.values, df.B.values)]
Out[247]:
A B
0 a 0
5 b 5
9 c 9
1 a 1
6 b 6
10 c 10
2 a 2
7 b 7
11 c 11
3 a 3
8 b 8
4 a 4
Or using df.reindex from #jezrael's post :
df.reindex(approach1(df.A.values, df.B.values))

Categories