Rank by group after sorting in pandas - python

I have a dataframe which looks like this
pd.DataFrame({'A': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'],
...: 'B': ['C1', 'C1', 'C1', 'C1', 'C2', 'C2', 'C2', 'C2'],
...: 'X': [1, 2, 1, 2, 2, 3, 4, 5],
...: 'Y': [2, 1, 2, 2, 7, 5, 7, 7],
...: 'Z': [2, 1, 2, 1, 5, 8, 1, 9]})
Out[10]:
A B X Y Z
0 A C1 1 2 2
1 B C1 2 1 1
2 C C1 1 2 2
3 D C1 2 2 1
4 E C2 2 7 5
5 F C2 3 5 8
6 G C2 4 7 1
7 H C2 5 7 9
I need to sort the dataframe by columns B, X, Y, Z and then rank within each group of B.
Resulting dataframe should look like this.
Out[12]:
A B X Y Z R
1 B C1 2 1 1 1
3 D C1 2 2 1 2
0 A C1 1 2 2 3
2 C C1 1 2 2 4
6 G C2 4 7 1 1
5 F C2 3 5 2 2
4 E C2 2 1 5 3
7 H C2 5 7 9 4
I know I can use df.sort_values(['B', 'Z', 'Y', 'X']) to bring in right order but struggling to apply the rank.
what is the 1 line of code for sorting and ranking?

You can use groupby().cumcount():
df['R'] = df.sort_values(['B','X','Y','Z']).groupby('B').cumcount() + 1
Output:
A B X Y Z R
0 A C1 1 2 2 3
1 B C1 2 1 1 1
2 C C1 1 2 2 4
3 D C1 2 2 1 2
4 E C2 2 7 5 2
5 F C2 3 5 8 3
6 G C2 4 7 1 1
7 H C2 5 7 9 4
To match your output, separate sort_values and groupby():
df = df.sort_values(['B','Z','Y','X'])
df['R'] = df.groupby('B').cumcount() + 1
Output:
A B X Y Z R
1 B C1 2 1 1 1
3 D C1 2 2 1 2
0 A C1 1 2 2 3
2 C C1 1 2 2 4
6 G C2 4 7 1 1
4 E C2 2 7 5 2
5 F C2 3 5 8 3
7 H C2 5 7 9 4

Related

Pandas max for rows, top n max

Im trying to create top columns, which is the max of a couple of column rows. Pandas has a method nlargest but I cannot get it to work in rows. Pandas also has max and idxmax which does exactly what I want to do but only for the absolute max value.
df = pd.DataFrame(np.array([[1, 2, 3, 5, 1, 9], [4, 5, 6, 2, 5, 9], [7, 8, 9, 2, 5, 10]]), columns=['a', 'b', 'c', 'd', 'e', 'f'])
cols = df.columns[:-1].tolist()
df['max_1_val'] = df[cols].max(axis=1)
df['max_1_col'] = df[cols].idxmax(axis=1)
Output:
a b c d e f max_1_val max_1_col
0 1 2 3 5 1 9 5 d
1 4 5 6 2 5 9 6 c
2 7 8 9 2 5 10 9 c
But I am trying to get max_n_val and max_n_col so the expected output for top 3 would be:
a b c d e f max_1_val max_1_col max_2_val max_2_col max_3_val max_3_col
0 1 2 3 5 1 9 5 d 3 c 2 b
1 4 5 6 2 5 9 6 c 5 b 5 e
2 7 8 9 2 5 10 9 c 8 b 7 a
For improve performance is used numpy.argsort for positions, for correct order is used the last 3 items, reversed by indexing:
N = 3
a = df[cols].to_numpy().argsort()[:, :-N-1:-1]
print (a)
[[3 2 1]
[2 4 1]
[2 1 0]]
Then get columns names by indexing to c and for reordering values in d use this solution:
c = np.array(cols)[a]
d = df[cols].to_numpy()[np.arange(a.shape[0])[:, None], a]
Last create DataFrames, join by concat and reorder columns names by DataFrame.reindex:
df1 = pd.DataFrame(c).rename(columns=lambda x : f'max_{x+1}_col')
df2 = pd.DataFrame(d).rename(columns=lambda x : f'max_{x+1}_val')
c = df.columns.tolist() + [y for x in zip(df2.columns, df1.columns) for y in x]
df = pd.concat([df, df1, df2], axis=1).reindex(c, axis=1)
print (df)
a b c d e f max_1_val max_1_col max_2_val max_2_col max_3_val \
0 1 2 3 5 1 9 5 d 3 c 2
1 4 5 6 2 5 9 6 c 5 e 5
2 7 8 9 2 5 10 9 c 8 b 7
max_3_col
0 b
1 b
2 a

use meshgrid for rows with common values in column

my dataframes:
df1 = pd.DataFrame(np.array([[1, 2, 3], [4, 2, 3], [7, 8, 8]]),columns=['a', 'b', 'c'])
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 2, 3], [5, 8, 8]]),columns=['a', 'b', 'c'])
df1,df2:
a b c
0 1 2 3
1 4 2 3
2 7 8 8
a b c
0 1 2 3
1 4 2 3
2 5 8 8
I want to combine rows from columns a from both df's in all sequences but only where values in column b and c are equal.
Right now I have only solution for all in general with this code:
x = np.array(np.meshgrid(df1.a.values,
df2.a.values)).T.reshape(-1,2)
df = pd.DataFrame(x)
print(df)
0 1
0 1 1
1 1 4
2 1 5
3 4 1
4 4 4
5 4 5
6 7 1
7 7 4
8 7 5
expected output for df1.a and df2.a only for rows where df1.b==df2.b and df1.c==df2.c:
0 1
0 1 1
1 1 4
2 4 1
3 4 4
4 7 5
so basically i need to group by common rows in selected columns band c
You should try DataFrame.merge using inner merge:
df1.merge(df2, on=['b', 'c'])[['a_x', 'a_y']]
a_x a_y
0 1 1
1 1 4
2 4 1
3 4 4
4 7 5

How to reassign the value of a column that has repeated values if it exist for some value?

I have the following DataFrame:
import pandas as pd
df = pd.DataFrame({'codes': [1, 2, 3, 4, 1, 2, 1, 2, 1, 2], 'results': ['a', 'b', 'c', 'd', None, None, None, None, None, None]})
I need to produce the following:
codes results
0 1 a
1 2 b
2 3 c
3 4 d
4 1 a
5 2 b
6 1 a
7 2 b
8 1 a
9 2 b
It is guaranteed that if the value of results is not None for a value in codes it will be unique. I mean there won't be two rows with different values for code and results.
You can do with merge
df[['codes']].reset_index().merge(df.dropna()).set_index('index').sort_index()
Out[571]:
codes results
index
0 1 a
1 2 b
2 3 c
3 4 d
4 1 a
5 2 b
6 1 a
7 2 b
8 1 a
9 2 b
Or map
df['results']=df.codes.map(df.set_index('codes').dropna()['results'])
df
Out[574]:
codes results
0 1 a
1 2 b
2 3 c
3 4 d
4 1 a
5 2 b
6 1 a
7 2 b
8 1 a
9 2 b
Or groupby + ffill
df['results']=df.groupby('codes').results.ffill()
df
Out[577]:
codes results
0 1 a
1 2 b
2 3 c
3 4 d
4 1 a
5 2 b
6 1 a
7 2 b
8 1 a
9 2 b
Or reindex | .loc
df.set_index('codes').dropna().reindex(df.codes).reset_index()
Out[589]:
codes results
0 1 a
1 2 b
2 3 c
3 4 d
4 1 a
5 2 b
6 1 a
7 2 b
8 1 a
9 2 b

Unroll a matrix in Pandas

I've got a matrix like this:
df = pd.DataFrame({'a':[7, 0, 3], 'b':[0, 4, 2], 'c':[3, 2, 9]})
df.index = list(df)
df
a b c
a 7 0 3
b 0 4 2
c 3 2 9
And I'd like to get something like this:
C1 C2 V
0 a a 7
1 a b 0
2 a c 3
3 b a 0
4 b b 4
5 b c 2
6 c a 3
7 c b 2
8 c c 9
For which I've written the following code:
vv = pd.DataFrame(columns=['C1', 'C2', 'V'])
i = 0
for cat1 in df.index:
for cat2 in df.index:
vv.loc[i] = [cat1, cat2, d[cat1][cat2]]
i += 1
vv['V'] = vv['V'].astype(int)
Is there a better/faster/more elegant way of doing this?
In [90]: df = df.stack().reset_index()
In [91]: df.columns = ['C1', 'C2', 'v']
In [92]: df
Out[92]:
C1 C2 v
0 a a 7
1 a b 0
2 a c 3
3 b a 0
4 b b 4
5 b c 2
6 c a 3
7 c b 2
8 c c 9
YOu can use the stack() method followed by resetting the index and renaming the columns.
df = pd.DataFrame({'a':[7, 0, 3], 'b':[0, 4, 2], 'c':[3, 2, 9]})
df.index = list(df)
result = df.stack().reset_index().rename(columns={'level_0':'C1', 'level_1':'C2',0:'V'})
print(result)
C1 C2 V
0 a a 7
1 a b 0
2 a c 3
3 b a 0
4 b b 4
5 b c 2
6 c a 3
7 c b 2
8 c c 9
Use:
df = (df.rename_axis('C2')
.reset_index()
.melt('C2', var_name='C1', value_name='V')
.reindex(columns=['C1','C2','V']))
print (df)
C1 C2 V
0 a a 7
1 a b 0
2 a c 3
3 b a 0
4 b b 4
5 b c 2
6 c a 3
7 c b 2
8 c c 9
You can use stack:
df.stack()
a a 7
b 0
c 3
b a 0
b 4
c 2
c a 3
b 2
c 9
dtype: int64
The pd.set_option('display.multi_sparse', False) will desparsen the series, showing the values in every row
Additionally, with proper renaming in a pipeline
df.stack()
.reset_index()
.rename(columns={'level_0': 'C1', 'level_1': 'C2', 0:'V'})
yields:
C1 C2 V
0 a a 7
1 a b 0
2 a c 3
3 b a 0
4 b b 4
5 b c 2
6 c a 3
7 c b 2
8 c c 9
To complete the answer and get the same output, I've added the following code:
vv = df.stack().reset_index()
vv.columns = ['C1', 'C2', 'V']

Perform logical operations on every column of a pandas dataframe?

I'm trying to create a new df column based on a condition to be validated in the all the rest of the columns per each row.
df = pd.DataFrame([[1, 5, 2, 8, 2], [2, 4, 4, 20, 5], [3, 3, 1, 20, 2], [4, 2, 2, 1, 0],
[5, 1, 4, -5, -4]],
columns=['a', 'b', 'c', 'd', 'e'],
index=[1, 2, 3, 4, 5])
I tried:
df['f'] = ""
df.loc[(df.any() >= 10), 'f'] = df['e'] + 10
However I get:
IndexingError: Unalignable boolean Series key provided
This is the desired output:
a b c d e f
1 1 5 2 8 2
2 2 4 4 20 5 15
3 3 3 1 20 2 12
4 4 2 2 1 0
5 5 1 4 -5 -4
Use
In [984]: df.loc[(df >= 10).any(1), 'f'] = df['e'] + 10
In [985]: df
Out[985]:
a b c d e f
1 1 5 2 8 2 NaN
2 2 4 4 20 5 15.0
3 3 3 1 20 2 12.0
4 4 2 2 1 0 NaN
5 5 1 4 -5 -4 NaN
Note that:
df.any()
a True
b True
c True
d True
e True
f True
dtype: bool
df.any() >= 10
a False
b False
c False
d False
e False
f False
dtype: bool
I assume you want to check if any value in a column is >= 10. That would be done with (df >= 10).any(axis=1).
You should be able to do this in one step, using np.where:
df['f'] = np.where((df >= 10).any(axis=1), df.e + 10, '')
df
a b c d e f
1 1 5 2 8 2
2 2 4 4 20 5 15
3 3 3 1 20 2 12
4 4 2 2 1 0
5 5 1 4 -5 -4
If you'd prefer NaNs instead of blanks, use:
df['f'] = np.where((df >= 10).any(axis=1), df.e + 10, np.nan)
df
a b c d e f
1 1 5 2 8 2 NaN
2 2 4 4 20 5 15.0
3 3 3 1 20 2 12.0
4 4 2 2 1 0 NaN
5 5 1 4 -5 -4 NaN
By using max
df['f'] = ""
df.loc[df.max(1)>=10,'f']=df.e+10
Out[330]:
a b c d e f
1 1 5 2 8 2
2 2 4 4 20 5 15
3 3 3 1 20 2 12
4 4 2 2 1 0
5 5 1 4 -5 -4

Categories