Pandas number rows within group in increasing order - python

Given the following data frame:
import pandas as pd
import numpy as np
df=pd.DataFrame({'A':['A','A','A','B','B','B'],
'B':['a','a','b','a','a','a'],
})
df
A B
0 A a
1 A a
2 A b
3 B a
4 B a
5 B a
I'd like to create column 'C', which numbers the rows within each group in columns A and B like this:
A B C
0 A a 1
1 A a 2
2 A b 1
3 B a 1
4 B a 2
5 B a 3
I've tried this so far:
df['C']=df.groupby(['A','B'])['B'].transform('rank')
...but it doesn't work!

Use groupby/cumcount:
In [25]: df['C'] = df.groupby(['A','B']).cumcount()+1; df
Out[25]:
A B C
0 A a 1
1 A a 2
2 A b 1
3 B a 1
4 B a 2
5 B a 3

Use groupby.rank function.
Here the working example.
df = pd.DataFrame({'C1':['a', 'a', 'a', 'b', 'b'], 'C2': [1, 2, 3, 4, 5]})
df
C1 C2
a 1
a 2
a 3
b 4
b 5
df["RANK"] = df.groupby("C1")["C2"].rank(method="first", ascending=True)
df
C1 C2 RANK
a 1 1
a 2 2
a 3 3
b 4 1
b 5 2

Related

Pandas cumulative count on new value

I have a data frame like the below one.
df = pd.DataFrame()
df['col_1'] = [1, 1, 1, 2, 2, 2, 3, 3, 3]
df['col_2'] = ['A', 'B', 'B', 'A', 'B', 'C', 'A', 'A', 'B']
df
col_1 col_2
0 1 A
1 1 B
2 1 B
3 2 A
4 2 B
5 2 C
6 3 A
7 3 A
8 3 B
I need to group by on col_1 and within each group, I need to update cumulative count whenever there is a new value in col_2. Something like below data frame.
col_1 col_2 col_3
0 1 A 1
1 1 B 2
2 1 B 2
3 2 A 1
4 2 B 2
5 2 C 3
6 3 A 1
7 3 A 1
8 3 B 2
I could do this using lists and dictionary. But couldn't find a way using pandas in built functions.
Use factorize with lambda function in GroupBy.transform:
df['col_3'] = df.groupby('col_1')['col_2'].transform(lambda x: pd.factorize(x)[0]+1)
print (df)
col_1 col_2 col_3
0 1 A 1
1 1 B 2
2 1 B 2
3 2 A 1
4 2 B 2
5 2 C 3
6 3 A 1
7 3 A 1
8 3 B 2

Unroll a matrix in Pandas

I've got a matrix like this:
df = pd.DataFrame({'a':[7, 0, 3], 'b':[0, 4, 2], 'c':[3, 2, 9]})
df.index = list(df)
df
a b c
a 7 0 3
b 0 4 2
c 3 2 9
And I'd like to get something like this:
C1 C2 V
0 a a 7
1 a b 0
2 a c 3
3 b a 0
4 b b 4
5 b c 2
6 c a 3
7 c b 2
8 c c 9
For which I've written the following code:
vv = pd.DataFrame(columns=['C1', 'C2', 'V'])
i = 0
for cat1 in df.index:
for cat2 in df.index:
vv.loc[i] = [cat1, cat2, d[cat1][cat2]]
i += 1
vv['V'] = vv['V'].astype(int)
Is there a better/faster/more elegant way of doing this?
In [90]: df = df.stack().reset_index()
In [91]: df.columns = ['C1', 'C2', 'v']
In [92]: df
Out[92]:
C1 C2 v
0 a a 7
1 a b 0
2 a c 3
3 b a 0
4 b b 4
5 b c 2
6 c a 3
7 c b 2
8 c c 9
YOu can use the stack() method followed by resetting the index and renaming the columns.
df = pd.DataFrame({'a':[7, 0, 3], 'b':[0, 4, 2], 'c':[3, 2, 9]})
df.index = list(df)
result = df.stack().reset_index().rename(columns={'level_0':'C1', 'level_1':'C2',0:'V'})
print(result)
C1 C2 V
0 a a 7
1 a b 0
2 a c 3
3 b a 0
4 b b 4
5 b c 2
6 c a 3
7 c b 2
8 c c 9
Use:
df = (df.rename_axis('C2')
.reset_index()
.melt('C2', var_name='C1', value_name='V')
.reindex(columns=['C1','C2','V']))
print (df)
C1 C2 V
0 a a 7
1 a b 0
2 a c 3
3 b a 0
4 b b 4
5 b c 2
6 c a 3
7 c b 2
8 c c 9
You can use stack:
df.stack()
a a 7
b 0
c 3
b a 0
b 4
c 2
c a 3
b 2
c 9
dtype: int64
The pd.set_option('display.multi_sparse', False) will desparsen the series, showing the values in every row
Additionally, with proper renaming in a pipeline
df.stack()
.reset_index()
.rename(columns={'level_0': 'C1', 'level_1': 'C2', 0:'V'})
yields:
C1 C2 V
0 a a 7
1 a b 0
2 a c 3
3 b a 0
4 b b 4
5 b c 2
6 c a 3
7 c b 2
8 c c 9
To complete the answer and get the same output, I've added the following code:
vv = df.stack().reset_index()
vv.columns = ['C1', 'C2', 'V']

Get values and column names

I have a pandas data frame that looks something like this:
data = {'1' : [0, 2, 0, 0], '2' : [5, 0, 0, 2], '3' : [2, 0, 0, 0], '4' : [0, 7, 0, 0]}
df = pd.DataFrame(data, index = ['a', 'b', 'c', 'd'])
df
1 2 3 4
a 0 5 2 0
b 2 0 0 7
c 0 0 0 0
d 0 2 0 0
I know I can get the maximum value and the corresponding column name for each row by doing (respectively):
df.max(1)
df.idxmax(1)
How can I get the values and the column name for every cell that is not zero?
So in this case, I'd want 2 tables, one giving me each value != 0 for each row:
a 5
a 2
b 2
b 7
d 2
And one giving me the column names for those values:
a 2
a 3
b 1
b 4
d 2
Thanks!
You can use stack for Series, then filter by boolean indexing, rename_axis, reset_index and last drop column or select columns by subset:
s = df.stack()
df1 = s[s!= 0].rename_axis(['a','b']).reset_index(name='c')
print (df1)
a b c
0 a 2 5
1 a 3 2
2 b 1 2
3 b 4 7
4 d 2 2
df2 = df1.drop('b', axis=1)
print (df2)
a c
0 a 5
1 a 2
2 b 2
3 b 7
4 d 2
df3 = df1.drop('c', axis=1)
print (df3)
a b
0 a 2
1 a 3
2 b 1
3 b 4
4 d 2
df3 = df1[['a','c']]
print (df3)
a c
0 a 5
1 a 2
2 b 2
3 b 7
4 d 2
df3 = df1[['a','b']]
print (df3)
a b
0 a 2
1 a 3
2 b 1
3 b 4
4 d 2

How to add new column with handling nan value

I have a dataframe like this
A B
0 a 1
1 b 2
2 c 3
3 d nan
4 e nan
I would like to add column C like below
A B C
0 a 1 a1
1 b 2 b2
2 c 3 c3
3 d nan d
4 e nan e
So I tried
df["C"]=df.A+df.B
but It returns
C
a1
b2
c3
nan
nan
How can get correct result?
In your code, I think the data type of the element in the dataframe is str, so, try fillna.
In [10]: import pandas as pd
In [11]: import numpy as np
In [12]: df = pd.DataFrame({'A': ['a', 'b', 'c', 'd', 'e'],
'B': ['1', '2', '3', np.nan, np.nan]})
In [13]: df.B.fillna('')
Out[13]:
0 1
1 2
2 3
3
4
Name: B, dtype: object
In [14]: df
Out[14]:
A B
0 a 1
1 b 2
2 c 3
3 d NaN
4 e NaN
[5 rows x 2 columns]
In [15]: df.B = df.B.fillna('')
In [16]: df["C"]=df.A+df.B
In [17]: df
Out[17]:
A B C
0 a 1 a1
1 b 2 b2
2 c 3 c3
3 d d
4 e e
[5 rows x 3 columns]
df['C'] = pd.Series(df.fillna('').values.tolist()).str.join(' ')
You can use add method with the fill_value parameter
df['C'] = df.A.add(df.B, fill_value='')
df

Assign series to DataFrame with unequal indices

I have the following dataframe and series with different indices and like to add series 's' to dataframe df2.
>>> import numpy as np
>>> import pandas as pd
>>> df = pd.DataFrame({'a': [1, 2, 2, 3], 'b': [1, 1, 2, 2], 'c': [1, 2, 3,4]})
>>> df
a b c
0 1 1 1
1 2 1 2
2 2 2 3
3 3 2 4
>>> df2 = df.set_index(['a', 'b'])
>>> df2
c
a b
1 1 1
2 1 2
2 3
3 2 4
>>> s = pd.Series([10, 20, 30], pd.MultiIndex.from_tuples([[1], [2], [3]], names=['a']))
>>> s
a
1 10
2 20
3 30
dtype: int64
>>> df2['x'] = s
>>> df2
c x
a b
1 1 1 NaN
2 1 2 NaN
2 3 NaN
3 2 4 NaN
I know column 'x' is set NaN because the column indices don't match, but is there a way to add series 's' by only taking into account the matching columns?
The expected result is
>>> df2
c x
a b
1 1 1 10
2 1 2 20
2 3 20 # because index a=2 (ignored 'b' because it didn't exist in series 's')
3 2 4 30
You can use DataFrame.join:
>>> df2.join(pd.DataFrame({"x": s}))
c x
a b
1 1 1 10
2 1 2 20
2 3 20
3 2 4 30

Categories