Autoincrement indexing after groupby with pandas on the original table - python

I cannot solve a very easy/simple problem in pandas. :(
I have the following table:
df = pd.DataFrame(data=dict(a=[1, 1, 1,2, 2, 3,1], b=["A", "A","B","A", "B", "A","A"]))
df
Out[96]:
a b
0 1 A
1 1 A
2 1 B
3 2 A
4 2 B
5 3 A
6 1 A
I would like to make an incrementing ID of each grouped (grouped by columns a and b) unique item. So the result would like like this (column c):
Out[98]:
a b c
0 1 A 1
1 1 A 1
2 1 B 2
3 2 A 3
4 2 B 4
5 3 A 5
6 1 A 1
I tried with:
df.groupby(["a", "b"]).nunique().cumsum().reset_index()
Result:
Out[105]:
a b c
0 1 A 1
1 1 B 2
2 2 A 3
3 2 B 4
4 3 A 5
Unfortunatelly this works only for the grouped by dataset and not on the original dataset. As you can see in the original table I have 7 rows and the grouped by returns only 5.
So could someone please help me on how to get the desired table:
a b c
0 1 A 1
1 1 A 1
2 1 B 2
3 2 A 3
4 2 B 4
5 3 A 5
6 1 A 1
Thank you in advance!

groupby + ngroup
df['c'] = df.groupby(['a', 'b']).ngroup() + 1
a b c
0 1 A 1
1 1 A 1
2 1 B 2
3 2 A 3
4 2 B 4
5 3 A 5
6 1 A 1

Use pd.factorize after create a tuple from (a, b) columns:
df['c'] = pd.factorize(df[['a', 'b']].apply(tuple, axis=1))[0] + 1
print(df)
# Output
a b c
0 1 A 1
1 1 A 1
2 1 B 2
3 2 A 3
4 2 B 4
5 3 A 5
6 1 A 1

Related

How to calculate count within the same group based on ID

My DataFrame looks like:
df = pd.DataFrame({"ID":['A','B','A','A','B','B','C','D','D','C'],
'count':[1,1,2,2,2,2,1,1,1,2]})
print(df)
ID count
0 A 1
1 B 1
2 A 2
3 A 2
4 B 2
5 B 2
6 C 1
7 D 1
8 D 1
9 C 2
I will be having only ID column and I want to calculate count column. The logic is I want to cumulatively count the occurrence of an ID. If its repeated immediately like index 2 & 3 they both should get same count. How can I achieve this?
My attempt which is not giving the accurate results:
df['x'] = df['ID'].eq(df['ID'].shift(-1)).astype(int)
df.groupby('ID')['x'].transform('cumsum')+1
0 1
1 1
2 2
3 2
4 2
5 2
6 1
7 2
8 2
9 1
Name: x, dtype: int32
The question is not directly related to groupby cumulative count, but it is different.
We can do filter then reindex back
(df[df.ID.ne(df.ID.shift())].groupby('ID').cumcount().add(1)
.reindex(df.index,method='ffill'))
Out[10]:
0 1
1 1
2 2
3 2
4 2
5 2
6 1
7 1
8 1
9 2
dtype: int64
You could also use groupby() with sort=False:
df['count2'] = df[(df.ID.ne(df.ID.shift()))].groupby('ID', sort=False).cumcount().add(1)
df['count2'] = df['count2'].ffill()
Output:
ID count count2
0 A 1 1
1 B 1 1
2 A 2 2
3 A 2 2
4 B 2 2
5 B 2 2
6 C 1 1
7 D 1 1
8 D 1 1
9 C 2 2

Adding columns to DataFrame from other DataFrame without intersection

I have on Dataframe with diff size and columns, I require to add the columns from one DataFrame to another, and fulfill with same data all rows.
for instance:
one of them:
Out[48]:
A B
0 1 2
1 1 2
2 1 2
3 1 2
and the other
Out[49]:
C D
0 3 4
I want to have a new one as:
A B C D
0 1 2 3 4
1 1 2 3 4
2 1 2 3 4
3 1 2 3 4
Is it possible?
You can assign with pd.Series
df.assign(**df1.loc[0])
Out[11]:
A B C D
0 1 2 3 4
1 1 2 3 4
2 1 2 3 4
3 1 2 3 4
Using join with ffill:
df1.join(df2).ffill().astype(int)
A B C D
0 1 2 3 4
1 1 2 3 4
2 1 2 3 4
3 1 2 3 4

Keep values of between two columns based on third column in pandas

I have three columns, A, B and C. I want to create a fourth column D that contains values of A or B, based on the value of C. For example:
A B C D
0 1 2 1 1
1 2 3 0 3
2 3 4 0 4
3 4 5 1 4
In the above example, column D takes the value of column A if the value of C is 1 and the value of column B if the value of C is 0. Is there an elegant way to do it in Pandas? Thank you for your help.
Use numpy.where:
In [20]: df
Out[20]:
A B C
0 1 2 1
1 2 3 0
2 3 4 0
3 4 5 1
In [21]: df['D'] = np.where(df.C, df.A, df.B)
In [22]: df
Out[22]:
A B C D
0 1 2 1 1
1 2 3 0 3
2 3 4 0 4
3 4 5 1 4
pandas
In consideration of the OP's request
Is there an elegant way to do it in Pandas?
my opinion of elegance
and idiomatic pure pandas
assign + pd.Series.where
df.assign(D=df.A.where(df.C, df.B))
A B C D
0 1 2 1 1
1 2 3 0 3
2 3 4 0 4
3 4 5 1 4
response to comment
how would you modify the pandas answer if instead of 0, 1 in column C you had A, B?
df.assign(D=df.lookup(df.index, df.C))
A B C D
0 1 2 A 1
1 2 3 B 3
2 3 4 B 4
3 4 5 A 4

New column with column name from max column by index pandas

I want to create a new column with column name for the max value by index.
Tie would include both columns.
A B C D
TRDNumber
ALB2008081610 3 1 1 1
ALB200808167 1 3 4 1
ALB200808168 3 1 3 1
ALB200808171 2 2 5 1
ALB2008081710 1 2 2 5
Desired output
A B C D Best
TRDNumber
ALB2008081610 3 1 1 1 A
ALB200808167 1 3 4 1 C
ALB200808168 3 1 3 1 A,C
ALB200808171 2 2 5 1 C
ALB2008081710 1 2 2 5 D
I have tried the following code
df.groupby(['TRDNumber'])[cols].max()
you can do:
>>> f = lambda r: ','.join(df.columns[r])
>>> df.eq(df.max(axis=1), axis=0).apply(f, axis=1)
TRDNumber
ALB2008081610 A
ALB200808167 C
ALB200808168 A,C
ALB200808171 C
ALB2008081710 D
dtype: object
>>> df['best'] = _
>>> df
A B C D best
TRDNumber
ALB2008081610 3 1 1 1 A
ALB200808167 1 3 4 1 C
ALB200808168 3 1 3 1 A,C
ALB200808171 2 2 5 1 C
ALB2008081710 1 2 2 5 D

combination of two DF, pandas

I have two df,
First df
A B C
1 1 3
1 1 2
1 2 5
2 2 7
2 3 7
Second df
B D
1 5
2 6
3 4
The column Bhas the same meaning in the both dfs. What is the most easy way add column D to the corresponding values in the first df? Output should be:
A B C D
1 1 3 5
1 1 2 5
1 2 5 6
2 2 7 6
2 3 7 4
Perform a 'left' merge in your case on column 'B':
In [206]:
df.merge(df1, how='left', on='B')
Out[206]:
A B C D
0 1 1 3 5
1 1 1 2 5
2 1 2 5 6
3 2 2 7 6
4 2 3 7 4
Another method would be to set 'B' on your second df as the index and then call map:
In [215]:
df1 = df1.set_index('B')
df['D'] = df['B'].map(df1['D'])
df
Out[215]:
A B C D
0 1 1 3 5
1 1 1 2 5
2 1 2 5 6
3 2 2 7 6
4 2 3 7 4

Categories