Autoincrement indexing after groupby with pandas on the original table

Autoincrement indexing after groupby with pandas on the original table - python

I cannot solve a very easy/simple problem in pandas. :(
I have the following table:
df = pd.DataFrame(data=dict(a=[1, 1, 1,2, 2, 3,1], b=["A", "A","B","A", "B", "A","A"]))
df
Out[96]:
a b
0 1 A
1 1 A
2 1 B
3 2 A
4 2 B
5 3 A
6 1 A
I would like to make an incrementing ID of each grouped (grouped by columns a and b) unique item. So the result would like like this (column c):
Out[98]:
a b c
0 1 A 1
1 1 A 1
2 1 B 2
3 2 A 3
4 2 B 4
5 3 A 5
6 1 A 1
I tried with:
df.groupby(["a", "b"]).nunique().cumsum().reset_index()
Result:
Out[105]:
a b c
0 1 A 1
1 1 B 2
2 2 A 3
3 2 B 4
4 3 A 5
Unfortunatelly this works only for the grouped by dataset and not on the original dataset. As you can see in the original table I have 7 rows and the grouped by returns only 5.
So could someone please help me on how to get the desired table:
a b c
0 1 A 1
1 1 A 1
2 1 B 2
3 2 A 3
4 2 B 4
5 3 A 5
6 1 A 1
Thank you in advance!

groupby + ngroup
df['c'] = df.groupby(['a', 'b']).ngroup() + 1
a b c
0 1 A 1
1 1 A 1
2 1 B 2
3 2 A 3
4 2 B 4
5 3 A 5
6 1 A 1

Use pd.factorize after create a tuple from (a, b) columns:
df['c'] = pd.factorize(df[['a', 'b']].apply(tuple, axis=1))[0] + 1
print(df)
# Output
a b c
0 1 A 1
1 1 A 1
2 1 B 2
3 2 A 3
4 2 B 4
5 3 A 5
6 1 A 1

Related

How to calculate count within the same group based on ID

My DataFrame looks like:
df = pd.DataFrame({"ID":['A','B','A','A','B','B','C','D','D','C'],
'count':[1,1,2,2,2,2,1,1,1,2]})
print(df)
ID count
0 A 1
1 B 1
2 A 2
3 A 2
4 B 2
5 B 2
6 C 1
7 D 1
8 D 1
9 C 2
I will be having only ID column and I want to calculate count column. The logic is I want to cumulatively count the occurrence of an ID. If its repeated immediately like index 2 & 3 they both should get same count. How can I achieve this?
My attempt which is not giving the accurate results:
df['x'] = df['ID'].eq(df['ID'].shift(-1)).astype(int)
df.groupby('ID')['x'].transform('cumsum')+1
0 1
1 1
2 2
3 2
4 2
5 2
6 1
7 2
8 2
9 1
Name: x, dtype: int32
The question is not directly related to groupby cumulative count, but it is different.

We can do filter then reindex back
(df[df.ID.ne(df.ID.shift())].groupby('ID').cumcount().add(1)
.reindex(df.index,method='ffill'))
Out[10]:
0 1
1 1
2 2
3 2
4 2
5 2
6 1
7 1
8 1
9 2
dtype: int64

You could also use groupby() with sort=False:
df['count2'] = df[(df.ID.ne(df.ID.shift()))].groupby('ID', sort=False).cumcount().add(1)
df['count2'] = df['count2'].ffill()
Output:
ID count count2
0 A 1 1
1 B 1 1
2 A 2 2
3 A 2 2
4 B 2 2
5 B 2 2
6 C 1 1
7 D 1 1
8 D 1 1
9 C 2 2

Adding columns to DataFrame from other DataFrame without intersection

I have on Dataframe with diff size and columns, I require to add the columns from one DataFrame to another, and fulfill with same data all rows.
for instance:
one of them:
Out[48]:
A B
0 1 2
1 1 2
2 1 2
3 1 2
and the other
Out[49]:
C D
0 3 4
I want to have a new one as:
A B C D
0 1 2 3 4
1 1 2 3 4
2 1 2 3 4
3 1 2 3 4
Is it possible?

You can assign with pd.Series
df.assign(**df1.loc[0])
Out[11]:
A B C D
0 1 2 3 4
1 1 2 3 4
2 1 2 3 4
3 1 2 3 4

Using join with ffill:
df1.join(df2).ffill().astype(int)
A B C D
0 1 2 3 4
1 1 2 3 4
2 1 2 3 4
3 1 2 3 4

Keep values of between two columns based on third column in pandas

I have three columns, A, B and C. I want to create a fourth column D that contains values of A or B, based on the value of C. For example:
A B C D
0 1 2 1 1
1 2 3 0 3
2 3 4 0 4
3 4 5 1 4
In the above example, column D takes the value of column A if the value of C is 1 and the value of column B if the value of C is 0. Is there an elegant way to do it in Pandas? Thank you for your help.

Use numpy.where:
In [20]: df
Out[20]:
A B C
0 1 2 1
1 2 3 0
2 3 4 0
3 4 5 1
In [21]: df['D'] = np.where(df.C, df.A, df.B)
In [22]: df
Out[22]:
A B C D
0 1 2 1 1
1 2 3 0 3
2 3 4 0 4
3 4 5 1 4

pandas
In consideration of the OP's request
Is there an elegant way to do it in Pandas?
my opinion of elegance
and idiomatic pure pandas
assign + pd.Series.where
df.assign(D=df.A.where(df.C, df.B))
A B C D
0 1 2 1 1
1 2 3 0 3
2 3 4 0 4
3 4 5 1 4
response to comment
how would you modify the pandas answer if instead of 0, 1 in column C you had A, B?
df.assign(D=df.lookup(df.index, df.C))
A B C D
0 1 2 A 1
1 2 3 B 3
2 3 4 B 4
3 4 5 A 4

New column with column name from max column by index pandas

I want to create a new column with column name for the max value by index.
Tie would include both columns.
A B C D
TRDNumber
ALB2008081610 3 1 1 1
ALB200808167 1 3 4 1
ALB200808168 3 1 3 1
ALB200808171 2 2 5 1
ALB2008081710 1 2 2 5
Desired output
A B C D Best
TRDNumber
ALB2008081610 3 1 1 1 A
ALB200808167 1 3 4 1 C
ALB200808168 3 1 3 1 A,C
ALB200808171 2 2 5 1 C
ALB2008081710 1 2 2 5 D
I have tried the following code
df.groupby(['TRDNumber'])[cols].max()

you can do:
>>> f = lambda r: ','.join(df.columns[r])
>>> df.eq(df.max(axis=1), axis=0).apply(f, axis=1)
TRDNumber
ALB2008081610 A
ALB200808167 C
ALB200808168 A,C
ALB200808171 C
ALB2008081710 D
dtype: object
>>> df['best'] = _
>>> df
A B C D best
TRDNumber
ALB2008081610 3 1 1 1 A
ALB200808167 1 3 4 1 C
ALB200808168 3 1 3 1 A,C
ALB200808171 2 2 5 1 C
ALB2008081710 1 2 2 5 D

combination of two DF, pandas

I have two df,
First df
A B C
1 1 3
1 1 2
1 2 5
2 2 7
2 3 7
Second df
B D
1 5
2 6
3 4
The column Bhas the same meaning in the both dfs. What is the most easy way add column D to the corresponding values in the first df? Output should be:
A B C D
1 1 3 5
1 1 2 5
1 2 5 6
2 2 7 6
2 3 7 4

Perform a 'left' merge in your case on column 'B':
In [206]:
df.merge(df1, how='left', on='B')
Out[206]:
A B C D
0 1 1 3 5
1 1 1 2 5
2 1 2 5 6
3 2 2 7 6
4 2 3 7 4
Another method would be to set 'B' on your second df as the index and then call map:
In [215]:
df1 = df1.set_index('B')
df['D'] = df['B'].map(df1['D'])
df
Out[215]:
A B C D
0 1 1 3 5
1 1 1 2 5
2 1 2 5 6
3 2 2 7 6
4 2 3 7 4

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Autoincrement indexing after groupby with pandas on the original table - python

groupby + ngroup df['c'] = df.groupby(['a', 'b']).ngroup() + 1 a b c 0 1 A 1 1 1 A 1 2 1 B 2 3 2 A 3 4 2 B 4 5 3 A 5 6 1 A 1

Use pd.factorize after create a tuple from (a, b) columns: df['c'] = pd.factorize(df[['a', 'b']].apply(tuple, axis=1))[0] + 1 print(df) # Output a b c 0 1 A 1 1 1 A 1 2 1 B 2 3 2 A 3 4 2 B 4 5 3 A 5 6 1 A 1

Related

How to calculate count within the same group based on ID

Adding columns to DataFrame from other DataFrame without intersection

Keep values of between two columns based on third column in pandas

New column with column name from max column by index pandas

combination of two DF, pandas

Categories

Resources