finding duplicates in a column without dropping - python

Input:
In [4]: df1
Out[4]:
A B
0 1 1
1 2 2
2 1 3
3 2 4
4 3 5
5 4 6
6 3 7
7 3 8
Here I have to get only the duplicated items in the "A" column of df1. I used df1['A'].duplicated() function it gives me output by dropping one column. But my expected output is as below.
Expected Output:
In [7]: df2
Out[7]:
A B
0 1 1
1 1 3
2 2 2
3 2 4
4 3 5
5 3 7
6 3 8

Use:
df[df['A'].duplicated(keep=False)]
the keep=False option indicates to flag all duplicates

Related

pandas get first row for each unique value in a column

Given a pandas data frame, how can I get the first row for each unique value in a column?
for example, given:
a b key
0 1 2 1
1 2 3 1
2 3 3 1
3 4 5 2
4 5 6 2
5 6 6 2
6 7 2 1
7 8 2 1
8 9 2 3
the result when analyzing by column key should be
a b key
0 1 2 1
3 4 5 2
8 9 2 3
p.s. df src:
pd.DataFrame([{'a':1,'b':2,'key':1},
{'a':2,'b':3,'key':1},
{'a':3,'b':3,'key':1},
{'a':4,'b':5,'key':2},
{'a':5,'b':6,'key':2},
{'a':6,'b':6,'key':2},
{'a':7,'b':2,'key':1},
{'a':8,'b':2,'key':1},
{'a':9,'b':2,'key':3}])
drop_duplicates does this. By default it keeps the first of the set, although that can be changed by other parameters.
df = df.drop_duplicates('key')

pop rows from dataframe based on conditions

From the dataframe
import pandas as pd
df1 = pd.DataFrame({'A':[1,1,1,1,2,2,2,2],'B':[1,2,3,4,5,6,7,8]})
print(df1)
A B
0 1 1
1 1 2
2 1 3
3 1 4
4 2 5
5 2 6
6 2 7
7 2 8
I want to pop 2 rows where 'A' == 2, preferably in a single statement like
df2 = df1.somepopfunction(...)
to generate the following result:
print(df1)
A B
0 1 1
1 1 2
2 1 3
3 1 4
4 2 7
5 2 8
print(df2)
A B
0 2 5
1 2 6
The pandas pop function sounds promising, but only pops complete colums.
What statement can replace the pseudocode
df2 = df1.somepopfunction(...)
to generate the desired results?
Pop function for remove rows does not exist in pandas, need filter first and then remove filtred rows from df1:
df2 = df1[df1.A.eq(2)].head(2)
print (df2)
A B
4 2 5
5 2 6
df1 = df1.drop(df2.index)
print (df1)
A B
0 1 1
1 1 2
2 1 3
3 1 4
6 2 7
7 2 8

How to use two columns to distinguish data points in a pandas dataframe

I have a dataframe that looks like follow:
import pandas as pd
df = pd.DataFrame({'a':[1,2,3], 'b':[[1,2,3],[1,2,3],[1,2,3]], 'c': [[4,5,6],[4,5,6],[4,5,6]]})
I want to explode the dataframe with column b and c. I know that if we only use one column then we can do
df.explode('column_name')
However, I can't find an way to use with two columns. So here is the desired output.
output = pd.DataFrame({'a':[1,1,1,2,2,2,3,3,3], 'b':[1,2,3,1,2,3,1,2,3], 'c': [4,5,6,4,5,6,4,5,6]})
I have tried
df.explode(['a','b'])
but it does not work and gives me a
ValueError: column must be a scalar.
Thanks.
Let us try
df=pd.concat([df[x].explode() for x in ['b','c']],axis=1).join(df[['a']]).reindex(columns=df.columns)
Out[179]:
a b c
0 1 1 4
0 1 2 5
0 1 3 6
1 2 1 4
1 2 2 5
1 2 3 6
2 3 1 4
2 3 2 5
2 3 3 6
You can use itertools chain, along with zip to get your result :
pd.DataFrame(chain.from_iterable(zip([a] * df.shape[-1], b, c)
for a, b, c in df.to_numpy()))
0 1 2
0 1 1 4
1 1 2 5
2 1 3 6
3 2 1 4
4 2 2 5
5 2 3 6
6 3 1 4
7 3 2 5
8 3 3 6
List comprehension from #Ben is the fastest. However, if you don't concern too much about speed, you may use apply with pd.Series.explode
df.set_index('a').apply(pd.Series.explode).reset_index()
Or simply apply. On non-list columns, it will return the original values
df.apply(pd.Series.explode).reset_index(drop=True)
Out[42]:
a b c
0 1 1 4
1 1 2 5
2 1 3 6
3 2 1 4
4 2 2 5
5 2 3 6
6 3 1 4
7 3 2 5
8 3 3 6

pandas stack second column below first and vice versa

I have a DataFrame with two columns and I would like to stack the second column below the first and the first below the second.
pd.DataFrame({'A':[1,2,3], 'B': [4,5,6]})
A B
0 1 4
1 2 5
2 3 6
Desired output:
A B
0 1 4
1 2 5
2 3 6
3 4 1
4 5 2
5 6 3
So far I have tried:
pd.concat([df, df[['B','A']].rename(columns={'A':'B', 'B':'A'})])
A B
0 1 4
1 2 5
2 3 6
3 4 1
4 5 2
5 6 3
Is this the cleanest way?
Concat is better if you ask me. But if you have a 100 columns renaming is a pain. As a generalized approach here's one with numpy flip and vstack i.e
v = df.values
pd.DataFrame(pd.np.vstack((v, pd.np.fliplr(v))), columns=df.columns)
A B
0 1 4
1 2 5
2 3 6
3 4 1
4 5 2
5 6 3

tracking maximum value in dataframe column

I want to produce a column B in a dataframe that tracks the maximum value reached in column A since row Index 0.
A B
Index
0 1 1
1 2 2
2 3 3
3 2 3
4 1 3
5 3 3
6 4 4
7 2 4
I want to avoid iterating, so is there a vectorized solution and if so how could it look like ?
You're looking for cummax:
In [257]:
df['B'] = df['A'].cummax()
df
Out[257]:
A B
Index
0 1 1
1 2 2
2 3 3
3 2 3
4 1 3
5 3 3
6 4 4
7 2 4

Categories