Pandas: find duplicates in another dataframe based on a subset - python

Assume DF 1:
A B C
0 1 1 1
1 1 1 2
2 2 1 1
3 1 9 0
4 9 9 9
And DF 2
A B C
0 6 1 1
1 1 1 2
2 2 1 1
3 1 9 0
4 1 9 6
I would like to add a column to DF 1 with a count of duplicates in DF 2 based on a subset of columns:
For example
Duplicate on
1
2
Result:
A B C Dupe
0 1 1 1 1
1 1 1 2 1
2 2 1 1 1
3 1 9 0 2
4 9 9 9 0

Sound like you should groupby by df2 then merge
df=df1.merge(df2.groupby(['A','B']).size().to_frame('DUP').reset_index(),how='left').fillna(0)
A B C DUP
0 1 1 1 1.0
1 1 1 2 1.0
2 2 1 1 1.0
3 1 9 0 2.0
4 9 9 9 0.0

Related

Duplicate a selected row and put the duplicate just below in a Pandas DataFrame

I have a Pandas dataframe like this :
id A B
0 2 2
1 1 1
2 3 3
3 7 7
And I want to duplicate the first row 3 times just below the selected row :
id A B
0 2 2
1 2 2
2 2 2
3 2 2
4 1 1
5 3 3
6 7 7
Is there a method that already exist in Pandas library ?
There is no built-in method for doing just this. However, you can create a list of indexes, and use df.loc + df.index.repeat:
new_df = df.loc[df.index.repeat([4] + [1] * (len(df) - 1))].reset_index(drop=True)
Output:
>>> new_df
id A B
0 0 2 2
1 0 2 2
2 0 2 2
3 0 2 2
4 1 1 1
5 2 3 3
6 3 7 7
Use reindex and Index.repeat to create your dataframe:
>>> df.reindex(df.index.repeat([3] + [1] * (len(df) - 1)))
id A B
0 0 2 2
0 0 2 2
0 0 2 2
1 1 1 1
2 2 3 3
3 3 7 7
Another way:
>>> df.loc[[df.index[0]]*3 + df.index[1:].tolist()]
id A B
0 0 2 2
0 0 2 2
0 0 2 2
1 1 1 1
2 2 3 3
3 3 7 7
A more generalized way proposed by #MuhammadHassan:
row_index = 0
repeat_time = 3
>>> df.reindex(df.index.tolist() + [row_index]*repeat_time).sort_index()
id A B
0 0 2 2
0 0 2 2
0 0 2 2
0 0 2 2
1 1 1 1
2 2 3 3
3 3 7 7
Let us try
n=3
row = 0
df = df.append(df.loc[[row]*(n-1)]).sort_index()
df
id A B
0 0 2 2
0 0 2 2
0 0 2 2
1 1 1 1
2 2 3 3
3 3 7 7

Python: create new column conditionally on values from two other columns

I would like to combine two columns in a new column.
Lets suppose I have:
Index A B
0 1 0
1 1 0
2 1 0
3 1 0
4 1 0
5 1 2
6 1 2
7 1 2
8 1 2
9 1 2
10 1 2
Now I would like to create a column C with the entries from A from Index 0 to 4 and from column B from Index 5 to 10. It should look like this:
Index A B C
0 1 0 1
1 1 0 1
2 1 0 1
3 1 0 1
4 1 0 1
5 1 2 2
6 1 2 2
7 1 2 2
8 1 2 2
9 1 2 2
10 1 2 2
Is there a python code how I can get this? Thanks in advance!
If Index is an actual column you can use numpy.where and specify your condition
import numpy as np
df['C'] = np.where(df['Index'] <= 4, df['A'], df['B'])
Index A B C
0 0 1 0 1
1 1 1 0 1
2 2 1 0 1
3 3 1 0 1
4 4 1 0 1
5 5 1 2 2
6 6 1 2 2
7 7 1 2 2
8 8 1 2 2
9 9 1 2 2
10 10 1 2 2
if your index is your actual index
you can slice your indices with iloc and create your column with concat.
df['C'] = pd.concat([df['A'].iloc[:5], df['B'].iloc[5:]])
print(df)
A B C
0 1 0 1
1 1 0 1
2 1 0 1
3 1 0 1
4 1 0 1
5 1 2 2
6 1 2 2
7 1 2 2
8 1 2 2
9 1 2 2
10 1 2 2

Column that counts up within subgroups pandas

I've got a df
df1
a b
4 0 1
5 0 1
6 0 2
2 0 3
3 1 2
15 1 3
12 1 3
13 1 1
15 3 1
14 3 1
8 3 3
9 3 2
10 3 1
the df should be grouped by a and b and I need a column c that goes up from 1 to amount of groups within subgroups of a
df1
a b c
4 0 1 1
5 0 1 1
6 0 2 2
2 0 3 3
3 1 2 1
15 1 3 2
12 1 3 2
13 1 1 3
15 3 1 1
14 3 1 1
8 3 3 2
9 3 2 3
10 3 1 4
How can I do that?
We can do groupby + transform factorize
df['C']=df.groupby('a').b.transform(lambda x : x.factorize()[0]+1)
4 1
5 1
6 2
2 3
3 1
15 2
12 2
13 3
15 1
14 1
8 1
9 1
10 2
Name: b, dtype: int64
Just so we can see the loop version
from itertools import count
from collections import defaultdict
x = defaultdict(count)
y = {}
c = []
for a, b in zip(df.a, df.b):
if (a, b) not in y:
y[(a, b)] = next(x[a]) + 1
c.append(y[(a, b)])
df.assign(C=c)
a b C
4 0 1 1
5 0 1 1
6 0 2 2
2 0 3 3
3 1 2 1
15 1 3 2
12 1 3 2
13 1 1 3
15 3 1 1
14 3 1 1
8 3 3 2
9 3 2 3
10 3 1 1
One option is groupby a and then iterate through each group and groupby b. Then use can use ngroup
df['c'] = np.hstack([g.groupby('b').ngroup().to_numpy() for _,g in df.groupby('a')])
a b c
4 0 1 0
5 0 1 0
6 0 2 1
2 0 3 2
3 1 2 1
15 1 3 2
12 1 3 2
13 1 1 0
15 3 1 0
14 3 1 0
8 3 1 0
9 3 1 0
10 3 2 1
you can use groupby.rank if you don't care about the order in the data.
df['c'] = df.groupby('a')['b'].rank('dense').astype(int)

Reshaping groupby dataframe to fixed dimensions

I have dataframe df with following data.
A B C D
1 1 3 1
1 2 9 8
1 3 3 9
2 1 2 9
2 2 1 4
2 3 9 5
2 4 6 4
3 1 4 1
3 2 0 4
4 1 2 6
5 1 2 4
5 2 8 3
grp = df.groupby('A')
Next I want to make all groups of dataframe df grouped on columns A to have same number of rows. Either Truncate extra rows or pad 0 rows. For above data, I want to make all groups to have 3 rows. I required the following results.
A B C D
1 1 3 1
1 2 9 8
1 3 3 9
2 1 2 9
2 2 1 4
2 3 9 5
3 1 4 1
3 2 0 4
3 0 0 0
4 1 2 6
4 0 0 0
4 0 0 0
5 1 2 4
5 2 8 3
5 0 0 0
Similarly, I may want to groupby on multiple columns, like
grp = df.groupby(['A','B'])
Use GroupBy.cumcount for counter column with DataFrame.reindex by MultiIndex.from_product:
df['g'] = df.groupby('A').cumcount()
mux = pd.MultiIndex.from_product([df['A'].unique(), range(3)], names=('A','g'))
df = (df.set_index(['A','g'])
.reindex(mux, fill_value=0)
.reset_index(level=1, drop=True)
.reset_index())
print (df)
A B C D
0 1 1 3 1
1 1 2 9 8
2 1 3 3 9
3 2 1 2 9
4 2 2 1 4
5 2 3 9 5
6 3 1 4 1
7 3 2 0 4
8 3 0 0 0
9 4 1 2 6
10 4 0 0 0
11 4 0 0 0
12 5 1 2 4
13 5 2 8 3
14 5 0 0 0
Another solution with DataFrame.merge with left join with helper DataFrame:
from itertools import product
df['g'] = df.groupby('A').cumcount()
df1 = pd.DataFrame(list(product(df['A'].unique(), range(3))), columns=['A','g'])
df = df1.merge(df, how='left').fillna(0).astype(int).drop('g', axis=1)
print (df)
A B C D
0 1 1 3 1
1 1 2 9 8
2 1 3 3 9
3 2 1 2 9
4 2 2 1 4
5 2 3 9 5
6 3 1 4 1
7 3 2 0 4
8 3 0 0 0
9 4 1 2 6
10 4 0 0 0
11 4 0 0 0
12 5 1 2 4
13 5 2 8 3
14 5 0 0 0
EDIT:
df['g'] = df.groupby(['A','B']).cumcount()
mux = pd.MultiIndex.from_product([df['A'].unique(),
df['B'].unique(),
range(3)], names=('A','B','g'))
df = (df.set_index(['A','B','g'])
.reindex(mux, fill_value=0)
.reset_index(level=2, drop=True)
.reset_index())
print (df.head(10))
A B C D
0 1 1 3 1
1 1 1 0 0
2 1 1 0 0
3 1 2 9 8
4 1 2 0 0
5 1 2 0 0
6 1 3 3 9
7 1 3 0 0
8 1 3 0 0
9 1 4 0 0
from itertools import product
df['g'] = df.groupby(['A','B']).cumcount()
df1 = pd.DataFrame(list(product(df['A'].unique(),
df['B'].unique(),
range(3))), columns=['A','B','g'])
df = df1.merge(df, how='left').fillna(0).astype(int).drop('g', axis=1)

Creating a new column in panda dataframe using logical indexing and group by

I have a data frame like below
df=pd.DataFrame({'a':['a','a','b','a','b','a','a','a'], 'b' : [1,0,0,1,0,1,1,1], 'c' : [1,2,3,4,5,6,7,8],'d':['1','2','1','2','1','2','1','2']})
df
Out[94]:
a b c d
0 a 1 1 1
1 a 0 2 2
2 b 0 3 1
3 a 1 4 2
4 b 0 5 1
5 a 1 6 2
6 a 1 7 1
7 a 1 8 2
I want something like below
df[(df['a']=='a') & (df['b']==1)]
In [97]:
df[(df['a']=='a') & (df['b']==1)].groupby('d')['c'].rank()
df[(df['a']=='a') & (df['b']==1)].groupby('d')['c'].rank()
Out[97]:
0 1
3 1
5 2
6 2
7 3
dtype: float64
I want this rank as a new column in dataframe df and wherever there is no rank I want NaN. SO final output will be something like below
a b c d rank
0 a 1 1 1 1
1 a 0 2 2 NaN
2 b 0 3 1 NaN
3 a 1 4 2 1
4 b 0 5 1 NaN
5 a 1 6 2 2
6 a 1 7 1 2
7 a 1 8 2 3
I will appreciate all the help and guidance. Thanks a lot.
Almost there, you just need to call transform to return a series with an index aligned to your orig df:
In [459]:
df['rank'] = df[(df['a']=='a') & (df['b']==1)].groupby('d')['c'].transform(pd.Series.rank)
df
Out[459]:
a b c d rank
0 a 1 1 1 1
1 a 0 2 2 NaN
2 b 0 3 1 NaN
3 a 1 4 2 1
4 b 0 5 1 NaN
5 a 1 6 2 2
6 a 1 7 1 2
7 a 1 8 2 3

Categories