Pandas find duplicates with reversed values between columns - python

What is the fastest way to find duplicates where value from Column A have been reversed with value from Column B?
For example, if I have a DataFrame with :
Column A Column B
0 C P
1 D C
2 L G
3 A D
4 B P
5 B G
6 P B
7 J T
8 P C
9 J T
The result will be :
Column A Column B
0 C P
8 P C
4 B P
6 P B
I tried :
df1 = df
df2 = df
for i in df2.index:
res = df1[(df1['Column A'] == df2['Column A'][i]) & (df1['Column B'] == df2['Column B'][i])]
But this is very slow and it iterates over the same values...

Use merge with renamed columns DataFrame:
d = {'Column A':'Column B','Column B':'Column A'}
df = df.merge(df.rename(columns=d))
print (df)
Column A Column B
0 C P
1 B P
2 P B
3 P C

You could try using reindex for inversion.
column_Names=["B","A"]
df=df.reindex(columns=column_Names)
Or you could just do this:
col_list = list(df) # get a list of the columns
col_list[0], col_list[1] = col_list[1], col_list[0]
df.columns = col_list # assign back

Related

How to explode multiple columns that contain a string?

I have a dataset that includes different types of tags. Each column has a string that contains a list of tags.
How am I supposed to explode selected columns at the same time ?
Unnamed: id Tag1 Tag2
0 A a,b,c d,e
1 B m,n x
to this:
Unnamed: id Tag1 Tag2
0 A a d
1 A a e
2 A b d
3 A b e
4 A c d
6 A c e
7 B m x
8 B n x
First, split the string values of each Tag column into lists, using Series.apply + Series.str.split. I'm using DataFrame.filter to select only the columns which starts with 'Tag'.
Then, use DataFrame.explode in a loop to explode sequentially each Tag column of the df, turning the values of each list into new rows.
tag_cols = df.filter(like='Tag').columns
df[tag_cols] = df[tag_cols].apply(lambda col: col.str.split(','))
for col in tag_cols:
df = df.explode(col, ignore_index=True)
print(df)
Output:
id Tag1 Tag2
0 A a d
1 A a e
2 A b d
3 A b e
4 A c d
5 A c e
6 B m x
7 B n x
Note that using just df.apply(lambda col: col.str.split(',').explode()) won't work in this case because some rows have strings/lists with a different number of elements. Therefore the rows can't be correctly aligned after exploding them, and apply will complain.

Aggregate values pandas

I have a pandas dataframe like this:
Id A B C D
1 a b c d
2 a b d
2 a c d
3 a d
3 a b c
I want to aggregate the empty values for the columns B-C and D, using the values contained in the other rows, by using the information for the same Id.
The resulting data frame should be the following:
Id A B C D
1 a b c d
2 a b c d
3 a b c d
There can be the possibility to have different values in the first column (A), for the same Id. In this case instead of putting the first instance I prefer to put another value indicating this event.
So for e.g.
Id A B C D
1 a b c d
2 a b d
2 x c d
It becomes:
Id A B C D
1 a b c d
2 f b c d
IIUC, you can use groupby_agg:
>>> df.groupby('Id')
.agg({'A': lambda x: x.iloc[0] if len(x.unique()) == 1 else 'f',
'B': 'first', 'C': 'first', 'D': 'first'})
A B C D
Id
1 a b c d
2 f b c d
The best way I can think to do this is to iterate through each unique Id, slicing it out of the original dataframe, and constructing a new row as a product of merging the relevant rows:
def aggregate(df):
ids = df['Id'].unique()
rows = []
for id in ids:
relevant = df[df['Id'] == id]
newrow = {c: "" for c in df.columns}
for _, row in relevant.iterrows():
for col in newrow:
if row[col]:
if len(newrow[col]):
if newrow[col][-1] == row[col]:
continue
newrow[col] += row[col]
rows.append(newrow)
return pd.DataFrame(rows)

How to combine string from one column to another column at same index in pandas DataFrame?

I was doing a project in nlp.
My input is:
index name lst
0 a c
0 d
0 e
1 f
1 b g
I need output like this:
index name lst combine
0 a c a c
0 d a d
0 e a e
1 f b f
1 b g b g
How can I achieve this?
You can use groupby+transform('max') to replace the empty cells with the letter per group as the letters have precedence over space. The rest is a simple string concatenation per column:
df['combine'] = df.groupby('index')['name'].transform('max') + ' ' + df['lst']
Used input:
df = pd.DataFrame({'index': [0,0,0,1,1],
'name': ['a','','','','b'],
'lst': list('cdefg'),
})
NB. I considered "index" to be a column here, if this is the index you should use df.index in the groupby
Output:
index name lst combine
0 0 a c a c
1 0 d a d
2 0 e a e
3 1 f b f
4 1 b g b g

perform df.loc to groupby df

I've a df consisted of person, origin and destination
df = pd.DataFrame({'PersonID':['1','1','2','2','2','3'],'O':['A','B','C','B','A','X'],'D':['B','A','B','A','B','Y']})
the df:
PersonID O D
1 A B
1 B A
2 C B
2 B A
2 A B
3 X Y
I have grouped by the df with df_grouped = df.groupby(['O','D']) and match them with another dataframe, taxi.
TaxiID O D
T1 B A
T2 A B
T3 C B
similarly, I group by the taxi with their O and D. Then I merged them after aggregating and counting the PersonID and TaxiID per O-D pair. I did it to see how many taxis are available for how many people.
O D PersonID TaxiID
count count
A B 2 1
B A 2 1
C B 1 1
Now, I want to perform df.loc to take only those PersonID that was counted in the merged file. How can I do this? I've tried to us:
seek = df.loc[df.PersonID.isin(merged['PersonID'])]
but it returns an empty dataframe. What can I do to do this?
edit: I attach the complete code for this case using dummy data
df = pd.DataFrame({'PersonID':['1','1','2','2','2','3'],'O':['A','B','C','B','A','X'],'D':['B','A','B','A','B','Y']})
taxi = pd.DataFrame({'TaxiID':['T1','T2','T3'],'O':['B','A','C'],'D':['A','B','B']})
df_grouped = df.groupby(['O','D'])
taxi_grouped = taxi.groupby(['O','D'])
dfm = df_grouped.agg({'PersonID':['count',list]}).reset_index()
tgm = taxi_grouped.agg({'TaxiID':['count',list]}).reset_index()
merged = pd.merge(dfm, tgm, how='inner')
seek = df.loc[df.PersonID.isin(merged['PersonID'])]
Select MultiIndex by tuple with Series.explode for scalars from nested lists:
seek = df.loc[df.PersonID.isin(merged[('PersonID', 'list')].explode().unique())]
print (seek)
PersonID O D
0 1 A B
1 1 B A
2 2 C B
3 2 B A
4 2 A B
For better performance is possible use set comprehension with flatten:
seek = df.loc[df.PersonID.isin(set(z for x in merged[('PersonID', 'list')] for z in x))]
print (seek)
PersonID O D
0 1 A B
1 1 B A
2 2 C B
3 2 B A
4 2 A B

Pandas - Remove duplicates across multiple columns

I am trying to efficiently remove duplicates in Pandas in which duplicates are inverted across two columns. For example, in this data frame:
import pandas as pd
key = pd.DataFrame({'p1':['a','b','a','a','b','d','c'],'p2':['b','a','c','d','c','a','b'],'value':[1,1,2,3,5,3,5]})
df = pd.DataFrame(key,columns=['p1','p2','value'])
print frame
p1 p2 value
0 a b 1
1 b a 1
2 a c 2
3 a d 3
4 b c 5
5 d a 3
6 c b 5
I would want to remove rows 1, 5 and 6, leaving me with just:
p1 p2 value
0 a b 1
2 a c 2
3 a d 3
4 b c 5
Thanks in advance for ideas on how to do this.
Reorder the p1 and p2 values so they appear in a canonical order:
mask = df['p1'] < df['p2']
df['first'] = df['p1'].where(mask, df['p2'])
df['second'] = df['p2'].where(mask, df['p1'])
yields
In [149]: df
Out[149]:
p1 p2 value first second
0 a b 1 a b
1 b a 1 a b
2 a c 2 a c
3 a d 3 a d
4 b c 5 b c
5 d a 3 a d
6 c b 5 b c
Then you can drop_duplicates:
df = df.drop_duplicates(subset=['value', 'first', 'second'])
import pandas as pd
key = pd.DataFrame({'p1':['a','b','a','a','b','d','c'],'p2':['b','a','c','d','c','a','b'],'value':[1,1,2,3,5,3,5]})
df = pd.DataFrame(key,columns=['p1','p2','value'])
mask = df['p1'] < df['p2']
df['first'] = df['p1'].where(mask, df['p2'])
df['second'] = df['p2'].where(mask, df['p1'])
df = df.drop_duplicates(subset=['value', 'first', 'second'])
df = df[['p1', 'p2', 'value']]
yields
In [151]: df
Out[151]:
p1 p2 value
0 a b 1
2 a c 2
3 a d 3
4 b c 5

Categories