Pandas: remove duplicate rows when matches can be in switched columns [duplicate] - python

What is the proper way to go from this df:
>>> df=pd.DataFrame({'a':['jeff','bob','jill'], 'b':['bob','jeff','mike']})
>>> df
a b
0 jeff bob
1 bob jeff
2 jill mike
To this:
>>> df2
a b
0 jeff bob
2 jill mike
where you're dropping a duplicate row based on the items in 'a' and 'b', without regard to the their specific column.
I can hack together a solution using a lambda expression to create a mask and then drop duplicates based on the mask column, but I'm thinking there has to be a simpler way than this:
>>> df['c'] = df[['a', 'b']].apply(lambda x: ''.join(sorted((x[0], x[1]), \
key=lambda x: x[0]) + sorted((x[0], x[1]), key=lambda x: x[1] )), axis=1)
>>> df.drop_duplicates(subset='c', keep='first', inplace=True)
>>> df = df.iloc[:,:-1]

I think you can sort each row independently and then use duplicated to see which ones to drop.
dupes = df.apply(lambda x: x.sort_values().values, axis=1).duplicated()
df[~dupes]
A faster way to get dupes. Thanks to #DSM.
dupes = df.T.apply(sorted).T.duplicated()

I think simpliest is use apply with axis=1 for sorted per rows and then call DataFrame.duplicated:
df = df[~df.apply(sorted, 1).duplicated()]
print (df)
a b
0 jeff bob
2 jill mike
A bit complicated, but very fast, is use numpy.sort with DataFrame constructor:
df1 = pd.DataFrame(np.sort(df.values, axis=1), index=df.index, columns=df.columns)
df = df[~df1.duplicated()]
print (df)
a b
0 jeff bob
2 jill mike
Timings:
np.random.seed(123)
N = 10000
df = pd.DataFrame({'A': np.random.randint(100,size=N).astype(str),
'B': np.random.randint(100,size=N).astype(str)})
#print (df)
In [63]: %timeit (df[~pd.DataFrame(np.sort(df.values, axis=1), index=df.index, columns=df.columns).duplicated()])
100 loops, best of 3: 3.25 ms per loop
In [64]: %timeit (df[~df.apply(sorted, 1).duplicated()])
1 loop, best of 3: 1.09 s per loop
#Ted Petrou solution1
In [65]: %timeit (df[~df.apply(lambda x: x.sort_values().values, axis=1).duplicated()])
1 loop, best of 3: 2.89 s per loop
#Ted Petrou solution2
In [66]: %timeit (df[~df.T.apply(sorted).T.duplicated()])
1 loop, best of 3: 1.56 s per loop

Related

Python pandas how to scan string contains by row?

How do you scan if a pandas dataframe row contains a certain substring?
for example i have a dataframe with 11 columns
all the columns contains names
ID name1 name2 name3 ... name10
-------------------------------------------------------
AA AA_balls AA_cakee1 AA_lavender ... AA_purple
AD AD_cakee AD_cats AD_webss ... AD_ballss
CS CS_cakee CS_cats CS_webss ... CS_purble
.
.
.
I would like to get rows which contains, say "ball" in the dataframe and get the ID
so the result would be ID 'AA' and ID 'AD' since AA_balls and AD_ballss are in the rows.
I have searched on google but seems there is no specific result for these.
people usually ask questions about searching substring in a specific columns but not all columns (a single row)
df[df["col_name"].str.contains("ball")]
The Methods I have thought of are as follows, you can skip this if you have little time:
(1) loop through the columns
for col_name in col_names:
df.append(df[df[col_name].str.contains('ball')])
and then drop duplicates rows which have same ID values
but this method would be very slow
(2) Make data frame to a 2 column dataframe by appending name2- name10 columns into one column and use df[df["concat_col"].str.contains("ball")]["ID] to get the IDs and drop duplicate
ID concat_col
AA AA_balls
AA AA_cakeee
AA AA_lavender
AA AA_purple
.
.
.
CS CS_purble
(3) Use the dataframe like (2) to make a dictionay
where
dict[df["concat_col"].value] = df["ID"]
then get the
[value for key, value in programs.items() if 'ball' in key()]
but in this method i need to loop through dictionary and become slow
if there is a method that i can apply faster without these processes,
i would prefer doing so.
If anyone knows about this,
would appreciate a lot if you kindly let me know:)
Thanks!
One idea is use melt:
df = df.melt('ID')
a = df.loc[df['value'].str.contains('ball'), 'ID']
print (a)
0 AA
10 AD
Name: ID, dtype: object
Another:
df = df.set_index('ID')
a = df.index[df.applymap(lambda x: 'ball' in x).any(axis=1)]
Or:
mask = np.logical_or.reduce([df[x].str.contains('ball', regex=False) for x in df.columns])
a = df.loc[, 'ID']
Timings:
np.random.seed(145)
L = list('abcdefgh')
df = pd.DataFrame(np.random.choice(L, size=(4000, 10)))
df.insert(0, 'ID', np.arange(4000).astype(str))
a = np.random.randint(4000, size=15)
b = np.random.randint(1, 10, size=15)
for i, j in zip(a,b):
df.iloc[i, j] = 'AB_ball_DE'
#print (df)
In [85]: %%timeit
...: df1 = df.melt('ID')
...: a = df1.loc[df1['value'].str.contains('ball'), 'ID']
...:
10 loops, best of 3: 24.3 ms per loop
In [86]: %%timeit
...: df.loc[np.logical_or.reduce([df[x].str.contains('ball', regex=False) for x in df.columns]), 'ID']
...:
100 loops, best of 3: 12.8 ms per loop
In [87]: %%timeit
...: df1 = df.set_index('ID')
...: df1.index[df1.applymap(lambda x: 'ball' in x).any(axis=1)]
...:
100 loops, best of 3: 11.1 ms per loop
Maybe this might work?
mask = df.apply(lambda row: row.map(str).str.contains('word').any(), axis=1)
df.loc[mask]
Disclaimer: I haven't tested this. Perhaps the .map(str) isn't necessary.

Convert whole dataframe from lower case to upper case with Pandas

I have a dataframe like the one displayed below:
# Create an example dataframe about a fictional army
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks'],
'company': ['1st', '1st', '2nd', '2nd'],
'deaths': ['kkk', 52, '25', 616],
'battles': [5, '42', 2, 2],
'size': ['l', 'll', 'l', 'm']}
df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'deaths', 'battles', 'size'])
My goal is to transform every single string inside of the dataframe to upper case so that it looks like this:
Notice: all data types are objects and must not be changed; the output must contain all objects. I want to avoid to convert every single column one by one... I would like to do it generally over the whole dataframe possibly.
What I tried so far is to do this but without success
df.str.upper()
astype() will cast each series to the dtype object (string) and then call the str() method on the converted series to get the string literally and call the function upper() on it. Note that after this, the dtype of all columns changes to object.
In [17]: df
Out[17]:
regiment company deaths battles size
0 Nighthawks 1st kkk 5 l
1 Nighthawks 1st 52 42 ll
2 Nighthawks 2nd 25 2 l
3 Nighthawks 2nd 616 2 m
In [18]: df.apply(lambda x: x.astype(str).str.upper())
Out[18]:
regiment company deaths battles size
0 NIGHTHAWKS 1ST KKK 5 L
1 NIGHTHAWKS 1ST 52 42 LL
2 NIGHTHAWKS 2ND 25 2 L
3 NIGHTHAWKS 2ND 616 2 M
You can later convert the 'battles' column to numeric again, using to_numeric():
In [42]: df2 = df.apply(lambda x: x.astype(str).str.upper())
In [43]: df2['battles'] = pd.to_numeric(df2['battles'])
In [44]: df2
Out[44]:
regiment company deaths battles size
0 NIGHTHAWKS 1ST KKK 5 L
1 NIGHTHAWKS 1ST 52 42 LL
2 NIGHTHAWKS 2ND 25 2 L
3 NIGHTHAWKS 2ND 616 2 M
In [45]: df2.dtypes
Out[45]:
regiment object
company object
deaths object
battles int64
size object
dtype: object
This can be solved by the following applymap method:
df = df.applymap(lambda s: s.lower() if type(s) == str else s)
Loops are very slow instead of using apply function to each and cell in a row, try to get columns names in a list and then loop over list of columns to convert each column text to lowercase.
Code below is the vector operation which is faster than apply function.
for columns in dataset.columns:
dataset[columns] = dataset[columns].str.lower()
Since str only works for series, you can apply it to each column individually then concatenate:
In [6]: pd.concat([df[col].astype(str).str.upper() for col in df.columns], axis=1)
Out[6]:
regiment company deaths battles size
0 NIGHTHAWKS 1ST KKK 5 L
1 NIGHTHAWKS 1ST 52 42 LL
2 NIGHTHAWKS 2ND 25 2 L
3 NIGHTHAWKS 2ND 616 2 M
Edit: performance comparison
In [10]: %timeit df.apply(lambda x: x.astype(str).str.upper())
100 loops, best of 3: 3.32 ms per loop
In [11]: %timeit pd.concat([df[col].astype(str).str.upper() for col in df.columns], axis=1)
100 loops, best of 3: 3.32 ms per loop
Both answers perform equally on a small dataframe.
In [15]: df = pd.concat(10000 * [df])
In [16]: %timeit pd.concat([df[col].astype(str).str.upper() for col in df.columns], axis=1)
10 loops, best of 3: 104 ms per loop
In [17]: %timeit df.apply(lambda x: x.astype(str).str.upper())
10 loops, best of 3: 130 ms per loop
On a large dataframe my answer is slightly faster.
try this
df2 = df2.apply(lambda x: x.str.upper() if x.dtype == "object" else x)
If you want to conserve the dtype use isinstance(obj,type)
df.apply(lambda x: x.str.upper().str.strip() if isinstance(x, object) else x)
if you want conserv dtype or change only one type.. try for and if:
for x in dados.columns:
if dados[x].dtype == 'object':
print('object - allow upper')
dados[x] = dados[x].str.upper()
else:
print('other? - not allow upper')
dados[x] = dados[x].str.upper()
You can apply it for every cols...
oh_df.columns = map(str.lower, oh_df.columns)

How can I create many columns after a list in pandas?

I have a dataframe, I want to create a lot of new columns after a list and filled with 0, how can I do it?
For example:
df = pd.DataFrame({"a":["computer", "printer"]})
print(df)
>>> a
>>>0 computer
>>>1 printer
I have a list
myList=["b","c","d"]
I want my new dataframe looks like:
>>> a b c d
>>>0 computer 0 0 0
>>>1 printer 0 0 0
How can I do it?
Use fastest solution:
for col in myList:
df[col] = 0
print(df)
a b c d
0 computer 0 0 0
1 printer 0 0 0
Another solution is use concat with DataFrame constructor:
pd.concat([df3,pd.DataFrame(columns=myList, index=df.index, data=0)], axis=1)
Timings:
[20000 rows x 300 columns]:
In [286]: %timeit pd.concat([df,pd.DataFrame(columns=myList)], axis=1).fillna(0)
1 loop, best of 3: 1.17 s per loop
In [287]: %timeit pd.concat([df3,pd.DataFrame(columns=myList, index=df.index,data=0)],axis=1)
10 loops, best of 3: 81.7 ms per loop
In [288]: %timeit (orig(df4))
10 loops, best of 3: 59.2 ms per loop
Code for timings:
myList=["b","c","d"] * 100
df = pd.DataFrame({"a":["computer", "printer"]})
print(df)
df = pd.concat([df]*10000).reset_index(drop=True)
df3 = df.copy()
df4 = df.copy()
df1= pd.concat([df,pd.DataFrame(columns=myList)], axis=1).fillna(0)
df2 = pd.concat([df3,pd.DataFrame(columns=myList, index=df.index, data=0)], axis=1)
print(df1)
print(df2)
def orig(df):
for col in range(300):
df[col] = 0
return df
print (orig(df4))
It'll be more performant to concat an empty df for large dfs rather than incrementally adding new columns as this will grow the df incrementally rather than just make a single allocation of the final df dimensions:
In [116]:
myList=["b","c","d"]
df = pd.concat([df,pd.DataFrame(columns=myList)], axis=1).fillna(0)
df
Out[116]:
a b c d
0 computer 0 0 0
1 printer 0 0 0

How to remove a pandas dataframe from another dataframe

How to remove a pandas dataframe from another dataframe, just like the set subtraction:
a=[1,2,3,4,5]
b=[1,5]
a-b=[2,3,4]
And now we have two pandas dataframe, how to remove df2 from df1:
In [5]: df1=pd.DataFrame([[1,2],[3,4],[5,6]],columns=['a','b'])
In [6]: df1
Out[6]:
a b
0 1 2
1 3 4
2 5 6
In [9]: df2=pd.DataFrame([[1,2],[5,6]],columns=['a','b'])
In [10]: df2
Out[10]:
a b
0 1 2
1 5 6
Then we expect df1-df2 result will be:
In [14]: df
Out[14]:
a b
0 3 4
How to do it?
Thank you.
Solution
Use pd.concat followed by drop_duplicates(keep=False)
pd.concat([df1, df2, df2]).drop_duplicates(keep=False)
It looks like
a b
1 3 4
Explanation
pd.concat adds the two DataFrames together by appending one right after the other. if there is any overlap, it will be captured by the drop_duplicates method. However, drop_duplicates by default leaves the first observation and removes every other observation. In this case, we want every duplicate removed. Hence, the keep=False parameter which does exactly that.
A special note to the repeated df2. With only one df2 any row in df2 not in df1 won't be considered a duplicate and will remain. This solution with only one df2 only works when df2 is a subset of df1. However, if we concat df2 twice, it is guaranteed to be a duplicate and will subsequently be removed.
You can use .duplicated, which has the benefit of being fairly expressive:
%%timeit
combined = df1.append(df2)
combined[~combined.index.duplicated(keep=False)]
1000 loops, best of 3: 875 µs per loop
For comparison:
%timeit df1.loc[pd.merge(df1, df2, on=['a','b'], how='left', indicator=True)['_merge'] == 'left_only']
100 loops, best of 3: 4.57 ms per loop
%timeit pd.concat([df1, df2, df2]).drop_duplicates(keep=False)
1000 loops, best of 3: 987 µs per loop
%timeit df2[df2.apply(lambda x: x.value not in df2.values, axis=1)]
1000 loops, best of 3: 546 µs per loop
In sum, using the np.array comparison is fastest. Don't need the .tolist() there.
To get dataframe with all records which are in DF1 but not in DF2
DF=DF1[~DF1.isin(DF2)].dropna(how = 'all')
A set logic approach. Turn the rows of df1 and df2 into sets. Then use set subtraction to define new DataFrame
idx1 = set(df1.set_index(['a', 'b']).index)
idx2 = set(df2.set_index(['a', 'b']).index)
pd.DataFrame(list(idx1 - idx2), columns=df1.columns)
a b
0 3 4
My shot with merge df1 and df2 from the question.
Using 'indicator' parameter
In [74]: df1.loc[pd.merge(df1, df2, on=['a','b'], how='left', indicator=True)['_merge'] == 'left_only']
Out[74]:
a b
1 3 4
This solution works when your df_to_drop is a subset of main data frame data.
data_clean = data.drop(df_to_drop.index)
A masking approach
df1[df1.apply(lambda x: x.values.tolist() not in df2.values.tolist(), axis=1)]
a b
1 3 4
I think the first tolist() needs to be removed, but keep the second one:
df1[df1.apply(lambda x: x.values() not in df2.values.tolist(), axis=1)]
An easiest option is to use indexes.
Append df1 and df2 and reset their indexes.
df = df1.concat(df2)
df.reset_index(inplace=True)
e.g:
This will give df2 indexes
indexes_df2 = df.index[ (df["a"].isin(df2["a"]) ) & (df["b"].isin(df2["b"]) )
result_index = df.index[~index_df2]
result_data = df.iloc[ result_index,:]
Hope it will help to new readers, although the question posted a little time ago :)
Solution if df1 contains duplicates + keeps the index.
A modified version of piRSquared's answer to keep the duplicates in df1 that do not appear in df2, while maintaining the index.
df1[df1.apply(lambda x: (x == pd.concat([df1.drop_duplicates(), df2, df2]).drop_duplicates(keep=False)).all(1).any(), axis=1)]
If your dataframes are big, you may want to store the result of
pd.concat([df1.drop_duplicates(), df2, df2]).drop_duplicates(keep=False)
in a variable before the df1.apply call.

quickly drop dataframe columns with only one distinct value

Is there a faster way to drop columns that only contain one distinct value than the code below?
cols=df.columns.tolist()
for col in cols:
if len(set(df[col].tolist()))<2:
df=df.drop(col, axis=1)
This is really quite slow for large dataframes. Logically, this counts the number of values in each column when in fact it could just stop counting after reaching 2 different values.
You can use Series.unique() method to find out all the unique elements in a column, and for columns whose .unique() returns only 1 element, you can drop that. Example -
for col in df.columns:
if len(df[col].unique()) == 1:
df.drop(col,inplace=True,axis=1)
A method that does not do inplace dropping -
res = df
for col in df.columns:
if len(df[col].unique()) == 1:
res = res.drop(col,axis=1)
Demo -
In [154]: df = pd.DataFrame([[1,2,3],[1,3,3],[1,2,3]])
In [155]: for col in df.columns:
.....: if len(df[col].unique()) == 1:
.....: df.drop(col,inplace=True,axis=1)
.....:
In [156]: df
Out[156]:
1
0 2
1 3
2 2
Timing results -
In [166]: %paste
def func1(df):
res = df
for col in df.columns:
if len(df[col].unique()) == 1:
res = res.drop(col,axis=1)
return res
## -- End pasted text --
In [172]: df = pd.DataFrame({'a':1, 'b':np.arange(5), 'c':[0,0,2,2,2]})
In [178]: %timeit func1(df)
1000 loops, best of 3: 1.05 ms per loop
In [180]: %timeit df[df.apply(pd.Series.value_counts).dropna(thresh=2, axis=1).columns]
100 loops, best of 3: 8.81 ms per loop
In [181]: %timeit df.apply(pd.Series.value_counts).dropna(thresh=2, axis=1)
100 loops, best of 3: 5.81 ms per loop
The fastest method still seems to be the method using unique and looping through the columns.
One step:
df = df[[c for c
in list(df)
if len(df[c].unique()) > 1]]
Two steps:
Create a list of column names that have more than 1 distinct value.
keep = [c for c
in list(df)
if len(df[c].unique()) > 1]
Drop the columns that are not in 'keep'
df = df[keep]
Note: this step can also be done using a list of columns to drop:
drop_cols = [c for c
in list(df)
if df[c].nunique() <= 1]
df = df.drop(columns=drop_cols)
df.loc[:,df.apply(pd.Series.nunique) != 1]
For example
In:
df = pd.DataFrame({'A': [10, 20, np.nan, 30], 'B': [10, np.nan, 10, 10]})
df.loc[:,df.apply(pd.Series.nunique) != 1]
Out:
A
0 10
1 20
2 NaN
3 30
Two simple one-liners for either returning a view (shorter version of jz0410's answer)
df.loc[:,df.nunique()!=1]
or dropping inplace (via drop())
df.drop(columns=df.columns[df.nunique()==1], inplace=True)
You can create a mask of your df by calling apply and call value_counts, this will produce NaN for all rows except one, you can then call dropna column-wise and pass param thresh=2 so that there must be 2 or more non-NaN values:
In [329]:
df = pd.DataFrame({'a':1, 'b':np.arange(5), 'c':[0,0,2,2,2]})
df
Out[329]:
a b c
0 1 0 0
1 1 1 0
2 1 2 2
3 1 3 2
4 1 4 2
In [342]:
df[df.apply(pd.Series.value_counts).dropna(thresh=2, axis=1).columns]
Out[342]:
b c
0 0 0
1 1 0
2 2 2
3 3 2
4 4 2
Output from the boolean conditions:
In [344]:
df.apply(pd.Series.value_counts)
Out[344]:
a b c
0 NaN 1 2
1 5 1 NaN
2 NaN 1 3
3 NaN 1 NaN
4 NaN 1 NaN
In [345]:
df.apply(pd.Series.value_counts).dropna(thresh=2, axis=1)
Out[345]:
b c
0 1 2
1 1 NaN
2 1 3
3 1 NaN
4 1 NaN
Many examples in thread and this thread does not worked for my df. Those worked:
# from: https://stackoverflow.com/questions/33144813/quickly-drop-dataframe-columns-with-only-one-distinct-value
# from: https://stackoverflow.com/questions/20209600/pandas-dataframe-remove-constant-column
import pandas as pd
import numpy as np
data = {'var1': [1,2,3,4,5,np.nan,7,8,9],
'var2':['Order',np.nan,'Inv','Order','Order','Shp','Order', 'Order','Inv'],
'var3':[101,101,101,102,102,102,103,103,np.nan],
'var4':[np.nan,1,1,1,1,1,1,1,1],
'var5':[1,1,1,1,1,1,1,1,1],
'var6':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'var7':["a","a","a","a","a","a","a","a","a"],
'var8': [1,2,3,4,5,6,7,8,9]}
df = pd.DataFrame(data)
df_original = df.copy()
#-------------------------------------------------------------------------------------------------
df2 = df[[c for c
in list(df)
if len(df[c].unique()) > 1]]
#-------------------------------------------------------------------------------------------------
keep = [c for c
in list(df)
if len(df[c].unique()) > 1]
df3 = df[keep]
#-------------------------------------------------------------------------------------------------
keep_columns = [col for col in df.columns if len(df[col].unique()) > 1]
df5 = df[keep_columns].copy()
#-------------------------------------------------------------------------------------------------
for col in df.columns:
if len(df[col].unique()) == 1:
df.drop(col,inplace=True,axis=1)
I would like to throw in:
pandas 1.0.3
ids = df.nunique().values>1
df.loc[:,ids]
not that slow:
2.81 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
df=df.loc[:,df.nunique()!=Numberofvalues]
None of the solutions worked in my use-case because I got this error: (my dataframe contains list item).
TypeError: unhashable type: 'list'
The solution that worked for me is this:
ndf = df.describe(include="all").T
new_cols = set(df.columns) - set(ndf[ndf.unique == 1].index)
df = df[list(new_cols)]
One line
df=df[[i for i in df if len(set(df[i]))>1]]
One of the solutions with pipe (convenient if used often):
def drop_unique_value_col(df):
return df.loc[:,df.apply(pd.Series.nunique) != 1]
df.pipe(drop_unique_value_col)
This will drop all the columns with only one distinct value.
for col in Dataframe.columns:
if len(Dataframe[col].value_counts()) == 1:
Dataframe.drop([col], axis=1, inplace=True)
Most 'pythonic' way of doing it I could find:
df = df.loc[:, (df != df.iloc[0]).any()]

Categories