pandas drop rows that share similar value in other column

pandas drop rows that share similar value in other column - python

I have DataFrame df = pd.DataFrame({'col1': ["a","b","c","d","e"], 'col2': [1,3,3,2,6]}) that looks like
Input:
col1 col2
0 a 1
1 b 3
2 c 3
3 d 2
4 e 6
I would like to remove rows from "col1" that share a common value in "col2". The expected output would look something like...
Output:
col1 col2
0 a 1
3 d 2
4 e 6
What would be the process of doing this?

using this short code should do the trick
df.drop_duplicates(subset=['col2'], keep=False)
Explanation
we use drop_duplicates to (ovbiously) drop the duplicates, and we set the column(s) we want to drop from to be col2 as you requested, In order to drop all occurences (and not keep the first occurence of each duplicate for example) we use keep=False.

This will do the trick
from collections import Counter
df = pd.DataFrame({'col1': ["a","b","c","d","e"], 'col2': [1,3,3,2,6]})
c = Counter(df['col2'])
ls = [k for k,v in c.items() if v==1]
_fltr = df['col2'].isin(ls)
df.loc[_fltr,:]

Related

Insert 2 Blank Rows In DF by Group

I basically want the solution from this question to be applied to 2 blank rows.
Insert Blank Row In Python Data frame when value in column changes?
I've messed around with the solution but don't understand the code enough to alter it correctly.

You can do:
num_empty_rows = 2
df = (df.groupby('Col1',as_index=False).apply(lambda g: g.append(
pd.DataFrame(data=[['']*len(df.columns)]*num_empty_rows,
columns=df.columns))).reset_index(drop=True).iloc[:-num_empty_rows])
As you can see, after each group df is appended by a dataframe to accommodate num_empty_rows and then at the end reset_index is performed. The last iloc[:-num_empty_rows] is optional i.e. to remove empty rows at the end.
Example input:
df = pd.DataFrame({'Col1': ['A', 'A', 'A', 'B', 'C'],
'Col2':['s','s','b','b','l'],
'Col3':['b','j','d','a','k'],
'Col4':['d','k','q','d','p']
})
Output:
Col1 Col2 Col3 Col4
0 A s b d
1 A s j k
2 A b d q
3
4
5 B b a d
6
7
8 C l k p

Add column to pandas dataframe from a reversed dictionary

I have a dataframe (pandas) and a dictionary with keys and values as list. The values in lists are unique across all the keys. I want to add a new column to my dataframe based on values of the dictionary having keys in it. E.g. suppose I have a dataframe like this
import pandas as pd
df = {'a':1, 'b':2, 'c':2, 'd':4, 'e':7}
df = pd.DataFrame.from_dict(df, orient='index', columns = ['col2'])
df = df.reset_index().rename(columns={'index':'col1'})
df
col1 col2
0 a 1
1 b 2
2 c 2
3 d 4
4 e 7
Now I also have dictionary like this
my_dict = {'x':['a', 'c'], 'y':['b'], 'z':['d', 'e']}
I want the output like this
col1 col2 col3
0 a 1 x
1 b 2 y
2 c 2 x
3 d 4 z
4 e 7 z
Presently I am doing this by reversing the dictionary first, i.e. like this
my_dict_rev = {value:key for key in my_dict for value in my_dict[key]}
df['col3']= df['col1'].map(my_dict_rev)
df
But I am sure that there must be some direct method.

I know this is an old question but here are two other ways to do the same job. First convert my_dict to a Series object, then explode it. Then reverse the mapping and use map:
tmp = pd.Series(my_dict).explode()
df['col3'] = df['col1'].map(pd.Series(tmp.index, tmp))
Another option (starts similar to above) but instead of map, merge:
df = df.merge(pd.Series(my_dict, name='col1').explode().rename_axis('col3').reset_index())
Output:
col1 col2 col3
0 a 1 x
1 b 2 y
2 c 2 x
3 d 4 z
4 e 7 z

Pandas mapping all, and a portion, of column value in another column

I am trying to search for values and portions of values from one column to another and return a third value.
Essentially, I have two dataframes: df and df2. The first has a part number in 'col1'. The second has the part number, or portion of it, in 'col1' and the value I want to put in df['col2'] in 'col2'.
import pandas as pd
df = pd.DataFrame({'col1': ['1-1-1', '1-1-2', '1-1-3',
'2-1-1', '2-1-2', '2-1-3']})
df2 = pd.DataFrame({'col1': ['1-1-1', '1-1-2', '1-1-3', '2-1'],
'col2': ['A', 'B', 'C', 'D']})
Of course this:
df['col1'].isin(df2['col1'])
Only covers everything that matches, not the portions:
df['col1'].isin(df2['col1'])
Out[27]:
0 True
1 True
2 True
3 False
4 False
5 False
Name: col1, dtype: bool
I tried:
df[df['col1'].str.contains(df2['col1'])]
but get:
TypeError: 'Series' objects are mutable, thus they cannot be hashed
I also tried use a dictionary made from df2; using the same approaches as above and also mapping it--with no luck
The results for df I need would look like this:
col1 col2
'1-1-1' 'A'
'1-1-2' 'B'
'1-1-3' 'C'
'2-1-1' 'D'
'2-1-2' 'D'
'2-1-3' 'D'
I can't figure out how to get the 'D' value into 'col2' because df2['col1'] contains '2-1'--only a portion of the part number.
Any help would be greatly appreciated. Thank you in advance.

We can do str.findall
s=df.col1.str.findall('|'.join(df2.col1.tolist())).str[0].map(df2.set_index('col1').col2)
df['New']=s
df
col1 New
0 1-1-1 A
1 1-1-2 B
2 1-1-3 C
3 2-1-1 D
4 2-1-2 D
5 2-1-3 D

If your df and df2 the specific format as in the sample, another way is using a dict map with fillna by mapping from rsplit
d = dict(df2[['col1', 'col2']].values)
df['col2'] = df.col1.map(d).fillna(df.col1.str.rsplit('-',1).str[0].map(d))
Out[1223]:
col1 col2
0 1-1-1 A
1 1-1-2 B
2 1-1-3 C
3 2-1-1 D
4 2-1-2 D
5 2-1-3 D
Otherwise, besides using findall as in Wen's solution, you may also use extract using with dict d from above
df.col1.str.extract('('+'|'.join(df2.col1)+')')[0].map(d)

Count across dataframe columns based on str.contains (or similar)

I would like to count the number of cells within each row that contain a particular character string, cells which have the particular string more than once should be counted once only.
I can count the number of cells across a row which equal a given value, but when I expand this logic to use str.contains, I have issues, as shown below
d = {'col1': ["a#", "b","c#"], 'col2': ["a", "b","c#"]}
df = pd.DataFrame(d)
#can correctly count across rows using equality
thisworks =( df =="a#" ).sum(axis=1)
#can count across a column using str.contains
thisworks1=df['col1'].str.contains('#').sum()
#but cannot use str.contains with a dataframe so what is the alternative
thisdoesnt =( df.str.contains('#') ).sum(axis=1)
Output should be a series showing the number of cells in each row that contain the given character string.

str.contains is a series method. To apply it to whole dataframe you need either agg or apply such as:
df.agg(lambda x: x.str.contains('#')).sum(1)
Out[2358]:
0 1
1 0
2 2
dtype: int64
If you don't like agg nor apply, you may use np.char.find to work directly on underlying numpy array of df
(np.char.find(df.values.tolist(), '#') + 1).astype(bool).sum(1)
Out[2360]: array([1, 0, 2])
Passing it to series or a columns of df
pd.Series((np.char.find(df.values.tolist(), '#') + 1).astype(bool).sum(1), index=df.index)
Out[2361]:
0 1
1 0
2 2
dtype: int32

A solution using df.apply:
df = pd.DataFrame({'col1': ["a#", "b","c#"],
'col2': ["a", "b","c#"]})
df
col1 col2
0 a# a
1 b b
2 c# c#
df['sum'] = df.apply(lambda x: x.str.contains('#'), axis=1).sum(axis=1)
col1 col2 sum
0 a# a 1
1 b b 0
2 c# c# 2

Something like this should work:
df = pd.DataFrame({'col1': ['#', '0'], 'col2': ['#', '#']})
df['totals'] = df['col1'].str.contains('#', regex=False).astype(int) +\
df['col2'].str.contains('#', regex=False).astype(int)
df
# col1 col2 totals
# 0 # # 2
# 1 0 # 1
It should generalize to as many columns as you want.

Mapping a Column from One Dataframe to Another

I would like to map the values in df2['col2'] to df['col1']:
df col1 col2
0 w a
1 1 2
2 2 3
I would like to use a column from the dataframe as a dictionary to get:
col1 col2
0 w a
1 A 2
2 B 3
However the data dictionary is just a column in df2, which looks like
df2 col1 col2
1 1 A
2 2 B
I have tried using this:
di = {"df2['col1']: df2['col2']}
final = df1.replace({"df2['col2']": di})
But get an error: TypeError: 'Series' objects are mutable, thus they cannot be hashed
I have about a 200,000 rows. Any help would be appreciated.
Edit:
The sample dictionary would look like di = {1: "A", 2: "B"}, but is in df2['col1']: df2['col2']. I have 200k+ rows, can I convert df2['col1']: df2['col2'] to a tuple, etc?

You can build a lookup dictionary based on the col1:col2 of df2 and then use that to replace the values in df1.col1.
import pandas as pd
df1 = pd.DataFrame({'col1':['w',1,2],'col2':['a',2,3]})
df2 = pd.DataFrame({'col1':[1,2],'col2':['A','B']})
print(df1)
# col1 col2
#0 w a
#1 1 2
#2 2 3
print(df2)
# col1 col2
#0 1 A
#1 2 B
dataLookUpDict = {row[1]:row[2] for row in df2[['col1','col2']].itertuples()}
final = df1.replace({'col1': dataLookUpDict})
print(final)
# col1 col2
#0 w a
#1 A 2
#2 B 3

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas drop rows that share similar value in other column - python

This will do the trick from collections import Counter df = pd.DataFrame({'col1': ["a","b","c","d","e"], 'col2': [1,3,3,2,6]}) c = Counter(df['col2']) ls = [k for k,v in c.items() if v==1] _fltr = df['col2'].isin(ls) df.loc[_fltr,:]

Related

Insert 2 Blank Rows In DF by Group

Add column to pandas dataframe from a reversed dictionary

Pandas mapping all, and a portion, of column value in another column

Count across dataframe columns based on str.contains (or similar)

Mapping a Column from One Dataframe to Another

Categories

Resources