pandas drop_duplicates condition on two other columns values - python

I have a datframe with columns A,B and C.
Column A is where there are duplicates. Column B is where there is email value or NaN. Column C is where there is 'wait' value or a number.
My dataframe has duplicate values in A. I would like to keep those who have a non-NaN value in B and the non 'wait' value in C (ie numbers).
How could I do that on a df dataframe?
I have tried df.drop_duplicates('A') but i dont see any conditions on other columns
Edit :
sample data :
df=pd.DataFrame({'A':[1,1,2,2,3,3],'B':['a#b.com',np.nan,np.nan,'c#d.com','np.nan',np.nan],'C':[123,456,567,'wait','wait','wait']})
>>> df
A B C
0 1 a#b.com 123
1 1 NaN 456
2 2 NaN 567
3 2 c#d.com wait
4 3 np.nan wait
5 3 NaN wait
I would like a resulting dataframe as
>>> df
A B C
0 1 a#b.com 123
1 2 c#d.com 567
2 3 np.nan wait
Thank you
Best,

Solution sorting per A, C columns with test if match wait first and then get first non missing value if exist per groups by column A:
df = df.sort_values(['A', 'C'], key = lambda x: x.eq('wait')).groupby('A').first()
print (df)
B C
A
1 a#b.com 123
2 c#d.com 567
3 np.nan wait

Related

How to add value of dataframe to another dataframe?

I want to add a row of dataframe to every row of another dataframe.
df1=pd.DataFrame({"a": [1,2],
"b": [3,4]})
df2=pd.DataFrame({"a":[4], "b":[5]})
I want to add df2 value to every row of df1.
I use df1+df2 and get following result
a b
0 5.0 8.0
1 NaN NaN
But I want to get the following result
a b
0 5 7
1 7 9
Any help would be dearly appreciated!
If really need add values per columns it means number of columns in df2 is same like number of rows in df1 use:
df = df1.add(df2.loc[0].to_numpy(), axis=0)
print (df)
a b
0 5 7
1 7 9
If need add by rows it means first value of df1 is add to first column of df2, so output is different:
df = df1.add(df2.loc[0], axis=1)
print (df)
a b
0 5 8
1 6 9

How to reduce conditionality of a categorical feature using a lookup table

I a dataframe (df1) whose one categorical column is
df1=pd.Dataframe({'COL1': ['AA','AB','BC','AC','BA','BB','BB','CA','CB','CD','CE']})
I have another dataframe (df2) which has two columns
df2=pd.Dataframe({'Category':['AA','AB','AC','BA','BB','BC','CA','CB','CC','CD','CE','CF'],'general_mapping':['A','A','A','B','B','B','C','C','C','C','C','C']})
I need to modify df1 using df2 and finally will look like:
df1->> ({'COL1': ['A','A','B','A','B','B','B','C','C','C','C']})
You can use pd.Series.map after setting Category as index using df.set_index.
df1['COL1'] = df1['COL1'].map(df2.set_index('Category')['general_mapping'])
df1
COL1
0 A
1 A
2 B
3 A
4 B
5 B
6 B
7 C
8 C
9 C
10 C

Pandas- from duplicate rows, keeping rows without null values

I have the following dataframe:
df= pd.DataFrame ({'id': [1,1,2,3,3, 4], 'test': ['a', np.nan, 'b','w', 'd', np.nan]})
as you see the "id" column has some duplicate values with different values for the "test" column. From duplicate rows, I want only keep rows without null values. if a duplicate rows does not have any null values, I want to keep it.
The output should be like this:
id value
0 1 a
1 2 b
2 3 w
3 3 d
4 4 NaN
I tried this, but it does not work because it removes the duplicate rows where the id = 3.
df = df.groupby('id', as_index=False, sort=False)['value'].first()
any suggestion?
For your sample data:
dup_id = df['id'].duplicated(keep=False)
df[~(dup_id & df.test.isna())]
gives what you want:
id test
0 1 a
2 2 b
3 3 w
4 3 d
5 4 NaN

code multiple columns based on lists and dictionaries in Python

I have the following dataframe in Pandas
OfferPreference_A OfferPreference_B OfferPreference_C
A B A
B C C
C S G
I have the following dictionary of unique values under all the columns
dict1={A:1, B:2, C:3, S:4, G:5, D:6}
I also have a list of the columnames
columnlist=['OfferPreference_A', 'OfferPreference_B', 'OfferPreference_C']
I Am trying to get the following table as the output
OfferPreference_A OfferPreference_B OfferPreference_C
1 2 1
2 3 3
3 4 5
How do I do this.
Use:
#if value not match get NaN
df = df[columnlist].applymap(dict1.get)
Or:
#if value not match get original value
df = df[columnlist].replace(dict1)
Or:
#if value not match get NaN
df = df[columnlist].stack().map(dict1).unstack()
print (df)
OfferPreference_A OfferPreference_B OfferPreference_C
0 1 2 1
1 2 3 3
2 3 4 5
You can use map for this like shown below, assuming the values will match always
for col in columnlist:
df[col] = df[col].map(dict1)

Pandas - return a dataframe after groupby

I have a Pandas df:
Name No
A 1
A 2
B 2
B 2
B 3
I want to group by column Name, sum column No and then return a 2-column dataframe like this:
Name No
A 3
B 7
I tried:
df.groupby(['Name'])['No'].sum()
but it does not return my desire dataframe. I can't add the result to a dataframe as a column.
Really appreciate any help
Add parameter as_index=False to groupby:
print (df.groupby(['Name'], as_index=False)['No'].sum())
Name No
0 A 3
1 B 7
Or call reset_index:
print (df.groupby(['Name'])['No'].sum().reset_index())
Name No
0 A 3
1 B 7

Categories