drop_duplicates() stopped working in Python pandas - python

This code had previously worked in python 3 to remove the duplicate values but keep first occurrence across an entire dataframe. After coming back to my script this no longer removes duplicates in a pandas dataFrame.
df = df.apply(lambda x: x.drop_duplicates(), axis=1)
so if I have
a b c
0 1 2
3 4 0
0 8 9
10 0 11
I want to get as an output
a b c
0 1 2
3 4
8 9
10 11
I don't mind if the blanks return as 'nan'
I also tried the following
df.drop_duplicates(subset = None, keep='first')
and
df.drop_duplicates(subset = None, keep='first', inplace =True)
Any advice / alternatives would be welcome!

After your attached the data , I think you can using duplicated
newdf=df[~df.stack().duplicated().unstack()]
newdf
Out[131]:
a b c
0 0.0 1.0 2.0
1 3.0 4.0 NaN
2 NaN 8.0 9.0
3 10.0 NaN 11.0

You need inplace to be True:
df.drop_duplicates(subset=None, keep='first', inplace=True)

Related

Is mask working differently using inplace?

I've just found out about this strange behaviour of mask, could someone explain this to me?
A)
[input]
df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])
df['C'] ='hi'
df.mask(df[['A', 'B']]<3, inplace=True)
[output]
A
B
C
0
NaN
NaN
hi
1
NaN
3.0
hi
2
4.0
5.0
hi
3
6.0
7.0
hi
4
8.0
9.0
hi
B)
[input]
df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])
df['C'] ='hi'
df.mask(df[['A', 'B']]<3)
[output]
A
B
C
0
NaN
NaN
NaN
1
NaN
3.0
NaN
2
4.0
5.0
NaN
3
6.0
7.0
NaN
4
8.0
9.0
NaN
Thank you in advance
The root cause of different result is that you pass a boolean dataframe that is not the same shape as the dataframe you want to mask. df.mask() fill the missing part with the value of inplace.
From the sourcecode, you can see pandas.DataFrame.mask() calls pandas.DataFrame.where() internally. pandas.DataFrame.where() then calls a _where() method that replaces values where the condition is False.
I just take df.where() as an example, here is the example code:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.arange(12).reshape(-1, 3), columns=['A', 'B', 'C'])
df1 = df.where(df[['A', 'B']]<3)
df.where(df[['A', 'B']]<3, inplace=True)
In this example, the df is
A B C
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
df[['A', 'B']]<3, the value of cond argument, is
A B
0 True True
1 False False
2 False False
3 False False
Digging into _where() method, the following lines are the key part:
def _where(...):
# align the cond to same shape as myself
cond = com.apply_if_callable(cond, self)
if isinstance(cond, NDFrame):
cond, _ = cond.align(self, join="right", broadcast_axis=1)
...
# make sure we are boolean
fill_value = bool(inplace)
cond = cond.fillna(fill_value)
Since the shape of cond and df are different, cond.align() fills the missing with NaN value. After that, cond looks like
A B C
0 True True NaN
1 False False NaN
2 False False NaN
3 False False NaN
Then with cond.fillna(fill_value), the NaN values are replaced with the value of inplace. So C column has the same value with inplace value.
Though there are still some codes (L9048 and L9124-L9145) related with inplace. We needn't care about the detail, since the aim of these lines are to replace values where the condition is False.
Recall that the df is
A B C
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
df1=df.where(df[['A', 'B']]<3): The cond C column is False since the default value of inplace is False. After doing df.where(), the df C column is set to the value of other argument which is NaN by default.
df.where(df[['A', 'B']]<3, inplace=True): The cond C column is True. After doing df.where(), the df C column keeps the same.
# print(df1)
A B C
0 0.0 1.0 NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
# print(df) after df.where(df[['A', 'B']]<3, inplace=True)
A B C
0 0.0 1.0 2
1 NaN NaN 5
2 NaN NaN 8
3 NaN NaN 11
Think it simple.
df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])
df['C'] ='hi'
df.mask(df[['A', 'B']]<3)
The last code line is asking for the full dataframe (df.). The condition was applied to columns ['A', 'B'] so, once the column 'C' was not part of the condition it will return NaN for the column C.
This below would be the same of df.mask(df[['A', 'B']]<3)
>>> df[["A","B","C"]].mask(df[['A', 'B']]<3)
A B C
0 NaN NaN NaN
1 NaN 3.0 NaN
2 4.0 5.0 NaN
3 6.0 7.0 NaN
4 8.0 9.0 NaN
>>>
And, df.mask(df[['A', 'B', 'C']]<3) will generate an error, because column 'C' is string type
TypeError: '<' not supported between instances of 'str' and 'int'
Finally, to return only columns "A" and "B"
>>> df[["A","B"]].mask(df[['A', 'B']]<3)
A B
0 NaN NaN
1 NaN 3.0
2 4.0 5.0
3 6.0 7.0
4 8.0 9.0
When you apply the command to be done inplace, it will do nothing to column C because of the NaN, which in the mask method will be 'do nothing'

Efficiently combine dataframes on 2nd level index

I have two dataframes looking like
import pandas as pd
df1 = pd.DataFrame([2.1,4.2,6.3,8.4,10.5], index=[2,4,6,8,10])
df1.index.name = 't'
df2 = pd.DataFrame(index=pd.MultiIndex.from_tuples([('A','a',1),('A','a',4),
('A','b',5),('A','b',6),('B','c',7),
('B','c',9),('B','d',10),('B','d',11),
], names=('big', 'small', 't')))
I am searching for an efficient way to combine them such that I get
0
big small t
A a 1 NaN
2 2.1
4 4.2
b 5 NaN
6 6.3
B c 7 NaN
8 8.4
9 NaN
d 10 10.5
11 NaN
I.e. I want to get the index levels 0 and 1 of df2 as index levels 0 and 1 in df1.
Of course a loop over the dataframe would work as well, though not feasible for large dataframes.
EDIT:
It appears from comments below that I should add, the indices big and small should be inferred on t in df1 based on the ordering of t.
Assuming that you want the unknown index levels to be inferred based on the ordering of 't', we can use an other merge, sort the values and then re-create the MultiIndex using ffill logic (need a Series for this).
res = (df2.reset_index()
.merge(df1, on='t', how='outer')
.set_index(df2.index.names)
.sort_index(level='t'))
res.index = pd.MultiIndex.from_arrays(
[pd.Series(res.index.get_level_values(i)).ffill()
for i in range(res.index.nlevels)],
names=res.index.names)
print(res)
0
big small t
A a 1 NaN
2 2.1
4 4.2
b 5 NaN
6 6.3
B c 7 NaN
8 8.4
9 NaN
d 10 10.5
11 NaN
Try extracting the level values and reindex:
df2['0'] = df1.reindex(df2.index.get_level_values('t'))[0].values
Output:
0
big small t
A a 1 NaN
4 4.2
b 5 NaN
6 6.3
B c 7 NaN
9 NaN
d 10 10.5
11 NaN
For more columns in df1, we can just merge:
(df2.reset_index()
.merge(df1, on='t', how='left')
.set_index(df2.index.names)
)

How to really filter a pandas dataset without leaving Nans everywhere

Say I have a huge DataFrame that only contains a handful of cells that match the filtering I perform. How can I end up with only the values that match it (and their indexes and columns) in a new dataframe without the entire other DataFrame that becomes Nan. Dropping Nan's with dropna just removes the whole column or row and filter replaces non matches with Nans.
Here's my code:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.random((1000, 1000)))
# this one is almost filled with Nans
df[df<0.01]
If need non missing values in another format you can use DataFrame.stack:
np.random.seed(2020)
df = pd.DataFrame(np.random.randint(10, size=(5, 3)))
# this one is almost filled with Nans
df1 = df[df<7]
print (df1)
0 1 2
0 0.0 NaN 3.0
1 6.0 3.0 3.0
2 NaN NaN 0.0
3 0.0 NaN NaN
4 3.0 NaN 2.0
df2 = df1.stack().rename_axis(('a','b')).reset_index(name='c')
print (df2)
a b c
0 0 0 0.0
1 0 2 3.0
2 1 0 6.0
3 1 1 3.0
4 1 2 3.0
5 2 2 0.0
6 3 0 0.0
7 4 0 3.0
8 4 2 2.0

transform a big dataframe with many None values to smaller one with indication of non null columns

I have a big dataframe with 4 columns with often 3 null values at every row. Sometimes there are 2 or 1 or even 0 null values but often 3.
I want to transform it to a two columns dataframe having in each row the non null value and the name of the column from which it was extracted.
Example: How to transform this dataframe
df
Out[1]:
a b c d
0 1.0 NaN NaN NaN
1 NaN 2.0 NaN NaN
2 NaN NaN 3.0 2.0
3 NaN NaN 1.0 NaN
to this One:
resultDF
Out[2]:
value columnName
0 1 a
1 2 b
2 3 c
3 2 d
4 1 c
The goal is to do it without looping on rows. Is this possible?
You can use pd.melt for adjusting the dataframe :
import pandas as pd
# reading the csv
df = pd.read_csv('test.csv')
df = df.melt(value_vars=['a','b','c','d'], var_name='foo', value_name='foo_value')
df.dropna(inplace=True)
df.reset_index(drop=True, inplace=True)
print(df)
Output :
foo foo_value
0 a 1.0
1 b 2.0
2 c 3.0
3 c 1.0
4 d 2.0

Delete values from pandas dataframe based on logical operation

I want to delete the values that are greater than a certain threshold from a pandas dataframe. Is there an efficient way to perform this? I am doing it with apply and lambda, which works fine but a bit slow for a large dataframe and I feel like there must be a better method.
df = pd.DataFrame({'A': [1,2,3,4,5], 'B': [1,2,3,4,5]})
df
A B
0 1 1
1 2 2
2 3 3
3 4 4
4 5 5
How can this be done without apply and lambda?
df['A'] = df.apply(lambda x: x['A'] if x['A'] < 3 else None, axis=1)
df
A B
0 1.0 1
1 2.0 2
2 NaN 3
3 NaN 4
4 NaN 5
Use a boolean mask against the df:
In[21]:
df[df<3]
Out[21]:
A
0 1.0
1 2.0
2 NaN
3 NaN
4 NaN
Here where the boolean condition is not met a False is returned, this will just mask out the df value returning NaN
If you actually want to drop these rows then self-assign:
df = df[df<3]
To compare a specific column:
In[22]:
df[df['A']<3]
Out[22]:
A
0 1
1 2
If you want NaN in the removed rows then you can use a trick where a double square brackets will return a single column df so we can mask the df:
In[25]:
df[df[['A']]<3]
Out[25]:
A
0 1.0
1 2.0
2 NaN
3 NaN
4 NaN
If you have multiple columns then the above won't work as the boolean mask has to match the orig df, in which case you can reindex against the orig df index:
In[31]:
df = pd.DataFrame({'A': [1,2,3,4,5], 'B': [1,2,3,4,5]})
df[df['A']<3].reindex(df.index)
Out[31]:
A B
0 1.0 1.0
1 2.0 2.0
2 NaN NaN
3 NaN NaN
4 NaN NaN
EDIT
You've updated your question again, if you want to just overwrite the single column:
In[32]:
df = pd.DataFrame({'A': [1,2,3,4,5], 'B': [1,2,3,4,5]})
df['A'] = df.loc[df['A'] < 3,'A']
df
Out[32]:
A B
0 1.0 1
1 2.0 2
2 NaN 3
3 NaN 4
4 NaN 5

Categories