Delete Null Duplicates from DataFrame - python

I'm trying to delete duplicates from one row, if the value in another column is null. Here's a sample dataframe:
Primary Application
Assigned To
Application 1
Jim Smith
Application 1
nan
Application 2
John Williams
Application 2
nan
Application 3
nan
Application 3
Sarah Smith
I'm trying to write a conditional that deletes the duplicate in Primary Application if the first or second value of a duplicate in Assigned To is null.
The ideal output would be:
Primary Application
Assigned To
Application 1
Jim Smith
Application 2
John Williams
Application 3
Sarah Smith
Here's what I've written so far:
df = df.groupby('Primary Application', as_index=False).apply(
lambda x: x.drop_duplicates(subset=['Primary Application'], keep='first'
if x['Assigned To'].iat[1].isnull()
else x.drop_duplicates(subset=['Primary Application'], keep='last')))
The main issue is with the if statement regarding isnull(). I've also tried using is none which hasn't worked either.
A key thing I should have added to this question: there are NA values I do want to keep, just not ones that are duplicates with what's already been assigned.

You can pass a custom function to groupby.agg
import numpy as np
df.groupby('Primary Application', as_index=False).agg(lambda x: np.nan if x.isnull().all() else x.dropna(subset=['Assigned To']))

Another way is via transform(). If the group size is one then keep it, otherwise keep all the non-nan's in the group.
With a couple additional cases to represent non-dup situations:
d = {'Primary Application': ['Application 1 ','Application 1 ','Application 2 ',
'Application 2 ','Application 3 ','Application 3 ','Application 4 ',
'Application 5 '],
'Assigned To': ['Jim Smith',np.nan,'John Williams',np.nan,np.nan,'Sarah Smith',
np.nan,'Mark Meed' ]}
df = pd.DataFrame(d)
Primary Application Assigned To
0 Application 1 Jim Smith
1 Application 1 NaN
2 Application 2 John Williams
3 Application 2 NaN
4 Application 3 NaN
5 Application 3 Sarah Smith
6 Application 4 NaN
7 Application 5 Mark Meed
df[df.groupby('Primary Application') \
.transform(lambda x: (x.size==1) | (~pd.isna(x)))['Assigned To']]
Primary Application Assigned To
0 Application 1 Jim Smith
2 Application 2 John Williams
5 Application 3 Sarah Smith
6 Application 4 NaN
7 Application 5 Mark Meed

Related

how to remove rows in python data frame with condition?

I have the following data:
df =
Emp_Name Leaves Leave_Type Salary Performance
0 Christy 20 sick 3000.0 56.6
1 Rocky 10 Casual kkkk 22.4
2 jenifer 50 Emergency 2500.6 '51.6'
3 Tom 10 sick Nan 46.2
4 Harry nn Casual 1800.1 '58.3'
5 Julie 22 sick 3600.2 'unknown'
6 Sam 5 Casual Nan 47.2
7 Mady 6 sick unknown Nan
Output:
Emp_Name Leaves Leave_Type Salary Performance
0 Christy 20 sick 3000.0 56.6
1 jenifer 50 Emergency 2500.6 51.6
2 Tom 10 sick Nan 46.2
3 Sam 5 Casual Nan 47.2
4 Mady 6 sick unknown Nan
I want to delete records where there is datatype error in numerical columns(Leaves,Salary,Performance).
If numerical columns contains strings then that row show be deleted from data frame?
df[['Leaves','Salary','Performance']].apply(pd.to_numeric, errors = 'coerce')
but this will covert values to Nan.
Let's start from a note concerning your sample data:
It contains Nan strings, which are not among strings automatically
recognized as NaNs.
To treat them as NaN, I read the source text with read_fwf,
passing na_values=['Nan'].
And now get down to the main task:
Define a function to check whether a cell is acceptable:
def isAcceptable(cell):
if pd.isna(cell) or cell == 'unknown':
return True
return all(c.isdigit() or c == '.' for c in cell)
I noticed that you accept NaN values.
You also a cell if it contains only unknown string, but you don't
accept a cell if such word is enclosed between e.g. quotes.
If you change your mind about what is / is not acceptable, change the
above function accordingly.
Then, to leave only rows with all acceptable values in all 3 mentioned
columns, run:
df[df[['Leaves', 'Salary', 'Performance']].applymap(isAcceptable).all(axis=1)]

How to shift the values of a certain group by different amounts

I have a DataFrame that looks like this:
user data
0 Kevin 1
1 Kevin 3
2 Sara 5
3 Kevin 23
...
And I want to get the historical values (looking let's say 2 entries forward) as rows:
user data data_1 data_2
0 Kevin 1 3 23
1 Sara 5 24 NaN
2 Kim ...
...
Right now I'm able to do this through the following command:
_temp = df.groupby(['user'], as_index = False)['data']
for i in range(1,2):
data['data_{0}'.format(i)] = _temp.shift(-1)
I feel like my approach is very inefficient and that there is a much faster way to do this (esp. when the number of lookahead/lookback values go up)!
You can use groupby.cumcount() with set_index() and unstack():
m=df.assign(k=df.groupby('user').cumcount().astype(str)).set_index(['user','k']).unstack()
m.columns=m.columns.map('_'.join)
print(m)
data_0 data_1 data_2
user
Kevin 1.0 3.0 23.0
Sara 5.0 NaN NaN

Collapsing Values to the Left, Python/Pandas

I work with a dataset that occasionally have values removed before I get my hands on it. When a value is removed, it is generally replaced my NaN or ''. What is the most efficient way to collapse the values to the left?
Specifically, I'm trying to turn this:
1 2 3 4
bill sjd meoip
nick tredsn bana
fred ccrw aaaa cretwew bbbbb
tom eomwepo
jill dew weaedf
Into this:
1 2 3 4
bill sjd meoip
nick tredsn bana
fred ccrw aaaa cretwew bbbbb
tom eomwepo
jill dew weaedf
The column titles don't matter, the only thing that matters is that there are no leading empty cells and no empty cells between.
I would prefer to do this in a non-iterative fashion, as the df can be quite large.
Try this, if those blanks are '', then use mask to np.nan, else you don't need mask nor fillna:
df.mask(df == '').apply(lambda x: pd.Series(x.dropna().values), axis=1).fillna('')
Output:
0 1 2 3
bill sjd meojp
nick tredsn bana
fred ccrw aaaa cretwew bbbb
tom eomwep
jill dew weadf

Apply value in column based on conditions while cross-evaluating 2 datasets

I have 2 DataFrames:
PROJECT1
key name deadline delivered
0 AA1 Tom 01/05/2018 02/05/2018
1 AA2 Sue 01/05/2018 30/04/2018
2 AA4 Jack 01/05/2018 04/05/2018
PROJECT2
key name deadline delivered
0 AA1 Tom 01/05/2018 30/04/2018
1 AA2 Sue 01/05/2018 30/04/2018
2 AA3 Jim 01/05/2018 03/05/2018
is is possible to create a column in PROJECT2 named 'In PROJECT1' and apply condition as such:
psuedo code
for row in PROJECT2:
if in the same row based on key column PROJECT1['delivered'] >= PROJECT2['deadline']:
PROJECT2['In PROJECT1'] = 'project delivered before deadline'
else:
'Project delayed'
expected result
key name deadline delivered In PROJECT1
0 AA1 Tom 01/05/2018 30/04/2018 Project delayed
1 AA2 Sue 01/05/2018 30/04/2018 project delivered before deadline
2 AA3 Jim 01/05/2018 03/05/2018 NaN
not sure how to approach it (iterrows(), for loop, df.loc[conditions], np.where(), or perhaps I need to define some kind of function to use in df.apply()), any help highly appreciated.
You can use numpy.select to add a series with a list of conditions and values.
Note I believe you have your desired criteria reversed, i.e. delivered before deadline should give "project delivered before deadline" rather than vice versa.
import numpy as np
# convert series to datetime if necessary
for col in ['deadline', 'delivered']:
df1[col] = pd.to_datetime(df1[col], dayfirst=True)
for col in ['deadline', 'delivered']:
df2[col] = pd.to_datetime(df2[col], dayfirst=True)
# create series mapping key to delivered date in df1
s = df1.set_index('key')['delivered']
# define conditions and values
conditions = [~df2['key'].isin(s.index), df2['key'].map(s) <= df2['deadline']]
values = [np.nan, 'project delivered before deadline']
# apply conditions and values, with fallback value
df2['In Project1'] = np.select(conditions, values, 'Project delayed')
print(df2)
key name deadline delivered In Project1
0 AA1 Tom 2018-05-01 2018-04-30 Project delayed
1 AA2 Sue 2018-05-01 2018-04-30 project delivered before deadline
2 AA3 Jim 2018-05-01 2018-05-03 nan
Here is an alternate way you can follow by joining both the data sets. This will help you avoid any necessity for loop and will be faster.
## join the two data sets
# p1 = Project 1
# p2 = Project 2
p3 = p2.merge(p1.loc[:,['key','delivered']], on='key',how='left', suffixes=['_p2','_p1'])
p3['In PROJECT1'] = np.where((p3['delivered_p1'] >= p3['delivered_p2']),'project delivered before deadline','Project delayed')
# handle cases with NA
set_to_na = p3[['delivered_p1','delivered_p2']].isnull().any(axis=1).values.tolist()
p3['In PROJECT1'].iloc[set_to_na] = np.nan
## remove unwanted columns and rename
p3.drop('delivered_p1', axis=1, inplace=True)
p3.rename(columns={'delivered_p2':'delivered'}, inplace=True)
print(p3)
key name deadline delivered In PROJECT1
0 AA1 Tom 01/05/2018 30/04/2018 Project delayed
1 AA2 Sue 01/05/2018 30/04/2018 project delivered before deadline
2 AA3 Jim 01/05/2018 03/05/2018 NaN

Group by pandas data frame unique first values - numpy array returned

From a two string columns pandas data frame looking like:
d = {'SCHOOL' : ['Yale', 'Yale', 'LBS', 'Harvard','UCLA', 'Harvard', 'HEC'],
'NAME' : ['John', 'Marc', 'Alex', 'Will', 'Will','Miller', 'Tom']}
df = pd.DataFrame(d)
Notice the relationship between NAME to SCHOOL is n to 1.
I want to get the last school in case one person has gone to two different schools (see "Will" case).
So far I got:
df = df.groupby('NAME')['SCHOOL'].unique().reset_index()
Return:
NAME SCHOOL
0 Alex [LBS]
1 John [Yale]
2 Marc [Yale]
3 Miller [Harvard]
4 Tom [HEC]
5 Will [Harvard, UCLA]
PROBLEMS:
unique() return both school not only the last school.
This line return SCHOOL column as a np.array instead of string. Very difficult to work further with this df.
Both problems where solved based on #IanS comments.
Using last() instead of unique():
df = df.groupby('NAME')['SCHOOL'].last().reset_index()
Return:
NAME SCHOOL
0 Alex LBS
1 John Yale
2 Marc Yale
3 Miller Harvard
4 Tom HEC
5 Will UCLA
Use drop_duplicates with parameter last and specifying column for check duplicates:
df = df.drop_duplicates('NAME', keep='last')
print (df)
NAME SCHOOL
0 John Yale
1 Marc Yale
2 Alex LBS
4 Will UCLA
5 Miller Harvard
6 Tom HEC
Also if need sorting add sort_values:
df = df.drop_duplicates('NAME', keep='last').sort_values('NAME')
print (df)
NAME SCHOOL
2 Alex LBS
0 John Yale
1 Marc Yale
5 Miller Harvard
6 Tom HEC
4 Will UCLA

Categories