pandas grouping by multiple categories for duplicates

pandas grouping by multiple categories for duplicates - python

Given this sample dataset, I am attempting to alert various companies that they have duplicates in our database so that they can all communicate with each other and determine which company the person belongs to:
Name SSN Company
Smith, John 1234 A
Smith, John 1234 B
Jones, Mary 4567 C
Jones, Mary 4567 D
Williams, Joe 1212 A
Williams, Joe 1212 C
The ideal output is a data frame provided to each company alerting them to duplicates in the data and the identity of the other company claiming the same person as assigned to them. Something like this:
Company A dataframe
Name SSN Company
Smith, John 1234 A
Smith, John 1234 B
Williams, Joe 1212 A
Williams, Joe 1212 C
Company C dataframe
Name SSN Company
Jones, Mary 4567 C
Jones, Mary 4567 D
Williams, Joe 1212 A
Williams, Joe 1212 C
So, tried groupby ['Company'], but, of course, that only groups all the Company results in one group, it omits the other Company with the duplicate person and SSN. Some version of groupby (deep in the logic of that one) seems like it should work, but grouping by multiple columns, not quite. The output would be a grouped by company but containing the duplicate value associated with all the values in that company's group. A enigma, hence my post.
Perhaps groupby Company and then concatenate each Company group with each other group on the Name column?

First we pivot on Company to see employees who are in multiple companies easily:
df2 = pd.pivot_table(df.assign(count = 1), index = ['Name','SSN'], columns='Company', values='count', aggfunc = 'count')
produces
Company A B C D
Name SSN
Jones,Mary 4567 NaN NaN 1.0 1.0
Smith,John 1234 1.0 1.0 NaN NaN
Williams,Joe 1212 1.0 NaN 1.0 NaN
where values are the count of an employee in that company and NaN means he is not in it
now we can manipilate to extract useful views for different companies. For A we can say 'pull everyone who is in company A and in any of the other companies':
dfA = df2[(~df2['A'].isna()) & (~df2[['B','C','D']].isna()).any(axis=1) ].dropna(how = 'all', axis=1)
dfA
this produces
Company A B C
Name SSN
Smith,John 1234 1.0 1.0 NaN
Williams,Joe 1212 1.0 NaN 1.0
Note we dropped companies that are irrelevant here, via dropna(...), in this case D, as there were no overlaps between A and D. and column D had all NaNs
We can easily write a function to produce a report for any company
def report_for(company_name):
companies = df2.columns
other_companies = [c for c in companies if c != company_name]
return (df2[(~df2[company_name].isna())
& (~df2[other_companies].isna()).any(axis=1) ]
.loc[:,[company_name] + other_companies]
.dropna(how = 'all', axis=1)
)
Note we also re-order columns so the table for company 'B' has column 'B' first:
report_for('B')
generates
Company B A
Name SSN
Smith,John 1234 1.0 1.0

Related

Keep values assigned to one column in a new dataframe

I have a dataset with three columns:
Name Customer Value
Johnny Mike 1
Christopher Luke 0
Christopher Mike 0
Carl Marilyn 1
Carl Stephen 1
I need to create a new dataset where I have two columns: one with unique values from Name and Customer columns, and the Value column. Values in the Value column were assigned to Name (this means that multiple rows with same Name have the same value: Carl has value 1, Christopher has value 0, and Johnny has value 1), so Customer elements should have empty values in Value column in the new dataset.
My expected output is
All Value
Johnny 1
Christopher 0
Carl 1
Mike
Luke
Marilyn
Stephen
For unique values in All column I consider unique().to_list() from both Name and Customer:
name = file['Name'].unique().tolist()
customer = file['Customer'].unique().tolist()
all_with_dupl = name + customer
customers=list(dict.fromkeys(all_with_dupl))
df= pd.DataFrame(columns=['All','Value'])
df['All']= customers
I do not know how to assign the values in the new dataset after creating the list with all names and customers with no duplicates.
Any help would be great.

Split columns and .drop_duplicates on data frame to remove duplicates and then append it back:
(df.drop('Customer', 1)
.drop_duplicates()
.rename(columns={'Name': 'All'})
.append(
df[['Customer']].rename(columns={'Customer': 'All'})
.drop_duplicates(),
ignore_index=True
))
All Value
0 Johnny 1.0
1 Christopher 0.0
2 Carl 1.0
3 Mike NaN
4 Luke NaN
5 Marilyn NaN
6 Stephen NaN
Or to split the steps up:
names = df.drop('Customer', 1).drop_duplicates().rename(columns={'Name': 'All'})
customers = df[['Customer']].drop_duplicates().rename(columns={'Customer': 'All'})
names.append(customers, ignore_index=True)

Anaother way
d=dict(zip(df['Name Customer'].str.split('\s').str[0],df['Value']))#Create dict
df['Name Customer']=df['Name Customer'].str.split('\s')
df=df.explode('Name Customer').drop_duplicates(keep='first').assign(Value='')Explode dataframe and drop duplicates
df['Value']=df['Name Customer'].map(d).fillna('')#Map values back

De-Duplicate in Pandas based off of multiple rules

I want to de-dupe rows in pandas based off of multiple criteria.
I have 3 columns: name, id and nick_name.
First rule is look for duplicate id's. When id's match, only keep rows where name and nick_name are different as long as I am keeping at least one row.
In other words, if name and nick_name don't match, keep that row. If name and nick_name match, then get rid of that row, as long as it isn't the only row that would be left for that id.
Example data:
data = {"name": ["Sam", "Sam", "Joseph", "Joseph", "Joseph", "Philip", "Philip", "James"],
"id": [1,1,2,2,2,3,3,4],
"nick_name": ["Sammie", "Sam", "Joseph", "Joe", "Joey", "Philip", "Philip", "James"]}
df = pd.DataFrame(data)
df
Produces:
name id nick_name
0 Sam 1 Sammie
1 Sam 1 Sam
2 Joseph 2 Joseph
3 Joseph 2 Joe
4 Joseph 2 Joey
5 Philip 3 Philip
6 Philip 3 Philip
7 James 4 James
Based on my rules above, I want a resulting dataframe to produce the following:
name id nick_name
0 Sam 1 Sammie
3 Joseph 2 Joe
4 Joseph 2 Joey
5 Philip 3 Philip
7 James 4 James

We can split this into 3 boolean condtions to filter your initial dataframe by.
#where name and nick_name match, keep the first value.
con1 = df.duplicated(subset=['name','nick_name'],keep='first')
# where ids are duplicated and name is not equal to nick_name
con2 = df.duplicated(subset=['id'],keep=False) & df['name'].ne(df['nick_name'])
# where no duplicate exists.
con3 = df.groupby('id')['id'].transform('size').eq(1)
print(df.loc[con1 | con2 | con3])
name id nick_name
0 Sam 1 Sammie
3 Joseph 2 Joe
4 Joseph 2 Joey
6 Philip 3 Philip
7 James 4 James

Concatenate multiple column strings into one column

I have the following dataframe with firstname and surname. I want to create a column fullname.
df1 = pd.DataFrame({'firstname':['jack','john','donald'],
'lastname':[pd.np.nan,'obrien','trump']})
print(df1)
firstname lastname
0 jack NaN
1 john obrien
2 donald trump
This works if there are no NaN values:
df1['fullname'] = df1['firstname']+df1['lastname']
However since there are NaNs in my dataframe, I decided to cast to string first. But it causes a problem in the fullname column:
df1['fullname'] = str(df1['firstname'])+str(df1['lastname'])
firstname lastname fullname
0 jack NaN 0 jack\n1 john\n2 donald\nName: f...
1 john obrien 0 jack\n1 john\n2 donald\nName: f...
2 donald trump 0 jack\n1 john\n2 donald\nName: f...
I can write some function that checks for nans and inserts the data into the new frame, but before I do that - is there another fast method to combine these strings into one column?

You need to treat NaNs using .fillna() Here, you can fill it with '' .
df1['fullname'] = df1['firstname'] + ' ' +df1['lastname'].fillna('')
Output:
firstname lastname fullname
0 jack NaN jack
1 john obrien john obrien
2 donald trump donald trumpt

You may also use .add and specify a fill_value
df1.firstname.add(" ").add(df1.lastname, fill_value="")
PS: Chaining too many adds or + is not recommended for strings, but for one or two columns you should be fine

df1['fullname'] = df1['firstname']+df1['lastname'].fillna('')

There is also Series.str.cat which can handle NaN and includes the separator.
df1["fullname"] = df1["firstname"].str.cat(df1["lastname"], sep=" ", na_rep="")
firstname lastname fullname
0 jack NaN jack
1 john obrien john obrien
2 donald trump donald trump

What I will do (For the case more than two columns need to join)
df1.stack().groupby(level=0).agg(' '.join)
Out[57]:
0 jack
1 john obrien
2 donald trump
dtype: object

Fill missing values in a dataframe based on other cell values

I have a large list of names and I am trying to cull down the duplicates. I am grouping them by name and consolidating the info if need be.
When two people don't have the same name it is no problem, we can just ffill and bfill, however, if two people have the same name we need to do some extra checks
This is an example of a group:
name code id country yob
1137 Bobby Joe USA19921111 NaN NaN NaN
2367 Bobby Joe NaN 1223133121 USA 1992
4398 Bobby Joe USA19981111 NaN NaN NaN
The code contains the persons country and birthdate. Looking at it, we can see that the first and second row are the same person. So we need to fill the info from the second row into the first row:
name code id country yob
1137 Bobby Joe USA19921111 1223133121 USA 1992
4398 Bobby Joe USA19981111 NaN NaN NaN
Here is what I have:
# Get create a dictionry of all of the rows that contain
# codes and their indexes
code_rows = dict(zip(list(group['code'].dropna().index),
group['code'].dropna().unique()))
no_code_rows = group.loc[pd.isnull(group['code']), :]
if no_code_rows.empty or len(code_rows) == group.shape[0]:
# No info to consolidate
return group
for group_idx, code in code_rows.items():
for row_idx, row in no_code_rows.iterrows():
country_yob = row['country'] + str(int(row['yob']))
if country_yob in code:
group.loc[group_idx, 'id'] = row['id']
group.loc[group_idx, 'country'] = row['country']
group.loc[group_idx, 'yob'] = row['yob']
group.drop(row_idx, inplace=True)
# Drop from temp table so we don't have to iterate
# over an extra row
no_code_rows.drop(row_idx, inplace=True)'''
break
return group
This works but I have a feeling I am missing something? I feel like I shouldn't have to use two loops for this and that maybe there is a pandas function?
EDIT
We don't know the order or how many rows we will have in each group
i.e.
name code id country yob
1137 Bobby Joe USA19921111 NaN NaN NaN
2367 Bobby Joe USA19981111 NaN NaN NaN
4398 Bobby Joe NaN 1223133121 USA 1992`

I think need:
m = df['code'].isnull()
df1 = df[~m]
df2 = df[m]
df = df1.merge(df2, on='name', suffixes=('','_'))
df['a_'] = df['country_'] + df['yob_'].astype(str)
m = df.apply(lambda x: x['a_'] in x['code'], axis=1)
df.loc[m, ['id','country','yob']] = df.loc[m, ['id_','country_','yob_']].rename(columns=lambda x: x.strip('_'))
df = df.loc[:, ~df.columns.str.endswith('_')]
print (df)
name code id country yob
0 Bobby Joe USA19921111 1223133121 USA 1992
1 Bobby Joe USA19981111 NaN NaN NaN

Group by pandas data frame unique first values - numpy array returned

From a two string columns pandas data frame looking like:
d = {'SCHOOL' : ['Yale', 'Yale', 'LBS', 'Harvard','UCLA', 'Harvard', 'HEC'],
'NAME' : ['John', 'Marc', 'Alex', 'Will', 'Will','Miller', 'Tom']}
df = pd.DataFrame(d)
Notice the relationship between NAME to SCHOOL is n to 1.
I want to get the last school in case one person has gone to two different schools (see "Will" case).
So far I got:
df = df.groupby('NAME')['SCHOOL'].unique().reset_index()
Return:
NAME SCHOOL
0 Alex [LBS]
1 John [Yale]
2 Marc [Yale]
3 Miller [Harvard]
4 Tom [HEC]
5 Will [Harvard, UCLA]
PROBLEMS:
unique() return both school not only the last school.
This line return SCHOOL column as a np.array instead of string. Very difficult to work further with this df.

Both problems where solved based on #IanS comments.
Using last() instead of unique():
df = df.groupby('NAME')['SCHOOL'].last().reset_index()
Return:
NAME SCHOOL
0 Alex LBS
1 John Yale
2 Marc Yale
3 Miller Harvard
4 Tom HEC
5 Will UCLA

Use drop_duplicates with parameter last and specifying column for check duplicates:
df = df.drop_duplicates('NAME', keep='last')
print (df)
NAME SCHOOL
0 John Yale
1 Marc Yale
2 Alex LBS
4 Will UCLA
5 Miller Harvard
6 Tom HEC
Also if need sorting add sort_values:
df = df.drop_duplicates('NAME', keep='last').sort_values('NAME')
print (df)
NAME SCHOOL
2 Alex LBS
0 John Yale
1 Marc Yale
5 Miller Harvard
6 Tom HEC
4 Will UCLA

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas grouping by multiple categories for duplicates - python

Related

Keep values assigned to one column in a new dataframe

De-Duplicate in Pandas based off of multiple rules

Concatenate multiple column strings into one column

Fill missing values in a dataframe based on other cell values

Group by pandas data frame unique first values - numpy array returned

Categories

Resources