I have a large list of names and I am trying to cull down the duplicates. I am grouping them by name and consolidating the info if need be.
When two people don't have the same name it is no problem, we can just ffill and bfill, however, if two people have the same name we need to do some extra checks
This is an example of a group:
name code id country yob
1137 Bobby Joe USA19921111 NaN NaN NaN
2367 Bobby Joe NaN 1223133121 USA 1992
4398 Bobby Joe USA19981111 NaN NaN NaN
The code contains the persons country and birthdate. Looking at it, we can see that the first and second row are the same person. So we need to fill the info from the second row into the first row:
name code id country yob
1137 Bobby Joe USA19921111 1223133121 USA 1992
4398 Bobby Joe USA19981111 NaN NaN NaN
Here is what I have:
# Get create a dictionry of all of the rows that contain
# codes and their indexes
code_rows = dict(zip(list(group['code'].dropna().index),
group['code'].dropna().unique()))
no_code_rows = group.loc[pd.isnull(group['code']), :]
if no_code_rows.empty or len(code_rows) == group.shape[0]:
# No info to consolidate
return group
for group_idx, code in code_rows.items():
for row_idx, row in no_code_rows.iterrows():
country_yob = row['country'] + str(int(row['yob']))
if country_yob in code:
group.loc[group_idx, 'id'] = row['id']
group.loc[group_idx, 'country'] = row['country']
group.loc[group_idx, 'yob'] = row['yob']
group.drop(row_idx, inplace=True)
# Drop from temp table so we don't have to iterate
# over an extra row
no_code_rows.drop(row_idx, inplace=True)'''
break
return group
This works but I have a feeling I am missing something? I feel like I shouldn't have to use two loops for this and that maybe there is a pandas function?
EDIT
We don't know the order or how many rows we will have in each group
i.e.
name code id country yob
1137 Bobby Joe USA19921111 NaN NaN NaN
2367 Bobby Joe USA19981111 NaN NaN NaN
4398 Bobby Joe NaN 1223133121 USA 1992`
I think need:
m = df['code'].isnull()
df1 = df[~m]
df2 = df[m]
df = df1.merge(df2, on='name', suffixes=('','_'))
df['a_'] = df['country_'] + df['yob_'].astype(str)
m = df.apply(lambda x: x['a_'] in x['code'], axis=1)
df.loc[m, ['id','country','yob']] = df.loc[m, ['id_','country_','yob_']].rename(columns=lambda x: x.strip('_'))
df = df.loc[:, ~df.columns.str.endswith('_')]
print (df)
name code id country yob
0 Bobby Joe USA19921111 1223133121 USA 1992
1 Bobby Joe USA19981111 NaN NaN NaN
Related
I have a small dataset that I need to reshape the layout by displaying all related records (related by ID) on the same row. The order of the columns changes once I Pivot the data in pandas as showing in the second image below. How can I maintain the order of the columns to that of the original dataset please?
original dataset
import pandas as pd
df = pd.DataFrame({
'ID':[33,33,3,21,21,3,33],
'FirstName':['Joseph','Mary','Abram','Peter','John','Daniel','Cat'],
'LastName':['JosephL','MaryL','AbramL','PeterL','JohnL','DanielL','CatL'],
'CAR':['BMW','MB','Opel','Fiat','VW','','Ford'],
'Salary':[1250,3254,2599,4566,7855,9999,7500]
})
Outcome
Pivoting the data to stack related records as new columns instead of new rows.
g = df.groupby(['ID']).cumcount().add(1)
df = df.set_index(['ID',g]).unstack(fill_value=0).sort_index(axis=1,level=1)
df.columns=["{}{}".format(a,b) for a,b in df.columns]
df = df.reset_index()
df
The order is not right, it should be FirstName, LastName, Salary, Car, FirstName, LastName, Salary, Car, FirstName, LastName, Salary, Car...etc
You can use pivot_table after defining a custom column index:
vals = ['FirstName', 'LastName', 'CAR', 'Salary']
idx = df.groupby('ID').cumcount().add(1).astype(str)
out = (df.pivot_table(index='ID', columns=idx, values=vals, aggfunc='first', fill_value=0)
.sort_index(level=1, axis=1).reset_index())
out.columns = out.columns.to_flat_index().map(''.join)
print(out)
# Output
ID CAR1 FirstName1 LastName1 Salary1 CAR2 FirstName2 LastName2 Salary2 CAR3 FirstName3 LastName3 Salary3
0 3 Opel Abram AbramL 2599.0 Daniel DanielL 9999.0 NaN NaN NaN NaN
1 21 Fiat Peter PeterL 4566.0 VW John JohnL 7855.0 NaN NaN NaN NaN
2 33 BMW Joseph JosephL 1250.0 MB Mary MaryL 3254.0 Ford Cat CatL 7500.0
I have a 2dataframes, which I am calling as df1 and df2.
df1 has columns like KPI and context and it looks like this.
KPI Context
0 Does the company have a policy in place to man... Anti-Bribery Policy\nBroadridge does not toler...
1 Does the company have a supplier code of conduct? Vendor Code of Conduct Our vendors play an imp...
2 Does the company have a grievance/complaint ha... If you ever have a question or wish to report ...
3 Does the company have a human rights policy ? Human Rights Statement of Commitment Broadridg...
4 Does the company have a policies consistent wi... Anti-Bribery Policy\nBroadridge does not toler...
df2 has a single column 'keyword'
df2:
Keyword
0 1.5 degree
1 1.5°
2 2 degree
3 2°
4 accident
I wanted to create another dataframe out of these two dataframe wherein if a particular value from 'Keyword' column of df2 is present in the 'Context' of df1 then simply write the count of it.
for which I have used pd.crosstab() however I suspect that its not giving me the expected output.
here's what I have tried so far.
new_df = df1.explode('Context')
new_df1 = df2.explode('Keyword')
new_df = pd.crosstab(new_df['KPI'], new_df1['Keyword'], values=new_df['Context'], aggfunc='count').reset_index().rename_axis(columns=None)
print(new_df.head())
the new_df looks like this.
KPI 1.5 degree 1.5° \
0 Does the Supplier code of conduct cover one or... NaN NaN
1 Does the companies have sites/operations locat... NaN NaN
2 Does the company have a due diligence process ... NaN NaN
3 Does the company have a grievance/complaint ha... NaN NaN
4 Does the company have a grievance/complaint ha... NaN NaN
2 degree 2° accident
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 1.0 NaN NaN
4 NaN NaN NaN
The expected output which I want is something like this.
0 KPI 1.5 degree 1.5° 2 degree 2° accident
1 Does the company have a policy in place to man 44 2 3 5 9
what exactly am I missing? please let me know, thanks!
There is multiple problems - first explode working with splitted values, not with strings. Then for extract Keyword from Context need Series.str.findall and for crosstab use columns in same DataFrame, not 2 different:
import re
pat = '|'.join(r"\b{}\b".format(re.escape(x)) for x in df2['Keyword'])
df1['new'] = df1['Context'].str.findall(pat, flags=re.I)
new_df = df1.explode('new')
out = pd.crosstab(new_df['KPI'], new_df['new'])
I have a dataset with three columns:
Name Customer Value
Johnny Mike 1
Christopher Luke 0
Christopher Mike 0
Carl Marilyn 1
Carl Stephen 1
I need to create a new dataset where I have two columns: one with unique values from Name and Customer columns, and the Value column. Values in the Value column were assigned to Name (this means that multiple rows with same Name have the same value: Carl has value 1, Christopher has value 0, and Johnny has value 1), so Customer elements should have empty values in Value column in the new dataset.
My expected output is
All Value
Johnny 1
Christopher 0
Carl 1
Mike
Luke
Marilyn
Stephen
For unique values in All column I consider unique().to_list() from both Name and Customer:
name = file['Name'].unique().tolist()
customer = file['Customer'].unique().tolist()
all_with_dupl = name + customer
customers=list(dict.fromkeys(all_with_dupl))
df= pd.DataFrame(columns=['All','Value'])
df['All']= customers
I do not know how to assign the values in the new dataset after creating the list with all names and customers with no duplicates.
Any help would be great.
Split columns and .drop_duplicates on data frame to remove duplicates and then append it back:
(df.drop('Customer', 1)
.drop_duplicates()
.rename(columns={'Name': 'All'})
.append(
df[['Customer']].rename(columns={'Customer': 'All'})
.drop_duplicates(),
ignore_index=True
))
All Value
0 Johnny 1.0
1 Christopher 0.0
2 Carl 1.0
3 Mike NaN
4 Luke NaN
5 Marilyn NaN
6 Stephen NaN
Or to split the steps up:
names = df.drop('Customer', 1).drop_duplicates().rename(columns={'Name': 'All'})
customers = df[['Customer']].drop_duplicates().rename(columns={'Customer': 'All'})
names.append(customers, ignore_index=True)
Anaother way
d=dict(zip(df['Name Customer'].str.split('\s').str[0],df['Value']))#Create dict
df['Name Customer']=df['Name Customer'].str.split('\s')
df=df.explode('Name Customer').drop_duplicates(keep='first').assign(Value='')Explode dataframe and drop duplicates
df['Value']=df['Name Customer'].map(d).fillna('')#Map values back
Given this sample dataset, I am attempting to alert various companies that they have duplicates in our database so that they can all communicate with each other and determine which company the person belongs to:
Name SSN Company
Smith, John 1234 A
Smith, John 1234 B
Jones, Mary 4567 C
Jones, Mary 4567 D
Williams, Joe 1212 A
Williams, Joe 1212 C
The ideal output is a data frame provided to each company alerting them to duplicates in the data and the identity of the other company claiming the same person as assigned to them. Something like this:
Company A dataframe
Name SSN Company
Smith, John 1234 A
Smith, John 1234 B
Williams, Joe 1212 A
Williams, Joe 1212 C
Company C dataframe
Name SSN Company
Jones, Mary 4567 C
Jones, Mary 4567 D
Williams, Joe 1212 A
Williams, Joe 1212 C
So, tried groupby ['Company'], but, of course, that only groups all the Company results in one group, it omits the other Company with the duplicate person and SSN. Some version of groupby (deep in the logic of that one) seems like it should work, but grouping by multiple columns, not quite. The output would be a grouped by company but containing the duplicate value associated with all the values in that company's group. A enigma, hence my post.
Perhaps groupby Company and then concatenate each Company group with each other group on the Name column?
First we pivot on Company to see employees who are in multiple companies easily:
df2 = pd.pivot_table(df.assign(count = 1), index = ['Name','SSN'], columns='Company', values='count', aggfunc = 'count')
produces
Company A B C D
Name SSN
Jones,Mary 4567 NaN NaN 1.0 1.0
Smith,John 1234 1.0 1.0 NaN NaN
Williams,Joe 1212 1.0 NaN 1.0 NaN
where values are the count of an employee in that company and NaN means he is not in it
now we can manipilate to extract useful views for different companies. For A we can say 'pull everyone who is in company A and in any of the other companies':
dfA = df2[(~df2['A'].isna()) & (~df2[['B','C','D']].isna()).any(axis=1) ].dropna(how = 'all', axis=1)
dfA
this produces
Company A B C
Name SSN
Smith,John 1234 1.0 1.0 NaN
Williams,Joe 1212 1.0 NaN 1.0
Note we dropped companies that are irrelevant here, via dropna(...), in this case D, as there were no overlaps between A and D. and column D had all NaNs
We can easily write a function to produce a report for any company
def report_for(company_name):
companies = df2.columns
other_companies = [c for c in companies if c != company_name]
return (df2[(~df2[company_name].isna())
& (~df2[other_companies].isna()).any(axis=1) ]
.loc[:,[company_name] + other_companies]
.dropna(how = 'all', axis=1)
)
Note we also re-order columns so the table for company 'B' has column 'B' first:
report_for('B')
generates
Company B A
Name SSN
Smith,John 1234 1.0 1.0
I have the following dataframe:
name date_one date_two
-----------------------------------------
sue
sue
john
john 13-06-2019
sally 23-04-2019
sally 23-04-2019 25-04-2019
bob 18-05-2019 14-06-2019
bob 18-05-2019 17-06-2019
The data contains duplicate name rows. I need to filter the data based on the following (in this order of priority):
For each name, keep the row with the newest date_two. If the name doesn't have any rows which have values for date_two, go to step 2
For each name, keep the row with the newest date_one. If the name doesn't have any rows which have values for date_one, go to step 3
These names don't have any rows which have a date_one or date_two so just keep the first row for that name
The above dataframe would be filtered to:
name date_one date_two
-----------------------------------------
sue
john 13-06-2019
sally 23-04-2019 25-04-2019
bob 18-05-2019 17-06-2019
This doesn't need to be done in the most performant way. The dataframe is only a few thousand rows and only needs to be done once. If it needs to be done in multiple (slow) steps that's fine.
Use DataFrameGroupBy.idxmax per groups for rows by maximal values, then filter out already matched values by Series.isin and last join together value by concat:
df['date_one'] = pd.to_datetime(df['date_one'], dayfirst=True)
df['date_two'] = pd.to_datetime(df['date_two'], dayfirst=True)
#rule1
df1 = df.loc[df.groupby('name')['date_two'].idxmax().dropna()]
#rule2
df2 = df.loc[df.groupby('name')['date_one'].idxmax().dropna()]
df2 = df2[~df2['name'].isin(df1['name'])]
#rule3
df3 = df[~df['name'].isin(df1['name'].append(df2['name']))].drop_duplicates('name')
df = pd.concat([df1, df2, df3]).sort_index()
print (df)
name date_one date_two
0 sue NaT NaT
3 john 2019-06-13 NaT
5 sally 2019-04-23 2019-04-25
7 bob 2019-05-18 2019-06-17