I have the following dataframe:
name date_one date_two
-----------------------------------------
sue
sue
john
john 13-06-2019
sally 23-04-2019
sally 23-04-2019 25-04-2019
bob 18-05-2019 14-06-2019
bob 18-05-2019 17-06-2019
The data contains duplicate name rows. I need to filter the data based on the following (in this order of priority):
For each name, keep the row with the newest date_two. If the name doesn't have any rows which have values for date_two, go to step 2
For each name, keep the row with the newest date_one. If the name doesn't have any rows which have values for date_one, go to step 3
These names don't have any rows which have a date_one or date_two so just keep the first row for that name
The above dataframe would be filtered to:
name date_one date_two
-----------------------------------------
sue
john 13-06-2019
sally 23-04-2019 25-04-2019
bob 18-05-2019 17-06-2019
This doesn't need to be done in the most performant way. The dataframe is only a few thousand rows and only needs to be done once. If it needs to be done in multiple (slow) steps that's fine.
Use DataFrameGroupBy.idxmax per groups for rows by maximal values, then filter out already matched values by Series.isin and last join together value by concat:
df['date_one'] = pd.to_datetime(df['date_one'], dayfirst=True)
df['date_two'] = pd.to_datetime(df['date_two'], dayfirst=True)
#rule1
df1 = df.loc[df.groupby('name')['date_two'].idxmax().dropna()]
#rule2
df2 = df.loc[df.groupby('name')['date_one'].idxmax().dropna()]
df2 = df2[~df2['name'].isin(df1['name'])]
#rule3
df3 = df[~df['name'].isin(df1['name'].append(df2['name']))].drop_duplicates('name')
df = pd.concat([df1, df2, df3]).sort_index()
print (df)
name date_one date_two
0 sue NaT NaT
3 john 2019-06-13 NaT
5 sally 2019-04-23 2019-04-25
7 bob 2019-05-18 2019-06-17
Related
I have a small dataset that I need to reshape the layout by displaying all related records (related by ID) on the same row. The order of the columns changes once I Pivot the data in pandas as showing in the second image below. How can I maintain the order of the columns to that of the original dataset please?
original dataset
import pandas as pd
df = pd.DataFrame({
'ID':[33,33,3,21,21,3,33],
'FirstName':['Joseph','Mary','Abram','Peter','John','Daniel','Cat'],
'LastName':['JosephL','MaryL','AbramL','PeterL','JohnL','DanielL','CatL'],
'CAR':['BMW','MB','Opel','Fiat','VW','','Ford'],
'Salary':[1250,3254,2599,4566,7855,9999,7500]
})
Outcome
Pivoting the data to stack related records as new columns instead of new rows.
g = df.groupby(['ID']).cumcount().add(1)
df = df.set_index(['ID',g]).unstack(fill_value=0).sort_index(axis=1,level=1)
df.columns=["{}{}".format(a,b) for a,b in df.columns]
df = df.reset_index()
df
The order is not right, it should be FirstName, LastName, Salary, Car, FirstName, LastName, Salary, Car, FirstName, LastName, Salary, Car...etc
You can use pivot_table after defining a custom column index:
vals = ['FirstName', 'LastName', 'CAR', 'Salary']
idx = df.groupby('ID').cumcount().add(1).astype(str)
out = (df.pivot_table(index='ID', columns=idx, values=vals, aggfunc='first', fill_value=0)
.sort_index(level=1, axis=1).reset_index())
out.columns = out.columns.to_flat_index().map(''.join)
print(out)
# Output
ID CAR1 FirstName1 LastName1 Salary1 CAR2 FirstName2 LastName2 Salary2 CAR3 FirstName3 LastName3 Salary3
0 3 Opel Abram AbramL 2599.0 Daniel DanielL 9999.0 NaN NaN NaN NaN
1 21 Fiat Peter PeterL 4566.0 VW John JohnL 7855.0 NaN NaN NaN NaN
2 33 BMW Joseph JosephL 1250.0 MB Mary MaryL 3254.0 Ford Cat CatL 7500.0
I have a dataset with three columns:
Name Customer Value
Johnny Mike 1
Christopher Luke 0
Christopher Mike 0
Carl Marilyn 1
Carl Stephen 1
I need to create a new dataset where I have two columns: one with unique values from Name and Customer columns, and the Value column. Values in the Value column were assigned to Name (this means that multiple rows with same Name have the same value: Carl has value 1, Christopher has value 0, and Johnny has value 1), so Customer elements should have empty values in Value column in the new dataset.
My expected output is
All Value
Johnny 1
Christopher 0
Carl 1
Mike
Luke
Marilyn
Stephen
For unique values in All column I consider unique().to_list() from both Name and Customer:
name = file['Name'].unique().tolist()
customer = file['Customer'].unique().tolist()
all_with_dupl = name + customer
customers=list(dict.fromkeys(all_with_dupl))
df= pd.DataFrame(columns=['All','Value'])
df['All']= customers
I do not know how to assign the values in the new dataset after creating the list with all names and customers with no duplicates.
Any help would be great.
Split columns and .drop_duplicates on data frame to remove duplicates and then append it back:
(df.drop('Customer', 1)
.drop_duplicates()
.rename(columns={'Name': 'All'})
.append(
df[['Customer']].rename(columns={'Customer': 'All'})
.drop_duplicates(),
ignore_index=True
))
All Value
0 Johnny 1.0
1 Christopher 0.0
2 Carl 1.0
3 Mike NaN
4 Luke NaN
5 Marilyn NaN
6 Stephen NaN
Or to split the steps up:
names = df.drop('Customer', 1).drop_duplicates().rename(columns={'Name': 'All'})
customers = df[['Customer']].drop_duplicates().rename(columns={'Customer': 'All'})
names.append(customers, ignore_index=True)
Anaother way
d=dict(zip(df['Name Customer'].str.split('\s').str[0],df['Value']))#Create dict
df['Name Customer']=df['Name Customer'].str.split('\s')
df=df.explode('Name Customer').drop_duplicates(keep='first').assign(Value='')Explode dataframe and drop duplicates
df['Value']=df['Name Customer'].map(d).fillna('')#Map values back
I have problem where I need to update a value if people were at the same table.
import pandas as pd
data = {"p1":['Jen','Mark','Carrie'],
"p2":['John','Jason','Rob'],
"value":[10,20,40]}
df = pd.DataFrame(data,columns=["p1",'p2','value'])
meeting = {'person':['Jen','Mark','Carrie','John','Jason','Rob'],
'table':[1,2,3,1,2,3]}
meeting = pd.DataFrame(meeting,columns=['person','table'])
df is a relationship table and value is the field i need to update. So if two people were at the same table in the meeting dataframe then update the df row accordingly.
for example: Jen and John were both at table 1, so I need to update the row in df that has Jen and John and set their value to value + 100 so 110.
I thought about maybe doing a self join on meeting to get the format to match that of df but not sure if this is the easiest or fastest approach
IIUC you could set the person as index in the meeting dataframe, and use its table values to replace the names in df. Then if both mappings have the same value (table), replace with df.value+100:
m = df[['p1','p2']].replace(meeting.set_index('person').table).eval('p1==p2')
df['value'] = df.value.mask(m, df.value+100)
print(df)
p1 p2 value
0 Jen John 110
1 Mark Jason 120
2 Carrie Rob 140
This could be an approach, using df.to_records():
groups=meeting.groupby('table').agg(set)['person'].to_list()
df['value']=[row[-1]+100 if set(list(row)[1:3]) in groups else row[-1] for row in df.to_records()]
Output:
df
p1 p2 value
0 Jen John 110
1 Mark Jason 120
2 Carrie Rob 140
I am relatively new to Python. If I have the following two types of dataframes, Lets say df1 and df2 respectively.
Id Name Job Name Salary Location
1 Jim Tester Jim 100 Japan
2 Bob Developer Bob 200 US
3 Sam Support Si 300 UK
Sue 400 France
I want to compare the 'Name' column in df2 to df1 such that if the name of the person (in df2) does not exist in df1 than that row in df2 would be outputed to another dataframe. So for the eg above the output would be:
Name Salary Location
Si 300 UK
Sue 400 France
Si and Sue are outputed because they do not exist in the 'Name' column in df1.
You can use Boolean indexing:
res = df2[~df2['Name'].isin(df1['Name'].unique())]
We use hashing via pd.Series.unique as an optimization in case you have duplicate names in df1.
I have a large list of names and I am trying to cull down the duplicates. I am grouping them by name and consolidating the info if need be.
When two people don't have the same name it is no problem, we can just ffill and bfill, however, if two people have the same name we need to do some extra checks
This is an example of a group:
name code id country yob
1137 Bobby Joe USA19921111 NaN NaN NaN
2367 Bobby Joe NaN 1223133121 USA 1992
4398 Bobby Joe USA19981111 NaN NaN NaN
The code contains the persons country and birthdate. Looking at it, we can see that the first and second row are the same person. So we need to fill the info from the second row into the first row:
name code id country yob
1137 Bobby Joe USA19921111 1223133121 USA 1992
4398 Bobby Joe USA19981111 NaN NaN NaN
Here is what I have:
# Get create a dictionry of all of the rows that contain
# codes and their indexes
code_rows = dict(zip(list(group['code'].dropna().index),
group['code'].dropna().unique()))
no_code_rows = group.loc[pd.isnull(group['code']), :]
if no_code_rows.empty or len(code_rows) == group.shape[0]:
# No info to consolidate
return group
for group_idx, code in code_rows.items():
for row_idx, row in no_code_rows.iterrows():
country_yob = row['country'] + str(int(row['yob']))
if country_yob in code:
group.loc[group_idx, 'id'] = row['id']
group.loc[group_idx, 'country'] = row['country']
group.loc[group_idx, 'yob'] = row['yob']
group.drop(row_idx, inplace=True)
# Drop from temp table so we don't have to iterate
# over an extra row
no_code_rows.drop(row_idx, inplace=True)'''
break
return group
This works but I have a feeling I am missing something? I feel like I shouldn't have to use two loops for this and that maybe there is a pandas function?
EDIT
We don't know the order or how many rows we will have in each group
i.e.
name code id country yob
1137 Bobby Joe USA19921111 NaN NaN NaN
2367 Bobby Joe USA19981111 NaN NaN NaN
4398 Bobby Joe NaN 1223133121 USA 1992`
I think need:
m = df['code'].isnull()
df1 = df[~m]
df2 = df[m]
df = df1.merge(df2, on='name', suffixes=('','_'))
df['a_'] = df['country_'] + df['yob_'].astype(str)
m = df.apply(lambda x: x['a_'] in x['code'], axis=1)
df.loc[m, ['id','country','yob']] = df.loc[m, ['id_','country_','yob_']].rename(columns=lambda x: x.strip('_'))
df = df.loc[:, ~df.columns.str.endswith('_')]
print (df)
name code id country yob
0 Bobby Joe USA19921111 1223133121 USA 1992
1 Bobby Joe USA19981111 NaN NaN NaN