Pandas groupby give any non nan values - python

I'm trying to perform a groupby on a table where given this groupby index, all values are either correct or Nan. EG:
id country name
0 1 France None
1 1 France Pierre
2 2 None Marge
3 1 None Pierre
4 3 USA Jim
5 3 None Jim
6 2 UK None
7 4 Spain Alvaro
8 2 None Marge
9 3 None Jim
10 4 Spain None
11 3 None Jim
I just want to get the values for each of the 4 people, which should never clash, eg:
country name
id
1 France Pierre
2 UK Marge
3 USA Jim
4 Spain Alvaro
I've tried:
groupby().first()
groupby.nth(0,dropna='any'/'all')
and even
groupby().apply(lambda x: x.loc[x.first_valid_index()])
All to no avail. What am I missing?
EDIT: to help you making the example dataframe for testing:
df = pd.DataFrame({'id':[1,1,2,1,3,3,2,4,2,3,4,3],'country':['France','France',None,None,'USA',None,'UK','Spain',None,None,'Spain',None],'name':[None,'Pierre','Marge','Pierre','Jim','Jim',None,'Alvaro','Marge','Jim',None,'Jim']})

Pandas groupby.first returns first not-null value but does not support None, try
df.fillna(np.nan).groupby('id').first()
country name
id
1 France Pierre
2 UK Marge
3 USA Jim
4 Spain Alvaro

Possible specifying to dropna when values are None
df.groupby('id').first(dropna=True)
country name
id
1 France Pierre
2 UK Marge
3 USA Jim
4 Spain Alvaro

Related

How to compare pandas dataframe using given keys

I have two datasets. I want to compare using id, name and need to write in different data frame with mismatched values as "Mismatched" and mismatched rows as it is.
df1
Index id name dept addr
0 1 Jeff IT Delhi
1 2 Tom Support Bangalore
2 3 Peter Admin Pune
3 4 Kaif IT Pune
4 5 Lee Dev Delhi
df2
Index id name dept addr
0 1 Jeff IT Delhi
1 2 Tom QA Bangalore
2 3 Peter Admin Pune
3 4 Kaif IT Hyderabad
And I need result like,
Result
Index id name dept addr
0 2 Tom Mismatched Bangalore
1 4 Kaif IT Mismatched
2 5 Lee Dev Delhi
One way to do what you intend to do (if 'id' and 'name' already match as in the case you show) is to do an inner merge according to the 'name' column and then change the 'Dept' value to 'mismatch' if the 'dept_x' and 'dept_y' value of the merged dataframe don't match.
A = pd.merge(df1,df2, on='name', how='inner')
# It creates new columns
print(A)
id_x name dept_x addr_x id_y dept_y addr_y
0 1 Jeff IT Delhi 1 IT Delhi
1 2 Tom Support Bangalore 2 QA Bangalore
2 3 Peter Admin Pune 3 Admin Pune
3 4 Kaif IT Pune 4 IT Hyderabad
B = A.copy()
B['dept_x'] = A.apply(lambda x : 'mismatch' if x.dept_x!=x.dept_y else x.dept_x, axis=1)
print(B)
id_x name dept_x addr_x id_y dept_y addr_y
0 1 Jeff IT Delhi 1 IT Delhi
1 2 Tom mismatch Bangalore 2 QA Bangalore
2 3 Peter Admin Pune 3 Admin Pune
3 4 Kaif IT Pune 4 IT Hyderabad
Then you can do the same for the address column, filter the rows with mismatch if you intend to only keep them, and rename or delete the columns that you need/don't need accordingly.
If you have many columns, you can use a function inside the .apply() to make it more general :
# the columns that you intend to check the mismatch for
cols = ['dept','addr']
# or if you want to do it on all columns except the first two because there's too many
cols = [a for a in df1.columns if a not in ['name','id']]
# define a function that compares for all columns
def is_mismatch(x) :
L = ['mismatch' if x[cols[i]+'_x']!=x[cols[i]+'_y'] else x[cols[i]+'_x'] for i in range(len(cols))]
return pd.Series(L)
C = A.copy()
C[cols] = C.apply(is_mismatch, axis=1) # don't forget that axis=1 here !
print(C)
id_x name dept_x addr_x id_y dept_y addr_y dept \
0 1 Jeff IT Delhi 1 IT Delhi IT
1 2 Tom Support Bangalore 2 QA Bangalore mismatch
2 3 Peter Admin Pune 3 Admin Pune Admin
3 4 Kaif IT Pune 4 IT Hyderabad IT
addr
0 Delhi
1 Bangalore
2 Pune
3 mismatch
# if you want to clean the columns
C = C[['id_x','name']+cols]
print(C)
id_x name dept addr
0 1 Jeff IT Delhi
1 2 Tom mismatch Bangalore
2 3 Peter Admin Pune
3 4 Kaif IT mismatch

Assign values (1 to N) for similar rows in a dataframe Pandas [duplicate]

This question already has answers here:
Add a sequential counter column on groups to a pandas dataframe
(4 answers)
Closed last year.
I have a dataframe df:
Name
Place
Price
Bob
NY
15
Jack
London
27
John
Paris
5
Bill
Sydney
3
Bob
NY
39
Jack
London
9
Bob
NY
2
Dave
NY
7
I need to assign an incremental value (from 1 to N) for each row which has the same name and place (price can be different).
df_out:
Name
Place
Price
Value
Bob
NY
15
1
Jack
London
27
1
John
Paris
5
1
Bill
Sydney
3
1
Bob
NY
39
2
Jack
London
9
2
Bob
NY
2
3
Dave
NY
7
1
I could do this by sorting the dataframe (on Name and Place) and then iteratively checking if they match between two consecutive rows. Is there a smarter/faster pandas way to do this?
You can use a grouped (on Name, Place) cumulative count and add 1 as it starts from 0:
df['Value'] = df.groupby(['Name','Place']).cumcount().add(1)
prints:
Name Place Price Value
0 Bob NY 15 1
1 Jack London 27 1
2 John Paris 5 1
3 Bill Sydney 3 1
4 Bob NY 39 2
5 Jack London 9 2
6 Bob NY 2 3
7 Dave NY 7 1

Pandas: compare how to compare two columns in different sheets and return matched value

I have two dataframes with multiple columns.
I would like to compare df1['id'] and df2['id'] and return a new df with another column that have the match value.
example:
df1
**id** **Name**
1 1 Paul
2 2 Jean
3 3 Alicia
4 4 Jennifer
df2
**id** **Name**
1 1 Paul
2 6 Jean
3 3 Alicia
4 7 Jennifer
output
**id** **Name** *correct_id*
1 1 Paul 1
2 2 Jean N/A
3 3 Alicia 3
4 4 Jennifer N/A
Note- the length of the two columns I want to match is not the same.
Try:
df1["correct_id"] = (df1["id"].isin(df2["id"]) * df1["id"]).replace(0, "N/A")
print(df1)
Prints:
id Name correct_id
0 1 Paul 1
1 2 Jean N/A
2 3 Alicia 3
3 4 Jennifer N/A

How do you merge dataframes in pandas with different shapes?

I am trying to merge two dataframes in pandas with large sets of data, however it is causing me some problems. I will try to illustrate with a smaller example.
df1 has a list of equipment and several columns relating to the equipment:
Item ID Equipment Owner Status Location
1 Jackhammer James Active London
2 Cement Mixer Tim Active New York
3 Drill Sarah Active Paris
4 Ladder Luke Inactive Hong Kong
5 Winch Kojo Inactive Sydney
6 Circular Saw Alex Active Moscow
df2 has a list of instances where equipment has been used. This has some similar columns to df1, however some of the fields are NaN values and also instances of equipment not in df1 have also been recorded:
Item ID Equipment Owner Date Location
1 Jackhammer James 08/09/2020 London
1 Jackhammer James 08/10/2020 London
2 Cement Mixer NaN 29/02/2020 New York
3 Drill Sarah 11/02/2020 NaN
3 Drill Sarah 30/11/2020 NaN
3 Drill Sarah 21/12/2020 NaN
6 Circular Saw Alex 19/06/2020 Moscow
7 Hammer Ken 21/12/2020 Toronto
8 Sander Ezra 19/06/2020 Frankfurt
The resulting dataframe I was hoping to end up with was this:
Item ID Equipment Owner Status Date Location
1 Jackhammer James Active 08/09/2020 London
1 Jackhammer James Active 08/10/2020 London
2 Cement Mixer Tim Active 29/02/2020 New York
3 Drill Sarah Active 11/02/2020 Paris
3 Drill Sarah Active 30/11/2020 Paris
3 Drill Sarah Active 21/12/2020 Paris
4 Ladder Luke Inactive NaN Hong Kong
5 Winch Kojo Inactive NaN Sydney
6 Circular Saw Alex Active 19/06/2020 Moscow
7 Hammer Ken NaN 21/12/2020 Toronto
8 Sander Ezra NaN 19/06/2020 Frankfurt
Instead, with the following code I'm getting duplicate rows, I think because of the NaN values:
data = pd.merge(df1, df2, how='outer', on=['Item ID'])
Item ID Equipment_x Equipment_y Owner_x Owner_y Status Date Location_x Location_y
1 Jackhammer NaN James James Active 08/09/2020 London London
1 Jackhammer NaN James James Active 08/10/2020 London London
2 Cement Mixer NaN Tim NaN Active 29/02/2020 New York New York
3 Drill NaN Sarah Sarah Active 11/02/2020 Paris NaN
3 Drill NaN Sarah Sarah Active 30/11/2020 Paris NaN
3 Drill NaN Sarah Sarah Active 21/12/2020 Paris NaN
4 Ladder NaN Luke NaN Inactive NaN Hong Kong Hong Kong
5 Winch NaN Kojo NaN Inactive NaN Sydney Sydney
6 Circular Saw NaN Alex NaN Active 19/06/2020 Moscow Moscow
7 NaN Hammer NaN Ken NaN 21/12/2020 NaN Toronto
8 NaN Sander NaN Ezra NaN 19/06/2020 NaN Frankfurt
Ideally I could just drop the _y columns however the data in the bottom rows means I would be losing important information. Instead the only thing I can think of merging the columns and force pandas to compare the values in each column and always favour the non-NaN value. I'm not sure if this is possible or not though?
merging the columns and force pandas to compare the values in each column and always favour the non-NaN value.
Is this what you mean?
In [45]: data = pd.merge(df1, df2, how='outer', on=['Item ID', 'Equipment'])
In [46]: data['Location'] = data['Location_y'].fillna(data['Location_x'])
In [47]: data['Owner'] = data['Owner_y'].fillna(data['Owner_x'])
In [48]: data = data.drop(['Location_x', 'Location_y', 'Owner_x', 'Owner_y'], axis=1)
In [49]: data
Out[49]:
Item ID Equipment Status Date Location Owner
0 1 Jackhammer Active 08/09/2020 London James
1 1 Jackhammer Active 08/10/2020 London James
2 2 Cement Mixer Active 29/02/2020 New York Tim
3 3 Drill Active 11/02/2020 Paris Sarah
4 3 Drill Active 30/11/2020 Paris Sarah
5 3 Drill Active 21/12/2020 Paris Sarah
6 4 Ladder Inactive NaN Hong Kong Luke
7 5 Winch Inactive NaN Sydney Kojo
8 6 Circular Saw Active 19/06/2020 Moscow Alex
9 7 Hammer NaN 21/12/2020 Toronto Ken
10 8 Sander NaN 19/06/2020 Frankfurt Ezra
(To my knowledge) you cannot really merge on null column. However you can use fillna to take the value and replace it by something else if it is NaN. Not a very elegant solution, but it seems to solve your example at least.
Also see pandas combine two columns with null values
Generically you can do that as follows:
# merge the two dataframes using a suffix that ideally does
# not appear in your data
suffix_string='_DF2'
data = pd.merge(df1, df2, how='outer', on=['Item_ID'], suffixes=('', suffix_string))
# now remove the duplicate columns by mergeing the content
# use the value of column + suffix_string if column is empty
columns_to_remove= list()
for col in df1.columns:
second_col= f'{col}{suffix_string}'
if second_col in data.columns:
data[col]= data[second_col].where(data[col].isna(), data[col])
columns_to_remove.append(second_col)
if columns_to_remove:
data.drop(columns=columns_to_remove, inplace=True)
data
The result is:
Item_ID Equipment Owner Status Location Date
0 1 Jackhammer James Active London 08/09/2020
1 1 Jackhammer James Active London 08/10/2020
2 2 Cement_Mixer Tim Active New_York 29/02/2020
3 3 Drill Sarah Active Paris 11/02/2020
4 3 Drill Sarah Active Paris 30/11/2020
5 3 Drill Sarah Active Paris 21/12/2020
6 4 Ladder Luke Inactive Hong_Kong NaN
7 5 Winch Kojo Inactive Sydney NaN
8 6 Circular_Saw Alex Active Moscow 19/06/2020
9 7 Hammer Ken NaN Toronto 21/12/2020
10 8 Sander Ezra NaN Frankfurt 19/06/2020
On the following test data:
df1= pd.read_csv(io.StringIO("""Item_ID Equipment Owner Status Location
1 Jackhammer James Active London
2 Cement_Mixer Tim Active New_York
3 Drill Sarah Active Paris
4 Ladder Luke Inactive Hong_Kong
5 Winch Kojo Inactive Sydney
6 Circular_Saw Alex Active Moscow"""), sep='\s+')
df2= pd.read_csv(io.StringIO("""Item_ID Equipment Owner Date Location
1 Jackhammer James 08/09/2020 London
1 Jackhammer James 08/10/2020 London
2 Cement_Mixer NaN 29/02/2020 New_York
3 Drill Sarah 11/02/2020 NaN
3 Drill Sarah 30/11/2020 NaN
3 Drill Sarah 21/12/2020 NaN
6 Circular_Saw Alex 19/06/2020 Moscow
7 Hammer Ken 21/12/2020 Toronto
8 Sander Ezra 19/06/2020 Frankfurt"""), sep='\s+')

Creating new column based on column values in row and column values in other rows in df?

I have df below as:
id | name | status | country | ref_id
3 Bob False Germany NaN
5 422 True USA 3
7 Nick False India NaN
6 Chris True Australia 7
8 324 True Africa 28
28 Tim False Canada 53
I want to add a new column for each row, where if the status for that row is True, if the ref_id for that row exists in the id column in another row and that rows status is False, give me the value of the name in that column.
So expected output below would be:
id | name | status | country | ref_id | new
3 Bob False Germany NaN NaN
5 422 True USA 3 Bob
7 Nick False India NaN NaN
6 Chris True Australia 7 Nick
8 324 True Africa 28 Tim
28 Tim False Canada 53 NaN
I have code below that I am using for other purposes that just filters for rows that have a status of True, and and id_reference value that exists in the id column like below:
(df.loc[df["status"]&df["id_reference"].astype(float).isin(df.loc[~df["status"], "id"])])
But I am trying to also calculate a new column as mentioned prior above with the value of the name if it has one in that column
Thanks!
Let us try
df['new']=df.loc[df.status,'ref_id'].map(df.set_index('id')['name'])
df
id name status country ref_id new
0 3 Bob False Germany NaN NaN
1 5 422 True USA 3.0 Bob
2 7 Nick False India NaN NaN
3 6 Chris True Australia 7.0 Nick
4 8 324 True Africa 28.0 Tim
5 28 Tim False Canada 53.0 NaN
This essentially a merge:
merged = (df.loc[df['status'],['ref_id']]
.merge(df.loc[~df['status'],['id','name']], left_on='ref_id', right_on='id')
)
df['ref_id'] = (df['id'].map(merged.set_index('id')['name'])
.where(df['status'])
)

Categories