I am facing issues in concatenating two Dataframes of different lengths. Below is the issue:
df1 =
emp_id emp_name counts
1 sam 0
2 joe 0
3 john 0
df2 =
emp_id emp_name counts
1 sam 0
2 joe 0
2 joe 1
3 john 0
My Expected output is:
Please Note that my expectation is not to merge the 2 dataframes into one but I would like to concat two dataframes side by side and highlight the differences in such a way that, if there is a duplicate row in one df, example df2, the respective row of df1 should show as NaN/blank/None any Null kind of values
Expected_output_df =
df1 df2
empId emp_name counts emp_id emp_name counts
1 sam 0 1 sam 0
2 joe 0 2 joe 0
NaN NaN NaN 2 joe 1
3 john 0 3 john 0
whereas am getting output as below:
actual_output_df = pd.concat([df1, df2], axis='columns', keys=['df1','df2'])
the above code gives me below mentioned Dataframe. but how can I get the dataframe which is mentioned in the Expected output
actual_output_df =
df1 df2
empId emp_name counts emp_id emp_name counts
1 sam 0 1 sam 0
2 joe 0 2 joe 0
3 john 0 2 joe 1
NaN NaN NaN 3 john 0
Tried pd.concat by passing different parameters but not getting expected result.
The main issue I have in concat is, I am not able to move the duplicate rows one row down.
Can anyone please help me on this? Thanks in advance
This does not give the exact output you asked, but it could solve your problem anyway:
df1.merge(df2, on=['emp_id', 'emp_name', 'counts'], how='outer', indicator=True)
Output:
emp_id emp_name counts _merge
0 1 sam 0 both
1 2 joe 0 both
2 3 john 0 both
3 2 joe 1 right_only
You don't have rows with NaNs as you wanted, but in this way you can check whether a row is in the left df, right df or both by looking at the _merge column. You can also give a custom name to that columns using indicator='name'.
Update
To get the exact output you want you can do the following:
output_df = df1.merge(df2, on=['emp_id', 'emp_name', 'counts'], how='outer', indicator=True)
output_df[['emp_id2', 'emp_name2', 'counts2']] = output_df[['emp_id', 'emp_name', 'counts']]
output_df.loc[output_df._merge == 'right_only', ['emp_id', 'emp_name', 'counts']] = np.nan
output_df.loc[output_df._merge == 'left_only', ['emp_id2', 'emp_name2', 'counts2']] = np.nan
output_df = output_df.drop('_merge', axis=1)
output_df.columns = pd.MultiIndex.from_tuples([('df1', 'emp_id'), ('df1', 'emp_name'), ('df1', 'counts'),
('df2', 'emp_id'), ('df2', 'emp_name'), ('df2', 'counts')])
Output:
df1 df2
emp_id emp_name counts emp_id emp_name counts
0 1.0 sam 0.0 1.0 sam 0.0
1 2.0 joe 0.0 2.0 joe 0.0
2 3.0 john 0.0 3.0 john 0.0
3 NaN NaN NaN 2.0 joe 1.0
Note:for simplicity's sake, i'm using a toy example, because copy/pasting dataframes is difficult in stack overflow (please let me know if there's an easy way to do this).
Is there a way to merge the values from one dataframe onto another without getting the _X, _Y columns? I'd like the values on one column to replace all zero values of another column.
df1:
Name Nonprofit Business Education
X 1 1 0
Y 0 1 0 <- Y and Z have zero values for Nonprofit and Educ
Z 0 0 0
Y 0 1 0
df2:
Name Nonprofit Education
Y 1 1 <- this df has the correct values.
Z 1 1
pd.merge(df1, df2, on='Name', how='outer')
Name Nonprofit_X Business Education_X Nonprofit_Y Education_Y
Y 1 1 1 1 1
Y 1 1 1 1 1
X 1 1 0 nan nan
Z 1 1 1 1 1
In a previous post, I tried combine_First and dropna(), but these don't do the job.
I want to replace zeros in df1 with the values in df2.
Furthermore, I want all rows with the same Names to be changed according to df2.
Name Nonprofit Business Education
Y 1 1 1
Y 1 1 1
X 1 1 0
Z 1 0 1
(need to clarify: The value in 'Business' column where name = Z should 0.)
My existing solution does the following:
I subset based on the names that exist in df2, and then replace those values with the correct value. However, I'd like a less hacky way to do this.
pubunis_df = df2
sdf = df1
regex = str_to_regex(', '.join(pubunis_df.ORGS))
pubunis = searchnamesre(sdf, 'ORGS', regex)
sdf.ix[pubunis.index, ['Education', 'Public']] = 1
searchnamesre(sdf, 'ORGS', regex)
Attention: In latest version of pandas, both answers above doesn't work anymore:
KSD's answer will raise error:
df1 = pd.DataFrame([["X",1,1,0],
["Y",0,1,0],
["Z",0,0,0],
["Y",0,0,0]],columns=["Name","Nonprofit","Business", "Education"])
df2 = pd.DataFrame([["Y",1,1],
["Z",1,1]],columns=["Name","Nonprofit", "Education"])
df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2.loc[df2.Name.isin(df1.Name),['Nonprofit', 'Education']].values
df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2[['Nonprofit', 'Education']].values
Out[851]:
ValueError: shape mismatch: value array of shape (2,) could not be broadcast to indexing result of shape (3,)
and EdChum's answer will give us the wrong result:
df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2[['Nonprofit', 'Education']]
df1
Out[852]:
Name Nonprofit Business Education
0 X 1.0 1 0.0
1 Y 1.0 1 1.0
2 Z NaN 0 NaN
3 Y NaN 1 NaN
Well, it will work safely only if values in column 'Name' are unique and are sorted in both data frames.
Here is my answer:
Way 1:
df1 = df1.merge(df2,on='Name',how="left")
df1['Nonprofit_y'] = df1['Nonprofit_y'].fillna(df1['Nonprofit_x'])
df1['Business_y'] = df1['Business_y'].fillna(df1['Business_x'])
df1.drop(["Business_x","Nonprofit_x"],inplace=True,axis=1)
df1.rename(columns={'Business_y':'Business','Nonprofit_y':'Nonprofit'},inplace=True)
Way 2:
df1 = df1.set_index('Name')
df2 = df2.set_index('Name')
df1.update(df2)
df1.reset_index(inplace=True)
More guide about update.. The columns names of both data frames need to set index are not necessary same before 'update'. You could try 'Name1' and 'Name2'. Also, it works even if other unnecessary row in df2, which won't update df1. In other words, df2 doesn't need to be the super set of df1.
Example:
df1 = pd.DataFrame([["X",1,1,0],
["Y",0,1,0],
["Z",0,0,0],
["Y",0,1,0]],columns=["Name1","Nonprofit","Business", "Education"])
df2 = pd.DataFrame([["Y",1,1],
["Z",1,1],
['U',1,3]],columns=["Name2","Nonprofit", "Education"])
df1 = df1.set_index('Name1')
df2 = df2.set_index('Name2')
df1.update(df2)
result:
Nonprofit Business Education
Name1
X 1.0 1 0.0
Y 1.0 1 1.0
Z 1.0 0 1.0
Y 1.0 1 1.0
Use the boolean mask from isin to filter the df and assign the desired row values from the rhs df:
In [27]:
df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] = df1[['Nonprofit', 'Education']]
df
Out[27]:
Name Nonprofit Business Education
0 X 1 1 0
1 Y 1 1 1
2 Z 1 0 1
3 Y 1 1 1
[4 rows x 4 columns]
In [27]:
This is the correct one.
df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] = df1[['Nonprofit', 'Education']].values
df
Out[27]:
Name Nonprofit Business Education
0 X 1 1 0
1 Y 1 1 1
2 Z 1 0 1
3 Y 1 1 1
[4 rows x 4 columns]
The above will work only when all rows in df1 exists in df . In other words df should be super set of df1
Incase if you have some non matching rows to df in df1,you should follow below
In other words df is not superset of df1 :
df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] =
df1.loc[df1.Name.isin(df.Name),['Nonprofit', 'Education']].values
df2.set_index('Name').combine_first(df1.set_index('Name')).reset_index()
I have a df like this:
MemberID FirstName LastName ClaimID Amount
0 1 John Doe 001A 100
1 1 John Doe 001B 150
2 2 Andy Right 004C 170
3 2 Andy Right 005A 200
4 2 Andy Right 002B 100
I need to transpose the values in the 'ClaimID' column for each member into one row, so each member will have each Claim as a value in a separate column called 'Claim(1-MaxNumofClaims), and the same logic goes for the Amount columns, the output needs to look like this:
MemberID FirstName LastName Claim1 Claim2 Claim3 Amount1 Amount2 Amount3
0 1 John Doe 001A 001B NaN 100 150 NaN
1 2 Andy Right 004C 005A 002B 170 200 100
I am new to Pandas and got myself stuck on this, any help would be greatly appreciated.
the operation you need is not transpose, this swaps row and column indexes
this approach groupby() identifying columns and constructs dict of columns that you want values to become columns 1..n
part two is expand out these dict. pd.Series expands a series of dict to columns
df = pd.read_csv(io.StringIO(""" MemberID FirstName LastName ClaimID Amount
0 1 John Doe 001A 100
1 1 John Doe 001B 150
2 2 Andy Right 004C 170
3 2 Andy Right 005A 200
4 2 Andy Right 002B 100 """), sep="\s+")
cols = ["ClaimID","Amount"]
# aggregate to columns that define rows, generate a dict for other columns
df = df.groupby(
["MemberID","FirstName","LastName"], as_index=False).agg(
{c:lambda s: {f"{s.name}{i+1}":v for i,v in enumerate(s)} for c in cols})
# expand out the dicts and drop the now redundant columns
df = df.join(df["ClaimID"].apply(pd.Series)).join(df["Amount"].apply(pd.Series)).drop(columns=cols)
MemberID
FirstName
LastName
ClaimID1
ClaimID2
ClaimID3
Amount1
Amount2
Amount3
0
1
John
Doe
001A
001B
nan
100
150
nan
1
2
Andy
Right
004C
005A
002B
170
200
100
i have two dataframes :
df:
id Name Number Stat
1 co 4
2 ma 98
3 sa 0
df1:
id Name Number Stat
1 co 4
2 ma 98 5%
I want to merge both dataframes in 1 (dfnew) and i want it as follow:
id Name Number Stat
1 co 4
2 ma 98 5%
3 sa 0
I used
dfnew = pd.concat([df, df2])
dfnew = df_row.drop_duplicates(keep='last')
I am not getting the result i want. the dataframes are joined but duplicates are not deleted. I need help please
It seems you need check only first 3 columns for duplicates:
dfnew = pd.concat([df, df2]).drop_duplicates(subset=['id','Name','Number'], keep='last')
print (dfnew)
id Name Number Stat
2 3 sa 0 NaN
0 1 co 4 NaN
1 2 ma 98 5%
try pd.merge function with inner/ outer based on requirement.
I have a dataframe:
df = pd.DataFrame({'name':['John','Fred','John','George','Fred']})
How can I transform this to generate a new column giving me group membership by value? Such that:
new_df = pd.DataFrame({'name':['John','Fred','John','George','Fred'], 'group':[1,2,1,3,2]})
Use factorize:
df['group'] = pd.factorize(df['name'])[0] + 1
print (df)
name group
0 John 1
1 Fred 2
2 John 1
3 George 3
4 Fred 2