Merge a dataframe to another dataframe without existing data

Merge a dataframe to another dataframe without existing data - python

Need to merge a DataFrame with another DataFrame without affecting existing data
df1:
Name
Subject
mark
a
Ta
52
b
En
c
Ma
d
Ss
60
df2:
Name
mark
b
57
c
58
Expected Output:
Name
Subject
mark
a
Ta
52
b
En
57
c
Ma
58
d
Ss
60

Use combine_first after setting Name as index:
df1.set_index('Name').combine_first(df2.set_index('Name')).reset_index()
output:
Name Subject mark
0 a Ta 52.0
1 b En 57.0
2 c Ma 58.0
3 d Ss 60.0

Try using merge and combine_first:
>>> df = df1.merge(df2, on='Name', how='outer')
>>> df['mark'] = df.pop('mark_x').combine_first(df.pop('mark_y'))
>>> df
Name Subject mark
0 a Ta 52.0
1 b En 57.0
2 c Ma 58.0
3 d Ss 60.0
>>>

One of the ways in which you can achieve this is by using the below steps:
Inner join the 2 tables using the pandas.merge()command.
Create a new column which basically checks if the marks column from df1 is not None, then take that value, else, take df2 column value.

Related

How to sum the previous pandas df column with the current if it contains the word "extra"

I have a pandas df that has columns that need to be summed with to the previous column iff they contain the word "extra". For example, here is my pandas df:
id laptops laptops extra battery cables monitor monitor extra
0 54 18 108 54 28 12
1 33 9 48 20 10 4
2 82 61 98 67 21 9
...
Is there a way in pandas to find columns that contain the word extra and sum them with the previous column? That would help clean up so much data.
Thank you

Remove extra text and aggregate sum for all columns:
df1 = (df.rename(columns=lambda x: x.replace(' extra', ''))
.groupby(level=0, axis=1, sort=False)
.sum())
Or filter extra columns, remove extra and add to original columns, last remove extra columns:
m = df.columns.str.endswith('extra')
df1 = (df.add(df.loc[:, m]
.rename(columns=lambda x: x.replace(' extra', '')), axis=1, fill_value=0)
.loc[:, ~m])
EDIT: For add previous column by extra substring in end of columns names use:
m = df.columns.to_series().str.endswith('extra')
df.loc[:, m] = df.loc[:, m.shift(-1, fill_value=False)] + df.loc[:, m].to_numpy()
df = df.loc[:, ~m]
print (df)
id laptops battery cables monitor
0 0.0 54.0 108.0 54.0 28.0
1 1.0 33.0 48.0 20.0 10.0
2 2.0 82.0 98.0 67.0 21.0

How do I merge two sets of data with Pandas in Pyhton without losing rows?

I'm using Pandas in Python to compare two data frames. I want to match up the data from one set to another.
Dataframe 1
Name
Sam
Mike
John
Matthew
Mark
Dataframe 2
Name
Number
Mike
76
John
92
Mark
32
This is the output I would like to get:
Name
Number
Sam
0
Mike
76
John
92
Matthew
0
Mark
32
At the moment I am doing this
df1 = pd.read_csv('data_frame1.csv', usecols=['Name', 'Number'])
df2 = pd.read_csv('data_frame2.csv')
df3 = pd.merge(df1, df2, on = 'Name')
df3.set_index('Name', inplace = True)
df3.to_csv('output.csv')
However, this is deleting the names which do not have a number. I want to keep them and assign 0 to them.

You can use pd.merge(..., , how = 'outer') this keep all row and insert for them Nan and then use .fillna(0) and insert 0 for Nan:
>>> pd.merge(df1, df2, on = 'Name', how = 'outer').fillna(0)
Name Number
0 Sam 0
1 Mike 76
2 John 92
3 Matthew 0
4 Mark 32
With pd.merge(..., , how = 'outer') you consider two DataFrame if you want megre one DataFrame with another you can merge like below, see this example:
>>> df1 = pd.DataFrame({'Name': ['Mike','John','Mark','Matthew']})
>>> df2 = pd.DataFrame({'Name': ['Mike','John','Mark', 'Sara'], 'Number' : [76,92,32,50]})
>>> pd.merge(df1, df2, on='Name', how='outer').fillna(0)
Name Number
0 Mike 76.0
1 John 92.0
2 Mark 32.0
3 Matthew 0.0
4 Sara 50.0
>>> df1.merge(df2,on='Name', how='left').fillna(0)
Name Number
0 Mike 76.0
1 John 92.0
2 Mark 32.0
3 Matthew 0.0

How to findout difference between two dataframes irrespective of index? [duplicate]

I have two data frames df1 and df2, where df2 is a subset of df1. How do I get a new data frame (df3) which is the difference between the two data frames?
In other word, a data frame that has all the rows/columns in df1 that are not in df2?

By using drop_duplicates
pd.concat([df1,df2]).drop_duplicates(keep=False)
Update :
The above method only works for those data frames that don't already have duplicates themselves. For example:
df1=pd.DataFrame({'A':[1,2,3,3],'B':[2,3,4,4]})
df2=pd.DataFrame({'A':[1],'B':[2]})
It will output like below , which is wrong
Wrong Output :
pd.concat([df1, df2]).drop_duplicates(keep=False)
Out[655]:
A B
1 2 3
Correct Output
Out[656]:
A B
1 2 3
2 3 4
3 3 4
How to achieve that?
Method 1: Using isin with tuple
df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))]
Out[657]:
A B
1 2 3
2 3 4
3 3 4
Method 2: merge with indicator
df1.merge(df2,indicator = True, how='left').loc[lambda x : x['_merge']!='both']
Out[421]:
A B _merge
1 2 3 left_only
2 3 4 left_only
3 3 4 left_only

For rows, try this, where Name is the joint index column (can be a list for multiple common columns, or specify left_on and right_on):
m = df1.merge(df2, on='Name', how='outer', suffixes=['', '_'], indicator=True)
The indicator=True setting is useful as it adds a column called _merge, with all changes between df1 and df2, categorized into 3 possible kinds: "left_only", "right_only" or "both".
For columns, try this:
set(df1.columns).symmetric_difference(df2.columns)

Accepted answer Method 1 will not work for data frames with NaNs inside, as pd.np.nan != pd.np.nan. I am not sure if this is the best way, but it can be avoided by
df1[~df1.astype(str).apply(tuple, 1).isin(df2.astype(str).apply(tuple, 1))]
It's slower, because it needs to cast data to string, but thanks to this casting pd.np.nan == pd.np.nan.
Let's go trough the code. First we cast values to string, and apply tuple function to each row.
df1.astype(str).apply(tuple, 1)
df2.astype(str).apply(tuple, 1)
Thanks to that, we get pd.Series object with list of tuples. Each tuple contains whole row from df1/df2.
Then we apply isin method on df1 to check if each tuple "is in" df2.
The result is pd.Series with bool values. True if tuple from df1 is in df2. In the end, we negate results with ~ sign, and applying filter on df1. Long story short, we get only those rows from df1 that are not in df2.
To make it more readable, we may write it as:
df1_str_tuples = df1.astype(str).apply(tuple, 1)
df2_str_tuples = df2.astype(str).apply(tuple, 1)
df1_values_in_df2_filter = df1_str_tuples.isin(df2_str_tuples)
df1_values_not_in_df2 = df1[~df1_values_in_df2_filter]

import pandas as pd
# given
df1 = pd.DataFrame({'Name':['John','Mike','Smith','Wale','Marry','Tom','Menda','Bolt','Yuswa',],
'Age':[23,45,12,34,27,44,28,39,40]})
df2 = pd.DataFrame({'Name':['John','Smith','Wale','Tom','Menda','Yuswa',],
'Age':[23,12,34,44,28,40]})
# find elements in df1 that are not in df2
df_1notin2 = df1[~(df1['Name'].isin(df2['Name']) & df1['Age'].isin(df2['Age']))].reset_index(drop=True)
# output:
print('df1\n', df1)
print('df2\n', df2)
print('df_1notin2\n', df_1notin2)
# df1
# Age Name
# 0 23 John
# 1 45 Mike
# 2 12 Smith
# 3 34 Wale
# 4 27 Marry
# 5 44 Tom
# 6 28 Menda
# 7 39 Bolt
# 8 40 Yuswa
# df2
# Age Name
# 0 23 John
# 1 12 Smith
# 2 34 Wale
# 3 44 Tom
# 4 28 Menda
# 5 40 Yuswa
# df_1notin2
# Age Name
# 0 45 Mike
# 1 27 Marry
# 2 39 Bolt

Perhaps a simpler one-liner, with identical or different column names. Worked even when df2['Name2'] contained duplicate values.
newDf = df1.set_index('Name1')
.drop(df2['Name2'], errors='ignore')
.reset_index(drop=False)

edit2, I figured out a new solution without the need of setting index
newdf=pd.concat([df1,df2]).drop_duplicates(keep=False)
Okay i found the answer of highest vote already contain what I have figured out. Yes, we can only use this code on condition that there are no duplicates in each two dfs.
I have a tricky method. First we set ’Name’ as the index of two dataframe given by the question. Since we have same ’Name’ in two dfs, we can just drop the ’smaller’ df’s index from the ‘bigger’ df.
Here is the code.
df1.set_index('Name',inplace=True)
df2.set_index('Name',inplace=True)
newdf=df1.drop(df2.index)

Pandas now offers a new API to do data frame diff: pandas.DataFrame.compare
df.compare(df2)
col1 col3
self other self other
0 a c NaN NaN
2 NaN NaN 3.0 4.0

In addition to accepted answer, I would like to propose one more wider solution that can find a 2D set difference of two dataframes with any index/columns (they might not coincide for both datarames). Also method allows to setup tolerance for float elements for dataframe comparison (it uses np.isclose)
import numpy as np
import pandas as pd
def get_dataframe_setdiff2d(df_new: pd.DataFrame,
df_old: pd.DataFrame,
rtol=1e-03, atol=1e-05) -> pd.DataFrame:
"""Returns set difference of two pandas DataFrames"""
union_index = np.union1d(df_new.index, df_old.index)
union_columns = np.union1d(df_new.columns, df_old.columns)
new = df_new.reindex(index=union_index, columns=union_columns)
old = df_old.reindex(index=union_index, columns=union_columns)
mask_diff = ~np.isclose(new, old, rtol, atol)
df_bool = pd.DataFrame(mask_diff, union_index, union_columns)
df_diff = pd.concat([new[df_bool].stack(),
old[df_bool].stack()], axis=1)
df_diff.columns = ["New", "Old"]
return df_diff
Example:
In [1]
df1 = pd.DataFrame({'A':[2,1,2],'C':[2,1,2]})
df2 = pd.DataFrame({'A':[1,1],'B':[1,1]})
print("df1:\n", df1, "\n")
print("df2:\n", df2, "\n")
diff = get_dataframe_setdiff2d(df1, df2)
print("diff:\n", diff, "\n")
Out [1]
df1:
A C
0 2 2
1 1 1
2 2 2
df2:
A B
0 1 1
1 1 1
diff:
New Old
0 A 2.0 1.0
B NaN 1.0
C 2.0 NaN
1 B NaN 1.0
C 1.0 NaN
2 A 2.0 NaN
C 2.0 NaN

As mentioned here
that
df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))]
is correct solution but it will produce wrong output if
df1=pd.DataFrame({'A':[1],'B':[2]})
df2=pd.DataFrame({'A':[1,2,3,3],'B':[2,3,4,4]})
In that case above solution will give
Empty DataFrame, instead you should use concat method after removing duplicates from each datframe.
Use concate with drop_duplicates
df1=df1.drop_duplicates(keep="first")
df2=df2.drop_duplicates(keep="first")
pd.concat([df1,df2]).drop_duplicates(keep=False)

I had issues with handling duplicates when there were duplicates on one side and at least one on the other side, so I used Counter.collections to do a better diff, ensuring both sides have the same count. This doesn't return duplicates, but it won't return any if both sides have the same count.
from collections import Counter
def diff(df1, df2, on=None):
"""
:param on: same as pandas.df.merge(on) (a list of columns)
"""
on = on if on else df1.columns
df1on = df1[on]
df2on = df2[on]
c1 = Counter(df1on.apply(tuple, 'columns'))
c2 = Counter(df2on.apply(tuple, 'columns'))
c1c2 = c1-c2
c2c1 = c2-c1
df1ondf2on = pd.DataFrame(list(c1c2.elements()), columns=on)
df2ondf1on = pd.DataFrame(list(c2c1.elements()), columns=on)
df1df2 = df1.merge(df1ondf2on).drop_duplicates(subset=on)
df2df1 = df2.merge(df2ondf1on).drop_duplicates(subset=on)
return pd.concat([df1df2, df2df1])
> df1 = pd.DataFrame({'a': [1, 1, 3, 4, 4]})
> df2 = pd.DataFrame({'a': [1, 2, 3, 4, 4]})
> diff(df1, df2)
a
0 1
0 2

There is a new method in pandas DataFrame.compare that compare 2 different dataframes and return which values changed in each column for the data records.
Example
First Dataframe
Id Customer Status Date
1 ABC Good Mar 2023
2 BAC Good Feb 2024
3 CBA Bad Apr 2022
Second Dataframe
Id Customer Status Date
1 ABC Bad Mar 2023
2 BAC Good Feb 2024
5 CBA Good Apr 2024
Comparing Dataframes
print("Dataframe difference -- \n")
print(df1.compare(df2))
print("Dataframe difference keeping equal values -- \n")
print(df1.compare(df2, keep_equal=True))
print("Dataframe difference keeping same shape -- \n")
print(df1.compare(df2, keep_shape=True))
print("Dataframe difference keeping same shape and equal values -- \n")
print(df1.compare(df2, keep_shape=True, keep_equal=True))
Result
Dataframe difference --
Id Status Date
self other self other self other
0 NaN NaN Good Bad NaN NaN
2 3.0 5.0 Bad Good Apr 2022 Apr 2024
Dataframe difference keeping equal values --
Id Status Date
self other self other self other
0 1 1 Good Bad Mar 2023 Mar 2023
2 3 5 Bad Good Apr 2022 Apr 2024
Dataframe difference keeping same shape --
Id Customer Status Date
self other self other self other self other
0 NaN NaN NaN NaN Good Bad NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN
2 3.0 5.0 NaN NaN Bad Good Apr 2022 Apr 2024
Dataframe difference keeping same shape and equal values --
Id Customer Status Date
self other self other self other self other
0 1 1 ABC ABC Good Bad Mar 2023 Mar 2023
1 2 2 BAC BAC Good Good Feb 2024 Feb 2024
2 3 5 CBA CBA Bad Good Apr 2022 Apr 2024

A slight variation of the nice #liangli's solution that does not require to change the index of existing dataframes:
newdf = df1.drop(df1.join(df2.set_index('Name').index))

Finding difference by index. Assuming df1 is a subset of df2 and the indexes are carried forward when subsetting
df1.loc[set(df1.index).symmetric_difference(set(df2.index))].dropna()
# Example
df1 = pd.DataFrame({"gender":np.random.choice(['m','f'],size=5), "subject":np.random.choice(["bio","phy","chem"],size=5)}, index = [1,2,3,4,5])
df2 = df1.loc[[1,3,5]]
df1
gender subject
1 f bio
2 m chem
3 f phy
4 m bio
5 f bio
df2
gender subject
1 f bio
3 f phy
5 f bio
df3 = df1.loc[set(df1.index).symmetric_difference(set(df2.index))].dropna()
df3
gender subject
2 m chem
4 m bio

Defining our dataframes:
df1 = pd.DataFrame({
'Name':
['John','Mike','Smith','Wale','Marry','Tom','Menda','Bolt','Yuswa'],
'Age':
[23,45,12,34,27,44,28,39,40]
})
df2 = df1[df1.Name.isin(['John','Smith','Wale','Tom','Menda','Yuswa'])
df1
Name Age
0 John 23
1 Mike 45
2 Smith 12
3 Wale 34
4 Marry 27
5 Tom 44
6 Menda 28
7 Bolt 39
8 Yuswa 40
df2
Name Age
0 John 23
2 Smith 12
3 Wale 34
5 Tom 44
6 Menda 28
8 Yuswa 40
The difference between the two would be:
df1[~df1.isin(df2)].dropna()
Name Age
1 Mike 45.0
4 Marry 27.0
7 Bolt 39.0
Where:
df1.isin(df2) returns the rows in df1 that are also in df2.
~ (Element-wise logical NOT) in front of the expression negates the results, so we get the elements in df1 that are NOT in df2–the difference between the two.
.dropna() drops the rows with NaN presenting the desired output
Note This only works if len(df1) >= len(df2). If df2 is longer than df1 you can reverse the expression: df2[~df2.isin(df1)].dropna()

I found the deepdiff library is a wonderful tool that also extends well to dataframes if different detail is required or ordering matters. You can experiment with diffing to_dict('records'), to_numpy(), and other exports:
import pandas as pd
from deepdiff import DeepDiff
df1 = pd.DataFrame({
'Name':
['John','Mike','Smith','Wale','Marry','Tom','Menda','Bolt','Yuswa'],
'Age':
[23,45,12,34,27,44,28,39,40]
})
df2 = df1[df1.Name.isin(['John','Smith','Wale','Tom','Menda','Yuswa'])]
DeepDiff(df1.to_dict(), df2.to_dict())
# {'dictionary_item_removed': [root['Name'][1], root['Name'][4], root['Name'][7], root['Age'][1], root['Age'][4], root['Age'][7]]}

Symmetric Difference
If you are interested in the rows that are only in one of the dataframes but not both, you are looking for the set difference:
pd.concat([df1,df2]).drop_duplicates(keep=False)
⚠️ Only works, if both dataframes do not contain any duplicates.
Set Difference / Relational Algebra Difference
If you are interested in the relational algebra difference / set difference, i.e. df1-df2 or df1\df2:
pd.concat([df1,df2,df2]).drop_duplicates(keep=False)
⚠️ Only works, if both dataframes do not contain any duplicates.

Another possible solution is to use numpy broadcasting:
df1[np.all(~np.all(df1.values == df2.values[:, None], axis=2), axis=0)]
Output:
Name Age
1 Mike 45
4 Marry 27
7 Bolt 39

Using the lambda function you can filter the rows with _merge value “left_only” to get all the rows in df1 which are missing from df2
df3 = df1.merge(df2, how = 'outer' ,indicator=True).loc[lambda x :x['_merge']=='left_only']
df

Try this one:
df_new = df1.merge(df2, how='outer', indicator=True).query('_merge == "left_only"').drop('_merge', 1)
It will result a new dataframe with the differences: the values that exist in df1 but not in df2.

update one dataframe with data from another, for one specific column - Pandas and Python

I'm trying to update one dataframe with data from another, for one specific column called 'Data'. Both dataframe's have the unique ID caled column 'ID'. Both columns have a 'Data' column. I want data from 'Data' in df2 to overwrite entries in df1 'Data', for only the amount of rows that are in df1. Where there is no corresponding 'ID' in df2 the df1 entry should remain.
import pandas as pd
data1 = '''\
ID Data Data1
1 AA BB
2 AB BF
3 AC BK
4 AD BL'''
data2 = '''\
ID Data
1 AAB
3 AAL
4 MNL
5 AAP
6 MNX
8 DLP
9 POW'''
df1 = pd.read_csv(pd.compat.StringIO(data1), sep='\s+')
df2 = pd.read_csv(pd.compat.StringIO(data2), sep='\s+')
Expected output:
new df3 expected outcome.
ID Data Data1
1 AAB BB
2 AB BF
3 AAL BK
4 MNL BL
df2 is a master list of values which never changes and has thousands of entries, where as df1 sometime only ever has a few hundred entries.
I have looked at pd.merge and combine_first however can't seem to get the right combination.
df3 = pd.merge(df1, df2, on='ID', how='left')
Any help much appreciated.

Create new dataframe
Here is one way making use of update:
df3 = df1[:].set_index('ID')
df3['Data'].update(df2.set_index('ID')['Data'])
df3.reset_index(inplace=True)
Or we could use maps/dicts and reassign (Python >= 3.5)
m = {**df1.set_index('ID')['Data'], **df2.set_index('ID')['Data']}
df3 = df1[:].assign(Data=df1['ID'].map(m))
Python < 3.5:
m = df1.set_index('ID')['Data']
m.update(df2.set_index('ID')['Data'])
df3 = df1[:].assign(Data=df1['ID'].map(m))
Update df1
Are you open to update the df1? In that case:
df1.update(df2)
Or if ID not index:
m = df2.set_index('ID')['Data']
df1.loc[df1['ID'].isin(df2['ID']),'Data'] =df1['ID'].map(m)
Or:
df1.set_index('ID',inplace=True)
df1.update(df2.set_index('ID'))
df1.reset_index(inplace=True)
Note: There might be something that makes more sense :)
Full example:
import pandas as pd
data1 = '''\
ID Data Data1
1 AA BB
2 AB BF
3 AC BK
4 AD BL'''
data2 = '''\
ID Data
1 AAB
3 AAL
4 MNL
5 AAP
6 MNX
8 DLP
9 POW'''
df1 = pd.read_csv(pd.compat.StringIO(data1), sep='\s+')
df2 = pd.read_csv(pd.compat.StringIO(data2), sep='\s+')
m = {**df1.set_index('ID')['Data'], **df2.set_index('ID')['Data']}
df3 = df1[:].assign(Data=df1['ID'].map(m))
print(df3)
Returns:
ID Data Data1
0 1 AAB BB
1 2 AB BF
2 3 AAL BK
3 4 MNL BL

Get rid of excess Labels on Pandas DataFrames

so I got a DataFrame by doing:
dfgrp=df.groupby(['CCS_Category_ICD9','Gender'])['f0_'].sum()
ndf=pd.DataFrame(dfgrp)
ndf
f0_
CCS_Category_ICD9 Gender
1 F 889
M 796
U 2
2 F 32637
M 33345
U 34
Where f0_ is the sum of the counts by Gender
All I really want is a simple one level dataframe similar to this which I got via
ndf=ndf.unstack(level=1)
ndf
f0_
Gender F M U
CCS_Category_ICD9
1 889.0 796.0 2.0
2 32637.0 33345.0 34.0
3 2546.0 1812.0 NaN
4 347284.0 213782.0 34.0
But what I want is:
CCS_Category_ICD9 F M U
1 889.0 796.0 2.0
2 32637.0 33345.0 34.0
3 2546.0 1812.0 NaN
4 347284.0 213782.0 34.0
I cannot figure out how to flatten or get rid of the levels associated with f0_ and Gender All I need is the "M","F","U" column headings so I have a simple one level dataframe. I have tried reset_index and set_index along with several other variations, with no luck...
At the end I want to have a simple crosstab with row and column totals (which my example does not show..
well I did (as suggested in one answer):
ndf = ndf.f0_.unstack()
ndf
Which gave me:
Gender F M U
CCS_Category_ICD9
1 889.0 796.0 2.0
2 32637.0 33345.0 34.0
3 2546.0 1812.0 NaN
4 347284.0 213782.0 34.0
Followed by:
nndf=ndf.reset_index(['CCS_Category_ICD9','F','M','U'])
nndf
Gender CCS_Category_ICD9 F M U
0 1 889.0 796.0 2.0
1 2 32637.0 33345.0 34.0
2 3 2546.0 1812.0 NaN
3 4 347284.0 213782.0 34.0
4 5 3493.0 7964.0 1.0
5 6 12295.0 9998.0 4.0
Which just about does it But I cannot change the index name from Gender to something like Idx no matter what I do I get an extra row added with the New name ie a row titled Idx just under Gender.. Also is there a more straight forward solution?

You can
df.loc[:, 'f0_']
for the DataFrame resulting from .unstack(), ie, select the first level of your MultiIndex columns which only leaves the gender level , or alternatively
df.columns = df.columns.droplevel()
see MultiIndex.droplevel docs

Because ndf is a pd.DataFrame it has a column index. When you performed unstack() it appends the last level from the row index to the column index. Since columns already had f0_, you got a second level. To flatten the way you'd like, call unstack() on the column instead.
ndf = ndf.f0_.unstack()
The text Gender is the name of the column index. If you want to get rid of it, you have to overwrite the name attribute for that object.
ndf.columns.name = None
Use this right after the ndf.f0_.unstack()

Generally, use df.pivot when you want use a column as the row index and another column as the column index. Use df.pivot_table when you need to aggregate values due to rows with duplicate (row,column) pairs.
In this case, instead of df.groupby(...)[...].sum().unstack() you could use
df.pivot_table:
import numpy as np
import pandas as pd
N = 100
df = pd.DataFrame({'CCS': np.random.choice([1,2], size=N),
'Gender':np.random.choice(['F','M','U'], size=N),
'f0':np.random.randint(10, size=N)})
result = df.pivot_table(index='CCS', columns='Gender', values='f0', aggfunc='sum')
result.columns.name = None
result = result.reset_index()
yields
CCS F M U
0 1 89 104 90
1 2 66 65 65
Notice that after calling pivot_table(), the DataFrame result has named
index and column Indexes:
In [176]: result = df.pivot_table(index='CCS', columns='Gender', values='f0', aggfunc='sum'); result
Out[176]:
Gender F M U
CCS
1 89 104 90
2 66 65 65
The index is named CSS:
In [177]: result.index
Out[177]: Int64Index([1, 2], dtype='int64', name='CCS')
and the columns index is named Gender:
In [178]: result.columns
Out[178]: Index(['F', 'M', 'U'], dtype='object', name='Gender') # <-- notice the name='Gender'
To remove the name from an Index, assign None to the name attribute:
In [179]: result.columns.name = None
In [180]: result
Out[180]:
F M U
CCS
1 95 68 67
2 82 63 68
Though it's not needed here, to remove names from the levels of a MultiIndex,
assign a list of Nones to the names (plural) attribute:
result.columns.names = [None]*numlevels

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Merge a dataframe to another dataframe without existing data - python

Need to merge a DataFrame with another DataFrame without affecting existing data df1: Name Subject mark a Ta 52 b En c Ma d Ss 60 df2: Name mark b 57 c 58 Expected Output: Name Subject mark a Ta 52 b En 57 c Ma 58 d Ss 60

Use combine_first after setting Name as index: df1.set_index('Name').combine_first(df2.set_index('Name')).reset_index() output: Name Subject mark 0 a Ta 52.0 1 b En 57.0 2 c Ma 58.0 3 d Ss 60.0

Try using merge and combine_first: >>> df = df1.merge(df2, on='Name', how='outer') >>> df['mark'] = df.pop('mark_x').combine_first(df.pop('mark_y')) >>> df Name Subject mark 0 a Ta 52.0 1 b En 57.0 2 c Ma 58.0 3 d Ss 60.0 >>>

One of the ways in which you can achieve this is by using the below steps: Inner join the 2 tables using the pandas.merge()command. Create a new column which basically checks if the marks column from df1 is not None, then take that value, else, take df2 column value.

Related

How to sum the previous pandas df column with the current if it contains the word "extra"

How do I merge two sets of data with Pandas in Pyhton without losing rows?

How to findout difference between two dataframes irrespective of index? [duplicate]

update one dataframe with data from another, for one specific column - Pandas and Python

Get rid of excess Labels on Pandas DataFrames

Categories

Resources