Merging two or more columns which don't overlap - python

Follow up to this post:
Merging two columns which don't overlap and create new columns
import pandas as pd
df1 = pd.DataFrame([["2014", "q2", 2],
["2013", "q1", 1],],
columns=('Year', 'Quarter', 'Value'))
df2 = pd.DataFrame([["2016", "q1", 3],
["2015", "q1", 3]],
columns=('Year', 'Quarter', 'Value'))
print(df1.merge(df2, on='Year', how='outer'))
Results in:
Year Quarter_x Value_x Quarter_y Value_y
0 2014 q2 2 NaN NaN
1 2013 q1 1 NaN NaN
2 2016 NaN NaN q1 3
3 2015 NaN NaN q1 3
But I want to get this:
Year Quarter Value
0 2014 q2 2
1 2013 q1 1
2 2016 q1 3
3 2015 q1 3
Note: This doesn't produce the desired result... :(
print(df1.merge(df2, on=['Year', 'Quarter','Value'], how='outer').dropna())
Year Quarter Value
0 2014 q2 2
1 2013 q1 1
... using 'left' or right' or inner also don't cut it.

Not sure what's happening here, but if I do
df1.merge(df2, on=['Year', 'Quarter', 'Value'], how='outer').dropna()
I get:
Year Quarter Value
0 2014 q2 2.0
1 2013 q1 1.0
2 2016 q1 3.0
3 2015 q1 3.0
You may want to take a look at the merge, join & concat docs.
The most 'intuitive' way for this is probably .append():
df1.append(df2)
Year Quarter Value
0 2014 q2 2.0
1 2013 q1 1.0
2 2016 q1 3.0
3 2015 q1 3.0
If you look into the source code, you'll find it calls concat behind the scenes.
Merge is useful and intended for cases where you have columns with overlapping values.

pandas concat is much better suited for this.
pd.concat([df1, df2]).reset_index(drop=True)
Year Quarter Value
0 2014 q2 2
1 2013 q1 1
2 2016 q1 3
3 2015 q1 3
concat is intended to place one dataframe adjacent to another while keeping the index or columns aligned. In the default case, it keeps the columns aligned. Considering your example dataframes, the columns are aligned and your stated expected output shows df2 placed exactly after df1 where the columns are aligned. Every aspect of what you've asked for is exactly what concat was designed to provide. All I've done is point you to an appropriate function.

You're looking for the append feature:
df_final = df1.append(df2)

Related

Drop empty categories in sub groups using groupby in pandas?

I have a resulting table
Year mycat
2019 A 2
B 1
2020 A 0
B 1
In the 3rd row (2020, A) you see zero. I want to get rid of lines like this.
Year mycat
2019 A 2
B 1
2020 B 1
How can I do this? Is there a way to let pandas handle that without "hacking" the resulting table after I've done .groupby().size()?
Here is the full code:
>>> import pandas as pd
>>> df = pd.DataFrame({'Year': [2019, 2019, 2019, 2020], 'mycat': list('AABB')})
>>> df.mycat = df.mycat.astype('category')
>>> df
Year mycat
0 2019 A
1 2019 A
2 2019 B
3 2020 B
>>> df.groupby(['Year', 'mycat']).size()
Year mycat
2019 A 2
B 1
2020 A 0
B 1
dtype: int64
Yes, there is a way to eliminate zero-instance groupby results even for Categoricals such as in your specified input dataframe:
df.groupby(['Year', 'mycat'], observed=True).size()
In the docs for groupby(), the observed argument is explained as follows:
observed : bool, default False
This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.

Calculate new MultiIndex level from existing MultiIndex level values

For a DataFrame with two MultiIndex levels age and yearref, the goal is to add a new MultiIndex level yearconstr calculated as yearconstr = yearref - age.
import pandas as pd
df = pd.DataFrame({"value": [1, 2, 3]},
index=pd.MultiIndex.from_tuples([(10, 2015), (3, 2015), (2, 2016)],
names=["age", "yearref"]))
print(df)
# input df:
value
age yearref
10 2015 1
3 2015 2
2 2016 3
We could reset the index, calculate a new column and then put the original index back in place plus the newly defined column, but surely there must be a better way.
df = (df.reset_index()
.assign(yearconstr=lambda df: df.yearref - df.age)
.set_index(list(df.index.names) + ["yearconstr"]))
print(df)
# expected result:
value
age yearref yearconstr
10 2015 2005 1
3 2015 2012 2
2 2016 2014 3
For a concise and straight-forward approach, we can make use of
eval to generate a new Series calculated from the existing MultiIndex. This is easy since it treats index levels just like columns: df.eval("yearref - age")
rename the new Series
set_index to append the Series to df using the append=True argument.
Putting everything together:
df.set_index(df.eval("yearref - age").rename("yearconstr"), append=True)
# result:
value
age yearref yearconstr
10 2015 2005 1
3 2015 2012 2
2 2016 2014 3

How to make pivot table in pandas with row values as a column and separate rows based on date?

I have a df structured in following setting and would like to change it so that the types found in the column measure are the the row readers with the original result as the row the new type column, condensing ids of the same date into one row. For example, I would like to change the following table:
id
name
measure
result
date
1
A
O1
X
2015
1
A
O2
X
2015
1
A
O3
X
2015
2
B
O2
Y
2015
1
A
O1
Z
2016
2
B
O1
Z
2016
...
...
...
...
...
To:
id
name
O1
O2
O3
date
1
A
X
X
X
2015
2
B
None
Y
None
2015
1
A
Z
None
None
2016
2
B
Z
None
None
2016
...
...
...
...
...
...
I know to use the pivot_table function in pandas; however, I am unsure how to take into account different years. Here are similar links to my question but don't answer the same question:
How to make types in the rows of pandas dataframe to become the column header with result as row type?
How to pivot a dataframe in Pandas?
How can I pivot a dataframe?
We can use aggfunc='first' for string values. rename_axis and reset_index to cleanup format:
new_df = (
df.pivot_table(index=['date', 'id', 'name'],
columns='measure',
values='result',
aggfunc='first')
.rename_axis(columns=None)
.reset_index()
)
# Re-order columns (move date to end)
new_df = new_df[[*new_df.columns[new_df.columns != 'date'], 'date']]
new_df:
id name O1 O2 O3 date
0 1 A X X X 2015
1 2 B NaN Y NaN 2015
2 1 A Z NaN NaN 2016
3 2 B Z NaN NaN 2016

How to findout difference between two dataframes irrespective of index? [duplicate]

I have two data frames df1 and df2, where df2 is a subset of df1. How do I get a new data frame (df3) which is the difference between the two data frames?
In other word, a data frame that has all the rows/columns in df1 that are not in df2?
By using drop_duplicates
pd.concat([df1,df2]).drop_duplicates(keep=False)
Update :
The above method only works for those data frames that don't already have duplicates themselves. For example:
df1=pd.DataFrame({'A':[1,2,3,3],'B':[2,3,4,4]})
df2=pd.DataFrame({'A':[1],'B':[2]})
It will output like below , which is wrong
Wrong Output :
pd.concat([df1, df2]).drop_duplicates(keep=False)
Out[655]:
A B
1 2 3
Correct Output
Out[656]:
A B
1 2 3
2 3 4
3 3 4
How to achieve that?
Method 1: Using isin with tuple
df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))]
Out[657]:
A B
1 2 3
2 3 4
3 3 4
Method 2: merge with indicator
df1.merge(df2,indicator = True, how='left').loc[lambda x : x['_merge']!='both']
Out[421]:
A B _merge
1 2 3 left_only
2 3 4 left_only
3 3 4 left_only
For rows, try this, where Name is the joint index column (can be a list for multiple common columns, or specify left_on and right_on):
m = df1.merge(df2, on='Name', how='outer', suffixes=['', '_'], indicator=True)
The indicator=True setting is useful as it adds a column called _merge, with all changes between df1 and df2, categorized into 3 possible kinds: "left_only", "right_only" or "both".
For columns, try this:
set(df1.columns).symmetric_difference(df2.columns)
Accepted answer Method 1 will not work for data frames with NaNs inside, as pd.np.nan != pd.np.nan. I am not sure if this is the best way, but it can be avoided by
df1[~df1.astype(str).apply(tuple, 1).isin(df2.astype(str).apply(tuple, 1))]
It's slower, because it needs to cast data to string, but thanks to this casting pd.np.nan == pd.np.nan.
Let's go trough the code. First we cast values to string, and apply tuple function to each row.
df1.astype(str).apply(tuple, 1)
df2.astype(str).apply(tuple, 1)
Thanks to that, we get pd.Series object with list of tuples. Each tuple contains whole row from df1/df2.
Then we apply isin method on df1 to check if each tuple "is in" df2.
The result is pd.Series with bool values. True if tuple from df1 is in df2. In the end, we negate results with ~ sign, and applying filter on df1. Long story short, we get only those rows from df1 that are not in df2.
To make it more readable, we may write it as:
df1_str_tuples = df1.astype(str).apply(tuple, 1)
df2_str_tuples = df2.astype(str).apply(tuple, 1)
df1_values_in_df2_filter = df1_str_tuples.isin(df2_str_tuples)
df1_values_not_in_df2 = df1[~df1_values_in_df2_filter]
import pandas as pd
# given
df1 = pd.DataFrame({'Name':['John','Mike','Smith','Wale','Marry','Tom','Menda','Bolt','Yuswa',],
'Age':[23,45,12,34,27,44,28,39,40]})
df2 = pd.DataFrame({'Name':['John','Smith','Wale','Tom','Menda','Yuswa',],
'Age':[23,12,34,44,28,40]})
# find elements in df1 that are not in df2
df_1notin2 = df1[~(df1['Name'].isin(df2['Name']) & df1['Age'].isin(df2['Age']))].reset_index(drop=True)
# output:
print('df1\n', df1)
print('df2\n', df2)
print('df_1notin2\n', df_1notin2)
# df1
# Age Name
# 0 23 John
# 1 45 Mike
# 2 12 Smith
# 3 34 Wale
# 4 27 Marry
# 5 44 Tom
# 6 28 Menda
# 7 39 Bolt
# 8 40 Yuswa
# df2
# Age Name
# 0 23 John
# 1 12 Smith
# 2 34 Wale
# 3 44 Tom
# 4 28 Menda
# 5 40 Yuswa
# df_1notin2
# Age Name
# 0 45 Mike
# 1 27 Marry
# 2 39 Bolt
Perhaps a simpler one-liner, with identical or different column names. Worked even when df2['Name2'] contained duplicate values.
newDf = df1.set_index('Name1')
.drop(df2['Name2'], errors='ignore')
.reset_index(drop=False)
edit2, I figured out a new solution without the need of setting index
newdf=pd.concat([df1,df2]).drop_duplicates(keep=False)
Okay i found the answer of highest vote already contain what I have figured out. Yes, we can only use this code on condition that there are no duplicates in each two dfs.
I have a tricky method. First we set ’Name’ as the index of two dataframe given by the question. Since we have same ’Name’ in two dfs, we can just drop the ’smaller’ df’s index from the ‘bigger’ df.
Here is the code.
df1.set_index('Name',inplace=True)
df2.set_index('Name',inplace=True)
newdf=df1.drop(df2.index)
Pandas now offers a new API to do data frame diff: pandas.DataFrame.compare
df.compare(df2)
col1 col3
self other self other
0 a c NaN NaN
2 NaN NaN 3.0 4.0
In addition to accepted answer, I would like to propose one more wider solution that can find a 2D set difference of two dataframes with any index/columns (they might not coincide for both datarames). Also method allows to setup tolerance for float elements for dataframe comparison (it uses np.isclose)
import numpy as np
import pandas as pd
def get_dataframe_setdiff2d(df_new: pd.DataFrame,
df_old: pd.DataFrame,
rtol=1e-03, atol=1e-05) -> pd.DataFrame:
"""Returns set difference of two pandas DataFrames"""
union_index = np.union1d(df_new.index, df_old.index)
union_columns = np.union1d(df_new.columns, df_old.columns)
new = df_new.reindex(index=union_index, columns=union_columns)
old = df_old.reindex(index=union_index, columns=union_columns)
mask_diff = ~np.isclose(new, old, rtol, atol)
df_bool = pd.DataFrame(mask_diff, union_index, union_columns)
df_diff = pd.concat([new[df_bool].stack(),
old[df_bool].stack()], axis=1)
df_diff.columns = ["New", "Old"]
return df_diff
Example:
In [1]
df1 = pd.DataFrame({'A':[2,1,2],'C':[2,1,2]})
df2 = pd.DataFrame({'A':[1,1],'B':[1,1]})
print("df1:\n", df1, "\n")
print("df2:\n", df2, "\n")
diff = get_dataframe_setdiff2d(df1, df2)
print("diff:\n", diff, "\n")
Out [1]
df1:
A C
0 2 2
1 1 1
2 2 2
df2:
A B
0 1 1
1 1 1
diff:
New Old
0 A 2.0 1.0
B NaN 1.0
C 2.0 NaN
1 B NaN 1.0
C 1.0 NaN
2 A 2.0 NaN
C 2.0 NaN
As mentioned here
that
df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))]
is correct solution but it will produce wrong output if
df1=pd.DataFrame({'A':[1],'B':[2]})
df2=pd.DataFrame({'A':[1,2,3,3],'B':[2,3,4,4]})
In that case above solution will give
Empty DataFrame, instead you should use concat method after removing duplicates from each datframe.
Use concate with drop_duplicates
df1=df1.drop_duplicates(keep="first")
df2=df2.drop_duplicates(keep="first")
pd.concat([df1,df2]).drop_duplicates(keep=False)
I had issues with handling duplicates when there were duplicates on one side and at least one on the other side, so I used Counter.collections to do a better diff, ensuring both sides have the same count. This doesn't return duplicates, but it won't return any if both sides have the same count.
from collections import Counter
def diff(df1, df2, on=None):
"""
:param on: same as pandas.df.merge(on) (a list of columns)
"""
on = on if on else df1.columns
df1on = df1[on]
df2on = df2[on]
c1 = Counter(df1on.apply(tuple, 'columns'))
c2 = Counter(df2on.apply(tuple, 'columns'))
c1c2 = c1-c2
c2c1 = c2-c1
df1ondf2on = pd.DataFrame(list(c1c2.elements()), columns=on)
df2ondf1on = pd.DataFrame(list(c2c1.elements()), columns=on)
df1df2 = df1.merge(df1ondf2on).drop_duplicates(subset=on)
df2df1 = df2.merge(df2ondf1on).drop_duplicates(subset=on)
return pd.concat([df1df2, df2df1])
> df1 = pd.DataFrame({'a': [1, 1, 3, 4, 4]})
> df2 = pd.DataFrame({'a': [1, 2, 3, 4, 4]})
> diff(df1, df2)
a
0 1
0 2
There is a new method in pandas DataFrame.compare that compare 2 different dataframes and return which values changed in each column for the data records.
Example
First Dataframe
Id Customer Status Date
1 ABC Good Mar 2023
2 BAC Good Feb 2024
3 CBA Bad Apr 2022
Second Dataframe
Id Customer Status Date
1 ABC Bad Mar 2023
2 BAC Good Feb 2024
5 CBA Good Apr 2024
Comparing Dataframes
print("Dataframe difference -- \n")
print(df1.compare(df2))
print("Dataframe difference keeping equal values -- \n")
print(df1.compare(df2, keep_equal=True))
print("Dataframe difference keeping same shape -- \n")
print(df1.compare(df2, keep_shape=True))
print("Dataframe difference keeping same shape and equal values -- \n")
print(df1.compare(df2, keep_shape=True, keep_equal=True))
Result
Dataframe difference --
Id Status Date
self other self other self other
0 NaN NaN Good Bad NaN NaN
2 3.0 5.0 Bad Good Apr 2022 Apr 2024
Dataframe difference keeping equal values --
Id Status Date
self other self other self other
0 1 1 Good Bad Mar 2023 Mar 2023
2 3 5 Bad Good Apr 2022 Apr 2024
Dataframe difference keeping same shape --
Id Customer Status Date
self other self other self other self other
0 NaN NaN NaN NaN Good Bad NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN
2 3.0 5.0 NaN NaN Bad Good Apr 2022 Apr 2024
Dataframe difference keeping same shape and equal values --
Id Customer Status Date
self other self other self other self other
0 1 1 ABC ABC Good Bad Mar 2023 Mar 2023
1 2 2 BAC BAC Good Good Feb 2024 Feb 2024
2 3 5 CBA CBA Bad Good Apr 2022 Apr 2024
A slight variation of the nice #liangli's solution that does not require to change the index of existing dataframes:
newdf = df1.drop(df1.join(df2.set_index('Name').index))
Finding difference by index. Assuming df1 is a subset of df2 and the indexes are carried forward when subsetting
df1.loc[set(df1.index).symmetric_difference(set(df2.index))].dropna()
# Example
df1 = pd.DataFrame({"gender":np.random.choice(['m','f'],size=5), "subject":np.random.choice(["bio","phy","chem"],size=5)}, index = [1,2,3,4,5])
df2 = df1.loc[[1,3,5]]
df1
gender subject
1 f bio
2 m chem
3 f phy
4 m bio
5 f bio
df2
gender subject
1 f bio
3 f phy
5 f bio
df3 = df1.loc[set(df1.index).symmetric_difference(set(df2.index))].dropna()
df3
gender subject
2 m chem
4 m bio
Defining our dataframes:
df1 = pd.DataFrame({
'Name':
['John','Mike','Smith','Wale','Marry','Tom','Menda','Bolt','Yuswa'],
'Age':
[23,45,12,34,27,44,28,39,40]
})
df2 = df1[df1.Name.isin(['John','Smith','Wale','Tom','Menda','Yuswa'])
df1
Name Age
0 John 23
1 Mike 45
2 Smith 12
3 Wale 34
4 Marry 27
5 Tom 44
6 Menda 28
7 Bolt 39
8 Yuswa 40
df2
Name Age
0 John 23
2 Smith 12
3 Wale 34
5 Tom 44
6 Menda 28
8 Yuswa 40
The difference between the two would be:
df1[~df1.isin(df2)].dropna()
Name Age
1 Mike 45.0
4 Marry 27.0
7 Bolt 39.0
Where:
df1.isin(df2) returns the rows in df1 that are also in df2.
~ (Element-wise logical NOT) in front of the expression negates the results, so we get the elements in df1 that are NOT in df2–the difference between the two.
.dropna() drops the rows with NaN presenting the desired output
Note This only works if len(df1) >= len(df2). If df2 is longer than df1 you can reverse the expression: df2[~df2.isin(df1)].dropna()
I found the deepdiff library is a wonderful tool that also extends well to dataframes if different detail is required or ordering matters. You can experiment with diffing to_dict('records'), to_numpy(), and other exports:
import pandas as pd
from deepdiff import DeepDiff
df1 = pd.DataFrame({
'Name':
['John','Mike','Smith','Wale','Marry','Tom','Menda','Bolt','Yuswa'],
'Age':
[23,45,12,34,27,44,28,39,40]
})
df2 = df1[df1.Name.isin(['John','Smith','Wale','Tom','Menda','Yuswa'])]
DeepDiff(df1.to_dict(), df2.to_dict())
# {'dictionary_item_removed': [root['Name'][1], root['Name'][4], root['Name'][7], root['Age'][1], root['Age'][4], root['Age'][7]]}
Symmetric Difference
If you are interested in the rows that are only in one of the dataframes but not both, you are looking for the set difference:
pd.concat([df1,df2]).drop_duplicates(keep=False)
⚠️ Only works, if both dataframes do not contain any duplicates.
Set Difference / Relational Algebra Difference
If you are interested in the relational algebra difference / set difference, i.e. df1-df2 or df1\df2:
pd.concat([df1,df2,df2]).drop_duplicates(keep=False)
⚠️ Only works, if both dataframes do not contain any duplicates.
Another possible solution is to use numpy broadcasting:
df1[np.all(~np.all(df1.values == df2.values[:, None], axis=2), axis=0)]
Output:
Name Age
1 Mike 45
4 Marry 27
7 Bolt 39
Using the lambda function you can filter the rows with _merge value “left_only” to get all the rows in df1 which are missing from df2
df3 = df1.merge(df2, how = 'outer' ,indicator=True).loc[lambda x :x['_merge']=='left_only']
df
Try this one:
df_new = df1.merge(df2, how='outer', indicator=True).query('_merge == "left_only"').drop('_merge', 1)
It will result a new dataframe with the differences: the values that exist in df1 but not in df2.

How to compare two rows and when they are different then create another dataframe to copy these two rows

Check column ['esn'] from df1. When any different found between two rows, produce another dataframe, df2. df2 only contains the before change and after change information
>>> df1 = pd.DataFrame([[2014,1],[2015,1],[2016,1],[2017,2],[2018,2]],columns=['year','esn'])
>>> df1
year esn
0 2014 1
1 2015 1
2 2016 1
3 2017 2
4 2018 2
>>> df2 # new dataframe intended to create
year esn
0 2016 1
1 2017 2
can't produce the above result in df2. Thanks for your help in advance.
Create boolena mask by compare shifted values by ne for not equal and replace first missing value by backfill, similar compare shifted with -1 with forward filling missing values - chain by | for bitwise OR and filter by boolean indexing:
mask = df1['esn'].ne(df1['esn'].shift().bfill()) | df1['esn'].ne(df1['esn'].shift(-1).ffill())
df2 = df1[mask]
print (df2)
year esn
2 2016 1
3 2017 2

Categories