PANDAS join data frames with different length by two columns [duplicate] - python

How can I perform a (INNER| (LEFT|RIGHT|FULL) OUTER) JOIN with pandas?
How do I add NaNs for missing rows after a merge?
How do I get rid of NaNs after merging?
Can I merge on the index?
How do I merge multiple DataFrames?
Cross join with pandas
merge? join? concat? update? Who? What? Why?!
... and more. I've seen these recurring questions asking about various facets of the pandas merge functionality. Most of the information regarding merge and its various use cases today is fragmented across dozens of badly worded, unsearchable posts. The aim here is to collate some of the more important points for posterity.
This Q&A is meant to be the next installment in a series of helpful user guides on common pandas idioms (see this post on pivoting, and this post on concatenation, which I will be touching on, later).
Please note that this post is not meant to be a replacement for the documentation, so please read that as well! Some of the examples are taken from there.
Table of Contents
For ease of access.
Merging basics - basic types of joins (read this first)
Index-based joins
Generalizing to multiple DataFrames
Cross join

This post aims to give readers a primer on SQL-flavored merging with Pandas, how to use it, and when not to use it.
In particular, here's what this post will go through:
The basics - types of joins (LEFT, RIGHT, OUTER, INNER)
merging with different column names
merging with multiple columns
avoiding duplicate merge key column in output
What this post (and other posts by me on this thread) will not go through:
Performance-related discussions and timings (for now). Mostly notable mentions of better alternatives, wherever appropriate.
Handling suffixes, removing extra columns, renaming outputs, and other specific use cases. There are other (read: better) posts that deal with that, so figure it out!
Note
Most examples default to INNER JOIN operations while demonstrating various features, unless otherwise specified.
Furthermore, all the DataFrames here can be copied and replicated so
you can play with them. Also, see this
post
on how to read DataFrames from your clipboard.
Lastly, all visual representation of JOIN operations have been hand-drawn using Google Drawings. Inspiration from here.
Enough talk - just show me how to use merge!
Setup & Basics
np.random.seed(0)
left = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4)})
right = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': np.random.randn(4)})
left
key value
0 A 1.764052
1 B 0.400157
2 C 0.978738
3 D 2.240893
right
key value
0 B 1.867558
1 D -0.977278
2 E 0.950088
3 F -0.151357
For the sake of simplicity, the key column has the same name (for now).
An INNER JOIN is represented by
Note
This, along with the forthcoming figures all follow this convention:
blue indicates rows that are present in the merge result
red indicates rows that are excluded from the result (i.e., removed)
green indicates missing values that are replaced with NaNs in the result
To perform an INNER JOIN, call merge on the left DataFrame, specifying the right DataFrame and the join key (at the very least) as arguments.
left.merge(right, on='key')
# Or, if you want to be explicit
# left.merge(right, on='key', how='inner')
key value_x value_y
0 B 0.400157 1.867558
1 D 2.240893 -0.977278
This returns only rows from left and right which share a common key (in this example, "B" and "D).
A LEFT OUTER JOIN, or LEFT JOIN is represented by
This can be performed by specifying how='left'.
left.merge(right, on='key', how='left')
key value_x value_y
0 A 1.764052 NaN
1 B 0.400157 1.867558
2 C 0.978738 NaN
3 D 2.240893 -0.977278
Carefully note the placement of NaNs here. If you specify how='left', then only keys from left are used, and missing data from right is replaced by NaN.
And similarly, for a RIGHT OUTER JOIN, or RIGHT JOIN which is...
...specify how='right':
left.merge(right, on='key', how='right')
key value_x value_y
0 B 0.400157 1.867558
1 D 2.240893 -0.977278
2 E NaN 0.950088
3 F NaN -0.151357
Here, keys from right are used, and missing data from left is replaced by NaN.
Finally, for the FULL OUTER JOIN, given by
specify how='outer'.
left.merge(right, on='key', how='outer')
key value_x value_y
0 A 1.764052 NaN
1 B 0.400157 1.867558
2 C 0.978738 NaN
3 D 2.240893 -0.977278
4 E NaN 0.950088
5 F NaN -0.151357
This uses the keys from both frames, and NaNs are inserted for missing rows in both.
The documentation summarizes these various merges nicely:
Other JOINs - LEFT-Excluding, RIGHT-Excluding, and FULL-Excluding/ANTI JOINs
If you need LEFT-Excluding JOINs and RIGHT-Excluding JOINs in two steps.
For LEFT-Excluding JOIN, represented as
Start by performing a LEFT OUTER JOIN and then filtering to rows coming from left only (excluding everything from the right),
(left.merge(right, on='key', how='left', indicator=True)
.query('_merge == "left_only"')
.drop('_merge', 1))
key value_x value_y
0 A 1.764052 NaN
2 C 0.978738 NaN
Where,
left.merge(right, on='key', how='left', indicator=True)
key value_x value_y _merge
0 A 1.764052 NaN left_only
1 B 0.400157 1.867558 both
2 C 0.978738 NaN left_only
3 D 2.240893 -0.977278 both
And similarly, for a RIGHT-Excluding JOIN,
(left.merge(right, on='key', how='right', indicator=True)
.query('_merge == "right_only"')
.drop('_merge', 1))
key value_x value_y
2 E NaN 0.950088
3 F NaN -0.151357
Lastly, if you are required to do a merge that only retains keys from the left or right, but not both (IOW, performing an ANTI-JOIN),
You can do this in similar fashion—
(left.merge(right, on='key', how='outer', indicator=True)
.query('_merge != "both"')
.drop('_merge', 1))
key value_x value_y
0 A 1.764052 NaN
2 C 0.978738 NaN
4 E NaN 0.950088
5 F NaN -0.151357
Different names for key columns
If the key columns are named differently—for example, left has keyLeft, and right has keyRight instead of key—then you will have to specify left_on and right_on as arguments instead of on:
left2 = left.rename({'key':'keyLeft'}, axis=1)
right2 = right.rename({'key':'keyRight'}, axis=1)
left2
keyLeft value
0 A 1.764052
1 B 0.400157
2 C 0.978738
3 D 2.240893
right2
keyRight value
0 B 1.867558
1 D -0.977278
2 E 0.950088
3 F -0.151357
left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')
keyLeft value_x keyRight value_y
0 B 0.400157 B 1.867558
1 D 2.240893 D -0.977278
Avoiding duplicate key column in output
When merging on keyLeft from left and keyRight from right, if you only want either of the keyLeft or keyRight (but not both) in the output, you can start by setting the index as a preliminary step.
left3 = left2.set_index('keyLeft')
left3.merge(right2, left_index=True, right_on='keyRight')
value_x keyRight value_y
0 0.400157 B 1.867558
1 2.240893 D -0.977278
Contrast this with the output of the command just before (that is, the output of left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')), you'll notice keyLeft is missing. You can figure out what column to keep based on which frame's index is set as the key. This may matter when, say, performing some OUTER JOIN operation.
Merging only a single column from one of the DataFrames
For example, consider
right3 = right.assign(newcol=np.arange(len(right)))
right3
key value newcol
0 B 1.867558 0
1 D -0.977278 1
2 E 0.950088 2
3 F -0.151357 3
If you are required to merge only "newcol" (without any of the other columns), you can usually just subset columns before merging:
left.merge(right3[['key', 'newcol']], on='key')
key value newcol
0 B 0.400157 0
1 D 2.240893 1
If you're doing a LEFT OUTER JOIN, a more performant solution would involve map:
# left['newcol'] = left['key'].map(right3.set_index('key')['newcol']))
left.assign(newcol=left['key'].map(right3.set_index('key')['newcol']))
key value newcol
0 A 1.764052 NaN
1 B 0.400157 0.0
2 C 0.978738 NaN
3 D 2.240893 1.0
As mentioned, this is similar to, but faster than
left.merge(right3[['key', 'newcol']], on='key', how='left')
key value newcol
0 A 1.764052 NaN
1 B 0.400157 0.0
2 C 0.978738 NaN
3 D 2.240893 1.0
Merging on multiple columns
To join on more than one column, specify a list for on (or left_on and right_on, as appropriate).
left.merge(right, on=['key1', 'key2'] ...)
Or, in the event the names are different,
left.merge(right, left_on=['lkey1', 'lkey2'], right_on=['rkey1', 'rkey2'])
Other useful merge* operations and functions
Merging a DataFrame with Series on index: See this answer.
Besides merge, DataFrame.update and DataFrame.combine_first are also used in certain cases to update one DataFrame with another.
pd.merge_ordered is a useful function for ordered JOINs.
pd.merge_asof (read: merge_asOf) is useful for approximate joins.
This section only covers the very basics, and is designed to only whet your appetite. For more examples and cases, see the documentation on merge, join, and concat as well as the links to the function specifications.
Continue Reading
Jump to other topics in Pandas Merging 101 to continue learning:
Merging basics - basic types of joins *
Index-based joins
Generalizing to multiple DataFrames
Cross join
*You are here.

A supplemental visual view of pd.concat([df0, df1], kwargs).
Notice that, kwarg axis=0 or axis=1 's meaning is not as intuitive as df.mean() or df.apply(func)

Joins 101
These animations might be better to explain you visually.
Credits: Garrick Aden-Buie tidyexplain repo
Inner Join
Outer Join or Full Join
Right Join
Left Join

In this answer, I will consider practical examples of:
pandas.concat
pandas.DataFrame.merge to merge dataframes from the index of one and the column of another one.
We will be using different dataframes for each of the cases.
1. pandas.concat
Considering the following DataFrames with the same column names:
Price2018 with size (8784, 5)
Year Month Day Hour Price
0 2018 1 1 1 6.74
1 2018 1 1 2 4.74
2 2018 1 1 3 3.66
3 2018 1 1 4 2.30
4 2018 1 1 5 2.30
5 2018 1 1 6 2.06
6 2018 1 1 7 2.06
7 2018 1 1 8 2.06
8 2018 1 1 9 2.30
9 2018 1 1 10 2.30
Price2019 with size (8760, 5)
Year Month Day Hour Price
0 2019 1 1 1 66.88
1 2019 1 1 2 66.88
2 2019 1 1 3 66.00
3 2019 1 1 4 63.64
4 2019 1 1 5 58.85
5 2019 1 1 6 55.47
6 2019 1 1 7 56.00
7 2019 1 1 8 61.09
8 2019 1 1 9 61.01
9 2019 1 1 10 61.00
One can combine them using pandas.concat, by simply
import pandas as pd
frames = [Price2018, Price2019]
df_merged = pd.concat(frames)
Which results in a DataFrame with size (17544, 5)
If one wants to have a clear picture of what happened, it works like this
(Source)
2. pandas.DataFrame.merge
In this section, we will consider a specific case: merging the index of one dataframe and the column of another dataframe.
Let's say one has the dataframe Geo with 54 columns, being one of the columns the Date, which is of type datetime64[ns].
Date 1 2 ... 51 52 53
0 2010-01-01 00:00:00 0.565919 0.892376 ... 0.593049 0.775082 0.680621
1 2010-01-01 01:00:00 0.358960 0.531418 ... 0.734619 0.480450 0.926735
2 2010-01-01 02:00:00 0.531870 0.221768 ... 0.902369 0.027840 0.398864
3 2010-01-01 03:00:00 0.475463 0.245810 ... 0.306405 0.645762 0.541882
4 2010-01-01 04:00:00 0.954546 0.867960 ... 0.912257 0.039772 0.627696
And the dataframe Price that has one column with the price named Price, and the index corresponds to the dates (Date)
Price
Date
2010-01-01 00:00:00 29.10
2010-01-01 01:00:00 9.57
2010-01-01 02:00:00 0.00
2010-01-01 03:00:00 0.00
2010-01-01 04:00:00 0.00
In order to merge them, one can use pandas.DataFrame.merge as follows
df_merged = pd.merge(Price, Geo, left_index=True, right_on='Date')
where Geo and Price are the previous dataframes.
That results in the following dataframe
Price Date 1 ... 51 52 53
0 29.10 2010-01-01 00:00:00 0.565919 ... 0.593049 0.775082 0.680621
1 9.57 2010-01-01 01:00:00 0.358960 ... 0.734619 0.480450 0.926735
2 0.00 2010-01-01 02:00:00 0.531870 ... 0.902369 0.027840 0.398864
3 0.00 2010-01-01 03:00:00 0.475463 ... 0.306405 0.645762 0.541882
4 0.00 2010-01-01 04:00:00 0.954546 ... 0.912257 0.039772 0.627696

This post will go through the following topics:
Merging with index under different conditions
options for index-based joins: merge, join, concat
merging on indexes
merging on index of one, column of other
effectively using named indexes to simplify merging syntax
BACK TO TOP
Index-based joins
TL;DR
There are a few options, some simpler than others depending on the use
case.
DataFrame.merge with left_index and right_index (or left_on and right_on using named indexes)
supports inner/left/right/full
can only join two at a time
supports column-column, index-column, index-index joins
DataFrame.join (join on index)
supports inner/left (default)/right/full
can join multiple DataFrames at a time
supports index-index joins
pd.concat (joins on index)
supports inner/full (default)
can join multiple DataFrames at a time
supports index-index joins
Index to index joins
Setup & Basics
import pandas as pd
import numpy as np
np.random.seed([3, 14])
left = pd.DataFrame(data={'value': np.random.randn(4)},
index=['A', 'B', 'C', 'D'])
right = pd.DataFrame(data={'value': np.random.randn(4)},
index=['B', 'D', 'E', 'F'])
left.index.name = right.index.name = 'idxkey'
left
value
idxkey
A -0.602923
B -0.402655
C 0.302329
D -0.524349
right
value
idxkey
B 0.543843
D 0.013135
E -0.326498
F 1.385076
Typically, an inner join on index would look like this:
left.merge(right, left_index=True, right_index=True)
value_x value_y
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
Other joins follow similar syntax.
Notable Alternatives
DataFrame.join defaults to joins on the index. DataFrame.join does a LEFT OUTER JOIN by default, so how='inner' is necessary here.
left.join(right, how='inner', lsuffix='_x', rsuffix='_y')
value_x value_y
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
Note that I needed to specify the lsuffix and rsuffix arguments since join would otherwise error out:
left.join(right)
ValueError: columns overlap but no suffix specified: Index(['value'], dtype='object')
Since the column names are the same. This would not be a problem if they were differently named.
left.rename(columns={'value':'leftvalue'}).join(right, how='inner')
leftvalue value
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
pd.concat joins on the index and can join two or more DataFrames at once. It does a full outer join by default, so how='inner' is required here..
pd.concat([left, right], axis=1, sort=False, join='inner')
value value
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
For more information on concat, see this post.
Index to Column joins
To perform an inner join using index of left, column of right, you will use DataFrame.merge a combination of left_index=True and right_on=....
right2 = right.reset_index().rename({'idxkey' : 'colkey'}, axis=1)
right2
colkey value
0 B 0.543843
1 D 0.013135
2 E -0.326498
3 F 1.385076
left.merge(right2, left_index=True, right_on='colkey')
value_x colkey value_y
0 -0.402655 B 0.543843
1 -0.524349 D 0.013135
Other joins follow a similar structure. Note that only merge can perform index to column joins. You can join on multiple columns, provided the number of index levels on the left equals the number of columns on the right.
join and concat are not capable of mixed merges. You will need to set the index as a pre-step using DataFrame.set_index.
Effectively using Named Index [pandas >= 0.23]
If your index is named, then from pandas >= 0.23, DataFrame.merge allows you to specify the index name to on (or left_on and right_on as necessary).
left.merge(right, on='idxkey')
value_x value_y
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
For the previous example of merging with the index of left, column of right, you can use left_on with the index name of left:
left.merge(right2, left_on='idxkey', right_on='colkey')
value_x colkey value_y
0 -0.402655 B 0.543843
1 -0.524349 D 0.013135
Continue Reading
Jump to other topics in Pandas Merging 101 to continue learning:
Merging basics - basic types of joins
Index-based joins*
Generalizing to multiple DataFrames
Cross join
* you are here

This post will go through the following topics:
how to correctly generalize to multiple DataFrames (and why merge has shortcomings here)
merging on unique keys
merging on non-unique keys
BACK TO TOP
Generalizing to multiple DataFrames
Oftentimes, the situation arises when multiple DataFrames are to be merged together. Naively, this can be done by chaining merge calls:
df1.merge(df2, ...).merge(df3, ...)
However, this quickly gets out of hand for many DataFrames. Furthermore, it may be necessary to generalise for an unknown number of DataFrames.
Here I introduce pd.concat for multi-way joins on unique keys, and DataFrame.join for multi-way joins on non-unique keys. First, the setup.
# Setup.
np.random.seed(0)
A = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'valueA': np.random.randn(4)})
B = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'valueB': np.random.randn(4)})
C = pd.DataFrame({'key': ['D', 'E', 'J', 'C'], 'valueC': np.ones(4)})
dfs = [A, B, C]
# Note: the "key" column values are unique, so the index is unique.
A2 = A.set_index('key')
B2 = B.set_index('key')
C2 = C.set_index('key')
dfs2 = [A2, B2, C2]
Multiway merge on unique keys
If your keys (here, the key could either be a column or an index) are unique, then you can use pd.concat. Note that pd.concat joins DataFrames on the index.
# Merge on `key` column. You'll need to set the index before concatenating
pd.concat(
[df.set_index('key') for df in dfs], axis=1, join='inner'
).reset_index()
key valueA valueB valueC
0 D 2.240893 -0.977278 1.0
# Merge on `key` index.
pd.concat(dfs2, axis=1, sort=False, join='inner')
valueA valueB valueC
key
D 2.240893 -0.977278 1.0
Omit join='inner' for a FULL OUTER JOIN. Note that you cannot specify LEFT or RIGHT OUTER joins (if you need these, use join, described below).
Multiway merge on keys with duplicates
concat is fast, but has its shortcomings. It cannot handle duplicates.
A3 = pd.DataFrame({'key': ['A', 'B', 'C', 'D', 'D'], 'valueA': np.random.randn(5)})
pd.concat([df.set_index('key') for df in [A3, B, C]], axis=1, join='inner')
ValueError: Shape of passed values is (3, 4), indices imply (3, 2)
In this situation, we can use join since it can handle non-unique keys (note that join joins DataFrames on their index; it calls merge under the hood and does a LEFT OUTER JOIN unless otherwise specified).
# Join on `key` column. Set as the index first.
# For inner join. For left join, omit the "how" argument.
A.set_index('key').join([B2, C2], how='inner').reset_index()
key valueA valueB valueC
0 D 2.240893 -0.977278 1.0
# Join on `key` index.
A3.set_index('key').join([B2, C2], how='inner')
valueA valueB valueC
key
D 1.454274 -0.977278 1.0
D 0.761038 -0.977278 1.0
Continue Reading
Jump to other topics in Pandas Merging 101 to continue learning:
Merging basics - basic types of joins
Index-based joins
Generalizing to multiple DataFrames *
Cross join
* you are here

Pandas at the moment does not support inequality joins within the merge syntax; one option is with the conditional_join function from pyjanitor - I am a contributor to this library:
# pip install pyjanitor
import pandas as pd
import janitor
left.conditional_join(right, ('value', 'value', '>'))
left right
key value key value
0 A 1.764052 D -0.977278
1 A 1.764052 F -0.151357
2 A 1.764052 E 0.950088
3 B 0.400157 D -0.977278
4 B 0.400157 F -0.151357
5 C 0.978738 D -0.977278
6 C 0.978738 F -0.151357
7 C 0.978738 E 0.950088
8 D 2.240893 D -0.977278
9 D 2.240893 F -0.151357
10 D 2.240893 E 0.950088
11 D 2.240893 B 1.867558
left.conditional_join(right, ('value', 'value', '<'))
left right
key value key value
0 A 1.764052 B 1.867558
1 B 0.400157 E 0.950088
2 B 0.400157 B 1.867558
3 C 0.978738 B 1.867558
The columns are passed as a variable argument of tuples, each tuple comprising of a column from the left dataframe, column from the right dataframe, and the join operator, which can be any of (>, <, >=, <=, !=). In the example above, a MultiIndex column is returned, because of overlaps in the column names.
Performance wise, this is better than a naive cross join:
np.random.seed(0)
dd = pd.DataFrame({'value':np.random.randint(100000, size=50_000)})
df = pd.DataFrame({'start':np.random.randint(100000, size=1_000),
'end':np.random.randint(100000, size=1_000)})
dd.head()
value
0 68268
1 43567
2 42613
3 45891
4 21243
df.head()
start end
0 71915 47005
1 64284 44913
2 13377 96626
3 75823 38673
4 29151 575
%%timeit
out = df.merge(dd, how='cross')
out.loc[(out.start < out.value) & (out.end > out.value)]
5.12 s ± 19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.conditional_join(dd, ('start', 'value' ,'<'), ('end', 'value' ,'>'))
280 ms ± 5.56 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.conditional_join(dd, ('start', 'value' ,'<'), ('end', 'value' ,'>'), use_numba=True)
124 ms ± 12.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
out = df.merge(dd, how='cross')
out = out.loc[(out.start < out.value) & (out.end > out.value)]
A = df.conditional_join(dd, ('start', 'value' ,'<'), ('end', 'value' ,'>'))
columns = A.columns.tolist()
A = A.sort_values(columns, ignore_index = True)
out = out.sort_values(columns, ignore_index = True)
A.equals(out)
True
Depending on the data size, you could get more performance when an equi join is present. In this case, pandas merge function is used, but the final data frame is delayed until the non-equi joins are computed. There is no numba support when equi conditions are present. Let's look at data from here:
import pandas as pd
import numpy as np
import random
import datetime
def random_dt_bw(start_date,end_date):
days_between = (end_date - start_date).days
random_num_days = random.randrange(days_between)
random_dt = start_date + datetime.timedelta(days=random_num_days)
return random_dt
def generate_data(n=1000):
items = [f"i_{x}" for x in range(n)]
start_dates = [random_dt_bw(datetime.date(2020,1,1),datetime.date(2020,9,1)) for x in range(n)]
end_dates = [x + datetime.timedelta(days=random.randint(1,10)) for x in start_dates]
offerDf = pd.DataFrame({"Item":items,
"StartDt":start_dates,
"EndDt":end_dates})
transaction_items = [f"i_{random.randint(0,n)}" for x in range(5*n)]
transaction_dt = [random_dt_bw(datetime.date(2020,1,1),datetime.date(2020,9,1)) for x in range(5*n)]
sales_amt = [random.randint(0,1000) for x in range(5*n)]
transactionDf = pd.DataFrame({"Item":transaction_items,"TransactionDt":transaction_dt,"Sales":sales_amt})
return offerDf,transactionDf
offerDf,transactionDf = generate_data(n=100000)
offerDf = (offerDf
.assign(StartDt = offerDf.StartDt.astype(np.datetime64),
EndDt = offerDf.EndDt.astype(np.datetime64)
)
)
transactionDf = transactionDf.assign(TransactionDt = transactionDf.TransactionDt.astype(np.datetime64))
# you can get more performance when using ints/datetimes
# in the equi join, compared to strings
offerDf = offerDf.assign(Itemr = offerDf.Item.str[2:].astype(int))
transactionDf = transactionDf.assign(Itemr = transactionDf.Item.str[2:].astype(int))
transactionDf.head()
Item TransactionDt Sales Itemr
0 i_43407 2020-05-29 692 43407
1 i_95044 2020-07-22 964 95044
2 i_94560 2020-01-09 462 94560
3 i_11246 2020-02-26 690 11246
4 i_55974 2020-03-07 219 55974
offerDf.head()
Item StartDt EndDt Itemr
0 i_0 2020-04-18 2020-04-19 0
1 i_1 2020-02-28 2020-03-07 1
2 i_2 2020-03-28 2020-03-30 2
3 i_3 2020-08-03 2020-08-13 3
4 i_4 2020-05-26 2020-06-04 4
# merge on strings
merged_df = pd.merge(offerDf,transactionDf,on='Itemr')
classic_int = merged_df[(merged_df['TransactionDt']>=merged_df['StartDt']) &
(merged_df['TransactionDt']<=merged_df['EndDt'])]
# merge on ints ... usually faster
merged_df = pd.merge(offerDf,transactionDf,on='Item')
classic_str = merged_df[(merged_df['TransactionDt']>=merged_df['StartDt']) &
(merged_df['TransactionDt']<=merged_df['EndDt'])]
# merge on integers
cond_join_int = (transactionDf
.conditional_join(
offerDf,
('Itemr', 'Itemr', '=='),
('TransactionDt', 'StartDt', '>='),
('TransactionDt', 'EndDt', '<=')
)
)
# merge on strings
cond_join_str = (transactionDf
.conditional_join(
offerDf,
('Item', 'Item', '=='),
('TransactionDt', 'StartDt', '>='),
('TransactionDt', 'EndDt', '<=')
)
)
%%timeit
merged_df = pd.merge(offerDf,transactionDf,on='Item')
classic_str = merged_df[(merged_df['TransactionDt']>=merged_df['StartDt']) &
(merged_df['TransactionDt']<=merged_df['EndDt'])]
292 ms ± 3.84 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
merged_df = pd.merge(offerDf,transactionDf,on='Itemr')
classic_int = merged_df[(merged_df['TransactionDt']>=merged_df['StartDt']) &
(merged_df['TransactionDt']<=merged_df['EndDt'])]
253 ms ± 2.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
(transactionDf
.conditional_join(
offerDf,
('Item', 'Item', '=='),
('TransactionDt', 'StartDt', '>='),
('TransactionDt', 'EndDt', '<=')
)
)
256 ms ± 9.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
(transactionDf
.conditional_join(
offerDf,
('Itemr', 'Itemr', '=='),
('TransactionDt', 'StartDt', '>='),
('TransactionDt', 'EndDt', '<=')
)
)
71.8 ms ± 2.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# check that both dataframes are equal
cols = ['Item', 'TransactionDt', 'Sales', 'Itemr_y','StartDt', 'EndDt', 'Itemr_x']
cond_join_str = cond_join_str.drop(columns=('right', 'Item')).set_axis(cols, axis=1)
(cond_join_str
.sort_values(cond_join_str.columns.tolist())
.reset_index(drop=True)
.reindex(columns=classic_str.columns)
.equals(
classic_str
.sort_values(classic_str.columns.tolist())
.reset_index(drop=True)
))
True

I think you should include this in your explanation as it is a relevant merge that I see fairly often, which is termed cross-join I believe. This is a merge that occurs when unique df's share no columns, and it simply merging 2 dfs side-by-side:
The setup:
names1 = [{'A':'Jack', 'B':'Jill'}]
names2 = [{'C':'Tommy', 'D':'Tammy'}]
df1=pd.DataFrame(names1)
df2=pd.DataFrame(names2)
df_merged= pd.merge(df1.assign(X=1), df2.assign(X=1), on='X').drop('X', 1)
This creates a dummy X column, merges on the X, and then drops it to produce
df_merged:
A B C D
0 Jack Jill Tommy Tammy

Related

Pandas merge on values from multiple columns [duplicate]

How can I perform a (INNER| (LEFT|RIGHT|FULL) OUTER) JOIN with pandas?
How do I add NaNs for missing rows after a merge?
How do I get rid of NaNs after merging?
Can I merge on the index?
How do I merge multiple DataFrames?
Cross join with pandas
merge? join? concat? update? Who? What? Why?!
... and more. I've seen these recurring questions asking about various facets of the pandas merge functionality. Most of the information regarding merge and its various use cases today is fragmented across dozens of badly worded, unsearchable posts. The aim here is to collate some of the more important points for posterity.
This Q&A is meant to be the next installment in a series of helpful user guides on common pandas idioms (see this post on pivoting, and this post on concatenation, which I will be touching on, later).
Please note that this post is not meant to be a replacement for the documentation, so please read that as well! Some of the examples are taken from there.
Table of Contents
For ease of access.
Merging basics - basic types of joins (read this first)
Index-based joins
Generalizing to multiple DataFrames
Cross join
This post aims to give readers a primer on SQL-flavored merging with Pandas, how to use it, and when not to use it.
In particular, here's what this post will go through:
The basics - types of joins (LEFT, RIGHT, OUTER, INNER)
merging with different column names
merging with multiple columns
avoiding duplicate merge key column in output
What this post (and other posts by me on this thread) will not go through:
Performance-related discussions and timings (for now). Mostly notable mentions of better alternatives, wherever appropriate.
Handling suffixes, removing extra columns, renaming outputs, and other specific use cases. There are other (read: better) posts that deal with that, so figure it out!
Note
Most examples default to INNER JOIN operations while demonstrating various features, unless otherwise specified.
Furthermore, all the DataFrames here can be copied and replicated so
you can play with them. Also, see this
post
on how to read DataFrames from your clipboard.
Lastly, all visual representation of JOIN operations have been hand-drawn using Google Drawings. Inspiration from here.
Enough talk - just show me how to use merge!
Setup & Basics
np.random.seed(0)
left = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4)})
right = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': np.random.randn(4)})
left
key value
0 A 1.764052
1 B 0.400157
2 C 0.978738
3 D 2.240893
right
key value
0 B 1.867558
1 D -0.977278
2 E 0.950088
3 F -0.151357
For the sake of simplicity, the key column has the same name (for now).
An INNER JOIN is represented by
Note
This, along with the forthcoming figures all follow this convention:
blue indicates rows that are present in the merge result
red indicates rows that are excluded from the result (i.e., removed)
green indicates missing values that are replaced with NaNs in the result
To perform an INNER JOIN, call merge on the left DataFrame, specifying the right DataFrame and the join key (at the very least) as arguments.
left.merge(right, on='key')
# Or, if you want to be explicit
# left.merge(right, on='key', how='inner')
key value_x value_y
0 B 0.400157 1.867558
1 D 2.240893 -0.977278
This returns only rows from left and right which share a common key (in this example, "B" and "D).
A LEFT OUTER JOIN, or LEFT JOIN is represented by
This can be performed by specifying how='left'.
left.merge(right, on='key', how='left')
key value_x value_y
0 A 1.764052 NaN
1 B 0.400157 1.867558
2 C 0.978738 NaN
3 D 2.240893 -0.977278
Carefully note the placement of NaNs here. If you specify how='left', then only keys from left are used, and missing data from right is replaced by NaN.
And similarly, for a RIGHT OUTER JOIN, or RIGHT JOIN which is...
...specify how='right':
left.merge(right, on='key', how='right')
key value_x value_y
0 B 0.400157 1.867558
1 D 2.240893 -0.977278
2 E NaN 0.950088
3 F NaN -0.151357
Here, keys from right are used, and missing data from left is replaced by NaN.
Finally, for the FULL OUTER JOIN, given by
specify how='outer'.
left.merge(right, on='key', how='outer')
key value_x value_y
0 A 1.764052 NaN
1 B 0.400157 1.867558
2 C 0.978738 NaN
3 D 2.240893 -0.977278
4 E NaN 0.950088
5 F NaN -0.151357
This uses the keys from both frames, and NaNs are inserted for missing rows in both.
The documentation summarizes these various merges nicely:
Other JOINs - LEFT-Excluding, RIGHT-Excluding, and FULL-Excluding/ANTI JOINs
If you need LEFT-Excluding JOINs and RIGHT-Excluding JOINs in two steps.
For LEFT-Excluding JOIN, represented as
Start by performing a LEFT OUTER JOIN and then filtering to rows coming from left only (excluding everything from the right),
(left.merge(right, on='key', how='left', indicator=True)
.query('_merge == "left_only"')
.drop('_merge', 1))
key value_x value_y
0 A 1.764052 NaN
2 C 0.978738 NaN
Where,
left.merge(right, on='key', how='left', indicator=True)
key value_x value_y _merge
0 A 1.764052 NaN left_only
1 B 0.400157 1.867558 both
2 C 0.978738 NaN left_only
3 D 2.240893 -0.977278 both
And similarly, for a RIGHT-Excluding JOIN,
(left.merge(right, on='key', how='right', indicator=True)
.query('_merge == "right_only"')
.drop('_merge', 1))
key value_x value_y
2 E NaN 0.950088
3 F NaN -0.151357
Lastly, if you are required to do a merge that only retains keys from the left or right, but not both (IOW, performing an ANTI-JOIN),
You can do this in similar fashion—
(left.merge(right, on='key', how='outer', indicator=True)
.query('_merge != "both"')
.drop('_merge', 1))
key value_x value_y
0 A 1.764052 NaN
2 C 0.978738 NaN
4 E NaN 0.950088
5 F NaN -0.151357
Different names for key columns
If the key columns are named differently—for example, left has keyLeft, and right has keyRight instead of key—then you will have to specify left_on and right_on as arguments instead of on:
left2 = left.rename({'key':'keyLeft'}, axis=1)
right2 = right.rename({'key':'keyRight'}, axis=1)
left2
keyLeft value
0 A 1.764052
1 B 0.400157
2 C 0.978738
3 D 2.240893
right2
keyRight value
0 B 1.867558
1 D -0.977278
2 E 0.950088
3 F -0.151357
left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')
keyLeft value_x keyRight value_y
0 B 0.400157 B 1.867558
1 D 2.240893 D -0.977278
Avoiding duplicate key column in output
When merging on keyLeft from left and keyRight from right, if you only want either of the keyLeft or keyRight (but not both) in the output, you can start by setting the index as a preliminary step.
left3 = left2.set_index('keyLeft')
left3.merge(right2, left_index=True, right_on='keyRight')
value_x keyRight value_y
0 0.400157 B 1.867558
1 2.240893 D -0.977278
Contrast this with the output of the command just before (that is, the output of left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')), you'll notice keyLeft is missing. You can figure out what column to keep based on which frame's index is set as the key. This may matter when, say, performing some OUTER JOIN operation.
Merging only a single column from one of the DataFrames
For example, consider
right3 = right.assign(newcol=np.arange(len(right)))
right3
key value newcol
0 B 1.867558 0
1 D -0.977278 1
2 E 0.950088 2
3 F -0.151357 3
If you are required to merge only "newcol" (without any of the other columns), you can usually just subset columns before merging:
left.merge(right3[['key', 'newcol']], on='key')
key value newcol
0 B 0.400157 0
1 D 2.240893 1
If you're doing a LEFT OUTER JOIN, a more performant solution would involve map:
# left['newcol'] = left['key'].map(right3.set_index('key')['newcol']))
left.assign(newcol=left['key'].map(right3.set_index('key')['newcol']))
key value newcol
0 A 1.764052 NaN
1 B 0.400157 0.0
2 C 0.978738 NaN
3 D 2.240893 1.0
As mentioned, this is similar to, but faster than
left.merge(right3[['key', 'newcol']], on='key', how='left')
key value newcol
0 A 1.764052 NaN
1 B 0.400157 0.0
2 C 0.978738 NaN
3 D 2.240893 1.0
Merging on multiple columns
To join on more than one column, specify a list for on (or left_on and right_on, as appropriate).
left.merge(right, on=['key1', 'key2'] ...)
Or, in the event the names are different,
left.merge(right, left_on=['lkey1', 'lkey2'], right_on=['rkey1', 'rkey2'])
Other useful merge* operations and functions
Merging a DataFrame with Series on index: See this answer.
Besides merge, DataFrame.update and DataFrame.combine_first are also used in certain cases to update one DataFrame with another.
pd.merge_ordered is a useful function for ordered JOINs.
pd.merge_asof (read: merge_asOf) is useful for approximate joins.
This section only covers the very basics, and is designed to only whet your appetite. For more examples and cases, see the documentation on merge, join, and concat as well as the links to the function specifications.
Continue Reading
Jump to other topics in Pandas Merging 101 to continue learning:
Merging basics - basic types of joins *
Index-based joins
Generalizing to multiple DataFrames
Cross join
*You are here.
A supplemental visual view of pd.concat([df0, df1], kwargs).
Notice that, kwarg axis=0 or axis=1 's meaning is not as intuitive as df.mean() or df.apply(func)
Joins 101
These animations might be better to explain you visually.
Credits: Garrick Aden-Buie tidyexplain repo
Inner Join
Outer Join or Full Join
Right Join
Left Join
In this answer, I will consider practical examples of:
pandas.concat
pandas.DataFrame.merge to merge dataframes from the index of one and the column of another one.
We will be using different dataframes for each of the cases.
1. pandas.concat
Considering the following DataFrames with the same column names:
Price2018 with size (8784, 5)
Year Month Day Hour Price
0 2018 1 1 1 6.74
1 2018 1 1 2 4.74
2 2018 1 1 3 3.66
3 2018 1 1 4 2.30
4 2018 1 1 5 2.30
5 2018 1 1 6 2.06
6 2018 1 1 7 2.06
7 2018 1 1 8 2.06
8 2018 1 1 9 2.30
9 2018 1 1 10 2.30
Price2019 with size (8760, 5)
Year Month Day Hour Price
0 2019 1 1 1 66.88
1 2019 1 1 2 66.88
2 2019 1 1 3 66.00
3 2019 1 1 4 63.64
4 2019 1 1 5 58.85
5 2019 1 1 6 55.47
6 2019 1 1 7 56.00
7 2019 1 1 8 61.09
8 2019 1 1 9 61.01
9 2019 1 1 10 61.00
One can combine them using pandas.concat, by simply
import pandas as pd
frames = [Price2018, Price2019]
df_merged = pd.concat(frames)
Which results in a DataFrame with size (17544, 5)
If one wants to have a clear picture of what happened, it works like this
(Source)
2. pandas.DataFrame.merge
In this section, we will consider a specific case: merging the index of one dataframe and the column of another dataframe.
Let's say one has the dataframe Geo with 54 columns, being one of the columns the Date, which is of type datetime64[ns].
Date 1 2 ... 51 52 53
0 2010-01-01 00:00:00 0.565919 0.892376 ... 0.593049 0.775082 0.680621
1 2010-01-01 01:00:00 0.358960 0.531418 ... 0.734619 0.480450 0.926735
2 2010-01-01 02:00:00 0.531870 0.221768 ... 0.902369 0.027840 0.398864
3 2010-01-01 03:00:00 0.475463 0.245810 ... 0.306405 0.645762 0.541882
4 2010-01-01 04:00:00 0.954546 0.867960 ... 0.912257 0.039772 0.627696
And the dataframe Price that has one column with the price named Price, and the index corresponds to the dates (Date)
Price
Date
2010-01-01 00:00:00 29.10
2010-01-01 01:00:00 9.57
2010-01-01 02:00:00 0.00
2010-01-01 03:00:00 0.00
2010-01-01 04:00:00 0.00
In order to merge them, one can use pandas.DataFrame.merge as follows
df_merged = pd.merge(Price, Geo, left_index=True, right_on='Date')
where Geo and Price are the previous dataframes.
That results in the following dataframe
Price Date 1 ... 51 52 53
0 29.10 2010-01-01 00:00:00 0.565919 ... 0.593049 0.775082 0.680621
1 9.57 2010-01-01 01:00:00 0.358960 ... 0.734619 0.480450 0.926735
2 0.00 2010-01-01 02:00:00 0.531870 ... 0.902369 0.027840 0.398864
3 0.00 2010-01-01 03:00:00 0.475463 ... 0.306405 0.645762 0.541882
4 0.00 2010-01-01 04:00:00 0.954546 ... 0.912257 0.039772 0.627696
This post will go through the following topics:
Merging with index under different conditions
options for index-based joins: merge, join, concat
merging on indexes
merging on index of one, column of other
effectively using named indexes to simplify merging syntax
BACK TO TOP
Index-based joins
TL;DR
There are a few options, some simpler than others depending on the use
case.
DataFrame.merge with left_index and right_index (or left_on and right_on using named indexes)
supports inner/left/right/full
can only join two at a time
supports column-column, index-column, index-index joins
DataFrame.join (join on index)
supports inner/left (default)/right/full
can join multiple DataFrames at a time
supports index-index joins
pd.concat (joins on index)
supports inner/full (default)
can join multiple DataFrames at a time
supports index-index joins
Index to index joins
Setup & Basics
import pandas as pd
import numpy as np
np.random.seed([3, 14])
left = pd.DataFrame(data={'value': np.random.randn(4)},
index=['A', 'B', 'C', 'D'])
right = pd.DataFrame(data={'value': np.random.randn(4)},
index=['B', 'D', 'E', 'F'])
left.index.name = right.index.name = 'idxkey'
left
value
idxkey
A -0.602923
B -0.402655
C 0.302329
D -0.524349
right
value
idxkey
B 0.543843
D 0.013135
E -0.326498
F 1.385076
Typically, an inner join on index would look like this:
left.merge(right, left_index=True, right_index=True)
value_x value_y
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
Other joins follow similar syntax.
Notable Alternatives
DataFrame.join defaults to joins on the index. DataFrame.join does a LEFT OUTER JOIN by default, so how='inner' is necessary here.
left.join(right, how='inner', lsuffix='_x', rsuffix='_y')
value_x value_y
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
Note that I needed to specify the lsuffix and rsuffix arguments since join would otherwise error out:
left.join(right)
ValueError: columns overlap but no suffix specified: Index(['value'], dtype='object')
Since the column names are the same. This would not be a problem if they were differently named.
left.rename(columns={'value':'leftvalue'}).join(right, how='inner')
leftvalue value
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
pd.concat joins on the index and can join two or more DataFrames at once. It does a full outer join by default, so how='inner' is required here..
pd.concat([left, right], axis=1, sort=False, join='inner')
value value
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
For more information on concat, see this post.
Index to Column joins
To perform an inner join using index of left, column of right, you will use DataFrame.merge a combination of left_index=True and right_on=....
right2 = right.reset_index().rename({'idxkey' : 'colkey'}, axis=1)
right2
colkey value
0 B 0.543843
1 D 0.013135
2 E -0.326498
3 F 1.385076
left.merge(right2, left_index=True, right_on='colkey')
value_x colkey value_y
0 -0.402655 B 0.543843
1 -0.524349 D 0.013135
Other joins follow a similar structure. Note that only merge can perform index to column joins. You can join on multiple columns, provided the number of index levels on the left equals the number of columns on the right.
join and concat are not capable of mixed merges. You will need to set the index as a pre-step using DataFrame.set_index.
Effectively using Named Index [pandas >= 0.23]
If your index is named, then from pandas >= 0.23, DataFrame.merge allows you to specify the index name to on (or left_on and right_on as necessary).
left.merge(right, on='idxkey')
value_x value_y
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
For the previous example of merging with the index of left, column of right, you can use left_on with the index name of left:
left.merge(right2, left_on='idxkey', right_on='colkey')
value_x colkey value_y
0 -0.402655 B 0.543843
1 -0.524349 D 0.013135
Continue Reading
Jump to other topics in Pandas Merging 101 to continue learning:
Merging basics - basic types of joins
Index-based joins*
Generalizing to multiple DataFrames
Cross join
* you are here
This post will go through the following topics:
how to correctly generalize to multiple DataFrames (and why merge has shortcomings here)
merging on unique keys
merging on non-unique keys
BACK TO TOP
Generalizing to multiple DataFrames
Oftentimes, the situation arises when multiple DataFrames are to be merged together. Naively, this can be done by chaining merge calls:
df1.merge(df2, ...).merge(df3, ...)
However, this quickly gets out of hand for many DataFrames. Furthermore, it may be necessary to generalise for an unknown number of DataFrames.
Here I introduce pd.concat for multi-way joins on unique keys, and DataFrame.join for multi-way joins on non-unique keys. First, the setup.
# Setup.
np.random.seed(0)
A = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'valueA': np.random.randn(4)})
B = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'valueB': np.random.randn(4)})
C = pd.DataFrame({'key': ['D', 'E', 'J', 'C'], 'valueC': np.ones(4)})
dfs = [A, B, C]
# Note: the "key" column values are unique, so the index is unique.
A2 = A.set_index('key')
B2 = B.set_index('key')
C2 = C.set_index('key')
dfs2 = [A2, B2, C2]
Multiway merge on unique keys
If your keys (here, the key could either be a column or an index) are unique, then you can use pd.concat. Note that pd.concat joins DataFrames on the index.
# Merge on `key` column. You'll need to set the index before concatenating
pd.concat(
[df.set_index('key') for df in dfs], axis=1, join='inner'
).reset_index()
key valueA valueB valueC
0 D 2.240893 -0.977278 1.0
# Merge on `key` index.
pd.concat(dfs2, axis=1, sort=False, join='inner')
valueA valueB valueC
key
D 2.240893 -0.977278 1.0
Omit join='inner' for a FULL OUTER JOIN. Note that you cannot specify LEFT or RIGHT OUTER joins (if you need these, use join, described below).
Multiway merge on keys with duplicates
concat is fast, but has its shortcomings. It cannot handle duplicates.
A3 = pd.DataFrame({'key': ['A', 'B', 'C', 'D', 'D'], 'valueA': np.random.randn(5)})
pd.concat([df.set_index('key') for df in [A3, B, C]], axis=1, join='inner')
ValueError: Shape of passed values is (3, 4), indices imply (3, 2)
In this situation, we can use join since it can handle non-unique keys (note that join joins DataFrames on their index; it calls merge under the hood and does a LEFT OUTER JOIN unless otherwise specified).
# Join on `key` column. Set as the index first.
# For inner join. For left join, omit the "how" argument.
A.set_index('key').join([B2, C2], how='inner').reset_index()
key valueA valueB valueC
0 D 2.240893 -0.977278 1.0
# Join on `key` index.
A3.set_index('key').join([B2, C2], how='inner')
valueA valueB valueC
key
D 1.454274 -0.977278 1.0
D 0.761038 -0.977278 1.0
Continue Reading
Jump to other topics in Pandas Merging 101 to continue learning:
Merging basics - basic types of joins
Index-based joins
Generalizing to multiple DataFrames *
Cross join
* you are here
Pandas at the moment does not support inequality joins within the merge syntax; one option is with the conditional_join function from pyjanitor - I am a contributor to this library:
# pip install pyjanitor
import pandas as pd
import janitor
left.conditional_join(right, ('value', 'value', '>'))
left right
key value key value
0 A 1.764052 D -0.977278
1 A 1.764052 F -0.151357
2 A 1.764052 E 0.950088
3 B 0.400157 D -0.977278
4 B 0.400157 F -0.151357
5 C 0.978738 D -0.977278
6 C 0.978738 F -0.151357
7 C 0.978738 E 0.950088
8 D 2.240893 D -0.977278
9 D 2.240893 F -0.151357
10 D 2.240893 E 0.950088
11 D 2.240893 B 1.867558
left.conditional_join(right, ('value', 'value', '<'))
left right
key value key value
0 A 1.764052 B 1.867558
1 B 0.400157 E 0.950088
2 B 0.400157 B 1.867558
3 C 0.978738 B 1.867558
The columns are passed as a variable argument of tuples, each tuple comprising of a column from the left dataframe, column from the right dataframe, and the join operator, which can be any of (>, <, >=, <=, !=). In the example above, a MultiIndex column is returned, because of overlaps in the column names.
Performance wise, this is better than a naive cross join:
np.random.seed(0)
dd = pd.DataFrame({'value':np.random.randint(100000, size=50_000)})
df = pd.DataFrame({'start':np.random.randint(100000, size=1_000),
'end':np.random.randint(100000, size=1_000)})
dd.head()
value
0 68268
1 43567
2 42613
3 45891
4 21243
df.head()
start end
0 71915 47005
1 64284 44913
2 13377 96626
3 75823 38673
4 29151 575
%%timeit
out = df.merge(dd, how='cross')
out.loc[(out.start < out.value) & (out.end > out.value)]
5.12 s ± 19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.conditional_join(dd, ('start', 'value' ,'<'), ('end', 'value' ,'>'))
280 ms ± 5.56 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.conditional_join(dd, ('start', 'value' ,'<'), ('end', 'value' ,'>'), use_numba=True)
124 ms ± 12.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
out = df.merge(dd, how='cross')
out = out.loc[(out.start < out.value) & (out.end > out.value)]
A = df.conditional_join(dd, ('start', 'value' ,'<'), ('end', 'value' ,'>'))
columns = A.columns.tolist()
A = A.sort_values(columns, ignore_index = True)
out = out.sort_values(columns, ignore_index = True)
A.equals(out)
True
Depending on the data size, you could get more performance when an equi join is present. In this case, pandas merge function is used, but the final data frame is delayed until the non-equi joins are computed. There is no numba support when equi conditions are present. Let's look at data from here:
import pandas as pd
import numpy as np
import random
import datetime
def random_dt_bw(start_date,end_date):
days_between = (end_date - start_date).days
random_num_days = random.randrange(days_between)
random_dt = start_date + datetime.timedelta(days=random_num_days)
return random_dt
def generate_data(n=1000):
items = [f"i_{x}" for x in range(n)]
start_dates = [random_dt_bw(datetime.date(2020,1,1),datetime.date(2020,9,1)) for x in range(n)]
end_dates = [x + datetime.timedelta(days=random.randint(1,10)) for x in start_dates]
offerDf = pd.DataFrame({"Item":items,
"StartDt":start_dates,
"EndDt":end_dates})
transaction_items = [f"i_{random.randint(0,n)}" for x in range(5*n)]
transaction_dt = [random_dt_bw(datetime.date(2020,1,1),datetime.date(2020,9,1)) for x in range(5*n)]
sales_amt = [random.randint(0,1000) for x in range(5*n)]
transactionDf = pd.DataFrame({"Item":transaction_items,"TransactionDt":transaction_dt,"Sales":sales_amt})
return offerDf,transactionDf
offerDf,transactionDf = generate_data(n=100000)
offerDf = (offerDf
.assign(StartDt = offerDf.StartDt.astype(np.datetime64),
EndDt = offerDf.EndDt.astype(np.datetime64)
)
)
transactionDf = transactionDf.assign(TransactionDt = transactionDf.TransactionDt.astype(np.datetime64))
# you can get more performance when using ints/datetimes
# in the equi join, compared to strings
offerDf = offerDf.assign(Itemr = offerDf.Item.str[2:].astype(int))
transactionDf = transactionDf.assign(Itemr = transactionDf.Item.str[2:].astype(int))
transactionDf.head()
Item TransactionDt Sales Itemr
0 i_43407 2020-05-29 692 43407
1 i_95044 2020-07-22 964 95044
2 i_94560 2020-01-09 462 94560
3 i_11246 2020-02-26 690 11246
4 i_55974 2020-03-07 219 55974
offerDf.head()
Item StartDt EndDt Itemr
0 i_0 2020-04-18 2020-04-19 0
1 i_1 2020-02-28 2020-03-07 1
2 i_2 2020-03-28 2020-03-30 2
3 i_3 2020-08-03 2020-08-13 3
4 i_4 2020-05-26 2020-06-04 4
# merge on strings
merged_df = pd.merge(offerDf,transactionDf,on='Itemr')
classic_int = merged_df[(merged_df['TransactionDt']>=merged_df['StartDt']) &
(merged_df['TransactionDt']<=merged_df['EndDt'])]
# merge on ints ... usually faster
merged_df = pd.merge(offerDf,transactionDf,on='Item')
classic_str = merged_df[(merged_df['TransactionDt']>=merged_df['StartDt']) &
(merged_df['TransactionDt']<=merged_df['EndDt'])]
# merge on integers
cond_join_int = (transactionDf
.conditional_join(
offerDf,
('Itemr', 'Itemr', '=='),
('TransactionDt', 'StartDt', '>='),
('TransactionDt', 'EndDt', '<=')
)
)
# merge on strings
cond_join_str = (transactionDf
.conditional_join(
offerDf,
('Item', 'Item', '=='),
('TransactionDt', 'StartDt', '>='),
('TransactionDt', 'EndDt', '<=')
)
)
%%timeit
merged_df = pd.merge(offerDf,transactionDf,on='Item')
classic_str = merged_df[(merged_df['TransactionDt']>=merged_df['StartDt']) &
(merged_df['TransactionDt']<=merged_df['EndDt'])]
292 ms ± 3.84 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
merged_df = pd.merge(offerDf,transactionDf,on='Itemr')
classic_int = merged_df[(merged_df['TransactionDt']>=merged_df['StartDt']) &
(merged_df['TransactionDt']<=merged_df['EndDt'])]
253 ms ± 2.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
(transactionDf
.conditional_join(
offerDf,
('Item', 'Item', '=='),
('TransactionDt', 'StartDt', '>='),
('TransactionDt', 'EndDt', '<=')
)
)
256 ms ± 9.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
(transactionDf
.conditional_join(
offerDf,
('Itemr', 'Itemr', '=='),
('TransactionDt', 'StartDt', '>='),
('TransactionDt', 'EndDt', '<=')
)
)
71.8 ms ± 2.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# check that both dataframes are equal
cols = ['Item', 'TransactionDt', 'Sales', 'Itemr_y','StartDt', 'EndDt', 'Itemr_x']
cond_join_str = cond_join_str.drop(columns=('right', 'Item')).set_axis(cols, axis=1)
(cond_join_str
.sort_values(cond_join_str.columns.tolist())
.reset_index(drop=True)
.reindex(columns=classic_str.columns)
.equals(
classic_str
.sort_values(classic_str.columns.tolist())
.reset_index(drop=True)
))
True
I think you should include this in your explanation as it is a relevant merge that I see fairly often, which is termed cross-join I believe. This is a merge that occurs when unique df's share no columns, and it simply merging 2 dfs side-by-side:
The setup:
names1 = [{'A':'Jack', 'B':'Jill'}]
names2 = [{'C':'Tommy', 'D':'Tammy'}]
df1=pd.DataFrame(names1)
df2=pd.DataFrame(names2)
df_merged= pd.merge(df1.assign(X=1), df2.assign(X=1), on='X').drop('X', 1)
This creates a dummy X column, merges on the X, and then drops it to produce
df_merged:
A B C D
0 Jack Jill Tommy Tammy

Concatenating Two Columns while keeping only the inner data [duplicate]

How can I perform a (INNER| (LEFT|RIGHT|FULL) OUTER) JOIN with pandas?
How do I add NaNs for missing rows after a merge?
How do I get rid of NaNs after merging?
Can I merge on the index?
How do I merge multiple DataFrames?
Cross join with pandas
merge? join? concat? update? Who? What? Why?!
... and more. I've seen these recurring questions asking about various facets of the pandas merge functionality. Most of the information regarding merge and its various use cases today is fragmented across dozens of badly worded, unsearchable posts. The aim here is to collate some of the more important points for posterity.
This Q&A is meant to be the next installment in a series of helpful user guides on common pandas idioms (see this post on pivoting, and this post on concatenation, which I will be touching on, later).
Please note that this post is not meant to be a replacement for the documentation, so please read that as well! Some of the examples are taken from there.
Table of Contents
For ease of access.
Merging basics - basic types of joins (read this first)
Index-based joins
Generalizing to multiple DataFrames
Cross join
This post aims to give readers a primer on SQL-flavored merging with Pandas, how to use it, and when not to use it.
In particular, here's what this post will go through:
The basics - types of joins (LEFT, RIGHT, OUTER, INNER)
merging with different column names
merging with multiple columns
avoiding duplicate merge key column in output
What this post (and other posts by me on this thread) will not go through:
Performance-related discussions and timings (for now). Mostly notable mentions of better alternatives, wherever appropriate.
Handling suffixes, removing extra columns, renaming outputs, and other specific use cases. There are other (read: better) posts that deal with that, so figure it out!
Note
Most examples default to INNER JOIN operations while demonstrating various features, unless otherwise specified.
Furthermore, all the DataFrames here can be copied and replicated so
you can play with them. Also, see this
post
on how to read DataFrames from your clipboard.
Lastly, all visual representation of JOIN operations have been hand-drawn using Google Drawings. Inspiration from here.
Enough talk - just show me how to use merge!
Setup & Basics
np.random.seed(0)
left = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4)})
right = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': np.random.randn(4)})
left
key value
0 A 1.764052
1 B 0.400157
2 C 0.978738
3 D 2.240893
right
key value
0 B 1.867558
1 D -0.977278
2 E 0.950088
3 F -0.151357
For the sake of simplicity, the key column has the same name (for now).
An INNER JOIN is represented by
Note
This, along with the forthcoming figures all follow this convention:
blue indicates rows that are present in the merge result
red indicates rows that are excluded from the result (i.e., removed)
green indicates missing values that are replaced with NaNs in the result
To perform an INNER JOIN, call merge on the left DataFrame, specifying the right DataFrame and the join key (at the very least) as arguments.
left.merge(right, on='key')
# Or, if you want to be explicit
# left.merge(right, on='key', how='inner')
key value_x value_y
0 B 0.400157 1.867558
1 D 2.240893 -0.977278
This returns only rows from left and right which share a common key (in this example, "B" and "D).
A LEFT OUTER JOIN, or LEFT JOIN is represented by
This can be performed by specifying how='left'.
left.merge(right, on='key', how='left')
key value_x value_y
0 A 1.764052 NaN
1 B 0.400157 1.867558
2 C 0.978738 NaN
3 D 2.240893 -0.977278
Carefully note the placement of NaNs here. If you specify how='left', then only keys from left are used, and missing data from right is replaced by NaN.
And similarly, for a RIGHT OUTER JOIN, or RIGHT JOIN which is...
...specify how='right':
left.merge(right, on='key', how='right')
key value_x value_y
0 B 0.400157 1.867558
1 D 2.240893 -0.977278
2 E NaN 0.950088
3 F NaN -0.151357
Here, keys from right are used, and missing data from left is replaced by NaN.
Finally, for the FULL OUTER JOIN, given by
specify how='outer'.
left.merge(right, on='key', how='outer')
key value_x value_y
0 A 1.764052 NaN
1 B 0.400157 1.867558
2 C 0.978738 NaN
3 D 2.240893 -0.977278
4 E NaN 0.950088
5 F NaN -0.151357
This uses the keys from both frames, and NaNs are inserted for missing rows in both.
The documentation summarizes these various merges nicely:
Other JOINs - LEFT-Excluding, RIGHT-Excluding, and FULL-Excluding/ANTI JOINs
If you need LEFT-Excluding JOINs and RIGHT-Excluding JOINs in two steps.
For LEFT-Excluding JOIN, represented as
Start by performing a LEFT OUTER JOIN and then filtering to rows coming from left only (excluding everything from the right),
(left.merge(right, on='key', how='left', indicator=True)
.query('_merge == "left_only"')
.drop('_merge', 1))
key value_x value_y
0 A 1.764052 NaN
2 C 0.978738 NaN
Where,
left.merge(right, on='key', how='left', indicator=True)
key value_x value_y _merge
0 A 1.764052 NaN left_only
1 B 0.400157 1.867558 both
2 C 0.978738 NaN left_only
3 D 2.240893 -0.977278 both
And similarly, for a RIGHT-Excluding JOIN,
(left.merge(right, on='key', how='right', indicator=True)
.query('_merge == "right_only"')
.drop('_merge', 1))
key value_x value_y
2 E NaN 0.950088
3 F NaN -0.151357
Lastly, if you are required to do a merge that only retains keys from the left or right, but not both (IOW, performing an ANTI-JOIN),
You can do this in similar fashion—
(left.merge(right, on='key', how='outer', indicator=True)
.query('_merge != "both"')
.drop('_merge', 1))
key value_x value_y
0 A 1.764052 NaN
2 C 0.978738 NaN
4 E NaN 0.950088
5 F NaN -0.151357
Different names for key columns
If the key columns are named differently—for example, left has keyLeft, and right has keyRight instead of key—then you will have to specify left_on and right_on as arguments instead of on:
left2 = left.rename({'key':'keyLeft'}, axis=1)
right2 = right.rename({'key':'keyRight'}, axis=1)
left2
keyLeft value
0 A 1.764052
1 B 0.400157
2 C 0.978738
3 D 2.240893
right2
keyRight value
0 B 1.867558
1 D -0.977278
2 E 0.950088
3 F -0.151357
left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')
keyLeft value_x keyRight value_y
0 B 0.400157 B 1.867558
1 D 2.240893 D -0.977278
Avoiding duplicate key column in output
When merging on keyLeft from left and keyRight from right, if you only want either of the keyLeft or keyRight (but not both) in the output, you can start by setting the index as a preliminary step.
left3 = left2.set_index('keyLeft')
left3.merge(right2, left_index=True, right_on='keyRight')
value_x keyRight value_y
0 0.400157 B 1.867558
1 2.240893 D -0.977278
Contrast this with the output of the command just before (that is, the output of left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')), you'll notice keyLeft is missing. You can figure out what column to keep based on which frame's index is set as the key. This may matter when, say, performing some OUTER JOIN operation.
Merging only a single column from one of the DataFrames
For example, consider
right3 = right.assign(newcol=np.arange(len(right)))
right3
key value newcol
0 B 1.867558 0
1 D -0.977278 1
2 E 0.950088 2
3 F -0.151357 3
If you are required to merge only "newcol" (without any of the other columns), you can usually just subset columns before merging:
left.merge(right3[['key', 'newcol']], on='key')
key value newcol
0 B 0.400157 0
1 D 2.240893 1
If you're doing a LEFT OUTER JOIN, a more performant solution would involve map:
# left['newcol'] = left['key'].map(right3.set_index('key')['newcol']))
left.assign(newcol=left['key'].map(right3.set_index('key')['newcol']))
key value newcol
0 A 1.764052 NaN
1 B 0.400157 0.0
2 C 0.978738 NaN
3 D 2.240893 1.0
As mentioned, this is similar to, but faster than
left.merge(right3[['key', 'newcol']], on='key', how='left')
key value newcol
0 A 1.764052 NaN
1 B 0.400157 0.0
2 C 0.978738 NaN
3 D 2.240893 1.0
Merging on multiple columns
To join on more than one column, specify a list for on (or left_on and right_on, as appropriate).
left.merge(right, on=['key1', 'key2'] ...)
Or, in the event the names are different,
left.merge(right, left_on=['lkey1', 'lkey2'], right_on=['rkey1', 'rkey2'])
Other useful merge* operations and functions
Merging a DataFrame with Series on index: See this answer.
Besides merge, DataFrame.update and DataFrame.combine_first are also used in certain cases to update one DataFrame with another.
pd.merge_ordered is a useful function for ordered JOINs.
pd.merge_asof (read: merge_asOf) is useful for approximate joins.
This section only covers the very basics, and is designed to only whet your appetite. For more examples and cases, see the documentation on merge, join, and concat as well as the links to the function specifications.
Continue Reading
Jump to other topics in Pandas Merging 101 to continue learning:
Merging basics - basic types of joins *
Index-based joins
Generalizing to multiple DataFrames
Cross join
*You are here.
A supplemental visual view of pd.concat([df0, df1], kwargs).
Notice that, kwarg axis=0 or axis=1 's meaning is not as intuitive as df.mean() or df.apply(func)
Joins 101
These animations might be better to explain you visually.
Credits: Garrick Aden-Buie tidyexplain repo
Inner Join
Outer Join or Full Join
Right Join
Left Join
In this answer, I will consider practical examples of:
pandas.concat
pandas.DataFrame.merge to merge dataframes from the index of one and the column of another one.
We will be using different dataframes for each of the cases.
1. pandas.concat
Considering the following DataFrames with the same column names:
Price2018 with size (8784, 5)
Year Month Day Hour Price
0 2018 1 1 1 6.74
1 2018 1 1 2 4.74
2 2018 1 1 3 3.66
3 2018 1 1 4 2.30
4 2018 1 1 5 2.30
5 2018 1 1 6 2.06
6 2018 1 1 7 2.06
7 2018 1 1 8 2.06
8 2018 1 1 9 2.30
9 2018 1 1 10 2.30
Price2019 with size (8760, 5)
Year Month Day Hour Price
0 2019 1 1 1 66.88
1 2019 1 1 2 66.88
2 2019 1 1 3 66.00
3 2019 1 1 4 63.64
4 2019 1 1 5 58.85
5 2019 1 1 6 55.47
6 2019 1 1 7 56.00
7 2019 1 1 8 61.09
8 2019 1 1 9 61.01
9 2019 1 1 10 61.00
One can combine them using pandas.concat, by simply
import pandas as pd
frames = [Price2018, Price2019]
df_merged = pd.concat(frames)
Which results in a DataFrame with size (17544, 5)
If one wants to have a clear picture of what happened, it works like this
(Source)
2. pandas.DataFrame.merge
In this section, we will consider a specific case: merging the index of one dataframe and the column of another dataframe.
Let's say one has the dataframe Geo with 54 columns, being one of the columns the Date, which is of type datetime64[ns].
Date 1 2 ... 51 52 53
0 2010-01-01 00:00:00 0.565919 0.892376 ... 0.593049 0.775082 0.680621
1 2010-01-01 01:00:00 0.358960 0.531418 ... 0.734619 0.480450 0.926735
2 2010-01-01 02:00:00 0.531870 0.221768 ... 0.902369 0.027840 0.398864
3 2010-01-01 03:00:00 0.475463 0.245810 ... 0.306405 0.645762 0.541882
4 2010-01-01 04:00:00 0.954546 0.867960 ... 0.912257 0.039772 0.627696
And the dataframe Price that has one column with the price named Price, and the index corresponds to the dates (Date)
Price
Date
2010-01-01 00:00:00 29.10
2010-01-01 01:00:00 9.57
2010-01-01 02:00:00 0.00
2010-01-01 03:00:00 0.00
2010-01-01 04:00:00 0.00
In order to merge them, one can use pandas.DataFrame.merge as follows
df_merged = pd.merge(Price, Geo, left_index=True, right_on='Date')
where Geo and Price are the previous dataframes.
That results in the following dataframe
Price Date 1 ... 51 52 53
0 29.10 2010-01-01 00:00:00 0.565919 ... 0.593049 0.775082 0.680621
1 9.57 2010-01-01 01:00:00 0.358960 ... 0.734619 0.480450 0.926735
2 0.00 2010-01-01 02:00:00 0.531870 ... 0.902369 0.027840 0.398864
3 0.00 2010-01-01 03:00:00 0.475463 ... 0.306405 0.645762 0.541882
4 0.00 2010-01-01 04:00:00 0.954546 ... 0.912257 0.039772 0.627696
This post will go through the following topics:
Merging with index under different conditions
options for index-based joins: merge, join, concat
merging on indexes
merging on index of one, column of other
effectively using named indexes to simplify merging syntax
BACK TO TOP
Index-based joins
TL;DR
There are a few options, some simpler than others depending on the use
case.
DataFrame.merge with left_index and right_index (or left_on and right_on using named indexes)
supports inner/left/right/full
can only join two at a time
supports column-column, index-column, index-index joins
DataFrame.join (join on index)
supports inner/left (default)/right/full
can join multiple DataFrames at a time
supports index-index joins
pd.concat (joins on index)
supports inner/full (default)
can join multiple DataFrames at a time
supports index-index joins
Index to index joins
Setup & Basics
import pandas as pd
import numpy as np
np.random.seed([3, 14])
left = pd.DataFrame(data={'value': np.random.randn(4)},
index=['A', 'B', 'C', 'D'])
right = pd.DataFrame(data={'value': np.random.randn(4)},
index=['B', 'D', 'E', 'F'])
left.index.name = right.index.name = 'idxkey'
left
value
idxkey
A -0.602923
B -0.402655
C 0.302329
D -0.524349
right
value
idxkey
B 0.543843
D 0.013135
E -0.326498
F 1.385076
Typically, an inner join on index would look like this:
left.merge(right, left_index=True, right_index=True)
value_x value_y
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
Other joins follow similar syntax.
Notable Alternatives
DataFrame.join defaults to joins on the index. DataFrame.join does a LEFT OUTER JOIN by default, so how='inner' is necessary here.
left.join(right, how='inner', lsuffix='_x', rsuffix='_y')
value_x value_y
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
Note that I needed to specify the lsuffix and rsuffix arguments since join would otherwise error out:
left.join(right)
ValueError: columns overlap but no suffix specified: Index(['value'], dtype='object')
Since the column names are the same. This would not be a problem if they were differently named.
left.rename(columns={'value':'leftvalue'}).join(right, how='inner')
leftvalue value
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
pd.concat joins on the index and can join two or more DataFrames at once. It does a full outer join by default, so how='inner' is required here..
pd.concat([left, right], axis=1, sort=False, join='inner')
value value
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
For more information on concat, see this post.
Index to Column joins
To perform an inner join using index of left, column of right, you will use DataFrame.merge a combination of left_index=True and right_on=....
right2 = right.reset_index().rename({'idxkey' : 'colkey'}, axis=1)
right2
colkey value
0 B 0.543843
1 D 0.013135
2 E -0.326498
3 F 1.385076
left.merge(right2, left_index=True, right_on='colkey')
value_x colkey value_y
0 -0.402655 B 0.543843
1 -0.524349 D 0.013135
Other joins follow a similar structure. Note that only merge can perform index to column joins. You can join on multiple columns, provided the number of index levels on the left equals the number of columns on the right.
join and concat are not capable of mixed merges. You will need to set the index as a pre-step using DataFrame.set_index.
Effectively using Named Index [pandas >= 0.23]
If your index is named, then from pandas >= 0.23, DataFrame.merge allows you to specify the index name to on (or left_on and right_on as necessary).
left.merge(right, on='idxkey')
value_x value_y
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
For the previous example of merging with the index of left, column of right, you can use left_on with the index name of left:
left.merge(right2, left_on='idxkey', right_on='colkey')
value_x colkey value_y
0 -0.402655 B 0.543843
1 -0.524349 D 0.013135
Continue Reading
Jump to other topics in Pandas Merging 101 to continue learning:
Merging basics - basic types of joins
Index-based joins*
Generalizing to multiple DataFrames
Cross join
* you are here
This post will go through the following topics:
how to correctly generalize to multiple DataFrames (and why merge has shortcomings here)
merging on unique keys
merging on non-unique keys
BACK TO TOP
Generalizing to multiple DataFrames
Oftentimes, the situation arises when multiple DataFrames are to be merged together. Naively, this can be done by chaining merge calls:
df1.merge(df2, ...).merge(df3, ...)
However, this quickly gets out of hand for many DataFrames. Furthermore, it may be necessary to generalise for an unknown number of DataFrames.
Here I introduce pd.concat for multi-way joins on unique keys, and DataFrame.join for multi-way joins on non-unique keys. First, the setup.
# Setup.
np.random.seed(0)
A = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'valueA': np.random.randn(4)})
B = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'valueB': np.random.randn(4)})
C = pd.DataFrame({'key': ['D', 'E', 'J', 'C'], 'valueC': np.ones(4)})
dfs = [A, B, C]
# Note: the "key" column values are unique, so the index is unique.
A2 = A.set_index('key')
B2 = B.set_index('key')
C2 = C.set_index('key')
dfs2 = [A2, B2, C2]
Multiway merge on unique keys
If your keys (here, the key could either be a column or an index) are unique, then you can use pd.concat. Note that pd.concat joins DataFrames on the index.
# Merge on `key` column. You'll need to set the index before concatenating
pd.concat(
[df.set_index('key') for df in dfs], axis=1, join='inner'
).reset_index()
key valueA valueB valueC
0 D 2.240893 -0.977278 1.0
# Merge on `key` index.
pd.concat(dfs2, axis=1, sort=False, join='inner')
valueA valueB valueC
key
D 2.240893 -0.977278 1.0
Omit join='inner' for a FULL OUTER JOIN. Note that you cannot specify LEFT or RIGHT OUTER joins (if you need these, use join, described below).
Multiway merge on keys with duplicates
concat is fast, but has its shortcomings. It cannot handle duplicates.
A3 = pd.DataFrame({'key': ['A', 'B', 'C', 'D', 'D'], 'valueA': np.random.randn(5)})
pd.concat([df.set_index('key') for df in [A3, B, C]], axis=1, join='inner')
ValueError: Shape of passed values is (3, 4), indices imply (3, 2)
In this situation, we can use join since it can handle non-unique keys (note that join joins DataFrames on their index; it calls merge under the hood and does a LEFT OUTER JOIN unless otherwise specified).
# Join on `key` column. Set as the index first.
# For inner join. For left join, omit the "how" argument.
A.set_index('key').join([B2, C2], how='inner').reset_index()
key valueA valueB valueC
0 D 2.240893 -0.977278 1.0
# Join on `key` index.
A3.set_index('key').join([B2, C2], how='inner')
valueA valueB valueC
key
D 1.454274 -0.977278 1.0
D 0.761038 -0.977278 1.0
Continue Reading
Jump to other topics in Pandas Merging 101 to continue learning:
Merging basics - basic types of joins
Index-based joins
Generalizing to multiple DataFrames *
Cross join
* you are here
Pandas at the moment does not support inequality joins within the merge syntax; one option is with the conditional_join function from pyjanitor - I am a contributor to this library:
# pip install pyjanitor
import pandas as pd
import janitor
left.conditional_join(right, ('value', 'value', '>'))
left right
key value key value
0 A 1.764052 D -0.977278
1 A 1.764052 F -0.151357
2 A 1.764052 E 0.950088
3 B 0.400157 D -0.977278
4 B 0.400157 F -0.151357
5 C 0.978738 D -0.977278
6 C 0.978738 F -0.151357
7 C 0.978738 E 0.950088
8 D 2.240893 D -0.977278
9 D 2.240893 F -0.151357
10 D 2.240893 E 0.950088
11 D 2.240893 B 1.867558
left.conditional_join(right, ('value', 'value', '<'))
left right
key value key value
0 A 1.764052 B 1.867558
1 B 0.400157 E 0.950088
2 B 0.400157 B 1.867558
3 C 0.978738 B 1.867558
The columns are passed as a variable argument of tuples, each tuple comprising of a column from the left dataframe, column from the right dataframe, and the join operator, which can be any of (>, <, >=, <=, !=). In the example above, a MultiIndex column is returned, because of overlaps in the column names.
Performance wise, this is better than a naive cross join:
np.random.seed(0)
dd = pd.DataFrame({'value':np.random.randint(100000, size=50_000)})
df = pd.DataFrame({'start':np.random.randint(100000, size=1_000),
'end':np.random.randint(100000, size=1_000)})
dd.head()
value
0 68268
1 43567
2 42613
3 45891
4 21243
df.head()
start end
0 71915 47005
1 64284 44913
2 13377 96626
3 75823 38673
4 29151 575
%%timeit
out = df.merge(dd, how='cross')
out.loc[(out.start < out.value) & (out.end > out.value)]
5.12 s ± 19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.conditional_join(dd, ('start', 'value' ,'<'), ('end', 'value' ,'>'))
280 ms ± 5.56 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.conditional_join(dd, ('start', 'value' ,'<'), ('end', 'value' ,'>'), use_numba=True)
124 ms ± 12.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
out = df.merge(dd, how='cross')
out = out.loc[(out.start < out.value) & (out.end > out.value)]
A = df.conditional_join(dd, ('start', 'value' ,'<'), ('end', 'value' ,'>'))
columns = A.columns.tolist()
A = A.sort_values(columns, ignore_index = True)
out = out.sort_values(columns, ignore_index = True)
A.equals(out)
True
Depending on the data size, you could get more performance when an equi join is present. In this case, pandas merge function is used, but the final data frame is delayed until the non-equi joins are computed. There is no numba support when equi conditions are present. Let's look at data from here:
import pandas as pd
import numpy as np
import random
import datetime
def random_dt_bw(start_date,end_date):
days_between = (end_date - start_date).days
random_num_days = random.randrange(days_between)
random_dt = start_date + datetime.timedelta(days=random_num_days)
return random_dt
def generate_data(n=1000):
items = [f"i_{x}" for x in range(n)]
start_dates = [random_dt_bw(datetime.date(2020,1,1),datetime.date(2020,9,1)) for x in range(n)]
end_dates = [x + datetime.timedelta(days=random.randint(1,10)) for x in start_dates]
offerDf = pd.DataFrame({"Item":items,
"StartDt":start_dates,
"EndDt":end_dates})
transaction_items = [f"i_{random.randint(0,n)}" for x in range(5*n)]
transaction_dt = [random_dt_bw(datetime.date(2020,1,1),datetime.date(2020,9,1)) for x in range(5*n)]
sales_amt = [random.randint(0,1000) for x in range(5*n)]
transactionDf = pd.DataFrame({"Item":transaction_items,"TransactionDt":transaction_dt,"Sales":sales_amt})
return offerDf,transactionDf
offerDf,transactionDf = generate_data(n=100000)
offerDf = (offerDf
.assign(StartDt = offerDf.StartDt.astype(np.datetime64),
EndDt = offerDf.EndDt.astype(np.datetime64)
)
)
transactionDf = transactionDf.assign(TransactionDt = transactionDf.TransactionDt.astype(np.datetime64))
# you can get more performance when using ints/datetimes
# in the equi join, compared to strings
offerDf = offerDf.assign(Itemr = offerDf.Item.str[2:].astype(int))
transactionDf = transactionDf.assign(Itemr = transactionDf.Item.str[2:].astype(int))
transactionDf.head()
Item TransactionDt Sales Itemr
0 i_43407 2020-05-29 692 43407
1 i_95044 2020-07-22 964 95044
2 i_94560 2020-01-09 462 94560
3 i_11246 2020-02-26 690 11246
4 i_55974 2020-03-07 219 55974
offerDf.head()
Item StartDt EndDt Itemr
0 i_0 2020-04-18 2020-04-19 0
1 i_1 2020-02-28 2020-03-07 1
2 i_2 2020-03-28 2020-03-30 2
3 i_3 2020-08-03 2020-08-13 3
4 i_4 2020-05-26 2020-06-04 4
# merge on strings
merged_df = pd.merge(offerDf,transactionDf,on='Itemr')
classic_int = merged_df[(merged_df['TransactionDt']>=merged_df['StartDt']) &
(merged_df['TransactionDt']<=merged_df['EndDt'])]
# merge on ints ... usually faster
merged_df = pd.merge(offerDf,transactionDf,on='Item')
classic_str = merged_df[(merged_df['TransactionDt']>=merged_df['StartDt']) &
(merged_df['TransactionDt']<=merged_df['EndDt'])]
# merge on integers
cond_join_int = (transactionDf
.conditional_join(
offerDf,
('Itemr', 'Itemr', '=='),
('TransactionDt', 'StartDt', '>='),
('TransactionDt', 'EndDt', '<=')
)
)
# merge on strings
cond_join_str = (transactionDf
.conditional_join(
offerDf,
('Item', 'Item', '=='),
('TransactionDt', 'StartDt', '>='),
('TransactionDt', 'EndDt', '<=')
)
)
%%timeit
merged_df = pd.merge(offerDf,transactionDf,on='Item')
classic_str = merged_df[(merged_df['TransactionDt']>=merged_df['StartDt']) &
(merged_df['TransactionDt']<=merged_df['EndDt'])]
292 ms ± 3.84 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
merged_df = pd.merge(offerDf,transactionDf,on='Itemr')
classic_int = merged_df[(merged_df['TransactionDt']>=merged_df['StartDt']) &
(merged_df['TransactionDt']<=merged_df['EndDt'])]
253 ms ± 2.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
(transactionDf
.conditional_join(
offerDf,
('Item', 'Item', '=='),
('TransactionDt', 'StartDt', '>='),
('TransactionDt', 'EndDt', '<=')
)
)
256 ms ± 9.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
(transactionDf
.conditional_join(
offerDf,
('Itemr', 'Itemr', '=='),
('TransactionDt', 'StartDt', '>='),
('TransactionDt', 'EndDt', '<=')
)
)
71.8 ms ± 2.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# check that both dataframes are equal
cols = ['Item', 'TransactionDt', 'Sales', 'Itemr_y','StartDt', 'EndDt', 'Itemr_x']
cond_join_str = cond_join_str.drop(columns=('right', 'Item')).set_axis(cols, axis=1)
(cond_join_str
.sort_values(cond_join_str.columns.tolist())
.reset_index(drop=True)
.reindex(columns=classic_str.columns)
.equals(
classic_str
.sort_values(classic_str.columns.tolist())
.reset_index(drop=True)
))
True
I think you should include this in your explanation as it is a relevant merge that I see fairly often, which is termed cross-join I believe. This is a merge that occurs when unique df's share no columns, and it simply merging 2 dfs side-by-side:
The setup:
names1 = [{'A':'Jack', 'B':'Jill'}]
names2 = [{'C':'Tommy', 'D':'Tammy'}]
df1=pd.DataFrame(names1)
df2=pd.DataFrame(names2)
df_merged= pd.merge(df1.assign(X=1), df2.assign(X=1), on='X').drop('X', 1)
This creates a dummy X column, merges on the X, and then drops it to produce
df_merged:
A B C D
0 Jack Jill Tommy Tammy

Merging two dataframes based on timestamp column [duplicate]

How can I perform a (INNER| (LEFT|RIGHT|FULL) OUTER) JOIN with pandas?
How do I add NaNs for missing rows after a merge?
How do I get rid of NaNs after merging?
Can I merge on the index?
How do I merge multiple DataFrames?
Cross join with pandas
merge? join? concat? update? Who? What? Why?!
... and more. I've seen these recurring questions asking about various facets of the pandas merge functionality. Most of the information regarding merge and its various use cases today is fragmented across dozens of badly worded, unsearchable posts. The aim here is to collate some of the more important points for posterity.
This Q&A is meant to be the next installment in a series of helpful user guides on common pandas idioms (see this post on pivoting, and this post on concatenation, which I will be touching on, later).
Please note that this post is not meant to be a replacement for the documentation, so please read that as well! Some of the examples are taken from there.
Table of Contents
For ease of access.
Merging basics - basic types of joins (read this first)
Index-based joins
Generalizing to multiple DataFrames
Cross join
This post aims to give readers a primer on SQL-flavored merging with Pandas, how to use it, and when not to use it.
In particular, here's what this post will go through:
The basics - types of joins (LEFT, RIGHT, OUTER, INNER)
merging with different column names
merging with multiple columns
avoiding duplicate merge key column in output
What this post (and other posts by me on this thread) will not go through:
Performance-related discussions and timings (for now). Mostly notable mentions of better alternatives, wherever appropriate.
Handling suffixes, removing extra columns, renaming outputs, and other specific use cases. There are other (read: better) posts that deal with that, so figure it out!
Note
Most examples default to INNER JOIN operations while demonstrating various features, unless otherwise specified.
Furthermore, all the DataFrames here can be copied and replicated so
you can play with them. Also, see this
post
on how to read DataFrames from your clipboard.
Lastly, all visual representation of JOIN operations have been hand-drawn using Google Drawings. Inspiration from here.
Enough talk - just show me how to use merge!
Setup & Basics
np.random.seed(0)
left = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4)})
right = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': np.random.randn(4)})
left
key value
0 A 1.764052
1 B 0.400157
2 C 0.978738
3 D 2.240893
right
key value
0 B 1.867558
1 D -0.977278
2 E 0.950088
3 F -0.151357
For the sake of simplicity, the key column has the same name (for now).
An INNER JOIN is represented by
Note
This, along with the forthcoming figures all follow this convention:
blue indicates rows that are present in the merge result
red indicates rows that are excluded from the result (i.e., removed)
green indicates missing values that are replaced with NaNs in the result
To perform an INNER JOIN, call merge on the left DataFrame, specifying the right DataFrame and the join key (at the very least) as arguments.
left.merge(right, on='key')
# Or, if you want to be explicit
# left.merge(right, on='key', how='inner')
key value_x value_y
0 B 0.400157 1.867558
1 D 2.240893 -0.977278
This returns only rows from left and right which share a common key (in this example, "B" and "D).
A LEFT OUTER JOIN, or LEFT JOIN is represented by
This can be performed by specifying how='left'.
left.merge(right, on='key', how='left')
key value_x value_y
0 A 1.764052 NaN
1 B 0.400157 1.867558
2 C 0.978738 NaN
3 D 2.240893 -0.977278
Carefully note the placement of NaNs here. If you specify how='left', then only keys from left are used, and missing data from right is replaced by NaN.
And similarly, for a RIGHT OUTER JOIN, or RIGHT JOIN which is...
...specify how='right':
left.merge(right, on='key', how='right')
key value_x value_y
0 B 0.400157 1.867558
1 D 2.240893 -0.977278
2 E NaN 0.950088
3 F NaN -0.151357
Here, keys from right are used, and missing data from left is replaced by NaN.
Finally, for the FULL OUTER JOIN, given by
specify how='outer'.
left.merge(right, on='key', how='outer')
key value_x value_y
0 A 1.764052 NaN
1 B 0.400157 1.867558
2 C 0.978738 NaN
3 D 2.240893 -0.977278
4 E NaN 0.950088
5 F NaN -0.151357
This uses the keys from both frames, and NaNs are inserted for missing rows in both.
The documentation summarizes these various merges nicely:
Other JOINs - LEFT-Excluding, RIGHT-Excluding, and FULL-Excluding/ANTI JOINs
If you need LEFT-Excluding JOINs and RIGHT-Excluding JOINs in two steps.
For LEFT-Excluding JOIN, represented as
Start by performing a LEFT OUTER JOIN and then filtering to rows coming from left only (excluding everything from the right),
(left.merge(right, on='key', how='left', indicator=True)
.query('_merge == "left_only"')
.drop('_merge', 1))
key value_x value_y
0 A 1.764052 NaN
2 C 0.978738 NaN
Where,
left.merge(right, on='key', how='left', indicator=True)
key value_x value_y _merge
0 A 1.764052 NaN left_only
1 B 0.400157 1.867558 both
2 C 0.978738 NaN left_only
3 D 2.240893 -0.977278 both
And similarly, for a RIGHT-Excluding JOIN,
(left.merge(right, on='key', how='right', indicator=True)
.query('_merge == "right_only"')
.drop('_merge', 1))
key value_x value_y
2 E NaN 0.950088
3 F NaN -0.151357
Lastly, if you are required to do a merge that only retains keys from the left or right, but not both (IOW, performing an ANTI-JOIN),
You can do this in similar fashion—
(left.merge(right, on='key', how='outer', indicator=True)
.query('_merge != "both"')
.drop('_merge', 1))
key value_x value_y
0 A 1.764052 NaN
2 C 0.978738 NaN
4 E NaN 0.950088
5 F NaN -0.151357
Different names for key columns
If the key columns are named differently—for example, left has keyLeft, and right has keyRight instead of key—then you will have to specify left_on and right_on as arguments instead of on:
left2 = left.rename({'key':'keyLeft'}, axis=1)
right2 = right.rename({'key':'keyRight'}, axis=1)
left2
keyLeft value
0 A 1.764052
1 B 0.400157
2 C 0.978738
3 D 2.240893
right2
keyRight value
0 B 1.867558
1 D -0.977278
2 E 0.950088
3 F -0.151357
left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')
keyLeft value_x keyRight value_y
0 B 0.400157 B 1.867558
1 D 2.240893 D -0.977278
Avoiding duplicate key column in output
When merging on keyLeft from left and keyRight from right, if you only want either of the keyLeft or keyRight (but not both) in the output, you can start by setting the index as a preliminary step.
left3 = left2.set_index('keyLeft')
left3.merge(right2, left_index=True, right_on='keyRight')
value_x keyRight value_y
0 0.400157 B 1.867558
1 2.240893 D -0.977278
Contrast this with the output of the command just before (that is, the output of left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')), you'll notice keyLeft is missing. You can figure out what column to keep based on which frame's index is set as the key. This may matter when, say, performing some OUTER JOIN operation.
Merging only a single column from one of the DataFrames
For example, consider
right3 = right.assign(newcol=np.arange(len(right)))
right3
key value newcol
0 B 1.867558 0
1 D -0.977278 1
2 E 0.950088 2
3 F -0.151357 3
If you are required to merge only "newcol" (without any of the other columns), you can usually just subset columns before merging:
left.merge(right3[['key', 'newcol']], on='key')
key value newcol
0 B 0.400157 0
1 D 2.240893 1
If you're doing a LEFT OUTER JOIN, a more performant solution would involve map:
# left['newcol'] = left['key'].map(right3.set_index('key')['newcol']))
left.assign(newcol=left['key'].map(right3.set_index('key')['newcol']))
key value newcol
0 A 1.764052 NaN
1 B 0.400157 0.0
2 C 0.978738 NaN
3 D 2.240893 1.0
As mentioned, this is similar to, but faster than
left.merge(right3[['key', 'newcol']], on='key', how='left')
key value newcol
0 A 1.764052 NaN
1 B 0.400157 0.0
2 C 0.978738 NaN
3 D 2.240893 1.0
Merging on multiple columns
To join on more than one column, specify a list for on (or left_on and right_on, as appropriate).
left.merge(right, on=['key1', 'key2'] ...)
Or, in the event the names are different,
left.merge(right, left_on=['lkey1', 'lkey2'], right_on=['rkey1', 'rkey2'])
Other useful merge* operations and functions
Merging a DataFrame with Series on index: See this answer.
Besides merge, DataFrame.update and DataFrame.combine_first are also used in certain cases to update one DataFrame with another.
pd.merge_ordered is a useful function for ordered JOINs.
pd.merge_asof (read: merge_asOf) is useful for approximate joins.
This section only covers the very basics, and is designed to only whet your appetite. For more examples and cases, see the documentation on merge, join, and concat as well as the links to the function specifications.
Continue Reading
Jump to other topics in Pandas Merging 101 to continue learning:
Merging basics - basic types of joins *
Index-based joins
Generalizing to multiple DataFrames
Cross join
*You are here.
A supplemental visual view of pd.concat([df0, df1], kwargs).
Notice that, kwarg axis=0 or axis=1 's meaning is not as intuitive as df.mean() or df.apply(func)
Joins 101
These animations might be better to explain you visually.
Credits: Garrick Aden-Buie tidyexplain repo
Inner Join
Outer Join or Full Join
Right Join
Left Join
In this answer, I will consider practical examples of:
pandas.concat
pandas.DataFrame.merge to merge dataframes from the index of one and the column of another one.
We will be using different dataframes for each of the cases.
1. pandas.concat
Considering the following DataFrames with the same column names:
Price2018 with size (8784, 5)
Year Month Day Hour Price
0 2018 1 1 1 6.74
1 2018 1 1 2 4.74
2 2018 1 1 3 3.66
3 2018 1 1 4 2.30
4 2018 1 1 5 2.30
5 2018 1 1 6 2.06
6 2018 1 1 7 2.06
7 2018 1 1 8 2.06
8 2018 1 1 9 2.30
9 2018 1 1 10 2.30
Price2019 with size (8760, 5)
Year Month Day Hour Price
0 2019 1 1 1 66.88
1 2019 1 1 2 66.88
2 2019 1 1 3 66.00
3 2019 1 1 4 63.64
4 2019 1 1 5 58.85
5 2019 1 1 6 55.47
6 2019 1 1 7 56.00
7 2019 1 1 8 61.09
8 2019 1 1 9 61.01
9 2019 1 1 10 61.00
One can combine them using pandas.concat, by simply
import pandas as pd
frames = [Price2018, Price2019]
df_merged = pd.concat(frames)
Which results in a DataFrame with size (17544, 5)
If one wants to have a clear picture of what happened, it works like this
(Source)
2. pandas.DataFrame.merge
In this section, we will consider a specific case: merging the index of one dataframe and the column of another dataframe.
Let's say one has the dataframe Geo with 54 columns, being one of the columns the Date, which is of type datetime64[ns].
Date 1 2 ... 51 52 53
0 2010-01-01 00:00:00 0.565919 0.892376 ... 0.593049 0.775082 0.680621
1 2010-01-01 01:00:00 0.358960 0.531418 ... 0.734619 0.480450 0.926735
2 2010-01-01 02:00:00 0.531870 0.221768 ... 0.902369 0.027840 0.398864
3 2010-01-01 03:00:00 0.475463 0.245810 ... 0.306405 0.645762 0.541882
4 2010-01-01 04:00:00 0.954546 0.867960 ... 0.912257 0.039772 0.627696
And the dataframe Price that has one column with the price named Price, and the index corresponds to the dates (Date)
Price
Date
2010-01-01 00:00:00 29.10
2010-01-01 01:00:00 9.57
2010-01-01 02:00:00 0.00
2010-01-01 03:00:00 0.00
2010-01-01 04:00:00 0.00
In order to merge them, one can use pandas.DataFrame.merge as follows
df_merged = pd.merge(Price, Geo, left_index=True, right_on='Date')
where Geo and Price are the previous dataframes.
That results in the following dataframe
Price Date 1 ... 51 52 53
0 29.10 2010-01-01 00:00:00 0.565919 ... 0.593049 0.775082 0.680621
1 9.57 2010-01-01 01:00:00 0.358960 ... 0.734619 0.480450 0.926735
2 0.00 2010-01-01 02:00:00 0.531870 ... 0.902369 0.027840 0.398864
3 0.00 2010-01-01 03:00:00 0.475463 ... 0.306405 0.645762 0.541882
4 0.00 2010-01-01 04:00:00 0.954546 ... 0.912257 0.039772 0.627696
This post will go through the following topics:
Merging with index under different conditions
options for index-based joins: merge, join, concat
merging on indexes
merging on index of one, column of other
effectively using named indexes to simplify merging syntax
BACK TO TOP
Index-based joins
TL;DR
There are a few options, some simpler than others depending on the use
case.
DataFrame.merge with left_index and right_index (or left_on and right_on using named indexes)
supports inner/left/right/full
can only join two at a time
supports column-column, index-column, index-index joins
DataFrame.join (join on index)
supports inner/left (default)/right/full
can join multiple DataFrames at a time
supports index-index joins
pd.concat (joins on index)
supports inner/full (default)
can join multiple DataFrames at a time
supports index-index joins
Index to index joins
Setup & Basics
import pandas as pd
import numpy as np
np.random.seed([3, 14])
left = pd.DataFrame(data={'value': np.random.randn(4)},
index=['A', 'B', 'C', 'D'])
right = pd.DataFrame(data={'value': np.random.randn(4)},
index=['B', 'D', 'E', 'F'])
left.index.name = right.index.name = 'idxkey'
left
value
idxkey
A -0.602923
B -0.402655
C 0.302329
D -0.524349
right
value
idxkey
B 0.543843
D 0.013135
E -0.326498
F 1.385076
Typically, an inner join on index would look like this:
left.merge(right, left_index=True, right_index=True)
value_x value_y
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
Other joins follow similar syntax.
Notable Alternatives
DataFrame.join defaults to joins on the index. DataFrame.join does a LEFT OUTER JOIN by default, so how='inner' is necessary here.
left.join(right, how='inner', lsuffix='_x', rsuffix='_y')
value_x value_y
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
Note that I needed to specify the lsuffix and rsuffix arguments since join would otherwise error out:
left.join(right)
ValueError: columns overlap but no suffix specified: Index(['value'], dtype='object')
Since the column names are the same. This would not be a problem if they were differently named.
left.rename(columns={'value':'leftvalue'}).join(right, how='inner')
leftvalue value
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
pd.concat joins on the index and can join two or more DataFrames at once. It does a full outer join by default, so how='inner' is required here..
pd.concat([left, right], axis=1, sort=False, join='inner')
value value
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
For more information on concat, see this post.
Index to Column joins
To perform an inner join using index of left, column of right, you will use DataFrame.merge a combination of left_index=True and right_on=....
right2 = right.reset_index().rename({'idxkey' : 'colkey'}, axis=1)
right2
colkey value
0 B 0.543843
1 D 0.013135
2 E -0.326498
3 F 1.385076
left.merge(right2, left_index=True, right_on='colkey')
value_x colkey value_y
0 -0.402655 B 0.543843
1 -0.524349 D 0.013135
Other joins follow a similar structure. Note that only merge can perform index to column joins. You can join on multiple columns, provided the number of index levels on the left equals the number of columns on the right.
join and concat are not capable of mixed merges. You will need to set the index as a pre-step using DataFrame.set_index.
Effectively using Named Index [pandas >= 0.23]
If your index is named, then from pandas >= 0.23, DataFrame.merge allows you to specify the index name to on (or left_on and right_on as necessary).
left.merge(right, on='idxkey')
value_x value_y
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
For the previous example of merging with the index of left, column of right, you can use left_on with the index name of left:
left.merge(right2, left_on='idxkey', right_on='colkey')
value_x colkey value_y
0 -0.402655 B 0.543843
1 -0.524349 D 0.013135
Continue Reading
Jump to other topics in Pandas Merging 101 to continue learning:
Merging basics - basic types of joins
Index-based joins*
Generalizing to multiple DataFrames
Cross join
* you are here
This post will go through the following topics:
how to correctly generalize to multiple DataFrames (and why merge has shortcomings here)
merging on unique keys
merging on non-unique keys
BACK TO TOP
Generalizing to multiple DataFrames
Oftentimes, the situation arises when multiple DataFrames are to be merged together. Naively, this can be done by chaining merge calls:
df1.merge(df2, ...).merge(df3, ...)
However, this quickly gets out of hand for many DataFrames. Furthermore, it may be necessary to generalise for an unknown number of DataFrames.
Here I introduce pd.concat for multi-way joins on unique keys, and DataFrame.join for multi-way joins on non-unique keys. First, the setup.
# Setup.
np.random.seed(0)
A = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'valueA': np.random.randn(4)})
B = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'valueB': np.random.randn(4)})
C = pd.DataFrame({'key': ['D', 'E', 'J', 'C'], 'valueC': np.ones(4)})
dfs = [A, B, C]
# Note: the "key" column values are unique, so the index is unique.
A2 = A.set_index('key')
B2 = B.set_index('key')
C2 = C.set_index('key')
dfs2 = [A2, B2, C2]
Multiway merge on unique keys
If your keys (here, the key could either be a column or an index) are unique, then you can use pd.concat. Note that pd.concat joins DataFrames on the index.
# Merge on `key` column. You'll need to set the index before concatenating
pd.concat(
[df.set_index('key') for df in dfs], axis=1, join='inner'
).reset_index()
key valueA valueB valueC
0 D 2.240893 -0.977278 1.0
# Merge on `key` index.
pd.concat(dfs2, axis=1, sort=False, join='inner')
valueA valueB valueC
key
D 2.240893 -0.977278 1.0
Omit join='inner' for a FULL OUTER JOIN. Note that you cannot specify LEFT or RIGHT OUTER joins (if you need these, use join, described below).
Multiway merge on keys with duplicates
concat is fast, but has its shortcomings. It cannot handle duplicates.
A3 = pd.DataFrame({'key': ['A', 'B', 'C', 'D', 'D'], 'valueA': np.random.randn(5)})
pd.concat([df.set_index('key') for df in [A3, B, C]], axis=1, join='inner')
ValueError: Shape of passed values is (3, 4), indices imply (3, 2)
In this situation, we can use join since it can handle non-unique keys (note that join joins DataFrames on their index; it calls merge under the hood and does a LEFT OUTER JOIN unless otherwise specified).
# Join on `key` column. Set as the index first.
# For inner join. For left join, omit the "how" argument.
A.set_index('key').join([B2, C2], how='inner').reset_index()
key valueA valueB valueC
0 D 2.240893 -0.977278 1.0
# Join on `key` index.
A3.set_index('key').join([B2, C2], how='inner')
valueA valueB valueC
key
D 1.454274 -0.977278 1.0
D 0.761038 -0.977278 1.0
Continue Reading
Jump to other topics in Pandas Merging 101 to continue learning:
Merging basics - basic types of joins
Index-based joins
Generalizing to multiple DataFrames *
Cross join
* you are here
Pandas at the moment does not support inequality joins within the merge syntax; one option is with the conditional_join function from pyjanitor - I am a contributor to this library:
# pip install pyjanitor
import pandas as pd
import janitor
left.conditional_join(right, ('value', 'value', '>'))
left right
key value key value
0 A 1.764052 D -0.977278
1 A 1.764052 F -0.151357
2 A 1.764052 E 0.950088
3 B 0.400157 D -0.977278
4 B 0.400157 F -0.151357
5 C 0.978738 D -0.977278
6 C 0.978738 F -0.151357
7 C 0.978738 E 0.950088
8 D 2.240893 D -0.977278
9 D 2.240893 F -0.151357
10 D 2.240893 E 0.950088
11 D 2.240893 B 1.867558
left.conditional_join(right, ('value', 'value', '<'))
left right
key value key value
0 A 1.764052 B 1.867558
1 B 0.400157 E 0.950088
2 B 0.400157 B 1.867558
3 C 0.978738 B 1.867558
The columns are passed as a variable argument of tuples, each tuple comprising of a column from the left dataframe, column from the right dataframe, and the join operator, which can be any of (>, <, >=, <=, !=). In the example above, a MultiIndex column is returned, because of overlaps in the column names.
Performance wise, this is better than a naive cross join:
np.random.seed(0)
dd = pd.DataFrame({'value':np.random.randint(100000, size=50_000)})
df = pd.DataFrame({'start':np.random.randint(100000, size=1_000),
'end':np.random.randint(100000, size=1_000)})
dd.head()
value
0 68268
1 43567
2 42613
3 45891
4 21243
df.head()
start end
0 71915 47005
1 64284 44913
2 13377 96626
3 75823 38673
4 29151 575
%%timeit
out = df.merge(dd, how='cross')
out.loc[(out.start < out.value) & (out.end > out.value)]
5.12 s ± 19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.conditional_join(dd, ('start', 'value' ,'<'), ('end', 'value' ,'>'))
280 ms ± 5.56 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.conditional_join(dd, ('start', 'value' ,'<'), ('end', 'value' ,'>'), use_numba=True)
124 ms ± 12.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
out = df.merge(dd, how='cross')
out = out.loc[(out.start < out.value) & (out.end > out.value)]
A = df.conditional_join(dd, ('start', 'value' ,'<'), ('end', 'value' ,'>'))
columns = A.columns.tolist()
A = A.sort_values(columns, ignore_index = True)
out = out.sort_values(columns, ignore_index = True)
A.equals(out)
True
Depending on the data size, you could get more performance when an equi join is present. In this case, pandas merge function is used, but the final data frame is delayed until the non-equi joins are computed. There is no numba support when equi conditions are present. Let's look at data from here:
import pandas as pd
import numpy as np
import random
import datetime
def random_dt_bw(start_date,end_date):
days_between = (end_date - start_date).days
random_num_days = random.randrange(days_between)
random_dt = start_date + datetime.timedelta(days=random_num_days)
return random_dt
def generate_data(n=1000):
items = [f"i_{x}" for x in range(n)]
start_dates = [random_dt_bw(datetime.date(2020,1,1),datetime.date(2020,9,1)) for x in range(n)]
end_dates = [x + datetime.timedelta(days=random.randint(1,10)) for x in start_dates]
offerDf = pd.DataFrame({"Item":items,
"StartDt":start_dates,
"EndDt":end_dates})
transaction_items = [f"i_{random.randint(0,n)}" for x in range(5*n)]
transaction_dt = [random_dt_bw(datetime.date(2020,1,1),datetime.date(2020,9,1)) for x in range(5*n)]
sales_amt = [random.randint(0,1000) for x in range(5*n)]
transactionDf = pd.DataFrame({"Item":transaction_items,"TransactionDt":transaction_dt,"Sales":sales_amt})
return offerDf,transactionDf
offerDf,transactionDf = generate_data(n=100000)
offerDf = (offerDf
.assign(StartDt = offerDf.StartDt.astype(np.datetime64),
EndDt = offerDf.EndDt.astype(np.datetime64)
)
)
transactionDf = transactionDf.assign(TransactionDt = transactionDf.TransactionDt.astype(np.datetime64))
# you can get more performance when using ints/datetimes
# in the equi join, compared to strings
offerDf = offerDf.assign(Itemr = offerDf.Item.str[2:].astype(int))
transactionDf = transactionDf.assign(Itemr = transactionDf.Item.str[2:].astype(int))
transactionDf.head()
Item TransactionDt Sales Itemr
0 i_43407 2020-05-29 692 43407
1 i_95044 2020-07-22 964 95044
2 i_94560 2020-01-09 462 94560
3 i_11246 2020-02-26 690 11246
4 i_55974 2020-03-07 219 55974
offerDf.head()
Item StartDt EndDt Itemr
0 i_0 2020-04-18 2020-04-19 0
1 i_1 2020-02-28 2020-03-07 1
2 i_2 2020-03-28 2020-03-30 2
3 i_3 2020-08-03 2020-08-13 3
4 i_4 2020-05-26 2020-06-04 4
# merge on strings
merged_df = pd.merge(offerDf,transactionDf,on='Itemr')
classic_int = merged_df[(merged_df['TransactionDt']>=merged_df['StartDt']) &
(merged_df['TransactionDt']<=merged_df['EndDt'])]
# merge on ints ... usually faster
merged_df = pd.merge(offerDf,transactionDf,on='Item')
classic_str = merged_df[(merged_df['TransactionDt']>=merged_df['StartDt']) &
(merged_df['TransactionDt']<=merged_df['EndDt'])]
# merge on integers
cond_join_int = (transactionDf
.conditional_join(
offerDf,
('Itemr', 'Itemr', '=='),
('TransactionDt', 'StartDt', '>='),
('TransactionDt', 'EndDt', '<=')
)
)
# merge on strings
cond_join_str = (transactionDf
.conditional_join(
offerDf,
('Item', 'Item', '=='),
('TransactionDt', 'StartDt', '>='),
('TransactionDt', 'EndDt', '<=')
)
)
%%timeit
merged_df = pd.merge(offerDf,transactionDf,on='Item')
classic_str = merged_df[(merged_df['TransactionDt']>=merged_df['StartDt']) &
(merged_df['TransactionDt']<=merged_df['EndDt'])]
292 ms ± 3.84 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
merged_df = pd.merge(offerDf,transactionDf,on='Itemr')
classic_int = merged_df[(merged_df['TransactionDt']>=merged_df['StartDt']) &
(merged_df['TransactionDt']<=merged_df['EndDt'])]
253 ms ± 2.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
(transactionDf
.conditional_join(
offerDf,
('Item', 'Item', '=='),
('TransactionDt', 'StartDt', '>='),
('TransactionDt', 'EndDt', '<=')
)
)
256 ms ± 9.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
(transactionDf
.conditional_join(
offerDf,
('Itemr', 'Itemr', '=='),
('TransactionDt', 'StartDt', '>='),
('TransactionDt', 'EndDt', '<=')
)
)
71.8 ms ± 2.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# check that both dataframes are equal
cols = ['Item', 'TransactionDt', 'Sales', 'Itemr_y','StartDt', 'EndDt', 'Itemr_x']
cond_join_str = cond_join_str.drop(columns=('right', 'Item')).set_axis(cols, axis=1)
(cond_join_str
.sort_values(cond_join_str.columns.tolist())
.reset_index(drop=True)
.reindex(columns=classic_str.columns)
.equals(
classic_str
.sort_values(classic_str.columns.tolist())
.reset_index(drop=True)
))
True
I think you should include this in your explanation as it is a relevant merge that I see fairly often, which is termed cross-join I believe. This is a merge that occurs when unique df's share no columns, and it simply merging 2 dfs side-by-side:
The setup:
names1 = [{'A':'Jack', 'B':'Jill'}]
names2 = [{'C':'Tommy', 'D':'Tammy'}]
df1=pd.DataFrame(names1)
df2=pd.DataFrame(names2)
df_merged= pd.merge(df1.assign(X=1), df2.assign(X=1), on='X').drop('X', 1)
This creates a dummy X column, merges on the X, and then drops it to produce
df_merged:
A B C D
0 Jack Jill Tommy Tammy

For UNEVEN structured data, 'comparing and replacing' values using Pandas [duplicate]

How can I perform a (INNER| (LEFT|RIGHT|FULL) OUTER) JOIN with pandas?
How do I add NaNs for missing rows after a merge?
How do I get rid of NaNs after merging?
Can I merge on the index?
How do I merge multiple DataFrames?
Cross join with pandas
merge? join? concat? update? Who? What? Why?!
... and more. I've seen these recurring questions asking about various facets of the pandas merge functionality. Most of the information regarding merge and its various use cases today is fragmented across dozens of badly worded, unsearchable posts. The aim here is to collate some of the more important points for posterity.
This Q&A is meant to be the next installment in a series of helpful user guides on common pandas idioms (see this post on pivoting, and this post on concatenation, which I will be touching on, later).
Please note that this post is not meant to be a replacement for the documentation, so please read that as well! Some of the examples are taken from there.
Table of Contents
For ease of access.
Merging basics - basic types of joins (read this first)
Index-based joins
Generalizing to multiple DataFrames
Cross join
This post aims to give readers a primer on SQL-flavored merging with Pandas, how to use it, and when not to use it.
In particular, here's what this post will go through:
The basics - types of joins (LEFT, RIGHT, OUTER, INNER)
merging with different column names
merging with multiple columns
avoiding duplicate merge key column in output
What this post (and other posts by me on this thread) will not go through:
Performance-related discussions and timings (for now). Mostly notable mentions of better alternatives, wherever appropriate.
Handling suffixes, removing extra columns, renaming outputs, and other specific use cases. There are other (read: better) posts that deal with that, so figure it out!
Note
Most examples default to INNER JOIN operations while demonstrating various features, unless otherwise specified.
Furthermore, all the DataFrames here can be copied and replicated so
you can play with them. Also, see this
post
on how to read DataFrames from your clipboard.
Lastly, all visual representation of JOIN operations have been hand-drawn using Google Drawings. Inspiration from here.
Enough talk - just show me how to use merge!
Setup & Basics
np.random.seed(0)
left = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4)})
right = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': np.random.randn(4)})
left
key value
0 A 1.764052
1 B 0.400157
2 C 0.978738
3 D 2.240893
right
key value
0 B 1.867558
1 D -0.977278
2 E 0.950088
3 F -0.151357
For the sake of simplicity, the key column has the same name (for now).
An INNER JOIN is represented by
Note
This, along with the forthcoming figures all follow this convention:
blue indicates rows that are present in the merge result
red indicates rows that are excluded from the result (i.e., removed)
green indicates missing values that are replaced with NaNs in the result
To perform an INNER JOIN, call merge on the left DataFrame, specifying the right DataFrame and the join key (at the very least) as arguments.
left.merge(right, on='key')
# Or, if you want to be explicit
# left.merge(right, on='key', how='inner')
key value_x value_y
0 B 0.400157 1.867558
1 D 2.240893 -0.977278
This returns only rows from left and right which share a common key (in this example, "B" and "D).
A LEFT OUTER JOIN, or LEFT JOIN is represented by
This can be performed by specifying how='left'.
left.merge(right, on='key', how='left')
key value_x value_y
0 A 1.764052 NaN
1 B 0.400157 1.867558
2 C 0.978738 NaN
3 D 2.240893 -0.977278
Carefully note the placement of NaNs here. If you specify how='left', then only keys from left are used, and missing data from right is replaced by NaN.
And similarly, for a RIGHT OUTER JOIN, or RIGHT JOIN which is...
...specify how='right':
left.merge(right, on='key', how='right')
key value_x value_y
0 B 0.400157 1.867558
1 D 2.240893 -0.977278
2 E NaN 0.950088
3 F NaN -0.151357
Here, keys from right are used, and missing data from left is replaced by NaN.
Finally, for the FULL OUTER JOIN, given by
specify how='outer'.
left.merge(right, on='key', how='outer')
key value_x value_y
0 A 1.764052 NaN
1 B 0.400157 1.867558
2 C 0.978738 NaN
3 D 2.240893 -0.977278
4 E NaN 0.950088
5 F NaN -0.151357
This uses the keys from both frames, and NaNs are inserted for missing rows in both.
The documentation summarizes these various merges nicely:
Other JOINs - LEFT-Excluding, RIGHT-Excluding, and FULL-Excluding/ANTI JOINs
If you need LEFT-Excluding JOINs and RIGHT-Excluding JOINs in two steps.
For LEFT-Excluding JOIN, represented as
Start by performing a LEFT OUTER JOIN and then filtering to rows coming from left only (excluding everything from the right),
(left.merge(right, on='key', how='left', indicator=True)
.query('_merge == "left_only"')
.drop('_merge', 1))
key value_x value_y
0 A 1.764052 NaN
2 C 0.978738 NaN
Where,
left.merge(right, on='key', how='left', indicator=True)
key value_x value_y _merge
0 A 1.764052 NaN left_only
1 B 0.400157 1.867558 both
2 C 0.978738 NaN left_only
3 D 2.240893 -0.977278 both
And similarly, for a RIGHT-Excluding JOIN,
(left.merge(right, on='key', how='right', indicator=True)
.query('_merge == "right_only"')
.drop('_merge', 1))
key value_x value_y
2 E NaN 0.950088
3 F NaN -0.151357
Lastly, if you are required to do a merge that only retains keys from the left or right, but not both (IOW, performing an ANTI-JOIN),
You can do this in similar fashion—
(left.merge(right, on='key', how='outer', indicator=True)
.query('_merge != "both"')
.drop('_merge', 1))
key value_x value_y
0 A 1.764052 NaN
2 C 0.978738 NaN
4 E NaN 0.950088
5 F NaN -0.151357
Different names for key columns
If the key columns are named differently—for example, left has keyLeft, and right has keyRight instead of key—then you will have to specify left_on and right_on as arguments instead of on:
left2 = left.rename({'key':'keyLeft'}, axis=1)
right2 = right.rename({'key':'keyRight'}, axis=1)
left2
keyLeft value
0 A 1.764052
1 B 0.400157
2 C 0.978738
3 D 2.240893
right2
keyRight value
0 B 1.867558
1 D -0.977278
2 E 0.950088
3 F -0.151357
left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')
keyLeft value_x keyRight value_y
0 B 0.400157 B 1.867558
1 D 2.240893 D -0.977278
Avoiding duplicate key column in output
When merging on keyLeft from left and keyRight from right, if you only want either of the keyLeft or keyRight (but not both) in the output, you can start by setting the index as a preliminary step.
left3 = left2.set_index('keyLeft')
left3.merge(right2, left_index=True, right_on='keyRight')
value_x keyRight value_y
0 0.400157 B 1.867558
1 2.240893 D -0.977278
Contrast this with the output of the command just before (that is, the output of left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')), you'll notice keyLeft is missing. You can figure out what column to keep based on which frame's index is set as the key. This may matter when, say, performing some OUTER JOIN operation.
Merging only a single column from one of the DataFrames
For example, consider
right3 = right.assign(newcol=np.arange(len(right)))
right3
key value newcol
0 B 1.867558 0
1 D -0.977278 1
2 E 0.950088 2
3 F -0.151357 3
If you are required to merge only "newcol" (without any of the other columns), you can usually just subset columns before merging:
left.merge(right3[['key', 'newcol']], on='key')
key value newcol
0 B 0.400157 0
1 D 2.240893 1
If you're doing a LEFT OUTER JOIN, a more performant solution would involve map:
# left['newcol'] = left['key'].map(right3.set_index('key')['newcol']))
left.assign(newcol=left['key'].map(right3.set_index('key')['newcol']))
key value newcol
0 A 1.764052 NaN
1 B 0.400157 0.0
2 C 0.978738 NaN
3 D 2.240893 1.0
As mentioned, this is similar to, but faster than
left.merge(right3[['key', 'newcol']], on='key', how='left')
key value newcol
0 A 1.764052 NaN
1 B 0.400157 0.0
2 C 0.978738 NaN
3 D 2.240893 1.0
Merging on multiple columns
To join on more than one column, specify a list for on (or left_on and right_on, as appropriate).
left.merge(right, on=['key1', 'key2'] ...)
Or, in the event the names are different,
left.merge(right, left_on=['lkey1', 'lkey2'], right_on=['rkey1', 'rkey2'])
Other useful merge* operations and functions
Merging a DataFrame with Series on index: See this answer.
Besides merge, DataFrame.update and DataFrame.combine_first are also used in certain cases to update one DataFrame with another.
pd.merge_ordered is a useful function for ordered JOINs.
pd.merge_asof (read: merge_asOf) is useful for approximate joins.
This section only covers the very basics, and is designed to only whet your appetite. For more examples and cases, see the documentation on merge, join, and concat as well as the links to the function specifications.
Continue Reading
Jump to other topics in Pandas Merging 101 to continue learning:
Merging basics - basic types of joins *
Index-based joins
Generalizing to multiple DataFrames
Cross join
*You are here.
A supplemental visual view of pd.concat([df0, df1], kwargs).
Notice that, kwarg axis=0 or axis=1 's meaning is not as intuitive as df.mean() or df.apply(func)
Joins 101
These animations might be better to explain you visually.
Credits: Garrick Aden-Buie tidyexplain repo
Inner Join
Outer Join or Full Join
Right Join
Left Join
In this answer, I will consider practical examples of:
pandas.concat
pandas.DataFrame.merge to merge dataframes from the index of one and the column of another one.
We will be using different dataframes for each of the cases.
1. pandas.concat
Considering the following DataFrames with the same column names:
Price2018 with size (8784, 5)
Year Month Day Hour Price
0 2018 1 1 1 6.74
1 2018 1 1 2 4.74
2 2018 1 1 3 3.66
3 2018 1 1 4 2.30
4 2018 1 1 5 2.30
5 2018 1 1 6 2.06
6 2018 1 1 7 2.06
7 2018 1 1 8 2.06
8 2018 1 1 9 2.30
9 2018 1 1 10 2.30
Price2019 with size (8760, 5)
Year Month Day Hour Price
0 2019 1 1 1 66.88
1 2019 1 1 2 66.88
2 2019 1 1 3 66.00
3 2019 1 1 4 63.64
4 2019 1 1 5 58.85
5 2019 1 1 6 55.47
6 2019 1 1 7 56.00
7 2019 1 1 8 61.09
8 2019 1 1 9 61.01
9 2019 1 1 10 61.00
One can combine them using pandas.concat, by simply
import pandas as pd
frames = [Price2018, Price2019]
df_merged = pd.concat(frames)
Which results in a DataFrame with size (17544, 5)
If one wants to have a clear picture of what happened, it works like this
(Source)
2. pandas.DataFrame.merge
In this section, we will consider a specific case: merging the index of one dataframe and the column of another dataframe.
Let's say one has the dataframe Geo with 54 columns, being one of the columns the Date, which is of type datetime64[ns].
Date 1 2 ... 51 52 53
0 2010-01-01 00:00:00 0.565919 0.892376 ... 0.593049 0.775082 0.680621
1 2010-01-01 01:00:00 0.358960 0.531418 ... 0.734619 0.480450 0.926735
2 2010-01-01 02:00:00 0.531870 0.221768 ... 0.902369 0.027840 0.398864
3 2010-01-01 03:00:00 0.475463 0.245810 ... 0.306405 0.645762 0.541882
4 2010-01-01 04:00:00 0.954546 0.867960 ... 0.912257 0.039772 0.627696
And the dataframe Price that has one column with the price named Price, and the index corresponds to the dates (Date)
Price
Date
2010-01-01 00:00:00 29.10
2010-01-01 01:00:00 9.57
2010-01-01 02:00:00 0.00
2010-01-01 03:00:00 0.00
2010-01-01 04:00:00 0.00
In order to merge them, one can use pandas.DataFrame.merge as follows
df_merged = pd.merge(Price, Geo, left_index=True, right_on='Date')
where Geo and Price are the previous dataframes.
That results in the following dataframe
Price Date 1 ... 51 52 53
0 29.10 2010-01-01 00:00:00 0.565919 ... 0.593049 0.775082 0.680621
1 9.57 2010-01-01 01:00:00 0.358960 ... 0.734619 0.480450 0.926735
2 0.00 2010-01-01 02:00:00 0.531870 ... 0.902369 0.027840 0.398864
3 0.00 2010-01-01 03:00:00 0.475463 ... 0.306405 0.645762 0.541882
4 0.00 2010-01-01 04:00:00 0.954546 ... 0.912257 0.039772 0.627696
This post will go through the following topics:
Merging with index under different conditions
options for index-based joins: merge, join, concat
merging on indexes
merging on index of one, column of other
effectively using named indexes to simplify merging syntax
BACK TO TOP
Index-based joins
TL;DR
There are a few options, some simpler than others depending on the use
case.
DataFrame.merge with left_index and right_index (or left_on and right_on using named indexes)
supports inner/left/right/full
can only join two at a time
supports column-column, index-column, index-index joins
DataFrame.join (join on index)
supports inner/left (default)/right/full
can join multiple DataFrames at a time
supports index-index joins
pd.concat (joins on index)
supports inner/full (default)
can join multiple DataFrames at a time
supports index-index joins
Index to index joins
Setup & Basics
import pandas as pd
import numpy as np
np.random.seed([3, 14])
left = pd.DataFrame(data={'value': np.random.randn(4)},
index=['A', 'B', 'C', 'D'])
right = pd.DataFrame(data={'value': np.random.randn(4)},
index=['B', 'D', 'E', 'F'])
left.index.name = right.index.name = 'idxkey'
left
value
idxkey
A -0.602923
B -0.402655
C 0.302329
D -0.524349
right
value
idxkey
B 0.543843
D 0.013135
E -0.326498
F 1.385076
Typically, an inner join on index would look like this:
left.merge(right, left_index=True, right_index=True)
value_x value_y
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
Other joins follow similar syntax.
Notable Alternatives
DataFrame.join defaults to joins on the index. DataFrame.join does a LEFT OUTER JOIN by default, so how='inner' is necessary here.
left.join(right, how='inner', lsuffix='_x', rsuffix='_y')
value_x value_y
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
Note that I needed to specify the lsuffix and rsuffix arguments since join would otherwise error out:
left.join(right)
ValueError: columns overlap but no suffix specified: Index(['value'], dtype='object')
Since the column names are the same. This would not be a problem if they were differently named.
left.rename(columns={'value':'leftvalue'}).join(right, how='inner')
leftvalue value
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
pd.concat joins on the index and can join two or more DataFrames at once. It does a full outer join by default, so how='inner' is required here..
pd.concat([left, right], axis=1, sort=False, join='inner')
value value
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
For more information on concat, see this post.
Index to Column joins
To perform an inner join using index of left, column of right, you will use DataFrame.merge a combination of left_index=True and right_on=....
right2 = right.reset_index().rename({'idxkey' : 'colkey'}, axis=1)
right2
colkey value
0 B 0.543843
1 D 0.013135
2 E -0.326498
3 F 1.385076
left.merge(right2, left_index=True, right_on='colkey')
value_x colkey value_y
0 -0.402655 B 0.543843
1 -0.524349 D 0.013135
Other joins follow a similar structure. Note that only merge can perform index to column joins. You can join on multiple columns, provided the number of index levels on the left equals the number of columns on the right.
join and concat are not capable of mixed merges. You will need to set the index as a pre-step using DataFrame.set_index.
Effectively using Named Index [pandas >= 0.23]
If your index is named, then from pandas >= 0.23, DataFrame.merge allows you to specify the index name to on (or left_on and right_on as necessary).
left.merge(right, on='idxkey')
value_x value_y
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
For the previous example of merging with the index of left, column of right, you can use left_on with the index name of left:
left.merge(right2, left_on='idxkey', right_on='colkey')
value_x colkey value_y
0 -0.402655 B 0.543843
1 -0.524349 D 0.013135
Continue Reading
Jump to other topics in Pandas Merging 101 to continue learning:
Merging basics - basic types of joins
Index-based joins*
Generalizing to multiple DataFrames
Cross join
* you are here
This post will go through the following topics:
how to correctly generalize to multiple DataFrames (and why merge has shortcomings here)
merging on unique keys
merging on non-unique keys
BACK TO TOP
Generalizing to multiple DataFrames
Oftentimes, the situation arises when multiple DataFrames are to be merged together. Naively, this can be done by chaining merge calls:
df1.merge(df2, ...).merge(df3, ...)
However, this quickly gets out of hand for many DataFrames. Furthermore, it may be necessary to generalise for an unknown number of DataFrames.
Here I introduce pd.concat for multi-way joins on unique keys, and DataFrame.join for multi-way joins on non-unique keys. First, the setup.
# Setup.
np.random.seed(0)
A = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'valueA': np.random.randn(4)})
B = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'valueB': np.random.randn(4)})
C = pd.DataFrame({'key': ['D', 'E', 'J', 'C'], 'valueC': np.ones(4)})
dfs = [A, B, C]
# Note: the "key" column values are unique, so the index is unique.
A2 = A.set_index('key')
B2 = B.set_index('key')
C2 = C.set_index('key')
dfs2 = [A2, B2, C2]
Multiway merge on unique keys
If your keys (here, the key could either be a column or an index) are unique, then you can use pd.concat. Note that pd.concat joins DataFrames on the index.
# Merge on `key` column. You'll need to set the index before concatenating
pd.concat(
[df.set_index('key') for df in dfs], axis=1, join='inner'
).reset_index()
key valueA valueB valueC
0 D 2.240893 -0.977278 1.0
# Merge on `key` index.
pd.concat(dfs2, axis=1, sort=False, join='inner')
valueA valueB valueC
key
D 2.240893 -0.977278 1.0
Omit join='inner' for a FULL OUTER JOIN. Note that you cannot specify LEFT or RIGHT OUTER joins (if you need these, use join, described below).
Multiway merge on keys with duplicates
concat is fast, but has its shortcomings. It cannot handle duplicates.
A3 = pd.DataFrame({'key': ['A', 'B', 'C', 'D', 'D'], 'valueA': np.random.randn(5)})
pd.concat([df.set_index('key') for df in [A3, B, C]], axis=1, join='inner')
ValueError: Shape of passed values is (3, 4), indices imply (3, 2)
In this situation, we can use join since it can handle non-unique keys (note that join joins DataFrames on their index; it calls merge under the hood and does a LEFT OUTER JOIN unless otherwise specified).
# Join on `key` column. Set as the index first.
# For inner join. For left join, omit the "how" argument.
A.set_index('key').join([B2, C2], how='inner').reset_index()
key valueA valueB valueC
0 D 2.240893 -0.977278 1.0
# Join on `key` index.
A3.set_index('key').join([B2, C2], how='inner')
valueA valueB valueC
key
D 1.454274 -0.977278 1.0
D 0.761038 -0.977278 1.0
Continue Reading
Jump to other topics in Pandas Merging 101 to continue learning:
Merging basics - basic types of joins
Index-based joins
Generalizing to multiple DataFrames *
Cross join
* you are here
Pandas at the moment does not support inequality joins within the merge syntax; one option is with the conditional_join function from pyjanitor - I am a contributor to this library:
# pip install pyjanitor
import pandas as pd
import janitor
left.conditional_join(right, ('value', 'value', '>'))
left right
key value key value
0 A 1.764052 D -0.977278
1 A 1.764052 F -0.151357
2 A 1.764052 E 0.950088
3 B 0.400157 D -0.977278
4 B 0.400157 F -0.151357
5 C 0.978738 D -0.977278
6 C 0.978738 F -0.151357
7 C 0.978738 E 0.950088
8 D 2.240893 D -0.977278
9 D 2.240893 F -0.151357
10 D 2.240893 E 0.950088
11 D 2.240893 B 1.867558
left.conditional_join(right, ('value', 'value', '<'))
left right
key value key value
0 A 1.764052 B 1.867558
1 B 0.400157 E 0.950088
2 B 0.400157 B 1.867558
3 C 0.978738 B 1.867558
The columns are passed as a variable argument of tuples, each tuple comprising of a column from the left dataframe, column from the right dataframe, and the join operator, which can be any of (>, <, >=, <=, !=). In the example above, a MultiIndex column is returned, because of overlaps in the column names.
Performance wise, this is better than a naive cross join:
np.random.seed(0)
dd = pd.DataFrame({'value':np.random.randint(100000, size=50_000)})
df = pd.DataFrame({'start':np.random.randint(100000, size=1_000),
'end':np.random.randint(100000, size=1_000)})
dd.head()
value
0 68268
1 43567
2 42613
3 45891
4 21243
df.head()
start end
0 71915 47005
1 64284 44913
2 13377 96626
3 75823 38673
4 29151 575
%%timeit
out = df.merge(dd, how='cross')
out.loc[(out.start < out.value) & (out.end > out.value)]
5.12 s ± 19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.conditional_join(dd, ('start', 'value' ,'<'), ('end', 'value' ,'>'))
280 ms ± 5.56 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.conditional_join(dd, ('start', 'value' ,'<'), ('end', 'value' ,'>'), use_numba=True)
124 ms ± 12.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
out = df.merge(dd, how='cross')
out = out.loc[(out.start < out.value) & (out.end > out.value)]
A = df.conditional_join(dd, ('start', 'value' ,'<'), ('end', 'value' ,'>'))
columns = A.columns.tolist()
A = A.sort_values(columns, ignore_index = True)
out = out.sort_values(columns, ignore_index = True)
A.equals(out)
True
Depending on the data size, you could get more performance when an equi join is present. In this case, pandas merge function is used, but the final data frame is delayed until the non-equi joins are computed. There is no numba support when equi conditions are present. Let's look at data from here:
import pandas as pd
import numpy as np
import random
import datetime
def random_dt_bw(start_date,end_date):
days_between = (end_date - start_date).days
random_num_days = random.randrange(days_between)
random_dt = start_date + datetime.timedelta(days=random_num_days)
return random_dt
def generate_data(n=1000):
items = [f"i_{x}" for x in range(n)]
start_dates = [random_dt_bw(datetime.date(2020,1,1),datetime.date(2020,9,1)) for x in range(n)]
end_dates = [x + datetime.timedelta(days=random.randint(1,10)) for x in start_dates]
offerDf = pd.DataFrame({"Item":items,
"StartDt":start_dates,
"EndDt":end_dates})
transaction_items = [f"i_{random.randint(0,n)}" for x in range(5*n)]
transaction_dt = [random_dt_bw(datetime.date(2020,1,1),datetime.date(2020,9,1)) for x in range(5*n)]
sales_amt = [random.randint(0,1000) for x in range(5*n)]
transactionDf = pd.DataFrame({"Item":transaction_items,"TransactionDt":transaction_dt,"Sales":sales_amt})
return offerDf,transactionDf
offerDf,transactionDf = generate_data(n=100000)
offerDf = (offerDf
.assign(StartDt = offerDf.StartDt.astype(np.datetime64),
EndDt = offerDf.EndDt.astype(np.datetime64)
)
)
transactionDf = transactionDf.assign(TransactionDt = transactionDf.TransactionDt.astype(np.datetime64))
# you can get more performance when using ints/datetimes
# in the equi join, compared to strings
offerDf = offerDf.assign(Itemr = offerDf.Item.str[2:].astype(int))
transactionDf = transactionDf.assign(Itemr = transactionDf.Item.str[2:].astype(int))
transactionDf.head()
Item TransactionDt Sales Itemr
0 i_43407 2020-05-29 692 43407
1 i_95044 2020-07-22 964 95044
2 i_94560 2020-01-09 462 94560
3 i_11246 2020-02-26 690 11246
4 i_55974 2020-03-07 219 55974
offerDf.head()
Item StartDt EndDt Itemr
0 i_0 2020-04-18 2020-04-19 0
1 i_1 2020-02-28 2020-03-07 1
2 i_2 2020-03-28 2020-03-30 2
3 i_3 2020-08-03 2020-08-13 3
4 i_4 2020-05-26 2020-06-04 4
# merge on strings
merged_df = pd.merge(offerDf,transactionDf,on='Itemr')
classic_int = merged_df[(merged_df['TransactionDt']>=merged_df['StartDt']) &
(merged_df['TransactionDt']<=merged_df['EndDt'])]
# merge on ints ... usually faster
merged_df = pd.merge(offerDf,transactionDf,on='Item')
classic_str = merged_df[(merged_df['TransactionDt']>=merged_df['StartDt']) &
(merged_df['TransactionDt']<=merged_df['EndDt'])]
# merge on integers
cond_join_int = (transactionDf
.conditional_join(
offerDf,
('Itemr', 'Itemr', '=='),
('TransactionDt', 'StartDt', '>='),
('TransactionDt', 'EndDt', '<=')
)
)
# merge on strings
cond_join_str = (transactionDf
.conditional_join(
offerDf,
('Item', 'Item', '=='),
('TransactionDt', 'StartDt', '>='),
('TransactionDt', 'EndDt', '<=')
)
)
%%timeit
merged_df = pd.merge(offerDf,transactionDf,on='Item')
classic_str = merged_df[(merged_df['TransactionDt']>=merged_df['StartDt']) &
(merged_df['TransactionDt']<=merged_df['EndDt'])]
292 ms ± 3.84 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
merged_df = pd.merge(offerDf,transactionDf,on='Itemr')
classic_int = merged_df[(merged_df['TransactionDt']>=merged_df['StartDt']) &
(merged_df['TransactionDt']<=merged_df['EndDt'])]
253 ms ± 2.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
(transactionDf
.conditional_join(
offerDf,
('Item', 'Item', '=='),
('TransactionDt', 'StartDt', '>='),
('TransactionDt', 'EndDt', '<=')
)
)
256 ms ± 9.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
(transactionDf
.conditional_join(
offerDf,
('Itemr', 'Itemr', '=='),
('TransactionDt', 'StartDt', '>='),
('TransactionDt', 'EndDt', '<=')
)
)
71.8 ms ± 2.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# check that both dataframes are equal
cols = ['Item', 'TransactionDt', 'Sales', 'Itemr_y','StartDt', 'EndDt', 'Itemr_x']
cond_join_str = cond_join_str.drop(columns=('right', 'Item')).set_axis(cols, axis=1)
(cond_join_str
.sort_values(cond_join_str.columns.tolist())
.reset_index(drop=True)
.reindex(columns=classic_str.columns)
.equals(
classic_str
.sort_values(classic_str.columns.tolist())
.reset_index(drop=True)
))
True
I think you should include this in your explanation as it is a relevant merge that I see fairly often, which is termed cross-join I believe. This is a merge that occurs when unique df's share no columns, and it simply merging 2 dfs side-by-side:
The setup:
names1 = [{'A':'Jack', 'B':'Jill'}]
names2 = [{'C':'Tommy', 'D':'Tammy'}]
df1=pd.DataFrame(names1)
df2=pd.DataFrame(names2)
df_merged= pd.merge(df1.assign(X=1), df2.assign(X=1), on='X').drop('X', 1)
This creates a dummy X column, merges on the X, and then drops it to produce
df_merged:
A B C D
0 Jack Jill Tommy Tammy

Combine Two Dataframes based on specific column in Pandas [duplicate]

How can I perform a (INNER| (LEFT|RIGHT|FULL) OUTER) JOIN with pandas?
How do I add NaNs for missing rows after a merge?
How do I get rid of NaNs after merging?
Can I merge on the index?
How do I merge multiple DataFrames?
Cross join with pandas
merge? join? concat? update? Who? What? Why?!
... and more. I've seen these recurring questions asking about various facets of the pandas merge functionality. Most of the information regarding merge and its various use cases today is fragmented across dozens of badly worded, unsearchable posts. The aim here is to collate some of the more important points for posterity.
This Q&A is meant to be the next installment in a series of helpful user guides on common pandas idioms (see this post on pivoting, and this post on concatenation, which I will be touching on, later).
Please note that this post is not meant to be a replacement for the documentation, so please read that as well! Some of the examples are taken from there.
Table of Contents
For ease of access.
Merging basics - basic types of joins (read this first)
Index-based joins
Generalizing to multiple DataFrames
Cross join
This post aims to give readers a primer on SQL-flavored merging with Pandas, how to use it, and when not to use it.
In particular, here's what this post will go through:
The basics - types of joins (LEFT, RIGHT, OUTER, INNER)
merging with different column names
merging with multiple columns
avoiding duplicate merge key column in output
What this post (and other posts by me on this thread) will not go through:
Performance-related discussions and timings (for now). Mostly notable mentions of better alternatives, wherever appropriate.
Handling suffixes, removing extra columns, renaming outputs, and other specific use cases. There are other (read: better) posts that deal with that, so figure it out!
Note
Most examples default to INNER JOIN operations while demonstrating various features, unless otherwise specified.
Furthermore, all the DataFrames here can be copied and replicated so
you can play with them. Also, see this
post
on how to read DataFrames from your clipboard.
Lastly, all visual representation of JOIN operations have been hand-drawn using Google Drawings. Inspiration from here.
Enough talk - just show me how to use merge!
Setup & Basics
np.random.seed(0)
left = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4)})
right = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': np.random.randn(4)})
left
key value
0 A 1.764052
1 B 0.400157
2 C 0.978738
3 D 2.240893
right
key value
0 B 1.867558
1 D -0.977278
2 E 0.950088
3 F -0.151357
For the sake of simplicity, the key column has the same name (for now).
An INNER JOIN is represented by
Note
This, along with the forthcoming figures all follow this convention:
blue indicates rows that are present in the merge result
red indicates rows that are excluded from the result (i.e., removed)
green indicates missing values that are replaced with NaNs in the result
To perform an INNER JOIN, call merge on the left DataFrame, specifying the right DataFrame and the join key (at the very least) as arguments.
left.merge(right, on='key')
# Or, if you want to be explicit
# left.merge(right, on='key', how='inner')
key value_x value_y
0 B 0.400157 1.867558
1 D 2.240893 -0.977278
This returns only rows from left and right which share a common key (in this example, "B" and "D).
A LEFT OUTER JOIN, or LEFT JOIN is represented by
This can be performed by specifying how='left'.
left.merge(right, on='key', how='left')
key value_x value_y
0 A 1.764052 NaN
1 B 0.400157 1.867558
2 C 0.978738 NaN
3 D 2.240893 -0.977278
Carefully note the placement of NaNs here. If you specify how='left', then only keys from left are used, and missing data from right is replaced by NaN.
And similarly, for a RIGHT OUTER JOIN, or RIGHT JOIN which is...
...specify how='right':
left.merge(right, on='key', how='right')
key value_x value_y
0 B 0.400157 1.867558
1 D 2.240893 -0.977278
2 E NaN 0.950088
3 F NaN -0.151357
Here, keys from right are used, and missing data from left is replaced by NaN.
Finally, for the FULL OUTER JOIN, given by
specify how='outer'.
left.merge(right, on='key', how='outer')
key value_x value_y
0 A 1.764052 NaN
1 B 0.400157 1.867558
2 C 0.978738 NaN
3 D 2.240893 -0.977278
4 E NaN 0.950088
5 F NaN -0.151357
This uses the keys from both frames, and NaNs are inserted for missing rows in both.
The documentation summarizes these various merges nicely:
Other JOINs - LEFT-Excluding, RIGHT-Excluding, and FULL-Excluding/ANTI JOINs
If you need LEFT-Excluding JOINs and RIGHT-Excluding JOINs in two steps.
For LEFT-Excluding JOIN, represented as
Start by performing a LEFT OUTER JOIN and then filtering to rows coming from left only (excluding everything from the right),
(left.merge(right, on='key', how='left', indicator=True)
.query('_merge == "left_only"')
.drop('_merge', 1))
key value_x value_y
0 A 1.764052 NaN
2 C 0.978738 NaN
Where,
left.merge(right, on='key', how='left', indicator=True)
key value_x value_y _merge
0 A 1.764052 NaN left_only
1 B 0.400157 1.867558 both
2 C 0.978738 NaN left_only
3 D 2.240893 -0.977278 both
And similarly, for a RIGHT-Excluding JOIN,
(left.merge(right, on='key', how='right', indicator=True)
.query('_merge == "right_only"')
.drop('_merge', 1))
key value_x value_y
2 E NaN 0.950088
3 F NaN -0.151357
Lastly, if you are required to do a merge that only retains keys from the left or right, but not both (IOW, performing an ANTI-JOIN),
You can do this in similar fashion—
(left.merge(right, on='key', how='outer', indicator=True)
.query('_merge != "both"')
.drop('_merge', 1))
key value_x value_y
0 A 1.764052 NaN
2 C 0.978738 NaN
4 E NaN 0.950088
5 F NaN -0.151357
Different names for key columns
If the key columns are named differently—for example, left has keyLeft, and right has keyRight instead of key—then you will have to specify left_on and right_on as arguments instead of on:
left2 = left.rename({'key':'keyLeft'}, axis=1)
right2 = right.rename({'key':'keyRight'}, axis=1)
left2
keyLeft value
0 A 1.764052
1 B 0.400157
2 C 0.978738
3 D 2.240893
right2
keyRight value
0 B 1.867558
1 D -0.977278
2 E 0.950088
3 F -0.151357
left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')
keyLeft value_x keyRight value_y
0 B 0.400157 B 1.867558
1 D 2.240893 D -0.977278
Avoiding duplicate key column in output
When merging on keyLeft from left and keyRight from right, if you only want either of the keyLeft or keyRight (but not both) in the output, you can start by setting the index as a preliminary step.
left3 = left2.set_index('keyLeft')
left3.merge(right2, left_index=True, right_on='keyRight')
value_x keyRight value_y
0 0.400157 B 1.867558
1 2.240893 D -0.977278
Contrast this with the output of the command just before (that is, the output of left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')), you'll notice keyLeft is missing. You can figure out what column to keep based on which frame's index is set as the key. This may matter when, say, performing some OUTER JOIN operation.
Merging only a single column from one of the DataFrames
For example, consider
right3 = right.assign(newcol=np.arange(len(right)))
right3
key value newcol
0 B 1.867558 0
1 D -0.977278 1
2 E 0.950088 2
3 F -0.151357 3
If you are required to merge only "newcol" (without any of the other columns), you can usually just subset columns before merging:
left.merge(right3[['key', 'newcol']], on='key')
key value newcol
0 B 0.400157 0
1 D 2.240893 1
If you're doing a LEFT OUTER JOIN, a more performant solution would involve map:
# left['newcol'] = left['key'].map(right3.set_index('key')['newcol']))
left.assign(newcol=left['key'].map(right3.set_index('key')['newcol']))
key value newcol
0 A 1.764052 NaN
1 B 0.400157 0.0
2 C 0.978738 NaN
3 D 2.240893 1.0
As mentioned, this is similar to, but faster than
left.merge(right3[['key', 'newcol']], on='key', how='left')
key value newcol
0 A 1.764052 NaN
1 B 0.400157 0.0
2 C 0.978738 NaN
3 D 2.240893 1.0
Merging on multiple columns
To join on more than one column, specify a list for on (or left_on and right_on, as appropriate).
left.merge(right, on=['key1', 'key2'] ...)
Or, in the event the names are different,
left.merge(right, left_on=['lkey1', 'lkey2'], right_on=['rkey1', 'rkey2'])
Other useful merge* operations and functions
Merging a DataFrame with Series on index: See this answer.
Besides merge, DataFrame.update and DataFrame.combine_first are also used in certain cases to update one DataFrame with another.
pd.merge_ordered is a useful function for ordered JOINs.
pd.merge_asof (read: merge_asOf) is useful for approximate joins.
This section only covers the very basics, and is designed to only whet your appetite. For more examples and cases, see the documentation on merge, join, and concat as well as the links to the function specifications.
Continue Reading
Jump to other topics in Pandas Merging 101 to continue learning:
Merging basics - basic types of joins *
Index-based joins
Generalizing to multiple DataFrames
Cross join
*You are here.
A supplemental visual view of pd.concat([df0, df1], kwargs).
Notice that, kwarg axis=0 or axis=1 's meaning is not as intuitive as df.mean() or df.apply(func)
Joins 101
These animations might be better to explain you visually.
Credits: Garrick Aden-Buie tidyexplain repo
Inner Join
Outer Join or Full Join
Right Join
Left Join
In this answer, I will consider practical examples of:
pandas.concat
pandas.DataFrame.merge to merge dataframes from the index of one and the column of another one.
We will be using different dataframes for each of the cases.
1. pandas.concat
Considering the following DataFrames with the same column names:
Price2018 with size (8784, 5)
Year Month Day Hour Price
0 2018 1 1 1 6.74
1 2018 1 1 2 4.74
2 2018 1 1 3 3.66
3 2018 1 1 4 2.30
4 2018 1 1 5 2.30
5 2018 1 1 6 2.06
6 2018 1 1 7 2.06
7 2018 1 1 8 2.06
8 2018 1 1 9 2.30
9 2018 1 1 10 2.30
Price2019 with size (8760, 5)
Year Month Day Hour Price
0 2019 1 1 1 66.88
1 2019 1 1 2 66.88
2 2019 1 1 3 66.00
3 2019 1 1 4 63.64
4 2019 1 1 5 58.85
5 2019 1 1 6 55.47
6 2019 1 1 7 56.00
7 2019 1 1 8 61.09
8 2019 1 1 9 61.01
9 2019 1 1 10 61.00
One can combine them using pandas.concat, by simply
import pandas as pd
frames = [Price2018, Price2019]
df_merged = pd.concat(frames)
Which results in a DataFrame with size (17544, 5)
If one wants to have a clear picture of what happened, it works like this
(Source)
2. pandas.DataFrame.merge
In this section, we will consider a specific case: merging the index of one dataframe and the column of another dataframe.
Let's say one has the dataframe Geo with 54 columns, being one of the columns the Date, which is of type datetime64[ns].
Date 1 2 ... 51 52 53
0 2010-01-01 00:00:00 0.565919 0.892376 ... 0.593049 0.775082 0.680621
1 2010-01-01 01:00:00 0.358960 0.531418 ... 0.734619 0.480450 0.926735
2 2010-01-01 02:00:00 0.531870 0.221768 ... 0.902369 0.027840 0.398864
3 2010-01-01 03:00:00 0.475463 0.245810 ... 0.306405 0.645762 0.541882
4 2010-01-01 04:00:00 0.954546 0.867960 ... 0.912257 0.039772 0.627696
And the dataframe Price that has one column with the price named Price, and the index corresponds to the dates (Date)
Price
Date
2010-01-01 00:00:00 29.10
2010-01-01 01:00:00 9.57
2010-01-01 02:00:00 0.00
2010-01-01 03:00:00 0.00
2010-01-01 04:00:00 0.00
In order to merge them, one can use pandas.DataFrame.merge as follows
df_merged = pd.merge(Price, Geo, left_index=True, right_on='Date')
where Geo and Price are the previous dataframes.
That results in the following dataframe
Price Date 1 ... 51 52 53
0 29.10 2010-01-01 00:00:00 0.565919 ... 0.593049 0.775082 0.680621
1 9.57 2010-01-01 01:00:00 0.358960 ... 0.734619 0.480450 0.926735
2 0.00 2010-01-01 02:00:00 0.531870 ... 0.902369 0.027840 0.398864
3 0.00 2010-01-01 03:00:00 0.475463 ... 0.306405 0.645762 0.541882
4 0.00 2010-01-01 04:00:00 0.954546 ... 0.912257 0.039772 0.627696
This post will go through the following topics:
Merging with index under different conditions
options for index-based joins: merge, join, concat
merging on indexes
merging on index of one, column of other
effectively using named indexes to simplify merging syntax
BACK TO TOP
Index-based joins
TL;DR
There are a few options, some simpler than others depending on the use
case.
DataFrame.merge with left_index and right_index (or left_on and right_on using named indexes)
supports inner/left/right/full
can only join two at a time
supports column-column, index-column, index-index joins
DataFrame.join (join on index)
supports inner/left (default)/right/full
can join multiple DataFrames at a time
supports index-index joins
pd.concat (joins on index)
supports inner/full (default)
can join multiple DataFrames at a time
supports index-index joins
Index to index joins
Setup & Basics
import pandas as pd
import numpy as np
np.random.seed([3, 14])
left = pd.DataFrame(data={'value': np.random.randn(4)},
index=['A', 'B', 'C', 'D'])
right = pd.DataFrame(data={'value': np.random.randn(4)},
index=['B', 'D', 'E', 'F'])
left.index.name = right.index.name = 'idxkey'
left
value
idxkey
A -0.602923
B -0.402655
C 0.302329
D -0.524349
right
value
idxkey
B 0.543843
D 0.013135
E -0.326498
F 1.385076
Typically, an inner join on index would look like this:
left.merge(right, left_index=True, right_index=True)
value_x value_y
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
Other joins follow similar syntax.
Notable Alternatives
DataFrame.join defaults to joins on the index. DataFrame.join does a LEFT OUTER JOIN by default, so how='inner' is necessary here.
left.join(right, how='inner', lsuffix='_x', rsuffix='_y')
value_x value_y
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
Note that I needed to specify the lsuffix and rsuffix arguments since join would otherwise error out:
left.join(right)
ValueError: columns overlap but no suffix specified: Index(['value'], dtype='object')
Since the column names are the same. This would not be a problem if they were differently named.
left.rename(columns={'value':'leftvalue'}).join(right, how='inner')
leftvalue value
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
pd.concat joins on the index and can join two or more DataFrames at once. It does a full outer join by default, so how='inner' is required here..
pd.concat([left, right], axis=1, sort=False, join='inner')
value value
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
For more information on concat, see this post.
Index to Column joins
To perform an inner join using index of left, column of right, you will use DataFrame.merge a combination of left_index=True and right_on=....
right2 = right.reset_index().rename({'idxkey' : 'colkey'}, axis=1)
right2
colkey value
0 B 0.543843
1 D 0.013135
2 E -0.326498
3 F 1.385076
left.merge(right2, left_index=True, right_on='colkey')
value_x colkey value_y
0 -0.402655 B 0.543843
1 -0.524349 D 0.013135
Other joins follow a similar structure. Note that only merge can perform index to column joins. You can join on multiple columns, provided the number of index levels on the left equals the number of columns on the right.
join and concat are not capable of mixed merges. You will need to set the index as a pre-step using DataFrame.set_index.
Effectively using Named Index [pandas >= 0.23]
If your index is named, then from pandas >= 0.23, DataFrame.merge allows you to specify the index name to on (or left_on and right_on as necessary).
left.merge(right, on='idxkey')
value_x value_y
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
For the previous example of merging with the index of left, column of right, you can use left_on with the index name of left:
left.merge(right2, left_on='idxkey', right_on='colkey')
value_x colkey value_y
0 -0.402655 B 0.543843
1 -0.524349 D 0.013135
Continue Reading
Jump to other topics in Pandas Merging 101 to continue learning:
Merging basics - basic types of joins
Index-based joins*
Generalizing to multiple DataFrames
Cross join
* you are here
This post will go through the following topics:
how to correctly generalize to multiple DataFrames (and why merge has shortcomings here)
merging on unique keys
merging on non-unique keys
BACK TO TOP
Generalizing to multiple DataFrames
Oftentimes, the situation arises when multiple DataFrames are to be merged together. Naively, this can be done by chaining merge calls:
df1.merge(df2, ...).merge(df3, ...)
However, this quickly gets out of hand for many DataFrames. Furthermore, it may be necessary to generalise for an unknown number of DataFrames.
Here I introduce pd.concat for multi-way joins on unique keys, and DataFrame.join for multi-way joins on non-unique keys. First, the setup.
# Setup.
np.random.seed(0)
A = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'valueA': np.random.randn(4)})
B = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'valueB': np.random.randn(4)})
C = pd.DataFrame({'key': ['D', 'E', 'J', 'C'], 'valueC': np.ones(4)})
dfs = [A, B, C]
# Note: the "key" column values are unique, so the index is unique.
A2 = A.set_index('key')
B2 = B.set_index('key')
C2 = C.set_index('key')
dfs2 = [A2, B2, C2]
Multiway merge on unique keys
If your keys (here, the key could either be a column or an index) are unique, then you can use pd.concat. Note that pd.concat joins DataFrames on the index.
# Merge on `key` column. You'll need to set the index before concatenating
pd.concat(
[df.set_index('key') for df in dfs], axis=1, join='inner'
).reset_index()
key valueA valueB valueC
0 D 2.240893 -0.977278 1.0
# Merge on `key` index.
pd.concat(dfs2, axis=1, sort=False, join='inner')
valueA valueB valueC
key
D 2.240893 -0.977278 1.0
Omit join='inner' for a FULL OUTER JOIN. Note that you cannot specify LEFT or RIGHT OUTER joins (if you need these, use join, described below).
Multiway merge on keys with duplicates
concat is fast, but has its shortcomings. It cannot handle duplicates.
A3 = pd.DataFrame({'key': ['A', 'B', 'C', 'D', 'D'], 'valueA': np.random.randn(5)})
pd.concat([df.set_index('key') for df in [A3, B, C]], axis=1, join='inner')
ValueError: Shape of passed values is (3, 4), indices imply (3, 2)
In this situation, we can use join since it can handle non-unique keys (note that join joins DataFrames on their index; it calls merge under the hood and does a LEFT OUTER JOIN unless otherwise specified).
# Join on `key` column. Set as the index first.
# For inner join. For left join, omit the "how" argument.
A.set_index('key').join([B2, C2], how='inner').reset_index()
key valueA valueB valueC
0 D 2.240893 -0.977278 1.0
# Join on `key` index.
A3.set_index('key').join([B2, C2], how='inner')
valueA valueB valueC
key
D 1.454274 -0.977278 1.0
D 0.761038 -0.977278 1.0
Continue Reading
Jump to other topics in Pandas Merging 101 to continue learning:
Merging basics - basic types of joins
Index-based joins
Generalizing to multiple DataFrames *
Cross join
* you are here
Pandas at the moment does not support inequality joins within the merge syntax; one option is with the conditional_join function from pyjanitor - I am a contributor to this library:
# pip install pyjanitor
import pandas as pd
import janitor
left.conditional_join(right, ('value', 'value', '>'))
left right
key value key value
0 A 1.764052 D -0.977278
1 A 1.764052 F -0.151357
2 A 1.764052 E 0.950088
3 B 0.400157 D -0.977278
4 B 0.400157 F -0.151357
5 C 0.978738 D -0.977278
6 C 0.978738 F -0.151357
7 C 0.978738 E 0.950088
8 D 2.240893 D -0.977278
9 D 2.240893 F -0.151357
10 D 2.240893 E 0.950088
11 D 2.240893 B 1.867558
left.conditional_join(right, ('value', 'value', '<'))
left right
key value key value
0 A 1.764052 B 1.867558
1 B 0.400157 E 0.950088
2 B 0.400157 B 1.867558
3 C 0.978738 B 1.867558
The columns are passed as a variable argument of tuples, each tuple comprising of a column from the left dataframe, column from the right dataframe, and the join operator, which can be any of (>, <, >=, <=, !=). In the example above, a MultiIndex column is returned, because of overlaps in the column names.
Performance wise, this is better than a naive cross join:
np.random.seed(0)
dd = pd.DataFrame({'value':np.random.randint(100000, size=50_000)})
df = pd.DataFrame({'start':np.random.randint(100000, size=1_000),
'end':np.random.randint(100000, size=1_000)})
dd.head()
value
0 68268
1 43567
2 42613
3 45891
4 21243
df.head()
start end
0 71915 47005
1 64284 44913
2 13377 96626
3 75823 38673
4 29151 575
%%timeit
out = df.merge(dd, how='cross')
out.loc[(out.start < out.value) & (out.end > out.value)]
5.12 s ± 19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.conditional_join(dd, ('start', 'value' ,'<'), ('end', 'value' ,'>'))
280 ms ± 5.56 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.conditional_join(dd, ('start', 'value' ,'<'), ('end', 'value' ,'>'), use_numba=True)
124 ms ± 12.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
out = df.merge(dd, how='cross')
out = out.loc[(out.start < out.value) & (out.end > out.value)]
A = df.conditional_join(dd, ('start', 'value' ,'<'), ('end', 'value' ,'>'))
columns = A.columns.tolist()
A = A.sort_values(columns, ignore_index = True)
out = out.sort_values(columns, ignore_index = True)
A.equals(out)
True
Depending on the data size, you could get more performance when an equi join is present. In this case, pandas merge function is used, but the final data frame is delayed until the non-equi joins are computed. There is no numba support when equi conditions are present. Let's look at data from here:
import pandas as pd
import numpy as np
import random
import datetime
def random_dt_bw(start_date,end_date):
days_between = (end_date - start_date).days
random_num_days = random.randrange(days_between)
random_dt = start_date + datetime.timedelta(days=random_num_days)
return random_dt
def generate_data(n=1000):
items = [f"i_{x}" for x in range(n)]
start_dates = [random_dt_bw(datetime.date(2020,1,1),datetime.date(2020,9,1)) for x in range(n)]
end_dates = [x + datetime.timedelta(days=random.randint(1,10)) for x in start_dates]
offerDf = pd.DataFrame({"Item":items,
"StartDt":start_dates,
"EndDt":end_dates})
transaction_items = [f"i_{random.randint(0,n)}" for x in range(5*n)]
transaction_dt = [random_dt_bw(datetime.date(2020,1,1),datetime.date(2020,9,1)) for x in range(5*n)]
sales_amt = [random.randint(0,1000) for x in range(5*n)]
transactionDf = pd.DataFrame({"Item":transaction_items,"TransactionDt":transaction_dt,"Sales":sales_amt})
return offerDf,transactionDf
offerDf,transactionDf = generate_data(n=100000)
offerDf = (offerDf
.assign(StartDt = offerDf.StartDt.astype(np.datetime64),
EndDt = offerDf.EndDt.astype(np.datetime64)
)
)
transactionDf = transactionDf.assign(TransactionDt = transactionDf.TransactionDt.astype(np.datetime64))
# you can get more performance when using ints/datetimes
# in the equi join, compared to strings
offerDf = offerDf.assign(Itemr = offerDf.Item.str[2:].astype(int))
transactionDf = transactionDf.assign(Itemr = transactionDf.Item.str[2:].astype(int))
transactionDf.head()
Item TransactionDt Sales Itemr
0 i_43407 2020-05-29 692 43407
1 i_95044 2020-07-22 964 95044
2 i_94560 2020-01-09 462 94560
3 i_11246 2020-02-26 690 11246
4 i_55974 2020-03-07 219 55974
offerDf.head()
Item StartDt EndDt Itemr
0 i_0 2020-04-18 2020-04-19 0
1 i_1 2020-02-28 2020-03-07 1
2 i_2 2020-03-28 2020-03-30 2
3 i_3 2020-08-03 2020-08-13 3
4 i_4 2020-05-26 2020-06-04 4
# merge on strings
merged_df = pd.merge(offerDf,transactionDf,on='Itemr')
classic_int = merged_df[(merged_df['TransactionDt']>=merged_df['StartDt']) &
(merged_df['TransactionDt']<=merged_df['EndDt'])]
# merge on ints ... usually faster
merged_df = pd.merge(offerDf,transactionDf,on='Item')
classic_str = merged_df[(merged_df['TransactionDt']>=merged_df['StartDt']) &
(merged_df['TransactionDt']<=merged_df['EndDt'])]
# merge on integers
cond_join_int = (transactionDf
.conditional_join(
offerDf,
('Itemr', 'Itemr', '=='),
('TransactionDt', 'StartDt', '>='),
('TransactionDt', 'EndDt', '<=')
)
)
# merge on strings
cond_join_str = (transactionDf
.conditional_join(
offerDf,
('Item', 'Item', '=='),
('TransactionDt', 'StartDt', '>='),
('TransactionDt', 'EndDt', '<=')
)
)
%%timeit
merged_df = pd.merge(offerDf,transactionDf,on='Item')
classic_str = merged_df[(merged_df['TransactionDt']>=merged_df['StartDt']) &
(merged_df['TransactionDt']<=merged_df['EndDt'])]
292 ms ± 3.84 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
merged_df = pd.merge(offerDf,transactionDf,on='Itemr')
classic_int = merged_df[(merged_df['TransactionDt']>=merged_df['StartDt']) &
(merged_df['TransactionDt']<=merged_df['EndDt'])]
253 ms ± 2.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
(transactionDf
.conditional_join(
offerDf,
('Item', 'Item', '=='),
('TransactionDt', 'StartDt', '>='),
('TransactionDt', 'EndDt', '<=')
)
)
256 ms ± 9.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
(transactionDf
.conditional_join(
offerDf,
('Itemr', 'Itemr', '=='),
('TransactionDt', 'StartDt', '>='),
('TransactionDt', 'EndDt', '<=')
)
)
71.8 ms ± 2.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# check that both dataframes are equal
cols = ['Item', 'TransactionDt', 'Sales', 'Itemr_y','StartDt', 'EndDt', 'Itemr_x']
cond_join_str = cond_join_str.drop(columns=('right', 'Item')).set_axis(cols, axis=1)
(cond_join_str
.sort_values(cond_join_str.columns.tolist())
.reset_index(drop=True)
.reindex(columns=classic_str.columns)
.equals(
classic_str
.sort_values(classic_str.columns.tolist())
.reset_index(drop=True)
))
True
I think you should include this in your explanation as it is a relevant merge that I see fairly often, which is termed cross-join I believe. This is a merge that occurs when unique df's share no columns, and it simply merging 2 dfs side-by-side:
The setup:
names1 = [{'A':'Jack', 'B':'Jill'}]
names2 = [{'C':'Tommy', 'D':'Tammy'}]
df1=pd.DataFrame(names1)
df2=pd.DataFrame(names2)
df_merged= pd.merge(df1.assign(X=1), df2.assign(X=1), on='X').drop('X', 1)
This creates a dummy X column, merges on the X, and then drops it to produce
df_merged:
A B C D
0 Jack Jill Tommy Tammy

Categories