Pandas - merge two dataframes based off of intersection of columns - python

I am new to dataframe manipulation. I've been playing around with df.merge, df.join, pd.concat and I've been getting frequent errors while being unable to merge without duplicates.
I have two representative dataframes I want to merge.
df1 = pd.DataFrame({'1990' : 1, '1991': 2, '1992': 3}, index = ['a','b','c'])
df2 = pd.DataFrame({'1989':0,'1990' : 1, '1991': 2, '1992': 3, '1993': 4}, index = ['d'])
I want to merge them by the intersection of the columns of the two dataframes while adding the row at the same time. Is there a way to use a dataframe method to do this?
The final product should look like:

Use concat with inner join:
df = pd.concat([df1, df2], join='inner')
print (df)
1990 1991 1992
a 1 2 3
b 1 2 3
c 1 2 3
d 1 2 3

Related

Match column to another column containing array

I have very junior question in python - i have a dataframe with a column containing some IDs and separate dataframe that contains 2 columns, out of which 1 is an array:
df1 = pd.DataFrame({"some_id": [1, 2, 3, 4, 5]})
df2 = pd.DataFrame([["A", [1, 2]], ["B", [3, 4]], ["C", [5]]], columns=['letter', 'some_ids'])
I want to add do df1 new column "letter' that for a given "some_id" will look up df2, check if this id is in df2['some_ids'] and return df2['letter']
I tried this:
df1['letter'] = df2[df1[some_id].isin(df2['some_ids')].letter
and get NaNs - any suggestion where I make mistake?
Create dictionary with flatten nested lists in dict comprehension and then use Series.map:
d = {x: a for a,b in zip(df2['letter'], df2['some_ids']) for x in b}
df1['letter'] = df1['some_id'].map(d)
Or mapping by Series created by DataFrame.explode with DataFrame.set_index:
df1['letter'] = df1['some_id'].map(df2.explode('some_ids').set_index('some_ids')['letter'])
Or use left join with rename column:
df1 = df1.merge(df2.explode('some_ids').rename(columns={'some_ids':'some_id'}), how='left')
print (df1)
some_id letter
0 1 A
1 2 A
2 3 B
3 4 B
4 5 C

How To Merge Two Data Frames in Pandas Python [duplicate]

This question already has answers here:
How do I combine two dataframes?
(8 answers)
Pandas Merging 101
(8 answers)
Closed 10 months ago.
How To Merge/Concat Two Data Frames
I want to merge two dataframes: the first one is a dataframe with one column with datetime64 dtype and the second one is a float dtype one column dataframe. This is what I have tried:
df1 = pd.DataFrame(df, columns = ['MemStartDate'])
df4 = pd.DataFrame(df, columns = ['TotalPrice'])
df_merge = pd.merge(df1,df2,left_on='MemStartDate',right_on='TotalPrice')
Error: You are trying to merge on datetime64[ns] and float64 columns. If you wish to proceed you should use pd.concat
But how can I do that ?
you can try this.
df_merge = pd.concat([df1, df2], axis=1)
Best option to use pd.concat but you also can try dataframe.join(dataframe).
for more information try to go through this Merge, join, concatenate and compare
df_merge=df1.join(df2)
Let us consider the following situation:
import pandas as pd
# Create dataframe with one column of type datatime64 and one float64
dictionary = {'MemStartDate':['2007-07-13', '2006-01-13', '2010-08-13'],
'TotalPrice':[50.5,10.4,3.5]}
df= pd.DataFrame(dictionary)
pd.to_datetime(df['MemStartDate']) #dtype: datetime64[ns]
df1 = pd.DataFrame(df, columns = ['MemStartDate'])
df4 = pd.DataFrame(df, columns = ['TotalPrice'])
df.TotalPrice # dtype: float64
Where you have df1 and df4 that are:
df1
Out:
MemStartDate
0 2007-07-13
1 2006-01-13
2 2010-08-13
df4
Out:
TotalPrice
0 50.5
1 10.4
2 3.5
If you want to concat df1 and df4, it means that you want to concatenate pandas objects along a particular axis with optional set logic along the other axes (see pandas.concat — pandas 1.4.2 documentation). Thus in practice:
df_concatenated = pd.concat([df1, df4], axis=1)
df_concatenated
The new resulting dataframe df_concatenated is this:
Out:
MemStartDate TotalPrice
0 2007-07-13 50.5
1 2006-01-13 10.4
2 2010-08-13 3.5
The axis decides where you want to concatenate along. With axis=1 you have concatenated the second dataframe along columns of the first dataframe. You can try with axis=0:
df_concatenated = pd.concat([df1, df4], axis=0)
df_concatenated
The output is:
Out:
MemStartDate TotalPrice
0 2007-07-13 NaN
1 2006-01-13 NaN
2 2010-08-13 NaN
0 NaN 50.5
1 NaN 10.4
2 NaN 3.5
Now you have added the second dataframe along rows of the first dataframe.
On the other hand, merge is used to join dataframes when they share some columns. It is useful because maybe you do not want to store dataframes with same contents repeatedly. For example:
# Create two dataframes
dictionary = {'MemStartDate':['2007-07-13', '2006-01-13', '2010-08-13'],
'TotalPrice':[50.5,10.4,3.5]}
dictionary_1 = {'MemStartDate':['2007-07-13', '2006-01-13', '2010-08-13', '2010-08-14'],
'Shop':['Shop_1','Shop_2','Shop_3','Shop_4']}
df= pd.DataFrame(dictionary)
df_1 = pd.DataFrame(dictionary_1)
if you have df and df_1 that are:
df
Out:
MemStartDate TotalPrice
0 2007-07-13 50.5
1 2006-01-13 10.4
2 2010-08-13 3.5
and
df_1
Out:
MemStartDate Shop
0 2007-07-13 Shop_1
1 2006-01-13 Shop_2
2 2010-08-13 Shop_3
3 2010-08-14 Shop_4
You can merge them in this way:
df_merged = pd.merge(df,df_1, on='MemStartDate', how='outer')
df_merged
Out:
MemStartDate TotalPrice Shop
0 2007-07-13 50.5 Shop_1
1 2006-01-13 10.4 Shop_2
2 2010-08-13 3.5 Shop_3
3 2010-08-14 NaN Shop_4
In the new dataframe df_merged, you keep the common column of the old dataframes df and df_1 (MemStartDate) and add the two columns that are different in the two dataframes (TotalPrice and Shop).
----> A couple of other explicative examples about merging dataframes in Pandas:
Example 1. Merging two dataframes preserving one column that is equal for both dataframes:
left = pd.DataFrame(
{
"key": ["K0", "K1", "K2", "K3"],
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
}
)
left
right = pd.DataFrame(
{
"key": ["K0", "K1", "K2", "K3"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
}
)
right
result = pd.merge(left, right, on="key")
result
Out:
key A B C D
0 K0 A0 B0 C0 D0
1 K1 A1 B1 C1 D1
2 K2 A2 B2 C2 D2
3 K3 A3 B3 C3 D3
Example 2. Merging two dataframes in order to read all the combinations of values
df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
'value': [1, 2, 3, 5]})
df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],
'value': [5, 6, 7, 8]})
result = pd.merge(df1,df2, left_on='lkey', right_on='rkey')
result
Out:
lkey value_x rkey value_y
0 foo 1 foo 5
1 foo 1 foo 8
2 foo 5 foo 5
3 foo 5 foo 8
4 bar 2 bar 6
5 baz 3 baz 7
Also in this case you can check the pandas.DataFrame.merge — pandas 1.4.2 documentation (where I took the second example) and here you have other possible ways to manipulate your dataframes: Merge, join, concatenate and compare (where I took the first example).
In the end, to sum up, you can intuitively understand what pd.concat() and pd.merge() do by studying the meaning of their names in spoken language:
Concatenate: to link together in a series or chain
Merge: to cause to combine, unite, or coalesce
And to come back to your error:
Error: You are trying to merge on datetime64[ns] and float64 columns. If you wish to proceed you should use pd.concat
It is telling you that the common column of the two dataframes are of different data type. So he understands that you are trying to do something that is "pd.concat's job" and so he suggests you to use pd.concat.

how do you filter a Pandas dataframe by a multi-column set?

Is there a way to filter a large dataframe by comparing multiple columns against a set of tuples where each element in the tuple corresponds to a different column value?
For example, is there a .isin() method that compares multiple columns of the DataFrame against a set of tuples?
Example:
df = pd.DataFrame({
'a': [1, 1, 1],
'b': [2, 2, 0],
'c': [3, 3, 3],
'd': ['not', 'relevant', 'column'],
})
# Filter the DataFrame by checking if the values in columns [a, b, c] match any tuple in value_set
value_set = set([(1,2,3), (1, 1, 1)])
new_df = ?? # should contain just the first two rows of df
You can use Series.isin, but first is necessary create tuples by first 3 columns:
print (df[df[['a','b','c']].apply(tuple, axis=1).isin(value_set)])
Or convert columns to index and use Index.isin:
print (df[df.set_index(['a','b','c']).index.isin(value_set)])
a b c d
0 1 2 3 not
1 1 2 3 relevant
Another idea is use inner join of DataFrame.merge by helper DataFrame by same 3 columns names, then on parameter should be omit, because join by intersection of columns names of both df:
print (df.merge(pd.DataFrame(value_set, columns=['a','b','c'])))
a b c d
0 1 2 3 not
1 1 2 3 relevant

Pandas - Merge 2 df with same column names but exclusive values

I have 1 main df MainDF, with column key and other columns not relevant.
I also have 2 other dfs, dfA and dfB, with 2 columns, key and tariff. The keys in dfA and dfB are exclusive, ie there is no key in both dfA and dfB.
On my MainDF, I do: MainDF.merge(dfA, how = 'left', on='key'), which will add the column "tariff" to my MainDF, for the keys in dfA and also in MainDF. This will put NaN to all keys in MainDF not in dfA
Now, I need to do MainDF.merge(dfB, how = 'left', on='key') to add the tariff for the keys in MainDF but not in dfA.
When I do the second merge, it will create in MainDF 2 columns tariff_x and tariff_y because tariff is already in MainDF following the first merge. However, since the keys are exclusive, I need to keep only one column tariff with the not-NaN values when possible.
How should I do so in a python way ? I could add a new column which is either tariff_x or tariff_y but I don't find that very elegant.
Thanks
you can first concat dfA and dfB before merging with MainDF:
MainDF.merge(pd.concat([dfA, dfB], axis=0), how='left', on='key')
Do you need something like this:
dfA = pd.DataFrame({'tariff': [1, 2, 3], 'A': list('abc')})
dfB = pd.DataFrame({'tariff': [4, 5, 6], 'A': list('def')})
dfJoin = pd.concat([dfA, dfB], ignore_index=True)
A B tariff
0 a NaN 1
1 b NaN 2
2 c NaN 3
3 NaN d 4
4 NaN e 5
5 NaN f 6
Now you can merge with dfJoin.

How to keep index when using pandas merge

I would like to merge two DataFrames, and keep the index from the first frame as the index on the merged dataset. However, when I do the merge, the resulting DataFrame has integer index. How can I specify that I want to keep the index from the left data frame?
In [4]: a = pd.DataFrame({'col1': {'a': 1, 'b': 2, 'c': 3},
'to_merge_on': {'a': 1, 'b': 3, 'c': 4}})
In [5]: b = pd.DataFrame({'col2': {0: 1, 1: 2, 2: 3},
'to_merge_on': {0: 1, 1: 3, 2: 5}})
In [6]: a
Out[6]:
col1 to_merge_on
a 1 1
b 2 3
c 3 4
In [7]: b
Out[7]:
col2 to_merge_on
0 1 1
1 2 3
2 3 5
In [8]: a.merge(b, how='left')
Out[8]:
col1 to_merge_on col2
0 1 1 1.0
1 2 3 2.0
2 3 4 NaN
In [9]: _.index
Out[9]: Int64Index([0, 1, 2], dtype='int64')
EDIT: Switched to example code that can be easily reproduced
In [5]: a.reset_index().merge(b, how="left").set_index('index')
Out[5]:
col1 to_merge_on col2
index
a 1 1 1
b 2 3 2
c 3 4 NaN
Note that for some left merge operations, you may end up with more rows than in a when there are multiple matches between a and b. In this case, you may need to drop duplicates.
You can make a copy of index on left dataframe and do merge.
a['copy_index'] = a.index
a.merge(b, how='left')
I found this simple method very useful while working with large dataframe and using pd.merge_asof() (or dd.merge_asof()).
This approach would be superior when resetting index is expensive (large dataframe).
There is a non-pd.merge solution using Series.map and DataFrame.set_index.
a['col2'] = a['to_merge_on'].map(b.set_index('to_merge_on')['col2']))
col1 to_merge_on col2
a 1 1 1.0
b 2 3 2.0
c 3 4 NaN
This doesn't introduce a dummy index name for the index.
Note however that there is no DataFrame.map method, and so this approach is not for multiple columns.
df1 = df1.merge(df2, how="inner", left_index=True, right_index=True)
This allows to preserve the index of df1
Assuming that the resulting df has the same number of rows and order as your first df, you can do this:
c = pd.merge(a, b, on='to_merge_on')
c.set_index(a.index,inplace=True)
another simple option is to rename the index to what was before:
a.merge(b, how="left").set_axis(a.index)
merge preserves the order at dataframe 'a', but just resets the index so it's safe to use set_axis
You can also use DataFrame.join() method to achieve the same thing. The join method will persist the original index. The column to join can be specified with on parameter.
In [17]: a.join(b.set_index("to_merge_on"), on="to_merge_on")
Out[17]:
col1 to_merge_on col2
a 1 1 1.0
b 2 3 2.0
c 3 4 NaN
Think I've come up with a different solution. I was joining the left table on index value and the right table on a column value based off index of left table. What I did was a normal merge:
First10ReviewsJoined = pd.merge(First10Reviews, df, left_index=True, right_on='Line Number')
Then I retrieved the new index numbers from the merged table and put them in a new column named Sentiment Line Number:
First10ReviewsJoined['Sentiment Line Number']= First10ReviewsJoined.index.tolist()
Then I manually set the index back to the original, left table index based off pre-existing column called Line Number (the column value I joined on from left table index):
First10ReviewsJoined.set_index('Line Number', inplace=True)
Then removed the index name of Line Number so that it remains blank:
First10ReviewsJoined.index.name = None
Maybe a bit of a hack but seems to work well and relatively simple. Also, guess it reduces risk of duplicates/messing up your data. Hopefully that all makes sense.
For the people that wants to maintain the left index as it was before the left join:
def left_join(
a: pandas.DataFrame, b: pandas.DataFrame, on: list[str], b_columns: list[str] = None
) -> pandas.DataFrame:
if b_columns:
b_columns = set(on + b_columns)
b = b[b_columns]
df = (
a.reset_index()
.merge(
b,
how="left",
on=on,
)
.set_index(keys=[x or "index" for x in a.index.names])
)
df.index.names = a.index.names
return df

Categories