Pandas - Merge 2 df with same column names but exclusive values - python

I have 1 main df MainDF, with column key and other columns not relevant.
I also have 2 other dfs, dfA and dfB, with 2 columns, key and tariff. The keys in dfA and dfB are exclusive, ie there is no key in both dfA and dfB.
On my MainDF, I do: MainDF.merge(dfA, how = 'left', on='key'), which will add the column "tariff" to my MainDF, for the keys in dfA and also in MainDF. This will put NaN to all keys in MainDF not in dfA
Now, I need to do MainDF.merge(dfB, how = 'left', on='key') to add the tariff for the keys in MainDF but not in dfA.
When I do the second merge, it will create in MainDF 2 columns tariff_x and tariff_y because tariff is already in MainDF following the first merge. However, since the keys are exclusive, I need to keep only one column tariff with the not-NaN values when possible.
How should I do so in a python way ? I could add a new column which is either tariff_x or tariff_y but I don't find that very elegant.
Thanks

you can first concat dfA and dfB before merging with MainDF:
MainDF.merge(pd.concat([dfA, dfB], axis=0), how='left', on='key')

Do you need something like this:
dfA = pd.DataFrame({'tariff': [1, 2, 3], 'A': list('abc')})
dfB = pd.DataFrame({'tariff': [4, 5, 6], 'A': list('def')})
dfJoin = pd.concat([dfA, dfB], ignore_index=True)
A B tariff
0 a NaN 1
1 b NaN 2
2 c NaN 3
3 NaN d 4
4 NaN e 5
5 NaN f 6
Now you can merge with dfJoin.

Related

How can I join 2 pandas dataframe using left join?

I have 2 dataframes which I need to join using left join. In sql I have the query as
SELECT A.* INTO NewTable FROM A LEFT JOIN B ON A.id=B.id WHERE B.id IS NULL;
I have the 2 dataframes as:
df1:
id
name
1
one
2
two
3
three
4
four
df2:
id
2
3
What I am expecting is:
id
name
1
one
4
four
What I have tried?
common = df1.merge(df2, on=['id', 'id'])
result = df1[~df1.id.isin(common.id)]
I get more results in this then what the query returns. Any help is appreciated.
you have the right solution,only you do interpret the results wrong.
This will give you the result without index
import pandas as pd
d = {'id': [1, 2,3,4], 'col2': ['one','two','three','four']}
d1 = {'id': [2,3]}
df1 = pd.DataFrame(data=d)
df2 = pd.DataFrame(data=d1)
result = df1[~df1.id.isin(df2.id)]
print(result.to_string(index=False))
You can use left join with .merge() with indicator= parameter turned on. Then, filter the indicator values equal to "left_only" with .query(), as follows:
df1.merge(df2, on='id', how='left', indicator='ind').query('ind == "left_only"')
Result:
id name ind
0 1 one left_only
3 4 four left_only
Optionally, you can remove the indicator column, as follows:
df1.merge(df2, on='id', how='left', indicator='ind').query('ind == "left_only"').drop('ind', axis=1)
Result:
id name
0 1 one
3 4 four
Try:
print(df1[~df1["id"].isin(df2["id"])])
Prints:
id name
0 1 one
3 4 four
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.array([[1, "one"], [2, "two"], [3, "three"], [4, "four"]]),
columns=['id', 'name '])
df2 = pd.DataFrame(np.array([[1], [2]]),
columns=['id'])
df1.drop(df2['id'], axis=0,inplace=True)
df1

Merging dataframes and keeping columns in place?

I have the following data frames:
DF1
df1 = pd.DataFrame(columns = ['Key', 'Value'])
df1['Key'] = ['A', 'B', 'C', 'D']
DF2
df2 = pd.DataFrame(columns = ['Key', 'Value'])
df2['Key'] = ['A', 'C']
df2['Value'] = [1,7]
I would like to merge these two data frames such that the data from DF2 under the column 'Value' is filled in DF1, where the remaining letters 'B' and 'D' have zero.
I tried this:
df3 = pd.merge(df1,df2,how='outer', on = 'Key')
However, this creates an additional column Value_x and Value_y which is not what I want.
Thanks
I think the shortest way to accomplish this is
df1[['Key']].merge(df2, on='Key', how='outer')
by not including Value from the left frame, you don't have 2 columns in the resulting data frame.
You could remove the Value column from df1 and use your existing merge.
Or only use the Key column from df1 when merging.
df3 = pd.merge(df1['Key'],df2,how='outer', on = 'Key').fillna(value=0)
Key Value
0 A 1.0
1 B 0.0
2 C 7.0
3 D 0.0
Another way to do it is by concatenating the dataframes and then grouping the Key value like this:
df3 = pd.concat([df1, df2]).fillna(0).groupby('Key').sum().reset_index()
Output:
Key Value
0 A 1
1 B 0
2 C 7
3 D 0
This way is a little verbose but easier to read and extensible to more than 2 DFs.

Outer merge between pandas and imputing NA with preceeding row

I have two dataframes containing the same columns:
df1 = pd.DataFrame({'a': [1,2,3,4,5],
'b': [2,3,4,5,6]})
df2 = pd.DataFrame({'a': [1,3,4],
'b': [2,4,5]})
I want df2 to have the same number of rows as df1. Any values of a not present in df1 should be copied over, and corresponding values of b should be taken from the row before.
In other words, I want df2 to look like this:
df2 = pd.DataFrame({'a': [1,2,3,4,5],
'b': [2,2,4,5,5]})
EDIT: I'm looking for an answer that is independent of the number of columns
Use DataFrame.merge by only a column from df1 and for replace missing values is added forward filling them:
df = df1[['a']].merge(df2, how='left', on='a').ffill()
print (df)
a b
0 1 2.0
1 2 2.0
2 3 4.0
3 4 5.0
4 5 5.0
Or use merge_asof:
df = pd.merge_asof(df1[['a']], df2, on='a')
print (df)
a b
0 1 2
1 2 2
2 3 4
3 4 5
4 5 5

Match rows between dataframes and preserve order

I work in python and pandas.
Let's suppose that I have a dataframe like that (INPUT):
A B C
0 2 8 6
1 5 2 5
2 3 4 9
3 5 1 1
I want to process it to finally get a new dataframe which looks like that (EXPECTED OUTPUT):
A B C
0 2 7 NaN
1 5 1 1
2 3 3 NaN
3 5 0 NaN
To manage this I do the following:
columns = ['A', 'B', 'C']
data_1 = [[2, 5, 3, 5], [8, 2, 4, 1], [6, 5, 9, 1]]
data_1 = np.array(data_1).T
df_1 = pd.DataFrame(data=data_1, columns=columns)
df_2 = df_1
df_2['B'] -= 1
df_2['C'] = np.nan
df_2 looks like that for now:
A B C
0 2 7 NaN
1 5 1 NaN
2 3 3 NaN
3 5 0 NaN
Now I want to do a matching/merging between df_1 and df_2 with using as keys the columns A and B.
I tried with isin() to do this:
df_temp = df_1[df_1[['A', 'B']].isin(df_2[['A', 'B']])]
df_2.iloc[df_temp.index] = df_temp
but it gives me back the same df_2 as before without matching the common row 5 1 1 for A, B, C respectively:
A B C
0 2 7 NaN
1 5 1 NaN
2 3 3 NaN
3 5 0 NaN
How can I do this properly?
By the way, just to be clear, the matching should not be done like
1st row of df1 - 1st row of df1
2nd row of df1 - 2nd row of df2
3rd row of df1 - 3rd row of df2
...
But it has to be done as:
any row of df1 - any row of df2
based on the specified columns as keys.
I think that this is why isin() above at my code does not work since it does the filtering/matching in the former way.
On the other hand, .merge() can do the matching in the latter way but it does not preserve the order of the rows in the way I want and it is pretty tricky or inefficient to fix that.
Finally, keep in mind that with my actual dataframes way more than only 2 columns (e.g. 15) will be used as keys for the matching so it is better that you come up with something concise even for bigger dataframes.
P.S.
See my answer below.
Here's my suggestion using a lambda function in apply. Should be easily scalable to more columns to compare (just adjust cols_to_compare accordingly). By the way, when generating df_2, be sure to copy df_1, otherwise changes in df_2 will carry over to df_1 as well.
So generating the data first:
columns = ['A', 'B', 'C']
data_1 = [[2, 5, 3, 5], [8, 2, 4, 1], [6, 5, 9, 1]]
data_1 = np.array(data_1).T
df_1 = pd.DataFrame(data=data_1, columns=columns)
df_2 = df_1.copy() # Be sure to create a copy here
df_2['B'] -= 1
df_2['C'] = np.nan
an now we 'scan' df_1 for the rows of interest:
cols_to_compare = ['A', 'B']
df_2['C'] = df_2.apply(lambda x: 1 if any((df_1.loc[:, cols_to_compare].values[:]==x[cols_to_compare].values).all(1)) else np.nan, axis=1)
What is does is check whether the values in the current row are also like this in any row in the concerning columns of df_1.
The output is:
A B C
0 2 7 NaN
1 5 1 1.0
2 3 3 NaN
3 5 0 NaN
Someone (I do not remember his username) suggested the following (which I think works) and then he deleted his post for some reason (??!):
df_2=df_2.set_index(['A','B'])
temp = df_1.set_index(['A','B'])
df_2.update(temp)
df_2.reset_index(inplace=True)
You can accomplish this using two for loops:
for row in df_2.iterrows():
for row2 in df_1.iterrows():
if [row[1]['A'],row[1]['B']] == [row2[1]['A'],row2[1]['B']]:
df_2['C'].iloc[row[0]] = row2[1]['C']
Just modify your below line:
df_temp = df_1[df_1[['A', 'B']].isin(df_2[['A', 'B']])]
with:
df_1[df_1['A'].isin(df_2['A']) & df_1['B'].isin(df_2['B'])]
It works fine!!

How to keep index when using pandas merge

I would like to merge two DataFrames, and keep the index from the first frame as the index on the merged dataset. However, when I do the merge, the resulting DataFrame has integer index. How can I specify that I want to keep the index from the left data frame?
In [4]: a = pd.DataFrame({'col1': {'a': 1, 'b': 2, 'c': 3},
'to_merge_on': {'a': 1, 'b': 3, 'c': 4}})
In [5]: b = pd.DataFrame({'col2': {0: 1, 1: 2, 2: 3},
'to_merge_on': {0: 1, 1: 3, 2: 5}})
In [6]: a
Out[6]:
col1 to_merge_on
a 1 1
b 2 3
c 3 4
In [7]: b
Out[7]:
col2 to_merge_on
0 1 1
1 2 3
2 3 5
In [8]: a.merge(b, how='left')
Out[8]:
col1 to_merge_on col2
0 1 1 1.0
1 2 3 2.0
2 3 4 NaN
In [9]: _.index
Out[9]: Int64Index([0, 1, 2], dtype='int64')
EDIT: Switched to example code that can be easily reproduced
In [5]: a.reset_index().merge(b, how="left").set_index('index')
Out[5]:
col1 to_merge_on col2
index
a 1 1 1
b 2 3 2
c 3 4 NaN
Note that for some left merge operations, you may end up with more rows than in a when there are multiple matches between a and b. In this case, you may need to drop duplicates.
You can make a copy of index on left dataframe and do merge.
a['copy_index'] = a.index
a.merge(b, how='left')
I found this simple method very useful while working with large dataframe and using pd.merge_asof() (or dd.merge_asof()).
This approach would be superior when resetting index is expensive (large dataframe).
There is a non-pd.merge solution using Series.map and DataFrame.set_index.
a['col2'] = a['to_merge_on'].map(b.set_index('to_merge_on')['col2']))
col1 to_merge_on col2
a 1 1 1.0
b 2 3 2.0
c 3 4 NaN
This doesn't introduce a dummy index name for the index.
Note however that there is no DataFrame.map method, and so this approach is not for multiple columns.
df1 = df1.merge(df2, how="inner", left_index=True, right_index=True)
This allows to preserve the index of df1
Assuming that the resulting df has the same number of rows and order as your first df, you can do this:
c = pd.merge(a, b, on='to_merge_on')
c.set_index(a.index,inplace=True)
another simple option is to rename the index to what was before:
a.merge(b, how="left").set_axis(a.index)
merge preserves the order at dataframe 'a', but just resets the index so it's safe to use set_axis
You can also use DataFrame.join() method to achieve the same thing. The join method will persist the original index. The column to join can be specified with on parameter.
In [17]: a.join(b.set_index("to_merge_on"), on="to_merge_on")
Out[17]:
col1 to_merge_on col2
a 1 1 1.0
b 2 3 2.0
c 3 4 NaN
Think I've come up with a different solution. I was joining the left table on index value and the right table on a column value based off index of left table. What I did was a normal merge:
First10ReviewsJoined = pd.merge(First10Reviews, df, left_index=True, right_on='Line Number')
Then I retrieved the new index numbers from the merged table and put them in a new column named Sentiment Line Number:
First10ReviewsJoined['Sentiment Line Number']= First10ReviewsJoined.index.tolist()
Then I manually set the index back to the original, left table index based off pre-existing column called Line Number (the column value I joined on from left table index):
First10ReviewsJoined.set_index('Line Number', inplace=True)
Then removed the index name of Line Number so that it remains blank:
First10ReviewsJoined.index.name = None
Maybe a bit of a hack but seems to work well and relatively simple. Also, guess it reduces risk of duplicates/messing up your data. Hopefully that all makes sense.
For the people that wants to maintain the left index as it was before the left join:
def left_join(
a: pandas.DataFrame, b: pandas.DataFrame, on: list[str], b_columns: list[str] = None
) -> pandas.DataFrame:
if b_columns:
b_columns = set(on + b_columns)
b = b[b_columns]
df = (
a.reset_index()
.merge(
b,
how="left",
on=on,
)
.set_index(keys=[x or "index" for x in a.index.names])
)
df.index.names = a.index.names
return df

Categories