Join two dataframes based on two columns [duplicate] - python

This question already has answers here:
pandas: merge (join) two data frames on multiple columns
(6 answers)
Closed 12 months ago.
I want to join two data frame df1 and df2 on two columns. For example, in the following dataframes, I want to join them with column a, b and a1, b1 and build the third dataframe.
import pandas as pd
import numpy as np
df1 = pd.DataFrame()
df2 = pd.DataFrame()
df3 = pd.DataFrame()
df1['a'] = [ 1, 2, 3 ]
df1['b'] = [ 2, 4, 6]
df1['c'] = [ 3, 5, 9]
df2['a1'] = [ 1, 2 ]
df2['b1'] = [ 4, 4]
df2['c1'] = [ 7, 5]
The output:

You can use pd.merge() and multiple keys a, b and a1, b1 using left_on and right_on, as follows:
import pandas as pd
import numpy as np
df1 = pd.DataFrame()
df2 = pd.DataFrame()
df3 = pd.DataFrame()
df1['a'] = [1, 2, 3]
df1['b'] = [2, 4, 6]
df1['c'] = [3, 5, 9]
df2['a1'] = [1, 2]
df2['b1'] = [4, 4]
df2['c1'] = [7, 5]
df3 = pd.merge(df1, df2, left_on=['a', 'b'], right_on=['a1', 'b1'], how='inner')
print(df3) # df3 has all columns for df1 and df2
# a b c a1 b1 c1
#0 2 4 5 2 4 5
df3 = df3.drop(df2.columns, axis=1) # removed columns of df2 as they're duplicated
df3.columns = ['a2', 'b2', 'c3'] # column names are changed as you want.
print(df3)
# a2 b2 c3
#0 2 4 5
For more information about pd.merge(), please see: https://pandas.pydata.org/docs/reference/api/pandas.merge.html

Related

Replace values of a dataframe with the value of another dataframe

I have two pandas dataframes
df1 = pd.DataFrame({'A': [1, 3, 5], 'B': [3, 4, 5]})
df2 = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': [8, 9, 10, 11, 12], 'C': ['K', 'D', 'E', 'F', 'G']})
The index of both data-frames are 'A'.
How to replace the values of df1's column 'B' with the values of df2 column 'B'?
RESULT of df1:
A B
1 8
3 10
5 12
Maybe dataframe.isin() is what you're searching:
df1['B'] = df2[df2['A'].isin(df1['A'])]['B'].values
print(df1)
Prints:
A B
0 1 8
1 3 10
2 5 12
One of possible solutions:
wrk = df1.set_index('A').B
wrk.update(df2.set_index('A').B)
df1 = wrk.reset_index()
The result is:
A B
0 1 8
1 3 10
2 5 12
Another solution, based on merge:
df1 = df1.merge(df2[['A', 'B']], how='left', on='A', suffixes=['_x', ''])\
.drop(columns=['B_x'])

Conditioning pandas Dataframe on an Array

I'm trying to figure out how to condition on an array I've created.
first6 = df["Tbl_Name_Dur"].unique()[0:6]
for element in first6:
print(element)
df_test = df[df['Tbl_Name_Dur'] for element in first6]
I've printed the elements and that works. How do I condition on selecting my dataframe based on first6. I've tried the following:
df_test = df[df['Tbl_Name_Dur'] in first6]
df_test = df[df['Tbl_Name_Dur'] == first6]
Any help would be much appreciated!
You can use the isin method. Here is an example:
import pandas as pd
data_dict = {'col': pd.Series([1, 2, 3, 4, 4, 5, 6, 7, 8 ,8 ])}
df = pd.DataFrame(data_dict)
first6 = df.col.unique()[0:6]
df = df[df.isin(first6)]
df.dropna(inplace=True)
print(df)
Alternatively, you can use a lambda function together with map:
import pandas as pd
data_dict = {'col': pd.Series([1, 2, 3, 4, 4, 5, 6, 7, 8, 8 ])}
df = pd.DataFrame(data_dict)
first6 = df.col.unique()[0:6]
df = df[df.col.map(lambda x : x in first6)]
print(df)
Output:
col
0 1
1 2
2 3
3 4
4 4
5 5
6 6

Pandas dataframe addition on selecting 2 or more columns

When there are 2 dataframes of same columns, how to select particular columns and add dataframes ?
dataframes in pandas are as follows
a_val = {'col1': [1, 2], 'col2': [3, 4], 'col3': [7, 8]}
b_val = {'col1': [1, 5, 2], 'col2': [3, 2, 4], 'col3': [7, 17, 33]}
a = pd.DataFrame(a_val)
b = pd.DataFrame(b_val)
How to make the resultant dataframe C (see below for the expected resultant C)
for example I have A dataframe as
B dataframe as
C dataframe as
I think you need merge and then sum last column:
c = pd.merge(a,b, on=['col1', 'col2'], suffixes=('','_'))
.assign(col3=lambda x: x.col3 + x.col3_).drop('col3_', 1)
What is same as:
c = pd.merge(a,b, on=['col1', 'col2'], suffixes=('','_'))
c.col3 = c.col3.add(c.col3_)
c = c.drop('col3_', 1)
print (c)
col1 col2 col3
0 1 3 14
1 2 4 41

Selecting columns from dataframe based on the name of other dataframe

I have 3 dataframes,
df
df = pd.DataFrame({'Name': ['CTA15', 'CTA16', 'AC007', 'AC007', 'AC007'],
'AA_ID': [22, 22, 2, 2, 2],
'BB_ID':[4, 5, 6, 8, 9],
'CC_ID' : [2, 2, 3, 3, 3],
'DD_RE': [4,7,8,9,0],
'EE_RE':[5,8,9,9,10]})
and df_ID,
df_ID = pd.DataFrame({'Name': ['CTA15', 'CTA16', 'CFV', 'SAP', 'SOS']})
and the other one isdf_RE, both of these data frames has the column Name, so I need to merge it to data frame df, then I need to select the columns based on the last part of the data frame's name. That is, for example, if the data frame is df_ID then I need all columns ending with "ID" + "Name" for all matching rows from Name from data frame df, and if the data frame id df_REL then I need I all columns ends with "RE" + "Name" from df and I wanted to save it separately.
I know I could call inside the loop as,
for dfs in dataframes:
ID=[col for col in df.columns if '_ID' in col]
df_ID=pd.merge(df,df_ID,on='Name')
df_ID=df_ID[ID]
But here the ID , has to change again when the data frames ends with RE and so on , I have a couple of file with different strings so any better solution would be great
So at the end I need for df_ID as having all the columns ending with ID
df_ID = pd.DataFrame({'Name': ['CTA15', 'CTA16'],
'AA_ID': [22, 22'],
'BB_ID':[4, 5],
'CC_ID' : [2, 2]})
Any help would be great
Assuming your columns in df are Name and anything with a suffix such as the examples you have listed (e.g. _ID, _RE), then what you could do is parse through the column names to first extract all unique possible suffixes:
# since the suffixes follow a pattern of `_*`, then I can look for the `_` character
suffixes = list(set([col[-3:] for col in df.columns if '_' in col]))
Now, with the list of suffixes, you next want to create a dictionary of your existing dataframes, where the keys in the dictionary are suffixes, and the values are the dataframes with the suffix names (e.g. df_ID, df_RE):
dfs = {}
dfs['_ID'] = df_ID
dfs['_RE'] = df_RE
... # and so forth
Now you can loop through your suffixes list to extract the appropriate columns with each suffix in the list and do the merges and column extractions:
for suffix in suffixes:
cols = [col for col in df.columns if suffix in col]
dfs[suffix] = pd.merge(df, dfs[suffix], on='Name')
dfs[suffix] = dfs[suffix][cols]
Now you have your dictionary of suffixed dataframes. If you want your dataframes as separate variables instead of keeping them in your dictionary, you can now set them back as individual objects:
df_ID = dfs['_ID']
df_RE = dfs['_RE']
... # and so forth
Putting it all together in an example
import pandas as pd
df = pd.DataFrame({'Name': ['CTA15', 'CTA16', 'AC007', 'AC007', 'AC007'],
'AA_ID': [22, 22, 2, 2, 2],
'BB_ID': [4, 5, 6, 8, 9],
'CC_ID': [2, 2, 3, 3, 3],
'DD_RE': [4, 7, 8, 9, 0],
'EE_RE': [5, 8, 9, 9, 10]})
# Get unique suffixes
suffixes = list(set([col[-3:] for col in df.columns if '_' in col]))
dfs = {} # dataframes dictionary
df_ID = pd.DataFrame({'Name': ['CTA15', 'CTA16', 'CFV', 'SAP', 'SOS']})
df_RE = pd.DataFrame({'Name': ['AC007']})
dfs['_ID'] = df_ID
dfs['_RE'] = df_RE
for suffix in suffixes:
cols = [col for col in df.columns if suffix in col]
dfs[suffix] = pd.merge(df, dfs[suffix], on='Name')
dfs[suffix] = dfs[suffix][cols]
df_ID = dfs['_ID']
df_RE = dfs['_RE']
print(df_ID)
print(df_RE)
Result:
AA_ID BB_ID CC_ID
0 22 4 2
1 22 5 2
DD_RE EE_RE
0 8 9
1 9 9
2 0 10
You can first merge df with df_ID and then take the columns end with ID.
pd.merge(df,df_ID,on='Name')[[e for e in df.columns if e.endswith('ID') or e=='Name']]
Out[121]:
AA_ID BB_ID CC_ID Name
0 22 4 2 CTA15
1 22 5 2 CTA16
Similarly, this can be done for the df_RE df as well.
pd.merge(df,df_RE,on='Name')[[e for e in df.columns if e.endswith('RE') or e=='Name']]

Column order in pandas.concat

I do as below:
data1 = pd.DataFrame({ 'b' : [1, 1, 1], 'a' : [2, 2, 2]})
data2 = pd.DataFrame({ 'b' : [1, 1, 1], 'a' : [2, 2, 2]})
frames = [data1, data2]
data = pd.concat(frames)
data
a b
0 2 1
1 2 1
2 2 1
0 2 1
1 2 1
2 2 1
The data column order is in alphabet order. Why is it so?
and how to keep the original order?
You are creating DataFrames out of dictionaries. Dictionaries are a unordered which means the keys do not have a specific order. So
d1 = {'key_a': 'val_a', 'key_b': 'val_b'}
and
d2 = {'key_b': 'val_b', 'key_a': 'val_a'}
are (probably) the same.
In addition to that I assume that pandas sorts the dictionary's keys descending by default (unfortunately I did not find any hint in the docs in order to prove that assumption) leading to the behavior you encountered.
So the basic motivation would be to resort / reorder the columns in your DataFrame. You can do this as follows:
import pandas as pd
data1 = pd.DataFrame({ 'b' : [1, 1, 1], 'a' : [2, 2, 2]})
data2 = pd.DataFrame({ 'b' : [1, 1, 1], 'a' : [2, 2, 2]})
frames = [data1, data2]
data = pd.concat(frames)
print(data)
cols = ['b' , 'a']
data = data[cols]
print(data)
Starting from version 0.23.0, you can prevent the concat() method to sort the returned DataFrame. For example:
df1 = pd.DataFrame({ 'a' : [1, 1, 1], 'b' : [2, 2, 2]})
df2 = pd.DataFrame({ 'b' : [1, 1, 1], 'a' : [2, 2, 2]})
df = pd.concat([df1, df2], sort=False)
A future version of pandas will change to not sort by default.
def concat_ordered_columns(frames):
columns_ordered = []
for frame in frames:
columns_ordered.extend(x for x in frame.columns if x not in columns_ordered)
final_df = pd.concat(frames)
return final_df[columns_ordered]
# Usage
dfs = [df_a,df_b,df_c]
full_df = concat_ordered_columns(dfs)
This should work.
You can create the original DataFrames with OrderedDicts
from collections import OrderedDict
odict = OrderedDict()
odict['b'] = [1, 1, 1]
odict['a'] = [2, 2, 2]
data1 = pd.DataFrame(odict)
data2 = pd.DataFrame(odict)
frames = [data1, data2]
data = pd.concat(frames)
data
b a
0 1 2
1 1 2
2 1 2
0 1 2
1 1 2
2 1 2
you can also specify the order like this :
import pandas as pd
data1 = pd.DataFrame({ 'b' : [1, 1, 1], 'a' : [2, 2, 2]})
data2 = pd.DataFrame({ 'b' : [1, 1, 1], 'a' : [2, 2, 2]})
listdf = [data1, data2]
data = pd.concat(listdf)
sequence = ['b','a']
data = data.reindex(columns=sequence)
Simplest way is firstly make the columns same order then concat:
df2=df2[df1.columns]
df=pd.concat((df1,df2),axis=0)

Categories