Merging DataFrames with Different Columns - python

Suppose I have two dataframes df1 and df2 as shown by the first two dataframes in the image below. I want to combine them to get df_desired as shown by the final dataframe in the image. My current attempts result in the third dataframe in the image; as you can see it is ignoring the fact that it has already seen a row with name a
My code:
df1 = pd.DataFrame({'name':['a','b'], 'data1':[3,4]})
df2 = pd.DataFrame({'name':['a','c'], 'data2':[1,5]})
def collect_results(target_list, df_list):
df = pd.DataFrame(columns = ['name','data1','data2'])
for i in range(2):
target = target_list[i]
df_target = df_list[i]
smiles = list(df_target['name'])
pxc50 = list(df_target[target])
target_col_names = ['name', target]
df_target_info = pd.DataFrame(columns=target_col_names)
df_target_info['name'] = smiles
df_target_info[target] = pxc50
try:
df = pd.merge(df,df_target_info, how="outer", on=["name",target])
except IndexError:
df = df.reindex_axis(df.columns.union(df_target_info.columns), axis=1)
return df
How can I get the desired behaviour?

You can merge on name with outer join using .merge()
df_desired = df1.merge(df2, on='name', how='outer')
Result:
print(df_desired)
name data1 data2
0 a 3.0 1.0
1 b 4.0 NaN
2 c NaN 5.0

Related

How can I match two different dataframes to show me which rows are matched and which have divergences (showing the divergence)

I have two dataframes and wanted to check if they contain the same data or not.
df1:
df1 = [['tom', 10],['nick',15], ['juli',14]]
df1 = pd.DataFrame(df1, columns = ['Name', 'Age'])
df2:
df2 = [['nick', 15],['tom', 10], ['juli',14]]
df2 = pd.DataFrame(df2, columns = ['Name', 'Age'])
Note that the information between them are exactly the same. The only difference is the row order.
I've created a code to match both dataframes, but it's showing that the dataframes are different on the first two rows:
ne = (df != df2).any(1)
ne_stacked = (df != df2).stack()
changed = ne_stacked[ne_stacked]
changed.index.names = ['id', 'col']
difference_locations = np.where(df != df2)
changed_from = df.values[difference_locations]
changed_to = df2.values[difference_locations]
divergences = pd.DataFrame({'df1': changed_from, "df2": changed_to}, index=changed.index)
print(divergences)
I am receiving the below result:
GRID SPX RECAP
id col
0 Name tom nick
Age 10 15
1 Name nick tom
Age 15 10
I was expecting to receive:
Empty DataFrame
Columns: [df1, df2]
Index: []
How I change the code so they can test each row on dataframes to check if they are matched?
And if I was comparing two data frames with different number of rows?

concat by taking the values from column

i have a list ['df1', 'df2'] where i have stores some dataframes which have been filtered on few conditions. Then i have converted this list to dataframe using
df = pd.DataFrame(list1)
now the df has only one column
0
df1
df2
sometimes it may also have
0
df1
df2
df3
i wanted to concate all these my static code is
df_new = pd.concat([df1,df2],axis=1) or
df_new = pd.concat([df1,df2,df3],axis=1)
how can i make it dynamic (without me specifying as df1,df2) so that it takes the values and concat it.
Using array to add the lists and data frames :
import pandas as pd
lists = [[1,2,3],[4,5,6]]
arr = []
for l in lists:
new_df = pd.DataFrame(l)
arr.append(new_df)
df = pd.concat(arr,axis=1)
df
Result :
0 0
0 1 4
1 2 5
2 3 6

Conditionally merge pd.DataFrames

I want to know if this is possible with pandas:
From df2, I want to create new1 and new2.
new1 as the latest date that can find from df1 that match column A
and B.
new2 as the latest date that can find from df1 that match column A
but not B.
I managed to get new1 but not new2.
Code:
import pandas as pd
d1 = [['1/1/19', 'xy','p1','54'], ['1/1/19', 'ft','p2','20'], ['3/15/19', 'xy','p3','60'],['2/5/19', 'xy','p4','40']]
df1 = pd.DataFrame(d1, columns = ['Name', 'A','B','C'])
d2 =[['12/1/19', 'xy','p1','110'], ['12/10/19', 'das','p10','60'], ['12/20/19', 'fas','p50','40']]
df2 = pd.DataFrame(d2, columns = ['Name', 'A','B','C'])
d3 = [['12/1/19', 'xy','p1','110','1/1/19','3/15/19'], ['12/10/19', 'das','p10','60','0','0'], ['12/20/19', 'fas','p50','40','0','0']]
dfresult = pd.DataFrame(d3, columns = ['Name', 'A','B','C','new1','new2'])
Updated!
IIUC, you want to add two columns to df2 : new1 and new2.
First I modified two things:
df1 = pd.DataFrame(d1, columns = ['Name1', 'A','B','C'])
df2 = pd.DataFrame(d2, columns = ['Name2', 'A','B','C'])
df1.Name1 = pd.to_datetime(df1.Name1)
Renamed Name into Name1 and Name2 for ease of use. Then I turned Name1 into a real date, so we can get the maximum date by group.
Then, We merge df2 with df1 on A column. This will give us rows that match on that column
aux = df2.merge(df1, on='A')
Then when the B columns is the same on both dataframes, we get Name1 out of it:
df2['new1'] = df2.index.map(aux[aux.B_x==aux.B_y].Name1).fillna(0)
If they're different we get the maximum date for every A group:
df2['new2'] = df2.A.map(aux[aux.B_x!=aux.B_y].groupby('A').Name1.max()).fillna(0)
Ouput:
Name2 A B C new1 new2
0 12/1/19 xy p1 110 2019-01-01 00:00:00 2019-03-15 00:00:00
1 12/10/19 das p10 60 0 0
2 12/20/19 fas p50 40 0 0
You can do this by:
standard merge based on A
removing all entries which match B values
sorting for dates
dropping duplicates on A, keeping last date (n.b. assumes dates are in date format, not as strings!)
merging back on id
Thus:
source = df1.copy() # renamed
v = df2.merge(source, on='A', how='left') # get all values where df2.A == source.A
v = v[v['B_x'] != v['B_y']] # drop entries where B values are the same
nv = v.sort_values(by=['Name_y']).drop_duplicates(subset=['Name_x'], keep='last')
df2.merge(nv[['Name_y', 'Name_x']].rename(columns={'Name_y': 'new2', 'Name_x': 'Name'}),
on='Name', how='left') # keeps non-matching, consider inner
This yields:
Out[94]:
Name A B C new2
0 12/1/19 xy p1 110 3/15/19
1 12/10/19 das p10 60 NaN
2 12/20/19 fas p50 40 NaN
My initial thought was to do something like the below. Sadly, it is not elegant. Generally, this sort of way to determining some value are frowned upon mostly because it fails to scale and with large data, gets especially slow.
def find_date(row, source=df1): # renamed df1 to source
t = source[source['B'] != row['B']]
t = t[t['A'] == row['A']]
return t.sort_values(by='date', ascending=False).iloc[0]
df2['new2'] = df2.apply(find_date, axis=1)

Issues with append, merge and join for 3 different dataframe outputs from pandas with 1 index

I have 10000 data that I'm sorting into a dictionary and then exporting that to a csv using pandas. I'm sorting temperatures, pressures and flow associated with a key. But when doing this I find: https://imgur.com/a/aNX7RHf
but I want something like this:https://imgur.com/a/ZxJgPv4
I'm transposing my dataframe so the index can be rows but in this case I want only 3 rows 1,2, & 3, and all the data populate those rows.
flow_dictionary = {'200:P1F1':[5.5, 5.5, 5.5]}
pres_dictionary = {'200:PT02':[200,200,200],
'200:PT03':[200,200,200],
'200:PT06':[66,66,66],
'200:PT07':[66,66,66]}
temp_dictionary = {'200:TE02':[27,27,27],
'200:TE03':[79,79,79],
'200:TE06':[113,113,113],
'200:TE07':[32,32,32]}
df = pd.DataFrame.from_dict(temp_dictionary, orient='index').T
df2 = pd.DataFrame.from_dict(pres_dictionary, orient='index').T
df3 = pd.DataFrame.from_dict(flow_dictionary, orient='index').T
df = df.append(df2, ignore_index=False, sort=True)
df = df.append(df3, ignore_index=False, sort=True)
df.to_csv('processedSegmentedData.csv')
SOLUTION:
df1 = pd.DataFrame.from_dict(temp_dictionary, orient='index').T
df2 = pd.DataFrame.from_dict(pres_dictionary, orient='index').T
df3 = pd.DataFrame.from_dict(flow_dictionary, orient='index').T
df4 = pd.concat([df1,df2,df3], axis=1)
df4.to_csv('processedSegmentedData.csv')

Nested merges in pandas with suffixes

I'm trying to merge multiple dataframes in pandas and keep the column labels straight in the resulting dataframe. Here's my test case:
import pandas as pd
df1 = pd.DataFrame(data = [[1,1],[3,1],[5,1]], columns = ['key','val'])
df2 = pd.DataFrame(data = [[1,2],[3,2],[7,2]], columns = ['key','val'])
df3 = pd.DataFrame(data = [[1,3],[2,3],[4,3]], columns = ['key','val'])
df = pd.merge(pd.merge(df1,df2,on='key', suffixes=['_1','_2']),df3,on='key',suffixes=[None,'_3'])
I'm getting this:
df =
key val_1 val_2 val
0 1 1 2 3
I'd like to see this:
df =
key val_1 val_2 val_3
0 1 1 2 3
The last pair of suffixes that I've specified is: [None,'_3'], the logic being that the pair ['_1','_2'] has created unique column names for the previous merge.
The suffix is needed only when the merged dataframe has two columns with same name. When you merge df3, your dataframe has column names val_1 and val_2 so there is no overlap.
You can handle that by renaming val to val_3 like this
df = df1.merge(df2, on = 'key', suffixes=['_1','_2']).merge(df3, on = 'key').rename(columns = {'val': 'val_3'})
you have to try this on
df = pd.merge(pd.merge(df1,df2,on='key', suffixes=[None,'_2']),df3,on='key',suffixes=['_1,'_3'])
it's work for me

Categories