How to compare and merge two dataframes based on common columns that have different dictionaries?
I have the following two dataframes,
df1 = pd.DataFrame({'name':['tom','keith','sam','joe'],'assets':[{'laptop':1,'scanner':2},{'laptop':1,'printer':3}, {'car':12,'keys':34},{'power-cables':24}]})
df2 = pd.DataFrame({'place':['ca','bal-vm'],'default_assets':[{'laptop':4,'printer':3,'scanner':2,'bag':8},{'car':12,'keys':34,'mat':24,'holder':45}]})
df1:
name assets
0 tom {'laptop':1,'scanner':2}
1 keith {'laptop':1,'printer':3}
2 sam {'car':12,'keys':34}
3 joe {'power-cables':24}
df2:
place default_assets
0 ca {'laptop':4,'printer':3,'scanner':2,'bag':8}
1 bal-vm {'car':12,'keys':34,'mat':24,'holder':45}
df2 is supposed to be merged with df1 when all the keys of df1.assets are in df2.default_assets, else None should be filled.
So the resultant df should be,
df:
name place assets default_assets
0 tom ca {'laptop':1,'scanner':2} {'laptop':4,'printer':3,'scanner':2,'bag':8}
1 keith ca {'laptop':1,'printer':3} {'laptop':4,'printer':3,'scanner':2,'bag':8}
2 sam bal-vm {'car':12,'keys':34} {'car':12,'keys':34,'mat':24,'holder':45}
3 joe None {'power-cables':24} None
You could do the following:
A cross join (cross product) of every row of df1 with df2
Then filter out the rows where all the keys of df1.assets are not in df2.default_assets.
Add the filtered out rows from df1, with pandas.concat.
For example:
# cross join
merged = df1.assign(key=1).merge(df2.assign(key=1), on='key').drop('key', axis=1)
# mask to filter
mask = [asset.keys() < default.keys() for asset, default in zip(merged['assets'], merged['default_assets'])]
# add those not in the mask
result = pd.concat([merged.loc[mask], df1], sort=True).drop_duplicates('name')
# print in full
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
print(result)
Output
assets \
0 {'laptop': 1, 'scanner': 2}
2 {'laptop': 1, 'printer': 3}
5 {'car': 12, 'keys': 34}
3 {'power-cables': 24}
default_assets name place
0 {'laptop': 4, 'printer': 3, 'scanner': 2, 'bag... tom ca
2 {'laptop': 4, 'printer': 3, 'scanner': 2, 'bag... keith ca
5 {'car': 12, 'keys': 34, 'mat': 24, 'holder': 45} sam bal-vm
3 NaN joe NaN
Related
I have a df which compares the new and old data. Is there a way to calculate the difference between the old and new data? For generality, I don't want to sort the dataframe, but only compare root variables that have a prefix "_old" and "_new"
df
apple_old daily banana_new banana_tree banana_old apple_new
0 5 3 4 2 10 6
for x in df.columns:
if x.endswith("_old") and x.endswith("_new"):
x = x.dif()
Expected Output; brackets are shown just for clarity
df_diff
apple_diff(old-new) banana_diff(old-new)
0 -1 (5-6) 6 (10-4)
Let's try creating a Multi-Index, then subtracting old from new.
Setup:
import pandas as pd
df = pd.DataFrame({'apple_old': {0: 5}, 'daily': {0: 3}, 'banana_new': {0: 4},
'banana_tree': {0: 2}, 'banana_old': {0: 10},
'apple_new': {0: 6}})
# Creation of Multi-Index:
df.columns = df.columns.str.rsplit('_', n=1, expand=True).swaplevel(0, 1)
# Subtract old from new:
output_df = (df['old'] - df['new']).add_suffix('_diff')
# Display:
print(output_df)
apple_diff banana_diff
0 -1 6
Multi-Index with str.rsplit
and max split length n=1 so multiple _ are handled safely:
df.columns = df.columns.str.rsplit('_', n=1, expand=True).swaplevel(0, 1)
old NaN new tree old new
apple daily banana banana banana apple
0 5 3 4 2 10 6
Then selection:
df['old']
apple banana
0 5 10
df['new']
banana apple
0 4 6
Subtraction will align by columns. Then add_suffix to add the _diff to columns.
I have two different dataframes which i need to compare.
These two dataframes are having different number of rows and doesnt have a Pk its Composite primarykey of (id||ver||name||prd||loc)
df1:
id ver name prd loc
a 1 surya 1a x
a 1 surya 1a y
a 2 ram 1a x
b 1 alex 1b z
b 1 alex 1b y
b 2 david 1b z
df2:
id ver name prd loc
a 1 surya 1a x
a 1 surya 1a y
a 2 ram 1a x
b 1 alex 1b z
I tried the below code and this workingif there are same number of rows , but if its like the above case its not working.
df1 = pd.DataFrame(Source)
df1 = df1.astype(str) #converting all elements as objects for easy comparison
df2 = pd.DataFrame(Target)
df2 = df2.astype(str) #converting all elements as objects for easy comparison
header_list = df1.columns.tolist() #creating a list of column names from df1 as the both df has same structure
df3 = pd.DataFrame(data=None, columns=df1.columns, index=df1.index)
for x in range(len(header_list)) :
df3[header_list[x]] = np.where(df1[header_list[x]] == df2[header_list[x]], 'True', 'False')
df3.to_csv('Output', index=False)
Please leet me know how to compare the datasets if there are different number od rows.
You can try this:
~df1.isin(df2)
# df1[~df1.isin(df2)].dropna()
Lets consider a quick example:
df1 = pd.DataFrame({
'Buyer': ['Carl', 'Carl', 'Carl'],
'Quantity': [18, 3, 5, ]})
# Buyer Quantity
# 0 Carl 18
# 1 Carl 3
# 2 Carl 5
df2 = pd.DataFrame({
'Buyer': ['Carl', 'Mark', 'Carl', 'Carl'],
'Quantity': [2, 1, 18, 5]})
# Buyer Quantity
# 0 Carl 2
# 1 Mark 1
# 2 Carl 18
# 3 Carl 5
~df2.isin(df1)
# Buyer Quantity
# 0 False True
# 1 True True
# 2 False True
# 3 True True
df2[~df2.isin(df1)].dropna()
# Buyer Quantity
# 1 Mark 1
# 3 Carl 5
Another idea can be merge on the same column names.
Sure, tweak the code to your needs. Hope this helped :)
I have a dataframe with two columns: a and b
df
a b
0 john 123
1 john
2 mark
3 mark 456
4 marcus 789
I want to update values of b column based on a column.
a b
0 john 123
1 john 123
2 mark 456
3 mark 456
4 marcus 789
If john has value 123 in b. Remaining john also must have same value.
Assuming your dataframe is:
df = pd.DataFrame({'a': ['john', 'john', 'mark', 'mark', 'marcus'], 'b': [123, '', '', 456, 789]})
You can df.groupby the dataframe on column a and then apply transform on the column b of the grouped dataframe returning the first non empty value in the grouped column b.
Use:
df['b'] = (
df.groupby('a')['b']
.transform(lambda s: s[s.ne('')].iloc[0] if s.ne('').any() else s)
)
Result:
# print(df)
a b
0 john 123
1 john 123
2 mark 456
3 mark 456
4 marcus 789
Example:
df = pd.DataFrame({'A': [0," ", 2, 3, 4],
'B': [5, 6, 7, 8, 9],
'C': ['a', 'b', 'c', 'd', 'e']})
df1=df.replace({'A':" "},3)
Hope this helps, In your case it would be like
df1=df.replace({'b':" "},123)
I have a table of sites with a land cover class and a state. I have another table with values linked to class and state. In the second table, however, some of the rows are linked only to class:
sites = pd.DataFrame({'id': ['a', 'b', 'c'],
'class': [1, 2, 23],
'state': ['al', 'ar', 'wy']})
values = pd.DataFrame({'class': [1, 1, 2, 2, 23],
'state': ['al', 'ar', 'al', 'ar', None],
'val': [10, 11, 12, 13, 16]})
I'd like to link the tables by class and state, except for those rows in the value table for which state is None, in which case they would be linked only by class.
A merge has the following result:
combined = sites.merge(values, how='left', on=['class', 'state'])
id class state val
0 a 1 al 10.0
1 b 2 ar 13.0
2 c 23 wy NaN
But I'd like val in the last row to be 16. Is there an inexpensive way to do this short of breaking up both tables, performing separate merges, and then concatenating the result?
How about merge them separately:
pd.concat([sites.merge(values, on=['class','state']),
sites.merge(values[values['state'].isna()].drop('state',axis=1),
on=['class'])
])
Output:
id class state val
0 a 1 al 10
1 b 2 ar 13
0 c 23 wy 16
We can use combine_first here:
(sites.set_index(['class','state'])
.combine_first(values.set_index(['class','state']))
.dropna().reset_index())
class state id val
0 1 al a 10.0
1 2 ar b 13.0
2 23 wy c 16.0
I want to want to filter rows by multi-column values.
For example, given the following dataframes,
import pandas as pd
df = pd.DataFrame({"name":["Amy", "Amy", "Amy", "Bob", "Bob",],
"group":[1, 1, 1, 1, 2],
"place":['a', 'a', "a", 'b', 'b'],
"y":[1, 2, 3, 1, 2]
})
print(df)
Original dataframe:
name group place y
0 Amy 1 a 1
1 Amy 1 a 2
2 Amy 1 a 3
3 Bob 1 b 1
4 Bob 2 b 2
I want to select the samples that satisfy the columns combination [name, group, place] in selectRow.
selectRow = [["Amy", 1, "a"], ["Amy", 2, "b"]]
Then the expected dataframe is :
name group place y
0 Amy 1 a 1
1 Amy 1 a 2
2 Amy 1 a 3
I have tried it and my method is not efficient and runs for a long time, especially when there are many samples in original dataframe.
My Simple Method:
newdf = pd.DataFrame({})
for item in (selectRow):
print(item)
tmp = df.loc[(df['name'] == item[0]) & (df['group'] == item[1]) & (df['place'] == item[2])]
newdf = newdf.append(tmp)
newdf = newdf.reset_index( drop = True)
newdf.tail()
print(newdf)
Hope for an efficient method to achieve it.
Try using isin:
print(df[df['name'].isin(list(zip(*selectRow))[0]) & df['group'].isin(list(zip(*selectRow))[1]) & df['place'].isin(list(zip(*selectRow))[2])])