Dataframe merging with src and dst keys

Dataframe merging with src and dst keys - python

I have three dataframes
df1 = pd.DataFrame({'src': ['src1', 'src2', 'src3'],
'dst': ['dst1', 'dst2', 'dst3']})
df2 = pd.DataFrame({'src': ['dst1', 'dst1', 'dst3'],
'dst': ['dstDst1', 'dstDst2', 'dstDst3']})
df3 = pd.DataFrame({'src': ['dstDst3', 'dstDst3'],
'dst': ['dstDstDst1', 'dstDstDst2']})
I want to merge the three dataframes using the following rule:
Keep the initial src field and merge all the dst if there is a src->dst relationship that can backtrack to the src in df1. To be more concrete, the result of the merge is:
df4 = pd.DataFrame({'src': ['src1', 'src2', 'src3'],
'dst': ['dst1, dstDst1, dstDst2', 'dst2', 'dst3, dstDst3, dstDstDst1, dstDstDst2']})
NOTE: it is guaranteed that df2's src values are subsets of df1's dst values and df3's src values are subsets of df2's dst values.
I came up with this solution, would like to know if there are more elegant or idiomatic way of doing this.
df_collection = {}
df_collection[1] = df1
df_collection[2] = df2
df_collection[3] = df3
def merge(df1, df2):
'''
for each target in df1, find source in df2
'''
df3 = df1.copy(deep=True)
# forward merge will be n complexity, ...save for later
for i in range(0, len(df1)):
for j in range(0, len(df2)):
if df1.iloc[i]['dst'] == df2.iloc[j]['src']:
df3.iloc[i]['dst'] = df3.iloc[i]['dst'] + ',' + df2.iloc[j]['dst']
return df3
for i in range(3, 1, -1):
df_collection[i-1] = merge(df_collection[i-1], df_collection[i])

You can avoid the nested for loops by the following code.
Code:
import pandas as pd
# Create the sample dataframes
df1 = pd.DataFrame({'src': ['src1', 'src2', 'src3'], 'dst': ['dst1', 'dst2', 'dst3']})
df2 = pd.DataFrame({'src': ['dst1', 'dst1', 'dst3'], 'dst': ['dstDst1', 'dstDst2', 'dstDst3']})
df3 = pd.DataFrame({'src': ['dstDst3', 'dstDst3'], 'dst': ['dstDstDst1', 'dstDstDst2']})
# Merge the dataframes from back to front
df = pd.DataFrame(columns=['src', 'dst'])
for i, _df in enumerate([df3, df2, df1]):
df = _df.merge(df, how='left', left_on='dst', right_on='src', suffixes=('', f'_{i}'))
df = df.groupby(['src', 'dst'], as_index=False)[f'dst_{i}'].agg(lambda s: [e for e in s if pd.notna(e)])
df['dst'] = df.apply(lambda r: ', '.join([r['dst']] + r[f'dst_{i}']), axis=1)
df = df[['src', 'dst']]
print(df)
Output:
src
dst
src1
dst1, dstDst1, dstDst2
src2
dst2
src3
dst3, dstDst3, dstDstDst1, dstDstDst2

Related

Check if column values exists in different dataframe

I have a pandas DataFrame 'df' with x rows, and another pandas DataFrame 'df2' with y rows
(x < y). I want to return the indexes of where the values of df['Farm'] equals the value of df2['Fields'], in order to add respective 'Manager' to df.
the code I have is as follows:
data2 = [['field1', 'Paul G'] , ['field2', 'Mark R'], ['field3', 'Roy Jr']]
data = [['field1'] , ['field2']]
columns = ['Field']
columns2 = ['Field', 'Manager']
df = pd.DataFrame(data, columns=columns)
df2 = pd.DataFrame(data2, columns=columns2)
farmNames = df['Farm']
exists = farmNames.reset_index(drop=True) == df1['Field'].reset_index(drop=True)
This returns the error message:
ValueError: Can only compare identically-labeled Series objects
Does anyone know how to fix this?

As #NickODell mentioned, you could use a merge, basically a left join. See below code.
df_new = pd.merge(df, df2, on = 'Field', how = 'left')
print(df_new)
Output:
Field Manager
0 field1 Paul G
1 field2 Mark R

pandas rename multi-level column having the same name

When I use aggregate function, the resulting columns 'price' and 'carat' have the same column name of 'mean'.
How do i rename the mean under the price to price_mean and under carat to carat_mean.
I can't change them individually.
diamonds.groupby('cut').agg({
'price': ['count', 'mean'],
'carat': 'mean'
}).rename(columns={'mean':'price_mean','mean':'carat_mean'}, level = 1)
})

You could try this:
# Rename columns of level 1
df1 = df["price"]
df1.columns = ["count", "carat_mean"]
df2 = df["carat"]
df2.columns = ["carat_mean"]
# Aggregate dfs (with renamed columns) under level 0 columns
df = pd.concat([df1, df2], axis=1, keys=['price', 'carat'])
print(df)
# Outputs
price carat
count carat_mean carat_mean
Fair 0.693995 -0.632283 0.789963
Good 0.099057 1.005623 0.143289
Ideal -0.277984 -0.105138 -0.611168

How to iterate over pandas DataFrame multi-index and filtering based off another column value

I am trying to do iterate over a multi-index with the following DataFrame.
Picture of My DataFrame
Essentially, what I am trying to do is reduce the DataFrame to the top QB, top 2 RB's, top 3 WR's, and top TE based off their values in their respective "FantasyPoints" column for each NFL team. I have been trying to figure out for hours how to do this, but can't come up with a solution. I tried using groupby but no luck, and figured I may have to iterate over the multi-index but haven't figured that out either. Thanks in advance to anyone who can help me figure this out. Below is the code used to generate the DataFrame in its existing state. Here is a link to the CSV file that is being used. https://drive.google.com/file/d/1hX1Jmjk4RBxsH8tt8g1tqwKqrjZkFZp_/view?usp=sharing
#import our CSV file
df = pd.read_csv('2019.csv')
#drop unneccessary columns
df.drop(['Rk', '2PM', '2PP', 'FantPt', 'DKPt', 'FDPt',
'VBD', 'PosRank', 'OvRank', 'PPR', 'Fmb',
'GS', 'Age', 'Tgt', 'Y/A', 'Att', 'Att.1', 'Cmp', 'Y/R'], axis=1, inplace=True)
#fix name formatting
df['Player'] = df['Player'].apply(lambda x: x.split('*')[0]).apply(lambda x: x.split('\\')[0])
#rename columns
df.rename({
'TD': 'PassingTD',
'TD.1': 'RushingTD',
'TD.2': 'ReceivingTD',
'TD.3': 'TotalTD',
'Yds': 'PassingYDs',
'Yds.1': 'RushingYDs',
'Yds.2': 'ReceivingYDs',
}, axis=1, inplace=True)
df['FantasyPoints'] = (df['PassingYDs']*0.04 + df['PassingTD']*4 - df['Int']*2 + df['RushingYDs']*.1
+ df['RushingTD']*6 + df['Rec']*1 + df['ReceivingYDs']*.1 + df['ReceivingTD']*6 - df['FL']*2)
df = df[['Tm', 'FantPos', 'FantasyPoints']]
df = df[df['Tm'] != '2TM']
df = df[df['Tm'] != '3TM']
df.set_index(['Tm', 'FantPos'], inplace=True)
df = df.sort_index()
df.head(30)

Why do multi-index?? You can easily set up a dictionary to iterate through and grab the top n rows for each condition/position:
import pandas as pd
#import our CSV file
df = pd.read_csv('2019.csv')
#drop unneccessary columns
df.drop(['Rk', '2PM', '2PP', 'FantPt', 'DKPt', 'FDPt',
'VBD', 'PosRank', 'OvRank', 'PPR', 'Fmb',
'GS', 'Age', 'Tgt', 'Y/A', 'Att', 'Att.1', 'Cmp', 'Y/R'], axis=1, inplace=True)
#fix name formatting
df['Player'] = df['Player'].apply(lambda x: x.split('*')[0]).apply(lambda x: x.split('\\')[0])
#rename columns
df.rename({
'TD': 'PassingTD',
'TD.1': 'RushingTD',
'TD.2': 'ReceivingTD',
'TD.3': 'TotalTD',
'Yds': 'PassingYDs',
'Yds.1': 'RushingYDs',
'Yds.2': 'ReceivingYDs',
}, axis=1, inplace=True)
df['FantasyPoints'] = (df['PassingYDs']*0.04 + df['PassingTD']*4 - df['Int']*2 + df['RushingYDs']*.1
+ df['RushingTD']*6 + df['Rec']*1 + df['ReceivingYDs']*.1 + df['ReceivingTD']*6 - df['FL']*2)
df = df[['Tm', 'FantPos', 'FantasyPoints']]
df = df[df['Tm'] != '2TM']
df = df[df['Tm'] != '3TM']
dictionary = {'QB':1,'RB':2,'WR':3,'TE':1}
results_df = pd.DataFrame()
for pos, n in dictionary.items():
results_df = results_df.append(df[df['FantPos'] == pos].nlargest(n, columns='FantasyPoints'), sort=True).reset_index(drop=True)
Output:
print (results_df)
FantPos FantasyPoints Tm
0 QB 415.68 BAL
1 RB 469.20 CAR
2 RB 314.80 GNB
3 WR 374.60 NOR
4 WR 274.10 TAM
5 WR 274.10 ATL
6 TE 254.30 KAN

Diff between two dataframes in pandas

I have two dataframes both of which have the same basic schema. (4 date fields, a couple of string fields, and 4-5 float fields). Call them df1 and df2.
What I want to do is basically get a "diff" of the two - where I get back all rows that are not shared between the two dataframes (not in the set intersection). Note, the two dataframes need not be the same length.
I tried using pandas.merge(how='outer') but I was not sure what column to pass in as the 'key' as there really isn't one and the various combinations I tried were not working. It is possible that df1 or df2 has two (or more) rows that are identical.
What is a good way to do this in pandas/Python?

Try this:
diff_df = pd.merge(df1, df2, how='outer', indicator='Exist')
diff_df = diff_df.loc[diff_df['Exist'] != 'both']
You will have a dataframe of all rows that don't exist on both df1 and df2.

IIUC:
You can use pd.Index.symmetric_difference
pd.concat([df1, df2]).loc[
df1.index.symmetric_difference(df2.index)
]

You can use this function, the output is an ordered dict of 6 dataframes which you can write to excel for further analysis.
'df1' and 'df2' refers to your input dataframes.
'uid' refers to the column or combination of columns that make up the unique key. (i.e. 'Fruits')
'dedupe' (default=True) drops duplicates in df1 and df2. (refer to Step 4 in comments)
'labels' (default = ('df1','df2')) allows you to name the input dataframes. If a unique key exists in both dataframes, but have
different values in one or more columns, it is usually important to know these rows, put them one on top of the other and label the row with the name so we know to which dataframe does it belong to.
'drop' can take a list of columns to be excluded from the consideration when considering the difference
Here goes:
df1 = pd.DataFrame([['apple', '1'], ['banana', 2], ['coconut',3]], columns=['Fruits','Quantity'])
df2 = pd.DataFrame([['apple', '1'], ['banana', 3], ['durian',4]], columns=['Fruits','Quantity'])
dict1 = diff_func(df1, df2, 'Fruits')
In [10]: dict1['df1_only']:
Out[10]:
Fruits Quantity
1 coconut 3
In [11]: dict1['df2_only']:
Out[11]:
Fruits Quantity
3 durian 4
In [12]: dict1['Diff']:
Out[12]:
Fruits Quantity df1 or df2
0 banana 2 df1
1 banana 3 df2
In [13]: dict1['Merge']:
Out[13]:
Fruits Quantity
0 apple 1
Here is the code:
import pandas as pd
from collections import OrderedDict as od
def diff_func(df1, df2, uid, dedupe=True, labels=('df1', 'df2'), drop=[]):
dict_df = {labels[0]: df1, labels[1]: df2}
col1 = df1.columns.values.tolist()
col2 = df2.columns.values.tolist()
# There could be columns known to be different, hence allow user to pass this as a list to be dropped.
if drop:
print ('Ignoring columns {} in comparison.'.format(', '.join(drop)))
col1 = list(filter(lambda x: x not in drop, col1))
col2 = list(filter(lambda x: x not in drop, col2))
df1 = df1[col1]
df2 = df2[col2]
# Step 1 - Check if no. of columns are the same:
len_lr = len(col1), len(col2)
assert len_lr[0]==len_lr[1], \
'Cannot compare frames with different number of columns: {}.'.format(len_lr)
# Step 2a - Check if the set of column headers are the same
# (order doesnt matter)
assert set(col1)==set(col2), \
'Left column headers are different from right column headers.' \
+'\n Left orphans: {}'.format(list(set(col1)-set(col2))) \
+'\n Right orphans: {}'.format(list(set(col2)-set(col1)))
# Step 2b - Check if the column headers are in the same order
if col1 != col2:
print ('[Note] Reordering right Dataframe...')
df2 = df2[col1]
# Step 3 - Check datatype are the same [Order is important]
if set((df1.dtypes == df2.dtypes).tolist()) - {True}:
print ('dtypes are not the same.')
df_dtypes = pd.DataFrame({labels[0]:df1.dtypes,labels[1]:df2.dtypes,'Diff':(df1.dtypes == df2.dtypes)})
df_dtypes = df_dtypes[df_dtypes['Diff']==False][[labels[0],labels[1],'Diff']]
print (df_dtypes)
else:
print ('DataType check: Passed')
# Step 4 - Check for duplicate rows
if dedupe:
for key, df in dict_df.items():
if df.shape[0] != df.drop_duplicates().shape[0]:
print(key + ': Duplicates exists, they will be dropped.')
dict_df[key] = df.drop_duplicates()
# Step 5 - Check for duplicate uids.
if type(uid)==str or type(uid)==list:
print ('Uniqueness check: {}'.format(uid))
for key, df in dict_df.items():
count_uid = df.shape[0]
count_uid_unique = df[uid].drop_duplicates().shape[0]
var = [0,1][count_uid_unique == df.shape[0]] #<-- Round off to the nearest integer if it is 100%
pct = round(100*count_uid_unique/df.shape[0], var)
print ('{}: {} out of {} are unique ({}%).'.format(key, count_uid_unique, count_uid, pct))
# Checks complete, begin merge. '''Remenber to dedupe, provide labels for common_no_match'''
dict_result = od()
df_merge = pd.merge(df1, df2, on=col1, how='inner')
if not df_merge.shape[0]:
print ('Error: Merged DataFrame is empty.')
else:
dict_result[labels[0]] = df1
dict_result[labels[1]] = df2
dict_result['Merge'] = df_merge
if type(uid)==str:
uid = [uid]
if type(uid)==list:
df1_only = df1.append(df_merge).reset_index(drop=True)
df1_only['Duplicated']=df1_only.duplicated(keep=False) #keep=False, marks all duplicates as True
df1_only = df1_only[df1_only['Duplicated']==False]
df2_only = df2.append(df_merge).reset_index(drop=True)
df2_only['Duplicated']=df2_only.duplicated(keep=False)
df2_only = df2_only[df2_only['Duplicated']==False]
label = labels[0]+' or '+labels[1]
df_lc = df1_only.copy()
df_lc[label] = labels[0]
df_rc = df2_only.copy()
df_rc[label] = labels[1]
df_c = df_lc.append(df_rc).reset_index(drop=True)
df_c['Duplicated'] = df_c.duplicated(subset=uid, keep=False)
df_c1 = df_c[df_c['Duplicated']==True]
df_c1 = df_c1.drop('Duplicated', axis=1)
df_uc = df_c[df_c['Duplicated']==False]
df_uc_left = df_uc[df_uc[label]==labels[0]]
df_uc_right = df_uc[df_uc[label]==labels[1]]
dict_result[labels[0]+'_only'] = df_uc_left.drop(['Duplicated', label], axis=1)
dict_result[labels[1]+'_only'] = df_uc_right.drop(['Duplicated', label], axis=1)
dict_result['Diff'] = df_c1.sort_values(uid).reset_index(drop=True)
return dict_result

Set df2.columns = df1.columns
Now, set every column as the index: df1 = df1.set_index(df1.columns.tolist()), and similarly for df2.
You can now do df1.index.difference(df2.index), and df2.index.difference(df1.index), and the two results are your distinct columns.

with
left_df.merge(df,left_on=left_df.columns.tolist(),right_on=df.columns.tolist(),how='outer')
you can get the outer join result.
Similarly, you can get the inner join result.Then make a diff that would be what you want.

Add a suffix to a dataframe called from a dictionary

I am trying to add a suffix to the dataframes called on by a dictionary.
Here is a sample code below:
import pandas as pd
import numpy as np
from collections import OrderedDict
from itertools import chain
# defining stuff
num_periods_1 = 11
num_periods_2 = 4
num_periods_3 = 5
# create sample time series
dates1 = pd.date_range('1/1/2000 00:00:00', periods=num_periods_1, freq='10min')
dates2 = pd.date_range('1/1/2000 01:30:00', periods=num_periods_2, freq='10min')
dates3 = pd.date_range('1/1/2000 02:00:00', periods=num_periods_3, freq='10min')
# column_names = ['WS Avg','WS Max','WS Min','WS Dev','WD Avg']
# column_names = ['A','B','C','D','E']
column_names_1 = ['C', 'B', 'A']
column_names_2 = ['B', 'C', 'D']
column_names_3 = ['E', 'B', 'C']
df1 = pd.DataFrame(np.random.randn(num_periods_1, len(column_names_1)), index=dates1, columns=column_names_1)
df2 = pd.DataFrame(np.random.randn(num_periods_2, len(column_names_2)), index=dates2, columns=column_names_2)
df3 = pd.DataFrame(np.random.randn(num_periods_3, len(column_names_3)), index=dates3, columns=column_names_3)
sep0 = '<~>'
suf1 = '_1'
suf2 = '_2'
suf3 = '_3'
ddict = {'df1': df1, 'df2': df2, 'df3': df3}
frames_to_concat = {'Sheets': ['df1', 'df3']}
Suffs = {'Suffixes': ['Suffix 1', 'Suffix 2', 'Suffix 3']}
Suff = {'Suffix 1': suf1, 'Suffix 2': suf2, 'Suffix 3': suf3}
## appply suffix to each data frame selected in order HERE
# Suffdict = [Suff[x] for x in Suffs['Suffixes']]
# print(Suffdict)
df4 = pd.concat([ddict[x] for x in frames_to_concat['Sheets']],
axis=1,
join='outer')
I want to add a suffix to each dataframe so that they can be distinguished when the dataframes are concatenated. I am having some trouble calling them and then applying them to each dataframe. So I have called for df1 and df3 to be concatenated and I would like only suffix 1 to be applied to df1 and suffix 2 to be applied to df3.
Order does not matter for the data frame suffix if df2 and df3 were called suffix 1 would be applied to df2 and suffix 2 would be applied to df3. obviously the last suffix would not be used.

Unless you have python3.6, you cannot guarantee order in dictionaries. Even if you could with python3.6, that would imply your code would not run in any lower python version. If you need order, you should be looking at lists instead.
You can store your dataframes as well as your suffixes in a list, and then use zip to add a suffix to each df in turn.
dfs = [df1, df2, df3]
sufs = [suf1, suf2, suf3]
df_sufs = [x.add_suffix(y) for x, y in zip(dfs, sufs)]
Based on your code/answer, you can load your dataframes and suffixes into lists, call zip, add a suffix to each one, and call pd.concat.
dfs = [ddict[x] for x in frames_to_concat['Sheets']]
sufs = [suff[x] for x in suffs['Suffixes']]
df4 = pd.concat([x.add_suffix(sep0 + y)
for x, y in zip(dfs, sufs)], axis=1, join='outer')

Ended up just making a simple iterator for the problem. Here is my solution
n=0
for df in frames_to_concat['Sheets']:
print(df_dict[df])
df_dict[df] = df_dict[df].add_suffix(sep0 + suff[suffs['Suffixes'][n]])
n = n+1
Anyone have a better way to do this?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Dataframe merging with src and dst keys - python

Related

Check if column values exists in different dataframe

pandas rename multi-level column having the same name

How to iterate over pandas DataFrame multi-index and filtering based off another column value

Diff between two dataframes in pandas

Add a suffix to a dataframe called from a dictionary

Categories

Resources