How can I merge between one dataframe based on the other lookup dataframe.
This is dataframe A where I want to replace the values :
InfoType IncidentType DangerType
0 NaN A NaN
1 NaN C NaN
2 NaN B C
3 NaN B NaN
This is the lookup table :
ID ParamCode ParamValue ParmDesc1 ParamDesc2 SortOrder ParamStatus
0 1 IncidentType A ABC DEF 1 1
1 2 IncidentType B GHI JKL 2 1
2 3 IncidentType C MNO PQR 7 1
2 3 DangerType C STU VWX 6 1
The expected input :
InfoType IncidentType DangerType
0 NaN ABC NaN
1 NaN MNO NaN
2 NaN GHI STU
3 NaN GHI NaN
Note that ParamCode is the column names and I need to replace ParamDesc1 into respective columns in dataframe A. Every column in dataframe A may have NaN and I don't intend to remove them. Just ignore them.
This is what I have done :
ntf_cols = ['InfoType','IncidentType','DangerType']
for c in ntf_cols:
if (c in ntf.columns) & (c in param['ParamCode'].values):
paramValue = param['ParamValue'].unique()
for idx, pv in enumerate(paramValue):
ntf['NewIncidentType'] = pd.np.where(ntf.IncidentType.str.contains(pv), param['ParmDesc1'].values, "whatever")
Error :
ValueError: operands could not be broadcast together with shapes (25,)
(13,) ()
Use the lookup table to make a dict, and then replace the column values of the original dataframe. Assume the original dataframe is df1 and the lookup table is df2
...
dict_map = dict(zip(df2.ParamCode + "-" + df2.ParamValue, df2.ParmDesc1))
df1['IncidentType'] = ("IncidentType" +'-'+ df1.IncidentType).replace(dict_map)
df1['DangerType'] = ("DangerType" +'-'+ df1.DangerType).replace(dict_map)
...
EDIT: Lambda's answer gave me an idea for how you could do this for many columns that you want to apply this logical pattern to:
import pandas as pd
df1 = pd.DataFrame(dict(
InfoType = [None, None, None, None],
IncidentType = 'A C B B'.split(),
DangerType = [None, None, 'C', None],
))
df2 = pd.DataFrame(dict(
ParamCode = 'IncidentType IncidentType IncidentType DangerType'.split(),
ParamValue = 'A B C C'.split(),
ParmDesc1 = 'ABC GHI MNO STU'.split(),
))
for col in df1.columns[1:]:
dict_map = dict(
df2[df2.ParamCode == col][['ParamValue','ParmDesc1']].to_records(index=False)
)
df1[col] = df1[col].replace(dict_map)
print(df1)
This assumes every column after the first column in df1 is one that needs updating and the to-be updated column names exists as a values in the 'ParamCode' column of df2.
Python tutor link to code
This problem could be solved using some custom functions and pandas.Series.apply():
import pandas as pd
def find_incident_type(x):
if pd.isna(x):
return x
return df2[
(df2['ParamCode'] == 'IncidentType') & (df2['ParamValue']==x)
]["ParmDesc1"].values[0]
def find_danger_type(x):
if pd.isna(x):
return x
return df2[
(df2['ParamCode'] == 'DangerType') & (df2['ParamValue']==x)
]["ParmDesc1"].values[0]
df1 = pd.DataFrame(dict(
InfoType = [None, None, None, None],
IncidentType = 'A C B B'.split(),
DangerType = [None, None, 'C', None],
))
df2 = pd.DataFrame(dict(
ParamCode = 'IncidentType IncidentType IncidentType DangerType'.split(),
ParamValue = 'A B C C'.split(),
ParmDesc1 = 'ABC GHI MNO STU'.split(),
))
df1['IncidentType'] = df1['IncidentType'].apply(find_incident_type)
df1['DangerType'] = df1['DangerType'].apply(find_danger_type)
print(df1)
step through the code in python tutor
It is very possible there is a more efficient way to do this. Hopefully some one who knows it will share it.
Also the ref to df2 from the outer scope is hard coded into the custom functions and thus will only work for that variable name from the outer scope. You'll need to use an argument for pandas.Series.apply's args param if you don't want these functions to be dependent on that ref.
Related
Having the following dataframe:
name
aaa
bbb
Mick
None
None
Ivan
A
C
Ivan-Peter
1
None
Juli
1
P
I want to get two dataframes.
One with values, where we have None in columns aaa and/or bbb, named filter_nulls in my code
One where we do not have None at all. df_out in my code.
This is what I have tried and it does not produce the required dataframes.
import pandas as pd
df_out = {
'name': [ 'Mick', 'Ivan', 'Ivan-Peter', 'Juli'],
'aaa': [None, 'A', '1', '1'],
'bbb': [None, 'C', None, 'P'],
}
print(df_out)
filter_nulls = df_out[df_out['aaa'].isnull()|(df_out['bbb'] is None)]
print(filter_nulls)
df_out = df_out.loc[filter_nulls].reset_index(level=0, drop=True)
print(df_out)
Use:
#DataFrame from sample data
df_out = pd.DataFrame(df_out)
#filter columns names by list and test if NaN or None at least in one row
m = df_out[['aaa','bbb']].isna().any(axis=1)
#OR test both columns separately
m = df_out['aaa'].isna() | df_out['bbb'].isna()
#filter matched and not matched rows
df1 = df_out[m].reset_index(drop=True)
df2 = df_out[~m].reset_index(drop=True)
print (df1)
name aaa bbb
0 Mick None None
1 Ivan-Peter 1 None
print (df2)
name aaa bbb
0 Ivan A C
1 Juli 1 P
Another idea with DataFrame.dropna and filter indices not exist in df2:
df2 = df_out.dropna()
df1 = df_out.loc[df_out.index.difference(df2.index)].reset_index(drop=True)
df2 = df2.reset_index(drop=True)
First of all one needs to convert df_out to a dataframe with pandas.DataFrame as follows
df_out = pd.DataFrame(df_out)
[Out]:
name aaa bbb
0 Mick None None
1 Ivan A C
2 Ivan-Peter 1 None
3 Juli 1 P
Then one can use, for both cases, pandas.Series.notnull.
With values, where we have None in columns aaa and/or bbb, named filter_nulls in my code
df1 = df_out[~df_out['aaa'].notnull() | ~df_out['bbb'].notnull()]
[Out]:
name aaa bbb
0 Mick None None
2 Ivan-Peter 1 None
Where we do not have None at all. df_out in my code.
df2 = df_out[df_out['aaa'].notnull() & df_out['bbb'].notnull()]
[Out]:
name aaa bbb
1 Ivan A C
3 Juli 1 P
Notes:
If needed one can use pandas.DataFrame.reset_index to get the following
df_new = df_out[~df_out['aaa'].notnull() | ~df_out['bbb'].notnull()].reset_index(drop=True)
[Out]:
name aaa bbb
0 Mick None None
1 Ivan-Peter 1 None
I have a dataframe as seen below:
col_a1, col_a2, col_b1, col_b2
abc lmn
def ghi qrs
zxv vbn
pej iop qaz
eki lod yhe wqe
I need two columns now, column A and Column B. Conditions summarized:
Column A = col_a2 if col_a2 is present else col_a1
Column B = col_a1 if col_a1 is present else col_b2
The required dataframe should be as follows:
Column A Column B
abc lmn
ghi qrs
zxv vbn
pej iop
lod yhe
Try:
df['A'] = df.apply(lambda x: x['col_a2'] if x['col_a2'] != '' else x['col_a1'], axis=1)
df['B'] = df.apply(lambda x: x['col_b1'] if x['col_b1'] != '' else x['col_b2'], axis=1)
print(df[['A', 'B']])
A B
0 abc lmn
1 ghi qrs
2 zxv vbn
3 pej iop
4 lod yhe
The !='' will work if you truly have nothing in the cell (as opposed to a NaN etc.). If you have actual NaN values use:
df['A'] = df.apply(lambda x: x['col_a2'] if pd.notna(x['col_a2']) else x['col_a1'], axis=1)
df['B'] = df.apply(lambda x: x['col_b1'] if pd.notna(x['col_b1']) else x['col_b2'], axis=1)
I have the DF of this kind:
pd.DataFrame({'label':['A','test1: A','test2: A','B','test1: B','test3: B'],
'value': [1,2,3,4,5,6]})
label value
0 A 1
1 test1: A 2
2 test2: A 3
3 B 4
4 test1: B 5
5 test3: B 6
And I need to convert to this:
pd.DataFrame({'label':['A','B'],
'value': [1,4],
'test1:':[2,5],
'test2:':[3,None],
'test3:':[None,6]})
label value test1: test2: test3:
0 A 1 2 3.0 NaN
1 B 4 5 NaN 6.0
I need to keep label for unique value and keys are merged to the right if present in the data. Keys may vary and be of different names for one value.
Feel free to share how to rename the question because I could not find the better way to name the problem.
EDIT:
Partly this solution contains what I need however there is no decent way to add columns representing key in the label column. Ideally something like a function with df input is needed.
Extract information into two data frames and merge them.
df2 = df[df['label'].str.contains('test')]
df3 = df2['label'].str.split(expand=True).rename(columns={0: "test", 1: "label"})
df3['value'] = df2['value']
df3 = df3.pivot_table(index='label', columns='test', values='value')
df2 = df[~df['label'].str.contains('test')]
df4 = pd.merge(df2, df3, on='label')
Output
label value test1: test2: test3:
0 A 1 2.0 3.0 NaN
1 B 4 5.0 NaN 6.0
Here's a way to do that:
df.loc[~df.label.str.contains(":"), "label"] = df.loc[~df.label.str.contains(":"), "label"].str.replace(r"(^.*$)", r"value:\1")
labels = df.label.str.split(":", expand = True).rename(columns = {0: "label1", 1:"label2"})
df = pd.concat([df, labels], axis=1)
df = pd.pivot_table(df, index="label2", columns="label1", dropna=False)
df.columns = [c[1] for c in df.columns]
df.index.name = "label"
The output is:
test1 test2 test3 value
label
A 2.0 3.0 NaN 1.0
B 5.0 NaN 6.0 4.0
I have two DataFrames C and D as follows:
C
A B
0 AB 1
1 CD 2
2 EF 3
D
A B
1 CD 4
2 GH 5
I have to merge both the dataframes but the merge should overwrite the values in the right df. Rest of the rows from the dataframe should not change.
Output
A B
0 AB 1
1 CD 4
2 EF 3
3 GH 5
The order of the rows of df must not change i.e. CD should remain in index 1. I tried using outer merge which is handling index but duplicating columns instead of overwriting.
>>> pd.merge(c,d, how='outer', on='A')
A B_x B_y
0 AB 1.0 NaN
1 CD 2.0 4.0
2 EF 3.0 NaN
3 GH NaN 5.0
Basically B_y should have replaced values in B_x(only where values occur).
I am using Python3.7.
You will have to replace the rows to override the values in place. This is different from drop duplicates as it will change the ordering of the rows.
Combine DFs takes in "pkey" as an argument, which is the main column on which the merge should happen.
def update_df_row(row=None, col_name="", df=pd.DataFrame(), pkey=""):
try:
match_index = df.loc[df[pkey] == col_name].index[0]
row = df.loc[match_index]
except IndexError:
pass
except Exception as ex:
raise
finally:
return row
def combine_dfs(parent_df, child_df, pkey):
filtered_child_df = child_df[child_df[pkey].isin(parent_df[pkey])]
parent_df[parent_df[pkey].isin(child_df[pkey])] = parent_df[
parent_df[pkey].isin(child_df[pkey])].apply(
lambda row: update_df_row(row, row[pkey], filtered_child_df, pkey), axis=1)
parent_df = pd.concat([parent_df, child_df]).drop_duplicates([pkey])
return parent_df.reset_index(drop=True)
The output of the above code snippet will be:
A B
0 AD 1
1 CD 4
2 EF 3
3 GH 5
Use:
df = pd.merge(C,D, how='outer', on='A', suffixes=('_',''))
#filter columns names
new_cols = df.columns[df.columns.str.endswith('_')]
#remove last char from column names
orig_cols = new_cols.str[:-1]
#dictionary for rename
d = dict(zip(new_cols, orig_cols))
#filter columns and replace NaNs by new appended columns
df[orig_cols] = df[orig_cols].combine_first(df[new_cols].rename(columns=d))
#remove appended columns
df = df.drop(new_cols, axis=1)
print (df)
A B
0 AB 1.0
1 CD 4.0
2 EF 3.0
3 GH 5.0
If it's acceptable to assume that column A is in alphabetical order:
C = pd.DataFrame({"A": ["AB", "CD", "EF"], "B": [1, 2, 3]})
D = pd.DataFrame({"A": ["CD", "GH"], "B": [4, 5]})
df_merge = pd.concat([C,D]).drop_duplicates('A', keep='last').sort_values(by=['A']).reset_index(drop=True)
df_merge
A B
0 AB 1
1 CD 4
2 EF 3
3 GH 5
Edit
This will do the job if the order in which each category appears in the original dataframes is most important:
C = pd.DataFrame({"A": ["AB", "CD", "EF"], "B": [1, 2, 3]})
D = pd.DataFrame({"A": ["CD", "GH"], "B": [4, 5]})
df_merge = pd.concat([C,D]).drop_duplicates('A', keep='last')
df_merge['A'] = pd.Categorical(df_merge['A'], C.A.append(D.A).drop_duplicates())
df_merge.sort_values(by=['A'], inplace=True)
df_merge.reset_index(drop=True, inplace=True)
df_merge
You can use update. In your case it would be:
C.update(D)
My dataframes are like below
df1
id c1
1 abc
2 def
3 ghi
df2
id set1
1 [123,456]
2 [789]
When I join df1 and df2 (final_data = df1.merge(df2, how = 'left')). It gives me
final_df
id c1 set1
1 abc [123,456]
2 def [789]
3 ghi NaN
I'm using below code to replace NaN with empty array []
for row in final_df.loc[final_df.set1.isnull(), 'set1'].index:
final_df.at[row, 'set1'] = []
The issue is if df2 is empty dataframe. It is giving
ValueError: setting an array element with a sequence.
PS: I'm using pandas 0.23.4 version
Pandas is not designed to be used with series of lists. You lose all vectorised functionality and any manipulations on such series involve inefficient, Python-level loops.
One work-around is to define a series of empty lists:
res = df1.merge(df2, how='left')
empty = pd.Series([[] for _ in range(len(df.index))], index=df.index)
res['set1'] = res['set1'].fillna(empty)
print(res)
id c1 set1
0 1 abc [123, 456]
1 2 def [789]
2 3 ghi []
A better idea at this point, if viable, is to split your lists into separate series:
res = res.join(pd.DataFrame(res.pop('set1').values.tolist()))
print(res)
id c1 0 1
0 1 abc 123.0 456.0
1 2 def 789.0 NaN
2 3 ghi NaN NaN
This is is not ideal but will get your work done
import pandas as pd
import numpy as np
df1 = pd.DataFrame([[1,'abc'],[2,'def'],[3,'ghi']], columns=['id', 'c1'])
df2 = pd.DataFrame([[1,[123,456]],[2,[789]]], columns=['id', 'set1'])
df=pd.merge(df1,df2, how='left', on='id')
df['set1'].fillna(0, inplace=True)
df['set1']=df['set1'].apply( lambda x:pd.Series({'set1': [] if x == 0 else x}))
print(df)