I have four df (df1,df2,df3,df4)
Sometimes df1 is null, sometimes df2 is null, sometimes df3 and df4 accordingly.
How can I do an outer merge so that the df which is empty is automatically ignored? I am using the below code to merge as of now:-
df = f1.result().merge(f2.result(), how='left', left_on='time', right_on='time').merge(f3.result(), how='left', left_on='time', right_on='time').merge(f4.result(), how='left', left_on='time', right_on='time')
and
df = reduce(lambda x,y: pd.merge(x,y, on='time', how='outer'), [f1.result(),f2.result(),f3.result(),f4.result()])
You can use df.empty attribute or len(df) > 0 to check whether the dataframe is empty or not.
Try this:
dfs = [df1, df2, df3, df4]
non_empty_dfs = [df for df in dfs if not df.empty]
df_final = reduce(lambda left,right: pd.merge(left,right, on='time', how='outer'), non_empty_dfs)
Or, you could also filter empty dataframe as,
non_empty_dfs = [df for df in dfs if len(df) > 0]
use pandas' dataframe empty method, to filter out the empty dataframe, then you can concatenate or run whatever merge operation you have in mind:
df4 = pd.DataFrame({'A':[]}) #empty dataframe
df1 = pd.DataFrame({'B':[2]})
df2 = pd.DataFrame({'C':[3]})
df3 = pd.DataFrame({'D':[4]})
dfs = [df1,df2,df3,df4]
#concat
#u can do other operations since u have gotten rid of the empty dataframe
pd.concat([df for df in dfs if not df.empty],axis=1)
B C D
0 2 3 4
Related
I have a list of dataframes:
all_df = [df1, df2, df3]
I would like to remove rows with duplicated indices in all dataframes in the list, such that the changes are reflected in the original dataframes df1, df2 and df3.
I tried to do
for df in all_df:
df = df[~df.index.duplicated()]
But the changes are only applied in the list, not on the original dataframes.
Essentially, I want to avoid doing the following:
df1 = df1[~df1.index.duplicated()]
df2 = df2[~df2.index.duplicated()]
df3 = df3[~df3.index.duplicated()]
all_df = [df1,df2,df3]
You need recreate list of DataFrames:
all_df = [df[~df.index.duplicated()] for df in all_df]
Or:
for i, df in enumerate(all_df):
all_df[i] = df[~df.index.duplicated()]
print (all_df[0])
EDIT: If name of dictionary is important use dictionary of DataFrames, but also inplace modification df1, df2 is not here, need select by keys of dicts:
d = {'price': df1, 'volumes': df2}
d = {k: df[~df.index.duplicated()] for k, df in all_df.items()}
print (d['price'])
I have a list of dataframes with a respective column names to join so for example:
dfs = [df1, df2, df3, df4]
col_join = ["col1", "col2", "col3"]
I have seen answers using the reduce function in Python.
import pandas as pd
from functools import reduce
reduce(lambda x, y: pd.merge(x, y, on=["Col1"], how="outer"), dfs)
What I want to achieve is the following:
df1 Columns:
Data1 Dim1 Dim2 Dim3
df2 Columns:
Example1 Example2 Example3
df3 Columns:
Other1 Other2 Other3
df1 to df2 is joined by Dim1 to Example1.
df1 to df3 is joined by Dim2 to Other1.
This list goes until df(n) where n can be even 20 dataframes being joined to df1 on different column names.
My idea is to pass a function with the list of the original df1 and the remainder df2, df3, df4 ... dfn.
Other argument is the list of merging columns, like above for example it would be: left_on=["Dim1"], right_on=["Example1"].
Next would be, joining df1 (already including df2 in the join) to df3 on Dim2 and Other1.
Each dataframe will be joined to df1 on a different column, which may or may not share the same name as df1, that's why left and right are arguments that should be used.d
How to incorporate the fact that merging columns are changing at each join inside the reduce function?
Thank you in advance.
This might work (untested):
result = df1
for df, col in zip(dfs[1:], col_join):
result = pd.merge(result, df, on=[col], how='outer')
I have 2 dataframes: df1 and df2. I would like to merge the 2 dataframes on the column link in df2. Link column in df2 contains a list of column and values which match in df1:
df1 = pd.DataFrame({'p':[1,2,3,4], 'a':[1,2,2,2],'b':['z','z','z','z'],'c':[3,3,4,4],'d':[5,5,5,6]})
df2 = pd.DataFrame({'e':[11,22,33,44], 'link':['a=1,c=3','a=2,c=3','a=2,c=4,d=5','a=2,c=4']})
The result should end with dataframe like this where column e from df2 are merge together with df1:
df_res = pd.DataFrame({'p':[1,2,3,3,4], 'a':[1,2,2,2,2],'b':['z','z','z','z','z'],'c':[3,3,4,4,4],'d':[5,5,5,5,6],'e':[11,22,33,44,44]})
How can this be done in pandas?
df1["e"] = df2["e"]
frames = [df1, df2]
result = pd.concat(frames)
I have 10000 data that I'm sorting into a dictionary and then exporting that to a csv using pandas. I'm sorting temperatures, pressures and flow associated with a key. But when doing this I find: https://imgur.com/a/aNX7RHf
but I want something like this:https://imgur.com/a/ZxJgPv4
I'm transposing my dataframe so the index can be rows but in this case I want only 3 rows 1,2, & 3, and all the data populate those rows.
flow_dictionary = {'200:P1F1':[5.5, 5.5, 5.5]}
pres_dictionary = {'200:PT02':[200,200,200],
'200:PT03':[200,200,200],
'200:PT06':[66,66,66],
'200:PT07':[66,66,66]}
temp_dictionary = {'200:TE02':[27,27,27],
'200:TE03':[79,79,79],
'200:TE06':[113,113,113],
'200:TE07':[32,32,32]}
df = pd.DataFrame.from_dict(temp_dictionary, orient='index').T
df2 = pd.DataFrame.from_dict(pres_dictionary, orient='index').T
df3 = pd.DataFrame.from_dict(flow_dictionary, orient='index').T
df = df.append(df2, ignore_index=False, sort=True)
df = df.append(df3, ignore_index=False, sort=True)
df.to_csv('processedSegmentedData.csv')
SOLUTION:
df1 = pd.DataFrame.from_dict(temp_dictionary, orient='index').T
df2 = pd.DataFrame.from_dict(pres_dictionary, orient='index').T
df3 = pd.DataFrame.from_dict(flow_dictionary, orient='index').T
df4 = pd.concat([df1,df2,df3], axis=1)
df4.to_csv('processedSegmentedData.csv')
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 4 years ago.
I have several dataframes df1, df2, ... with dublicate data, partly overlapping columns and rows ((see below)
How can I lump all dataframes into one dataframe.
df1 = pd.DataFrame({'A': [1,2], 'B': [4,5]}, index=['a', 'b'])
df2 = pd.DataFrame({'B': [5,6], 'C': [8,9]}, index=['b', 'c'])
df3 = pd.DataFrame({'A': [2,3], 'B': [5,6]}, index=['b', 'c'])
df4 = pd.DataFrame({'C': [7,8], index=['a', 'b'])
df5 = pd.DataFrame({'A': [1], 'B': [4], 'C': [7]}, index=['a'])
....
added: example data structure
A B C
a 1 4 7
b 2 5 8
c 3 6 9
added: what I am realy looking for is a more effective way for the following script, which is realy slow for big dataframes
dfs =[df1, df2, df3, df4, df5]
cols, rows = [], []
for df in dfs:
cols = cols + df.columns.tolist()
rows = rows + df.index.tolist()
cols = np.unique(cols)
rows = np.unique(rows)
merged_dfs = pd.DataFrame(data=np.nan, columns=cols, index=rows)
for df in dfs:
for col in df.columns:
for row in df.index:
merged_dfs[col][row] = df[col][row]
fast and easy solution (added 23. Dez. 2015)
dfs =[df1, df2, df3, df4, df5]
# create empty DataFrame with all cols and rows
cols, rows = [], []
for df_i in dfs:
cols = cols + df_i.columns.tolist()
rows = rows + df_i.index.tolist()
cols = np.unique(cols)
rows = np.unique(rows)
df = pd.DataFrame(data=np.NaN, columns=cols, index=rows)
# fill DataFrame
for df_i in dfs:
df.loc[df_i.index, df_i.columns] = df_i.values
With index preservation
This is an updated version that preserves the index:
from functools import reduce
dfs = [df1, df2, df3, df3, df5]
def my_merge(df1, df2):
res = pd.merge(df1, df2, how='outer', left_index=True, right_index=True)
cols = sorted(res.columns)
pairs = []
for col1, col2 in zip(cols[:-1], cols[1:]):
if col1.endswith('_x') and col2.endswith('_y'):
pairs.append((col1, col2))
for col1, col2 in pairs:
res[col1[:-2]] = res[col1].combine_first(res[col2])
res = res.drop([col1, col2], axis=1)
return res
print(reduce(my_merge, dfs))
Output:
A B C
a 1 4 7
b 2 5 8
c 3 6 9
Without index preservation
This would be one way:
from functools import reduce # Python 3 only
dfs = [df1, df2, df3, df3, df5]
def my_merge(df1, df2):
return pd.merge(df1, df2, how='outer')
merged_dfs = reduce(my_merge, dfs)
Results in:
A B C
0 1 4 NaN
1 2 5 8
2 NaN 6 9
3 3 6 NaN
4 1 4 7
You can adapt the join method by setting how:
how : {'left', 'right', 'outer', 'inner'}, default 'inner'
left: use only keys from left frame (SQL: left outer join)
right: use only keys from right frame (SQL: right outer join)
outer: use union of keys from both frames (SQL: full outer join)
inner: use intersection of keys from both frames (SQL: inner join)
If you like lambda, use this version for the same result:
reduce(lambda df1, df2: pd.merge(df1, df2, how='outer'), dfs)
Same idea as the other answer, but slightly different function:
def multiple_merge(lst_dfs, on):
reduce_func = lambda left,right: pd.merge(left, right, on=on)
return reduce(reduce_func, lst_dfs)
Here, lst_dfs is a list of dataframes