pandas: merge several dataframes [duplicate] - python

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 4 years ago.
I have several dataframes df1, df2, ... with dublicate data, partly overlapping columns and rows ((see below)
How can I lump all dataframes into one dataframe.
df1 = pd.DataFrame({'A': [1,2], 'B': [4,5]}, index=['a', 'b'])
df2 = pd.DataFrame({'B': [5,6], 'C': [8,9]}, index=['b', 'c'])
df3 = pd.DataFrame({'A': [2,3], 'B': [5,6]}, index=['b', 'c'])
df4 = pd.DataFrame({'C': [7,8], index=['a', 'b'])
df5 = pd.DataFrame({'A': [1], 'B': [4], 'C': [7]}, index=['a'])
....
added: example data structure
A B C
a 1 4 7
b 2 5 8
c 3 6 9
added: what I am realy looking for is a more effective way for the following script, which is realy slow for big dataframes
dfs =[df1, df2, df3, df4, df5]
cols, rows = [], []
for df in dfs:
cols = cols + df.columns.tolist()
rows = rows + df.index.tolist()
cols = np.unique(cols)
rows = np.unique(rows)
merged_dfs = pd.DataFrame(data=np.nan, columns=cols, index=rows)
for df in dfs:
for col in df.columns:
for row in df.index:
merged_dfs[col][row] = df[col][row]
fast and easy solution (added 23. Dez. 2015)
dfs =[df1, df2, df3, df4, df5]
# create empty DataFrame with all cols and rows
cols, rows = [], []
for df_i in dfs:
cols = cols + df_i.columns.tolist()
rows = rows + df_i.index.tolist()
cols = np.unique(cols)
rows = np.unique(rows)
df = pd.DataFrame(data=np.NaN, columns=cols, index=rows)
# fill DataFrame
for df_i in dfs:
df.loc[df_i.index, df_i.columns] = df_i.values

With index preservation
This is an updated version that preserves the index:
from functools import reduce
dfs = [df1, df2, df3, df3, df5]
def my_merge(df1, df2):
res = pd.merge(df1, df2, how='outer', left_index=True, right_index=True)
cols = sorted(res.columns)
pairs = []
for col1, col2 in zip(cols[:-1], cols[1:]):
if col1.endswith('_x') and col2.endswith('_y'):
pairs.append((col1, col2))
for col1, col2 in pairs:
res[col1[:-2]] = res[col1].combine_first(res[col2])
res = res.drop([col1, col2], axis=1)
return res
print(reduce(my_merge, dfs))
Output:
A B C
a 1 4 7
b 2 5 8
c 3 6 9
Without index preservation
This would be one way:
from functools import reduce # Python 3 only
dfs = [df1, df2, df3, df3, df5]
def my_merge(df1, df2):
return pd.merge(df1, df2, how='outer')
merged_dfs = reduce(my_merge, dfs)
Results in:
A B C
0 1 4 NaN
1 2 5 8
2 NaN 6 9
3 3 6 NaN
4 1 4 7
You can adapt the join method by setting how:
how : {'left', 'right', 'outer', 'inner'}, default 'inner'
left: use only keys from left frame (SQL: left outer join)
right: use only keys from right frame (SQL: right outer join)
outer: use union of keys from both frames (SQL: full outer join)
inner: use intersection of keys from both frames (SQL: inner join)
If you like lambda, use this version for the same result:
reduce(lambda df1, df2: pd.merge(df1, df2, how='outer'), dfs)

Same idea as the other answer, but slightly different function:
def multiple_merge(lst_dfs, on):
reduce_func = lambda left,right: pd.merge(left, right, on=on)
return reduce(reduce_func, lst_dfs)
Here, lst_dfs is a list of dataframes

Related

Joining multiple dataframes with multiple common columns

I have multiple dataframes like this-
df=pd.DataFrame({'a':[1,2,3],'b':[3,4,5],'c':[4,6,7]})
df2=pd.DataFrame({'a':[1,2,3],'d':[66,24,55],'c':[4,6,7]})
df3=pd.DataFrame({'a':[1,2,3],'f':[31,74,95],'c':[4,6,7]})
I want this output-
a c
0 1 4
1 2 6
2 3 7
This is the common columns across the 3 datasets. I am looking for a solution which works for multiple columns without having to specify the common columns as I have seen on SO( since the actual data frames are huge).
If need filter columns names with same content in each DataFrame is possible convert it to tuples and compare:
dfs = [df, df2, df3]
df1 = pd.concat([x.apply(tuple) for x in dfs], axis=1)
cols = df1.index[df1.eq(df1.iloc[:, 0], axis=0).all(axis=1)]
df2 = df[cols]
print (df2)
a c
0 1 4
1 2 6
2 3 7
If columns names should be different and is necessary compare only content:
df=pd.DataFrame({'a':[1,2,3],'b':[3,4,5],'c':[4,6,7]})
df2=pd.DataFrame({'r':[1,2,3],'t':[66,24,55],'l':[4,6,7]})
df3=pd.DataFrame({'f':[1,2,3],'g':[31,74,95],'m':[4,6,7]})
dfs = [df, df2, df3]
p = [x.apply(tuple).tolist() for x in dfs]
a = set(p[0]).intersection(*p)
print (a)
{(4, 6, 7), (1, 2, 3)}
You can use reduce, to apply function r_common cumulatively to the dataframes of dfs, from left to right, so as to reduce the list of dfs to a single dataframe df_common. The intersection method is use to find out the common columns in two dataframes d1 & d2 inside r_common function.
def r_common(d1, d2):
cols = d1.columns.intersection(d2.columns).tolist()
m = d1[cols].eq(d2[cols]).all()
return d1[m[m].index]
df_common = reduce(r_common, dfs) # dfs = [df, df2, df3]
Result:
# print(df_common)
a c
0 1 4
1 2 6
2 3 7
A combination of reduce, intersection, filter and concat could help with your usecase:
dfs = (df,df2,df3)
cols = [ent.columns for ent in dfs]
cols
[Index(['a', 'b', 'c'], dtype='object'),
Index(['a', 'd', 'c'], dtype='object'),
Index(['a', 'f', 'c'], dtype='object')]
#find the common columns to all :
from functools import reduce
universal_cols = reduce(lambda x,y : x.intersection(y), cols).tolist()
universal_cols
['a', 'c']
#filter for only universal_cols for each df
updates = [ent.filter(universal_cols) for ent in dfs]
If the columns and contents of the columns are the same, then you can skip the list comprehension and just filter from only one dataframe:
#let's use the first dataframe
output = df.filter(universal_cols)
If the columns' contents are different, then concatenate and drop duplicates:
#concatenate and drop duplicates
res = pd.concat(updates).drop_duplicates()
res #output has the same result
a c
0 1 4
1 2 6
2 3 7

Ignore empty dataframe when merging

I have four df (df1,df2,df3,df4)
Sometimes df1 is null, sometimes df2 is null, sometimes df3 and df4 accordingly.
How can I do an outer merge so that the df which is empty is automatically ignored? I am using the below code to merge as of now:-
df = f1.result().merge(f2.result(), how='left', left_on='time', right_on='time').merge(f3.result(), how='left', left_on='time', right_on='time').merge(f4.result(), how='left', left_on='time', right_on='time')
and
df = reduce(lambda x,y: pd.merge(x,y, on='time', how='outer'), [f1.result(),f2.result(),f3.result(),f4.result()])
You can use df.empty attribute or len(df) > 0 to check whether the dataframe is empty or not.
Try this:
dfs = [df1, df2, df3, df4]
non_empty_dfs = [df for df in dfs if not df.empty]
df_final = reduce(lambda left,right: pd.merge(left,right, on='time', how='outer'), non_empty_dfs)
Or, you could also filter empty dataframe as,
non_empty_dfs = [df for df in dfs if len(df) > 0]
use pandas' dataframe empty method, to filter out the empty dataframe, then you can concatenate or run whatever merge operation you have in mind:
df4 = pd.DataFrame({'A':[]}) #empty dataframe
df1 = pd.DataFrame({'B':[2]})
df2 = pd.DataFrame({'C':[3]})
df3 = pd.DataFrame({'D':[4]})
dfs = [df1,df2,df3,df4]
#concat
#u can do other operations since u have gotten rid of the empty dataframe
pd.concat([df for df in dfs if not df.empty],axis=1)
B C D
0 2 3 4

nunique into new dataframe

I have some data in pandas:
df1
df1['ID_A'].nunique()
5
df2
df2['ID_B'].nunique()
6
df3
df1['ID_A'].nunique()
2
df4
df2['ID_B'].nunique()
9
and so-on until 200 df.
how to make new dataframe based on this nunique
my expected result looks like this:
combine ID_A ID_B
combine_1 5 6
combine_2 2 9
thank you
Use list comprehension with list of DataFrames and if necessary change index names by list comprehensions with f-strings:
df1 = pd.DataFrame({'ID_A':[1,2,3,4,5,5],
'ID_B':[1,2,3,4,5,6]})
df2 = pd.DataFrame({'ID_A':[1,2,1,2,1,1,1,2,1],
'ID_B':[1,2,3,4,5,6,7,8,9]})
dfs = [df1, df2]
df = pd.DataFrame([x.nunique() for x in dfs])
df.index = [f'combine_{x+1}' for x in df.index]
df.index.name= 'combine'
print (df)
ID_A ID_B
combine
combine_1 5 6
combine_2 2 9
If necessary filter only columns by list:
cols = ['ID_A', 'ID_B']
dfs = [df1, df2]
df = pd.DataFrame([x[cols].nunique() for x in dfs])
#filter only columns starting by ID_
#df = pd.DataFrame([x.filter(regex='^ID_').nunique() for x in dfs])
df.index = [f'combine_{x+1}' for x in df.index]
df.index.name= 'combine'

Pandas merge on part of two columns

I have two dataframes with a common column called 'upc' as such:
df1:
upc
23456793749
78907809834
35894796324
67382808404
93743008374
df2:
upc
4567937
9078098
8947963
3828084
7430083
Notice that df2 'upc' values are the innermost 7 values of df1 'upc' values.
Note that both df1 and df2 have other columns not shown above.
What I want to do is do an inner merge on 'upc' but only on the innermost 7 values. How can I achieve this?
1) Create both dataframes and convert to string type.
2) pd.merge the two frames, but using the left_on keyword to access the inner 7 characters of your 'upc' series
df1 = pd.DataFrame(data=[
23456793749,
78907809834,
35894796324,
67382808404,
93743008374,], columns = ['upc1'])
df1 = df1.astype(str)
df2 = pd.DataFrame(data=[
4567937,
9078098,
8947963,
3828084,
7430083,], columns = ['upc2'])
df2 = df2.astype(str)
pd.merge(df1, df2, left_on=df1['upc1'].astype(str).str[2:-2], right_on='upc2', how='inner')
Out[5]:
upc1 upc2
0 23456793749 4567937
1 78907809834 9078098
2 35894796324 8947963
3 67382808404 3828084
4 93743008374 7430083
Using str.extact, match all items in df1 with df2, then we using the result as merge key merge with df2
df1['keyfordf2']=df1.astype(str).upc.str.extract(r'({})'.format('|'.join(df2.upc.astype(str).tolist())),expand=True).fillna(False)
df1.merge(df2.astype(str),left_on='keyfordf2',right_on='upc')
Out[273]:
upc_x keyfordf2 upc_y
0 23456793749 4567937 4567937
1 78907809834 9078098 9078098
2 35894796324 8947963 8947963
3 67382808404 3828084 3828084
4 93743008374 7430083 7430083
You could make a new column in df1 and merge on that.
import pandas as pd
df1= pd.DataFrame({'upc': [ 23456793749, 78907809834, 35894796324, 67382808404, 93743008374]})
df2= pd.DataFrame({'upc': [ 4567937, 9078098, 8947963, 3828084, 7430083]})
df1['upc_old'] = df1['upc'] #in case you still need the old (longer) upc column
df1['upc'] = df1['upc'].astype(str).str[2:-2].astype(int)
merged_df = pd.merge(df1, df2, on='upc')

Pandas concatenate alternating columns

I have two dataframes as follows:
df2 = pd.DataFrame(np.random.randn(5,2),columns=['A','C'])
df3 = pd.DataFrame(np.random.randn(5,2),columns=['B','D'])
I wish to get the columns in an alternating fashion such that I get the result below:
df4 = pd.DataFrame()
for i in range(len(df2.columns)):
df4[df2.columns[i]]=df2[df2.columns[i]]
df4[df3.columns[i]]=df3[df3.columns[i]]
df4
A B C D
0 1.056889 0.494769 0.588765 0.846133
1 1.536102 2.015574 -1.279769 -0.378024
2 -0.097357 -0.886320 0.713624 -1.055808
3 -0.269585 -0.512070 0.755534 0.855884
4 -2.691672 -0.597245 1.023647 0.278428
I think I'm being really inefficient with this solution. What is the more pythonic/ pandic way of doing this?
p.s. In my specific case the column names are not A,B,C,D and aren't alphabetically arranged. Just so know which two dataframes I want to combine.
If you need something more dynamic, first zip both columns names of both DataFrames and then flat it:
df5 = pd.concat([df2, df3], axis=1)
print (df5)
A C B D
0 0.874226 -0.764478 1.022128 -1.209092
1 1.411708 -0.395135 -0.223004 0.124689
2 1.515223 -2.184020 0.316079 -0.137779
3 -0.554961 -0.149091 0.179390 -1.109159
4 0.666985 1.879810 0.406585 0.208084
#http://stackoverflow.com/a/10636583/2901002
print (list(sum(zip(df2.columns, df3.columns), ())))
['A', 'B', 'C', 'D']
print (df5[list(sum(zip(df2.columns, df3.columns), ()))])
A B C D
0 0.874226 1.022128 -0.764478 -1.209092
1 1.411708 -0.223004 -0.395135 0.124689
2 1.515223 0.316079 -2.184020 -0.137779
3 -0.554961 0.179390 -0.149091 -1.109159
4 0.666985 0.406585 1.879810 0.208084
How about this?
df4 = pd.concat([df2, df3], axis=1)
Or do they have to be in a specific order? Anyway, you can always reorder them:
df4 = df4[['A','B','C','D']]
And without writing out the columns:
df4 = df4[[item for items in zip(df2.columns, df3.columns) for item in items]]
You could concat and then reindex_axis.
df = pd.concat([df2, df3], axis=1)
df.reindex_axis(df.columns[::2].tolist() + df.columns[1::2].tolist(), axis=1)
Append even indices to df2 columns and odd indices to df3 columns. Use these new levels to sort.
df2_ = df2.T.set_index(np.arange(len(df2.columns)) * 2, append=True).T
df3_ = df3.T.set_index(np.arange(len(df3.columns)) * 2 + 1, append=True).T
df = pd.concat([df2_, df3_], axis=1).sort_index(1, 1)
df.columns = df.columns.droplevel(1)
df

Categories