pandas: merge several dataframes [duplicate]

pandas: merge several dataframes [duplicate] - python

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 4 years ago.
I have several dataframes df1, df2, ... with dublicate data, partly overlapping columns and rows ((see below)
How can I lump all dataframes into one dataframe.
df1 = pd.DataFrame({'A': [1,2], 'B': [4,5]}, index=['a', 'b'])
df2 = pd.DataFrame({'B': [5,6], 'C': [8,9]}, index=['b', 'c'])
df3 = pd.DataFrame({'A': [2,3], 'B': [5,6]}, index=['b', 'c'])
df4 = pd.DataFrame({'C': [7,8], index=['a', 'b'])
df5 = pd.DataFrame({'A': [1], 'B': [4], 'C': [7]}, index=['a'])
....
added: example data structure
A B C
a 1 4 7
b 2 5 8
c 3 6 9
added: what I am realy looking for is a more effective way for the following script, which is realy slow for big dataframes
dfs =[df1, df2, df3, df4, df5]
cols, rows = [], []
for df in dfs:
cols = cols + df.columns.tolist()
rows = rows + df.index.tolist()
cols = np.unique(cols)
rows = np.unique(rows)
merged_dfs = pd.DataFrame(data=np.nan, columns=cols, index=rows)
for df in dfs:
for col in df.columns:
for row in df.index:
merged_dfs[col][row] = df[col][row]
fast and easy solution (added 23. Dez. 2015)
dfs =[df1, df2, df3, df4, df5]
# create empty DataFrame with all cols and rows
cols, rows = [], []
for df_i in dfs:
cols = cols + df_i.columns.tolist()
rows = rows + df_i.index.tolist()
cols = np.unique(cols)
rows = np.unique(rows)
df = pd.DataFrame(data=np.NaN, columns=cols, index=rows)
# fill DataFrame
for df_i in dfs:
df.loc[df_i.index, df_i.columns] = df_i.values

With index preservation
This is an updated version that preserves the index:
from functools import reduce
dfs = [df1, df2, df3, df3, df5]
def my_merge(df1, df2):
res = pd.merge(df1, df2, how='outer', left_index=True, right_index=True)
cols = sorted(res.columns)
pairs = []
for col1, col2 in zip(cols[:-1], cols[1:]):
if col1.endswith('_x') and col2.endswith('_y'):
pairs.append((col1, col2))
for col1, col2 in pairs:
res[col1[:-2]] = res[col1].combine_first(res[col2])
res = res.drop([col1, col2], axis=1)
return res
print(reduce(my_merge, dfs))
Output:
A B C
a 1 4 7
b 2 5 8
c 3 6 9
Without index preservation
This would be one way:
from functools import reduce # Python 3 only
dfs = [df1, df2, df3, df3, df5]
def my_merge(df1, df2):
return pd.merge(df1, df2, how='outer')
merged_dfs = reduce(my_merge, dfs)
Results in:
A B C
0 1 4 NaN
1 2 5 8
2 NaN 6 9
3 3 6 NaN
4 1 4 7
You can adapt the join method by setting how:
how : {'left', 'right', 'outer', 'inner'}, default 'inner'
left: use only keys from left frame (SQL: left outer join)
right: use only keys from right frame (SQL: right outer join)
outer: use union of keys from both frames (SQL: full outer join)
inner: use intersection of keys from both frames (SQL: inner join)
If you like lambda, use this version for the same result:
reduce(lambda df1, df2: pd.merge(df1, df2, how='outer'), dfs)

Same idea as the other answer, but slightly different function:
def multiple_merge(lst_dfs, on):
reduce_func = lambda left,right: pd.merge(left, right, on=on)
return reduce(reduce_func, lst_dfs)
Here, lst_dfs is a list of dataframes

Related

Joining multiple dataframes with multiple common columns

I have multiple dataframes like this-
df=pd.DataFrame({'a':[1,2,3],'b':[3,4,5],'c':[4,6,7]})
df2=pd.DataFrame({'a':[1,2,3],'d':[66,24,55],'c':[4,6,7]})
df3=pd.DataFrame({'a':[1,2,3],'f':[31,74,95],'c':[4,6,7]})
I want this output-
a c
0 1 4
1 2 6
2 3 7
This is the common columns across the 3 datasets. I am looking for a solution which works for multiple columns without having to specify the common columns as I have seen on SO( since the actual data frames are huge).

If need filter columns names with same content in each DataFrame is possible convert it to tuples and compare:
dfs = [df, df2, df3]
df1 = pd.concat([x.apply(tuple) for x in dfs], axis=1)
cols = df1.index[df1.eq(df1.iloc[:, 0], axis=0).all(axis=1)]
df2 = df[cols]
print (df2)
a c
0 1 4
1 2 6
2 3 7
If columns names should be different and is necessary compare only content:
df=pd.DataFrame({'a':[1,2,3],'b':[3,4,5],'c':[4,6,7]})
df2=pd.DataFrame({'r':[1,2,3],'t':[66,24,55],'l':[4,6,7]})
df3=pd.DataFrame({'f':[1,2,3],'g':[31,74,95],'m':[4,6,7]})
dfs = [df, df2, df3]
p = [x.apply(tuple).tolist() for x in dfs]
a = set(p[0]).intersection(*p)
print (a)
{(4, 6, 7), (1, 2, 3)}

You can use reduce, to apply function r_common cumulatively to the dataframes of dfs, from left to right, so as to reduce the list of dfs to a single dataframe df_common. The intersection method is use to find out the common columns in two dataframes d1 & d2 inside r_common function.
def r_common(d1, d2):
cols = d1.columns.intersection(d2.columns).tolist()
m = d1[cols].eq(d2[cols]).all()
return d1[m[m].index]
df_common = reduce(r_common, dfs) # dfs = [df, df2, df3]
Result:
# print(df_common)
a c
0 1 4
1 2 6
2 3 7

A combination of reduce, intersection, filter and concat could help with your usecase:
dfs = (df,df2,df3)
cols = [ent.columns for ent in dfs]
cols
[Index(['a', 'b', 'c'], dtype='object'),
Index(['a', 'd', 'c'], dtype='object'),
Index(['a', 'f', 'c'], dtype='object')]
#find the common columns to all :
from functools import reduce
universal_cols = reduce(lambda x,y : x.intersection(y), cols).tolist()
universal_cols
['a', 'c']
#filter for only universal_cols for each df
updates = [ent.filter(universal_cols) for ent in dfs]
If the columns and contents of the columns are the same, then you can skip the list comprehension and just filter from only one dataframe:
#let's use the first dataframe
output = df.filter(universal_cols)
If the columns' contents are different, then concatenate and drop duplicates:
#concatenate and drop duplicates
res = pd.concat(updates).drop_duplicates()
res #output has the same result
a c
0 1 4
1 2 6
2 3 7

Ignore empty dataframe when merging

I have four df (df1,df2,df3,df4)
Sometimes df1 is null, sometimes df2 is null, sometimes df3 and df4 accordingly.
How can I do an outer merge so that the df which is empty is automatically ignored? I am using the below code to merge as of now:-
df = f1.result().merge(f2.result(), how='left', left_on='time', right_on='time').merge(f3.result(), how='left', left_on='time', right_on='time').merge(f4.result(), how='left', left_on='time', right_on='time')
and
df = reduce(lambda x,y: pd.merge(x,y, on='time', how='outer'), [f1.result(),f2.result(),f3.result(),f4.result()])

You can use df.empty attribute or len(df) > 0 to check whether the dataframe is empty or not.
Try this:
dfs = [df1, df2, df3, df4]
non_empty_dfs = [df for df in dfs if not df.empty]
df_final = reduce(lambda left,right: pd.merge(left,right, on='time', how='outer'), non_empty_dfs)
Or, you could also filter empty dataframe as,
non_empty_dfs = [df for df in dfs if len(df) > 0]

use pandas' dataframe empty method, to filter out the empty dataframe, then you can concatenate or run whatever merge operation you have in mind:
df4 = pd.DataFrame({'A':[]}) #empty dataframe
df1 = pd.DataFrame({'B':[2]})
df2 = pd.DataFrame({'C':[3]})
df3 = pd.DataFrame({'D':[4]})
dfs = [df1,df2,df3,df4]
#concat
#u can do other operations since u have gotten rid of the empty dataframe
pd.concat([df for df in dfs if not df.empty],axis=1)
B C D
0 2 3 4

nunique into new dataframe

I have some data in pandas:
df1
df1['ID_A'].nunique()
5
df2
df2['ID_B'].nunique()
6
df3
df1['ID_A'].nunique()
2
df4
df2['ID_B'].nunique()
9
and so-on until 200 df.
how to make new dataframe based on this nunique
my expected result looks like this:
combine ID_A ID_B
combine_1 5 6
combine_2 2 9
thank you

Use list comprehension with list of DataFrames and if necessary change index names by list comprehensions with f-strings:
df1 = pd.DataFrame({'ID_A':[1,2,3,4,5,5],
'ID_B':[1,2,3,4,5,6]})
df2 = pd.DataFrame({'ID_A':[1,2,1,2,1,1,1,2,1],
'ID_B':[1,2,3,4,5,6,7,8,9]})
dfs = [df1, df2]
df = pd.DataFrame([x.nunique() for x in dfs])
df.index = [f'combine_{x+1}' for x in df.index]
df.index.name= 'combine'
print (df)
ID_A ID_B
combine
combine_1 5 6
combine_2 2 9
If necessary filter only columns by list:
cols = ['ID_A', 'ID_B']
dfs = [df1, df2]
df = pd.DataFrame([x[cols].nunique() for x in dfs])
#filter only columns starting by ID_
#df = pd.DataFrame([x.filter(regex='^ID_').nunique() for x in dfs])
df.index = [f'combine_{x+1}' for x in df.index]
df.index.name= 'combine'

Pandas merge on part of two columns

I have two dataframes with a common column called 'upc' as such:
df1:
upc
23456793749
78907809834
35894796324
67382808404
93743008374
df2:
upc
4567937
9078098
8947963
3828084
7430083
Notice that df2 'upc' values are the innermost 7 values of df1 'upc' values.
Note that both df1 and df2 have other columns not shown above.
What I want to do is do an inner merge on 'upc' but only on the innermost 7 values. How can I achieve this?

1) Create both dataframes and convert to string type.
2) pd.merge the two frames, but using the left_on keyword to access the inner 7 characters of your 'upc' series
df1 = pd.DataFrame(data=[
23456793749,
78907809834,
35894796324,
67382808404,
93743008374,], columns = ['upc1'])
df1 = df1.astype(str)
df2 = pd.DataFrame(data=[
4567937,
9078098,
8947963,
3828084,
7430083,], columns = ['upc2'])
df2 = df2.astype(str)
pd.merge(df1, df2, left_on=df1['upc1'].astype(str).str[2:-2], right_on='upc2', how='inner')
Out[5]:
upc1 upc2
0 23456793749 4567937
1 78907809834 9078098
2 35894796324 8947963
3 67382808404 3828084
4 93743008374 7430083

Using str.extact, match all items in df1 with df2, then we using the result as merge key merge with df2
df1['keyfordf2']=df1.astype(str).upc.str.extract(r'({})'.format('|'.join(df2.upc.astype(str).tolist())),expand=True).fillna(False)
df1.merge(df2.astype(str),left_on='keyfordf2',right_on='upc')
Out[273]:
upc_x keyfordf2 upc_y
0 23456793749 4567937 4567937
1 78907809834 9078098 9078098
2 35894796324 8947963 8947963
3 67382808404 3828084 3828084
4 93743008374 7430083 7430083

You could make a new column in df1 and merge on that.
import pandas as pd
df1= pd.DataFrame({'upc': [ 23456793749, 78907809834, 35894796324, 67382808404, 93743008374]})
df2= pd.DataFrame({'upc': [ 4567937, 9078098, 8947963, 3828084, 7430083]})
df1['upc_old'] = df1['upc'] #in case you still need the old (longer) upc column
df1['upc'] = df1['upc'].astype(str).str[2:-2].astype(int)
merged_df = pd.merge(df1, df2, on='upc')

Pandas concatenate alternating columns

I have two dataframes as follows:
df2 = pd.DataFrame(np.random.randn(5,2),columns=['A','C'])
df3 = pd.DataFrame(np.random.randn(5,2),columns=['B','D'])
I wish to get the columns in an alternating fashion such that I get the result below:
df4 = pd.DataFrame()
for i in range(len(df2.columns)):
df4[df2.columns[i]]=df2[df2.columns[i]]
df4[df3.columns[i]]=df3[df3.columns[i]]
df4
A B C D
0 1.056889 0.494769 0.588765 0.846133
1 1.536102 2.015574 -1.279769 -0.378024
2 -0.097357 -0.886320 0.713624 -1.055808
3 -0.269585 -0.512070 0.755534 0.855884
4 -2.691672 -0.597245 1.023647 0.278428
I think I'm being really inefficient with this solution. What is the more pythonic/ pandic way of doing this?
p.s. In my specific case the column names are not A,B,C,D and aren't alphabetically arranged. Just so know which two dataframes I want to combine.

If you need something more dynamic, first zip both columns names of both DataFrames and then flat it:
df5 = pd.concat([df2, df3], axis=1)
print (df5)
A C B D
0 0.874226 -0.764478 1.022128 -1.209092
1 1.411708 -0.395135 -0.223004 0.124689
2 1.515223 -2.184020 0.316079 -0.137779
3 -0.554961 -0.149091 0.179390 -1.109159
4 0.666985 1.879810 0.406585 0.208084
#http://stackoverflow.com/a/10636583/2901002
print (list(sum(zip(df2.columns, df3.columns), ())))
['A', 'B', 'C', 'D']
print (df5[list(sum(zip(df2.columns, df3.columns), ()))])
A B C D
0 0.874226 1.022128 -0.764478 -1.209092
1 1.411708 -0.223004 -0.395135 0.124689
2 1.515223 0.316079 -2.184020 -0.137779
3 -0.554961 0.179390 -0.149091 -1.109159
4 0.666985 0.406585 1.879810 0.208084

How about this?
df4 = pd.concat([df2, df3], axis=1)
Or do they have to be in a specific order? Anyway, you can always reorder them:
df4 = df4[['A','B','C','D']]
And without writing out the columns:
df4 = df4[[item for items in zip(df2.columns, df3.columns) for item in items]]

You could concat and then reindex_axis.
df = pd.concat([df2, df3], axis=1)
df.reindex_axis(df.columns[::2].tolist() + df.columns[1::2].tolist(), axis=1)

Append even indices to df2 columns and odd indices to df3 columns. Use these new levels to sort.
df2_ = df2.T.set_index(np.arange(len(df2.columns)) * 2, append=True).T
df3_ = df3.T.set_index(np.arange(len(df3.columns)) * 2 + 1, append=True).T
df = pd.concat([df2_, df3_], axis=1).sort_index(1, 1)
df.columns = df.columns.droplevel(1)
df

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas: merge several dataframes [duplicate] - python

Same idea as the other answer, but slightly different function: def multiple_merge(lst_dfs, on): reduce_func = lambda left,right: pd.merge(left, right, on=on) return reduce(reduce_func, lst_dfs) Here, lst_dfs is a list of dataframes

Related

Joining multiple dataframes with multiple common columns

Ignore empty dataframe when merging

nunique into new dataframe

Pandas merge on part of two columns

Pandas concatenate alternating columns

Categories

Resources