Merge multiple data frames with different dimensions using Pandas [duplicate] - python

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 4 years ago.
I have the following data frames (in reality they are more than 3).
import pandas as pd
df1 = pd.DataFrame({'head1': ['foo', 'bix', 'bar'],'val': [11, 22, 32]})
df2 = pd.DataFrame({'head2': ['foo', 'xoo', 'bar','qux'],'val': [1, 2, 3,10]})
df3 = pd.DataFrame({'head3': ['xoo', 'bar',],'val': [20, 100]})
# Note that the value in column 'head' is always unique
What I want to do is to merge them based on head column. And whenever the value of a head does not exist in one data frame we would assign it with NA.
In the end it'll look like this:
head1 head2 head3
-------------------------------
foo 11 1 NA
bix 22 NA NA
bar 32 3 100
xoo NA 2 20
qux NA 10 NA
How can I achieve that using Pandas?

You can use pandas.concat selecting the axis=1 to concatenate your multiple DataFrames.
Note however that I've first set the index of the df1, df2, df3 to use the variables (foo, bar, etc) rather than the default integers.
import pandas as pd
df1 = pd.DataFrame({'head1': ['foo', 'bix', 'bar'],'val': [11, 22, 32]})
df2 = pd.DataFrame({'head2': ['foo', 'xoo', 'bar','qux'],'val': [1, 2, 3,10]})
df3 = pd.DataFrame({'head3': ['xoo', 'bar',],'val': [20, 100]})
df1 = df1.set_index('head1')
df2 = df2.set_index('head2')
df3 = df3.set_index('head3')
df = pd.concat([df1, df2, df3], axis = 1)
columns = ['head1', 'head2', 'head3']
df.columns = columns
print(df)
head1 head2 head3
bar 32 3 100
bix 22 NaN NaN
foo 11 1 NaN
qux NaN 10 NaN
xoo NaN 2 20

Related

How To Merge Two Data Frames in Pandas Python [duplicate]

This question already has answers here:
How do I combine two dataframes?
(8 answers)
Pandas Merging 101
(8 answers)
Closed 10 months ago.
How To Merge/Concat Two Data Frames
I want to merge two dataframes: the first one is a dataframe with one column with datetime64 dtype and the second one is a float dtype one column dataframe. This is what I have tried:
df1 = pd.DataFrame(df, columns = ['MemStartDate'])
df4 = pd.DataFrame(df, columns = ['TotalPrice'])
df_merge = pd.merge(df1,df2,left_on='MemStartDate',right_on='TotalPrice')
Error: You are trying to merge on datetime64[ns] and float64 columns. If you wish to proceed you should use pd.concat
But how can I do that ?
you can try this.
df_merge = pd.concat([df1, df2], axis=1)
Best option to use pd.concat but you also can try dataframe.join(dataframe).
for more information try to go through this Merge, join, concatenate and compare
df_merge=df1.join(df2)
Let us consider the following situation:
import pandas as pd
# Create dataframe with one column of type datatime64 and one float64
dictionary = {'MemStartDate':['2007-07-13', '2006-01-13', '2010-08-13'],
'TotalPrice':[50.5,10.4,3.5]}
df= pd.DataFrame(dictionary)
pd.to_datetime(df['MemStartDate']) #dtype: datetime64[ns]
df1 = pd.DataFrame(df, columns = ['MemStartDate'])
df4 = pd.DataFrame(df, columns = ['TotalPrice'])
df.TotalPrice # dtype: float64
Where you have df1 and df4 that are:
df1
Out:
MemStartDate
0 2007-07-13
1 2006-01-13
2 2010-08-13
df4
Out:
TotalPrice
0 50.5
1 10.4
2 3.5
If you want to concat df1 and df4, it means that you want to concatenate pandas objects along a particular axis with optional set logic along the other axes (see pandas.concat — pandas 1.4.2 documentation). Thus in practice:
df_concatenated = pd.concat([df1, df4], axis=1)
df_concatenated
The new resulting dataframe df_concatenated is this:
Out:
MemStartDate TotalPrice
0 2007-07-13 50.5
1 2006-01-13 10.4
2 2010-08-13 3.5
The axis decides where you want to concatenate along. With axis=1 you have concatenated the second dataframe along columns of the first dataframe. You can try with axis=0:
df_concatenated = pd.concat([df1, df4], axis=0)
df_concatenated
The output is:
Out:
MemStartDate TotalPrice
0 2007-07-13 NaN
1 2006-01-13 NaN
2 2010-08-13 NaN
0 NaN 50.5
1 NaN 10.4
2 NaN 3.5
Now you have added the second dataframe along rows of the first dataframe.
On the other hand, merge is used to join dataframes when they share some columns. It is useful because maybe you do not want to store dataframes with same contents repeatedly. For example:
# Create two dataframes
dictionary = {'MemStartDate':['2007-07-13', '2006-01-13', '2010-08-13'],
'TotalPrice':[50.5,10.4,3.5]}
dictionary_1 = {'MemStartDate':['2007-07-13', '2006-01-13', '2010-08-13', '2010-08-14'],
'Shop':['Shop_1','Shop_2','Shop_3','Shop_4']}
df= pd.DataFrame(dictionary)
df_1 = pd.DataFrame(dictionary_1)
if you have df and df_1 that are:
df
Out:
MemStartDate TotalPrice
0 2007-07-13 50.5
1 2006-01-13 10.4
2 2010-08-13 3.5
and
df_1
Out:
MemStartDate Shop
0 2007-07-13 Shop_1
1 2006-01-13 Shop_2
2 2010-08-13 Shop_3
3 2010-08-14 Shop_4
You can merge them in this way:
df_merged = pd.merge(df,df_1, on='MemStartDate', how='outer')
df_merged
Out:
MemStartDate TotalPrice Shop
0 2007-07-13 50.5 Shop_1
1 2006-01-13 10.4 Shop_2
2 2010-08-13 3.5 Shop_3
3 2010-08-14 NaN Shop_4
In the new dataframe df_merged, you keep the common column of the old dataframes df and df_1 (MemStartDate) and add the two columns that are different in the two dataframes (TotalPrice and Shop).
----> A couple of other explicative examples about merging dataframes in Pandas:
Example 1. Merging two dataframes preserving one column that is equal for both dataframes:
left = pd.DataFrame(
{
"key": ["K0", "K1", "K2", "K3"],
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
}
)
left
right = pd.DataFrame(
{
"key": ["K0", "K1", "K2", "K3"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
}
)
right
result = pd.merge(left, right, on="key")
result
Out:
key A B C D
0 K0 A0 B0 C0 D0
1 K1 A1 B1 C1 D1
2 K2 A2 B2 C2 D2
3 K3 A3 B3 C3 D3
Example 2. Merging two dataframes in order to read all the combinations of values
df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
'value': [1, 2, 3, 5]})
df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],
'value': [5, 6, 7, 8]})
result = pd.merge(df1,df2, left_on='lkey', right_on='rkey')
result
Out:
lkey value_x rkey value_y
0 foo 1 foo 5
1 foo 1 foo 8
2 foo 5 foo 5
3 foo 5 foo 8
4 bar 2 bar 6
5 baz 3 baz 7
Also in this case you can check the pandas.DataFrame.merge — pandas 1.4.2 documentation (where I took the second example) and here you have other possible ways to manipulate your dataframes: Merge, join, concatenate and compare (where I took the first example).
In the end, to sum up, you can intuitively understand what pd.concat() and pd.merge() do by studying the meaning of their names in spoken language:
Concatenate: to link together in a series or chain
Merge: to cause to combine, unite, or coalesce
And to come back to your error:
Error: You are trying to merge on datetime64[ns] and float64 columns. If you wish to proceed you should use pd.concat
It is telling you that the common column of the two dataframes are of different data type. So he understands that you are trying to do something that is "pd.concat's job" and so he suggests you to use pd.concat.

Pandas apply row-wise a function and create multiple new columns

What is the best way to apply a row-wise function and create multiple new columns?
I have two dataframes and a working code, but it's most likely not optimal
df1 (dataframe has thousands of rows and xx number of columns)
sic
data1
data2
data3
data4
data5
5
0.90783598
0.84722083
0.47149924
0.98724123
0.50654476
6
0.53442684
0.59730371
0.92486887
0.61531646
0.62784041
3
0.56806423
0.09619383
0.33846097
0.71878313
0.96316724
8
0.86933042
0.64965755
0.94549745
0.08866519
0.92156389
12
0.651328
0.37193774
0.9679044
0.36898991
0.15161838
6
0.24555531
0.50195983
0.79114578
0.9290596
0.10672607
df2 (column header maps to the sic-code in df1. There are in total 12 sic-codes and the dataframe is thousands of rows long)
1
2
3
4
5
6
7
8
9
10
11
12
c_bar
0.4955329
0.92970292
0.68049726
0.91325006
0.55578465
0.78056519
0.53954711
0.90335326
0.93986402
0.0204794
0.51575764
0.61144255
a1_bar
0.75781444
0.81052669
0.99910449
0.62181902
0.11797144
0.40031316
0.08561665
0.35296894
0.14445697
0.93799762
0.80641802
0.31379671
a2_bar
0.41432552
0.36313911
0.13091618
0.39251953
0.66249636
0.31221897
0.15988528
0.1620938
0.55143589
0.66571044
0.68198944
0.23806947
a3_bar
0.38918855
0.83689178
0.15838139
0.39943204
0.48615188
0.06299899
0.86343819
0.47975619
0.05300611
0.15080875
0.73088725
0.3500239
a4_bar
0.47201384
0.90874121
0.50417142
0.70047698
0.24820601
0.34302454
0.4650635
0.0992668
0.55142391
0.82947194
0.28251699
0.53170308
I achieved the result I need with the following code:
ind_list = np.arange(1,13) # Create list of industries
def c_bar(row):
for i in ind_list:
if row['sic'] == i:
return mlev_mean.loc['const',i]
def a1_bar(row):
for i in ind_list:
if row['sic'] == i:
return mlev_mean.loc['a1bar',i]
def a2_bar(row):
for i in ind_list:
if row['sic'] == i:
return mlev_mean.loc['a2bar',i]
def a3_bar(row):
for i in ind_list:
if row['sic'] == i:
return mlev_mean.loc['a3bar',i]
def a4_bar(row):
for i in ind_list:
if row['sic'] == i:
return mlev_mean.loc['a4bar',i]
mlev_merge['c_bar'] = mlev_merge.apply(c_bar, axis=1, result_type='expand')
mlev_merge['a1_bar'] = mlev_merge.apply(a1_bar, axis=1, result_type='expand')
mlev_merge['a2_bar'] = mlev_merge.apply(a2_bar, axis=1, result_type='expand')
mlev_merge['a3_bar'] = mlev_merge.apply(a3_bar, axis=1, result_type='expand')
mlev_merge['a4_bar'] = mlev_merge.apply(a4_bar, axis=1, result_type='expand')
The output is something like this:
sic
data1
data2
data3
data4
c_bar
a1_bar
a2_bar
a3_bar
a4_bar
5
0.10316948
0.61408639
0.04042675
0.79255749
0.56357931
0.42920472
0.20701581
0.67639811
0.37778029
6
0.5730904
0.16753145
0.27835136
0.00178992
0.51793793
0.06772307
0.15084885
0.12451806
0.33114948
3
0.87710893
0.66834187
0.14286608
0.12609769
0.75873957
0.72586804
0.6081763
0.14598001
0.21557266
8
0.24565579
0.56195558
0.93316676
0.20988936
0.67404545
0.65221594
0.79758557
0.67093021
0.33400764
12
0.79703344
0.61066111
0.94602909
0.56218703
0.92384307
0.30836159
0.72521994
0.00795362
0.76348227
6
0.86604791
0.28454782
0.97229172
0.21853932
0.75650652
0.40788056
0.53233553
0.60326386
0.27399405
Cell values in the example are randomly generated, but the point is to map based on sic-codes and add rows from df2 as new columns into df1.
To do this, you need to:
Transpose df2 so that its columns are correct for concatenation
Index it with the df1["sic"] column to get the correct rows
Reset the index of the obtained rows of df2 using .reset_index(drop=True), so that the dataframes can be concatenated correctly. (This replaces the current index e.g. 5, 6, 3, 8, 12, 6 with a new one e.g. 0, 1, 2, 3, 4, 5 while keeping the actual values the same. This is so that pandas doesn't get confused while concatenating them)
Concatenate the two dataframes
Note: I used a method based off of this to read in the dataframe, and it assumed that the columns of df2 were strings but the values of the sic column of df1 were ints. Therefore I used .astype(str) to get step 2 working. If this is not actually the case, you may need to remove the .astype(str).
Here is the single line of code to do these things:
merged = pd.concat([df1, df2.T.loc[df1["sic"].astype(str)].reset_index(drop=True)], axis=1)
Here is the full code I used:
from io import StringIO
import pandas as pd
df1 = pd.read_csv(StringIO("""
sic data1 data2 data3 data4 data5
5 0.90783598 0.84722083 0.47149924 0.98724123 0.50654476
6 0.53442684 0.59730371 0.92486887 0.61531646 0.62784041
3 0.56806423 0.09619383 0.33846097 0.71878313 0.96316724
8 0.86933042 0.64965755 0.94549745 0.08866519 0.92156389
12 0.651328 0.37193774 0.9679044 0.36898991 0.15161838
6 0.24555531 0.50195983 0.79114578 0.9290596 0.10672607
"""), sep="\t")
df2 = pd.read_csv(StringIO("""
1 2 3 4 5 6 7 8 9 10 11 12
c_bar 0.4955329 0.92970292 0.68049726 0.91325006 0.55578465 0.78056519 0.53954711 0.90335326 0.93986402 0.0204794 0.51575764 0.61144255
a1_bar 0.75781444 0.81052669 0.99910449 0.62181902 0.11797144 0.40031316 0.08561665 0.35296894 0.14445697 0.93799762 0.80641802 0.31379671
a2_bar 0.41432552 0.36313911 0.13091618 0.39251953 0.66249636 0.31221897 0.15988528 0.1620938 0.55143589 0.66571044 0.68198944 0.23806947
a3_bar 0.38918855 0.83689178 0.15838139 0.39943204 0.48615188 0.06299899 0.86343819 0.47975619 0.05300611 0.15080875 0.73088725 0.3500239
a4_bar 0.47201384 0.90874121 0.50417142 0.70047698 0.24820601 0.34302454 0.4650635 0.0992668 0.55142391 0.82947194 0.28251699 0.53170308
"""), sep="\t", index_col=[0])
merged = pd.concat([df1, df2.T.loc[df1["sic"].astype(str)].reset_index(drop=True)], axis=1)
print(merged)
which produces the output:
sic data1 data2 data3 ... a1_bar a2_bar a3_bar a4_bar
0 5 0.907836 0.847221 0.471499 ... 0.117971 0.662496 0.486152 0.248206
1 6 0.534427 0.597304 0.924869 ... 0.400313 0.312219 0.062999 0.343025
2 3 0.568064 0.096194 0.338461 ... 0.999104 0.130916 0.158381 0.504171
3 8 0.869330 0.649658 0.945497 ... 0.352969 0.162094 0.479756 0.099267
4 12 0.651328 0.371938 0.967904 ... 0.313797 0.238069 0.350024 0.531703
5 6 0.245555 0.501960 0.791146 ... 0.400313 0.312219 0.062999 0.343025
[6 rows x 11 columns]
Try transposing df2 and applying transformations to it.
Transposing a data frame means converting the rows into columns of your data frame.
df2_tr = df2.T.map(lambda col:mapFunc(col),axis=0)
then, you can use concatenate the transformed columns of df2 with the columns of df1, using df1 = pd.concat([df1,df2],axis=1).

Merge daframes of different columns in a for loop

I need to join dataframes with different columns created in a for-loop.
So this is the question in a simplified version. As you can see in the picture, I have made two dataframes.
In this dataframe, we have 5 columns, and the columns are not continuous (0,2,5 and 7 are missing).
Here we have 6 columns, not continuous(0,6,7 missing) and the columns itself does not completely match the first df.
What I need to do is :
Step 1: create a new df with continuous column numbers 0,1,2,3,4,5,6,7,8.
Step 2: Add the rows of df1 and df2 corresponding to each column numbers. Whichever row whose column number does not have values should be a nan.
Note : This has to be done in a loop as I have thousands of dataframes to merge
So the resulting dataframe will be like this:
# store your dfs in an iterator
# df_list = [ ... ]
# the columns you want your final df to have
final_columns = range(9)
# add these columns with value None to your dfs if not there already
for df in df_list:
for i in final_columns:
if i not in df1.columns:
df[i] = None
# merge all of your dfs together
final_df = pd.concat(df_list, ignore_index=True)
final_df
Try concat + reindex:
import pandas as pd
import numpy as np
df1 = pd.DataFrame([[34, 56, 66, 77, 77]], columns=[1, 3, 4, 6, 7])
df2 = pd.DataFrame([[34, 56, 66, 77, 77, 66]], columns=[1, 2, 3, 4, 5, 8])
# Collection of all DataFrames
dfs = (df1, df2)
# Concat
new_df = pd.concat(dfs, ignore_index=True).reindex(columns=np.arange(0, 9))
print(new_df)
new_df:
0 1 2 3 4 5 6 7 8
0 NaN 34 NaN 56 66 NaN 77.0 77.0 NaN
1 NaN 34 56.0 66 77 77.0 NaN NaN 66.0

Fill in values based on a differnt datafram values in pandas [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 2 years ago.
I have the following dataframe:
df1 = pd.DataFrame({'ID': ['foo', 'foo','bar','foo', 'baz', 'foo'],'value': [1, 2, 3, 5, 4, 3, 1, 2, 3]})
df2 = pd.DataFrame({'ID': ['foo', 'bar', 'baz', 'foo'],'age': [10, 21, 32, 15]})
I would like to create a new column in DF1 called age, and take the values from df2, that match on 'ID'. I would like for those values to be duplicated (instead of nan), when 'ID' value appears more than once in df1.
I tried a merge of df1 and df2, but they produce NaNs instead of duplicates.
Tha Pandas 101 does not contain an answer for this problem.
I think you need outer join:
df = pd.merge(df1, df2, on='ID', how='outer')
print(df)
ID value age
0 foo 1 10
1 foo 1 15
2 foo 2 10
3 foo 2 15
4 foo 5 10
5 foo 5 15
6 foo 3 10
7 foo 3 15
8 bar 3 21
9 baz 4 32

Create a Pandas DataFrame from series without duplicating their names?

Is it possible to create a DataFrame from a list of series without duplicating their names?
Ex, creating the same DataFrame as:
>>> pd.DataFrame({ "foo": data["foo"], "bar": other_data["bar"] })
But without without needing to explicitly name the columns?
Try pandas.concat which takes a list of items to combine as its argument:
df1 = pd.DataFrame(np.random.randn(100, 4), columns=list('abcd'))
df2 = pd.DataFrame(np.random.randn(100, 3), columns=list('xyz'))
df3 = pd.concat([df1['a'], df2['y']], axis=1)
Note that you need to use axis=1 to stack things together side-by side and axis=0 (which is the default) to combine them one-over-the-other.
Seems like you want to join the dataframes (works similar to SQL):
import numpy as np
import pandas
df1 = pandas.DataFrame(
np.random.random_integers(low=0, high=10, size=(10,2)),
columns = ['foo', 'bar'],
index=list('ABCDEFHIJK')
)
df2 = pandas.DataFrame(
np.random.random_integers(low=0, high=10, size=(10,2)),
columns = ['bar', 'bax'],
index=list('DEFHIJKLMN')
)
df1[['foo']].join(df2['bar'], how='outer')
The on kwarg takes a list of columns or None. If None, it'll join on the indices of the two dataframes. You just need to make sure that you're using a dataframe for the left size -- hence the double brackets to force df[['foo']] to a dataframe (df['foo'] returns a series)
This gives me:
foo bar
A 4 NaN
B 0 NaN
C 10 NaN
D 8 3
E 2 0
F 3 3
H 9 10
I 0 9
J 5 6
K 2 9
L NaN 3
M NaN 1
N NaN 1
You can also do inner, left, and right joins.
I prefer the explicit way, as presented in your original post, but if you really want to write certain names once, you could try this:
import pandas as pd
import numpy as np
def dictify(*args):
return dict((i,n[i]) for i,n in args)
data = { 'foo': np.random.randn(5) }
other_data = { 'bar': np.random.randn(5) }
print pd.DataFrame(dictify(('foo', data), ('bar', other_data)))
The output is as expected:
bar foo
0 0.533973 -0.477521
1 0.027354 0.974038
2 -0.725991 0.350420
3 1.921215 0.648210
4 0.547640 1.652310
[5 rows x 2 columns]

Categories