Difference(s) between merge() and concat() in pandas - python

What's the essential difference(s) between pd.DataFrame.merge() and pd.concat()?
So far, this is what I found, please comment on how complete and accurate my understanding is:
.merge() can only use columns (plus row-indices) and it is semantically suitable for database-style operations. .concat() can be used with either axis, using only indices, and gives the option for adding a hierarchical index.
Incidentally, this allows for the following redundancy: both can combine two dataframes using the rows indices.
pd.DataFrame.join() merely offers a shorthand for a subset of the use cases of .merge()
(Pandas is great at addressing a very wide spectrum of use cases in data analysis. It can be a bit daunting exploring the documentation to figure out what is the best way to perform a particular task. )

A very high level difference is that merge() is used to combine two (or more) dataframes on the basis of values of common columns (indices can also be used, use left_index=True and/or right_index=True), and concat() is used to append one (or more) dataframes one below the other (or sideways, depending on whether the axis option is set to 0 or 1).
join() is used to merge 2 dataframes on the basis of the index; instead of using merge() with the option left_index=True we can use join().
For example:
df1 = pd.DataFrame({'Key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': range(7)})
df1:
Key data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 a 5
6 b 6
df2 = pd.DataFrame({'Key': ['a', 'b', 'd'], 'data2': range(3)})
df2:
Key data2
0 a 0
1 b 1
2 d 2
#Merge
# The 2 dataframes are merged on the basis of values in column "Key" as it is
# a common column in 2 dataframes
pd.merge(df1, df2)
Key data1 data2
0 b 0 1
1 b 1 1
2 b 6 1
3 a 2 0
4 a 4 0
5 a 5 0
#Concat
# df2 dataframe is appended at the bottom of df1
pd.concat([df1, df2])
Key data1 data2
0 b 0 NaN
1 b 1 NaN
2 a 2 NaN
3 c 3 NaN
4 a 4 NaN
5 a 5 NaN
6 b 6 NaN
0 a Nan 0
1 b Nan 1
2 d Nan 2

At a high level:
.concat() simply stacks multiple DataFrame together either
vertically, or stitches horizontally after aligning on index
.merge() first aligns two DataFrame' selected common column(s) or
index, and then pick up the remaining columns from the aligned rows of each DataFrame.
More specifically, .concat():
Is a top-level pandas function
Combines two or more pandas DataFrame vertically or horizontally
Aligns only on the index when combining horizontally
Errors when any of the DataFrame contains a duplicate index.
Defaults to outer join with the option for inner join
And .merge():
Exists both as a top-level pandas function and a DataFrame method (as of pandas 1.0)
Combines exactly two DataFrame horizontally
Aligns the calling DataFrame's column(s) or index with the other
DataFrame's column(s) or index
Handles duplicate values on the joining columns or index by
performing a cartesian product
Defaults to inner join with options for left, outer, and right
Note that when performing pd.merge(left, right), if left has two rows containing the same values from the joining columns or index, each row will combine with right's corresponding row(s) resulting in a cartesian product. On the other hand, if .concat() is used to combine columns, we need to make sure no duplicated index exists in either DataFrame.
Practically speaking:
Consider .concat() first when combining homogeneous DataFrame, while
consider .merge() first when combining complementary DataFrame.
If need to merge vertically, go with .concat(). If need to merge
horizontally via columns, go with .merge(), which by default merge on the columns in common.
Reference: Pandas 1.x Cookbook

pd.concat takes an Iterable as its argument. Hence, it cannot take DataFrames directly as its argument. Also Dimensions of the DataFrame should match along axis while concatenating.
pd.merge can take DataFrames as its argument, and is used to combine two DataFrames with same columns or index, which can't be done with pd.concat since it will show the repeated column in the DataFrame.
Whereas join can be used to join two DataFrames with different indices.

I am currently trying to understand the essential difference(s) between pd.DataFrame.merge() and pd.concat().
Nice question. The main difference:
pd.concat works on both axes.
The other difference, is pd.concat has innerdefault and outer joins only, while pd.DataFrame.merge() has left, right, outer, innerdefault joins.
Third notable other difference is: pd.DataFrame.merge() has the option to set the column suffixes when merging columns with the same name, while for pd.concat this is not possible.
With pd.concat by default you are able to stack rows of multiple dataframes (axis=0) and when you set the axis=1 then you mimic the pd.DataFrame.merge() function.
Some useful examples of pd.concat:
df2=pd.concat([df]*2, ignore_index=True) #double the rows of a dataframe
df2=pd.concat([df, df.iloc[[0]]]) # add first row to the end
df3=pd.concat([df1,df2], join='inner', ignore_index=True) # concat two df's

The main difference between merge & concat is that merge allow you to perform more structured "join" of tables where use of concat is more broad and less structured.
Merge
Referring the documentation, pd.DataFrame.merge takes right as a required argument, which you can think it as joining left table and right table according to some pre-defined structured join operation. Note the definition for parameter right.
Required Parameters
right: DataFrame or named Series
Optional Parameters
how: {‘left’, ‘right’, ‘outer’, ‘inner’} default ‘inner’
on: label or list
left_on: label or list, or array-like
right_on: label or list, or array-like
left_index: bool, default False
right_index: bool, default False
sort: bool, default False
suffixes: tuple of (str, str), default (‘_x’, ‘_y’)
copy: bool, default True
indicator: bool or str, default False
validate: str, optional
Important: pd.DataFrame.merge requires right to be a pd.DataFrame or named pd.Series object.
Output
Returns: DataFrame
Furthermore, if we check the docstring for Merge Operation on pandas is below:
Perform a database (SQL) merge operation between two DataFrame or Series
objects using either columns as keys or their row indexes
Concat
Refer to documentation of pd.concat, first note that the parameter is not named any of table, data_frame, series, matrix, etc., but objs instead. That is, you can pass many "data containers", which are defined as:
Iterable[FrameOrSeriesUnion], Mapping[Optional[Hashable], FrameOrSeriesUnion]
Required Parameters
objs: a sequence or mapping of Series or DataFrame objects
Optional Parameters
axis: {0/’index’, 1/’columns’}, default 0
join: {‘inner’, ‘outer’}, default ‘outer’
ignore_index: bool, default False
keys: sequence, default None
levels: list of sequences, default None
names: list, default None
verify_integrity: bool, default False
sort: bool, default False
copy: bool, default True
Output
Returns: object, type of objs
Example
Code
import pandas as pd
v1 = pd.Series([1, 5, 9, 13])
v2 = pd.Series([10, 100, 1000, 10000])
v3 = pd.Series([0, 1, 2, 3])
df_left = pd.DataFrame({
"v1": v1,
"v2": v2,
"v3": v3
})
df_right = pd.DataFrame({
"v4": [5, 5, 5, 5],
"v5": [3, 2, 1, 0]
})
df_concat = pd.concat([v1, v2, v3])
# Performing operations on default
merge_result = df_left.merge(df_right, left_index=True, right_index=True)
concat_result = pd.concat([df_left, df_right], sort=False)
print(merge_result)
print('='*20)
print(concat_result)
Code Output
v1 v2 v3 v4 v5
0 1 10 0 5 3
1 5 100 1 5 2
2 9 1000 2 5 1
3 13 10000 3 5 0
====================
v1 v2 v3 v4 v5
0 1.0 10.0 0.0 NaN NaN
1 5.0 100.0 1.0 NaN NaN
2 9.0 1000.0 2.0 NaN NaN
3 13.0 10000.0 3.0 NaN NaN
0 NaN NaN NaN 5.0 3.0
1 NaN NaN NaN 5.0 2.0
2 NaN NaN NaN 5.0 1.0
You can achieve, however, the first output (merge) with concat by changing the axis parameter
concat_result = pd.concat([df_left, df_right], sort=False, axis=1)
Observe the following behavior,
concat_result = pd.concat([df_left, df_right, df_left, df_right], sort=False)
outputs;
v1 v2 v3 v4 v5
0 1.0 10.0 0.0 NaN NaN
1 5.0 100.0 1.0 NaN NaN
2 9.0 1000.0 2.0 NaN NaN
3 13.0 10000.0 3.0 NaN NaN
0 NaN NaN NaN 5.0 3.0
1 NaN NaN NaN 5.0 2.0
2 NaN NaN NaN 5.0 1.0
3 NaN NaN NaN 5.0 0.0
0 1.0 10.0 0.0 NaN NaN
1 5.0 100.0 1.0 NaN NaN
2 9.0 1000.0 2.0 NaN NaN
3 13.0 10000.0 3.0 NaN NaN
0 NaN NaN NaN 5.0 3.0
1 NaN NaN NaN 5.0 2.0
2 NaN NaN NaN 5.0 1.0
3 NaN NaN NaN 5.0 0.0
, which you cannot perform a similar operation with merge, since it only allows a single DataFrame or named Series.
merge_result = df_left.merge([df_right, df_left, df_right], left_index=True, right_index=True)
outputs;
TypeError: Can only merge Series or DataFrame objects, a <class 'list'> was passed
Conclusion
As you may have notice already that input and outputs may be different between "merge" and "concat".
As I mentioned at the beginning, the very first (main) difference is that "merge" performs a more structured join with a set of restricted set of objects and parameters where as "concat" performs a less strict/broader join with a broader set of objects and parameters.
All in all, merge is less tolerant to changes/(the input) and "concat" is looser/less sensitive to changes/(the input). You can achieve "merge" by using "concat", but the reverse is not always true.
"Merge" operation uses Data Frame columns (or name of pd.Series object) or row indices, and since it uses those entities only it performs horizontal merge of Data Frames or Series, and does not apply vertical operation as a result.
If you want to see more, you can deep dive in the source code a bit;
Source code for concat
Source code for merge

Only concat function has axis parameter. Merge is used to combine dataframes side-by-side based on values in shared columns so there is no need for axis parameter.

by default:
join is a column-wise left join
pd.merge is a column-wise inner join
pd.concat is a row-wise outer join
pd.concat:
takes Iterable arguments. Thus, it cannot take DataFrames directly (use [df,df2])
Dimensions of DataFrame should match along axis
Join and pd.merge:
can take DataFrame arguments
Click to see picture for understanding why code below does the same thing
df1.join(df2)
pd.merge(df1, df2, left_index=True, right_index=True)
pd.concat([df1, df2], axis=1)

Related

interpolate(method="nearest") in a groupby in pandas

I have a dataset that I want to groupby("CustomerID") and fill NaNs with the nearest number within the group.
I can fill by nearest number irregardless of group like this:
df['num'] = df['num'].interpolate(method="nearest")
When I tried:
df['num'] = df.groupby('CustomerID')['num'].transform(lambda x: x.interpolate(method="nearest"))
I got ValueError: x and y arrays must have at least 2 entries, which I assume is because
some customers only have one entry with NaN or only NaNs.
However, when I extracted a select few rows that should have worked and made a new dataframe, nothing happened.
Is there a way I can group by customerID and fill NaNs with nearest number within the group, and skip customers with only NaNs or just one observation?
I ran into the same "ValueError: x and y arrays must have at least 2 entries" in my code. Adapted to your code (which I obviously could not reproduce) here is how I solved the problem:
import pandas as pd
import numpy as np
df.loc[:,'num'] = df.groupby('CustomerID')['num'].apply(lambda group: group.interpolate(method='nearest') if np.count_nonzero(np.isnan(group)) < (len(group) - 1) else group)
df.loc[:,'num'] = df.groupby('CustomerID').apply(lambda group: group.interpolate(method='linear', limit_area='outside', limit_direction='both'))
It does the following:
The first "groupby + apply" interpolates each group with the method 'nearest' ONLY if the group has at least two non NaNs values.
np.isnan(group) returns an array containing True where group has NaNs and False where it has values.
np.count_nonzero(np.isnan(group)) returns the number of True in the previous array (i.e. the number of NaNs in the group).
If the number of NaNs is strictly smaller than the length of the group minus 1 (i.e. there are at least two non NaNs in the group), the group is interpolated using 'nearest', otherwise it is left untouched.
The second "groupby + apply" finishes to interpolate each group, using method='linear' and argument limit_direction='both'.
If a group was fully interpolated in the previous step: nothing
happens.
If a group had only one non NaN value (therefore was left
untouched in the previous step): The non NaN value will be used to
fill the entire group.
If a group had only NaNs (therefore was left untouched in the previous step): the group remains full of NaNs.
Here's a dummy example using your notations:
df=pd.DataFrame({'CustomerID':['a']*3+['b']*3+['c']*3,'num':[1,np.nan,2,np.nan,1,np.nan,np.nan,np.nan,np.nan]})
df
CustomerID num
0 a 1.0
1 a NaN
2 a 2.0
3 b NaN
4 b 1.0
5 b NaN
6 c NaN
7 c NaN
8 c NaN
df.loc[:,'num'] = df.groupby('CustomerID')['num'].apply(lambda group: group.interpolate(method='nearest') if np.count_nonzero(np.isnan(group)) < (len(group) - 1) else group)
df
CustomerID num
0 a 1.0
1 a 1.0
2 a 2.0
3 b NaN
4 b 1.0
5 b NaN
6 c NaN
7 c NaN
8 c NaN
df.loc[:,'num'] = df.groupby('CustomerID').apply(lambda group: group.interpolate(method='linear', limit_area='outside', limit_direction='both'))
df
CustomerID num
0 a 1.0
1 a 1.0
2 a 2.0
3 b 1.0
4 b 1.0
5 b 1.0
6 c NaN
7 c NaN
8 c NaN
EDIT: important note
The interpolate method 'nearest' uses the numerical values of the index (see documentation https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html). It works well in my dummy example above because the index is clean. If the index of your dataframe is messy (e.g. after concatenating dataframes) you may want to do df.reset_index(inplace=True) before you interpolate.

Automatically reshape pandas DataFrame as columns of different lengths are added to existing DataFrame?

I have a DataFrame that looks like this:
When I try to add a list of values (of arbitrary length) to one of the columns I get an error:
mydf['a','curr(A)'] = [6,6,6,6,6]
or
mydf['a','curr(A)'] = [6,6]
gives the following error:
"ValueError: Length of values does not match length of index"
But this works:
mydf['a','curr(A)'] = [6,6,6]
How can I add an arbitrary number of entries to a column and pad the DataFrame with NaN's when necessary? Is there a parameter I can set when defining the DataFrame to do this padding automatically?
Thanks for your help.
I think the best way to do this would be something with concat
df2 = pd.DataFrame({
0:[1,2,3],
1:[1,2,3],
2:[4,5,6]
})
row = pd.Series([6,6,6,6])
pd.concat([df2,row], axis=0, ignore_index=True)
Results:
0 1 2
0 1 1.0 4.0
1 2 2.0 5.0
2 3 3.0 6.0
3 6 NaN NaN
4 6 NaN NaN
5 6 NaN NaN
6 6 NaN NaN
I don't think you are able to do this by just assigning the values to a column
Turn the sequence into another df (with the same column names) and then use .combine_first().
df_val = pd.DataFrame({('a', 'curr(a)'): [6, 6, 6]})
df_final = mydf.combine_first(df_val)
I found a workaround to solve my specific problem but it only works because I have all the columns I want in the dataframe ahead of time.
# 2 pairs of lists I want to use as column data.
mydf = pd.DataFrame([[1,2],[3,4],[5,6,7,8,9],[-3,4,-5,6,12]])
mydf = mydf.transpose() # Transpose to go from 4 rows to 4 columns.
# Create multilevel index with 4 indices
multi_idx = multi_idx = pd.MultiIndex.from_product([['a','b'],['curr(A)','volt(V)']])
for col in mydf.columns: # loop through to rename each column
mydf = mydf.rename(columns = {col : multi_idx[col]})
It works, but it seems like there must be a simpler way to do this.
Thanks for your help everyone!

updating multiple columns in the pandas data frame based another table

I have 2 CSV file like this , want to update the df1 columns (LL,UL) base on the df2(LL,UL) by matching columns (test ,cond) in the both dataframes
df1:
test Cond day mode LL UL
a T1 Tue 7
b T2 mon 7
c T2 sun 6
d T3 fri 3
c T2 sat 6
d T3 wed 3
df2:-
test Cond LL UL
a T1 15 23
b T2 -3 -3.5
c T2 -19 -11
d T3 6.5 14.5
my expected output should be:-
def SpecsLL(cond1,test1):
if ((cond1==spec['Cond'] ) & (test1==spec['test'])):
return df2['LL']
df1['LL'] = df1.apply(lambda x: SpecsLL(x['Cond'],x['test']),axis=1)
i have tried above code but not working.
any ideas on how to do it??
Simply use merge functionalities of pandas
df1.merge(df2)
Method 1: combine_first
index_cols = ['test', 'Cond']
(
df1
.set_index(index_cols)
.combine_first(
df2.set_index(index_cols)
).reset_index()
)
Explanation:
set_index moves the specified columns to the index, indicating that each row should be identified by its test and Cond columns.
foo.combine_first(bar) will identify matching index + column labels between foo and bar, and fill in values from bar wherever foo is NaN or has a column/row missing. In this case, thanks to the set_index, the two dataframes will have their rows matched where test and Cond are the same, and then the UL and LL values from df2 will be filled in to the corresponding columns of the output.
reset_index simply reverses the set_index call, so that test and Cond become regular columns again.
Note that this operation might mangle the order of your columns, so if that is important to you then you can call .reindex(df1.columns, axis=1) at the very end, which will reorder the columns to original order in df1.
Method 2: merge
Alternatively you can use the merge method, which allows you to operate on the columns directly without using set_index, but will require some other preprocessing:
index_cols = ['test', 'Cond']
(
df1
.drop(['LL', 'UL'], axis=1)
.merge(
df2,
on=index_cols
)
)
The .drop call is necessary because otherwise merge will include the UL and LL columns from both DataFrames in the output:
test Cond day mode LL_x UL_x LL_y UL_y
0 a T1 Tue 7 NaN NaN 15.0 23.0
1 b T2 mon 7 NaN NaN -3.0 -3.5
2 c T2 sun 6 NaN NaN -19.0 -11.0
3 c T2 sat 6 NaN NaN -19.0 -11.0
4 d T3 fri 3 NaN NaN 6.5 14.5
5 d T3 wed 3 NaN NaN 6.5 14.5
Which to use?
With the data that you have provided, merge seems like the more natural operation - if you never expect UL and LL to have any data in df1, then if possible I'd recommend simply removing those column headers entirely from the input CSV, so that df1 doesn't have those columns at all. In that case, the drop call would no longer be necessary and the required merge call is very expressive.
However, if you expect that df1 would sometimes have real values for UL or LL, and you want to include those values in the output, then the combine_first solution is what you want. Note that if both df1 and df2 have different non-null values for a particular row/column, then the df1.combine_first(df2) will select the value from df1 and ignore the df2 value. If you instead wanted to prioritise the values from df2 then you want to call it the other way round, i.e. df2.combine_first(df1).

Pandas merging dataframes and overwriting the data in the original df

I'm trying to merge two pandas dataframes but I can't figure out how to get the result I need. These are the example versions of dataframes I'm looking at:
df1 = pd.DataFrame([["09/10/2019",None],["10/10/2019",None], ["11/10/2019",6],
["12/10/2019",5], ["13/10/2019",3], ["14/10/2019",3],
["15/10/2019",5],
["16/10/2019",None]], columns = ['Date', 'A'])
df2 = pd.DataFrame([["10/10/2019",3], ["11/10/2019",5], ["12/10/2019",6],
["13/10/2019",1], ["14/10/2019",2], ["15/10/2019",4]],
columns = ['Date', 'A'])
I have checked the Pandas merging 101 but still can't find the way to do it correctly. Essentially what I need using the same graphics as in the guide is this:
i.e. I want to keep the data from df1 that falls outside the shared keys section, but within shared area I want df2 data from column 'A' to overwrite data from df1. I'm not even sure that merge is the right tool to use.
I've tried using df1 = pd.merge(df1, df2, how='right', on='Date') with different options, but in most cases it creates two separate columns - A_x and A_y in the output.
This is what I want to get as the end result:
Date A
0 09/10/2019 NaN
1 10/10/2019 3.0
2 11/10/2019 5.0
3 12/10/2019 6.0
4 13/10/2019 1.0
5 14/10/2019 2.0
6 15/10/2019 4.0
7 16/10/2019 NaN
Thanks in advance!
here is a way using combine_first:
df2.set_index('Date').combine_first(df1.set_index('Date')).reset_index()
Or reindex_like:
df2.set_index('Date').reindex_like(df1.set_index('Date')).reset_index()
Date A
0 09/10/2019 NaN
1 10/10/2019 3.0
2 11/10/2019 5.0
3 12/10/2019 6.0
4 13/10/2019 1.0
5 14/10/2019 2.0
6 15/10/2019 4.0
7 16/10/2019 NaN

Index out of bounds when replacing NaNs through a function in Pandas

I have created a function that replaces the NaNs in a Pandas dataframe with the means of the respective columns. I tested the function with a small dataframe and it worked. When I applied it though to a much larger dataframe (30,000 rows, 9 columns) I got the error message: IndexError: index out of bounds
The function is the following:
# The 'update' function will replace all the NaNs in a dataframe with the mean of the respective columns
def update(df): # the function takes one argument, the dataframe that will be updated
ncol = df.shape[1] # number of columns in the dataframe
for i in range(0 , ncol): # loops over all the columns
df.iloc[:,i][df.isnull().iloc[:, i]]=df.mean()[i] # subsets the df using the isnull() method, extracting the positions
# in each column where the
return(df)
The small dataframe I used to test the function is the following:
0 1 2 3
0 NaN NaN 3 4
1 NaN NaN 7 8
2 9.0 10.0 11 12
Could you explain the error? Your advice will be appreciated.
I would use DataFrame.fillna() method in conjunction with DataFrame.mean() method:
In [130]: df.fillna(df.mean())
Out[130]:
0 1 2 3
0 9.0 10.0 3 4
1 9.0 10.0 7 8
2 9.0 10.0 11 12
Mean values:
In [138]: df.mean()
Out[138]:
0 9.0
1 10.0
2 7.0
3 8.0
dtype: float64
The reason you are getting "index out of bounds" is because you are assigning the value df.mean()[i] when i is one iteration of what are supposed to be ordinal positions. df.mean() is a Series whose indices are the columns of df. df.mean()[something] implies something better be a column name. But they aren't and that's why you get your error.
your code... fixed
def update(df): # the function takes one argument, the dataframe that will be updated
ncol = df.shape[1] # number of columns in the dataframe
for i in range(0 , ncol): # loops over all the columns
df.iloc[:,i][df.isnull().iloc[:, i]]=df.mean().iloc[i] # subsets the df using the isnull() method, extracting the positions
# in each column where the
return(df)
Also, your function is altering the df directly. You may want to be careful. I'm not sure that's what you intended.
All that said. I'd recommend another approach
def update(df):
return df.where(df.notnull(), df.mean(), axis=1)
You could use any number of methods to fill missing with the mean. I'd suggest using #MaxU's answer.
df.where
takes df when first arg is True otherwise second argument
df.where(df.notnull(), df.mean(), axis=1)
df.combine_first with awkward pandas broadcasting
df.combine_first(pd.DataFrame([df.mean()], df.index))
np.where
pd.DataFrame(
np.where(
df.notnull(), df.values,
np.nanmean(df.values, 0, keepdims=1)),
df.index, df.columns)

Categories