How do I combine values from two rows with identical index and has no intersection in values?
import pandas as pd
df = pd.DataFrame([[1,2,3],[4,None,None],[None,5,6]],index=['a','b','b'])
df
#input
0 1 2
a 1.0 2.0 3.0
b 4.0 NaN NaN
b NaN 5.0 6.0
Desired output
0 1 2
a 1.0 2.0 3.0
b 4.0 5.0 6.0
Please stack(), drops all nans and unstack()
df.stack().unstack()
If possible simplify solution for first non missing values per index labels use GroupBy.first:
df1 = df.groupby(level=0).first()
If possible same output from sample data is use sum per labels use sum:
df1 = df.sum(level=0)
If there is multiple non missing values per groups is necessary specify expected output, obviously is is more complicated.
I want to add a new column, Category, in each of my 8 similar dataframes.
The values in this column are the same, they are also the df name, like df1_p8 in this example.
I have used:
In: df61_p8.insert(3,"Category","df61_p8", True)
# or simply, df61_p8['Category']='df61_p8'
Out:
code violation_description Category
89491 9-1-503 Defective or obstructed duct system one- building df61_p8
102045 9-1-503 Defective or obstructed duct system one- building df61_p8
103369 9-1-503 Defective or obstructed duct system one- building df61_p8
130440 9-1-502 Failure to maintain at least one (1) elevator df61_p8
132446 9-1-503 Defective or obstructed duct system one- building df61_p8
Ultimately, I want to append/concat these 8 dataframes into one dataframe.
I wonder if there is more efficient way to do it, rather than using .insert one by one on each dataframe.
Something like loops or lambdas.. As a beginner, I am not sure how to apply them in my case? thank you.
append_alldfs = []
x=[df61_p1,df61_p2,df61_p3,df61_p4,df61_p5,df61_p6,df61_p7,df61_p8]
lambdafunc = lambda x: x.insert(3,"Category","x",True)
You can use pd.concat with keys parameter then reset_index:
pd.concat([df0,df1,df2,df3], keys=['df0', 'df1', 'df2', 'df3']).reset_index(level=0)
MCVE:
df0 = pd.DataFrame(np.ones((3,3)), columns=[*'ABC'])
df1 = pd.DataFrame(np.zeros((3,3)), columns=[*'ABC'])
df2 = pd.DataFrame(np.zeros((3,3))+3, columns=[*'ABC'])
df3 = pd.DataFrame(np.zeros((3,3))+4, columns=[*'ABC'])
df_out = pd.concat([df0,df1,df2,df3], keys=['df0', 'df1', 'df2', 'df3']).reset_index(level=0)
df_out
Output:
level_0 A B C
0 df0 1.0 1.0 1.0
1 df0 1.0 1.0 1.0
2 df0 1.0 1.0 1.0
0 df1 0.0 0.0 0.0
1 df1 0.0 0.0 0.0
2 df1 0.0 0.0 0.0
0 df2 3.0 3.0 3.0
1 df2 3.0 3.0 3.0
2 df2 3.0 3.0 3.0
0 df3 4.0 4.0 4.0
1 df3 4.0 4.0 4.0
2 df3 4.0 4.0 4.0
Keep it simple and explicit.
for col_val, df in [
('df61_p1', df61_p1),
('df61_p2', df61_p2),
('df61_p3', df61_p3),
('df61_p4', df61_p4),
('df61_p5', df61_p5),
('df61_p6', df61_p6),
('df61_p7', df61_p7),
('df61_p8', df61_p8),
]:
df['Category'] = col_val
While there are certainly more 'meta-programming-ey' ways of accomplishing the same task, these are usually quite convoluted and more complicated to understand and refactor.
Given the structure of this code, however, I imagine that there are ways you could get rid of this problem before you even get to this point.
For example, at what point did those dataframes get split up? Perhaps by never using separate DataFrames in the first place [keep the original dataframe together/concat at beginning] (and using apply, groupby, pivot and melt operations as needed), you can avoid this problem altogether.
def add_column(df, col_name, col_value):
return df.insert(loc=-1, column=col_name, value=col_value, allow_duplicates = False)
df_list = [........]
col_name = ...
col_value = .... # copy column (Category) values
res = map(lambda df: add_column(df, col_name, col_value), df_list)
list(res)
I have a pandas data-frame with multiple features, where I would like to insert rows of nans corresponding to only the first feature. In other words, I would like to transform something like this:
into this:
As I will be dealing with large datasets, the speed is important.
For general solution for select missing values if more columns add new DataFrame created by DataFrame.drop_duplicates, selecting features columns and rewritten data in feat2, so if use concat are all another columns replaced to missing values. Last for correct order add DataFrame.sort_values:
df1 = df.drop_duplicates('feat1')[['feat1','feat2']].assign(feat2='-')
df2 = (pd.concat([df1, df], sort=False, ignore_index=True)
.sort_values('feat1'))
print (df2)
feat1 feat2 var
0 A - NaN
3 A x 0.0
4 A y 1.0
5 A z 2.0
1 B - NaN
6 B x 3.0
7 B y 4.0
8 B z 5.0
2 C - NaN
9 C x 6.0
10 C y 7.0
11 C z 8.0
Hi so I have two dataframes, first one is a dataframe which was created by grouping by another df by id (which is index now) and then sorting by 'due' column.
df1:
paid due
id
3 13.000000 5.000000
2 437.000000 5.000000
5 90.000000 5.000000
1 60.000000 5.000000
4 675.000000 5.000000
The other one is a normal dataframe which has 3 columns: 'id' 'name' and 'country'.
df2:
id name country
1 'AB' 'DE'
2 'CD' 'DE'
3 'EF' 'NL'
4 'HAH' 'SG'
5 'NOP' 'NOR'
So what I was trying to do is to add the 'name' column to the 1st dataframe based on the id number (which is index in first df and column in second one).
So I thought this code would work:
pd.merge(df1, df2['name'], left_index=True, right_on='id')
But I get error
ValueError: can not merge DataFrame with instance of type <class 'pandas.core.series.Series'>
You can use rename for map by dict:
df1['name'] = df1.rename(index=df2.set_index('id')['name']).index
print (df1)
paid due name
id
3 13.0 5.0 'EF'
2 437.0 5.0 'CD'
5 90.0 5.0 'NOP'
1 60.0 5.0 'AB'
4 675.0 5.0 'HAH'
You might find that pd.concat is a better option here because it can accept a mix of dataframe and series: http://pandas.pydata.org/pandas-docs/stable/merging.html#concatenating-with-mixed-ndims.
Okay so I figured out that I can't really get one column of dataframe in that way but I can remake df2 so that it contains only one needed column:
df2=df2[['id', 'name']]
pd.merge(df1, df2, left_index=True, right_on='id')
And there is no error anymore.
What's the essential difference(s) between pd.DataFrame.merge() and pd.concat()?
So far, this is what I found, please comment on how complete and accurate my understanding is:
.merge() can only use columns (plus row-indices) and it is semantically suitable for database-style operations. .concat() can be used with either axis, using only indices, and gives the option for adding a hierarchical index.
Incidentally, this allows for the following redundancy: both can combine two dataframes using the rows indices.
pd.DataFrame.join() merely offers a shorthand for a subset of the use cases of .merge()
(Pandas is great at addressing a very wide spectrum of use cases in data analysis. It can be a bit daunting exploring the documentation to figure out what is the best way to perform a particular task. )
A very high level difference is that merge() is used to combine two (or more) dataframes on the basis of values of common columns (indices can also be used, use left_index=True and/or right_index=True), and concat() is used to append one (or more) dataframes one below the other (or sideways, depending on whether the axis option is set to 0 or 1).
join() is used to merge 2 dataframes on the basis of the index; instead of using merge() with the option left_index=True we can use join().
For example:
df1 = pd.DataFrame({'Key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': range(7)})
df1:
Key data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 a 5
6 b 6
df2 = pd.DataFrame({'Key': ['a', 'b', 'd'], 'data2': range(3)})
df2:
Key data2
0 a 0
1 b 1
2 d 2
#Merge
# The 2 dataframes are merged on the basis of values in column "Key" as it is
# a common column in 2 dataframes
pd.merge(df1, df2)
Key data1 data2
0 b 0 1
1 b 1 1
2 b 6 1
3 a 2 0
4 a 4 0
5 a 5 0
#Concat
# df2 dataframe is appended at the bottom of df1
pd.concat([df1, df2])
Key data1 data2
0 b 0 NaN
1 b 1 NaN
2 a 2 NaN
3 c 3 NaN
4 a 4 NaN
5 a 5 NaN
6 b 6 NaN
0 a Nan 0
1 b Nan 1
2 d Nan 2
At a high level:
.concat() simply stacks multiple DataFrame together either
vertically, or stitches horizontally after aligning on index
.merge() first aligns two DataFrame' selected common column(s) or
index, and then pick up the remaining columns from the aligned rows of each DataFrame.
More specifically, .concat():
Is a top-level pandas function
Combines two or more pandas DataFrame vertically or horizontally
Aligns only on the index when combining horizontally
Errors when any of the DataFrame contains a duplicate index.
Defaults to outer join with the option for inner join
And .merge():
Exists both as a top-level pandas function and a DataFrame method (as of pandas 1.0)
Combines exactly two DataFrame horizontally
Aligns the calling DataFrame's column(s) or index with the other
DataFrame's column(s) or index
Handles duplicate values on the joining columns or index by
performing a cartesian product
Defaults to inner join with options for left, outer, and right
Note that when performing pd.merge(left, right), if left has two rows containing the same values from the joining columns or index, each row will combine with right's corresponding row(s) resulting in a cartesian product. On the other hand, if .concat() is used to combine columns, we need to make sure no duplicated index exists in either DataFrame.
Practically speaking:
Consider .concat() first when combining homogeneous DataFrame, while
consider .merge() first when combining complementary DataFrame.
If need to merge vertically, go with .concat(). If need to merge
horizontally via columns, go with .merge(), which by default merge on the columns in common.
Reference: Pandas 1.x Cookbook
pd.concat takes an Iterable as its argument. Hence, it cannot take DataFrames directly as its argument. Also Dimensions of the DataFrame should match along axis while concatenating.
pd.merge can take DataFrames as its argument, and is used to combine two DataFrames with same columns or index, which can't be done with pd.concat since it will show the repeated column in the DataFrame.
Whereas join can be used to join two DataFrames with different indices.
I am currently trying to understand the essential difference(s) between pd.DataFrame.merge() and pd.concat().
Nice question. The main difference:
pd.concat works on both axes.
The other difference, is pd.concat has innerdefault and outer joins only, while pd.DataFrame.merge() has left, right, outer, innerdefault joins.
Third notable other difference is: pd.DataFrame.merge() has the option to set the column suffixes when merging columns with the same name, while for pd.concat this is not possible.
With pd.concat by default you are able to stack rows of multiple dataframes (axis=0) and when you set the axis=1 then you mimic the pd.DataFrame.merge() function.
Some useful examples of pd.concat:
df2=pd.concat([df]*2, ignore_index=True) #double the rows of a dataframe
df2=pd.concat([df, df.iloc[[0]]]) # add first row to the end
df3=pd.concat([df1,df2], join='inner', ignore_index=True) # concat two df's
The main difference between merge & concat is that merge allow you to perform more structured "join" of tables where use of concat is more broad and less structured.
Merge
Referring the documentation, pd.DataFrame.merge takes right as a required argument, which you can think it as joining left table and right table according to some pre-defined structured join operation. Note the definition for parameter right.
Required Parameters
right: DataFrame or named Series
Optional Parameters
how: {‘left’, ‘right’, ‘outer’, ‘inner’} default ‘inner’
on: label or list
left_on: label or list, or array-like
right_on: label or list, or array-like
left_index: bool, default False
right_index: bool, default False
sort: bool, default False
suffixes: tuple of (str, str), default (‘_x’, ‘_y’)
copy: bool, default True
indicator: bool or str, default False
validate: str, optional
Important: pd.DataFrame.merge requires right to be a pd.DataFrame or named pd.Series object.
Output
Returns: DataFrame
Furthermore, if we check the docstring for Merge Operation on pandas is below:
Perform a database (SQL) merge operation between two DataFrame or Series
objects using either columns as keys or their row indexes
Concat
Refer to documentation of pd.concat, first note that the parameter is not named any of table, data_frame, series, matrix, etc., but objs instead. That is, you can pass many "data containers", which are defined as:
Iterable[FrameOrSeriesUnion], Mapping[Optional[Hashable], FrameOrSeriesUnion]
Required Parameters
objs: a sequence or mapping of Series or DataFrame objects
Optional Parameters
axis: {0/’index’, 1/’columns’}, default 0
join: {‘inner’, ‘outer’}, default ‘outer’
ignore_index: bool, default False
keys: sequence, default None
levels: list of sequences, default None
names: list, default None
verify_integrity: bool, default False
sort: bool, default False
copy: bool, default True
Output
Returns: object, type of objs
Example
Code
import pandas as pd
v1 = pd.Series([1, 5, 9, 13])
v2 = pd.Series([10, 100, 1000, 10000])
v3 = pd.Series([0, 1, 2, 3])
df_left = pd.DataFrame({
"v1": v1,
"v2": v2,
"v3": v3
})
df_right = pd.DataFrame({
"v4": [5, 5, 5, 5],
"v5": [3, 2, 1, 0]
})
df_concat = pd.concat([v1, v2, v3])
# Performing operations on default
merge_result = df_left.merge(df_right, left_index=True, right_index=True)
concat_result = pd.concat([df_left, df_right], sort=False)
print(merge_result)
print('='*20)
print(concat_result)
Code Output
v1 v2 v3 v4 v5
0 1 10 0 5 3
1 5 100 1 5 2
2 9 1000 2 5 1
3 13 10000 3 5 0
====================
v1 v2 v3 v4 v5
0 1.0 10.0 0.0 NaN NaN
1 5.0 100.0 1.0 NaN NaN
2 9.0 1000.0 2.0 NaN NaN
3 13.0 10000.0 3.0 NaN NaN
0 NaN NaN NaN 5.0 3.0
1 NaN NaN NaN 5.0 2.0
2 NaN NaN NaN 5.0 1.0
You can achieve, however, the first output (merge) with concat by changing the axis parameter
concat_result = pd.concat([df_left, df_right], sort=False, axis=1)
Observe the following behavior,
concat_result = pd.concat([df_left, df_right, df_left, df_right], sort=False)
outputs;
v1 v2 v3 v4 v5
0 1.0 10.0 0.0 NaN NaN
1 5.0 100.0 1.0 NaN NaN
2 9.0 1000.0 2.0 NaN NaN
3 13.0 10000.0 3.0 NaN NaN
0 NaN NaN NaN 5.0 3.0
1 NaN NaN NaN 5.0 2.0
2 NaN NaN NaN 5.0 1.0
3 NaN NaN NaN 5.0 0.0
0 1.0 10.0 0.0 NaN NaN
1 5.0 100.0 1.0 NaN NaN
2 9.0 1000.0 2.0 NaN NaN
3 13.0 10000.0 3.0 NaN NaN
0 NaN NaN NaN 5.0 3.0
1 NaN NaN NaN 5.0 2.0
2 NaN NaN NaN 5.0 1.0
3 NaN NaN NaN 5.0 0.0
, which you cannot perform a similar operation with merge, since it only allows a single DataFrame or named Series.
merge_result = df_left.merge([df_right, df_left, df_right], left_index=True, right_index=True)
outputs;
TypeError: Can only merge Series or DataFrame objects, a <class 'list'> was passed
Conclusion
As you may have notice already that input and outputs may be different between "merge" and "concat".
As I mentioned at the beginning, the very first (main) difference is that "merge" performs a more structured join with a set of restricted set of objects and parameters where as "concat" performs a less strict/broader join with a broader set of objects and parameters.
All in all, merge is less tolerant to changes/(the input) and "concat" is looser/less sensitive to changes/(the input). You can achieve "merge" by using "concat", but the reverse is not always true.
"Merge" operation uses Data Frame columns (or name of pd.Series object) or row indices, and since it uses those entities only it performs horizontal merge of Data Frames or Series, and does not apply vertical operation as a result.
If you want to see more, you can deep dive in the source code a bit;
Source code for concat
Source code for merge
Only concat function has axis parameter. Merge is used to combine dataframes side-by-side based on values in shared columns so there is no need for axis parameter.
by default:
join is a column-wise left join
pd.merge is a column-wise inner join
pd.concat is a row-wise outer join
pd.concat:
takes Iterable arguments. Thus, it cannot take DataFrames directly (use [df,df2])
Dimensions of DataFrame should match along axis
Join and pd.merge:
can take DataFrame arguments
Click to see picture for understanding why code below does the same thing
df1.join(df2)
pd.merge(df1, df2, left_index=True, right_index=True)
pd.concat([df1, df2], axis=1)