Merge dataframes without duplicating rows in python pandas [duplicate] - python

This question already has answers here:
Pandas left join on duplicate keys but without increasing the number of columns
(2 answers)
Closed 4 years ago.
I'd like to combine two dataframes using their similar column 'A':
>>> df1
A B
0 I 1
1 I 2
2 II 3
>>> df2
A C
0 I 4
1 II 5
2 III 6
To do so I tried using:
merged = pd.merge(df1, df2, on='A', how='outer')
Which returned:
>>> merged
A B C
0 I 1.0 4
1 I 2.0 4
2 II 3.0 5
3 III NaN 6
However, since df2 only contained one value for A == 'I', I do not want this value to be duplicated in the merged dataframe. Instead I would like the following output:
>>> merged
A B C
0 I 1.0 4
1 I 2.0 NaN
2 II 3.0 5
3 III NaN 6
What is the best way to do this? I am new to python and still slightly confused with all the join/merge/concatenate/append operations.

Let us create a new variable g, by cumcount
df1['g']=df1.groupby('A').cumcount()
df2['g']=df2.groupby('A').cumcount()
df1.merge(df2,how='outer').drop('g',1)
Out[62]:
A B C
0 I 1.0 4.0
1 I 2.0 NaN
2 II 3.0 5.0
3 III NaN 6.0

Related

Combining partially overlapping data frames on both axes pandas [duplicate]

This question already has answers here:
Merging two data frames and keeping the extra rows from first df
(2 answers)
Closed 1 year ago.
What is the best way to combine multiple data frames that partially overlap on both axes?
I came up already with a workable solution but I'm unsure it's the best way nor that I should be doing that at all.
So I have the following frames:
df1 = pd.DataFrame([[6,6],[7,7]],columns=['a','b'])
print(df1)
a b
0 6 6
1 7 7
df2 = pd.DataFrame([[7,7],[8,8]],columns=['b','c'], index=[1,2])
print(df2)
b c
1 7 7
2 8 8
basically the only overlapping data point is b1
and I'd like to obtain the following:
a b c
0 6.0 6 NaN
1 7.0 7 7.0
2 NaN 8 8.0
If I do a regular concat I end up either with a duplicate on the index or on the columns. Now, the workaround I found is the following:
dfc = pd.concat([df1,df2],axis=0)
dfc = dfc.groupby(dfc.index).mean()
print(dfc)
a b c
0 6.0 6 NaN
1 7.0 7 7.0
2 NaN 8 8.0
I wonder if there is a better way to do it and more in general if this is best practice when handling data.
I should also add that in my datasets the overlapping data is always an exact duplicate and I "should" never have different values if the indexes are the same.
Thank you!
You want to use combine_first:
df1.combine_first(df2)
output:
a b c
0 6.0 6 NaN
1 7.0 7 7.0
2 NaN 8 8.0
NB. if b1 is different in the two dataframes, this will take the value from df1

Pandas combining dataframes based on column value

I am trying to turn multiple dataframes into a single one based on the values in the first column, but not every dataframe has the same values in the first column. Take this example:
df1:
A 4
B 6
C 8
df2:
A 7
B 4
F 3
full_df:
A 4 7
B 6 4
C 8
F 3
How do I do this using python and pandas?
You can use pandas merge with outer join
df1.merge(df2,on =['first_column'],how='outer')
You can use pd.concat, remembering to align indices:
res = pd.concat([df1.set_index(0), df2.set_index(0)], axis=1)
print(res)
1 1
A 4.0 7.0
B 6.0 4.0
C 8.0 NaN
F NaN 3.0

append two data frames with unequal columns

I am trying to append two dataframes in pandas which have two different no of columns.
Example:
df1
A B
1 1
2 2
3 3
df2
A
4
5
Expected concatenated dataframe
df
A B
1 1
2 2
3 3
4 Null(or)0
5 Null(or)0
I am using
df1.append(df2) when the columns are same. But no idea how to deal with unequal no of columns.
How about pd.concat?
>>> pd.concat([df1,df2])
A B
0 1 1.0
1 2 2.0
2 3 3.0
0 4 NaN
1 5 NaN
Also, df1.append(df2) still works:
>>> df1.append(df2)
A B
0 1 1.0
1 2 2.0
2 3 3.0
0 4 NaN
1 5 NaN
From the docs of df.append:
Columns not in this frame are added as new columns.
Use the concat to join two columns and pass the additional argument ignore_index=True to reset the index other wise you might end with indexes as 0 1 2 0 1. For additional information refer docs here:
df1 = pd.DataFrame({'A':[1,2,3], 'B':[1,2,3]})
df2 = pd.DataFrame({'A':[4,5]})
df = pd.concat([df1,df2],ignore_index=True)
df
Output:
without ignore_index = True :
A B
0 1 1.0
1 2 2.0
2 3 3.0
0 4 NaN
1 5 NaN
with ignore_index = True :
A B
0 1 1.0
1 2 2.0
2 3 3.0
3 4 NaN
4 5 NaN

Generate a dataframe from list with different length [duplicate]

This question already has answers here:
Creating dataframe from a dictionary where entries have different lengths
(8 answers)
Closed 4 years ago.
Here I got many list with different length, like a=[1,2,3] and b=[2,3]
I would like to generate a pd.DataFrame from them, by padding nan at the end of list, like this:
a b
1 1 2
2 2 3
3 3 nan
Any good idea to help me do so?
Use
In [9]: pd.DataFrame({'a': pd.Series(a), 'b': pd.Series(b)})
Out[9]:
a b
0 1 2.0
1 2 3.0
2 3 NaN
Or,
In [10]: pd.DataFrame.from_dict({'a': a, 'b': b}, orient='index').T
Out[10]:
a b
0 1.0 2.0
1 2.0 3.0
2 3.0 NaN

Pandas Dataframe merge without duplicating either side?

I often get tables containing similar information from different sources for "QC". Sometime I want to put these two tables side by side, output to excel to show others, so we can resolve discrepancies. To do so I want a 'lazy' merge with pandas dataframe.
say, I have two tables:
df a: df b:
n I II n III IV
0 a 1 2 0 a 1 2
1 a 3 4 1 a 0 0
2 b 5 6 2 b 5 6
3 c 9 9 3 b 7 8
I want to have results like:
a merge b
n I II III IV
0 a 1 2 1 2
1 a 3 4
2 b 5 6 5 6
3 b 7 8
4 c 9 9
of course this is what I got with merge():
a.merge(b, how='outer', on="n")
n I II III IV
0 a 1 2 1.0 2.0
1 a 1 2 0.0 0.0
2 a 3 4 1.0 2.0
3 a 3 4 0.0 0.0
4 b 5 6 5.0 6.0
5 b 5 6 7.0 8.0
6 c 9 9 NaN NaN
I feel there must be an easy way to do that, but all my solution were convoluted.
Is there a parameter in merge or concat for something like "no_copy"?
Doesn't look like you can do it with the information given alone, you need to introduce a cumulative count column to add to the merge columns. Consider this solution
>>> import pandas
>>> dfa = pandas.DataFrame( {'n':['a','a','b','c'] , 'I' : [1,3,5,9] , 'II':[2,4,6,9]}, columns=['n','I','II'])
>>> dfb = pandas.DataFrame( {'n':['a','b','b'] , 'III' : [1,5,7] , 'IV':[2,6,8] }, columns=['n','III','IV'])
>>>
>>> dfa['nCC'] = dfa.groupby( 'n' ).cumcount()
>>> dfb['nCC'] = dfb.groupby( 'n' ).cumcount()
>>> dm = dfa.merge(dfb, how='outer', on=['n','nCC'] )
>>>
>>>
>>> dfa
n I II nCC
0 a 1 2 0
1 a 3 4 1
2 b 5 6 0
3 c 9 9 0
>>> dfb
n III IV nCC
0 a 1 2 0
1 b 5 6 0
2 b 7 8 1
>>> dm
n I II nCC III IV
0 a 1.0 2.0 0 1.0 2.0
1 a 3.0 4.0 1 NaN NaN
2 b 5.0 6.0 0 5.0 6.0
3 c 9.0 9.0 0 NaN NaN
4 b NaN NaN 1 7.0 8.0
>>>
It has the gaps or lack of duplication where you want although the index isn't quite identical to your output. Because NaN's are involved the various columns get coerced to float64 types.
Adding the cumulative count essentially forces instances to match with each other across both sides, the first matches for a given level match the corresponding first level, and likewise for all instances of the level for all levels.

Categories