Combining partially overlapping data frames on both axes pandas [duplicate] - python

This question already has answers here:
Merging two data frames and keeping the extra rows from first df
(2 answers)
Closed 1 year ago.
What is the best way to combine multiple data frames that partially overlap on both axes?
I came up already with a workable solution but I'm unsure it's the best way nor that I should be doing that at all.
So I have the following frames:
df1 = pd.DataFrame([[6,6],[7,7]],columns=['a','b'])
print(df1)
a b
0 6 6
1 7 7
df2 = pd.DataFrame([[7,7],[8,8]],columns=['b','c'], index=[1,2])
print(df2)
b c
1 7 7
2 8 8
basically the only overlapping data point is b1
and I'd like to obtain the following:
a b c
0 6.0 6 NaN
1 7.0 7 7.0
2 NaN 8 8.0
If I do a regular concat I end up either with a duplicate on the index or on the columns. Now, the workaround I found is the following:
dfc = pd.concat([df1,df2],axis=0)
dfc = dfc.groupby(dfc.index).mean()
print(dfc)
a b c
0 6.0 6 NaN
1 7.0 7 7.0
2 NaN 8 8.0
I wonder if there is a better way to do it and more in general if this is best practice when handling data.
I should also add that in my datasets the overlapping data is always an exact duplicate and I "should" never have different values if the indexes are the same.
Thank you!

You want to use combine_first:
df1.combine_first(df2)
output:
a b c
0 6.0 6 NaN
1 7.0 7 7.0
2 NaN 8 8.0
NB. if b1 is different in the two dataframes, this will take the value from df1

Related

Adding particular rows and changing their positions

I have an example(left one in image description).
There are several indexes in the first column. However, the third (not just for the third one because I have data from more than a thousand repeating intervals) of the repeating characters is missing-data which is 'GG'.
Question: I want to add particular rows (like 'GG') with the value 'NaN'
I want to display its values ​​in different columns based on the characters of the repeated section(from 'II' to '//\n').
Is there any way I can do in this situation?.
Assuming your data is a dataframe with all data as columns (if not, you first need to reset_index):
(df
#.reset_index() # uncomment if first column is index
.assign(cols=df.groupby('col1').cumcount())
.pivot(index='col1', columns='cols', values='col2')
)
output:
cols 0 1 2
col1
A 0.0 4.0 7.0
B 1.0 5.0 8.0
C 2.0 9.0 NaN
D 3.0 6.0 10.0
input:
col1 col2
0 A 0
1 B 1
2 C 2
3 D 3
4 A 4
5 B 5
6 D 6
7 A 7
8 B 8
9 C 9
10 D 10

Efficiently combine dataframes on 2nd level index

I have two dataframes looking like
import pandas as pd
df1 = pd.DataFrame([2.1,4.2,6.3,8.4,10.5], index=[2,4,6,8,10])
df1.index.name = 't'
df2 = pd.DataFrame(index=pd.MultiIndex.from_tuples([('A','a',1),('A','a',4),
('A','b',5),('A','b',6),('B','c',7),
('B','c',9),('B','d',10),('B','d',11),
], names=('big', 'small', 't')))
I am searching for an efficient way to combine them such that I get
0
big small t
A a 1 NaN
2 2.1
4 4.2
b 5 NaN
6 6.3
B c 7 NaN
8 8.4
9 NaN
d 10 10.5
11 NaN
I.e. I want to get the index levels 0 and 1 of df2 as index levels 0 and 1 in df1.
Of course a loop over the dataframe would work as well, though not feasible for large dataframes.
EDIT:
It appears from comments below that I should add, the indices big and small should be inferred on t in df1 based on the ordering of t.
Assuming that you want the unknown index levels to be inferred based on the ordering of 't', we can use an other merge, sort the values and then re-create the MultiIndex using ffill logic (need a Series for this).
res = (df2.reset_index()
.merge(df1, on='t', how='outer')
.set_index(df2.index.names)
.sort_index(level='t'))
res.index = pd.MultiIndex.from_arrays(
[pd.Series(res.index.get_level_values(i)).ffill()
for i in range(res.index.nlevels)],
names=res.index.names)
print(res)
0
big small t
A a 1 NaN
2 2.1
4 4.2
b 5 NaN
6 6.3
B c 7 NaN
8 8.4
9 NaN
d 10 10.5
11 NaN
Try extracting the level values and reindex:
df2['0'] = df1.reindex(df2.index.get_level_values('t'))[0].values
Output:
0
big small t
A a 1 NaN
4 4.2
b 5 NaN
6 6.3
B c 7 NaN
9 NaN
d 10 10.5
11 NaN
For more columns in df1, we can just merge:
(df2.reset_index()
.merge(df1, on='t', how='left')
.set_index(df2.index.names)
)

Pandas combining dataframes based on column value

I am trying to turn multiple dataframes into a single one based on the values in the first column, but not every dataframe has the same values in the first column. Take this example:
df1:
A 4
B 6
C 8
df2:
A 7
B 4
F 3
full_df:
A 4 7
B 6 4
C 8
F 3
How do I do this using python and pandas?
You can use pandas merge with outer join
df1.merge(df2,on =['first_column'],how='outer')
You can use pd.concat, remembering to align indices:
res = pd.concat([df1.set_index(0), df2.set_index(0)], axis=1)
print(res)
1 1
A 4.0 7.0
B 6.0 4.0
C 8.0 NaN
F NaN 3.0

Generate a dataframe from list with different length [duplicate]

This question already has answers here:
Creating dataframe from a dictionary where entries have different lengths
(8 answers)
Closed 4 years ago.
Here I got many list with different length, like a=[1,2,3] and b=[2,3]
I would like to generate a pd.DataFrame from them, by padding nan at the end of list, like this:
a b
1 1 2
2 2 3
3 3 nan
Any good idea to help me do so?
Use
In [9]: pd.DataFrame({'a': pd.Series(a), 'b': pd.Series(b)})
Out[9]:
a b
0 1 2.0
1 2 3.0
2 3 NaN
Or,
In [10]: pd.DataFrame.from_dict({'a': a, 'b': b}, orient='index').T
Out[10]:
a b
0 1.0 2.0
1 2.0 3.0
2 3.0 NaN

Merge dataframes without duplicating rows in python pandas [duplicate]

This question already has answers here:
Pandas left join on duplicate keys but without increasing the number of columns
(2 answers)
Closed 4 years ago.
I'd like to combine two dataframes using their similar column 'A':
>>> df1
A B
0 I 1
1 I 2
2 II 3
>>> df2
A C
0 I 4
1 II 5
2 III 6
To do so I tried using:
merged = pd.merge(df1, df2, on='A', how='outer')
Which returned:
>>> merged
A B C
0 I 1.0 4
1 I 2.0 4
2 II 3.0 5
3 III NaN 6
However, since df2 only contained one value for A == 'I', I do not want this value to be duplicated in the merged dataframe. Instead I would like the following output:
>>> merged
A B C
0 I 1.0 4
1 I 2.0 NaN
2 II 3.0 5
3 III NaN 6
What is the best way to do this? I am new to python and still slightly confused with all the join/merge/concatenate/append operations.
Let us create a new variable g, by cumcount
df1['g']=df1.groupby('A').cumcount()
df2['g']=df2.groupby('A').cumcount()
df1.merge(df2,how='outer').drop('g',1)
Out[62]:
A B C
0 I 1.0 4.0
1 I 2.0 NaN
2 II 3.0 5.0
3 III NaN 6.0

Categories