dataframe merge/join on rows with partial index overlap - python

I have the following 2 dataframes:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
...: 'B': ['B0', 'B1', 'B2'],
...: 'C': ['C0', 'C1', 'C2'],
...: 'D': ['D0', 'D1', 'D2']},
...: index=[0, 1, 2])
and
df2 = pd.DataFrame({'E': ['E2', 'E3', 'E4'],
...: 'F': ['F2', 'F3', 'F4']},
...: index=[2, 3, 4])
As you can see df1 and df2 only have index 2 as an overlap.
I want to combine these 2 df's in such a way that the end result is:
df3 = pd.DataFrame({'A': ['A0', 'A1', 'A2',np.nan,np.nan],
...: 'B': ['B0', 'B1', 'B2',np.nan,np.nan],
...: 'C': ['C0', 'C1', 'C2',np.nan, np.nan],
...: 'D': ['D0', 'D1', 'D2',np.nan,np.nan],
'E': [np.nan, np.nan, 'E2','E3','E4'],
'F': [np.nan, np.nan, 'F2','F3','F4']},
...: index=[0, 1, 2,3,4])

Use combine_first:
df1.combine_first(df2)
Out:
A B C D E F
0 A0 B0 C0 D0 NaN NaN
1 A1 B1 C1 D1 NaN NaN
2 A2 B2 C2 D2 E2 F2
3 NaN NaN NaN NaN E3 F3
4 NaN NaN NaN NaN E4 F4

You can use concat, axis 1
pd.concat([df1,df2],axis=1)
A B C D E F
0 A0 B0 C0 D0 NaN NaN
1 A1 B1 C1 D1 NaN NaN
2 A2 B2 C2 D2 E2 F2
3 NaN NaN NaN NaN E3 F3
4 NaN NaN NaN NaN E4 F4

Related

pd.update with two matching rows

What I am trying to do is updating the empty df1 from df2, which is created in a while-loop that requests data through an API. I want to keep all rows and their order from df1.
df1:
df = pd.DataFrame({'A': ['c1', 'c1', 'c2','c2', 'c3', 'c3'], 'B': ['y1', 'y2', 'y1', 'y2', 'y1', 'y2'], 'C': ["","","","","",""], 'D': ["","","","","",""]})
A B C D
0 c1 y1
1 c1 y2
2 c2 y1
3 c2 y2
4 c3 y1
5 c3 y2
df2:
values_for_df = pd.DataFrame({'A': ['c1', 'c1', 'c2', 'c3'], 'B': ['y1', 'y2', 'y1', 'y2'], 'C': [4, 5, 4, 6], 'D': [7, 8, 9,""]})
A B C D
0 c1 y1 4 7
1 c1 y2 5 8
2 c2 y1 4 9
3 c3 y2 6
Output:
A B C D
0 c1 y1 4 7
1 c1 y2 5 8
2 c2 y1 4 9
3 c3 y2 6
4 c3 y1
5 c3 y2
Wanted output:
A B C D
0 c1 y1 4 7
1 c1 y2 5 8
2 c2 y1 4 9
3 c2 y2
4 c3 y1
5 c3 y2 6
This process will be repated 1000s of times. Can someone help me with this, share his ideas / alternative ways or explain me why the actual output differs from my expected output?
Try:
df = df.set_index(['A','B'])
values_for_df = values_for_df.set_index(['A','B'])
df.update(values_for_df, filter_func=lambda x: x=='')
df.reset_index()
A B C D
0 c1 y1 4 7
1 c1 y2 5 8
2 c2 y1 4 9
3 c2 y2
4 c3 y1
5 c3 y2 6

Join two DataFrames by index and columns

I'm trying to join two DataFrames by index that can contain columns in common and I only want to add one to the other if that specific value is NaN or doesn't exist. I'm using the pandas example, so I've got:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
as
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
and
df4 = pd.DataFrame({'B': ['B2p', 'B3p', 'B6p', 'B7p'],
'D': ['D2p', 'D3p', 'D6p', 'D7p'],
'F': ['F2p', 'F3p', 'F6p', 'F7p']},
index=[2, 3, 6, 7])
as
B D F
2 B2p D2p F2p
3 B3p D3p F3p
6 B6p D6p F6p
7 B7p D7p F7p
and the searched result is:
A B C D F
0 A0 B0 C0 D0 Nan
1 A1 B1 C1 D1 Nan
2 A2 B2 C2 D2 F2p
3 A3 B3 C3 D3 F3p
6 Nan B6p Nan D6p F6p
7 Nan B7p Nan D7p F7p
This is a good use case for combine_first, where the row and column indices of the resulting dataframe will be the union of the two, i.e in the absence of an index in one of the dataframes, the value from the other is used (same behaviour as if it contained a NaN:
df1.combine_first(df4)
A B C D F
0 A0 B0 C0 D0 NaN
1 A1 B1 C1 D1 NaN
2 A2 B2 C2 D2 F2p
3 A3 B3 C3 D3 F3p
6 NaN B6p NaN D6p F6p
7 NaN B7p NaN D7p F7p

Pandas: Reducing multi-index to highest available level

I have the following type of a dataframe, values which are grouped by 3 different categories A,B,C:
import pandas as pd
A = ['A1', 'A2', 'A3', 'A2', 'A1']
B = ['B3', 'B2', 'B2', 'B1', 'B3']
C = ['C2', 'C2', 'C3', 'C1', 'C3']
value = ['6','2','3','3','5']
df = pd.DataFrame({'categA': A,'categB': B, 'categC': C, 'value': value})
df
Which looks like:
categA categB categC value
0 A1 B3 C2 6
1 A2 B2 C2 2
2 A3 B2 C3 3
3 A2 B1 C1 3
4 A1 B3 C3 5
Now, when I want to unstack this df by the C category, .unstack() returns some multi-indexed dataframe with 'value' at the first level and my categories of interest C1, C2 & C3 at the second level:
df = df.set_index(['categA','categB','categC']).unstack('categC')
df
Output:
value
categC C1 C2 C3
categA categB
A1 B3 NaN 6 5
A2 B1 3 NaN NaN
B2 NaN 2 NaN
A3 B2 NaN NaN 3
Is there a quick and clean way to get rid of the multi-index by reducing it to the highest available level? This is what I'd like to have as output:
categA categB C1 C2 C3
A1 B3 NaN 6 5
A2 B1 3 NaN NaN
B2 NaN 2 NaN
A3 B2 NaN NaN 3
Many thanks in advance!
Edit:
print(df.reset_index())
gives:
categA categB value
categC C1 C2 C3
0 A1 B3 NaN 6 5
1 A2 B1 3 NaN NaN
2 A2 B2 NaN 2 NaN
3 A3 B2 NaN NaN 3
Adding reset_index also , unstack with Series
df.set_index(['categA','categB','categC']).value.unstack('categC').reset_index()
Out[875]:
categC categA categB C1 C2 C3
0 A1 B3 None 6 5
1 A2 B1 3 None None
2 A2 B2 None 2 None
3 A3 B2 None None 3

Concatenate DataFrames in vertical and horizontal at the same time

I would like to concatenate two df in both directions at the same time.
It means, if the index does not exist, it is created.
And if the column does not exist, it is created also.
import pandas as pd
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
df2 = pd.DataFrame({'A': ['A4'],
'D': ['D4']},
index=[4])
df3 = pd.DataFrame({'A': ['E4'],
'F': ['F4']},
index=[4])
result = pd.concat([df1, df2, df3])
It gives :
A B C D F
0 A0 B0 C0 D0 NaN
1 A1 B1 C1 D1 NaN
2 A2 B2 C2 D2 NaN
3 A3 B3 C3 D3 NaN
4 A4 NaN NaN D4 NaN
4 E4 NaN NaN NaN F4
Instead of :
A B C D F
0 A0 B0 C0 D0 NaN
1 A1 B1 C1 D1 NaN
2 A2 B2 C2 D2 NaN
3 A3 B3 C3 D3 NaN
4 E4 NaN NaN D4 F4

Updating dataframe on values from another dataframe

I want to update data frame X on values from dataframe from Y.
X = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2'],
'C': ['C0', 'C1', 'C2'],
'D': ['D0', 'D1', 'D2']})
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
Y = pd.DataFrame({'A': ['A0', 'A1'],
'B': ['B0', 'B1'],
'C': ['C0xx', 'C1xx'],
'D': ['D0xx', 'D1xx']})
A B C D
0 A0 B0 C0xx D0xx
1 A1 B1 C1xx D1xx
And the result to be:
A B C D
0 A0 B0 C0xx D0xx
1 A1 B1 C1xx D1xx
2 A2 B2 C2 D2
Of course my dataframe is match bigger.
1. Both DataFrames have the same index
This is the case you presented in the example given in your question.
You might want to use the update method:
>>> X.update(Y)
>>> X
A B C D
0 A0 B0 C0xx D0xx
1 A1 B1 C1xx D1xx
2 A2 B2 C2 D2
It also works if lines are in a different order in X and Y:
>>> Y = pd.DataFrame({'A': ['A1', 'A0'],
'B': ['B1', 'B0'],
'C': ['C1xx', 'C0xx'],
'D': ['D1xx', 'D0xx']},
index=[1,0])
>>> Y
A B C D
1 A1 B1 C1xx D1xx
0 A0 B0 C0xx D0xx
>>> X.update(Y)
>>> X
A B C D
0 A0 B0 C0xx D0xx
1 A1 B1 C1xx D1xx
2 A2 B2 C2 D2
2. Different indexes
If Y has a different index:
>>> Y = pd.DataFrame({'A': ['A0', 'A1'],
'B': ['B0', 'B1'],
'C': ['C0xx', 'C1xx'],
'D': ['D0xx', 'D1xx']},
index=[2,1])
>>> Y
A B C D
2 A0 B0 C0xx D0xx
1 A1 B1 C1xx D1xx
You can still use update if you can find another column usable as an index (identifying the lines so that they match the lines to be replaced). I take the example of the "A" column but a multiple index would work as well.
>>> X2, Y2 = X.set_index("A"), Y.set_index("A")
>>> X2.update(Y2)
>>> X2.reset_index(inplace=True)
>>> X2
A B C D
0 A0 B0 C0xx D0xx
1 A1 B1 C1xx D1xx
2 A2 B2 C2 D2
I think you need combine_first with set_index if need add missing values by A, B columns in both df:
print (Y.set_index(['A','B']).combine_first(X.set_index(['A','B'])).reset_index())
A B C D
0 A0 B0 C0xx D0xx
1 A1 B1 C1xx D1xx
2 A2 B2 C2 D2
Unfortunately update works bad:
Y = pd.DataFrame({'A': ['A0', 'A1'],
'B': ['B0', 'B1'],
'C': ['C0xx', 'C1xx'],
'D': ['D0xx', 'D1xx']}, index=[2,1])
print (X)
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
print (Y)
A B C D
2 A0 B0 C0xx D0xx
1 A1 B1 C1xx D1xx
X.update(Y)
print (X)
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1xx D1xx
2 A0 B0 C0xx D0xx
X.set_index(['A','B']).update(Y.set_index(['A','B']))
print (X)
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
print (Y.set_index(['A','B']).combine_first(X.set_index(['A','B'])).reset_index())
A B C D
0 A0 B0 C0xx D0xx
1 A1 B1 C1xx D1xx
2 A2 B2 C2 D2

Categories