Join two DataFrames by index and columns - python

I'm trying to join two DataFrames by index that can contain columns in common and I only want to add one to the other if that specific value is NaN or doesn't exist. I'm using the pandas example, so I've got:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
as
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
and
df4 = pd.DataFrame({'B': ['B2p', 'B3p', 'B6p', 'B7p'],
'D': ['D2p', 'D3p', 'D6p', 'D7p'],
'F': ['F2p', 'F3p', 'F6p', 'F7p']},
index=[2, 3, 6, 7])
as
B D F
2 B2p D2p F2p
3 B3p D3p F3p
6 B6p D6p F6p
7 B7p D7p F7p
and the searched result is:
A B C D F
0 A0 B0 C0 D0 Nan
1 A1 B1 C1 D1 Nan
2 A2 B2 C2 D2 F2p
3 A3 B3 C3 D3 F3p
6 Nan B6p Nan D6p F6p
7 Nan B7p Nan D7p F7p

This is a good use case for combine_first, where the row and column indices of the resulting dataframe will be the union of the two, i.e in the absence of an index in one of the dataframes, the value from the other is used (same behaviour as if it contained a NaN:
df1.combine_first(df4)
A B C D F
0 A0 B0 C0 D0 NaN
1 A1 B1 C1 D1 NaN
2 A2 B2 C2 D2 F2p
3 A3 B3 C3 D3 F3p
6 NaN B6p NaN D6p F6p
7 NaN B7p NaN D7p F7p

Related

Calculations on unordered Data frame

Hi All i have created a dummy DF
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3','K0', 'K1', 'K2', 'K3'],
'C': [7, 11, 9, 13,9, 6, 10, 5],
'D': [1, 2, 1, 2,2, 1, 2,1]})
result = pd.merge(left, right, on='key')
output:
key A B C D
0 K0 A0 B0 7 1
1 K0 A0 B0 9 2
2 K1 A1 B1 11 2
3 K1 A1 B1 6 1
4 K2 A2 B2 9 1
5 K2 A2 B2 10 2
6 K3 A3 B3 13 2
7 K3 A3 B3 5 1
The problem I am trying to solve is that I want to group the entries by the key Value, and perform a mathematical operation on it, you will notice that there is 2 entries in the column
so for each key, if the top value is less than the bottom value in column D, perform simple addition on the column C entries, in this case the math would only be applied to index =[4,5,6,7] and the calculations would be
9-10 = -1
13-5 = 8
Ideally these results would be stored in a list, I know the data structure is not ideal but this is what I have been given to work with, and i have no idea how to approach it
# perform checking on column D (and add "check" column)
result["check"] = result.groupby('key')['D'].diff(periods=-1)
result
key A B C D check
0 K0 A0 B0 7 1 -1.0
1 K0 A0 B0 9 2 NaN
2 K1 A1 B1 11 2 1.0
3 K1 A1 B1 6 1 NaN
4 K2 A2 B2 9 1 -1.0
5 K2 A2 B2 10 2 NaN
6 K3 A3 B3 13 2 1.0
7 K3 A3 B3 5 1 NaN
# perform difference between column C values (and add it as a column)
result["diff"] = result.groupby('key')['C'].diff(periods=-1)
result
key A B C D check diff
0 K0 A0 B0 7 1 -1.0 -2.0
1 K0 A0 B0 9 2 NaN NaN
2 K1 A1 B1 11 2 1.0 5.0
3 K1 A1 B1 6 1 NaN NaN
4 K2 A2 B2 9 1 -1.0 -1.0
5 K2 A2 B2 10 2 NaN NaN
6 K3 A3 B3 13 2 1.0 8.0
7 K3 A3 B3 5 1 NaN NaN
# filter dataframe to get only desired results
result[(result.check < 0)]['diff']
0 -2.0
4 -1.0
# results as a list
list(result[(result.check < 0)]['diff'])
[-2.0, -1.0]

Pandas: Reducing multi-index to highest available level

I have the following type of a dataframe, values which are grouped by 3 different categories A,B,C:
import pandas as pd
A = ['A1', 'A2', 'A3', 'A2', 'A1']
B = ['B3', 'B2', 'B2', 'B1', 'B3']
C = ['C2', 'C2', 'C3', 'C1', 'C3']
value = ['6','2','3','3','5']
df = pd.DataFrame({'categA': A,'categB': B, 'categC': C, 'value': value})
df
Which looks like:
categA categB categC value
0 A1 B3 C2 6
1 A2 B2 C2 2
2 A3 B2 C3 3
3 A2 B1 C1 3
4 A1 B3 C3 5
Now, when I want to unstack this df by the C category, .unstack() returns some multi-indexed dataframe with 'value' at the first level and my categories of interest C1, C2 & C3 at the second level:
df = df.set_index(['categA','categB','categC']).unstack('categC')
df
Output:
value
categC C1 C2 C3
categA categB
A1 B3 NaN 6 5
A2 B1 3 NaN NaN
B2 NaN 2 NaN
A3 B2 NaN NaN 3
Is there a quick and clean way to get rid of the multi-index by reducing it to the highest available level? This is what I'd like to have as output:
categA categB C1 C2 C3
A1 B3 NaN 6 5
A2 B1 3 NaN NaN
B2 NaN 2 NaN
A3 B2 NaN NaN 3
Many thanks in advance!
Edit:
print(df.reset_index())
gives:
categA categB value
categC C1 C2 C3
0 A1 B3 NaN 6 5
1 A2 B1 3 NaN NaN
2 A2 B2 NaN 2 NaN
3 A3 B2 NaN NaN 3
Adding reset_index also , unstack with Series
df.set_index(['categA','categB','categC']).value.unstack('categC').reset_index()
Out[875]:
categC categA categB C1 C2 C3
0 A1 B3 None 6 5
1 A2 B1 3 None None
2 A2 B2 None 2 None
3 A3 B2 None None 3

Concatenate DataFrames in vertical and horizontal at the same time

I would like to concatenate two df in both directions at the same time.
It means, if the index does not exist, it is created.
And if the column does not exist, it is created also.
import pandas as pd
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
df2 = pd.DataFrame({'A': ['A4'],
'D': ['D4']},
index=[4])
df3 = pd.DataFrame({'A': ['E4'],
'F': ['F4']},
index=[4])
result = pd.concat([df1, df2, df3])
It gives :
A B C D F
0 A0 B0 C0 D0 NaN
1 A1 B1 C1 D1 NaN
2 A2 B2 C2 D2 NaN
3 A3 B3 C3 D3 NaN
4 A4 NaN NaN D4 NaN
4 E4 NaN NaN NaN F4
Instead of :
A B C D F
0 A0 B0 C0 D0 NaN
1 A1 B1 C1 D1 NaN
2 A2 B2 C2 D2 NaN
3 A3 B3 C3 D3 NaN
4 E4 NaN NaN D4 F4

Updating dataframe on values from another dataframe

I want to update data frame X on values from dataframe from Y.
X = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2'],
'C': ['C0', 'C1', 'C2'],
'D': ['D0', 'D1', 'D2']})
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
Y = pd.DataFrame({'A': ['A0', 'A1'],
'B': ['B0', 'B1'],
'C': ['C0xx', 'C1xx'],
'D': ['D0xx', 'D1xx']})
A B C D
0 A0 B0 C0xx D0xx
1 A1 B1 C1xx D1xx
And the result to be:
A B C D
0 A0 B0 C0xx D0xx
1 A1 B1 C1xx D1xx
2 A2 B2 C2 D2
Of course my dataframe is match bigger.
1. Both DataFrames have the same index
This is the case you presented in the example given in your question.
You might want to use the update method:
>>> X.update(Y)
>>> X
A B C D
0 A0 B0 C0xx D0xx
1 A1 B1 C1xx D1xx
2 A2 B2 C2 D2
It also works if lines are in a different order in X and Y:
>>> Y = pd.DataFrame({'A': ['A1', 'A0'],
'B': ['B1', 'B0'],
'C': ['C1xx', 'C0xx'],
'D': ['D1xx', 'D0xx']},
index=[1,0])
>>> Y
A B C D
1 A1 B1 C1xx D1xx
0 A0 B0 C0xx D0xx
>>> X.update(Y)
>>> X
A B C D
0 A0 B0 C0xx D0xx
1 A1 B1 C1xx D1xx
2 A2 B2 C2 D2
2. Different indexes
If Y has a different index:
>>> Y = pd.DataFrame({'A': ['A0', 'A1'],
'B': ['B0', 'B1'],
'C': ['C0xx', 'C1xx'],
'D': ['D0xx', 'D1xx']},
index=[2,1])
>>> Y
A B C D
2 A0 B0 C0xx D0xx
1 A1 B1 C1xx D1xx
You can still use update if you can find another column usable as an index (identifying the lines so that they match the lines to be replaced). I take the example of the "A" column but a multiple index would work as well.
>>> X2, Y2 = X.set_index("A"), Y.set_index("A")
>>> X2.update(Y2)
>>> X2.reset_index(inplace=True)
>>> X2
A B C D
0 A0 B0 C0xx D0xx
1 A1 B1 C1xx D1xx
2 A2 B2 C2 D2
I think you need combine_first with set_index if need add missing values by A, B columns in both df:
print (Y.set_index(['A','B']).combine_first(X.set_index(['A','B'])).reset_index())
A B C D
0 A0 B0 C0xx D0xx
1 A1 B1 C1xx D1xx
2 A2 B2 C2 D2
Unfortunately update works bad:
Y = pd.DataFrame({'A': ['A0', 'A1'],
'B': ['B0', 'B1'],
'C': ['C0xx', 'C1xx'],
'D': ['D0xx', 'D1xx']}, index=[2,1])
print (X)
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
print (Y)
A B C D
2 A0 B0 C0xx D0xx
1 A1 B1 C1xx D1xx
X.update(Y)
print (X)
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1xx D1xx
2 A0 B0 C0xx D0xx
X.set_index(['A','B']).update(Y.set_index(['A','B']))
print (X)
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
print (Y.set_index(['A','B']).combine_first(X.set_index(['A','B'])).reset_index())
A B C D
0 A0 B0 C0xx D0xx
1 A1 B1 C1xx D1xx
2 A2 B2 C2 D2

dataframe merge/join on rows with partial index overlap

I have the following 2 dataframes:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
...: 'B': ['B0', 'B1', 'B2'],
...: 'C': ['C0', 'C1', 'C2'],
...: 'D': ['D0', 'D1', 'D2']},
...: index=[0, 1, 2])
and
df2 = pd.DataFrame({'E': ['E2', 'E3', 'E4'],
...: 'F': ['F2', 'F3', 'F4']},
...: index=[2, 3, 4])
As you can see df1 and df2 only have index 2 as an overlap.
I want to combine these 2 df's in such a way that the end result is:
df3 = pd.DataFrame({'A': ['A0', 'A1', 'A2',np.nan,np.nan],
...: 'B': ['B0', 'B1', 'B2',np.nan,np.nan],
...: 'C': ['C0', 'C1', 'C2',np.nan, np.nan],
...: 'D': ['D0', 'D1', 'D2',np.nan,np.nan],
'E': [np.nan, np.nan, 'E2','E3','E4'],
'F': [np.nan, np.nan, 'F2','F3','F4']},
...: index=[0, 1, 2,3,4])
Use combine_first:
df1.combine_first(df2)
Out:
A B C D E F
0 A0 B0 C0 D0 NaN NaN
1 A1 B1 C1 D1 NaN NaN
2 A2 B2 C2 D2 E2 F2
3 NaN NaN NaN NaN E3 F3
4 NaN NaN NaN NaN E4 F4
You can use concat, axis 1
pd.concat([df1,df2],axis=1)
A B C D E F
0 A0 B0 C0 D0 NaN NaN
1 A1 B1 C1 D1 NaN NaN
2 A2 B2 C2 D2 E2 F2
3 NaN NaN NaN NaN E3 F3
4 NaN NaN NaN NaN E4 F4

Categories