Merge/Append pandas dataframes but update overlapping rows - python

I have two (or more) dataframes that I want to append under each other (or outer merge, in a way). How do I make sure that I can append the two dataframes, but at the same time, if an index is the same, I want to update the value of the variable with the second (dfB) dataframe.
As an illustration:
dfA =
Index Var1
A 5
B 6
C 7
dfB =
Index Var1
A 6
D 8
E 10
Desired output should look like
output =
Index Var1
A 6
B 6
C 7
D 8
E 10
Any help would be greatly appreciated!
Thanks

For this particular case, considering the update, you can use pd.concat() with the argument ignore_index=True and drop_duplicates(['index'])
output = pd.concat([dfA,dfB],ignore_index=True)drop_duplicates(['index'],keep='last')
Example:
A = {'Index':['A','B','C'],'Var1':[5,6,7]}
B = {'Index':['A','D','E'],'Var1':[6,7,8]}
dfA = pd.DataFrame(A)
dfB = pd.DataFrame(B)
output = pd.concat([dfA,dfB],ignore_index=True).drop_duplicates(['Index'],keep='last')
print(output)
Index Var1
1 B 6
2 C 7
3 A 6
4 D 7
5 E 8
After this you can use set_index() or sort_values() if you want to sort your dataframe in alphabetical order given the column Index

You can also merge and fillna:
final = (df1.merge(df2,on='Index',how='outer',suffixes=('_x',''))
.assign(Var1 = lambda x: x['Var1'].fillna(x['Var1_x']))[df1.columns])
Index Var1
0 A 6.0
1 B 6.0
2 C 7.0
3 D 8.0
4 E 10.0

Related

How to overwrite multiple rows from one row (iloc/loc difference)?

I have a dataframe and would like to assign multiple values from one row to multiple other rows.
I get it to work with .iloc but for some when I use conditions with .loc it only returns nan.
df = pd.DataFrame(dict(A = [1,2,0,0],B=[0,0,0,10],C=[3,4,5,6]))
df.index = ['a','b','c','d']
When I use loc with conditions or with direct index names:
df.loc[df['A']>0, ['B','C']] = df.loc['d',['B','C']]
df.loc[['a','b'], ['B','C']] = df.loc['d',['B','C']]
it will return
A B C
a 1.0 NaN NaN
b 2.0 NaN NaN
c 0.0 0.0 5.0
d 0.0 10.0 6.0
but when I use .iloc it actually works as expected
df.iloc[0:2,1:3] = df.iloc[3,1:3]
A B C
a 1 10 6
b 2 10 6
c 0 0 5
d 0 10 6
is there a way to do this with .loc or do I need to rewrite my code to get the row numbers from my mask?
When you use labels, pandas perform index alignment, and in your case there is no common indices thus the NaNs, while location based indexing does not align.
You can assign a numpy array to prevent index alignment:
df.loc[['a','b'], ['B','C']] = df.loc['d',['B','C']].values
output:
A B C
a 1 10 6
b 2 10 6
c 0 0 5
d 0 10 6

Alternative to pandas concat

I am trying to concatenate two data frames with
pd.concat([df1.set_index(["t", "tc"]), df2.set_index(["t", "tc"])], axis=1)
It can happen that in df1, the index is not unique. In that case, I want the corresponding entry in df2 to be inserted into all the rows with that index. Unfortunately, instead of doing that, concat gives me an error.I thought ignore_index = True might help, but I still get the error ValueError: cannot handle a non-unique multi-index!
Is there an alternative to concat that does what I want?
For example:
df1
t tc a
a 1 5
b 1 6
a 1 7
df2:
t tc b
a 1 8
b 1 10
result(after resetting the index):
t tc a b
a 1 5 8
b 1 6 10
a 1 7 8
using .merge you can get where you need
df1.merge(df2, on =['t', 'tc'])
#result
t tc a b
0 a 1 5 8
1 a 1 7 8
2 b 1 6 10

How to assign values to multiple non existing columns in a pandas dataframe?

So what I want to do is to add columns to a dataframe and fill them (all rows respectively) with a single value.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[1,2],[3,4]]), columns = ["A","B"])
arr = np.array([7,8])
# this is what I would like to do
df[["C","D"]] = arr
# and this is what I want to achieve
# A B C D
# 0 1 2 7 8
# 1 3 4 7 8
# but it yields an "KeyError" sadly
# KeyError: "['C' 'D'] not in index"
I do know about the assign-functionality and how I would tackle this issue if I only were to add one column at once. I just want to know whether there is a clean and simple way to do this with multiple new columns as I was not able to find one.
For me working:
df[["C","D"]] = pd.DataFrame([arr], index=df.index)
Or join:
df = df.join(pd.DataFrame([arr], columns=['C','D'], index=df.index))
Or assign:
df = df.assign(**pd.Series(arr, index=['C','D']))
print (df)
A B C D
0 1 2 7 8
1 3 4 7 8
You can using assign and pass a dict in it
df.assign(**dict(zip(['C','D'],[arr.tolist()]*2)))
Out[755]:
A B C D
0 1 2 7 7
1 3 4 8 8

Delete a row if it doesn't contain a specified integer value (Pandas)

I have a Pandas dataset that I want to clean up prior to applying my ML algorithm. I am wondering if it was possible to remove a row if an element of its columns does not match a set of values. For example, if I have the dataframe:
a b
0 1 6
1 4 7
2 2 4
3 3 7
...
And I desire the values of a to be one of [1,3] and of b to be one of [6,7], such that my final dataset is:
a b
0 1 6
1 3 7
...
Currently, my implementation is not working as some of my data rows have erroneous strings attached to the value. For example, instead of a value of 1 I'll have something like 1abc. Hence why I would like to remove anything that is not an integer of that value.
My workaround is also a bit archaic, as I am removing entries for column a that do not have 1 or 3 via:
dataset = dataset[(dataset.commute != 1)]
dataset = dataset[(dataset.commute != 3)]
You can use boolean indexing with double isin and &:
df1 = df[(df['a'].isin([1,3])) & (df['b'].isin([6,7]))]
print (df1)
a b
0 1 6
3 3 7
Or use numpy.in1d:
df1 = df[(np.in1d(df['a'], [1,3])) & (np.in1d(df['b'], [6,7])) ]
print (df1)
a b
0 1 6
3 3 7
But if need remove all rows with non numeric then need to_numeric with errors='coerce' which return NaN and then is possible filter it by notnull:
df = pd.DataFrame({'a':['1abc','2','3'],
'b':['4','5','dsws7']})
print (df)
a b
0 1abc 4
1 2 5
2 3 dsws7
mask = pd.to_numeric(df['a'], errors='coerce').notnull() &
pd.to_numeric(df['b'], errors='coerce').notnull()
df1 = df[mask].astype(int)
print (df1)
a b
1 2 5
If need check if some value is NaN or None:
df = pd.DataFrame({'a':['1abc',None,'3'],
'b':['4','5',np.nan]})
print (df)
a b
0 1abc 4
1 None 5
2 3 NaN
print (df[df.isnull().any(axis=1)])
a b
1 None 5
2 3 NaN
You can use pandas isin()
df = df[df.a.isin([1,3]) & df.b.isin([6,7])]
a b
0 1 6
3 3 7

Outer join in python Pandas

I have two data sets as following
A B
IDs IDs
1 1
2 2
3 5
4 7
How in Pandas, Numpy we can apply a join which can give me all the data from B, which is not present in A
Something like Following
B
Ids
5
7
I know it can be done with for loop, but that I don't want, since my real data is in millions, and I am really not sure how to use Panda Numpy here, something like following
pd.merge(A, B, on='ids', how='right')
Thanks
You can use NumPy's setdiff1d, like so -
np.setdiff1d(B['IDs'],A['IDs'])
Also, np.in1d could be used for the same effect, like so -
B[~np.in1d(B['IDs'],A['IDs'])]
Please note that np.setdiff1d would give us a sorted NumPy array as output.
Sample run -
>>> A = pd.DataFrame([1,2,3,4],columns=['IDs'])
>>> B = pd.DataFrame([1,7,5,2],columns=['IDs'])
>>> np.setdiff1d(B['IDs'],A['IDs'])
array([5, 7])
>>> B[~np.in1d(B['IDs'],A['IDs'])]
IDs
1 7
2 5
You can use merge with parameter indicator and then boolean indexing. Last you can drop column _merge:
A = pd.DataFrame({'IDs':[1,2,3,4],
'B':[4,5,6,7],
'C':[1,8,9,4]})
print (A)
B C IDs
0 4 1 1
1 5 8 2
2 6 9 3
3 7 4 4
B = pd.DataFrame({'IDs':[1,2,5,7],
'A':[1,8,3,7],
'D':[1,8,9,4]})
print (B)
A D IDs
0 1 1 1
1 8 8 2
2 3 9 5
3 7 4 7
df = (pd.merge(A, B, on='IDs', how='outer', indicator=True))
df = df[df._merge == 'right_only']
df = df.drop('_merge', axis=1)
print (df)
B C IDs A D
4 NaN NaN 5.0 3.0 9.0
5 NaN NaN 7.0 7.0 4.0
You could convert the data series to sets and take the difference:
import pandas as pd
df=pd.DataFrame({'A' : [1,2,3,4], 'B' : [1,2,5,7]})
A=set(df['A'])
B=set(df['B'])
C=pd.DataFrame({'C' : list(B-A)}) # Take difference and convert back to DataFrame
The variable "C" then yields
C
0 5
1 7
You can simply use pandas' .isin() method:
df = pd.DataFrame({'A' : [1,2,3,4], 'B' : [1,2,5,7]})
df[~df['B'].isin(df['A'])]
If these are separate DataFrames:
a = pd.DataFrame({'IDs' : [1,2,3,4]})
b = pd.DataFrame({'IDs' : [1,2,5,7]})
b[~b['IDs'].isin(a['IDs'])]
Output:
IDs
2 5
3 7

Categories