I have two dataframes in python. I want to update rows in first dataframe using matching values from another dataframe. Second dataframe serves as an override.
Here is an example with same data and code:
DataFrame 1 :
DataFrame 2:
I want to update update dataframe 1 based on matching code and name. In this example Dataframe 1 should be updated as below:
Note : Row with Code =2 and Name= Company2 is updated with value 1000 (coming from Dataframe 2)
import pandas as pd
data1 = {
'Code': [1, 2, 3],
'Name': ['Company1', 'Company2', 'Company3'],
'Value': [200, 300, 400],
}
df1 = pd.DataFrame(data1, columns= ['Code','Name','Value'])
data2 = {
'Code': [2],
'Name': ['Company2'],
'Value': [1000],
}
df2 = pd.DataFrame(data2, columns= ['Code','Name','Value'])
Any pointers or hints?
Using DataFrame.update, which aligns on indices (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.update.html):
>>> df1.set_index('Code', inplace=True)
>>> df1.update(df2.set_index('Code'))
>>> df1.reset_index() # to recover the initial structure
Code Name Value
0 1 Company1 200.0
1 2 Company2 1000.0
2 3 Company3 400.0
You can using concat + drop_duplicates which updates the common rows and adds the new rows in df2
pd.concat([df1,df2]).drop_duplicates(['Code','Name'],keep='last').sort_values('Code')
Out[1280]:
Code Name Value
0 1 Company1 200
0 2 Company2 1000
2 3 Company3 400
Update due to below comments
df1.set_index(['Code', 'Name'], inplace=True)
df1.update(df2.set_index(['Code', 'Name']))
df1.reset_index(drop=True, inplace=True)
You can merge the data first and then use numpy.where, here's how to use numpy.where
updated = df1.merge(df2, how='left', on=['Code', 'Name'], suffixes=('', '_new'))
updated['Value'] = np.where(pd.notnull(updated['Value_new']), updated['Value_new'], updated['Value'])
updated.drop('Value_new', axis=1, inplace=True)
Code Name Value
0 1 Company1 200.0
1 2 Company2 1000.0
2 3 Company3 400.0
There is a update function available
example:
df1.update(df2)
for more info:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.update.html
You can align indices and then use combine_first:
res = df2.set_index(['Code', 'Name'])\
.combine_first(df1.set_index(['Code', 'Name']))\
.reset_index()
print(res)
# Code Name Value
# 0 1 Company1 200.0
# 1 2 Company2 1000.0
# 2 3 Company3 400.0
Assuming company and code are redundant identifiers, you can also do
import pandas as pd
vdic = pd.Series(df2.Value.values, index=df2.Name).to_dict()
df1.loc[df1.Name.isin(vdic.keys()), 'Value'] = df1.loc[df1.Name.isin(vdic.keys()), 'Name'].map(vdic)
# Code Name Value
#0 1 Company1 200
#1 2 Company2 1000
#2 3 Company3 400
You can use pd.Series.where on the result of left-joining df1 and df2
merged = df1.merge(df2, on=['Code', 'Name'], how='left')
df1.Value = merged.Value_y.where(~merged.Value_y.isnull(), df1.Value)
>>> df1
Code Name Value
0 1 Company1 200.0
1 2 Company2 1000.0
2 3 Company3 400.0
You can change the line to
df1.Value = merged.Value_y.where(~merged.Value_y.isnull(), df1.Value).astype(int)
in order to return the value to be an integer.
There's something I often do.
I merge 'left' first:
df_merged = pd.merge(df1, df2, how = 'left', on = 'Code')
Pandas will create columns with extension '_x' (for your left dataframe) and
'_y' (for your right dataframe)
You want the ones that came from the right. So just remove any columns with '_x' and rename '_y':
for col in df_merged.columns:
if '_x' in col:
df_merged .drop(columns = col, inplace = True)
if '_y' in col:
new_name = col.strip('_y')
df_merged .rename(columns = {col : new_name }, inplace=True)
Append the dataset
Drop the duplicate by code
Sort the values
combined_df = combined_df.append(df2).drop_duplicates(['Code'],keep='last').sort_values('Code')
None of the above solutions worked for my particular example, which I think is rooted in the dtype of my columns, but I eventually came to this solution
indexes = df1.loc[df1.Code.isin(df2.Code.values)].index
df1.at[indexes,'Value'] = df2['Value'].values
Related
I have two dataframes in python. I want to update rows in first dataframe using matching values from another dataframe. Second dataframe serves as an override.
Here is an example with same data and code:
DataFrame 1 :
DataFrame 2:
I want to update update dataframe 1 based on matching code and name. In this example Dataframe 1 should be updated as below:
Note : Row with Code =2 and Name= Company2 is updated with value 1000 (coming from Dataframe 2)
import pandas as pd
data1 = {
'Code': [1, 2, 3],
'Name': ['Company1', 'Company2', 'Company3'],
'Value': [200, 300, 400],
}
df1 = pd.DataFrame(data1, columns= ['Code','Name','Value'])
data2 = {
'Code': [2],
'Name': ['Company2'],
'Value': [1000],
}
df2 = pd.DataFrame(data2, columns= ['Code','Name','Value'])
Any pointers or hints?
Using DataFrame.update, which aligns on indices (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.update.html):
>>> df1.set_index('Code', inplace=True)
>>> df1.update(df2.set_index('Code'))
>>> df1.reset_index() # to recover the initial structure
Code Name Value
0 1 Company1 200.0
1 2 Company2 1000.0
2 3 Company3 400.0
You can using concat + drop_duplicates which updates the common rows and adds the new rows in df2
pd.concat([df1,df2]).drop_duplicates(['Code','Name'],keep='last').sort_values('Code')
Out[1280]:
Code Name Value
0 1 Company1 200
0 2 Company2 1000
2 3 Company3 400
Update due to below comments
df1.set_index(['Code', 'Name'], inplace=True)
df1.update(df2.set_index(['Code', 'Name']))
df1.reset_index(drop=True, inplace=True)
You can merge the data first and then use numpy.where, here's how to use numpy.where
updated = df1.merge(df2, how='left', on=['Code', 'Name'], suffixes=('', '_new'))
updated['Value'] = np.where(pd.notnull(updated['Value_new']), updated['Value_new'], updated['Value'])
updated.drop('Value_new', axis=1, inplace=True)
Code Name Value
0 1 Company1 200.0
1 2 Company2 1000.0
2 3 Company3 400.0
There is a update function available
example:
df1.update(df2)
for more info:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.update.html
You can align indices and then use combine_first:
res = df2.set_index(['Code', 'Name'])\
.combine_first(df1.set_index(['Code', 'Name']))\
.reset_index()
print(res)
# Code Name Value
# 0 1 Company1 200.0
# 1 2 Company2 1000.0
# 2 3 Company3 400.0
Assuming company and code are redundant identifiers, you can also do
import pandas as pd
vdic = pd.Series(df2.Value.values, index=df2.Name).to_dict()
df1.loc[df1.Name.isin(vdic.keys()), 'Value'] = df1.loc[df1.Name.isin(vdic.keys()), 'Name'].map(vdic)
# Code Name Value
#0 1 Company1 200
#1 2 Company2 1000
#2 3 Company3 400
You can use pd.Series.where on the result of left-joining df1 and df2
merged = df1.merge(df2, on=['Code', 'Name'], how='left')
df1.Value = merged.Value_y.where(~merged.Value_y.isnull(), df1.Value)
>>> df1
Code Name Value
0 1 Company1 200.0
1 2 Company2 1000.0
2 3 Company3 400.0
You can change the line to
df1.Value = merged.Value_y.where(~merged.Value_y.isnull(), df1.Value).astype(int)
in order to return the value to be an integer.
There's something I often do.
I merge 'left' first:
df_merged = pd.merge(df1, df2, how = 'left', on = 'Code')
Pandas will create columns with extension '_x' (for your left dataframe) and
'_y' (for your right dataframe)
You want the ones that came from the right. So just remove any columns with '_x' and rename '_y':
for col in df_merged.columns:
if '_x' in col:
df_merged .drop(columns = col, inplace = True)
if '_y' in col:
new_name = col.strip('_y')
df_merged .rename(columns = {col : new_name }, inplace=True)
Append the dataset
Drop the duplicate by code
Sort the values
combined_df = combined_df.append(df2).drop_duplicates(['Code'],keep='last').sort_values('Code')
None of the above solutions worked for my particular example, which I think is rooted in the dtype of my columns, but I eventually came to this solution
indexes = df1.loc[df1.Code.isin(df2.Code.values)].index
df1.at[indexes,'Value'] = df2['Value'].values
I want to know if this is possible with pandas:
From df2, I want to create new1 and new2.
new1 as the latest date that can find from df1 that match column A
and B.
new2 as the latest date that can find from df1 that match column A
but not B.
I managed to get new1 but not new2.
Code:
import pandas as pd
d1 = [['1/1/19', 'xy','p1','54'], ['1/1/19', 'ft','p2','20'], ['3/15/19', 'xy','p3','60'],['2/5/19', 'xy','p4','40']]
df1 = pd.DataFrame(d1, columns = ['Name', 'A','B','C'])
d2 =[['12/1/19', 'xy','p1','110'], ['12/10/19', 'das','p10','60'], ['12/20/19', 'fas','p50','40']]
df2 = pd.DataFrame(d2, columns = ['Name', 'A','B','C'])
d3 = [['12/1/19', 'xy','p1','110','1/1/19','3/15/19'], ['12/10/19', 'das','p10','60','0','0'], ['12/20/19', 'fas','p50','40','0','0']]
dfresult = pd.DataFrame(d3, columns = ['Name', 'A','B','C','new1','new2'])
Updated!
IIUC, you want to add two columns to df2 : new1 and new2.
First I modified two things:
df1 = pd.DataFrame(d1, columns = ['Name1', 'A','B','C'])
df2 = pd.DataFrame(d2, columns = ['Name2', 'A','B','C'])
df1.Name1 = pd.to_datetime(df1.Name1)
Renamed Name into Name1 and Name2 for ease of use. Then I turned Name1 into a real date, so we can get the maximum date by group.
Then, We merge df2 with df1 on A column. This will give us rows that match on that column
aux = df2.merge(df1, on='A')
Then when the B columns is the same on both dataframes, we get Name1 out of it:
df2['new1'] = df2.index.map(aux[aux.B_x==aux.B_y].Name1).fillna(0)
If they're different we get the maximum date for every A group:
df2['new2'] = df2.A.map(aux[aux.B_x!=aux.B_y].groupby('A').Name1.max()).fillna(0)
Ouput:
Name2 A B C new1 new2
0 12/1/19 xy p1 110 2019-01-01 00:00:00 2019-03-15 00:00:00
1 12/10/19 das p10 60 0 0
2 12/20/19 fas p50 40 0 0
You can do this by:
standard merge based on A
removing all entries which match B values
sorting for dates
dropping duplicates on A, keeping last date (n.b. assumes dates are in date format, not as strings!)
merging back on id
Thus:
source = df1.copy() # renamed
v = df2.merge(source, on='A', how='left') # get all values where df2.A == source.A
v = v[v['B_x'] != v['B_y']] # drop entries where B values are the same
nv = v.sort_values(by=['Name_y']).drop_duplicates(subset=['Name_x'], keep='last')
df2.merge(nv[['Name_y', 'Name_x']].rename(columns={'Name_y': 'new2', 'Name_x': 'Name'}),
on='Name', how='left') # keeps non-matching, consider inner
This yields:
Out[94]:
Name A B C new2
0 12/1/19 xy p1 110 3/15/19
1 12/10/19 das p10 60 NaN
2 12/20/19 fas p50 40 NaN
My initial thought was to do something like the below. Sadly, it is not elegant. Generally, this sort of way to determining some value are frowned upon mostly because it fails to scale and with large data, gets especially slow.
def find_date(row, source=df1): # renamed df1 to source
t = source[source['B'] != row['B']]
t = t[t['A'] == row['A']]
return t.sort_values(by='date', ascending=False).iloc[0]
df2['new2'] = df2.apply(find_date, axis=1)
I want to append a Series to a DataFrame where Series's index matches DataFrame's columns using pd.concat, but it gives me surprises:
df = pd.DataFrame(columns=['a', 'b'])
sr = pd.Series(data=[1,2], index=['a', 'b'], name=1)
pd.concat([df, sr], axis=0)
Out[11]:
a b 0
a NaN NaN 1.0
b NaN NaN 2.0
What I expected is of course:
df.append(sr)
Out[14]:
a b
1 1 2
It really surprises me that pd.concat is not index-columns aware. So is it true that if I want to concat a Series as a new row to a DF, then I can only use df.append instead?
Need DataFrame from Series by to_frame and transpose:
a = pd.concat([df, sr.to_frame(1).T])
print (a)
a b
1 1 2
Detail:
print (sr.to_frame(1).T)
a b
1 1 2
Or use setting with enlargement:
df.loc[1] = sr
print (df)
a b
1 1 2
"df.loc[1] = sr" will drop the column if it isn't in df
df = pd.DataFrame(columns = ['a','b'])
sr = pd.Series({'a':1,'b':2,'c':3})
df.loc[1] = sr
df will be like:
a b
1 1 2
I have two data frame, one is user-item-rating and the other is side information of the items:
#df1
A12VH45Q3H5R5I B000NWJTKW 5.0
A3J8AQWNNI3WSN B000NWJTKW 4.0
A1XOBWIL4MILVM BDASK99000 1.0
#df2
B000NWJTKW ....
BDASK99000 ....
Now I w'd like to map the name of item and user to integer ID. I know there is a way of factorize:
df.apply(lambda x: pd.factorize(x)[0] + 1)
But I 'd like to ensure that the integer of the items in two data frame are consistent. So the resulting data frames is:
#df1
1 1 5.0
2 1 4.0
3 2 1.0
#df2
1 ...
2 ...
Do you know how to ensure that? Thanks in advance!
Concatenate the common column(s), and apply pd.factorize (or pd.Categorical) on that:
codes, uniques = pd.factorize(pd.concat([df1['item'], df2['item']]))
df1['item'] = codes[:len(df1)] + 1
df2['item'] = codes[len(df1):] + 1
For example,
import pandas as pd
df1 = pd.DataFrame(
[('A12VH45Q3H5R5I', 'B000NWJTKW', 5.0),
('A3J8AQWNNI3WSN', 'B000NWJTKW', 4.0),
('A1XOBWIL4MILVM', 'BDASK99000', 1.0)], columns=['user', 'item', 'rating'])
df2 = pd.DataFrame(
[('B000NWJTKW', 10),
('BDASK99000', 20)], columns=['item', 'extra'])
codes, uniques = pd.factorize(pd.concat([df1['item'], df2['item']]))
df1['item'] = codes[:len(df1)] + 1
df2['item'] = codes[len(df1):] + 1
codes, uniques = pd.factorize(df1['user'])
df1['user'] = codes + 1
print(df1)
print(df2)
yields
# df1
user item rating
0 1 1 5
1 2 1 4
2 3 2 1
# df2
item extra
0 1 10
1 2 20
Another way to work-around the problem (if you have enough memory) would be to merge the two DataFrames: df3 = pd.merge(df1, df2, on='item', how='outer'), and then factorize df3['item']:
df3 = pd.merge(df1, df2, on='item', how='outer')
for col in ['item', 'user']:
df3[col] = pd.factorize(df3[col])[0] + 1
print(df3)
yields
user item rating extra
0 1 1 5 10
1 2 1 4 10
2 3 2 1 20
Another option could be to apply factorize on the first dataframe, and then apply the resulting mapping to the second dataframe:
# create factorization:
idx, levels = pd.factorize(df1['item'])
# replace the item codes in the first dataframe with the new index value
df1['item'] = idx
# create a dictionary mapping the original code to the new index value
d = {code: i for i, code in enumerate(codes)}
# apply this mapping to the second dataframe
df2['item'] = df2.item.apply(lambda code: d[code])
This approach will only work if every level is present in both dataframes.
I'm having trouble using pd.merge after groupby. Here's my hypothetical:
import pandas as pd
from pandas import DataFrame
import numpy as np
df1 = DataFrame({'key': [1,1,2,2,3,3],
'var11': np.random.randn(6),
'var12': np.random.randn(6)})
df2 = DataFrame({'key': [1,2,3],
'var21': np.random.randn(3),
'var22': np.random.randn(3)})
#group var11 in df1 by key
grouped = df1['var11'].groupby(df1['key'])
# calculate the mean of var11 by key
grouped = grouped.mean()
print grouped
key
1 1.399430
2 0.568216
3 -0.612843
dtype: float64
print grouped.index
Int64Index([1, 2, 3], dtype='int64')
print df2
key var21 var22
0 1 -0.381078 0.224325
1 2 0.836719 -0.565498
2 3 0.323412 -1.616901
df2 = pd.merge(df2, grouped, left_on = 'key', right_index = True)
At this point, I get IndexError: list index out of range.
When using groupby, the grouping variable ('key' in this example) becomes the index for the resultant series, which is why I specify 'right_index = True'. I've tried other syntax without success. Any advice?
I think you should just do this:
In [140]:
df2 = pd.merge(df2,
pd.DataFrame(grouped, columns=['mean']),
left_on='key',
right_index=True)
print df2
key var21 var22 mean
0 1 0.324476 0.701254 0.400313
1 2 -1.270500 0.055383 -0.293691
2 3 0.804864 0.566747 0.628787
[3 rows x 4 columns]
The reason it didn't work is that grouped is a Series not a DataFrame