I've got two data frames that represent similar data but I want to merge after changing the col names. There are a few ways to achieve this but given the size of my actual data frames, I'd like to use the following method. I'm returning nan values for the second df.
import pandas as pd
df1 = pd.DataFrame({
'time': ['2012-08-02 09:50:20.0','2012-08-02 09:50:32.5','2012-08-02 09:50:34.8'],
'Val': ['1,2,3','1,2,3','1,2,3'],
'Val2': [1,2,3],
'Val3': [1.1,2.1,3.1]
})
df2 = pd.DataFrame({
'time': ['2012-08-02 09:50:20.0','2012-08-02 09:50:32.5','2012-08-02 09:50:34.8'],
'Val': ['1,2,3','1,2,3','1,2,3'],
'Val2': [1,2,3],
'Val3': [1.1,2.1,3.1]
})
df1['time'] = pd.to_datetime(df1['time'])
df2['time'] = pd.to_datetime(df2['time'])
df1.columns.values[1:4] = ['first_' + str(x) for x in df1.columns[1:4]]
df2.columns.values[1:4] = ['second_' + str(x) for x in df2.columns[1:4]]
df3 = pd.merge(df1, df2, on = 'time')
print(df3)
time first_Val first_Val2 first_Val3 second_Val second_Val2 second_Val3
0 2012-08-02 09:50:20.000 1,2,3 1 1.1 NaN NaN NaN
1 2012-08-02 09:50:32.500 1,2,3 2 2.1 NaN NaN NaN
2 2012-08-02 09:50:34.800 1,2,3 3 3.1 NaN NaN NaN
Intended output:
time first_Val first_Val2 first_Val3 second_Val second_Val2 second_Val3
0 2012-08-02 09:50:20.000 1,2,3 1 1.1 1,2,3 1 1.1
1 2012-08-02 09:50:32.500 1,2,3 2 2.1 1,2,3 2 2.1
2 2012-08-02 09:50:34.800 1,2,3 3 3.1 1,2,3 3 3.1
The issue is slice assignment of the column names.
df1.columns.values[1:4] = new values
Fails in pandas 1.1.1 and 1.1.2
Works in 1.0.1 and 1.0.5
'time' is set as the index, then reset, after changing the column names in a list-comprehension.
This demonstrates, it's okay to rename the columns with a list comprehension, but not by slicing df.columns.
.reset_index() can be removed, to leave 'time' as the index, in which case, use df.join, instead of pd.merge.
The options are to set the column, which won't have a new name, as the index, or use .rename for the specific columns.
df1 = pd.DataFrame({
'time': ['2012-08-02 09:50:20.0','2012-08-02 09:50:32.5','2012-08-02 09:50:34.8'],
'first_Val': ['1,2,3','1,2,3','1,2,3'],
'first_Val2': [1,2,3],
'first_Val3': [1.1,2.1,3.1]
})
df1['time'] = pd.to_datetime(df1['time'])
df1.set_index('time', inplace=True)
df1.columns = ['first_' + str(x) for x in df1.columns]
df1.reset_index(inplace=True)
df2 = pd.DataFrame({
'time': ['2012-08-02 09:50:20.0','2012-08-02 09:50:32.5','2012-08-02 09:50:34.8'],
'Val': ['1,2,3','1,2,3','1,2,3'],
'Val2': [1,2,3],
'Val3': [1.1,2.1,3.1]
})
df2['time'] = pd.to_datetime(df2['time'])
df2.set_index('time', inplace=True)
df2.columns = ['second_' + str(x) for x in df2.columns]
df2.reset_index(inplace=True)
# merge
df3 = pd.merge(df1, df2, on = 'time', how='left')
time first_first_Val first_first_Val2 first_first_Val3 second_Val second_Val2 second_Val3
0 2012-08-02 09:50:20.000 1,2,3 1 1.1 1,2,3 1 1.1
1 2012-08-02 09:50:32.500 1,2,3 2 2.1 1,2,3 2 2.1
2 2012-08-02 09:50:34.800 1,2,3 3 3.1 1,2,3 3 3.1
Okay, let's try this a different way:
df1 = df1.set_index('time').add_prefix('first_')
df2 = df2.set_index('time').add_prefix('second_')
df3 = pd.merge(df1, df2, on = 'time')
print(df3)
Related
I have two dataframes in python. I want to update rows in first dataframe using matching values from another dataframe. Second dataframe serves as an override.
Here is an example with same data and code:
DataFrame 1 :
DataFrame 2:
I want to update update dataframe 1 based on matching code and name. In this example Dataframe 1 should be updated as below:
Note : Row with Code =2 and Name= Company2 is updated with value 1000 (coming from Dataframe 2)
import pandas as pd
data1 = {
'Code': [1, 2, 3],
'Name': ['Company1', 'Company2', 'Company3'],
'Value': [200, 300, 400],
}
df1 = pd.DataFrame(data1, columns= ['Code','Name','Value'])
data2 = {
'Code': [2],
'Name': ['Company2'],
'Value': [1000],
}
df2 = pd.DataFrame(data2, columns= ['Code','Name','Value'])
Any pointers or hints?
Using DataFrame.update, which aligns on indices (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.update.html):
>>> df1.set_index('Code', inplace=True)
>>> df1.update(df2.set_index('Code'))
>>> df1.reset_index() # to recover the initial structure
Code Name Value
0 1 Company1 200.0
1 2 Company2 1000.0
2 3 Company3 400.0
You can using concat + drop_duplicates which updates the common rows and adds the new rows in df2
pd.concat([df1,df2]).drop_duplicates(['Code','Name'],keep='last').sort_values('Code')
Out[1280]:
Code Name Value
0 1 Company1 200
0 2 Company2 1000
2 3 Company3 400
Update due to below comments
df1.set_index(['Code', 'Name'], inplace=True)
df1.update(df2.set_index(['Code', 'Name']))
df1.reset_index(drop=True, inplace=True)
You can merge the data first and then use numpy.where, here's how to use numpy.where
updated = df1.merge(df2, how='left', on=['Code', 'Name'], suffixes=('', '_new'))
updated['Value'] = np.where(pd.notnull(updated['Value_new']), updated['Value_new'], updated['Value'])
updated.drop('Value_new', axis=1, inplace=True)
Code Name Value
0 1 Company1 200.0
1 2 Company2 1000.0
2 3 Company3 400.0
There is a update function available
example:
df1.update(df2)
for more info:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.update.html
You can align indices and then use combine_first:
res = df2.set_index(['Code', 'Name'])\
.combine_first(df1.set_index(['Code', 'Name']))\
.reset_index()
print(res)
# Code Name Value
# 0 1 Company1 200.0
# 1 2 Company2 1000.0
# 2 3 Company3 400.0
Assuming company and code are redundant identifiers, you can also do
import pandas as pd
vdic = pd.Series(df2.Value.values, index=df2.Name).to_dict()
df1.loc[df1.Name.isin(vdic.keys()), 'Value'] = df1.loc[df1.Name.isin(vdic.keys()), 'Name'].map(vdic)
# Code Name Value
#0 1 Company1 200
#1 2 Company2 1000
#2 3 Company3 400
You can use pd.Series.where on the result of left-joining df1 and df2
merged = df1.merge(df2, on=['Code', 'Name'], how='left')
df1.Value = merged.Value_y.where(~merged.Value_y.isnull(), df1.Value)
>>> df1
Code Name Value
0 1 Company1 200.0
1 2 Company2 1000.0
2 3 Company3 400.0
You can change the line to
df1.Value = merged.Value_y.where(~merged.Value_y.isnull(), df1.Value).astype(int)
in order to return the value to be an integer.
There's something I often do.
I merge 'left' first:
df_merged = pd.merge(df1, df2, how = 'left', on = 'Code')
Pandas will create columns with extension '_x' (for your left dataframe) and
'_y' (for your right dataframe)
You want the ones that came from the right. So just remove any columns with '_x' and rename '_y':
for col in df_merged.columns:
if '_x' in col:
df_merged .drop(columns = col, inplace = True)
if '_y' in col:
new_name = col.strip('_y')
df_merged .rename(columns = {col : new_name }, inplace=True)
Append the dataset
Drop the duplicate by code
Sort the values
combined_df = combined_df.append(df2).drop_duplicates(['Code'],keep='last').sort_values('Code')
None of the above solutions worked for my particular example, which I think is rooted in the dtype of my columns, but I eventually came to this solution
indexes = df1.loc[df1.Code.isin(df2.Code.values)].index
df1.at[indexes,'Value'] = df2['Value'].values
I have two dataframes in python. I want to update rows in first dataframe using matching values from another dataframe. Second dataframe serves as an override.
Here is an example with same data and code:
DataFrame 1 :
DataFrame 2:
I want to update update dataframe 1 based on matching code and name. In this example Dataframe 1 should be updated as below:
Note : Row with Code =2 and Name= Company2 is updated with value 1000 (coming from Dataframe 2)
import pandas as pd
data1 = {
'Code': [1, 2, 3],
'Name': ['Company1', 'Company2', 'Company3'],
'Value': [200, 300, 400],
}
df1 = pd.DataFrame(data1, columns= ['Code','Name','Value'])
data2 = {
'Code': [2],
'Name': ['Company2'],
'Value': [1000],
}
df2 = pd.DataFrame(data2, columns= ['Code','Name','Value'])
Any pointers or hints?
Using DataFrame.update, which aligns on indices (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.update.html):
>>> df1.set_index('Code', inplace=True)
>>> df1.update(df2.set_index('Code'))
>>> df1.reset_index() # to recover the initial structure
Code Name Value
0 1 Company1 200.0
1 2 Company2 1000.0
2 3 Company3 400.0
You can using concat + drop_duplicates which updates the common rows and adds the new rows in df2
pd.concat([df1,df2]).drop_duplicates(['Code','Name'],keep='last').sort_values('Code')
Out[1280]:
Code Name Value
0 1 Company1 200
0 2 Company2 1000
2 3 Company3 400
Update due to below comments
df1.set_index(['Code', 'Name'], inplace=True)
df1.update(df2.set_index(['Code', 'Name']))
df1.reset_index(drop=True, inplace=True)
You can merge the data first and then use numpy.where, here's how to use numpy.where
updated = df1.merge(df2, how='left', on=['Code', 'Name'], suffixes=('', '_new'))
updated['Value'] = np.where(pd.notnull(updated['Value_new']), updated['Value_new'], updated['Value'])
updated.drop('Value_new', axis=1, inplace=True)
Code Name Value
0 1 Company1 200.0
1 2 Company2 1000.0
2 3 Company3 400.0
There is a update function available
example:
df1.update(df2)
for more info:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.update.html
You can align indices and then use combine_first:
res = df2.set_index(['Code', 'Name'])\
.combine_first(df1.set_index(['Code', 'Name']))\
.reset_index()
print(res)
# Code Name Value
# 0 1 Company1 200.0
# 1 2 Company2 1000.0
# 2 3 Company3 400.0
Assuming company and code are redundant identifiers, you can also do
import pandas as pd
vdic = pd.Series(df2.Value.values, index=df2.Name).to_dict()
df1.loc[df1.Name.isin(vdic.keys()), 'Value'] = df1.loc[df1.Name.isin(vdic.keys()), 'Name'].map(vdic)
# Code Name Value
#0 1 Company1 200
#1 2 Company2 1000
#2 3 Company3 400
You can use pd.Series.where on the result of left-joining df1 and df2
merged = df1.merge(df2, on=['Code', 'Name'], how='left')
df1.Value = merged.Value_y.where(~merged.Value_y.isnull(), df1.Value)
>>> df1
Code Name Value
0 1 Company1 200.0
1 2 Company2 1000.0
2 3 Company3 400.0
You can change the line to
df1.Value = merged.Value_y.where(~merged.Value_y.isnull(), df1.Value).astype(int)
in order to return the value to be an integer.
There's something I often do.
I merge 'left' first:
df_merged = pd.merge(df1, df2, how = 'left', on = 'Code')
Pandas will create columns with extension '_x' (for your left dataframe) and
'_y' (for your right dataframe)
You want the ones that came from the right. So just remove any columns with '_x' and rename '_y':
for col in df_merged.columns:
if '_x' in col:
df_merged .drop(columns = col, inplace = True)
if '_y' in col:
new_name = col.strip('_y')
df_merged .rename(columns = {col : new_name }, inplace=True)
Append the dataset
Drop the duplicate by code
Sort the values
combined_df = combined_df.append(df2).drop_duplicates(['Code'],keep='last').sort_values('Code')
None of the above solutions worked for my particular example, which I think is rooted in the dtype of my columns, but I eventually came to this solution
indexes = df1.loc[df1.Code.isin(df2.Code.values)].index
df1.at[indexes,'Value'] = df2['Value'].values
Input Explained:
I have two dataframe df1 and df2, which holds columns as mentioned below.
df1
Description Col1 Col2
AAA 1.2 2.5
BBB 1.3 2.0
CCC 1.1 2.3
df2
Description Col1 Col2
AAA 1.2 1.3
BBB 1.3 2.0
Scenario:
Have to compare df1['Description'] and df2['Description'], when both equals then have to compare df1['Col1'] with df2['Col1'] and df1['Col2'] with df2['Col2'] and produce result as expected below.
Expected Output:
Description Col1 Col2 Col1_Result Col2_Result
AAA 1.2 2.5 Pass Fail
BBB 1.3 2.0 Pass Pass
CCC 1.1 2.3 Not found in df2 Not found in df2
Tried Code:
Have tried out the below mentioned codeline for above mentioned scenario but doesn't works. Throughs error "ValueError: Can only compare identically-labeled Series objects"
df1['Col1_Result'] = np.where(df1['Description']== df2['Description'],np.where(df1['Col1'] == df2['Col1'], 'Pass', 'Fail'),'Not found in df2')
df1['Col2_Result'] = np.where(df1['Description']== df2['Description'],np.where(df1['Col2'] == df2['Col2'], 'Pass', 'Fail'),'Not found in df2')
Thanks in Advance!
Use DataFrame.merge with left join for output DataFrame, then select added columns by DataFrame.filter and create output by compare values first for missing values and then columns each other in numpy.select:
df1['desc'] = df1['Description'].str.lower()
df2['desc'] = df2['Description'].str.lower()
df = (df1.merge(df2, on='desc', suffixes=['', '_Result'], how='left')
.drop(['Description_Result','desc'], axis=1))
df3 = df.filter(like='_Result')
new = df3.rename(columns=lambda x: x.replace('_Result',''))
df[df3.columns] = np.select([new.isna(),
df[new.columns].eq(new)],
['Not found in df2', 'Pass'], 'Fail')
print (df)
Description Col1 Col2 Col1_Result Col2_Result
0 AAA 1.2 2.5 Pass Fail
1 BBB 1.3 2.0 Pass Pass
2 CCC 1.1 2.3 Not found in df2 Not found in df2
Details:
print (df3)
Col1_Result Col2_Result
0 1.2 1.3
1 1.3 2.0
2 NaN NaN
print (new)
Col1 Col2
0 1.2 1.3
1 1.3 2.0
2 NaN NaN
Alternative, code below works with the example given. If there are edge cases, it could be modified as needed.
# Import libraries
import pandas as pd
# Create DataFrame
df1 = pd.DataFrame({
'Description':['AaA', 'BBB','CCC'],
'Col1': [1.2,1.3,1.1],
'Col2':[2.5,2.0,2.3]
})
df2 = pd.DataFrame({
'Description': ['AAA', 'BBB'],
'Col1': [1.2, 1.3],
'Col2': [1.3, 2.0]
})
# Convert to lower case
df1['Description'] = df1['Description'].str.lower()
df2['Description'] = df2['Description'].str.lower()
# Merge df
df = df1.merge(df2, on='Description', how='left')
# Compare
df['Col1_result'] = df.apply(lambda x: 'Not found in df2' if (pd.isna(x['Col1_y'])) else
'Pass' if x['Col1_x']==x['Col1_y'] else
'Fail', axis=1)
df['Col2_result'] = df.apply(lambda x: 'Not found in df2' if (pd.isna(x['Col2_y'])) else
'Pass' if x['Col2_x']==x['Col2_y'] else
'Fail', axis=1)
# Keep only columns from df1
df = df.drop(['Col1_y', 'Col2_y'], axis=1)
# Remove '_x' from column names
df.columns = df.columns.str.replace(r'_x$', '')
# Change to upper case
df['Description'] = df['Description'].str.upper()
Output
df
Description Col1 Col2 Col1_result Col2_result
0 AAA 1.2 2.5 Pass Fail
1 BBB 1.3 2.0 Pass Pass
2 CCC 1.1 2.3 Not found in df2 Not found in df2
I have a dataframe with [Year] & [Week] columns sometimes missing. I have another dataframe that is a calendar for reference from which I can get these missing values. How to fill these missing columns using pandas?
I have tried using reindex to set them up, but I am getting the following error
ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
import pandas as pd
d1 = {'Year': [2019,2019,2019,2019,2019], 'Week':[1,2,4,6,7], 'Value':
[20,40,60,75,90]}
d2 = {'Year': [2019,2019,2019,2019,2019,2019,2019,2019,2019,2019], 'Week':[1,2,3,4,5,6,7,8,9,10]}
df1 = pd.DataFrame(data=d1)
df2 = pd.DataFrame(data=d2)
df1 = df1.set_index(['Year', 'Week'])
df2 = df2.set_index(['Year', 'Week'])
df1 = df1.reindex(df2, fill_value=0)
print(df1)
You should adding index so df2.index
df1.reindex(df2.index,fill_value=0)
Out[851]:
Value
Year Week
2019 1 20
2 40
3 0
4 60
5 0
6 75
7 90
df2.index.difference(df1.index)
Out[854]:
MultiIndex(levels=[[2019], [3, 5]],
labels=[[0, 0], [0, 1]],
names=['Year', 'Week'],
sortorder=0)
Update
s=df1.reindex(df2.index)
s[s.bfill().notnull().values].fillna(0)
Out[877]:
Value
Year Week
2019 1 20.0
2 40.0
3 0.0
4 60.0
5 0.0
6 75.0
7 90.0
import pandas as pd
d1 = {'Year': [2019,2019,2019,2019,2019], 'Week':[1,2,4,6,7], 'Value':
[20,40,60,75,90]}
d2 = {'Year': [2019,2019,2019,2019,2019,2019,2019], 'Week':[1,2,3,4,5,6,7]}
df1 = pd.DataFrame(data=d1)
df2 = pd.DataFrame(data=d2)
df1 = df1.set_index(['Year', 'Week'])
df2 = df2.set_index(['Year', 'Week'])
fill_value = df1['Value'].mean() #value to fill `NaN` rows with - can choose another logic if you do not want the mean
df1 = df1.join(df2, how='right')
df1.fillna(value=fill_value,axis=1) # Fill missing data here
print(df1)
I want to append a Series to a DataFrame where Series's index matches DataFrame's columns using pd.concat, but it gives me surprises:
df = pd.DataFrame(columns=['a', 'b'])
sr = pd.Series(data=[1,2], index=['a', 'b'], name=1)
pd.concat([df, sr], axis=0)
Out[11]:
a b 0
a NaN NaN 1.0
b NaN NaN 2.0
What I expected is of course:
df.append(sr)
Out[14]:
a b
1 1 2
It really surprises me that pd.concat is not index-columns aware. So is it true that if I want to concat a Series as a new row to a DF, then I can only use df.append instead?
Need DataFrame from Series by to_frame and transpose:
a = pd.concat([df, sr.to_frame(1).T])
print (a)
a b
1 1 2
Detail:
print (sr.to_frame(1).T)
a b
1 1 2
Or use setting with enlargement:
df.loc[1] = sr
print (df)
a b
1 1 2
"df.loc[1] = sr" will drop the column if it isn't in df
df = pd.DataFrame(columns = ['a','b'])
sr = pd.Series({'a':1,'b':2,'c':3})
df.loc[1] = sr
df will be like:
a b
1 1 2