How to conditionally replace Pandas dataframe column values from another dataframe - python

I have the following 2 dataframes:
df1 = pd.DataFrame({"col1":[1, 2, 3],
"col2":["a", "b", "c"]})
df1
Output:
col1 col2
0 1 a
1 2 b
2 3 c
And the second one:
df2 = pd.DataFrame({"col1":[1, 2, 3, 4, 5],
"col2":["x", "y", "z", "q", "w"]})
df2
Output:
col1 col2
0 1 x
1 2 y
2 3 z
3 4 q
4 5 w
Additional info:
col1 in both data frames have unique values.
col2 does not necessarily have unique values.
What to achieve:
How can I replace values of col2 in df1 with the corresponding col2 values from df2 from the matching col1 values?
Desired final content of df1 is supposed to be as following:
col1 col2
0 1 x
1 2 y
2 3 z

Create dict by zipping the df2 columns.
Use map to transfer values over to df1. Code below
df1['col2']=df1['col1'].map(dict(zip(df2['col1'],df2['col2'])))

try .map
df1['col2'] = df1['col1'].map(df2.set_index('col1')['col2'])
# col1 col2
# 0 1 x
# 1 2 y
# 2 3 z

Related

How to compare two dataframes in Python pandas and output the difference?

I have two df with the same numbers of columns but different numbers of rows.
df1
col1 col2
0 a 1,2,3,4
1 b 1,2,3
2 c 1
df2
col1 col2
0 b 1,3
1 c 1,2
2 d 1,2,3
3 e 1,2
df1 is the existing list, df2 is the updated list. The expected result is whatever in df2 that was previously not in df1.
Expected result:
col1 col2
0 c 2
1 d 1,2,3
2 e 1,2
I've tried with
mask = df1['col2'] != df2['col2']
but it doesn't work with different rows of df.
Use DataFrame.explode by splitted values in columns col2, then use DataFrame.merge with right join and indicato parameter, filter by boolean indexing only rows with right_only and last aggregate join:
df11 = df1.assign(col2 = df1['col2'].str.split(',')).explode('col2')
df22 = df2.assign(col2 = df2['col2'].str.split(',')).explode('col2')
df = df11.merge(df22, indicator=True, how='right', on=['col1','col2'])
df = (df[df['_merge'].eq('right_only')]
.groupby('col1')['col2']
.agg(','.join)
.reset_index(name='col2'))
print (df)
col1 col2
0 c 2
1 d 1,2,3
2 e 1,2

How to add a list of values to a pandas column

I have a pandas dataframe.
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
col1 col2
0 1 3
1 2 4
I want to add the list lst=[10, 20] element-wise to 'col1' to have the following dataframe.
col1 col2
0 11 3
1 22 4
How to do that?
If you want to edit the column in-place you could do,
df['col1'] += lst
after which df will be,
col1 col2
0 11 3
1 22 4
Similarly, other types of mathematical operations are possible, such as,
df['col1'] *= lst
df['col1'] /= lst
If you want to create a new dataframe after addition,
df1 = df.copy()
df1['col1'] = df['col1'].add(lst, axis=0) # df['col1'].add(lst) outputs a series, df['col1']+lst also works
Now df1 is;
col1 col2
0 11 3
1 22 4

Reformat dataframe using pandas by adding new rows based on dictionary value

Given below is my dataframe
df = pd.DataFrame({'Col1':['1','2'],'Col2':[{'a':['a1','a2']},{'b':['b1']}]})
Col1 Col2
0 1 {u'a': [u'a1', u'a2']}
1 2 {u'b': [u'b1']}
I need to reformat this data frame as below
Col1 NCol2 NCol3
0 1 a a1
1 1 a a2
2 2 b b1
Basically, for each key value pair in the dictionary, i am adding a row with key and value in Ncol2 and Ncol3.
Thanks for help in advance.
You can use the following solution:
df1 = df['Col2'].apply(pd.Series).apply(lambda x: x.explode())\
.stack().reset_index(level=1)
df1.columns = ['Col2', 'Col3']
df.drop('Col2', axis=1).merge(df1, left_index=True, right_index=True)\
.reset_index(drop=True)
Output:
Col1 Col2 Col3
0 1 a a1
1 1 a a2
2 2 b b1

Mapping a Column from One Dataframe to Another

I would like to map the values in df2['col2'] to df['col1']:
df col1 col2
0 w a
1 1 2
2 2 3
I would like to use a column from the dataframe as a dictionary to get:
col1 col2
0 w a
1 A 2
2 B 3
However the data dictionary is just a column in df2, which looks like
df2 col1 col2
1 1 A
2 2 B
I have tried using this:
di = {"df2['col1']: df2['col2']}
final = df1.replace({"df2['col2']": di})
But get an error: TypeError: 'Series' objects are mutable, thus they cannot be hashed
I have about a 200,000 rows. Any help would be appreciated.
Edit:
The sample dictionary would look like di = {1: "A", 2: "B"}, but is in df2['col1']: df2['col2']. I have 200k+ rows, can I convert df2['col1']: df2['col2'] to a tuple, etc?
You can build a lookup dictionary based on the col1:col2 of df2 and then use that to replace the values in df1.col1.
import pandas as pd
df1 = pd.DataFrame({'col1':['w',1,2],'col2':['a',2,3]})
df2 = pd.DataFrame({'col1':[1,2],'col2':['A','B']})
print(df1)
# col1 col2
#0 w a
#1 1 2
#2 2 3
print(df2)
# col1 col2
#0 1 A
#1 2 B
dataLookUpDict = {row[1]:row[2] for row in df2[['col1','col2']].itertuples()}
final = df1.replace({'col1': dataLookUpDict})
print(final)
# col1 col2
#0 w a
#1 A 2
#2 B 3

Pandas Combining two rows into one [duplicate]

Given the following dataframe:
import pandas as pd
df = pd.DataFrame({'COL1': ['A', np.nan,'A'],
'COL2' : [np.nan,'A','A']})
df
COL1 COL2
0 A NaN
1 NaN A
2 A A
I would like to create a column ('COL3') that uses the value from COL1 per row unless that value is null (or NaN). If the value is null (or NaN), I'd like for it to use the value from COL2.
The desired result is:
COL1 COL2 COL3
0 A NaN A
1 NaN A A
2 A A A
Thanks in advance!
In [8]: df
Out[8]:
COL1 COL2
0 A NaN
1 NaN B
2 A B
In [9]: df["COL3"] = df["COL1"].fillna(df["COL2"])
In [10]: df
Out[10]:
COL1 COL2 COL3
0 A NaN A
1 NaN B B
2 A B A
You can use np.where to conditionally set column values.
df = df.assign(COL3=np.where(df.COL1.isnull(), df.COL2, df.COL1))
>>> df
COL1 COL2 COL3
0 A NaN A
1 NaN A A
2 A A A
If you don't mind mutating the values in COL2, you can update them directly to get your desired result.
df = pd.DataFrame({'COL1': ['A', np.nan,'A'],
'COL2' : [np.nan,'B','B']})
>>> df
COL1 COL2
0 A NaN
1 NaN B
2 A B
df.COL2.update(df.COL1)
>>> df
COL1 COL2
0 A A
1 NaN B
2 A A
Using .combine_first, which gives precedence to non-null values in the Series or DataFrame calling it:
import pandas as pd
import numpy as np
df = pd.DataFrame({'COL1': ['A', np.nan,'A'],
'COL2' : [np.nan,'B','B']})
df['COL3'] = df.COL1.combine_first(df.COL2)
Output:
COL1 COL2 COL3
0 A NaN A
1 NaN B B
2 A B A
If we mod your df slightly then you will see that this works and in fact will work for any number of columns so long as there is a single valid value:
In [5]:
df = pd.DataFrame({'COL1': ['B', np.nan,'B'],
'COL2' : [np.nan,'A','A']})
df
Out[5]:
COL1 COL2
0 B NaN
1 NaN A
2 B A
In [6]:
df.apply(lambda x: x[x.first_valid_index()], axis=1)
Out[6]:
0 B
1 A
2 B
dtype: object
first_valid_index will return the index value (in this case column) that contains the first non-NaN value:
In [7]:
df.apply(lambda x: x.first_valid_index(), axis=1)
Out[7]:
0 COL1
1 COL2
2 COL1
dtype: object
So we can use this to index into the series
You can also use mask which replaces the values where COL1 is NaN by column COL2:
In [8]: df.assign(COL3=df['COL1'].mask(df['COL1'].isna(), df['COL2']))
Out[8]:
COL1 COL2 COL3
0 A NaN A
1 NaN A A
2 A A A

Categories