Pandas conditional merge 2 dataframes with one to many relationship - python

I am trying to merge two pandas DataFrames with one of many relationship. However, there are a couple of caveats. Explanation below.
import pandas as pd
df1 = pd.DataFrame({'name': ['AA', 'BB', 'CC', 'DD'],
'col1': [1, 2, 3, 4],
'col2': [1, 2, 3, 4] })
df2 = pd.DataFrame({'name': ['AA', 'AA', 'BB', 'BB', 'CC', 'DD'],
'col3': [0, 10, np.nan, 11, 12, np.nan] })
I'd like to merge the 2 DataFrames, however, ignore the 0 and np.nan in df2 when joining. I cannot simply filter df2 as there are other columns that I need.
Basically, I'd like to join on rows with one-to-many relationship that are not 0 or NaNs.
Expected output:

how about this :
merged_Df = df1.merge(df2.sort_values(['name','col3'], ascending=False).groupby(['name']).head(1), on='name')
output :
>>>
name col1 col2 col3
0 AA 1 1 10.0
1 BB 2 2 11.0
2 CC 3 3 12.0
3 DD 4 4 NaN

One way:
>>> df1.merge(df2).drop_duplicates(subset=['name'], keep='last')
name col1 col2 col3
1 AA 1 1 10.0
3 BB 2 2 11.0
4 CC 3 3 12.0
5 DD 4 4 13.0

Related

concatenate 2 dataframes in order to append some columns

I have 2 dataframes(df1 and df2) and I want to append them as follows :
df1 and df2 have some columns in common but I want to append the columns that exist in df2 and not in df1 but keep the columns of df1 as they are
df2 is empty (all rows are nan)
I could just add columns in df1 but in the future, df2 could have new cols added that is why I do not want to hardcode the column names but rather be done automatically. I used to use append but I get the following message
df_new = df1.append(df2)
FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead
I tried the following
df_new = pd.concat([df1, df2], axis=1)
but it concatenates all the columns of both dataframes
According to https://pandas.pydata.org/docs/reference/api/pandas.concat.html
join{‘inner’, ‘outer’}, default ‘outer’
How to handle indexes on other axis (or axes).
INNER
df = pd.DataFrame([['c', 3, 'cat'], ['d', 4, 'dog']], columns=['letter', 'number', 'animal'])
df2 = pd.DataFrame([[None,None,None,None],[None,None,None,None]], columns=['letter', 'number', 'animal', 'newcol'])
print(pd.concat([df,df2], join='inner').dropna(how='all'))
output:
letter number animal
0 c 3 cat
1 d 4 dog
OUTER
df = pd.DataFrame([['c', 3, 'cat'], ['d', 4, 'dog']], columns=['letter', 'number', 'animal'])
df2 = pd.DataFrame([[None,None,None,None],[None,None,None,None]], columns=['letter', 'number', 'animal', 'newcol'])
print(pd.concat([df,df2], join='outer').dropna(how='all'))
output:
letter number animal newcol
0 c 3 cat NaN
1 d 4 dog NaN
You could use pd.concat() with axis=0 (default) and join='outer' (default). I'm illustrating with some examples
df1 = pd.DataFrame({'col1': [3,3,3],
'col2': [4,4,4]})
df2 = pd.DataFrame({'col1': [1,2,3],
'col2': [1,2,3],
'col3': [1,2,3],
'col4': [1,2,3]})
print(df1)
col1 col2
0 3 4
1 3 4
2 3 4
print(df2)
col1 col2 col3 col4
0 1 1 1 1
1 2 2 2 2
2 3 3 3 3
df3 = pd.concat([df1, df2], axis=0, join='outer')
print(df3)
col1 col2 col3 col4
0 3 4 NaN NaN
1 3 4 NaN NaN
2 3 4 NaN NaN
0 1 1 1.0 1.0
1 2 2 2.0 2.0
2 3 3 3.0 3.0
To concatenate just the columns from df2 that are not present in df1:
pd.concat([df1, df2.loc[:, [c for c in df2.columns if c not in df1.columns]]], axis=1)

Aggregate values in pandas dataframe based on lists of indices in a pandas series

Suppose you have a dataframe with an "id" column and a column of values:
df1 = pd.DataFrame({'id': ['a', 'b', 'c'] , 'vals': [1, 2, 3]})
df1
id vals
0 a 1
1 b 2
2 c 3
You also have a series that contains lists of "id" values that correspond to those in df1:
df2 = pd.Series([['b', 'c'], ['a', 'c'], ['a', 'b']])
df2
id
0 [b, c]
1 [a, c]
2 [a, b]
Now, you need a computationally efficient method for taking the mean of the "vals" column in df1 using the corresponding ids in df2 and creating a new column in df1. For instance, for the first row (index=0) we would take the mean of the values for ids "b" and "c" in df1 (since these are the id values in df2 for index=0):
id vals avg_vals
0 a 1 2.5
1 b 2 2.0
2 c 3 1.5
You could do it this way:
df1['avg_vals'] = df2.apply(lambda x: df1.loc[df1['id'].isin(x), 'vals'].mean())
df1
id vals avg_vals
0 a 1 2.5
1 b 2 2.0
2 c 3 1.5
...but suppose it is too slow for your purposes. I.e., I need something much more computationally efficient if possible! Thanks for your help in advance.
Let us try
df1['new'] = pd.DataFrame(df2.tolist()).replace(dict(zip(df1.id,df1.vals))).mean(1)
df1
Out[109]:
id vals new
0 a 1 2.5
1 b 2 2.0
2 c 3 1.5
Try something like:
df1['avg_vals'] = (df2.explode()
.map(df1.set_index('id')['vals'])
.groupby(level=0)
.mean()
)
output:
id vals avg_vals
0 a 1 2.5
1 b 2 2.0
2 c 3 1.5
Thanks to #Beny and #mozway for their answers. But, these still were not performing as efficiently as I needed. I was able to take some of mozway's answer and add a merge and groupby to it which sped things up:
df1 = pd.DataFrame({'id': ['a', 'b', 'c'] , 'vals': [1, 2, 3]})
df2 = pd.Series([['b', 'c'], ['a', 'c'], ['a', 'b']])
df2 = df2.explode().reset_index(drop=False)
df1['avg_vals'] = pd.merge(df1, df2, left_on='id', right_on=0, how='right').groupby('index').mean()['vals']
df1
id vals avg_vals
0 a 1 2.5
1 b 2 2.0
2 c 3 1.5

Pandas, check if a column contains characters from another column, and mark out the character?

There are 2 Dataframe, df1 & df2. e.g.
df1 = pd.DataFrame({'index': [1, 2, 3, 4],
'col1': ['12abc12', '12abcbla', 'abc', 'jh']})
df2 = pd.DataFrame({'col2': ['abc', 'efj']})
what i want looks like this (find all the rows which contains the character from df2, and tag them out)
index col1 col2
0 1 12abc12 abc
1 2 12abcbla abc
2 3 abc abc
3 4 jh
I've found a similar question but not exactly what i want. Thx for any ideas in advance.
Use Series.str.extract if need first matched value:
df1['new'] = df1['col1'].str.extract(f'({"|".join(df2["col2"])})', expand=False).fillna('')
print (df1)
index col1 new
0 1 12abc12 abc
1 2 12abcbla abc
2 3 abc abc
3 4 jh
If need all matched values use Series.str.findall and Series.str.join:
df1 = pd.DataFrame({'index': [1, 2, 3, 4],
'col1': ['12abc1defj2', '12abcbla', 'abc', 'jh']})
df2 = pd.DataFrame({'col2': ['abc', 'efj']})
df1['new'] = df1['col1'].str.findall("|".join(df2["col2"])).str.join(',')
print (df1)
index col1 new
0 1 12abc1defj2 abc,efj
1 2 12abcbla abc
2 3 abc abc
3 4 jh

Subtract one row from another in Pandas DataFrame

I am trying to subtract one row from another in a Pandas DataFrame. I have multiple descriptor columns preceding one numerical column, forcing me to set the index of the DataFrame on the two descriptor columns.
When I do this I get a KeyError on whatever the first column name listed in the set_index() list of columns is. In this case it is 'COL_A':
df = pd.DataFrame({'COL_A': ['A', 'A'],
'COL_B': ['B', 'B'],
'COL_C': [4, 2]})
df.set_index(['COL_A', 'COL_B'], inplace=True)
df.iloc[1] = (df.iloc[1] / df.iloc[0])
df.reset_index(inplace=True)
KeyError: 'COL_A'
I did not give this a second thought and cannot figure out why the KeyError is how this resolves.
I came upon this question for a quick answer. Here's what my solution ended up being.
>>> df = pd.DataFrame(data=[[5,5,5,5], [3,3,3,3]], index=['r1', 'r2'])
>>> df
0 1 2 3
r1 5 5 5 5
r2 3 3 3 3
>>> df.loc['r3'] = df.loc['r1'] - df.loc['r2']
>>> df
0 1 2 3
r1 5 5 5 5
r2 3 3 3 3
r3 2 2 2 2
>>>
Not sure I understand you correctly:
df = pd.DataFrame({'COL_A': ['A', 'A'],
'COL_B': ['B', 'B'],
'COL_C': [4, 2]})
gives:
COL_A COL_B COL_C
0 A B 4
1 A B 2
then
df.set_index(['COL_A', 'COL_B'], inplace=True)
df.iloc[1] = (df.iloc[1] / df.iloc[0])
yields:
COL_A COL_B
A B 4.0
B 0.5
If you now want to subtract, say row 0 from row 1, you can:
df.iloc[1].subtract(df.iloc[0])
to get:
COL_C -3.5

python pandas dataframe : fill nans with a conditional mean

I have the following dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame(data={'Cat' : ['A', 'A', 'A','B', 'B', 'A', 'B'],
'Vals' : [1, 2, 3, 4, 5, np.nan, np.nan]})
Cat Vals
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 A NaN
6 B NaN
And I want indexes 5 and 6 to be filled with the conditional mean of 'Vals' based on the 'Cat' column, namely 2 and 4.5
The following code works fine:
means = df.groupby('Cat').Vals.mean()
for i in df[df.Vals.isnull()].index:
df.loc[i, 'Vals'] = means[df.loc[i].Cat]
Cat Vals
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 A 2
6 B 4.5
But I'm looking for something nicer, like
df.Vals.fillna(df.Vals.mean(Conditionally to column 'Cat'))
Edit: I found this, which is one line shorter, but I'm still not happy with it:
means = df.groupby('Cat').Vals.mean()
df.Vals = df.apply(lambda x: means[x.Cat] if pd.isnull(x.Vals) else x.Vals, axis=1)
We wish to "associate" the Cat values with the missing NaN locations.
In Pandas such associations are always done via the index.
So it is natural to set Cat as the index:
df = df.set_index(['Cat'])
Once this is done, then fillna works as desired:
df['Vals'] = df['Vals'].fillna(means)
To return Cat to a column, you could then of course use reset_index:
df = df.reset_index()
import pandas as pd
import numpy as np
df = pd.DataFrame(
{'Cat' : ['A', 'A', 'A','B', 'B', 'A', 'B'],
'Vals' : [1, 2, 3, 4, 5, np.nan, np.nan]})
means = df.groupby(['Cat'])['Vals'].mean()
df = df.set_index(['Cat'])
df['Vals'] = df['Vals'].fillna(means)
df = df.reset_index()
print(df)
yields
Cat Vals
0 A 1.0
1 A 2.0
2 A 3.0
3 B 4.0
4 B 5.0
5 A 2.0
6 B 4.5

Categories