Pandas divide two dataframe with different sizes - python

I have a dataframe df1 as:
col1 col2 Val1 Val2
A g 4 6
A d 3 8
B h 5 10
B p 7 14
I have another dataframe df2 as:
col1 Val1 Val2
A 2 3
B 1 4
I want to divide df1 by df2 based on col1, val1 and val2 so that row A from df2 divides both rows A from df1.
My final output of df1.div(df2) is as follows:
col1 col2 Val1 Val2
A g 2 2
A d 1.5 2
B h 5 2.5
B p 7 3.5

Convert col1 and col2 to MultiIndex, also convert col1 in second DataFrame to index and then use DataFrame.div:
df = df1.set_index(['col1', 'col2']).div(df2.set_index('col1')).reset_index()
#alternative with specify level of index
#df = df1.set_index(['col1', 'col2']).div(df2.set_index('col1'), level=0).reset_index()
print (df)
col1 col2 Val1 Val2
0 A g 2.0 2.000000
1 A d 1.5 2.666667
2 B h 5.0 2.500000
3 B p 7.0 3.500000

I think there is a slight mistake in your example. For col Val2, 2nd row - 8/3 should be 2.67. So the final output df1.div(df2) should be :
col1 col2 Val1 Val2
0 A g 2.0 2.000000
1 A d 1.5 2.666667
2 B h 5.0 2.500000
3 B p 7.0 3.500000
Anyways here is a possible solution:
Construct the 2 dfs
import pandas as pd
df1 = pd.DataFrame(data={'col1':['A','A','B','B'], 'col2': ['g','d','h','p'], 'Val1': [4,3,5,7], 'Val2': [6,8,10,14]}, columns=['col1','col2','Val1','Val2'])
df2 = pd.DataFrame(data={'col1':['A','B'], 'Val1': [2,1], 'Val2': [3,4]}, columns=['col1','Val1','Val2'])
print (df1)
print (df2)
Output:
>>>
col1 col2 Val1 Val2
0 A g 4 6
1 A d 3 8
2 B h 5 10
3 B p 7 14
col1 Val1 Val2
0 A 2 3
1 B 1 4
Now we can just do an INNER JOIN of df1 and df2 on col: col1. If you are not familiar with SQL joins have a look at this: sql-join. We can do join in pandas using the merge() method
## join df1, df2
merged_df = pd.merge(left=df1, right=df2, how='inner', on='col1')
print (merged_df)
Output:
>>>
col1 col2 Val1_x Val2_x Val1_y Val2_y
0 A g 4 6 2 3
1 A d 3 8 2 3
2 B h 5 10 1 4
3 B p 7 14 1 4
Now that we have got the corresponding columns of df1 and df2, we can simply compute the division and delete the redundant columns:
# Val1 = Val1_x/Val1_y, Val2 = Val2_x/Val2_y
merged_df['Val1'] = merged_df['Val1_x']/merged_df['Val1_y']
merged_df['Val2'] = merged_df['Val2_x']/merged_df['Val2_y']
# delete the cols: Val1_x,Val1_y,Val2_x,Val2_y
merged_df.drop(columns=['Val1_x', 'Val1_y', 'Val2_x', 'Val2_y'], inplace=True)
print (merged_df)
Final Output:
col1 col2 Val1 Val2
0 A g 2.0 2.000000
1 A d 1.5 2.666667
2 B h 5.0 2.500000
3 B p 7.0 3.500000
I hope this solves your question :)

You can use the pandas.merge() function to execute a database-like join between dataframes, then use the result to divide column values:
# merge against col1 so we get a merged index
merged = pd.merge(df1[["col1"]], df2)
df1[["Val1", "Val2"]] = df1[["Val1", "Val2"]].div(merged[["Val1", "Val2"]])
This produces:
col1 col2 Val1 Val2
0 A g 2.0 2.000000
1 A d 1.5 2.666667
2 B h 5.0 2.500000
3 B p 7.0 3.500000

Related

How to replace column values based on other columns in pandas?

Assume, I have a data frame such as
import pandas as pd
df = pd.DataFrame({'visitor':['A','B','C','D','E'],
'col1':[1,2,3,4,5],
'col2':[1,2,4,7,8],
'col3':[4,2,3,6,1]})
visitor
col1
col2
col3
A
1
1
4
B
2
2
2
C
3
4
3
D
4
7
6
E
5
8
1
For each row/visitor, (1) First, if there are any identical values, I would like to keep the 1st value of each row then replace the rest of identical values in the same row with NULL such as
visitor
col1
col2
col3
A
1
NULL
4
B
2
NULL
NULL
C
3
4
NULL
D
4
7
6
E
5
8
1
Then (2) keep rows/visitors with more than 1 value such as
Final Data Frame
visitor
col1
col2
col3
A
1
NULL
4
C
3
4
NULL
D
4
7
6
E
5
8
1
Any suggestions? many thanks
We can use series.duplicated along the columns axis to identify the duplicates, then mask the duplicates using where and filter the rows where the sum of non-duplicated values is greater than 1
s = df.set_index('visitor')
m = ~s.apply(pd.Series.duplicated, axis=1)
s.where(m)[m.sum(1).gt(1)]
col1 col2 col3
visitor
A 1 NaN 4.0
C 3 4.0 NaN
D 4 7.0 6.0
E 5 8.0 1.0
Let us try mask with pd.Series.duplicated, then dropna with thresh
out = df.mask(df.apply(pd.Series.duplicated,1)).dropna(thresh = df.shape[1]-1)
Out[321]:
visitor col1 col2 col3
0 A 1 NaN 4.0
2 C 3 4.0 NaN
3 D 4 7.0 6.0
4 E 5 8.0 1.0

Imputing values into a dataframe based on another dataframe and a condition

Suppose I have the following dataframes:
df1 = pd.DataFrame({'col1':['a','b','c','d'],'col2':[1,2,3,4]})
df2 = pd.DataFrame({'col3':['a','x','a','c','b']})
I wonder how can I look up on df1 and make a new column on df2 and replace values from col2 in it, for those values that there is no data I shall impute 0, the result should look like the following:
col3 col4
0 a 1
1 x 0
2 a 1
3 c 3
4 b 2
Use Series.map with Series.fillna:
df2['col2'] = df2['col3'].map(df1.set_index('col1')['col2']).fillna(0).astype(int)
print (df2)
col3 col2
0 a 1
1 x 0
2 a 1
3 c 3
4 b 2
Or DataFrame.merge, better if need append multiple columns:
df = df2.merge(df1.rename(columns={'col1':'col3'}), how='left').fillna(0)
print (df)
col3 col2
0 a 1.0
1 x 0.0
2 a 1.0
3 c 3.0
4 b 2.0

Replace NA values with values with corresponding from other same

How can I replace NA values in df1
df1:
ID col1 col2 col3 col4
A NaN NaN NaN NaN
B 0 0 1 2
C NaN NaN NaN NaN
With the values from the other dataframe that are corresponding to those NaN values (so other values do not go over)
df2:
ID col1 col2 col3 col4
A 1 2 1 11
B 2 2 4 8
C 0 0 NaN NaN
So result is
ID col1 col2 col3 col4
A 1 2 1 11
B 0 0 1 2
C 0 0 NaN NaN
IIUC use if ID are index in both DataFrames:
df = df1.fillna(df2)
Or:
df = df1.combine_first(df2)
print (df)
col1 col2 col3 col4
ID
A 1.0 2.0 1.0 11.0
B 0.0 0.0 1.0 2.0
C 0.0 0.0 NaN NaN
If ID are columns:
df = df1.set_index('ID').fillna(df2.set_index('ID'))
#alternative
#df = df1.set_index('ID').combine_first(df2.set_index('ID'))
import numpy as np
import pandas as pd
(rows, columns) = df1.shape
for i in range(rows):
for j in range(columns):
if df1.iloc[i,j] == np.NaN:
df1.iloc[i,j] = df2.iloc[i,j]
If all df1 missing values have a corresponding value in df2, that should work.
This solution also takes in count that the NaN values are expressed correctly in df1 as np.NaN, so if they are in string format or another one it will raise an exception.

How to groupby and update values in pandas?

I have a pandas DataFrame that looks similar to the following...
>>> df = pd.DataFrame({
... 'col1':['A','C','B','A','B','C','A'],
... 'col2':[np.nan,1.,np.nan,1.,1.,np.nan,np.nan],
... 'col3':[0,1,9,4,2,3,5],
... })
>>> df
col1 col2 col3
0 A NaN 0
1 C 1.0 1
2 B NaN 9
3 A 1.0 4
4 B 1.0 2
5 C NaN 3
6 A NaN 5
What I would like to do is group the rows of col1 by value and then update any NaN values in col2 to increment in value by 1 based on the last highest value of that group in col1.
So that my expected results would look like the following...
>>> df
col1 col2 col3
0 A 1.0 4
1 A 2.0 0
2 A 3.0 5
3 B 1.0 2
4 B 2.0 9
5 C 1.0 1
6 C 2.0 3
I believe I can use something like groupby on col1 though I'm unsure how to increment the value in col2 based on the last highest value of the group from col1. I've tried the following, but instead of incrementing the value of col1 it updates the value to all 1.0 and adds an additional column...
>>> df1 = df.groupby(['col1'], as_index=False).agg({'col2': 'min'})
>>> df = pd.merge(df1, df, how='left', left_on=['col1'], right_on=['col1'])
>>> df
col1 col2_x col2_y col3
0 A 1.0 NaN 0
1 A 1.0 1.0 1
2 A 1.0 NaN 5
3 B 1.0 NaN 9
4 B 1.0 1.0 4
5 C 1.0 1.0 2
6 C 1.0 NaN 3
Use GroupBy.cumcount only for rows with missing values, add maximum value per group with GroupBy.transform and max and last replace by original values by fillna:
df = pd.DataFrame({
'col1':['A','C','B','A','B','B','B'],
'col2':[np.nan,1.,np.nan,1.,3.,np.nan, 0],
'col3':[0,1,9,4,2,3,4],
})
print (df)
col1 col2 col3
0 A NaN 0
1 C 1.0 1
2 B NaN 9
3 A 1.0 4
4 B 3.0 2
5 B NaN 3
6 B 0.0 4
df = df.sort_values(['col1','col2'], na_position='last')
s = df.groupby('col1')['col2'].transform('max')
df['new'] = (df[df['col2'].isna()]
.groupby('col1')
.cumcount()
.add(1)
.add(s)
.fillna(df['col2']).astype(int))
print (df)
col1 col2 col3 new
3 A 1.0 4 1
0 A NaN 0 2
6 B 0.0 4 0
4 B 3.0 2 3
2 B NaN 9 4
5 B NaN 3 5
1 C 1.0 1 1
Another way:
df['col2_new'] = df.groupby('col1')['col2'].apply(lambda x: x.replace(np.nan, x.value_counts().index[0]+1))
df = df.sort_values('col1')

Group by within a groupby then averaging

Let's say I have a dataframe (I'll just use a simple example) that looks like this:
import pandas as pd
df = {'Col1':[3,4,2,6,5,7,3,4,9,7,1,3],
'Col2':['B','B','B','B','A','A','A','A','C','C','C','C',],
'Col3':[1,1,2,2,1,1,2,2,1,1,2,2]}
df = pd.DataFrame(df)
Which gives a dataframe like so:
Col1 Col2 Col3
0 3 B 1
1 4 B 1
2 2 B 2
3 6 B 2
4 5 A 1
5 7 A 1
6 3 A 2
7 4 A 2
8 9 C 1
9 7 C 1
10 1 C 2
11 3 C 2
What I want to do is several steps:
1) For each unique value in Col2, and for each unique value in Col3, average Col1. So a desired output would be:
Avg Col2 Col3
1 3.5 B 1
2 4 B 2
3 6 A 1
4 3.5 A 2
5 8 C 1
6 2 C 2
2) Now, for each unique value in Col3, I want the highest average and the corresponding value in Col2. So
Best Avg Col2 Col3
1 8 C 1
2 4 B 2
My attempt has been using df.groupby(['Col3','Col2'], as_index = False).agg({'Col1':'mean'}).groupby(['Col3']).agg({'Col1':'max'})
This gives me the highest average for each Col3 value, but not the corresponding Col2 label. Thank you for any help you can give!
After you first groupby do sort_values + drop_duplicates
g1=df.groupby(['Col3','Col2'], as_index = False).agg({'Col1':'mean'})
g1.sort_values('Col1').drop_duplicates('Col3',keep='last')
Out[569]:
Col3 Col2 Col1
4 2 B 4.0
2 1 C 8.0
Or in case you have duplicate max value of mean
g1[g1.Col1==g1.groupby('Col3').Col1.transform('max')]
Do the following (I modified your code slightly,
to make it a bit shorter):
df2 = df.groupby(['Col3','Col2'], as_index = False).mean()
When you print the result, for your input, you will get:
Col3 Col2 Col1
0 1 A 6.0
1 1 B 3.5
2 1 C 8.0
3 2 A 3.5
4 2 B 4.0
5 2 C 2.0
Then run:
res = df2.iloc[df2.groupby('Col3').Col1.idxmax()]
When you print the result, you will get:
Col3 Col2 Col1
2 1 C 8.0
4 2 B 4.0
As you can see:
idxmax gives the index of the row with "maximal" element (for each
group),
this result you can use as the argument of iloc.

Categories