Pandas dataframe take group max value per group when groupby - python

I have dataframe with many columns, 2 are categorical and the rest are numeric:
df = [type1 , type2 , type3 , val1, val2, val3
a b q 1 2 3
a c w 3 5 2
b c t 2 9 0
a b p 4 6 7
a c m 2 1 8]
I want to apply a merge based on the operation groupby(["type1","type2"]) that will create take the max value from the grouped row:
df = [type1 , type2 ,type3, val1, val2, val3
a b q 2 6 7
a c w 4 5 8
b c t 2 9 0
Explanation: val3 of first row is 7 because this is the maximal value when type1 = a, type2 = b.
Similarly, val3 of second row is 8 because this is the maximal value when type1 = a, type2 = c.

If need aggregate all columns by max:
df = df.groupby(["type1","type2"]).max()
print (df)
type3 val1 val2 val3
type1 type2
a b q 4 6 7
c w 3 5 8
b c t 2 9 0
If need some columns aggregate different you can create dictionary of columns names with aggregate functions and then set another aggregate functuions for some columns, like for type3 is used first and for val1 is used last:
d = dict.fromkeys(df.columns.difference(['type1','type2']), 'max')
d['type3'] = 'first'
d['val1'] = 'last'
df = df.groupby(["type1","type2"], as_index=False, sort=False).agg(d)
print (df)
type1 type2 type3 val1 val2 val3
0 a b q 4 6 7
1 a c w 2 5 8
2 b c t 2 9 0

Related

Conditional average of columns based on two dataframes

I have two dataframes. 1st is a metatable, 2nd is a table with values.
df1:
Id Con Obs
A one Day
B one Night
C two Day
D two Night
df2:
Entry A B C D
val1 2 8 2 8
val2 4 6 4 6
val3 6 4 6 4
val4 8 2 8 2
val5 10 0 10 0
I wish to sum df2 based on Condition ('Con') column. For this, I attempted to groupby the Con column and feed that as column to df2 to average.
level = df1.groupby(['Con'])['Id'].agg(','.join)
level = level.reset_index()
This produces the following:
Con Id
0 one A,B
1 two C,D
How do I supply this grouped Id to df2 to get,
Output:
Entry AB_sum CD_sum
val1 10 10
val2 10 10
val3 10 10
val4 10 10
val5 10 10
You can rename the columns and use groupby on the new column names:
(df2.set_index('Entry')
.rename(columns=df1.set_index('Id')['Con'])
.groupby(level=0, axis=1).sum()
)
Output:
one two
Entry
val1 10 10
val2 10 10
val3 10 10
val4 10 10
val5 10 10

pandas dataframe how to merge all rows based on groupby

I have dataframe with many columns, 2 are categorical and the rest are numeric:
df = [type1 , type2 , type3 , val1, val2, val3
a b q 1 2 3
a c w 3 5 2
b c t 2 9 0
a b p 4 6 7
a c m 2 1 8]
I want to apply a merge based on the operation groupby(["type1","type2"]) that will create the following dataframe:
df = [type1 , type2 ,type3, val1, val2, val3 , val1_a, val2_b, val3_b
a b q 1 2 3 4 6 7
a c w 3 5 2 2 1 8
b c t 2 9 0 2 9 0
Please notice: there could be 1 or 2 rows at each groupby, but not more. in case of 1 - just duplicate the single row
Idea is use GroupBy.cumcount for counter by type1, type2, then is created MultiIndex, reshaped by DataFrame.unstack, forward filling missing values per rows by ffill, converting to integers, sorting by counter level and last in list comprehension flatten MultiIndex:
g = df.groupby(["type1","type2"]).cumcount()
df1 = (df.set_index(["type1","type2", g])
.unstack()
.ffill(axis=1)
.astype(int)
.sort_index(level=1, axis=1))
df1.columns = [f'{a}_{b}' if b != 0 else a for a, b in df1.columns]
df1 = df1.reset_index()
print (df1)
type1 type2 val1 val2 val3 val1_1 val2_1 val3_1
0 a b 1 2 3 4 6 7
1 a c 3 5 2 2 1 8
2 b c 2 9 0 2 9 0

Pandas divide two dataframe with different sizes

I have a dataframe df1 as:
col1 col2 Val1 Val2
A g 4 6
A d 3 8
B h 5 10
B p 7 14
I have another dataframe df2 as:
col1 Val1 Val2
A 2 3
B 1 4
I want to divide df1 by df2 based on col1, val1 and val2 so that row A from df2 divides both rows A from df1.
My final output of df1.div(df2) is as follows:
col1 col2 Val1 Val2
A g 2 2
A d 1.5 2
B h 5 2.5
B p 7 3.5
Convert col1 and col2 to MultiIndex, also convert col1 in second DataFrame to index and then use DataFrame.div:
df = df1.set_index(['col1', 'col2']).div(df2.set_index('col1')).reset_index()
#alternative with specify level of index
#df = df1.set_index(['col1', 'col2']).div(df2.set_index('col1'), level=0).reset_index()
print (df)
col1 col2 Val1 Val2
0 A g 2.0 2.000000
1 A d 1.5 2.666667
2 B h 5.0 2.500000
3 B p 7.0 3.500000
I think there is a slight mistake in your example. For col Val2, 2nd row - 8/3 should be 2.67. So the final output df1.div(df2) should be :
col1 col2 Val1 Val2
0 A g 2.0 2.000000
1 A d 1.5 2.666667
2 B h 5.0 2.500000
3 B p 7.0 3.500000
Anyways here is a possible solution:
Construct the 2 dfs
import pandas as pd
df1 = pd.DataFrame(data={'col1':['A','A','B','B'], 'col2': ['g','d','h','p'], 'Val1': [4,3,5,7], 'Val2': [6,8,10,14]}, columns=['col1','col2','Val1','Val2'])
df2 = pd.DataFrame(data={'col1':['A','B'], 'Val1': [2,1], 'Val2': [3,4]}, columns=['col1','Val1','Val2'])
print (df1)
print (df2)
Output:
>>>
col1 col2 Val1 Val2
0 A g 4 6
1 A d 3 8
2 B h 5 10
3 B p 7 14
col1 Val1 Val2
0 A 2 3
1 B 1 4
Now we can just do an INNER JOIN of df1 and df2 on col: col1. If you are not familiar with SQL joins have a look at this: sql-join. We can do join in pandas using the merge() method
## join df1, df2
merged_df = pd.merge(left=df1, right=df2, how='inner', on='col1')
print (merged_df)
Output:
>>>
col1 col2 Val1_x Val2_x Val1_y Val2_y
0 A g 4 6 2 3
1 A d 3 8 2 3
2 B h 5 10 1 4
3 B p 7 14 1 4
Now that we have got the corresponding columns of df1 and df2, we can simply compute the division and delete the redundant columns:
# Val1 = Val1_x/Val1_y, Val2 = Val2_x/Val2_y
merged_df['Val1'] = merged_df['Val1_x']/merged_df['Val1_y']
merged_df['Val2'] = merged_df['Val2_x']/merged_df['Val2_y']
# delete the cols: Val1_x,Val1_y,Val2_x,Val2_y
merged_df.drop(columns=['Val1_x', 'Val1_y', 'Val2_x', 'Val2_y'], inplace=True)
print (merged_df)
Final Output:
col1 col2 Val1 Val2
0 A g 2.0 2.000000
1 A d 1.5 2.666667
2 B h 5.0 2.500000
3 B p 7.0 3.500000
I hope this solves your question :)
You can use the pandas.merge() function to execute a database-like join between dataframes, then use the result to divide column values:
# merge against col1 so we get a merged index
merged = pd.merge(df1[["col1"]], df2)
df1[["Val1", "Val2"]] = df1[["Val1", "Val2"]].div(merged[["Val1", "Val2"]])
This produces:
col1 col2 Val1 Val2
0 A g 2.0 2.000000
1 A d 1.5 2.666667
2 B h 5.0 2.500000
3 B p 7.0 3.500000

Pandas - Delete cells based on ranking within column

I want to delete values based on their relative rank within their column. Specifically, I want to isolate the X highest and X lowest values within several columns. So if X=2 and my dataframe looks like this:
ID Val1 Val2 Val3
001 2 8 14
002 10 15 8
003 3 1 20
004 11 11 7
005 14 4 19
The output should look like this:
ID Val1 Val2 Val3
001 2 NaN NaN
002 NaN 15 8
003 3 1 20
004 11 11 7
005 14 4 19
I know that I can make a sub-table to isolate the high and low rank using:
df = df.sort('Column Name')
df2 = df.head(X) # OR: df.tail(X)
And I figure I clear these sub-tables of the values from other columns using:
df2['Other Column'] = np.NaN
df2['Other Column B'] = np.NaN
Then merge the sub-tables back together in a way that replaces NaN values when there is data in one of the tables. I tried:
df2.update(df3) # df3 is a sub-table made the same way as df2 using a different column
Which only updated rows already present in df2.
I tried:
out = pd.merge(df2, df3, how='outer')
which gave me separate rows when a row appeared in both df2 and d3
I tried:
out = df2.combine_first(df3)
which over-wrote numerical values with found NaN values in some cases making it unsuitable.
There must be a way to do this: I want to the original dataframe with NaN values plugged in whenever a value is not among the X highest or X lowest values in that column.
Interesting question, you can get the index of the values of each columns in the sorted values of each columns (here in the mask DataFrame), and then keep the values that have the index within you defined boundary.
In [98]:
print df
Val1 Val2 Val3
ID
1 2 8 14
2 10 15 8
3 3 1 20
4 11 11 7
5 14 4 19
In [99]:
mask = df.apply(lambda x: np.searchsorted(sorted(x),x))
print mask
Val1 Val2 Val3
ID
1 0 2 2
2 2 4 1
3 1 0 4
4 3 3 0
5 4 1 3
In [100]:
print (mask<=1)|(mask>=(len(mask)-2))
Val1 Val2 Val3
ID
1 True False False
2 False True True
3 True True True
4 True True True
5 True True True
In [101]:
print df.where((mask<=1)|(mask>=(len(mask)-2)))
Val1 Val2 Val3
ID
1 2 NaN NaN
2 NaN 15 8
3 3 1 20
4 11 11 7
5 14 4 19

append columns of a data frame to a different data frame in pandas

Given these two pandas data frames:
>>> df1 = pd.DataFrame({'c1':['a','b','c','d'], 'c':['x','y','y','x']})
c1 c2
0 a x
1 b y
2 c y
3 d x
>>> df2 = pd.DataFrame({'c1':['d','c','a','b'], 'val1':[12,31,14,34], 'val2':[0,0,1,1]})
c1 val1 val2
0 d 12 4
1 c 31 3
2 a 14 1
3 b 34 2
I'd like to append the columns val1 and val2 of df2 to the data frame df1, taking into account the elements in c1. The updated df1 would then look like:
>>> df1
c1 c2 val1 val2
0 a x 14 1
1 b y 34 2
2 c y 31 3
3 d x 12 4
I thought of using a combination of set_index and update:
df1.set_index('c1').update(df2.set_index('c1')), but it didn't work...
You could use pd.merge:
import pandas as pd
df1 = pd.DataFrame({'c1':['a','b','c','d'], 'c2':['x','y','y','x']})
df2 = pd.DataFrame({'c1':['d','c','a','b'], 'val1':[12,31,14,34], 'val2':[4,3,1,2]})
df1 = pd.merge(df1, df2, on=['c1'])
print(df1)
yields
c1 c2 val1 val2
0 a x 14 1
1 b y 34 2
2 c y 31 3
3 d x 12 4

Categories