Looking to get the row of a group that has the maximum value across multiple columns:
pd.DataFrame([{'grouper': 'a', 'col1': 1, 'col2': 3, 'uniq_id': 1}, {'grouper': 'a', 'col1': 2, 'col2': 4, 'uniq_id': 2}, {'grouper': 'a', 'col1': 3, 'col2': 2, 'uniq_id': 3}])
col1 col2 grouper uniq_id
0 1 3 a 1
1 2 4 a 2
2 3 2 a 3
In the above, I'm grouping by the "grouper" column. Within the "a" group, I want to get the row that has the max of col1 and col2, in this case, when I group my DataFrame, I want to get the row with uniq_id of 2 because it has the highest value of col1/col2 with 4, so the outcome would be:
col1 col2 grouper uniq_id
1 2 4 a 2
In my actual example, I'm using timestamps, so I actually don't expect ties. But in the case of a tie, I am indifferent to which row I select in the group, so it would just be first of the group in that case.
One more way you can try:
# find row wise max value
df['row_max'] = df[['col1','col2']].max(axis=1)
# filter rows from groups
df.loc[df.groupby('grouper')['row_max'].idxmax()]
col1 col2 grouper uniq_id row_max
1 2 4 a 2 4
Later you can drop row_max using df.drop('row_max', axis=1)
IIUC using transform then compare with original dataframe
g=df.groupby('grouper')
s1=g.col1.transform('max')
s2=g.col2.transform('max')
s=pd.concat([s1,s2],axis=1).max(1)
df.loc[df[['col1','col2']].eq(s,0).any(1)]
Out[89]:
col1 col2 grouper uniq_id
1 2 4 a 2
Interesting approaches all around. Adding another one just to show the power of apply (which I'm a big fan of) and using some of the other mentioned methods.
import pandas as pd
df = pd.DataFrame(
[
{"grouper": "a", "col1": 1, "col2": 3, "uniq_id": 1},
{"grouper": "a", "col1": 2, "col2": 4, "uniq_id": 2},
{"grouper": "a", "col1": 3, "col2": 2, "uniq_id": 3},
]
)
def find_max(grp):
# find max value per row, then find index of row with max val
max_row_idx = grp[["col1", "col2"]].max(axis=1).idxmax()
return grp.loc[max_row_idx]
df.groupby("grouper").apply(find_max)
value = pd.concat([df['col1'], df['col2']], axis = 0).max()
df.loc[(df['col1'] == value) | (df['col2'] == value), :]
col1 col2 grouper uniq_id
1 2 4 a 2
This probably isn't the fastest way, but it will work in your case. Concat both the columns and find the max, then search the df for where either column equals the value.
You can use numpy and pandas as follows:
import numpy as np
import pandas as pd
df = pd.DataFrame({'col1': [1, 2, 3],
'col2': [3, 4, 2],
'grouper': ['a', 'a', 'a'],
'uniq_id': [1, 2, 3]})
df['temp'] = np.max([df.col1.values, df.col2.values],axis=0)
idx = df.groupby('grouper')['temp'].idxmax()
df.loc[idx].drop('temp',1)
col1 col2 grouper uniq_id
1 2 4 a 2
Related
I have the following Pandas dataframe in Python:
import pandas as pd
d = {'col1': [1, 2, 3, 4, 5], 'col2': [6, 7, 8, 9, 10]}
df = pd.DataFrame(data=d)
df.index=['A', 'B', 'C', 'D', 'E']
df
which gives the following output:
col1 col2
A 1 6
B 2 7
C 3 8
D 4 9
E 5 10
I need to write a function (say the name will be getNrRows(fromIndex) ) that will take an index value as input and will return the number of rows between that given index and the last index of the dataframe.
For instance:
nrRows = getNrRows("C")
print(nrRows)
> 2
Because it takes 2 steps (rows) from the index C to the index E.
How can I write such a function in the most elegant way?
The simplest way might be
len(df[row_index:]) - 1
For your information we have built-in function get_indexer_for
len(df)-df.index.get_indexer_for(['C'])-1
Out[179]: array([2], dtype=int64)
I want to do row comparisons by group based on a condition across 2 columns. This condition
is: (col1(i)-col1(j))*(col2(i)-col2(j)) <= 0, where we are comparing every row i with row j in columns col1 and col2. If the condition is satisfied for all row comparisons in the group, then set true for that group, else false.
data = {'group':['A', 'A', 'A', 'B', 'B', 'B'],
'col1':[1, 2, 3, 2, 3, 1], 'col2':[4, 3, 2, 2, 3, 1]}
df = pd.DataFrame(data)
df
with output
A True
B False
You can use shift for comparision with next row along with groupby+all for checking if all items in the group is True:
cond=((df['col1']-df['col1'].shift(-1))*(df['col2']-df['col2'].shift(-1))<=0)&(df['group']==df['group'].shift(-1))
cond.groupby(df['group']).all()
group
A True
B False
dtype: bool
I want to add the the data of reference to data, so I use
data[reference.columns]=reference
but it only creates the column with no value, how can I add the value?
Your two DataFrames are indexed differently, so when you do data[reference.columns] = reference it tries to align the new columns on indices. Since the indices of reference are not in data (or only align for index=0) it adds the columns, but fills the values with NaN.
It looks like you want to add multiple static columns to data with the values from reference. You can just assign these:
for col in reference.columns:
data[col] = reference[col].values[0]
Here's an illustration of the issue.
import pandas as pd
data = pd.DataFrame({'id': [1, 2, 3, 4],
'val1': ['A', 'B', 'C', 'D']})
reference = pd.DataFrame({'id2': [1, 2, 3, 4],
'val2': ['A', 'B', 'C', 'D']})
These have the same indices ranging from 0-3.
data[reference.columns] = reference
Outputs
id val1 id2 val2
0 1 A 1 A
1 2 B 2 B
2 3 C 3 C
3 4 D 4 D
But, if these DataFrames have different indices (that only partially overlap):
data = pd.DataFrame({'id': [1, 2, 3, 4],
'val1': ['A', 'B', 'C', 'D']})
reference = pd.DataFrame({'id2': [1, 2, 3, 4],
'val2': ['A', 'B', 'C', 'D']})
reference.index=[3,4,5,6]
data[reference.columns]=reference
Outputs:
id val1 id2 val2
0 1 A NaN NaN
1 2 B NaN NaN
2 3 C NaN NaN
3 4 D 1.0 A
As only the index value of 3 is shared.
When there are 2 dataframes of same columns, how to select particular columns and add dataframes ?
dataframes in pandas are as follows
a_val = {'col1': [1, 2], 'col2': [3, 4], 'col3': [7, 8]}
b_val = {'col1': [1, 5, 2], 'col2': [3, 2, 4], 'col3': [7, 17, 33]}
a = pd.DataFrame(a_val)
b = pd.DataFrame(b_val)
How to make the resultant dataframe C (see below for the expected resultant C)
for example I have A dataframe as
B dataframe as
C dataframe as
I think you need merge and then sum last column:
c = pd.merge(a,b, on=['col1', 'col2'], suffixes=('','_'))
.assign(col3=lambda x: x.col3 + x.col3_).drop('col3_', 1)
What is same as:
c = pd.merge(a,b, on=['col1', 'col2'], suffixes=('','_'))
c.col3 = c.col3.add(c.col3_)
c = c.drop('col3_', 1)
print (c)
col1 col2 col3
0 1 3 14
1 2 4 41
I have a pandas dataframe:
df = pd.DataFrame({'one' : [1, 2, 3, 4] ,'two' : [5, 6, 7, 8]})
one two
0 1 5
1 2 6
2 3 7
3 4 8
Column "one" and column "two" together comprise (x,y) coordinates
Lets say I have a list of coordinates: c = [(1,5), (2,6), (20,5)]
Is there an elegant way of obtaining the rows in df
with matching coordinates? In this case, given c, the matching rows would be 0 and 1
Related question: Using pandas to select rows using two different columns from dataframe?
And: Selecting rows from pandas DataFrame using two columns
This approaching using pd.merge should perform better than the iterative solutions.
import pandas as pd
df = pd.DataFrame({"one" : [1, 2, 3, 4] ,"two" : [5, 6, 7, 8]})
c = [(1, 5), (2, 6), (20, 5)]
df2 = pd.DataFrame(c, columns=["one", "two"])
pd.merge(df, df2, on=["one", "two"], how="inner")
one two
0 1 5
1 2 6
You can use
>>> set.union(*(set(df.index[(df.one == i) & (df.two == j)]) for i, j in c))
{0, 1}