GroupBy aggregate count based on specific column

GroupBy aggregate count based on specific column - python

I've been looking for a few hours and can't seem to find a topic related to that exact matter.
So basically, I want to apply on a groupby to find something else than the mean. My groupby returns two columns 'feature_name' and 'target_name', and I want to replace the value in 'target_name' by something else : the number of occurences of 1, of 0, the difference between both, etc.
print(df[[feature_name, target_name]])
When I print my dataframe with the column I use, I get the following : screenshot
I already have the following code to compute the mean of 'target_name' for each value of 'feature_name':
df[[feature_name, target_name]].groupby([feature_name],as_index=False).mean()
Which returns : this.
And I want to compute different things than the mean. Here are the values I want to compute in the end : what I want
In my case, the feature 'target_name' will always be equal to either 1 or 0 (with 1 being 'good' and 0 'bad'.
I have seen this example from an answer.:
df.groupby(['catA', 'catB'])['scores'].apply(lambda x: x[x.str.contains('RET')].count())
But I don't know how to apply this to my case as x would be simply an int.
And after solving this issue, I still need to compute more than just the count!
Thanks for reading ☺

import pandas as pd
import numpy as np
def my_func(x):
# Create your 3 metrics here
calc1 = x.min()
calc2 = x.max()
calc3 = x.sum()
# return a pandas series
return pd.Series(dict(metric1=calc1, metric2=calc2, metric3=calc3))
# Apply the function you created
df.groupby(...)['columns needed to calculate formulas'].apply(my_func).unstack()
Optionally, using .unstack() at the end allows you to see all your 3 metrics as column headers
As an example:
df
Out[]:
Names A B
0 In 0.820747 0.370199
1 Out 0.162521 0.921443
2 In 0.534743 0.240836
3 Out 0.910891 0.096016
4 In 0.825876 0.833074
5 Out 0.546043 0.551751
6 In 0.305500 0.091768
7 Out 0.131028 0.043438
8 In 0.656116 0.562967
9 Out 0.351492 0.688008
10 In 0.410132 0.443524
11 Out 0.216372 0.057402
12 In 0.406622 0.754607
13 Out 0.272031 0.721558
14 In 0.162517 0.408080
15 Out 0.006613 0.616339
16 In 0.313313 0.808897
17 Out 0.545608 0.445589
18 In 0.353636 0.465455
19 Out 0.737072 0.306329
df.groupby('Names')['A'].apply(my_func).unstack()
Out[]:
metric1 metric2 metric3
Names
In 0.162517 0.825876 4.789202
Out 0.006613 0.910891 3.879669

Related

python, finding target value of the dataset

I have a sample dataset. Here is:
import pandas as pd
import numpy as np
df = {'Point1': [50,50,50,45,45,35,35], 'Point2': [48,44,30,35,33,34,32], 'Dist': [4,6,2,7,8,3,6]}
df = pd.DataFrame(df)
df
And its output is here:
My goal is to find dist value with its condition and point2 value for each group of point1.
Here is my code. (It gives an error)
if df['dist'] < 5 :
df1 = df[df['dist'].isin(df.groupby('Point1').max()['Dist'].values)]
else :
df1 = df[df['dist'].isin(df.groupby('Point1').min()['Dist'].values)]
df1
And here is the expected output:
So, if there is exist Dist value less than 5, I would like to take the max one of these groups. If no, I would like to take the min one. I hope it would be clear.

IIUC, you want to find the closest Dist to 5, with prority for values lower than 5.
For this you can compute two columns to help you sort the values in order of priority and take the first one. Here 'cond' sort by ≤5 first, then >5, and cond2 by absolute distance to 5.
thresh = 5
(df
.assign(cond=df['Dist'].gt(thresh),
cond2=df['Dist'].sub(thresh).abs(),
)
.sort_values(by=['cond', 'cond2'])
.groupby('Point1', as_index=False).first()
.drop(columns=['cond', 'cond2'])
)
output:
Point1 Point2 Dist
0 35 34 3
1 45 35 7
2 50 48 4
NB. this is also sorting by Point1 in the process, if this is unwanted on can create a function to sort a dataframe this way and apply it per group. Let me know if this is the case

Since you are using pandas DataFrame you can use the brackets syntax to filter the the data
In your case:
df[df['Dist']] < 5
About the second part of the question, it was a little confusing, can you explain more about the "take the max one of these groups. If no, I would like to take the min one"

how do you divide each value from a pandas series in sequence

Hi I was trying to figure out how to divide values from a DataFrame. But here I made an example for pandas series
a = pd.Series([1, 2, 16,64,128,360,720])
a
-----------------
0 1
1 2
2 16
3 64
4 128
5 360
6 720
So is there any way I could divide a number in a given row by the value from the previous row?
0 2
1 8
2 4
3 2
4 2.8
5 2
Furthermore, I also tried to get the output like "if the value is double, print the index".
Thank you for your help!

What it seems to me is that you are trying to divide a number in a given row by the one of the previous. This can be achieved using this code
import pandas as pd
import numpy as np
a = pd.Series([1, 2, 16,64,128,360,720])
division = pd.Series(np.divide(a.values[1:],a.values[:-1]))
index = pd.Series(np.multiply(division == 2, [i for i in range(len(a)-1)]))
Note: your question is very ill posed. You didn't specify what you wanted to achieve, I figured out by myself from the example. You also added a wrong snipped of code. Pay attention to make a nicer question next time

Well I don't know what exactly you want to divide your pandas Series by, but you can do it like this :
# If you want to store a new result
b = a / value_to_divive_by
# or if you want to apply it directly to your serie
a /= value_to_divive_by
or using list comprehension
b = [int(nb / your_value_here) for nb in a]
# with a min value to do the divison
b = [int(nb / your_value_here) for nb in a if nb > min_value]
There is probably other ways to do what you want, but there is two easy solutions

Finding Mean in Pandas, where column has np.NaN value as well [duplicate]

I can't get the average or mean of a column in pandas. A have a dataframe. Neither of things I tried below gives me the average of the column weight
>>> allDF
ID birthyear weight
0 619040 1962 0.1231231
1 600161 1963 0.981742
2 25602033 1963 1.3123124
3 624870 1987 0.94212
The following returns several values, not one:
allDF[['weight']].mean(axis=1)
So does this:
allDF.groupby('weight').mean()

If you only want the mean of the weight column, select the column (which is a Series) and call .mean():
In [479]: df
Out[479]:
ID birthyear weight
0 619040 1962 0.123123
1 600161 1963 0.981742
2 25602033 1963 1.312312
3 624870 1987 0.942120
In [480]: df.loc[:, 'weight'].mean()
Out[480]: 0.83982437500000007

Try df.mean(axis=0) , axis=0 argument calculates the column wise mean of the dataframe so the result will be axis=1 is row wise mean so you are getting multiple values.

Do try to give print (df.describe()) a shot. I hope it will be very helpful to get an overall description of your dataframe.

Mean for each column in df :
A B C
0 5 3 8
1 5 3 9
2 8 4 9
df.mean()
A 6.000000
B 3.333333
C 8.666667
dtype: float64
and if you want average of all columns:
df.stack().mean()
6.0

you can use
df.describe()
you will get basic statistics of the dataframe and to get mean of specific column you can use
df["columnname"].mean()

You can also access a column using the dot notation (also called attribute access) and then calculate its mean:
df.your_column_name.mean()

You can use either of the two statements below:
numpy.mean(df['col_name'])
# or
df['col_name'].mean()

Additionally if you want to get the round value after finding the mean.
#Create a DataFrame
df1 = {
'Subject':['semester1','semester2','semester3','semester4','semester1',
'semester2','semester3'],
'Score':[62.73,47.76,55.61,74.67,31.55,77.31,85.47]}
df1 = pd.DataFrame(df1,columns=['Subject','Score'])
rounded_mean = round(df1['Score'].mean()) # specified nothing as decimal place
print(rounded_mean) # 62
rounded_mean_decimal_0 = round(df1['Score'].mean(), 0) # specified decimal place as 0
print(rounded_mean_decimal_0) # 62.0
rounded_mean_decimal_1 = round(df1['Score'].mean(), 1) # specified decimal place as 1
print(rounded_mean_decimal_1) # 62.2

Do note that it needs to be in the numeric data type in the first place.
import pandas as pd
df['column'] = pd.to_numeric(df['column'], errors='coerce')
Next find the mean on one column or for all numeric columns using describe().
df['column'].mean()
df.describe()
Example of result from describe:
column
count 62.000000
mean 84.678548
std 216.694615
min 13.100000
25% 27.012500
50% 41.220000
75% 70.817500
max 1666.860000

You can simply go for:
df.describe()
that will provide you with all the relevant details you need, but to find the min, max or average value of a particular column (say 'weights' in your case), use:
df['weights'].mean(): For average value
df['weights'].max(): For maximum value
df['weights'].min(): For minimum value

You can use the method agg (aggregate):
df.agg('mean')
It's possible to apply multiple statistics:
df.agg(['mean', 'max', 'min'])

You can easily follow the following code
import pandas as pd
import numpy as np
classxii = {'Name':['Karan','Ishan','Aditya','Anant','Ronit'],
'Subject':['Accounts','Economics','Accounts','Economics','Accounts'],
'Score':[87,64,58,74,87],
'Grade':['A1','B2','C1','B1','A2']}
df = pd.DataFrame(classxii,index = ['a','b','c','d','e'],columns=['Name','Subject','Score','Grade'])
print(df)
#use the below for mean if you already have a dataframe
print('mean of score is:')
print(df[['Score']].mean())

In Pandas, how to apply a customized function using Group mean on Groupby Object

Here is my input data.
df1= pd.DataFrame( np.random.randn(10,3), columns= list("ABC") )
A B C
0 0.557303 1.657976 -0.091638
1 -0.769201 1.305553 -0.248403
2 1.251513 -0.634947 0.100130
3 -1.030045 -0.268972 1.328666
4 0.665483 -0.133410 0.151235
5 0.703294 -0.525490 0.109413
6 0.549441 0.002626 -0.005841
7 0.454866 1.094490 -1.946760
8 -0.152995 -0.736689 -0.367252
9 -0.632906 1.066869 0.303271
I want to create groups based on value of column A. So I slice A first. And define a function. Then I use apply method on the Groupby Obj. I am expecting the new column to be the difference between B and C over the group mean of A.
b=np.linspace(-1, 1,5)
def tmpF(x):
x['newCol']= (x['B']-x['C'])/df1['A'].mean()
return x
df1.groupby(np.digitize(df1['A'],b)).apply(tmpF)
However, I am only using the mean value of the entire column A. I know df1['A'].mean() is wrong but I dont know how to access the group mean instead.
How to solve that ?

You can change df1['A'] to x['A'] in function tmpF:
b=np.linspace(-1, 1,5)
def tmpF(x):
x['newCol']= (x['B']-x['C'])/x['A'].mean()
return x
df1.groupby(np.digitize(df1['A'],b)).apply(tmpF)

Find row where values for column is maximal in a pandas DataFrame

How can I find the row for which the value of a specific column is maximal?
df.max() will give me the maximal value for each column, I don't know how to get the corresponding row.

Use the pandas idxmax function. It's straightforward:
>>> import pandas
>>> import numpy as np
>>> df = pandas.DataFrame(np.random.randn(5,3),columns=['A','B','C'])
>>> df
A B C
0 1.232853 -1.979459 -0.573626
1 0.140767 0.394940 1.068890
2 0.742023 1.343977 -0.579745
3 2.125299 -0.649328 -0.211692
4 -0.187253 1.908618 -1.862934
>>> df['A'].idxmax()
3
>>> df['B'].idxmax()
4
>>> df['C'].idxmax()
1
Alternatively you could also use numpy.argmax, such as numpy.argmax(df['A']) -- it provides the same thing, and appears at least as fast as idxmax in cursory observations.
idxmax() returns indices labels, not integers.
Example': if you have string values as your index labels, like rows 'a' through 'e', you might want to know that the max occurs in row 4 (not row 'd').
if you want the integer position of that label within the Index you have to get it manually (which can be tricky now that duplicate row labels are allowed).
HISTORICAL NOTES:
idxmax() used to be called argmax() prior to 0.11
argmax was deprecated prior to 1.0.0 and removed entirely in 1.0.0
back as of Pandas 0.16, argmax used to exist and perform the same function (though appeared to run more slowly than idxmax).
argmax function returned the integer position within the index of the row location of the maximum element.
pandas moved to using row labels instead of integer indices. Positional integer indices used to be very common, more common than labels, especially in applications where duplicate row labels are common.
For example, consider this toy DataFrame with a duplicate row label:
In [19]: dfrm
Out[19]:
A B C
a 0.143693 0.653810 0.586007
b 0.623582 0.312903 0.919076
c 0.165438 0.889809 0.000967
d 0.308245 0.787776 0.571195
e 0.870068 0.935626 0.606911
f 0.037602 0.855193 0.728495
g 0.605366 0.338105 0.696460
h 0.000000 0.090814 0.963927
i 0.688343 0.188468 0.352213
i 0.879000 0.105039 0.900260
In [20]: dfrm['A'].idxmax()
Out[20]: 'i'
In [21]: dfrm.iloc[dfrm['A'].idxmax()] # .ix instead of .iloc in older versions of pandas
Out[21]:
A B C
i 0.688343 0.188468 0.352213
i 0.879000 0.105039 0.900260
So here a naive use of idxmax is not sufficient, whereas the old form of argmax would correctly provide the positional location of the max row (in this case, position 9).
This is exactly one of those nasty kinds of bug-prone behaviors in dynamically typed languages that makes this sort of thing so unfortunate, and worth beating a dead horse over. If you are writing systems code and your system suddenly gets used on some data sets that are not cleaned properly before being joined, it's very easy to end up with duplicate row labels, especially string labels like a CUSIP or SEDOL identifier for financial assets. You can't easily use the type system to help you out, and you may not be able to enforce uniqueness on the index without running into unexpectedly missing data.
So you're left with hoping that your unit tests covered everything (they didn't, or more likely no one wrote any tests) -- otherwise (most likely) you're just left waiting to see if you happen to smack into this error at runtime, in which case you probably have to go drop many hours worth of work from the database you were outputting results to, bang your head against the wall in IPython trying to manually reproduce the problem, finally figuring out that it's because idxmax can only report the label of the max row, and then being disappointed that no standard function automatically gets the positions of the max row for you, writing a buggy implementation yourself, editing the code, and praying you don't run into the problem again.

You might also try idxmax:
In [5]: df = pandas.DataFrame(np.random.randn(10,3),columns=['A','B','C'])
In [6]: df
Out[6]:
A B C
0 2.001289 0.482561 1.579985
1 -0.991646 -0.387835 1.320236
2 0.143826 -1.096889 1.486508
3 -0.193056 -0.499020 1.536540
4 -2.083647 -3.074591 0.175772
5 -0.186138 -1.949731 0.287432
6 -0.480790 -1.771560 -0.930234
7 0.227383 -0.278253 2.102004
8 -0.002592 1.434192 -1.624915
9 0.404911 -2.167599 -0.452900
In [7]: df.idxmax()
Out[7]:
A 0
B 8
C 7
e.g.
In [8]: df.loc[df['A'].idxmax()]
Out[8]:
A 2.001289
B 0.482561
C 1.579985

Both above answers would only return one index if there are multiple rows that take the maximum value. If you want all the rows, there does not seem to have a function.
But it is not hard to do. Below is an example for Series; the same can be done for DataFrame:
In [1]: from pandas import Series, DataFrame
In [2]: s=Series([2,4,4,3],index=['a','b','c','d'])
In [3]: s.idxmax()
Out[3]: 'b'
In [4]: s[s==s.max()]
Out[4]:
b 4
c 4
dtype: int64

df.iloc[df['columnX'].argmax()]
argmax() would provide the index corresponding to the max value for the columnX. iloc can be used to get the row of the DataFrame df for this index.

A more compact and readable solution using query() is like this:
import pandas as pd
df = pandas.DataFrame(np.random.randn(5,3),columns=['A','B','C'])
print(df)
# find row with maximum A
df.query('A == A.max()')
It also returns a DataFrame instead of Series, which would be handy for some use cases.

Very simple: we have df as below and we want to print a row with max value in C:
A B C
x 1 4
y 2 10
z 5 9
In:
df.loc[df['C'] == df['C'].max()] # condition check
Out:
A B C
y 2 10

If you want the entire row instead of just the id, you can use df.nlargest and pass in how many 'top' rows you want and you can also pass in for which column/columns you want it for.
df.nlargest(2,['A'])
will give you the rows corresponding to the top 2 values of A.
use df.nsmallest for min values.

The direct ".argmax()" solution does not work for me.
The previous example provided by #ely
>>> import pandas
>>> import numpy as np
>>> df = pandas.DataFrame(np.random.randn(5,3),columns=['A','B','C'])
>>> df
A B C
0 1.232853 -1.979459 -0.573626
1 0.140767 0.394940 1.068890
2 0.742023 1.343977 -0.579745
3 2.125299 -0.649328 -0.211692
4 -0.187253 1.908618 -1.862934
>>> df['A'].argmax()
3
>>> df['B'].argmax()
4
>>> df['C'].argmax()
1
returns the following message :
FutureWarning: 'argmax' is deprecated, use 'idxmax' instead. The behavior of 'argmax'
will be corrected to return the positional maximum in the future.
Use 'series.values.argmax' to get the position of the maximum now.
So that my solution is :
df['A'].values.argmax()

mx.iloc[0].idxmax()
This one line of code will give you how to find the maximum value from a row in dataframe, here mx is the dataframe and iloc[0] indicates the 0th index.

Considering this dataframe
[In]: df = pd.DataFrame(np.random.randn(4,3),columns=['A','B','C'])
[Out]:
A B C
0 -0.253233 0.226313 1.223688
1 0.472606 1.017674 1.520032
2 1.454875 1.066637 0.381890
3 -0.054181 0.234305 -0.557915
Assuming one want to know the rows where column "C" is max, the following will do the work
[In]: df[df['C']==df['C'].max()])
[Out]:
A B C
1 0.472606 1.017674 1.520032

The idmax of the DataFrame returns the label index of the row with the maximum value and the behavior of argmax depends on version of pandas (right now it returns a warning). If you want to use the positional index, you can do the following:
max_row = df['A'].values.argmax()
or
import numpy as np
max_row = np.argmax(df['A'].values)
Note that if you use np.argmax(df['A']) behaves the same as df['A'].argmax().

Use:
data.iloc[data['A'].idxmax()]
data['A'].idxmax() -finds max value location in terms of row
data.iloc() - returns the row

If there are ties in the maximum values, then idxmax returns the index of only the first max value. For example, in the following DataFrame:
A B C
0 1 0 1
1 0 0 1
2 0 0 0
3 0 1 1
4 1 0 0
idxmax returns
A 0
B 3
C 0
dtype: int64
Now, if we want all indices corresponding to max values, then we could use max + eq to create a boolean DataFrame, then use it on df.index to filter out indexes:
out = df.eq(df.max()).apply(lambda x: df.index[x].tolist())
Output:
A [0, 4]
B [3]
C [0, 1, 3]
dtype: object

what worked for me is:
df[df['colX'] == df['colX'].max()
You then get the row in your df with the maximum value of colX.
Then if you just want the index you can add .index at the end of the query.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

GroupBy aggregate count based on specific column - python

Related

python, finding target value of the dataset

how do you divide each value from a pandas series in sequence

Finding Mean in Pandas, where column has np.NaN value as well [duplicate]

In Pandas, how to apply a customized function using Group mean on Groupby Object

Find row where values for column is maximal in a pandas DataFrame

Categories

Resources