How can I average ACROSS groups in python-pandas? - python

I have a dataset like this:
Participant Type Rating
1 A 6
1 A 5
1 B 4
1 B 3
2 A 9
2 A 8
2 B 7
2 B 6
I want obtain this:
Type MeanRating
A mean(6,9)
A mean(5,8)
B mean(4,7)
B mean(3,6)
So, for each type, I want the mean of the higher value in each group, then the mean of the second higher value in each group, etc.
I can't think up a proper way to do this with python pandas, since the means seem to apply always within groups, but not across them.

First use groupby.rank to create a column that allows you to align the highest values, second highest values, etc. Then perform another groupby using the newly created column to compute the means:
# Get the grouping column.
df['Grouper'] = df.groupby(['Type', 'Participant']).rank(method='first', ascending=False)
# Perform the groupby and format the result.
result = df.groupby(['Type', 'Grouper'])['Rating'].mean().rename('MeanRating')
result = result.reset_index(level=1, drop=True).reset_index()
The resulting output:
Type MeanRating
0 A 7.5
1 A 6.5
2 B 5.5
3 B 4.5
I used the method='first' parameter of groupby.rank to handle the case of duplicate ratings within a ['Type', 'Participant'] group. You can omit it if this is not a possibility within your dataset, but it won't change the output if you leave it and there are no duplicates.

Related

Why do we need to add : when defining a new column using .iloc function

When we make a new column in a dataset in pandas
df["Max"] = df.iloc[:, 5:7].sum(axis=1)
If we are only getting the columns from index 5 to index 7, why do we need to pass: as all the columns.
pandas.DataFrame.iloc() is used purely for integer-location based indexing for selection by position (read here for documentation). The : means all rows in the selected columns, here column index 5 and 6 (iloc is not inclusive of the last index).
You are using .iloc() to take a slice out of the dataframe and apply an aggregate function across columns of the slice.
Consider an example:
df = pd.DataFrame({"a":[0,1,2],"b":[2,3,4],"c":[4,5,6]})
df
would produce the following dataframe
a b c
0 0 2 4
1 1 3 5
2 2 4 6
You are using iloc to avoid dealing with named columns, so that
df.iloc[:,1:3]
would look as follows
b c
0 2 4
1 3 5
2 4 6
Now a slight modification of your code would get you a new column containing sums across columns
df.iloc[:,1:3].sum(axis=1)
0 6
1 8
2 10
Alternatively you could use function application:
df.apply(lambda x: x.iloc[1:3].sum(), axis=1)
0 6
1 8
2 10
Thus you explicitly tell to apply sum across columns. However your syntax is more succinct and is preferable to explicit function application. The result is the same as one would expect.

Pandas compute means by group using value closest to current date

I'm attempting to compute the mean by date for all categories. However, each category (called mygroup in the example) does not have a value for each date. I would like to use an apply in pandas to compute the mean at each date, filling in the value using the closest date less than or equal to the current date. For instance if I have:
pd.DataFrame({'date':['1','2','3','6','1','3','4','5','1','2','3','4'],
'mygroup':['a','a','a','a','b','b','b','b','c','c','c','c'],
'myval':[10,20,30,40,50,60,70,80,90,100,110,120]})
date mygroup myval
0 1 a 10
1 2 a 20
2 3 a 30
3 6 a 40
4 1 b 50
5 3 b 60
6 4 b 70
7 5 b 80
8 1 c 90
9 2 c 100
10 3 c 110
11 4 c 120
Computing the mean for date == 1 should be equal to (10 + 50 + 90)/3 = 50 which can be done with a typical mean apply groupby date. However, for date == 6 I would like to use the last known values for each mygroup. The average then for date == 6 would be calculated as
(40 + 80 + 120)/3 = 80 since a has a value at date == 6 of 40, b does not have a value at date == 6, so the last known value was at date == 5 which is 80 and the last known value for c was at date == 4 of 120. The final result should look like:
date meanvalue
1 50
2 56.67
3 66.67
4 73.33
5 76.67
6 80
Is it possible to compute the mean by date with a groupby and apply in this manner, using each mygroup and filling in with the last known value if there is no value for the current date? This will have to be done for thousands of dates and tens of thousands of categories, so for loops are to be avoided.
df.set_index(['mygroup', 'date']).unstack().ffill(axis=1) \
.stack().groupby(level=1).mean()
myval
date
1 50.000000
2 56.666667
3 66.666667
4 73.333333
5 76.666667
6 80.000000
set your index to the key columns
unstack the date level into columns
fill the gaps horizontally - you have know a dense matrix you can calc against
put the date back in
group by date that is your expect output
apply the math - here you want a mean
The key point to remember that's useful for a number of problems is that stacking / unstacking / pivoting, etc... "rubikscubing" your dataframe is always filling gaps of a sparse format (like the columnar format you have to begin with) into a dense one full of NAs.
So if you're able to do the calculation easily with a full dense matrix, then I encourage you to always focus first on obtaining that dense matrix, so that you can do the easy math afterwards.
You can convert all implicit missing values to explicit and fill missing values with forward fill scheme and then do a normal groupby average:
from itertools import product
import pandas as pd
# get all combinations of date and mygroup using product function from itertools
all_combinations = list(product(df.date.drop_duplicates(), df.mygroup.drop_duplicates()))
# convert implicit missing values to explicit missing values by merging all combinations
# with original data frame
df1 = pd.merge(df, pd.DataFrame.from_records(all_combinations,
columns = ['date', 'mygroup']), 'outer')
# fill missing date values with previous date values within each group
df1.sort_values(['mygroup', 'date']).ffill().groupby('date').mean()
# myval
#date
#1 50.000000
#2 56.666667
#3 66.666667
#4 73.333333
#5 76.666667
#6 80.000000

Python pandas dataframe: find max for each unique values of an another column

I have a large dataframe (from 500k to 1M rows) which contains for example these 3 numeric columns: ID, A, B
I want to filter the results in order to obtain a table like the one in the image below, where, for each unique value of column id, i have the maximum and minimum value of A and B.
How can i do?
EDIT: i have updated the image below in order to be more clear: when i get the max or min from a column i need to get also the data associated to it of the others columns
Sample data (note that you posted an image which can't be used by potential answerers without retyping, so I'm making a simple example in its place):
df=pd.DataFrame({ 'id':[1,1,1,1,2,2,2,2],
'a':range(8), 'b':range(8,0,-1) })
The key to this is just using idxmax and idxmin and then futzing with the indexes so that you can merge things in a readable way. Here's the whole answer and you may wish to examine intermediate dataframes to see how this is working.
df_max = df.groupby('id').idxmax()
df_max['type'] = 'max'
df_min = df.groupby('id').idxmin()
df_min['type'] = 'min'
df2 = df_max.append(df_min).set_index('type',append=True).stack().rename('index')
df3 = pd.concat([ df2.reset_index().drop('id',axis=1).set_index('index'),
df.loc[df2.values] ], axis=1 )
df3.set_index(['id','level_2','type']).sort_index()
a b
id level_2 type
1 a max 3 5
min 0 8
b max 0 8
min 3 5
2 a max 7 1
min 4 4
b max 4 4
min 7 1
Note in particular that df2 looks like this:
id type
1 max a 3
b 0
2 max a 7
b 4
1 min a 0
b 3
2 min a 4
b 7
The last column there holds the index values in df that were derived with idxmax & idxmin. So basically all the information you need is in df2. The rest of it is just a matter of merging back with df and making it more readable.
For anyone looking to get min and max values of a specific column where there is a unique ID, this is how I modified the above code:
df_maxA = df.groupby('id').max()['A']
df_maxA['type'] = 'max'
df_minA = df.groupby('id').max()['A']
df_minA['type'] = 'min'
df_maxB = df.groupby('id').max()['B']
df_maxB['type'] = 'max'
df_minB = df.groupby('id').max()['B']
df_minB['type'] = 'min'
Then you can merge these together to create a single dataframe.

Conditional selection of data in a pandas DataFrame

I have two columns in my pandas DataFrame.
A B
0 1 5
1 2 3
2 3 2
3 4 0
4 5 1
I need the value in A where the value of B is minimum. In the above case my answer would be 4 since the minimum B value is 0.
Can anyone help me with it?
To find the minimum in column B, you can use df.B.min(). For your DataFrame this returns 0.
To find values at particular locations in a DataFrame, you can use loc:
>>> df.loc[(df.B == df.B.min()), 'A']
3 4
Name: A, dtype: int64
So here, loc picks out all of the rows where column B is equal to its minimum value (df.B == df.B.min()) and selects the corresponding values in column A.
This method returns all values in A corresponding to the minimum value in B. If you only need to find one of the values, the better way is to use idxmin as #aus_lacy suggests.
Here's one way:
b_min = df.B.idxmin()
a_val = df.A[b_min]
idxmin() returns the index of the minimum value within column B. You then locate the value at that same index in column A.
or if you want a single, albeit less readable, line you can do it like:
a_val = df.A[df.B.idxmin()]
Also, as a disclaimer this solution assumes that the minimum value in column B is unique. For example if you were to have a data set that looked like this:
A B
1 2
2 5
3 0
4 3
5 0
My solution would return the first instance where B's minimum value is located which in this case is in the third row and has a corresponding A value of 3. If you believe that the minimum value of B is not unique then you should go with #ajcr's solution.

Group counts. Why every column?

I often need to know how many entries I have in each group in a dataframe in Pandas. The following does it, but it returns one value for every column in my dataframe.
df.groupby(['A', 'B', 'C']).count()
That is, if I have, say 20 columns (where A, B and C are three of them), it would return 17 counts, all identical (at least every time I have done it) within each group.
What is the rationale behind this?
Is there any way to restrict the count to only one column? (or have it return only one value per group?)
Would that speed up the counts in any way?
The method dataFrameGroupBy.count doesn't seem to have an argument to specify on which columns to do the count (I also could not find it on the API ref)
groupby(...).count() returns the count of non null values in each column. So potentially it can be different for each column.
example:
>>> df
jim joe jolie
0 4 NaN 4
1 8 0 NaN
>>> df.groupby('jim').count()
joe jolie
jim
4 0 1
8 1 0
.groupby(...).size() returns the size of each group.

Categories