I am a newbie here. English is not my native language so excuse any grammatical mistakes. I need to compute the average BMI per hair colour using the df.
# 1. Here we import pandas
import pandas as pd
# 2. Here we import numpy
import numpy as np
np.random.seed(0)
df = pd.DataFrame({'Age':[18, 21, 28, 19, 23, 22, 18, 24, 25, 20],
'Hair colour':['Blonde', 'Brown', 'Black', 'Blonde', 'Blonde', 'Black','Brown', 'Brown', 'Black', 'Black'],
'Length (in cm)':np.random.normal(175, 10, 10).round(1),
'Weight (in kg)':np.random.normal(70, 5, 10).round(1)},
index = ['Leon', 'Mirta', 'Nathan', 'Linda', 'Bandar', 'Violeta', 'Noah', 'Niji', 'Lucy', 'Mark'],)
I should get vectors with names.
Firstly, I wrote the function of BMI:
# function
def BMI():
df['weight (in kg)'] / (df['Length']/100)**2
However, I don't know what my next step is.
Can you advise me on how to find the average BMI per hair colour?
You can use df.groupby() which is a functionality within Pandas
For your particular case, you may use
df.groupby('Hair colour').mean()['BMI']
which gives output
Hair colour
Black 23.003356
Blonde 18.806844
Brown 23.271460
Name: BMI, dtype: float64
You can either filter or groupby.
Your BMI function does not make sense as you are:
referencing columns that do not exist
do nothing with its return so it gets discarded
Filtering:
import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame({'Age':[18, 21, 28, 19, 23, 22, 18, 24, 25, 20],
'Hair colour':['Blonde', 'Brown', 'Black', 'Blonde',
'Blonde', 'Black','Brown', 'Brown', 'Black',
'Black'],
'Length (in cm)':np.random.normal(175, 10, 10).round(1),
'Weight (in kg)':np.random.normal(70, 5, 10).round(1)},
index = ['Leon', 'Mirta', 'Nathan', 'Linda', 'Bandar',
'Violeta', 'Noah', 'Niji', 'Lucy', 'Mark'],)
print(df)
# calculate BMI - not as function, using correct column names
df["BMI"] = df['Weight (in kg)'] / (df['Length (in cm)']/100)**2
print(df)
# filter to brown
brown = df[df["Hair colour"] == "Brown"]
print(brown)
print(brown["BMI"].mean())
Output:
# calculated BMI
Age Hair colour Length (in cm) Weight (in kg) BMI
Leon 18 Blonde 192.6 70.7 19.059296
Mirta 21 Brown 179.0 77.3 24.125339
Nathan 28 Black 184.8 73.8 21.609884
Linda 19 Blonde 197.4 70.6 18.118006
Bandar 23 Blonde 193.7 72.2 19.243229
Violeta 22 Black 165.2 71.7 26.272359
Noah 18 Brown 184.5 77.5 22.767165
Niji 24 Brown 173.5 69.0 22.921875
Lucy 25 Black 174.0 71.6 23.649095
Mark 20 Black 179.1 65.7 20.482087
# filtered output
Age Hair colour Length (in cm) Weight (in kg) BMI
Mirta 21 Brown 179.0 77.3 24.125339
Noah 18 Brown 184.5 77.5 22.767165
Niji 24 Brown 173.5 69.0 22.921875
# avg BMI
23.271459786871446
Groupby:
# use groupby
grouped = df.groupby('Hair colour')
print(*grouped, sep="\n\n")
# https://stackoverflow.com/questions/51091331
print(grouped.get_group("Brown")["BMI"].mean())
Output:
# grouped output
('Black', Age Hair colour Length (in cm) Weight (in kg) BMI
Nathan 28 Black 184.8 73.8 21.609884
Violeta 22 Black 165.2 71.7 26.272359
Lucy 25 Black 174.0 71.6 23.649095
Mark 20 Black 179.1 65.7 20.482087)
('Blonde', Age Hair colour Length (in cm) Weight (in kg) BMI
Leon 18 Blonde 192.6 70.7 19.059296
Linda 19 Blonde 197.4 70.6 18.118006
Bandar 23 Blonde 193.7 72.2 19.243229)
('Brown', Age Hair colour Length (in cm) Weight (in kg) BMI
Mirta 21 Brown 179.0 77.3 24.125339
Noah 18 Brown 184.5 77.5 22.767165
Niji 24 Brown 173.5 69.0 22.921875)
# avg BMI
23.271459786871446
Related
I have a dataset that contains the NBA Player's average statistics per game. Some player's statistics are repeated because of they've been in different teams in season.
For example:
Player Pos Age Tm G GS MP FG
8 Jarrett Allen C 22 TOT 28 10 26.2 4.4
9 Jarrett Allen C 22 BRK 12 5 26.7 3.7
10 Jarrett Allen C 22 CLE 16 5 25.9 4.9
I want to average Jarrett Allen's stats and put them into a single row. How can I do this?
You can groupby and use agg to get the mean. For the non numeric columns, let's take the first value:
df.groupby('Player').agg({k: 'mean' if v in ('int64', 'float64') else 'first'
for k,v in df.dtypes[1:].items()})
output:
Pos Age Tm G GS MP FG
Player
Jarrett Allen C 22 TOT 18.666667 6.666667 26.266667 4.333333
NB. content of the dictionary comprehension:
{'Pos': 'first',
'Age': 'mean',
'Tm': 'first',
'G': 'mean',
'GS': 'mean',
'MP': 'mean',
'FG': 'mean'}
x = [['a', 12, 5],['a', 12, 7], ['b', 15, 10],['b', 15, 12],['c', 20, 1]]
import pandas as pd
df = pd.DataFrame(x, columns=['name', 'age', 'score'])
print(df)
print('-----------')
df2 = df.groupby(['name', 'age']).mean()
print(df2)
Output:
name age score
0 a 12 5
1 a 12 7
2 b 15 10
3 b 15 12
4 c 20 1
-----------
score
name age
a 12 6
b 15 11
c 20 1
Option 1
If one considers the dataframe that OP shares in the question df the following will do the work
df_new = df.groupby('Player').agg(lambda x: x.iloc[0] if pd.api.types.is_string_dtype(x.dtype) else x.mean())
[Out]:
Pos Age Tm G GS MP FG
Player
Jarrett Allen C 22.0 TOT 18.666667 6.666667 26.266667 4.333333
This one uses:
pandas.DataFrame.groupby to group by the Player column
pandas.core.groupby.GroupBy.agg to aggregate the values based on a custom made lambda function.
pandas.api.types.is_string_dtype to check if a column is of string type (see here how the method is implemented)
Let's test it with a new dataframe, df2, with more elements in the Player column.
import numpy as np
df2 = pd.DataFrame({'Player': ['John Collins', 'John Collins', 'John Collins', 'Trae Young', 'Trae Young', 'Clint Capela', 'Jarrett Allen', 'Jarrett Allen', 'Jarrett Allen'],
'Pos': ['PF', 'PF', 'PF', 'PG', 'PG', 'C', 'C', 'C', 'C'],
'Age': np.random.randint(0, 100, 9),
'Tm': ['ATL', 'ATL', 'ATL', 'ATL', 'ATL', 'ATL', 'TOT', 'BRK', 'CLE'],
'G': np.random.randint(0, 100, 9),
'GS': np.random.randint(0, 100, 9),
'MP': np.random.uniform(0, 100, 9),
'FG': np.random.uniform(0, 100, 9)})
[Out]:
Player Pos Age Tm G GS MP FG
0 John Collins PF 71 ATL 75 39 16.123225 77.949756
1 John Collins PF 60 ATL 49 49 30.308092 24.788401
2 John Collins PF 52 ATL 33 92 11.087317 58.488575
3 Trae Young PG 72 ATL 20 91 62.862313 60.169282
4 Trae Young PG 85 ATL 61 77 30.248551 85.169038
5 Clint Capela C 73 ATL 5 67 45.817690 21.966777
6 Jarrett Allen C 23 TOT 60 51 93.076624 34.160823
7 Jarrett Allen C 12 BRK 2 77 74.318568 78.755869
8 Jarrett Allen C 44 CLE 82 81 7.375631 40.930844
If one tests the operation on df2, one will get the following
df_new2 = df2.groupby('Player').agg(lambda x: x.iloc[0] if pd.api.types.is_string_dtype(x.dtype) else x.mean())
[Out]:
Pos Age Tm G GS MP FG
Player
Clint Capela C 95.000000 ATL 30.000000 98.000000 46.476398 17.987104
Jarrett Allen C 60.000000 TOT 48.666667 19.333333 70.050540 33.572896
John Collins PF 74.333333 ATL 50.333333 52.666667 78.181457 78.152235
Trae Young PG 57.500000 ATL 44.500000 47.500000 46.602543 53.835455
Option 2
Depending on the desired output, assuming that one only wants to group by player (independently of Age or Tm), a simpler solution would be to just group by and pass .mean() as follows
df_new3 = df.groupby('Player').mean()
[Out]:
Age G GS MP FG
Player
Jarrett Allen 22.0 18.666667 6.666667 26.266667 4.333333
Notes:
The output of this previous operation won't display non-numerical columns (apart from the Player name).
I have a dataset that contains the NBA Player's average statistics per game. Some player's statistics are repeated because of they've been in different teams in season.
For example:
Player Pos Age Tm G GS MP FG
8 Jarrett Allen C 22 TOT 28 10 26.2 4.4
9 Jarrett Allen C 22 BRK 12 5 26.7 3.7
10 Jarrett Allen C 22 CLE 16 5 25.9 4.9
I want to average Jarrett Allen's stats and put them into a single row. How can I do this?
You can groupby and use agg to get the mean. For the non numeric columns, let's take the first value:
df.groupby('Player').agg({k: 'mean' if v in ('int64', 'float64') else 'first'
for k,v in df.dtypes[1:].items()})
output:
Pos Age Tm G GS MP FG
Player
Jarrett Allen C 22 TOT 18.666667 6.666667 26.266667 4.333333
NB. content of the dictionary comprehension:
{'Pos': 'first',
'Age': 'mean',
'Tm': 'first',
'G': 'mean',
'GS': 'mean',
'MP': 'mean',
'FG': 'mean'}
x = [['a', 12, 5],['a', 12, 7], ['b', 15, 10],['b', 15, 12],['c', 20, 1]]
import pandas as pd
df = pd.DataFrame(x, columns=['name', 'age', 'score'])
print(df)
print('-----------')
df2 = df.groupby(['name', 'age']).mean()
print(df2)
Output:
name age score
0 a 12 5
1 a 12 7
2 b 15 10
3 b 15 12
4 c 20 1
-----------
score
name age
a 12 6
b 15 11
c 20 1
Option 1
If one considers the dataframe that OP shares in the question df the following will do the work
df_new = df.groupby('Player').agg(lambda x: x.iloc[0] if pd.api.types.is_string_dtype(x.dtype) else x.mean())
[Out]:
Pos Age Tm G GS MP FG
Player
Jarrett Allen C 22.0 TOT 18.666667 6.666667 26.266667 4.333333
This one uses:
pandas.DataFrame.groupby to group by the Player column
pandas.core.groupby.GroupBy.agg to aggregate the values based on a custom made lambda function.
pandas.api.types.is_string_dtype to check if a column is of string type (see here how the method is implemented)
Let's test it with a new dataframe, df2, with more elements in the Player column.
import numpy as np
df2 = pd.DataFrame({'Player': ['John Collins', 'John Collins', 'John Collins', 'Trae Young', 'Trae Young', 'Clint Capela', 'Jarrett Allen', 'Jarrett Allen', 'Jarrett Allen'],
'Pos': ['PF', 'PF', 'PF', 'PG', 'PG', 'C', 'C', 'C', 'C'],
'Age': np.random.randint(0, 100, 9),
'Tm': ['ATL', 'ATL', 'ATL', 'ATL', 'ATL', 'ATL', 'TOT', 'BRK', 'CLE'],
'G': np.random.randint(0, 100, 9),
'GS': np.random.randint(0, 100, 9),
'MP': np.random.uniform(0, 100, 9),
'FG': np.random.uniform(0, 100, 9)})
[Out]:
Player Pos Age Tm G GS MP FG
0 John Collins PF 71 ATL 75 39 16.123225 77.949756
1 John Collins PF 60 ATL 49 49 30.308092 24.788401
2 John Collins PF 52 ATL 33 92 11.087317 58.488575
3 Trae Young PG 72 ATL 20 91 62.862313 60.169282
4 Trae Young PG 85 ATL 61 77 30.248551 85.169038
5 Clint Capela C 73 ATL 5 67 45.817690 21.966777
6 Jarrett Allen C 23 TOT 60 51 93.076624 34.160823
7 Jarrett Allen C 12 BRK 2 77 74.318568 78.755869
8 Jarrett Allen C 44 CLE 82 81 7.375631 40.930844
If one tests the operation on df2, one will get the following
df_new2 = df2.groupby('Player').agg(lambda x: x.iloc[0] if pd.api.types.is_string_dtype(x.dtype) else x.mean())
[Out]:
Pos Age Tm G GS MP FG
Player
Clint Capela C 95.000000 ATL 30.000000 98.000000 46.476398 17.987104
Jarrett Allen C 60.000000 TOT 48.666667 19.333333 70.050540 33.572896
John Collins PF 74.333333 ATL 50.333333 52.666667 78.181457 78.152235
Trae Young PG 57.500000 ATL 44.500000 47.500000 46.602543 53.835455
Option 2
Depending on the desired output, assuming that one only wants to group by player (independently of Age or Tm), a simpler solution would be to just group by and pass .mean() as follows
df_new3 = df.groupby('Player').mean()
[Out]:
Age G GS MP FG
Player
Jarrett Allen 22.0 18.666667 6.666667 26.266667 4.333333
Notes:
The output of this previous operation won't display non-numerical columns (apart from the Player name).
Can someone helps in identifying the problem ?
I have written this code below, and then
import numpy as np
import pandas as pd
retail = pd.read_csv('online_retail2.csv')
retail.groupby(['Country','Description'])['Quantity','Price'].agg([np.mean,max])
retail.loc[('Australia','DOLLY GIRL BEAKER'),('Quantity','mean')]
The groupby function has output:
Out[36]:
Quantity Price
mean max mean max
Country Description
Australia DOLLY GIRL BEAKER 200.0 200 1.08 1.08
I LOVE LONDON MINI BACKPACK 4.0 4 4.15 4.15
10 COLOUR SPACEBOY PEN 48.0 48 0.85 0.85
12 PENCIL SMALL TUBE WOODLAND 384.0 384 0.55 0.55
12 PENCILS SMALL TUBE RED SPOTTY 24.0 24 0.65 0.65
... ... ... ...
West Indies VINTAGE BEAD PINK SCARF 3.0 3 7.95 7.95
WHITE AND BLUE CERAMIC OIL BURNER 6.0 6 1.25 1.25
WOODLAND PARTY BAG + STICKER SET 1.0 1 1.65 1.65
WOVEN BERRIES CUSHION COVER 2.0 2 4.95 4.95
WOVEN FROST CUSHION COVER 2.0 2 4.95 4.95
[30696 rows x 4 columns]
while the .loc function resulted in the below error:
KeyError: "None of [Index(['Australia', 'DOLLY GIRL BEAKER'], dtype='object')] are in the [index]"
I think it's because you are not saving the result of groupby+aggregation to a new variable (groupby+aggregation is not an inplace operation, i.e. it will create a new dataframe and you need to save it otherwise it will just compute and print the result). Basically with your current code you're trying to index your initial dataframe retail which causes the error.
You can modify your code as follows :
import numpy as np
import pandas as pd
retail = pd.read_csv('online_retail2.csv')
retail_aggregated = retail.groupby(['Country','Description'])[['Quantity','Price']].agg([np.mean,max])
Then you can index your aggregated dataframe as you want :
retail_aggregated.loc[('Australia','DOLLY GIRL BEAKER'),('Quantity','mean')]
Edit : add a full working example
import numpy as np
import pandas as pd
import random
random.seed(123)
np.random.seed(123)
# Here I generate a random dataframe
retail = pd.DataFrame({
"Country": [random.choice(["Australia", "West Indies"]) for _ in range(100)],
"Description": [random.choice([
"DOLLY GIRL BEAKER", "DOLLY GIRL BEAKER", "COLOUR SPACEBOY PEN", "VINTAGE BEAD PINK SCARF", "WOODLAND PARTY BAG + STICKER SET"
]) for _ in range(100)],
"Quantity": np.random.randint(1, 10, 100),
"Price": np.random.randint(1, 100, 100),
})
# Then I groupby and compute aggregate
retail_gp = retail.groupby(['Country','Description'])[['Quantity','Price']].agg([np.mean,max])
retail_gp.loc[('Australia','DOLLY GIRL BEAKER'),('Quantity','mean')]
Output :
4.894736842105263
I have multiple columns of data and I want to find where a value lies within its "class" as well as overall.
Here is some example data (let's assume the "class" we're measuring against is is eye_color and the metric is score):
raw_data = {'name': ['Alex', 'Alicia', 'Omar', 'Louise', 'Alice'],
'age': [20, 19, 35, 24, 32],
'eye_color': ['blue', 'blue', 'brown', "green", "brown"],
'score': [88, 92, 95, 70, 96]}
df = pd.DataFrame(raw_data)
df = df.sort_values(['eye_color', 'score'], ascending=[True, False])
I want to create a column that would use the current sort order to give a value of "Brown1" for Alice, "Brown2" for Omar, "Green1" for Louise, etc.
I'm not sure how to approach and am fairly sure there's an easy way to do it before I overengineer a function that re-sorts based on each class and then recreates an index or something...
Use groupby().cumcount():
df['new'] = df['eye_color'] + df.groupby('eye_color').cumcount().add(1).astype(str)
Output:
name age eye_color score new
1 Alicia 19 blue 92 blue1
0 Alex 20 blue 88 blue2
4 Alice 32 brown 96 brown1
2 Omar 35 brown 95 brown2
3 Louise 24 green 70 green1
Take the DataFrame in the answer of Loc vs. iloc vs. ix vs. at vs. iat? for example.
df = pd.DataFrame(
{'age':[30, 2, 12, 4, 32, 33, 69],
'color':['blue', 'green', 'red', 'white', 'gray', 'black', 'red'],
'food':['Steak', 'Lamb', 'Mango', 'Apple', 'Cheese', 'Melon', 'Beans'],
'height':[165, 70, 120, 80, 180, 172, 150],
'score':[4.6, 8.3, 9.0, 3.3, 1.8, 9.5, 2.2],
'state':['NY', 'TX', 'FL', 'AL', 'AK', 'TX', 'TX']},
index=['Jane', 'Nick', 'Aaron', 'Penelope', 'Dean', 'Christina', 'Cornelia']
)
Now I want all columns except 'food' and 'height'.
I thought something like df.loc[:,['age':'color', 'score':'state']] would work, but Python returns SyntaxError: invalid syntax.
I am aware of that there is one way to work around: df.drop(columns = ['food', 'height']). However, in my real life situation, I have hundreds of columns to be dropped. Typing out all column names is so inefficient.
I am expecting something similar with dplyr::select(df, -(food:height)) or dplyr::select(df, age:color, score:state) in R language.
Also have read Selecting/Excluding sets of columns in Pandas.
First, find all columns lying between food and height (inclusive).
c = df.iloc[-1:0].loc[:, 'food':'height'].columns
Next, filter with difference/isin/setdiff1d -
df[df.columns.difference(c)]
Or,
df.loc[:, ~df.columns.isin(c)]
Or,
df[np.setdiff1d(df.columns, c)]
age color score state
Jane 30 blue 4.6 NY
Nick 2 green 8.3 TX
Aaron 12 red 9.0 FL
Penelope 4 white 3.3 AL
Dean 32 gray 1.8 AK
Christina 33 black 9.5 TX
Cornelia 69 red 2.2 TX
First get positions of columns names by Index.get_loc and then use numpy.r_ for join all slicers together:
a = np.r_[df.columns.get_loc('age'):df.columns.get_loc('color')+1,
df.columns.get_loc('score'):df.columns.get_loc('state')+1]
df = df.iloc[:, a]
print (df)
age color score state
Jane 30 blue 4.6 NY
Nick 2 green 8.3 TX
Aaron 12 red 9.0 FL
Penelope 4 white 3.3 AL
Dean 32 gray 1.8 AK
Christina 33 black 9.5 TX
Cornelia 69 red 2.2 TX
One option for flexible column selection is with select_columns from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
df.select_columns(slice('age', 'color'), slice('score', 'state'))
age color score state
Jane 30 blue 4.6 NY
Nick 2 green 8.3 TX
Aaron 12 red 9.0 FL
Penelope 4 white 3.3 AL
Dean 32 gray 1.8 AK
Christina 33 black 9.5 TX
Cornelia 69 red 2.2 TX
df.select_columns(slice('food', 'height'), invert = True)
age color score state
Jane 30 blue 4.6 NY
Nick 2 green 8.3 TX
Aaron 12 red 9.0 FL
Penelope 4 white 3.3 AL
Dean 32 gray 1.8 AK
Christina 33 black 9.5 TX
Cornelia 69 red 2.2 TX