There is a pandas dataframe as follow:
import pandas as pd
raw_data = {'name': ['Willard Morris', 'Al Jennings', 'Omar Mullins', 'Spencer McDaniel'],
'age': [20, 19, 22, 21],
'favorite_color': ['blue', 'blue', 'yellow', "green"],
'grade': [88, 92, 95, 70]}
df = pd.DataFrame(raw_data)
I want to divide age and grade numeric cell values equal blue in favorite_color column to 125.0 value and yellow values divide to 130.0 and green to 135.0. Results mus be inserted in new columns age_new, grade_new.
By below code I receive error.
df['age_new'] =(df.loc[df['favorite_color']=='blue']/125.0)
df['age_new'] =(df.loc[df['favorite_color']=='yellow']/130.0)
df['age_new'] =(df.loc[df['favorite_color']=='green']/135.0)
df['grade_new'] =(df.loc[df['favorite_color']=='blue']/125.0)
df['grade_new'] =(df.loc[df['favorite_color']=='yellow']/130.0)
df['grade_new'] =(df.loc[df['favorite_color']=='green']/135.0)
Error:
TypeError: unsupported operand type(s) for /: 'str' and 'int'
map
mods = {'blue': 125, 'yellow': 130, 'green': 135}
df.assign(
mods=df.favorite_color.map(mods),
age_new=lambda d: d.age / d.mods,
grade_new=lambda d: d.grade / d.mods
)
name age favorite_color grade mods age_new grade_new
0 Willard Morris 20 blue 88 125 0.160000 0.704000
1 Al Jennings 19 blue 92 125 0.152000 0.736000
2 Omar Mullins 22 yellow 95 130 0.169231 0.730769
3 Spencer McDaniel 21 green 70 135 0.155556 0.518519
Similar
mods = {'blue': 125, 'yellow': 130, 'green': 135}
df.join(df[['age', 'grade']].div(df.favorite_color.map(mods), axis=0).add_suffix('_new'))
name age favorite_color grade age_new grade_new
0 Willard Morris 20 blue 88 0.160000 0.704000
1 Al Jennings 19 blue 92 0.152000 0.736000
2 Omar Mullins 22 yellow 95 0.169231 0.730769
3 Spencer McDaniel 21 green 70 0.155556 0.518519
You can use .replace instead of .loc, so that you only perform the operation once.
import pandas as pd
raw_data = {
'name': ['Willard Morris', 'Al Jennings', 'Omar Mullins', 'Spencer McDaniel'],
'age': [20, 19, 22, 21],
'favorite_color': ['blue', 'blue', 'yellow', "green"],
'grade': [88, 92, 95, 70]}
df = pd.DataFrame(raw_data)
color_d = {
"blue": 125,
"yellow": 130,
"green": 135
}
df[["age_new", "grade_new"]] = df[["age", "grade"]].div(
df['favorite_color'].replace(color_d),
axis=0)
df.head()
Which gives
name age favorite_color grade age_new grade_new
0 Willard Morris 20 blue 88 0.160000 0.704000
1 Al Jennings 19 blue 92 0.152000 0.736000
2 Omar Mullins 22 yellow 95 0.169231 0.730769
3 Spencer McDaniel 21 green 70 0.155556 0.518519
Related
I need to parse column values in a data frame and save the first parsed section in a new column if it has a parsing delimiter like "-" if not leave it empty
raw_data = {'name': ['Willard Morris', 'Al Jennings', 'Omar Mullins', 'Spencer McDaniel'],
'code': ['01-02-11-55-00115','11-02-11-55-00445','test', '31-0t-11-55-00115'],
'favorite_color': ['blue', 'blue', 'yellow', 'green'],
'grade': [88, 92, 95, 70]}
df = pd.DataFrame(raw_data)
df.head()
adding a new column that has the first parsed section and the expected column values are :
01
11
null
31
df['parsed'] = df['code'].apply(lambda x: x.split('-')[0] if '-' in x else 'null')
will output:
name code favorite_color grade parsed
0 Willard Morris 01-02-11-55-00115 blue 88 01
1 Al Jennings 11-02-11-55-00445 blue 92 11
2 Omar Mullins test yellow 95 null
3 Spencer McDaniel 31-0t-11-55-00115 green 70 31
I have a dataset that contains the NBA Player's average statistics per game. Some player's statistics are repeated because of they've been in different teams in season.
For example:
Player Pos Age Tm G GS MP FG
8 Jarrett Allen C 22 TOT 28 10 26.2 4.4
9 Jarrett Allen C 22 BRK 12 5 26.7 3.7
10 Jarrett Allen C 22 CLE 16 5 25.9 4.9
I want to average Jarrett Allen's stats and put them into a single row. How can I do this?
You can groupby and use agg to get the mean. For the non numeric columns, let's take the first value:
df.groupby('Player').agg({k: 'mean' if v in ('int64', 'float64') else 'first'
for k,v in df.dtypes[1:].items()})
output:
Pos Age Tm G GS MP FG
Player
Jarrett Allen C 22 TOT 18.666667 6.666667 26.266667 4.333333
NB. content of the dictionary comprehension:
{'Pos': 'first',
'Age': 'mean',
'Tm': 'first',
'G': 'mean',
'GS': 'mean',
'MP': 'mean',
'FG': 'mean'}
x = [['a', 12, 5],['a', 12, 7], ['b', 15, 10],['b', 15, 12],['c', 20, 1]]
import pandas as pd
df = pd.DataFrame(x, columns=['name', 'age', 'score'])
print(df)
print('-----------')
df2 = df.groupby(['name', 'age']).mean()
print(df2)
Output:
name age score
0 a 12 5
1 a 12 7
2 b 15 10
3 b 15 12
4 c 20 1
-----------
score
name age
a 12 6
b 15 11
c 20 1
Option 1
If one considers the dataframe that OP shares in the question df the following will do the work
df_new = df.groupby('Player').agg(lambda x: x.iloc[0] if pd.api.types.is_string_dtype(x.dtype) else x.mean())
[Out]:
Pos Age Tm G GS MP FG
Player
Jarrett Allen C 22.0 TOT 18.666667 6.666667 26.266667 4.333333
This one uses:
pandas.DataFrame.groupby to group by the Player column
pandas.core.groupby.GroupBy.agg to aggregate the values based on a custom made lambda function.
pandas.api.types.is_string_dtype to check if a column is of string type (see here how the method is implemented)
Let's test it with a new dataframe, df2, with more elements in the Player column.
import numpy as np
df2 = pd.DataFrame({'Player': ['John Collins', 'John Collins', 'John Collins', 'Trae Young', 'Trae Young', 'Clint Capela', 'Jarrett Allen', 'Jarrett Allen', 'Jarrett Allen'],
'Pos': ['PF', 'PF', 'PF', 'PG', 'PG', 'C', 'C', 'C', 'C'],
'Age': np.random.randint(0, 100, 9),
'Tm': ['ATL', 'ATL', 'ATL', 'ATL', 'ATL', 'ATL', 'TOT', 'BRK', 'CLE'],
'G': np.random.randint(0, 100, 9),
'GS': np.random.randint(0, 100, 9),
'MP': np.random.uniform(0, 100, 9),
'FG': np.random.uniform(0, 100, 9)})
[Out]:
Player Pos Age Tm G GS MP FG
0 John Collins PF 71 ATL 75 39 16.123225 77.949756
1 John Collins PF 60 ATL 49 49 30.308092 24.788401
2 John Collins PF 52 ATL 33 92 11.087317 58.488575
3 Trae Young PG 72 ATL 20 91 62.862313 60.169282
4 Trae Young PG 85 ATL 61 77 30.248551 85.169038
5 Clint Capela C 73 ATL 5 67 45.817690 21.966777
6 Jarrett Allen C 23 TOT 60 51 93.076624 34.160823
7 Jarrett Allen C 12 BRK 2 77 74.318568 78.755869
8 Jarrett Allen C 44 CLE 82 81 7.375631 40.930844
If one tests the operation on df2, one will get the following
df_new2 = df2.groupby('Player').agg(lambda x: x.iloc[0] if pd.api.types.is_string_dtype(x.dtype) else x.mean())
[Out]:
Pos Age Tm G GS MP FG
Player
Clint Capela C 95.000000 ATL 30.000000 98.000000 46.476398 17.987104
Jarrett Allen C 60.000000 TOT 48.666667 19.333333 70.050540 33.572896
John Collins PF 74.333333 ATL 50.333333 52.666667 78.181457 78.152235
Trae Young PG 57.500000 ATL 44.500000 47.500000 46.602543 53.835455
Option 2
Depending on the desired output, assuming that one only wants to group by player (independently of Age or Tm), a simpler solution would be to just group by and pass .mean() as follows
df_new3 = df.groupby('Player').mean()
[Out]:
Age G GS MP FG
Player
Jarrett Allen 22.0 18.666667 6.666667 26.266667 4.333333
Notes:
The output of this previous operation won't display non-numerical columns (apart from the Player name).
I have a dataset that contains the NBA Player's average statistics per game. Some player's statistics are repeated because of they've been in different teams in season.
For example:
Player Pos Age Tm G GS MP FG
8 Jarrett Allen C 22 TOT 28 10 26.2 4.4
9 Jarrett Allen C 22 BRK 12 5 26.7 3.7
10 Jarrett Allen C 22 CLE 16 5 25.9 4.9
I want to average Jarrett Allen's stats and put them into a single row. How can I do this?
You can groupby and use agg to get the mean. For the non numeric columns, let's take the first value:
df.groupby('Player').agg({k: 'mean' if v in ('int64', 'float64') else 'first'
for k,v in df.dtypes[1:].items()})
output:
Pos Age Tm G GS MP FG
Player
Jarrett Allen C 22 TOT 18.666667 6.666667 26.266667 4.333333
NB. content of the dictionary comprehension:
{'Pos': 'first',
'Age': 'mean',
'Tm': 'first',
'G': 'mean',
'GS': 'mean',
'MP': 'mean',
'FG': 'mean'}
x = [['a', 12, 5],['a', 12, 7], ['b', 15, 10],['b', 15, 12],['c', 20, 1]]
import pandas as pd
df = pd.DataFrame(x, columns=['name', 'age', 'score'])
print(df)
print('-----------')
df2 = df.groupby(['name', 'age']).mean()
print(df2)
Output:
name age score
0 a 12 5
1 a 12 7
2 b 15 10
3 b 15 12
4 c 20 1
-----------
score
name age
a 12 6
b 15 11
c 20 1
Option 1
If one considers the dataframe that OP shares in the question df the following will do the work
df_new = df.groupby('Player').agg(lambda x: x.iloc[0] if pd.api.types.is_string_dtype(x.dtype) else x.mean())
[Out]:
Pos Age Tm G GS MP FG
Player
Jarrett Allen C 22.0 TOT 18.666667 6.666667 26.266667 4.333333
This one uses:
pandas.DataFrame.groupby to group by the Player column
pandas.core.groupby.GroupBy.agg to aggregate the values based on a custom made lambda function.
pandas.api.types.is_string_dtype to check if a column is of string type (see here how the method is implemented)
Let's test it with a new dataframe, df2, with more elements in the Player column.
import numpy as np
df2 = pd.DataFrame({'Player': ['John Collins', 'John Collins', 'John Collins', 'Trae Young', 'Trae Young', 'Clint Capela', 'Jarrett Allen', 'Jarrett Allen', 'Jarrett Allen'],
'Pos': ['PF', 'PF', 'PF', 'PG', 'PG', 'C', 'C', 'C', 'C'],
'Age': np.random.randint(0, 100, 9),
'Tm': ['ATL', 'ATL', 'ATL', 'ATL', 'ATL', 'ATL', 'TOT', 'BRK', 'CLE'],
'G': np.random.randint(0, 100, 9),
'GS': np.random.randint(0, 100, 9),
'MP': np.random.uniform(0, 100, 9),
'FG': np.random.uniform(0, 100, 9)})
[Out]:
Player Pos Age Tm G GS MP FG
0 John Collins PF 71 ATL 75 39 16.123225 77.949756
1 John Collins PF 60 ATL 49 49 30.308092 24.788401
2 John Collins PF 52 ATL 33 92 11.087317 58.488575
3 Trae Young PG 72 ATL 20 91 62.862313 60.169282
4 Trae Young PG 85 ATL 61 77 30.248551 85.169038
5 Clint Capela C 73 ATL 5 67 45.817690 21.966777
6 Jarrett Allen C 23 TOT 60 51 93.076624 34.160823
7 Jarrett Allen C 12 BRK 2 77 74.318568 78.755869
8 Jarrett Allen C 44 CLE 82 81 7.375631 40.930844
If one tests the operation on df2, one will get the following
df_new2 = df2.groupby('Player').agg(lambda x: x.iloc[0] if pd.api.types.is_string_dtype(x.dtype) else x.mean())
[Out]:
Pos Age Tm G GS MP FG
Player
Clint Capela C 95.000000 ATL 30.000000 98.000000 46.476398 17.987104
Jarrett Allen C 60.000000 TOT 48.666667 19.333333 70.050540 33.572896
John Collins PF 74.333333 ATL 50.333333 52.666667 78.181457 78.152235
Trae Young PG 57.500000 ATL 44.500000 47.500000 46.602543 53.835455
Option 2
Depending on the desired output, assuming that one only wants to group by player (independently of Age or Tm), a simpler solution would be to just group by and pass .mean() as follows
df_new3 = df.groupby('Player').mean()
[Out]:
Age G GS MP FG
Player
Jarrett Allen 22.0 18.666667 6.666667 26.266667 4.333333
Notes:
The output of this previous operation won't display non-numerical columns (apart from the Player name).
Below is the dataframe (df). I want to save the sample of 3 rows from each category of 'country' column.
Following is my code but it's not saving based on category. I need single csv having the samples. Please suggest.
data = {'country':['India', 'Nepal', 'Canada', 'USA','India', 'Nepal', 'Canada', 'USA','India', 'Nepal', 'Canada', 'USA','India', 'Nepal', 'Canada', 'USA','India', 'Nepal', 'Canada', 'USA'],
'Age':[20, 21, 19, 18,20, 21, 19, 18,20, 21, 19, 18,20, 21, 19, 18,20, 21, 19, 18]}
df = pd.DataFrame(data)
df.sample(n=3).to_csv(sampledata.csv, na_rep='NA', index = False)
GroupBy and then sample
df.groupby('country').sample(3)
country Age
2 Canada 19
6 Canada 19
10 Canada 19
4 India 20
0 India 20
12 India 20
1 Nepal 21
13 Nepal 21
9 Nepal 21
3 USA 18
11 USA 18
19 USA 18
I have multiple columns of data and I want to find where a value lies within its "class" as well as overall.
Here is some example data (let's assume the "class" we're measuring against is is eye_color and the metric is score):
raw_data = {'name': ['Alex', 'Alicia', 'Omar', 'Louise', 'Alice'],
'age': [20, 19, 35, 24, 32],
'eye_color': ['blue', 'blue', 'brown', "green", "brown"],
'score': [88, 92, 95, 70, 96]}
df = pd.DataFrame(raw_data)
df = df.sort_values(['eye_color', 'score'], ascending=[True, False])
I want to create a column that would use the current sort order to give a value of "Brown1" for Alice, "Brown2" for Omar, "Green1" for Louise, etc.
I'm not sure how to approach and am fairly sure there's an easy way to do it before I overengineer a function that re-sorts based on each class and then recreates an index or something...
Use groupby().cumcount():
df['new'] = df['eye_color'] + df.groupby('eye_color').cumcount().add(1).astype(str)
Output:
name age eye_color score new
1 Alicia 19 blue 92 blue1
0 Alex 20 blue 88 blue2
4 Alice 32 brown 96 brown1
2 Omar 35 brown 95 brown2
3 Louise 24 green 70 green1