How to calculate specific cell values in python pandas?

How to calculate specific cell values in python pandas? - python

There is a pandas dataframe as follow:
import pandas as pd
raw_data = {'name': ['Willard Morris', 'Al Jennings', 'Omar Mullins', 'Spencer McDaniel'],
'age': [20, 19, 22, 21],
'favorite_color': ['blue', 'blue', 'yellow', "green"],
'grade': [88, 92, 95, 70]}
df = pd.DataFrame(raw_data)
I want to divide age and grade numeric cell values equal blue in favorite_color column to 125.0 value and yellow values divide to 130.0 and green to 135.0. Results mus be inserted in new columns age_new, grade_new.
By below code I receive error.
df['age_new'] =(df.loc[df['favorite_color']=='blue']/125.0)
df['age_new'] =(df.loc[df['favorite_color']=='yellow']/130.0)
df['age_new'] =(df.loc[df['favorite_color']=='green']/135.0)
df['grade_new'] =(df.loc[df['favorite_color']=='blue']/125.0)
df['grade_new'] =(df.loc[df['favorite_color']=='yellow']/130.0)
df['grade_new'] =(df.loc[df['favorite_color']=='green']/135.0)
Error:
TypeError: unsupported operand type(s) for /: 'str' and 'int'

map
mods = {'blue': 125, 'yellow': 130, 'green': 135}
df.assign(
mods=df.favorite_color.map(mods),
age_new=lambda d: d.age / d.mods,
grade_new=lambda d: d.grade / d.mods
)
name age favorite_color grade mods age_new grade_new
0 Willard Morris 20 blue 88 125 0.160000 0.704000
1 Al Jennings 19 blue 92 125 0.152000 0.736000
2 Omar Mullins 22 yellow 95 130 0.169231 0.730769
3 Spencer McDaniel 21 green 70 135 0.155556 0.518519
Similar
mods = {'blue': 125, 'yellow': 130, 'green': 135}
df.join(df[['age', 'grade']].div(df.favorite_color.map(mods), axis=0).add_suffix('_new'))
name age favorite_color grade age_new grade_new
0 Willard Morris 20 blue 88 0.160000 0.704000
1 Al Jennings 19 blue 92 0.152000 0.736000
2 Omar Mullins 22 yellow 95 0.169231 0.730769
3 Spencer McDaniel 21 green 70 0.155556 0.518519

You can use .replace instead of .loc, so that you only perform the operation once.
import pandas as pd
raw_data = {
'name': ['Willard Morris', 'Al Jennings', 'Omar Mullins', 'Spencer McDaniel'],
'age': [20, 19, 22, 21],
'favorite_color': ['blue', 'blue', 'yellow', "green"],
'grade': [88, 92, 95, 70]}
df = pd.DataFrame(raw_data)
color_d = {
"blue": 125,
"yellow": 130,
"green": 135
}
df[["age_new", "grade_new"]] = df[["age", "grade"]].div(
df['favorite_color'].replace(color_d),
axis=0)
df.head()
Which gives
name age favorite_color grade age_new grade_new
0 Willard Morris 20 blue 88 0.160000 0.704000
1 Al Jennings 19 blue 92 0.152000 0.736000
2 Omar Mullins 22 yellow 95 0.169231 0.730769
3 Spencer McDaniel 21 green 70 0.155556 0.518519

Related

Parse the column value and save the first section in new column

I need to parse column values in a data frame and save the first parsed section in a new column if it has a parsing delimiter like "-" if not leave it empty
raw_data = {'name': ['Willard Morris', 'Al Jennings', 'Omar Mullins', 'Spencer McDaniel'],
'code': ['01-02-11-55-00115','11-02-11-55-00445','test', '31-0t-11-55-00115'],
'favorite_color': ['blue', 'blue', 'yellow', 'green'],
'grade': [88, 92, 95, 70]}
df = pd.DataFrame(raw_data)
df.head()
adding a new column that has the first parsed section and the expected column values are :
01
11
null
31

df['parsed'] = df['code'].apply(lambda x: x.split('-')[0] if '-' in x else 'null')
will output:
name code favorite_color grade parsed
0 Willard Morris 01-02-11-55-00115 blue 88 01
1 Al Jennings 11-02-11-55-00445 blue 92 11
2 Omar Mullins test yellow 95 null
3 Spencer McDaniel 31-0t-11-55-00115 green 70 31

How to find sum of Few Columns in pandas Dataframe and leave the rest as it is? [duplicate]

I have a dataset that contains the NBA Player's average statistics per game. Some player's statistics are repeated because of they've been in different teams in season.
For example:
Player Pos Age Tm G GS MP FG
8 Jarrett Allen C 22 TOT 28 10 26.2 4.4
9 Jarrett Allen C 22 BRK 12 5 26.7 3.7
10 Jarrett Allen C 22 CLE 16 5 25.9 4.9
I want to average Jarrett Allen's stats and put them into a single row. How can I do this?

You can groupby and use agg to get the mean. For the non numeric columns, let's take the first value:
df.groupby('Player').agg({k: 'mean' if v in ('int64', 'float64') else 'first'
for k,v in df.dtypes[1:].items()})
output:
Pos Age Tm G GS MP FG
Player
Jarrett Allen C 22 TOT 18.666667 6.666667 26.266667 4.333333
NB. content of the dictionary comprehension:
{'Pos': 'first',
'Age': 'mean',
'Tm': 'first',
'G': 'mean',
'GS': 'mean',
'MP': 'mean',
'FG': 'mean'}

x = [['a', 12, 5],['a', 12, 7], ['b', 15, 10],['b', 15, 12],['c', 20, 1]]
import pandas as pd
df = pd.DataFrame(x, columns=['name', 'age', 'score'])
print(df)
print('-----------')
df2 = df.groupby(['name', 'age']).mean()
print(df2)
Output:
name age score
0 a 12 5
1 a 12 7
2 b 15 10
3 b 15 12
4 c 20 1
-----------
score
name age
a 12 6
b 15 11
c 20 1

Option 1
If one considers the dataframe that OP shares in the question df the following will do the work
df_new = df.groupby('Player').agg(lambda x: x.iloc[0] if pd.api.types.is_string_dtype(x.dtype) else x.mean())
[Out]:
Pos Age Tm G GS MP FG
Player
Jarrett Allen C 22.0 TOT 18.666667 6.666667 26.266667 4.333333
This one uses:
pandas.DataFrame.groupby to group by the Player column
pandas.core.groupby.GroupBy.agg to aggregate the values based on a custom made lambda function.
pandas.api.types.is_string_dtype to check if a column is of string type (see here how the method is implemented)
Let's test it with a new dataframe, df2, with more elements in the Player column.
import numpy as np
df2 = pd.DataFrame({'Player': ['John Collins', 'John Collins', 'John Collins', 'Trae Young', 'Trae Young', 'Clint Capela', 'Jarrett Allen', 'Jarrett Allen', 'Jarrett Allen'],
'Pos': ['PF', 'PF', 'PF', 'PG', 'PG', 'C', 'C', 'C', 'C'],
'Age': np.random.randint(0, 100, 9),
'Tm': ['ATL', 'ATL', 'ATL', 'ATL', 'ATL', 'ATL', 'TOT', 'BRK', 'CLE'],
'G': np.random.randint(0, 100, 9),
'GS': np.random.randint(0, 100, 9),
'MP': np.random.uniform(0, 100, 9),
'FG': np.random.uniform(0, 100, 9)})
[Out]:
Player Pos Age Tm G GS MP FG
0 John Collins PF 71 ATL 75 39 16.123225 77.949756
1 John Collins PF 60 ATL 49 49 30.308092 24.788401
2 John Collins PF 52 ATL 33 92 11.087317 58.488575
3 Trae Young PG 72 ATL 20 91 62.862313 60.169282
4 Trae Young PG 85 ATL 61 77 30.248551 85.169038
5 Clint Capela C 73 ATL 5 67 45.817690 21.966777
6 Jarrett Allen C 23 TOT 60 51 93.076624 34.160823
7 Jarrett Allen C 12 BRK 2 77 74.318568 78.755869
8 Jarrett Allen C 44 CLE 82 81 7.375631 40.930844
If one tests the operation on df2, one will get the following
df_new2 = df2.groupby('Player').agg(lambda x: x.iloc[0] if pd.api.types.is_string_dtype(x.dtype) else x.mean())
[Out]:
Pos Age Tm G GS MP FG
Player
Clint Capela C 95.000000 ATL 30.000000 98.000000 46.476398 17.987104
Jarrett Allen C 60.000000 TOT 48.666667 19.333333 70.050540 33.572896
John Collins PF 74.333333 ATL 50.333333 52.666667 78.181457 78.152235
Trae Young PG 57.500000 ATL 44.500000 47.500000 46.602543 53.835455
Option 2
Depending on the desired output, assuming that one only wants to group by player (independently of Age or Tm), a simpler solution would be to just group by and pass .mean() as follows
df_new3 = df.groupby('Player').mean()
[Out]:
Age G GS MP FG
Player
Jarrett Allen 22.0 18.666667 6.666667 26.266667 4.333333
Notes:
The output of this previous operation won't display non-numerical columns (apart from the Player name).

pandas join rows by column values and remove duplicates [duplicate]

I have a dataset that contains the NBA Player's average statistics per game. Some player's statistics are repeated because of they've been in different teams in season.
For example:
Player Pos Age Tm G GS MP FG
8 Jarrett Allen C 22 TOT 28 10 26.2 4.4
9 Jarrett Allen C 22 BRK 12 5 26.7 3.7
10 Jarrett Allen C 22 CLE 16 5 25.9 4.9
I want to average Jarrett Allen's stats and put them into a single row. How can I do this?

You can groupby and use agg to get the mean. For the non numeric columns, let's take the first value:
df.groupby('Player').agg({k: 'mean' if v in ('int64', 'float64') else 'first'
for k,v in df.dtypes[1:].items()})
output:
Pos Age Tm G GS MP FG
Player
Jarrett Allen C 22 TOT 18.666667 6.666667 26.266667 4.333333
NB. content of the dictionary comprehension:
{'Pos': 'first',
'Age': 'mean',
'Tm': 'first',
'G': 'mean',
'GS': 'mean',
'MP': 'mean',
'FG': 'mean'}

x = [['a', 12, 5],['a', 12, 7], ['b', 15, 10],['b', 15, 12],['c', 20, 1]]
import pandas as pd
df = pd.DataFrame(x, columns=['name', 'age', 'score'])
print(df)
print('-----------')
df2 = df.groupby(['name', 'age']).mean()
print(df2)
Output:
name age score
0 a 12 5
1 a 12 7
2 b 15 10
3 b 15 12
4 c 20 1
-----------
score
name age
a 12 6
b 15 11
c 20 1

Option 1
If one considers the dataframe that OP shares in the question df the following will do the work
df_new = df.groupby('Player').agg(lambda x: x.iloc[0] if pd.api.types.is_string_dtype(x.dtype) else x.mean())
[Out]:
Pos Age Tm G GS MP FG
Player
Jarrett Allen C 22.0 TOT 18.666667 6.666667 26.266667 4.333333
This one uses:
pandas.DataFrame.groupby to group by the Player column
pandas.core.groupby.GroupBy.agg to aggregate the values based on a custom made lambda function.
pandas.api.types.is_string_dtype to check if a column is of string type (see here how the method is implemented)
Let's test it with a new dataframe, df2, with more elements in the Player column.
import numpy as np
df2 = pd.DataFrame({'Player': ['John Collins', 'John Collins', 'John Collins', 'Trae Young', 'Trae Young', 'Clint Capela', 'Jarrett Allen', 'Jarrett Allen', 'Jarrett Allen'],
'Pos': ['PF', 'PF', 'PF', 'PG', 'PG', 'C', 'C', 'C', 'C'],
'Age': np.random.randint(0, 100, 9),
'Tm': ['ATL', 'ATL', 'ATL', 'ATL', 'ATL', 'ATL', 'TOT', 'BRK', 'CLE'],
'G': np.random.randint(0, 100, 9),
'GS': np.random.randint(0, 100, 9),
'MP': np.random.uniform(0, 100, 9),
'FG': np.random.uniform(0, 100, 9)})
[Out]:
Player Pos Age Tm G GS MP FG
0 John Collins PF 71 ATL 75 39 16.123225 77.949756
1 John Collins PF 60 ATL 49 49 30.308092 24.788401
2 John Collins PF 52 ATL 33 92 11.087317 58.488575
3 Trae Young PG 72 ATL 20 91 62.862313 60.169282
4 Trae Young PG 85 ATL 61 77 30.248551 85.169038
5 Clint Capela C 73 ATL 5 67 45.817690 21.966777
6 Jarrett Allen C 23 TOT 60 51 93.076624 34.160823
7 Jarrett Allen C 12 BRK 2 77 74.318568 78.755869
8 Jarrett Allen C 44 CLE 82 81 7.375631 40.930844
If one tests the operation on df2, one will get the following
df_new2 = df2.groupby('Player').agg(lambda x: x.iloc[0] if pd.api.types.is_string_dtype(x.dtype) else x.mean())
[Out]:
Pos Age Tm G GS MP FG
Player
Clint Capela C 95.000000 ATL 30.000000 98.000000 46.476398 17.987104
Jarrett Allen C 60.000000 TOT 48.666667 19.333333 70.050540 33.572896
John Collins PF 74.333333 ATL 50.333333 52.666667 78.181457 78.152235
Trae Young PG 57.500000 ATL 44.500000 47.500000 46.602543 53.835455
Option 2
Depending on the desired output, assuming that one only wants to group by player (independently of Age or Tm), a simpler solution would be to just group by and pass .mean() as follows
df_new3 = df.groupby('Player').mean()
[Out]:
Age G GS MP FG
Player
Jarrett Allen 22.0 18.666667 6.666667 26.266667 4.333333
Notes:
The output of this previous operation won't display non-numerical columns (apart from the Player name).

Pandas: How to get the sample rows from each category from specific column in dataframe and save in single csv?

Below is the dataframe (df). I want to save the sample of 3 rows from each category of 'country' column.
Following is my code but it's not saving based on category. I need single csv having the samples. Please suggest.
data = {'country':['India', 'Nepal', 'Canada', 'USA','India', 'Nepal', 'Canada', 'USA','India', 'Nepal', 'Canada', 'USA','India', 'Nepal', 'Canada', 'USA','India', 'Nepal', 'Canada', 'USA'],
'Age':[20, 21, 19, 18,20, 21, 19, 18,20, 21, 19, 18,20, 21, 19, 18,20, 21, 19, 18]}
df = pd.DataFrame(data)
df.sample(n=3).to_csv(sampledata.csv, na_rep='NA', index = False)

GroupBy and then sample
df.groupby('country').sample(3)
country Age
2 Canada 19
6 Canada 19
10 Canada 19
4 India 20
0 India 20
12 India 20
1 Nepal 21
13 Nepal 21
9 Nepal 21
3 USA 18
11 USA 18
19 USA 18

Is there a way to create a pandas dataframe column based on current sort position and another column?

I have multiple columns of data and I want to find where a value lies within its "class" as well as overall.
Here is some example data (let's assume the "class" we're measuring against is is eye_color and the metric is score):
raw_data = {'name': ['Alex', 'Alicia', 'Omar', 'Louise', 'Alice'],
'age': [20, 19, 35, 24, 32],
'eye_color': ['blue', 'blue', 'brown', "green", "brown"],
'score': [88, 92, 95, 70, 96]}
df = pd.DataFrame(raw_data)
df = df.sort_values(['eye_color', 'score'], ascending=[True, False])
I want to create a column that would use the current sort order to give a value of "Brown1" for Alice, "Brown2" for Omar, "Green1" for Louise, etc.
I'm not sure how to approach and am fairly sure there's an easy way to do it before I overengineer a function that re-sorts based on each class and then recreates an index or something...

Use groupby().cumcount():
df['new'] = df['eye_color'] + df.groupby('eye_color').cumcount().add(1).astype(str)
Output:
name age eye_color score new
1 Alicia 19 blue 92 blue1
0 Alex 20 blue 88 blue2
4 Alice 32 brown 96 brown1
2 Omar 35 brown 95 brown2
3 Louise 24 green 70 green1

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to calculate specific cell values in python pandas? - python

Related

Parse the column value and save the first section in new column

How to find sum of Few Columns in pandas Dataframe and leave the rest as it is? [duplicate]

pandas join rows by column values and remove duplicates [duplicate]

Pandas: How to get the sample rows from each category from specific column in dataframe and save in single csv?

Is there a way to create a pandas dataframe column based on current sort position and another column?

Categories

Resources