pandas join rows by column values and remove duplicates [duplicate]

pandas join rows by column values and remove duplicates [duplicate] - python

I have a dataset that contains the NBA Player's average statistics per game. Some player's statistics are repeated because of they've been in different teams in season.
For example:
Player Pos Age Tm G GS MP FG
8 Jarrett Allen C 22 TOT 28 10 26.2 4.4
9 Jarrett Allen C 22 BRK 12 5 26.7 3.7
10 Jarrett Allen C 22 CLE 16 5 25.9 4.9
I want to average Jarrett Allen's stats and put them into a single row. How can I do this?

You can groupby and use agg to get the mean. For the non numeric columns, let's take the first value:
df.groupby('Player').agg({k: 'mean' if v in ('int64', 'float64') else 'first'
for k,v in df.dtypes[1:].items()})
output:
Pos Age Tm G GS MP FG
Player
Jarrett Allen C 22 TOT 18.666667 6.666667 26.266667 4.333333
NB. content of the dictionary comprehension:
{'Pos': 'first',
'Age': 'mean',
'Tm': 'first',
'G': 'mean',
'GS': 'mean',
'MP': 'mean',
'FG': 'mean'}

x = [['a', 12, 5],['a', 12, 7], ['b', 15, 10],['b', 15, 12],['c', 20, 1]]
import pandas as pd
df = pd.DataFrame(x, columns=['name', 'age', 'score'])
print(df)
print('-----------')
df2 = df.groupby(['name', 'age']).mean()
print(df2)
Output:
name age score
0 a 12 5
1 a 12 7
2 b 15 10
3 b 15 12
4 c 20 1
-----------
score
name age
a 12 6
b 15 11
c 20 1

Option 1
If one considers the dataframe that OP shares in the question df the following will do the work
df_new = df.groupby('Player').agg(lambda x: x.iloc[0] if pd.api.types.is_string_dtype(x.dtype) else x.mean())
[Out]:
Pos Age Tm G GS MP FG
Player
Jarrett Allen C 22.0 TOT 18.666667 6.666667 26.266667 4.333333
This one uses:
pandas.DataFrame.groupby to group by the Player column
pandas.core.groupby.GroupBy.agg to aggregate the values based on a custom made lambda function.
pandas.api.types.is_string_dtype to check if a column is of string type (see here how the method is implemented)
Let's test it with a new dataframe, df2, with more elements in the Player column.
import numpy as np
df2 = pd.DataFrame({'Player': ['John Collins', 'John Collins', 'John Collins', 'Trae Young', 'Trae Young', 'Clint Capela', 'Jarrett Allen', 'Jarrett Allen', 'Jarrett Allen'],
'Pos': ['PF', 'PF', 'PF', 'PG', 'PG', 'C', 'C', 'C', 'C'],
'Age': np.random.randint(0, 100, 9),
'Tm': ['ATL', 'ATL', 'ATL', 'ATL', 'ATL', 'ATL', 'TOT', 'BRK', 'CLE'],
'G': np.random.randint(0, 100, 9),
'GS': np.random.randint(0, 100, 9),
'MP': np.random.uniform(0, 100, 9),
'FG': np.random.uniform(0, 100, 9)})
[Out]:
Player Pos Age Tm G GS MP FG
0 John Collins PF 71 ATL 75 39 16.123225 77.949756
1 John Collins PF 60 ATL 49 49 30.308092 24.788401
2 John Collins PF 52 ATL 33 92 11.087317 58.488575
3 Trae Young PG 72 ATL 20 91 62.862313 60.169282
4 Trae Young PG 85 ATL 61 77 30.248551 85.169038
5 Clint Capela C 73 ATL 5 67 45.817690 21.966777
6 Jarrett Allen C 23 TOT 60 51 93.076624 34.160823
7 Jarrett Allen C 12 BRK 2 77 74.318568 78.755869
8 Jarrett Allen C 44 CLE 82 81 7.375631 40.930844
If one tests the operation on df2, one will get the following
df_new2 = df2.groupby('Player').agg(lambda x: x.iloc[0] if pd.api.types.is_string_dtype(x.dtype) else x.mean())
[Out]:
Pos Age Tm G GS MP FG
Player
Clint Capela C 95.000000 ATL 30.000000 98.000000 46.476398 17.987104
Jarrett Allen C 60.000000 TOT 48.666667 19.333333 70.050540 33.572896
John Collins PF 74.333333 ATL 50.333333 52.666667 78.181457 78.152235
Trae Young PG 57.500000 ATL 44.500000 47.500000 46.602543 53.835455
Option 2
Depending on the desired output, assuming that one only wants to group by player (independently of Age or Tm), a simpler solution would be to just group by and pass .mean() as follows
df_new3 = df.groupby('Player').mean()
[Out]:
Age G GS MP FG
Player
Jarrett Allen 22.0 18.666667 6.666667 26.266667 4.333333
Notes:
The output of this previous operation won't display non-numerical columns (apart from the Player name).

Related

How to find sum of Few Columns in pandas Dataframe and leave the rest as it is? [duplicate]

I have a dataset that contains the NBA Player's average statistics per game. Some player's statistics are repeated because of they've been in different teams in season.
For example:
Player Pos Age Tm G GS MP FG
8 Jarrett Allen C 22 TOT 28 10 26.2 4.4
9 Jarrett Allen C 22 BRK 12 5 26.7 3.7
10 Jarrett Allen C 22 CLE 16 5 25.9 4.9
I want to average Jarrett Allen's stats and put them into a single row. How can I do this?

You can groupby and use agg to get the mean. For the non numeric columns, let's take the first value:
df.groupby('Player').agg({k: 'mean' if v in ('int64', 'float64') else 'first'
for k,v in df.dtypes[1:].items()})
output:
Pos Age Tm G GS MP FG
Player
Jarrett Allen C 22 TOT 18.666667 6.666667 26.266667 4.333333
NB. content of the dictionary comprehension:
{'Pos': 'first',
'Age': 'mean',
'Tm': 'first',
'G': 'mean',
'GS': 'mean',
'MP': 'mean',
'FG': 'mean'}

x = [['a', 12, 5],['a', 12, 7], ['b', 15, 10],['b', 15, 12],['c', 20, 1]]
import pandas as pd
df = pd.DataFrame(x, columns=['name', 'age', 'score'])
print(df)
print('-----------')
df2 = df.groupby(['name', 'age']).mean()
print(df2)
Output:
name age score
0 a 12 5
1 a 12 7
2 b 15 10
3 b 15 12
4 c 20 1
-----------
score
name age
a 12 6
b 15 11
c 20 1

Option 1
If one considers the dataframe that OP shares in the question df the following will do the work
df_new = df.groupby('Player').agg(lambda x: x.iloc[0] if pd.api.types.is_string_dtype(x.dtype) else x.mean())
[Out]:
Pos Age Tm G GS MP FG
Player
Jarrett Allen C 22.0 TOT 18.666667 6.666667 26.266667 4.333333
This one uses:
pandas.DataFrame.groupby to group by the Player column
pandas.core.groupby.GroupBy.agg to aggregate the values based on a custom made lambda function.
pandas.api.types.is_string_dtype to check if a column is of string type (see here how the method is implemented)
Let's test it with a new dataframe, df2, with more elements in the Player column.
import numpy as np
df2 = pd.DataFrame({'Player': ['John Collins', 'John Collins', 'John Collins', 'Trae Young', 'Trae Young', 'Clint Capela', 'Jarrett Allen', 'Jarrett Allen', 'Jarrett Allen'],
'Pos': ['PF', 'PF', 'PF', 'PG', 'PG', 'C', 'C', 'C', 'C'],
'Age': np.random.randint(0, 100, 9),
'Tm': ['ATL', 'ATL', 'ATL', 'ATL', 'ATL', 'ATL', 'TOT', 'BRK', 'CLE'],
'G': np.random.randint(0, 100, 9),
'GS': np.random.randint(0, 100, 9),
'MP': np.random.uniform(0, 100, 9),
'FG': np.random.uniform(0, 100, 9)})
[Out]:
Player Pos Age Tm G GS MP FG
0 John Collins PF 71 ATL 75 39 16.123225 77.949756
1 John Collins PF 60 ATL 49 49 30.308092 24.788401
2 John Collins PF 52 ATL 33 92 11.087317 58.488575
3 Trae Young PG 72 ATL 20 91 62.862313 60.169282
4 Trae Young PG 85 ATL 61 77 30.248551 85.169038
5 Clint Capela C 73 ATL 5 67 45.817690 21.966777
6 Jarrett Allen C 23 TOT 60 51 93.076624 34.160823
7 Jarrett Allen C 12 BRK 2 77 74.318568 78.755869
8 Jarrett Allen C 44 CLE 82 81 7.375631 40.930844
If one tests the operation on df2, one will get the following
df_new2 = df2.groupby('Player').agg(lambda x: x.iloc[0] if pd.api.types.is_string_dtype(x.dtype) else x.mean())
[Out]:
Pos Age Tm G GS MP FG
Player
Clint Capela C 95.000000 ATL 30.000000 98.000000 46.476398 17.987104
Jarrett Allen C 60.000000 TOT 48.666667 19.333333 70.050540 33.572896
John Collins PF 74.333333 ATL 50.333333 52.666667 78.181457 78.152235
Trae Young PG 57.500000 ATL 44.500000 47.500000 46.602543 53.835455
Option 2
Depending on the desired output, assuming that one only wants to group by player (independently of Age or Tm), a simpler solution would be to just group by and pass .mean() as follows
df_new3 = df.groupby('Player').mean()
[Out]:
Age G GS MP FG
Player
Jarrett Allen 22.0 18.666667 6.666667 26.266667 4.333333
Notes:
The output of this previous operation won't display non-numerical columns (apart from the Player name).

Python Dataframe Conditional If Statement Using pd.np.where Erroring Out

I have the following dataframe:
count country year age_group gender type
7 Albania 2006 014 f ep
1 Albania 2007 014 f ep
3 Albania 2008 014 f ep
2 Albania 2009 014 f ep
2 Albania 2010 014 f ep
I'm trying to make adjustments to the "gender" column so that 'f' becomes 'female' and same for m and male.
I tried the following code:
who3['gender'] = pd.np.where(who3['gender'] == 'f', "female")
But it gives me this error:
Now when I try this code:
who3['gender'] = pd.np.where(who3['gender'] == 'f', "female",
pd.np.where(who3['gender'] == 'm', "male"))
I get error below:
What am I doing wrong?

You can use also .replace():
df["gender"] = df["gender"].replace({"f": "female", "m": "male"})
print(df)
Prints:
count country year age_group gender type
0 7 Albania 2006 14 female ep
1 1 Albania 2007 14 female ep
2 3 Albania 2008 14 female ep
3 2 Albania 2009 14 female ep
4 2 Albania 2010 14 female ep

np.where needs the condition as the first parameter, and then the desire output if the condition is met, and as the third parameter it gets an output when the condition is not met, Try this:
who3['gender'] = np.where(who3['gender'] == 'f', "female", 'male')
Another solution is using replace method:
who3['gender'] = who3['gender'].replace({'f': 'female', 'm': 'male'})

Pandas replacing the number only for the columns that contains number

I have dataframe that has more than 100 columns. but here I am trying to replacing the number all across the dataframe whose column contains the number (Int/float/any formate of number).
I know how to take care column seperately, but i am looking for some smart code that efficiently replacing the value to -5 if Value <= 0 and 111 if value > 50.
Below is the code.
import numpy as np
import pandas as pd
df = pd.DataFrame({'Name': ['Avery Bradley', 'Jae Crowder', 'John Holland', 'R.J. Hunter'],
'Team': ['Boston Celtics',
'Boston Celtics',
'Boston Celtics',
'Boston Celtics'],
'Number1': [0.0, 999.0, -30.0, 28.0],
'Number2': [1000, 500, -10, 25],
'Position': ['PG', 'SF', 'SG', 'SG']})
#df["Number1"].values[df["Number1"] > 50] = 999
#df["Number1"].values[df["Number1"] < 0] = -5
df[ df > 50 ] = 888
df[ df < 0 ] = -5

You can use select_dtypes with np.select for multiple conditions here:
m = df.select_dtypes(np.number)
df[m.columns] = np.select([m>50,m<0],[888,-5],m)
print(df)
Name Team Number1 Number2 Position
0 Avery Bradley Boston Celtics 0.0 888.0 PG
1 Jae Crowder Boston Celtics 888.0 888.0 SF
2 John Holland Boston Celtics -5.0 -5.0 SG
3 R.J. Hunter Boston Celtics 28.0 25.0 SG

Use:
c = df.select_dtypes(np.number).columns
df[c] = df[c].mask(df[c] > 50, 888)
df[c] = df[c].mask(df[c] < 0, -5)
print (df)
Name Team Number1 Number2 Position
0 Avery Bradley Boston Celtics 0.0 888 PG
1 Jae Crowder Boston Celtics 888.0 888 SF
2 John Holland Boston Celtics -5.0 -5 SG
3 R.J. Hunter Boston Celtics 28.0 25 SG

Create column with multiple data frames and multiple conditions

I am looking at football data and trying to add an opponent column, but am struggling with the way that the data frames are organized.
****EDIT****
defense = {'week': [1, 1, 1, 1, 2, 2, 2, 2], 'team': ['GB', 'MIA', 'CHI', 'DET', 'GB', 'MIA', 'CHI', 'DET']}
games = {'week': [1, 1, 2, 2], 'winner': ['GB', 'MIA', 'GB', 'DET'], 'loser': ['CHI', 'DET', 'MIA', 'CHI']}
def_df = pd.DataFrame(data=defense)
games_df = pd.DataFrame(data=games)
def_df
team week
0 GB 1
1 MIA 1
2 CHI 1
3 DET 1
4 GB 2
5 MIA 2
6 CHI 2
7 DET 2
games_df
loser week winner
0 CHI 1 GB
1 DET 1 MIA
2 MIA 2 GB
3 CHI 2 DET
I am looking to add an defense['Opponent'] column based on that week.
team week Opponent
0 GB 1 CHI
1 MIA 1 DET
2 CHI 1 GB
3 DET 1 MIA
4 GB 2 MIA
5 MIA 2 GB
6 CHI 2 DET
7 DET 2 CHI
Thanks!

Here's one way using a nested dictionary mapping:
from collections import defaultdict
d = defaultdict(dict)
for row in games_df.itertuples(index=False):
d[row.week].update({row.winner: row.loser, row.loser: row.winner})
def_df['opponent'] = def_df.apply(lambda x: d[x['week']][x['team']], axis=1)
print(def_df)
team week opponent
0 GB 1 CHI
1 MIA 1 DET
2 CHI 1 GB
3 DET 1 MIA
4 GB 2 MIA
5 MIA 2 GB
6 CHI 2 DET
7 DET 2 CHI
An equally valid alternative using tuple keys, which avoids collections:
d = {}
for row in games_df.itertuples(index=False):
d[(row.week, row.winner)] = row.loser
d[(row.week, row.loser)] = row.winner
def_df['opponent'] = def_df.set_index(['week', 'team']).index.map(d.get)

Updated
Create a column of opponents
opponent_list = []
for team, week in zip(def_df['team'],def_df['week']):
for gameweek, winner, loser in zip(games_df['week'],games_df['winner'],games_df['loser']):
if gameweek == week and (winner ==team or loser ==team):
if winner == team:
opponent_list.append(loser)
else:
opponent_list.append(winner)
def_df['opponent'] = opponent_list

To obtain names of 3 highest records in the dataframe in pandas

I have a dataframe df as shown below:
City Name Division Population
1 California A 100
2 Texas B 98
3 NewYork C 96
4 Florida D 94
5 Illinois E 92
6 Pennsylvania F 90
7 Ohio G 88
8 Michigan H 86
9 Georgia I 84
10 North Carolina J 82
I would like to write a code to obtain the 3 most populous cities as string values.
Desired output:
['California', 'Texas', 'NewYork']
How can i do this??

You can use nlargest with tolist:
print (df.nlargest(3, 'Population')['City Name'].tolist())
['California', 'Texas', 'NewYork']
If data are sorted in column Population use iloc or head:
print (df['City Name'].iloc[:3].tolist())
['California', 'Texas', 'NewYork']
print (df['City Name'].head(3).tolist())
['California', 'Texas', 'NewYork']

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas join rows by column values and remove duplicates [duplicate] - python

Related

How to find sum of Few Columns in pandas Dataframe and leave the rest as it is? [duplicate]

Python Dataframe Conditional If Statement Using pd.np.where Erroring Out

Pandas replacing the number only for the columns that contains number

Create column with multiple data frames and multiple conditions

To obtain names of 3 highest records in the dataframe in pandas

Categories

Resources