Multiply values in certain columns by fixed metric if multiple conditions exist - python

I have a dataset of hockey statistics and I want to apply a weight multiplier to certain statistics based on certain conditions.
A snippet of my dataset:
Player Pos GP G GF/GP S Shots/GP S% TOI/GP
2 Andrew Cogliano 1.0 79.2 11.0 0.1 126.8 1.6 8.3 14.44
12 Artturi Lehkonen 2.0 73.0 14.6 0.2 158.6 2.2 9.3 15.29
28 Cale Makar 4.0 59.3 16.0 0.3 155.0 2.6 9.8 23.67
31 Darren Helm 1.0 66.6 10.5 0.2 125.0 1.9 8.6 14.37
61 Gabriel Landeskog 2.0 72.0 24.3 0.3 196.1 2.7 12.8 19.46
103 Nathan MacKinnon 1.0 73.8 27.8 0.4 274.4 3.7 9.9 19.69
What I am trying to do is create a function that multiplies 'G', 'GF/GP', 'S', and 'Shots/GP' by a specific weight - 1.1 for example.
But I want to only do that for players based on two categories:
Defence ('Pos' = 4.0) with 50 or more games ('GP') and 20 min or more time on ice per game ('TOI/GP')
Offense ('Pos' != 4.0) with 50 or more games ('GP') and 14 min or more time on ice per game ('TOI/GP')
I can identify these groups by:
def_cond = df.loc[(df["Pos"]==4.0) & (df["GP"]>=50) & (df["TOI/GP"] >=20.00)]
off_cond = df.loc[(df["Pos"]!=4.0) & (df["GP"]>=50) & (df["TOI/GP"] >=14.00)]
Output for def_cond:
Player Pos GP G GF/GP S Shots/GP S% TOI/GP
28 Cale Makar 4.0 59.3 16.0 0.3 155.0 2.6 9.8 23.67
41 Devon Toews 4.0 58.8 8.2 0.1 120.5 2.1 6.7 22.14
45 Erik Johnson 4.0 67.4 7.3 0.1 140.9 2.1 5.1 22.22
112 Samuel Girard 4.0 68.0 4.4 0.1 90.8 1.3 5.0 20.75
Issue:
What I want to do is take this output and multiply 'G', 'GF/GP', 'S', and 'Shots/GP' by a weight value - again 1.1 for example.
I have tried various things such as:
if def_cond == True:
df[["G", "GF/GP", S", "Shots/GP"]].multiply(1.1, axis="index")
Or simply
if def_cond == True:
df["G"] = (df["G"]*1.1)
Pretty much everything I try results in the following error:
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I am new to this so any and all advice is welcome!

I would try this:
def f(df,weight):
for i in df.index:
if (
(df.loc[i,'Pos']==4.0 and df.loc[i,'GP']>=50
and df.loc[i,'TOI/GP']>=20)
or
(df.loc[i,'Pos']!=4.0 and df.loc[i,'GP']>=50
and df.loc[i,'TOI/GP']>=14)
):
df.loc[i,['G','GF/GP','S','Shots/GP']]*=weight
Though, i'm pretty sure, it is not the best solution...

Related

AttributeError: 'list' object has no attribute 'assign'

I have this dataframe:
SRC Coup Vint Bal Mar Apr May Jun Jul BondSec
0 JPM 1.5 2021 43.9 5.6 4.9 4.9 5.2 4.4 FNCL
1 JPM 1.5 2020 41.6 6.2 6.0 5.6 5.8 4.8 FNCL
2 JPM 2.0 2021 503.9 7.1 6.3 5.8 6.0 4.9 FNCL
3 JPM 2.0 2020 308.3 9.3 7.8 7.5 7.9 6.6 FNCL
4 JPM 2.5 2021 345.0 8.6 7.8 6.9 6.8 5.6 FNCL
5 JPM 4.5 2010 5.7 21.3 20.0 18.0 17.7 14.6 G2SF
6 JPM 5.0 2019 2.8 39.1 37.6 34.6 30.8 24.2 G2SF
7 JPM 5.0 2018 7.3 39.8 37.1 33.4 30.1 24.2 G2SF
8 JPM 5.0 2010 3.9 23.3 20.0 18.6 17.9 14.6 G2SF
9 JPM 5.0 2009 4.2 22.8 21.2 19.5 18.6 15.4 G2SF
I want to duplicate all the rows that have FNCL as the BondSec, and rename the value of BondSec in those new duplicate rows to FGLMC. I'm able to accomplish half of that with the following code:
if "FGLMC" not in jpm['BondSec']:
is_FNCL = jpm['BondSec'] == "FNCL"
FNCL_try = jpm[is_FNCL]
jpm.append([FNCL_try]*1,ignore_index=True)
But if I instead try to implement the change to the BondSec value in the same line as below:
jpm.append(([FNCL_try]*1).assign(**{'BondSecurity': 'FGLMC'}),ignore_index=True)
I get the following error:
AttributeError: 'list' object has no attribute 'assign'
Additionally, I would like to insert the duplicated rows based on an index condition, not just at the bottom as additional rows. The condition cannot be simply a row position because this will have to work on future files with different numbers of rows. So I would like to insert the duplicated rows at the position where the BondSec column values change from FNCL to FNCI (FNCI is not showing here, but basically it would be right below the last row with FNCL). I'm assuming this could be done with an np.where function call, but I'm not sure how to implement that.
I'll also eventually want to do this same exact process with rows with FNCI as the BondSec value (duplicating them and transforming the BondSec value to FGCI, and inserting at the index position right below the last row with FNCI as the value).
I'd suggest a helper function to handle all your duplications:
def duplicate_and_rename(df, target, value):
return pd.concat([df, df[df["BondSec"] == target].assign(BondSec=value)])
Then
for target, value in (("FNCL", "FGLMC"), ("FNCI", "FGCI")):
df = duplicate_and_rename(df, target, value)
Then after all that, you can categorize the BondSec column and use a custom order:
ordering = ["FNCL", "FGLMC", "FNCI", "FGCI", "G2SF"]
df["BondSec"] = pd.Categorical(df["BondSec"], ordering).sort_values()
df = df.reset_index(drop=True)
Alternatively, you can use a dictionary for your ordering, as explained in this answer.

Web scraping with beautifulSoup and selenium NBA stats

Im trying to get the data for the NBA advanced stat but keep getting errors. This is what I have. Please help
from selenium import webdriver
from bs4 import BeautifulSoup as soup
d = webdriver.Chrome('C:/chromedriver.exe')
d.get('https://www.nba.com/stats/players/passing/?Season=2019-20&SeasonType=Regular%20Season&TeamID=1610612747')
s = soup(d.page_source, 'html.parser').find('table', {'class':'nba-stat-table__overflow'})
headers, [_, *data] = [i.text for i in soup.find_all('th')], [[i.text for i in soup.find_all('td')] for i in soup.find_all('tr')]
final_data = [i for i in data if len(i) > 1]
print(final_data)
There is no 'table' with 'class'='table-responsive' , there is a 'div' element with 'class' = 'table-responsive' , and that has a 'table' underneath with 'class'='table'. So this line is returning a NoneType:
s = soup(d.page_source, 'html.parser').find('table', {'class':'table-responsive'})
Just use pandas to read in the tables once you get the page source from selenium. Note you'll likely need to add a implicit wait for the page to render.
import pandas as pd
from selenium import webdriver
d = webdriver.Chrome('C:/Users/kgrab/OneDrive/Desktop/web mining/week3-twitter/chromedriver_win32/chromedriver.exe')
d.get('https://www.nba.com/stats/players/passing/?Season=2019-20&SeasonType=Regular%20Season&TeamID=1610612747')
df = pd.read_html(d.page_source)[0]
d.close()
Output:
print (df)
Player Team GP ... ASTAdj AST ToPass% AST ToPass% Adj
0 Alex Caruso LAL 64 ... 2.3 8.0 9.6
1 Anthony Davis LAL 62 ... 3.9 8.3 9.9
2 Avery Bradley LAL 49 ... 1.6 6.8 8.4
3 Danny Green LAL 68 ... 1.7 4.7 5.8
4 Devontae Cacok LAL 1 ... 1.0 16.7 16.7
5 Dion Waiters LAL 7 ... 3.6 9.8 14.5
6 Dwight Howard LAL 69 ... 0.8 2.8 3.2
7 JR Smith LAL 6 ... 0.7 6.0 8.0
8 JaVale McGee LAL 68 ... 0.7 3.3 4.4
9 Jared Dudley LAL 45 ... 0.8 5.7 7.0
10 Kentavious Caldwell-Pope LAL 69 ... 1.8 7.5 8.6
11 Kostas Antetokounmpo LAL 5 ... 0.6 11.8 17.6
12 Kyle Kuzma LAL 61 ... 1.6 6.4 7.9
13 LeBron James LAL 67 ... 11.8 16.4 19.0
14 Markieff Morris LAL 14 ... 0.8 4.1 5.7
15 Quinn Cook LAL 44 ... 1.1 8.3 8.5
16 Rajon Rondo LAL 48 ... 6.1 12.7 15.5
17 Talen Horton-Tucker LAL 6 ... 2.0 5.7 11.3
18 Troy Daniels DEN 41 ... 0.5 6.3 9.0
19 Zach Norvell Jr. GSW 2 ... 0.0 0.0 0.0
[20 rows x 16 columns]

Creating a new column from two columns using a dictionary in Pandas

I want to create a column based on a group and threshold for cutoff from another column for each group of the grouped column.
The dataframe is below:
df_in ->
unique_id myvalue identif
0 CTA15 19.0 TOP
1 CTA15 22.0 TOP
2 CTA15 28.0 TOP
3 CTA15 18.0 TOP
4 CTA15 22.4 TOP
5 AC007 2.0 TOP
6 AC007 2.3 SDME
7 AC007 2.0 SDME
8 AC007 5.0 SDME
9 AC007 3.0 SDME
10 AC007 31.4 SDME
11 AC007 4.4 SDME
12 CGT6 9.7 BTME
13 CGT6 44.5 BTME
14 TVF5 6.7 BTME
15 TVF5 9.1 BTME
16 TVF5 10.0 BTME
17 BGD1 1.0 BTME
18 BGD1 1.6 NON
19 GHB 51.0 NON
20 GHB 54.0 NON
21 GHB 4.7 NON
So I have created a dictionary based on each group of the 'identif' column as :
md = {'TOP': 22, 'SDME': 10, 'BTME': 20, 'NON':20}
So my goal is to create a new column, say 'chk', based on the following condition:
If the "identif" column matches the key in the dictionary "md" and the value for that key is >= than the corresponding value in the "myvalue" column then
I will have 1, otherwise 0.
However, I am trying to find a good way using map/groupby/apply to create the new output data frame. I am now doing a very inefficient way ( which is taking considerable time on real data of million rows)
using a function as follows:
def myfilter(df, idCol, valCol, mydict):
for index,row in df.iterrows():
for key, value in mydict.items():
if row[idCol] == key and row[valCol] >= value:
df['chk'] = 1
elif row[idCol] == key and row[valCol] < value:
df['chk'] = 0
return df
Getting the output via the following call:
df_out = myfilter(df_in, 'identif', 'myvalue', md)
So my output will be like:
df_out ->
unique_id myvalue identif chk
0 CTA15 19.0 TOP 0
1 CTA15 22.0 TOP 1
2 CTA15 28.0 TOP 1
3 CTA15 18.0 TOP 0
4 CTA15 22.4 TOP 1
5 AC007 2.0 TOP 0
6 AC007 2.3 SDME 0
7 AC007 2.0 SDME 0
8 AC007 5.0 SDME 0
9 AC007 3.0 SDME 0
10 AC007 31.4 SDME 1
11 AC007 4.4 SDME 0
12 CGT6 9.7 BTME 0
13 CGT6 44.5 BTME 1
14 TVF5 6.7 BTME 0
15 TVF5 9.1 BTME 0
16 TVF5 10.0 BTME 0
17 BGD1 1.0 BTME 0
18 BGD1 1.6 NON 0
19 GHB 51.0 NON 1
20 GHB 54.0 NON 1
21 GHB 4.7 NON 0
This works but extremely inefficient and would like a much better way to do it.
This should be faster:
def func(identif, value):
if identif in md:
if value >= md[identif]:
return 1.0
else:
return 0.0
else:
return np.NaN
df['chk'] = df.apply(lambda row: func(row['identif'], row['myvalue']), axis=1)
The timing on this little example:
CPU times: user 1.64 ms, sys: 73 µs, total: 1.71 ms
Wall time: 1.66 ms
Your version timing:
CPU times: user 8.6 ms, sys: 1.92 ms, total: 10.5 ms
Wall time: 8.79 ms
Although on such a small example it's not conclusive.
First, you're traversing your dataset four times total, for each row in the data frame you're traversing every element in your dictionary. You can change your function to traverse it once instead. This will speed it up your original function. Try something like:
def myfilter(df, idCol, valCol, mydict):
for index,row in df.iterrows():
value = mydict.get(row[idCol])
if row[valCol] >= value:
df['chk'] = 1
else:
df['chk'] = 0
return df

Array into dataframe interpolation

I have the following array:
[299.13953679 241.1902389 192.58645951 ... 8.53750551 24.38822528
71.61117789]
For each value in the array I want to get the interpolated wind speed based on the values in the column power in the following pd.DataFrame:
wind speed power
5 2.5 0
6 3.0 25
7 3.5 82
8 4.0 154
9 4.5 244
10 5.0 354
11 5.5 486
12 6.0 643
13 6.5 827
14 7.0 1038
15 7.5 1272
16 8.0 1525
17 8.5 1794
18 9.0 2037
19 9.5 2211
20 10.0 2362
21 10.5 2386
22 11.0 2400
So basically I'd like to retreive the following array:
[4.7 4.5 4.3 ... 2.6 3.0 3.4]
Any suggestions on where to start? I was looking at the pd.DataFrame.interpolate function but reading through its functionalities it does not seem to be helpful in my problem. Or am I wrong?
Using interp from numpy
np.interp(ary,df['power'].values,df['wind speed'].values)
Out[202]:
array([4.75063426, 4.48439022, 4.21436922, 2.67075011, 2.98776451,
3.40886998])

Need to compare one Pandas (Python) dataframe with values from another dataframe

So I've pulled data from an sql server, and inputted into a dataframe. All the data is of discrete form, and increases in a 0.1 step in one direction (0.0, 0.1, 0.2... 9.8, 9.9, 10.0), with multiple power values for each step (e.g. 1000, 1412, 134.5, 657.1 at 0.1), (14.5, 948.1, 343.8 at 5.5) - hopefully you see what I'm trying to say.
I've managed to group the data into these individual steps using the following, and have then taken the mean and standard deviation for each group.
group = df.groupby('step').power.mean()
group2 = df.groupby('step').power.std().fillna(0)
This results in two data frames (group and group2) which have the mean and standard deviation for each of the 0.1 steps. It's then easy to create an upper and lower limit for each step using the following:
upperlimit = group + 3*group2
lowerlimit = group - 3*group2
lowerlimit[lowerlimit<0] = 0
Now comes the bit I'm confused about! I need to go back into the original dataframe and remove rows/instances where the power value is outside these calculated limits (note there is a different upper and lower limit for each 0.1 step).
Here's 50 lines of the sample data:
Index Power Step
0 106.0 5.0
1 200.4 5.5
2 201.4 5.6
3 226.9 5.6
4 206.8 5.6
5 177.5 5.3
6 124.0 4.9
7 121.0 4.8
8 93.9 4.7
9 135.6 5.0
10 211.1 5.6
11 265.2 6.0
12 281.4 6.2
13 417.9 6.9
14 546.0 7.4
15 619.9 7.9
16 404.4 7.1
17 241.4 5.8
18 44.3 3.9
19 72.1 4.6
20 21.1 3.3
21 6.3 2.3
22 0.0 0.8
23 0.0 0.9
24 0.0 3.2
25 0.0 4.6
26 33.3 4.2
27 97.7 4.7
28 91.0 4.7
29 105.6 4.8
30 97.4 4.6
31 126.7 5.0
32 134.3 5.0
33 133.4 5.1
34 301.8 6.3
35 298.5 6.3
36 312.1 6.5
37 505.3 7.5
38 491.8 7.3
39 404.6 6.8
40 324.3 6.6
41 347.2 6.7
42 365.3 6.8
43 279.7 6.3
44 351.4 6.8
45 350.1 6.7
46 573.5 7.9
47 490.1 7.5
48 520.4 7.6
49 548.2 7.9
To put you goal another way, you want to perform some manipulations on grouped data, and then project the results of those manipulations back to the ungrouped rows so you can use them for filtering those rows. One way to do this is with transform:
The transform method returns an object that is indexed the same (same size) as the one being grouped. Thus, the passed transform function should return a result that is the same size as the group chunk.
You can then create the new rows directly:
df['upper'] = df.groupby('step').power.transform(lambda p: p.mean() + 3*p.std().fillna(0))
df['lower'] = df.groupby('step').power.transform(lambda p: p.mean() - 3*p.std().fillna(0))
df.loc[df['lower'] < 0, 'lower'] = 0
And sort accordingly:
df = df[(df.power <= df.upper) & (df.power >= df.lower())]

Categories