The most elegant way to do a calculation on dataframe column - python

I'm a newbie in python.
I have a column in pandas dataframe called [weight].
Which is the efficient and smartest way to redefine securities's weights to sum 1 (or 100%) ? something like the sample calculation below
weight new weight
0,05 14%
0,10 29%
0,20 57%
total 0,35 100%
df = pd.DataFrame({'Security' : ['ABC','DEF','GHI'], 'Rating': ['AAA', 'BBB','AA'], 'Weight' : [ 0.05, 0.1, 0.2]})
print(df)
Security Rating Weight
ABC AAA 0.05
DEF BBB 0.10
GHI AA 0.20

I think we can devide weight by sum of weights and get the percentage of weight (newWeight):
import pandas as pd
df = pd.DataFrame({'Security' : ['ABC','DEF','GHI'], 'Rating': ['AAA', 'BBB','AA'], 'Weight' : [ 0.05, 0.1, 0.2]})
df['newWeight'] = 100 * df['Weight'] / sum(df['Weight'])
print(df)
## Rating Security Weight newWeight
## 0 AAA ABC 0.05 14.285714
## 1 BBB DEF 0.10 28.571429
## 2 AA GHI 0.20 57.142857

Using the apply method is a neat way to solve this problem. You can do something like this:
import pandas as pd
df = pd.DataFrame({'Security' : ['ABC','DEF','GHI'], 'Rating': ['AAA', 'BBB','AA'], 'Weight' : [ 0.05, 0.1, 0.2]})
total = df.Weight.sum()
df['newWeight'] = df.Weight.apply(lambda x: x / total)
The resulting DataFrame looks like this:
Security Rating Weight newWeight
0 ABC AAA 0.05 0.142857
1 DEF BBB 0.10 0.285714
2 GHI AA 0.20 0.571429
If you want to represent these as percentages, you need to convert them to strings, here's an example:
df['percentWeight'] = df.newWeight.apply(lambda x: "{}%".format(round(x * 100)))
And you get the result:
Security Rating Weight newWeight percentWeight
0 ABC AAA 0.05 0.142857 14%
1 DEF BBB 0.10 0.285714 29%
2 GHI AA 0.20 0.571429 57%

Related

Pythion: Conditional_join janitor package

Hi I want to do a lookup to get the factor value for my dataset based on 3 conditions. Below is the lookup table:
Lookup_Table = {'State_Cd': ['TX','TX','TX','TX','CA','CA','CA','CA'],
'Deductible': [0,0,1000,1000,0,0,1000,1000],
'Revenue_1': [-99999999,25000000,-99999999,25000000,-99999999,25000000,-99999999,25000000],
'Revenue_2': [24999999,99000000,24999999,99000000,24999999,99000000,24999999,99000000],
'Factor': [0.15,0.25,0.2,0.3,0.11,0.15,0.13,0.45]
}
Lookup_Table = pd.DataFrame(Lookup_Table, columns = ['State_Cd','Deductible','Revenue_1','Revenue_2','Factor'])
lookup output:
Lookup_Table
State_Cd Deductible Revenue_1 Revenue_2 Factor
0 TX 0 -99999999 24999999 0.15
1 TX 0 25000000 99000000 0.25
2 TX 1000 -99999999 24999999 0.20
3 TX 1000 25000000 99000000 0.30
4 CA 0 -99999999 24999999 0.11
5 CA 0 25000000 99000000 0.15
6 CA 1000 -99999999 24999999 0.13
7 CA 1000 25000000 99000000 0.45
And then below is my dataset.
Dataset = {'Policy': ['A','B','C'],
'State': ['CA','TX','TX'],
'Deductible': [0,1000,0],
'Revenue': [10000000,30000000,1000000]
}
Dataset = pd.DataFrame(Dataset, columns = ['Policy','State','Deductible','Revenue'])
Dataset output:
Dataset
Policy State Deductible Revenue
0 A CA 0 1500000
1 B TX 1000 30000000
2 C TX 0 1000000
So basically to do the lookup the State must be matching to the State_Cd in lookup table, Deductible should be matching on the deductible in the lookup table, and lastly for Revenue it should be in between Revenue_1 and Revenue_2 (Revenue_1<=Revenue<=Revenue_2). To get to the desired factor value. Below is my expected output:
Policy State Deductible Revenue Factor
0 A CA 0 1500000 0.11
1 B TX 1000 30000000 0.30
2 C TX 0 1000000 0.15
I'm trying the conditional_join from janitor package. However I'm having an error. Is there something missing on my code?
import janitor
Data_Final = (Dataset.conditional_join(Lookup_Table,
# variable arguments
# col_from_left_df, col_from_right_df, comparator
('Revenue', 'Revenue_1', '>='),
('Revenue', 'Revenue_2', '<='),
('State', 'State_Cd', '=='),
('Deductible', 'Deductible', '=='),
how = 'left',sort_by_appearance = False
))
Below is the error
TypeError: __init__() got an unexpected keyword argument 'copy'
Resolved. By installing older version of pandas (less than 1.5). e.g.:
pip install pandas==1.4

Find positive and negative bin limits based on multiple other columns

I have a dataframe like as shown below
ID raw_val var_name constant s_value
1 388 Qty 0.36 -0.032
2 120 Qty 0.36 -0.007
3 34 Qty 0.36 0.16
4 45 Qty 0.36 0.31
1 110 F1 0.36 -0.232
2 1000 F1 0.36 -0.17
3 318 F1 0.36 0.26
4 419 F1 0.36 0.31
My objective is to
a) Find the upper and lower limits (of raw_val) for each value of var_name for s_value >=0
b) Find the upper and lower limits (of raw_val) for each value of var_name for s_value <0
I tried the below
df['sign'] = np.where[df['s_value']<0, 'neg', 'pos']
s = df.groupby(['var_name','sign'])['raw_val'].series
df['buckets'] = pd.IntervalIndex.from_arrays(s)
Please note that my real data is big data and has more than 200 unique values for var_name column. The distribution of positive and negative values (s_value) may be uneven for each value of the var_name columns. In sample df, I have shown even distribution of pos and neg values but it may not be the case in real life.
I expect my output to be like as below
var_name sign low_limit upp_limit
Qty neg 120 388
F1 neg 110 1000
Qty pos 34 45
Qty pos 318 419
I think numpy.where with aggregate minimal and maximal values is way:
df['sign'] = np.where(df['s_value']<0, 'neg', 'pos')
df1 = (df.groupby(['var_name','sign'], sort=False, as_index=False)
.agg(low_limit=('raw_val','min'), upp_limit=('raw_val','max')))
print (df1)
var_name sign low_limit upp_limit
0 Qty neg 120 388
1 Qty pos 34 45
2 F1 neg 110 1000
3 F1 pos 318 419

AttributeError: 'NoneType' object has no attribute 'text' - BeautifulShop

I have a little code for scraping info from fbref (link for data: https://fbref.com/en/comps/9/stats/Premier-League-Stats) and it worked well but now I have some problems with some features (I've checked that the fields which don't work now are"player","nationality","position","squad","age","birth_year"). I have also checked that the fields have the same name in the web that it used to be. Any ideas/help to solve the problem?
Many Thanks!
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import re
import sys, getopt
import csv
def get_tables(url):
res = requests.get(url)
## The next two lines get around the issue with comments breaking the parsing.
comm = re.compile("<!--|-->")
soup = BeautifulSoup(comm.sub("",res.text),'lxml')
all_tables = soup.findAll("tbody")
team_table = all_tables[0]
player_table = all_tables[1]
return player_table, team_table
def get_frame(features, player_table):
pre_df_player = dict()
features_wanted_player = features
rows_player = player_table.find_all('tr')
for row in rows_player:
if(row.find('th',{"scope":"row"}) != None):
for f in features_wanted_player:
cell = row.find("td",{"data-stat": f})
a = cell.text.strip().encode()
text=a.decode("utf-8")
if(text == ''):
text = '0'
if((f!='player')&(f!='nationality')&(f!='position')&(f!='squad')&(f!='age')&(f!='birth_year')):
text = float(text.replace(',',''))
if f in pre_df_player:
pre_df_player[f].append(text)
else:
pre_df_player[f] = [text]
df_player = pd.DataFrame.from_dict(pre_df_player)
return df_player
stats = ["player","nationality","position","squad","age","birth_year","games","games_starts","minutes","goals","assists","pens_made","pens_att","cards_yellow","cards_red","goals_per90","assists_per90","goals_assists_per90","goals_pens_per90","goals_assists_pens_per90","xg","npxg","xa","xg_per90","xa_per90","xg_xa_per90","npxg_per90","npxg_xa_per90"]
def frame_for_category(category,top,end,features):
url = (top + category + end)
player_table, team_table = get_tables(url)
df_player = get_frame(features, player_table)
return df_player
top='https://fbref.com/en/comps/9/'
end='/Premier-League-Stats'
df1 = frame_for_category('stats',top,end,stats)
df1
I suggest loading the table with panda's read_html. There is a direct link to this table under Share & Export --> Embed this Table.
import pandas as pd
df = pd.read_html("https://widgets.sports-reference.com/wg.fcgi?css=1&site=fb&url=%2Fen%2Fcomps%2F9%2Fstats%2FPremier-League-Stats&div=div_stats_standard", header=1)
This outputs a list of dataframes, the table can be accessed as df[0]. Output df[0].head():
Rk
Player
Nation
Pos
Squad
Age
Born
MP
Starts
Min
90s
Gls
Ast
G-PK
PK
PKatt
CrdY
CrdR
Gls.1
Ast.1
G+A
G-PK.1
G+A-PK
xG
npxG
xA
npxG+xA
xG.1
xA.1
xG+xA
npxG.1
npxG+xA.1
Matches
0
1
Patrick van Aanholt
nl NED
DF
Crystal Palace
30-190
1990
16
15
1324
14.7
0
1
0
0
0
1
0
0
0.07
0.07
0
0.07
1.2
1.2
0.8
2
0.08
0.05
0.13
0.08
0.13
Matches
1
2
Tammy Abraham
eng ENG
FW
Chelsea
23-156
1997
20
12
1021
11.3
6
1
6
0
0
0
0
0.53
0.09
0.62
0.53
0.62
5.6
5.6
0.9
6.5
0.49
0.08
0.57
0.49
0.57
Matches
2
3
Che Adams
eng ENG
FW
Southampton
24-237
1996
26
22
1985
22.1
5
4
5
0
0
1
0
0.23
0.18
0.41
0.23
0.41
5.5
5.5
4.3
9.9
0.25
0.2
0.45
0.25
0.45
Matches
3
4
Tosin Adarabioyo
eng ENG
DF
Fulham
23-164
1997
23
23
2070
23
0
0
0
0
0
1
0
0
0
0
0
0
1
1
0.1
1.1
0.04
0.01
0.05
0.04
0.05
Matches
4
5
Adrián
es ESP
GK
Liverpool
34-063
1987
3
3
270
3
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Matches
If you're only after the player stats, change player_table = all_tables[1] to player_table = all_tables[2], because now you are feeding team table into get_frame function.
I tried it and it worked fine after that.

Change normalized integer values to categories for classification

I'm working on this dataset with the following columns, N/A counts and example of a record:
Serial No. 0
GRE Score 0
TOEFL Score 0
University Rating 0
SOP 0
LOR 0
CGPA 0
Research 0
Chance of Admit 0
dtype: int64
0: 1 337 118 4 4.5 4.5 9.65 1 0.92
1: 2 324 107 4 4.0 4.5 8.87 1 0.76
The column Chance of admit is a normalised intergar value ranging from 0 to 1, what i wanted to do was take this column and output a corrosponding ordered values where chance would be bins (low medium high) (unlikely doable likely) ect
What i have come across is that pandas has a built in function named to_categorical however, i don't understand it enough and what i read i still don't exactly get.
This dataset would be used for a decision tree where the labels would be the chance of admit
Thank you for your help
Since they are "normalized" values...why would you need to categorize them? A simple threshould should work right?
i.e.
0-0.33 low
0.33-0.66 medium
0.66-1.0 high
The only reason you would want to use an automated method would probably be if your number of categories keeps changing?
To do the category, you could use pandas to_categorical but you will need to determine the range and the number of bins (categories). From the docs this should work I think.
In [6]: df = pd.DataFrame({'value': np.random.randint(0, 100, 20)})
In [7]: labels = ["{0} - {1}".format(i, i + 9) for i in range(0, 100, 10)]
In [8]: df['group'] = pd.cut(df.value, range(0, 105, 10), right=False, labels=labels)
In [9]: df.head(10)
Out[9]:
value group
0 65 60 - 69
1 49 40 - 49
2 56 50 - 59
3 43 40 - 49
4 43 40 - 49
5 91 90 - 99
6 32 30 - 39
7 87 80 - 89
8 36 30 - 39
9 8 0 - 9
You can then replace df['group'] with your chance of admit column and fill up the necessary ranges for your discrete bins by threshold or automatic based on number of bins.
For your reference:
https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html
IIUC, you want to map a continuous variable to a categorical value based on ranges, for example:
0.96 -> high,
0.31 -> low
...
So pandas provides with a function for just that, cut, from the documentation:
Use cut when you need to segment and sort data values into bins. This
function is also useful for going from a continuous variable to a
categorical variable.
Setup
Serial No. GRE Score TOEFL Score ... CGPA Research Chance of Admit
0 1 337 118 ... 9.65 1 0.92
1 2 324 107 ... 8.87 1 0.76
2 2 324 107 ... 8.87 1 0.31
3 2 324 107 ... 8.87 1 0.45
[4 rows x 9 columns]
Assuming the above setup, you could use cut like this:
labels = pd.cut(df['Chance of Admit'], [0, 0.33, 0.66, 1.0], labels=['low', 'medium', 'high'])
print(labels)
Output
0 high
1 high
2 low
3 medium
Name: Chance of Admit, dtype: category
Categories (3, object): [low < medium < high]
Notice that we are use 3 bins: [(0, 0.33], (0.33, 0.66], (0.66, 1.0]] and that the values of the column Chance of Admit are [0.92, 0.76, 0.31, 0.45]. If you want to change the label names just change the value of the labels parameter, for example: labels=['unlikely', 'doable', 'likely']. If you need an ordinal value do:
labels = pd.cut(df['Chance of Admit'], [0, 0.33, 0.66, 1.0], labels=list(range(3)))
print(labels)
Output
0 2
1 2
2 0
3 1
Name: Chance of Admit, dtype: category
Categories (3, int64): [0 < 1 < 2]
Finally to put all in perspective you could do the following to add it to your DataFrame:
df['group'] = pd.cut(df['Chance of Admit'], [0, 0.33, 0.66, 1.0], labels=['low', 'medium', 'high'])
print(df)
Output
Serial No. GRE Score TOEFL Score ... Research Chance of Admit group
0 1 337 118 ... 1 0.92 high
1 2 324 107 ... 1 0.76 high
2 2 324 107 ... 1 0.31 low
3 2 324 107 ... 1 0.45 medium
[4 rows x 10 columns]

Pairwise correlation from 2 dataframes in python

I have 2 dataframes:
df = pd.DataFrame({'SAMs': ['GOS', 'BUM', 'BEN', 'AUD', 'VWA','HON'],
'GN1': [22, 22, 2, 2, 2,5],
'GN2':[1.1,5.7,4.8,7.09,10.876,0.178]})
df
GN1 GN2 SAMs
0 22 1.100 GOS
1 22 5.700 BUM
2 2 4.800 BEN
3 2 7.090 AUD
4 2 10.876 VWA
5 5 0.178 HON
and df2:
df2 = pd.DataFrame({'SAMs': ['FAMS', 'SAP', 'KLM', 'SOS', 'LUD','EJT'],
'GN1': [22, 22, 2, 2, 2,5],
'GN2':[1.1,5.7,4.8,7.09,10.876,0.178]})
I need to calculate the pearson correlations between the column SAMs from df1 and df2. For each value in column SAMs from both df1 and df2, I'd like to make pairwise combinations and calculate their correlations.
At the end, the output should look like:
SAMs correlation_value P-value
GOS-FAMS 0.45 0.87
GOS-SAP 0.55 1
GOS-KLM 0.15 0.89
...
HON-EJT 0.156 0.98
Any suggestions would be great!

Categories