Pythion: Conditional_join janitor package - python

Hi I want to do a lookup to get the factor value for my dataset based on 3 conditions. Below is the lookup table:
Lookup_Table = {'State_Cd': ['TX','TX','TX','TX','CA','CA','CA','CA'],
'Deductible': [0,0,1000,1000,0,0,1000,1000],
'Revenue_1': [-99999999,25000000,-99999999,25000000,-99999999,25000000,-99999999,25000000],
'Revenue_2': [24999999,99000000,24999999,99000000,24999999,99000000,24999999,99000000],
'Factor': [0.15,0.25,0.2,0.3,0.11,0.15,0.13,0.45]
}
Lookup_Table = pd.DataFrame(Lookup_Table, columns = ['State_Cd','Deductible','Revenue_1','Revenue_2','Factor'])
lookup output:
Lookup_Table
State_Cd Deductible Revenue_1 Revenue_2 Factor
0 TX 0 -99999999 24999999 0.15
1 TX 0 25000000 99000000 0.25
2 TX 1000 -99999999 24999999 0.20
3 TX 1000 25000000 99000000 0.30
4 CA 0 -99999999 24999999 0.11
5 CA 0 25000000 99000000 0.15
6 CA 1000 -99999999 24999999 0.13
7 CA 1000 25000000 99000000 0.45
And then below is my dataset.
Dataset = {'Policy': ['A','B','C'],
'State': ['CA','TX','TX'],
'Deductible': [0,1000,0],
'Revenue': [10000000,30000000,1000000]
}
Dataset = pd.DataFrame(Dataset, columns = ['Policy','State','Deductible','Revenue'])
Dataset output:
Dataset
Policy State Deductible Revenue
0 A CA 0 1500000
1 B TX 1000 30000000
2 C TX 0 1000000
So basically to do the lookup the State must be matching to the State_Cd in lookup table, Deductible should be matching on the deductible in the lookup table, and lastly for Revenue it should be in between Revenue_1 and Revenue_2 (Revenue_1<=Revenue<=Revenue_2). To get to the desired factor value. Below is my expected output:
Policy State Deductible Revenue Factor
0 A CA 0 1500000 0.11
1 B TX 1000 30000000 0.30
2 C TX 0 1000000 0.15
I'm trying the conditional_join from janitor package. However I'm having an error. Is there something missing on my code?
import janitor
Data_Final = (Dataset.conditional_join(Lookup_Table,
# variable arguments
# col_from_left_df, col_from_right_df, comparator
('Revenue', 'Revenue_1', '>='),
('Revenue', 'Revenue_2', '<='),
('State', 'State_Cd', '=='),
('Deductible', 'Deductible', '=='),
how = 'left',sort_by_appearance = False
))
Below is the error
TypeError: __init__() got an unexpected keyword argument 'copy'

Resolved. By installing older version of pandas (less than 1.5). e.g.:
pip install pandas==1.4

Related

Find positive and negative bin limits based on multiple other columns

I have a dataframe like as shown below
ID raw_val var_name constant s_value
1 388 Qty 0.36 -0.032
2 120 Qty 0.36 -0.007
3 34 Qty 0.36 0.16
4 45 Qty 0.36 0.31
1 110 F1 0.36 -0.232
2 1000 F1 0.36 -0.17
3 318 F1 0.36 0.26
4 419 F1 0.36 0.31
My objective is to
a) Find the upper and lower limits (of raw_val) for each value of var_name for s_value >=0
b) Find the upper and lower limits (of raw_val) for each value of var_name for s_value <0
I tried the below
df['sign'] = np.where[df['s_value']<0, 'neg', 'pos']
s = df.groupby(['var_name','sign'])['raw_val'].series
df['buckets'] = pd.IntervalIndex.from_arrays(s)
Please note that my real data is big data and has more than 200 unique values for var_name column. The distribution of positive and negative values (s_value) may be uneven for each value of the var_name columns. In sample df, I have shown even distribution of pos and neg values but it may not be the case in real life.
I expect my output to be like as below
var_name sign low_limit upp_limit
Qty neg 120 388
F1 neg 110 1000
Qty pos 34 45
Qty pos 318 419
I think numpy.where with aggregate minimal and maximal values is way:
df['sign'] = np.where(df['s_value']<0, 'neg', 'pos')
df1 = (df.groupby(['var_name','sign'], sort=False, as_index=False)
.agg(low_limit=('raw_val','min'), upp_limit=('raw_val','max')))
print (df1)
var_name sign low_limit upp_limit
0 Qty neg 120 388
1 Qty pos 34 45
2 F1 neg 110 1000
3 F1 pos 318 419

Merging One Column on to Multiple Columns

I have the following two dataframes, DF1:
location vaccine1 vaccine2 vaccine3 vaccine4
0 Afghanistan Oxford/AstraZeneca Pfizer/BioNTech Sinopharm/Beijing None
1 Albania Oxford/AstraZeneca Pfizer/BioNTech Sinovac Sputnik V
2 Algeria Sputnik V None None None
3 Andorra Oxford/AstraZeneca Pfizer/BioNTech None None
DF2:
Vaccine Efficacy
0 Oxford/AstraZeneca 0.70
1 Pfizer/BioNTech 0.95
2 Sinopharm/Beijing 0.79
3 Sinovac 0.50
4 Sputnik V 0.92
I understand that you can merge like this below but the process is repeated 4 times which is inefficient:
v1 = pd.merge(df1, vacc_eff, how='left', left_on='vaccine1', right_on='Vaccine')[['location', 'Efficacy']]
v2 = pd.merge(df1, vacc_eff, how='left', left_on='vaccine2', right_on='Vaccine')[['location', 'Efficacy']]
vmerged = pd.merge(v1,v2,on=['location'])
How can I merge the DF2 column 'Efficacy' onto each of the vaccine columns in DF1 without writing the same merge function again and again?
Here is a solution you can try out, stack + map then unstack
map_ = vacc_eff.set_index('Vaccine')['Efficacy'].to_dict()
print(
df1[['location', 'vaccine1', 'vaccine2']].set_index('location')
.stack().map(map_).unstack()
)
vaccine1 vaccine2
location
Afghanistan 0.70 0.95
Albania 0.70 0.95
Algeria 0.92 NaN
Andorra 0.70 0.95

AttributeError: 'NoneType' object has no attribute 'text' - BeautifulShop

I have a little code for scraping info from fbref (link for data: https://fbref.com/en/comps/9/stats/Premier-League-Stats) and it worked well but now I have some problems with some features (I've checked that the fields which don't work now are"player","nationality","position","squad","age","birth_year"). I have also checked that the fields have the same name in the web that it used to be. Any ideas/help to solve the problem?
Many Thanks!
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import re
import sys, getopt
import csv
def get_tables(url):
res = requests.get(url)
## The next two lines get around the issue with comments breaking the parsing.
comm = re.compile("<!--|-->")
soup = BeautifulSoup(comm.sub("",res.text),'lxml')
all_tables = soup.findAll("tbody")
team_table = all_tables[0]
player_table = all_tables[1]
return player_table, team_table
def get_frame(features, player_table):
pre_df_player = dict()
features_wanted_player = features
rows_player = player_table.find_all('tr')
for row in rows_player:
if(row.find('th',{"scope":"row"}) != None):
for f in features_wanted_player:
cell = row.find("td",{"data-stat": f})
a = cell.text.strip().encode()
text=a.decode("utf-8")
if(text == ''):
text = '0'
if((f!='player')&(f!='nationality')&(f!='position')&(f!='squad')&(f!='age')&(f!='birth_year')):
text = float(text.replace(',',''))
if f in pre_df_player:
pre_df_player[f].append(text)
else:
pre_df_player[f] = [text]
df_player = pd.DataFrame.from_dict(pre_df_player)
return df_player
stats = ["player","nationality","position","squad","age","birth_year","games","games_starts","minutes","goals","assists","pens_made","pens_att","cards_yellow","cards_red","goals_per90","assists_per90","goals_assists_per90","goals_pens_per90","goals_assists_pens_per90","xg","npxg","xa","xg_per90","xa_per90","xg_xa_per90","npxg_per90","npxg_xa_per90"]
def frame_for_category(category,top,end,features):
url = (top + category + end)
player_table, team_table = get_tables(url)
df_player = get_frame(features, player_table)
return df_player
top='https://fbref.com/en/comps/9/'
end='/Premier-League-Stats'
df1 = frame_for_category('stats',top,end,stats)
df1
I suggest loading the table with panda's read_html. There is a direct link to this table under Share & Export --> Embed this Table.
import pandas as pd
df = pd.read_html("https://widgets.sports-reference.com/wg.fcgi?css=1&site=fb&url=%2Fen%2Fcomps%2F9%2Fstats%2FPremier-League-Stats&div=div_stats_standard", header=1)
This outputs a list of dataframes, the table can be accessed as df[0]. Output df[0].head():
Rk
Player
Nation
Pos
Squad
Age
Born
MP
Starts
Min
90s
Gls
Ast
G-PK
PK
PKatt
CrdY
CrdR
Gls.1
Ast.1
G+A
G-PK.1
G+A-PK
xG
npxG
xA
npxG+xA
xG.1
xA.1
xG+xA
npxG.1
npxG+xA.1
Matches
0
1
Patrick van Aanholt
nl NED
DF
Crystal Palace
30-190
1990
16
15
1324
14.7
0
1
0
0
0
1
0
0
0.07
0.07
0
0.07
1.2
1.2
0.8
2
0.08
0.05
0.13
0.08
0.13
Matches
1
2
Tammy Abraham
eng ENG
FW
Chelsea
23-156
1997
20
12
1021
11.3
6
1
6
0
0
0
0
0.53
0.09
0.62
0.53
0.62
5.6
5.6
0.9
6.5
0.49
0.08
0.57
0.49
0.57
Matches
2
3
Che Adams
eng ENG
FW
Southampton
24-237
1996
26
22
1985
22.1
5
4
5
0
0
1
0
0.23
0.18
0.41
0.23
0.41
5.5
5.5
4.3
9.9
0.25
0.2
0.45
0.25
0.45
Matches
3
4
Tosin Adarabioyo
eng ENG
DF
Fulham
23-164
1997
23
23
2070
23
0
0
0
0
0
1
0
0
0
0
0
0
1
1
0.1
1.1
0.04
0.01
0.05
0.04
0.05
Matches
4
5
Adrián
es ESP
GK
Liverpool
34-063
1987
3
3
270
3
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Matches
If you're only after the player stats, change player_table = all_tables[1] to player_table = all_tables[2], because now you are feeding team table into get_frame function.
I tried it and it worked fine after that.

The most elegant way to do a calculation on dataframe column

I'm a newbie in python.
I have a column in pandas dataframe called [weight].
Which is the efficient and smartest way to redefine securities's weights to sum 1 (or 100%) ? something like the sample calculation below
weight new weight
0,05 14%
0,10 29%
0,20 57%
total 0,35 100%
df = pd.DataFrame({'Security' : ['ABC','DEF','GHI'], 'Rating': ['AAA', 'BBB','AA'], 'Weight' : [ 0.05, 0.1, 0.2]})
print(df)
Security Rating Weight
ABC AAA 0.05
DEF BBB 0.10
GHI AA 0.20
I think we can devide weight by sum of weights and get the percentage of weight (newWeight):
import pandas as pd
df = pd.DataFrame({'Security' : ['ABC','DEF','GHI'], 'Rating': ['AAA', 'BBB','AA'], 'Weight' : [ 0.05, 0.1, 0.2]})
df['newWeight'] = 100 * df['Weight'] / sum(df['Weight'])
print(df)
## Rating Security Weight newWeight
## 0 AAA ABC 0.05 14.285714
## 1 BBB DEF 0.10 28.571429
## 2 AA GHI 0.20 57.142857
Using the apply method is a neat way to solve this problem. You can do something like this:
import pandas as pd
df = pd.DataFrame({'Security' : ['ABC','DEF','GHI'], 'Rating': ['AAA', 'BBB','AA'], 'Weight' : [ 0.05, 0.1, 0.2]})
total = df.Weight.sum()
df['newWeight'] = df.Weight.apply(lambda x: x / total)
The resulting DataFrame looks like this:
Security Rating Weight newWeight
0 ABC AAA 0.05 0.142857
1 DEF BBB 0.10 0.285714
2 GHI AA 0.20 0.571429
If you want to represent these as percentages, you need to convert them to strings, here's an example:
df['percentWeight'] = df.newWeight.apply(lambda x: "{}%".format(round(x * 100)))
And you get the result:
Security Rating Weight newWeight percentWeight
0 ABC AAA 0.05 0.142857 14%
1 DEF BBB 0.10 0.285714 29%
2 GHI AA 0.20 0.571429 57%

Finding values based on specific categories

I was wondering how would find estimated values based on several different categories. Two of the columns are categorical, one of the other columns contains two strings of interest and the last contain numeric values
I have a csv file called sports.csv
import pandas as pd
import numpy as np
#loading the data into data frame
df = pd.read_csv('sports.csv')
I'm trying to find a suggested price for a Gym that have both baseball and basketball as well as enrollment from 240 to 260 given they are from region 4 and of type 1
Region Type enroll estimates price Gym
2 1 377 0.43 40 Football|Baseball|Hockey|Running|Basketball|Swimming|Cycling|Volleyball|Tennis|Ballet
4 2 100 0.26 37 Baseball|Tennis
4 1 347 0.65 61 Basketball|Baseball|Ballet
4 1 264 0.17 12 Swimming|Ballet|Cycling|Basketball|Volleyball|Hockey|Running|Tennis|Baseball|Football
1 1 286 0.74 78 Swimming|Basketball
0 1 210 0.13 29 Baseball|Tennis|Ballet|Cycling|Basketball|Football|Volleyball|Swimming
0 1 263 0.91 31 Tennis
2 2 271 0.39 54 Tennis|Football|Ballet|Cycling|Running|Swimming|Baseball|Basketball|Volleyball
3 3 247 0.51 33 Baseball|Hockey|Swimming|Cycling
0 1 109 0.12 17 Football|Hockey|Volleyball
I don't know how to piece everything together. I apologize if the syntax is incorrect I'm just beginning Python. So far I have:
import pandas as pd
import numpy as np
#loading the data into data frame
df = pd.read_csv('sports.csv')
#group 4th region and type 1 together where enrollment is in between 240 and 260
group = df[df['Region'] == 4] df[df['Type'] == 1] df[240>=df['Enrollment'] <=260 ]
#split by pipe chars to find gyms that contain both Baseball and Basketball
df['Gym'] = df['Gym'].str.split('|')
df['Gym'] = df['Gym'].str.contains('Baseball'& 'Basketball')
price = df.loc[df['Gym'], 'Price']
Should I do a groupby instead? If so, how would I include the columns Type==1 Region ==4 and enrollment from 240 to 260 ?
You can create a mask with all your conditions specified and then use the mask for subsetting:
mask = (df['Region'] == 4) & (df['Type'] == 1) & \
(df['enroll'] <= 260) & (df['enroll'] >= 240) & \
df['Gym'].str.contains('Baseball') & df['Gym'].str.contains('Basketball')
df['price'][mask]
# Series([], name: price, dtype: int64)
which returns empty, since there is no record satisfying all conditions as above.
I had to add an instance that would actually meet your criteria, or else you will get an empty result. You want to use df.loc with conditions as follows:
In [1]: import pandas as pd, numpy as np, io
In [2]: in_string = io.StringIO("""Region Type enroll estimates price Gym
...: 2 1 377 0.43 40 Football|Baseball|Hockey|Running|Basketball|Swimming|Cycling|Volleyball|Tennis|Ballet
...: 4 2 100 0.26 37 Baseball|Tennis
...: 4 1 247 0.65 61 Basketball|Baseball|Ballet
...: 4 1 264 0.17 12 Swimming|Ballet|Cycling|Basketball|Volleyball|Hockey|Running|Tennis|Baseball|Football
...: 1 1 286 0.74 78 Swimming|Basketball
...: 0 1 210 0.13 29 Baseball|Tennis|Ballet|Cycling|Basketball|Football|Volleyball|Swimming
...: 0 1 263 0.91 31 Tennis
...: 2 2 271 0.39 54 Tennis|Football|Ballet|Cycling|Running|Swimming|Baseball|Basketball|Volleyball
...: 3 3 247 0.51 33 Baseball|Hockey|Swimming|Cycling
...: 0 1 109 0.12 17 Football|Hockey|Volleyball""")
In [3]: df = pd.read_csv(in_string,delimiter=r"\s+")
In [4]: df.loc[df.Gym.str.contains(r"(?=.*Baseball)(?=.*Basketball)")
...: & (df.enroll <= 260) & (df.enroll >= 240)
...: & (df.Region == 4) & (df.Type == 1), 'price']
Out[4]:
2 61
Name: price, dtype: int64
Note I used a regex pattern for contains that essentially acts as an AND operator for regex. You could simply have done another conjunction of .contains conditions for Basketball and Baseball.

Categories