How do i create code for a vlookup in python? - python

df
Season
Date
Team
Team_Season_Code
TS
L
Opponent
Opponent_Season_Code
OS
2019
20181109
Abilene_Chr
1_2019
94
Home
Arkansas_St
15_2019
73
2019
20181115
Abilene_Chr
1_2019
67
Away
Denver
82_2019
61
2019
20181122
Abilene_Chr
1_2019
72
N
Elon
70_2019
56
2019
20181123
Abilene_Chr
1_2019
73
Away
Pacific
224_2019
71
2019
20181124
Abilene_Chr
1_2019
60
N
UC_Riverside
306_2019
48
Overall_Season_Avg
Team_Season_Code
Team
TS
OS
MOV
15_2019
Arkansas_St
70.909091
65.242424
5.666667
70_2019
Elon
73.636364
71.818182
1.818182
82_2019
Denver
74.03125
72.15625
1.875
224_2019
Pacific
78.333333
76.466667
1.866667
306_2019
UC_Riverside
79.545455
78.060606
1.484848
I have these two dataframes and I want to be able to look up the Opponent_Season_Code from df in Overall_Season_Avg - "Team_Season_Code" and bring back "TS" and "OS" to create a new column in df called "OOS" and "OTS"
So a new column for row 1 in df should have Column name OOS with data - 65.24... and Column name OTS with data 70.90...
In excel its a simple vlookup but i haven't been able to use the solutions that i have found to the vlookup question on overflow so i decided to post my own question. I will also say that the Overall_Season_Avg dataframe was created through by Overall_Season_Avg = df.groupby(['Team_Season_Code', 'Team']).agg({'TS': np.mean, 'OS': np.mean, 'MOV': np.mean})

You can use a merge, after reworking a bit Overall_Season_Avg :
df.merge(Overall_Season_Avg
.set_index(['Team_Season_Code', 'Team'])
[['OS', 'TS']].add_prefix('O'),
left_on=['Opponent_Season_Code', 'Opponent'],
right_index=True, how='left'
)
Output:
Season Date Team Team_Season_Code TS L Opponent Opponent_Season_Code OS OOS OTS
0 2019 20181109 Abilene_Chr 1_2019 94 Home Arkansas_St 15_2019 73 65.242424 70.909091
1 2019 20181115 Abilene_Chr 1_2019 67 Away Denver 82_2019 61 72.156250 74.031250
2 2019 20181122 Abilene_Chr 1_2019 72 N Elon 70_2019 56 71.818182 73.636364
3 2019 20181123 Abilene_Chr 1_2019 73 Away Pacific 224_2019 71 76.466667 78.333333
4 2019 20181124 Abilene_Chr 1_2019 60 N UC_Riverside 306_2019 48 78.060606 79.545455
merging only on Opponent_Season_Code/Team_Season_Code:
df.merge(Overall_Season_Avg
.set_index('Team_Season_Code')
[['OS', 'TS']].add_prefix('O'),
left_on=['Opponent_Season_Code'],
right_index=True, how='left'
)
Output:
Season Date Team Team_Season_Code TS L Opponent Opponent_Season_Code OS OOS OTS
0 2019 20181109 Abilene_Chr 1_2019 94 Home Arkansas_St 15_2019 73 65.242424 70.909091
1 2019 20181115 Abilene_Chr 1_2019 67 Away Denver 82_2019 61 72.156250 74.031250
2 2019 20181122 Abilene_Chr 1_2019 72 N Elon 70_2019 56 71.818182 73.636364
3 2019 20181123 Abilene_Chr 1_2019 73 Away Pacific 224_2019 71 76.466667 78.333333
4 2019 20181124 Abilene_Chr 1_2019 60 N UC_Riverside 306_2019 48 78.060606 79.545455

df.merge(Overall_Season_Avg, on=['Team_Season_Code', 'Team'], how='left')
and rename column's names
or use transform instead agg when make Overall_Season_Avg.
but i don remain transform code becuz you don provide reproducible example
make simple and reproducible example plz
https://stackoverflow.com/help/minimal-reproducible-example

Related

Fill blank cells of a pandas dataframe column by matching with another datafame column

I have a pandas dataframe, lets call it df1 that looks like this (the follow is just a sample to give an idea of the dataframe):
Ac
Tp
Id
2020
2021
2022
Efecty
FC
IQ_EF
100
200
45
Asset
FC
52
48
15
Debt
P&G
IQ_DEBT
45
58
15
Tax
Other
48
45
78
And I want to fill the blank spaces using a in the 'Id' column using the next auxiliar dataframe, lets call it df2 (again, this is just a sample):
Ac
Tp
Id
Efecty
FC
IQ_EF
Asset
FC
IQ_AST
Debt
P&G
IQ_DEBT
Tax
Other
IQ_TAX
Income
BAL
IQ_INC
Invest
FC
IQ_INV
To get df1 dataframe, looking like this:
Ac
Tp
Id
2020
2021
2022
Efecty
FC
IQ_EF
100
200
45
Asset
FC
IQ_AST
52
48
15
Debt
P&G
IQ_DEBT
45
58
15
Tax
Other
IQ_TAX
48
45
78
I tried with this line of code but it did not work:
df1['Id'] = df1['Id'].mask(df1('nan')).fillna(df1['Ac'].map(df2('Ac')['Id']))
Can you guys help me?
Merge the two frames on Ac and Tp columns and assign the Id column from this result to df1.Id. This works similar to Excel Vlookup functionality.
ac_tp = ['Ac', 'Tp']
df1['Id'] = df1[ac_tp].merge(df2[[*ac_tp, 'Id']])['Id']
In a similar vein you could also try:
df['Id'] = (df.merge(df2, on = ['Ac', 'Tp'])
.pipe(lambda d: d['Id_x'].mask(d['Id_x'].isnull(), d['Id_y'])))
Ac Tp Id 2020 2021 2022
0 Efecty FC IQ_EF 100 200 45
1 Asset FC IQ_AST 52 48 15
2 Debt P&G IQ_DEBT 45 58 15
3 Tax Other IQ_TAX 48 45 78

Calculate deviation after groupby - loop of ufunc does not support argument 0

I have data about electric cars in USA and I am trying to calculate standard deviation for each state. I already calculated mean in that way:
df = pd.read_csv('https://gist.githubusercontent.com/AlbertKozera/6396b4333d1a9222193e11401069ed9a/raw/ab8733a2135bcf61999bbcac4f92e0de5fd56794/Pojazdy%2520elektryczne%2520w%2520USA.csv')
for col in df.columns:
df[col] = df[col].astype(str)
df['range'] = pd.to_numeric(df['range'])
.
.
.
df_avg_range = df.drop(columns = ['state', 'brand', 'model', 'year of production', 'type']).groupby('code', as_index=False)['range'].mean()
And here is my return after that:
code range
0 AK 154.553600
1 AL 156.959936
2 AR 153.950400
3 AZ 152.756000
4 CA 152.359200
5 CO 159.084800
6 CT 155.212000
7 DE 156.322400
8 FL 153.728000
9 GA 154.748800
10 HI 154.503200
11 IA 155.746400
12 ID 157.851200
13 IL 155.200800
14 IN 153.338400
15 KS 154.240000
16 KY 154.162400
17 LA 156.728800
18 MA 134.643200
19 MD 137.080800
20 ME 142.263200
21 MI 132.828000
22 MN 135.828000
23 MO 138.376000
24 MS 132.704000
25 MT 132.552000
26 NC 133.800000
27 ND 136.096800
28 NE 137.150400
29 NH 131.498400
30 NJ 137.760800
31 NM 133.325600
32 NV 137.522400
33 NY 137.476000
34 OH 137.784800
35 OK 134.277600
36 OR 134.504000
37 PA 141.052000
38 RI 137.572000
39 SC 143.348000
40 SD 141.189600
41 TN 139.981600
42 TX 139.233600
43 UT 138.615200
44 VA 141.334400
45 VT 143.104000
46 WA 137.880800
47 WI 143.916800
48 WV 141.008000
49 WY 147.109600
Now I am trying to calculate deviation in the same way:
df_dev_range = df.drop(columns = ['state', 'brand', 'model', 'year of production', 'type']).groupby('code', as_index=False)['range'].std()
And here is my error after that:
*** TypeError: loop of ufunc does not support argument 0 of type str which has no callable sqrt method
Can someone explain what am I doing wrong?
Try removing as_index=False from groupby. Standard deviation will be applied to all columns including groupby column when keeping as_index as False.
In order to retain index 0-49 try using the below syntax
df_dev_range = df.drop(columns = ['state', 'brand', 'model', 'year of production', 'type']).groupby('code',as_index=False).agg({'range':'std'})

python pandas pivot table of numerical range from dataframe

Hello i would like to make a pivot table out of a dataframe that list out the companies according to their number of uploads on a website. Here is what I have:
df
Company Uploads
Nike 11
Adidas 26
Apple 55
Tesla 3
Amazon 97
Ralph Lauren 54
Tiffany 19
Walmart 77
Target 18
Facebook 48
Google 81
Desired output
Range Company Uploads
>10 Tesla 3
11-50 Adidas 26
Tiffany 19
Target 18
Nike 11
51-100 Amazon 97
Google 81
Walmart 77
Apple 55
Ralph Lauren 54
I'm thinking adding a 'Range' column in df using np.where. Then make a pivot table using pd.pivot_table or .groupby. Then .sort_value for the descending upload number in the pivot table.
I'm not very sure if this would work. Can anyone please help me on this please? I appreciate any assistance. Thanks in advance!!
You can use pd.cut(), which has the capability of binning, to classify a segment and use the name output by a label.
import pandas as pd
import numpy as np
import io
data = '''
Company Uploads
Nike 11
Adidas 26
Apple 55
Tesla 3
Amazon 97
"Ralph Lauren" 54
Tiffany 19
Walmart 77
Target 18
Facebook 48
Google 81
'''
df = pd.read_csv(io.StringIO(data), sep='\s+')
df['category'] = pd.cut(df['Uploads'], [0,10,50,100], labels=['>10','11-50','51-100'])
df.sort_values(['category','Uploads'], ascending=[True, True], inplace=True)
df.set_index(['category','Company'],inplace=True)
df
Uploads
category Company
>10 Tesla 3
11-50 Nike 11
Target 18
Tiffany 19
Adidas 26
Facebook 48
51-100 Ralph Lauren 54
Apple 55
Walmart 77
Google 81
Amazon 97
What you want is a MultiIndex instead of a groupby()
First create a column that bins your uploads like you proposed:
df = df.sort_values('Uploads',ascending=False)
df['Range'] = np.digitize(df['Uploads'],[0,11,51,100]) #bins <=10, 11-50, 50-100
#only handles up to 100, if there are values above 100 you need to expand that second list
Now we map the resulting values of our bin to a more descriptive string
df = df.sort_values('Range')
key_range = np.vectorize(lambda x: {1:'<10',2:'11-50',3:'51-100'}[x])
df['Range'] = k(df['Range'])
Create a multiIndex to get your desired df
df.set_index(['Range','Company'])
output:
Uploads
Range Company
<10 Tesla 3
11-50 Facebook 48
Adidas 26
Tiffany 19
Target 18
Nike 11
51-100 Amazon 97
Google 81
Walmart 77
Apple 55
Ralph 54

How to merge data frames in pandas that have some columns in common and some not without losing any data

For example, if I had
df = pd.DataFrame(np.random.randint(0,100,size=(10, 4)), columns=list('ABCD'))
df1 = pd.DataFrame(np.random.randint(0,100,size=(8, 3)), columns=list('BCD'))
display(df,df1)
A B C D
0 63 16 89 55
1 17 29 81 17
2 88 82 9 64
B C D
0 21 38 36
1 54 88 80
2 44 53 53
3 24 58 29
A. B. C. D
0. 63. 16. 89. 55
1. 17 29. 81. 17
2. 88 82 9 64
3. NAN. 21 38 36
4. NAN. 54 88 80
5. NAN 44 53 53
6. NAN. 24 58 29
Is this possible?? I have about 25 data frames, all organized by ascending dates (the columns) and containing data for different airports (the indexes) for how many times a plane has ascended at each airport on each day. To reiterate, aiport names are the rows and dates are columns. The problem is, every data frame, containing 7 days, has a different number of airport names because some weeks some airports are inactive, and some weeks they're not. For that reason, it is really hard to merge them all together into a dataframe because each one has a lot of airports in common, but not necessarily in the exact same position (column number). Is there anyway to merge them so that NANs appear in the dates that the airports are inactive, and so that the rows of airport names are never duplicates? Sorry it is so hard to explain, thank you!!

Python parsing data from a website using regular expression

I'm trying to parse some data from this website:
http://www.csfbl.com/freeagents.asp?leagueid=2237
I've written some code:
import urllib
import re
name = re.compile('<td>(.+?)')
player_id = re.compile('<td><a href="(.+?)" onclick=')
#player_id_num = re.compile('<td><a href=player.asp?playerid="(.+?)" onclick=')
stat_c = re.compile('<td class="[^"]+" align="[^"]+"><span class="[^"]?">(.+?)</span><br><span class="[^"]?">')
stat_p = re.compile('<td class="[^"]+" align="[^"]+"><span class="[^"]?">"[^"]+"</span><br><span class="[^"]?">(.+?)</span></td>')
url = 'http://www.csfbl.com/freeagents.asp?leagueid=2237'
sock = urllib.request.urlopen(url).read().decode("utf-8")
#li = name.findall(sock)
name = name.findall(sock)
player_id = player_id.findall(sock)
#player_id_num = player_id_num.findall(sock)
#age = age.findall(sock)
stat_c = stat_c.findall(sock)
stat_p = stat_p.findall(sock)
First question : player_id returns the whole url "player.asp?playerid=4209661". I was unable to get just the number part. How can I do that?
(my attempt is described in #player_id_num)
Second question: I am not able to get stat_c when span_class is empty as in "".
Is there a way I can get these resolved? I am not very familiar with RE (regular expressions), I looked up tutorials online but it's still unclear what I am doing wrong.
Very simple using the pandas library.
Code:
import pandas as pd
url = "http://www.csfbl.com/freeagents.asp?leagueid=2237"
dfs = pd.read_html(url)
# print dfs[3]
# dfs[3].to_csv("stats.csv") # Send to a CSV file.
print dfs[3].head()
Result:
0 1 2 3 4 5 6 7 8 9 10 \
0 Pos Name Age T PO FI CO SY HR RA GL
1 P George Pacheco 38 R 4858 7484 8090 7888 6777 4353 6979
2 P David Montoya 34 R 3944 5976 6673 8699 6267 6685 5459
3 P Robert Cole 34 R 5769 7189 7285 5863 6267 5868 5462
4 P Juanold McDonald 32 R 69100 5772 4953 4866 5976 67100 5362
11 12 13 14 15 16
0 AR EN RL Fatigue Salary NaN
1 3747 6171 -3 100% --- $3,672,000
2 5257 5975 -4 96% 2% $2,736,000
3 4953 5061 -4 96% 3% $2,401,000
4 5982 5263 -4 100% --- $1,890,000
You can apply whatever cleaning methods you want from here onwards. Code is rudimentary so it's up to you to improve it.
More Code:
import pandas as pd
import itertools
url = "http://www.csfbl.com/freeagents.asp?leagueid=2237"
dfs = pd.read_html(url)
df = dfs[3] # "First" stats table.
# The first row is the actual header.
# Also, notice the NaN at the end.
header = df.iloc[0][:-1].tolist()
# Fix that atrocity of a last column.
df.drop([15], axis=1, inplace=True)
# Last row is all NaNs. This particular
# table should end with Jeremy Dix.
df = df.iloc[1:-1,:]
df.columns = header
df.reset_index(drop=True, inplace=True)
# Pandas cannot create two rows without the
# dataframe turning into a nightmare. Let's
# try an aesthetic change.
sub_header = header[4:13]
orig = ["{}{}".format(h, "r") for h in sub_header]
clone = ["{}{}".format(h, "p") for h in sub_header]
# http://stackoverflow.com/a/3678930/2548721
comb = [iter(orig), iter(clone)]
comb = list(it.next() for it in itertools.cycle(comb))
# Construct the new header.
new_header = header[0:4]
new_header += comb
new_header += header[13:]
# Slow but does it cleanly.
for s, o, c in zip(sub_header, orig, clone):
df.loc[:, o] = df[s].apply(lambda x: x[:2])
df.loc[:, c] = df[s].apply(lambda x: x[2:])
df = df[new_header] # Drop the other columns.
print df.head()
More result:
Pos Name Age T POr POp FIr FIp COr COp ... RAp GLr \
0 P George Pacheco 38 R 48 58 74 84 80 90 ... 53 69
1 P David Montoya 34 R 39 44 59 76 66 73 ... 85 54
2 P Robert Cole 34 R 57 69 71 89 72 85 ... 68 54
3 P Juanold McDonald 32 R 69 100 57 72 49 53 ... 100 53
4 P Trevor White 37 R 61 66 62 64 67 67 ... 38 48
GLp ARr ARp ENr ENp RL Fatigue Salary
0 79 37 47 61 71 -3 100% $3,672,000
1 59 52 57 59 75 -4 96% $2,736,000
2 62 49 53 50 61 -4 96% $2,401,000
3 62 59 82 52 63 -4 100% $1,890,000
4 50 70 100 62 69 -4 100% $1,887,000
Obviously, what I did instead was separate the Real values from Potential values. Some tricks were used but it gets the job done at least for the first table of players. The next few ones require a degree of manipulation.

Categories