How to create multiple triangles based on given number of simulations?

How to create multiple triangles based on given number of simulations? - python

Below is my code:
triangle = cl.load_sample('genins')
# Use bootstrap sampler to get resampled triangles
bootstrapdataframe = cl.BootstrapODPSample(n_sims=4, random_state=42).fit(triangle).resampled_triangles_
#converting to dataframe
resampledtriangledf = bootstrapdataframe.to_frame()
print(resampledtriangledf)
In above code i mentioned n_sims(number of simulation)=4. So it generates below datafame:
0 2001 12 254,926
0 2001 24 535,877
0 2001 36 1,355,613
0 2001 48 2,034,557
0 2001 60 2,311,789
0 2001 72 2,539,807
0 2001 84 2,724,773
0 2001 96 3,187,095
0 2001 108 3,498,646
0 2001 120 3,586,037
0 2002 12 542,369
0 2002 24 1,016,927
0 2002 36 2,201,329
0 2002 48 2,923,381
0 2002 60 3,711,305
0 2002 72 3,914,829
0 2002 84 4,385,757
0 2002 96 4,596,072
0 2002 108 5,047,861
0 2003 12 235,361
0 2003 24 960,355
0 2003 36 1,661,972
0 2003 48 2,643,370
0 2003 60 3,372,684
0 2003 72 3,642,605
0 2003 84 4,160,583
0 2003 96 4,480,332
0 2004 12 764,553
0 2004 24 1,703,557
0 2004 36 2,498,418
0 2004 48 3,198,358
0 2004 60 3,524,562
0 2004 72 3,884,971
0 2004 84 4,268,241
0 2005 12 381,670
0 2005 24 1,124,054
0 2005 36 2,026,434
0 2005 48 2,863,902
0 2005 60 3,039,322
0 2005 72 3,288,253
0 2006 12 320,332
0 2006 24 1,022,323
0 2006 36 1,830,842
0 2006 48 2,676,710
0 2006 60 3,375,172
0 2007 12 330,361
0 2007 24 1,463,348
0 2007 36 2,771,839
0 2007 48 4,003,745
0 2008 12 282,143
0 2008 24 1,782,267
0 2008 36 2,898,699
0 2009 12 362,726
0 2009 24 1,277,750
0 2010 12 321,247
1 2001 12 219,021
1 2001 24 755,975
1 2001 36 1,360,298
1 2001 48 2,062,947
1 2001 60 2,356,983
1 2001 72 2,781,187
1 2001 84 2,987,837
1 2001 96 3,118,952
1 2001 108 3,307,522
1 2001 120 3,455,107
1 2002 12 302,932
1 2002 24 1,022,459
1 2002 36 1,634,938
1 2002 48 2,538,708
1 2002 60 3,005,695
1 2002 72 3,274,719
1 2002 84 3,356,499
1 2002 96 3,595,361
1 2002 108 4,100,065
1 2003 12 489,934
1 2003 24 1,233,438
1 2003 36 2,471,849
1 2003 48 3,672,629
1 2003 60 4,157,489
1 2003 72 4,498,470
1 2003 84 4,587,579
1 2003 96 4,816,232
1 2004 12 518,680
1 2004 24 1,209,705
1 2004 36 2,019,757
1 2004 48 2,997,820
1 2004 60 3,630,442
1 2004 72 3,881,093
1 2004 84 4,080,322
1 2005 12 453,963
1 2005 24 1,458,504
1 2005 36 2,036,506
1 2005 48 2,846,464
1 2005 60 3,280,124
1 2005 72 3,544,597
1 2006 12 369,755
1 2006 24 1,209,117
1 2006 36 1,973,136
1 2006 48 3,034,294
1 2006 60 3,537,784
1 2007 12 477,788
1 2007 24 1,524,537
1 2007 36 2,170,391
1 2007 48 3,355,093
1 2008 12 250,690
1 2008 24 1,546,986
1 2008 36 2,996,737
1 2009 12 271,270
1 2009 24 1,446,353
1 2010 12 510,114
2 2001 12 170,866
2 2001 24 797,338
2 2001 36 1,663,610
2 2001 48 2,293,697
2 2001 60 2,607,067
2 2001 72 2,979,479
2 2001 84 3,127,308
2 2001 96 3,285,338
2 2001 108 3,574,272
2 2001 120 3,630,610
2 2002 12 259,060
2 2002 24 1,011,092
2 2002 36 1,851,504
2 2002 48 2,705,313
2 2002 60 3,195,774
2 2002 72 3,766,008
2 2002 84 3,944,417
2 2002 96 4,234,043
2 2002 108 4,763,664
2 2003 12 239,981
2 2003 24 983,484
2 2003 36 1,929,785
2 2003 48 2,497,929
2 2003 60 2,972,887
2 2003 72 3,313,868
2 2003 84 3,727,432
2 2003 96 4,024,122
2 2004 12 77,522
2 2004 24 729,401
2 2004 36 1,473,914
2 2004 48 2,376,313
2 2004 60 2,999,197
2 2004 72 3,372,020
2 2004 84 3,887,883
2 2005 12 321,598
2 2005 24 1,132,502
2 2005 36 1,710,504
2 2005 48 2,438,620
2 2005 60 2,801,957
2 2005 72 3,182,466
2 2006 12 255,407
2 2006 24 1,275,141
2 2006 36 2,083,421
2 2006 48 3,144,579
2 2006 60 3,891,772
2 2007 12 338,120
2 2007 24 1,275,697
2 2007 36 2,238,715
2 2007 48 3,615,323
2 2008 12 310,214
2 2008 24 1,237,156
2 2008 36 2,563,326
2 2009 12 271,093
2 2009 24 1,523,131
2 2010 12 430,591
3 2001 12 330,887
3 2001 24 831,193
3 2001 36 1,601,374
3 2001 48 2,188,879
3 2001 60 2,662,773
3 2001 72 3,086,976
3 2001 84 3,332,247
3 2001 96 3,317,279
3 2001 108 3,576,659
3 2001 120 3,613,563
3 2002 12 358,263
3 2002 24 1,139,259
3 2002 36 2,236,375
3 2002 48 3,163,464
3 2002 60 3,715,130
3 2002 72 4,295,638
3 2002 84 4,502,105
3 2002 96 4,769,139
3 2002 108 5,323,304
3 2003 12 489,934
3 2003 24 1,570,352
3 2003 36 3,123,215
3 2003 48 4,189,299
3 2003 60 4,819,070
3 2003 72 5,306,689
3 2003 84 5,560,371
3 2003 96 5,827,003
3 2004 12 419,727
3 2004 24 1,308,884
3 2004 36 2,118,936
3 2004 48 2,906,732
3 2004 60 3,561,577
3 2004 72 3,934,400
3 2004 84 4,010,511
3 2005 12 389,217
3 2005 24 1,173,226
3 2005 36 1,794,216
3 2005 48 2,528,910
3 2005 60 3,474,035
3 2005 72 3,908,999
3 2006 12 291,940
3 2006 24 1,136,674
3 2006 36 1,915,614
3 2006 48 2,693,930
3 2006 60 3,375,601
3 2007 12 506,055
3 2007 24 1,684,660
3 2007 36 2,678,739
3 2007 48 3,545,156
3 2008 12 282,143
3 2008 24 1,536,490
3 2008 36 2,458,789
3 2009 12 271,093
3 2009 24 1,199,897
3 2010 12 266,359
Using above dataframe I have to create 4 triangles based on Toatal column:
For example:
Row Labels 12 24 36 48 60 72 84 96 108 120 Grand Total
2001 254,926 535,877 1,355,613 2,034,557 2,311,789 2,539,807 2,724,773 3,187,095 3,498,646 3,586,037 22,029,119
2002 542,369 1,016,927 2,201,329 2,923,381 3,711,305 3,914,829 4,385,757 4,596,072 5,047,861 28,339,832
2003 235,361 960,355 1,661,972 2,643,370 3,372,684 3,642,605 4,160,583 4,480,332 21,157,261
2004 764,553 1,703,557 2,498,418 3,198,358 3,524,562 3,884,971 4,268,241 19,842,659
2005 381,670 1,124,054 2,026,434 2,863,902 3,039,322 3,288,253 12,723,635
2006 320,332 1,022,323 1,830,842 2,676,710 3,375,172 9,225,377
2007 330,361 1,463,348 2,771,839 4,003,745 8,569,294
2008 282,143 1,782,267 2,898,699 4,963,110
2009 362,726 1,277,750 1,640,475
2010 321,247 321,247
Grand Total 3,795,687 10,886,456 17,245,147 20,344,022 19,334,833 17,270,466 15,539,355 12,263,499 8,546,507 3,586,037 128,812,009
.
.
.
Like this i need 4 triangles (4 is number of simulation) using 1st dataframe.
If user gives s_sims=900 then it creates 900 totals values based on this we have to create 900 triangles.
In above triangle i just displayed only 1 triangle for 0th value. But i neet triangle for 1 ,2 and 3 also.

Try:
df['sample_size'] = pd.to_numeric(df['sample_size'].str.replace(',',''))
df.pivot_table('sample_size','year', 'no', aggfunc='first')\
.pipe(lambda x: pd.concat([x,x.sum().to_frame('Grand Total').T]))
Output:
no 12 24 36 48 60 72 84 96 108 120
2001 254926.0 535877.0 1355613.0 2034557.0 2311789.0 2539807.0 2724773.0 3187095.0 3498646.0 3586037.0
2002 542369.0 1016927.0 2201329.0 2923381.0 3711305.0 3914829.0 4385757.0 4596072.0 5047861.0 NaN
2003 235361.0 960355.0 1661972.0 2643370.0 3372684.0 3642605.0 4160583.0 4480332.0 NaN NaN
2004 764553.0 1703557.0 2498418.0 3198358.0 3524562.0 3884971.0 4268241.0 NaN NaN NaN
2005 381670.0 1124054.0 2026434.0 2863902.0 3039322.0 3288253.0 NaN NaN NaN NaN
2006 320332.0 1022323.0 1830842.0 2676710.0 3375172.0 NaN NaN NaN NaN NaN
2007 330361.0 1463348.0 2771839.0 4003745.0 NaN NaN NaN NaN NaN NaN
2008 282143.0 1782267.0 2898699.0 NaN NaN NaN NaN NaN NaN NaN
2009 362726.0 1277750.0 NaN NaN NaN NaN NaN NaN NaN NaN
2010 321247.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
Grand Total 3795688.0 10886458.0 17245146.0 20344023.0 19334834.0 17270465.0 15539354.0 12263499.0 8546507.0 3586037.0

Related

In Selenium see if the 'a' anchor tag contains the text I want, and then extract multiple td's of text in the same row

For my python code, I have been trying to scrape data from NCAAF Stats. I have been having issues extracting the td's text after I evaluate if the anchor tag 'a', contains the text I want. I want to be able to find the teams amount of tds, points, and ppg. I have been able to successfully find the school by text in selenium, but after that I am unable to extract the info I want. Here is what I have coded so far.
from selenium import webdriver
driver = webdriver.Chrome('C:\\Users\\Carl\\Downloads\\chromedriver.exe')
driver.get('https://www.ncaa.com/stats/football/fbs/current/team/27')
# I plan to make a while or for loop later, that is why I used f strings
team = "Coastal Carolina"
first = driver.find_element_by_xpath(f'//a[text()="{team}"]')
# This was the way another similiarly asked question was answered but did not work
#tds = driver.find_element_by_xpath(f'//td//a[text()="{apples}"]/../td[4]').text
# This grabs data from the very first row of data... not the one I want
tds = first.find_element_by_xpath('//following-sibling::td[4]').text
total_points = first.find_element_by_xpath('//following-sibling::td[10]').text
ppg = first.find_element_by_xpath('//following-sibling::td[11]').text
print(tds, total_points, ppg)
driver.quit()
I have tried to look around for a similarly asked question and was able to find this snippet
tds = driver.find_element_by_xpath(f'//td//a[text()="{apples}"]/../td[4]').text
it unfortunately did not help me out much. The html structure looks like this. I appreciate any help, and thank you!

No need to use Selenium, the page isn't dynamic. Just use pandas to parse the table for you:
import pandas as pd
url = 'https://www.ncaa.com/stats/football/fbs/current/team/27'
dfs = pd.read_html(url)[0]
Output:
print(df)
Rank Team G TDs PAT 2PT Def Pts FG Saf Pts PPG
0 1 Ohio St. 6 39 39 0 0 6 0 291.0 48.5
1 2 Pittsburgh 6 40 36 0 0 4 1 290.0 48.3
2 3 Coastal Carolina 7 43 42 0 0 6 1 320.0 45.7
3 4 Alabama 7 41 40 1 0 9 0 315.0 45.0
4 5 Ole Miss 6 35 30 1 0 6 1 262.0 43.7
5 6 Cincinnati 6 36 34 1 0 3 0 261.0 43.5
6 7 Oklahoma 7 35 34 1 1 17 0 299.0 42.7
7 - SMU 7 40 36 1 0 7 0 299.0 42.7
8 9 Texas 7 38 37 0 0 8 1 291.0 41.6
9 10 Western Ky. 6 31 27 1 0 10 0 245.0 40.8
10 11 Tennessee 7 36 36 0 0 7 1 275.0 39.3
11 12 Wake Forest 6 28 24 2 0 12 0 232.0 38.7
12 13 UTSA 7 33 33 0 0 13 0 270.0 38.6
13 14 Michigan 6 28 25 1 0 12 0 231.0 38.5
14 15 Georgia 7 34 33 0 0 10 1 269.0 38.4
15 16 Baylor 7 35 35 0 0 7 1 268.0 38.3
16 17 Houston 6 30 28 0 0 5 0 223.0 37.2
17 - TCU 6 29 28 0 0 7 0 223.0 37.2
18 19 Marshall 7 34 33 0 0 7 0 258.0 36.9
19 - North Carolina 7 34 32 2 0 6 0 258.0 36.9
20 21 Nevada 6 26 24 1 0 12 0 218.0 36.3
21 22 Virginia 7 31 29 2 0 10 2 253.0 36.1
22 23 Fresno St. 7 32 27 1 0 10 0 251.0 35.9
23 - Memphis 7 33 26 3 0 7 0 251.0 35.9
24 25 Texas Tech 7 32 31 0 0 9 0 250.0 35.7
25 26 Auburn 7 29 28 1 0 12 1 242.0 34.6
26 27 Florida 7 33 29 1 0 4 0 241.0 34.4
27 - Missouri 7 31 31 0 0 8 0 241.0 34.4
28 29 Liberty 7 33 29 1 0 3 1 240.0 34.3
29 - Michigan St. 7 30 30 0 0 10 0 240.0 34.3
30 31 UCF 6 28 26 0 0 3 1 205.0 34.2
31 32 Oregon St. 6 27 27 0 0 5 0 204.0 34.0
32 33 Oregon 6 26 26 0 0 7 0 203.0 33.8
33 34 Iowa St. 6 23 22 0 0 14 0 202.0 33.7
34 35 UCLA 7 30 28 0 0 9 0 235.0 33.6
35 36 San Diego St. 6 25 24 1 0 7 0 197.0 32.8
36 37 LSU 7 29 29 0 0 8 0 227.0 32.4
37 38 Louisville 6 24 23 0 0 9 0 194.0 32.3
38 - Miami (FL) 6 24 22 1 0 8 1 194.0 32.3
39 - NC State 6 25 24 0 0 6 1 194.0 32.3
40 41 Southern California 6 22 19 3 0 12 0 193.0 32.2
41 42 Tulane 7 31 23 4 0 2 0 223.0 31.9
42 43 Arizona St. 7 30 25 2 0 4 0 221.0 31.6
43 44 Utah 6 25 22 1 0 5 0 189.0 31.5
44 45 Air Force 7 29 27 1 0 5 1 220.0 31.4
45 46 App State 7 27 24 0 0 11 0 219.0 31.3
46 47 Arkansas 7 27 25 0 0 10 0 217.0 31.0
47 - Army West Point 6 25 22 0 0 4 1 186.0 31.0
48 - Notre Dame 6 23 20 2 0 8 0 186.0 31.0
49 - Western Mich. 7 28 25 0 0 8 0 217.0 31.0

How I can fix this BeautifulSoup website scrape for NHL Reference?

I'm scraping National Hockey League (NHL) data for multiple seasons from this URL:
https://www.hockey-reference.com/leagues/NHL_2018_skaters.html
I'm only getting a few instances here and have tried moving my dict statements throughout the for loops. I've also tried utilizing solutions I found on other posts with no luck. Any help is appreciated. Thank you!
import requests
from bs4 import BeautifulSoup
import pandas as pd
dict={}
for i in range (2010,2020):
year = str(i)
source = requests.get('https://www.hockey-reference.com/leagues/NHL_'+year+'_skaters.html').text
soup = BeautifulSoup(source,features='lxml')
#identifying table in html
table = soup.find('table', id="stats")
#grabbing <tr> tags in html
rows = table.findAll("tr")
#creating passable values for each "stat" in td tag
data_stats = [
"player",
"age",
"team_id",
"pos",
"games_played",
"goals",
"assists",
"points",
"plus_minus",
"pen_min",
"ps",
"goals_ev",
"goals_pp",
"goals_sh",
"goals_gw",
"assists_ev",
"assists_pp",
"assists_sh",
"shots",
"shot_pct",
"time_on_ice",
"time_on_ice_avg",
"blocks",
"hits",
"faceoff_wins",
"faceoff_losses",
"faceoff_percentage"
]
for rownum in rows:
# grabbing player name and using as key
filter = { "data-stat":'player' }
cell = rows[3].findAll("td",filter)
nameval = cell[0].string
list = []
for data in data_stats:
#iterating through data_stat to grab values
filter = { "data-stat":data }
cell = rows[3].findAll("td",filter)
value = cell[0].string
list.append(value)
dict[nameval] = list
dict[nameval].append(year)
# conversion to numeric values and creating dataframe
columns = [
"player",
"age",
"team_id",
"pos",
"games_played",
"goals",
"assists",
"points",
"plus_minus",
"pen_min",
"ps",
"goals_ev",
"goals_pp",
"goals_sh",
"goals_gw",
"assists_ev",
"assists_pp",
"assists_sh",
"shots",
"shot_pct",
"time_on_ice",
"time_on_ice_avg",
"blocks",
"hits",
"faceoff_wins",
"faceoff_losses",
"faceoff_percentage",
"year"
]
df = pd.DataFrame.from_dict(dict,orient='index',columns=columns)
cols = df.columns.drop(['player','team_id','pos','year'])
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
print(df)
Output
Craig Adams Craig Adams 32 ... 43.9 2010
Luke Adam Luke Adam 22 ... 100.0 2013
Justin Abdelkader Justin Abdelkader 29 ... 29.4 2017
Will Acton Will Acton 27 ... 50.0 2015
Noel Acciari Noel Acciari 24 ... 44.1 2016
Pontus Aberg Pontus Aberg 25 ... 10.5 2019
[6 rows x 28 columns]

I'd just use pandas' .read_html(), It does the hard work of parsing tables for you (uses BeautifulSoup under the hood)
Code:
import pandas as pd
result = pd.DataFrame()
for i in range (2010,2020):
print(i)
year = str(i)
url = 'https://www.hockey-reference.com/leagues/NHL_'+year+'_skaters.html'
#source = requests.get('https://www.hockey-reference.com/leagues/NHL_'+year+'_skaters.html').text
df = pd.read_html(url,header=1)[0]
df['year'] = year
result = result.append(df, sort=False)
result = result[~result['Age'].str.contains("Age")]
result = result.reset_index(drop=True)
You can then save to file with result.to_csv('filename.csv',index=False)
Output:
print (result)
Rk Player Age Tm Pos GP ... BLK HIT FOW FOL FO% year
0 1 Justin Abdelkader 22 DET LW 50 ... 20 152 148 170 46.5 2010
1 2 Craig Adams 32 PIT RW 82 ... 58 193 243 311 43.9 2010
2 3 Maxim Afinogenov 30 ATL RW 82 ... 21 32 1 2 33.3 2010
3 4 Andrew Alberts 28 TOT D 76 ... 88 216 0 1 0.0 2010
4 4 Andrew Alberts 28 CAR D 62 ... 67 172 0 0 NaN 2010
5 4 Andrew Alberts 28 VAN D 14 ... 21 44 0 1 0.0 2010
6 5 Daniel Alfredsson 37 OTT RW 70 ... 36 41 14 25 35.9 2010
7 6 Bryan Allen 29 FLA D 74 ... 137 120 0 0 NaN 2010
8 7 Cody Almond 20 MIN C 7 ... 5 7 18 12 60.0 2010
9 8 Karl Alzner 21 WSH D 21 ... 21 15 0 0 NaN 2010
10 9 Artem Anisimov 21 NYR C 82 ... 41 45 310 380 44.9 2010
11 10 Nik Antropov 29 ATL C 76 ... 35 82 481 627 43.4 2010
12 11 Colby Armstrong 27 ATL RW 79 ... 29 74 10 10 50.0 2010
13 12 Derek Armstrong 36 STL C 6 ... 0 4 7 8 46.7 2010
14 13 Jason Arnott 35 NSH C 63 ... 17 24 526 551 48.8 2010
15 14 Dean Arsene 29 EDM D 13 ... 13 18 0 0 NaN 2010
16 15 Evgeny Artyukhin 26 TOT RW 54 ... 10 127 1 1 50.0 2010
17 15 Evgeny Artyukhin 26 ANA RW 37 ... 8 90 0 1 0.0 2010
18 15 Evgeny Artyukhin 26 ATL RW 17 ... 2 37 1 0 100.0 2010
19 16 Arron Asham 31 PHI RW 72 ... 16 92 2 11 15.4 2010
20 17 Adrian Aucoin 36 PHX D 82 ... 67 131 1 0 100.0 2010
21 18 Keith Aucoin 31 WSH C 9 ... 0 2 31 25 55.4 2010
22 19 Sean Avery 29 NYR C 69 ... 17 145 4 10 28.6 2010
23 20 David Backes 25 STL RW 79 ... 60 266 504 561 47.3 2010
24 21 Mikael Backlund 20 CGY C 23 ... 4 12 100 86 53.8 2010
25 22 Nicklas Backstrom 22 WSH C 82 ... 61 90 657 660 49.9 2010
26 23 Josh Bailey 20 NYI C 73 ... 36 67 171 255 40.1 2010
27 24 Keith Ballard 27 FLA D 82 ... 201 156 0 0 NaN 2010
28 25 Krys Barch 29 DAL RW 63 ... 13 120 0 3 0.0 2010
29 26 Cam Barker 23 TOT D 70 ... 53 75 0 0 NaN 2010
... ... .. ... .. .. ... ... ... ... ... ... ...
10251 885 Chris Wideman 29 TOT D 25 ... 26 35 0 0 NaN 2019
10252 885 Chris Wideman 29 OTT D 19 ... 25 26 0 0 NaN 2019
10253 885 Chris Wideman 29 EDM D 5 ... 1 7 0 0 NaN 2019
10254 885 Chris Wideman 29 FLA D 1 ... 0 2 0 0 NaN 2019
10255 886 Justin Williams 37 CAR RW 82 ... 32 55 92 150 38.0 2019
10256 887 Colin Wilson 29 COL C 65 ... 31 55 20 32 38.5 2019
10257 888 Garrett Wilson 27 PIT LW 50 ... 16 114 3 4 42.9 2019
10258 889 Scott Wilson 26 BUF C 15 ... 2 29 1 2 33.3 2019
10259 890 Tom Wilson 24 WSH RW 63 ... 52 200 29 24 54.7 2019
10260 891 Luke Witkowski 28 DET D 34 ... 27 67 0 0 NaN 2019
10261 892 Christian Wolanin 23 OTT D 30 ... 31 11 0 0 NaN 2019
10262 893 Miles Wood 23 NJD LW 63 ... 27 97 0 2 0.0 2019
10263 894 Egor Yakovlev 27 NJD D 25 ... 22 12 0 0 NaN 2019
10264 895 Kailer Yamamoto 20 EDM RW 17 ... 11 18 0 0 NaN 2019
10265 896 Keith Yandle 32 FLA D 82 ... 76 47 0 0 NaN 2019
10266 897 Pavel Zacha 21 NJD C 61 ... 24 68 348 364 48.9 2019
10267 898 Filip Zadina 19 DET RW 9 ... 3 6 3 3 50.0 2019
10268 899 Nikita Zadorov 23 COL D 70 ... 67 228 0 0 NaN 2019
10269 900 Nikita Zaitsev 27 TOR D 81 ... 151 139 0 0 NaN 2019
10270 901 Travis Zajac 33 NJD C 80 ... 38 66 841 605 58.2 2019
10271 902 Jakub Zboril 21 BOS D 2 ... 0 3 0 0 NaN 2019
10272 903 Mika Zibanejad 25 NYR C 82 ... 66 134 830 842 49.6 2019
10273 904 Mats Zuccarello 31 TOT LW 48 ... 43 57 10 20 33.3 2019
10274 904 Mats Zuccarello 31 NYR LW 46 ... 42 57 10 20 33.3 2019
10275 904 Mats Zuccarello 31 DAL LW 2 ... 1 0 0 0 NaN 2019
10276 905 Jason Zucker 27 MIN LW 81 ... 38 87 2 11 15.4 2019
10277 906 Valentin Zykov 23 TOT LW 28 ... 6 26 2 7 22.2 2019
10278 906 Valentin Zykov 23 CAR LW 13 ... 2 6 2 6 25.0 2019
10279 906 Valentin Zykov 23 VEG LW 10 ... 3 18 0 1 0.0 2019
10280 906 Valentin Zykov 23 EDM LW 5 ... 1 2 0 0 NaN 2019
[10281 rows x 29 columns]

Scraping heavily formatted tables are positively painful with Beautiful Soup (not to bash on Beautiful Soup, it's wonderful for several use cases). There's a bit of a 'hack' I use for scraping data surrounded with dense markup, if you're willing to be a bit utilitarian about it:
1. Select entire table on web page
2. Copy + paste into Evernote (simplifies and reformats the HTML)
3. Copy + paste from Evernote to Excel or another spreadsheet software (removes the HTML)
4. Save as .csv
Input
Output
It isn't perfect. There will be blank lines in the CSV, but blank lines are easier and far less time-consuming to remove than such data is to scrape. Good luck!
As reference, I've linked my own conversions below.
Parsed to Evernote
Parsed to Excel

Dataframe: How to transpose/merge sparse columns as rows

I have a Dataframe as below.
data = {'Year':["2012", "2013", "2014", "2015","2016"],
'Matthew':[80,83,85,90,91],
'Aakash':[85,75,95,92,93],
'Jill': [90,70,100,80,85]}
df = pd.DataFrame(data)
Year Matthew Aakash Jill
0 2012 80 85 90
1 2013 83 75 70
2 2014 85 95 100
3 2015 90 92 80
4 2016 91 93 85
How do I transform it to below?
Expected Result:
data2 = {'Year':["2012","2012","2012", "2013","2013","2013","2014","2014","2014","2015","2015","2015","2016","2016","2016"],
'Name':['Matthew','Aakash','Jill','Matthew','Aakash','Jill','Matthew','Aakash','Jill','Matthew','Aakash','Jill','Matthew','Aakash','Jill'],
'Results':[80,85,90,83,75,70,85,95,100,90,92,80,91,93,85]
}
df2 = pd.DataFrame(data2)
Year Name Results
0 2012 Matthew 80
1 2012 Aakash 85
2 2012 Jill 90
3 2013 Matthew 83
4 2013 Aakash 75
5 2013 Jill 70
6 2014 Matthew 85
7 2014 Aakash 95
8 2014 Jill 100
9 2015 Matthew 90
10 2015 Aakash 92
11 2015 Jill 80
12 2016 Matthew 91
13 2016 Aakash 93
14 2016 Jill 85

How to find ChangeCol1/ChangeCol2 and %ChangeCol1/%ChangeCol2 of DF

I have data that looks like this.
Year Quarter Quantity Price TotalRevenue
0 2000 1 23 142 3266
1 2000 2 23 144 3312
2 2000 3 23 147 3381
3 2000 4 23 151 3473
4 2001 1 22 160 3520
5 2001 2 22 183 4026
6 2001 3 22 186 4092
7 2001 4 22 186 4092
8 2002 1 21 212 4452
9 2002 2 19 232 4408
10 2002 3 19 223 4237
I'm trying to figure out how to get the 'MarginalRevenue', where:
MR = (∆TR/∆Q)
MarginalRevenue = (Change in TotalRevenue) / (Change in Quantity)
I found: df.pct_change()
But that seems to get the percentage change for everything.
Also, I'm trying to figure out how to get something related:
ElasticityPrice = (%ΔQuantity/%ΔPrice)

Do you mean something like this ?
df['MarginalRevenue'] = df['TotalRevenue'].pct_change() / df['Quantity'].pct_change()
or
df['MarginalRevenue'] = df['TotalRevenue'].diff() / df['Quantity'].diff()

Pandas Melt with Multiple Value Vars

I have a data set which is in wide format like this
Index Country Variable 2000 2001 2002 2003 2004 2005
0 Argentina var1 12 15 18 17 23 29
1 Argentina var2 1 3 2 5 7 5
2 Brazil var1 20 23 25 29 31 32
3 Brazil var2 0 1 2 2 3 3
I want to reshape my data to long so that year, var1, and var2 become new columns
Index Country year var1 var2
0 Argentina 2000 12 1
1 Argentina 2001 15 3
2 Argentina 2002 18 2
....
6 Brazil 2000 20 0
7 Brazil 2001 23 1
I got my code to work when I only had one variable by writing
df=(pd.melt(df,id_vars='Country',value_name='Var1', var_name='year'))
I cant figure out how to do this for a var1,var2, var3, etc.

Instead of melt, you can use a combination of stack and unstack:
(df.set_index(['Country', 'Variable'])
.rename_axis(['Year'], axis=1)
.stack()
.unstack('Variable')
.reset_index())
Variable Country Year var1 var2
0 Argentina 2000 12 1
1 Argentina 2001 15 3
2 Argentina 2002 18 2
3 Argentina 2003 17 5
4 Argentina 2004 23 7
5 Argentina 2005 29 5
6 Brazil 2000 20 0
7 Brazil 2001 23 1
8 Brazil 2002 25 2
9 Brazil 2003 29 2
10 Brazil 2004 31 3
11 Brazil 2005 32 3

Option 1
Using melt then unstack for var1, var2, etc...
(df1.melt(id_vars=['Country','Variable'],var_name='Year')
.set_index(['Country','Year','Variable'])
.squeeze()
.unstack()
.reset_index())
Output:
Variable Country Year var1 var2
0 Argentina 2000 12 1
1 Argentina 2001 15 3
2 Argentina 2002 18 2
3 Argentina 2003 17 5
4 Argentina 2004 23 7
5 Argentina 2005 29 5
6 Brazil 2000 20 0
7 Brazil 2001 23 1
8 Brazil 2002 25 2
9 Brazil 2003 29 2
10 Brazil 2004 31 3
11 Brazil 2005 32 3
Option 2
Using pivot then stack:
(df1.pivot(index='Country',columns='Variable')
.stack(0)
.rename_axis(['Country','Year'])
.reset_index())
Output:
Variable Country Year var1 var2
0 Argentina 2000 12 1
1 Argentina 2001 15 3
2 Argentina 2002 18 2
3 Argentina 2003 17 5
4 Argentina 2004 23 7
5 Argentina 2005 29 5
6 Brazil 2000 20 0
7 Brazil 2001 23 1
8 Brazil 2002 25 2
9 Brazil 2003 29 2
10 Brazil 2004 31 3
11 Brazil 2005 32 3
Option 3 (ayhan's solution)
Using set_index, stack, and unstack:
(df.set_index(['Country', 'Variable'])
.rename_axis(['Year'], axis=1)
.stack()
.unstack('Variable')
.reset_index())
Output:
Variable Country Year var1 var2
0 Argentina 2000 12 1
1 Argentina 2001 15 3
2 Argentina 2002 18 2
3 Argentina 2003 17 5
4 Argentina 2004 23 7
5 Argentina 2005 29 5
6 Brazil 2000 20 0
7 Brazil 2001 23 1
8 Brazil 2002 25 2
9 Brazil 2003 29 2
10 Brazil 2004 31 3
11 Brazil 2005 32 3

numpy
years = df.drop(['Country', 'Variable'], 1)
y = years.values
m = y.shape[1]
c = df.Country.values
v = df.Variable.values
f0, u0 = pd.factorize(df.Country.values)
f1, u1 = pd.factorize(df.Variable.values)
w = np.empty((u1.size, u0.size, m), dtype=y.dtype)
w[f1, f0] = y
results = pd.DataFrame(dict(
Country=u0.repeat(m),
Year=np.tile(years.columns.values, u0.size),
)).join(pd.DataFrame(w.reshape(-1, m * u1.size).T, columns=u1))
results
Country Year var1 var2
0 Argentina 2000 12 1
1 Argentina 2001 15 3
2 Argentina 2002 18 2
3 Argentina 2003 17 5
4 Argentina 2004 23 7
5 Argentina 2005 29 5
6 Brazil 2000 20 0
7 Brazil 2001 23 1
8 Brazil 2002 25 2
9 Brazil 2003 29 2
10 Brazil 2004 31 3
11 Brazil 2005 32 3

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to create multiple triangles based on given number of simulations? - python

Related

In Selenium see if the 'a' anchor tag contains the text I want, and then extract multiple td's of text in the same row

How I can fix this BeautifulSoup website scrape for NHL Reference?

Dataframe: How to transpose/merge sparse columns as rows

How to find ChangeCol1/ChangeCol2 and %ChangeCol1/%ChangeCol2 of DF

Pandas Melt with Multiple Value Vars

Categories

Resources