How to create multiple triangles based on given number of simulations? - python
Below is my code:
triangle = cl.load_sample('genins')
# Use bootstrap sampler to get resampled triangles
bootstrapdataframe = cl.BootstrapODPSample(n_sims=4, random_state=42).fit(triangle).resampled_triangles_
#converting to dataframe
resampledtriangledf = bootstrapdataframe.to_frame()
print(resampledtriangledf)
In above code i mentioned n_sims(number of simulation)=4. So it generates below datafame:
0 2001 12 254,926
0 2001 24 535,877
0 2001 36 1,355,613
0 2001 48 2,034,557
0 2001 60 2,311,789
0 2001 72 2,539,807
0 2001 84 2,724,773
0 2001 96 3,187,095
0 2001 108 3,498,646
0 2001 120 3,586,037
0 2002 12 542,369
0 2002 24 1,016,927
0 2002 36 2,201,329
0 2002 48 2,923,381
0 2002 60 3,711,305
0 2002 72 3,914,829
0 2002 84 4,385,757
0 2002 96 4,596,072
0 2002 108 5,047,861
0 2003 12 235,361
0 2003 24 960,355
0 2003 36 1,661,972
0 2003 48 2,643,370
0 2003 60 3,372,684
0 2003 72 3,642,605
0 2003 84 4,160,583
0 2003 96 4,480,332
0 2004 12 764,553
0 2004 24 1,703,557
0 2004 36 2,498,418
0 2004 48 3,198,358
0 2004 60 3,524,562
0 2004 72 3,884,971
0 2004 84 4,268,241
0 2005 12 381,670
0 2005 24 1,124,054
0 2005 36 2,026,434
0 2005 48 2,863,902
0 2005 60 3,039,322
0 2005 72 3,288,253
0 2006 12 320,332
0 2006 24 1,022,323
0 2006 36 1,830,842
0 2006 48 2,676,710
0 2006 60 3,375,172
0 2007 12 330,361
0 2007 24 1,463,348
0 2007 36 2,771,839
0 2007 48 4,003,745
0 2008 12 282,143
0 2008 24 1,782,267
0 2008 36 2,898,699
0 2009 12 362,726
0 2009 24 1,277,750
0 2010 12 321,247
1 2001 12 219,021
1 2001 24 755,975
1 2001 36 1,360,298
1 2001 48 2,062,947
1 2001 60 2,356,983
1 2001 72 2,781,187
1 2001 84 2,987,837
1 2001 96 3,118,952
1 2001 108 3,307,522
1 2001 120 3,455,107
1 2002 12 302,932
1 2002 24 1,022,459
1 2002 36 1,634,938
1 2002 48 2,538,708
1 2002 60 3,005,695
1 2002 72 3,274,719
1 2002 84 3,356,499
1 2002 96 3,595,361
1 2002 108 4,100,065
1 2003 12 489,934
1 2003 24 1,233,438
1 2003 36 2,471,849
1 2003 48 3,672,629
1 2003 60 4,157,489
1 2003 72 4,498,470
1 2003 84 4,587,579
1 2003 96 4,816,232
1 2004 12 518,680
1 2004 24 1,209,705
1 2004 36 2,019,757
1 2004 48 2,997,820
1 2004 60 3,630,442
1 2004 72 3,881,093
1 2004 84 4,080,322
1 2005 12 453,963
1 2005 24 1,458,504
1 2005 36 2,036,506
1 2005 48 2,846,464
1 2005 60 3,280,124
1 2005 72 3,544,597
1 2006 12 369,755
1 2006 24 1,209,117
1 2006 36 1,973,136
1 2006 48 3,034,294
1 2006 60 3,537,784
1 2007 12 477,788
1 2007 24 1,524,537
1 2007 36 2,170,391
1 2007 48 3,355,093
1 2008 12 250,690
1 2008 24 1,546,986
1 2008 36 2,996,737
1 2009 12 271,270
1 2009 24 1,446,353
1 2010 12 510,114
2 2001 12 170,866
2 2001 24 797,338
2 2001 36 1,663,610
2 2001 48 2,293,697
2 2001 60 2,607,067
2 2001 72 2,979,479
2 2001 84 3,127,308
2 2001 96 3,285,338
2 2001 108 3,574,272
2 2001 120 3,630,610
2 2002 12 259,060
2 2002 24 1,011,092
2 2002 36 1,851,504
2 2002 48 2,705,313
2 2002 60 3,195,774
2 2002 72 3,766,008
2 2002 84 3,944,417
2 2002 96 4,234,043
2 2002 108 4,763,664
2 2003 12 239,981
2 2003 24 983,484
2 2003 36 1,929,785
2 2003 48 2,497,929
2 2003 60 2,972,887
2 2003 72 3,313,868
2 2003 84 3,727,432
2 2003 96 4,024,122
2 2004 12 77,522
2 2004 24 729,401
2 2004 36 1,473,914
2 2004 48 2,376,313
2 2004 60 2,999,197
2 2004 72 3,372,020
2 2004 84 3,887,883
2 2005 12 321,598
2 2005 24 1,132,502
2 2005 36 1,710,504
2 2005 48 2,438,620
2 2005 60 2,801,957
2 2005 72 3,182,466
2 2006 12 255,407
2 2006 24 1,275,141
2 2006 36 2,083,421
2 2006 48 3,144,579
2 2006 60 3,891,772
2 2007 12 338,120
2 2007 24 1,275,697
2 2007 36 2,238,715
2 2007 48 3,615,323
2 2008 12 310,214
2 2008 24 1,237,156
2 2008 36 2,563,326
2 2009 12 271,093
2 2009 24 1,523,131
2 2010 12 430,591
3 2001 12 330,887
3 2001 24 831,193
3 2001 36 1,601,374
3 2001 48 2,188,879
3 2001 60 2,662,773
3 2001 72 3,086,976
3 2001 84 3,332,247
3 2001 96 3,317,279
3 2001 108 3,576,659
3 2001 120 3,613,563
3 2002 12 358,263
3 2002 24 1,139,259
3 2002 36 2,236,375
3 2002 48 3,163,464
3 2002 60 3,715,130
3 2002 72 4,295,638
3 2002 84 4,502,105
3 2002 96 4,769,139
3 2002 108 5,323,304
3 2003 12 489,934
3 2003 24 1,570,352
3 2003 36 3,123,215
3 2003 48 4,189,299
3 2003 60 4,819,070
3 2003 72 5,306,689
3 2003 84 5,560,371
3 2003 96 5,827,003
3 2004 12 419,727
3 2004 24 1,308,884
3 2004 36 2,118,936
3 2004 48 2,906,732
3 2004 60 3,561,577
3 2004 72 3,934,400
3 2004 84 4,010,511
3 2005 12 389,217
3 2005 24 1,173,226
3 2005 36 1,794,216
3 2005 48 2,528,910
3 2005 60 3,474,035
3 2005 72 3,908,999
3 2006 12 291,940
3 2006 24 1,136,674
3 2006 36 1,915,614
3 2006 48 2,693,930
3 2006 60 3,375,601
3 2007 12 506,055
3 2007 24 1,684,660
3 2007 36 2,678,739
3 2007 48 3,545,156
3 2008 12 282,143
3 2008 24 1,536,490
3 2008 36 2,458,789
3 2009 12 271,093
3 2009 24 1,199,897
3 2010 12 266,359
Using above dataframe I have to create 4 triangles based on Toatal column:
For example:
Row Labels 12 24 36 48 60 72 84 96 108 120 Grand Total
2001 254,926 535,877 1,355,613 2,034,557 2,311,789 2,539,807 2,724,773 3,187,095 3,498,646 3,586,037 22,029,119
2002 542,369 1,016,927 2,201,329 2,923,381 3,711,305 3,914,829 4,385,757 4,596,072 5,047,861 28,339,832
2003 235,361 960,355 1,661,972 2,643,370 3,372,684 3,642,605 4,160,583 4,480,332 21,157,261
2004 764,553 1,703,557 2,498,418 3,198,358 3,524,562 3,884,971 4,268,241 19,842,659
2005 381,670 1,124,054 2,026,434 2,863,902 3,039,322 3,288,253 12,723,635
2006 320,332 1,022,323 1,830,842 2,676,710 3,375,172 9,225,377
2007 330,361 1,463,348 2,771,839 4,003,745 8,569,294
2008 282,143 1,782,267 2,898,699 4,963,110
2009 362,726 1,277,750 1,640,475
2010 321,247 321,247
Grand Total 3,795,687 10,886,456 17,245,147 20,344,022 19,334,833 17,270,466 15,539,355 12,263,499 8,546,507 3,586,037 128,812,009
.
.
.
Like this i need 4 triangles (4 is number of simulation) using 1st dataframe.
If user gives s_sims=900 then it creates 900 totals values based on this we have to create 900 triangles.
In above triangle i just displayed only 1 triangle for 0th value. But i neet triangle for 1 ,2 and 3 also.
Try:
df['sample_size'] = pd.to_numeric(df['sample_size'].str.replace(',',''))
df.pivot_table('sample_size','year', 'no', aggfunc='first')\
.pipe(lambda x: pd.concat([x,x.sum().to_frame('Grand Total').T]))
Output:
no 12 24 36 48 60 72 84 96 108 120
2001 254926.0 535877.0 1355613.0 2034557.0 2311789.0 2539807.0 2724773.0 3187095.0 3498646.0 3586037.0
2002 542369.0 1016927.0 2201329.0 2923381.0 3711305.0 3914829.0 4385757.0 4596072.0 5047861.0 NaN
2003 235361.0 960355.0 1661972.0 2643370.0 3372684.0 3642605.0 4160583.0 4480332.0 NaN NaN
2004 764553.0 1703557.0 2498418.0 3198358.0 3524562.0 3884971.0 4268241.0 NaN NaN NaN
2005 381670.0 1124054.0 2026434.0 2863902.0 3039322.0 3288253.0 NaN NaN NaN NaN
2006 320332.0 1022323.0 1830842.0 2676710.0 3375172.0 NaN NaN NaN NaN NaN
2007 330361.0 1463348.0 2771839.0 4003745.0 NaN NaN NaN NaN NaN NaN
2008 282143.0 1782267.0 2898699.0 NaN NaN NaN NaN NaN NaN NaN
2009 362726.0 1277750.0 NaN NaN NaN NaN NaN NaN NaN NaN
2010 321247.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
Grand Total 3795688.0 10886458.0 17245146.0 20344023.0 19334834.0 17270465.0 15539354.0 12263499.0 8546507.0 3586037.0
Related
In Selenium see if the 'a' anchor tag contains the text I want, and then extract multiple td's of text in the same row
For my python code, I have been trying to scrape data from NCAAF Stats. I have been having issues extracting the td's text after I evaluate if the anchor tag 'a', contains the text I want. I want to be able to find the teams amount of tds, points, and ppg. I have been able to successfully find the school by text in selenium, but after that I am unable to extract the info I want. Here is what I have coded so far. from selenium import webdriver driver = webdriver.Chrome('C:\\Users\\Carl\\Downloads\\chromedriver.exe') driver.get('https://www.ncaa.com/stats/football/fbs/current/team/27') # I plan to make a while or for loop later, that is why I used f strings team = "Coastal Carolina" first = driver.find_element_by_xpath(f'//a[text()="{team}"]') # This was the way another similiarly asked question was answered but did not work #tds = driver.find_element_by_xpath(f'//td//a[text()="{apples}"]/../td[4]').text # This grabs data from the very first row of data... not the one I want tds = first.find_element_by_xpath('//following-sibling::td[4]').text total_points = first.find_element_by_xpath('//following-sibling::td[10]').text ppg = first.find_element_by_xpath('//following-sibling::td[11]').text print(tds, total_points, ppg) driver.quit() I have tried to look around for a similarly asked question and was able to find this snippet tds = driver.find_element_by_xpath(f'//td//a[text()="{apples}"]/../td[4]').text it unfortunately did not help me out much. The html structure looks like this. I appreciate any help, and thank you!
No need to use Selenium, the page isn't dynamic. Just use pandas to parse the table for you: import pandas as pd url = 'https://www.ncaa.com/stats/football/fbs/current/team/27' dfs = pd.read_html(url)[0] Output: print(df) Rank Team G TDs PAT 2PT Def Pts FG Saf Pts PPG 0 1 Ohio St. 6 39 39 0 0 6 0 291.0 48.5 1 2 Pittsburgh 6 40 36 0 0 4 1 290.0 48.3 2 3 Coastal Carolina 7 43 42 0 0 6 1 320.0 45.7 3 4 Alabama 7 41 40 1 0 9 0 315.0 45.0 4 5 Ole Miss 6 35 30 1 0 6 1 262.0 43.7 5 6 Cincinnati 6 36 34 1 0 3 0 261.0 43.5 6 7 Oklahoma 7 35 34 1 1 17 0 299.0 42.7 7 - SMU 7 40 36 1 0 7 0 299.0 42.7 8 9 Texas 7 38 37 0 0 8 1 291.0 41.6 9 10 Western Ky. 6 31 27 1 0 10 0 245.0 40.8 10 11 Tennessee 7 36 36 0 0 7 1 275.0 39.3 11 12 Wake Forest 6 28 24 2 0 12 0 232.0 38.7 12 13 UTSA 7 33 33 0 0 13 0 270.0 38.6 13 14 Michigan 6 28 25 1 0 12 0 231.0 38.5 14 15 Georgia 7 34 33 0 0 10 1 269.0 38.4 15 16 Baylor 7 35 35 0 0 7 1 268.0 38.3 16 17 Houston 6 30 28 0 0 5 0 223.0 37.2 17 - TCU 6 29 28 0 0 7 0 223.0 37.2 18 19 Marshall 7 34 33 0 0 7 0 258.0 36.9 19 - North Carolina 7 34 32 2 0 6 0 258.0 36.9 20 21 Nevada 6 26 24 1 0 12 0 218.0 36.3 21 22 Virginia 7 31 29 2 0 10 2 253.0 36.1 22 23 Fresno St. 7 32 27 1 0 10 0 251.0 35.9 23 - Memphis 7 33 26 3 0 7 0 251.0 35.9 24 25 Texas Tech 7 32 31 0 0 9 0 250.0 35.7 25 26 Auburn 7 29 28 1 0 12 1 242.0 34.6 26 27 Florida 7 33 29 1 0 4 0 241.0 34.4 27 - Missouri 7 31 31 0 0 8 0 241.0 34.4 28 29 Liberty 7 33 29 1 0 3 1 240.0 34.3 29 - Michigan St. 7 30 30 0 0 10 0 240.0 34.3 30 31 UCF 6 28 26 0 0 3 1 205.0 34.2 31 32 Oregon St. 6 27 27 0 0 5 0 204.0 34.0 32 33 Oregon 6 26 26 0 0 7 0 203.0 33.8 33 34 Iowa St. 6 23 22 0 0 14 0 202.0 33.7 34 35 UCLA 7 30 28 0 0 9 0 235.0 33.6 35 36 San Diego St. 6 25 24 1 0 7 0 197.0 32.8 36 37 LSU 7 29 29 0 0 8 0 227.0 32.4 37 38 Louisville 6 24 23 0 0 9 0 194.0 32.3 38 - Miami (FL) 6 24 22 1 0 8 1 194.0 32.3 39 - NC State 6 25 24 0 0 6 1 194.0 32.3 40 41 Southern California 6 22 19 3 0 12 0 193.0 32.2 41 42 Tulane 7 31 23 4 0 2 0 223.0 31.9 42 43 Arizona St. 7 30 25 2 0 4 0 221.0 31.6 43 44 Utah 6 25 22 1 0 5 0 189.0 31.5 44 45 Air Force 7 29 27 1 0 5 1 220.0 31.4 45 46 App State 7 27 24 0 0 11 0 219.0 31.3 46 47 Arkansas 7 27 25 0 0 10 0 217.0 31.0 47 - Army West Point 6 25 22 0 0 4 1 186.0 31.0 48 - Notre Dame 6 23 20 2 0 8 0 186.0 31.0 49 - Western Mich. 7 28 25 0 0 8 0 217.0 31.0
How I can fix this BeautifulSoup website scrape for NHL Reference?
I'm scraping National Hockey League (NHL) data for multiple seasons from this URL: https://www.hockey-reference.com/leagues/NHL_2018_skaters.html I'm only getting a few instances here and have tried moving my dict statements throughout the for loops. I've also tried utilizing solutions I found on other posts with no luck. Any help is appreciated. Thank you! import requests from bs4 import BeautifulSoup import pandas as pd dict={} for i in range (2010,2020): year = str(i) source = requests.get('https://www.hockey-reference.com/leagues/NHL_'+year+'_skaters.html').text soup = BeautifulSoup(source,features='lxml') #identifying table in html table = soup.find('table', id="stats") #grabbing <tr> tags in html rows = table.findAll("tr") #creating passable values for each "stat" in td tag data_stats = [ "player", "age", "team_id", "pos", "games_played", "goals", "assists", "points", "plus_minus", "pen_min", "ps", "goals_ev", "goals_pp", "goals_sh", "goals_gw", "assists_ev", "assists_pp", "assists_sh", "shots", "shot_pct", "time_on_ice", "time_on_ice_avg", "blocks", "hits", "faceoff_wins", "faceoff_losses", "faceoff_percentage" ] for rownum in rows: # grabbing player name and using as key filter = { "data-stat":'player' } cell = rows[3].findAll("td",filter) nameval = cell[0].string list = [] for data in data_stats: #iterating through data_stat to grab values filter = { "data-stat":data } cell = rows[3].findAll("td",filter) value = cell[0].string list.append(value) dict[nameval] = list dict[nameval].append(year) # conversion to numeric values and creating dataframe columns = [ "player", "age", "team_id", "pos", "games_played", "goals", "assists", "points", "plus_minus", "pen_min", "ps", "goals_ev", "goals_pp", "goals_sh", "goals_gw", "assists_ev", "assists_pp", "assists_sh", "shots", "shot_pct", "time_on_ice", "time_on_ice_avg", "blocks", "hits", "faceoff_wins", "faceoff_losses", "faceoff_percentage", "year" ] df = pd.DataFrame.from_dict(dict,orient='index',columns=columns) cols = df.columns.drop(['player','team_id','pos','year']) df[cols] = df[cols].apply(pd.to_numeric, errors='coerce') print(df) Output Craig Adams Craig Adams 32 ... 43.9 2010 Luke Adam Luke Adam 22 ... 100.0 2013 Justin Abdelkader Justin Abdelkader 29 ... 29.4 2017 Will Acton Will Acton 27 ... 50.0 2015 Noel Acciari Noel Acciari 24 ... 44.1 2016 Pontus Aberg Pontus Aberg 25 ... 10.5 2019 [6 rows x 28 columns]
I'd just use pandas' .read_html(), It does the hard work of parsing tables for you (uses BeautifulSoup under the hood) Code: import pandas as pd result = pd.DataFrame() for i in range (2010,2020): print(i) year = str(i) url = 'https://www.hockey-reference.com/leagues/NHL_'+year+'_skaters.html' #source = requests.get('https://www.hockey-reference.com/leagues/NHL_'+year+'_skaters.html').text df = pd.read_html(url,header=1)[0] df['year'] = year result = result.append(df, sort=False) result = result[~result['Age'].str.contains("Age")] result = result.reset_index(drop=True) You can then save to file with result.to_csv('filename.csv',index=False) Output: print (result) Rk Player Age Tm Pos GP ... BLK HIT FOW FOL FO% year 0 1 Justin Abdelkader 22 DET LW 50 ... 20 152 148 170 46.5 2010 1 2 Craig Adams 32 PIT RW 82 ... 58 193 243 311 43.9 2010 2 3 Maxim Afinogenov 30 ATL RW 82 ... 21 32 1 2 33.3 2010 3 4 Andrew Alberts 28 TOT D 76 ... 88 216 0 1 0.0 2010 4 4 Andrew Alberts 28 CAR D 62 ... 67 172 0 0 NaN 2010 5 4 Andrew Alberts 28 VAN D 14 ... 21 44 0 1 0.0 2010 6 5 Daniel Alfredsson 37 OTT RW 70 ... 36 41 14 25 35.9 2010 7 6 Bryan Allen 29 FLA D 74 ... 137 120 0 0 NaN 2010 8 7 Cody Almond 20 MIN C 7 ... 5 7 18 12 60.0 2010 9 8 Karl Alzner 21 WSH D 21 ... 21 15 0 0 NaN 2010 10 9 Artem Anisimov 21 NYR C 82 ... 41 45 310 380 44.9 2010 11 10 Nik Antropov 29 ATL C 76 ... 35 82 481 627 43.4 2010 12 11 Colby Armstrong 27 ATL RW 79 ... 29 74 10 10 50.0 2010 13 12 Derek Armstrong 36 STL C 6 ... 0 4 7 8 46.7 2010 14 13 Jason Arnott 35 NSH C 63 ... 17 24 526 551 48.8 2010 15 14 Dean Arsene 29 EDM D 13 ... 13 18 0 0 NaN 2010 16 15 Evgeny Artyukhin 26 TOT RW 54 ... 10 127 1 1 50.0 2010 17 15 Evgeny Artyukhin 26 ANA RW 37 ... 8 90 0 1 0.0 2010 18 15 Evgeny Artyukhin 26 ATL RW 17 ... 2 37 1 0 100.0 2010 19 16 Arron Asham 31 PHI RW 72 ... 16 92 2 11 15.4 2010 20 17 Adrian Aucoin 36 PHX D 82 ... 67 131 1 0 100.0 2010 21 18 Keith Aucoin 31 WSH C 9 ... 0 2 31 25 55.4 2010 22 19 Sean Avery 29 NYR C 69 ... 17 145 4 10 28.6 2010 23 20 David Backes 25 STL RW 79 ... 60 266 504 561 47.3 2010 24 21 Mikael Backlund 20 CGY C 23 ... 4 12 100 86 53.8 2010 25 22 Nicklas Backstrom 22 WSH C 82 ... 61 90 657 660 49.9 2010 26 23 Josh Bailey 20 NYI C 73 ... 36 67 171 255 40.1 2010 27 24 Keith Ballard 27 FLA D 82 ... 201 156 0 0 NaN 2010 28 25 Krys Barch 29 DAL RW 63 ... 13 120 0 3 0.0 2010 29 26 Cam Barker 23 TOT D 70 ... 53 75 0 0 NaN 2010 ... ... .. ... .. .. ... ... ... ... ... ... ... 10251 885 Chris Wideman 29 TOT D 25 ... 26 35 0 0 NaN 2019 10252 885 Chris Wideman 29 OTT D 19 ... 25 26 0 0 NaN 2019 10253 885 Chris Wideman 29 EDM D 5 ... 1 7 0 0 NaN 2019 10254 885 Chris Wideman 29 FLA D 1 ... 0 2 0 0 NaN 2019 10255 886 Justin Williams 37 CAR RW 82 ... 32 55 92 150 38.0 2019 10256 887 Colin Wilson 29 COL C 65 ... 31 55 20 32 38.5 2019 10257 888 Garrett Wilson 27 PIT LW 50 ... 16 114 3 4 42.9 2019 10258 889 Scott Wilson 26 BUF C 15 ... 2 29 1 2 33.3 2019 10259 890 Tom Wilson 24 WSH RW 63 ... 52 200 29 24 54.7 2019 10260 891 Luke Witkowski 28 DET D 34 ... 27 67 0 0 NaN 2019 10261 892 Christian Wolanin 23 OTT D 30 ... 31 11 0 0 NaN 2019 10262 893 Miles Wood 23 NJD LW 63 ... 27 97 0 2 0.0 2019 10263 894 Egor Yakovlev 27 NJD D 25 ... 22 12 0 0 NaN 2019 10264 895 Kailer Yamamoto 20 EDM RW 17 ... 11 18 0 0 NaN 2019 10265 896 Keith Yandle 32 FLA D 82 ... 76 47 0 0 NaN 2019 10266 897 Pavel Zacha 21 NJD C 61 ... 24 68 348 364 48.9 2019 10267 898 Filip Zadina 19 DET RW 9 ... 3 6 3 3 50.0 2019 10268 899 Nikita Zadorov 23 COL D 70 ... 67 228 0 0 NaN 2019 10269 900 Nikita Zaitsev 27 TOR D 81 ... 151 139 0 0 NaN 2019 10270 901 Travis Zajac 33 NJD C 80 ... 38 66 841 605 58.2 2019 10271 902 Jakub Zboril 21 BOS D 2 ... 0 3 0 0 NaN 2019 10272 903 Mika Zibanejad 25 NYR C 82 ... 66 134 830 842 49.6 2019 10273 904 Mats Zuccarello 31 TOT LW 48 ... 43 57 10 20 33.3 2019 10274 904 Mats Zuccarello 31 NYR LW 46 ... 42 57 10 20 33.3 2019 10275 904 Mats Zuccarello 31 DAL LW 2 ... 1 0 0 0 NaN 2019 10276 905 Jason Zucker 27 MIN LW 81 ... 38 87 2 11 15.4 2019 10277 906 Valentin Zykov 23 TOT LW 28 ... 6 26 2 7 22.2 2019 10278 906 Valentin Zykov 23 CAR LW 13 ... 2 6 2 6 25.0 2019 10279 906 Valentin Zykov 23 VEG LW 10 ... 3 18 0 1 0.0 2019 10280 906 Valentin Zykov 23 EDM LW 5 ... 1 2 0 0 NaN 2019 [10281 rows x 29 columns]
Scraping heavily formatted tables are positively painful with Beautiful Soup (not to bash on Beautiful Soup, it's wonderful for several use cases). There's a bit of a 'hack' I use for scraping data surrounded with dense markup, if you're willing to be a bit utilitarian about it: 1. Select entire table on web page 2. Copy + paste into Evernote (simplifies and reformats the HTML) 3. Copy + paste from Evernote to Excel or another spreadsheet software (removes the HTML) 4. Save as .csv Input Output It isn't perfect. There will be blank lines in the CSV, but blank lines are easier and far less time-consuming to remove than such data is to scrape. Good luck! As reference, I've linked my own conversions below. Parsed to Evernote Parsed to Excel
Dataframe: How to transpose/merge sparse columns as rows
I have a Dataframe as below. data = {'Year':["2012", "2013", "2014", "2015","2016"], 'Matthew':[80,83,85,90,91], 'Aakash':[85,75,95,92,93], 'Jill': [90,70,100,80,85]} df = pd.DataFrame(data) Year Matthew Aakash Jill 0 2012 80 85 90 1 2013 83 75 70 2 2014 85 95 100 3 2015 90 92 80 4 2016 91 93 85 How do I transform it to below? Expected Result: data2 = {'Year':["2012","2012","2012", "2013","2013","2013","2014","2014","2014","2015","2015","2015","2016","2016","2016"], 'Name':['Matthew','Aakash','Jill','Matthew','Aakash','Jill','Matthew','Aakash','Jill','Matthew','Aakash','Jill','Matthew','Aakash','Jill'], 'Results':[80,85,90,83,75,70,85,95,100,90,92,80,91,93,85] } df2 = pd.DataFrame(data2) Year Name Results 0 2012 Matthew 80 1 2012 Aakash 85 2 2012 Jill 90 3 2013 Matthew 83 4 2013 Aakash 75 5 2013 Jill 70 6 2014 Matthew 85 7 2014 Aakash 95 8 2014 Jill 100 9 2015 Matthew 90 10 2015 Aakash 92 11 2015 Jill 80 12 2016 Matthew 91 13 2016 Aakash 93 14 2016 Jill 85
How to find ChangeCol1/ChangeCol2 and %ChangeCol1/%ChangeCol2 of DF
I have data that looks like this. Year Quarter Quantity Price TotalRevenue 0 2000 1 23 142 3266 1 2000 2 23 144 3312 2 2000 3 23 147 3381 3 2000 4 23 151 3473 4 2001 1 22 160 3520 5 2001 2 22 183 4026 6 2001 3 22 186 4092 7 2001 4 22 186 4092 8 2002 1 21 212 4452 9 2002 2 19 232 4408 10 2002 3 19 223 4237 I'm trying to figure out how to get the 'MarginalRevenue', where: MR = (∆TR/∆Q) MarginalRevenue = (Change in TotalRevenue) / (Change in Quantity) I found: df.pct_change() But that seems to get the percentage change for everything. Also, I'm trying to figure out how to get something related: ElasticityPrice = (%ΔQuantity/%ΔPrice)
Do you mean something like this ? df['MarginalRevenue'] = df['TotalRevenue'].pct_change() / df['Quantity'].pct_change() or df['MarginalRevenue'] = df['TotalRevenue'].diff() / df['Quantity'].diff()
Pandas Melt with Multiple Value Vars
I have a data set which is in wide format like this Index Country Variable 2000 2001 2002 2003 2004 2005 0 Argentina var1 12 15 18 17 23 29 1 Argentina var2 1 3 2 5 7 5 2 Brazil var1 20 23 25 29 31 32 3 Brazil var2 0 1 2 2 3 3 I want to reshape my data to long so that year, var1, and var2 become new columns Index Country year var1 var2 0 Argentina 2000 12 1 1 Argentina 2001 15 3 2 Argentina 2002 18 2 .... 6 Brazil 2000 20 0 7 Brazil 2001 23 1 I got my code to work when I only had one variable by writing df=(pd.melt(df,id_vars='Country',value_name='Var1', var_name='year')) I cant figure out how to do this for a var1,var2, var3, etc.
Instead of melt, you can use a combination of stack and unstack: (df.set_index(['Country', 'Variable']) .rename_axis(['Year'], axis=1) .stack() .unstack('Variable') .reset_index()) Variable Country Year var1 var2 0 Argentina 2000 12 1 1 Argentina 2001 15 3 2 Argentina 2002 18 2 3 Argentina 2003 17 5 4 Argentina 2004 23 7 5 Argentina 2005 29 5 6 Brazil 2000 20 0 7 Brazil 2001 23 1 8 Brazil 2002 25 2 9 Brazil 2003 29 2 10 Brazil 2004 31 3 11 Brazil 2005 32 3
Option 1 Using melt then unstack for var1, var2, etc... (df1.melt(id_vars=['Country','Variable'],var_name='Year') .set_index(['Country','Year','Variable']) .squeeze() .unstack() .reset_index()) Output: Variable Country Year var1 var2 0 Argentina 2000 12 1 1 Argentina 2001 15 3 2 Argentina 2002 18 2 3 Argentina 2003 17 5 4 Argentina 2004 23 7 5 Argentina 2005 29 5 6 Brazil 2000 20 0 7 Brazil 2001 23 1 8 Brazil 2002 25 2 9 Brazil 2003 29 2 10 Brazil 2004 31 3 11 Brazil 2005 32 3 Option 2 Using pivot then stack: (df1.pivot(index='Country',columns='Variable') .stack(0) .rename_axis(['Country','Year']) .reset_index()) Output: Variable Country Year var1 var2 0 Argentina 2000 12 1 1 Argentina 2001 15 3 2 Argentina 2002 18 2 3 Argentina 2003 17 5 4 Argentina 2004 23 7 5 Argentina 2005 29 5 6 Brazil 2000 20 0 7 Brazil 2001 23 1 8 Brazil 2002 25 2 9 Brazil 2003 29 2 10 Brazil 2004 31 3 11 Brazil 2005 32 3 Option 3 (ayhan's solution) Using set_index, stack, and unstack: (df.set_index(['Country', 'Variable']) .rename_axis(['Year'], axis=1) .stack() .unstack('Variable') .reset_index()) Output: Variable Country Year var1 var2 0 Argentina 2000 12 1 1 Argentina 2001 15 3 2 Argentina 2002 18 2 3 Argentina 2003 17 5 4 Argentina 2004 23 7 5 Argentina 2005 29 5 6 Brazil 2000 20 0 7 Brazil 2001 23 1 8 Brazil 2002 25 2 9 Brazil 2003 29 2 10 Brazil 2004 31 3 11 Brazil 2005 32 3
numpy years = df.drop(['Country', 'Variable'], 1) y = years.values m = y.shape[1] c = df.Country.values v = df.Variable.values f0, u0 = pd.factorize(df.Country.values) f1, u1 = pd.factorize(df.Variable.values) w = np.empty((u1.size, u0.size, m), dtype=y.dtype) w[f1, f0] = y results = pd.DataFrame(dict( Country=u0.repeat(m), Year=np.tile(years.columns.values, u0.size), )).join(pd.DataFrame(w.reshape(-1, m * u1.size).T, columns=u1)) results Country Year var1 var2 0 Argentina 2000 12 1 1 Argentina 2001 15 3 2 Argentina 2002 18 2 3 Argentina 2003 17 5 4 Argentina 2004 23 7 5 Argentina 2005 29 5 6 Brazil 2000 20 0 7 Brazil 2001 23 1 8 Brazil 2002 25 2 9 Brazil 2003 29 2 10 Brazil 2004 31 3 11 Brazil 2005 32 3