Can't split web scraped table on rows - python
I pulled a table of Tour de France winners from wikipedia using BeautifulSoup, but its returning the table in what appears to be a dataset, but the rows are separable.
First, here is what I did to grab the page and table:
import requests
response = requests.get("Https://en.wikipedia.org/wiki/List_of_Tour_de_France_general_classification_winners")
content = response.content
from bs4 import BeatifulSoup
parser = BeautifulSoup(content, 'html.parser')
# I know its the second table on the page, so grab it as such
winners_table = parser.find_all('table')[1]
import pandas as pd
data = pd.read_html(str(winners_table), flavor = 'html5lib')
Note that I used html5lib here because pycharm was telling me that there is no lxml, despite it certainly being there. When I print out the table, it appears as a table with 116 rows and 9 columns, but it isn't appearing to split on rows. It looks like this:
[ 0 1 \
0 Year Country
1 1903 France
2 1904 France
3 1905 France
4 1906 France
5 1907 France
6 1908 France
7 1909 Luxembourg
8 1910 France
9 1911 France
10 1912 Belgium
11 1913 Belgium
12 1914 Belgium
13 1915 World War I
14 1916 NaN
15 1917 NaN
16 1918 NaN
17 1919 Belgium
18 1920 Belgium
19 1921 Belgium
20 1922 Belgium
21 1923 France
22 1924 Italy
23 1925 Italy
24 1926 Belgium
25 1927 Luxembourg
26 1928 Luxembourg
27 1929 Belgium
28 1930 France
29 1931 France
.. ... ...
86 1988 Spain
87 1989 United States
88 1990 United States
89 1991 Spain
90 1992 Spain
91 1993 Spain
92 1994 Spain
93 1995 Spain
94 1996 Denmark
95 1997 Germany
96 1998 Italy
97 1999[B] United States
98 2000[B] United States
99 2001[B] United States
100 2002[B] United States
101 2003[B] United States
102 2004[B] United States
103 2005[B] United States
104 2006 Spain
105 2007 Spain
106 2008 Spain
107 2009 Spain
108 2010 Luxembourg
109 2011 Australia
110 2012 Great Britain
111 2013 Great Britain
112 2014 Italy
113 2015 Great Britain
114 2016 Great Britain
115 2017 Great Britain
2 \
0 Cyclist
1 Garin, MauriceMaurice Garin
2 Garin, MauriceMaurice Garin Cornet, HenriHenri...
3 Trousselier, LouisLouis Trousselier
4 Pottier, RenéRené Pottier
5 Petit-Breton, LucienLucien Petit-Breton
6 Petit-Breton, LucienLucien Petit-Breton
7 Faber, FrançoisFrançois Faber
8 Lapize, OctaveOctave Lapize
9 Garrigou, GustaveGustave Garrigou
10 Defraye, OdileOdile Defraye
11 Thys, PhilippePhilippe Thys
12 Thys, PhilippePhilippe Thys
13 NaN
14 NaN
15 NaN
16 NaN
17 Lambot, FirminFirmin Lambot
18 Thys, PhilippePhilippe Thys
19 Scieur, LéonLéon Scieur
20 Lambot, FirminFirmin Lambot
21 Pélissier, HenriHenri Pélissier
22 Bottecchia, OttavioOttavio Bottecchia
23 Bottecchia, OttavioOttavio Bottecchia
24 Buysse, LucienLucien Buysse
25 Frantz, NicolasNicolas Frantz
26 Frantz, NicolasNicolas Frantz
27 De Waele, MauriceMaurice De Waele
28 Leducq, AndréAndré Leducq
29 Magne, AntoninAntonin Magne
.. ...
86 Delgado, PedroPedro Delgado
87 LeMond, GregGreg LeMond
88 LeMond, GregGreg LeMond
89 Indurain, MiguelMiguel Indurain
90 Indurain, MiguelMiguel Indurain
91 Indurain, MiguelMiguel Indurain
92 Indurain, MiguelMiguel Indurain
93 Indurain, MiguelMiguel Indurain
94 Riis, BjarneBjarne Riis[A]
95 Ullrich, JanJan Ullrich#
96 Pantani, MarcoMarco Pantani
97 Armstrong, LanceLance Armstrong
98 Armstrong, LanceLance Armstrong
99 Armstrong, LanceLance Armstrong
100 Armstrong, LanceLance Armstrong
101 Armstrong, LanceLance Armstrong
102 Armstrong, LanceLance Armstrong
103 Armstrong, LanceLance Armstrong
104 Landis, FloydFloyd Landis Pereiro, ÓscarÓscar ...
105 Contador, AlbertoAlberto Contador#
106 Sastre, CarlosCarlos Sastre*
107 Contador, AlbertoAlberto Contador
108 Contador, AlbertoAlberto Contador Schleck, And...
109 Evans, CadelCadel Evans
110 Wiggins, BradleyBradley Wiggins
111 Froome, ChrisChris Froome
112 Nibali, VincenzoVincenzo Nibali
113 Froome, ChrisChris Froome*
114 Froome, ChrisChris Froome
115 Froome, ChrisChris Froome
3 4 \
0 Sponsor/Team Distance
1 La Française 2,428 km (1,509 mi)
2 Conte 2,428 km (1,509 mi)
3 Peugeot–Wolber 2,994 km (1,860 mi)
4 Peugeot–Wolber 4,637 km (2,881 mi)
5 Peugeot–Wolber 4,488 km (2,789 mi)
6 Peugeot–Wolber 4,497 km (2,794 mi)
7 Alcyon–Dunlop 4,498 km (2,795 mi)
8 Alcyon–Dunlop 4,734 km (2,942 mi)
9 Alcyon–Dunlop 5,343 km (3,320 mi)
10 Alcyon–Dunlop 5,289 km (3,286 mi)
11 Peugeot–Wolber 5,287 km (3,285 mi)
12 Peugeot–Wolber 5,380 km (3,340 mi)
13 NaN NaN
14 NaN NaN
15 NaN NaN
16 NaN NaN
17 La Sportive 5,560 km (3,450 mi)
18 La Sportive 5,503 km (3,419 mi)
19 La Sportive 5,485 km (3,408 mi)
20 Peugeot–Wolber 5,375 km (3,340 mi)
21 Automoto–Hutchinson 5,386 km (3,347 mi)
22 Automoto 5,425 km (3,371 mi)
23 Automoto–Hutchinson 5,440 km (3,380 mi)
24 Automoto–Hutchinson 5,745 km (3,570 mi)
25 Alcyon–Dunlop 5,398 km (3,354 mi)
26 Alcyon–Dunlop 5,476 km (3,403 mi)
27 Alcyon–Dunlop 5,286 km (3,285 mi)
28 Alcyon–Dunlop 4,822 km (2,996 mi)
29 France 5,091 km (3,163 mi)
.. ... ...
86 Reynolds 3,286 km (2,042 mi)
87 AD Renting–W-Cup–Bottecchia 3,285 km (2,041 mi)
88 Z–Tomasso 3,504 km (2,177 mi)
89 Banesto 3,914 km (2,432 mi)
90 Banesto 3,983 km (2,475 mi)
91 Banesto 3,714 km (2,308 mi)
92 Banesto 3,978 km (2,472 mi)
93 Banesto 3,635 km (2,259 mi)
94 Team Telekom 3,765 km (2,339 mi)
95 Team Telekom 3,950 km (2,450 mi)
96 Mercatone Uno–Bianchi 3,875 km (2,408 mi)
97 U.S. Postal Service 3,687 km (2,291 mi)
98 U.S. Postal Service 3,662 km (2,275 mi)
99 U.S. Postal Service 3,458 km (2,149 mi)
100 U.S. Postal Service 3,272 km (2,033 mi)
101 U.S. Postal Service 3,427 km (2,129 mi)
102 U.S. Postal Service 3,391 km (2,107 mi)
103 Discovery Channel 3,593 km (2,233 mi)
104 Caisse d'Epargne–Illes Balears 3,657 km (2,272 mi)
105 Discovery Channel 3,570 km (2,220 mi)
106 Team CSC 3,559 km (2,211 mi)
107 Astana 3,459 km (2,149 mi)
108 Team Saxo Bank 3,642 km (2,263 mi)
109 BMC Racing Team 3,430 km (2,130 mi)
110 Team Sky 3,496 km (2,172 mi)
111 Team Sky 3,404 km (2,115 mi)
112 Astana 3,660.5 km (2,274.5 mi)
113 Team Sky 3,360.3 km (2,088.0 mi)
114 Team Sky 3,529 km (2,193 mi)
115 Team Sky 3,540 km (2,200 mi)
5 6 7 8
0 Time/Points Margin Stage wins Stages in lead
1 094 !94h 33' 14" 24921 !+ 2h 59' 21" 3 6
2 096 !96h 05' 55" 21614 !+ 2h 16' 14" 1 3
3 35 26 5 10
4 31 8 5 12
5 47 19 2 5
6 36 32 5 13
7 37 20 6 13
8 63 4 4 3
9 43 18 2 13
10 49 59 3 13
11 197 !197h 54' 00" 00837 !+ 8' 37" 1 8
12 200 !200h 28' 48" 00150 !+ 1' 50" 1 15
13 NaN NaN NaN NaN
14 NaN NaN NaN NaN
15 NaN NaN NaN NaN
16 NaN NaN NaN NaN
17 231 !231h 07' 15" 14254 !+ 1h 42' 54" 1 2
18 228 !228h 36' 13" 05721 !+ 57' 21" 4 14
19 221 !221h 50' 26" 01836 !+ 18' 36" 2 14
20 222 !222h 08' 06" 04115 !+ 41' 15" 0 3
21 222 !222h 15' 30" 03041 !+ 30 '41" 3 6
22 226 !226h 18' 21" 03536 !+ 35' 36" 4 15
23 219 !219h 10' 18" 05420 !+ 54' 20" 4 13
24 238 !238h 44' 25" 12225 !+ 1h 22' 25" 2 8
25 198 !198h 16' 42" 14841 !+ 1h 48' 41" 3 14
26 192 !192h 48' 58" 05007 !+ 50' 07" 5 22
27 186 !186h 39' 15" 04423 !+44' 23" 1 16
28 172 !172h 12' 16" 01413 !+ 14' 13" 2 13
29 177 !177h 10' 03" 01256 !+ 12' 56" 1 16
.. ... ... ... ...
86 084 !84h 27' 53" 00713 !+ 7' 13" 1 11
87 087 !87h 38' 35" 00008 !+ 8" 3 8
88 090 !90h 43' 20" 00216 !+ 2' 16" 0 2
89 101 !101h 01' 20" 00336 !+ 3' 36" 2 10
90 100 !100h 49' 30" 00435 !+ 4' 35" 3 10
91 095 !95h 57' 09" 00459 !+ 4' 59" 2 14
92 103 !103h 38' 38" 00539 !+ 5' 39" 1 13
93 092 !92h 44' 59" 00435 !+ 4' 35" 2 13
94 095 !95h 57' 16" 00141 !+ 1' 41" 2 13
95 100 !100h 30' 35" 00909 !+ 9' 09" 2 12
96 092 !92h 49' 46" 00321 !+ 3' 21" 2 7
97 091 !91h 32' 16" 00737 !+ 7' 37" 4 15
98 092 !92h 33' 08" 00602 !+ 6' 02" 1 12
99 086 !86h 17' 28" 00644 !+ 6' 44" 4 8
100 082 !82h 05' 12" 00717 !+ 7' 17" 4 11
101 083 !83h 41' 12" 00101 !+ 1' 01" 1 13
102 083 !83h 36' 02" 00619 !+ 6' 19" 5 7
103 086 !86h 15' 02" 00440 !+ 4' 40" 1 17
104 089 !89h 40' 27" 00032 !+ 32" 0 8
105 091 !91h 00' 26" 00023 !+ 23" 1 4
106 087 !87h 52' 52" 00058 !+ 58" 1 5
107 085 !85h 48' 35" 00411 !+ 4' 11" 2 7
108 091 !91h 59' 27" 00122 !+ 1' 22" 2 12
109 086 !86h 12' 22" 00134 !+ 1' 34" 1 2
110 087 !87h 34' 47" 00321 !+ 3' 21" 2 14
111 083 !83h 56' 20" 00420 !+ 4' 20" 3 14
112 089 !89h 59' 06" 00737 !+ 7' 37" 4 19
113 084 !84h 46' 14" 00112 !+ 1' 12" 1 16
114 089 !89h 04' 48" 00405 !+ 4' 05" 2 14
115 086 !86h 20' 55" 00054 !+ 54" 0 15
[116 rows x 9 columns]]
This is all well and good, but the problem is it doesn't seem to be differentiating by rows. For instance, when I try to print just the first row, it reprints the whole dataset. Here's an example of trying to just print the first row and second column (so should just be one value):
print(data[0][2])
0 Country
1 France
2 France
3 France
4 France
5 France
6 France
7 Luxembourg
8 France
9 France
10 Belgium
11 Belgium
12 Belgium
13 World War I
14 NaN
15 NaN
16 NaN
17 Belgium
18 Belgium
19 Belgium
20 Belgium
21 France
22 Italy
23 Italy
24 Belgium
25 Luxembourg
26 Luxembourg
27 Belgium
28 France
29 France
...
86 Spain
87 United States
88 United States
89 Spain
90 Spain
91 Spain
92 Spain
93 Spain
94 Denmark
95 Germany
96 Italy
97 United States
98 United States
99 United States
100 United States
101 United States
102 United States
103 United States
104 Spain
105 Spain
106 Spain
107 Spain
108 Luxembourg
109 Australia
110 Great Britain
111 Great Britain
112 Italy
113 Great Britain
114 Great Britain
115 Great Britain
Name: 1, Length: 116, dtype: object
All I want is for this to behave as a data frame, with 116 rows and 9 columns. Any idea how to fix this?
If we take a look at the documentation here we can see that read_html actually outputs a list of DataFrames and not a single DataFrame. We can confirm this when we run:
>> print(type(data))
<class 'list'>
The format of the list is such that the first element of the list is the actual DataFrame containing your values.
>> print(type(data[0]))
<class 'pandas.core.frame.DataFrame'>
The simple solution to this is to reassign data to data[0]. From this you can then call individual rows. Indexing of rows for DataFrames doesn't behave like normal lists so I would recommend looking into .iloc and .loc. This is a nice article I found on indexing of DataFrames.
An example of this solution:
>> data = data[0]
>> print(data.iloc[1])
0 1903
1 France
2 Garin, MauriceMaurice Garin
3 La Française
4 2,428 km (1,509 mi)
5 094 !94h 33' 14"
6 24921 !+ 2h 59' 21"
7 3
8 6
Name: 1, dtype: object
The pandas function read_html returns a list of dataframes. So in your case I believe you need to choose the first index of the returned list as done in the 8th line in the code below.
Also note the you have a typo in the import line of BeautifulSoup, please update your code accordingly in the question.
I hope my output is what you're looking for.
Code:
import requests
import pandas as pd
from bs4 import BeautifulSoup
response = requests.get("Https://en.wikipedia.org/wiki/List_of_Tour_de_France_general_classification_winners")
parser = BeautifulSoup(response.content, 'html.parser')
winners_table = parser.find_all('table')[1]
data = pd.read_html(str(winners_table), flavor = 'lxml')[0]
print("type of variable data: " + str(type(data)))
print(data[0][2])
Output:
type of variable data: <class 'pandas.core.frame.DataFrame'>
1904
Note I used lxml instead of html5lib
You could try this:
df = data[0]
# iterate through the data frame using iterrows()
for index, row in df.iterrows():
print ("Col1:", row[0], " Col2: ", row[1], "Col3:", row[2], "Col4:", row[3]) #etc for all cols
I hope this helps!
Related
Is there a way to iterate through a column in pandas if it is an index
I have a pandas DataFrame which looks like this Region Sub Region Country Size Plants Birds Mammals Africa Northern Africa Algeria 2380000 22 41 15 Egypt 1000000 8 58 14 Libya 1760000 7 32 8 Sub-Saharan Africa Angola 1250000 34 53 32 Benin 115000 20 40 12 Western Africa Cape Verde 4030 51 35 7 Americas Latin America Antigua 440 4 31 3 Argentina 2780000 70 42 52 Bolivia 1100000 106 8 55 Northern America Canada 9980000 18 44 24 Grenada 340 3 29 2 USA 9830000 510 251 91 Asia Central Asia Kazakhstan 2720000 14 14 27 Kyrgyz 200000 13 3 15 Uzbekistan 447000 16 7 19 Eastern Asia China 9560000 593 136 96 Japan 378000 50 77 49 South Korea 100000 31 28 33 So I am trying to prompt the user to input a value and if the input exists within the Sub Region column, perform a particular task. I tried turning the 'Sub region' column to a list and iterate through it if it matches the user input sub_region_list=[] for i in world_data.index.values: sub_region_list.append(i[1]) print(sub_region_list[0]) That is not the output I had in mind. I believe there is an easier way to do this but can not seem to figure it out
You can use get_level_values to filter. sub_region = input("Enter a sub region:") if sub_region not in df.index.get_level_values('Sub Region'): raise ValueError("You must enter a valid sub-region") If you want to save the column values in a list, try: df.index.get_level_values("Sub Region").unique().to_list()
How I can fix this BeautifulSoup website scrape for NHL Reference?
I'm scraping National Hockey League (NHL) data for multiple seasons from this URL: https://www.hockey-reference.com/leagues/NHL_2018_skaters.html I'm only getting a few instances here and have tried moving my dict statements throughout the for loops. I've also tried utilizing solutions I found on other posts with no luck. Any help is appreciated. Thank you! import requests from bs4 import BeautifulSoup import pandas as pd dict={} for i in range (2010,2020): year = str(i) source = requests.get('https://www.hockey-reference.com/leagues/NHL_'+year+'_skaters.html').text soup = BeautifulSoup(source,features='lxml') #identifying table in html table = soup.find('table', id="stats") #grabbing <tr> tags in html rows = table.findAll("tr") #creating passable values for each "stat" in td tag data_stats = [ "player", "age", "team_id", "pos", "games_played", "goals", "assists", "points", "plus_minus", "pen_min", "ps", "goals_ev", "goals_pp", "goals_sh", "goals_gw", "assists_ev", "assists_pp", "assists_sh", "shots", "shot_pct", "time_on_ice", "time_on_ice_avg", "blocks", "hits", "faceoff_wins", "faceoff_losses", "faceoff_percentage" ] for rownum in rows: # grabbing player name and using as key filter = { "data-stat":'player' } cell = rows[3].findAll("td",filter) nameval = cell[0].string list = [] for data in data_stats: #iterating through data_stat to grab values filter = { "data-stat":data } cell = rows[3].findAll("td",filter) value = cell[0].string list.append(value) dict[nameval] = list dict[nameval].append(year) # conversion to numeric values and creating dataframe columns = [ "player", "age", "team_id", "pos", "games_played", "goals", "assists", "points", "plus_minus", "pen_min", "ps", "goals_ev", "goals_pp", "goals_sh", "goals_gw", "assists_ev", "assists_pp", "assists_sh", "shots", "shot_pct", "time_on_ice", "time_on_ice_avg", "blocks", "hits", "faceoff_wins", "faceoff_losses", "faceoff_percentage", "year" ] df = pd.DataFrame.from_dict(dict,orient='index',columns=columns) cols = df.columns.drop(['player','team_id','pos','year']) df[cols] = df[cols].apply(pd.to_numeric, errors='coerce') print(df) Output Craig Adams Craig Adams 32 ... 43.9 2010 Luke Adam Luke Adam 22 ... 100.0 2013 Justin Abdelkader Justin Abdelkader 29 ... 29.4 2017 Will Acton Will Acton 27 ... 50.0 2015 Noel Acciari Noel Acciari 24 ... 44.1 2016 Pontus Aberg Pontus Aberg 25 ... 10.5 2019 [6 rows x 28 columns]
I'd just use pandas' .read_html(), It does the hard work of parsing tables for you (uses BeautifulSoup under the hood) Code: import pandas as pd result = pd.DataFrame() for i in range (2010,2020): print(i) year = str(i) url = 'https://www.hockey-reference.com/leagues/NHL_'+year+'_skaters.html' #source = requests.get('https://www.hockey-reference.com/leagues/NHL_'+year+'_skaters.html').text df = pd.read_html(url,header=1)[0] df['year'] = year result = result.append(df, sort=False) result = result[~result['Age'].str.contains("Age")] result = result.reset_index(drop=True) You can then save to file with result.to_csv('filename.csv',index=False) Output: print (result) Rk Player Age Tm Pos GP ... BLK HIT FOW FOL FO% year 0 1 Justin Abdelkader 22 DET LW 50 ... 20 152 148 170 46.5 2010 1 2 Craig Adams 32 PIT RW 82 ... 58 193 243 311 43.9 2010 2 3 Maxim Afinogenov 30 ATL RW 82 ... 21 32 1 2 33.3 2010 3 4 Andrew Alberts 28 TOT D 76 ... 88 216 0 1 0.0 2010 4 4 Andrew Alberts 28 CAR D 62 ... 67 172 0 0 NaN 2010 5 4 Andrew Alberts 28 VAN D 14 ... 21 44 0 1 0.0 2010 6 5 Daniel Alfredsson 37 OTT RW 70 ... 36 41 14 25 35.9 2010 7 6 Bryan Allen 29 FLA D 74 ... 137 120 0 0 NaN 2010 8 7 Cody Almond 20 MIN C 7 ... 5 7 18 12 60.0 2010 9 8 Karl Alzner 21 WSH D 21 ... 21 15 0 0 NaN 2010 10 9 Artem Anisimov 21 NYR C 82 ... 41 45 310 380 44.9 2010 11 10 Nik Antropov 29 ATL C 76 ... 35 82 481 627 43.4 2010 12 11 Colby Armstrong 27 ATL RW 79 ... 29 74 10 10 50.0 2010 13 12 Derek Armstrong 36 STL C 6 ... 0 4 7 8 46.7 2010 14 13 Jason Arnott 35 NSH C 63 ... 17 24 526 551 48.8 2010 15 14 Dean Arsene 29 EDM D 13 ... 13 18 0 0 NaN 2010 16 15 Evgeny Artyukhin 26 TOT RW 54 ... 10 127 1 1 50.0 2010 17 15 Evgeny Artyukhin 26 ANA RW 37 ... 8 90 0 1 0.0 2010 18 15 Evgeny Artyukhin 26 ATL RW 17 ... 2 37 1 0 100.0 2010 19 16 Arron Asham 31 PHI RW 72 ... 16 92 2 11 15.4 2010 20 17 Adrian Aucoin 36 PHX D 82 ... 67 131 1 0 100.0 2010 21 18 Keith Aucoin 31 WSH C 9 ... 0 2 31 25 55.4 2010 22 19 Sean Avery 29 NYR C 69 ... 17 145 4 10 28.6 2010 23 20 David Backes 25 STL RW 79 ... 60 266 504 561 47.3 2010 24 21 Mikael Backlund 20 CGY C 23 ... 4 12 100 86 53.8 2010 25 22 Nicklas Backstrom 22 WSH C 82 ... 61 90 657 660 49.9 2010 26 23 Josh Bailey 20 NYI C 73 ... 36 67 171 255 40.1 2010 27 24 Keith Ballard 27 FLA D 82 ... 201 156 0 0 NaN 2010 28 25 Krys Barch 29 DAL RW 63 ... 13 120 0 3 0.0 2010 29 26 Cam Barker 23 TOT D 70 ... 53 75 0 0 NaN 2010 ... ... .. ... .. .. ... ... ... ... ... ... ... 10251 885 Chris Wideman 29 TOT D 25 ... 26 35 0 0 NaN 2019 10252 885 Chris Wideman 29 OTT D 19 ... 25 26 0 0 NaN 2019 10253 885 Chris Wideman 29 EDM D 5 ... 1 7 0 0 NaN 2019 10254 885 Chris Wideman 29 FLA D 1 ... 0 2 0 0 NaN 2019 10255 886 Justin Williams 37 CAR RW 82 ... 32 55 92 150 38.0 2019 10256 887 Colin Wilson 29 COL C 65 ... 31 55 20 32 38.5 2019 10257 888 Garrett Wilson 27 PIT LW 50 ... 16 114 3 4 42.9 2019 10258 889 Scott Wilson 26 BUF C 15 ... 2 29 1 2 33.3 2019 10259 890 Tom Wilson 24 WSH RW 63 ... 52 200 29 24 54.7 2019 10260 891 Luke Witkowski 28 DET D 34 ... 27 67 0 0 NaN 2019 10261 892 Christian Wolanin 23 OTT D 30 ... 31 11 0 0 NaN 2019 10262 893 Miles Wood 23 NJD LW 63 ... 27 97 0 2 0.0 2019 10263 894 Egor Yakovlev 27 NJD D 25 ... 22 12 0 0 NaN 2019 10264 895 Kailer Yamamoto 20 EDM RW 17 ... 11 18 0 0 NaN 2019 10265 896 Keith Yandle 32 FLA D 82 ... 76 47 0 0 NaN 2019 10266 897 Pavel Zacha 21 NJD C 61 ... 24 68 348 364 48.9 2019 10267 898 Filip Zadina 19 DET RW 9 ... 3 6 3 3 50.0 2019 10268 899 Nikita Zadorov 23 COL D 70 ... 67 228 0 0 NaN 2019 10269 900 Nikita Zaitsev 27 TOR D 81 ... 151 139 0 0 NaN 2019 10270 901 Travis Zajac 33 NJD C 80 ... 38 66 841 605 58.2 2019 10271 902 Jakub Zboril 21 BOS D 2 ... 0 3 0 0 NaN 2019 10272 903 Mika Zibanejad 25 NYR C 82 ... 66 134 830 842 49.6 2019 10273 904 Mats Zuccarello 31 TOT LW 48 ... 43 57 10 20 33.3 2019 10274 904 Mats Zuccarello 31 NYR LW 46 ... 42 57 10 20 33.3 2019 10275 904 Mats Zuccarello 31 DAL LW 2 ... 1 0 0 0 NaN 2019 10276 905 Jason Zucker 27 MIN LW 81 ... 38 87 2 11 15.4 2019 10277 906 Valentin Zykov 23 TOT LW 28 ... 6 26 2 7 22.2 2019 10278 906 Valentin Zykov 23 CAR LW 13 ... 2 6 2 6 25.0 2019 10279 906 Valentin Zykov 23 VEG LW 10 ... 3 18 0 1 0.0 2019 10280 906 Valentin Zykov 23 EDM LW 5 ... 1 2 0 0 NaN 2019 [10281 rows x 29 columns]
Scraping heavily formatted tables are positively painful with Beautiful Soup (not to bash on Beautiful Soup, it's wonderful for several use cases). There's a bit of a 'hack' I use for scraping data surrounded with dense markup, if you're willing to be a bit utilitarian about it: 1. Select entire table on web page 2. Copy + paste into Evernote (simplifies and reformats the HTML) 3. Copy + paste from Evernote to Excel or another spreadsheet software (removes the HTML) 4. Save as .csv Input Output It isn't perfect. There will be blank lines in the CSV, but blank lines are easier and far less time-consuming to remove than such data is to scrape. Good luck! As reference, I've linked my own conversions below. Parsed to Evernote Parsed to Excel
Merging the same labels for counting
I have a huge dataframe as: country1 import1 export1 country2 import2 export2 0 USA 12 82 Germany 12 82 1 Germany 65 31 France 65 31 2 England 74 47 Japan 74 47 3 Japan 23 55 England 23 55 4 France 48 12 Usa 48 12 export1 and import1 belongs to country1 export2 and import2 belongs to country2 I want to count export and import values by country. Output may be like: country | total_export | total_import ______________________________________________ USA | 12211221 | 212121 France | 4545 | 5454 ... ...
Use wide_to_long first: df = (pd.wide_to_long(data.reset_index(), ['country','import','export'], i='index', j='tmp') .reset_index(drop=True)) print (df) country import export 0 USA 12 82 1 Germany 65 31 2 England 74 47 3 Japan 23 55 4 France 48 12 5 Germany 12 82 6 France 65 31 7 Japan 74 47 8 England 23 55 9 Usa 48 12 And then aggregate sum: df = df.groupby('country', as_index=False).sum() print (df) country import export 0 England 97 102 1 France 113 43 2 Germany 77 113 3 Japan 97 102 4 USA 12 82 5 Usa 48 12
You can slice the table into two parts and concatenate them: func = lambda x: x[:-1] # or lambda x: x.rstrip('0123456789') data.iloc[:,:3].rename(func, axis=1).\ append(data.iloc[:,3:].rename(func, axis=1)).\ groupby('country').sum() Output: import export country England 97 102 France 113 43 Germany 77 113 Japan 97 102 USA 12 82 Usa 48 12
How to pull the index from a Pandas dataframe?
I have a dataframe and I want to pull the first Index value after each time I sort the dataframe based on values as a string. And what I want my function to do is pull the country name at the top of the list. In this example, it would pull 'United States' as a string. Because the country names are the indexes and not Series values I can't just do summer_gold.iloc[0]. Summer Gold Silver Bronze Total # Winter Gold.1 Silver.1 Bronze.1 Total.1 # Games Gold.2 Silver.2 Bronze.2 Combined total ID Afghanistan 13 0 0 2 2 0 0 0 0 0 13 0 0 2 2 AFG Algeria 12 5 2 8 15 3 0 0 0 0 15 5 2 8 15 ALG Argentina 23 18 24 28 70 18 0 0 0 0 41 18 24 28 70 ARG Armenia 5 1 2 9 12 6 0 0 0 0 11 1 2 9 12 ARM Australasia 2 3 4 5 12 0 0 0 0 0 2 3 4 5 12 ANZ So if I were to sort based on number of Gold medals I'd get a dataframe that looks like: # Summer Gold Silver Bronze Total # Winter Gold.1 \ United States 26 976 757 666 2399 22 96 Soviet Union 9 395 319 296 1010 9 78 Great Britain 27 236 272 272 780 22 10 France 27 202 223 246 671 22 31 China 9 201 146 126 473 10 12 Silver.1 Bronze.1 Total.1 # Games Gold.2 Silver.2 \ United States 102 84 282 48 1072 859 Soviet Union 57 59 194 18 473 376 Great Britain 4 12 26 49 246 276 France 31 47 109 49 233 254 China 22 19 53 19 213 168 Bronze.2 Combined total ID United States 750 2681 USA Soviet Union 355 1204 URS Great Britain 284 806 GBR France 293 780 FRA China 145 526 CHN So far my overall code looks like: def answer_one(): summer_gold = df.sort_values('Gold', ascending=False) summer_gold = summer_gold.iloc[0] return summer_gold answer_one() Output: # Summer 26 Gold 976 Silver 757 Bronze 666 Total 2399 # Winter 22 Gold.1 96 Silver.1 102 Bronze.1 84 Total.1 282 # Games 48 Gold.2 1072 Silver.2 859 Bronze.2 750 Combined total 2681 ID USA Name: United States, dtype: object I want an output of 'United States', in this case, or the name of whatever the country is at the top of my sorted dataframe.
After you sorted your dataframe, you can access the first row index like: df.index[0]
Combine certain rows values of duplicate rows Pandas
I have a dataframe based on football players. I am finding duplicate rows for when a player has transferred mid-season. My aim is to add the points the accumalted in both leagues and add them together to make just one row. Here is a sample of the data: name full_name club Points Start Sub 84 S. Mustafi Shkodran Mustafi Arsenal 76 26 1 85 S. Mustafi Shkodran Mustafi Arsenal -2 0 1 89 Bruno Bruno Soriano Llido Villarreal CF 43 15 16 90 Bruno Bruno Gonzalez Cabrera Getafe CF 43 15 16 119 Oscar Oscar dos Santos Emboaba NaN 16 5 8 120 Oscar Oscar dos Santos Emboaba NaN 1 0 2 121 Oscar Oscar Rodriguez Arnaiz Real Madrid CF 16 5 8 122 Oscar Oscar Rodriguez Arnaiz Real Madrid CF 1 0 2 188 C. Bravo Claudio Bravo Manchester City 61 22 8 189 C. Bravo Claudio Bravo Manchester City 1 1 0 193 Naldo Ronaldo Aparecido Rodrigues FC Schalke 04 58 19 1 194 Naldo Edinaldo Gomes Pereira RCD Espanyol 58 19 1 200 G. Castro Gonzalo Castro Borussia Dortmund 79 23 6 201 G. Castro Gonzalo Castro Malaga CF 79 23 6 209 Juanfran Juan Francisco Torres Belen Atletico Madrid 86 21 8 210 Juanfran Juan Francisco Torres Belen Atletico Madrid 74 34 2 211 Juanfran Juan Francisco Moreno Fuertes RC Coruna 86 21 8 212 Juanfran Juan Francisco Moreno Fuertes RC Coruna 74 34 2 My goal dataframe would have players like for example Mustafi's Points Start and Sum values added together to give just one player. Players like Bruno are clearly not the same person so I don't want to add the two brunos together. name full_name club Points Start Sub 84 S. Mustafi Shkodran Mustafi Arsenal 74 26 2 89 Bruno Bruno Soriano Llido Villarreal CF 43 15 16 90 Bruno Bruno Gonzalez Cabrera Getafe CF 43 15 16 119 Oscar Oscar dos Santos Emboaba NaN 17 5 10 121 Oscar Oscar Rodriguez Arnaiz Real Madrid CF 17 5 10 188 C. Bravo Claudio Bravo Manchester City 62 23 8 193 Naldo Ronaldo Aparecido Rodrigues FC Schalke 04 58 19 1 194 Naldo Edinaldo Gomes Pereira RCD Espanyol 58 19 1 200 G. Castro Gonzalo Castro Borussia Dortmund 158 46 12 209 Juanfran Juan Francisco Torres Belen Atletico Madrid 86 21 8 212 Juanfran Juan Francisco Moreno Fuertes RC Coruna 74 34 2 Any help would be great!
You need: df[['name','full_name','club']] = df[['name','full_name','club']].fillna('') d = {'Points':'sum', 'Start':'sum', 'Sub':'sum', 'club':'first'} df = (df.groupby(['name','full_name'], sort=False, as_index=False) .agg(d) .reindex(columns=df.columns)) with pd.option_context('display.expand_frame_repr', False): print (df) name full_name club Points Start Sub 0 S. Mustafi Shkodran Mustafi Arsenal 74 26 2 1 Bruno Bruno SorianoLlido Villarreal CF 43 15 16 2 Bruno Bruno Gonzalez Cabrera Getafe CF 43 15 16 3 Oscar Oscar dos Santos Emboaba 17 5 10 4 Oscar Oscar Rodriguez Arnaiz Real Madrid CF 17 5 10 5 C. Bravo Claudio Bravo Manchester City 62 23 8 6 Naldo Ronaldo Aparecido Rodrigues FC Schalke 04 58 19 1 7 Naldo Edinaldo Gomes Pereira RCD Espanyol 58 19 1 8 G. Castro Gonzalo Castro Borussia Dortmund 158 46 12 9 Juanfran Juan Francisco Torres Belen Atletico Madrid 160 55 10 10 Juanfran Juan Francisco Moreno Fuertes RC Coruna 160 55 10 Explanation: First replace NaNs to '' by fillna for avoid omit rows with them in groupby Aggregate by groupby, agg with dictionary with specify columns and their aggregating functions Last for display all rows together temporarly use with