Combine certain rows values of duplicate rows Pandas - python
I have a dataframe based on football players. I am finding duplicate rows for when a player has transferred mid-season. My aim is to add the points the accumalted in both leagues and add them together to make just one row.
Here is a sample of the data:
name full_name club Points Start Sub
84 S. Mustafi Shkodran Mustafi Arsenal 76 26 1
85 S. Mustafi Shkodran Mustafi Arsenal -2 0 1
89 Bruno Bruno Soriano Llido Villarreal CF 43 15 16
90 Bruno Bruno Gonzalez Cabrera Getafe CF 43 15 16
119 Oscar Oscar dos Santos Emboaba NaN 16 5 8
120 Oscar Oscar dos Santos Emboaba NaN 1 0 2
121 Oscar Oscar Rodriguez Arnaiz Real Madrid CF 16 5 8
122 Oscar Oscar Rodriguez Arnaiz Real Madrid CF 1 0 2
188 C. Bravo Claudio Bravo Manchester City 61 22 8
189 C. Bravo Claudio Bravo Manchester City 1 1 0
193 Naldo Ronaldo Aparecido Rodrigues FC Schalke 04 58 19 1
194 Naldo Edinaldo Gomes Pereira RCD Espanyol 58 19 1
200 G. Castro Gonzalo Castro Borussia Dortmund 79 23 6
201 G. Castro Gonzalo Castro Malaga CF 79 23 6
209 Juanfran Juan Francisco Torres Belen Atletico Madrid 86 21 8
210 Juanfran Juan Francisco Torres Belen Atletico Madrid 74 34 2
211 Juanfran Juan Francisco Moreno Fuertes RC Coruna 86 21 8
212 Juanfran Juan Francisco Moreno Fuertes RC Coruna 74 34 2
My goal dataframe would have players like for example Mustafi's Points Start and Sum values added together to give just one player.
Players like Bruno are clearly not the same person so I don't want to add the two brunos together.
name full_name club Points Start Sub
84 S. Mustafi Shkodran Mustafi Arsenal 74 26 2
89 Bruno Bruno Soriano Llido Villarreal CF 43 15 16
90 Bruno Bruno Gonzalez Cabrera Getafe CF 43 15 16
119 Oscar Oscar dos Santos Emboaba NaN 17 5 10
121 Oscar Oscar Rodriguez Arnaiz Real Madrid CF 17 5 10
188 C. Bravo Claudio Bravo Manchester City 62 23 8
193 Naldo Ronaldo Aparecido Rodrigues FC Schalke 04 58 19 1
194 Naldo Edinaldo Gomes Pereira RCD Espanyol 58 19 1
200 G. Castro Gonzalo Castro Borussia Dortmund 158 46 12
209 Juanfran Juan Francisco Torres Belen Atletico Madrid 86 21 8
212 Juanfran Juan Francisco Moreno Fuertes RC Coruna 74 34 2
Any help would be great!
You need:
df[['name','full_name','club']] = df[['name','full_name','club']].fillna('')
d = {'Points':'sum', 'Start':'sum', 'Sub':'sum', 'club':'first'}
df = (df.groupby(['name','full_name'], sort=False, as_index=False)
.agg(d)
.reindex(columns=df.columns))
with pd.option_context('display.expand_frame_repr', False):
print (df)
name full_name club Points Start Sub
0 S. Mustafi Shkodran Mustafi Arsenal 74 26 2
1 Bruno Bruno SorianoLlido Villarreal CF 43 15 16
2 Bruno Bruno Gonzalez Cabrera Getafe CF 43 15 16
3 Oscar Oscar dos Santos Emboaba 17 5 10
4 Oscar Oscar Rodriguez Arnaiz Real Madrid CF 17 5 10
5 C. Bravo Claudio Bravo Manchester City 62 23 8
6 Naldo Ronaldo Aparecido Rodrigues FC Schalke 04 58 19 1
7 Naldo Edinaldo Gomes Pereira RCD Espanyol 58 19 1
8 G. Castro Gonzalo Castro Borussia Dortmund 158 46 12
9 Juanfran Juan Francisco Torres Belen Atletico Madrid 160 55 10
10 Juanfran Juan Francisco Moreno Fuertes RC Coruna 160 55 10
Explanation:
First replace NaNs to '' by fillna for avoid omit rows with them in groupby
Aggregate by groupby, agg with dictionary with specify columns and their aggregating functions
Last for display all rows together temporarly use with
Related
Is there a way to iterate through a column in pandas if it is an index
I have a pandas DataFrame which looks like this Region Sub Region Country Size Plants Birds Mammals Africa Northern Africa Algeria 2380000 22 41 15 Egypt 1000000 8 58 14 Libya 1760000 7 32 8 Sub-Saharan Africa Angola 1250000 34 53 32 Benin 115000 20 40 12 Western Africa Cape Verde 4030 51 35 7 Americas Latin America Antigua 440 4 31 3 Argentina 2780000 70 42 52 Bolivia 1100000 106 8 55 Northern America Canada 9980000 18 44 24 Grenada 340 3 29 2 USA 9830000 510 251 91 Asia Central Asia Kazakhstan 2720000 14 14 27 Kyrgyz 200000 13 3 15 Uzbekistan 447000 16 7 19 Eastern Asia China 9560000 593 136 96 Japan 378000 50 77 49 South Korea 100000 31 28 33 So I am trying to prompt the user to input a value and if the input exists within the Sub Region column, perform a particular task. I tried turning the 'Sub region' column to a list and iterate through it if it matches the user input sub_region_list=[] for i in world_data.index.values: sub_region_list.append(i[1]) print(sub_region_list[0]) That is not the output I had in mind. I believe there is an easier way to do this but can not seem to figure it out
You can use get_level_values to filter. sub_region = input("Enter a sub region:") if sub_region not in df.index.get_level_values('Sub Region'): raise ValueError("You must enter a valid sub-region") If you want to save the column values in a list, try: df.index.get_level_values("Sub Region").unique().to_list()
How to sort dataframe with values
I want to sort my dataframe in decending order with "Total Confirmed cases" My Code high_cases_sorted_df = df.sort_values(by='Total Confirmed cases',ascending=False) print(high_cases_sorted_df) Output state Total Confirmed cases 19 Maharashtra 8590 14 Jharkhand 82 24 Puducherry 8 9 Goa 7 32 West Bengal 697 13 Jammu and Kashmir 546 15 Karnataka 512 30 Uttarakhand 51 16 Kerala 481 6 Chandigarh 40 12 Himachal Pradesh 40 7 Chhattisgarh 37 4 Assam 36 10 Gujarat 3548 5 Bihar 345 1 Andaman and Nicobar Islands 33 25 Punjab 313 8 Delhi 3108 11 Haryana 296 26 Rajasthan 2262 18 Madhya Pradesh 2168 17 Ladakh 20 20 Manipur 2 29 Tripura 2 31 Uttar Pradesh 1955 I don't know why it shows like this it should be (1.Maharashtra, 2.Gujarat, 3.Delhi, etc) complete script Here
simple by converting that column into integer df['Total_Confirmed_cases'] = df['Total_Confirmed_cases'].astype(int)
Pybaseball: Extract standings data and save to disk using pandas
What I am trying to do is take this output from pybaseball which is set in as a list. [ Tm W L W-L% GB 1 Boston Red Sox 94 44 .681 -- 2 New York Yankees 86 51 .628] and put it into a csv file using pandas. So far these are the are the queries I have tried I have the information for this output set as data. Whenever I try to import it from pd.DataFrame() it tells me that: AttributeError: 'list' object has no attribute 'to_csv'. So I add a dataframe to that using df = pd.Dataframe(data) and that prints out just the headers 0 Teams W L W-L% GB 0 Tm Tm 1 W W 2 L L 3 W-L% W-L% 4 GB GB How would I get this to import all of the information in the list to csv? from pybaseball import standings import pandas as pd data = standings() data.to_csv('file.csv', header = True, sep = ',')
Looks like standings() returns a list of dataframes: from pybaseball import standings import pandas as pd data = standings() print type(data) print type(data[0]) Output: <type 'list'> <class 'pandas.core.frame.DataFrame'> To write it to file, you need to concatenate the list of dataframes into a single dataframe before writing: all_data = pd.concat(data) print all_data all_data.to_csv("baseball_data.csv", sep=",", index=False) Output: Tm W L W-L% GB 1 Boston Red Sox 95 44 .683 -- 2 New York Yankees 86 52 .623 8.5 3 Tampa Bay Rays 74 63 .540 20.0 4 Toronto Blue Jays 62 75 .453 32.0 5 Baltimore Orioles 40 98 .290 54.5 1 Cleveland Indians 77 60 .562 -- 2 Minnesota Twins 63 74 .460 14.0 3 Chicago White Sox 56 82 .406 21.5 4 Detroit Tigers 55 83 .399 22.5 5 Kansas City Royals 46 91 .336 31.0 1 Houston Astros 85 53 .616 -- 2 Oakland Athletics 83 56 .597 2.5 3 Seattle Mariners 77 61 .558 8.0 4 Los Angeles Angels 67 71 .486 18.0 5 Texas Rangers 60 78 .435 25.0 1 Atlanta Braves 76 61 .555 -- 2 Philadelphia Phillies 72 65 .526 4.0 3 Washington Nationals 69 69 .500 7.5 4 New York Mets 62 75 .453 14.0 5 Miami Marlins 55 83 .399 21.5 1 Chicago Cubs 81 56 .591 -- 2 Milwaukee Brewers 78 61 .561 4.0 3 St. Louis Cardinals 76 62 .551 5.5 4 Pittsburgh Pirates 67 71 .486 14.5 5 Cincinnati Reds 59 79 .428 22.5 1 Colorado Rockies 75 62 .547 -- 2 Los Angeles Dodgers 75 63 .543 0.5 3 Arizona Diamondbacks 74 64 .536 1.5 4 San Francisco Giants 68 71 .489 8.0 5 San Diego Padres 55 85 .393 21.5 And you'll have a file baseball_data.csv which is a comma-separated representation of the dataframe above.
Removing rows from one DataFrame based on rows from another DataFrame
I have two different dataframes with two different lengths of rows. I want df1 to match df2 but I don't want to create a new dataframe in the process (no merge). df1 0 Alameda 1 Alpine 2 Amador 3 Butte 4 Calaveras 5 Colusa 6 Contra Costa 7 Del Norte 8 El Dorado 9 Fresno 10 Glenn 11 Humboldt 12 Imperial 13 Inyo 14 Kern 15 Kings 16 Lake 17 Lassen 18 Los Angeles 19 Madera 20 Marin 21 Mariposa 22 Mendocino 23 Merced 24 Modoc 25 Mono 26 Monterey 27 Napa 28 Nevada 29 Orange 30 Placer 31 Plumas 32 Riverside 33 Sacramento 34 San Benito 35 San Bernardino 36 San Diego 37 San Francisco 38 San Joaquin 39 San Luis Obispo 40 San Mateo 41 Santa Barbara 42 Santa Clara 43 Santa Cruz 44 Shasta 45 Sierra 46 Siskiyou 47 Solano 48 Sonoma 49 Stanislaus 50 Sutter 51 Tehama 52 Trinity 53 Tulare 54 Tuolumne 55 Ventura 56 Yolo 57 Yuba df2 0 Alameda 1 Amador 2 Butte 3 Calaveras 4 Colusa 5 Contra Costa 6 Del Norte 7 El Dorado 8 Fresno 9 Glenn 10 Humboldt 11 Imperial 12 Inyo 13 Kern 14 Kings 15 Lake 16 Lassen 17 Los Angeles 18 Madera 19 Marin 20 Mariposa 21 Mendocino 22 Merced 23 Mono 24 Monterey 25 Napa 26 Nevada 27 Orange 28 Placer 29 Plumas 30 Riverside 31 Sacramento 32 San Benito 33 San Bernardino 34 San Diego 35 San Francisco 36 San Joaquin 37 San Luis Obispo 38 San Mateo 39 Santa Barbara 40 Santa Clara 41 Santa Cruz 42 Shasta 43 Siskiyou 44 Solano 45 Sonoma 46 Stanislaus 47 Sutter 48 Tehama 49 Tulare 50 Ventura 51 Yolo 52 Yuba Is there a way to modify a column's rows in a dataframe using a column's rows from a different dataframe? Again I want to keep the dataframes separate, but the goal is to get the dataframes to have the same number of rows containing the same values.
Since you just want common rows, you can compute them quickly using np.intersect1d: i = df1.values.squeeze() j = df2.values.squeeze() df1 = pd.DataFrame(np.intersect1d(i, j)) And have df2 just become a copy of df1: df2 = df1.copy(deep=True)
Using duplicated s=pd.concat([df1,df2],keys=[1,2]) df1,df2=s[s.duplicated(keep=False)].loc[1],s[s.duplicated(keep=False)].loc[1]
Can't split web scraped table on rows
I pulled a table of Tour de France winners from wikipedia using BeautifulSoup, but its returning the table in what appears to be a dataset, but the rows are separable. First, here is what I did to grab the page and table: import requests response = requests.get("Https://en.wikipedia.org/wiki/List_of_Tour_de_France_general_classification_winners") content = response.content from bs4 import BeatifulSoup parser = BeautifulSoup(content, 'html.parser') # I know its the second table on the page, so grab it as such winners_table = parser.find_all('table')[1] import pandas as pd data = pd.read_html(str(winners_table), flavor = 'html5lib') Note that I used html5lib here because pycharm was telling me that there is no lxml, despite it certainly being there. When I print out the table, it appears as a table with 116 rows and 9 columns, but it isn't appearing to split on rows. It looks like this: [ 0 1 \ 0 Year Country 1 1903 France 2 1904 France 3 1905 France 4 1906 France 5 1907 France 6 1908 France 7 1909 Luxembourg 8 1910 France 9 1911 France 10 1912 Belgium 11 1913 Belgium 12 1914 Belgium 13 1915 World War I 14 1916 NaN 15 1917 NaN 16 1918 NaN 17 1919 Belgium 18 1920 Belgium 19 1921 Belgium 20 1922 Belgium 21 1923 France 22 1924 Italy 23 1925 Italy 24 1926 Belgium 25 1927 Luxembourg 26 1928 Luxembourg 27 1929 Belgium 28 1930 France 29 1931 France .. ... ... 86 1988 Spain 87 1989 United States 88 1990 United States 89 1991 Spain 90 1992 Spain 91 1993 Spain 92 1994 Spain 93 1995 Spain 94 1996 Denmark 95 1997 Germany 96 1998 Italy 97 1999[B] United States 98 2000[B] United States 99 2001[B] United States 100 2002[B] United States 101 2003[B] United States 102 2004[B] United States 103 2005[B] United States 104 2006 Spain 105 2007 Spain 106 2008 Spain 107 2009 Spain 108 2010 Luxembourg 109 2011 Australia 110 2012 Great Britain 111 2013 Great Britain 112 2014 Italy 113 2015 Great Britain 114 2016 Great Britain 115 2017 Great Britain 2 \ 0 Cyclist 1 Garin, MauriceMaurice Garin 2 Garin, MauriceMaurice Garin Cornet, HenriHenri... 3 Trousselier, LouisLouis Trousselier 4 Pottier, RenéRené Pottier 5 Petit-Breton, LucienLucien Petit-Breton 6 Petit-Breton, LucienLucien Petit-Breton 7 Faber, FrançoisFrançois Faber 8 Lapize, OctaveOctave Lapize 9 Garrigou, GustaveGustave Garrigou 10 Defraye, OdileOdile Defraye 11 Thys, PhilippePhilippe Thys 12 Thys, PhilippePhilippe Thys 13 NaN 14 NaN 15 NaN 16 NaN 17 Lambot, FirminFirmin Lambot 18 Thys, PhilippePhilippe Thys 19 Scieur, LéonLéon Scieur 20 Lambot, FirminFirmin Lambot 21 Pélissier, HenriHenri Pélissier 22 Bottecchia, OttavioOttavio Bottecchia 23 Bottecchia, OttavioOttavio Bottecchia 24 Buysse, LucienLucien Buysse 25 Frantz, NicolasNicolas Frantz 26 Frantz, NicolasNicolas Frantz 27 De Waele, MauriceMaurice De Waele 28 Leducq, AndréAndré Leducq 29 Magne, AntoninAntonin Magne .. ... 86 Delgado, PedroPedro Delgado 87 LeMond, GregGreg LeMond 88 LeMond, GregGreg LeMond 89 Indurain, MiguelMiguel Indurain 90 Indurain, MiguelMiguel Indurain 91 Indurain, MiguelMiguel Indurain 92 Indurain, MiguelMiguel Indurain 93 Indurain, MiguelMiguel Indurain 94 Riis, BjarneBjarne Riis[A] 95 Ullrich, JanJan Ullrich# 96 Pantani, MarcoMarco Pantani 97 Armstrong, LanceLance Armstrong 98 Armstrong, LanceLance Armstrong 99 Armstrong, LanceLance Armstrong 100 Armstrong, LanceLance Armstrong 101 Armstrong, LanceLance Armstrong 102 Armstrong, LanceLance Armstrong 103 Armstrong, LanceLance Armstrong 104 Landis, FloydFloyd Landis Pereiro, ÓscarÓscar ... 105 Contador, AlbertoAlberto Contador# 106 Sastre, CarlosCarlos Sastre* 107 Contador, AlbertoAlberto Contador 108 Contador, AlbertoAlberto Contador Schleck, And... 109 Evans, CadelCadel Evans 110 Wiggins, BradleyBradley Wiggins 111 Froome, ChrisChris Froome 112 Nibali, VincenzoVincenzo Nibali 113 Froome, ChrisChris Froome* 114 Froome, ChrisChris Froome 115 Froome, ChrisChris Froome 3 4 \ 0 Sponsor/Team Distance 1 La Française 2,428 km (1,509 mi) 2 Conte 2,428 km (1,509 mi) 3 Peugeot–Wolber 2,994 km (1,860 mi) 4 Peugeot–Wolber 4,637 km (2,881 mi) 5 Peugeot–Wolber 4,488 km (2,789 mi) 6 Peugeot–Wolber 4,497 km (2,794 mi) 7 Alcyon–Dunlop 4,498 km (2,795 mi) 8 Alcyon–Dunlop 4,734 km (2,942 mi) 9 Alcyon–Dunlop 5,343 km (3,320 mi) 10 Alcyon–Dunlop 5,289 km (3,286 mi) 11 Peugeot–Wolber 5,287 km (3,285 mi) 12 Peugeot–Wolber 5,380 km (3,340 mi) 13 NaN NaN 14 NaN NaN 15 NaN NaN 16 NaN NaN 17 La Sportive 5,560 km (3,450 mi) 18 La Sportive 5,503 km (3,419 mi) 19 La Sportive 5,485 km (3,408 mi) 20 Peugeot–Wolber 5,375 km (3,340 mi) 21 Automoto–Hutchinson 5,386 km (3,347 mi) 22 Automoto 5,425 km (3,371 mi) 23 Automoto–Hutchinson 5,440 km (3,380 mi) 24 Automoto–Hutchinson 5,745 km (3,570 mi) 25 Alcyon–Dunlop 5,398 km (3,354 mi) 26 Alcyon–Dunlop 5,476 km (3,403 mi) 27 Alcyon–Dunlop 5,286 km (3,285 mi) 28 Alcyon–Dunlop 4,822 km (2,996 mi) 29 France 5,091 km (3,163 mi) .. ... ... 86 Reynolds 3,286 km (2,042 mi) 87 AD Renting–W-Cup–Bottecchia 3,285 km (2,041 mi) 88 Z–Tomasso 3,504 km (2,177 mi) 89 Banesto 3,914 km (2,432 mi) 90 Banesto 3,983 km (2,475 mi) 91 Banesto 3,714 km (2,308 mi) 92 Banesto 3,978 km (2,472 mi) 93 Banesto 3,635 km (2,259 mi) 94 Team Telekom 3,765 km (2,339 mi) 95 Team Telekom 3,950 km (2,450 mi) 96 Mercatone Uno–Bianchi 3,875 km (2,408 mi) 97 U.S. Postal Service 3,687 km (2,291 mi) 98 U.S. Postal Service 3,662 km (2,275 mi) 99 U.S. Postal Service 3,458 km (2,149 mi) 100 U.S. Postal Service 3,272 km (2,033 mi) 101 U.S. Postal Service 3,427 km (2,129 mi) 102 U.S. Postal Service 3,391 km (2,107 mi) 103 Discovery Channel 3,593 km (2,233 mi) 104 Caisse d'Epargne–Illes Balears 3,657 km (2,272 mi) 105 Discovery Channel 3,570 km (2,220 mi) 106 Team CSC 3,559 km (2,211 mi) 107 Astana 3,459 km (2,149 mi) 108 Team Saxo Bank 3,642 km (2,263 mi) 109 BMC Racing Team 3,430 km (2,130 mi) 110 Team Sky 3,496 km (2,172 mi) 111 Team Sky 3,404 km (2,115 mi) 112 Astana 3,660.5 km (2,274.5 mi) 113 Team Sky 3,360.3 km (2,088.0 mi) 114 Team Sky 3,529 km (2,193 mi) 115 Team Sky 3,540 km (2,200 mi) 5 6 7 8 0 Time/Points Margin Stage wins Stages in lead 1 094 !94h 33' 14" 24921 !+ 2h 59' 21" 3 6 2 096 !96h 05' 55" 21614 !+ 2h 16' 14" 1 3 3 35 26 5 10 4 31 8 5 12 5 47 19 2 5 6 36 32 5 13 7 37 20 6 13 8 63 4 4 3 9 43 18 2 13 10 49 59 3 13 11 197 !197h 54' 00" 00837 !+ 8' 37" 1 8 12 200 !200h 28' 48" 00150 !+ 1' 50" 1 15 13 NaN NaN NaN NaN 14 NaN NaN NaN NaN 15 NaN NaN NaN NaN 16 NaN NaN NaN NaN 17 231 !231h 07' 15" 14254 !+ 1h 42' 54" 1 2 18 228 !228h 36' 13" 05721 !+ 57' 21" 4 14 19 221 !221h 50' 26" 01836 !+ 18' 36" 2 14 20 222 !222h 08' 06" 04115 !+ 41' 15" 0 3 21 222 !222h 15' 30" 03041 !+ 30 '41" 3 6 22 226 !226h 18' 21" 03536 !+ 35' 36" 4 15 23 219 !219h 10' 18" 05420 !+ 54' 20" 4 13 24 238 !238h 44' 25" 12225 !+ 1h 22' 25" 2 8 25 198 !198h 16' 42" 14841 !+ 1h 48' 41" 3 14 26 192 !192h 48' 58" 05007 !+ 50' 07" 5 22 27 186 !186h 39' 15" 04423 !+44' 23" 1 16 28 172 !172h 12' 16" 01413 !+ 14' 13" 2 13 29 177 !177h 10' 03" 01256 !+ 12' 56" 1 16 .. ... ... ... ... 86 084 !84h 27' 53" 00713 !+ 7' 13" 1 11 87 087 !87h 38' 35" 00008 !+ 8" 3 8 88 090 !90h 43' 20" 00216 !+ 2' 16" 0 2 89 101 !101h 01' 20" 00336 !+ 3' 36" 2 10 90 100 !100h 49' 30" 00435 !+ 4' 35" 3 10 91 095 !95h 57' 09" 00459 !+ 4' 59" 2 14 92 103 !103h 38' 38" 00539 !+ 5' 39" 1 13 93 092 !92h 44' 59" 00435 !+ 4' 35" 2 13 94 095 !95h 57' 16" 00141 !+ 1' 41" 2 13 95 100 !100h 30' 35" 00909 !+ 9' 09" 2 12 96 092 !92h 49' 46" 00321 !+ 3' 21" 2 7 97 091 !91h 32' 16" 00737 !+ 7' 37" 4 15 98 092 !92h 33' 08" 00602 !+ 6' 02" 1 12 99 086 !86h 17' 28" 00644 !+ 6' 44" 4 8 100 082 !82h 05' 12" 00717 !+ 7' 17" 4 11 101 083 !83h 41' 12" 00101 !+ 1' 01" 1 13 102 083 !83h 36' 02" 00619 !+ 6' 19" 5 7 103 086 !86h 15' 02" 00440 !+ 4' 40" 1 17 104 089 !89h 40' 27" 00032 !+ 32" 0 8 105 091 !91h 00' 26" 00023 !+ 23" 1 4 106 087 !87h 52' 52" 00058 !+ 58" 1 5 107 085 !85h 48' 35" 00411 !+ 4' 11" 2 7 108 091 !91h 59' 27" 00122 !+ 1' 22" 2 12 109 086 !86h 12' 22" 00134 !+ 1' 34" 1 2 110 087 !87h 34' 47" 00321 !+ 3' 21" 2 14 111 083 !83h 56' 20" 00420 !+ 4' 20" 3 14 112 089 !89h 59' 06" 00737 !+ 7' 37" 4 19 113 084 !84h 46' 14" 00112 !+ 1' 12" 1 16 114 089 !89h 04' 48" 00405 !+ 4' 05" 2 14 115 086 !86h 20' 55" 00054 !+ 54" 0 15 [116 rows x 9 columns]] This is all well and good, but the problem is it doesn't seem to be differentiating by rows. For instance, when I try to print just the first row, it reprints the whole dataset. Here's an example of trying to just print the first row and second column (so should just be one value): print(data[0][2]) 0 Country 1 France 2 France 3 France 4 France 5 France 6 France 7 Luxembourg 8 France 9 France 10 Belgium 11 Belgium 12 Belgium 13 World War I 14 NaN 15 NaN 16 NaN 17 Belgium 18 Belgium 19 Belgium 20 Belgium 21 France 22 Italy 23 Italy 24 Belgium 25 Luxembourg 26 Luxembourg 27 Belgium 28 France 29 France ... 86 Spain 87 United States 88 United States 89 Spain 90 Spain 91 Spain 92 Spain 93 Spain 94 Denmark 95 Germany 96 Italy 97 United States 98 United States 99 United States 100 United States 101 United States 102 United States 103 United States 104 Spain 105 Spain 106 Spain 107 Spain 108 Luxembourg 109 Australia 110 Great Britain 111 Great Britain 112 Italy 113 Great Britain 114 Great Britain 115 Great Britain Name: 1, Length: 116, dtype: object All I want is for this to behave as a data frame, with 116 rows and 9 columns. Any idea how to fix this?
If we take a look at the documentation here we can see that read_html actually outputs a list of DataFrames and not a single DataFrame. We can confirm this when we run: >> print(type(data)) <class 'list'> The format of the list is such that the first element of the list is the actual DataFrame containing your values. >> print(type(data[0])) <class 'pandas.core.frame.DataFrame'> The simple solution to this is to reassign data to data[0]. From this you can then call individual rows. Indexing of rows for DataFrames doesn't behave like normal lists so I would recommend looking into .iloc and .loc. This is a nice article I found on indexing of DataFrames. An example of this solution: >> data = data[0] >> print(data.iloc[1]) 0 1903 1 France 2 Garin, MauriceMaurice Garin 3 La Française 4 2,428 km (1,509 mi) 5 094 !94h 33' 14" 6 24921 !+ 2h 59' 21" 7 3 8 6 Name: 1, dtype: object
The pandas function read_html returns a list of dataframes. So in your case I believe you need to choose the first index of the returned list as done in the 8th line in the code below. Also note the you have a typo in the import line of BeautifulSoup, please update your code accordingly in the question. I hope my output is what you're looking for. Code: import requests import pandas as pd from bs4 import BeautifulSoup response = requests.get("Https://en.wikipedia.org/wiki/List_of_Tour_de_France_general_classification_winners") parser = BeautifulSoup(response.content, 'html.parser') winners_table = parser.find_all('table')[1] data = pd.read_html(str(winners_table), flavor = 'lxml')[0] print("type of variable data: " + str(type(data))) print(data[0][2]) Output: type of variable data: <class 'pandas.core.frame.DataFrame'> 1904 Note I used lxml instead of html5lib
You could try this: df = data[0] # iterate through the data frame using iterrows() for index, row in df.iterrows(): print ("Col1:", row[0], " Col2: ", row[1], "Col3:", row[2], "Col4:", row[3]) #etc for all cols I hope this helps!