Related
How can I merge and sum the columns with the same name?
So the output should be 1 Column named Canada as a result of the sum of the 4 columns named Canada.
Country/Region Brazil Canada Canada Canada Canada
Week 1 0 3 0 0 0
Week 2 0 17 0 0 0
Week 3 0 21 0 0 0
Week 4 0 21 0 0 0
Week 5 0 23 0 0 0
Week 6 0 80 0 5 0
Week 7 0 194 0 20 0
Week 8 12 702 3 199 20
Week 9 182 2679 16 2395 260
Week 10 737 8711 80 17928 892
Week 11 1674 25497 153 48195 1597
Week 12 2923 46392 175 85563 2003
Week 13 4516 76095 182 122431 2180
Week 14 6002 105386 183 163539 2431
Week 15 6751 127713 189 210409 2995
Week 16 7081 147716 189 258188 3845
From its current state, this should give the outcome you're looking for:
df = df.set_index('Country/Region') # optional
df.groupby(df.columns, axis=1).sum() # Stolen from Scott Boston as it's a superior method.
Output:
index Brazil Canada
Country/Region
Week 1 0 3
Week 2 0 17
Week 3 0 21
Week 4 0 21
Week 5 0 23
Week 6 0 85
Week 7 0 214
Week 8 12 924
Week 9 182 5350
Week 10 737 27611
Week 11 1674 75442
Week 12 2923 134133
Week 13 4516 200888
Week 14 6002 271539
Week 15 6751 341306
Week 16 7081 409938
I found your dataset interesting, here's how I would clean it up from step 1:
df = pd.read_csv('file.csv')
df = df.set_index(['Province/State', 'Country/Region', 'Lat', 'Long']).stack().reset_index()
df.columns = ['Province/State', 'Country/Region', 'Lat', 'Long', 'date', 'value']
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
df = df.pivot_table(index=df.index, columns='Country/Region', values='value', aggfunc=np.sum)
print(df)
Output:
Country/Region Afghanistan Albania Algeria Andorra Angola ... West Bank and Gaza Western Sahara Yemen Zambia Zimbabwe
date ...
2020-01-22 0 0 0 0 0 ... 0 0 0 0 0
2020-01-23 0 0 0 0 0 ... 0 0 0 0 0
2020-01-24 0 0 0 0 0 ... 0 0 0 0 0
2020-01-25 0 0 0 0 0 ... 0 0 0 0 0
2020-01-26 0 0 0 0 0 ... 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ...
2020-07-30 36542 5197 29831 922 1109 ... 11548 10 1726 5555 3092
2020-07-31 36675 5276 30394 925 1148 ... 11837 10 1728 5963 3169
2020-08-01 36710 5396 30950 925 1164 ... 12160 10 1730 6228 3659
2020-08-02 36710 5519 31465 925 1199 ... 12297 10 1734 6347 3921
2020-08-03 36747 5620 31972 937 1280 ... 12541 10 1734 6580 4075
If you now want to do weekly aggregations, it's as simple as:
print(df.resample('w').sum())
Output:
Country/Region Afghanistan Albania Algeria Andorra Angola ... West Bank and Gaza Western Sahara Yemen Zambia Zimbabwe
date ...
2020-01-26 0 0 0 0 0 ... 0 0 0 0 0
2020-02-02 0 0 0 0 0 ... 0 0 0 0 0
2020-02-09 0 0 0 0 0 ... 0 0 0 0 0
2020-02-16 0 0 0 0 0 ... 0 0 0 0 0
2020-02-23 0 0 0 0 0 ... 0 0 0 0 0
2020-03-01 7 0 6 0 0 ... 0 0 0 0 0
2020-03-08 10 0 85 7 0 ... 43 0 0 0 0
2020-03-15 57 160 195 7 0 ... 209 0 0 0 0
2020-03-22 175 464 705 409 5 ... 309 0 0 11 7
2020-03-29 632 1142 2537 1618 29 ... 559 0 0 113 31
2020-04-05 1783 2000 6875 2970 62 ... 1178 4 0 262 59
2020-04-12 3401 2864 11629 4057 128 ... 1847 30 3 279 84
2020-04-19 5838 3603 16062 4764 143 ... 2081 42 7 356 154
2020-04-26 8918 4606 21211 5087 174 ... 2353 42 7 541 200
2020-05-03 15149 5391 27943 5214 208 ... 2432 42 41 738 244
2020-05-10 25286 5871 36315 5265 274 ... 2607 42 203 1260 241
2020-05-17 39634 6321 45122 5317 327 ... 2632 42 632 3894 274
2020-05-24 61342 6798 54185 5332 402 ... 2869 45 1321 5991 354
2020-05-31 91885 7517 62849 5344 536 ... 3073 63 1932 7125 894
2020-06-07 126442 8378 68842 5868 609 ... 3221 63 3060 7623 1694
2020-06-14 159822 9689 74147 5967 827 ... 3396 63 4236 8836 2335
2020-06-21 191378 12463 79737 5981 1142 ... 4466 63 6322 9905 3089
2020-06-28 210487 15349 87615 5985 1522 ... 10242 70 7360 10512 3813
2020-07-05 224560 18707 102918 5985 2186 ... 21897 70 8450 11322 4426
2020-07-12 237087 22399 124588 5985 2940 ... 36949 70 9489 13002 6200
2020-07-19 245264 26845 149611 6098 4279 ... 52323 70 10855 16350 9058
2020-07-26 250970 31255 178605 6237 5919 ... 68154 70 11571 26749 14933
2020-08-02 255739 36370 208457 6429 7648 ... 80685 70 12023 38896 22241
2020-08-09 36747 5620 31972 937 1280 ... 12541 10 1734 6580 4075
Try:
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0,100,(20,5)), columns=[*'ZAABC'])
df.groupby(df.columns, axis=1, sort=False).sum()
Output:
Z A B C
0 44 111 67 67
1 9 104 36 87
2 70 176 12 58
3 65 126 46 88
4 81 62 77 72
5 9 100 69 79
6 47 146 99 88
7 49 48 19 14
8 39 97 9 57
9 32 105 23 35
10 75 83 34 0
11 0 89 5 38
12 17 83 42 58
13 31 66 41 57
14 35 57 82 91
15 0 113 53 12
16 42 159 68 6
17 68 50 76 52
18 78 35 99 58
19 23 92 85 48
You can try a transpose and groupby, e.g. something similar to the below.
df_T = df.tranpose()
df_T.groupby(df_T.index).sum()['Canada']
Here's a way to do it:
df.columns = [(col + str(i)) if col.startswith('Canada') else col for i, col in enumerate(df.columns)]
df = df.assign(Canada=df.filter(like='Canada').sum(axis=1)).drop(columns=[x for x in df.columns if x.startswith('Canada') and x != 'Canada'])
First we rename the columns starting with Canada by appending their integer position, which ensures they are no longer duplicates.
Then we use sum() to add across columns like Canada, put the result in a new column named Canada, and drop the columns that were originally named Canada.
Full test code is:
import pandas as pd
df = pd.DataFrame(
columns=[x.strip() for x in 'Brazil Canada Canada Canada Canada'.split()],
index=['Week ' + str(i) for i in range(1, 17)],
data=[[i] * 5 for i in range(1, 17)])
df.columns.names=['Country/Region']
print(df)
df.columns = [(col + str(i)) if col.startswith('Canada') else col for i, col in enumerate(df.columns)]
df = df.assign(Canada=df.filter(like='Canada').sum(axis=1)).drop(columns=[x for x in df.columns if x.startswith('Canada') and x != 'Canada'])
print(df)
Output:
Country/Region Brazil Canada Canada Canada Canada
Week 1 1 1 1 1 1
Week 2 2 2 2 2 2
Week 3 3 3 3 3 3
Week 4 4 4 4 4 4
Week 5 5 5 5 5 5
Week 6 6 6 6 6 6
Week 7 7 7 7 7 7
Week 8 8 8 8 8 8
Week 9 9 9 9 9 9
Week 10 10 10 10 10 10
Week 11 11 11 11 11 11
Week 12 12 12 12 12 12
Week 13 13 13 13 13 13
Week 14 14 14 14 14 14
Week 15 15 15 15 15 15
Week 16 16 16 16 16 16
Brazil Canada
Week 1 1 4
Week 2 2 8
Week 3 3 12
Week 4 4 16
Week 5 5 20
Week 6 6 24
Week 7 7 28
Week 8 8 32
Week 9 9 36
Week 10 10 40
Week 11 11 44
Week 12 12 48
Week 13 13 52
Week 14 14 56
Week 15 15 60
Week 16 16 64
I have a pandas DataFrame which looks like this
Region Sub Region Country Size Plants Birds Mammals
Africa Northern Africa Algeria 2380000 22 41 15
Egypt 1000000 8 58 14
Libya 1760000 7 32 8
Sub-Saharan Africa Angola 1250000 34 53 32
Benin 115000 20 40 12
Western Africa Cape Verde 4030 51 35 7
Americas Latin America Antigua 440 4 31 3
Argentina 2780000 70 42 52
Bolivia 1100000 106 8 55
Northern America Canada 9980000 18 44 24
Grenada 340 3 29 2
USA 9830000 510 251 91
Asia Central Asia Kazakhstan 2720000 14 14 27
Kyrgyz 200000 13 3 15
Uzbekistan 447000 16 7 19
Eastern Asia China 9560000 593 136 96
Japan 378000 50 77 49
South Korea 100000 31 28 33
So I am trying to prompt the user to input a value and if the input exists within the Sub Region column, perform a particular task.
I tried turning the 'Sub region' column to a list and iterate through it if it matches the user input
sub_region_list=[]
for i in world_data.index.values:
sub_region_list.append(i[1])
print(sub_region_list[0])
That is not the output I had in mind.
I believe there is an easier way to do this but can not seem to figure it out
You can use get_level_values to filter.
sub_region = input("Enter a sub region:")
if sub_region not in df.index.get_level_values('Sub Region'):
raise ValueError("You must enter a valid sub-region")
If you want to save the column values in a list, try:
df.index.get_level_values("Sub Region").unique().to_list()
I have a huge dataframe as:
country1 import1 export1 country2 import2 export2
0 USA 12 82 Germany 12 82
1 Germany 65 31 France 65 31
2 England 74 47 Japan 74 47
3 Japan 23 55 England 23 55
4 France 48 12 Usa 48 12
export1 and import1 belongs to country1
export2 and import2 belongs to country2
I want to count export and import values by country.
Output may be like:
country | total_export | total_import
______________________________________________
USA | 12211221 | 212121
France | 4545 | 5454
...
...
Use wide_to_long first:
df = (pd.wide_to_long(data.reset_index(), ['country','import','export'], i='index', j='tmp')
.reset_index(drop=True))
print (df)
country import export
0 USA 12 82
1 Germany 65 31
2 England 74 47
3 Japan 23 55
4 France 48 12
5 Germany 12 82
6 France 65 31
7 Japan 74 47
8 England 23 55
9 Usa 48 12
And then aggregate sum:
df = df.groupby('country', as_index=False).sum()
print (df)
country import export
0 England 97 102
1 France 113 43
2 Germany 77 113
3 Japan 97 102
4 USA 12 82
5 Usa 48 12
You can slice the table into two parts and concatenate them:
func = lambda x: x[:-1] # or lambda x: x.rstrip('0123456789')
data.iloc[:,:3].rename(func, axis=1).\
append(data.iloc[:,3:].rename(func, axis=1)).\
groupby('country').sum()
Output:
import export
country
England 97 102
France 113 43
Germany 77 113
Japan 97 102
USA 12 82
Usa 48 12
I pulled a table of Tour de France winners from wikipedia using BeautifulSoup, but its returning the table in what appears to be a dataset, but the rows are separable.
First, here is what I did to grab the page and table:
import requests
response = requests.get("Https://en.wikipedia.org/wiki/List_of_Tour_de_France_general_classification_winners")
content = response.content
from bs4 import BeatifulSoup
parser = BeautifulSoup(content, 'html.parser')
# I know its the second table on the page, so grab it as such
winners_table = parser.find_all('table')[1]
import pandas as pd
data = pd.read_html(str(winners_table), flavor = 'html5lib')
Note that I used html5lib here because pycharm was telling me that there is no lxml, despite it certainly being there. When I print out the table, it appears as a table with 116 rows and 9 columns, but it isn't appearing to split on rows. It looks like this:
[ 0 1 \
0 Year Country
1 1903 France
2 1904 France
3 1905 France
4 1906 France
5 1907 France
6 1908 France
7 1909 Luxembourg
8 1910 France
9 1911 France
10 1912 Belgium
11 1913 Belgium
12 1914 Belgium
13 1915 World War I
14 1916 NaN
15 1917 NaN
16 1918 NaN
17 1919 Belgium
18 1920 Belgium
19 1921 Belgium
20 1922 Belgium
21 1923 France
22 1924 Italy
23 1925 Italy
24 1926 Belgium
25 1927 Luxembourg
26 1928 Luxembourg
27 1929 Belgium
28 1930 France
29 1931 France
.. ... ...
86 1988 Spain
87 1989 United States
88 1990 United States
89 1991 Spain
90 1992 Spain
91 1993 Spain
92 1994 Spain
93 1995 Spain
94 1996 Denmark
95 1997 Germany
96 1998 Italy
97 1999[B] United States
98 2000[B] United States
99 2001[B] United States
100 2002[B] United States
101 2003[B] United States
102 2004[B] United States
103 2005[B] United States
104 2006 Spain
105 2007 Spain
106 2008 Spain
107 2009 Spain
108 2010 Luxembourg
109 2011 Australia
110 2012 Great Britain
111 2013 Great Britain
112 2014 Italy
113 2015 Great Britain
114 2016 Great Britain
115 2017 Great Britain
2 \
0 Cyclist
1 Garin, MauriceMaurice Garin
2 Garin, MauriceMaurice Garin Cornet, HenriHenri...
3 Trousselier, LouisLouis Trousselier
4 Pottier, RenéRené Pottier
5 Petit-Breton, LucienLucien Petit-Breton
6 Petit-Breton, LucienLucien Petit-Breton
7 Faber, FrançoisFrançois Faber
8 Lapize, OctaveOctave Lapize
9 Garrigou, GustaveGustave Garrigou
10 Defraye, OdileOdile Defraye
11 Thys, PhilippePhilippe Thys
12 Thys, PhilippePhilippe Thys
13 NaN
14 NaN
15 NaN
16 NaN
17 Lambot, FirminFirmin Lambot
18 Thys, PhilippePhilippe Thys
19 Scieur, LéonLéon Scieur
20 Lambot, FirminFirmin Lambot
21 Pélissier, HenriHenri Pélissier
22 Bottecchia, OttavioOttavio Bottecchia
23 Bottecchia, OttavioOttavio Bottecchia
24 Buysse, LucienLucien Buysse
25 Frantz, NicolasNicolas Frantz
26 Frantz, NicolasNicolas Frantz
27 De Waele, MauriceMaurice De Waele
28 Leducq, AndréAndré Leducq
29 Magne, AntoninAntonin Magne
.. ...
86 Delgado, PedroPedro Delgado
87 LeMond, GregGreg LeMond
88 LeMond, GregGreg LeMond
89 Indurain, MiguelMiguel Indurain
90 Indurain, MiguelMiguel Indurain
91 Indurain, MiguelMiguel Indurain
92 Indurain, MiguelMiguel Indurain
93 Indurain, MiguelMiguel Indurain
94 Riis, BjarneBjarne Riis[A]
95 Ullrich, JanJan Ullrich#
96 Pantani, MarcoMarco Pantani
97 Armstrong, LanceLance Armstrong
98 Armstrong, LanceLance Armstrong
99 Armstrong, LanceLance Armstrong
100 Armstrong, LanceLance Armstrong
101 Armstrong, LanceLance Armstrong
102 Armstrong, LanceLance Armstrong
103 Armstrong, LanceLance Armstrong
104 Landis, FloydFloyd Landis Pereiro, ÓscarÓscar ...
105 Contador, AlbertoAlberto Contador#
106 Sastre, CarlosCarlos Sastre*
107 Contador, AlbertoAlberto Contador
108 Contador, AlbertoAlberto Contador Schleck, And...
109 Evans, CadelCadel Evans
110 Wiggins, BradleyBradley Wiggins
111 Froome, ChrisChris Froome
112 Nibali, VincenzoVincenzo Nibali
113 Froome, ChrisChris Froome*
114 Froome, ChrisChris Froome
115 Froome, ChrisChris Froome
3 4 \
0 Sponsor/Team Distance
1 La Française 2,428 km (1,509 mi)
2 Conte 2,428 km (1,509 mi)
3 Peugeot–Wolber 2,994 km (1,860 mi)
4 Peugeot–Wolber 4,637 km (2,881 mi)
5 Peugeot–Wolber 4,488 km (2,789 mi)
6 Peugeot–Wolber 4,497 km (2,794 mi)
7 Alcyon–Dunlop 4,498 km (2,795 mi)
8 Alcyon–Dunlop 4,734 km (2,942 mi)
9 Alcyon–Dunlop 5,343 km (3,320 mi)
10 Alcyon–Dunlop 5,289 km (3,286 mi)
11 Peugeot–Wolber 5,287 km (3,285 mi)
12 Peugeot–Wolber 5,380 km (3,340 mi)
13 NaN NaN
14 NaN NaN
15 NaN NaN
16 NaN NaN
17 La Sportive 5,560 km (3,450 mi)
18 La Sportive 5,503 km (3,419 mi)
19 La Sportive 5,485 km (3,408 mi)
20 Peugeot–Wolber 5,375 km (3,340 mi)
21 Automoto–Hutchinson 5,386 km (3,347 mi)
22 Automoto 5,425 km (3,371 mi)
23 Automoto–Hutchinson 5,440 km (3,380 mi)
24 Automoto–Hutchinson 5,745 km (3,570 mi)
25 Alcyon–Dunlop 5,398 km (3,354 mi)
26 Alcyon–Dunlop 5,476 km (3,403 mi)
27 Alcyon–Dunlop 5,286 km (3,285 mi)
28 Alcyon–Dunlop 4,822 km (2,996 mi)
29 France 5,091 km (3,163 mi)
.. ... ...
86 Reynolds 3,286 km (2,042 mi)
87 AD Renting–W-Cup–Bottecchia 3,285 km (2,041 mi)
88 Z–Tomasso 3,504 km (2,177 mi)
89 Banesto 3,914 km (2,432 mi)
90 Banesto 3,983 km (2,475 mi)
91 Banesto 3,714 km (2,308 mi)
92 Banesto 3,978 km (2,472 mi)
93 Banesto 3,635 km (2,259 mi)
94 Team Telekom 3,765 km (2,339 mi)
95 Team Telekom 3,950 km (2,450 mi)
96 Mercatone Uno–Bianchi 3,875 km (2,408 mi)
97 U.S. Postal Service 3,687 km (2,291 mi)
98 U.S. Postal Service 3,662 km (2,275 mi)
99 U.S. Postal Service 3,458 km (2,149 mi)
100 U.S. Postal Service 3,272 km (2,033 mi)
101 U.S. Postal Service 3,427 km (2,129 mi)
102 U.S. Postal Service 3,391 km (2,107 mi)
103 Discovery Channel 3,593 km (2,233 mi)
104 Caisse d'Epargne–Illes Balears 3,657 km (2,272 mi)
105 Discovery Channel 3,570 km (2,220 mi)
106 Team CSC 3,559 km (2,211 mi)
107 Astana 3,459 km (2,149 mi)
108 Team Saxo Bank 3,642 km (2,263 mi)
109 BMC Racing Team 3,430 km (2,130 mi)
110 Team Sky 3,496 km (2,172 mi)
111 Team Sky 3,404 km (2,115 mi)
112 Astana 3,660.5 km (2,274.5 mi)
113 Team Sky 3,360.3 km (2,088.0 mi)
114 Team Sky 3,529 km (2,193 mi)
115 Team Sky 3,540 km (2,200 mi)
5 6 7 8
0 Time/Points Margin Stage wins Stages in lead
1 094 !94h 33' 14" 24921 !+ 2h 59' 21" 3 6
2 096 !96h 05' 55" 21614 !+ 2h 16' 14" 1 3
3 35 26 5 10
4 31 8 5 12
5 47 19 2 5
6 36 32 5 13
7 37 20 6 13
8 63 4 4 3
9 43 18 2 13
10 49 59 3 13
11 197 !197h 54' 00" 00837 !+ 8' 37" 1 8
12 200 !200h 28' 48" 00150 !+ 1' 50" 1 15
13 NaN NaN NaN NaN
14 NaN NaN NaN NaN
15 NaN NaN NaN NaN
16 NaN NaN NaN NaN
17 231 !231h 07' 15" 14254 !+ 1h 42' 54" 1 2
18 228 !228h 36' 13" 05721 !+ 57' 21" 4 14
19 221 !221h 50' 26" 01836 !+ 18' 36" 2 14
20 222 !222h 08' 06" 04115 !+ 41' 15" 0 3
21 222 !222h 15' 30" 03041 !+ 30 '41" 3 6
22 226 !226h 18' 21" 03536 !+ 35' 36" 4 15
23 219 !219h 10' 18" 05420 !+ 54' 20" 4 13
24 238 !238h 44' 25" 12225 !+ 1h 22' 25" 2 8
25 198 !198h 16' 42" 14841 !+ 1h 48' 41" 3 14
26 192 !192h 48' 58" 05007 !+ 50' 07" 5 22
27 186 !186h 39' 15" 04423 !+44' 23" 1 16
28 172 !172h 12' 16" 01413 !+ 14' 13" 2 13
29 177 !177h 10' 03" 01256 !+ 12' 56" 1 16
.. ... ... ... ...
86 084 !84h 27' 53" 00713 !+ 7' 13" 1 11
87 087 !87h 38' 35" 00008 !+ 8" 3 8
88 090 !90h 43' 20" 00216 !+ 2' 16" 0 2
89 101 !101h 01' 20" 00336 !+ 3' 36" 2 10
90 100 !100h 49' 30" 00435 !+ 4' 35" 3 10
91 095 !95h 57' 09" 00459 !+ 4' 59" 2 14
92 103 !103h 38' 38" 00539 !+ 5' 39" 1 13
93 092 !92h 44' 59" 00435 !+ 4' 35" 2 13
94 095 !95h 57' 16" 00141 !+ 1' 41" 2 13
95 100 !100h 30' 35" 00909 !+ 9' 09" 2 12
96 092 !92h 49' 46" 00321 !+ 3' 21" 2 7
97 091 !91h 32' 16" 00737 !+ 7' 37" 4 15
98 092 !92h 33' 08" 00602 !+ 6' 02" 1 12
99 086 !86h 17' 28" 00644 !+ 6' 44" 4 8
100 082 !82h 05' 12" 00717 !+ 7' 17" 4 11
101 083 !83h 41' 12" 00101 !+ 1' 01" 1 13
102 083 !83h 36' 02" 00619 !+ 6' 19" 5 7
103 086 !86h 15' 02" 00440 !+ 4' 40" 1 17
104 089 !89h 40' 27" 00032 !+ 32" 0 8
105 091 !91h 00' 26" 00023 !+ 23" 1 4
106 087 !87h 52' 52" 00058 !+ 58" 1 5
107 085 !85h 48' 35" 00411 !+ 4' 11" 2 7
108 091 !91h 59' 27" 00122 !+ 1' 22" 2 12
109 086 !86h 12' 22" 00134 !+ 1' 34" 1 2
110 087 !87h 34' 47" 00321 !+ 3' 21" 2 14
111 083 !83h 56' 20" 00420 !+ 4' 20" 3 14
112 089 !89h 59' 06" 00737 !+ 7' 37" 4 19
113 084 !84h 46' 14" 00112 !+ 1' 12" 1 16
114 089 !89h 04' 48" 00405 !+ 4' 05" 2 14
115 086 !86h 20' 55" 00054 !+ 54" 0 15
[116 rows x 9 columns]]
This is all well and good, but the problem is it doesn't seem to be differentiating by rows. For instance, when I try to print just the first row, it reprints the whole dataset. Here's an example of trying to just print the first row and second column (so should just be one value):
print(data[0][2])
0 Country
1 France
2 France
3 France
4 France
5 France
6 France
7 Luxembourg
8 France
9 France
10 Belgium
11 Belgium
12 Belgium
13 World War I
14 NaN
15 NaN
16 NaN
17 Belgium
18 Belgium
19 Belgium
20 Belgium
21 France
22 Italy
23 Italy
24 Belgium
25 Luxembourg
26 Luxembourg
27 Belgium
28 France
29 France
...
86 Spain
87 United States
88 United States
89 Spain
90 Spain
91 Spain
92 Spain
93 Spain
94 Denmark
95 Germany
96 Italy
97 United States
98 United States
99 United States
100 United States
101 United States
102 United States
103 United States
104 Spain
105 Spain
106 Spain
107 Spain
108 Luxembourg
109 Australia
110 Great Britain
111 Great Britain
112 Italy
113 Great Britain
114 Great Britain
115 Great Britain
Name: 1, Length: 116, dtype: object
All I want is for this to behave as a data frame, with 116 rows and 9 columns. Any idea how to fix this?
If we take a look at the documentation here we can see that read_html actually outputs a list of DataFrames and not a single DataFrame. We can confirm this when we run:
>> print(type(data))
<class 'list'>
The format of the list is such that the first element of the list is the actual DataFrame containing your values.
>> print(type(data[0]))
<class 'pandas.core.frame.DataFrame'>
The simple solution to this is to reassign data to data[0]. From this you can then call individual rows. Indexing of rows for DataFrames doesn't behave like normal lists so I would recommend looking into .iloc and .loc. This is a nice article I found on indexing of DataFrames.
An example of this solution:
>> data = data[0]
>> print(data.iloc[1])
0 1903
1 France
2 Garin, MauriceMaurice Garin
3 La Française
4 2,428 km (1,509 mi)
5 094 !94h 33' 14"
6 24921 !+ 2h 59' 21"
7 3
8 6
Name: 1, dtype: object
The pandas function read_html returns a list of dataframes. So in your case I believe you need to choose the first index of the returned list as done in the 8th line in the code below.
Also note the you have a typo in the import line of BeautifulSoup, please update your code accordingly in the question.
I hope my output is what you're looking for.
Code:
import requests
import pandas as pd
from bs4 import BeautifulSoup
response = requests.get("Https://en.wikipedia.org/wiki/List_of_Tour_de_France_general_classification_winners")
parser = BeautifulSoup(response.content, 'html.parser')
winners_table = parser.find_all('table')[1]
data = pd.read_html(str(winners_table), flavor = 'lxml')[0]
print("type of variable data: " + str(type(data)))
print(data[0][2])
Output:
type of variable data: <class 'pandas.core.frame.DataFrame'>
1904
Note I used lxml instead of html5lib
You could try this:
df = data[0]
# iterate through the data frame using iterrows()
for index, row in df.iterrows():
print ("Col1:", row[0], " Col2: ", row[1], "Col3:", row[2], "Col4:", row[3]) #etc for all cols
I hope this helps!
This question is an extension of another question but with a different approach. I have the following 2 dfs:
(if someone can show me a more efficient way of creating the df below,instead of writing it out by hand, that would be great)
yrs = pd.DataFrame({'years': [1950, 1951, 1952, 1953, 1954, 1955, \
1956, 1957,1958,1959,1960,1961,1962,1963,1964,1965,1967,1968,1969,\
1970,1971,1972,1973,1974,1975,1976,10977,1978,1979,1980,1981,1982,\
1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,\
1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,\
2009,2010,2011,2012,2013,2014]}, index=[1,2,3,4,5,6,7,8,9,10,11,12,\
13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,\
35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,51,52,53,54,55,56,57,\
58,59,60,61,62,63,64,65])
yrs
years
1 1950
2 1951
3 1952
4 1953
5 1954
........
58 2007
59 2008
60 2009
61 2010
62 2011
63 2012
64 2013
65 2014
dfyears.head(30).to_dict()
{'end': {0: 1995,1: 1997,2: 1999,3: 2001,4: 2003,5: 2005,6: 2007,07: 2013,
8: 2014,9: 1995,10: 2007,11: 2013,12: 2014,13: 1989,14: 1991, 15: 1993,
16: 1995,17: 1997,18: 1999,19: 2001,20: 2003,21: 2005,22: 2007,23: 2013,
24: 2014,25: 1985,26: 1987,27: 1989,28: 1991,29: 1993},'idthomas': {0: 136,1: 136,2: 136,3: 136,4: 136,5: 136,6: 136,7: 136,8: 136,9: 172,10: 172,
11: 172,12: 172,13: 174,14: 174,15: 174,16: 174,17: 174,18: 174,19: 174,
20: 174, 21: 174,22: 174,23: 174,24: 174,25: 179,26: 179,27: 179,28: 179,
29: 179}, 'start': {0: 1993,1: 1995,2: 1997,3: 1999,4: 2001,5: 2003,6: 2005,7: 2007,8: 2013,9: 1993,10: 2001,11: 2007,12: 2013,13: 1987,14: 1989,
15: 1991,16: 1993,17: 1995,18: 1997, 19: 1999,20: 2001,21: 2003, 22: 2005,
23: 2007,24: 2013, 25: 1983,26: 1985,27: 1987,28: 1989,29: 1991}}
dfyears.head(30)
end start idthomas
0 1995 1993 136
1 1997 1995 136
2 1999 1997 136
3 2001 1999 136
4 2003 2001 136
5 2005 2003 136
6 2007 2005 136
7 2013 2007 136
8 2014 2013 136
9 1995 1993 172
10 2007 2001 172
11 2013 2007 172
12 2014 2013 172
I want to create a column == served in yrs that will return a 1 or a 0 conditioned on whether the corresponding value in column == years is >= start or <= end and, that will simultaneously create a column == idthomas that returns the idthomas value from the row that corresponds to the condition being applied. Below, is an example of what I want:
years served idthomas
1 1950 0 136
2 1951 0 136
3 1952 0 136
4 1953 0 136
5 1954 0 136
...................
43 1993 1 136
44 1994 1 136
45 1995 1 136
46 1996 1 136
47 1997 1 136
48 1998 1 136
49 1999 1 136
51 2000 1 136
52 2001 1 136
53 2002 1 136
54 2003 1 136
55 2004 1 136
56 2005 1 136
57 2006 1 136
58 2007 1 136
59 2008 1 136
60 2009 1 136
61 2010 1 136
62 2011 1 136
63 2012 1 136
64 2013 1 136
65 2014 1 136
66 1950 0 172
67 1951 0 172
68 1952 0 172
69 1953 0 172
70 1954 0 172
...................
72 1993 1 172
73 1994 1 172
74 1995 1 172
75 1996 0 172
76 1997 0 172
77 1998 0 172
78 1999 0 172
79 2000 0 172
80 2001 1 172
81 2002 1 172
82 2003 1 172
83 2004 1 172
84 2005 1 172
85 2006 1 172
86 2007 1 172
87 2008 1 172
88 2009 1 172
89 2010 1 172
90 2011 1 172
91 2012 1 172
92 2013 1 172
93 2014 1 172
I typed out 'something' to code this. It's embarrassingly crude:
uu=dfyears.groupby('idthomas')
yrs['did_service'] == 1 if:
# somewhere in the next line I think that I need to do some sort of
# tuple so that I can grab the value in the 'idthomas' column that
# is associated with the comparison that I am doing.
x in years >= uu.start | x in years <= uu.end
else == 0
If this does not work then I will be doing the work by hand. I only ask that if someone tries and is not able, then just let me know so I can have an idea of the vitality of the idea.
I can help with the time series, you don't need to type the data by hand, here's how you can do it.
pd.DataFrame(np.array(pd.date_range(start='1900', end='1920', freq='A').strftime('%Y')), columns=['years'])
or lose the .strftime() if you want to have months and days and full date in other words.
For running the logic you are describing, I was thinking that np.where might work fine, something like (not tested)
yrs['served'] = np.where((yrs['years'] >= dfyears['start'] | yrs['years'] <= dfyears['end']), 1, 0)
However, that wouldn't really address the fact that you want to add new rows to yrs, according to your example at least.
I know this is not a complete answer, but I hope it helps to some extent.