pd.read_html() not reading date - python

When I try to parse a wiki page for its tables, the tables are read correctly except for the date of birth column, which comes back as empty. Is there a workaround for this? I've tried using beautiful soup but I get the same result.
The code I've used is as follows:
url = 'https://en.wikipedia.org/wiki/2002_FIFA_World_Cup_squads'
pd.read_html(url)
Here's an image of one of the tables in question:

One possible solution can be alter the page content with beautifulsoup and then load it to pandas:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/2002_FIFA_World_Cup_squads"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
# select correct table, here I select the first one:
tbl = soup.select("table")[0]
# remove the (aged XX) part:
for td in tbl.select("td:nth-of-type(3)"):
td.string = td.contents[-1].split("(")[0]
df = pd.read_html(str(tbl))[0]
print(df)
Prints:
No. Pos. Player Date of birth (age) Caps Club
0 1 GK Thomas Sørensen 12 June 1976 14 Sunderland
1 2 MF Stig Tøfting 14 August 1969 36 Bolton Wanderers
2 3 DF René Henriksen 27 August 1969 39 Panathinaikos
3 4 DF Martin Laursen 26 July 1977 15 Milan
4 5 DF Jan Heintze (c) 17 August 1963 83 PSV Eindhoven
5 6 DF Thomas Helveg 24 June 1971 67 Milan
6 7 MF Thomas Gravesen 11 March 1976 22 Everton
7 8 MF Jesper Grønkjær 12 August 1977 25 Chelsea
8 9 FW Jon Dahl Tomasson 29 August 1976 38 Feyenoord
9 10 MF Martin Jørgensen 6 October 1975 32 Udinese
10 11 FW Ebbe Sand 19 July 1972 44 Schalke 04
11 12 DF Niclas Jensen 17 August 1974 8 Manchester City
12 13 DF Steven Lustü 13 April 1971 4 Lyn
13 14 MF Claus Jensen 29 April 1977 13 Charlton Athletic
14 15 MF Jan Michaelsen 28 November 1970 11 Panathinaikos
15 16 GK Peter Kjær 5 November 1965 4 Aberdeen
16 17 MF Christian Poulsen 28 February 1980 3 Copenhagen
17 18 FW Peter Løvenkrands 29 January 1980 4 Rangers
18 19 MF Dennis Rommedahl 22 July 1978 19 PSV Eindhoven
19 20 DF Kasper Bøgelund 8 October 1980 2 PSV Eindhoven
20 21 FW Peter Madsen 26 April 1978 4 Brøndby
21 22 GK Jesper Christiansen 24 April 1978 0 Vejle
22 23 MF Brian Steen Nielsen 28 December 1968 65 Malmö FF

Try setting the parse_dates parameter to True inside read_html method.

Related

I am trying to generate Choropleth Maps , Nothings Shows on screen

I am trying to generate a choropleth map of india with some data , everything seems to work fine, i cleaned the data and made a proper Pandas Dataframe, but somehow its still blank. the geojson _id matches with the dataframe yet no resolve.
import plotly.express as px
import pandas as pd
import json
import numpy as np
india_map = json.load(open('INDIA_STATES.json','r'))
df = pd.read_excel("data.xlsx")
df_s = pd.DataFrame({'Stats':df['Unnamed: 7'][3:38]},)
#print(df_s)
df = pd.concat([df['Unnamed: 1'][3:38],df_s],axis=1,ignore_index=True)
df.reset_index(drop=True,inplace=True)
df = df.rename(columns={df.columns[0]:"State",df.columns[1]:"Stat"})
#print(df)
state_id_json = []
state_name_json = []
for st in india_map['features']:
state_name_json.append(st['properties']['STATE'])
state_id_json.append(st['_id'])
state_name_json = list(map(lambda x:x.upper(),state_name_json))
df['State'] = list(map(lambda x:x.upper(),df['State']))
state_id = pd.DataFrame({"State":state_name_json,"_id":state_id_json}).sort_values(by=['State'])
#print(state_id,df)
df = pd.merge(df,state_id,on='State')
print(df)
fig = px.choropleth(df,geojson=india_map,locations='_id',color='Stat',scope= 'asia')
fig.show()
I get this in Terminal :
State Stat _id
0 ANDAMAN AND NICOBAR ISLANDS 356.0 8
1 ANDHRA PRADESH 76210.0 9
2 ARUNACHAL PRADESH 1098.0 10
3 ASSAM 26656.0 11
4 BIHAR 82999.0 row_112
5 CHANDIGARH 901.0 35
6 CHANDIGARH 901.0 13
7 CHHATTISGARH 20834.0 row_251
8 DAMAN AND DIU 158.0 34
9 DELHI 13851.0 16
10 GOA 1348.0 17
11 GUJARAT 50671.0 18
12 HARYANA 21145.0 19
13 HIMACHAL PRADESH 6078.0 20
14 JAMMU AND KASHMIR 10144.0 7
15 JHARKHAND 26946.0 row_267
16 KARNATAKA 52851.0 21
17 KERALA 31841.0 22
18 LAKSHADWEEP 61.0 23
19 MADHYA PRADESH 60348.0 row_250
20 MAHARASHTRA 96879.0 25
21 MANIPUR 2294.0 26
22 MEGHALAYA 2319.0 27
23 MIZORAM 889.0 28
24 NAGALAND 1990.0 29
25 ODISHA 36805.0 30
26 PUDUCHERRY 974.0 37
27 PUNJAB 24359.0 1
28 RAJASTHAN 56507.0 2
29 SIKKIM 541.0 3
30 TAMIL NADU 62406.0 4
31 TRIPURA 3199.0 5
32 UTTAR PRADESH 166198.0 31
33 WEST BENGAL 80176.0 32
Plotly Result:
Empty Map with no data
Since no geojson was provided, I used Json retrieved from the web and created the code with the data you provided. geojson was not provided, so I may not be certain, but I think it was probably due to a missing featureidkey setting. In my example, I have associated the state name with NAME_1 in geojson. Some of the state names do not match, so I adapted them to geojson.
Note:
Some states are missing because the state name in the user data does not match NAME_1 in geojson. Please replace it with your data.
from urllib.request import urlopen
import json
import pandas as pd
import numpy as np
import io
import plotly.express as px
# https://github.com/Subhash9325/GeoJson-Data-of-Indian-States/blob/master/Indian_States
url = 'https://raw.githubusercontent.com/Subhash9325/GeoJson-Data-of-Indian-States/master/Indian_States'
with urlopen(url) as response:
india_map = json.load(response)
data = '''
State Stat _id
0 "ANDAMAN AND NICOBAR ISLANDS" 356.0 8
1 "ANDHRA PRADESH" 76210.0 9
2 "ARUNACHAL PRADESH" 1098.0 10
3 ASSAM 26656.0 11
4 BIHAR 82999.0 row_112
5 CHANDIGARH 901.0 35
6 CHANDIGARH 901.0 13
7 CHHATTISGARH 20834.0 row_251
8 "DAMAN AND DIU" 158.0 34
9 DELHI 13851.0 16
10 GOA 1348.0 17
11 GUJARAT 50671.0 18
12 HARYANA 21145.0 19
13 "HIMACHAL PRADESH" 6078.0 20
14 "JAMMU AND KASHMIR" 10144.0 7
15 JHARKHAND 26946.0 row_267
16 KARNATAKA 52851.0 21
17 KERALA 31841.0 22
18 LAKSHADWEEP 61.0 23
19 "MADHYA PRADESH" 60348.0 row_250
20 MAHARASHTRA 96879.0 25
21 MANIPUR 2294.0 26
22 MEGHALAYA 2319.0 27
23 MIZORAM 889.0 28
24 NAGALAND 1990.0 29
25 ODISHA 36805.0 30
26 PUDUCHERRY 974.0 37
27 PUNJAB 24359.0 1
28 RAJASTHAN 56507.0 2
29 SIKKIM 541.0 3
30 "TAMIL NADU" 62406.0 4
31 TRIPURA 3199.0 5
32 "UTTAR PRADESH" 166198.0 31
33 "WEST BENGAL" 80176.0 32
'''
df = pd.read_csv(io.StringIO(data), delim_whitespace=True)
df['State'] = df['State'].str.title()
df['State'] = df['State'].str.replace(' And ', ' and ')
df.loc[0, 'State'] = 'Andaman and Nicobar'
fig = px.choropleth(df,
geojson=india_map,
locations='State',
color='Stat',
featureidkey="properties.NAME_1",
scope='asia')
fig.update_geos(fitbounds="locations", visible=False)
fig.update_layout(autosize=False,
width=1000,
height=600
)
fig.show()

Web scrape Sports-Reference with Python Beautiful Soup

I am trying to scrape data from Nick Saban's sports reference page so that I can pull in the list of All-Americans he coached and then his Bowl-Win Loss Percentage.
I am new to Python so this has been a massive struggle. When I inspect the page I see div id = #leaderboard_all-americans class = "data_grid_box"
When I run the code below I am getting the Coaching Record table, which is the first table on the site. I tried using different indexes thinking it may give me a different result but that did not work either.
Ultimately, I want to get the All-American data and turn it into a data frame.
import requests
import bs4
import pandas as pd
saban2 = requests.get("https://www.sports-reference.com/cfb/coaches/nick-saban-1.html")
saban_soup2 = bs4.BeautifulSoup(saban2.text,"lxml")
saban_select = saban_soup2.select('div',{"id":"leaderboard_all-americans"})
saban_df2 = pd.read_html(str(saban_select))
All Americans
sports-reference.com stores the HTML tables as comments in the basic request response. You have to first grab the commented block with the All-Americans and bowl results, and then parse that result:
import bs4
from bs4 import BeautifulSoup as soup
import requests, pandas as pd
d = soup(requests.get('https://www.sports-reference.com/cfb/coaches/nick-saban-1.html').text, 'html.parser')
block = [i for i in d.find_all(string=lambda text: isinstance(text, bs4.Comment)) if 'id="leaderboard_all-americans"' in i][0]
b = soup(str(block), 'html.parser')
players = [i for i in b.select('#leaderboard_all-americans table.no_columns tr')]
p_results = [{'name':i.td.a.text, 'year':i.td.contents[-1][2:-1]} for i in players]
all_americans = pd.DataFrame(p_results)
bowl_win_loss = b.select_one('#leaderboard_win_loss_pct_post td.single').contents[-2]
print(all_americans)
print(bowl_win_loss)
Output:
all_americans
name year
0 Jonathan Allen 2016
1 Javier Arenas 2009
2 Mark Barron 2011
3 Antoine Caldwell 2008
4 Ha Ha Clinton-Dix 2013
5 Terrence Cody 2008-2009
6 Landon Collins 2014
7 Amari Cooper 2014
8 Landon Dickerson 2020
9 Minkah Fitzpatrick 2016-2017
10 Reuben Foster 2016
11 Najee Harris 2020
12 Derrick Henry 2015
13 Dont'a Hightower 2011
14 Mark Ingram 2009
15 Jerry Jeudy 2018
16 Mike Johnson 2009
17 Barrett Jones 2011-2012
18 Mac Jones 2020
19 Ryan Kelly 2015
20 Cyrus Kouandjio 2013
21 Chad Lavalais 2003
22 Alex Leatherwood 2020
23 Rolando McClain 2009
24 Demarcus Milliner 2012
25 C.J. Mosley 2012-2013
26 Reggie Ragland 2015
27 Josh Reed 2001
28 Trent Richardson 2011
29 A'Shawn Robinson 2015
30 Cam Robinson 2016
31 Andre Smith 2008
32 DeVonta Smith 2020
33 Marcus Spears 2004
34 Patrick Surtain II 2020
35 Tua Tagovailoa 2018
36 Deionte Thompson 2018
37 Chance Warmack 2012
38 Ben Wilkerson 2004
39 Jonah Williams 2018
40 Quinnen Williams 2018
bowl_win_loss:
' .63 (#23)'

How to apply values in column based on condition in row values in python

I have a dataframe that looks like this:
df
Name year week date
0 Adam 2016 16 2016-04-24
1 Mary 2016 17 2016-05-01
2 Jane 2016 20 2016-05-22
3 Joe 2016 17 2016-05-01
4 Arthur 2017 44 2017-11-05
5 Liz 2017 41 2017-10-15
6 Janice 2016 47 2016-11-27
And I want to create column season so df['season'] that attributes a season MAM or OND depending on the value in week.
The result should look like this:
df_final
Name year week date season
0 Adam 2016 16 2016-04-24 MAM
1 Mary 2016 17 2016-05-01 MAM
2 Jane 2016 20 2016-05-22 MAM
3 Joe 2016 17 2016-05-01 MAM
4 Arthur 2017 44 2017-11-05 OND
5 Liz 2017 41 2017-10-15 OND
6 Janice 2016 47 2016-11-27 OND
In essence, values of week that are below 40 should be paired with MAM and values above 40 should be OND.
So far I have this:
condition =df.week < 40
df['season'] = df[condition][[i for i in df.columns.values if i not in ['a']]].apply(lambda x: 'OND')
But it is clunky and does not produce the final response.
Thank you.
Use numpy.where:
condition = df.week < 40
df['season'] = np.where(condition, 'MAM', 'OND')
print (df)
Name year week date season
0 Adam 2016 16 2016-04-24 MAM
1 Mary 2016 17 2016-05-01 MAM
2 Jane 2016 20 2016-05-22 MAM
3 Joe 2016 17 2016-05-01 MAM
4 Arthur 2017 44 2017-11-05 OND
5 Liz 2017 41 2017-10-15 OND
6 Janice 2016 47 2016-11-27 OND
EDIT:
For convert strings to integers use astype:
condition = df.week.astype(int) < 40
Or convert column:
df.week = df.week.astype(int)
condition = df.week < 40

query a csv using pandas

I have a csv file like this:
year month Company A Company B Company C
1990 Jan 10 15 20
1990 Feb 11 14 21
1990 mar 13 8 23
1990 april 12 22 19
1990 may 15 12 18
1990 june 18 13 13
1990 june 12 14 15
1990 july 12 14 16
1991 Jan 11 16 13
1991 Feb 14 17 11
1991 mar 23 13 12
1991 april 23 21 10
1991 may 22 22 9
1991 june 24 20 32
1991 june 12 14 15
1991 july 21 14 16
1992 Jan 10 13 26
1992 Feb 9 11 19
1992 mar 23 12 18
1992 april 12 10 21
1992 may 17 9 10
1992 june 15 42 9
1992 june 16 9 26
1992 july 15 26 19
1993 Jan 18 19 20
1993 Feb 19 18 21
1993 mar 20 21 23
1993 april 21 10 19
1993 may 13 9 14
1993 june 14 23 23
1993 june 15 21 23
1993 july 16 10 22
I want to find out for each company the month and year where they had the highest number of sale for ex: for company A in year 1990 they had highest sale of 18. I want to do this using pandas. but to understand how to proceed with this. pointers needed please.
ps: here is what I have done till now.
import pandas as pd
df = pd.read_csv('SAMPLE.csv')
num_of_rows = len(df.index)
years_list = []
months_list = []
company_list = df.columns[2:]
for each in df.columns[2:]:
each = []
for i in range(0,num_of_rows):
years_list.append(df[df.columns[0]][i])
months_list.append(df[df.columns[1]][i])
years_list = list(set(years_list))
months_list = list(set(months_list))
for each in years_list:
for c in company_list:
print df[(df.year == each)][c].max()
I am getting the biggest number for a year for a company but how to get the month and year also I dont know.
Use a combination of idxmax() and loc to filter the dataframe:
In [36]:
import pandas as pd
import io
temp = """year month Company_A Company_B Company_C
1990 Jan 10 15 20
1990 Feb 11 14 21
1990 mar 13 8 23
1990 april 12 22 19
1990 may 15 12 18
1990 june 18 13 13
1990 june 12 14 15
1990 july 12 14 16
1991 Jan 11 16 13
1991 Feb 14 17 11
1991 mar 23 13 12
1991 april 23 21 10
1991 may 22 22 9
1991 june 24 20 32
1991 june 12 14 15
1991 july 21 14 16
1992 Jan 10 13 26
1992 Feb 9 11 19
1992 mar 23 12 18
1992 april 12 10 21
1992 may 17 9 10
1992 june 15 42 9
1992 june 16 9 26
1992 july 15 26 19
1993 Jan 18 19 20
1993 Feb 19 18 21
1993 mar 20 21 23
1993 april 21 10 19
1993 may 13 9 14
1993 june 14 23 23
1993 june 15 21 23
1993 july 16 10 22"""
df = pd.read_csv(io.StringIO(temp),sep='\s+')
# the call to .unique() is because the same row for A and C appears twice
df.loc[df[['Company_A', 'Company_B', 'Company_C']].idxmax().unique()]
Out[36]:
year month Company_A Company_B Company_C
13 1991 june 24 20 32
21 1992 june 15 42 9

Adding columns of different length into pandas dataframe

I have a dataframe detailing money awarded to people over several years:
Name -- Money -- Year
Paul 57.00 2012
Susan 67.00 2012
Gary 54.00 2011
Paul 77.00 2011
Andrea 20.00 2011
Albert 23.00 2011
Hal 26.00 2010
Paul 23.00 2010
From this dataframe, I want to construct a dataframe that details all the money awarded in a single year, for making a boxplot:
2012 -- 2011 -- 2010
57.00 54.00 26.00
67.00 77.00 23.00
20.00
23.00
So you see this results in columns of different length. When I try to do this using pandas, I get the error 'ValueError: Length of values does not match length of index'. I assume this is because I can't add varying length columns to a dataframe.
Can anyone offer some advice on how to proceed? Perhap I'm approaching this incorrectly? Thanks for any help!
I'd do this in a two-step process: first add a column corresponding to the index in each year using cumcount, and then pivot so that the new column is the index, the years become the columns, and the money column becomes the values:
df["yindex"] = df.groupby("Year").cumcount()
new_df = df.pivot(index="yindex", columns="Year", values="Money")
For example:
>>> df = pd.read_csv("money.txt", sep="\s+")
>>> df
Name Money Year
0 Paul 57 2012
1 Susan 67 2012
2 Gary 54 2011
3 Paul 77 2011
4 Andrea 20 2011
5 Albert 23 2011
6 Hal 26 2010
7 Paul 23 2010
>>> df["yindex"] = df.groupby("Year").cumcount()
>>> df
Name Money Year yindex
0 Paul 57 2012 0
1 Susan 67 2012 1
2 Gary 54 2011 0
3 Paul 77 2011 1
4 Andrea 20 2011 2
5 Albert 23 2011 3
6 Hal 26 2010 0
7 Paul 23 2010 1
>>> df.pivot(index="yindex", columns="Year", values="Money")
Year 2010 2011 2012
yindex
0 26 54 57
1 23 77 67
2 NaN 20 NaN
3 NaN 23 NaN
After which you could get rid of the NaNs if you like, but it depends on whether you want to distinguish between cases like "knowing the value is 0" and "not knowing what the value is":
>>> df.pivot(index="yindex", columns="Year", values="Money").fillna(0)
Year 2010 2011 2012
yindex
0 26 54 57
1 23 77 67
2 0 20 0
3 0 23 0

Categories