How to Parse the MLB Team and Player data using Pandas DataFrame? - python

I am still learning and could use some help. I would like to parse the starting pitchers and their respective teams.
I would like the data in a Pandas Dataframe but do not know how to parse the data correctly. Any suggestions would be very helpful. Thanks for your time!
Here is an example of the desired output:
Game Team Name
OAK Chris Bassitt
1
ARI Zac Gallen
SEA Justin Dunn
2
LAD Ross Stripling
Here is my code:
#url = https://www.baseball-reference.com/previews/index.shtml
#Data needed: 1) Team 2) Pitcher Name
import pandas as pd
url = 'https://www.baseball-reference.com/previews/index.shtml'
test = pd.read_html(url)
for t in test:
name = t[1]
team = t[0]
print(team)
print(name)
I feel like I have to create a Pandas DataFrame and append the Team and Name, however, I am not sure how to parse out just the desired output.

pandas.read_html returns a list of all the tables for a given URL
dataframes in the list can be selected using normal list slicing and selecting methods
import pandas as pd
url = 'https://www.baseball-reference.com/previews/index.shtml'
list_of_dataframes = pd.read_html(url)
# select and combine the dataframes for games; every other dataframe from 0 (even)
games = pd.concat(list_of_dataframes[0::2])
# display(games.head())
0 1 2
0 Cubs (13-6) NaN Preview
1 Cardinals (4-4) NaN 12:00AM
0 Cardinals (4-4) NaN Preview
1 Cubs (13-6) NaN 5:15PM
0 Red Sox (6-16) NaN Preview
# select the players from list_of_dataframes; every other dataframe from 1 (odd)
players = list_of_dataframes[1::2]
# add the Game to the dataframes
for i, df in enumerate(players, 1):
df['Game'] = i
players[i-1] = df
# combine all the dataframe
players = pd.concat(players).reset_index(drop=True)
# create a players column for the name only
players['name'] = players[1].str.split('(', expand=True)[0]
# rename the colume
players.rename(columns={0: 'Team'}, inplace=True)
# drop 1
players.drop(columns=[1], inplace=True)
# display(players.head(6))
Team Game name
0 CHC 1 Tyson Miller
1 STL 1 Alex Reyes
2 STL 2 Kwang Hyun Kim
3 CHC 2 Kyle Hendricks
4 BOS 3 Martin Perez
5 NYY 3 Jordan Montgomery

Love those sports reference.com sites. Trenton's solution is perfect, so don't change the accepted answer, but just wanted to throw this alternative data source for probable pitchers incase you were interested.
Looks like mlb.com has a publicly available api to pull that info (I'm going to assume that's possibly where baseball-reference fills their probable pitcher page). But what I like about this is you can get much more data returned to analyse, and it gives you the option to get a wider date range to get historical data, and possibly probable pitchers 2 or 3 days in advance (as well as day of). So give this code a look over too, play with it, practice with it.
But this could set you up to your first machine learning sort of thing.
PS: Let me know if you figure out what strikeZoneBottom and strikeZoneTop means here if you even bother to look into this data. I haven't been able to figure out what those mean.
I'm also wondering too, if there's data regarding the ballpark. Like in the pitchers stats there's the fly ball:ground ball ratio. If there was data on the ballparks like if you have flyball pitcher in a venue that yields lots of homeruns, that you might see a different situation for that same pitcher in a ballpark where flyballs don't quite travel as far, or the stadium has deeper fences (essentially homeruns turn into warning track fly out and vice versa)??
Code:
import requests
import pandas as pd
from datetime import datetime, timedelta
url = 'https://statsapi.mlb.com/api/v1/schedule'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'}
yesterday = datetime.strftime(datetime.now() - timedelta(1), '%Y-%m-%d')
today = datetime.strftime(datetime.now(), '%Y-%m-%d')
tomorrow = datetime.strftime(datetime.now() + timedelta(1), '%Y-%m-%d')
#To get 7 days earlier; notice the minus sign
#pastDate = datetime.strftime(datetime.now() - timedelta(7), '%Y-%m-%d')
#To get 3 days later; notice the plus sign
#futureDate = datetime.strftime(datetime.now() + timedelta(3), '%Y-%m-%d')
#hydrate parameter is to get back certain data elements. Not sure how to alter it exactly yet, would have to play around
#But without hydrate, it doesn't return probable pitchers
payload = {
'sportId': '1',
'startDate': today, #<-- Change these to get a wider range of games (to also get historical stats for machine learning)
'endDate': today, #<-- Change these to get a wider range of games (to possible probable pitchers for next few days. just need to adjust timedelta above)
'hydrate': 'team(leaders(showOnPreview(leaderCategories=[homeRuns,runsBattedIn,battingAverage],statGroup=[pitching,hitting]))),linescore(matchup,runners),flags,liveLookin,review,broadcasts(all),venue(location),decisions,person,probablePitcher,stats,homeRuns,previousPlay,game(content(media(featured,epg),summary),tickets),seriesStatus(useOverride=true)'}
jsonData = requests.get(url, headers=headers, params=payload).json()
dates = jsonData['dates']
rows = []
for date in dates:
games = date['games']
for game in games:
dayNight = game['dayNight']
gameDate = game['gameDate']
city = game['venue']['location']['city']
venue = game['venue']['name']
teams = game['teams']
for k, v in teams.items():
row = {}
row.update({'dayNight':dayNight,
'gameDate':gameDate,
'city':city,
'venue':venue})
homeAway = k
teamName = v['team']['name']
if 'probablePitcher' not in v.keys():
row.update({'homeAway':homeAway,
'teamName':teamName})
rows.append(row)
else:
probablePitcher = v['probablePitcher']
fullName = probablePitcher['fullName']
pitchHand = probablePitcher['pitchHand']['code']
strikeZoneBottom = probablePitcher['strikeZoneBottom']
strikeZoneTop = probablePitcher['strikeZoneTop']
row.update({'homeAway':homeAway,
'teamName':teamName,
'probablePitcher':fullName,
'pitchHand':pitchHand,
'strikeZoneBottom':strikeZoneBottom,
'strikeZoneTop':strikeZoneTop})
stats = probablePitcher['stats']
for stat in stats:
if stat['type']['displayName'] == 'statsSingleSeason' and stat['group']['displayName'] == 'pitching':
playerStats = stat['stats']
row.update(playerStats)
rows.append(row)
df = pd.DataFrame(rows)
Output: First 10 rows
print (df.head(10).to_string())
airOuts atBats balks baseOnBalls blownSaves catchersInterference caughtStealing city completeGames dayNight doubles earnedRuns era gameDate gamesFinished gamesPitched gamesPlayed gamesStarted groundOuts groundOutsToAirouts hitBatsmen hitByPitch hits hitsPer9Inn holds homeAway homeRuns homeRunsPer9 inheritedRunners inheritedRunnersScored inningsPitched intentionalWalks losses obp outs pickoffs pitchHand probablePitcher rbi runs runsScoredPer9 sacBunts sacFlies saveOpportunities saves shutouts stolenBasePercentage stolenBases strikeOuts strikeZoneBottom strikeZoneTop strikeoutWalkRatio strikeoutsPer9Inn teamName triples venue walksPer9Inn whip wildPitches winPercentage wins
0 15.0 44.0 0.0 9.0 0.0 0.0 0.0 Baltimore 0.0 day 2.0 8.0 6.00 2020-08-19T17:05:00Z 0.0 3.0 3.0 3.0 9.0 0.60 0.0 0.0 10.0 7.50 0.0 away 3.0 2.25 0.0 0.0 12.0 0.0 1.0 .358 36.0 0.0 R Tanner Roark 0.0 8.0 6.00 0.0 0.0 0.0 0.0 0.0 1.000 1.0 10.0 1.589 3.467 1.11 7.50 Toronto Blue Jays 0.0 Oriole Park at Camden Yards 6.75 1.58 0.0 .500 1.0
1 18.0 74.0 0.0 3.0 0.0 0.0 0.0 Baltimore 0.0 day 5.0 8.0 4.00 2020-08-19T17:05:00Z 0.0 4.0 4.0 4.0 18.0 1.00 1.0 1.0 22.0 11.00 0.0 home 1.0 0.50 0.0 0.0 18.0 0.0 2.0 .329 54.0 1.0 L Tommy Milone 0.0 11.0 5.50 1.0 1.0 0.0 0.0 0.0 1.000 1.0 18.0 1.535 3.371 6.00 9.00 Baltimore Orioles 1.0 Oriole Park at Camden Yards 1.50 1.39 1.0 .333 1.0
2 14.0 59.0 0.0 2.0 0.0 0.0 0.0 Boston 0.0 day 3.0 7.0 4.02 2020-08-19T17:35:00Z 0.0 3.0 3.0 3.0 14.0 1.00 0.0 0.0 17.0 9.77 0.0 away 2.0 1.15 0.0 0.0 15.2 0.0 2.0 .311 47.0 0.0 R Jake Arrieta 0.0 7.0 4.02 0.0 0.0 0.0 0.0 0.0 .--- 0.0 14.0 1.627 3.549 7.00 8.04 Philadelphia Phillies 0.0 Fenway Park 1.15 1.21 2.0 .333 1.0
3 2.0 14.0 1.0 3.0 0.0 0.0 0.0 Boston 0.0 day 1.0 5.0 22.50 2020-08-19T17:35:00Z 0.0 1.0 1.0 1.0 1.0 0.50 0.0 0.0 7.0 31.50 0.0 home 2.0 9.00 0.0 0.0 2.0 0.0 1.0 .588 6.0 0.0 L Kyle Hart 0.0 7.0 31.50 0.0 0.0 0.0 0.0 0.0 .--- 0.0 4.0 1.681 3.575 1.33 18.00 Boston Red Sox 0.0 Fenway Park 13.50 5.00 0.0 .000 0.0
4 8.0 27.0 0.0 0.0 0.0 0.0 0.0 Chicago 0.0 day 0.0 2.0 2.57 2020-08-19T18:20:00Z 0.0 1.0 1.0 1.0 7.0 0.88 0.0 0.0 6.0 7.71 0.0 away 0.0 0.00 0.0 0.0 7.0 0.0 0.0 .222 21.0 0.0 R Jack Flaherty 0.0 2.0 2.57 0.0 0.0 0.0 0.0 0.0 .--- 0.0 6.0 1.627 3.549 -.-- 7.71 St. Louis Cardinals 0.0 Wrigley Field 0.00 0.86 0.0 1.000 1.0
5 13.0 65.0 0.0 6.0 0.0 0.0 1.0 Chicago 0.0 day 2.0 6.0 2.84 2020-08-19T18:20:00Z 0.0 3.0 3.0 3.0 28.0 2.15 1.0 1.0 10.0 4.74 0.0 home 2.0 0.95 0.0 0.0 19.0 0.0 1.0 .236 57.0 0.0 R Alec Mills 0.0 6.0 2.84 0.0 0.0 0.0 0.0 0.0 .000 0.0 14.0 1.627 3.549 2.33 6.63 Chicago Cubs 0.0 Wrigley Field 2.84 0.84 0.0 .667 2.0
6 NaN NaN NaN NaN NaN NaN NaN Chicago NaN night NaN NaN NaN 2020-08-19T03:33:00Z NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN away NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Chicago Cubs NaN Wrigley Field NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN NaN Chicago NaN night NaN NaN NaN 2020-08-19T03:33:00Z NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN home NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN St. Louis Cardinals NaN Wrigley Field NaN NaN NaN NaN NaN
8 13.0 92.0 0.0 8.0 0.0 0.0 1.0 Kansas City 0.0 day 6.0 10.0 3.91 2020-08-19T21:05:00Z 0.0 4.0 4.0 4.0 24.0 1.85 0.0 0.0 25.0 9.78 0.0 away 1.0 0.39 0.0 0.0 23.0 0.0 2.0 .327 69.0 0.0 R Luis Castillo 0.0 12.0 4.70 0.0 1.0 0.0 0.0 0.0 .000 0.0 31.0 1.589 3.467 3.88 12.13 Cincinnati Reds 1.0 Kauffman Stadium 3.13 1.43 0.0 .000 0.0
9 10.0 36.0 0.0 5.0 0.0 0.0 0.0 Kansas City 0.0 day 0.0 0.0 0.00 2020-08-19T21:05:00Z 0.0 2.0 2.0 2.0 11.0 1.10 1.0 1.0 5.0 4.09 0.0 home 0.0 0.00 0.0 0.0 11.0 0.0 0.0 .262 33.0 0.0 R Brad Keller 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0.0 .--- 0.0 10.0 1.681 3.575 2.00 8.18 Kansas City Royals 0.0 Kauffman Stadium 4.09 0.91 0.0 1.000 2.0

Related

Ranking with multiple ocurrence of ties in Pandas

I need to rank my df by some columns. Have a look at the print below
The lines need to be ranked from 1 to 20 by the column df['pontos_na_rodada']
If we issue some ties - which will occur - they have to be resolved by the highest value in column df['saldo_gols']. Then if the tie persist resolve it again by the column df['gols_feitos'] and lastly if we still have ties resolve it by column df['Red Cards'] and df['Yellow Cards'] where for these columns the lower value is the best.
Can someone give me a hand?
Example of the data in the image:
<bound method DataFrame.to_dict of league_season league_round fixture_id team.id resultado \
50885 2020 1.0 327986 118 3.0
46622 2020 1.0 327992 119 3.0
50863 2020 1.0 327986 120 0.0
60003 2020 1.0 327987 121 1.0
46637 2020 1.0 327991 123 3.0
46774 2020 1.0 327990 124 0.0
55991 2020 1.0 327994 126 3.0
46700 2020 1.0 327985 127 0.0
46730 2020 1.0 327988 128 1.0
46652 2020 1.0 327991 129 0.0
46758 2020 1.0 327990 130 3.0
50908 2020 1.0 327989 131 1.0
60024 2020 1.0 327987 133 1.0
46684 2020 1.0 327993 134 3.0
50931 2020 1.0 327989 144 1.0
46606 2020 1.0 327992 147 0.0
55970 2020 1.0 327994 151 0.0
46668 2020 1.0 327993 154 0.0
46743 2020 1.0 327988 794 1.0
46714 2020 1.0 327985 1062 3.0
gols_feitos saldo_gols Red Cards Yellow Cards pontos_na_rodada \
50885 2.0 1.0 0.0 3.0 3.0
46622 1.0 1.0 0.0 4.0 3.0
50863 1.0 -1.0 1.0 2.0 0.0
60003 1.0 0.0 0.0 1.0 1.0
46637 3.0 1.0 0.0 3.0 3.0
46774 0.0 -1.0 0.0 3.0 0.0
55991 3.0 3.0 0.0 NaN 3.0
46700 0.0 -1.0 0.0 3.0 0.0
46730 1.0 0.0 0.0 NaN 1.0
46652 2.0 -1.0 0.0 3.0 0.0
46758 1.0 1.0 0.0 2.0 3.0
50908 0.0 0.0 0.0 2.0 1.0
60024 1.0 0.0 0.0 1.0 1.0
46684 2.0 2.0 0.0 2.0 3.0
50931 0.0 0.0 0.0 NaN 1.0
46606 0.0 -1.0 0.0 3.0 0.0
55970 0.0 -3.0 0.0 3.0 0.0
46668 0.0 -2.0 1.0 3.0 0.0
46743 1.0 0.0 0.0 1.0 1.0
46714 1.0 1.0 0.0 2.0 3.0
rank
50885 NaN
46622 NaN
50863 NaN
60003 NaN
46637 NaN
46774 NaN
55991 NaN
46700 NaN
46730 NaN
46652 NaN
46758 NaN
50908 NaN
60024 NaN
46684 NaN
50931 NaN
46606 NaN
55970 NaN
46668 NaN
46743 NaN
46714 NaN >
I just figured out an answer to show here:
df['rank'] = np.nan
df['Red Cards'] = df['Red Cards']*-1
df['Yellow Cards'] = df['Yellow Cards']*-1
df['rank'] = df.sort_values(by = ['league_season','league_round','pontos_na_rodada',\
'saldo_gols','gols_feitos','Red Cards','Yellow Cards'])\
.groupby(['league_season','league_round']).cumcount(ascending=False)+1
df[(df['league_round']==10) & (df['league_season']==2020)].sort_values(by = 'rank')
The result:
league_season league_round fixture_id team.id resultado \
49809 2020 10.0 328084 119 0.0
50032 2020 10.0 328076 133 3.0
49919 2020 10.0 328079 1062 3.0
49671 2020 10.0 328078 126 1.0
49964 2020 10.0 328077 121 1.0
49855 2020 10.0 328083 127 0.0
49648 2020 10.0 328078 128 1.0
49694 2020 10.0 328080 130 1.0
49740 2020 10.0 328075 124 3.0
49832 2020 10.0 328083 129 3.0
49899 2020 10.0 328081 144 3.0
49717 2020 10.0 328080 154 1.0
49876 2020 10.0 328081 118 0.0
49602 2020 10.0 328082 134 3.0
49987 2020 10.0 328077 123 1.0
49763 2020 10.0 328075 131 0.0
50009 2020 10.0 328076 120 0.0
49786 2020 10.0 328084 151 3.0
49625 2020 10.0 328082 147 0.0
49942 2020 10.0 328079 794 0.0
gols_feitos saldo_gols Red Cards Yellow Cards pontos_na_rodada \
49809 0.0 -1.0 -0.0 -3.0 20.0
50032 3.0 1.0 -0.0 -2.0 18.0
49919 2.0 1.0 -0.0 -1.0 18.0
49671 2.0 0.0 -0.0 -2.0 18.0
49964 2.0 0.0 -1.0 -3.0 18.0
49855 0.0 -2.0 -0.0 NaN 17.0
49648 2.0 0.0 -0.0 -3.0 15.0
49694 1.0 0.0 -1.0 -1.0 15.0
49740 2.0 1.0 -1.0 -2.0 14.0
49832 2.0 2.0 -0.0 -1.0 13.0
49899 1.0 1.0 -0.0 -2.0 13.0
49717 1.0 0.0 -1.0 -2.0 12.0
49876 0.0 -1.0 -1.0 -2.0 12.0
49602 1.0 1.0 -0.0 -4.0 11.0
49987 2.0 0.0 -1.0 -3.0 11.0
49763 1.0 -1.0 -0.0 -4.0 10.0
50009 2.0 -1.0 -0.0 -2.0 9.0
49786 1.0 1.0 -1.0 -4.0 8.0
49625 0.0 -1.0 -1.0 -2.0 8.0
49942 1.0 -1.0 -0.0 -1.0 7.0
rank
49809 1
50032 2
49919 3
49671 4
49964 5
49855 6
49648 7
49694 8
49740 9
49832 10
49899 11
49717 12
49876 13
49602 14
49987 15
49763 16
50009 17
49786 18
49625 19
49942 20

Beautifulsoup not finding all class elements

I am trying to get all tables with a class of "stats_table". However it is only pulling 2 tables. Yet when I print the actual soup it and search the document (manually) I can find 9 tables.
from bs4 import BeautifulSoup
import requests
# function to get hitting stats
def get_hitting_stats(team, soup):
# get tables
tables = soup.find_all("table", class_="stats_table")
print(tables)
# function to process game
def process_game(gamelink, headers):
# get boxscore page
req = requests.get(gamelink, headers)
soup = BeautifulSoup(req.content, 'html.parser')
home_hitting = get_hitting_stats("home", soup)
away_hitting = get_hitting_stats("away", soup)
headers = {
'Access-Control-Allow-Origin': '*',
'Access-Control-Allow-Methods': 'GET',
'Access-Control-Allow-Headers': 'Content-Type',
'Access-Control-Max-Age': '3600',
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
}
process_game("https://www.baseball-reference.com/boxes/CLE/CLE202208151.shtml", headers)
Originally I thought that the other tables might be retrieved from a different request but it doesn't make sense that when I look at the soup returned I can find more than the two tables my code does. Any help appreciated.
The content is within the comments. You need to pull it out.
Also, you never return anything in the functions. Is that what you want to do?
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
# function to get hitting stats
def get_hitting_stats(home_away, soup):
# get tables
idx = {'home':1, 'away':0}
hitting = soup.find_all('table', {'id':re.compile('.*batting.*')})
html = str(hitting[idx[home_away]])
df = pd.read_html(html)[0]
print(df)
return df
# function to process game
def process_game(gamelink, headers):
# get boxscore page
html = requests.get(gamelink, headers).text
html = html.replace('<!--', '').replace('-->', '')
soup = BeautifulSoup(html, 'html.parser')
home_hitting = get_hitting_stats("home", soup)
away_hitting = get_hitting_stats("away", soup)
headers = {
'Access-Control-Allow-Origin': '*',
'Access-Control-Allow-Methods': 'GET',
'Access-Control-Allow-Headers': 'Content-Type',
'Access-Control-Max-Age': '3600',
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
}
process_game("https://www.baseball-reference.com/boxes/CLE/CLE202208151.shtml", headers)
Output:
Batting AB R H RBI ... acLI RE24 PO A Details
0 Steven Kwan LF 4.0 0.0 1.0 0.0 ... 1.76 0.1 0.0 0.0 NaN
1 Amed Rosario SS 4.0 1.0 0.0 0.0 ... 1.49 -1.5 0.0 1.0 NaN
2 Jose Ramirez 3B 4.0 0.0 2.0 1.0 ... 1.72 0.5 0.0 0.0 NaN
3 Andres Gimenez 2B 4.0 1.0 3.0 3.0 ... 1.84 3.6 0.0 4.0 HR,2B
4 Oscar Gonzalez RF 4.0 0.0 2.0 0.0 ... 1.38 0.3 3.0 0.0 2B
5 Owen Miller 1B 3.0 0.0 0.0 0.0 ... 1.71 -1.1 5.0 3.0 GDP
6 Nolan Jones DH 4.0 0.0 0.0 0.0 ... 1.77 -1.6 NaN NaN NaN
7 Austin Hedges C 3.0 0.0 0.0 0.0 ... 1.77 -1.2 13.0 0.0 NaN
8 Myles Straw CF 3.0 2.0 1.0 0.0 ... 1.09 1.0 3.0 0.0 2·SB
9 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN
10 Aaron Civale P NaN NaN NaN NaN ... NaN NaN 1.0 0.0 NaN
11 James Karinchak P NaN NaN NaN NaN ... NaN NaN 0.0 0.0 NaN
12 Trevor Stephan P NaN NaN NaN NaN ... NaN NaN 2.0 0.0 NaN
13 Emmanuel Clase P NaN NaN NaN NaN ... NaN NaN 0.0 0.0 NaN
14 Team Totals 33.0 4.0 9.0 4.0 ... 1.59 0.3 27.0 8.0 NaN
[15 rows x 24 columns]
Batting AB R H RBI ... acLI RE24 PO A Details
0 Riley Greene CF 4.0 1.0 1.0 0.0 ... 0.0 -0.1 3.0 0.0 NaN
1 Victor Reyes RF 3.0 0.0 1.0 0.0 ... 0.0 0.0 1.0 0.0 SB
2 Javier Baez SS 4.0 0.0 1.0 0.0 ... 0.0 0.2 2.0 3.0 2B
3 Harold Castro 1B 4.0 0.0 0.0 1.0 ... 0.0 -0.7 7.0 0.0 NaN
4 Miguel Cabrera DH 4.0 0.0 0.0 0.0 ... 0.0 -0.8 NaN NaN NaN
5 Jeimer Candelario 3B 3.0 0.0 0.0 0.0 ... 0.0 -0.5 1.0 1.0 NaN
6 Eric Haase C 2.0 0.0 0.0 0.0 ... 0.0 -0.3 6.0 1.0 NaN
7 Jonathan Schoop 2B 3.0 0.0 0.0 0.0 ... 0.0 -0.5 1.0 2.0 NaN
8 Akil Baddoo LF 3.0 0.0 0.0 0.0 ... 0.0 -0.5 3.0 0.0 NaN
9 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN
10 Drew Hutchison P NaN NaN NaN NaN ... NaN NaN 0.0 0.0 NaN
11 Will Vest P NaN NaN NaN NaN ... NaN NaN 0.0 1.0 NaN
12 Andrew Chafin P NaN NaN NaN NaN ... NaN NaN 0.0 0.0 NaN
13 Wily Peralta P NaN NaN NaN NaN ... NaN NaN 0.0 0.0 NaN
14 Team Totals 30.0 1.0 3.0 1.0 ... 0.0 -3.2 24.0 8.0 NaN
[15 rows x 24 columns]

Pandas df group by count elements

My dataframe looks like this.
# initialize list of lists
data = [[1998, 1998,2002,2003], [2001, 1999,1993,2003], [1998, 1999,2003,1994], [1998,1997,2003,1993], [1999,2001,1996, 1999]]
df = pd.DataFrame(data, columns = ['A', 'B', 'C', 'D'])
I would like to count for each date the number of occurrences in %. Such that the dataframe looks like this:
1997 1998 1999
A 20% 80% 100%
B 30% 10% 0%
C 70% 10% 0%
I tried to use Pandas group-by.
The logic is not fully clear (since it looks that the provided output is not the real one corresponding to the provided input), but here are some approaches:
using crosstab
Percent per year
df2 = df.melt(value_name='year')
df2 = pd.crosstab(df2['variable'], df2['year'], normalize='columns').mul(100)
# or
# df2 = pd.crosstab(df2['variable'], df2['year'])
# df2.div(df2.sum()).mul(100)
Output:
year 1993 1994 1996 1997 1998 1999 2001 2002 2003
variable
A 0.0 0.0 0.0 0.0 75.0 25.0 50.0 0.0 0.0
B 0.0 0.0 0.0 100.0 25.0 50.0 50.0 0.0 0.0
C 50.0 0.0 100.0 0.0 0.0 0.0 0.0 100.0 50.0
D 50.0 100.0 0.0 0.0 0.0 25.0 0.0 0.0 50.0
Percent per variable
df2 = df.melt(value_name='year')
pd.crosstab(df2['variable'], df2['year'], normalize='index').mul(100)
# or
# df2 = pd.crosstab(df2['variable'], df2['year'])
# df2.div(df2.sum(1), axis=0).mul(100)
Output:
year 1993 1994 1996 1997 1998 1999 2001 2002 2003
variable
A 0.0 0.0 0.0 0.0 60.0 20.0 20.0 0.0 0.0
B 0.0 0.0 0.0 20.0 20.0 40.0 20.0 0.0 0.0
C 20.0 0.0 20.0 0.0 0.0 0.0 0.0 20.0 40.0
D 20.0 20.0 0.0 0.0 0.0 20.0 0.0 0.0 40.0
using groupby
(df.stack()
.groupby(level=1)
.apply(lambda s: s.value_counts(normalize=True))
.unstack(fill_value=0)
.mul(100)
)
Output:
1993 1994 1996 1997 1998 1999 2001 2002 2003
A 0.0 0.0 0.0 0.0 60.0 20.0 20.0 0.0 0.0
B 0.0 0.0 0.0 20.0 20.0 40.0 20.0 0.0 0.0
C 20.0 0.0 20.0 0.0 0.0 0.0 0.0 20.0 40.0
D 20.0 20.0 0.0 0.0 0.0 20.0 0.0 0.0 40.0
Another option could be the following:
# getting value_counts for each column
df2 = pd.concat([df[col].value_counts(normalize=True) for col in df.columns], axis=1)
# filling null values with 0
df2.fillna(0, inplace=True)
# transforming to string and adding %
df2 = df2.astype('int').astype('str')+'%'
# getting your output
df2.loc['1997':'1999', 'A':'C'].T
Output:
1997 1998 1999
A 20% 80% 100%
B 30% 10% 0%
C 70% 10% 0%
melt + groupby + unstack
(df.melt().groupby(['variable', 'value']).size()
/ df.melt().groupby('value').size()).unstack(1)
Out[1]:
value 1993 1994 1996 1997 1998 1999 2001 2002 2003
variable
A NaN NaN NaN NaN 0.75 0.25 0.5 NaN NaN
B NaN NaN NaN 1.0 0.25 0.50 0.5 NaN NaN
C 0.5 NaN 1.0 NaN NaN NaN NaN 1.0 0.5
D 0.5 1.0 NaN NaN NaN 0.25 NaN NaN 0.5

Having issues trying to make my dataframe numeric

So I have a sqlite local database, I read it into my program as a pandas dataframe using
""" Seperating hitters and pitchers """
pitchers = pd.read_sql_query("SELECT * FROM ALL_NORTHWOODS_DATA WHERE BF_y >= 20 AND BF_x >= 20", northwoods_db)
hitters = pd.read_sql_query("SELECT * FROM ALL_NORTHWOODS_DATA WHERE PA_y >= 25 AND PA_x >= 25", northwoods_db)
But when I do this, some of the numbers are not numeric. Here is a head of one of the dataframes:
index Year Age_x AgeDif_x Tm_x Lg_x Lev_x Aff_x G_x PA_x ... ER_y BK_y WP_y BF_y WHIP_y H9_y HR9_y BB9_y SO9_y SO/W_y
0 84 2020 21 -0.3 Hillsdale GMAC NCAA None 5 None ... 4.0 None 3.0 71.0 1.132 5.6 0.0 4.6 8.7 1.89
1 264 2018 -- None Duke ACC NCAA None 15 None ... 13 0 1 88 2.111 10.0 0.5 9.0 8.0 0.89
2 298 2019 21 0.1 Wisconsin-Milwaukee Horz NCAA None 8 None ... 1.0 0.0 2.0 21.0 2.25 9.0 0.0 11.3 11.3 1.0
3 357 2017 22 1.0 Nova Southeastern SSC NCAA None 15.0 None ... 20.0 0.0 3.0 206.0 1.489 9.7 0.4 3.7 8.5 2.32
4 418 2021 21 -0.4 Creighton BigE NCAA None 4 None ... 26.0 1.0 6.0 226.0 1.625 8.6 0.9 6.0 7.5 1.25
When I try to make the dataframe numeric, I used this line of code:
hitters = hitters.apply(pd.to_numeric, errors='coerce')
pitchers = pitchers.apply(pd.to_numeric, errors='coerce')
But when I did that, the new head of the dataframes is full of NaN's, it seems like it got rid of all of the string values but I want to keep those.
index Year Age_x AgeDif_x Tm_x Lg_x Lev_x Aff_x G_x PA_x ... ER_y BK_y WP_y BF_y WHIP_y H9_y HR9_y BB9_y SO9_y SO/W_y
0 84 2020 21.0 -0.3 NaN NaN NaN NaN 5.0 NaN ... 4.0 NaN 3.0 71.0 1.132 5.6 0.0 4.6 8.7 1.89
1 264 2018 NaN NaN NaN NaN NaN NaN 15.0 NaN ... 13.0 0.0 1.0 88.0 2.111 10.0 0.5 9.0 8.0 0.89
2 298 2019 21.0 0.1 NaN NaN NaN NaN 8.0 NaN ... 1.0 0.0 2.0 21.0 2.250 9.0 0.0 11.3 11.3 1.00
3 357 2017 22.0 1.0 NaN NaN NaN NaN 15.0 NaN ... 20.0 0.0 3.0 206.0 1.489 9.7 0.4 3.7 8.5 2.32
4 418 2021 21.0 -0.4 NaN NaN NaN NaN 4.0 NaN ... 26.0 1.0 6.0 226.0 1.625 8.6 0.9 6.0 7.5 1.25
Is there a better way to makethe number values numeric and keep all my string columns? Maybe there is an sqlite function that can do it better? I am not sure, any help is appriciated.
Maybe you can use combine_first:
hitters_new = hitters.apply(pd.to_numeric, errors='coerce').combine_first(hitters)
pitchers_new = pitchers.apply(pd.to_numeric, errors='coerce').combine_first(pitchers)
You can try using astype or convert_dtypes. They both take an argument which is the columns you want to convert, if you already know which columns are numeric and which ones are strings that can work. Otherwise, take a look at this thread to do this automatically.

BeautifulSoup and Selenium won't retrieve full html from website

This is the site I'm trying to retrieve information from: https://www.baseball-reference.com/boxes/CLE/CLE202108120.shtml I want to get the box score data so like the Oakland A's total batting average in the game, at bats in the game, etc. However, when I retreive and print the html from the site, these box scores are missing completely from the html. Any suggestions? Thanks.
Here's my code:
from bs4 import BeautifulSoup
import requests
url = "https://www.baseball-reference.com/boxes/CLE/CLE202108120.shtml"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify)
Please help! Thanks! I tried selenium and had the same problem.
The page is loaded by javascript. Try using the requests_html package instead. See below sample.
from bs4 import BeautifulSoup
from requests_html import HTMLSession
url = "https://www.baseball-reference.com/boxes/CLE/CLE202108120.shtml"
s = HTMLSession()
page = s.get(url, timeout=20)
page.html.render()
soup = BeautifulSoup(page.html.html, 'html.parser')
print(soup.prettify)
The other tables are there in the requested html, but within the comments. So you need to parse out the comments to get those additional tables:
import requests
from bs4 import BeautifulSoup, Comment
import pandas as pd
url = "https://www.baseball-reference.com/boxes/CLE/CLE202108120.shtml"
result = requests.get(url).text
data = BeautifulSoup(result, 'html.parser')
comments = data.find_all(string=lambda text: isinstance(text, Comment))
tables = pd.read_html(url)
for each in comments:
if 'table' in str(each):
try:
tables.append(pd.read_html(str(each))[0])
except:
continue
Output:
Oakland
print(tables[2].to_string())
Batting AB R H RBI BB SO PA BA OBP SLG OPS Pit Str WPA aLI WPA+ WPA- cWPA acLI RE24 PO A Details
0 Mark Canha LF 6.0 1.0 1.0 3.0 0.0 0.0 6.0 0.247 0.379 0.415 0.793 23.0 19.0 0.011 0.58 0.040 -0.029% 0.01% 1.02 1.0 1.0 0.0 2B
1 Starling Marte CF 3.0 0.0 2.0 3.0 0.0 1.0 4.0 0.325 0.414 0.476 0.889 12.0 7.0 0.116 0.90 0.132 -0.016% 0.12% 1.57 2.8 1.0 0.0 2B,HBP
2 Stephen Piscotty PH-RF 1.0 0.0 1.0 2.0 0.0 0.0 2.0 0.211 0.272 0.349 0.622 7.0 3.0 0.000 0.00 0.000 0.000% 0% 0.00 2.0 1.0 0.0 HBP
3 Matt Olson 1B 6.0 0.0 1.0 2.0 0.0 0.0 6.0 0.283 0.376 0.566 0.941 21.0 13.0 -0.057 0.45 0.008 -0.065% -0.06% 0.78 -0.6 9.0 1.0 GDP
4 Mitch Moreland DH 5.0 3.0 2.0 2.0 0.0 0.0 6.0 0.230 0.290 0.415 0.705 23.0 16.0 0.049 0.28 0.064 -0.015% 0.05% 0.50 1.5 NaN NaN 2·HR,HBP
5 Josh Harrison 2B 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.294 0.366 0.435 0.801 7.0 3.0 0.057 1.50 0.057 0.000% 0.06% 2.63 0.6 0.0 0.0 NaN
6 Tony Kemp 2B 4.0 3.0 3.0 0.0 1.0 0.0 5.0 0.252 0.370 0.381 0.751 16.0 10.0 -0.001 0.14 0.009 -0.010% 0% 0.24 1.6 2.0 2.0 NaN
7 Sean Murphy C 4.0 3.0 2.0 2.0 2.0 1.0 6.0 0.224 0.318 0.419 0.737 25.0 15.0 0.143 0.38 0.151 -0.007% 0.15% 0.67 2.7 7.0 0.0 2B
8 Matt Chapman 3B 1.0 3.0 0.0 0.0 5.0 1.0 6.0 0.214 0.310 0.365 0.676 31.0 10.0 0.051 0.28 0.051 0.000% 0.05% 0.49 2.2 1.0 3.0 NaN
9 Seth Brown RF-CF 5.0 1.0 1.0 1.0 0.0 1.0 6.0 0.204 0.278 0.451 0.730 18.0 12.0 -0.067 0.40 0.000 -0.067% -0.07% 0.70 -1.7 4.0 0.0 SF
10 Elvis Andrus SS 5.0 2.0 1.0 2.0 1.0 0.0 6.0 0.233 0.283 0.310 0.593 20.0 15.0 0.015 0.42 0.050 -0.034% 0.02% 0.73 -0.1 0.0 4.0 NaN
11 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
12 Chris Bassitt P NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 0.0 NaN
13 A.J. Puk P NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 NaN
14 Deolis Guerra P NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 NaN
15 Jake Diekman P NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 NaN
16 Team Totals 40.0 17.0 14.0 17.0 10.0 4.0 54.0 0.350 0.500 0.575 1.075 203.0 123.0 0.317 0.41 0.562 -0.243% 0.33% 0.72 12.2 27.0 10.0 NaN

Categories