BeautifulSoup and Selenium won't retrieve full html from website

BeautifulSoup and Selenium won't retrieve full html from website - python

This is the site I'm trying to retrieve information from: https://www.baseball-reference.com/boxes/CLE/CLE202108120.shtml I want to get the box score data so like the Oakland A's total batting average in the game, at bats in the game, etc. However, when I retreive and print the html from the site, these box scores are missing completely from the html. Any suggestions? Thanks.
Here's my code:
from bs4 import BeautifulSoup
import requests
url = "https://www.baseball-reference.com/boxes/CLE/CLE202108120.shtml"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify)
Please help! Thanks! I tried selenium and had the same problem.

The page is loaded by javascript. Try using the requests_html package instead. See below sample.
from bs4 import BeautifulSoup
from requests_html import HTMLSession
url = "https://www.baseball-reference.com/boxes/CLE/CLE202108120.shtml"
s = HTMLSession()
page = s.get(url, timeout=20)
page.html.render()
soup = BeautifulSoup(page.html.html, 'html.parser')
print(soup.prettify)

The other tables are there in the requested html, but within the comments. So you need to parse out the comments to get those additional tables:
import requests
from bs4 import BeautifulSoup, Comment
import pandas as pd
url = "https://www.baseball-reference.com/boxes/CLE/CLE202108120.shtml"
result = requests.get(url).text
data = BeautifulSoup(result, 'html.parser')
comments = data.find_all(string=lambda text: isinstance(text, Comment))
tables = pd.read_html(url)
for each in comments:
if 'table' in str(each):
try:
tables.append(pd.read_html(str(each))[0])
except:
continue
Output:
Oakland
print(tables[2].to_string())
Batting AB R H RBI BB SO PA BA OBP SLG OPS Pit Str WPA aLI WPA+ WPA- cWPA acLI RE24 PO A Details
0 Mark Canha LF 6.0 1.0 1.0 3.0 0.0 0.0 6.0 0.247 0.379 0.415 0.793 23.0 19.0 0.011 0.58 0.040 -0.029% 0.01% 1.02 1.0 1.0 0.0 2B
1 Starling Marte CF 3.0 0.0 2.0 3.0 0.0 1.0 4.0 0.325 0.414 0.476 0.889 12.0 7.0 0.116 0.90 0.132 -0.016% 0.12% 1.57 2.8 1.0 0.0 2B,HBP
2 Stephen Piscotty PH-RF 1.0 0.0 1.0 2.0 0.0 0.0 2.0 0.211 0.272 0.349 0.622 7.0 3.0 0.000 0.00 0.000 0.000% 0% 0.00 2.0 1.0 0.0 HBP
3 Matt Olson 1B 6.0 0.0 1.0 2.0 0.0 0.0 6.0 0.283 0.376 0.566 0.941 21.0 13.0 -0.057 0.45 0.008 -0.065% -0.06% 0.78 -0.6 9.0 1.0 GDP
4 Mitch Moreland DH 5.0 3.0 2.0 2.0 0.0 0.0 6.0 0.230 0.290 0.415 0.705 23.0 16.0 0.049 0.28 0.064 -0.015% 0.05% 0.50 1.5 NaN NaN 2·HR,HBP
5 Josh Harrison 2B 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.294 0.366 0.435 0.801 7.0 3.0 0.057 1.50 0.057 0.000% 0.06% 2.63 0.6 0.0 0.0 NaN
6 Tony Kemp 2B 4.0 3.0 3.0 0.0 1.0 0.0 5.0 0.252 0.370 0.381 0.751 16.0 10.0 -0.001 0.14 0.009 -0.010% 0% 0.24 1.6 2.0 2.0 NaN
7 Sean Murphy C 4.0 3.0 2.0 2.0 2.0 1.0 6.0 0.224 0.318 0.419 0.737 25.0 15.0 0.143 0.38 0.151 -0.007% 0.15% 0.67 2.7 7.0 0.0 2B
8 Matt Chapman 3B 1.0 3.0 0.0 0.0 5.0 1.0 6.0 0.214 0.310 0.365 0.676 31.0 10.0 0.051 0.28 0.051 0.000% 0.05% 0.49 2.2 1.0 3.0 NaN
9 Seth Brown RF-CF 5.0 1.0 1.0 1.0 0.0 1.0 6.0 0.204 0.278 0.451 0.730 18.0 12.0 -0.067 0.40 0.000 -0.067% -0.07% 0.70 -1.7 4.0 0.0 SF
10 Elvis Andrus SS 5.0 2.0 1.0 2.0 1.0 0.0 6.0 0.233 0.283 0.310 0.593 20.0 15.0 0.015 0.42 0.050 -0.034% 0.02% 0.73 -0.1 0.0 4.0 NaN
11 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
12 Chris Bassitt P NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 0.0 NaN
13 A.J. Puk P NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 NaN
14 Deolis Guerra P NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 NaN
15 Jake Diekman P NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 NaN
16 Team Totals 40.0 17.0 14.0 17.0 10.0 4.0 54.0 0.350 0.500 0.575 1.075 203.0 123.0 0.317 0.41 0.562 -0.243% 0.33% 0.72 12.2 27.0 10.0 NaN

Related

Beautifulsoup not finding all class elements

I am trying to get all tables with a class of "stats_table". However it is only pulling 2 tables. Yet when I print the actual soup it and search the document (manually) I can find 9 tables.
from bs4 import BeautifulSoup
import requests
# function to get hitting stats
def get_hitting_stats(team, soup):
# get tables
tables = soup.find_all("table", class_="stats_table")
print(tables)
# function to process game
def process_game(gamelink, headers):
# get boxscore page
req = requests.get(gamelink, headers)
soup = BeautifulSoup(req.content, 'html.parser')
home_hitting = get_hitting_stats("home", soup)
away_hitting = get_hitting_stats("away", soup)
headers = {
'Access-Control-Allow-Origin': '*',
'Access-Control-Allow-Methods': 'GET',
'Access-Control-Allow-Headers': 'Content-Type',
'Access-Control-Max-Age': '3600',
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
}
process_game("https://www.baseball-reference.com/boxes/CLE/CLE202208151.shtml", headers)
Originally I thought that the other tables might be retrieved from a different request but it doesn't make sense that when I look at the soup returned I can find more than the two tables my code does. Any help appreciated.

The content is within the comments. You need to pull it out.
Also, you never return anything in the functions. Is that what you want to do?
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
# function to get hitting stats
def get_hitting_stats(home_away, soup):
# get tables
idx = {'home':1, 'away':0}
hitting = soup.find_all('table', {'id':re.compile('.*batting.*')})
html = str(hitting[idx[home_away]])
df = pd.read_html(html)[0]
print(df)
return df
# function to process game
def process_game(gamelink, headers):
# get boxscore page
html = requests.get(gamelink, headers).text
html = html.replace('<!--', '').replace('-->', '')
soup = BeautifulSoup(html, 'html.parser')
home_hitting = get_hitting_stats("home", soup)
away_hitting = get_hitting_stats("away", soup)
headers = {
'Access-Control-Allow-Origin': '*',
'Access-Control-Allow-Methods': 'GET',
'Access-Control-Allow-Headers': 'Content-Type',
'Access-Control-Max-Age': '3600',
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
}
process_game("https://www.baseball-reference.com/boxes/CLE/CLE202208151.shtml", headers)
Output:
Batting AB R H RBI ... acLI RE24 PO A Details
0 Steven Kwan LF 4.0 0.0 1.0 0.0 ... 1.76 0.1 0.0 0.0 NaN
1 Amed Rosario SS 4.0 1.0 0.0 0.0 ... 1.49 -1.5 0.0 1.0 NaN
2 Jose Ramirez 3B 4.0 0.0 2.0 1.0 ... 1.72 0.5 0.0 0.0 NaN
3 Andres Gimenez 2B 4.0 1.0 3.0 3.0 ... 1.84 3.6 0.0 4.0 HR,2B
4 Oscar Gonzalez RF 4.0 0.0 2.0 0.0 ... 1.38 0.3 3.0 0.0 2B
5 Owen Miller 1B 3.0 0.0 0.0 0.0 ... 1.71 -1.1 5.0 3.0 GDP
6 Nolan Jones DH 4.0 0.0 0.0 0.0 ... 1.77 -1.6 NaN NaN NaN
7 Austin Hedges C 3.0 0.0 0.0 0.0 ... 1.77 -1.2 13.0 0.0 NaN
8 Myles Straw CF 3.0 2.0 1.0 0.0 ... 1.09 1.0 3.0 0.0 2·SB
9 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN
10 Aaron Civale P NaN NaN NaN NaN ... NaN NaN 1.0 0.0 NaN
11 James Karinchak P NaN NaN NaN NaN ... NaN NaN 0.0 0.0 NaN
12 Trevor Stephan P NaN NaN NaN NaN ... NaN NaN 2.0 0.0 NaN
13 Emmanuel Clase P NaN NaN NaN NaN ... NaN NaN 0.0 0.0 NaN
14 Team Totals 33.0 4.0 9.0 4.0 ... 1.59 0.3 27.0 8.0 NaN
[15 rows x 24 columns]
Batting AB R H RBI ... acLI RE24 PO A Details
0 Riley Greene CF 4.0 1.0 1.0 0.0 ... 0.0 -0.1 3.0 0.0 NaN
1 Victor Reyes RF 3.0 0.0 1.0 0.0 ... 0.0 0.0 1.0 0.0 SB
2 Javier Baez SS 4.0 0.0 1.0 0.0 ... 0.0 0.2 2.0 3.0 2B
3 Harold Castro 1B 4.0 0.0 0.0 1.0 ... 0.0 -0.7 7.0 0.0 NaN
4 Miguel Cabrera DH 4.0 0.0 0.0 0.0 ... 0.0 -0.8 NaN NaN NaN
5 Jeimer Candelario 3B 3.0 0.0 0.0 0.0 ... 0.0 -0.5 1.0 1.0 NaN
6 Eric Haase C 2.0 0.0 0.0 0.0 ... 0.0 -0.3 6.0 1.0 NaN
7 Jonathan Schoop 2B 3.0 0.0 0.0 0.0 ... 0.0 -0.5 1.0 2.0 NaN
8 Akil Baddoo LF 3.0 0.0 0.0 0.0 ... 0.0 -0.5 3.0 0.0 NaN
9 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN
10 Drew Hutchison P NaN NaN NaN NaN ... NaN NaN 0.0 0.0 NaN
11 Will Vest P NaN NaN NaN NaN ... NaN NaN 0.0 1.0 NaN
12 Andrew Chafin P NaN NaN NaN NaN ... NaN NaN 0.0 0.0 NaN
13 Wily Peralta P NaN NaN NaN NaN ... NaN NaN 0.0 0.0 NaN
14 Team Totals 30.0 1.0 3.0 1.0 ... 0.0 -3.2 24.0 8.0 NaN
[15 rows x 24 columns]

Having issues trying to make my dataframe numeric

So I have a sqlite local database, I read it into my program as a pandas dataframe using
""" Seperating hitters and pitchers """
pitchers = pd.read_sql_query("SELECT * FROM ALL_NORTHWOODS_DATA WHERE BF_y >= 20 AND BF_x >= 20", northwoods_db)
hitters = pd.read_sql_query("SELECT * FROM ALL_NORTHWOODS_DATA WHERE PA_y >= 25 AND PA_x >= 25", northwoods_db)
But when I do this, some of the numbers are not numeric. Here is a head of one of the dataframes:
index Year Age_x AgeDif_x Tm_x Lg_x Lev_x Aff_x G_x PA_x ... ER_y BK_y WP_y BF_y WHIP_y H9_y HR9_y BB9_y SO9_y SO/W_y
0 84 2020 21 -0.3 Hillsdale GMAC NCAA None 5 None ... 4.0 None 3.0 71.0 1.132 5.6 0.0 4.6 8.7 1.89
1 264 2018 -- None Duke ACC NCAA None 15 None ... 13 0 1 88 2.111 10.0 0.5 9.0 8.0 0.89
2 298 2019 21 0.1 Wisconsin-Milwaukee Horz NCAA None 8 None ... 1.0 0.0 2.0 21.0 2.25 9.0 0.0 11.3 11.3 1.0
3 357 2017 22 1.0 Nova Southeastern SSC NCAA None 15.0 None ... 20.0 0.0 3.0 206.0 1.489 9.7 0.4 3.7 8.5 2.32
4 418 2021 21 -0.4 Creighton BigE NCAA None 4 None ... 26.0 1.0 6.0 226.0 1.625 8.6 0.9 6.0 7.5 1.25
When I try to make the dataframe numeric, I used this line of code:
hitters = hitters.apply(pd.to_numeric, errors='coerce')
pitchers = pitchers.apply(pd.to_numeric, errors='coerce')
But when I did that, the new head of the dataframes is full of NaN's, it seems like it got rid of all of the string values but I want to keep those.
index Year Age_x AgeDif_x Tm_x Lg_x Lev_x Aff_x G_x PA_x ... ER_y BK_y WP_y BF_y WHIP_y H9_y HR9_y BB9_y SO9_y SO/W_y
0 84 2020 21.0 -0.3 NaN NaN NaN NaN 5.0 NaN ... 4.0 NaN 3.0 71.0 1.132 5.6 0.0 4.6 8.7 1.89
1 264 2018 NaN NaN NaN NaN NaN NaN 15.0 NaN ... 13.0 0.0 1.0 88.0 2.111 10.0 0.5 9.0 8.0 0.89
2 298 2019 21.0 0.1 NaN NaN NaN NaN 8.0 NaN ... 1.0 0.0 2.0 21.0 2.250 9.0 0.0 11.3 11.3 1.00
3 357 2017 22.0 1.0 NaN NaN NaN NaN 15.0 NaN ... 20.0 0.0 3.0 206.0 1.489 9.7 0.4 3.7 8.5 2.32
4 418 2021 21.0 -0.4 NaN NaN NaN NaN 4.0 NaN ... 26.0 1.0 6.0 226.0 1.625 8.6 0.9 6.0 7.5 1.25
Is there a better way to makethe number values numeric and keep all my string columns? Maybe there is an sqlite function that can do it better? I am not sure, any help is appriciated.

Maybe you can use combine_first:
hitters_new = hitters.apply(pd.to_numeric, errors='coerce').combine_first(hitters)
pitchers_new = pitchers.apply(pd.to_numeric, errors='coerce').combine_first(pitchers)

You can try using astype or convert_dtypes. They both take an argument which is the columns you want to convert, if you already know which columns are numeric and which ones are strings that can work. Otherwise, take a look at this thread to do this automatically.

How to Parse the MLB Team and Player data using Pandas DataFrame?

I am still learning and could use some help. I would like to parse the starting pitchers and their respective teams.
I would like the data in a Pandas Dataframe but do not know how to parse the data correctly. Any suggestions would be very helpful. Thanks for your time!
Here is an example of the desired output:
Game Team Name
OAK Chris Bassitt
1
ARI Zac Gallen
SEA Justin Dunn
2
LAD Ross Stripling
Here is my code:
#url = https://www.baseball-reference.com/previews/index.shtml
#Data needed: 1) Team 2) Pitcher Name
import pandas as pd
url = 'https://www.baseball-reference.com/previews/index.shtml'
test = pd.read_html(url)
for t in test:
name = t[1]
team = t[0]
print(team)
print(name)
I feel like I have to create a Pandas DataFrame and append the Team and Name, however, I am not sure how to parse out just the desired output.

pandas.read_html returns a list of all the tables for a given URL
dataframes in the list can be selected using normal list slicing and selecting methods
import pandas as pd
url = 'https://www.baseball-reference.com/previews/index.shtml'
list_of_dataframes = pd.read_html(url)
# select and combine the dataframes for games; every other dataframe from 0 (even)
games = pd.concat(list_of_dataframes[0::2])
# display(games.head())
0 1 2
0 Cubs (13-6) NaN Preview
1 Cardinals (4-4) NaN 12:00AM
0 Cardinals (4-4) NaN Preview
1 Cubs (13-6) NaN 5:15PM
0 Red Sox (6-16) NaN Preview
# select the players from list_of_dataframes; every other dataframe from 1 (odd)
players = list_of_dataframes[1::2]
# add the Game to the dataframes
for i, df in enumerate(players, 1):
df['Game'] = i
players[i-1] = df
# combine all the dataframe
players = pd.concat(players).reset_index(drop=True)
# create a players column for the name only
players['name'] = players[1].str.split('(', expand=True)[0]
# rename the colume
players.rename(columns={0: 'Team'}, inplace=True)
# drop 1
players.drop(columns=[1], inplace=True)
# display(players.head(6))
Team Game name
0 CHC 1 Tyson Miller
1 STL 1 Alex Reyes
2 STL 2 Kwang Hyun Kim
3 CHC 2 Kyle Hendricks
4 BOS 3 Martin Perez
5 NYY 3 Jordan Montgomery

Love those sports reference.com sites. Trenton's solution is perfect, so don't change the accepted answer, but just wanted to throw this alternative data source for probable pitchers incase you were interested.
Looks like mlb.com has a publicly available api to pull that info (I'm going to assume that's possibly where baseball-reference fills their probable pitcher page). But what I like about this is you can get much more data returned to analyse, and it gives you the option to get a wider date range to get historical data, and possibly probable pitchers 2 or 3 days in advance (as well as day of). So give this code a look over too, play with it, practice with it.
But this could set you up to your first machine learning sort of thing.
PS: Let me know if you figure out what strikeZoneBottom and strikeZoneTop means here if you even bother to look into this data. I haven't been able to figure out what those mean.
I'm also wondering too, if there's data regarding the ballpark. Like in the pitchers stats there's the fly ball:ground ball ratio. If there was data on the ballparks like if you have flyball pitcher in a venue that yields lots of homeruns, that you might see a different situation for that same pitcher in a ballpark where flyballs don't quite travel as far, or the stadium has deeper fences (essentially homeruns turn into warning track fly out and vice versa)??
Code:
import requests
import pandas as pd
from datetime import datetime, timedelta
url = 'https://statsapi.mlb.com/api/v1/schedule'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'}
yesterday = datetime.strftime(datetime.now() - timedelta(1), '%Y-%m-%d')
today = datetime.strftime(datetime.now(), '%Y-%m-%d')
tomorrow = datetime.strftime(datetime.now() + timedelta(1), '%Y-%m-%d')
#To get 7 days earlier; notice the minus sign
#pastDate = datetime.strftime(datetime.now() - timedelta(7), '%Y-%m-%d')
#To get 3 days later; notice the plus sign
#futureDate = datetime.strftime(datetime.now() + timedelta(3), '%Y-%m-%d')
#hydrate parameter is to get back certain data elements. Not sure how to alter it exactly yet, would have to play around
#But without hydrate, it doesn't return probable pitchers
payload = {
'sportId': '1',
'startDate': today, #<-- Change these to get a wider range of games (to also get historical stats for machine learning)
'endDate': today, #<-- Change these to get a wider range of games (to possible probable pitchers for next few days. just need to adjust timedelta above)
'hydrate': 'team(leaders(showOnPreview(leaderCategories=[homeRuns,runsBattedIn,battingAverage],statGroup=[pitching,hitting]))),linescore(matchup,runners),flags,liveLookin,review,broadcasts(all),venue(location),decisions,person,probablePitcher,stats,homeRuns,previousPlay,game(content(media(featured,epg),summary),tickets),seriesStatus(useOverride=true)'}
jsonData = requests.get(url, headers=headers, params=payload).json()
dates = jsonData['dates']
rows = []
for date in dates:
games = date['games']
for game in games:
dayNight = game['dayNight']
gameDate = game['gameDate']
city = game['venue']['location']['city']
venue = game['venue']['name']
teams = game['teams']
for k, v in teams.items():
row = {}
row.update({'dayNight':dayNight,
'gameDate':gameDate,
'city':city,
'venue':venue})
homeAway = k
teamName = v['team']['name']
if 'probablePitcher' not in v.keys():
row.update({'homeAway':homeAway,
'teamName':teamName})
rows.append(row)
else:
probablePitcher = v['probablePitcher']
fullName = probablePitcher['fullName']
pitchHand = probablePitcher['pitchHand']['code']
strikeZoneBottom = probablePitcher['strikeZoneBottom']
strikeZoneTop = probablePitcher['strikeZoneTop']
row.update({'homeAway':homeAway,
'teamName':teamName,
'probablePitcher':fullName,
'pitchHand':pitchHand,
'strikeZoneBottom':strikeZoneBottom,
'strikeZoneTop':strikeZoneTop})
stats = probablePitcher['stats']
for stat in stats:
if stat['type']['displayName'] == 'statsSingleSeason' and stat['group']['displayName'] == 'pitching':
playerStats = stat['stats']
row.update(playerStats)
rows.append(row)
df = pd.DataFrame(rows)
Output: First 10 rows
print (df.head(10).to_string())
airOuts atBats balks baseOnBalls blownSaves catchersInterference caughtStealing city completeGames dayNight doubles earnedRuns era gameDate gamesFinished gamesPitched gamesPlayed gamesStarted groundOuts groundOutsToAirouts hitBatsmen hitByPitch hits hitsPer9Inn holds homeAway homeRuns homeRunsPer9 inheritedRunners inheritedRunnersScored inningsPitched intentionalWalks losses obp outs pickoffs pitchHand probablePitcher rbi runs runsScoredPer9 sacBunts sacFlies saveOpportunities saves shutouts stolenBasePercentage stolenBases strikeOuts strikeZoneBottom strikeZoneTop strikeoutWalkRatio strikeoutsPer9Inn teamName triples venue walksPer9Inn whip wildPitches winPercentage wins
0 15.0 44.0 0.0 9.0 0.0 0.0 0.0 Baltimore 0.0 day 2.0 8.0 6.00 2020-08-19T17:05:00Z 0.0 3.0 3.0 3.0 9.0 0.60 0.0 0.0 10.0 7.50 0.0 away 3.0 2.25 0.0 0.0 12.0 0.0 1.0 .358 36.0 0.0 R Tanner Roark 0.0 8.0 6.00 0.0 0.0 0.0 0.0 0.0 1.000 1.0 10.0 1.589 3.467 1.11 7.50 Toronto Blue Jays 0.0 Oriole Park at Camden Yards 6.75 1.58 0.0 .500 1.0
1 18.0 74.0 0.0 3.0 0.0 0.0 0.0 Baltimore 0.0 day 5.0 8.0 4.00 2020-08-19T17:05:00Z 0.0 4.0 4.0 4.0 18.0 1.00 1.0 1.0 22.0 11.00 0.0 home 1.0 0.50 0.0 0.0 18.0 0.0 2.0 .329 54.0 1.0 L Tommy Milone 0.0 11.0 5.50 1.0 1.0 0.0 0.0 0.0 1.000 1.0 18.0 1.535 3.371 6.00 9.00 Baltimore Orioles 1.0 Oriole Park at Camden Yards 1.50 1.39 1.0 .333 1.0
2 14.0 59.0 0.0 2.0 0.0 0.0 0.0 Boston 0.0 day 3.0 7.0 4.02 2020-08-19T17:35:00Z 0.0 3.0 3.0 3.0 14.0 1.00 0.0 0.0 17.0 9.77 0.0 away 2.0 1.15 0.0 0.0 15.2 0.0 2.0 .311 47.0 0.0 R Jake Arrieta 0.0 7.0 4.02 0.0 0.0 0.0 0.0 0.0 .--- 0.0 14.0 1.627 3.549 7.00 8.04 Philadelphia Phillies 0.0 Fenway Park 1.15 1.21 2.0 .333 1.0
3 2.0 14.0 1.0 3.0 0.0 0.0 0.0 Boston 0.0 day 1.0 5.0 22.50 2020-08-19T17:35:00Z 0.0 1.0 1.0 1.0 1.0 0.50 0.0 0.0 7.0 31.50 0.0 home 2.0 9.00 0.0 0.0 2.0 0.0 1.0 .588 6.0 0.0 L Kyle Hart 0.0 7.0 31.50 0.0 0.0 0.0 0.0 0.0 .--- 0.0 4.0 1.681 3.575 1.33 18.00 Boston Red Sox 0.0 Fenway Park 13.50 5.00 0.0 .000 0.0
4 8.0 27.0 0.0 0.0 0.0 0.0 0.0 Chicago 0.0 day 0.0 2.0 2.57 2020-08-19T18:20:00Z 0.0 1.0 1.0 1.0 7.0 0.88 0.0 0.0 6.0 7.71 0.0 away 0.0 0.00 0.0 0.0 7.0 0.0 0.0 .222 21.0 0.0 R Jack Flaherty 0.0 2.0 2.57 0.0 0.0 0.0 0.0 0.0 .--- 0.0 6.0 1.627 3.549 -.-- 7.71 St. Louis Cardinals 0.0 Wrigley Field 0.00 0.86 0.0 1.000 1.0
5 13.0 65.0 0.0 6.0 0.0 0.0 1.0 Chicago 0.0 day 2.0 6.0 2.84 2020-08-19T18:20:00Z 0.0 3.0 3.0 3.0 28.0 2.15 1.0 1.0 10.0 4.74 0.0 home 2.0 0.95 0.0 0.0 19.0 0.0 1.0 .236 57.0 0.0 R Alec Mills 0.0 6.0 2.84 0.0 0.0 0.0 0.0 0.0 .000 0.0 14.0 1.627 3.549 2.33 6.63 Chicago Cubs 0.0 Wrigley Field 2.84 0.84 0.0 .667 2.0
6 NaN NaN NaN NaN NaN NaN NaN Chicago NaN night NaN NaN NaN 2020-08-19T03:33:00Z NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN away NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Chicago Cubs NaN Wrigley Field NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN NaN Chicago NaN night NaN NaN NaN 2020-08-19T03:33:00Z NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN home NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN St. Louis Cardinals NaN Wrigley Field NaN NaN NaN NaN NaN
8 13.0 92.0 0.0 8.0 0.0 0.0 1.0 Kansas City 0.0 day 6.0 10.0 3.91 2020-08-19T21:05:00Z 0.0 4.0 4.0 4.0 24.0 1.85 0.0 0.0 25.0 9.78 0.0 away 1.0 0.39 0.0 0.0 23.0 0.0 2.0 .327 69.0 0.0 R Luis Castillo 0.0 12.0 4.70 0.0 1.0 0.0 0.0 0.0 .000 0.0 31.0 1.589 3.467 3.88 12.13 Cincinnati Reds 1.0 Kauffman Stadium 3.13 1.43 0.0 .000 0.0
9 10.0 36.0 0.0 5.0 0.0 0.0 0.0 Kansas City 0.0 day 0.0 0.0 0.00 2020-08-19T21:05:00Z 0.0 2.0 2.0 2.0 11.0 1.10 1.0 1.0 5.0 4.09 0.0 home 0.0 0.00 0.0 0.0 11.0 0.0 0.0 .262 33.0 0.0 R Brad Keller 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0.0 .--- 0.0 10.0 1.681 3.575 2.00 8.18 Kansas City Royals 0.0 Kauffman Stadium 4.09 0.91 0.0 1.000 2.0

Having trouble sorting by row index and `sort_index()` isn't working

I'm attempting to sort the row indexes below from largest to smallest:
My first attempt was:
plot_df_dropoff.sort_index(by=["dropoff_latitude"], ascending=False)
But I get the a Key Value Error.
Second thought based on this link didn't work either. It returned None.
This seems so simple but I can't figure it out. Any help would be much appreciated.
id
pickup_longitude (-74.03, -74.025] (-74.025, -74.02] (-74.02, -74.015] (-74.015, -74.01] (-74.01, -74.005] (-74.005, -74] (-74, -73.995] (-73.995, -73.99] (-73.99, -73.985] (-73.985, -73.98] ... (-73.82, -73.815] (-73.815, -73.81] (-73.81, -73.805] (-73.805, -73.8] (-73.8, -73.795] (-73.795, -73.79] (-73.79, -73.785] (-73.785, -73.78] (-73.78, -73.775] (-73.775, -73.77]
pickup_latitude
(40.63, 40.64] 5.0 10.0 8.0 2.0 3.0 1.0 NaN 2.0 1.0 1.0 ... NaN NaN NaN NaN 1.0 NaN 7.0 1.0 NaN NaN
(40.64, 40.65] 2.0 2.0 14.0 16.0 2.0 4.0 6.0 3.0 5.0 11.0 ... NaN NaN NaN 149.0 164.0 3580.0 7532.0 11381.0 5596.0 NaN
(40.65, 40.66] NaN NaN NaN 2.0 22.0 41.0 11.0 2.0 4.0 13.0 ... NaN 1.0 146.0 7.0 3.0 201.0 81.0 2.0 1.0 2.0
(40.66, 40.67] NaN NaN NaN NaN NaN 2.0 60.0 143.0 180.0 122.0 ... NaN 4.0 24.0 126.0 15.0 47.0 32.0 4.0 3.0 3.0
(40.67, 40.68] NaN NaN 7.0 44.0 18.0 200.0 328.0 65.0 293.0 590.0 ... 3.0 3.0 1.0 131.0 1.0 1.0 2.0 1.0 1.0 2.0
And here is a smaller segment that might be easier to work with:
id \
pickup_longitude (-74.03, -74.025] (-74.025, -74.02] (-74.02, -74.015]
pickup_latitude
(40.63, 40.64] 5.0 10.0 8.0
(40.64, 40.65] 2.0 2.0 14.0
(40.65, 40.66] NaN NaN NaN
(40.66, 40.67] NaN NaN NaN
(40.67, 40.68] NaN NaN 7.0
(40.68, 40.69] NaN NaN NaN
(40.69, 40.7] NaN 1.0 1.0
(40.7, 40.71] 1.0 1.0 3841.0
(40.71, 40.72] NaN 2.0 6537.0
(40.72, 40.73] NaN NaN NaN
(40.73, 40.74] 9.0 2.0 NaN

You can reset index and sort by values.
Try:
>>>plot_df_dropoff.reset_index().sort_values(by=["dropoff_latitude"], ascending=False)
And as #JohnE mentioned, you can also just use sort_index():
>>>plot_df_dropoff.sort_index(ascending=False)

reshaping a DataFrame with non-unique index

I have the following DataFrame:
In [299]: df
Out[299]:
a b
DATE
2017-05-28 15:01:37 0.0 1.0
2017-05-28 15:01:39 1.0 0.0
2017-05-28 15:01:39 1.0 0.0
2017-05-28 15:01:39 1.0 0.0
2017-05-28 15:01:39 1.0 0.0
2017-05-28 15:01:39 1.0 0.0
2017-05-28 15:01:42 1.0 0.0
2017-05-28 15:02:10 1.0 0.0
2017-05-28 15:02:14 0.0 1.0
2017-05-28 15:02:23 0.0 1.0
2017-05-28 15:02:28 1.0 0.0
2017-05-28 15:02:34 0.0 1.0
2017-05-28 15:02:34 0.0 1.0
I can get the shape I'm looking for by doing the following:
In [300]: xa = df.groupby(df.index).apply(lambda x: x['a'].values)
In [301]: xb = df.groupby(df.index).apply(lambda x: x['b'].values)
In [302]: ya = pd.DataFrame(xa.tolist(), index=xa.index)
In [303]: yb = pd.DataFrame(xb.tolist(), index=xb.index)
In [304]: new_df = pd.concat([ya, yb], axis=1, keys=['a', 'b'])
In [305]: new_df
Out[305]:
a b
0 1 2 3 4 0 1 2 3 4
DATE
2017-05-28 15:01:37 0.0 NaN NaN NaN NaN 1.0 NaN NaN NaN NaN
2017-05-28 15:01:39 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0
2017-05-28 15:01:42 1.0 NaN NaN NaN NaN 0.0 NaN NaN NaN NaN
2017-05-28 15:02:10 1.0 NaN NaN NaN NaN 0.0 NaN NaN NaN NaN
2017-05-28 15:02:14 0.0 NaN NaN NaN NaN 1.0 NaN NaN NaN NaN
2017-05-28 15:02:23 0.0 NaN NaN NaN NaN 1.0 NaN NaN NaN NaN
2017-05-28 15:02:28 1.0 NaN NaN NaN NaN 0.0 NaN NaN NaN NaN
2017-05-28 15:02:34 0.0 0.0 NaN NaN NaN 1.0 1.0 NaN NaN NaN
Is there a more efficient way to get the same result?

Append an index level with cumcount
df.set_index(df.groupby(level='DATE').cumcount(), append=True).unstack()
a b
0 1 2 3 4 0 1 2 3 4
DATE
2017-05-28 15:01:37 0.0 NaN NaN NaN NaN 1.0 NaN NaN NaN NaN
2017-05-28 15:01:39 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0
2017-05-28 15:01:42 1.0 NaN NaN NaN NaN 0.0 NaN NaN NaN NaN
2017-05-28 15:02:10 1.0 NaN NaN NaN NaN 0.0 NaN NaN NaN NaN
2017-05-28 15:02:14 0.0 NaN NaN NaN NaN 1.0 NaN NaN NaN NaN
2017-05-28 15:02:23 0.0 NaN NaN NaN NaN 1.0 NaN NaN NaN NaN
2017-05-28 15:02:28 1.0 NaN NaN NaN NaN 0.0 NaN NaN NaN NaN
2017-05-28 15:02:34 0.0 0.0 NaN NaN NaN 1.0 1.0 NaN NaN NaN

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

BeautifulSoup and Selenium won't retrieve full html from website - python

Related

Beautifulsoup not finding all class elements

Having issues trying to make my dataframe numeric

How to Parse the MLB Team and Player data using Pandas DataFrame?

Having trouble sorting by row index and `sort_index()` isn't working

reshaping a DataFrame with non-unique index

Categories

Resources