So I have a sqlite local database, I read it into my program as a pandas dataframe using
""" Seperating hitters and pitchers """
pitchers = pd.read_sql_query("SELECT * FROM ALL_NORTHWOODS_DATA WHERE BF_y >= 20 AND BF_x >= 20", northwoods_db)
hitters = pd.read_sql_query("SELECT * FROM ALL_NORTHWOODS_DATA WHERE PA_y >= 25 AND PA_x >= 25", northwoods_db)
But when I do this, some of the numbers are not numeric. Here is a head of one of the dataframes:
index Year Age_x AgeDif_x Tm_x Lg_x Lev_x Aff_x G_x PA_x ... ER_y BK_y WP_y BF_y WHIP_y H9_y HR9_y BB9_y SO9_y SO/W_y
0 84 2020 21 -0.3 Hillsdale GMAC NCAA None 5 None ... 4.0 None 3.0 71.0 1.132 5.6 0.0 4.6 8.7 1.89
1 264 2018 -- None Duke ACC NCAA None 15 None ... 13 0 1 88 2.111 10.0 0.5 9.0 8.0 0.89
2 298 2019 21 0.1 Wisconsin-Milwaukee Horz NCAA None 8 None ... 1.0 0.0 2.0 21.0 2.25 9.0 0.0 11.3 11.3 1.0
3 357 2017 22 1.0 Nova Southeastern SSC NCAA None 15.0 None ... 20.0 0.0 3.0 206.0 1.489 9.7 0.4 3.7 8.5 2.32
4 418 2021 21 -0.4 Creighton BigE NCAA None 4 None ... 26.0 1.0 6.0 226.0 1.625 8.6 0.9 6.0 7.5 1.25
When I try to make the dataframe numeric, I used this line of code:
hitters = hitters.apply(pd.to_numeric, errors='coerce')
pitchers = pitchers.apply(pd.to_numeric, errors='coerce')
But when I did that, the new head of the dataframes is full of NaN's, it seems like it got rid of all of the string values but I want to keep those.
index Year Age_x AgeDif_x Tm_x Lg_x Lev_x Aff_x G_x PA_x ... ER_y BK_y WP_y BF_y WHIP_y H9_y HR9_y BB9_y SO9_y SO/W_y
0 84 2020 21.0 -0.3 NaN NaN NaN NaN 5.0 NaN ... 4.0 NaN 3.0 71.0 1.132 5.6 0.0 4.6 8.7 1.89
1 264 2018 NaN NaN NaN NaN NaN NaN 15.0 NaN ... 13.0 0.0 1.0 88.0 2.111 10.0 0.5 9.0 8.0 0.89
2 298 2019 21.0 0.1 NaN NaN NaN NaN 8.0 NaN ... 1.0 0.0 2.0 21.0 2.250 9.0 0.0 11.3 11.3 1.00
3 357 2017 22.0 1.0 NaN NaN NaN NaN 15.0 NaN ... 20.0 0.0 3.0 206.0 1.489 9.7 0.4 3.7 8.5 2.32
4 418 2021 21.0 -0.4 NaN NaN NaN NaN 4.0 NaN ... 26.0 1.0 6.0 226.0 1.625 8.6 0.9 6.0 7.5 1.25
Is there a better way to makethe number values numeric and keep all my string columns? Maybe there is an sqlite function that can do it better? I am not sure, any help is appriciated.
Maybe you can use combine_first:
hitters_new = hitters.apply(pd.to_numeric, errors='coerce').combine_first(hitters)
pitchers_new = pitchers.apply(pd.to_numeric, errors='coerce').combine_first(pitchers)
You can try using astype or convert_dtypes. They both take an argument which is the columns you want to convert, if you already know which columns are numeric and which ones are strings that can work. Otherwise, take a look at this thread to do this automatically.
Related
I am still learning and could use some help. I would like to parse the starting pitchers and their respective teams.
I would like the data in a Pandas Dataframe but do not know how to parse the data correctly. Any suggestions would be very helpful. Thanks for your time!
Here is an example of the desired output:
Game Team Name
OAK Chris Bassitt
1
ARI Zac Gallen
SEA Justin Dunn
2
LAD Ross Stripling
Here is my code:
#url = https://www.baseball-reference.com/previews/index.shtml
#Data needed: 1) Team 2) Pitcher Name
import pandas as pd
url = 'https://www.baseball-reference.com/previews/index.shtml'
test = pd.read_html(url)
for t in test:
name = t[1]
team = t[0]
print(team)
print(name)
I feel like I have to create a Pandas DataFrame and append the Team and Name, however, I am not sure how to parse out just the desired output.
pandas.read_html returns a list of all the tables for a given URL
dataframes in the list can be selected using normal list slicing and selecting methods
import pandas as pd
url = 'https://www.baseball-reference.com/previews/index.shtml'
list_of_dataframes = pd.read_html(url)
# select and combine the dataframes for games; every other dataframe from 0 (even)
games = pd.concat(list_of_dataframes[0::2])
# display(games.head())
0 1 2
0 Cubs (13-6) NaN Preview
1 Cardinals (4-4) NaN 12:00AM
0 Cardinals (4-4) NaN Preview
1 Cubs (13-6) NaN 5:15PM
0 Red Sox (6-16) NaN Preview
# select the players from list_of_dataframes; every other dataframe from 1 (odd)
players = list_of_dataframes[1::2]
# add the Game to the dataframes
for i, df in enumerate(players, 1):
df['Game'] = i
players[i-1] = df
# combine all the dataframe
players = pd.concat(players).reset_index(drop=True)
# create a players column for the name only
players['name'] = players[1].str.split('(', expand=True)[0]
# rename the colume
players.rename(columns={0: 'Team'}, inplace=True)
# drop 1
players.drop(columns=[1], inplace=True)
# display(players.head(6))
Team Game name
0 CHC 1 Tyson Miller
1 STL 1 Alex Reyes
2 STL 2 Kwang Hyun Kim
3 CHC 2 Kyle Hendricks
4 BOS 3 Martin Perez
5 NYY 3 Jordan Montgomery
Love those sports reference.com sites. Trenton's solution is perfect, so don't change the accepted answer, but just wanted to throw this alternative data source for probable pitchers incase you were interested.
Looks like mlb.com has a publicly available api to pull that info (I'm going to assume that's possibly where baseball-reference fills their probable pitcher page). But what I like about this is you can get much more data returned to analyse, and it gives you the option to get a wider date range to get historical data, and possibly probable pitchers 2 or 3 days in advance (as well as day of). So give this code a look over too, play with it, practice with it.
But this could set you up to your first machine learning sort of thing.
PS: Let me know if you figure out what strikeZoneBottom and strikeZoneTop means here if you even bother to look into this data. I haven't been able to figure out what those mean.
I'm also wondering too, if there's data regarding the ballpark. Like in the pitchers stats there's the fly ball:ground ball ratio. If there was data on the ballparks like if you have flyball pitcher in a venue that yields lots of homeruns, that you might see a different situation for that same pitcher in a ballpark where flyballs don't quite travel as far, or the stadium has deeper fences (essentially homeruns turn into warning track fly out and vice versa)??
Code:
import requests
import pandas as pd
from datetime import datetime, timedelta
url = 'https://statsapi.mlb.com/api/v1/schedule'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'}
yesterday = datetime.strftime(datetime.now() - timedelta(1), '%Y-%m-%d')
today = datetime.strftime(datetime.now(), '%Y-%m-%d')
tomorrow = datetime.strftime(datetime.now() + timedelta(1), '%Y-%m-%d')
#To get 7 days earlier; notice the minus sign
#pastDate = datetime.strftime(datetime.now() - timedelta(7), '%Y-%m-%d')
#To get 3 days later; notice the plus sign
#futureDate = datetime.strftime(datetime.now() + timedelta(3), '%Y-%m-%d')
#hydrate parameter is to get back certain data elements. Not sure how to alter it exactly yet, would have to play around
#But without hydrate, it doesn't return probable pitchers
payload = {
'sportId': '1',
'startDate': today, #<-- Change these to get a wider range of games (to also get historical stats for machine learning)
'endDate': today, #<-- Change these to get a wider range of games (to possible probable pitchers for next few days. just need to adjust timedelta above)
'hydrate': 'team(leaders(showOnPreview(leaderCategories=[homeRuns,runsBattedIn,battingAverage],statGroup=[pitching,hitting]))),linescore(matchup,runners),flags,liveLookin,review,broadcasts(all),venue(location),decisions,person,probablePitcher,stats,homeRuns,previousPlay,game(content(media(featured,epg),summary),tickets),seriesStatus(useOverride=true)'}
jsonData = requests.get(url, headers=headers, params=payload).json()
dates = jsonData['dates']
rows = []
for date in dates:
games = date['games']
for game in games:
dayNight = game['dayNight']
gameDate = game['gameDate']
city = game['venue']['location']['city']
venue = game['venue']['name']
teams = game['teams']
for k, v in teams.items():
row = {}
row.update({'dayNight':dayNight,
'gameDate':gameDate,
'city':city,
'venue':venue})
homeAway = k
teamName = v['team']['name']
if 'probablePitcher' not in v.keys():
row.update({'homeAway':homeAway,
'teamName':teamName})
rows.append(row)
else:
probablePitcher = v['probablePitcher']
fullName = probablePitcher['fullName']
pitchHand = probablePitcher['pitchHand']['code']
strikeZoneBottom = probablePitcher['strikeZoneBottom']
strikeZoneTop = probablePitcher['strikeZoneTop']
row.update({'homeAway':homeAway,
'teamName':teamName,
'probablePitcher':fullName,
'pitchHand':pitchHand,
'strikeZoneBottom':strikeZoneBottom,
'strikeZoneTop':strikeZoneTop})
stats = probablePitcher['stats']
for stat in stats:
if stat['type']['displayName'] == 'statsSingleSeason' and stat['group']['displayName'] == 'pitching':
playerStats = stat['stats']
row.update(playerStats)
rows.append(row)
df = pd.DataFrame(rows)
Output: First 10 rows
print (df.head(10).to_string())
airOuts atBats balks baseOnBalls blownSaves catchersInterference caughtStealing city completeGames dayNight doubles earnedRuns era gameDate gamesFinished gamesPitched gamesPlayed gamesStarted groundOuts groundOutsToAirouts hitBatsmen hitByPitch hits hitsPer9Inn holds homeAway homeRuns homeRunsPer9 inheritedRunners inheritedRunnersScored inningsPitched intentionalWalks losses obp outs pickoffs pitchHand probablePitcher rbi runs runsScoredPer9 sacBunts sacFlies saveOpportunities saves shutouts stolenBasePercentage stolenBases strikeOuts strikeZoneBottom strikeZoneTop strikeoutWalkRatio strikeoutsPer9Inn teamName triples venue walksPer9Inn whip wildPitches winPercentage wins
0 15.0 44.0 0.0 9.0 0.0 0.0 0.0 Baltimore 0.0 day 2.0 8.0 6.00 2020-08-19T17:05:00Z 0.0 3.0 3.0 3.0 9.0 0.60 0.0 0.0 10.0 7.50 0.0 away 3.0 2.25 0.0 0.0 12.0 0.0 1.0 .358 36.0 0.0 R Tanner Roark 0.0 8.0 6.00 0.0 0.0 0.0 0.0 0.0 1.000 1.0 10.0 1.589 3.467 1.11 7.50 Toronto Blue Jays 0.0 Oriole Park at Camden Yards 6.75 1.58 0.0 .500 1.0
1 18.0 74.0 0.0 3.0 0.0 0.0 0.0 Baltimore 0.0 day 5.0 8.0 4.00 2020-08-19T17:05:00Z 0.0 4.0 4.0 4.0 18.0 1.00 1.0 1.0 22.0 11.00 0.0 home 1.0 0.50 0.0 0.0 18.0 0.0 2.0 .329 54.0 1.0 L Tommy Milone 0.0 11.0 5.50 1.0 1.0 0.0 0.0 0.0 1.000 1.0 18.0 1.535 3.371 6.00 9.00 Baltimore Orioles 1.0 Oriole Park at Camden Yards 1.50 1.39 1.0 .333 1.0
2 14.0 59.0 0.0 2.0 0.0 0.0 0.0 Boston 0.0 day 3.0 7.0 4.02 2020-08-19T17:35:00Z 0.0 3.0 3.0 3.0 14.0 1.00 0.0 0.0 17.0 9.77 0.0 away 2.0 1.15 0.0 0.0 15.2 0.0 2.0 .311 47.0 0.0 R Jake Arrieta 0.0 7.0 4.02 0.0 0.0 0.0 0.0 0.0 .--- 0.0 14.0 1.627 3.549 7.00 8.04 Philadelphia Phillies 0.0 Fenway Park 1.15 1.21 2.0 .333 1.0
3 2.0 14.0 1.0 3.0 0.0 0.0 0.0 Boston 0.0 day 1.0 5.0 22.50 2020-08-19T17:35:00Z 0.0 1.0 1.0 1.0 1.0 0.50 0.0 0.0 7.0 31.50 0.0 home 2.0 9.00 0.0 0.0 2.0 0.0 1.0 .588 6.0 0.0 L Kyle Hart 0.0 7.0 31.50 0.0 0.0 0.0 0.0 0.0 .--- 0.0 4.0 1.681 3.575 1.33 18.00 Boston Red Sox 0.0 Fenway Park 13.50 5.00 0.0 .000 0.0
4 8.0 27.0 0.0 0.0 0.0 0.0 0.0 Chicago 0.0 day 0.0 2.0 2.57 2020-08-19T18:20:00Z 0.0 1.0 1.0 1.0 7.0 0.88 0.0 0.0 6.0 7.71 0.0 away 0.0 0.00 0.0 0.0 7.0 0.0 0.0 .222 21.0 0.0 R Jack Flaherty 0.0 2.0 2.57 0.0 0.0 0.0 0.0 0.0 .--- 0.0 6.0 1.627 3.549 -.-- 7.71 St. Louis Cardinals 0.0 Wrigley Field 0.00 0.86 0.0 1.000 1.0
5 13.0 65.0 0.0 6.0 0.0 0.0 1.0 Chicago 0.0 day 2.0 6.0 2.84 2020-08-19T18:20:00Z 0.0 3.0 3.0 3.0 28.0 2.15 1.0 1.0 10.0 4.74 0.0 home 2.0 0.95 0.0 0.0 19.0 0.0 1.0 .236 57.0 0.0 R Alec Mills 0.0 6.0 2.84 0.0 0.0 0.0 0.0 0.0 .000 0.0 14.0 1.627 3.549 2.33 6.63 Chicago Cubs 0.0 Wrigley Field 2.84 0.84 0.0 .667 2.0
6 NaN NaN NaN NaN NaN NaN NaN Chicago NaN night NaN NaN NaN 2020-08-19T03:33:00Z NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN away NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Chicago Cubs NaN Wrigley Field NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN NaN Chicago NaN night NaN NaN NaN 2020-08-19T03:33:00Z NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN home NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN St. Louis Cardinals NaN Wrigley Field NaN NaN NaN NaN NaN
8 13.0 92.0 0.0 8.0 0.0 0.0 1.0 Kansas City 0.0 day 6.0 10.0 3.91 2020-08-19T21:05:00Z 0.0 4.0 4.0 4.0 24.0 1.85 0.0 0.0 25.0 9.78 0.0 away 1.0 0.39 0.0 0.0 23.0 0.0 2.0 .327 69.0 0.0 R Luis Castillo 0.0 12.0 4.70 0.0 1.0 0.0 0.0 0.0 .000 0.0 31.0 1.589 3.467 3.88 12.13 Cincinnati Reds 1.0 Kauffman Stadium 3.13 1.43 0.0 .000 0.0
9 10.0 36.0 0.0 5.0 0.0 0.0 0.0 Kansas City 0.0 day 0.0 0.0 0.00 2020-08-19T21:05:00Z 0.0 2.0 2.0 2.0 11.0 1.10 1.0 1.0 5.0 4.09 0.0 home 0.0 0.00 0.0 0.0 11.0 0.0 0.0 .262 33.0 0.0 R Brad Keller 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0.0 .--- 0.0 10.0 1.681 3.575 2.00 8.18 Kansas City Royals 0.0 Kauffman Stadium 4.09 0.91 0.0 1.000 2.0
My sample code is as follow:
import pandas as pd
dictx = {'col1':[1,'nan','nan','nan',5,'nan',7,'nan',9,'nan','nan','nan',13],\
'col2':[20,'nan','nan','nan',22,'nan',25,'nan',30,'nan','nan','nan',25],\
'col3':[15,'nan','nan','nan',10,'nan',14,'nan',13,'nan','nan','nan',9]}
df = pd.DataFrame(dictx).astype(float)
I'm trying to interpolate various segments which contain the value 'nan'.
For context, I'm trying to track bus speeds using GPS data provided by the city (São Paulo, Brazil), but the data is scarce and with parts that do not provide the information, as the e.g., but there're segments which I know for a fact that they are stopped, such as dawn, but the information come as 'nan' as well.
What I need:
I've been experimenting with dataframe.interpolate() parameters (limit and limit_diretcion) but came up short. If I set df.interpolate(limit=2) I will not only interpolate the data that I need but the data where it shouldn't. So I need to interpolate between sections defined by a limit
Desired output:
Out[7]:
col1 col2 col3
0 1.0 20.00 15.00
1 nan nan nan
2 nan nan nan
3 nan nan nan
4 5.0 22.00 10.00
5 6.0 23.50 12.00
6 7.0 25.00 14.00
7 8.0 27.50 13.50
8 9.0 30.00 13.00
9 nan nan nan
10 nan nan nan
11 nan nan nan
12 13.0 25.00 9.00
The logic that I've been trying to apply is basically trying to find nan's and calculating the difference between their indexes and so createing a new dataframe_temp to interpolate and only than add it to another creating a new dataframe_final. But this has become hard to achieve due to the fact that 'nan'=='nan' return False
This is a hack but may still be useful. Likely Pandas 0.23 will have a better solution.
https://pandas-docs.github.io/pandas-docs-travis/whatsnew.html#dataframe-interpolate-has-gained-the-limit-area-kwarg
df_fw = df.interpolate(limit=1)
df_bk = df.interpolate(limit=1, limit_direction='backward')
df_fw.where(df_bk.notna())
col1 col2 col3
0 1.0 20.0 15.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 5.0 22.0 10.0
5 6.0 23.5 12.0
6 7.0 25.0 14.0
7 8.0 27.5 13.5
8 9.0 30.0 13.0
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 13.0 25.0 9.0
Not a Hack
More legitimate way of handling it.
Generalized to handle any limit.
def interp(df, limit):
d = df.notna().rolling(limit + 1).agg(any).fillna(1)
d = pd.concat({
i: d.shift(-i).fillna(1)
for i in range(limit + 1)
}).prod(level=1)
return df.interpolate(limit=limit).where(d.astype(bool))
df.pipe(interp, 1)
col1 col2 col3
0 1.0 20.0 15.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 5.0 22.0 10.0
5 6.0 23.5 12.0
6 7.0 25.0 14.0
7 8.0 27.5 13.5
8 9.0 30.0 13.0
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 13.0 25.0 9.0
Can also handle variation in NaN from column to column. Consider a different df
dictx = {'col1':[1,'nan','nan','nan',5,'nan','nan',7,'nan',9,'nan','nan','nan',13],\
'col2':[20,'nan','nan','nan',22,'nan',25,'nan','nan',30,'nan','nan','nan',25],\
'col3':[15,'nan','nan','nan',10,'nan',14,'nan',13,'nan','nan','nan',9,'nan']}
df = pd.DataFrame(dictx).astype(float)
df
col1 col2 col3
0 1.0 20.0 15.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 5.0 22.0 10.0
5 NaN NaN NaN
6 NaN 25.0 14.0
7 7.0 NaN NaN
8 NaN NaN 13.0
9 9.0 30.0 NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 NaN NaN 9.0
13 13.0 25.0 NaN
Then with limit=1
df.pipe(interp, 1)
col1 col2 col3
0 1.0 20.0 15.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 5.0 22.0 10.0
5 NaN 23.5 12.0
6 NaN 25.0 14.0
7 7.0 NaN 13.5
8 8.0 NaN 13.0
9 9.0 30.0 NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 NaN NaN 9.0
13 13.0 25.0 9.0
And with limit=2
df.pipe(interp, 2).round(2)
col1 col2 col3
0 1.00 20.00 15.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 5.00 22.00 10.0
5 5.67 23.50 12.0
6 6.33 25.00 14.0
7 7.00 26.67 13.5
8 8.00 28.33 13.0
9 9.00 30.00 NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 NaN NaN 9.0
13 13.00 25.00 9.0
Here is a way to selectively ignore rows which are consecutive runs of NaNs whose length is greater than a certain size (given by limit):
import numpy as np
import pandas as pd
dictx = {'col1':[1,'nan','nan','nan',5,'nan',7,'nan',9,'nan','nan','nan',13],\
'col2':[20,'nan','nan','nan',22,'nan',25,'nan',30,'nan','nan','nan',25],\
'col3':[15,'nan','nan','nan',10,'nan',14,'nan',13,'nan','nan','nan',9]}
df = pd.DataFrame(dictx).astype(float)
limit = 2
notnull = pd.notnull(df).all(axis=1)
# assign group numbers to the rows of df. Each group starts with a non-null row,
# followed by null rows
group = notnull.cumsum()
# find the index of groups having length > limit
ignore = (df.groupby(group).filter(lambda grp: len(grp)>limit)).index
# only ignore rows which are null
ignore = df.loc[~notnull].index.intersection(ignore)
keep = df.index.difference(ignore)
# interpolate only the kept rows
df.loc[keep] = df.loc[keep].interpolate()
print(df)
prints
col1 col2 col3
0 1.0 20.0 15.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 5.0 22.0 10.0
5 6.0 23.5 12.0
6 7.0 25.0 14.0
7 8.0 27.5 13.5
8 9.0 30.0 13.0
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 13.0 25.0 9.0
By changing the value of limit you can control how big the group has to be before it should be ignored.
This is a partial answer.
for i in list(df):
for x in range(len(df[i])):
if not df[i][x] > -100:
df[i][x] = 0
df
col1 col2 col3
0 1.0 20.0 15.0
1 0.0 0.0 0.0
2 0.0 0.0 0.0
3 0.0 0.0 0.0
4 5.0 22.0 10.0
5 0.0 0.0 0.0
6 7.0 25.0 14.0
7 0.0 0.0 0.0
8 9.0 30.0 13.0
9 0.0 0.0 0.0
10 0.0 0.0 0.0
11 0.0 0.0 0.0
12 13.0 25.0 9.0
Now,
df["col1"][1] == df["col2"][1]
True
I have a pandas'DataFrame, it looks like this:
# Output
# A B C D
# 0 3.0 6.0 7.0 4.0
# 1 42.0 44.0 1.0 3.0
# 2 4.0 2.0 3.0 62.0
# 3 90.0 83.0 53.0 23.0
# 4 22.0 23.0 24.0 NaN
# 5 5.0 2.0 5.0 34.0
# 6 NaN NaN NaN NaN
# 7 NaN NaN NaN NaN
# 8 2.0 12.0 65.0 1.0
# 9 5.0 7.0 32.0 7.0
# 10 2.0 13.0 6.0 12.0
# 11 NaN NaN NaN NaN
# 12 23.0 NaN 23.0 34.0
# 13 61.0 NaN 63.0 3.0
# 14 32.0 43.0 12.0 76.0
# 15 24.0 2.0 34.0 2.0
What I would like to do is fill the NaN's with the earliest preceding row's B value. Apart from Column D, on this row, I would like NaN's replaced with zeros.
I've looked into ffill, fillna.. neither seem to be able to do the job.
My solution so far:
def fix_abc(row, column, df):
# If the row/column value is null/nan
if pd.isnull( row[column] ):
# Get the value of row[column] from the row before
prior = row.name
value = df[prior-1:prior]['B'].values[0]
# If that values empty, go to the row before that
while pd.isnull( value ) and prior >= 1 :
prior = prior - 1
value = df[prior-1:prior]['B'].values[0]
else:
value = row[column]
return value
df['A'] = df.apply( lambda x: fix_abc(x,'A',df), axis=1 )
df['B'] = df.apply( lambda x: fix_abc(x,'B',df), axis=1 )
df['C'] = df.apply( lambda x: fix_abc(x,'C',df), axis=1 )
def fix_d(x):
if pd.isnull(x['D']):
return 0
return x
df['D'] = df.apply( lambda x: fix_d(x), axis=1 )
It feels like this quite inefficient, and slow. So I'm wondering if there is a quicker, more efficient way to do this.
Example output;
# A B C D
# 0 3.0 6.0 7.0 3.0
# 1 42.0 44.0 1.0 42.0
# 2 4.0 2.0 3.0 4.0
# 3 90.0 83.0 53.0 90.0
# 4 22.0 23.0 24.0 0.0
# 5 5.0 2.0 5.0 5.0
# 6 2.0 2.0 2.0 0.0
# 7 2.0 2.0 2.0 0.0
# 8 2.0 12.0 65.0 2.0
# 9 5.0 7.0 32.0 5.0
# 10 2.0 13.0 6.0 2.0
# 11 13.0 13.0 13.0 0.0
# 12 23.0 13.0 23.0 23.0
# 13 61.0 13.0 63.0 61.0
# 14 32.0 43.0 12.0 32.0
# 15 24.0 2.0 34.0 24.0
I have dumped the code including the data for the dataframe into a python fiddle available (here)
fillna allows for various ways to do the filling. In this case, column D can just fill with 0. Column B can fill via pad. And then columns A and C can fill from column B, like:
Code:
df['D'] = df.D.fillna(0)
df['B'] = df.B.fillna(method='pad')
df['A'] = df.A.fillna(df['B'])
df['C'] = df.C.fillna(df['B'])
Test Code:
df = pd.read_fwf(StringIO(u"""
A B C D
3.0 6.0 7.0 4.0
42.0 44.0 1.0 3.0
4.0 2.0 3.0 62.0
90.0 83.0 53.0 23.0
22.0 23.0 24.0 NaN
5.0 2.0 5.0 34.0
NaN NaN NaN NaN
NaN NaN NaN NaN
2.0 12.0 65.0 1.0
5.0 7.0 32.0 7.0
2.0 13.0 6.0 12.0
NaN NaN NaN NaN
23.0 NaN 23.0 34.0
61.0 NaN 63.0 3.0
32.0 43.0 12.0 76.0
24.0 2.0 34.0 2.0"""), header=1)
print(df)
df['D'] = df.D.fillna(0)
df['B'] = df.B.fillna(method='pad')
df['A'] = df.A.fillna(df['B'])
df['C'] = df.C.fillna(df['B'])
print(df)
Results:
A B C D
0 3.0 6.0 7.0 4.0
1 42.0 44.0 1.0 3.0
2 4.0 2.0 3.0 62.0
3 90.0 83.0 53.0 23.0
4 22.0 23.0 24.0 NaN
5 5.0 2.0 5.0 34.0
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 2.0 12.0 65.0 1.0
9 5.0 7.0 32.0 7.0
10 2.0 13.0 6.0 12.0
11 NaN NaN NaN NaN
12 23.0 NaN 23.0 34.0
13 61.0 NaN 63.0 3.0
14 32.0 43.0 12.0 76.0
15 24.0 2.0 34.0 2.0
A B C D
0 3.0 6.0 7.0 4.0
1 42.0 44.0 1.0 3.0
2 4.0 2.0 3.0 62.0
3 90.0 83.0 53.0 23.0
4 22.0 23.0 24.0 0.0
5 5.0 2.0 5.0 34.0
6 2.0 2.0 2.0 0.0
7 2.0 2.0 2.0 0.0
8 2.0 12.0 65.0 1.0
9 5.0 7.0 32.0 7.0
10 2.0 13.0 6.0 12.0
11 13.0 13.0 13.0 0.0
12 23.0 13.0 23.0 34.0
13 61.0 13.0 63.0 3.0
14 32.0 43.0 12.0 76.0
15 24.0 2.0 34.0 2.0
I have a dataframe of race results. I'd like to create a series that takes the last stage position and subtracts that by the average of all the stages before that. Here is a small slice for the df (could have more stages, countries and rows)
race_location stage1_position stage2_position stage3_position number_of_stages
AUS 2.0 2.0 NaN 2
AUS 1.0 5.0 NaN 2
AUS 3.0 4.0 NaN 2
AUS 4.0 8.0 NaN 2
AUS 10.0 6.0 NaN 2
AUS 9.0 7.0 NaN 2
FRA 23.0 1.0 10.0 3
FRA 6.0 12.0 24.0 3
FRA 14.0 11.0 14.0 3
FRA 18.0 10.0 1.0 3
FRA 15.0 14.0 4.0 3
USA 24.0 NaN NaN 1
USA 7.0 NaN NaN 1
USA 22.0 NaN NaN 1
USA 11.0 NaN NaN 1
USA 8.0 NaN NaN 1
USA 16.0 NaN NaN 1
USA 13.0 NaN NaN 1
USA 19.0 NaN NaN 1
USA 5.0 NaN NaN 1
USA 25.0 NaN NaN 1
The output would be
last_stage_minus_average
0
4
1
4
-4
-2
-2
15
1.5
-13
-10.5
0
0
0
0
0
0
0
0
0
0
0
This wont work, but I was thinking something like this:
new_series = []
for country in country_list:
num_stages = df.loc[df['race_location'] == country, 'number_of_stages']
differnce = df.ix[df['race_location'] == country, num_stages] -
df.iloc[:, 0:num_stages-1].mean(axis=1)
new_series.append(difference)
I'm not sure how to go about doing this. Any help or direction would be amazing!
#use pandas apply to take the mean for the first n-1 stages and subtract from last stage.
df.apply(lambda x: x.iloc[x.number_of_stages]-np.mean(x.iloc[1:x.number_of_stages]),axis=1).fillna(0)
Out[264]:
0 0.0
1 4.0
2 1.0
3 4.0
4 -4.0
5 -2.0
6 -2.0
7 15.0
8 1.5
9 -13.0
10 -10.5
11 0.0
12 0.0
13 0.0
14 0.0
15 0.0
16 0.0
17 0.0
18 0.0
19 0.0
20 0.0
dtype: float64
I'd use filter to get just he stage columns, then stack and groupby
stages = df.filter(regex='^stage\d+.*')
stages.stack().groupby(level=0).apply(
lambda x: x.iloc[-1] - x.iloc[:-1].mean()
).fillna(0)
0 0.0
1 4.0
2 1.0
3 4.0
4 -4.0
5 -2.0
6 -2.0
7 15.0
8 1.5
9 -13.0
10 -10.5
11 0.0
12 0.0
13 0.0
14 0.0
15 0.0
16 0.0
17 0.0
18 0.0
19 0.0
20 0.0
dtype: float64
how it works
stack will automatically drop the NaN values when converting to a series.
Now, position -1 is the last value within each group if we grouped by the first level of the new multiindex
So, we use a lambda and calculate the mean with every thing up to the last value x.iloc[:-1].mean()
And subtract that from the last value x.iloc[-1]
subtracts that by the average of all the stages before that
It's not a big deal but I'm just curious! Unlike your desired output but along to your description, if one of the racers finished only one race, shouldn't their result be inf or nan instead of 0? (to specify them from the one who has already done 2~3 race but last race result is exactly same with average of races? like racer #1 vs racer #11~20)
df_sp = df.filter(regex='^stage\d+.*')
df['last'] = df_sp.T.fillna(method='ffill').T.iloc[:, -1]
df['mean'] = (df_sp.sum(axis=1) - df['last']) / (df['number_of_stages'] - 1)
print(df['last'] - df['mean'])
0 0.0
1 4.0
2 1.0
3 4.0
4 -4.0
5 -2.0
6 -2.0
7 15.0
8 1.5
9 -13.0
10 -10.5
11 NaN
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
18 NaN
19 NaN
20 NaN
I am sure there must be a very simple solution to this problem, but I am failing to find it (and browsing through previously asked questions, I didn't find the answer I wanted or didn't understand it).
I have a dataframe similar to this (just much bigger, with many more rows and columns):
x val1 val2 val3
0 0.0 10.0 NaN NaN
1 0.5 10.5 NaN NaN
2 1.0 11.0 NaN NaN
3 1.5 11.5 NaN 11.60
4 2.0 12.0 NaN 12.08
5 2.5 12.5 12.2 12.56
6 3.0 13.0 19.8 13.04
7 3.5 13.5 13.3 13.52
8 4.0 14.0 19.8 14.00
9 4.5 14.5 14.4 14.48
10 5.0 15.0 19.8 14.96
11 5.5 15.5 15.5 15.44
12 6.0 16.0 19.8 15.92
13 6.5 16.5 16.6 16.40
14 7.0 17.0 19.8 18.00
15 7.5 17.5 17.7 NaN
16 8.0 18.0 19.8 NaN
17 8.5 18.5 18.8 NaN
18 9.0 19.0 19.8 NaN
19 9.5 19.5 19.9 NaN
20 10.0 20.0 19.8 NaN
In the next step, I need to compute the derivative dVal/dx for each of the value columns (in reality I have more than 3 columns, so I need to have a robust solution in a loop, I can't select the rows manually each time). But because of the NaN values in some of the columns, I am facing the problem that x and val are not of the same dimension. I feel the way to overcome this would be to only select only those x intervals, for which the val is notnull. But I am not able to do that. I am probably making some very stupid mistakes (I am not a programmer and I am very untalented, so please be patient with me:) ).
Here is the code so far (now that I think of it, I may have introduced some mistakes just by leaving some old pieces of code because I've been messing with it for a while, trying different things):
import pandas as pd
import numpy as np
df = pd.read_csv('H:/DocumentsRedir/pokus/dataframe.csv', delimiter=',')
vals = list(df.columns.values)[1:]
for i in vals:
V = np.asarray(pd.notnull(df[i]))
mask = pd.notnull(df[i])
X = np.asarray(df.loc[mask]['x'])
derivative=np.diff(V)/np.diff(X)
But I am getting this error:
ValueError: operands could not be broadcast together with shapes (20,) (15,)
So, apparently, it did not select only the notnull values...
Is there an obvious mistake that I am making or a different approach that I should adopt? Thanks!
(And another less important question: is np.diff the right function to use here or had I better calculated it manually by finite differences? I'm not finding numpy documentation very helpful.)
To calculate dVal/dX:
dVal = df.iloc[:, 1:].diff() # `x` is in column 0.
dX = df['x'].diff()
>>> dVal.apply(lambda series: series / dX)
val1 val2 val3
0 NaN NaN NaN
1 1 NaN NaN
2 1 NaN NaN
3 1 NaN NaN
4 1 NaN 0.96
5 1 NaN 0.96
6 1 15.2 0.96
7 1 -13.0 0.96
8 1 13.0 0.96
9 1 -10.8 0.96
10 1 10.8 0.96
11 1 -8.6 0.96
12 1 8.6 0.96
13 1 -6.4 0.96
14 1 6.4 3.20
15 1 -4.2 NaN
16 1 4.2 NaN
17 1 -2.0 NaN
18 1 2.0 NaN
19 1 0.2 NaN
20 1 -0.2 NaN
We difference all columns (except the first one), and then apply a lambda function to each column which divides it by the difference in column X.