Pandas Merge Not Working When Values Are an Exact Match - python

Below is my code and Dataframes. stats_df is much bigger. Not sure if it matters, but the column values are EXACTLY as they appear in the actual files. I can't merge the two DFs without losing 'Alex Len' even though both DFs have the same PlayerID value of '20000852'
stats_df = pd.read_csv('stats_todate.csv')
matchup_df = pd.read_csv('matchup.csv')
new_df = pd.merge(stats_df, matchup_df[['PlayerID','Matchup','Started','GameStatus']])
I have also tried:
stats_df['PlayerID'] = stats_df['PlayerID'].astype(str)
matchup_df['PlayerID'] = matchup_df['PlayerID'].astype(str)
stats_df['PlayerID'] = stats_df['PlayerID'].str.strip()
matchup_df['PlayerID'] = matchup_df['PlayerID'].str.strip()
Any ideas?
Here are my two Dataframes:
DF1:
PlayerID SeasonType Season Name Team Position
20001713 1 2018 A.J. Hammons MIA C
20002725 2 2022 A.J. Lawson ATL SG
20002038 2 2021 Élie Okobo BKN PG
20002742 2 2022 Aamir Simms NY PF
20000518 3 2018 Aaron Brooks MIN PG
20000681 1 2022 Aaron Gordon DEN PF
20001395 1 2018 Aaron Harrison DAL SG
20002680 1 2022 Aaron Henry PHI SF
20002005 1 2022 Aaron Holiday PHO PG
20001981 3 2018 Aaron Jackson HOU PF
20002539 1 2022 Aaron Nesmith BOS SF
20002714 1 2022 Aaron Wiggins OKC SG
20001721 1 2022 Abdel Nader PHO SF
20002251 2 2020 Abdul Gaddy OKC PG
20002458 1 2021 Adam Mokoka CHI SG
20002619 1 2022 Ade Murkey SAC PF
20002311 1 2022 Admiral Schofield ORL PF
20000783 1 2018 Adreian Payne ORL PF
20002510 1 2022 Ahmad Caver IND PG
20002498 2 2020 Ahmed Hill CHA PG
20000603 1 2022 Al Horford BOS PF
20000750 3 2018 Al Jefferson IND C
20001645 1 2019 Alan Williams BKN PF
20000837 1 2022 Alec Burks NY SG
20001882 1 2018 Alec Peters PHO PF
20002850 1 2022 Aleem Ford ORL SF
20002542 1 2022 Aleksej Pokuševski OKC PF
20002301 3 2021 Alen Smailagic GS PF
20001763 1 2019 Alex Abrines OKC SG
20001801 1 2022 Alex Caruso CHI SG
20000852 1 2022 Alex Len SAC C
DF2:
PlayerID Name Date Started Opponent GameStatus Matchup
20000681 Aaron Gordon 4/1/2022 1 MIN 16
20002005 Aaron Holiday 4/1/2022 0 MEM 21
20002539 Aaron Nesmith 4/1/2022 0 IND 13
20002714 Aaron Wiggins 4/1/2022 1 DET 14
20002311 Admiral Schofield 4/1/2022 0 TOR 10
20000603 Al Horford 4/1/2022 1 IND 13
20002542 Aleksej Pokuševski 4/1/2022 1 DET 14
20000852 Alex Len 4/1/2022 1 HOU 22

You need to specify the column you want to merge on using the on keyword argument:
new_df = pd.merge(stats_df, matchup_df[['PlayerID','Matchup','Started','GameStatus']], on=['PayerID'])
Otherwise it will merge using all of the shared columns.
Here is the explanation from the pandas docs:
on : label or list
Column or index level names to join on. These must be found in both
DataFrames. If on is None and not merging on indexes then this defaults
to the intersection of the columns in both DataFrames.

Related

Web scrape Sports-Reference with Python Beautiful Soup

I am trying to scrape data from Nick Saban's sports reference page so that I can pull in the list of All-Americans he coached and then his Bowl-Win Loss Percentage.
I am new to Python so this has been a massive struggle. When I inspect the page I see div id = #leaderboard_all-americans class = "data_grid_box"
When I run the code below I am getting the Coaching Record table, which is the first table on the site. I tried using different indexes thinking it may give me a different result but that did not work either.
Ultimately, I want to get the All-American data and turn it into a data frame.
import requests
import bs4
import pandas as pd
saban2 = requests.get("https://www.sports-reference.com/cfb/coaches/nick-saban-1.html")
saban_soup2 = bs4.BeautifulSoup(saban2.text,"lxml")
saban_select = saban_soup2.select('div',{"id":"leaderboard_all-americans"})
saban_df2 = pd.read_html(str(saban_select))
All Americans
sports-reference.com stores the HTML tables as comments in the basic request response. You have to first grab the commented block with the All-Americans and bowl results, and then parse that result:
import bs4
from bs4 import BeautifulSoup as soup
import requests, pandas as pd
d = soup(requests.get('https://www.sports-reference.com/cfb/coaches/nick-saban-1.html').text, 'html.parser')
block = [i for i in d.find_all(string=lambda text: isinstance(text, bs4.Comment)) if 'id="leaderboard_all-americans"' in i][0]
b = soup(str(block), 'html.parser')
players = [i for i in b.select('#leaderboard_all-americans table.no_columns tr')]
p_results = [{'name':i.td.a.text, 'year':i.td.contents[-1][2:-1]} for i in players]
all_americans = pd.DataFrame(p_results)
bowl_win_loss = b.select_one('#leaderboard_win_loss_pct_post td.single').contents[-2]
print(all_americans)
print(bowl_win_loss)
Output:
all_americans
name year
0 Jonathan Allen 2016
1 Javier Arenas 2009
2 Mark Barron 2011
3 Antoine Caldwell 2008
4 Ha Ha Clinton-Dix 2013
5 Terrence Cody 2008-2009
6 Landon Collins 2014
7 Amari Cooper 2014
8 Landon Dickerson 2020
9 Minkah Fitzpatrick 2016-2017
10 Reuben Foster 2016
11 Najee Harris 2020
12 Derrick Henry 2015
13 Dont'a Hightower 2011
14 Mark Ingram 2009
15 Jerry Jeudy 2018
16 Mike Johnson 2009
17 Barrett Jones 2011-2012
18 Mac Jones 2020
19 Ryan Kelly 2015
20 Cyrus Kouandjio 2013
21 Chad Lavalais 2003
22 Alex Leatherwood 2020
23 Rolando McClain 2009
24 Demarcus Milliner 2012
25 C.J. Mosley 2012-2013
26 Reggie Ragland 2015
27 Josh Reed 2001
28 Trent Richardson 2011
29 A'Shawn Robinson 2015
30 Cam Robinson 2016
31 Andre Smith 2008
32 DeVonta Smith 2020
33 Marcus Spears 2004
34 Patrick Surtain II 2020
35 Tua Tagovailoa 2018
36 Deionte Thompson 2018
37 Chance Warmack 2012
38 Ben Wilkerson 2004
39 Jonah Williams 2018
40 Quinnen Williams 2018
bowl_win_loss:
' .63 (#23)'

Replacing variable name based on count

I have a DataFrame that look like this:
player pos Count of pos
A.J. Derby FB 1
TE 10
A.J. Green WR 16
A.J. McCarron QB 3
Aaron Jones RB 12
Aaron Ripkowski FB 16
Aaron Rodgers QB 7
Adam Humphries TE 1
WR 15
Adam Shaheen TE 13
Adam Thielen WR 16
Adrian Peterson RB 10
Akeem Hunt RB 15
Alan Cross FB 1
TE 7
Albert Wilson WR 13
Aldrick Robinson WR 16
Alex Armah CB 1
FB 6
RB 2
Alex Collins RB 15
Alex Erickson WR 16
Alex Smith QB 15
Alfred Blue RB 11
Alfred Morris RB 14
Allen Hurns WR 10
Allen Robinson WR 1
Alshon Jeffery WR 16
Alvin Kamara FB 1
RB 15
Amara Darboh WR 16
Amari Cooper TE 2
WR 12
For a player that has more than one pos type I would like to replace all the pos types listed for that player with the pos type that has the highest count of pos. So, for the first player his FB type will be replaced with TE.
I've started with:
for p in df.player:
if df.groupby('player')['pos'].nunique() > 1:
But am struggling with what the next step is for replacing the pos based on count of pos.
Appreciate any help on this. Thanks!
Use GroupBy.transform with DataFrameGroupBy.idxmax for pos values by maximum values of Count of pos:
#if necessary
df = df.reset_index()
df['player'] = df['player'].replace('', np.nan).ffill()
df['pos'] = (df.set_index('pos')
.groupby('player')['Count of pos']
.transform('idxmax')
.to_numpy())
print (df)
player pos Count of pos
0 A.J. Derby TE 1
1 A.J. Derby TE 10
2 A.J. Green WR 16
3 A.J. McCarron QB 3
4 Aaron Jones RB 12
5 Aaron Ripkowski FB 16
6 Aaron Rodgers QB 7
7 Adam Humphries WR 1
8 Adam Humphries WR 15

Updating Pandas Column Using Conditions and a List

This is similar to some other questions posted, but i can't find an answer that fits my needs.
I have a Dataframe with the following:
RK PLAYER SCHOOL YEAR POS POS RK HT WT 2019 2018 2017 2016
0 1 Nick Bosa Ohio St. Jr EDGE 1 6-4 266 Jr
1 2 Quinnen Williams Alabama Soph DL 1 6-3 303 Soph
2 3 Josh Allen Kentucky Sr EDGE 2 6-5 262 Sr
3 4 Ed Oliver Houston Jr DL 2 6-2 287 Jr
2018, 2017, and 2016 have np.NaN values; but i can't format this table correctly with them in it.
Now i have a separate list containing the following:
season = ['Sr', 'Jr', 'Soph', 'Fr']
The 2019 column says their current status, and i would like for the 2018 column to show their status as of the prior year. So if it was 'Sr', it should be 'Jr'. Essentially, what i want to do is have the column check for the value in [season], move it one index ahead, and then take that value back into the column. The result for 2018 should be:
RK PLAYER SCHOOL YEAR POS POS RK HT WT 2019 2018 2017 2016
0 1 Nick Bosa Ohio St. Jr EDGE 1 6-4 266 Jr Soph
1 2 Quinnen Williams Alabama Soph DL 1 6-3 303 Soph Fr
2 3 Josh Allen Kentucky Sr EDGE 2 6-5 262 Sr Jr
3 4 Ed Oliver Houston Jr DL 2 6-2 287 Jr Soph
I can think of a way to do this with a for k, v in iteritems loop that would check the values, but i'm wondering if there's a better way?
I'm not sure if this is much smarter than what you already have, but its a suggestion
import pandas as pd
def get_season(curr_season, curr_year, prev_year):
season = ['Sr', 'Jr', 'Soph', 'Fr']
try:
return season[season.index(curr_season) + (curr_year - prev_year)]
except IndexError:
# Return some meaningful meassage perhaps?
return '-'
df = pd.DataFrame({'2019': ['Jr', 'Soph', 'Sr', 'Jr']})
df['2018'] = [get_season(s, 2019, 2018) for s in df['2019']]
df['2017'] = [get_season(s, 2019, 2017) for s in df['2019']]
df['2016'] = [get_season(s, 2019, 2016) for s in df['2019']]
df
Out[18]:
2019 2018 2017 2016
0 Jr Soph Fr -
1 Soph Fr - -
2 Sr Jr Soph Fr
3 Jr Soph Fr -
Another possible solution is to write a function that will accept a row, do a slice of seasons list starting from '2019' value and return that slice as pandas.Series. Then we can apply that function to columns using apply(). I used a part of your input DataFrame for testing.
In [3]: df
Out[3]:
WT 2019 2018 2017 2016
0 266 Jr NaN NaN NaN
1 303 Soph NaN NaN NaN
2 262 Sr NaN NaN NaN
3 287 Jr NaN NaN NaN
In [4]: def fill_row(row):
...: season = ['Sr', 'Jr', 'Soph', 'Fr']
...: data = season[season.index(row['2019']):]
...: return pd.Series(data)
In [5]: cols_to_update = ['2019', '2018', '2017', '2016']
In [6]: df[cols_to_update] = df[cols_to_update].apply(fill_row, axis=1)
In [7]: df
Out[7]:
WT 2019 2018 2017 2016
0 266 Jr Soph Fr NaN
1 303 Soph Fr NaN NaN
2 262 Sr Jr Soph Fr
3 287 Jr Soph Fr NaN

How to use group by on multiple columns?

I am using pandas for some data processing, My panda statement looks like this
yearage.groupby(['year', 'Tm']).size()
It gives me data like this
2014 ATL 9
BOS 9
BRK 7
CHI 10
CHO 9
CLE 8
DAL 9
DEN 8
DET 9
GSW 8
When I convert it into dataframe, I get only two columns compound key and the count. What I actually want is, three columns,
year, Tm, Size
How do I separate out the two compound keys after groupby?
You specify as_index=False in your groupby statement. As a side note, you probably want to use count (which excludes NaNs) instead of size.
>>> df.groupby(['year', 'Tm'], as_index=False).count()
year Tm a
0 2014 ATL 4
1 2014 BOS 4
2 2014 BRK 1
3 2014 CHI 1
4 2014 CHO 1
5 2014 CLE 1
6 2014 DAL 1
7 2014 DEN 1
8 2014 DET 1
9 2014 GSW 1
For size:
Another simple aggregation example is to compute the size of each group. This is included in GroupBy as the size method. It returns a Series whose index are the group names and whose values are the sizes of each group.
For count:
Compute count of group, excluding missing values
I think you can try reset_index with parameter name for new column name Size:
yearage.groupby(['year','Tm']).size().reset_index(name='Size')
Sample:
print yearage
year Tm a
0 2014 ATL 9
1 2014 ATL 9
2 2014 ATL 9
3 2014 ATL 9
4 2014 BOS 9
5 2014 BRK 7
6 2014 BOS 9
7 2014 BOS 9
8 2014 BOS 9
9 2014 CHI 10
10 2014 CHO 9
11 2014 CLE 8
12 2014 DAL 9
13 2014 DEN 8
14 2014 DET 9
15 2014 GSW 8
print yearage.groupby(['year','Tm']).size().reset_index(name='Size')
year Tm Size
0 2014 ATL 4
1 2014 BOS 4
2 2014 BRK 1
3 2014 CHI 1
4 2014 CHO 1
5 2014 CLE 1
6 2014 DAL 1
7 2014 DEN 1
8 2014 DET 1
9 2014 GSW 1
Without parameter name get new column 0:
print yearage.groupby(['year','Tm']).size().reset_index()
year Tm 0
0 2014 ATL 4
1 2014 BOS 4
2 2014 BRK 1
3 2014 CHI 1
4 2014 CHO 1
5 2014 CLE 1
6 2014 DAL 1
7 2014 DEN 1
8 2014 DET 1
9 2014 GSW 1

Adding columns of different length into pandas dataframe

I have a dataframe detailing money awarded to people over several years:
Name -- Money -- Year
Paul 57.00 2012
Susan 67.00 2012
Gary 54.00 2011
Paul 77.00 2011
Andrea 20.00 2011
Albert 23.00 2011
Hal 26.00 2010
Paul 23.00 2010
From this dataframe, I want to construct a dataframe that details all the money awarded in a single year, for making a boxplot:
2012 -- 2011 -- 2010
57.00 54.00 26.00
67.00 77.00 23.00
20.00
23.00
So you see this results in columns of different length. When I try to do this using pandas, I get the error 'ValueError: Length of values does not match length of index'. I assume this is because I can't add varying length columns to a dataframe.
Can anyone offer some advice on how to proceed? Perhap I'm approaching this incorrectly? Thanks for any help!
I'd do this in a two-step process: first add a column corresponding to the index in each year using cumcount, and then pivot so that the new column is the index, the years become the columns, and the money column becomes the values:
df["yindex"] = df.groupby("Year").cumcount()
new_df = df.pivot(index="yindex", columns="Year", values="Money")
For example:
>>> df = pd.read_csv("money.txt", sep="\s+")
>>> df
Name Money Year
0 Paul 57 2012
1 Susan 67 2012
2 Gary 54 2011
3 Paul 77 2011
4 Andrea 20 2011
5 Albert 23 2011
6 Hal 26 2010
7 Paul 23 2010
>>> df["yindex"] = df.groupby("Year").cumcount()
>>> df
Name Money Year yindex
0 Paul 57 2012 0
1 Susan 67 2012 1
2 Gary 54 2011 0
3 Paul 77 2011 1
4 Andrea 20 2011 2
5 Albert 23 2011 3
6 Hal 26 2010 0
7 Paul 23 2010 1
>>> df.pivot(index="yindex", columns="Year", values="Money")
Year 2010 2011 2012
yindex
0 26 54 57
1 23 77 67
2 NaN 20 NaN
3 NaN 23 NaN
After which you could get rid of the NaNs if you like, but it depends on whether you want to distinguish between cases like "knowing the value is 0" and "not knowing what the value is":
>>> df.pivot(index="yindex", columns="Year", values="Money").fillna(0)
Year 2010 2011 2012
yindex
0 26 54 57
1 23 77 67
2 0 20 0
3 0 23 0

Categories