Sorry, I just asked this question: Pythonic Way to have multiple Or's when conditioning in a dataframe but marked it as answered prematurely because it passed my overly simplistic test case, but isn't working more generally. (If it is possible to merge and reopen the question that would be great...)
Here is the full issue:
sum(data['Name'].isin(eligible_players))
> 0
sum(data['Name'] == "Antonio Brown")
> 68
"Antonio Brown" in eligible_players
> True
Basically if I understand correctly, I am showing that Antonio Brown is in eligible players and he is in the dataframe. However, for some reason the .isin() isn't working properly.
As I said in my prior question, I am looking for a way to check many ors to select the proper rows
____ EDIT ____
In[14]:
eligible_players
Out[14]:
Name
Antonio Brown 378
Demaryius Thomas 334
Jordy Nelson 319
Dez Bryant 309
Emmanuel Sanders 293
Odell Beckham 289
Julio Jones 288
Randall Cobb 284
Jeremy Maclin 267
T.Y. Hilton 255
Alshon Jeffery 252
Golden Tate 250
Mike Evans 236
DeAndre Hopkins 223
Calvin Johnson 220
Kelvin Benjamin 218
Julian Edelman 213
Anquan Boldin 213
Steve Smith 213
Roddy White 208
Brandon LaFell 205
Mike Wallace 205
A.J. Green 203
DeSean Jackson 200
Jordan Matthews 194
Eric Decker 194
Sammy Watkins 190
Torrey Smith 186
Andre Johnson 186
Jarvis Landry 178
Eddie Royal 176
Brandon Marshall 175
Vincent Jackson 175
Rueben Randle 174
Marques Colston 173
Mohamed Sanu 171
Keenan Allen 170
James Jones 168
Malcom Floyd 168
Kenny Stills 167
Greg Jennings 162
Kendall Wright 162
Doug Baldwin 160
Michael Floyd 159
Robert Woods 158
Name: Pts, dtype: int64
and
In [31]:
data.tail(110)
Out[31]:
Name Pts year week pos Team
28029 Dez Bryant 25 2014 17 WR DAL
28030 Antonio Brown 25 2014 17 WR PIT
28031 Jordan Matthews 24 2014 17 WR PHI
28032 Randall Cobb 23 2014 17 WR GB
28033 Rueben Randle 21 2014 17 WR NYG
28034 Demaryius Thomas 19 2014 17 WR DEN
28035 Calvin Johnson 19 2014 17 WR DET
28036 Torrey Smith 18 2014 17 WR BAL
28037 Roddy White 17 2014 17 WR ATL
28038 Steve Smith 17 2014 17 WR BAL
28039 DeSean Jackson 16 2014 17 WR WAS
28040 Mike Evans 16 2014 17 WR TB
28041 Anquan Boldin 16 2014 17 WR SF
28042 Adam Thielen 15 2014 17 WR MIN
28043 Cecil Shorts 15 2014 17 WR JAC
28044 A.J. Green 15 2014 17 WR CIN
28045 Jordy Nelson 14 2014 17 WR GB
28046 Brian Hartline 14 2014 17 WR MIA
28047 Robert Woods 13 2014 17 WR BUF
28048 Kenny Stills 13 2014 17 WR NO
28049 Emmanuel Sanders 13 2014 17 WR DEN
28050 Eddie Royal 13 2014 17 WR SD
28051 Marques Colston 13 2014 17 WR NO
28052 Chris Owusu 12 2014 17 WR NYJ
28053 Brandon LaFell 12 2014 17 WR NE
28054 Dontrelle Inman 12 2014 17 WR SD
28055 Reggie Wayne 11 2014 17 WR IND
28056 Paul Richardson 11 2014 17 WR SEA
28057 Cole Beasley 11 2014 17 WR DAL
28058 Jarvis Landry 10 2014 17 WR MIA
(Aside: once you posted what you were actually using, it only took seconds to see the problem.)
Series.isin(something) iterates over something to determine the set of things you want to test membership in. But your eligible_players isn't a list, it's a Series. And iteration over a Series is iteration over the values, even though membership (in) is with respect to the index:
In [72]: eligible_players = pd.Series([10,20,30], index=["A","B","C"])
In [73]: list(eligible_players)
Out[73]: [10, 20, 30]
In [74]: "A" in eligible_players
Out[74]: True
So in your case, you could use eligible_players.index instead to pass the right names:
In [75]: df = pd.DataFrame({"Name": ["A","B","C","D"]})
In [76]: df
Out[76]:
Name
0 A
1 B
2 C
3 D
In [77]: df["Name"].isin(eligible_players) # remember, this will be [10, 20, 30]
Out[77]:
0 False
1 False
2 False
3 False
Name: Name, dtype: bool
In [78]: df["Name"].isin(eligible_players.index)
Out[78]:
0 True
1 True
2 True
3 False
Name: Name, dtype: bool
In [79]: df["Name"].isin(eligible_players.index).sum()
Out[79]: 3
Related
When I try to parse a wiki page for its tables, the tables are read correctly except for the date of birth column, which comes back as empty. Is there a workaround for this? I've tried using beautiful soup but I get the same result.
The code I've used is as follows:
url = 'https://en.wikipedia.org/wiki/2002_FIFA_World_Cup_squads'
pd.read_html(url)
Here's an image of one of the tables in question:
One possible solution can be alter the page content with beautifulsoup and then load it to pandas:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/2002_FIFA_World_Cup_squads"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
# select correct table, here I select the first one:
tbl = soup.select("table")[0]
# remove the (aged XX) part:
for td in tbl.select("td:nth-of-type(3)"):
td.string = td.contents[-1].split("(")[0]
df = pd.read_html(str(tbl))[0]
print(df)
Prints:
No. Pos. Player Date of birth (age) Caps Club
0 1 GK Thomas Sørensen 12 June 1976 14 Sunderland
1 2 MF Stig Tøfting 14 August 1969 36 Bolton Wanderers
2 3 DF René Henriksen 27 August 1969 39 Panathinaikos
3 4 DF Martin Laursen 26 July 1977 15 Milan
4 5 DF Jan Heintze (c) 17 August 1963 83 PSV Eindhoven
5 6 DF Thomas Helveg 24 June 1971 67 Milan
6 7 MF Thomas Gravesen 11 March 1976 22 Everton
7 8 MF Jesper Grønkjær 12 August 1977 25 Chelsea
8 9 FW Jon Dahl Tomasson 29 August 1976 38 Feyenoord
9 10 MF Martin Jørgensen 6 October 1975 32 Udinese
10 11 FW Ebbe Sand 19 July 1972 44 Schalke 04
11 12 DF Niclas Jensen 17 August 1974 8 Manchester City
12 13 DF Steven Lustü 13 April 1971 4 Lyn
13 14 MF Claus Jensen 29 April 1977 13 Charlton Athletic
14 15 MF Jan Michaelsen 28 November 1970 11 Panathinaikos
15 16 GK Peter Kjær 5 November 1965 4 Aberdeen
16 17 MF Christian Poulsen 28 February 1980 3 Copenhagen
17 18 FW Peter Løvenkrands 29 January 1980 4 Rangers
18 19 MF Dennis Rommedahl 22 July 1978 19 PSV Eindhoven
19 20 DF Kasper Bøgelund 8 October 1980 2 PSV Eindhoven
20 21 FW Peter Madsen 26 April 1978 4 Brøndby
21 22 GK Jesper Christiansen 24 April 1978 0 Vejle
22 23 MF Brian Steen Nielsen 28 December 1968 65 Malmö FF
Try setting the parse_dates parameter to True inside read_html method.
I am trying to scrape data from Nick Saban's sports reference page so that I can pull in the list of All-Americans he coached and then his Bowl-Win Loss Percentage.
I am new to Python so this has been a massive struggle. When I inspect the page I see div id = #leaderboard_all-americans class = "data_grid_box"
When I run the code below I am getting the Coaching Record table, which is the first table on the site. I tried using different indexes thinking it may give me a different result but that did not work either.
Ultimately, I want to get the All-American data and turn it into a data frame.
import requests
import bs4
import pandas as pd
saban2 = requests.get("https://www.sports-reference.com/cfb/coaches/nick-saban-1.html")
saban_soup2 = bs4.BeautifulSoup(saban2.text,"lxml")
saban_select = saban_soup2.select('div',{"id":"leaderboard_all-americans"})
saban_df2 = pd.read_html(str(saban_select))
All Americans
sports-reference.com stores the HTML tables as comments in the basic request response. You have to first grab the commented block with the All-Americans and bowl results, and then parse that result:
import bs4
from bs4 import BeautifulSoup as soup
import requests, pandas as pd
d = soup(requests.get('https://www.sports-reference.com/cfb/coaches/nick-saban-1.html').text, 'html.parser')
block = [i for i in d.find_all(string=lambda text: isinstance(text, bs4.Comment)) if 'id="leaderboard_all-americans"' in i][0]
b = soup(str(block), 'html.parser')
players = [i for i in b.select('#leaderboard_all-americans table.no_columns tr')]
p_results = [{'name':i.td.a.text, 'year':i.td.contents[-1][2:-1]} for i in players]
all_americans = pd.DataFrame(p_results)
bowl_win_loss = b.select_one('#leaderboard_win_loss_pct_post td.single').contents[-2]
print(all_americans)
print(bowl_win_loss)
Output:
all_americans
name year
0 Jonathan Allen 2016
1 Javier Arenas 2009
2 Mark Barron 2011
3 Antoine Caldwell 2008
4 Ha Ha Clinton-Dix 2013
5 Terrence Cody 2008-2009
6 Landon Collins 2014
7 Amari Cooper 2014
8 Landon Dickerson 2020
9 Minkah Fitzpatrick 2016-2017
10 Reuben Foster 2016
11 Najee Harris 2020
12 Derrick Henry 2015
13 Dont'a Hightower 2011
14 Mark Ingram 2009
15 Jerry Jeudy 2018
16 Mike Johnson 2009
17 Barrett Jones 2011-2012
18 Mac Jones 2020
19 Ryan Kelly 2015
20 Cyrus Kouandjio 2013
21 Chad Lavalais 2003
22 Alex Leatherwood 2020
23 Rolando McClain 2009
24 Demarcus Milliner 2012
25 C.J. Mosley 2012-2013
26 Reggie Ragland 2015
27 Josh Reed 2001
28 Trent Richardson 2011
29 A'Shawn Robinson 2015
30 Cam Robinson 2016
31 Andre Smith 2008
32 DeVonta Smith 2020
33 Marcus Spears 2004
34 Patrick Surtain II 2020
35 Tua Tagovailoa 2018
36 Deionte Thompson 2018
37 Chance Warmack 2012
38 Ben Wilkerson 2004
39 Jonah Williams 2018
40 Quinnen Williams 2018
bowl_win_loss:
' .63 (#23)'
I have a DataFrame that look like this:
player pos Count of pos
A.J. Derby FB 1
TE 10
A.J. Green WR 16
A.J. McCarron QB 3
Aaron Jones RB 12
Aaron Ripkowski FB 16
Aaron Rodgers QB 7
Adam Humphries TE 1
WR 15
Adam Shaheen TE 13
Adam Thielen WR 16
Adrian Peterson RB 10
Akeem Hunt RB 15
Alan Cross FB 1
TE 7
Albert Wilson WR 13
Aldrick Robinson WR 16
Alex Armah CB 1
FB 6
RB 2
Alex Collins RB 15
Alex Erickson WR 16
Alex Smith QB 15
Alfred Blue RB 11
Alfred Morris RB 14
Allen Hurns WR 10
Allen Robinson WR 1
Alshon Jeffery WR 16
Alvin Kamara FB 1
RB 15
Amara Darboh WR 16
Amari Cooper TE 2
WR 12
For a player that has more than one pos type I would like to replace all the pos types listed for that player with the pos type that has the highest count of pos. So, for the first player his FB type will be replaced with TE.
I've started with:
for p in df.player:
if df.groupby('player')['pos'].nunique() > 1:
But am struggling with what the next step is for replacing the pos based on count of pos.
Appreciate any help on this. Thanks!
Use GroupBy.transform with DataFrameGroupBy.idxmax for pos values by maximum values of Count of pos:
#if necessary
df = df.reset_index()
df['player'] = df['player'].replace('', np.nan).ffill()
df['pos'] = (df.set_index('pos')
.groupby('player')['Count of pos']
.transform('idxmax')
.to_numpy())
print (df)
player pos Count of pos
0 A.J. Derby TE 1
1 A.J. Derby TE 10
2 A.J. Green WR 16
3 A.J. McCarron QB 3
4 Aaron Jones RB 12
5 Aaron Ripkowski FB 16
6 Aaron Rodgers QB 7
7 Adam Humphries WR 1
8 Adam Humphries WR 15
I have a dataframe based on football players. I am finding duplicate rows for when a player has transferred mid-season. My aim is to add the points the accumalted in both leagues and add them together to make just one row.
Here is a sample of the data:
name full_name club Points Start Sub
84 S. Mustafi Shkodran Mustafi Arsenal 76 26 1
85 S. Mustafi Shkodran Mustafi Arsenal -2 0 1
89 Bruno Bruno Soriano Llido Villarreal CF 43 15 16
90 Bruno Bruno Gonzalez Cabrera Getafe CF 43 15 16
119 Oscar Oscar dos Santos Emboaba NaN 16 5 8
120 Oscar Oscar dos Santos Emboaba NaN 1 0 2
121 Oscar Oscar Rodriguez Arnaiz Real Madrid CF 16 5 8
122 Oscar Oscar Rodriguez Arnaiz Real Madrid CF 1 0 2
188 C. Bravo Claudio Bravo Manchester City 61 22 8
189 C. Bravo Claudio Bravo Manchester City 1 1 0
193 Naldo Ronaldo Aparecido Rodrigues FC Schalke 04 58 19 1
194 Naldo Edinaldo Gomes Pereira RCD Espanyol 58 19 1
200 G. Castro Gonzalo Castro Borussia Dortmund 79 23 6
201 G. Castro Gonzalo Castro Malaga CF 79 23 6
209 Juanfran Juan Francisco Torres Belen Atletico Madrid 86 21 8
210 Juanfran Juan Francisco Torres Belen Atletico Madrid 74 34 2
211 Juanfran Juan Francisco Moreno Fuertes RC Coruna 86 21 8
212 Juanfran Juan Francisco Moreno Fuertes RC Coruna 74 34 2
My goal dataframe would have players like for example Mustafi's Points Start and Sum values added together to give just one player.
Players like Bruno are clearly not the same person so I don't want to add the two brunos together.
name full_name club Points Start Sub
84 S. Mustafi Shkodran Mustafi Arsenal 74 26 2
89 Bruno Bruno Soriano Llido Villarreal CF 43 15 16
90 Bruno Bruno Gonzalez Cabrera Getafe CF 43 15 16
119 Oscar Oscar dos Santos Emboaba NaN 17 5 10
121 Oscar Oscar Rodriguez Arnaiz Real Madrid CF 17 5 10
188 C. Bravo Claudio Bravo Manchester City 62 23 8
193 Naldo Ronaldo Aparecido Rodrigues FC Schalke 04 58 19 1
194 Naldo Edinaldo Gomes Pereira RCD Espanyol 58 19 1
200 G. Castro Gonzalo Castro Borussia Dortmund 158 46 12
209 Juanfran Juan Francisco Torres Belen Atletico Madrid 86 21 8
212 Juanfran Juan Francisco Moreno Fuertes RC Coruna 74 34 2
Any help would be great!
You need:
df[['name','full_name','club']] = df[['name','full_name','club']].fillna('')
d = {'Points':'sum', 'Start':'sum', 'Sub':'sum', 'club':'first'}
df = (df.groupby(['name','full_name'], sort=False, as_index=False)
.agg(d)
.reindex(columns=df.columns))
with pd.option_context('display.expand_frame_repr', False):
print (df)
name full_name club Points Start Sub
0 S. Mustafi Shkodran Mustafi Arsenal 74 26 2
1 Bruno Bruno SorianoLlido Villarreal CF 43 15 16
2 Bruno Bruno Gonzalez Cabrera Getafe CF 43 15 16
3 Oscar Oscar dos Santos Emboaba 17 5 10
4 Oscar Oscar Rodriguez Arnaiz Real Madrid CF 17 5 10
5 C. Bravo Claudio Bravo Manchester City 62 23 8
6 Naldo Ronaldo Aparecido Rodrigues FC Schalke 04 58 19 1
7 Naldo Edinaldo Gomes Pereira RCD Espanyol 58 19 1
8 G. Castro Gonzalo Castro Borussia Dortmund 158 46 12
9 Juanfran Juan Francisco Torres Belen Atletico Madrid 160 55 10
10 Juanfran Juan Francisco Moreno Fuertes RC Coruna 160 55 10
Explanation:
First replace NaNs to '' by fillna for avoid omit rows with them in groupby
Aggregate by groupby, agg with dictionary with specify columns and their aggregating functions
Last for display all rows together temporarly use with
I have the below dataframe
In [62]: df
Out[62]:
coverage name reports year
Cochice 45 Jason 4 2012
Pima 214 Molly 24 2012
Santa Cruz 212 Tina 31 2013
Maricopa 72 Jake 2 2014
Yuma 85 Amy 3 2014
Basically i can filter the rows as below
df[df["coverage"] > 30
and i can drop/delete a single row as below
df.drop(['Cochice', 'Pima'])
But i want to delete a certain number of rows based on a condition, how can i do so?
The best is boolean indexing but need invert condition - get all values equal and higher as 72:
print (df[df["coverage"] >= 72])
coverage name reports year
Pima 214 Molly 24 2012
Santa Cruz 212 Tina 31 2013
Maricopa 72 Jake 2 2014
Yuma 85 Amy 3 2014
It is same as ge function:
print (df[df["coverage"].ge(72)])
coverage name reports year
Pima 214 Molly 24 2012
Santa Cruz 212 Tina 31 2013
Maricopa 72 Jake 2 2014
Yuma 85 Amy 3 2014
Another possible solution is invert mask by ~:
print (df["coverage"] < 72)
Cochice True
Pima False
Santa Cruz False
Maricopa False
Yuma False
Name: coverage, dtype: bool
print (~(df["coverage"] < 72))
Cochice False
Pima True
Santa Cruz True
Maricopa True
Yuma True
Name: coverage, dtype: bool
print (df[~(df["coverage"] < 72)])
coverage name reports year
Pima 214 Molly 24 2012
Santa Cruz 212 Tina 31 2013
Maricopa 72 Jake 2 2014
Yuma 85 Amy 3 2014
we can use pandas.query() functionality as well
import pandas as pd
dict_ = {'coverage':[45,214,212,72,85], 'name': ['jason','Molly','Tina','Jake','Amy']}
df = pd.DataFrame(dict_)
print(df.query('coverage > 72'))