delete rows based on a condition in pandas

delete rows based on a condition in pandas - python

I have the below dataframe
In [62]: df
Out[62]:
coverage name reports year
Cochice 45 Jason 4 2012
Pima 214 Molly 24 2012
Santa Cruz 212 Tina 31 2013
Maricopa 72 Jake 2 2014
Yuma 85 Amy 3 2014
Basically i can filter the rows as below
df[df["coverage"] > 30
and i can drop/delete a single row as below
df.drop(['Cochice', 'Pima'])
But i want to delete a certain number of rows based on a condition, how can i do so?

The best is boolean indexing but need invert condition - get all values equal and higher as 72:
print (df[df["coverage"] >= 72])
coverage name reports year
Pima 214 Molly 24 2012
Santa Cruz 212 Tina 31 2013
Maricopa 72 Jake 2 2014
Yuma 85 Amy 3 2014
It is same as ge function:
print (df[df["coverage"].ge(72)])
coverage name reports year
Pima 214 Molly 24 2012
Santa Cruz 212 Tina 31 2013
Maricopa 72 Jake 2 2014
Yuma 85 Amy 3 2014
Another possible solution is invert mask by ~:
print (df["coverage"] < 72)
Cochice True
Pima False
Santa Cruz False
Maricopa False
Yuma False
Name: coverage, dtype: bool
print (~(df["coverage"] < 72))
Cochice False
Pima True
Santa Cruz True
Maricopa True
Yuma True
Name: coverage, dtype: bool
print (df[~(df["coverage"] < 72)])
coverage name reports year
Pima 214 Molly 24 2012
Santa Cruz 212 Tina 31 2013
Maricopa 72 Jake 2 2014
Yuma 85 Amy 3 2014

we can use pandas.query() functionality as well
import pandas as pd
dict_ = {'coverage':[45,214,212,72,85], 'name': ['jason','Molly','Tina','Jake','Amy']}
df = pd.DataFrame(dict_)
print(df.query('coverage > 72'))

Related

pd.read_html() not reading date

When I try to parse a wiki page for its tables, the tables are read correctly except for the date of birth column, which comes back as empty. Is there a workaround for this? I've tried using beautiful soup but I get the same result.
The code I've used is as follows:
url = 'https://en.wikipedia.org/wiki/2002_FIFA_World_Cup_squads'
pd.read_html(url)
Here's an image of one of the tables in question:

One possible solution can be alter the page content with beautifulsoup and then load it to pandas:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/2002_FIFA_World_Cup_squads"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
# select correct table, here I select the first one:
tbl = soup.select("table")[0]
# remove the (aged XX) part:
for td in tbl.select("td:nth-of-type(3)"):
td.string = td.contents[-1].split("(")[0]
df = pd.read_html(str(tbl))[0]
print(df)
Prints:
No. Pos. Player Date of birth (age) Caps Club
0 1 GK Thomas Sørensen 12 June 1976 14 Sunderland
1 2 MF Stig Tøfting 14 August 1969 36 Bolton Wanderers
2 3 DF René Henriksen 27 August 1969 39 Panathinaikos
3 4 DF Martin Laursen 26 July 1977 15 Milan
4 5 DF Jan Heintze (c) 17 August 1963 83 PSV Eindhoven
5 6 DF Thomas Helveg 24 June 1971 67 Milan
6 7 MF Thomas Gravesen 11 March 1976 22 Everton
7 8 MF Jesper Grønkjær 12 August 1977 25 Chelsea
8 9 FW Jon Dahl Tomasson 29 August 1976 38 Feyenoord
9 10 MF Martin Jørgensen 6 October 1975 32 Udinese
10 11 FW Ebbe Sand 19 July 1972 44 Schalke 04
11 12 DF Niclas Jensen 17 August 1974 8 Manchester City
12 13 DF Steven Lustü 13 April 1971 4 Lyn
13 14 MF Claus Jensen 29 April 1977 13 Charlton Athletic
14 15 MF Jan Michaelsen 28 November 1970 11 Panathinaikos
15 16 GK Peter Kjær 5 November 1965 4 Aberdeen
16 17 MF Christian Poulsen 28 February 1980 3 Copenhagen
17 18 FW Peter Løvenkrands 29 January 1980 4 Rangers
18 19 MF Dennis Rommedahl 22 July 1978 19 PSV Eindhoven
19 20 DF Kasper Bøgelund 8 October 1980 2 PSV Eindhoven
20 21 FW Peter Madsen 26 April 1978 4 Brøndby
21 22 GK Jesper Christiansen 24 April 1978 0 Vejle
22 23 MF Brian Steen Nielsen 28 December 1968 65 Malmö FF

Try setting the parse_dates parameter to True inside read_html method.

2 strings in one column, want to add the strings to a preexisting value

I have a dataframe which contains the goals and assists for teams in Spain. my problems is that some of the players played for 2 teams within a season and these are represented by the rows which contain 2 strings. due to this, python thinks these values are represent a new club but that's not the case, they moved from one club (the first string) to a new club (the second string).
How can i add the values of the second string to the clubs they are actually now representing, instead of making it look like its a new club ?
year country club goals assists
2020 Spain Alaves 35 21
617 2020 Spain Alaves,Athletic Club 0 0
618 2020 Spain Alaves,Granada 0 1
619 2020 Spain Athletic Club 42 31
620 2020 Spain Athletic Club,Real Valladolid 1 0
621 2020 Spain Atletico Madrid 65 53
622 2020 Spain Atletico Madrid,Osasuna 0 5
623 2020 Spain Atletico Madrid,Valencia 0 0
624 2020 Spain Barcelona 80 51
625 2020 Spain Barcelona,Getafe 2 2
626 2020 Spain Cadiz 32 16
627 2020 Spain Cadiz,Valencia 1 0
628 2020 Spain Celta Vigo 55 37
629 2020 Spain Celta Vigo,Real Valladolid 1 3
630 2020 Spain Eibar 25 17
631 2020 Spain Eibar,Sevilla 4 3
632 2020 Spain Elche 33 25
633 2020 Spain Getafe 24 11
634 2020 Spain Getafe,Villarreal 1 1
635 2020 Spain Granada 46 29
636 2020 Spain Levante 46 35
637 2020 Spain Osasuna 36 21
638 2020 Spain Real Betis 50 33
639 2020 Spain Real Madrid 63 52
640 2020 Spain Real Sociedad 56 36
641 2020 Spain Real Sociedad,Sevilla 1 2
642 2020 Spain Real Valladolid 33 20
643 2020 Spain SD Huesca 33 23
644 2020 Spain Sevilla 51 35
645 2020 Spain Valencia 48 35
646 2020 Spain Villarreal 57 33

If I understand your issue, you can remove the first team with:
df['current_club'] = df['club'].apply(lambda s: s.split(',')[-1])
This will create a new column with just the current club where the player is playing.
Then you can run your analysis (grouping by team etc) using that new column without double teams.

Filtering Dataframe based on many conditions

here is my problem:
I have a dataFrame that look like this :
Date Name Score Country
2012 Paul 45 Mexico
2012 Mike 38 Sweden
2012 Teddy 62 USA
2012 Hilary 80 USA
2013 Ashley 42 France
2013 Temari 58 UK
2013 Harry 78 UK
2013 Silvia 55 Italy
I want to select the two best scores, with a filter by date and also from a different country.
For example here : In 2012 Hilary has the best score (USA) so she will be selected.
Teddy has the second best score in 2012 but he won't be selected as he comes from the same country (USA)
So Paul will be selected instead as he comes from a different country (Mexico).
This is what I did :
df = pd.DataFrame(
{'Date':["2012","2012","2012","2012","2013","2013","2013","2013"],
'Name': ["Paul", "Mike", "Teddy", "Hilary", "Ashley", "Temaru","Harry","Silvia"],
'Score': [45, 38, 62, 80, 42, 58,78,55],
"Country":["Mexico","Sweden","USA","USA","France","UK",'UK','Italy']})
And then I made the filter by Date and by Score :
df1 = df.set_index('Name').groupby('Date')['Score'].apply(lambda grp: grp.nlargest(2))
But I don't really know and to do the filter that takes into account that they have to come from a different country.
Does anyone have an idea on that ? Thank you so much
EDIT : The answer I am looking for should be something like that :
Date Name Score Country
2012 Hilary 80 USA
2012 Paul 45 Mexico
2013 Harry 78 UK
2013 Silvia 55 Italy
Filter two people by date, best score and from a different country

sort_values + tail
s=df.sort_values('Score').drop_duplicates(['Date','Country'],keep='last').groupby('Date').tail(2)
s
Date Name Score Country
0 2012 Paul 45 Mexico
7 2013 Silvia 55 Italy
6 2013 Harry 78 UK
3 2012 Hilary 80 USA

You can group by a list use the code below:
df1 = df.set_index('Name').groupby(['Date', 'Country'])['Score'].apply(lambda grp: grp.nlargest(1))
It will put out this:
Date Country Name Score
2012 Mexico Paul 45
Sweden Mike 38
USA Hilary 80
2013 France Ashley 42
Italy Silvia 55
UK Harry 78
EDIT:
Based on new information here is a solution. It might be able to be improved a bit but it works.
df.sort_values(['Score'],ascending=False, inplace=True)
df.sort_values(['Date'], inplace=True)
df.drop_duplicates(['Date', 'Country'], keep='first', inplace=True)
df1 = df.groupby('Date').head(2).reset_index(drop=True)
This outputs
Date Name Score Country
0 2012 Hilary 80 USA
1 2012 Paul 45 Mexico
2 2013 Harry 78 UK
3 2013 Silvia 55 Italy

df.groupby(['Country','Name','Date'])['Score'].agg(Score=('Score','first')).reset_index().drop_duplicates(subset='Country', keep='first')
result

I have used different longer approach, which anyone hasn't submitted so far.
df = pd.DataFrame(
{'Date':["2012","2012","2012","2012","2013","2013","2013","2013"],
'Name': ["Paul", "Mike", "Teddy", "Hilary", "Ashley", "Temaru","Harry","Silvia"],
'Score': [45, 38, 62, 80, 42, 58,78,55],
"Country":["Mexico","Sweden","USA","USA","France","UK",'UK','Italy']})
df1=df.groupby(['Date','Country'])['Score'].max().reset_index()
df2=df.iloc[:,[1,2]]
df1.merge(df2)
This is little convoluted but does the work.

Problems with isin pandas

Sorry, I just asked this question: Pythonic Way to have multiple Or's when conditioning in a dataframe but marked it as answered prematurely because it passed my overly simplistic test case, but isn't working more generally. (If it is possible to merge and reopen the question that would be great...)
Here is the full issue:
sum(data['Name'].isin(eligible_players))
> 0
sum(data['Name'] == "Antonio Brown")
> 68
"Antonio Brown" in eligible_players
> True
Basically if I understand correctly, I am showing that Antonio Brown is in eligible players and he is in the dataframe. However, for some reason the .isin() isn't working properly.
As I said in my prior question, I am looking for a way to check many ors to select the proper rows
____ EDIT ____
In[14]:
eligible_players
Out[14]:
Name
Antonio Brown 378
Demaryius Thomas 334
Jordy Nelson 319
Dez Bryant 309
Emmanuel Sanders 293
Odell Beckham 289
Julio Jones 288
Randall Cobb 284
Jeremy Maclin 267
T.Y. Hilton 255
Alshon Jeffery 252
Golden Tate 250
Mike Evans 236
DeAndre Hopkins 223
Calvin Johnson 220
Kelvin Benjamin 218
Julian Edelman 213
Anquan Boldin 213
Steve Smith 213
Roddy White 208
Brandon LaFell 205
Mike Wallace 205
A.J. Green 203
DeSean Jackson 200
Jordan Matthews 194
Eric Decker 194
Sammy Watkins 190
Torrey Smith 186
Andre Johnson 186
Jarvis Landry 178
Eddie Royal 176
Brandon Marshall 175
Vincent Jackson 175
Rueben Randle 174
Marques Colston 173
Mohamed Sanu 171
Keenan Allen 170
James Jones 168
Malcom Floyd 168
Kenny Stills 167
Greg Jennings 162
Kendall Wright 162
Doug Baldwin 160
Michael Floyd 159
Robert Woods 158
Name: Pts, dtype: int64
and
In [31]:
data.tail(110)
Out[31]:
Name Pts year week pos Team
28029 Dez Bryant 25 2014 17 WR DAL
28030 Antonio Brown 25 2014 17 WR PIT
28031 Jordan Matthews 24 2014 17 WR PHI
28032 Randall Cobb 23 2014 17 WR GB
28033 Rueben Randle 21 2014 17 WR NYG
28034 Demaryius Thomas 19 2014 17 WR DEN
28035 Calvin Johnson 19 2014 17 WR DET
28036 Torrey Smith 18 2014 17 WR BAL
28037 Roddy White 17 2014 17 WR ATL
28038 Steve Smith 17 2014 17 WR BAL
28039 DeSean Jackson 16 2014 17 WR WAS
28040 Mike Evans 16 2014 17 WR TB
28041 Anquan Boldin 16 2014 17 WR SF
28042 Adam Thielen 15 2014 17 WR MIN
28043 Cecil Shorts 15 2014 17 WR JAC
28044 A.J. Green 15 2014 17 WR CIN
28045 Jordy Nelson 14 2014 17 WR GB
28046 Brian Hartline 14 2014 17 WR MIA
28047 Robert Woods 13 2014 17 WR BUF
28048 Kenny Stills 13 2014 17 WR NO
28049 Emmanuel Sanders 13 2014 17 WR DEN
28050 Eddie Royal 13 2014 17 WR SD
28051 Marques Colston 13 2014 17 WR NO
28052 Chris Owusu 12 2014 17 WR NYJ
28053 Brandon LaFell 12 2014 17 WR NE
28054 Dontrelle Inman 12 2014 17 WR SD
28055 Reggie Wayne 11 2014 17 WR IND
28056 Paul Richardson 11 2014 17 WR SEA
28057 Cole Beasley 11 2014 17 WR DAL
28058 Jarvis Landry 10 2014 17 WR MIA

(Aside: once you posted what you were actually using, it only took seconds to see the problem.)
Series.isin(something) iterates over something to determine the set of things you want to test membership in. But your eligible_players isn't a list, it's a Series. And iteration over a Series is iteration over the values, even though membership (in) is with respect to the index:
In [72]: eligible_players = pd.Series([10,20,30], index=["A","B","C"])
In [73]: list(eligible_players)
Out[73]: [10, 20, 30]
In [74]: "A" in eligible_players
Out[74]: True
So in your case, you could use eligible_players.index instead to pass the right names:
In [75]: df = pd.DataFrame({"Name": ["A","B","C","D"]})
In [76]: df
Out[76]:
Name
0 A
1 B
2 C
3 D
In [77]: df["Name"].isin(eligible_players) # remember, this will be [10, 20, 30]
Out[77]:
0 False
1 False
2 False
3 False
Name: Name, dtype: bool
In [78]: df["Name"].isin(eligible_players.index)
Out[78]:
0 True
1 True
2 True
3 False
Name: Name, dtype: bool
In [79]: df["Name"].isin(eligible_players.index).sum()
Out[79]: 3

Adding columns of different length into pandas dataframe

I have a dataframe detailing money awarded to people over several years:
Name -- Money -- Year
Paul 57.00 2012
Susan 67.00 2012
Gary 54.00 2011
Paul 77.00 2011
Andrea 20.00 2011
Albert 23.00 2011
Hal 26.00 2010
Paul 23.00 2010
From this dataframe, I want to construct a dataframe that details all the money awarded in a single year, for making a boxplot:
2012 -- 2011 -- 2010
57.00 54.00 26.00
67.00 77.00 23.00
20.00
23.00
So you see this results in columns of different length. When I try to do this using pandas, I get the error 'ValueError: Length of values does not match length of index'. I assume this is because I can't add varying length columns to a dataframe.
Can anyone offer some advice on how to proceed? Perhap I'm approaching this incorrectly? Thanks for any help!

I'd do this in a two-step process: first add a column corresponding to the index in each year using cumcount, and then pivot so that the new column is the index, the years become the columns, and the money column becomes the values:
df["yindex"] = df.groupby("Year").cumcount()
new_df = df.pivot(index="yindex", columns="Year", values="Money")
For example:
>>> df = pd.read_csv("money.txt", sep="\s+")
>>> df
Name Money Year
0 Paul 57 2012
1 Susan 67 2012
2 Gary 54 2011
3 Paul 77 2011
4 Andrea 20 2011
5 Albert 23 2011
6 Hal 26 2010
7 Paul 23 2010
>>> df["yindex"] = df.groupby("Year").cumcount()
>>> df
Name Money Year yindex
0 Paul 57 2012 0
1 Susan 67 2012 1
2 Gary 54 2011 0
3 Paul 77 2011 1
4 Andrea 20 2011 2
5 Albert 23 2011 3
6 Hal 26 2010 0
7 Paul 23 2010 1
>>> df.pivot(index="yindex", columns="Year", values="Money")
Year 2010 2011 2012
yindex
0 26 54 57
1 23 77 67
2 NaN 20 NaN
3 NaN 23 NaN
After which you could get rid of the NaNs if you like, but it depends on whether you want to distinguish between cases like "knowing the value is 0" and "not knowing what the value is":
>>> df.pivot(index="yindex", columns="Year", values="Money").fillna(0)
Year 2010 2011 2012
yindex
0 26 54 57
1 23 77 67
2 0 20 0
3 0 23 0

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

delete rows based on a condition in pandas - python

we can use pandas.query() functionality as well import pandas as pd dict_ = {'coverage':[45,214,212,72,85], 'name': ['jason','Molly','Tina','Jake','Amy']} df = pd.DataFrame(dict_) print(df.query('coverage > 72'))

Related

pd.read_html() not reading date

2 strings in one column, want to add the strings to a preexisting value

Filtering Dataframe based on many conditions

Problems with isin pandas

Adding columns of different length into pandas dataframe

Categories

Resources