Filtering Dataframe based on many conditions - python

here is my problem:
I have a dataFrame that look like this :
Date Name Score Country
2012 Paul 45 Mexico
2012 Mike 38 Sweden
2012 Teddy 62 USA
2012 Hilary 80 USA
2013 Ashley 42 France
2013 Temari 58 UK
2013 Harry 78 UK
2013 Silvia 55 Italy
I want to select the two best scores, with a filter by date and also from a different country.
For example here : In 2012 Hilary has the best score (USA) so she will be selected.
Teddy has the second best score in 2012 but he won't be selected as he comes from the same country (USA)
So Paul will be selected instead as he comes from a different country (Mexico).
This is what I did :
df = pd.DataFrame(
{'Date':["2012","2012","2012","2012","2013","2013","2013","2013"],
'Name': ["Paul", "Mike", "Teddy", "Hilary", "Ashley", "Temaru","Harry","Silvia"],
'Score': [45, 38, 62, 80, 42, 58,78,55],
"Country":["Mexico","Sweden","USA","USA","France","UK",'UK','Italy']})
And then I made the filter by Date and by Score :
df1 = df.set_index('Name').groupby('Date')['Score'].apply(lambda grp: grp.nlargest(2))
But I don't really know and to do the filter that takes into account that they have to come from a different country.
Does anyone have an idea on that ? Thank you so much
EDIT : The answer I am looking for should be something like that :
Date Name Score Country
2012 Hilary 80 USA
2012 Paul 45 Mexico
2013 Harry 78 UK
2013 Silvia 55 Italy
Filter two people by date, best score and from a different country

sort_values + tail
s=df.sort_values('Score').drop_duplicates(['Date','Country'],keep='last').groupby('Date').tail(2)
s
Date Name Score Country
0 2012 Paul 45 Mexico
7 2013 Silvia 55 Italy
6 2013 Harry 78 UK
3 2012 Hilary 80 USA

You can group by a list use the code below:
df1 = df.set_index('Name').groupby(['Date', 'Country'])['Score'].apply(lambda grp: grp.nlargest(1))
It will put out this:
Date Country Name Score
2012 Mexico Paul 45
Sweden Mike 38
USA Hilary 80
2013 France Ashley 42
Italy Silvia 55
UK Harry 78
EDIT:
Based on new information here is a solution. It might be able to be improved a bit but it works.
df.sort_values(['Score'],ascending=False, inplace=True)
df.sort_values(['Date'], inplace=True)
df.drop_duplicates(['Date', 'Country'], keep='first', inplace=True)
df1 = df.groupby('Date').head(2).reset_index(drop=True)
This outputs
Date Name Score Country
0 2012 Hilary 80 USA
1 2012 Paul 45 Mexico
2 2013 Harry 78 UK
3 2013 Silvia 55 Italy

df.groupby(['Country','Name','Date'])['Score'].agg(Score=('Score','first')).reset_index().drop_duplicates(subset='Country', keep='first')
result

I have used different longer approach, which anyone hasn't submitted so far.
df = pd.DataFrame(
{'Date':["2012","2012","2012","2012","2013","2013","2013","2013"],
'Name': ["Paul", "Mike", "Teddy", "Hilary", "Ashley", "Temaru","Harry","Silvia"],
'Score': [45, 38, 62, 80, 42, 58,78,55],
"Country":["Mexico","Sweden","USA","USA","France","UK",'UK','Italy']})
df1=df.groupby(['Date','Country'])['Score'].max().reset_index()
df2=df.iloc[:,[1,2]]
df1.merge(df2)
This is little convoluted but does the work.

Related

Python summing selected values in a column that match given condition

Here's the data after the preliminary data cleaning.
year
country
employees
2001
US
9
2001
Canada
81
2001
France
22
2001
Japan
31
2001
Chile
7
2001
Mexico
15
2001
Total
165
2002
US
5
2002
Canada
80
2002
France
20
2002
Japan
30
2002
Egypt
35
2002
Total
170
...
...
...
2010
US
32
...
...
...
What I want to get is the table below, which is summing up all countries except "US, Canada, France, and Japan" into 'others'. The list of countries varies every year from 2001 to 2010 so I want to use a for loop with if condition to loop over every year.
year
country
employees
2001
US
9
2001
Canada
81
2001
France
22
2001
Japan
31
2001
Others
22
2001
Total
165
2002
US
5
2002
Canada
80
2002
France
20
2002
Japan
30
2002
Others
35
2002
Total
170
Any leads would be greatly appreciated!
You may consider dropping Total from your dataframe.
However, as stated, your question can be solved by using Series.where to map away values that you don't recognize:
country = df["country"].where(df["country"].isin(["US", "Canada", "France", "Japan", "Total"]), "Others")
df.groupby([df["year"], country]).sum(numeric_only=True)

Is it possible to conditionally combine data frame rows using pandas in python3?

I have the following data frame.
Names Counts Year
0 Jordan 1043 2000
1 Steve 204 2000
2 Brock 3 2000
3 Steve 33 2000
4 Mike 88 2000
... ... ... ...
20001 Bryce 2 2015
20002 Steve 11 2015
20003 Penny 24 2015
20004 Steve 15 2015
20005 Ryan 5 2015
I want to output the information about the name "Steve" over all years. The output should combine the "Counts" for the name "Steve" if the name appears multiple times within the same year.
Example output might look like:
Names Counts Year
0 Steve 237 2000
1 Steve 400 2001
2 Steve 35 2002
... ... ... ...
15 Steve 26 2015
do you want something like this ?
#first
cols=['Counts','Year']
df[cols]=df[cols].astype('int32')
df=df[df['Names']=='Steve']
df=df.groupby('Year')['Counts'].agg({'sum'})
Filter records for Steve then groupby Year, and finally calculate the aggregates i.e. first for Names, and sums for Counts
(df[df['Names'].eq('Steve')]
.groupby('Year')
.agg({'Names': 'first', 'Counts': sum})
.reset_index())
Year Names Counts
0 2000 Steve 237
1 2015 Steve 26

How do you merge two pandas dataframes with the same attributes and overwrite the rows that are identical?

I have two datasets that I would like to merge. A simplified version is:
DF1
----
name age country
joe 25 uk
jim 24 usa
jill 46 spain
DF2
---
name age country
joe 25 uk
jim 24 usa
sam 27 france
I would like to merge the two datasets to produce
DF3
---
name age country
joe 25 uk
jim 24 usa
sam 27 france
jill 46 spain
Can anybody suggest how I can achieve this.
via outer merge:
merged_df = df1.merge(df2, how = 'outer')
OUTPUT:
name age country
0 joe 25 uk
1 jim 24 usa
2 jill 46 spain
3 sam 27 france
NOTE: sort by age column if requried:
merged_df = df1.merge(df2, how = 'outer').sort_values('age')
The simplest way to do it this particular case is to append df2 to df1 and then drop duplicates
df3 = df1.append(df2, ignore_index=True).drop_duplicates().reset_index(drop= True)

Fastest way to "unpack' a pandas dataframe

Hope the title is not misleading.
I load an Excel file in a pandas dataframe as usual
df = pd.read_excel('complete.xlsx')
and this is what's inside (usually is already ordered - this is a really small sample)
df
Out[21]:
Country City First Name Last Name Ref
0 England London John Smith 34
1 England London Bill Owen 332
2 England Brighton Max Crowe 25
3 England Brighton Steve Grant 55
4 France Paris Roland Tomas 44
5 France Paris Anatole Donnet 534
6 France Lyon Paulin Botrel 234
7 Spain Madrid Oriol Abarquero 34
8 Spain Madrid Alberto Olloqui 534
9 Spain Barcelona Ander Moreno 254
10 Spain Barcelona Cesar Aranda 222
what I need to do is automating an export of the data creating a sqlite db for every country, (i.e. 'England.sqlite') which will contain a table for evey city (i.e. London and Brighton) and every table will have the related personnel info.
The sqlite is not a problem, I'm only trying to figure how to "unpack" the dataframe in the most rapid and "pythonic way
Thanks
You can loop by DataFrame.groupby object:
for i, subdf in df.groupby('Country'):
print (i)
print (subdf)
#processing

Removing data from a column in pandas

I'm trying to prune some data from my data frame but only the rows where there are duplicates in the "To country" column
My data frame looks like this:
Year From country To country Points
0 2016 Albania Armenia 0
1 2016 Albania Armenia 2
2 2016 Albania Australia 12
Year From country To country Points
2129 2016 United Kingdom The Netherlands 0
2130 2016 United Kingdom Ukraine 10
2131 2016 United Kingdom Ukraine 5
[2132 rows x 4 columns]
I try this on it:
df.drop_duplicates(subset='To country', inplace=True)
And what happens is this:
Year From country To country Points
0 2016 Albania Armenia 0
2 2016 Albania Australia 12
4 2016 Albania Austria 0
Year From country To country Points
46 2016 Albania The Netherlands 0
48 2016 Albania Ukraine 0
50 2016 Albania United Kingdom 5
[50 rows x 4 columns]
While this does get rid of the duplicated 'To country' entries, it also removes all the values of the 'From country' column. I must be using the drop_duplicates() wrong, but the pandas documentation isn't helping me understand why its dropping more than I'd expect it to?
No, this behavior is correct—assuming every team played every other team, it's finding the firsts, and all of those firsts are "From" Albania.
From what you've said below, you want to keep row 0, but not row 1 because it repeats both the To and From countries. The way to eliminate those is:
df.drop_duplicates(subset=['To country', 'From country'], inplace=True)
The simplest solution is to group by the 'to country' name and take the first (or the last, if you prefer) row from each group:
df.groupby('To country').first().reset_index()
# To country Year From country Points
#0 Armenia 2016 Albania 0
#1 Australia 2016 Albania 12
#2 The Netherlands 2016 United Kingdom 0
#3 Ukraine 2016 United Kingdom 10
Compared to aryamccarthy's solution, this one gives you more control over which duplicates to keep.

Categories