Removing data from a column in pandas - python

I'm trying to prune some data from my data frame but only the rows where there are duplicates in the "To country" column
My data frame looks like this:
Year From country To country Points
0 2016 Albania Armenia 0
1 2016 Albania Armenia 2
2 2016 Albania Australia 12
Year From country To country Points
2129 2016 United Kingdom The Netherlands 0
2130 2016 United Kingdom Ukraine 10
2131 2016 United Kingdom Ukraine 5
[2132 rows x 4 columns]
I try this on it:
df.drop_duplicates(subset='To country', inplace=True)
And what happens is this:
Year From country To country Points
0 2016 Albania Armenia 0
2 2016 Albania Australia 12
4 2016 Albania Austria 0
Year From country To country Points
46 2016 Albania The Netherlands 0
48 2016 Albania Ukraine 0
50 2016 Albania United Kingdom 5
[50 rows x 4 columns]
While this does get rid of the duplicated 'To country' entries, it also removes all the values of the 'From country' column. I must be using the drop_duplicates() wrong, but the pandas documentation isn't helping me understand why its dropping more than I'd expect it to?

No, this behavior is correct—assuming every team played every other team, it's finding the firsts, and all of those firsts are "From" Albania.
From what you've said below, you want to keep row 0, but not row 1 because it repeats both the To and From countries. The way to eliminate those is:
df.drop_duplicates(subset=['To country', 'From country'], inplace=True)

The simplest solution is to group by the 'to country' name and take the first (or the last, if you prefer) row from each group:
df.groupby('To country').first().reset_index()
# To country Year From country Points
#0 Armenia 2016 Albania 0
#1 Australia 2016 Albania 12
#2 The Netherlands 2016 United Kingdom 0
#3 Ukraine 2016 United Kingdom 10
Compared to aryamccarthy's solution, this one gives you more control over which duplicates to keep.

Related

Python summing selected values in a column that match given condition

Here's the data after the preliminary data cleaning.
year
country
employees
2001
US
9
2001
Canada
81
2001
France
22
2001
Japan
31
2001
Chile
7
2001
Mexico
15
2001
Total
165
2002
US
5
2002
Canada
80
2002
France
20
2002
Japan
30
2002
Egypt
35
2002
Total
170
...
...
...
2010
US
32
...
...
...
What I want to get is the table below, which is summing up all countries except "US, Canada, France, and Japan" into 'others'. The list of countries varies every year from 2001 to 2010 so I want to use a for loop with if condition to loop over every year.
year
country
employees
2001
US
9
2001
Canada
81
2001
France
22
2001
Japan
31
2001
Others
22
2001
Total
165
2002
US
5
2002
Canada
80
2002
France
20
2002
Japan
30
2002
Others
35
2002
Total
170
Any leads would be greatly appreciated!
You may consider dropping Total from your dataframe.
However, as stated, your question can be solved by using Series.where to map away values that you don't recognize:
country = df["country"].where(df["country"].isin(["US", "Canada", "France", "Japan", "Total"]), "Others")
df.groupby([df["year"], country]).sum(numeric_only=True)

Fastest way to "unpack' a pandas dataframe

Hope the title is not misleading.
I load an Excel file in a pandas dataframe as usual
df = pd.read_excel('complete.xlsx')
and this is what's inside (usually is already ordered - this is a really small sample)
df
Out[21]:
Country City First Name Last Name Ref
0 England London John Smith 34
1 England London Bill Owen 332
2 England Brighton Max Crowe 25
3 England Brighton Steve Grant 55
4 France Paris Roland Tomas 44
5 France Paris Anatole Donnet 534
6 France Lyon Paulin Botrel 234
7 Spain Madrid Oriol Abarquero 34
8 Spain Madrid Alberto Olloqui 534
9 Spain Barcelona Ander Moreno 254
10 Spain Barcelona Cesar Aranda 222
what I need to do is automating an export of the data creating a sqlite db for every country, (i.e. 'England.sqlite') which will contain a table for evey city (i.e. London and Brighton) and every table will have the related personnel info.
The sqlite is not a problem, I'm only trying to figure how to "unpack" the dataframe in the most rapid and "pythonic way
Thanks
You can loop by DataFrame.groupby object:
for i, subdf in df.groupby('Country'):
print (i)
print (subdf)
#processing

Filtering Dataframe based on many conditions

here is my problem:
I have a dataFrame that look like this :
Date Name Score Country
2012 Paul 45 Mexico
2012 Mike 38 Sweden
2012 Teddy 62 USA
2012 Hilary 80 USA
2013 Ashley 42 France
2013 Temari 58 UK
2013 Harry 78 UK
2013 Silvia 55 Italy
I want to select the two best scores, with a filter by date and also from a different country.
For example here : In 2012 Hilary has the best score (USA) so she will be selected.
Teddy has the second best score in 2012 but he won't be selected as he comes from the same country (USA)
So Paul will be selected instead as he comes from a different country (Mexico).
This is what I did :
df = pd.DataFrame(
{'Date':["2012","2012","2012","2012","2013","2013","2013","2013"],
'Name': ["Paul", "Mike", "Teddy", "Hilary", "Ashley", "Temaru","Harry","Silvia"],
'Score': [45, 38, 62, 80, 42, 58,78,55],
"Country":["Mexico","Sweden","USA","USA","France","UK",'UK','Italy']})
And then I made the filter by Date and by Score :
df1 = df.set_index('Name').groupby('Date')['Score'].apply(lambda grp: grp.nlargest(2))
But I don't really know and to do the filter that takes into account that they have to come from a different country.
Does anyone have an idea on that ? Thank you so much
EDIT : The answer I am looking for should be something like that :
Date Name Score Country
2012 Hilary 80 USA
2012 Paul 45 Mexico
2013 Harry 78 UK
2013 Silvia 55 Italy
Filter two people by date, best score and from a different country
sort_values + tail
s=df.sort_values('Score').drop_duplicates(['Date','Country'],keep='last').groupby('Date').tail(2)
s
Date Name Score Country
0 2012 Paul 45 Mexico
7 2013 Silvia 55 Italy
6 2013 Harry 78 UK
3 2012 Hilary 80 USA
You can group by a list use the code below:
df1 = df.set_index('Name').groupby(['Date', 'Country'])['Score'].apply(lambda grp: grp.nlargest(1))
It will put out this:
Date Country Name Score
2012 Mexico Paul 45
Sweden Mike 38
USA Hilary 80
2013 France Ashley 42
Italy Silvia 55
UK Harry 78
EDIT:
Based on new information here is a solution. It might be able to be improved a bit but it works.
df.sort_values(['Score'],ascending=False, inplace=True)
df.sort_values(['Date'], inplace=True)
df.drop_duplicates(['Date', 'Country'], keep='first', inplace=True)
df1 = df.groupby('Date').head(2).reset_index(drop=True)
This outputs
Date Name Score Country
0 2012 Hilary 80 USA
1 2012 Paul 45 Mexico
2 2013 Harry 78 UK
3 2013 Silvia 55 Italy
df.groupby(['Country','Name','Date'])['Score'].agg(Score=('Score','first')).reset_index().drop_duplicates(subset='Country', keep='first')
result
I have used different longer approach, which anyone hasn't submitted so far.
df = pd.DataFrame(
{'Date':["2012","2012","2012","2012","2013","2013","2013","2013"],
'Name': ["Paul", "Mike", "Teddy", "Hilary", "Ashley", "Temaru","Harry","Silvia"],
'Score': [45, 38, 62, 80, 42, 58,78,55],
"Country":["Mexico","Sweden","USA","USA","France","UK",'UK','Italy']})
df1=df.groupby(['Date','Country'])['Score'].max().reset_index()
df2=df.iloc[:,[1,2]]
df1.merge(df2)
This is little convoluted but does the work.

Wide to Long data frame returning NaN instead of float values

I have a large data frame that looks like this:
Country 2010 2011 2012 2013
0 Germany 4.625e+10 4.814e+10 4.625e+10 4.593e+10
1 France 6.178e+10 6.460e+10 6.003e+10 6.241e+10
2 Italy 4.625e+10 4.625e+10 4.625e+10 4.625e+10
I want to reshape the data so that the Country, Years, and Values are all columns. I used the melt method
dftotal = pd.melt(dftotal, id_vars='Country',
value_vars=[2010,2011,2012,2013,2014,2015,2016,2016,2017],
var_name ='Year', value_name='Total')
I was able to attain:
Country Year Total
0 Germany 2010 NaN
1 France 2010 NaN
2 Italy 2010 NaN
My issue is that the float values turns into NaN and I don't how to reshape the data frame to keep the values as floats.
Omit the value_vars argument and it works:
pd.melt(dftotal, id_vars='Country', var_name ='Year', value_name='Total')
Country Year Total
0 Germany 2010 4.625000e+10
1 France 2010 6.178000e+10
2 Italy 2010 4.625000e+10
3 Germany 2011 4.814000e+10
4 France 2011 6.460000e+10
5 Italy 2011 4.625000e+10
6 Germany 2012 4.625000e+10
7 France 2012 6.003000e+10
8 Italy 2012 4.625000e+10
9 Germany 2013 4.593000e+10
10 France 2013 6.241000e+10
11 Italy 2013 4.625000e+10
The problem is probably that your column names are not ints but strings, so you could do:
dftotal = pd.melt(dftotal, id_vars='Country',
value_vars=['2010','2011','2012','2013','2014','2015','2016','2016','2017'],
var_name ='Year', value_name='Total')
And it would also work.
Alternatively, using stack:
dftotal = (dftotal.set_index('Country').stack()
.reset_index()
.rename(columns={'level_1':'Year',0:'Total'})
.sort_values('Year'))
Will get you the same output (but less succinctly)

How to groupby and collpase with pandas?

I have a dataframe of the following type:
Country Year Age Male Female
0 Canada 2005 50 400 25
1 Canada 2005 51 100 25
2 Canada 2006 50 100 70
3 Columbia 2005 50 75 75
I would like to, for example, get the total number of males+females of any age, grouped by country and year. I.e. I'm trying to understand what operation could allow me to see a table such as
Country Year Total over ages and sexes
0 Canada 2005 550
1 Canada 2006 170
2 Columbia 2005 150
In the above example, the value 550 comes from the total number of males and females in Canada for the year 2005, regardless of age: so 550 = 400+25+100+25.
I probably need to groupby Country and Year, but I'm not sure how to collapse the ages and total the number of males and females.
df["Total"] = df.Male + df.Female
df.groupby(["Country", "Year"]).Total.sum()
Output:
Country Year
Canada 2005 550
2006 170
Columbia 2005 150
Name: Total, dtype: int64
Update
cᴏʟᴅsᴘᴇᴇᴅ's chained version:
(df.assign(Total=df.Male + df.Female)
.groupby(['Country', 'Year'])
.Total
.sum()
.reset_index(name='Total over ages and sexes'))

Categories