Rounding and sorting dataframe with pandas - python

https://github.com/haosmark/jupyter_notebooks/blob/master/Coursera%20week%203%20assignment.ipynb
All the way at the bottom of the code, with question 3, I'm trying to average, round, and sort the data, however for some reason rounding and sorting isn't working at all
i = df.columns.get_loc('2006')
avgGDP = df[df.columns[i:]].copy()
avgGDP = avgGDP.mean(axis=1).round(2).sort_values(ascending=False)
avgGDP
what am I doing wrong here?
This is what df looks like before I apply average, round, and sort.

Your series is actually sorted, the first line being 1.5e+13 and the last one 4.4e+11:
Country
United States 1.536434e+13
China 6.348609e+12
Japan 5.542208e+12
Germany 3.493025e+12
France 2.681725e+12
United Kingdom 2.487907e+12
Brazil 2.189794e+12
Italy 2.120175e+12
India 1.769297e+12
Canada 1.660648e+12
Russian Federation 1.565460e+12
Spain 1.418078e+12
Australia 1.164043e+12
South Korea 1.106714e+12
Iran 4.441558e+11
Rounding doesn't do anything visible here because the smallest value is 4e+11, and rounding it to 2 decimal places doesn't show on this scale. If you want to keep only 2 decimal places in the scientific notation, you can use .map('{:0.2e}'.format), see my note below.
Note: just for fun, you could also calculate the same with a one-liner:
df.filter(regex='^2').mean(1).sort_values()[::-1].map('{:0.2e}'.format)
Output:
Country
United States 1.54e+13
China 6.35e+12
Japan 5.54e+12
Germany 3.49e+12
France 2.68e+12
United Kingdom 2.49e+12
Brazil 2.19e+12
Italy 2.12e+12
India 1.77e+12
Canada 1.66e+12
Russian Federation 1.57e+12
Spain 1.42e+12
Australia 1.16e+12
South Korea 1.11e+12
Iran 4.44e+11

Related

finding id name of 5 most frequent value in a column in pandas

We have a data which has column name "birth_country"
i executed following code;
import pandas as pd
df=pd.read_csv("data.csv")
df['birth_country'].value_counts()[:5]
output:
United States of America 259
United Kingdom 85
Germany 61
France 51
Sweden 29
I want my output to be look like;
United States of America
United Kingdom
Germany
France
Sweden
How to do it?
Like;
df['birth_country'].value_counts().idxmax()
gives output:
United States of America
For series by index values use:
pd.Series(df['birth_country'].value_counts()[:5].index)

How to convert Object YYYYM1 YYYYM2 to Month and Year

I have a data set of trade statistics. The data is in the following form:
reporter partner Time Period export import
0 Argentina United States 1990M2 1.304801e+08 5.984441e+07
1 Argentina United States 1990M3 1.237417e+08 5.092350e+07
2 Argentina United States 1990M4 1.020971e+08 4.884196e+07
3 Argentina United States 1990M5 1.569232e+08 5.583000e+07
4 Argentina United States 1990M6 1.539624e+08 6.869098e+07
5 Argentina United States 1990M7 1.491639e+08 6.207464e+07
6 Argentina United States 1990M8 1.675413e+08 8.482295e+07
7 Argentina United States 1990M9 1.459988e+08 7.731452e+07
8 Argentina United States 1990M10 1.613134e+08 1.061588e+08
9 Argentina United States 1990M11 1.392604e+08 9.931942e+07
10 Argentina United States 1990M12 1.266004e+08 1.003602e+08
11 Argentina United States 1991M1 1.183864e+08 8.458743e+07
12 Argentina United States 1991M2 1.107058e+08 7.544877e+07
13 Argentina United States 1991M3 1.034667e+08 7.632608e+07
14 Argentina United States 1991M4 1.078808e+08 9.906306e+07
and so on.
The "Time Period" variable is Dtype object. I want to change the format of the "Time Period" variable so that I get February 1990 instead of 1990M2, March 1990 instead of 1990M2, etc.
Convert to datetime64 dtype using pd.to_datetime with a specified format and extract the year using the dt accessor. Ex:
df['year'] = pd.to_datetime(df['Time Period'], format='%YM%m').dt.year
If you wish, you could also extract the month via dt.month. docs.
Alternatively, you could split the strings on 'M' and cast to two separate columns of dtype int, e.g.
df[['Y','M']] = df["Time Period"].str.split("M", expand=True).astype(int)

Sorting values in a pandas series in ascending order not working when re-assigned

I am trying to sort a Pandas Series in ascending order.
Top15['HighRenew'].sort_values(ascending=True)
Gives me:
Country
China 1
Russian Federation 1
Canada 1
Germany 1
Italy 1
Spain 1
Brazil 1
South Korea 2.27935
Iran 5.70772
Japan 10.2328
United Kingdom 10.6005
United States 11.571
Australia 11.8108
India 14.9691
France 17.0203
Name: HighRenew, dtype: object
The values are in ascending order.
However, when I then modify the series in the context of the dataframe:
Top15['HighRenew'] = Top15['HighRenew'].sort_values(ascending=True)
Top15['HighRenew']
Gives me:
Country
China 1
United States 11.571
Japan 10.2328
United Kingdom 10.6005
Russian Federation 1
Canada 1
Germany 1
India 14.9691
France 17.0203
South Korea 2.27935
Italy 1
Spain 1
Iran 5.70772
Australia 11.8108
Brazil 1
Name: HighRenew, dtype: object
Why this is giving me a different output to that above?
Would be grateful for any advice?
Top15['HighRenew'] = Top15['HighRenew'].sort_values(ascending=True).to_numpy()
or
Top15['HighRenew'] = Top15['HighRenew'].sort_values(ascending=True).reset_index(drop=True)
When you sort_values , the indexes don't change so it is aligning per the index!
Thank you to anky for providing me with this fantastic solution!

Pandas Groupby results coming up based on the value_counts and ascending values

highest_medals_countries = olympics_merged.groupby(['Sport'])['Team'].value_counts()
highest_medals_countries.sort_values(ascending = False)[:10]
Output:
Sport Team
Athletics United States 3202
Great Britain 2240
Gymnastics United States 1939
Swimming United States 1622
Gymnastics France 1576
Athletics France 1494
Gymnastics Italy 1345
Swimming Great Britain 1291
Athletics Germany 1254
Gymnastics Hungary 1242
In the above output, I am stacking the teams with the most number of medals based on sport together but when I look at the output the sports are coming up based on the value counts. How can I get rid of this and put countries together for athletics , Gymnastics, Swimming, etc?
Expected output is:
Sport Team
Athletics United States 3202
Great Britain 2240
France 1494
Gymnastics United States 1939
France 1576
Italy 1345
Hungary 1242
Swimming United States 1622
Great Britain 1291
Athletics Germany 1254
By running sort_values on your stacked dataframe you force it to sort the entire dataframe by value whereas the values were already sorted within the categories in the first place. So don't run highest_medals_countries.sort_values(ascending = False)[:10] and you're fine.

Filter and drop rows by proportion python

I have a dataframe called wine that contains a bunch of rows I need to drop.
How do i drop all rows in column 'country' that are less than 1% of the whole?
Here are the proportions:
#proportion of wine countries in the data set
wine.country.value_counts() / len(wine.country)
US 0.382384
France 0.153514
Italy 0.100118
Spain 0.070780
Portugal 0.062186
Chile 0.056742
Argentina 0.042835
Austria 0.034767
Germany 0.028928
Australia 0.021434
South Africa 0.010233
New Zealand 0.009069
Israel 0.006133
Greece 0.004493
Canada 0.002526
Hungary 0.001755
Romania 0.001558
...
I got lazy and didn't include all of the results, but i think you catch my drift. I need to drop all rows with proportions less than .01
Here is the head of my dataframe:
country designation points price province taster_name variety year price_category
Portugal Avidagos 87 15.0 Douro Roger Voss Portuguese Red 2011.0 low
You can use something like this:
df = df[df.proportion >= .01]
From that dataset it should give you something like this:
US 0.382384
France 0.153514
Italy 0.100118
Spain 0.070780
Portugal 0.062186
Chile 0.056742
Argentina 0.042835
Austria 0.034767
Germany 0.028928
Australia 0.021434
South Africa 0.010233
figured it out
country_filter = wine.country.value_counts(normalize=True) > 0.01
country_index = country_filter[country_filter.values == True].index
wine = wine[wine.country.isin(list(country_index))]

Categories