Use groupby keys as indexes of pandas dataframe - python

I have a following pandas dataframe df:
% Renewable Energy Supply
Country
China 19.754910 1.271910e+11
United States 11.570980 9.083800e+10
Japan 10.232820 1.898400e+10
United Kingdom 10.600470 7.920000e+09
Russian Federation 17.288680 3.070900e+10
Canada 61.945430 1.043100e+10
Germany 17.901530 1.326100e+10
India 14.969080 3.319500e+10
France 17.020280 1.059700e+10
South Korea 2.279353 1.100700e+10
Italy 33.667230 6.530000e+09
Spain 37.968590 4.923000e+09
Iran 5.707721 9.172000e+09
Australia 11.810810 5.386000e+09
Brazil 69.648030 1.214900e+10
I am grouping this dataframe using the Continents each country belongs to and also using the bins obtained by using pd.cut on the column % Renewable :
out, bins = pd.cut(Top15['% Renewable'].values, bins = 5, retbins = True)
grp = Top15.groupby(by = [ContinentDict, out])
where,
ContinentDict = {'China':'Asia',
'United States':'North America',
'Japan':'Asia',
'United Kingdom':'Europe',
'Russian Federation':'Europe',
'Canada':'North America',
'Germany':'Europe',
'India':'Asia',
'France':'Europe',
'South Korea':'Asia',
'Italy':'Europe',
'Spain':'Europe',
'Iran':'Asia',
'Australia':'Australia',
'Brazil':'South America'}
Now, I want to create a new dataframe with the same columns as df and another column given by 'Country'. The indexes of this new dataframe should be given by groupby objects keys hierarchically ('Continent', 'out'). After hours of trial, I see no way to do this. Any ideas?

You can create a multi-index from continent and cut and assign it back to your data frame:
out, bins = pd.cut(Top15['% Renewable'].values, bins = 5, retbins = True)
con = Top15.index.to_series().map(ContinentDict).values
Top15.reset_index(inplace=True)
Top15.index = pd.MultiIndex.from_arrays([con, out])
Top15

Related

Converting dataframe to dictionary with country by continent

I have a .csv and dataframe which has 2 columns (country, continent). I want to create a dictionary, carrying the continent as key and a list of all countries as values.
The .csv has the following format:
country
continent
Algeria
Africa
Angola
Africa
and so on.
I tried using:
continentsDict = dict([(con, cou) for con, cou in zip(continents.continent, continents.country)])
But this gave me the following output:
{'Africa': 'Zimbabwe', 'Asia': 'Yemen', 'Europe': 'Vatican City', 'North America': 'United States Virgin Islands', 'Oceania': 'Wallis and Futuna', 'South America': 'Venezuela'}
Which is the right format but only added the last value it found for the respective continent.
Anyone an idea?
Thank you!
Assuming continents is the instance of your pandas df, you could do:
continentsDict = continents.groupby("continent")["country"].apply(list).to_dict()
Given:
country continent
0 Algeria Africa
1 Angola Africa
Doing:
out = df.groupby('continent')['country'].agg(list).to_dict()
print(out)
Output:
{'Africa': ['Algeria', 'Angola']}

Match countries with continents in streamlit selectbox

I have a df where I've used pycountry to get full names for country column and continent column and make select box in streamlit like so,
country continent
Hong Kong Asia
Montenegro Europe
Rwanda Africa
United States North America
Germany Europe
Myanmar Asia
Saudi Arabia Asia
etc.. etc..
Streamlit code:
continent_select = df['continent'].drop_duplicates()
country_select = df['country'].drop_duplicates()
continent_sidebar = st.sidebar.selectbox('Select a continent:', continent_select)
country_sidebar = st.sidebar.selectbox('Select a country:', country_select)
Desired output: I would like country names for specific continent to show in select box ie: Select 'Asia' from continent select box. Then in country select box, Hong Kong, China, India, etc... show up.
I've tried to group the rows into a list with continent
df2 = df.groupby('continent')['country'].apply(list)
continent_sidebar = st.sidebar.selectbox('Select a continent:', df2)
But this results in having a full list in the select box ie: [Rwanda, Morocco, Sudan, etc.]
Is there an easier way besides making a dictionary and grouping them manually?
See the comments in the code.
Code
import streamlit as st
import pandas as pd
data = {
'country': ['Hong Kong', 'Montenegro', 'Rwanda', 'Myanmar', 'Saudi Arabia'],
'continent': ['Asia', 'Europe', 'Africa', 'Asia', 'Asia']
}
df = pd.DataFrame(data)
continent_select = df['continent'].drop_duplicates()
# country_select = df['country'].drop_duplicates()
continent_sidebar = st.sidebar.selectbox('Select a continent:', continent_select)
# Get dataframe where continent is the continent_sidebar.
df1 = df.loc[df.continent == continent_sidebar]
# Get the country column from df1.
df2 = df1.country
# Show df2 in the side bar.
country_sidebar = st.sidebar.selectbox('Select a country:', df2)
Output

How to append a column to a dataframe with values based on condition

I have the following dataframe:
Country is actually the index:
2014 2015 PopEst
Country
China 8.230121e+12 8.797999e+12 1.367645e+09
United States 1.615662e+13 1.654857e+13 3.176154e+08
Japan 5.642884e+12 5.669563e+12 1.274094e+08
United Kingdom 2.605643e+12 2.666333e+12 6.387097e+07
Russian Federation 1.678709e+12 1.616149e+12 1.435000e+08
Canada 1.773486e+12 1.792609e+12 3.523986e+07
Germany 3.624386e+12 3.685556e+12 8.036970e+07
India 2.200617e+12 2.367206e+12 1.276731e+09
France 2.729632e+12 2.761185e+12 6.383735e+07
South Korea 1.234340e+12 1.266580e+12 4.980543e+07
Italy 2.033868e+12 2.049316e+12 5.990826e+07
Spain 1.375605e+12 1.419821e+12 4.644340e+07
Iran 4.639027e+11 NaN 7.707563e+07
Australia 1.272520e+12 1.301251e+12 2.331602e+07
Brazil 2.412231e+12 2.319423e+12 2.059153e+08
And I have the following dict:
ContinentDict = {'China':'Asia',
'United States':'North America',
'Japan':'Asia',
'United Kingdom':'Europe',
'Russian Federation':'Europe',
'Canada':'North America',
'Germany':'Europe',
'India':'Asia',
'France':'Europe',
'South Korea':'Asia',
'Italy':'Europe',
'Spain':'Europe',
'Iran':'Asia',
'Australia':'Australia',
'Brazil':'South America'}
I need to append a column showing the Continent Name for each country.
how can I do this?
Use:
df['Continent'] = df.index.map('ContinentDict')
Try this:
df['Continent'] = df.apply(lambda row : ContinentDict[row.name] ,axis = 1)
Output:
2014 2015 PopEst Continent
China 8.230121e+12 8.797999e+12 1.367645e+0 Asia
United States 1.615662e+13 1.654857e+13 3.176154e+0 North America

Group the index of dataframe using a dictionary

I have a dataframe 'Top15' whose index is country :
Top15.index
Index(['China', 'United States', 'Japan', 'United Kingdom',
'Russian Federation', 'Canada', 'Germany', 'India', 'France',
'South Korea', 'Italy', 'Spain', 'Iran', 'Australia', 'Brazil'],
dtype='object', name='Country')
I have a dictionary which has continent for each country.
ContinentDict = {'China':'Asia',
'United States':'North America',
'Japan':'Asia',
'United Kingdom':'Europe',
'Russian Federation':'Europe',
'Canada':'North America',
'Germany':'Europe',
'India':'Asia',
'France':'Europe',
'South Korea':'Asia',
'Italy':'Europe',
'Spain':'Europe',
'Iran':'Asia',
'Australia':'Australia',
'Brazil':'South America'}
I want to group the countries by continent.
I created a column 'Continent'
Top15['Continent']=Top15.index.map(ContinentDict)
After that i tried to group by continent
Top15.groupby('Continent')
I received following output:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000023CBF691AC8>
When i checked the dataframe, it was not grouped.
Is it because country is index and not in column?
What should i do?
from pandas import DataFrame as df
import numpy as np
import pandas as pd
ContinentDict = {'China':'Asia',
'United States':'North America',
'Japan':'Asia',
'United Kingdom':'Europe',
'Russian Federation':'Europe',
'Canada':'North America',
'Germany':'Europe',
'India':'Asia',
'France':'Europe',
'South Korea':'Asia',
'Italy':'Europe',
'Spain':'Europe',
'Iran':'Asia',
'Australia':'Australia',
'Brazil':'South America'}
df = pd.DataFrame(list(ContinentDict.items()), columns=['Country', 'Continent'])
print(df)
"""
Country Continent
0 China Asia
1 United States North America
2 Japan Asia
3 United Kingdom Europe
4 Russian Federation Europe
5 Canada North America
6 Germany Europe
7 India Asia
8 France Europe
9 South Korea Asia
10 Italy Europe
11 Spain Europe
12 Iran Asia
13 Australia Australia
14 Brazil South America
"""
df12 = (df.groupby('Continent').size().reset_index(name='Count')
.sort_values(['Count'], ascending=False).rename(columns={'index': 'Continent'}))
print(df12)
"""
Continent Count
2 Europe 6
0 Asia 5
3 North America 2
1 Australia 1
4 South America 1
"""
df1 = df.groupby('Continent')['Country'].apply(lambda x: x.tolist())
print(df1)
"""
Continent
Asia [China, Japan, India, South Korea, Iran]
Australia [Australia]
Europe [United Kingdom, Russian Federation, Germany, ...
North America [United States, Canada]
South America [Brazil]
Name: Country, dtype: object
"""

Groupby function in pandas dataframe of Python does not seem to work

I have a table with various information (e.g. energy supply, proportion of renewable energy supply) to 15 countries. I have to create a dataframe with information on continent level to the number of countries on each continent and the mean, standard deviation and sum of the population of the respective countries on those continents. The dataframe consists of the data of the table mentioned above. My problem is that I can't seem to aggregate the data on continent level after mapping the 15 countries to their respective continent. I have to use a predefined dictionary to solve this task. Could you please help me in this? Please find my Code below:
def answer_eleven():
import numpy as np
import pandas as pd
Top15 = answer_one()
Top15['Country Name'] = Top15.index
ContinentDict = {'China':'Asia',
'United States':'North America',
'Japan':'Asia',
'United Kingdom':'Europe',
'Russian Federation':'Europe',
'Canada':'North America',
'Germany':'Europe',
'India':'Asia',
'France':'Europe',
'South Korea':'Asia',
'Italy':'Europe',
'Spain':'Europe',
'Iran':'Asia',
'Australia':'Australia',
'Brazil':'South America'}
Top15['Continent'] = pd.Series(ContinentDict)
#Top15['size'] = Top15['Country'].count()
Top15['Population'] = (Top15['Energy Supply'] / Top15['Energy Supply per Capita'])
#columns_to_keep = ['Continent', 'Population']
#Top15 = Top15[columns_to_keep]
#Top15 = Top15.set_index('Continent').groupby(level=0)['Population'].agg({'sum': np.sum})
Top15.set_index(['Continent'], inplace = True)
Top15['size'] = Top15.groupby(['Continent'])['Country Name'].count()
Top15['sum'] = Top15.groupby(['Continent'])['Population'].sum()
Top15['mean'] = Top15.groupby(['Continent'])['Population'].mean()
Top15['std'] = Top15.groupby(['Continent'])['Population'].std()
columns_to_keep = ['size', 'sum', 'mean', 'std']
Top15 = Top15[columns_to_keep]
#Top15['Continent Name'] = Top15.index
#Top15.groupby(['Continent'], level = 0, sort = True)['size'].count()
return Top15.iloc[:5]
answer_eleven()
I believe you need agg for aggregate by dictionary:
def answer_eleven():
Top15 = answer_one()
ContinentDict = {'China':'Asia',
'United States':'North America',
'Japan':'Asia',
'United Kingdom':'Europe',
'Russian Federation':'Europe',
'Canada':'North America',
'Germany':'Europe',
'India':'Asia',
'France':'Europe',
'South Korea':'Asia',
'Italy':'Europe',
'Spain':'Europe',
'Iran':'Asia',
'Australia':'Australia',
'Brazil':'South America'}
Top15['Population'] = (Top15['Energy Supply'] / Top15['Energy Supply per Capita'])
Top15 = Top15.groupby(ContinentDict)['Population'].agg(['size','sum','mean','std'])
return Top15
df = answer_eleven()
print (df)
sum mean std size
Country Name
Asia 2.771785e+09 9.239284e+08 6.913019e+08 3
Australia 2.331602e+07 2.331602e+07 NaN 1
Europe 4.579297e+08 7.632161e+07 3.464767e+07 6
North America 3.528552e+08 1.764276e+08 1.996696e+08 2
South America 2.059153e+08 2.059153e+08 NaN 1

Categories