Matplotlib Pie Graph with 'All Other Categories" - python

I have created a matplotlib pie chart:
df.plot(kind='pie', subplots=True, figsize=(6, 4))
My dataframe consists of two columns - Country and Value (% distribution) and has about 25 countries listed. I would like to only plot the top 10 countries by values (by highest %) and within the plot, calculate the remaining countries % value and give it the title of 'All Other Countries'. How do I do this using matplotlib using the .plot function?
Country Value
Albania 4%
Brazil 3%
Denmark 5%
France 10%
Mexico 3%
Nigeria 15%
Spain 4%
U.S. 5%

As already stated in the comments, the best way to do this is probably to do the manipulations before plotting. Here's a way how to do it:
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np
countries = [
'Albania',
'Brazil',
'Denmark',
'France',
'Mexico',
'Nigeria',
'Spain',
'Germany',
'Finland',
]
#the full dataframe
df = pd.DataFrame(
data = {'country': countries, 'value' :np.random.rand(len(countries))},
).sort_values('value', ascending = False)
#the top 5
df2 = df[:5].copy()
#others
new_row = pd.DataFrame(data = {
'country' : ['others'],
'value' : [df['value'][5:].sum()]
})
#combining top 5 with others
df2 = pd.concat([df2, new_row])
#plotting -- for comparison left all countries and right
#the others combined
fig, axes = plt.subplots(nrows = 1, ncols = 2, figsize = (9,4))
df.plot(kind = 'pie', y = 'value', labels = df['country'], ax = axes[0])
df2.plot(kind = 'pie', y = 'value', labels = df2['country'], ax = axes[1])
axes[0].set_title('all countries')
axes[1].set_title('top 5')
plt.show()
The result looks like this.
Hope this helps.

Related

Generate a world map with folium and value_counts()

I have a problem. I want to create a world map with the help of folium. I am using value_counts, because I want to show like a heatmap which countries occur most.
But unfortunately I do not know, how I could get the 'heads' folium.Choropleth(...,columns=['Country','Total'],...).add_to(m). How could I generate a map with value_counts?
The point, is that value_counts does not give any heading. Is there any option to get the heading columns=['Country', 'Total'],
Dataframe
id country
0 1 DE
1 2 DE
2 3 CN
3 4 BG
4 3 CN
5 4 BG
6 5 BG
import pandas as pd
d = {'id': [1, 2, 3, 4, 3, 4, 5], 'country': ['DE', 'DE', 'CN', 'BG', 'CN', 'BG', 'BG']}
df = pd.DataFrame(data=d)
# print(df)
count_country = df['country'].value_counts()
[OUT]
BG 3
DE 2
CN 2
import folium
#Creating a base map
m = folium.Map()
folium.Choropleth(
data=count__orders_countries,
columns=['Country', 'Total'],
fill_color='PuRd',
nan_fill_color='white'
).add_to(m)
Check out this tutorial it helpful. https://towardsdatascience.com/creating-a-simple-folium-map-covid-19-worldwide-total-case-a0a1429c6e7c
Apparently, there are two key points that you are missing:
1- Setting up the world country data. This is done through a URL that you pass to the folium.Choropleth geo_Data parameter.
From the tutorial:
#Setting up the world countries data URL
url = 'https://raw.githubusercontent.com/python-
visualization/folium/master/examples/data'
country_shapes = f'{url}/world-countries.json'
2- In your dataframe you need to have the names of the countries to match the data from the URL so you need to replace the names in your dataframe to these names.
For example in the tutorial they had to change the following names like this (the name of the dataframe with countries and data was called df_covid):
From the tutorial:
#Replacing the country name
df_covid.replace('USA', "United States of America", inplace = True)
df_covid.replace('Tanzania', "United Republic of Tanzania", inplace =
True)
df_covid.replace('Democratic Republic of Congo', "Democratic Republic of > the Congo", inplace = True)
df_covid.replace('Congo', "Republic of the Congo", inplace = True)
df_covid.replace('Lao', "Laos", inplace = True)
df_covid.replace('Syrian Arab Republic', "Syria", inplace = True)
df_covid.replace('Serbia', "Republic of Serbia", inplace = True)
df_covid.replace('Czechia', "Czech Republic", inplace = True)
df_covid.replace('UAE', "United Arab Emirates", inplace = True)
3- Finally, create your map. PAss the URL to geo_data, the dataframe to data, and the column name that has the country and the counts to the columns.
From the tutorial:
folium.Choropleth(
geo_data=country_shapes,
name='choropleth COVID-19',
data=df_covid,
columns=['Country', 'Total Case'],
key_on='feature.properties.name',
fill_color='PuRd',
nan_fill_color='white'
).add_to(m)
Edit:
To get a data frame from the counts you could do something like this:
df_input = pd.DataFrame()
df_input['Country'] = count_country.index
df_input['Counts'] = np.array(count_country)

crosstab - plot - how to return a bar chart with the top 5

I want to return only the top 5 teams that gained the maximum medals (gold, silver and Bronze) in the plot.
medal_ranks = olympics[olympics['Medal'] != 'NaN'].groupby(['NOC', 'Year', 'Sport', 'Event', 'Season'])
medal_ranks = medal_ranks.first()
medal_ranks = medal_ranks.reset_index()
medal_ranks['NOC'].value_counts().head(5)
medal_colors = ['darkgoldenrod', 'gold', 'silver']
cross = pd.crosstab(medal_ranks.NOC, medal_ranks.Medal).plot(kind='bar',color = medal_colors,
stacked=True, figsize=(18,5))
(The Teams that I want to show:
USA 5261
GBR 4152
FRA 4135
ITA 3738
CAN 3649)

Using Add() function to merge multiple dataframes in Panda

Given three data frames containing the number of gold, silver, and bronze Olympic medals won by some countries, determine the total number of medals won by each country.
Note: All the three data frames don’t have all the same countries.Also, sort the final dataframe, according to the total medal count in descending order.
This is my code below - but I am not getting the desired output.Can someone please suggest what is wrong?
import numpy as np
import pandas as pd
# Defining the three dataframes indicating the gold, silver, and bronze medal counts
# of different countries
gold = pd.DataFrame({'Country': ['USA', 'France', 'Russia'],
'Medals': [15, 13, 9]}
)
silver = pd.DataFrame({'Country': ['USA', 'Germany', 'Russia'],
'Medals': [29, 20, 16]}
)
bronze = pd.DataFrame({'Country': ['France', 'USA', 'UK'],
'Medals': [40, 28, 27]}
)
#gold.set_index('Country',inplace = True)
#silver.set_index('Country',inplace = True)
#bronze.set_index('Country',inplace = True)
Total = gold.add(silver,fill_value = 0).add(bronze,fill_value = 0)
Total.sort_values('Medals',ascending = True)
You can try:
pd.concat([gold, silver, bronze]).groupby('Country').sum().\
sort_values('Medals', ascending=False).reset_index()
If you do like that you have three dataframes in one. It's grouped by country and you get sum of medals for each of them. At the end we sort it in a descending order and reset the index.
Output:
Country Medals
0 USA 72
1 France 53
2 UK 27
3 Russia 25
4 Germany 20
You can do below way as well:
gold.set_index('Country', inplace=True)
silver.set_index('Country', inplace=True)
bronze.set_index('Country', inplace=True)
#print(gold)
#print(silver)
#print(bronze)
Total= gold.add(silver, fill_value=0).add(bronze,fill_value=0).sort_values('Medals', ascending=False)
Output:
Medals
Country
USA 72.0
France 53.0
UK 27.0
Russia 25.0
Germany 20.0
import numpy as np
import pandas as pd
# Defining the three dataframes indicating the gold, silver, and bronze medal counts
# of different countries
gold = pd.DataFrame({'Country': ['USA', 'France', 'Russia'],
'Medals': [15, 13, 9]}
)
silver = pd.DataFrame({'Country': ['USA', 'Germany', 'Russia'],
'Medals': [29, 20, 16]}
)
bronze = pd.DataFrame({'Country': ['France', 'USA', 'UK'],
'Medals': [40, 28, 27]}
)
# Set the index of the dataframes to 'Country' so that you can get the countrywise
# medal count
gold.set_index('Country', inplace = True)
silver.set_index('Country', inplace = True)
bronze.set_index('Country', inplace = True)
# Add the three dataframes and set the fill_value argument to zero to avoid getting
# NaN values
total = gold.add(silver, fill_value = 0).add(bronze, fill_value = 0)
# Sort the resultant dataframe in a descending order
total = total.sort_values(by = 'Medals', ascending = False).astype("int64")
# Print the sorted dataframe
print(total)
int64 is used to convert the float value into integer and 64 indicates 64bit memory location
import numpy as np
import pandas as pd
# Defining the three dataframes indicating the gold, silver, and bronze medal counts
# of different countries
gold = pd.DataFrame({'Country': ['USA', 'France', 'Russia'],
'Medals': [15, 13, 9]}
)
silver = pd.DataFrame({'Country': ['USA', 'Germany', 'Russia'],
'Medals': [29, 20, 16]}
)
bronze = pd.DataFrame({'Country': ['France', 'USA', 'UK'],
'Medals': [40, 28, 27]}
)
gold.set_index('Country' , inplace = True)
silver.set_index('Country' , inplace = True)
bronze.set_index('Country' , inplace = True )
total = gold.add(silver , fill_value = 0).add(bronze , fill_value = 0)
total = total.sort_values(by = 'Medals', ascending = False)
total = total.astype(int)
print(total)

Plotly: How to make data consistent across animation frames (i.e. avoiding vanishing data) in plotly express

This is starting to bug me: In plotly express when using animation_frame, I know it's important to set ranges so data can be displayed consistently, otherwise data may vanish across frames. But for a column with categorical values (say 'US', 'Russia', 'Germany'), I cannot find any way to avoid disappearing data when not every frame contains all categories if I want that column to appear with different colors (in the code below, that column would be 'AnotherColumn'). Plotly documentation points out
Animations are designed to work well when each row of input is present across all animation frames, and when categorical values mapped to symbol, color and facet are constant across frames. Animations may be misleading or inconsistent if these constraints are not met.
but while I can easily set a range_color when I have a continuous color range, nothing of the sort seems to work for categorical data. I can somewhat workaround this by making my data numerical (e.g. 'US'-> 1, 'Russia' -> 2) bu that is both fiddly and the result visually unappealing.
import plotly.express as px
...
fig = px.bar(data, x="NameColumn",
y="SomeColumn",
color="AnotherColumn",
animation_frame="AnimationColumn",
range_y=[0, max_y]
)
Here is a simple reproducible example:
import pandas as pd
import plotly.express as px
data_dict = {'ColorColumn': ['p', 'p', 'p', 'q'],
'xColumn': ['someName', 'someOtherName', 'someName', 'someOtherName'],
'yColumn': [10, 20, 30, 40],
'animationColumn': [1, 1, 2, 2]}
data = pd.DataFrame(data=data_dict)
fig = px.bar(data, x="xColumn",
y="yColumn",
color="ColorColumn",
animation_frame="animationColumn",
range_y=[0, 40]
)
fig.update_layout(xaxis={'title': '',
'visible': True,
'showticklabels': True})
fig.show()
If you try it out, you'll notice the second frame is missing a bar. If the ColorColumn had numeric data, you could fix this by specifying range_color (similar to the specification of range_y in the code above); my question would be, how to handle this with categorical data?
Second edit: Some requested additional data or more a more reasonable example. This might be more appropriate:
import pandas as pd
import plotly.express as px
data_dict = {'Region': ['North America', 'Asia', 'Asia',
'North America', 'Asia', 'Europe',
'North America', 'Europe', 'Asia'],
'Country': ['US', 'China', 'Korea',
'US', 'Phillipines', 'France',
'Canada', 'Germany', 'Thailand'],
'GDP': [10, 20, 30,
40, 50, 60,
70, 80, 90],
'Year': [2017, 2017, 2017,
2018, 2018, 2018,
2019, 2019, 2019]}
data = pd.DataFrame(data=data_dict)
fig = px.bar(data, x="Country",
y="GDP",
color="Region",
animation_frame="Year",
range_y=[0, 80]
)
fig.update_layout(xaxis={'title': '',
'visible': True,
'showticklabels': True})
fig.show()
A similar question has been asked and answered under Plotly: How to specify categorical x-axis elements in a plotly express animation?. The necessary adjustments for your use case aren't exactly straight-forward though, so I'll might as well set it up for you.
It all boils down to this setup using, among other things:
df['key']=df.groupby(['Year','Country']).cumcount()
df1 = pd.pivot_table(df,index='Year',columns=['key', 'Country'],values='GDP')
And:
df1 = pd.merge(df1, data[['Country', 'Region']], how='left', on='Country').drop_duplicates()
Using some neat properties of pd.pivot_table, this will give you a dataset that has all years and all countries for all regions even though GDP from these have not been specified.
The two first animation frames will look like this:
Complete code:
import pandas as pd
import plotly.express as px
data_dict = {'Region': ['North America', 'Asia', 'Asia',
'North America', 'Asia', 'Europe',
'North America', 'Europe', 'Asia'],
'Country': ['US', 'China', 'Korea',
'US', 'Phillipines', 'France',
'Canada', 'Germany', 'Thailand'],
'GDP': [10, 20, 30,
40, 50, 60,
70, 80, 90],
'Year': [2017, 2017, 2017,
2018, 2018, 2018,
2019, 2019, 2019]}
data = pd.DataFrame(data=data_dict)
# dat munging
df = data.copy()
df['key']=df.groupby(['Year','Country']).cumcount()
df1 = pd.pivot_table(df,index='Year',columns=['key', 'Country'],values='GDP')
df1 = df1.stack(level=[0,1],dropna=False).reset_index()
df1 = pd.merge(df1, data[['Country', 'Region']], how='left', on='Country').drop_duplicates()
df1.columns=['Year', 'Key', 'Country', 'GDP', 'Region']
fig = px.bar(df1, x="Country",
y="GDP",
color="Region",
animation_frame="Year",
range_y=[0, 80]
)
fig.update_layout(xaxis={'title': '',
'visible': True,
'showticklabels': True})
fig.show()
The following is not a direct answer to you question (as in what do i need to change in plotly), but rather focuses on consistent data in you DataFrame.
The basic idea is that the "Primary Key" of each of your rows in the second example is ["Year", "Country"]. plotly will now expect a value for "GDP" as well as "Region" for each combination of those. The following creates a DataFrame that looks just like so (be using a MultiIndex reindexing).
unqiue_years = data["Year"].unique()
unqiue_countries = data["Country"].unique()
# Let's first separate the region of a country
region_per_country = data[["Country", "Region"]].drop_duplicates().set_index("Country")
# Removing the region
data = data[["Year", "Country", "GDP"]].set_index(["Year", "Country"])
# Creating all possible "Year" "Country" combinations
data = data.reindex(pd.MultiIndex.from_product([unqiue_years, unqiue_countries]))
# Cleanup
data = data.reset_index().rename(columns={"level_0": "Year", "level_1": "Country"})
# Re-adding the region
data = data.merge(region_per_country, left_on="Country", right_index=True)
Running this gives us the following DataFrame (shown without the .reset_index()):
GDP Region
Year Country
2017 Canada NaN North America
China 20.0 Asia
France NaN Europe
Germany NaN Europe
Korea 30.0 Asia
Phillipines NaN Asia
Thailand NaN Asia
US 10.0 North America
2018 Canada NaN North America
China NaN Asia
France 60.0 Europe
Germany NaN Europe
Korea NaN Asia
Phillipines 50.0 Asia
Thailand NaN Asia
US 40.0 North America
2019 Canada 70.0 North America
China NaN Asia
France NaN Europe
Germany 80.0 Europe
Korea NaN Asia
Phillipines NaN Asia
Thailand 90.0 Asia
US NaN North America
which plotly will then correctly plot.

plot data from pandas DataFrame, colour dependent on column value

I have tried multiple codes to condition the bar plot colour to a particular value. It seems that the color function only checks for the first item in the index (in this case Germany) and sets the condition for all other items in the index. I would really appreciate if anybody could help:
colors = ['red' if 'Germany' else 'lightgrey' for x in first5_countries.index] #it colors all bars red
colors = ['r' if 'IT' else 'b' for index in first5_countries.index] #it colors everything red
colors = ['r' if pop_mln>85 else 'b' for pop_mln in first5_countries.pop_mln] #all bars blue
colors = ['r' if index=='Italy' else 'b' for index in first5_countries.index] #all bars blue
colors = ['b', 'b', 'b', 'r', 'b'] #yields blue
The whole code:
sorted_df = population_2019.sort_values(by='pop_mln', ascending=False)
first5_countries = sorted_df[:5]
colors = ['r' if index=='Italy' else 'b' for index in first5_countries.index]
first5_countries[['pop_mln']].plot.bar(figsize=(20,5), legend=False, color=colors)
plt.ylabel('Total population (in million)', size=12)
plt.xticks(rotation=30, ha='right')
plt.xlabel('')
plt.grid(axis='y')
plt.show()
Printout of first5_countries:
geo sex age year total_pop pop_mln
geo_full
Germany DE T TOTAL 2019 83019213.0 83.019213
France FR T TOTAL 2019 67012883.0 67.012883
United Kingdom UK T TOTAL 2019 66647112.0 66.647112
Italy IT T TOTAL 2019 60359546.0 60.359546
Spain ES T TOTAL 2019 46937060.0 46.937060
population
first5_countries.index.values
array(['Germany', 'France', 'United Kingdom', 'Italy', 'Spain'],
dtype=object)
You can define your colors like this:
colors = ['red' if x=='Italy' else 'lightgray' for x in first5_countries.index]
And then pass to the plot function:
first5_countries['population_mln'].plot.bar(figsize=(20,5),color=colors, legend=False)
Together, you would do:
colors = ['red' if x=='Italy' else 'lightgray' for x in first5_countries.index]
first5_countries['pop_mln'].plot.bar(figsize=(20,5),color=colors, legend=False)
Output would be something like this:

Categories