I have a problem. I want to create a world map with the help of folium. I am using value_counts, because I want to show like a heatmap which countries occur most.
But unfortunately I do not know, how I could get the 'heads' folium.Choropleth(...,columns=['Country','Total'],...).add_to(m). How could I generate a map with value_counts?
The point, is that value_counts does not give any heading. Is there any option to get the heading columns=['Country', 'Total'],
Dataframe
id country
0 1 DE
1 2 DE
2 3 CN
3 4 BG
4 3 CN
5 4 BG
6 5 BG
import pandas as pd
d = {'id': [1, 2, 3, 4, 3, 4, 5], 'country': ['DE', 'DE', 'CN', 'BG', 'CN', 'BG', 'BG']}
df = pd.DataFrame(data=d)
# print(df)
count_country = df['country'].value_counts()
[OUT]
BG 3
DE 2
CN 2
import folium
#Creating a base map
m = folium.Map()
folium.Choropleth(
data=count__orders_countries,
columns=['Country', 'Total'],
fill_color='PuRd',
nan_fill_color='white'
).add_to(m)
Check out this tutorial it helpful. https://towardsdatascience.com/creating-a-simple-folium-map-covid-19-worldwide-total-case-a0a1429c6e7c
Apparently, there are two key points that you are missing:
1- Setting up the world country data. This is done through a URL that you pass to the folium.Choropleth geo_Data parameter.
From the tutorial:
#Setting up the world countries data URL
url = 'https://raw.githubusercontent.com/python-
visualization/folium/master/examples/data'
country_shapes = f'{url}/world-countries.json'
2- In your dataframe you need to have the names of the countries to match the data from the URL so you need to replace the names in your dataframe to these names.
For example in the tutorial they had to change the following names like this (the name of the dataframe with countries and data was called df_covid):
From the tutorial:
#Replacing the country name
df_covid.replace('USA', "United States of America", inplace = True)
df_covid.replace('Tanzania', "United Republic of Tanzania", inplace =
True)
df_covid.replace('Democratic Republic of Congo', "Democratic Republic of > the Congo", inplace = True)
df_covid.replace('Congo', "Republic of the Congo", inplace = True)
df_covid.replace('Lao', "Laos", inplace = True)
df_covid.replace('Syrian Arab Republic', "Syria", inplace = True)
df_covid.replace('Serbia', "Republic of Serbia", inplace = True)
df_covid.replace('Czechia', "Czech Republic", inplace = True)
df_covid.replace('UAE', "United Arab Emirates", inplace = True)
3- Finally, create your map. PAss the URL to geo_data, the dataframe to data, and the column name that has the country and the counts to the columns.
From the tutorial:
folium.Choropleth(
geo_data=country_shapes,
name='choropleth COVID-19',
data=df_covid,
columns=['Country', 'Total Case'],
key_on='feature.properties.name',
fill_color='PuRd',
nan_fill_color='white'
).add_to(m)
Edit:
To get a data frame from the counts you could do something like this:
df_input = pd.DataFrame()
df_input['Country'] = count_country.index
df_input['Counts'] = np.array(count_country)
Related
I have a df where I've used pycountry to get full names for country column and continent column and make select box in streamlit like so,
country continent
Hong Kong Asia
Montenegro Europe
Rwanda Africa
United States North America
Germany Europe
Myanmar Asia
Saudi Arabia Asia
etc.. etc..
Streamlit code:
continent_select = df['continent'].drop_duplicates()
country_select = df['country'].drop_duplicates()
continent_sidebar = st.sidebar.selectbox('Select a continent:', continent_select)
country_sidebar = st.sidebar.selectbox('Select a country:', country_select)
Desired output: I would like country names for specific continent to show in select box ie: Select 'Asia' from continent select box. Then in country select box, Hong Kong, China, India, etc... show up.
I've tried to group the rows into a list with continent
df2 = df.groupby('continent')['country'].apply(list)
continent_sidebar = st.sidebar.selectbox('Select a continent:', df2)
But this results in having a full list in the select box ie: [Rwanda, Morocco, Sudan, etc.]
Is there an easier way besides making a dictionary and grouping them manually?
See the comments in the code.
Code
import streamlit as st
import pandas as pd
data = {
'country': ['Hong Kong', 'Montenegro', 'Rwanda', 'Myanmar', 'Saudi Arabia'],
'continent': ['Asia', 'Europe', 'Africa', 'Asia', 'Asia']
}
df = pd.DataFrame(data)
continent_select = df['continent'].drop_duplicates()
# country_select = df['country'].drop_duplicates()
continent_sidebar = st.sidebar.selectbox('Select a continent:', continent_select)
# Get dataframe where continent is the continent_sidebar.
df1 = df.loc[df.continent == continent_sidebar]
# Get the country column from df1.
df2 = df1.country
# Show df2 in the side bar.
country_sidebar = st.sidebar.selectbox('Select a country:', df2)
Output
Using the using the Plotly go.Table() function and Pandas, I'm attempting to create a table to summarize some data. My data is as follows:
import pandas as pd
test_df = pd.DataFrame({'Manufacturer':['BMW', 'Chrysler', 'Chrysler', 'Chrysler', 'Brokertec', 'DWAS', 'Ford', 'Buick'],
'Metric':['Indicator', 'Indicator', 'Indicator', 'Indicator', 'Indicator', 'Indicator', 'Indicator', 'Indicator'],
'Dimension':['Short', 'Short', 'Short', 'Long', 'Short', 'Short', 'Long', 'Long'],
'User': ['USA', 'USA', 'USA', 'USA', 'USA', 'New USA', 'USA', 'Los USA'],
'Value':[50, 3, 3, 2, 5, 7, 10, 5]
})
My desired output is as follows (summing the Dimension by Manufacturer):
Manufacturer Short Long
Chrysler 6 2
Buick 5 5
Mercedes 7 0
Ford 0 10
I need to shape the Pandas data frame a bit (and this is where I'm running into trouble). My code was as follows:
table_columns = ['Manufacturer', 'Longs', 'Shorts']
manufacturers = ['Chrysler', 'Buick', 'Mercedes', 'Ford']
df_new = (df[df['Manufacturer'].isin(manufacturers)]
.set_index(['Manufacturer', 'Dimension'])
['Value'].unstack()
.reset_index()[table_columns]
)
Then, create the table using the Plotly go.Table() function:
import plotly.graph_objects as go
direction_table = go.Figure(go.Table(
header=dict(
values=table_columns,
font=dict(size=12),
line_color='darkslategray',
fill_color='lightskyblue',
align='center'
),
cells=dict(
values=df_new.T, # using Transpose here
line_color='darkslategray',
fill_color='lightcyan',
align = 'center')
)
)
direction_table
The error I'm seeing is:
ValueError: Index contains duplicate entries, cannot reshape
What is the best way to work around this?
Thanks in advance!
You need to use pivot_table with aggfunc='sum' instead of set_index.unstack
table_columns = ['Manufacturer', 'Long', 'Short']
manufacturers = ['Chrysler', 'Buick', 'Mercedes', 'Ford']
df_new = (test_df[test_df['Manufacturer'].isin(manufacturers)]
.pivot_table(index='Manufacturer', columns='Dimension',
values='Value', aggfunc='sum', fill_value=0)
.reset_index()
.rename_axis(columns=None)[table_columns]
)
print (df_new)
Manufacturer Long Short
0 Buick 5 0
1 Chrysler 2 6
2 Ford 10 0
Note it is not the same output but I don't think your input can give the expected output
Or the same result with groupby.sum and unstack
(test_df[test_df['Manufacturer'].isin(manufacturers)]
.groupby(['Manufacturer', 'Dimension'])
['Value'].sum()
.unstack(fill_value=0)
.reset_index()
.rename_axis(columns=None)[table_columns]
)
I have this CSV:
Name Species Country
0 Hobbes Tiger U.S.
1 SherKhan Tiger India
2 Rescuer Mouse Australia
3 Mickey Mouse U.S.
And I have a second CSV:
Continent Countries Unnamed: 2 Unnamed: 3 Unnamed: 4
0 North America U.S. Mexico Guatemala Honduras
1 Asia India China Nepal NaN
2 Australia Australia NaN NaN NaN
3 Africa South Africa Botswana Zimbabwe NaN
I want to use the second CSV to update the first file so that the output is:
Name Species Country
0 Hobbes Tiger North America
1 SherKhan Tiger Asia
2 Rescuer Mouse Australia
3 Mickey Mouse North America
So far this the closest I have gotten:
import pandas as pd
# Import my data.
data = pd.read_csv('Continents.csv')
Animals = pd.read_csv('Animals.csv')
Animalsdf = pd.DataFrame(Animals)
# Transpose my data from horizontal to vertical.
data1 = data.T
# Clean my data and update my header with the first column.
data1.columns = data1.iloc[0]
# Drop now duplicated data.
data1.drop(data1.index[[0]], inplace = True)
# Build the dictionary.
data_dict = {col: list(data1[col]) for col in data1.columns}
# Update my csv.
Animals['Country'] = Animals['Country'].map(data_dict)
print ('Animals')
This results in a dictionary that has lists as its values and therefore i just get NaN out:
Name Species Country
0 Hobbes Tiger NaN
1 SherKhan Tiger NaN
2 Rescuer Mole [Australia, nan, nan, nan]
3 Mickey Mole NaN
I've tried flipping from list to tuples and this doesn't work. Have tried multiple ways to pull in the dictionary etc. I am just out of ideas.
Sorry if the code is super junky. I'm learning this as I go. Figured a project was the best way to learn a new language. Didn't think it would be this difficult.
Any suggestions would be appreciated. I need to be able to use the code so that when I get multiple reference CSVs, I can update my data with new keys. Hope this is clear.
Thanks in advance.
One intuitive solution is to use a dictionary mapping. Data from #WillMonge.
pd.DataFrame.itertuples works by producing namedtuples, but they may also be referenced using numeric indexers.
# create mapping dictionary
d = {}
for row in df.itertuples():
d.update(dict.fromkeys(filter(None, row[2:]), row[1]))
# apply mapping dictionary
data['Continent'] = data['Country'].map(d)
print(data)
Country name Continent
0 China 2 Asia
1 China 5 Asia
2 Canada 9 America
3 Egypt 0 Africa
4 Mexico 3 America
You should use DictReader and DictWriter. You can learn how to use them by below link.
https://docs.python.org/2/library/csv.html
Here is an update of your code, I have tried to add comments to explain
import pandas as pd
# Read data in (read_csv also returns a DataFrame directly)
data = pd.DataFrame({'name': [2, 5, 9, 0, 3], 'Country': ['China', 'China', 'Canada', 'Egypt', 'Mexico']})
df = pd.DataFrame({'Continent': ['Asia', 'America', 'Africa'],
'Country1': ['China', 'Mexico', 'Egypt'],
'Country2': ['Japan', 'Canada', None],
'Country3': ['Thailand', None, None ]})
# Unstack to get a row for each country (remove the continent rows)
premap_df = pd.DataFrame(df.unstack('Continent').drop('Continent')).dropna().reset_index()
premap_df.columns = ['_', 'continent_key', 'Country']
# Merge the continent back based on the continent_key (old row number)
map_df = pd.merge(premap_df, df[['Continent']], left_on='continent_key', right_index=True)[['Continent', 'Country']]
# Merge with the data now
pd.merge(data, map_df, on='Country')
For further reference, Wes McKinney's Python for Data Analysis (here is a pdf version I found online) is one of the best books out there for learning pandas
You can always create buckets and run conditions:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Name':['Hobbes','SherKhan','Rescuer','Mickey'], 'Species':['Tiger','Tiger','Mouse','Mouse'],'Country':['U.S.','India','Australia','U.S.']})
North_America = ['U.S.', 'Mexico', 'Guatemala', 'Honduras']
Asia = ['India', 'China', 'Nepal']
Australia = ['Australia']
Africa = ['South Africa', 'Botswana', 'Zimbabwe']
conditions = [
(df['Country'].isin(North_America)),
(df['Country'].isin(Asia)),
(df['Country'].isin(Australia)),
(df['Country'].isin(Africa))
]
choices = [
'North America',
'Asia',
'Australia',
'Africa'
]
df['Continent'] = np.select(conditions, choices, default = np.nan)
df
I have created a matplotlib pie chart:
df.plot(kind='pie', subplots=True, figsize=(6, 4))
My dataframe consists of two columns - Country and Value (% distribution) and has about 25 countries listed. I would like to only plot the top 10 countries by values (by highest %) and within the plot, calculate the remaining countries % value and give it the title of 'All Other Countries'. How do I do this using matplotlib using the .plot function?
Country Value
Albania 4%
Brazil 3%
Denmark 5%
France 10%
Mexico 3%
Nigeria 15%
Spain 4%
U.S. 5%
As already stated in the comments, the best way to do this is probably to do the manipulations before plotting. Here's a way how to do it:
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np
countries = [
'Albania',
'Brazil',
'Denmark',
'France',
'Mexico',
'Nigeria',
'Spain',
'Germany',
'Finland',
]
#the full dataframe
df = pd.DataFrame(
data = {'country': countries, 'value' :np.random.rand(len(countries))},
).sort_values('value', ascending = False)
#the top 5
df2 = df[:5].copy()
#others
new_row = pd.DataFrame(data = {
'country' : ['others'],
'value' : [df['value'][5:].sum()]
})
#combining top 5 with others
df2 = pd.concat([df2, new_row])
#plotting -- for comparison left all countries and right
#the others combined
fig, axes = plt.subplots(nrows = 1, ncols = 2, figsize = (9,4))
df.plot(kind = 'pie', y = 'value', labels = df['country'], ax = axes[0])
df2.plot(kind = 'pie', y = 'value', labels = df2['country'], ax = axes[1])
axes[0].set_title('all countries')
axes[1].set_title('top 5')
plt.show()
The result looks like this.
Hope this helps.
I have a function called handle text that renames values in dataframe columns:
def handle_text(txt):
if txt.lower()[:6] == 'deu_ga':
return 'Western Europe', 'Germany'
elif txt.lower()[:6] == 'fra_ga':
return 'Western Europe', 'France'
return 'Other', 'Other'
I apply handle_text on various dataframes in the following way:
campaigns_df['Region'], campaigns_df['Market'] = zip(*campaigns_df['Campaign Name'].apply(handle_text))
atlas_df['Region'], atlas_df['Market'] = zip(*atlas_df['Campaign Name'].apply(handle_text))
flashtalking_df['Region'], flashtalking_df['Market'] = zip(*flashtalking_df['Campaign Name'].apply(handle_text))
I was wondering if there was a way to do a for loop to apply the function to various dfs at once:
dataframes = [atlas_df, flashtalking_df, innovid_df, ias_viewability_df, ias_fraud_df]
columns_df = ['Campaign Name']
for df in dataframes:
for column in df.columns:
if column in columns_df:
zip(df.column.apply(handle_text))
However the error I get is:
AttributeError: 'DataFrame' object has no attribute 'column'
I managed to solve it like this:
dataframes = [atlas_df, flashtalking_df, innovid_df, ias_viewability_df, ias_fraud_df, mediaplan_df]
columns_df = 'Campaign Name'
for df in dataframes:
df['Region'], df['Market'] = zip(*df[columns_df].apply(handle_text))
Need change attribute acces by . to more general by []:
zip(df.column.apply(handle_text))
to
zip(df[column].apply(handle_text))
EDIT:
Better solution:
atlas_df = pd.DataFrame({'Campaign Name':['deu_gathf', 'deu_gahf', 'fra_gagg'],'another_col':[1,2,3]})
flashtalking_df = pd.DataFrame({'Campaign Name':['deu_gahf','fra_ga', 'deu_gatt'],'another_col':[4,5,6]})
dataframes = [atlas_df, flashtalking_df]
columns_df = 'Campaign Name'
You can map by dict and then create new columns:
d = {'deu_ga': ['Western Europe','Germany'], 'fra_ga':['Western Europe','France']}
for df in dataframes:
df[['Region','Market']] = pd.DataFrame(df[columns_df].str.lower()
.str[:6]
.map(d)
.values.tolist())
#print (df)
print (atlas_df)
Campaign Name another_col Region Market
0 deu_gathf 1 Western Europe Germany
1 deu_gahf 2 Western Europe Germany
2 fra_gagg 3 Western Europe France
print (flashtalking_df)
Campaign Name another_col Region Market
0 deu_gahf 4 Western Europe Germany
1 fra_ga 5 Western Europe France
2 deu_gatt 6 Western Europe Germany