Frequency map by country in python - python

I am trying to make a world map with some specific frequency data for some countries. I have tried to use plotly (below), but the base map is not available, and it won't let me load a new one I found.
The map I need is a color scale (intensity) for the countries with presence of this variable.
These are the data and the code with which I have tried to plot the map:
database = px.data.gapminder()
d = {'Australia':[3],
'Brazil' :[2],
'Canada':[6],
'Chile':[3],
'Denmark':[1],
'France':[16],
'Germany':[3],
'Israel':[1]}
data = pd.DataFrame(d).T.reset_index()
data.columns=['country', 'count']
df=pd.merge(database, yourdata, how='left', on='country')
url = (
"https://raw.githubusercontent.com/python-visualization/folium/master/examples/data"
)
fig = px.choropleth(df, locations="country",
locationmode='ISO-3',
geojson = f"{url}/world-countries.json",
color="count")
I keep getting the same error.
Thank you for your help

I think the reason why it is not displayed is because the location mode and the target column are incorrectly specified. If the location mode is 'iso_3', then for this data, the location would be 'iso_alpha'. Also, if the location mode is 'country names', then the location would be 'country'. Since there are many data presented, we extracted by year and changed the merging method.
import pandas as pd
d = {'Australia':[3],
'Brazil' :[2],
'Canada':[6],
'Chile':[3],
'Denmark':[1],
'France':[16],
'Germany':[3],
'Israel':[1]}
data = pd.DataFrame(d).T.reset_index()
data.columns=['country', 'count']
import plotly.express as px
database = px.data.gapminder().query('year == 2007')
df = pd.merge(database, data, how='inner', on='country')
url = (
"https://raw.githubusercontent.com/python-visualization/folium/master/examples/data"
)
fig = px.choropleth(df,
locations="country",#"iso_alpha",
locationmode="country names",#"ISO-3",
geojson = f"{url}/world-countries.json",
color="count"
)
fig.show()

Related

Folium Color Issues

I'm working with Folium for the first time, and attempting to make a Choropleth map of housing values in North Carolina using Zillow data as the source. I've been running into lots of issues along the way, and right now I'm a bit stuck on how to add in colors to the map; if the property value is >100k make it green, and slowing increasing the gradient to orange if it's <850k.
At the moment the map does generate the zip code data fine, but all of the polygons are a black-grey color. It's also not showing a color key or map name, and I have a feeling some of my earlier code could be off.
import folium
import pandas as pd
import requests
import os
working_directory = os.getcwd()
print(working_directory)
path = working_directory + '/Desktop/NCHomes.csv'
df = pd.read_csv(path)
df.head()
df['Homes'].min(), df['Homes'].max()
INDICATOR = 'North Carolina Home Values by Zip Code'
data = df[df['RegionName'] == INDICATOR]
max_value = data['Homes'].max()
data = data[data['Homes'] == max_value]
data.head()
geojson_url = 'https://raw.githubusercontent.com/OpenDataDE/State-zip-code-GeoJSON/master/nc_north_carolina_zip_codes_geo.min.json'
response = requests.get(geojson_url)
geojson = response.json()
geojson
geojson['features'][0]
map_data = data[['RegionName', 'Homes']]
map_data.head()
M = folium.Map(location=[20, 10], zoom_start=2)
folium.Choropleth(
geo_data=geojson,
data=map_data,
columns=['RegionName', 'Homes'],
fill_color='YlOrRd',
fill_opacity=0.7,
line_opacity=0.2,
legend_name=INDICATOR
).add_to(M)
M
You can specify the threshold_scale parameter as follows:
folium.Choropleth(
geo_data=geojson,
data=map_data,
columns=['RegionName', 'Homes'],
fill_color='YlOrRd',
fill_opacity=0.7,
line_opacity=0.2,
threshold_scale=[100000, 850000],
legend_name=INDICATOR
).add_to(M)

How to generate a choropleth map based on region names?

I'm working on Python with a dataset that has data about a numerical variable for each italian region, like this:
import numpy as np
import pandas as pd
regions = ['Trentino Alto Adige', "Valle d'Aosta", 'Veneto', 'Lombardia', 'Emilia-Romagna', 'Toscana', 'Friuli-Venezia Giulia', 'Liguria', 'Piemonte', 'Marche', 'Lazio', 'Umbria', 'Abruzzo', 'Sardegna', 'Puglia', 'Molise', 'Basilicata', 'Calabria', 'Sicilia', 'Campania']
df = pd.DataFrame([regions,[10+(i/2) for i in range(20)]]).transpose()
df.columns = ['region','quantity']
df.head()
I would like to generate a map of Italy in which the colour of the different regions depends on the numeric values of the variable quantity (df['quantity']),i.e., a choropleth map like this:
How can I do it?
You can use geopandas.
The regions in your df compared to the geojson dont match exactly. I'm sure you can find another one, or alter the names so they match.
import pandas as pd
import geopandas as gpd
regions = ['Trentino Alto Adige', "Valle d'Aosta", 'Veneto', 'Lombardia', 'Emilia-Romagna', 'Toscana', 'Friuli-Venezia Giulia', 'Liguria', 'Piemonte', 'Marche', 'Lazio', 'Umbria', 'Abruzzo', 'Sardegna', 'Puglia', 'Molise', 'Basilicata', 'Calabria', 'Sicilia', 'Campania']
df = pd.DataFrame([regions,[10+(i/2) for i in range(20)]]).transpose()
df.columns = ['region','quantity']
#Download a geojson of the region geometries
gdf = gpd.read_file(filename=r'https://raw.githubusercontent.com/openpolis/geojson-italy/master/geojson/limits_IT_municipalities.geojson')
gdf = gdf.dissolve(by='reg_name') #The geojson is to detailed, dissolve boundaries by reg_name attribute
gdf = gdf.reset_index()
#gdf.reg_name[~gdf.reg_name.isin(regions)] Two regions are missing in your df
#16 Trentino-Alto Adige/Südtirol
#18 Valle d'Aosta/Vallée d'Aoste
gdf = pd.merge(left=gdf, right=df, how='left', left_on='reg_name', right_on='region')
ax = gdf.plot(
column="quantity",
legend=True,
figsize=(15, 10),
cmap='OrRd',
missing_kwds={'color': 'lightgrey'});
ax.set_axis_off();

How to plot multiple traces with trendlines?

I'm trying to plot trendlines on multiple traces on scatters in plotly. I'm kind of stumped on how to do it.
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_df['Circumference (meters)'],
y=df_df['Height (meters)'],
name='Douglas Fir', mode='markers')
)
fig.add_trace(go.Scatter(x=df_wp['Circumference (meters)'],
y=df_wp['Height (meters)'],
name='White Pine',mode='markers'),
)
fig.update_layout(title="Tree Circumference vs Height (meters)",
xaxis_title=df_df['Circumference (meters)'].name,
yaxis_title=df_df['Height (meters)'].name,
title_x=0.5)
fig.show()
Trying to get something like this:
Here's how I resolved it. Basically I used numpy polyfit function to calculation my slop. I then added the slop for each data set as a tracer
import numpy as np
df_m, df_b = np.polyfit(df_df['Circumference (meters)'].to_numpy(), df_df['Height (meters)'].to_numpy(), 1)
wp_m, wp_b = np.polyfit(df_wp['Circumference (meters)'].to_numpy(), df_wp['Height (meters)'].to_numpy(), 1)
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_df['Circumference (meters)'],
y=df_df['Height (meters)'],
name='Douglas Fir', mode='markers')
)
fig.add_trace(go.Scatter(x=df_df['Circumference (meters)'],
y=(df_m*df_df['Circumference (meters)'] + df_b),
name='douglas fir trendline',
mode='lines')
)
fig.add_trace(go.Scatter(x=df_wp['Circumference (meters)'],
y=df_wp['Height (meters)'],
name='White Pine',mode='markers'),
)
fig.add_trace(go.Scatter(x=df_wp['Circumference (meters)'],
y=(wp_m * df_wp['Circumference (meters)'] + wp_b),
name='white pine trendline',
mode='lines')
)
fig.update_layout(title="Tree Circumference vs Height (meters)",
xaxis_title=df_df['Circumference (meters)'].name,
yaxis_title=df_df['Height (meters)'].name,
title_x=0.5)
fig.show()
You've already put together a procedure that solves your problem, but I would like to mention that you can use plotly.express and do the very same thing with only a very few lines of code. Using px.scatter() there are actually two slightly different ways, depending on whether your data is of a long or wide format. Your data seems to be of the latter format, since you're asking:
how can I make this work with separate traces?
So I'll start with that. And I'll use a subset of the built-in dataset px.data.stocks() since you haven't provided a data sample.
Code 1 - Wide data
fig_wide = px.scatter(df_wide, x = 'index', y = ['AAPL', 'MSFT'],
trendline = 'ols',
)
Code 2 - Long data
fig_long = px.scatter(df_long, x= 'index', y = 'value',
color = 'variable',
trendline = 'ols')
Plot 1 - Identical results
About the data:
A dataframe of a wide format typically has an index with unique values in the left-most column, variable names in the column headers, and corresponding values for each variable per index in the columns like this:
index AAPL MSFT
0 1.000000 1.000000
1 1.011943 1.015988
2 1.019771 1.020524
3 0.980057 1.066561
4 0.917143 1.040708
Here, adding information about another variable would require adding another column.
A dataframe of a long format, on the other hand, typically organizes the same data with only (though not necessarily only) three columns; index, variable and value:
index variable value
0 AAPL 1.000000
1 AAPL 1.011943
.
.
100 MSFT 1.720717
101 MSFT 1.752239
An contrary to the wide format, this means that index will have duplicate values. But for a good reason.
So what's the difference?
If you look at Code 1 you'll see that the only thing you need to specify for px.scatter in order to get multiple traces with trendlines, in this case AAPL and MSFT on the y-axis versus an index on the x-axis, is trendline = 'ols'. This is because plotly.express automatically identifies the data format as wide and knows how to apply the trendlines correctly. Different columns means different catrgories for which a trace and trendline are produced.
As for the "long approach", you've got both GOOG and AAPL in the same variable column, and values for both of them in the value column. But setting color = 'variable' lets plotly.express know how to categorize the variable column, correctly separate the data in in the value column, and thus correctly produce the trendlines. A different name in the variable column means that index and value in the same row belongs to different categories, for which a new trace and trendline are built.
Any pros and cons?
The arguably only advantage with the wide format is that it's easier to read (particularly for those of us damaged by too many years of sub-excellent data handling with Excel). And one great advantage with the long format is that you can easily illustrate more dimensions of the data if you have more categories with, for example, different symbols or sizes for the markers.
Another advantage with the long format occurs if the dataset changes, for example with the addition of another variable 'AMZN'. Then the name and the values of that variable will occur in the already existing columns instead of adding another one like you would for the wide format. This means that you actually won't have to change the code in:
fig_long = px.scatter(df_long, x= 'index', y = 'value',
color = 'variable',
trendline = 'ols')
... in order to add the data to the figure.
While for the wide format, you would have to specify y = ['GOOG', 'AAPL', 'AMZN'] in:
fig_wide = px.scatter(df_wide, x = 'index', y = ['AAPL', 'MSFT', 'AMZN'],
trendline = 'ols',
)
And I would strongly argue that this outweighs the slight inconvenience of speifying color = 'variable' in:
fig_long = px.scatter(df_long, x= 'index', y = 'value',
color = 'variable',
trendline = 'ols')
Plot 2 - A new variable:
Complete code
# imports
import pandas as pd
import plotly.express as px
# data
df = px.data.stocks()
# df.date = pd.to_datetime(df.date)
df_wide = df.drop(['date', 'GOOG', 'AMZN', 'NFLX', 'FB'], axis = 1).reset_index()
# df_wide = df.drop(['date', 'GOOG', 'NFLX', 'FB'], axis = 1).reset_index()
df_long = pd.melt(df_wide, id_vars = 'index')
df_long
fig_wide = px.scatter(df_wide, x = 'index', y = ['AAPL', 'MSFT'],
trendline = 'ols',
)
fig_long = px.scatter(df_long, x= 'index', y = 'value',
color = 'variable',
trendline = 'ols')
# fig_long.show()
fig_wide.show()

Applying gradient background color in python pandas is not working

I'm trying to style my table but I'm having a problem because of how my table is made. I want to color individual column which i chose, for every year, but i don't want to involve years in code because i have many tables with different years so it needs to stay this format of table.
I need to apply different gradient colors for 4 columns. This is my code but it won't work. This is a Dropbox link of excel document.
import pandas as pd
import numpy as np
df = pd.read_excel('products2.xlsx', index_col=[0])
df.columns = df.columns.str.split('_', expand=True)
new_data = df.stack(0)
new_data1 = new_data.eval('status = profit - loss + other')
new_data2 = new_data1.eval('index = (profit / status) / (loss / status)')
# if i put here styling my data it will work untill output, which means in the end is not working
#style = new_data2.style.background_gradient(subset=['status'])
output = new_data2.unstack(1).swaplevel(0,1, axis=1).sort_index(axis=1)
style = output.style.background_gradient(subset=['profit', 'loss', 'status', 'index'])
# i need to profit to be green gradient, loss red gradient, status blue and index to be some color
style.to_excel('output_products.xlsx')
Any ideas?

How to get data about state which is currently hovered? Plotly Choropleth - USA map

I created an interactive map of individual US states. The map will contain information on electric vehicles in the US. Currently, it is colored depending on the range (average in kilometers) of a given vehicle.
Here is my code:
import plotly.graph_objects as go
import pandas as pd
df = pd.read_csv('https://gist.githubusercontent.com/AlbertKozera/6396b4333d1a9222193e11401069ed9a/raw/ab8733a2135bcf61999bbcac4f92e0de5fd56794/Pojazdy%2520elektryczne%2520w%2520USA.csv')
for col in df.columns:
df[col] = df[col].astype(str)
df['range'] = pd.to_numeric(df['range'])
df_range = df.drop(columns = ['state', 'brand', 'model', 'year of production', 'type']).groupby('code', as_index=False)
df_range_mean = df_range.agg({'range':'mean'})
fig = go.Figure(data=go.Choropleth(
locations=df['code'].drop_duplicates(keep='first').reset_index(drop=True),
z = round(df_range_mean['range'], 2),
locationmode='USA-states',
colorscale='Reds',
autocolorscale=False,
marker_line_color='black',
))
fig.update_layout(
geo = dict(
scope='usa',
projection=go.layout.geo.Projection(type = 'albers usa'),
showlakes=True, # lakes
lakecolor='rgb(255, 255, 255)'),
)
fig.show()
It looks like this:
Here is my question:
I need to dynamically return information about the given state in which the mouse cursor is currently located. Unfortunately, I don't know how to do it and whether it is possible at all. I have to implement a method that will display a different image (chernoff face) depending on what state is currently highlighted by the user.
Can anyone tell me if there is any method that will return data about the currently highlighted state? Or maybe, unfortunately - I will have to write my own listener.
I was searching such a method in documentation but I couldn't find it.
The argument locations=df['state'] into the go.Choropleth function should return the abbreviation of the state you refer whenever you point the cursor.

Categories