Encoding a list column to the legend of a plot - python

Apologies in advance, I am not sure how to word this question best:
I am working with a large dataset, and I would like to plot Latitude and Longitude where the colour of the points (actually the opacity) is encoded to a 'FeatureType' column binded to the legend. This way I can use the legend to highlight on my map various features I am looking for.
Here is a picture of my map and legend so far
The problem is that in my dataset, the FeatureType column is a list of features that can be found there (i.e arch, bridge, etc..).
How can I make it so that the point shows up for both arch, and bridge. At the moment it creates its own category of (arch,bridge etc.), leading to over 300 combinations of about 20 different FeatureTypes.
The dataset can be found at http://atlantides.org/downloads/pleiades/dumps/pleiades-locations-latest.csv.gz
N.B: I am using altair/pandas
import altair as alt
import pandas as pd
from vega_datasets import data
df = pd.read_csv ('C://path/pleiades-locations.csv')
alt.data_transformers.enable('json')
countries = alt.topo_feature(data.world_110m.url, 'countries')
selection = alt.selection_multi(fields=['featureType'], bind='legend')
brush = alt.selection(type='interval', encodings=['x'])
map = alt.Chart(countries).mark_geoshape(
fill='lightgray',
stroke='white'
).project('equirectangular').properties(
width=500,
height=300
)
points = alt.Chart(df).mark_circle().encode(
alt.Latitude('reprLat:Q'),
alt.Longitude('reprLong:Q'),
alt.Color('featureType:N'),
tooltip=['featureType','timePeriodsKeys:N'],
opacity=alt.condition(selection, alt.value(1), alt.value(0.0))
).add_selection(
selection)
(map + points)

It is not possible for Altair to generate the labels you want from your current column format. You will need to turn your comma-separated string labels into lists and then explode the column so that you get one row per item in the list:
import altair as alt
import pandas as pd
from vega_datasets import data
alt.data_transformers.enable('data_server')
df = pd.read_csv('http://atlantides.org/downloads/pleiades/dumps/pleiades-locations-latest.csv.gz')[['reprLong', 'reprLat', 'featureType']]
df['featureType'] = df['featureType'].str.split(',')
df = df.explode('featureType')
countries = alt.topo_feature(data.world_110m.url, 'countries')
world_map = alt.Chart(countries).mark_geoshape(
fill='lightgray',
stroke='white')
points = alt.Chart(df).mark_circle(size=10).encode(
alt.Latitude('reprLat:Q'),
alt.Longitude('reprLong:Q'),
alt.Color('featureType:N', legend=alt.Legend(columns=2)))
world_map + points
Note that having this many entries in the legend is not meaningful since the colors are repeated. The interactivity would help with that somewhat, but I would consider splitting this up into multiple charts. I am not sure if it is even possible to expand the legend to show those hidden 81 entries. And double check that the long lat location corresponds correctly with the world map projection you are using, they seemed to move around when I changed the projection.

Related

Plotly Express Choropleth for Country Regions

I have a dataframe created on a csv file about Italian Covid-19 spread all over regions. I was trying to create a px.choropleth plot in which showing Total Positive values for every regions in Italy.
This the code tried:
italy_regions=[i for i in region['Region'].unique()]
fig = px.choropleth(italy_last, locations="Country",
locationmode=italy_regions,
color=np.log(italy_last["TotalPositive"]),
hover_name="Region", hover_data=['TotalPositive'],
color_continuous_scale="Sunsetdark",
title='Regions with Positive Cases')
fig.update(layout_coloraxis_showscale=False)
fig.show()
Now I report some info: 'Country' is the name given to my dataframe and is filled only with the same values: 'Italy'. If I only input 'location="Country"' the graph is fine and I can see Italy colored into the world map.
The problems start when I try to make pyplot color my regions. As I'm a newbye in pyplot express, I read some examples and I thought I had to create a list of italian regions names and then put into 'choropleth' as input for 'barmode'.
Clearly I'm wrong.
So, what is the procedure to follow to make it run (if any)?
In case of need, I can provide both the csv file that the jupyter file I'm working on.
You need to provide a geojson with Italian region borders as geojson parameter to plotly.express.choropleth, for instance this one
https://gist.githubusercontent.com/datajournalism-it/48e29e7c87dca7eb1d29/raw/2636aeef92ba0770a073424853f37690064eb0ea/regioni.geojson
If you use this one, you need to explicitly pass featureidkey='properties.NOME_REG' as a parameter of plotly.express.choropleth.
Working example:
import pandas as pd
import requests
import plotly.express as px
regions = ['Piemonte', 'Trentino-Alto Adige', 'Lombardia', 'Puglia', 'Basilicata',
'Friuli Venezia Giulia', 'Liguria', "Valle d'Aosta", 'Emilia-Romagna',
'Molise', 'Lazio', 'Veneto', 'Sardegna', 'Sicilia', 'Abruzzo',
'Calabria', 'Toscana', 'Umbria', 'Campania', 'Marche']
# Create a dataframe with the region names
df = pd.DataFrame(regions, columns=['NOME_REG'])
# For demonstration, create a column with the length of the region's name
df['name_length'] = df['NOME_REG'].str.len()
# Read the geojson data with Italy's regional borders from github
repo_url = 'https://gist.githubusercontent.com/datajournalism-it/48e29e7c87dca7eb1d29/raw/2636aeef92ba0770a073424853f37690064eb0ea/regioni.geojson'
italy_regions_geo = requests.get(repo_url).json()
# Choropleth representing the length of region names
fig = px.choropleth(data_frame=df,
geojson=italy_regions_geo,
locations='NOME_REG', # name of dataframe column
featureidkey='properties.NOME_REG', # path to field in GeoJSON feature object with which to match the values passed in to locations
color='name_length',
color_continuous_scale="Magma",
scope="europe",
)
fig.update_geos(showcountries=False, showcoastlines=False, showland=False, fitbounds="locations")
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
Output image

Pandas plot several df with different variables on the same barplot

I have 4 Dataframes with different location: Indonesia, Singapore, Malaysia and Total each of them containing the percentage of the 5 top revenue-generating products. I have plotted them separately.
I want to combine them together on one plot where X-axis shows different locations and top-revenue-generating products for each location.
I have printed data frames and as you can see they have different products in them.
print(Ind_top_cat, Sin_top_cat, Mal_top_cat, Tot_top_cat)
Category Amt
M020P 0.144131
MH 0.099439
ML 0.055052
PB 0.050057
PPDR 0.048315
Category Amt
ML 0.480781
M015 0.073034
PPDR 0.035412
M025 0.033418
M020 0.031836
Category Amt
TN 0.343650
PPDR 0.190773
NMCN 0.118425
M015 0.047539
NN 0.038140
Category Amt
M020P 0.158575
MH 0.092012
ML 0.064179
PPDR 0.050803
PB 0.044301
Thanks to joelostblom I was able to construct a plot, however, there are still some issues.
enter image description here
all_countries = pd.concat([Ind_top_cat, Sin_top_cat, Mal_top_cat, Tot_top_cat])
all_countries['Category'] = all_countries.index
sns.barplot(x='Country', y='Amt',hue = 'Category',data=all_countries)
Is there any way I can put legend values on the x-axis (no need to colour categories on I want to instead colour countries), and put data values on top of bars. Also, bars are not centred and have no idea how to solve it.
You could create a new column in each dataframe with the country name, e.g.
Ind_top_cat['Country'] = 'Indonesia'
Sin_top_cat['Country'] = 'Singapore'
The you can create one big dataframe by concatenating the country dataframes together:
all_countries = pd.concat([Ind_top_cat, Sin_top_cat])
And finally, you can use a high level plotting library such as seaborn to assign one column to the x-axis location and one to the color of the bars:
import seaborn as sns
sns.barplot(x='Country', y='Amt', color='Category', data=all_countries)
You can scroll down to the second example on this page to get an idea what such a plot would look like (also pasted below):

Which parts of my dataframe are being plotted?

The goal is to plot the data frame I'm working with on a single chart, with a line for each value of init_population where the y-axis is count and x-axis is tick_number.
I've figured out how to use groupby() and plot() together to make this:
As you can see, all the lines are there nicely, but I'm pretty confident that the blue at the top that doesn't follow the relationship the other lines are following is actually a different column of data.
So that this is reproducible, the data is available here.
import pandas as pd
max_runs_data = pd.read_csv('clean_table.csv')
del max_runs_data['visualization']
max_runs_data.columns = ['run_number','init_population', 'tick', 'turtle_count']
max_runs_data.set_index('tick', inplace = True)
test_plot_1 = max_runs_data.groupby('init_population')['turtle_count'].plot()
test_plot_2 = max_runs_data.groupby('init_population').plot(y='turtle_count')
test_plot_1 is the linked image, test_plot_2 is a separate plot for each group.
Is it obvious how to specify the columns for x and y without losing the grouping on a single chart?
Thanks

Setting col_colors in seaborn clustermap from pandas

I have a clustermap generated from a pandas dataframe. Two of the columns are used to generate the clustermap and I need to use a 3rd column to generate a col_colors bar using sns.palplot(sns.light_palette('red')) palette (values will be from 0 - 1, light - dark colors).
The pseudo-code looks something like this:
df=pd.DataFrame(input, columns = ['Source', 'amplicon', 'coverage', 'GC'])
tiles = df.pivot("Source", "amplicon", "coverage")
col_colors = [values from df['GC']]
sns.clustermap(tiles, vmin=0, vmax=2, col_colors=col_colors)
I'm battling to find details on how to setup the col_colors so the correct values are linked to the appropriate tiles. Some direction would be greatly appreciated.
This example will be much easier to explain with sample data. I don't know what your data looks like, but say you had a bunch of GC content measurements For instance:
import seaborn as sns
import numpy as np
import pandas as pd
data = {'16S':np.random.normal(.52, 0.05, 12),
'ITS':np.random.normal(.52, 0.05, 12),
'Source':np.random.choice(['soil', 'water', 'air'], 12, replace=True)}
df=pd.DataFrame(data)
df[:3]
16S ITS Source
0 0.493087 0.460066 air
1 0.607229 0.592945 water
2 0.577155 0.440726 water
So data is GC content, and then there is a column describing the source. Say we want to plot a cluster map of the GC content where we use the Source column to define the network
#create a color palette with the same number of colors as unique values in the Source column
network_pal = sns.light_palette('red', len(df.Source.unique()))
#Create a dictionary where the key is the category and the values are the
#colors from the palette we just created
network_lut = dict(zip(df.Source.unique(), network_pal))
#get the series of all of the categories
networks = df.Source
#map the colors to the series. Now we have a list of colors the same
#length as our dataframe, where unique values are mapped to the same color
network_colors = pd.Series(networks).map(network_lut)
#plot the heatmap with the 16S and ITS categories with the network colors
#defined by Source column
sns.clustermap(df[['16S', 'ITS']], row_colors=network_colors, cmap='BuGn_r')
Basically what most of the above code is doing is creating a vector of colors that corrospond to the Source column of the data frame. You could of course create this manually, where the first color in the list would be mapped to the first row in the dataframe and the second color would be mapped to the second row and so on (this order will change when you plot it), however that would be a lot of work. I used a red color palette as that is what you mentioned in your question though I might recommend using a different palette. I colored by rows, however you can do the same thing for columns. Hope this helps!

class labels in Pandas scattermatrix

This question has been asked before, Multiple data in scatter matrix, but didn't receive an answer.
I'd like to make a scatter matrix, something like in the pandas docs, but with differently colored markers for different classes. For example, I'd like some points to appear in green and others in blue depending on the value of one of the columns (or a separate list).
Here's an example using the Iris dataset. The color of the points represents the species of Iris -- Setosa, Versicolor, or Virginica.
Does pandas (or matplotlib) have a way to make a chart like that?
Update: This functionality is now in the latest version of Seaborn. Here's an example.
The following was my stopgap measure:
def factor_scatter_matrix(df, factor, palette=None):
'''Create a scatter matrix of the variables in df, with differently colored
points depending on the value of df[factor].
inputs:
df: pandas.DataFrame containing the columns to be plotted, as well
as factor.
factor: string or pandas.Series. The column indicating which group
each row belongs to.
palette: A list of hex codes, at least as long as the number of groups.
If omitted, a predefined palette will be used, but it only includes
9 groups.
'''
import matplotlib.colors
import numpy as np
from pandas.tools.plotting import scatter_matrix
from scipy.stats import gaussian_kde
if isinstance(factor, basestring):
factor_name = factor #save off the name
factor = df[factor] #extract column
df = df.drop(factor_name,axis=1) # remove from df, so it
# doesn't get a row and col in the plot.
classes = list(set(factor))
if palette is None:
palette = ['#e41a1c', '#377eb8', '#4eae4b',
'#994fa1', '#ff8101', '#fdfc33',
'#a8572c', '#f482be', '#999999']
color_map = dict(zip(classes,palette))
if len(classes) > len(palette):
raise ValueError('''Too many groups for the number of colors provided.
We only have {} colors in the palette, but you have {}
groups.'''.format(len(palette), len(classes)))
colors = factor.apply(lambda group: color_map[group])
axarr = scatter_matrix(df,figsize=(10,10),marker='o',c=colors,diagonal=None)
for rc in xrange(len(df.columns)):
for group in classes:
y = df[factor == group].icol(rc).values
gkde = gaussian_kde(y)
ind = np.linspace(y.min(), y.max(), 1000)
axarr[rc][rc].plot(ind, gkde.evaluate(ind),c=color_map[group])
return axarr, color_map
As an example, we'll use the same dataset as in the question, available here
>>> import pandas as pd
>>> iris = pd.read_csv('iris.csv')
>>> axarr, color_map = factor_scatter_matrix(iris,'Name')
>>> color_map
{'Iris-setosa': '#377eb8',
'Iris-versicolor': '#4eae4b',
'Iris-virginica': '#e41a1c'}
Hope this is helpful!
You can also call the scattermatrix from pandas as follow :
pd.scatter_matrix(df,color=colors)
with colors being an list of size len(df)containing colors

Categories