xarray discrete scatter plot: specifying legend/colour order - python

Plotting a discrete xarray DataArray variable in a Dataset with xr.plot.scatter() yields a legend in which the discrete values are ordered arbitrarily, corresponding to unpredictable colour assignment to each level. Would it be possible to specify a specific colour or position for a given discrete value?
A simple reproducible example:
import xarray as xr
# get a predefined dataset
uvz = xr.tutorial.open_dataset("eraint_uvz")
# select a 2-D subset of the data
uvzr = uvz.isel(level=0, month=0, latitude=slice(150, 242),
longitude=slice(240, 300))
# define a discrete variable based on levels of a continuous variable
uvzr['zone'] = 'A'
uvzr['zone'] = uvzr.zone.where(uvzr.u > 30, other='C')
uvzr['zone'] = uvzr.zone.where(uvzr.u > 10, other='B')
# do the plot
xr.plot.scatter(uvzr, x='longitude', y='latitude', hue='zone')
Is there a way to ensure that the legend entries are arranged 'A', 'B', 'C' from top to bottom, say? Or ensure that A is assigned to blue, and B to orange, for example?
I know I can reset the values of the matplotlib color cycler, but for that to be useful I first need to know which order the discrete values will be plotted in.
I'm using xarray v2022.3.0 on python 3.8.6. With an earlier version of xarray (I think 0.16) the levels were arranged alphabetically.

I found an ugly workaround using xarray.Dataset.stack and xr.where(..., drop=True), in case anyone else is stuck with a similar problem.
import numpy as np # for unique, to cycle through values
import matplotlib.pyplot as plt # to get a legend
# instead of np.unique you could pass an iterable of your choice
# specifying the order
for value in np.unique(uvzr.zone):
# convert to a 1-D dataframe with a co-ordinate including all
# unique combinations of latitude-longitude values
uvzr_stacked = uvzr.stack({'location':('longitude', 'latitude')})
# now select only those grid points in zone value
uvzr_stacked = uvzr_stacked.where(uvzr_stacked.zone == value,
drop=True)
# the plotting function can't see the original dims any more;
# a new name is required, however
uvzr_stacked['lat'] = uvzr_stacked.latitude
uvzr_stacked['lon'] = uvzr_stacked.longitude
# plot!
xr.plot.scatter(uvzr_stacked, x='lon', y='lat', hue='zone',
add_guide=False)
plt.legend(title='zone')

Related

How to select a range of NumPy values for bar chart

I created a bar chart using Matplotlib from the count of unique strings in a NumPy array. Now I would like to display only the top 10 most frequent species in the bar chart. I am new to Python so I am having trouble figuring it out. This is also my first question here, so let me know if I'm missing any important information
test_indices = numpy.where((obj.year == 2014) & (obj.native == "Native"))
SpeciesList2014 = numpy.append(SpeciesList2014, obj.species_code[test_indices])
labels, counts = numpy.unique(SpeciesList2014, return_counts=True)
indexSort = numpy.argsort(counts)
plt.bar(labels[indexSort][::-1], counts[indexSort][::-1], align='center')
plt.xticks(rotation=45)
plt.show()
You already have the values in a sorted array but you only want to select the ten values with the most counts.
It seems your array is sorted with larger counts as last values so you can exploit the numpy indexing as
plt.bar(labels[indexSort][-1:-11:-1], counts[indexSort][-1:-11;-1], align='center')
where [a:b:c] means a=start index, b=end index c= step, and negative values represent counting from the end of the array.
Or alternatively:
n=counts.shape[0]
plt.bar(labels[indexSort][n-11:], counts[indexSort][n-11:], align='center')
which plots in increasing order.
Do yourself a favor and learn about Numpy Indexing.
In this simple case, the last 10 elements of an array are indicated by the notation [-10:], that you can read from the last element minus ten to the last element.
import numpy as np
import matplotlib.pyplot as plt
# syntetic data
np.random.seed(20210428)
SpeciesList2014 = np.random.randint(0, 100, 2000)
# this is from your code
species, counts = np.unique(SpeciesList2014, return_counts=True)
topindices = np.argsort(counts)[-10:]
# here you probably can have, simply, topspecies = species[topindices]
topspecies = [repr(label) for label in species[topindices]]
topcounts = counts[topindices]
# plotting
plt.bar(topspecies, topcounts)
plt.show()

How to create a geographical heatmap passing custom radii

I want to create a visualization on a map using folium. In the map I want to observe how many items are related to a particular geographical point building a heatmap. Below is the code I'm using.
import pandas as pd
import folium
from folium import plugins
data = [[41.895278,12.482222,2873494.0,20.243001,20414,7.104243],
[41.883850,12.333330,3916.0,0.835251,4,1.021450],
[41.854241,12.567000,22263.0,1.132390,35,1.572115],
[41.902147,12.590388,19505.0,0.839181,37,1.896950],
[41.994240,12.48520,16239.0,1.383981,25,1.539504]]
df = pd.DataFrame(columns=['latitude','longitude','population','radius','count','normalized'],data=data)
middle_lat = df['latitude'].median()
middle_lon = df['longitude'].median()
m = folium.Map(location=[middle_lat, middle_lon],tiles = "Stamen Terrain",zoom_start=11)
# convert to (n, 2) nd-array format for heatmap
points = df[['latitude', 'longitude', 'normalized']].dropna().values
# plot heatmap
plugins.HeatMap(points, radius=15).add_to(m)
m.save(outfile='map.html')
Here the result
In this map, each point has the same radius. Insted, I want to create a heatmap in which the points radius is proportional with the one of the city it belongs to. I already tried to pass the radii in a list, but it is not working, as well as passing the values with a for loop.
Any idea?
You need to add one point after another. So you can specify the radius for each point. Like this:
import random
import numpy
pointArrays = numpy.split(points, len(points))
radii = [5, 10, 15, 20, 25]
for point, radius in zip(pointArrays, radii):
plugins.HeatMap(point, radius=radius).add_to(m)
m.save(outfile='map.html')
Here you can see, each point has a different size.

Creating a faceted matplotlib/seaborn plot using indicator variables rather than a single column

Seaborn is great for creating faceted plots based on a categorical variable encoding the class of each facet. However, this assumes your categories are mutually exclusive. Is it possible to create a Seaborn FacetGrid (or similar) based on a set of indicator variables?
As a concrete example, think about comparing patients that are infected with one or more viruses, and plotting an attribute of interest by virus. Its possible that a patient carries more than one virus, so creating a virus column to create a grid on is not possible. You can, however, create a set of indicator variables (one for each virus) that flags the virus for each patient. There does not seem to be a way of passing a set of indicator variables to any of the Seaborn functions to do this.
I can't imagine I'm the first person to come across this scenario, so I'm hoping there are suggestions for how to do this, without coding it by hand in Matploltlib.
I don't see how to do it with FacetGrid, possibly because this isn't facetting the data, since a data-record might appear several times or only once in the plot. One of the standard tricks with a set of bitfields is to read them as binary, so you see each combination of the bits. That's unambiguous but gets messy:
import pandas as pd
import seaborn as sns
from numpy.random import random, randint
from numpy import concatenate
import matplotlib.pyplot as plt
# Dummy data
vdata = pd.DataFrame(concatenate((randint(2, size=(32,4)), random(size=(32,2))), axis=1))
vdata.columns=['Species','v1','v2','v3','x','y']
binary_v = vdata.v1 + vdata.v2*2 + vdata.v3*4
# Making a binary number out of the "virusX?" fields
pd.concat((vdata, binary_v), axis=1)
vdata = pd.concat((vdata, binary_v), axis=1)
vdata.columns=['Species','v1','v2','v3','x','y','binary_v']
# Plotting group membership by row
#g = sns.FacetGrid(vdata, col="Species", row='binary_v')
#g.map(plt.scatter, "x", "y")
#g.add_legend()
#plt.savefig('multiple_facet_binary_row') # Unreadably big.
h = sns.FacetGrid(vdata, col="Species", hue="binary_v")
h.map(plt.scatter, "x","y")
h.add_legend()
plt.savefig('multiple_facet_binary_hue')
If you have too many indicators to deal with the combinatorial explosion, explicitly making the new subsets works:
# Nope, need to pull out subsets:
bdata = vdata[vdata.v1 + vdata.v2 + vdata.v3 ==0.]
assert(len(bdata) > 0) # ... catch...
bdata['Virus'] = pd.Series(['none']*len(bdata), index=bdata.index)
for i in ['v1','v2','v3']:
on = vdata[vdata[i]==1.]
on['Virus'] = pd.Series([i]*len(on), index=on.index)
bdata = bdata.append(on)
j = sns.FacetGrid(bdata, col='Species', row='Virus')
j.map(plt.scatter, 'x', 'y')
j.add_legend()
j.savefig('multiple_facet_refish')

Setting col_colors in seaborn clustermap from pandas

I have a clustermap generated from a pandas dataframe. Two of the columns are used to generate the clustermap and I need to use a 3rd column to generate a col_colors bar using sns.palplot(sns.light_palette('red')) palette (values will be from 0 - 1, light - dark colors).
The pseudo-code looks something like this:
df=pd.DataFrame(input, columns = ['Source', 'amplicon', 'coverage', 'GC'])
tiles = df.pivot("Source", "amplicon", "coverage")
col_colors = [values from df['GC']]
sns.clustermap(tiles, vmin=0, vmax=2, col_colors=col_colors)
I'm battling to find details on how to setup the col_colors so the correct values are linked to the appropriate tiles. Some direction would be greatly appreciated.
This example will be much easier to explain with sample data. I don't know what your data looks like, but say you had a bunch of GC content measurements For instance:
import seaborn as sns
import numpy as np
import pandas as pd
data = {'16S':np.random.normal(.52, 0.05, 12),
'ITS':np.random.normal(.52, 0.05, 12),
'Source':np.random.choice(['soil', 'water', 'air'], 12, replace=True)}
df=pd.DataFrame(data)
df[:3]
16S ITS Source
0 0.493087 0.460066 air
1 0.607229 0.592945 water
2 0.577155 0.440726 water
So data is GC content, and then there is a column describing the source. Say we want to plot a cluster map of the GC content where we use the Source column to define the network
#create a color palette with the same number of colors as unique values in the Source column
network_pal = sns.light_palette('red', len(df.Source.unique()))
#Create a dictionary where the key is the category and the values are the
#colors from the palette we just created
network_lut = dict(zip(df.Source.unique(), network_pal))
#get the series of all of the categories
networks = df.Source
#map the colors to the series. Now we have a list of colors the same
#length as our dataframe, where unique values are mapped to the same color
network_colors = pd.Series(networks).map(network_lut)
#plot the heatmap with the 16S and ITS categories with the network colors
#defined by Source column
sns.clustermap(df[['16S', 'ITS']], row_colors=network_colors, cmap='BuGn_r')
Basically what most of the above code is doing is creating a vector of colors that corrospond to the Source column of the data frame. You could of course create this manually, where the first color in the list would be mapped to the first row in the dataframe and the second color would be mapped to the second row and so on (this order will change when you plot it), however that would be a lot of work. I used a red color palette as that is what you mentioned in your question though I might recommend using a different palette. I colored by rows, however you can do the same thing for columns. Hope this helps!

class labels in Pandas scattermatrix

This question has been asked before, Multiple data in scatter matrix, but didn't receive an answer.
I'd like to make a scatter matrix, something like in the pandas docs, but with differently colored markers for different classes. For example, I'd like some points to appear in green and others in blue depending on the value of one of the columns (or a separate list).
Here's an example using the Iris dataset. The color of the points represents the species of Iris -- Setosa, Versicolor, or Virginica.
Does pandas (or matplotlib) have a way to make a chart like that?
Update: This functionality is now in the latest version of Seaborn. Here's an example.
The following was my stopgap measure:
def factor_scatter_matrix(df, factor, palette=None):
'''Create a scatter matrix of the variables in df, with differently colored
points depending on the value of df[factor].
inputs:
df: pandas.DataFrame containing the columns to be plotted, as well
as factor.
factor: string or pandas.Series. The column indicating which group
each row belongs to.
palette: A list of hex codes, at least as long as the number of groups.
If omitted, a predefined palette will be used, but it only includes
9 groups.
'''
import matplotlib.colors
import numpy as np
from pandas.tools.plotting import scatter_matrix
from scipy.stats import gaussian_kde
if isinstance(factor, basestring):
factor_name = factor #save off the name
factor = df[factor] #extract column
df = df.drop(factor_name,axis=1) # remove from df, so it
# doesn't get a row and col in the plot.
classes = list(set(factor))
if palette is None:
palette = ['#e41a1c', '#377eb8', '#4eae4b',
'#994fa1', '#ff8101', '#fdfc33',
'#a8572c', '#f482be', '#999999']
color_map = dict(zip(classes,palette))
if len(classes) > len(palette):
raise ValueError('''Too many groups for the number of colors provided.
We only have {} colors in the palette, but you have {}
groups.'''.format(len(palette), len(classes)))
colors = factor.apply(lambda group: color_map[group])
axarr = scatter_matrix(df,figsize=(10,10),marker='o',c=colors,diagonal=None)
for rc in xrange(len(df.columns)):
for group in classes:
y = df[factor == group].icol(rc).values
gkde = gaussian_kde(y)
ind = np.linspace(y.min(), y.max(), 1000)
axarr[rc][rc].plot(ind, gkde.evaluate(ind),c=color_map[group])
return axarr, color_map
As an example, we'll use the same dataset as in the question, available here
>>> import pandas as pd
>>> iris = pd.read_csv('iris.csv')
>>> axarr, color_map = factor_scatter_matrix(iris,'Name')
>>> color_map
{'Iris-setosa': '#377eb8',
'Iris-versicolor': '#4eae4b',
'Iris-virginica': '#e41a1c'}
Hope this is helpful!
You can also call the scattermatrix from pandas as follow :
pd.scatter_matrix(df,color=colors)
with colors being an list of size len(df)containing colors

Categories