class labels in Pandas scattermatrix

class labels in Pandas scattermatrix - python

This question has been asked before, Multiple data in scatter matrix, but didn't receive an answer.
I'd like to make a scatter matrix, something like in the pandas docs, but with differently colored markers for different classes. For example, I'd like some points to appear in green and others in blue depending on the value of one of the columns (or a separate list).
Here's an example using the Iris dataset. The color of the points represents the species of Iris -- Setosa, Versicolor, or Virginica.
Does pandas (or matplotlib) have a way to make a chart like that?

Update: This functionality is now in the latest version of Seaborn. Here's an example.
The following was my stopgap measure:
def factor_scatter_matrix(df, factor, palette=None):
'''Create a scatter matrix of the variables in df, with differently colored
points depending on the value of df[factor].
inputs:
df: pandas.DataFrame containing the columns to be plotted, as well
as factor.
factor: string or pandas.Series. The column indicating which group
each row belongs to.
palette: A list of hex codes, at least as long as the number of groups.
If omitted, a predefined palette will be used, but it only includes
9 groups.
'''
import matplotlib.colors
import numpy as np
from pandas.tools.plotting import scatter_matrix
from scipy.stats import gaussian_kde
if isinstance(factor, basestring):
factor_name = factor #save off the name
factor = df[factor] #extract column
df = df.drop(factor_name,axis=1) # remove from df, so it
# doesn't get a row and col in the plot.
classes = list(set(factor))
if palette is None:
palette = ['#e41a1c', '#377eb8', '#4eae4b',
'#994fa1', '#ff8101', '#fdfc33',
'#a8572c', '#f482be', '#999999']
color_map = dict(zip(classes,palette))
if len(classes) > len(palette):
raise ValueError('''Too many groups for the number of colors provided.
We only have {} colors in the palette, but you have {}
groups.'''.format(len(palette), len(classes)))
colors = factor.apply(lambda group: color_map[group])
axarr = scatter_matrix(df,figsize=(10,10),marker='o',c=colors,diagonal=None)
for rc in xrange(len(df.columns)):
for group in classes:
y = df[factor == group].icol(rc).values
gkde = gaussian_kde(y)
ind = np.linspace(y.min(), y.max(), 1000)
axarr[rc][rc].plot(ind, gkde.evaluate(ind),c=color_map[group])
return axarr, color_map
As an example, we'll use the same dataset as in the question, available here
>>> import pandas as pd
>>> iris = pd.read_csv('iris.csv')
>>> axarr, color_map = factor_scatter_matrix(iris,'Name')
>>> color_map
{'Iris-setosa': '#377eb8',
'Iris-versicolor': '#4eae4b',
'Iris-virginica': '#e41a1c'}
Hope this is helpful!

You can also call the scattermatrix from pandas as follow :
pd.scatter_matrix(df,color=colors)
with colors being an list of size len(df)containing colors

Related

xarray discrete scatter plot: specifying legend/colour order

Plotting a discrete xarray DataArray variable in a Dataset with xr.plot.scatter() yields a legend in which the discrete values are ordered arbitrarily, corresponding to unpredictable colour assignment to each level. Would it be possible to specify a specific colour or position for a given discrete value?
A simple reproducible example:
import xarray as xr
# get a predefined dataset
uvz = xr.tutorial.open_dataset("eraint_uvz")
# select a 2-D subset of the data
uvzr = uvz.isel(level=0, month=0, latitude=slice(150, 242),
longitude=slice(240, 300))
# define a discrete variable based on levels of a continuous variable
uvzr['zone'] = 'A'
uvzr['zone'] = uvzr.zone.where(uvzr.u > 30, other='C')
uvzr['zone'] = uvzr.zone.where(uvzr.u > 10, other='B')
# do the plot
xr.plot.scatter(uvzr, x='longitude', y='latitude', hue='zone')
Is there a way to ensure that the legend entries are arranged 'A', 'B', 'C' from top to bottom, say? Or ensure that A is assigned to blue, and B to orange, for example?
I know I can reset the values of the matplotlib color cycler, but for that to be useful I first need to know which order the discrete values will be plotted in.
I'm using xarray v2022.3.0 on python 3.8.6. With an earlier version of xarray (I think 0.16) the levels were arranged alphabetically.

I found an ugly workaround using xarray.Dataset.stack and xr.where(..., drop=True), in case anyone else is stuck with a similar problem.
import numpy as np # for unique, to cycle through values
import matplotlib.pyplot as plt # to get a legend
# instead of np.unique you could pass an iterable of your choice
# specifying the order
for value in np.unique(uvzr.zone):
# convert to a 1-D dataframe with a co-ordinate including all
# unique combinations of latitude-longitude values
uvzr_stacked = uvzr.stack({'location':('longitude', 'latitude')})
# now select only those grid points in zone value
uvzr_stacked = uvzr_stacked.where(uvzr_stacked.zone == value,
drop=True)
# the plotting function can't see the original dims any more;
# a new name is required, however
uvzr_stacked['lat'] = uvzr_stacked.latitude
uvzr_stacked['lon'] = uvzr_stacked.longitude
# plot!
xr.plot.scatter(uvzr_stacked, x='lon', y='lat', hue='zone',
add_guide=False)
plt.legend(title='zone')

Encoding a list column to the legend of a plot

Apologies in advance, I am not sure how to word this question best:
I am working with a large dataset, and I would like to plot Latitude and Longitude where the colour of the points (actually the opacity) is encoded to a 'FeatureType' column binded to the legend. This way I can use the legend to highlight on my map various features I am looking for.
Here is a picture of my map and legend so far
The problem is that in my dataset, the FeatureType column is a list of features that can be found there (i.e arch, bridge, etc..).
How can I make it so that the point shows up for both arch, and bridge. At the moment it creates its own category of (arch,bridge etc.), leading to over 300 combinations of about 20 different FeatureTypes.
The dataset can be found at http://atlantides.org/downloads/pleiades/dumps/pleiades-locations-latest.csv.gz
N.B: I am using altair/pandas
import altair as alt
import pandas as pd
from vega_datasets import data
df = pd.read_csv ('C://path/pleiades-locations.csv')
alt.data_transformers.enable('json')
countries = alt.topo_feature(data.world_110m.url, 'countries')
selection = alt.selection_multi(fields=['featureType'], bind='legend')
brush = alt.selection(type='interval', encodings=['x'])
map = alt.Chart(countries).mark_geoshape(
fill='lightgray',
stroke='white'
).project('equirectangular').properties(
width=500,
height=300
)
points = alt.Chart(df).mark_circle().encode(
alt.Latitude('reprLat:Q'),
alt.Longitude('reprLong:Q'),
alt.Color('featureType:N'),
tooltip=['featureType','timePeriodsKeys:N'],
opacity=alt.condition(selection, alt.value(1), alt.value(0.0))
).add_selection(
selection)
(map + points)

It is not possible for Altair to generate the labels you want from your current column format. You will need to turn your comma-separated string labels into lists and then explode the column so that you get one row per item in the list:
import altair as alt
import pandas as pd
from vega_datasets import data
alt.data_transformers.enable('data_server')
df = pd.read_csv('http://atlantides.org/downloads/pleiades/dumps/pleiades-locations-latest.csv.gz')[['reprLong', 'reprLat', 'featureType']]
df['featureType'] = df['featureType'].str.split(',')
df = df.explode('featureType')
countries = alt.topo_feature(data.world_110m.url, 'countries')
world_map = alt.Chart(countries).mark_geoshape(
fill='lightgray',
stroke='white')
points = alt.Chart(df).mark_circle(size=10).encode(
alt.Latitude('reprLat:Q'),
alt.Longitude('reprLong:Q'),
alt.Color('featureType:N', legend=alt.Legend(columns=2)))
world_map + points
Note that having this many entries in the legend is not meaningful since the colors are repeated. The interactivity would help with that somewhat, but I would consider splitting this up into multiple charts. I am not sure if it is even possible to expand the legend to show those hidden 81 entries. And double check that the long lat location corresponds correctly with the world map projection you are using, they seemed to move around when I changed the projection.

Holoviews: how to customize histogram for linked time series Curve plots

I am just getting started with Holoviews. My questions are on customizing histograms, but also I am sharing a complete example as it may be helpful for other newbies to look at, since the documentation for Holoviews is very thorough but can be overwhelming.
I have a number of time series in text files loaded as Pandas DataFrames where:
each file is for a specific location
at each location about 10 time series were collected, each with about 15,000 points
I am building a small interactive tool where a Selector can be used to choose the location / DataFrame, and then another Selector to pick 3 of 10 of the time series to be plotted together.
My goal is to allow linked zooms (both x and y scales). The questions and code will focus on this aspect of the tool.
I cannot share the actual data I am using, unfortunately, as it is proprietary, but I have created 3 random walks with specific data ranges that are consistent with the actual data.
## preliminaries ##
import pandas as pd
import numpy as np
import holoviews as hv
from holoviews.util.transform import dim
from holoviews.selection import link_selections
from holoviews import opts
from holoviews.operation.datashader import shade, rasterize
import hvplot.pandas
hv.extension('bokeh', width=100)
## create random walks (one location) ##
data_df = pd.DataFrame()
npoints=15000
np.random.seed(71)
x = np.arange(npoints)
y1 = 1300+2.5*np.random.randn(npoints).cumsum()
y2 = 1500+2*np.random.randn(npoints).cumsum()
y3 = 3+np.random.randn(npoints).cumsum()
data_df.loc[:,'x'] = x
data_df.loc[:,'rand1'] = y1
data_df.loc[:,'rand2'] = y2
data_df.loc[:,'rand3'] = y3
This first block is just to plot the data and show how, by design, one of the random walks have different range from the other two:
data_df.hvplot(x='x',y=['rand1','rand2','rand3'],value_label='y',width=800,height=400)
As a result, although hvplot subplots work out of the box (for linking), ranges are different so the scaling is not quite there:
data_df.hvplot(x='x',y=['rand1','rand2','rand3'],
value_label='y',subplots=True,width=800,height=200).cols(1)
So, my first attempt was to adapt the Python-based Points example from Linked brushing in the documentation:
colors = hv.Cycle('Category10').values
dims = ['rand1', 'rand2', 'rand3']
layout = hv.Layout([
hv.Points(data_df, dim).opts(color=c)
for c, dim in zip(colors, [['x', d] for d in dims])
])
link_selections(layout).opts(opts.Points(width=1200, height=300)).cols(1)
That is already an amazing result for a 20 minutes effort!
However, what I would really like is to plot a curve rather than points, and also see a histogram, so I adapted the comprehension syntax to work with Curve (after reading the documentation pages Applying customization, and Composing elements):
colors = hv.Cycle('Category10').values
dims = ['rand1', 'rand2', 'rand3']
layout = hv.Layout([hv.Curve(data_df,'x',dim).opts(height=300,width=1200,
color=c).hist(dim) for c,
dim in zip(colors,[d for d in dims])])
link_selections(layout).cols(1)
Which is almost exactly what I want. But I still struggle with the different layers of opts syntax.
Question 1: with the comprehension from the last code block, how would I make the histogram share color with the curves?
Now, suppose I want to rasterize the plots (although I do not think is quite yet necessary with 15,000 points like in this case), I tried to adapt the first example with Points:
cmaps = ['Blues', 'Greens', 'Reds']
dims = ['rand1', 'rand2', 'rand3']
layout = hv.Layout([
shade(rasterize(hv.Points(data_df, dims),
cmap=c)).opts(width=1200, height = 400).hist(dims[1])
for c, dims in zip(cmaps, [['x', d] for d in dims])
])
link_selections(layout).cols(1)
This is a decent start, but again I struggle with the options/customization.
Question 2: in the above cod block, how would I pass the colormaps (it does not work as it is now), and how do I make the histogram reflect data values as in the previous case (and also have the right colormap)?
Thank you!

Sander answered how to color the histogram, but for the other question about coloring the datashaded plot, Datashader renders your data with a colormap rather than a single color, so the parameter is named cmap rather than color. So you were correct to use cmap in the datashaded case, but (a) cmap is actually a parameter to shade (which does the colormapping of the output of rasterize), and (b) you don't really need shade, as you can let Bokeh do the colormapping in most cases nowadays, in which case cmap is an option rather than an argument. Example:
from bokeh.palettes import Blues, Greens, Reds
cmaps = [Blues[256][200:], Greens[256][200:], Reds[256][200:]]
dims = ['rand1', 'rand2', 'rand3']
layout = hv.Layout([
rasterize(hv.Points(data_df, ds)).opts(cmap=c,width=1200, height = 400).hist(dims[1])
for c, ds in zip(cmaps, [['x', d] for d in dims])
])
link_selections(layout).cols(1)

To answer your first question to make the histogram share the color of the curve, I've added .opts(opts.Histogram(color=c)) to your code.
When you have a layout you can specify the options of an element inside the layout like that.
colors = hv.Cycle('Category10').values
dims = ['rand1', 'rand2', 'rand3']
layout = hv.Layout(
[hv.Curve(data_df,'x',dim)
.opts(height=300,width=600, color=c)
.hist(dim)
.opts(opts.Histogram(color=c))
for c, dim in zip(colors,[d for d in dims])]
)
link_selections(layout).cols(1)

Colour code the plot based on the two data frame values

I would like to colour code the scatter plot based upon the two data frame values such that for each different values of df[1], a new color is to be assigned and for each df[2] value having same df[1] value, the assigned color earlier needs the opacity variation with highest value of df[2] (among df[2] values having same df[1] value) getting 100 % opaque and the lowest getting least opaque among the group of the data points.
Here is the code:
def func():
...
df = pd.read_csv(PATH + file, sep=",", header=None)
b = 2.72
a = 0.00000009
popt, pcov = curve_fit(func, df[2], df[5]/df[4], p0=[a,b])
perr = np.sqrt(np.diag(pcov))
plt.scatter(df[1], df[5]/df[4]/df[2])
# Plot responsible for the datapoints in the figure
plt.plot(df[1], func_cpu(df[2], *popt)/df[2], "r")
# plot responsible for the curve in the figure
plt.legend(loc="upper left")
Here is the sample dataset:
**df[0],df[1],df[2],df[3],df[4],df[5],df[6]**
file_name_1_i1,31,413,36120,10,9,10
file_name_1_i2,31,1240,60488,10,25,27
file_name_1_i3,31,2769,107296,10,47,48
file_name_1_i4,31,8797,307016,10,150,150
file_name_2_i1,34,72,10868,11,9,10
file_name_2_i2,34,6273,250852,11,187,196
file_name_3_i1,36,84,29568,12,9,10
file_name_3_i2,36,969,68892,12,25,26
file_name_3_i3,36,6545,328052,12,150,151
file_name_4_i1,69,116,40712,13,25,26
file_name_4_i2,69,417,80080,13,47,48
file_name_4_i2,69,1313,189656,13,149,150
file_name_4_i4,69,3009,398820,13,195,196
file_name_4_i5,69,22913,2855044,13,3991,4144
file_name_5_i1,85,59,48636,16,47,48
file_name_5_i2,85,163,64888,15,77,77
file_name_5_i3,85,349,108728,16,103,111
file_name_5_i4,85,1063,253180,14,248,248
file_name_5_i5,85,2393,526164,15,687,689
file_name_5_i6,85,17713,3643728,15,5862,5867
file_name_6_i1,104,84,75044,33,137,138
file_name_6_i2,104,455,204792,28,538,598
file_name_6_i3,104,1330,513336,31,2062,2063
file_name_6_i4,104,2925,1072276,28,3233,3236
file_name_6_i5,104,6545,2340416,28,7056,7059
...
So, the x-axis would be df[1] which are 31, 31, 31, 31, 34, 34,... and the y-axis is df[5], df[4], df[2] which are 9, 10, 413. For each different value of df[1], a new colour needs to be assigned. It would be fine to repeat the color cycles say after 6 unique colours. And among each color the opacity needs to be changed wrt to the value of df[2] (though y-axis is df[5], df[4], df[2]). The highest getting the darker version of the same color, and the lowest getting the lightest version of the same color.
and the scatter plot:
This is roughly how my desired solution of the color code needs to look like:
I have around 200 entries in the csv file.
Does using NumPy in this scenario is more advantageous ?

Let me know if this is appropriate or if I have misunderstood anything-
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# not needed for you
# df = pd.read_csv('~/Documents/tmp.csv')
max_2 = pd.DataFrame(df.groupby('1').max()['2'])
no_unique_colors = 3
color_set = [np.random.random((3)) for _ in range(no_unique_colors)]
# assign colors to unique df2 in cyclic order
max_2['colors'] = [color_set[unique_df2 % no_unique_colors] for unique_df2 in range(max_2.shape[0])]
# calculate the opacities for each entry in the dataframe
colors = [list(max_2.loc[df1].colors) + [float(df['2'].iloc[i])/max_2['2'].loc[df1]] for i, df1 in enumerate(df['1'])]
# repeat thrice so that df2, df4 and df5 share the same opacity
colors = [x for x in colors for _ in range(3)]
plt.scatter(df['1'].values.repeat(3), df[['2', '4', '5']].values.reshape(-1), c=colors)
plt.show()

Well, what do you know. I understood this task totally differently. I thought the point was to have alpha levels according to all df[2], df[4], and df[5] values for each df[1] value. Oh well, since I have done the work already, why not post it?
from matplotlib import pyplot as plt
import pandas as pd
from itertools import cycle
from matplotlib.colors import to_rgb
#read the data, column numbers will be generated automatically
df = pd.read_csv("data.txt", sep = ",", header=None)
#our figure with the ax object
fig, ax = plt.subplots(figsize=(10,10))
#definition of the colors
sc_color = cycle(["tab:orange", "red", "blue", "black"])
#get groups of the same df[1] value, they will also be sorted at the same time
dfgroups = df.iloc[:, [2, 4, 5]].groupby(by=df[1])
#plot each group with a different colour
for groupkey, groupval in dfgroups:
#create group dataframe with df[1] value as x and df[2], df[4], and df[5] values as y
groupval= groupval.melt(var_name="x", value_name="y")
groupval.x = groupkey
#get min and max y for the normalization
y_high = groupval.y.max()
y_low = groupval.y.min()
#read out r, g, and b values of the next color in the cycle
r, g, b = to_rgb(next(sc_color))
#create a colour array with nonlinear normalized alpha levels
#between 0.2 and 0.8, so that all data point are visible
group_color = [(r, g, b, 0.19 + 0.8 * ((y_high-val) / (y_high-y_low))**7) for val in groupval.y]
#and plot
ax.scatter(groupval.x, groupval.y, c=group_color)
plt.show()
Sample output of your data:
Two main problems here. One is that alpha in a scatter plot does not accept an array. But color does, hence, the detour to read out the RGB values and create an RGBA array with added alpha levels.
The other is that your data are spread over a rather wide range. A linear normalization makes changes near the lowest values invisible. There is surely some optimization possible; I like for instance this suggestion.

Setting col_colors in seaborn clustermap from pandas

I have a clustermap generated from a pandas dataframe. Two of the columns are used to generate the clustermap and I need to use a 3rd column to generate a col_colors bar using sns.palplot(sns.light_palette('red')) palette (values will be from 0 - 1, light - dark colors).
The pseudo-code looks something like this:
df=pd.DataFrame(input, columns = ['Source', 'amplicon', 'coverage', 'GC'])
tiles = df.pivot("Source", "amplicon", "coverage")
col_colors = [values from df['GC']]
sns.clustermap(tiles, vmin=0, vmax=2, col_colors=col_colors)
I'm battling to find details on how to setup the col_colors so the correct values are linked to the appropriate tiles. Some direction would be greatly appreciated.

This example will be much easier to explain with sample data. I don't know what your data looks like, but say you had a bunch of GC content measurements For instance:
import seaborn as sns
import numpy as np
import pandas as pd
data = {'16S':np.random.normal(.52, 0.05, 12),
'ITS':np.random.normal(.52, 0.05, 12),
'Source':np.random.choice(['soil', 'water', 'air'], 12, replace=True)}
df=pd.DataFrame(data)
df[:3]
16S ITS Source
0 0.493087 0.460066 air
1 0.607229 0.592945 water
2 0.577155 0.440726 water
So data is GC content, and then there is a column describing the source. Say we want to plot a cluster map of the GC content where we use the Source column to define the network
#create a color palette with the same number of colors as unique values in the Source column
network_pal = sns.light_palette('red', len(df.Source.unique()))
#Create a dictionary where the key is the category and the values are the
#colors from the palette we just created
network_lut = dict(zip(df.Source.unique(), network_pal))
#get the series of all of the categories
networks = df.Source
#map the colors to the series. Now we have a list of colors the same
#length as our dataframe, where unique values are mapped to the same color
network_colors = pd.Series(networks).map(network_lut)
#plot the heatmap with the 16S and ITS categories with the network colors
#defined by Source column
sns.clustermap(df[['16S', 'ITS']], row_colors=network_colors, cmap='BuGn_r')
Basically what most of the above code is doing is creating a vector of colors that corrospond to the Source column of the data frame. You could of course create this manually, where the first color in the list would be mapped to the first row in the dataframe and the second color would be mapped to the second row and so on (this order will change when you plot it), however that would be a lot of work. I used a red color palette as that is what you mentioned in your question though I might recommend using a different palette. I colored by rows, however you can do the same thing for columns. Hope this helps!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

class labels in Pandas scattermatrix - python

You can also call the scattermatrix from pandas as follow : pd.scatter_matrix(df,color=colors) with colors being an list of size len(df)containing colors

Related

xarray discrete scatter plot: specifying legend/colour order

Encoding a list column to the legend of a plot

Holoviews: how to customize histogram for linked time series Curve plots

Colour code the plot based on the two data frame values

Setting col_colors in seaborn clustermap from pandas

Categories

Resources