Related
Update with reproducible code:
I have data in form of numeric values that are stored in a nested dictionary called dict_data. This dictionary includes two variables (nested dictionaries), called ple and sampen. To both variables, three rois (lists of data) are attached.
Furthermore, in the nested for loop for variable in variables: for i in range(3):, the dictionary’s data is re-stored into a numpy array for ple and sampen respectively
Next, I apply a quick Z transformation on the numpy array data and save the results in yet another and new numpy array called data_array_z.
Here comes my aim:
I would like to plot a scatterplot for both variables ple and sampen including their three rois(see list rois and the dictionary dict_data). Additionally, I would like to plot linear regression lines across the datapoints in the three rois. An image at the end shows the result that is reproducible with this code. All this works so far.
Problem:
I would like to apply matplotlib’s colormap for both variables’ scatter points and their regression lines. I do not understand how to code the c= or color= commands so that matplotlib plots each new point and regression line with an individual color from a specified colormap, such as tab20c_r.
Currently I receive the following error. AttributeError: 'numpy.ndarray' object has no attribute 'index'
Here is my full code for overview:
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
from scipy import stats
# Data
rois = ["interoception", "exteroception", "cognitive"]
variables = ["ple", "sampen"]
dict_data = {
"ple":
{"interoception": [-1.10037122285797, -1.12865588460383, -0.703950853686781],
"exteroception": [-1.17398360636007, -1.3990001171012, -1.04528124479161],
"cognitive": [-1.41360171989828, -1.40283540165903, -0.97942989571082]},
"sampen":
{"interoception": [0.949609993391806, 0.855956231042865, 0.960562761361569],
"exteroception": [0.910728054043017, 0.800209085753256, 0.884463613744541],
"cognitive": [0.832925242590743, 0.782835997734383, 0.899277318610028]},
}
# Linear regression for each subject and variable across the three regions
x_pos_1 = np.array([0, 1, 2])
x_pos_2 = np.array([3, 4, 5])
for variable in variables:
for i in range(3):
data_array = np.array([
dict_data[variable]["interoception"][i],
dict_data[variable]["exteroception"][i],
dict_data[variable]["cognitive"][i]
])
# Z normalization
data_array_z = []
for i in data_array:
z = (i - np.mean(data_array)) / np.std(data_array)
data_array_z.append(z)
# Plot
cmap = plt.get_cmap("tab20c_r")
segment_cmap = cmap(np.linspace(0, 1, len(data_array_z)))
if variable == "ple":
reg = sp.stats.linregress(x_pos_1, data_array_z)
plt.scatter(x_pos_1, data_array_z, color=segment_cmap[data_array.index(i)])
plt.plot(x_pos_1, reg[1] + reg[0]*x_pos_1,
linestyle="-", linewidth=1, color=segment_cmap[data_array.index(i)],
zorder=2)
elif variable == "sampen":
reg = sp.stats.linregress(x_pos_2, data_array_z)
plt.scatter(x_pos_2, data_array_z)
plt.plot(x_pos_2, reg[1] + reg[0]*x_pos_2,
linestyle="-", linewidth=1, zorder=2)
plt.show()
This is what the result looks like when removing the codeline color=segment_cmap[data_array.index(i)] from the first plot, so that matplotlib simply takes its default colormap.
Your code has two problems.
You reuse the name i in an inner loop, when the inner loop is completed i is now a float, not an index into the columns of your data.
You had this idea of choosing the colors from a qualitative color map, but by default Matplotlib uses colors from a qualitative color map, you have just to use them.
Re ①, I changed the name of the inner loop variable, re ② I used a cycler object to go through the colors from the tab20c_r color map — I used 3 colors, per your code, but the colors are repeated in the plot, is this what you want?
Further, I've added a legend.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import linregress
from cycler import cycler
# Data
rois = ["interoception", "exteroception", "cognitive"]
variables = ["ple", "sampen"]
dict_data = {
"ple":
{"interoception": [-1.10037122285797, -1.12865588460383, -0.703950853686781],
"exteroception": [-1.17398360636007, -1.3990001171012, -1.04528124479161],
"cognitive": [-1.41360171989828, -1.40283540165903, -0.97942989571082]},
"sampen":
{"interoception": [0.949609993391806, 0.855956231042865, 0.960562761361569],
"exteroception": [0.910728054043017, 0.800209085753256, 0.884463613744541],
"cognitive": [0.832925242590743, 0.782835997734383, 0.899277318610028]},
}
# Linear regression for each subject and variable across the three regions
x_pos_1 = np.array([0, 1, 2])
x_pos_2 = np.array([3, 4, 5])
plt.gca().set_prop_cycle(
cycler(color=plt.get_cmap("tab20c_r")(np.linspace(0, 1, 3))))
for variable in variables:
for i in range(3):
data_array = np.array([
dict_data[variable]["interoception"][i],
dict_data[variable]["exteroception"][i],
dict_data[variable]["cognitive"][i]
])
# Z normalization
data_array_z = []
for y in data_array:
z = (y - np.mean(data_array)) / np.std(data_array)
data_array_z.append(z)
# Plot
if variable == "ple" or variable == "acw":
reg = linregress(x_pos_1, data_array_z)
print(i)
plt.scatter(x_pos_1, data_array_z, **color, label="%s, %s"%(variable, i+1))
plt.plot(x_pos_1, reg[1] + reg[0]*x_pos_1,
linestyle="-", linewidth=1)
elif variable == "sampen":
reg = linregress(x_pos_2, data_array_z)
plt.scatter(x_pos_2, data_array_z, label="%s, %s"%(variable, i+1))
plt.plot(x_pos_2, reg[1] + reg[0]*x_pos_2,
linestyle="-", linewidth=1, zorder=2)
plt.legend()
plt.show()
I have data that are multidimensional compositional data (all dimensions sum to 1 or 100). I have learned how to use three of the variables to create a 2d ternary plot.
I would like to add a fourth dimension such that my plot looks like this.
I am willing to use python or R. I am using pyr2 to create the ternary plots in python using R right now, but just because that's an easy solution. If the ternary data could be transformed into 3d coordinates a simple wire plot could be used.
This post shows how 3d compositional data can be transformed into 2d data so that normal plotting method can be used. One solution would be to do the same thing in 3d.
Here is some sample Data:
c1 c2 c3 c4
0 0.082337 0.097583 0.048608 0.771472
1 0.116490 0.065047 0.066202 0.752261
2 0.114884 0.135018 0.073870 0.676229
3 0.071027 0.097207 0.070959 0.760807
4 0.066284 0.079842 0.103915 0.749959
5 0.016074 0.074833 0.044532 0.864561
6 0.066277 0.077837 0.058364 0.797522
7 0.055549 0.057117 0.045633 0.841701
8 0.071129 0.077620 0.049066 0.802185
9 0.089790 0.086967 0.083101 0.740142
10 0.084430 0.094489 0.039989 0.781093
Well, I solved this myself using a wikipedia article, an SO post, and some brute force. Sorry for the wall of code, but you have to draw all the plot outlines and labels and so forth.
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import axes3d, Axes3D
from itertools import combinations
import pandas as pd
def plot_ax(): #plot tetrahedral outline
verts=[[0,0,0],
[1,0,0],
[0.5,np.sqrt(3)/2,0],
[0.5,0.28867513, 0.81649658]]
lines=combinations(verts,2)
for x in lines:
line=np.transpose(np.array(x))
ax.plot3D(line[0],line[1],line[2],c='0')
def label_points(): #create labels of each vertices of the simplex
a=(np.array([1,0,0,0])) # Barycentric coordinates of vertices (A or c1)
b=(np.array([0,1,0,0])) # Barycentric coordinates of vertices (B or c2)
c=(np.array([0,0,1,0])) # Barycentric coordinates of vertices (C or c3)
d=(np.array([0,0,0,1])) # Barycentric coordinates of vertices (D or c3)
labels=['a','b','c','d']
cartesian_points=get_cartesian_array_from_barycentric([a,b,c,d])
for point,label in zip(cartesian_points,labels):
if 'a' in label:
ax.text(point[0],point[1]-0.075,point[2], label, size=16)
elif 'b' in label:
ax.text(point[0]+0.02,point[1]-0.02,point[2], label, size=16)
else:
ax.text(point[0],point[1],point[2], label, size=16)
def get_cartesian_array_from_barycentric(b): #tranform from "barycentric" composition space to cartesian coordinates
verts=[[0,0,0],
[1,0,0],
[0.5,np.sqrt(3)/2,0],
[0.5,0.28867513, 0.81649658]]
#create transformation array vis https://en.wikipedia.org/wiki/Barycentric_coordinate_system
t = np.transpose(np.array(verts))
t_array=np.array([t.dot(x) for x in b]) #apply transform to all points
return t_array
def plot_3d_tern(df,c='1'): #use function "get_cartesian_array_from_barycentric" to plot the scatter points
#args are b=dataframe to plot and c=scatter point color
bary_arr=df.values
cartesian_points=get_cartesian_array_from_barycentric(bary_arr)
ax.scatter(cartesian_points[:,0],cartesian_points[:,1],cartesian_points[:,2],c=c)
#Create Dataset 1
np.random.seed(123)
c1=np.random.normal(8,2.5,20)
c2=np.random.normal(8,2.5,20)
c3=np.random.normal(8,2.5,20)
c4=[100-x for x in c1+c2+c3] #make sur ecomponents sum to 100
#df unecessary but that is the format of my real data
df1=pd.DataFrame(data=[c1,c2,c3,c4],index=['c1','c2','c3','c4']).T
df1=df1/100
#Create Dataset 2
np.random.seed(1234)
c1=np.random.normal(16,2.5,20)
c2=np.random.normal(16,2.5,20)
c3=np.random.normal(16,2.5,20)
c4=[100-x for x in c1+c2+c3]
df2=pd.DataFrame(data=[c1,c2,c3,c4],index=['c1','c2','c3','c4']).T
df2=df2/100
#Create Dataset 3
np.random.seed(12345)
c1=np.random.normal(25,2.5,20)
c2=np.random.normal(25,2.5,20)
c3=np.random.normal(25,2.5,20)
c4=[100-x for x in c1+c2+c3]
df3=pd.DataFrame(data=[c1,c2,c3,c4],index=['c1','c2','c3','c4']).T
df3=df3/100
fig = plt.figure()
ax = Axes3D(fig) #Create a 3D plot in most recent version of matplot
plot_ax() #call function to draw tetrahedral outline
label_points() #label the vertices
plot_3d_tern(df1,'b') #call function to plot df1
plot_3d_tern(df2,'r') #...plot df2
plot_3d_tern(df3,'g') #...
The accepted answer explains how to do this in python but the question was also asking about R.
I've provided an answer in this thread on how to do this 'manually' in R.
Otherwise, you can use the klaR package directly for this:
df <- matrix(c(
0.082337, 0.097583, 0.048608, 0.771472,
0.116490, 0.065047, 0.066202, 0.752261,
0.114884, 0.135018, 0.073870, 0.676229,
0.071027, 0.097207, 0.070959, 0.760807,
0.066284, 0.079842, 0.103915, 0.749959,
0.016074, 0.074833, 0.044532, 0.864561,
0.066277, 0.077837, 0.058364, 0.797522,
0.055549, 0.057117, 0.045633, 0.841701,
0.071129, 0.077620, 0.049066, 0.802185,
0.089790, 0.086967, 0.083101, 0.740142,
0.084430, 0.094489, 0.039989, 0.781094
), byrow = TRUE, nrow = 11, ncol = 4)
# install.packages(c("klaR", "scatterplot3d"))
library(klaR)
#> Loading required package: MASS
quadplot(df)
Created on 2020-08-14 by the reprex package (v0.3.0)
To illustrate my problem I prepared an example:
First, I have two arrays 'a'and 'b'and I'm interested in their distribution:
import numpy as np
import matplotlib.pyplot as plt
a = np.array([1,2,2,2,2,4,8,1,9,5,3,1,2,9])
b = np.array([5,9,9,2,3,9,3,6,8,4,2,7,8,8])
n1,bin1,pat1 = plt.hist(a,np.arange(1,10,2),histtype='step')
n2,bin2,pat2 = plt.hist(b,np.arange(1,10,2), histtype='step')
plt.show()
This code gives me a histogram with two 'curves'. Now I want to subtract one 'curve' from the other, and by this I mean that I do this for each bin separately:
n3 = n2-n1
I don't need negative counts so:
for i in range(0,len(n2)):
if n3[i]<0:
n3[i]=0
else:
continue
The new histogram curve should be plotted in the same range as the previous ones and it should have the same number of bins. So I have the number of bins and their position (which will be the same as the ones for the other curves, please refer to the block above) and the frequency or counts (n3) that every bins should have. Do you have any ideas of how I can do this with the data that I have?
You can use a step function to plot n3 = n2 - n1. The only issue is that you need to provide one more value, otherwise the last value is not shown nicely. Also you need to use the where="post" option of the step function.
import numpy as np
import matplotlib.pyplot as plt
a = np.array([1,2,2,2,2,4,8,1,9,5,3,1,2,9])
b = np.array([5,9,9,2,3,9,3,6,8,4,2,7,8,8])
n1,bin1,pat1 = plt.hist(a,np.arange(1,10,2),histtype='step')
n2,bin2,pat2 = plt.hist(b,np.arange(1,10,2), histtype='step')
n3=n2-n1
n3[n3<0] = 0
plt.step(np.arange(1,10,2),np.append(n3,[n3[-1]]), where='post', lw=3 )
plt.show()
I want to color my clusters with a color map that I made in the form of a dictionary (i.e. {leaf: color}).
I've tried following https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/ but the colors get messed up for some reason. The default plot looks good, I just want to assign those colors differently. I saw that there was a link_color_func but when I tried using my color map (D_leaf_color dictionary) I got an error b/c it wasn't a function. I've created D_leaf_color to customize the colors of the leaves associated with particular clusters. In my actual dataset, the colors mean something so I'm steering away from arbitrary color assignments.
I don't want to use color_threshold b/c in my actual data, I have way more clusters and SciPy repeats the colors, hence this question. . .
How can I use my leaf-color dictionary to customize the color of my dendrogram clusters?
I made a GitHub issue https://github.com/scipy/scipy/issues/6346 where I further elaborated on the approach to color the leaves in Interpreting the output of SciPy's hierarchical clustering dendrogram? (maybe found a bug...) but I still can't figure out how to actually either: (i) use dendrogram output to reconstruct my dendrogram with my specified color dictionary or (ii) reformat my D_leaf_color dictionary for the link_color_func parameter.
# Init
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
# Load data
from sklearn.datasets import load_diabetes
# Clustering
from scipy.cluster.hierarchy import dendrogram, fcluster, leaves_list
from scipy.spatial import distance
from fastcluster import linkage # You can use SciPy one too
%matplotlib inline
# Dataset
A_data = load_diabetes().data
DF_diabetes = pd.DataFrame(A_data, columns = ["attr_%d" % j for j in range(A_data.shape[1])])
# Absolute value of correlation matrix, then subtract from 1 for disimilarity
DF_dism = 1 - np.abs(DF_diabetes.corr())
# Compute average linkage
A_dist = distance.squareform(DF_dism.as_matrix())
Z = linkage(A_dist,method="average")
# Color mapping
D_leaf_colors = {"attr_1": "#808080", # Unclustered gray
"attr_4": "#B061FF", # Cluster 1 indigo
"attr_5": "#B061FF",
"attr_2": "#B061FF",
"attr_8": "#B061FF",
"attr_6": "#B061FF",
"attr_7": "#B061FF",
"attr_0": "#61ffff", # Cluster 2 cyan
"attr_3": "#61ffff",
"attr_9": "#61ffff",
}
# Dendrogram
# To get this dendrogram coloring below `color_threshold=0.7`
D = dendrogram(Z=Z, labels=DF_dism.index, color_threshold=None, leaf_font_size=12, leaf_rotation=45, link_color_func=D_leaf_colors)
# TypeError: 'dict' object is not callable
I also tried how do I get the subtrees of dendrogram made by scipy.cluster.hierarchy
Here a solution that uses the return matrix Z of linkage() (described early but a little hidden in the docs) and link_color_func:
# see question for code prior to "color mapping"
# Color mapping
dflt_col = "#808080" # Unclustered gray
D_leaf_colors = {"attr_1": dflt_col,
"attr_4": "#B061FF", # Cluster 1 indigo
"attr_5": "#B061FF",
"attr_2": "#B061FF",
"attr_8": "#B061FF",
"attr_6": "#B061FF",
"attr_7": "#B061FF",
"attr_0": "#61ffff", # Cluster 2 cyan
"attr_3": "#61ffff",
"attr_9": "#61ffff",
}
# notes:
# * rows in Z correspond to "inverted U" links that connect clusters
# * rows are ordered by increasing distance
# * if the colors of the connected clusters match, use that color for link
link_cols = {}
for i, i12 in enumerate(Z[:,:2].astype(int)):
c1, c2 = (link_cols[x] if x > len(Z) else D_leaf_colors["attr_%d"%x]
for x in i12)
link_cols[i+1+len(Z)] = c1 if c1 == c2 else dflt_col
# Dendrogram
D = dendrogram(Z=Z, labels=DF_dism.index, color_threshold=None,
leaf_font_size=12, leaf_rotation=45, link_color_func=lambda x: link_cols[x])
Here the output:
Two-liner for applying custom colormap to cluster branches:
import matplotlib as mpl
from matplotlib.pyplot import cm
from scipy.cluster import hierarchy
cmap = cm.rainbow(np.linspace(0, 1, 10))
hierarchy.set_link_color_palette([mpl.colors.rgb2hex(rgb[:3]) for rgb in cmap])
You can then replace rainbow by any cmap and change 10 for the number of cluster you want.
I found a hackish solution, and does require to use the color threshold (but I need to use it in order to obtain the same original coloring, otherwise the colors are not the same as presented in the OP), but could lead you to a solution. However, you may not have enough information to know how to set the color palette order.
# Init
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
# Load data
from sklearn.datasets import load_diabetes
# Clustering
from scipy.cluster.hierarchy import dendrogram, fcluster, leaves_list, set_link_color_palette
from scipy.spatial import distance
from fastcluster import linkage # You can use SciPy one too
%matplotlib inline
# Dataset
A_data = load_diabetes().data
DF_diabetes = pd.DataFrame(A_data, columns = ["attr_%d" % j for j in range(A_data.shape[1])])
# Absolute value of correlation matrix, then subtract from 1 for disimilarity
DF_dism = 1 - np.abs(DF_diabetes.corr())
# Compute average linkage
A_dist = distance.squareform(DF_dism.as_matrix())
Z = linkage(A_dist,method="average")
# Color mapping dict not relevant in this case
# Dendrogram
# To get this dendrogram coloring below `color_threshold=0.7`
#Change the color palette, I did not include the grey, which is used above the threshold
set_link_color_palette(["#B061FF", "#61ffff"])
D = dendrogram(Z=Z, labels=DF_dism.index, color_threshold=.7, leaf_font_size=12, leaf_rotation=45,
above_threshold_color="grey")
The result:
Pretty much exactly what the question states, but a little context:
I'm creating a program to plot a large number of points (~10,000, but it will be more later on). This is being done using matplotlib's plt.scatter. This command is part of a loop that saves the figure, so I can later animate it.
What I want to be able to do is randomly select a small portion of these particles (say, maybe 100?) and give them a different marker than the rest, even though they're part of the same data set. This is so I can use them as placeholders to see the motion of individual particles, as well as the bulk material.
Is there a way to use a different marker for a small subset of the same data?
For reference, the particles are uniformly distributed just using the numpy random sampler, but my code for that is:
for i in range(N): # N number of particles
particle_position[i] = np.random.uniform(0, xmax) # Initialize in spatial domain
particle_velocity[i] = np.random.normal(0, 5) # Initialize in velocity space
for i in range(maxtime):
plt.scatter(particle_position, particle_velocity, s=1, c=norm_xvel, cmap=br_disc, lw=0)
The position and velocity change on each iteration of the main loop (there's quite a bit of code), but these are the main initialization and plotting routines.
I had an idea that perhaps I could randomly select a bunch of i values from range(N), and use an ax.scatter() command to plot them on the same axes?
Here is a possible solution to have a subset of your points identified with a different marker:
import matplotlib.pyplot as plt
import numpy as np
SIZE = 100
SAMPLE_SIZE = 10
def select_subset(seq, size):
"""selects a subset of the data using ...
"""
return seq[:size]
points_x = np.random.uniform(-1, 1, size=SIZE)
points_y = np.random.uniform(-1, 1, size=SIZE)
plt.scatter(points_x, points_y, marker=".", color="blue")
plt.scatter(select_subset(points_x, SAMPLE_SIZE),
select_subset(points_y, SAMPLE_SIZE),
marker="o", color="red")
plt.show()
It uses plt.scatter twice; once on the full data set, the other on the sample points.
You will have to decide how you want to select the sample of points - it is isolated in the select_subset function..
You could also extract the sample points from the data set to prevent marking them twice, but numpy is rather inefficient at deleting or resizing.
Maybe a better method is to use a mask? A mask has the advantage of leaving your original data intact and in order.
Here is a way to proceed with masks:
import matplotlib.pyplot as plt
import numpy as np
import random
SIZE = 100
SAMPLE_SIZE = 10
def make_mask(data_size, sample_size):
mask = np.array([True] * sample_size + [False ] * (data_size - sample_size))
np.random.shuffle(mask)
return mask
points_x = np.random.uniform(-1, 1, size=SIZE)
points_y = np.random.uniform(-1, 1, size=SIZE)
mask = make_mask(SIZE, SAMPLE_SIZE)
not_mask = np.invert(mask)
plt.scatter(points_x[not_mask], points_y[not_mask], marker=".", color="blue")
plt.scatter(points_x[mask], points_y[mask], marker="o", color="red")
plt.show()
As you see, scatter is called once on a subset of the data points (the ones not selected in the sample), and a second time on the sampled subset, and draws each subset with its own marker. It is efficient & leaves the original data intact.
The code below does what you want. I have selected a random set v_sub_index of N_sub indices in the correct range (0 to N) and draw those (with _sub suffix) from the larger samples particle_position and particle_velocity. Please note that you don't have to loop to generate random samples. Numpy has great functionality for that without having to use for loops.
import numpy as np
import matplotlib.pyplot as pl
N = 100
xmax = 1.
v_sigma = 2.5 / 2. # 95% of the samples contained within 0, 5
v_mean = 2.5 # mean at 2.5
N_sub = 10
v_sub_index = np.random.randint(0, N, N_sub)
particle_position = np.random.rand (N) * xmax
particle_velocity = np.random.randn(N)
particle_position_sub = np.array(particle_position[v_sub_index])
particle_velocity_sub = np.array(particle_velocity[v_sub_index])
particle_position_nosub = np.delete(particle_position, v_sub_index)
particle_velocity_nosub = np.delete(particle_velocity, v_sub_index)
pl.scatter(particle_position_nosub, particle_velocity_nosub, color='b', marker='o')
pl.scatter(particle_position_sub , particle_velocity_sub , color='r', marker='^')
pl.show()