I'm trying to draw a regression line in plotly python in ternary space, but there doesn't seem to be an option like "trendline = 'loess' for scatter ternaries. Is there another way to achieve the same result for ternaries? Code from a previous post that makes a spline line but not a regression.
import numpy as np
import plotly.graph_objects as go
a = np.array([0.15, 0.15, 0.17, 0.2 , 0.21, 0.24, 0.26, 0.27, 0.27, 0.29, 0.32, 0.35, 0.39, 0.4 , 0.4 , 0.41, 0.47, 0.48, 0.51, 0.52, 0.54, 0.56, 0.59, 0.62, 0.63, 0.65, 0.69, 0.73, 0.74])
b = np.array([0.14, 0.15, 0.1 , 0.17, 0.17, 0.18, 0.05, 0.16, 0.17, 0.04, 0.03, 0.14, 0.13, 0.13, 0.14, 0.14, 0.13, 0.13, 0.14, 0.14, 0.15, 0.16, 0.18, 0.2 , 0.21, 0.22, 0.24, 0.25, 0.25])
c = np.array([0.71, 0.7 , 0.73, 0.63, 0.62, 0.58, 0.69, 0.57, 0.56, 0.67, 0.65, 0.51, 0.48, 0.47, 0.46, 0.45, 0.4 , 0.39, 0.35, 0.34, 0.31, 0.28, 0.23, 0.18, 0.16, 0.13, 0.07, 0.02, 0.01])
fig = go.Figure()
curve_portion = np.where((b < 0.15) & (c > 0.6))
curve_other_portion = np.where(~((b < 0.15) & (c > 0.6)))
def add_plot_spline_portions(fig, indices_groupings):
for indices in indices_groupings:
fig.add_trace(go.Scatterternary({
'mode': 'lines',
'connectgaps': True,
'a': a[indices],
'b': b[indices],
'c': c[indices],
'line': {'color': 'black', 'shape': 'spline', 'smoothing': 1},
'marker': {'size': 2, 'line': {'width': 0.1}}
})
)
add_plot_spline_portions(fig, [curve_portion, curve_other_portion])
fig.show(renderer='png')
I can outline what I think is a general sort of solution - it doesn't have as much mathematical rigor as I would like, and involves some guess and check type work - but hopefully it's helpful.
The first consideration is that for this regression on a ternary plot, there are only two degrees of freedom because A+B+C=1 (you might find this explanation helpful). This means it only makes sense to consider the relationship between two of the variables at a time. What we really want to do is create a regression between two of the variables, then determine the value of the third variable using the equation A+B+C=1.
The second consideration is bit harder to define, but since you are after a regression that captures the "reversing" nature of the variable A, we want a regression where A can take on repeated values. I think the most straightforward way to achieve this is for A to be the variable you are predicting.
For simplicity sake, let's say we use a degree 2 polynomial regression that predicts A from either B or C. We can make a scatter and choose whichever polynomial will have a better fit for our purposes.
Here is a quick eda:
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots
a = np.array([0.15, 0.15, 0.17, 0.2 , 0.21, 0.24, 0.26, 0.27, 0.27, 0.29, 0.32, 0.35, 0.39, 0.4 , 0.4 , 0.41, 0.47, 0.48, 0.51, 0.52, 0.54, 0.56, 0.59, 0.62, 0.63, 0.65, 0.69, 0.73, 0.74])
b = np.array([0.14, 0.15, 0.1 , 0.17, 0.17, 0.18, 0.05, 0.16, 0.17, 0.04, 0.03, 0.14, 0.13, 0.13, 0.14, 0.14, 0.13, 0.13, 0.14, 0.14, 0.15, 0.16, 0.18, 0.2 , 0.21, 0.22, 0.24, 0.25, 0.25])
c = np.array([0.71, 0.7 , 0.73, 0.63, 0.62, 0.58, 0.69, 0.57, 0.56, 0.67, 0.65, 0.51, 0.48, 0.47, 0.46, 0.45, 0.4 , 0.39, 0.35, 0.34, 0.31, 0.28, 0.23, 0.18, 0.16, 0.13, 0.07, 0.02, 0.01])
## eda to determine polynomial of best fit to predict A
fig_eda = make_subplots(rows=1, cols=2)
fig_eda.add_trace(go.Scatter(x=b, y=a, mode='markers'),row=1, col=1)
coefficients = np.polyfit(b,a,2)
p = np.poly1d(coefficients)
b_vals = np.linspace(min(b),max(b))
a_pred = np.array([p(x) for x in b_vals])
fig_eda.add_trace(go.Scatter(x=b_vals, y=a_pred, mode='lines'),row=1, col=1)
fig_eda.add_trace(go.Scatter(x=c, y=a, mode='markers'),row=1, col=2)
coefficients = np.polyfit(c,a,2)
p = np.poly1d(coefficients)
c_vals = np.linspace(min(c),max(c))
a_pred = np.array([p(x) for x in c_vals])
fig_eda.add_trace(go.Scatter(x=c_vals, y=a_pred, mode='lines'),row=1, col=2)
Notice how predicting A from B looks like it captures the reversing nature of A better than predicting A from C. If we try to make a degree 2 polynomial regression on A from C, we can see A is not going to repeat within the domain of C: [0,1] because of the very low sloping nature of that polynomial.
So let's proceed with this regression with C as the predictor variable, and A as the predicted variable (and B also being a predicted variable using B = 1 - (A + C).
fig = go.Figure()
fig.add_trace(go.Scatterternary({
'mode': 'markers',
'connectgaps': True,
'a': a,
'b': b,
'c': c
}))
## since A+B+C = 100, we only need to fit a polynomial between two of the variables
## fit an n-degree polynomial to 2 of your variables
## source https://numpy.org/doc/stable/reference/generated/numpy.polyfit.html
coefficients = np.polyfit(b,a,2)
p = np.poly1d(coefficients)
## we use the entire domain of the input variable B
b_vals = np.linspace(0,1)
a_pred = np.array([p(x) for x in b_vals])
c_pred = 1 - (b_vals + a_pred)
fig.add_trace(go.Scatterternary({
'mode': 'lines',
'connectgaps': True,
'a': a_pred,
'b': b_vals,
'c': c_pred,
'marker': {'size': 2, 'color':'red', 'line': {'width': 0.1}}
}))
fig.show()
This is the lowest degree polynomial regression that allows for repeated values of A (a linear regression to predict A would be the wouldn't allow A to take on repeated values). However, you can definitely experiment with increasing the degree of the polynomial you are using, and predicting A from either variables B or C.
Related
I have a data set I need to augment. Therefore, I have implemented an augmentation method called magnitude warping that has two hyperparameters to tune, namely sigma and knots. To assess the quality, I have two models that I train using the augmented data and test it on part of the real data. To compare the accuracy I also trained the models on only the real data. Lets assume the following Python code:
# test accuracy trained on real data only
ref_dt_accuracy = 0.86
ref_lstm_accuracy = 0.85
# test accuracy for each pair of hyperparameters
sigma = [0.2, 0.35, 0.5, 0.65]
knots = [4,5,6,7]
dt_accuracy_mw = [
[0.82, 0.85, 0.83, 0.84],
[0.8, 0.79, 0.81, 0.79],
[0.78,0.77, 0.74, 0.76],
[0.74, 0.72, 0.78, 0.70]
]
lstm_accuracy_mw = [
[0.80, 0.83, 0.81, 0.82],
[0.78, 0.77, 0.79, 0.77],
[0.76,0.75, 0.72, 0.74],
[0.72, 0.7, 0.76, 0.68]
]
Now, I want to plot two (three if the last option is possible) matrices:
Plotting dt_accuracy_mw and lstm_accuracy_mw such that each sigma and knots are visualized:
sigma/knots 4 5 6 7
0.2
0.35 Actual matrix consisting of aforementioned accuracies
0.5
0.65
A combined version of above such that each entry consists of dt_accuracy (ref_dt_accuracy - dt_accuracy)/lstm_accuracy (ref_lstm_accuracy - lstm_accuracy) , so each entry consists of the dt_accuracy the difference to the ref and the same for the lstm_accuracy. Each accuracy score of the models are then seperated by the /
How would one achieve this using any open source libraries such as matplotlib, seaborn etc.
You can create a Seaborn heatmap as follows:
from matplotlib import pyplot as plt
import seaborn as sns
sigma = [0.2, 0.35, 0.5, 0.65]
knots = [4, 5, 6, 7]
dt_accuracy_mw = [[0.82, 0.85, 0.83, 0.84],
[0.8, 0.79, 0.81, 0.79],
[0.78, 0.77, 0.74, 0.76],
[0.74, 0.72, 0.78, 0.70]]
ax = sns.heatmap(data=dt_accuracy_mw, xticklabels=knots, yticklabels=sigma,
linewidths=1, linecolor='blue', clip_on=False, annot=True, cbar=False,
cmap=sns.color_palette(['white'], as_cmap=True))
ax.set_xlabel('knots')
ax.set_ylabel('sigma')
plt.tight_layout()
plt.show()
If I understand the second question correctly, a matrix of annotations would do the job (the data can be anything with the correct width and height):
from matplotlib import pyplot as plt
import seaborn as sns
ref_dt_accuracy = 0.86
ref_lstm_accuracy = 0.85
sigma = [0.2, 0.35, 0.5, 0.65]
knots = [4, 5, 6, 7]
dt_accuracy_mw = [[0.82, 0.85, 0.83, 0.84],
[0.8, 0.79, 0.81, 0.79],
[0.78, 0.77, 0.74, 0.76],
[0.74, 0.72, 0.78, 0.70]]
lstm_accuracy_mw = [[0.80, 0.83, 0.81, 0.82],
[0.78, 0.77, 0.79, 0.77],
[0.76, 0.75, 0.72, 0.74],
[0.72, 0.7, 0.76, 0.68]]
annot_matrix = [[f'{ref_dt_accuracy - dt:.2f} / {ref_lstm_accuracy - lstm:.2f}'
for dt, lstm in zip(dt_row, lstm_row)]
for dt_row, lstm_row in zip(dt_accuracy_mw, lstm_accuracy_mw)]
ax = sns.heatmap(data=dt_accuracy_mw, xticklabels=knots, yticklabels=sigma,
annot=annot_matrix, fmt='',
linewidths=2, linecolor='crimson', clip_on=False, cbar=False,
cmap=sns.color_palette(['aliceblue'], as_cmap=True))
ax.set_xlabel('knots')
ax.set_ylabel('sigma')
plt.tight_layout()
plt.show()
The input to the Kmeans in both the Matlab and Python environments is the following list:
input = [1.11, 0.81, 0.61, 0.62, 0.62, 1.03, 1.16, 0.44, 0.42, 0.73, 0.74, 0.65, 0.59, 0.64, 0.98, 0.89, 0.62, 0.95, 0.88, 0.60, 0.61, 0.62, 0.62, 0.64, 0.98, 0.90, 0.64]
Matlab:
[idx, C] = kmeans(input',3,'Start',[0.3;0.9;1.5]);
Output
C = [0.596, 0.825, 1.035]
(idx==1) = 15, (idx==2) = 6, (idx==3) = 6
Python:
import numpy as np
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3, n_init=1, init=np.array([0.3,0.9,1.5]).reshape(-1,1)).fit(np.array(input).reshape(-1, 1))
idx = kmeans.labels_
C = kmeans.cluster_centers_
Output
C = [0.430, 0.969, 0.637]
(idx==0) = 2, (idx==1) = 10, (idx==2) = 15
Clearly, the output centroids and the number of input points classified in the 3 clusters are different for these environments. What is the reason behind this even when the initial centroids are the same?
I've writed a minimal kmeans algorithm to test your dataset with matlab:
input = [1.11, 0.81, 0.61, 0.62, 0.62, 1.03, 1.16, 0.44, 0.42, 0.73, 0.74, 0.65, 0.59,
0.64, 0.98, 0.89, 0.62, 0.95, 0.88, 0.60, 0.61, 0.62, 0.62, 0.64, 0.98, 0.90,
0.64];
c = [0.3;0.9;1.5]
for ii = 1:10
[~,idx] = min(abs(c-input)); % pairwise euclidian distance
c = accumarray(idx.',input,[],#mean) % compute the new centroid
end
After the first iteration the index idx, that indicate which is the closest centroid for each value, looks like this:
2 2 2 2 2 2 2 1 1 2...
The last centroid (1.5 here) is NEVER the closest value ! So in order to keep 3 groups the kmeans algorithm have to compute, somehow, a new value for this centroid (because it's hard to compute the mean of an empty set). And it looks like python and matlab have different implementation for it.
If you want to avoid this problem make sure that every initial centroid is the closest value for, at least, one element of your dataset.
You can, for example, take the first three differents value of your dataset.
Generation of a list of many lists each with different ranges
Isc_act = [0.1, 0.2, 0.3]
I_cel = []
a = []
for i in range(0,len(Isc_act)):
a = np.arange(0, Isc_act[i], 0.1*Isc_act[i])
I_cel[i].append(a)
print(I_cel)
Output is:
IndexError: list index out of range
My code is giving error. But, I want to get I_cel = [[0,0.01,..,0.1],[0,0.02,0.04,...,0.2],[0, 0.03, 0.06,...,0.3]]. Hence, the 'nested list' I_cel has three lists and each list has 10 values.
The simplest fix to your code, probably what you were intending to do:
Isc_act = [0.1, 0.2, 0.3]
I_cel = []
for i in range(0,len(Isc_act)):
a = np.arange(0, Isc_act[i], 0.1*Isc_act[i])
I_cel.append(a)
print(I_cel)
Note that the endpoint will be one step less than you wanted! For example row zero, you have to pick two of the below:
Steps of size 0.01
Start point 0.0 and end point 0.1
10 elements total
You can not have all three.
More numpythonic approach:
>>> Isc_act = [0.1, 0.2, 0.3]
>>> (np.linspace(0, 1, 11).reshape(11,1) # [Isc_act]).T
array([[0. , 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1 ],
[0. , 0.02, 0.04, 0.06, 0.08, 0.1 , 0.12, 0.14, 0.16, 0.18, 0.2 ],
[0. , 0.03, 0.06, 0.09, 0.12, 0.15, 0.18, 0.21, 0.24, 0.27, 0.3 ]])
linspace gives better control of the end point when dealing with floats:
In [84]: [np.linspace(0,x,11) for x in [.1,.2,.3]]
Out[84]:
[array([0. , 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1 ]),
array([0. , 0.02, 0.04, 0.06, 0.08, 0.1 , 0.12, 0.14, 0.16, 0.18, 0.2 ]),
array([0. , 0.03, 0.06, 0.09, 0.12, 0.15, 0.18, 0.21, 0.24, 0.27, 0.3 ])]
Or we could scale just one array (arange with integers is predictable):
In [86]: np.array([.1,.2,.3])[:,None]*np.arange(0,11)
Out[86]:
array([[0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ],
[0. , 0.2, 0.4, 0.6, 0.8, 1. , 1.2, 1.4, 1.6, 1.8, 2. ],
[0. , 0.3, 0.6, 0.9, 1.2, 1.5, 1.8, 2.1, 2.4, 2.7, 3. ]])
I would like to plot a histogram with a non-uniform x-axis using Matplotlib.
For example, consider the following histogram:
import matplotlib.pyplot as plt
values = [0.68, 0.28, 0.31, 0.5, 0.25, 0.5, 0.002, 0.13, 0.002, 0.2, 0.3, 0.45,
0.56, 0.53, 0.001, 0.44, 0.008, 0.26, 0., 0.37, 0.03, 0.002, 0.19, 0.18,
0.04, 0.31, 0.006, 0.6, 0.19, 0.3, 0., 0.46, 0.2, 0.004, 0.06, 0.]
plt.hist(values)
plt.show()
The first bin has high density, so I would like to zoom in there.
Ideally, I would like to change the values in the x-axis to something like [0, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1], keeping the bin widths constant within the graph (but not numerically, of course). Is there an easy way to achieve this?
Any comments or suggestions are welcome.
Using bins will solve the problems. The bins are the values to which you assign the values for example 0.28 will be assigned to bin 0.3. The code below provides you an example of using bins:
import matplotlib.pyplot as plt
values = [0.68, 0.28, 0.31, 0.5, 0.25, 0.5, 0.002, 0.13, 0.002, 0.2, 0.3, 0.45,
0.56, 0.53, 0.001, 0.44, 0.008, 0.26, 0., 0.37, 0.03, 0.002, 0.19, 0.18,
0.04, 0.31, 0.006, 0.6, 0.19, 0.3, 0., 0.46, 0.2, 0.004, 0.06, 0.]
plt.hist(values, bins=[0, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1])
plt.show()
To plot it in a more suitable way, it can be handy to convert the x axis into a logaritmic scale:
plt.hist(values, bins=[0, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1], log=True)
changes the log scale on the y axis. Adding the following line to your code will make a logaritmic x axis for your histogram:
plt.xscale('log')
The solution from André is nice, but the bin widths are not constant. Working with a log2 x-axis suits what I was looking for. I use np.logspace to make the bin widths constant in the graph.
That's what I ended up doing:
import matplotlib.pyplot as plt
values = [0.68, 0.28, 0.31, 0.5, 0.25, 0.5, 0.002, 0.13, 0.002, 0.2, 0.3, 0.45,
0.56, 0.53, 0.001, 0.44, 0.008, 0.26, 0., 0.37, 0.03, 0.002, 0.19, 0.18,
0.04, 0.31, 0.006, 0.6, 0.19, 0.3, 0., 0.46, 0.2, 0.004, 0.06, 0.]
bins = np.logspace(-10, 1, 20, base=2)
bins[0]=0
fig, ax = plt.subplots()
plt.hist(values, bins=bins)
ax.set_xscale('log', basex=2)
ax.set_xlim(2**-10, 1)
plt.show()
I have two arrays X and Y, X is the base array and Y is operated in a loop. As the loop runs I want to compare the arrays to find the nearest value of Y to X or in other words where is Y most close to X. As an example I have attached the reproducible code:
from __future__ import division
import numpy as np
import matplotlib.pyplot as plt
from scipy import interpolate
x = np.array([[0.12, 0.11, 0.1, 0.09, 0.08],
[0.13, 0.12, 0.11, 0.1, 0.09],
[0.15, 0.14, 0.12, 0.11, 0.1],
[0.17, 0.15, 0.14, 0.12, 0.11],
[0.19, 0.17, 0.16, 0.14, 0.12],
[0.22, 0.19, 0.17, 0.15, 0.13],
[0.24, 0.22, 0.19, 0.16, 0.14],
[0.27, 0.24, 0.21, 0.18, 0.15],
[0.29, 0.26, 0.22, 0.19, 0.16]])
y = np.array([[0.07, 0.06, 0.05, 0.04, 0.03],
[0.08, 0.07, 0.06, 0.05, 0.04],
[0.10, 0.09, 0.07, 0.06, 0.05],
[0.14, 0.12, 0.11, 0.09, 0.08],
[0.16, 0.14, 0.13, 0.11, 0.09],
[0.19, 0.16, 0.14, 0.12, 0.10],
[0.22, 0.20, 0.17, 0.14, 0.12],
[0.25, 0.22, 0.19, 0.16, 0.13],
[0.27, 0.24, 0.20, 0.17, 0.14]])
for i in range(100):
y = y + (i / 10000)
I want to break the loop when the closest values have been found. By closest I mean, the values should be within ±10% of the original values or some other percentage. How can this be done in Python?
You can compute the Euclidean distance between the two matrices:
import numpy as np
import scipy.spatial.distance
import matplotlib.pyplot as plt
x = np.array([[0.12, 0.11, 0.1, 0.09, 0.08],
[0.13, 0.12, 0.11, 0.1, 0.09],
[0.15, 0.14, 0.12, 0.11, 0.1],
[0.17, 0.15, 0.14, 0.12, 0.11],
[0.19, 0.17, 0.16, 0.14, 0.12],
[0.22, 0.19, 0.17, 0.15, 0.13],
[0.24, 0.22, 0.19, 0.16, 0.14],
[0.27, 0.24, 0.21, 0.18, 0.15],
[0.29, 0.26, 0.22, 0.19, 0.16]])
y = np.array([[0.07, 0.06, 0.05, 0.04, 0.03],
[0.08, 0.07, 0.06, 0.05, 0.04],
[0.10, 0.09, 0.07, 0.06, 0.05],
[0.14, 0.12, 0.11, 0.09, 0.08],
[0.16, 0.14, 0.13, 0.11, 0.09],
[0.19, 0.16, 0.14, 0.12, 0.10],
[0.22, 0.20, 0.17, 0.14, 0.12],
[0.25, 0.22, 0.19, 0.16, 0.13],
[0.27, 0.24, 0.20, 0.17, 0.14]])
dists = []
for i in range(100):
y = y + (i / 10000.)
dists.append(scipy.spatial.distance.euclidean(x.flatten(), y.flatten()))
plt.plot(dists)
will return this graph, which is the evolution of the Euclidean distance between your 2 matrices:
To break the loop at the minimum, you can use:
dist = np.inf
for i in range(100):
y = y + (i / 10000.)
d = scipy.spatial.distance.euclidean(x.flatten(), y.flatten())
if d < dist:
dist = d
else:
break
print dist
# 0.0838525491562 #(the minimal distance)
print y
#[[ 0.1051 0.0951 0.0851 0.0751 0.0651]
#[ 0.1151 0.1051 0.0951 0.0851 0.0751]
#[ 0.1351 0.1251 0.1051 0.0951 0.0851]
#[ 0.1751 0.1551 0.1451 0.1251 0.1151]
#[ 0.1951 0.1751 0.1651 0.1451 0.1251]
#[ 0.2251 0.1951 0.1751 0.1551 0.1351]
#[ 0.2551 0.2351 0.2051 0.1751 0.1551]
#[ 0.2851 0.2551 0.2251 0.1951 0.1651]
#[ 0.3051 0.2751 0.2351 0.2051 0.1751]]