Related
I have a data set I need to augment. Therefore, I have implemented an augmentation method called magnitude warping that has two hyperparameters to tune, namely sigma and knots. To assess the quality, I have two models that I train using the augmented data and test it on part of the real data. To compare the accuracy I also trained the models on only the real data. Lets assume the following Python code:
# test accuracy trained on real data only
ref_dt_accuracy = 0.86
ref_lstm_accuracy = 0.85
# test accuracy for each pair of hyperparameters
sigma = [0.2, 0.35, 0.5, 0.65]
knots = [4,5,6,7]
dt_accuracy_mw = [
[0.82, 0.85, 0.83, 0.84],
[0.8, 0.79, 0.81, 0.79],
[0.78,0.77, 0.74, 0.76],
[0.74, 0.72, 0.78, 0.70]
]
lstm_accuracy_mw = [
[0.80, 0.83, 0.81, 0.82],
[0.78, 0.77, 0.79, 0.77],
[0.76,0.75, 0.72, 0.74],
[0.72, 0.7, 0.76, 0.68]
]
Now, I want to plot two (three if the last option is possible) matrices:
Plotting dt_accuracy_mw and lstm_accuracy_mw such that each sigma and knots are visualized:
sigma/knots 4 5 6 7
0.2
0.35 Actual matrix consisting of aforementioned accuracies
0.5
0.65
A combined version of above such that each entry consists of dt_accuracy (ref_dt_accuracy - dt_accuracy)/lstm_accuracy (ref_lstm_accuracy - lstm_accuracy) , so each entry consists of the dt_accuracy the difference to the ref and the same for the lstm_accuracy. Each accuracy score of the models are then seperated by the /
How would one achieve this using any open source libraries such as matplotlib, seaborn etc.
You can create a Seaborn heatmap as follows:
from matplotlib import pyplot as plt
import seaborn as sns
sigma = [0.2, 0.35, 0.5, 0.65]
knots = [4, 5, 6, 7]
dt_accuracy_mw = [[0.82, 0.85, 0.83, 0.84],
[0.8, 0.79, 0.81, 0.79],
[0.78, 0.77, 0.74, 0.76],
[0.74, 0.72, 0.78, 0.70]]
ax = sns.heatmap(data=dt_accuracy_mw, xticklabels=knots, yticklabels=sigma,
linewidths=1, linecolor='blue', clip_on=False, annot=True, cbar=False,
cmap=sns.color_palette(['white'], as_cmap=True))
ax.set_xlabel('knots')
ax.set_ylabel('sigma')
plt.tight_layout()
plt.show()
If I understand the second question correctly, a matrix of annotations would do the job (the data can be anything with the correct width and height):
from matplotlib import pyplot as plt
import seaborn as sns
ref_dt_accuracy = 0.86
ref_lstm_accuracy = 0.85
sigma = [0.2, 0.35, 0.5, 0.65]
knots = [4, 5, 6, 7]
dt_accuracy_mw = [[0.82, 0.85, 0.83, 0.84],
[0.8, 0.79, 0.81, 0.79],
[0.78, 0.77, 0.74, 0.76],
[0.74, 0.72, 0.78, 0.70]]
lstm_accuracy_mw = [[0.80, 0.83, 0.81, 0.82],
[0.78, 0.77, 0.79, 0.77],
[0.76, 0.75, 0.72, 0.74],
[0.72, 0.7, 0.76, 0.68]]
annot_matrix = [[f'{ref_dt_accuracy - dt:.2f} / {ref_lstm_accuracy - lstm:.2f}'
for dt, lstm in zip(dt_row, lstm_row)]
for dt_row, lstm_row in zip(dt_accuracy_mw, lstm_accuracy_mw)]
ax = sns.heatmap(data=dt_accuracy_mw, xticklabels=knots, yticklabels=sigma,
annot=annot_matrix, fmt='',
linewidths=2, linecolor='crimson', clip_on=False, cbar=False,
cmap=sns.color_palette(['aliceblue'], as_cmap=True))
ax.set_xlabel('knots')
ax.set_ylabel('sigma')
plt.tight_layout()
plt.show()
The question was answered by #QuangHoang in the comments. Using np.tile(array, (repeats, 1)) or slower: np.vstack([array]*repeats).
I have what seems to be a simple problem. I have a NumPy array I want to replicate column-wise, and keep it a 2D array, just repeated over and over again say 50x. I try with a np.repeat but the order isn't preserved - it just copies each row however many times I specify. broadcast_to I can't seem to get to work either. I'm thinking maybe I need to flatten the array with a .reshape(-1) then do something with it and reshape it back, but I can't seem to find the right commands. I know I can do it with .append and a for loop, but I'm looking for a vectorized solution. Here's an example with the initial array and then what it should look like afterwards (I'm only repeating it 3x due to space):
array = np.array([(1.00, 0.80, 0.95, 0.88, 0.97, 0.85),
(0.80, 1.00, 0.87, 0.97, 0.80, 0.92),
(0.95, 0.87, 1.00, 0.85, 0.92, 0.89),
(0.88, 0.97, 0.85, 1.00, 0.85, 0.95),
(0.97, 0.80, 0.92, 0.85, 1.00, 0.88),
(0.85, 0.92, 0.89, 0.95, 0.88, 1.00)])
Repeat 3x:
array([[1. , 0.8 , 0.95, 0.88, 0.97, 0.85],
[0.8 , 1. , 0.87, 0.97, 0.8 , 0.92],
[0.95, 0.87, 1. , 0.85, 0.92, 0.89],
[0.88, 0.97, 0.85, 1. , 0.85, 0.95],
[0.97, 0.8 , 0.92, 0.85, 1. , 0.88],
[0.85, 0.92, 0.89, 0.95, 0.88, 1. ],
[1. , 0.8 , 0.95, 0.88, 0.97, 0.85],
[0.8 , 1. , 0.87, 0.97, 0.8 , 0.92],
[0.95, 0.87, 1. , 0.85, 0.92, 0.89],
[0.88, 0.97, 0.85, 1. , 0.85, 0.95],
[0.97, 0.8 , 0.92, 0.85, 1. , 0.88],
[0.85, 0.92, 0.89, 0.95, 0.88, 1. ],
[1. , 0.8 , 0.95, 0.88, 0.97, 0.85],
[0.8 , 1. , 0.87, 0.97, 0.8 , 0.92],
[0.95, 0.87, 1. , 0.85, 0.92, 0.89],
[0.88, 0.97, 0.85, 1. , 0.85, 0.95],
[0.97, 0.8 , 0.92, 0.85, 1. , 0.88],
[0.85, 0.92, 0.89, 0.95, 0.88, 1. ]])
Appreciate the help!
I'm trying to draw a regression line in plotly python in ternary space, but there doesn't seem to be an option like "trendline = 'loess' for scatter ternaries. Is there another way to achieve the same result for ternaries? Code from a previous post that makes a spline line but not a regression.
import numpy as np
import plotly.graph_objects as go
a = np.array([0.15, 0.15, 0.17, 0.2 , 0.21, 0.24, 0.26, 0.27, 0.27, 0.29, 0.32, 0.35, 0.39, 0.4 , 0.4 , 0.41, 0.47, 0.48, 0.51, 0.52, 0.54, 0.56, 0.59, 0.62, 0.63, 0.65, 0.69, 0.73, 0.74])
b = np.array([0.14, 0.15, 0.1 , 0.17, 0.17, 0.18, 0.05, 0.16, 0.17, 0.04, 0.03, 0.14, 0.13, 0.13, 0.14, 0.14, 0.13, 0.13, 0.14, 0.14, 0.15, 0.16, 0.18, 0.2 , 0.21, 0.22, 0.24, 0.25, 0.25])
c = np.array([0.71, 0.7 , 0.73, 0.63, 0.62, 0.58, 0.69, 0.57, 0.56, 0.67, 0.65, 0.51, 0.48, 0.47, 0.46, 0.45, 0.4 , 0.39, 0.35, 0.34, 0.31, 0.28, 0.23, 0.18, 0.16, 0.13, 0.07, 0.02, 0.01])
fig = go.Figure()
curve_portion = np.where((b < 0.15) & (c > 0.6))
curve_other_portion = np.where(~((b < 0.15) & (c > 0.6)))
def add_plot_spline_portions(fig, indices_groupings):
for indices in indices_groupings:
fig.add_trace(go.Scatterternary({
'mode': 'lines',
'connectgaps': True,
'a': a[indices],
'b': b[indices],
'c': c[indices],
'line': {'color': 'black', 'shape': 'spline', 'smoothing': 1},
'marker': {'size': 2, 'line': {'width': 0.1}}
})
)
add_plot_spline_portions(fig, [curve_portion, curve_other_portion])
fig.show(renderer='png')
I can outline what I think is a general sort of solution - it doesn't have as much mathematical rigor as I would like, and involves some guess and check type work - but hopefully it's helpful.
The first consideration is that for this regression on a ternary plot, there are only two degrees of freedom because A+B+C=1 (you might find this explanation helpful). This means it only makes sense to consider the relationship between two of the variables at a time. What we really want to do is create a regression between two of the variables, then determine the value of the third variable using the equation A+B+C=1.
The second consideration is bit harder to define, but since you are after a regression that captures the "reversing" nature of the variable A, we want a regression where A can take on repeated values. I think the most straightforward way to achieve this is for A to be the variable you are predicting.
For simplicity sake, let's say we use a degree 2 polynomial regression that predicts A from either B or C. We can make a scatter and choose whichever polynomial will have a better fit for our purposes.
Here is a quick eda:
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots
a = np.array([0.15, 0.15, 0.17, 0.2 , 0.21, 0.24, 0.26, 0.27, 0.27, 0.29, 0.32, 0.35, 0.39, 0.4 , 0.4 , 0.41, 0.47, 0.48, 0.51, 0.52, 0.54, 0.56, 0.59, 0.62, 0.63, 0.65, 0.69, 0.73, 0.74])
b = np.array([0.14, 0.15, 0.1 , 0.17, 0.17, 0.18, 0.05, 0.16, 0.17, 0.04, 0.03, 0.14, 0.13, 0.13, 0.14, 0.14, 0.13, 0.13, 0.14, 0.14, 0.15, 0.16, 0.18, 0.2 , 0.21, 0.22, 0.24, 0.25, 0.25])
c = np.array([0.71, 0.7 , 0.73, 0.63, 0.62, 0.58, 0.69, 0.57, 0.56, 0.67, 0.65, 0.51, 0.48, 0.47, 0.46, 0.45, 0.4 , 0.39, 0.35, 0.34, 0.31, 0.28, 0.23, 0.18, 0.16, 0.13, 0.07, 0.02, 0.01])
## eda to determine polynomial of best fit to predict A
fig_eda = make_subplots(rows=1, cols=2)
fig_eda.add_trace(go.Scatter(x=b, y=a, mode='markers'),row=1, col=1)
coefficients = np.polyfit(b,a,2)
p = np.poly1d(coefficients)
b_vals = np.linspace(min(b),max(b))
a_pred = np.array([p(x) for x in b_vals])
fig_eda.add_trace(go.Scatter(x=b_vals, y=a_pred, mode='lines'),row=1, col=1)
fig_eda.add_trace(go.Scatter(x=c, y=a, mode='markers'),row=1, col=2)
coefficients = np.polyfit(c,a,2)
p = np.poly1d(coefficients)
c_vals = np.linspace(min(c),max(c))
a_pred = np.array([p(x) for x in c_vals])
fig_eda.add_trace(go.Scatter(x=c_vals, y=a_pred, mode='lines'),row=1, col=2)
Notice how predicting A from B looks like it captures the reversing nature of A better than predicting A from C. If we try to make a degree 2 polynomial regression on A from C, we can see A is not going to repeat within the domain of C: [0,1] because of the very low sloping nature of that polynomial.
So let's proceed with this regression with C as the predictor variable, and A as the predicted variable (and B also being a predicted variable using B = 1 - (A + C).
fig = go.Figure()
fig.add_trace(go.Scatterternary({
'mode': 'markers',
'connectgaps': True,
'a': a,
'b': b,
'c': c
}))
## since A+B+C = 100, we only need to fit a polynomial between two of the variables
## fit an n-degree polynomial to 2 of your variables
## source https://numpy.org/doc/stable/reference/generated/numpy.polyfit.html
coefficients = np.polyfit(b,a,2)
p = np.poly1d(coefficients)
## we use the entire domain of the input variable B
b_vals = np.linspace(0,1)
a_pred = np.array([p(x) for x in b_vals])
c_pred = 1 - (b_vals + a_pred)
fig.add_trace(go.Scatterternary({
'mode': 'lines',
'connectgaps': True,
'a': a_pred,
'b': b_vals,
'c': c_pred,
'marker': {'size': 2, 'color':'red', 'line': {'width': 0.1}}
}))
fig.show()
This is the lowest degree polynomial regression that allows for repeated values of A (a linear regression to predict A would be the wouldn't allow A to take on repeated values). However, you can definitely experiment with increasing the degree of the polynomial you are using, and predicting A from either variables B or C.
I have the following set of 15 data points:
[0.287 , 0.0691, 0.856, 0.731, 0.895, 0.76, 0.496, 0.749, 0.77, 0.684, 0.667, 0.386, 0.4, 0.334, 0.346]
And I would like the order of these data points to be changed so to minimize the error with the following set of 15 data points:
[0.1, 0.3, 0.5, 0.7, 0.9, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.3, 0.2, 0.1]
I could just try all permutations of the first set of data points and see which one gives the smallest error but that would take forever...
I'm assuming by error you mean the summed absolute difference. It is not difficult to check that this error is minimized when a and b have the same rank order. The best reordering of a can thus be obtained using argsort
>>> a = np.array([0.287 , 0.0691, 0.856 , 0.731 , 0.895 , 0.76 , 0.496 , 0.749 , 0.77 , 0.684 , 0.667 , 0.386 , 0.4 , 0.334 , 0.346 ])
>>> b = np.array([0.1, 0.3, 0.5, 0.7, 0.9, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.3, 0.2, 0.1])
>>>
>>> best_shuffle = np.empty(a.size,int)
>>> best_shuffle[b.argsort(kind="stable")] = a.argsort(kind="stable")
>>>
>>> np.abs(b-a[best_shuffle]).sum()
1.3499000000000005
I would like to plot a histogram with a non-uniform x-axis using Matplotlib.
For example, consider the following histogram:
import matplotlib.pyplot as plt
values = [0.68, 0.28, 0.31, 0.5, 0.25, 0.5, 0.002, 0.13, 0.002, 0.2, 0.3, 0.45,
0.56, 0.53, 0.001, 0.44, 0.008, 0.26, 0., 0.37, 0.03, 0.002, 0.19, 0.18,
0.04, 0.31, 0.006, 0.6, 0.19, 0.3, 0., 0.46, 0.2, 0.004, 0.06, 0.]
plt.hist(values)
plt.show()
The first bin has high density, so I would like to zoom in there.
Ideally, I would like to change the values in the x-axis to something like [0, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1], keeping the bin widths constant within the graph (but not numerically, of course). Is there an easy way to achieve this?
Any comments or suggestions are welcome.
Using bins will solve the problems. The bins are the values to which you assign the values for example 0.28 will be assigned to bin 0.3. The code below provides you an example of using bins:
import matplotlib.pyplot as plt
values = [0.68, 0.28, 0.31, 0.5, 0.25, 0.5, 0.002, 0.13, 0.002, 0.2, 0.3, 0.45,
0.56, 0.53, 0.001, 0.44, 0.008, 0.26, 0., 0.37, 0.03, 0.002, 0.19, 0.18,
0.04, 0.31, 0.006, 0.6, 0.19, 0.3, 0., 0.46, 0.2, 0.004, 0.06, 0.]
plt.hist(values, bins=[0, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1])
plt.show()
To plot it in a more suitable way, it can be handy to convert the x axis into a logaritmic scale:
plt.hist(values, bins=[0, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1], log=True)
changes the log scale on the y axis. Adding the following line to your code will make a logaritmic x axis for your histogram:
plt.xscale('log')
The solution from André is nice, but the bin widths are not constant. Working with a log2 x-axis suits what I was looking for. I use np.logspace to make the bin widths constant in the graph.
That's what I ended up doing:
import matplotlib.pyplot as plt
values = [0.68, 0.28, 0.31, 0.5, 0.25, 0.5, 0.002, 0.13, 0.002, 0.2, 0.3, 0.45,
0.56, 0.53, 0.001, 0.44, 0.008, 0.26, 0., 0.37, 0.03, 0.002, 0.19, 0.18,
0.04, 0.31, 0.006, 0.6, 0.19, 0.3, 0., 0.46, 0.2, 0.004, 0.06, 0.]
bins = np.logspace(-10, 1, 20, base=2)
bins[0]=0
fig, ax = plt.subplots()
plt.hist(values, bins=bins)
ax.set_xscale('log', basex=2)
ax.set_xlim(2**-10, 1)
plt.show()