How to compare two arrays and find the optimal match in Python?

How to compare two arrays and find the optimal match in Python? - python

I have two arrays X and Y, X is the base array and Y is operated in a loop. As the loop runs I want to compare the arrays to find the nearest value of Y to X or in other words where is Y most close to X. As an example I have attached the reproducible code:
from __future__ import division
import numpy as np
import matplotlib.pyplot as plt
from scipy import interpolate
x = np.array([[0.12, 0.11, 0.1, 0.09, 0.08],
[0.13, 0.12, 0.11, 0.1, 0.09],
[0.15, 0.14, 0.12, 0.11, 0.1],
[0.17, 0.15, 0.14, 0.12, 0.11],
[0.19, 0.17, 0.16, 0.14, 0.12],
[0.22, 0.19, 0.17, 0.15, 0.13],
[0.24, 0.22, 0.19, 0.16, 0.14],
[0.27, 0.24, 0.21, 0.18, 0.15],
[0.29, 0.26, 0.22, 0.19, 0.16]])
y = np.array([[0.07, 0.06, 0.05, 0.04, 0.03],
[0.08, 0.07, 0.06, 0.05, 0.04],
[0.10, 0.09, 0.07, 0.06, 0.05],
[0.14, 0.12, 0.11, 0.09, 0.08],
[0.16, 0.14, 0.13, 0.11, 0.09],
[0.19, 0.16, 0.14, 0.12, 0.10],
[0.22, 0.20, 0.17, 0.14, 0.12],
[0.25, 0.22, 0.19, 0.16, 0.13],
[0.27, 0.24, 0.20, 0.17, 0.14]])
for i in range(100):
y = y + (i / 10000)
I want to break the loop when the closest values have been found. By closest I mean, the values should be within ±10% of the original values or some other percentage. How can this be done in Python?

You can compute the Euclidean distance between the two matrices:
import numpy as np
import scipy.spatial.distance
import matplotlib.pyplot as plt
x = np.array([[0.12, 0.11, 0.1, 0.09, 0.08],
[0.13, 0.12, 0.11, 0.1, 0.09],
[0.15, 0.14, 0.12, 0.11, 0.1],
[0.17, 0.15, 0.14, 0.12, 0.11],
[0.19, 0.17, 0.16, 0.14, 0.12],
[0.22, 0.19, 0.17, 0.15, 0.13],
[0.24, 0.22, 0.19, 0.16, 0.14],
[0.27, 0.24, 0.21, 0.18, 0.15],
[0.29, 0.26, 0.22, 0.19, 0.16]])
y = np.array([[0.07, 0.06, 0.05, 0.04, 0.03],
[0.08, 0.07, 0.06, 0.05, 0.04],
[0.10, 0.09, 0.07, 0.06, 0.05],
[0.14, 0.12, 0.11, 0.09, 0.08],
[0.16, 0.14, 0.13, 0.11, 0.09],
[0.19, 0.16, 0.14, 0.12, 0.10],
[0.22, 0.20, 0.17, 0.14, 0.12],
[0.25, 0.22, 0.19, 0.16, 0.13],
[0.27, 0.24, 0.20, 0.17, 0.14]])
dists = []
for i in range(100):
y = y + (i / 10000.)
dists.append(scipy.spatial.distance.euclidean(x.flatten(), y.flatten()))
plt.plot(dists)
will return this graph, which is the evolution of the Euclidean distance between your 2 matrices:
To break the loop at the minimum, you can use:
dist = np.inf
for i in range(100):
y = y + (i / 10000.)
d = scipy.spatial.distance.euclidean(x.flatten(), y.flatten())
if d < dist:
dist = d
else:
break
print dist
# 0.0838525491562 #(the minimal distance)
print y
#[[ 0.1051 0.0951 0.0851 0.0751 0.0651]
#[ 0.1151 0.1051 0.0951 0.0851 0.0751]
#[ 0.1351 0.1251 0.1051 0.0951 0.0851]
#[ 0.1751 0.1551 0.1451 0.1251 0.1151]
#[ 0.1951 0.1751 0.1651 0.1451 0.1251]
#[ 0.2251 0.1951 0.1751 0.1551 0.1351]
#[ 0.2551 0.2351 0.2051 0.1751 0.1551]
#[ 0.2851 0.2551 0.2251 0.1951 0.1651]
#[ 0.3051 0.2751 0.2351 0.2051 0.1751]]

Related

Plotly python regression in ternary space

I'm trying to draw a regression line in plotly python in ternary space, but there doesn't seem to be an option like "trendline = 'loess' for scatter ternaries. Is there another way to achieve the same result for ternaries? Code from a previous post that makes a spline line but not a regression.
import numpy as np
import plotly.graph_objects as go
a = np.array([0.15, 0.15, 0.17, 0.2 , 0.21, 0.24, 0.26, 0.27, 0.27, 0.29, 0.32, 0.35, 0.39, 0.4 , 0.4 , 0.41, 0.47, 0.48, 0.51, 0.52, 0.54, 0.56, 0.59, 0.62, 0.63, 0.65, 0.69, 0.73, 0.74])
b = np.array([0.14, 0.15, 0.1 , 0.17, 0.17, 0.18, 0.05, 0.16, 0.17, 0.04, 0.03, 0.14, 0.13, 0.13, 0.14, 0.14, 0.13, 0.13, 0.14, 0.14, 0.15, 0.16, 0.18, 0.2 , 0.21, 0.22, 0.24, 0.25, 0.25])
c = np.array([0.71, 0.7 , 0.73, 0.63, 0.62, 0.58, 0.69, 0.57, 0.56, 0.67, 0.65, 0.51, 0.48, 0.47, 0.46, 0.45, 0.4 , 0.39, 0.35, 0.34, 0.31, 0.28, 0.23, 0.18, 0.16, 0.13, 0.07, 0.02, 0.01])
fig = go.Figure()
curve_portion = np.where((b < 0.15) & (c > 0.6))
curve_other_portion = np.where(~((b < 0.15) & (c > 0.6)))
def add_plot_spline_portions(fig, indices_groupings):
for indices in indices_groupings:
fig.add_trace(go.Scatterternary({
'mode': 'lines',
'connectgaps': True,
'a': a[indices],
'b': b[indices],
'c': c[indices],
'line': {'color': 'black', 'shape': 'spline', 'smoothing': 1},
'marker': {'size': 2, 'line': {'width': 0.1}}
})
)
add_plot_spline_portions(fig, [curve_portion, curve_other_portion])
fig.show(renderer='png')

I can outline what I think is a general sort of solution - it doesn't have as much mathematical rigor as I would like, and involves some guess and check type work - but hopefully it's helpful.
The first consideration is that for this regression on a ternary plot, there are only two degrees of freedom because A+B+C=1 (you might find this explanation helpful). This means it only makes sense to consider the relationship between two of the variables at a time. What we really want to do is create a regression between two of the variables, then determine the value of the third variable using the equation A+B+C=1.
The second consideration is bit harder to define, but since you are after a regression that captures the "reversing" nature of the variable A, we want a regression where A can take on repeated values. I think the most straightforward way to achieve this is for A to be the variable you are predicting.
For simplicity sake, let's say we use a degree 2 polynomial regression that predicts A from either B or C. We can make a scatter and choose whichever polynomial will have a better fit for our purposes.
Here is a quick eda:
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots
a = np.array([0.15, 0.15, 0.17, 0.2 , 0.21, 0.24, 0.26, 0.27, 0.27, 0.29, 0.32, 0.35, 0.39, 0.4 , 0.4 , 0.41, 0.47, 0.48, 0.51, 0.52, 0.54, 0.56, 0.59, 0.62, 0.63, 0.65, 0.69, 0.73, 0.74])
b = np.array([0.14, 0.15, 0.1 , 0.17, 0.17, 0.18, 0.05, 0.16, 0.17, 0.04, 0.03, 0.14, 0.13, 0.13, 0.14, 0.14, 0.13, 0.13, 0.14, 0.14, 0.15, 0.16, 0.18, 0.2 , 0.21, 0.22, 0.24, 0.25, 0.25])
c = np.array([0.71, 0.7 , 0.73, 0.63, 0.62, 0.58, 0.69, 0.57, 0.56, 0.67, 0.65, 0.51, 0.48, 0.47, 0.46, 0.45, 0.4 , 0.39, 0.35, 0.34, 0.31, 0.28, 0.23, 0.18, 0.16, 0.13, 0.07, 0.02, 0.01])
## eda to determine polynomial of best fit to predict A
fig_eda = make_subplots(rows=1, cols=2)
fig_eda.add_trace(go.Scatter(x=b, y=a, mode='markers'),row=1, col=1)
coefficients = np.polyfit(b,a,2)
p = np.poly1d(coefficients)
b_vals = np.linspace(min(b),max(b))
a_pred = np.array([p(x) for x in b_vals])
fig_eda.add_trace(go.Scatter(x=b_vals, y=a_pred, mode='lines'),row=1, col=1)
fig_eda.add_trace(go.Scatter(x=c, y=a, mode='markers'),row=1, col=2)
coefficients = np.polyfit(c,a,2)
p = np.poly1d(coefficients)
c_vals = np.linspace(min(c),max(c))
a_pred = np.array([p(x) for x in c_vals])
fig_eda.add_trace(go.Scatter(x=c_vals, y=a_pred, mode='lines'),row=1, col=2)
Notice how predicting A from B looks like it captures the reversing nature of A better than predicting A from C. If we try to make a degree 2 polynomial regression on A from C, we can see A is not going to repeat within the domain of C: [0,1] because of the very low sloping nature of that polynomial.
So let's proceed with this regression with C as the predictor variable, and A as the predicted variable (and B also being a predicted variable using B = 1 - (A + C).
fig = go.Figure()
fig.add_trace(go.Scatterternary({
'mode': 'markers',
'connectgaps': True,
'a': a,
'b': b,
'c': c
}))
## since A+B+C = 100, we only need to fit a polynomial between two of the variables
## fit an n-degree polynomial to 2 of your variables
## source https://numpy.org/doc/stable/reference/generated/numpy.polyfit.html
coefficients = np.polyfit(b,a,2)
p = np.poly1d(coefficients)
## we use the entire domain of the input variable B
b_vals = np.linspace(0,1)
a_pred = np.array([p(x) for x in b_vals])
c_pred = 1 - (b_vals + a_pred)
fig.add_trace(go.Scatterternary({
'mode': 'lines',
'connectgaps': True,
'a': a_pred,
'b': b_vals,
'c': c_pred,
'marker': {'size': 2, 'color':'red', 'line': {'width': 0.1}}
}))
fig.show()
This is the lowest degree polynomial regression that allows for repeated values of A (a linear regression to predict A would be the wouldn't allow A to take on repeated values). However, you can definitely experiment with increasing the degree of the polynomial you are using, and predicting A from either variables B or C.

why is numpy.fromstring reading numbers wrong?

I am writing code that uses numpy.fromstring to read arrays from xml element text.
It runs with no error, but what it reads is very strange.
for example
import numpy as np
nr = 24
r_string = '''
0.0000 0.0100 0.0200 0.0300 0.0400 0.0500 0.0600 0.0700
0.0800 0.0900 0.1000 0.1100 0.1200 0.1300 0.1400 0.1500
0.1600 0.1700 0.1800 0.1900 0.2000 0.2100 0.2200 0.2300
'''
r = np.fromstring(r_string, count = nr)
print(r)
prints the following(garbage)
[1.20737375e-153 1.48440234e-076 1.30354286e-076 6.96312257e-077
6.01356142e-154 1.20737830e-153 1.82984908e-076 1.30354286e-076
6.96312257e-077 3.22522589e-086 6.01347037e-154 6.03686893e-154
1.39804459e-076 9.72377416e-072 3.24245662e-086 6.01347037e-154
6.03686880e-154 1.39939399e-076 1.79371973e-052 1.91654811e-076
8.54289848e-072 6.96312257e-077 6.01356142e-154 1.20738399e-153]
What is going on here?
I will appreciate help here.

you need to declare sep=' '
>>> r = np.fromstring(r_string, count = nr, sep=' ')
>>> r
array([0. , 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1 ,
0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2 , 0.21,
0.22, 0.23])

Running np.fromstring without the sep specified will actually throw the warning:
DeprecationWarning: The binary mode of fromstring is deprecated, as it behaves surprisingly on unicode inputs.
You need to specify your seperator, like:
np.fromstring(r_string, sep="\t")
Output:
array([0. , 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1 ,
0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2 , 0.21,
0.22, 0.23])

Generation of a nested list with different ranges

Generation of a list of many lists each with different ranges
Isc_act = [0.1, 0.2, 0.3]
I_cel = []
a = []
for i in range(0,len(Isc_act)):
a = np.arange(0, Isc_act[i], 0.1*Isc_act[i])
I_cel[i].append(a)
print(I_cel)
Output is:
IndexError: list index out of range
My code is giving error. But, I want to get I_cel = [[0,0.01,..,0.1],[0,0.02,0.04,...,0.2],[0, 0.03, 0.06,...,0.3]]. Hence, the 'nested list' I_cel has three lists and each list has 10 values.

The simplest fix to your code, probably what you were intending to do:
Isc_act = [0.1, 0.2, 0.3]
I_cel = []
for i in range(0,len(Isc_act)):
a = np.arange(0, Isc_act[i], 0.1*Isc_act[i])
I_cel.append(a)
print(I_cel)
Note that the endpoint will be one step less than you wanted! For example row zero, you have to pick two of the below:
Steps of size 0.01
Start point 0.0 and end point 0.1
10 elements total
You can not have all three.
More numpythonic approach:
>>> Isc_act = [0.1, 0.2, 0.3]
>>> (np.linspace(0, 1, 11).reshape(11,1) # [Isc_act]).T
array([[0. , 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1 ],
[0. , 0.02, 0.04, 0.06, 0.08, 0.1 , 0.12, 0.14, 0.16, 0.18, 0.2 ],
[0. , 0.03, 0.06, 0.09, 0.12, 0.15, 0.18, 0.21, 0.24, 0.27, 0.3 ]])

linspace gives better control of the end point when dealing with floats:
In [84]: [np.linspace(0,x,11) for x in [.1,.2,.3]]
Out[84]:
[array([0. , 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1 ]),
array([0. , 0.02, 0.04, 0.06, 0.08, 0.1 , 0.12, 0.14, 0.16, 0.18, 0.2 ]),
array([0. , 0.03, 0.06, 0.09, 0.12, 0.15, 0.18, 0.21, 0.24, 0.27, 0.3 ])]
Or we could scale just one array (arange with integers is predictable):
In [86]: np.array([.1,.2,.3])[:,None]*np.arange(0,11)
Out[86]:
array([[0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ],
[0. , 0.2, 0.4, 0.6, 0.8, 1. , 1.2, 1.4, 1.6, 1.8, 2. ],
[0. , 0.3, 0.6, 0.9, 1.2, 1.5, 1.8, 2.1, 2.4, 2.7, 3. ]])

Non-uniform axis in matplotlib histogram

I would like to plot a histogram with a non-uniform x-axis using Matplotlib.
For example, consider the following histogram:
import matplotlib.pyplot as plt
values = [0.68, 0.28, 0.31, 0.5, 0.25, 0.5, 0.002, 0.13, 0.002, 0.2, 0.3, 0.45,
0.56, 0.53, 0.001, 0.44, 0.008, 0.26, 0., 0.37, 0.03, 0.002, 0.19, 0.18,
0.04, 0.31, 0.006, 0.6, 0.19, 0.3, 0., 0.46, 0.2, 0.004, 0.06, 0.]
plt.hist(values)
plt.show()
The first bin has high density, so I would like to zoom in there.
Ideally, I would like to change the values in the x-axis to something like [0, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1], keeping the bin widths constant within the graph (but not numerically, of course). Is there an easy way to achieve this?
Any comments or suggestions are welcome.

Using bins will solve the problems. The bins are the values to which you assign the values for example 0.28 will be assigned to bin 0.3. The code below provides you an example of using bins:
import matplotlib.pyplot as plt
values = [0.68, 0.28, 0.31, 0.5, 0.25, 0.5, 0.002, 0.13, 0.002, 0.2, 0.3, 0.45,
0.56, 0.53, 0.001, 0.44, 0.008, 0.26, 0., 0.37, 0.03, 0.002, 0.19, 0.18,
0.04, 0.31, 0.006, 0.6, 0.19, 0.3, 0., 0.46, 0.2, 0.004, 0.06, 0.]
plt.hist(values, bins=[0, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1])
plt.show()
To plot it in a more suitable way, it can be handy to convert the x axis into a logaritmic scale:
plt.hist(values, bins=[0, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1], log=True)
changes the log scale on the y axis. Adding the following line to your code will make a logaritmic x axis for your histogram:
plt.xscale('log')

The solution from André is nice, but the bin widths are not constant. Working with a log2 x-axis suits what I was looking for. I use np.logspace to make the bin widths constant in the graph.
That's what I ended up doing:
import matplotlib.pyplot as plt
values = [0.68, 0.28, 0.31, 0.5, 0.25, 0.5, 0.002, 0.13, 0.002, 0.2, 0.3, 0.45,
0.56, 0.53, 0.001, 0.44, 0.008, 0.26, 0., 0.37, 0.03, 0.002, 0.19, 0.18,
0.04, 0.31, 0.006, 0.6, 0.19, 0.3, 0., 0.46, 0.2, 0.004, 0.06, 0.]
bins = np.logspace(-10, 1, 20, base=2)
bins[0]=0
fig, ax = plt.subplots()
plt.hist(values, bins=bins)
ax.set_xscale('log', basex=2)
ax.set_xlim(2**-10, 1)
plt.show()

sorting of 2d array min to max in tensorflow

I have an array
x1 = tf.Variable([[0.51, 0.52, 0.53, 0.94, 0.35],
[0.32, 0.72, 0.83, 0.74, 0.55],
[0.23, 0.72, 0.63, 0.64, 0.35],
[0.11, 0.02, 0.03, 0.14, 0.15],
[0.01, 0.72, 0.73, 0.04, 0.75]],tf.float32)
I want to sort the elements in each row from min to max. Is there any function for doing such ?
In the example here they are using tf.nn.top_k2d array,using this I can loop to create the max to min.
def sort(instance):
sorted = []
rows = tf.shape(instance)[0]
col = tf.shape(instance)[1]
for i in range(rows.eval()):
matrix.append([tf.gather(instance[i], tf.nn.top_k(instance[i], k=col.eval()).indices)])
return matrix
Is there any thing similar for finding the min to max or how to reverse the array in each row ?

As suggested by #Yaroslav you can just use the top_k values.
a = tf.Variable([[0.51, 0.52, 0.53, 0.94, 0.35],
[0.32, 0.72, 0.83, 0.74, 0.55],
[0.23, 0.72, 0.63, 0.64, 0.35],
[0.11, 0.02, 0.03, 0.14, 0.15],
[0.01, 0.72, 0.73, 0.04, 0.75]],tf.float32)
row_size = a.get_shape().as_list()[-1]
top_k = tf.nn.top_k(-a, k=row_size)
sess.run(-top_k.values)
this prints for me
array([[ 0.34999999, 0.50999999, 0.51999998, 0.52999997, 0.94 ],
[ 0.31999999, 0.55000001, 0.72000003, 0.74000001, 0.82999998],
[ 0.23 , 0.34999999, 0.63 , 0.63999999, 0.72000003],
[ 0.02 , 0.03 , 0.11 , 0.14 , 0.15000001],
[ 0.01 , 0.04 , 0.72000003, 0.73000002, 0.75 ]], dtype=float32)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to compare two arrays and find the optimal match in Python? - python

Related

Plotly python regression in ternary space

why is numpy.fromstring reading numbers wrong?

Generation of a nested list with different ranges

Non-uniform axis in matplotlib histogram

sorting of 2d array min to max in tensorflow

Categories

Resources