I have a np.array of 50 elements. For example:
data = np.array([9.22, 9. , 9.01, ..., 7.98, 6.77, 7.3 ])
For each element of the data np.array, I have a x and y data pair (both with the same length) that I want to interpolate with. For example:
x = np.array([[ 1, 2, 3, 4, 5 ],
...,
[ 1.01, 2.01, 3.02, 4.03, 5.07 ]])
y = np.array([[0. , 1. , 0.95, ..., 0.07, 0.06, 0.06],
...,
[0. , 0.99 , 0.85, ..., 0.03, 0.05, 0.06]])
I want to interpolate each data element with the respective np.array of x and y.
I have the following solution using map():
def cubic_spline(i):
return scipy.interpolate.splev(x=data[i],
tck=scipy.interpolate.splrep(x[i], y[i], k=3))
list(map(cubic_spline, np.arange(len(data)))
But I'm wondering if there is a way to do it directly with scipy and numpy to optimize the execution time. Something like:
scipy.interpolate.splev(x=data,
tck=scipy.interpolate.splrep(x, y, k=3))
Any suggestions will be appreciated. Thanks in advance.
If you have a single x array and multiple y arrays, newer interpolators (make_interp_spline, PchipInterpolator etc) support multidimensional y arrays automatically.
If you really have a collection of pairs of 1D arrays, x and y, where x arrays differ, and you want scipy to loop over these datasets, then no, scipy does not support that. You'd need to loop over them manually.
Related
I just cannot find a solution to the following problem:
Consider two NumPy.arrays, one of shape (10,64,10) and one of (x, 64).
Array A (10,64,10) represents 10 classes with 64 features and over each of these features I got a PDF split in 10 bins --> (Classes, Features, Bins). Each value in that innermost array represents a probability.
[[[0.62, 0., 0. ],
[0.12, 0.09, 0.01],
[0.59, 0.01, 0. ],
[0.62, 0., 0. ]],
[[0.62, 0., 0. ],
[0.59, 0.01, 0. ],
[0.62, 0., 0. ],
[0.62, 0., 0. ]]]
(simplified to (2,4,3) so you can test it by copying it directly. The representating classes are "0" and "1")
Array B (X, 64) are the instance of a dataset X and the bin-index the i`th feature belongs to\
[[0, 0, 2, 1]
[0, 0, 1, 0]
[0, 2, 1, 0]]
(simplified to (X=3, 4))
What I want to do is for each row in Array B, e.g. [0, 0, 2, 1] I want to get the probability that when the bin for the first feature is 0 the class is "1" and that the class is "0".
The expected output for the first instance here would be:
"0" = [0.62, 0.12, 0.00, 0.00]
"1" = [0.62, 0.59, 0.00, 0.00]
and if possible then for all X instances.
I do not expect any kind of Dictionary or anything alike but just some array that contains the values in a somewhat sorted manner (can also be another sort than shown in the example)
Of course, I could do all this in giant nested for-loops, but I want at least some vectorization. As anybody any good suggestions, our answer does not have to be a full-fledged solution.
EDIT:
The best nested loop I came up with was
prediction = np.empty((bins.shape[0], histograms.shape[1], histograms.shape[0]))
for n, instance in enumerate(bins):
for i, instance_bin in enumerate(instance):
prediction[n,i] = histograms[:,i, instance_bin] # Prob for every class x that the bin given in "instance_bin" of feature "i" corresponds to a possible instance of that class
histograms = Array A;
bins = Array B
Please also tell me of every other bad-practise if you find any in my way of working with numpy or anything else in this snipped.
There are N distributions which take on integer values 0,... with associated probabilities. Further, I assume 3 variables [value, prob]:
import numpy as np
x = np.array([ [0,0.3],[1,0.2],[3,0.5] ])
y = np.array([ [10,0.2],[11,0.4],[13,0.1],[14,0.3] ])
z = np.array([ [21,0.3],[23,0.7] ])
As there are N variables I convolve first x+y, then I add z, and so on.
Unfortunately numpy.convole() takes 1-d arrays as input variables, so it does not suit in this case directly. I play with variables to take them all values 0,1,2,...,23 (if value is not know then Pr=0)... I feel like there is another much better solution.
Does anyone have a suggestion for making it more efficient? Thanks in advance.
I don't see a built-in method for this in Scipy; there's a way to define a custom discrete random variables, but those don't support addition. Here is an approach using pandas, assuming import pandas as pd and x,y,z as in your example:
values = np.add.outer(x[:,0], y[:,0]).flatten()
probs = np.multiply.outer(x[:,1], y[:,1]).flatten()
df = pd.DataFrame({'values': values, 'probs': probs})
conv = df.groupby('values').sum()
result = conv.reset_index().values
The output is
array([[ 10. , 0.06],
[ 11. , 0.16],
[ 12. , 0.08],
[ 13. , 0.13],
[ 14. , 0.31],
[ 15. , 0.06],
[ 16. , 0.05],
[ 17. , 0.15]])
With more than two variables, you don't have to go back and forth between numpy and pandas: the additional variables can be included at the beginning.
values = np.add.outer(np.add.outer(x[:,0], y[:,0]), z[:,0]).flatten()
probs = np.multiply.outer(np.multiply.outer(x[:,1], y[:,1]), z[:,1]).flatten()
Aside: it would be better to keep values and probabilities in separate numpy arrays, if they have different intrinsic data types (integers vs reals).
I am seeing something strange while using AffinityPropagation from sklearn. I have a 4 x 4 numpy ndarray - which is basically the affinity-scores. sim[i, j] has the affinity score of [i, j]. Now, when I feed into the AffinityPropgation function, I get a total of 4 labels.
here is an similar example with a smaller matrix:
In [215]: x = np.array([[1, 0.2, 0.4, 0], [0.2, 1, 0.8, 0.3], [0.4, 0.8, 1, 0.7], [0, 0.3, 0.7, 1]]
.....: )
In [216]: x
Out[216]:
array([[ 1. , 0.2, 0.4, 0. ],
[ 0.2, 1. , 0.8, 0.3],
[ 0.4, 0.8, 1. , 0.7],
[ 0. , 0.3, 0.7, 1. ]])
In [217]: clusterer = cluster.AffinityPropagation(affinity='precomputed')
In [218]: f = clusterer.fit(x)
In [219]: f.labels_
Out[219]: array([0, 1, 1, 1])
This says (according to Kevin), that the first sample (0th-indexed row) is a cluster (Cluster # 0) on its own and the rest of the samples are in another cluster (cluster # 1). But, still, I do not understand this output. What is a sample here? What are the members? I want to have a set of pairs (i, j) assigned to one cluster, another set of pairs assigned to another cluster and so on.
It looks like a 4-sample x 4-feature matrix..which I do not want. Is this the problem? If so, how to convert this to a nice 4-sample x 4-sample affinity-matrix?
The documentation (http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AffinityPropagation.html) says
fit(X, y=None)
Create affinity matrix from negative euclidean distances, then apply affinity propagation clustering.
Parameters:
X: array-like, shape (n_samples, n_features) or (n_samples, n_samples) :
Data matrix or, if affinity is precomputed, matrix of similarities / affinities.
Thanks!
By your description it sounds like you are working with a "pairwise similarity matrix": x (although your example data does not show that). If this is the case your matrix should be symmertric so that: sim[i,j] == sim[j,i] with your diagonal values equal to 1. Example similarity data S:
S
array([[ 1. , 0.08276253, 0.16227766, 0.47213595, 0.64575131],
[ 0.08276253, 1. , 0.56776436, 0.74456265, 0.09901951],
[ 0.16227766, 0.56776436, 1. , 0.47722558, 0.58257569],
[ 0.47213595, 0.74456265, 0.47722558, 1. , 0.87298335],
[ 0.64575131, 0.09901951, 0.58257569, 0.87298335, 1. ]])
Typically when you already have a distance matrix you should use affinity='precomputed'. But in your case, you are using similarity. In this specific example you can transform to pseudo-distance using 1-D. (The reason to do this would be because I don't know that Affinity Propagation will give you expected results if you give it a similarity matrix as input):
1-D
array([[ 0. , 0.91723747, 0.83772234, 0.52786405, 0.35424869],
[ 0.91723747, 0. , 0.43223564, 0.25543735, 0.90098049],
[ 0.83772234, 0.43223564, 0. , 0.52277442, 0.41742431],
[ 0.52786405, 0.25543735, 0.52277442, 0. , 0.12701665],
[ 0.35424869, 0.90098049, 0.41742431, 0.12701665, 0. ]])
With that being said, I think this is where your interpretation was off:
This says that the first 3-rows are similar, 4th row is a cluster on its own, and the 5th row is also a cluster on its own. Totally of 3 clusters.
The f.labels_ array:
array([0, 1, 1, 1, 0])
is telling you that samples (not rows) 0 and 4 are in cluster 0 AND that samples 2, 3, and 4 are in cluster 1. You don't need 25 different labels for a 5 sample problem, that wouldn't make sense. Hope this helps a little, try the demo (inspect the variables along the way and compare them with your data), which starts with raw data; it should help you decide if Affinity Propagation is the right clustering algorithm for you.
According to this page https://scikit-learn.org/stable/modules/clustering.html
you can use a similarity matrix for AffinityPropagation.
So I'm new to python AND data analysis, but have been tasked to create a scatter plot. The data set that I'm using has many elements containing None values. When I use the polyfit method to create a trendline(best-fit line) I get errors for the Nones. I've tried using lists and numpy arrays with dismal results. I've also tried masked_array, masked_invalid, ect. in MULTIPLE configurations, but it kept giving me an array filled with Nones. Is there a way of creating a trendline in such a way that I don't need to remove the elements that have None values? I need them to keep my plot dimensions correct. I'm using Python 2.7. This is what I got so far:
import matplotlib.pyplot as plt
import numpy as np
import numpy.ma as ma
import pylab
#The InterpolatedUnivariateSpline method popped up during my endeavor
#to extrapolate the trendline through the gaps in data.
#To be honest, I don't think its doing anything for me...
from scipy.interpolate import InterpolatedUnivariateSpline
fig, ax = plt.subplots(1,1)
ax.scatter(y, dbm, color = 'purple', marker = 'o', s = 100)
plt.xlim(min(y), max(y))
plt.xlabel('Temp - C')
dbm_array = np.asarray(dbm) #dbm and y are lists earlier in the program
y_array = np.asarray(y)
x = np.linspace(min(y), max(y), len(y))
order = 1
s = InterpolatedUnivariateSpline(y, dbm, k=order)
blah = s(x)
plt.plot(y, blah, '--k')
This gives me the scatter plot without the trendline for some reason. No errors, so I guess I got that going for me....
Thank you so much in advance!
First of all, if you have arrays, there should be no Nones in them, just nans. This is because None is an object which cannot be expressed as a number. So, the first problem may be here. Let's have a look:
import numpy as np
a = np.array([None, 1, 2, 3, 4, None])
What do we get?
>>> a
array([None, 1, 2, 3, 4, None], dtype=object)
This is most certainly something we did not. It is an array of objects, which is most of the time something not very useful. You cannot perform any calculations on that one:
>>> 2*a
unsupported operand type(s) for *: 'int' and 'NoneType'
This happens because the element-wise multiplication tries to multiply 2*None.
So, what you really want to have is:
>>> a = np.array([np.nan, 1, 2, 3, 4, np.nan])
>>> a
array([ nan, 1., 2., 3., 4., nan])
>>> a.dtype
dtype('float64')
>>> 2 * a
array([ nan, 2., 4., 6., 8., nan])
Now everything works as expected.
So, the first thing is to check that your input arrays have the correct form. If you then have problems with curve fitting, you may create an array without the nasty nans in there:
import numpy as np
a = np.array([[0,np.nan], [1, 1], [2, 1.5], [3.2, np.nan], [4, 5]])
b = a[-np.isnan(a[:,1])]
Let's see the contents of a and b:
>>> a
array([[ 0. , nan],
[ 1. , 1. ],
[ 2. , 1.5],
[ 3.2, nan],
[ 4. , 5. ]])
>>> b
array([[ 1. , 1. ],
[ 2. , 1.5],
[ 4. , 5. ]])
And this is what you want. The curve is fitted with b without any nans which have the habit of migrating around and making the results of calculations nans. (This is by design.)
How does this work, then? The np.isnan(a[:,1]) returns a boolean array with True at each position with a nan in column 1 in a and False for each valid number. As this is exactly the opposite of what we want, we'll negate it by adding the minus sign in front. And then the indexing picks only the rows which have numbers.
In case you have your X data and Y data in two different 1-D vectors, do this:
# original y data: Y
# original x data: X
# both have the same length
# calculate a mask to be used (a boolean vector)
msk = -np.isnan(Y)
# use the mask to plot both X and Y only at the points where Y is not NaN
plot(X[msk], Y[msk])
In some cases you may not have the X data at all, but you would like to number the points from, e.g. 0 onwards (as matplotlib does if you only give it one vector). There are a couple of possibilities, but this is one:
msk = -np.isnan(Y)
X = np.arange(len(Y))
plot(X[msk], Y[msk])
In numpy, the original array has the shape(2,2,2) like this
[[[0.2,0.3],[0.1,0.5]],[[0.1,0.3],[0.1,0.4]]]
I'd like to scale the array so that the max value of the a dimension is 1 like this:
As max([0.2,0.1,0.1,0.1]) is 0.2, and 1/0.2 is 5, so for the first element of the int tuple, multiple it by 5.
As max([0.3,0.5,0.3,0.4]) is 0.5, and 1/0.5 is 2, so for the second element of the int tuple, multiple it by 2
So the final array is like this:
[[[1,0.6],[0.5,1]],[[0.5,0.6],[0.5,0.8]]]
I know how to multiple an array with an integer in numpy, but I'm not sure how to multiple the array with different factor. Does anyone have ideas about this?
If your array = a:
>>> import numpy as np
>>> a = np.array([[[0.2,0.3],[0.1,0.5]],[[0.1,0.3],[0.1,0.4]]])
You can do this:
>>> a/np.amax(a.reshape(4,2),axis=0)
array([[[ 1. , 0.6],
[ 0.5, 1. ]],
[[ 0.5, 0.6],
[ 0.5, 0.8]]])