Generating three new features (membership class) based on probability clustering

Generating three new features (membership class) based on probability clustering - python

I have a column of 100,000 temperatures with a minimum of 0°F and maximum of 130°F. I want to create three new columns (features) based on that temperature column for my model based on probability of membership to a cluster (I think it is also called fuzzy clustering or soft k means clustering).
As illustrated in the plot below: I want to create 3 class memberships with overlap (cold, medium, hot) each with probability of data points belonging to each class of temperature. For example: a temperature of 39°F might have a class 1 (hot) membership of 0.05, a class 2 (medium) membership of 0.20 and a class 3 (cold) membership of 0.75 (note the sum of three would be 1). Is there any way to do this in Python?
cluster_1 = 0 to 30
Cluster_2 = 50 to 80
Cluster_3 = 100 to 130

Based on the image and description: this is more of an assignment problem based on known soft clusters, rather than a clustering problem in itself.
If you have a vector of temperatures: [20, 30, 40, 50, 60, ...] that you want to convert to probabilities of being cold, warm, or hot based on the image above, you can achieve this with linear interpolation:
import numpy as np
def discretize(vec):
out = np.zeros((len(vec), 3))
for i, v in enumerate(vec):
if v < 30:
out[i] = [1.0, 0.0, 0.0]
elif v <= 50:
out[i] = [(50 - v) / 20, (v - 30) / 20, 0.0]
elif v <= 80:
out[i] = [0.0, 1.0, 0.0]
elif v <= 100:
out[i] = [0.0, (100 - v) / 20, (v - 80) / 20]
else:
out[i] = [0.0, 0.0, 1.0]
return out
result = discretize(np.arange(20, 120, step=5))
Which will expand a 1xN array into a 3xN array:
[[1. 0. 0. ]
[1. 0. 0. ]
[1. 0. 0. ]
[0.75 0.25 0. ]
[0.5 0.5 0. ]
[0.25 0.75 0. ]
[0. 1. 0. ]
...
[0. 1. 0. ]
[0. 0.75 0.25]
[0. 0.5 0.5 ]
[0. 0.25 0.75]
[0. 0. 1. ]
...
[0. 0. 1. ]]
If you don't know the clusters ahead of time, a Gaussian mixture performs something similar to this idea.
For example, consider a multimodal distribution X with modes at 25, 65, and 115 (to correspond roughly with the temperature example):
from numpy.random import default_rng
rng = default_rng(42)
X = np.c_[
rng.normal(loc=25, scale=15, size=1000),
rng.normal(loc=65, scale=15, size=1000),
rng.normal(loc=115, scale=15, size=1000),
].reshape(-1, 1)
Fitting a Gaussian mixture corresponds to trying to estimate where the means are:
from sklearn.mixture import GaussianMixture
model = GaussianMixture(n_components=3, random_state=42)
model.fit(X)
print(model.means_)
Here: the means that are found tend to be pretty close to where we expected them to be in our synthetic data:
[[115.85580935]
[ 25.33925571]
[ 65.35465989]]
Finally, the .predict_proba() method provides an estimate for how likely a value belongs to each cluster:
>>> np.round(model.predict_proba(X), 3)
array([[0. , 0.962, 0.038],
[0.002, 0.035, 0.963],
[0.989, 0. , 0.011],
...,
[0. , 0.844, 0.156],
[0.88 , 0. , 0.12 ],
[0.993, 0. , 0.007]])

Related

how to reverse index a 2-d array

I have a 2d MxN array A , each row of which is a sequence of indices, padded by -1's at the end e.g.:
[[ 2 1 -1 -1 -1]
[ 1 4 3 -1 -1]
[ 3 1 0 -1 -1]]
I have another MxN array of float values B:
[[ 0.7 0.4 1.5 2.0 4.4 ]
[ 0.8 4.0 0.3 0.11 0.53]
[ 0.6 7.4 0.22 0.71 0.06]]
and I want to use the indices in A to filter B i.e. for each row, only the indices present in A retain their values, and the values at all other locations are set to 0.0, i.e. the result would look like:
[[ 0.0 0.4 1.5 0.0 0.0 ]
[ 0.0 4.0 0.0 0.11 0.53 ]
[ 0.6 7.4 0.0 0.71 0.0]]
What's a good way to do this in "pure" numpy? (I would like to do this in pure numpy so I can jit it in jax.

Numpy supports fancy indexing. Ignoring the "-1" entries for the moment, you can do something like this:
index = (np.arange(B.shape[0]).reshape(-1, 1), A)
result = np.zeros_like(B)
result[index] = B[index]
This works because indices are broadcasted. The column np.arange(B.shape[0]).reshape(-1, 1) matches all the elements of a given row of A to the corresponding row in B and result.
This example does not address the fact that -1 is a valid numpy index. You need to clear the elements that correspond to -1 in A when 4 (the last column) is not present in that row:
mask = (A == -1).any(axis=1) & (A != A.shape[1] - 1).all(axis=1)
result[mask, -1] = 0.0
Here, the mask is [True, False, True], indicating that even though the second row has a -1 in it, it also contains a 4.
This approach is fairly efficient. It will create no more than a couple of boolean arrays of the same shape as A for the mask.

You can use broadcasting, but note that it will create a large intermediate array of shape (M, N, N) (in pure numpy at least):
import numpy as np
A = ...
B = ...
M, N = A.shape
out = np.where(np.any(A[..., None] == np.arange(N), axis=1), B, 0.0)
out:
array([[0. , 0.4 , 1.5 , 0. , 0. ],
[0. , 4. , 0. , 0.11, 0.53],
[0.6 , 7.4 , 0. , 0.71, 0. ]])

Another possible solution:
maxr = np.max(A, axis=1)
A = np.where(A == -1, maxr.reshape(-1,1), A)
mask = np.zeros(np.shape(B), dtype=bool)
np.put_along_axis(mask, A, True, axis=1)
np.where(mask, B, 0)
Output:
array([[0. , 0.4 , 1.5 , 0. , 0. ],
[0. , 4. , 0. , 0.11, 0.53],
[0.6 , 7.4 , 0. , 0.71, 0. ]])
EDIT (When there is rows with only -1)
The following code aims to contemplate the possibility, raised by #MadPhysicist (to whom I thank), of having rows containing only -1 -- that is only necessary to add 2 lines of code to my previous code.
A = np.array([[ 2, 1, -1, -1, -1],
[ -1, -1, -1, -1, -1],
[ 3, 1, 0, -1, -1]])
B = np.array([[ 0.7, 0.4, 1.5, 2.0, 4.4 ],
[ 0.8, 4.0, 0.3, 0.11, 0.53],
[ 0.6, 7.4, 0.22, 0.71, 0.06]])
rminus1 = np.all(A == -1, axis=1) # new
maxr = np.max(A, axis=1)
A = np.where(A == -1, maxr.reshape(-1,1), A)
mask = np.zeros(np.shape(B), dtype=bool)
np.put_along_axis(mask, A, True, axis=1)
C = np.where(mask, B, 0)
C[rminus1, :] = 0 # new
Output:
array([[0. , 0.4 , 1.5 , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ],
[0.6 , 7.4 , 0. , 0.71, 0. ]])

Using for loop to replace values in a matrix but only the last replaced value is kept

x_n = np.arange(0, 1.0, 0.25)
u_m = np.arange(0, 1.0, 0.5)
for x in range(len(x_n)):
for u in range(len(u_m)):
zeros_array = np.zeros( (len(x_n), len(u_m)) )
zeros_array[x,u] = x_n[x] - u_m[u]
zeros_array
#result
array([[ 0. , 0. ],
[ 0. , 0. ],
[ 0. , 0. ],
[ 0. , 0.25]])
Only the last replaced value is kept. I want to know how to keep all the replaced values.

You're initializing a new zeros_array on every iteration of the loop, so it's straight-forward that when the loop ends, only the last zeros_array value is kept, to solve this, you need to define zeros_array once outside the loop and keep updating it inside:
x_n = np.arange(0, 1.0, 0.25)
u_m = np.arange(0, 1.0, 0.5)
zeros_array = np.zeros((len(x_n), len(u_m)))
for x in range(len(x_n)):
for u in range(len(u_m)):
zeros_array[x, u] = x_n[x] - u_m[u]
print(zeros_array)
Output:
[[ 0. -0.5 ]
[ 0.25 -0.25]
[ 0.5 0. ]
[ 0.75 0.25]]

you have the initialization of the zeros_array inside the loop so it's doing it every loop
do:
zeros_array = np.zeros((len(x_n),len(u_m)))
for x in range(len(x_n)):
for u in range(len(u_m)):
zeros_array[x,u] = x_n[x] - u_m[u]
output:
array([[ 0. , -0.5 ],
[ 0.25, -0.25],
[ 0.5 , 0. ],
[ 0.75, 0.25]])

Finding the logits with respect to labels Tensorflow Python

I have the label array and logits array as:
label = [1,1,0,1,-1,-1,1,0,-1,0,-1,-1,0,0,0,1,1,1,-1,1]
logits = [0.2,0.3,0.4,0.1,-1.4,-2,0.4,0.5,-0.231,1.9,1.4,-1.456,0.12,-0.45,0.5,0.3,0.4,0.2,1.2,12]
Using Tensorflow, I want to get the values from label and logits where:
1> label is greater than zero
2> label is less than zero
3> label is equals to zero
I am willing to have result something like this:
label1,logits1 = some_Condition_logic_Where(label > 0) _ returns respective labels and logits
Can anyone suggest me how is this achievable?
EDITED:
>>> label = [1,1,0,1,-1,-1,1,0,-1,0,-1,-1,0,0,0,1,1,1,-1,1]
>>> logits = [0.2,0.3,0.4,0.1,-1.4,-2,0.4,0.5,-0.231,1.9,1.4,-1.456,0.12,-0.45,0.5,0.3,0.4,0.2,1.2,12]
>>> label1 = [];logits1 = []
>>> for l1,l2 in zip(label,logits):
... if(l1>0):
... label1.append(l1)
... logits1.append(l2)
...
>>> label1
[1, 1, 1, 1, 1, 1, 1, 1]
>>> logits1
[0.2, 0.3, 0.1, 0.4, 0.3, 0.4, 0.2, 12]
Want this logic to be implemented in Tensorflow same for the values with -1 and 0. How I can achieve this?

You can use tf.boolean_mask.
import tensorflow as tf
label = tf.constant([1,1,0,1,-1,-1,1,0,-1,0,-1,-1,0,0,0,1,1,1,-1,1],dtype=tf.float32)
logits = tf.constant([0.2,0.3,0.4,0.1,-1.4,-2,0.4,0.5,-0.231,1.9,1.4,-1.456,0.12,-0.45,0.5,0.3,0.4,0.2,1.2,12],dtype=tf.float32)
# label>0
label1 = tf.boolean_mask(label,tf.greater(label,0))
logits1 = tf.boolean_mask(logits,tf.greater(label,0))
# label<0
label2 = tf.boolean_mask(label,tf.less(label,0))
logits2 = tf.boolean_mask(logits,tf.less(label,0))
# label=0
label3 = tf.boolean_mask(label,tf.equal(label,0))
logits3 = tf.boolean_mask(logits,tf.equal(label,0))
with tf.Session() as sess:
print(sess.run(label1))
print(sess.run(logits1))
print(sess.run(label2))
print(sess.run(logits2))
print(sess.run(label3))
print(sess.run(logits3))
[1. 1. 1. 1. 1. 1. 1. 1.]
[ 0.2 0.3 0.1 0.4 0.3 0.4 0.2 12. ]
[-1. -1. -1. -1. -1. -1.]
[-1.4 -2. -0.231 1.4 -1.456 1.2 ]
[0. 0. 0. 0. 0. 0.]
[ 0.4 0.5 1.9 0.12 -0.45 0.5 ]

Cartesian product from 2 series

I have this big serie of length t (t = 200K rows)
prices = [200, 100, 500, 300 ..]
and I want to calculate a matrix (tXt) where a value is calculated as:
matrix[i][j] = prices[j]/prices[i] - 1
I tried this using a double for, but it's too slow. Any ideas how to perform it better?
for p0 in prices:
for p1 in prices:
matrix[i][j] = p1/p0 - 1

A vectorized solution is using np.meshgrid, with prices and 1/prices as arguments (note that prices must be an array), and multiplying the result and substracting 1 in order to compute matrix[i][j] = prices[j]/prices[i] - 1:
a, b = np.meshgrid(p, 1/p)
a * b - 1
As an example:
p = np.array([1,4,2])
Would give:
a, b = np.meshgrid(p, 1/p)
a * b - 1
array([[ 0. , 3. , 1. ],
[-0.75, 0. , -0.5 ],
[-0.5 , 1. , 0. ]])
Quick check of some of the cells:
(i,j) prices[j]/prices[i] - 1
--------------------------------
(1,1) 1/1 - 1 = 0
(1,2) 4/1 - 1 = 3
(1,3) 2/1 - 1 = 1
(2,1) 1/4 - 1 = -0.75
Another solution:
[p] / np.array([p]).T - 1
array([[ 0. , 3. , 1. ],
[-0.75, 0. , -0.5 ],
[-0.5 , 1. , 0. ]])

There are two idiomatic ways of doing an outer product-type operation. Either use the .outer method of universal functions, here np.divide:
In [2]: p = np.array([10, 20, 30, 40])
In [3]: np.divide.outer(p, p)
Out[3]:
array([[ 1. , 0.5 , 0.33333333, 0.25 ],
[ 2. , 1. , 0.66666667, 0.5 ],
[ 3. , 1.5 , 1. , 0.75 ],
[ 4. , 2. , 1.33333333, 1. ]])
Alternatively, use broadcasting:
In [4]: p[:, None] / p[None, :]
Out[4]:
array([[ 1. , 0.5 , 0.33333333, 0.25 ],
[ 2. , 1. , 0.66666667, 0.5 ],
[ 3. , 1.5 , 1. , 0.75 ],
[ 4. , 2. , 1.33333333, 1. ]])
This p[None, :] itself can be spelled as a reshape, p.reshape((1, len(p))), but readability.
Both are equivalent to a double for-loop:
In [6]: o = np.empty((len(p), len(p)))
In [7]: for i in range(len(p)):
...: for j in range(len(p)):
...: o[i, j] = p[i] / p[j]
...:
In [8]: o
Out[8]:
array([[ 1. , 0.5 , 0.33333333, 0.25 ],
[ 2. , 1. , 0.66666667, 0.5 ],
[ 3. , 1.5 , 1. , 0.75 ],
[ 4. , 2. , 1.33333333, 1. ]])

I guess it can be done in this way
import numpy
prices = [200., 300., 100., 500., 600.]
x = numpy.array(prices).reshape(1, len(prices))
matrix = (1/x.T) * x - 1
Let me explain in details. This matrix is a matrix product of column vector of element-wise reciprocal price values and a row vector of original price values. Then matrix of ones of the same size needs to be subtracted from the result.
First of all we create row-vector from prices list
x = numpy.array(prices).reshape(1, len(prices))
Reshaping is required here. Otherwise your vector will have shape (len(prices),), not required (1, len(prices)).
Then we compute a column vector of element-wise reciprocal price values:
(1/x.T)
Finally, we compute the resulting matrix
matrix = (1/x.T) * x - 1
Here ending - 1 will be broadcasted to a matrix of the same shape with (1/x.T) * x.

list of indexes of maximum values in ndarray

I have a ndarray. From this array I need to choose the list of N numbers with biggest values. I found heapq.nlargest to find the N largest entries, but I need to extract the indexes.
I want to build a new array where only the N rows with the largest weights in the first column will survive. The rest of the rows will be replaced by random values
import numpy as np
import heapq # For choosing list of max values
a = [[1.1,2.1,3.1], [2.1,3.1,4.1], [5.1,0.1,7.1],[0.1,1.1,1.1],[4.1,3.1,9.1]]
a = np.asarray(a)
maxVal = heapq.nlargest(2,a[:,0])
if __name__ == '__main__':
print a
print maxVal
The output I have is:
[[ 1.1 2.1 3.1]
[ 2.1 3.1 4.1]
[ 5.1 0.1 7.1]
[ 0.1 1.1 1.1]
[ 4.1 3.1 9.1]]
[5.0999999999999996, 4.0999999999999996]
but what I need is [2,4] as the indexes to build a new array. The indexes are the rows so if in this example I want to replace the rest by 0 I need to finish with:
[[0.0 0.0 0.0]
[ 0.0 0.0 0.0]
[ 5.1 0.1 7.1]
[ 0.0 0.0 0.0]
[ 4.1 3.1 9.1]]
I am stuck in the point where I need indexes. The original array has 1000 rows and 100 columns. The weights are normalized floating points and I don't want to do something like if a[:,1] == maxVal[0]: because sometimes I have the weights very close and can finish with more values maxVal[0] than my original N.
Is there any simple way to extract indexes on this setup to replace the rest of the array?

If you only have 1000 rows, I would forget about the heap and use np.argsort on the first column:
>>> np.argsort(a[:,0])[::-1][:2]
array([2, 4])
If you want to put it all together, it would look something like:
def trim_rows(a, n) :
idx = np.argsort(a[:,0])[:-n]
a[idx] = 0
>>> a = np.random.rand(10, 4)
>>> a
array([[ 0.34416425, 0.89021968, 0.06260404, 0.0218131 ],
[ 0.72344948, 0.79637177, 0.70029863, 0.20096129],
[ 0.27772833, 0.05372373, 0.00372941, 0.18454153],
[ 0.09124461, 0.38676351, 0.98478492, 0.72986697],
[ 0.84789887, 0.69171688, 0.97718206, 0.64019977],
[ 0.27597241, 0.26705301, 0.62124467, 0.43337711],
[ 0.79455424, 0.37024814, 0.93549275, 0.01130491],
[ 0.95113795, 0.32306471, 0.47548887, 0.20429272],
[ 0.3943888 , 0.61586129, 0.02776393, 0.2560126 ],
[ 0.5934556 , 0.23093912, 0.12550062, 0.58542137]])
>>> trim_rows(a, 3)
>>> a
array([[ 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. ],
[ 0.84789887, 0.69171688, 0.97718206, 0.64019977],
[ 0. , 0. , 0. , 0. ],
[ 0.79455424, 0.37024814, 0.93549275, 0.01130491],
[ 0.95113795, 0.32306471, 0.47548887, 0.20429272],
[ 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. ]])
And for your data size it's probably fast enough:
In [7]: a = np.random.rand(1000, 100)
In [8]: %timeit -n1 -r1 trim_rows(a, 50)
1 loops, best of 1: 7.65 ms per loop

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Generating three new features (membership class) based on probability clustering - python

Related

how to reverse index a 2-d array

Using for loop to replace values in a matrix but only the last replaced value is kept

Finding the logits with respect to labels Tensorflow Python

Cartesian product from 2 series

list of indexes of maximum values in ndarray

Categories

Resources