one-hot encoding of an array of floats using just keras - python

First of, I am new to stackoverflow, so if there is a way to improve the way I formulate my question or if I missed something obvious, do point it out to me please!
I am building a classification convolutional network in Keras, where the network is asked to predict parameter was used to generate the image. The classes are encoded in 5 float values, e.g. a list of the classes may look like this:
[[0.], [0.76666665], [0.5], [0.23333333], [1.]]
I want to one-hot encode these classes, using the keras.utils.to_categorical(y, num_classes=5, dtype='float32') function.
However, it returns the following:
array(
[
[1., 0., 0., 0., 0.],
[1., 0., 0., 0., 0.],
[1., 0., 0., 0., 0.],
[1., 0., 0., 0., 0.],
[0., 1., 0., 0., 0.]
],
dtype=float32)
It only takes integers as input, thus it maps all values < 1. to 0.
I could circumvent this by multiplying all values with a constant so they are all integers and I think there is also a way to solve this problem within scikit learn, but that sounds like a huge work-around for a problem that should be trivial to solve within just keras, which makes me believe I am missing something obvious.
I hope somebody is able to point out a simple alternative using just Keras.

Another option is to use OneHotEncoder from sklearn:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(categories='auto')
input = [[0.], [0.76666665], [0.5], [0.23333333], [1.]]
output = encoder.fit_transform(input)
print(input)
print(output.toarray())
Outputs:
[[0.0], [0.76666665], [0.5], [0.23333333], [1.0]]
[[ 1. 0. 0. 0. 0.]
[ 0. 0. 0. 1. 0.]
[ 0. 0. 1. 0. 0.]
[ 0. 1. 0. 0. 0.]
[ 0. 0. 0. 0. 1.]]

Due to the continuous nature of floating point values, it's not advisable to try and one hot encode them. Instead, you should try something like this:
a = {}
classes = []
for item, i in zip(your_array, range(len(your_array))):
a[str(i)] = item
classes.append(str(i))
encoded_classes = to_categorical(classes)
The dictionary is so that you can refer to actual values later.
EDIT: Updated after comment from nuric.
your_array = [[0.], [0.76666665], [0.5], [0.23333333], [1.]]
class_values = {}
classes = []
for i, item in enumerate(your_array):
class_values[str(i)] = item
classes.append(i)
encoded_classes = to_categorical(classes)

Related

Calculating Confusion Matrix by Using the Array of Arrays

I am using transformers and datasets libraries to train an multi-class nlp model for real specific dataset and I need to have an idea how my model performs for each label. So, I'd like to calculate the confusion matrix. I have 4 labels. My result.prediction looks like
array([[ -6.906 , -8.11 , -10.29 , 6.242 ],
[ -4.51 , 3.705 , -9.76 , -7.49 ],
[ -6.734 , 3.36 , -10.27 , -6.883 ],
...,
[ 8.41 , -9.43 , -9.45 , -8.6 ],
[ 1.3125, -3.094 , -11.016 , -9.31 ],
[ -7.152 , -8.5 , -9.13 , 6.766 ]], dtype=float16)
In here when predicted value is positive then model predicts 1, else model predicts 0. Next my result.label_ids looks like
array([[0., 0., 0., 1.],
[1., 0., 0., 0.],
[0., 0., 0., 1.],
...,
[1., 0., 0., 0.],
[1., 0., 0., 0.],
[0., 0., 0., 1.]], dtype=float32)
As you can see model return an array of 4, and give 0 values to false labels and 1 to true values.
In general, I've been using the following function to calculate confusion matrix, but in this case it didn't work since this function is for 1 dimensional arrays.
import numpy as np
def compute_confusion_matrix(labels, true, pred):
K = len(labels) # Number of classes
result = np.zeros((K, K))
for i in range(labels):
result[true[i]][pred[i]] += 1
return result
If possible I'd like to modify this function suitable for my above case. At least I would like to understand how can I implement confusion matrix for results that in the form multi dimensional arrays.
A possibility could be reversing the encoding to the format required by compute_confusion_matrix and, in this way, it is still possible to use your function!
To convert the predictions it's possible to do:
pred = list(np.where(result.label_ids == 1.)[1])
where np.where(result.label_ids == 1.)[1] is a numpy 1-dimensional array containing the indexes of the 1.s in each row of result.label_ids.
So pred will look like this according to your result.label_ids:
[3, 0, 3, ..., 0, 0, 3]
so it should have the same format of the original true (if also true is one-hot encoded the same strategy could be used to convert it) and can be used as input of your function for computing the confusion matrix.
First of all I would like to thank Nicola Fanelli for the idea.
The function I gave above as well as the sklearn.metrics.confusion_matrix() both need to be provided a list of predicted and true values. After my prediction step, I try to retrieve my true and predicted values in order to calculate a confusion matrix. The results I was getting are in the following form
array([[0., 0., 0., 1.],
[1., 0., 0., 0.],
[0., 0., 0., 1.],
...,
[1., 0., 0., 0.],
[1., 0., 0., 0.],
[0., 0., 0., 1.]], dtype=float32)
Here the idea is to retrieve the positional index of the value 1. When I tried the approach suggested by Nicola Fanelli , the resulting sizes were lower then the initial ones and they weren't matching. Therefore, confusion matrix cannot be calculated. To be honest I couldn't find the reason behind it, but I'll investigate that more later.
So, I use a different technique to implement the same idea. I used np.argmax() and append these positions to a new list. Here is the code sample for true values
true = []
for i in range(len(result.label_ids)):
n = np.array(result.label_ids[i])
true.append(np.argmax(n))
This way I got the results in the desired format without my sizes are being changed.
Even though this is a working solution for my problem, I am still open to more elegant ways to approach this problem.

How to delete multiple value from matrix numpy low computational cost

I've recently been trying my hand at numpy, and I'm trying to find a solution to delete the elements inside the matrix at column 2 equal to the value stored in the variable element.
Since I am a large amount of data I would need to know if there was a more efficient method which takes less time to execute than the classic for.
I enclose an example:
element = [ 85., 222., 166., 238.]
matrix = [[228., 1., 222.],
[140., 0., 85.],
[140., 0., 104.],
[230., 0., 217.],
[115., 1., 250.],
[12., 1., 166.],
[181., 1., 238.]]
the output:
matrix = [[140., 0., 104.],
[230., 0., 217.],
[115., 1., 250.]]
The method I used is the following:
for y in element:
matrix = matrix[(matrix[:,2]!= y)]
When running it for a large amount of data it takes a long time. Is there anything more efficient, so that you can save on execution?
Since you tagged numpy, I'd assume matrix is a numpy array. With that, you can use np.isin for your purpose:
matrix = np.array(matrix)
matrix[~np.isin(np.array(matrix)[:,2], element)]
Output:
array([[140., 0., 104.],
[230., 0., 217.],
[115., 1., 250.]])

Efficient Implementation of Gaussian Elimination in Python [duplicate]

Is there somewhere in the cosmos of scipy/numpy/... a standard method for Gauss-elimination of a matrix?
One finds many snippets via google, but I would prefer to use "trusted" modules if possible.
I finally found, that it can be done using LU decomposition. Here the U matrix represents the reduced form of the linear system.
from numpy import array
from scipy.linalg import lu
a = array([[2.,4.,4.,4.],[1.,2.,3.,3.],[1.,2.,2.,2.],[1.,4.,3.,4.]])
pl, u = lu(a, permute_l=True)
Then u reads
array([[ 2., 4., 4., 4.],
[ 0., 2., 1., 2.],
[ 0., 0., 1., 1.],
[ 0., 0., 0., 0.]])
Depending on the solvability of the system this matrix has an upper triangular or trapezoidal structure. In the above case a line of zeros arises, as the matrix has only rank 3.
One function that can be worth checking is _remove_redundancy, if you wish to remove repeated or redundant equations:
import numpy as np
import scipy.optimize
a = np.array([[1.,1.,1.,1.],
[0.,0.,0.,1.],
[0.,0.,0.,2.],
[0.,0.,0.,3.]])
print(scipy.optimize._remove_redundancy._remove_redundancy(a, np.zeros_like(a[:, 0]))[0])
which gives:
[[1. 1. 1. 1.]
[0. 0. 0. 3.]]
As a note to #flonk answer, using a LU decomposition might not always give the desired reduced row matrix. Example:
import numpy as np
import scipy.linalg
a = np.array([[1.,1.,1.,1.],
[0.,0.,0.,1.],
[0.,0.,0.,2.],
[0.,0.,0.,3.]])
_,_, u = scipy.linalg.lu(a)
print(u)
gives the same matrix:
[[1. 1. 1. 1.]
[0. 0. 0. 1.]
[0. 0. 0. 2.]
[0. 0. 0. 3.]]
even though the last 3 rows are linearly dependent.
You can use the symbolic mathematics python library sympy
import sympy as sp
m = sp.Matrix([[1,2,1],
[-2,-3,1],
[3,5,0]])
m_rref, pivots = m.rref() # Compute reduced row echelon form (rref).
print(m_rref, pivots)
This will output the matrix in reduced echelon form, as well as a list of the pivot columns
Matrix([[1, 0, -5],
[0, 1, 3],
[0, 0, 0]])
(0, 1)

I'd like to know how to calculate the similarity(numerical accuracy) of the two numpy array types in Python

I'm a student who just started deep learning with Python.
First of all, my native language is not English, so I can be poor at using a translator.
I used time series data in deep learning to create a model that predicts the likelihood of certain situations in the future. We've even completed visualizations using graphs.
But rather than visualizing it through graphs, I wanted to understand the similarity between train data and test data, the accuracy of the numbers.
The two data are in the following format:
In [51] : train_r
Out[51] : array([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
Note: This data is composed of 0 and 1.
In [52] : test_r
Out[52] : array([[0. , 0. , 0. , ..., 0.03657577, 0.06709877,
0.0569071 ],
[0. , 0. , 0. , ..., 0.04707848, 0.07826 ,
0.0819832 ],
[0. , 0. , 0. , ..., 0.04467918, 0.07355513,
0.08117414],
I used the Cosine Similarity method to determine the accuracy of these two types of data, but an error has occurred.
from numpy import dot
from numpy.linalg import norm
cos_sim = dot(train_r, test_r)/(norm(train_r)*norm(test_r))
ValueError: shapes (100,24) and (100,24) not aligned: 24 (dim 1) != 100 (dim 0)
So I searched the Internet to find a different way, but it didn't help because most of them were string-analysis.
Can I figure out how to calculate the similarity between the two lists, and describe it in numbers?
Found the cause.
The reason for the error is that a total of 24 lists were stored in train_r and test_r.
I tried to calculate the list of 24 at once, and there was an error.
It's a simple solution. You can specify a list in train_r and test_r to calculate by cosine similarity method.
train_c = train_r[:,12]
test_c = test_r[:,12]
from numpy import dot
from numpy.linalg import norm
a = train_c
b = test_c
cos_sim = (dot(a, b)/(norm(a)*norm(b))) * 100
print(cos_sim)
95.18094658851624

norm parameters in sklearn.preprocessing.normalize

In sklearn documentation says "norm" can be either
norm : ‘l1’, ‘l2’, or ‘max’, optional (‘l2’ by default)
The norm to use to normalize each non zero sample (or each non-zero feature if axis is 0).
The documentation about normalization isn't clearly stating how ‘l1’, ‘l2’, or ‘max’ are calculated.
Can anyone clear these?
Informally speaking, the norm is a generalization of the concept of (vector) length; from the Wikipedia entry:
In linear algebra, functional analysis, and related areas of mathematics, a norm is a function that assigns a strictly positive length or size to each vector in a vector space.
The L2-norm is the usual Euclidean length, i.e. the square root of the sum of the squared vector elements.
The L1-norm is the sum of the absolute values of the vector elements.
The max-norm (sometimes also called infinity norm) is simply the maximum absolute vector element.
As the docs say, normalization here means making our vectors (i.e. data samples) having unit length, so specifying which length (i.e. which norm) is also required.
You can easily verify the above adapting the examples from the docs:
from sklearn import preprocessing
import numpy as np
X = [[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1., -1.]]
X_l1 = preprocessing.normalize(X, norm='l1')
X_l1
# array([[ 0.25, -0.25, 0.5 ],
# [ 1. , 0. , 0. ],
# [ 0. , 0.5 , -0.5 ]])
You can verify by simple visual inspection that the absolute values of the elements of X_l1 sum up to 1.
X_l2 = preprocessing.normalize(X, norm='l2')
X_l2
# array([[ 0.40824829, -0.40824829, 0.81649658],
# [ 1. , 0. , 0. ],
# [ 0. , 0.70710678, -0.70710678]])
np.sqrt(np.sum(X_l2**2, axis=1)) # verify that L2-norm is indeed 1
# array([ 1., 1., 1.])

Categories