Compute the unit vector for each row of an array - python

I have a large (n x dim) array, each row is a vector in a space (whatever the dimension but let's do it in 2D):
import numpy as np
A = np.array([[50,14],[26,11],[81,9],[-11,-19]])
A.shape
(4,2)
I want to quickly compute the unit vector for each of those rows.
N = np.linalg.norm(A, axis=1)
# something like this, but for each row:
A /= N # not working:
# ValueError: operands could not be broadcast together
# with shapes (4,2) (4,) (4,2)
# or in a pandas-like manner:
np.divide(A, N, axis=1, inplace=True) # not working either
How could you do that properly?

You can use a broadcasting operation such as:
A /= np.linalg.norm(A, axis=1)[:,None]
# or
A /= np.linalg.norm(A, axis=1).reshape(4,1)
which both will give the array a shape of (4,1) instead of (4,)
But beware, A.dtype should be float64* otherwise you will encounter this error when using ufunc
A /= np.linalg.norm(A, axis=1)[:,None]
TypeError: ufunc 'true_divide' output (typecode 'd') could not be coerced to
provided output parameter (typecode 'l') according to the casting
rule ''same_kind''
But doing it as follows will work, no matter the value of A.dtype:
A = A/np.linalg.norm(A, axis=1)[:,None]
*
To initialize the array with float64 you can simply add a comma to one of the numbers:
A = np.array([[50.,14],[26,11],[81,9],[-11,-19]])
You can also use the normalize feature of scikit learn's preprocessing toolbox:
import sklearn
sklearn.__version__ # 0.24.2
from sklearn.preprocessing import normalize
normalize(A, norm="l2", axis=1)
array([[ 0.962964 , 0.2696299],
[ 0.9209673, 0.38964 ],
[ 0.9938837, 0.1104315],
[-0.5010363, -0.8654263]])
# as per the doc, you can set the copy flag to False to perform inplace row
# normalization and avoid a copy (if the input is already a numpy array or a
# scipy.sparse CSR matrix and if axis is 1):
normalize(A, norm="l2", axis=1, copy=False)
array([[ 0.962964 , 0.2696299],
[ 0.9209673, 0.38964 ],
[ 0.9938837, 0.1104315],
[-0.5010363, -0.8654263]])

Related

Changing dtype of view changes results of numpy functions without changing values of the array

I want the values of the np.mean function to be roughly the same, before and after the dtype change. The dtype has to remain float32.
array = np.random.randint(0, high=255, size=(3, 12000, 12000),dtype="int")
array = array[:,500:10000,500:10000]
array= array.reshape((-1,3))
# array.shape is now (90250000, 3)
print(np.mean(array,axis=0),array.dtype) # Nr.1
array = array.astype("float32")
print(np.mean(array,axis=0),array.dtype) # Nr.2
Results of the two print functions:
[127.003107 127.00156286 126.99015613] int32
[47.589664 47.589664 47.589664] float32
Adding a .copy() the view line has no effect. The size of the view effects the impact on the float mean. Changing the size in both the last dimensions to [500:8000] results in:
[76.35497 76.35497 76.35497] float32
Around [500:5000]and below both means are actually around the same.
Changing the code starting from the reshape line:
array= array.reshape((-1,3))
array_float = array.astype("float32")
print(np.all(array_float==array),array.dtype,array_float.dtype)
Results in:
True int32 float32
So if the values are the same, why are the results from np.mean different ?
Your array:
In [50]: arr.shape, arr.dtype
Out[50]: ((90250000, 3), dtype('int32'))
You could have gotten this with np.random.randint(0, high=255, size=(90250000,3),dtype="int"). In fact we don't need that size 3 dimension. Anyways it's just many numbers in the (0,255) range.
The expected mean:
In [51]: np.mean(arr, axis=0)
Out[51]: array([126.9822936 , 126.99682718, 126.99214526])
But notice what we get if we just sum those numbers:
In [52]: np.sum(arr, axis=0)
Out[52]: array([-1424749891, -1423438235, -1423860778])
The int32 sum as overflowed and wrapped around. There are too many numbers. So mean must be doing something more sophisticated than simply summing and dividing by the count.
Taking mean on the float32 gives the funny values:
In [53]: np.mean(arr.astype('float32'), axis=0)
Out[53]: array([47.589664, 47.589664, 47.589664], dtype=float32)
but float64 matches the int case (but with a long conversion time):
In [54]: np.mean(arr.astype('float64'), axis=0)
Out[54]: array([126.9822936 , 126.99682718, 126.99214526])
It looks like the float mean is just doing the sum and divide method:
In [56]: np.sum(arr.astype('float64'), axis=0)
Out[56]: array([1.14601520e+10, 1.14614637e+10, 1.14610411e+10])
In [57]: np.sum(arr.astype('float32'), axis=0)
Out[57]: array([4.2949673e+09, 4.2949673e+09, 4.2949673e+09], dtype=float32)
In [58]: Out[56]/arr.shape[0]
Out[58]: array([126.9822936 , 126.99682718, 126.99214526])
In [59]: Out[57]/arr.shape[0]
Out[59]: array([47.58966533, 47.58966533, 47.58966533])
While the sum is within the range of float32:
In [60]: np.finfo('float32')
Out[60]: finfo(resolution=1e-06, min=-3.4028235e+38, max=3.4028235e+38, dtype=float32)
for some reason it is having problems getting the right values.
Note that the python sum has problems with the int version:
In [70]: sum(arr[:,0])
C:\Users\paul\AppData\Local\Temp\ipykernel_1128\1456076714.py:1: RuntimeWarning: overflow encountered in long_scalars
sum(arr[:,0])
Out[70]: -1424749891
There is a math.fsum that handles large sums better:
In [71]: math.fsum(arr[:,0])
Out[71]: 11460151997.0
Sum on the long ints also works fine:
In [72]: np.sum(arr.astype('int64'),axis=0)
Out[72]: array([11460151997, 11461463653, 11461041110], dtype=int64)
From the np.mean docs:
dtype : data-type, optional
Type to use in computing the mean. For integer inputs, the default
is `float64`; for floating point inputs, it is the same as the
input dtype.
Notes
-----
The arithmetic mean is the sum of the elements along the axis divided
by the number of elements.
Note that for floating-point input, the mean is computed using the
same precision the input has. Depending on the input data, this can
cause the results to be inaccurate, especially for `float32` (see
example below). Specifying a higher-precision accumulator using the
`dtype` keyword can alleviate this issue.
Playing with the dtype parameter:
In [74]: np.mean(arr, axis=0, dtype='int32')
Out[74]: array([-15, -15, -15])
In [75]: np.mean(arr, axis=0, dtype='int64')
Out[75]: array([126, 126, 126], dtype=int64)
In [76]: np.mean(arr, axis=0, dtype='float32')
Out[76]: array([47.589664, 47.589664, 47.589664], dtype=float32)
In [77]: np.mean(arr, axis=0, dtype='float64')
Out[77]: array([126.9822936 , 126.99682718, 126.99214526])
The -15 is explained by:
In [78]: -1424749891/arr.shape[0]
Out[78]: -15.786702393351801
In sum, if you want accurate results you need to use float64, either with the default mean dtype, or the appropriate astype. Working with float32 can give problems, especially with this many elements.
Changing to "float64" solves the problem.
array = np.random.randint(0, high=255, size=(3, 12000, 12000),dtype="int")
array = array[:,500:10000,500:10000]
array= array.reshape((-1,3))
# array.shape is now (90250000, 3)
print(array.mean(axis=0),array.dtype) # Nr.1
array = array.astype("float64")
print(array.mean(axis=0),array.dtype) # Nr.2
Results in:
[126.98418438 126.9969912 127.00242922] int32
[126.98418438 126.9969912 127.00242922] float64

Compute KL divergence between rows of a matrix and a vector

I have a matrix (numpy 2d array) in which each row is a valid probability distribution. I have another vector (numpy 1d array), again a prob dist. I need to compute KL divergence between each row of the matrix and the vector. Is it possible to do this without using for loops?
This question asks the same thing, but none of the answers solve my problem. One of them suggests to use for loop which I want to avoid since I have large data. Another answer provides a solution in tensorflow, but I want for numpy arrays.
scipy.stats.entropy computes KL divergence between 2 vectors, but I couldn't get how to use it when one of them is a matrix.
The function scipy.stats.entropy can, in fact, do the vectorized calculation, but you have to reshape the arguments appropriately for it to work. When the inputs are two-dimensional arrays, entropy expects the columns to hold the probability vectors. In the case where p is two-dimensional and q is one-dimensional, a trivial dimension must be added to q to make the arguments compatible for broadcasting.
Here's an example. First, the imports:
In [10]: import numpy as np
In [11]: from scipy.stats import entropy
Create a two-dimensional p whose rows are the probability vectors, and a one-dimensional probability vector q:
In [12]: np.random.seed(8675309)
In [13]: p = np.random.rand(3, 5)
In [14]: p /= p.sum(axis=1, keepdims=True)
In [15]: q = np.random.rand(5)
In [16]: q /= q.sum()
In [17]: p
Out[17]:
array([[0.32085531, 0.29660176, 0.14113073, 0.07988999, 0.1615222 ],
[0.05870513, 0.15367858, 0.29585406, 0.01298657, 0.47877566],
[0.1914319 , 0.29324935, 0.1093297 , 0.17710131, 0.22888774]])
In [18]: q
Out[18]: array([0.06804561, 0.35392387, 0.29008139, 0.04580467, 0.24214446])
For comparison with the vectorized result, here's the result computed using a Python loop.
In [19]: [entropy(t, q) for t in p]
Out[19]: [0.32253909299531597, 0.17897138916539493, 0.2627905326857023]
To make entropy do the vectorized calculation, the columns of the first argument must be the probability vectors, so we'll transpose p. Then, to make q compatible with p.T, we'll reshape it into a two-dimensional array with shape (5, 1) (i.e. it contains a single column):
In [20]: entropy(p.T, q.reshape(-1, 1))
Out[20]: array([0.32253909, 0.17897139, 0.26279053])
Note: It is tempting to use q.T as the second argument, but that won't work. In NumPy, the transpose operation only swaps the lengths of existing dimensions--it never creates new dimensions. So the transpose of a one-dimensional array is itself. That is, q.T is the same shape as q.
Older version of this answer follows...
You can use scipy.special.kl_div or scipy.special.rel_entr to do this. Here's an example.
In [17]: import numpy as np
...: from scipy.stats import entropy
...: from scipy.special import kl_div, rel_entr
Make p and q for the example.
p has shape (3, 5); the rows are the probability distributions. q is a 1-d array with length 5.
In [18]: np.random.seed(8675309)
...: p = np.random.rand(3, 5)
...: p /= p.sum(axis=1, keepdims=True)
...: q = np.random.rand(5)
...: q /= q.sum()
This is the calculation that you want, using a Python loop and scipy.stats.entropy. I include this here so the result can be compared to the vectorized calculation below.
In [19]: [entropy(t, q) for t in p]
Out[19]: [0.32253909299531597, 0.17897138916539493, 0.2627905326857023]
We have constructed p and q so that the probability vectors
each sum to 1. In this case, the above result can also be
computed in a vectorized calculation with scipy.special.rel_entr or scipy.special.kl_div. (I recommend rel_entr. kl_div adds and subtracts additional terms that will ultimately cancel out in the sum, so it does a bit more work than necessary.)
These functions compute only the point-wise part of the calculations;
you have to sum the result to get the actual entropy or divergence.
In [20]: rel_entr(p, q).sum(axis=1)
Out[20]: array([0.32253909, 0.17897139, 0.26279053])
In [21]: kl_div(p, q).sum(axis=1)
Out[21]: array([0.32253909, 0.17897139, 0.26279053])

python difference between array(10,1) array(10,)

I'm trying to load MNIST dataset into arrays.
When I use
(X_train, y_train), (X_test, y_test)= mnist.load_data()
I get an array y_test(10000,) but I want it to be in the shape of (10000,1).
What is the difference between array(10000,1) and array(10000,)?
How can I convert the first array to the second array?
Your first Array with shape (10000,) is a 1-Dimensional np.ndarray.
Since the shape attribute of numpy Arrays is a Tuple and a tuple of length 1 needs a trailing comma the shape is (10000,) and not (10000) (which would be an int). So currently your data looks like this:
import numpy as np
a = np.arange(5) # >>> array([0, 1, 2, 3, 4]
print(a.shape) # >>> (5,)
What you want is an 2-Dimensional array with shape of (10000, 1).
Adding a dimension of length 1 doesn't require any additional data, it is basically and "empty" dimension. To add an dimension to an existing array you can use either np.expand_dims() or np.reshape().
Using np.expand_dims:
import numpy as np
b = np.array(np.arange(5)) # >>> array([0, 1, 2, 3, 4])
b = np.expand_dims(b, axis=1) # >>> array([[0],[1],[2],[3],[4]])
The function was specifically made for the purpose of adding empty dimensions to arrays. The axis keyword specifies which position the newly added dimension will occupy.
Using np.reshape:
import numpy as np
a = np.arange(5)
X_test_reshaped = np.reshape(a, shape=[-1, 1]) # >>> array([[0],[1],[2],[3],[4]])
The shape=[-1, 1] specifies how the new shape should look like after the reshape operation. The -1 itself will be replaced by the shape that 'fits the data' by numpy internally.
Reshape is a more powerful function than expand_dims and can be used in many different ways. You can read more on other uses of it in the numpy docs. numpy.reshape()
An array with a size of (10,1) is a 2D array containing empty columns.
An array with a size of (10,) is a 1D array.
To convert (10,1) to (10,), you can simply collapse the columns. For example, we take the x array, which has x.shape = (10,1). now using x[:,] you can collapse the columns and x[:,].shape = (10,).
To convert (10,) to (10,1), you can add a dimension by using np.newaxis. So, after import numpy as np, assuming we are using numpy arrays here. Take a y array for example, which has y.shape = (10,). Using y[:, np.newaxis], you can a new array with the shape of (10,1).

ValueError: operands could not be broadcast together with shapes - inverse_transform- Python

I know ValueError question has been asked many times. I am still struggling to find an answer because I am using inverse_transform in my code.
Say I have an array a
a.shape
> (100,20)
and another array b
b.shape
> (100,3)
When I did a np.concatenate,
hat = np.concatenate((a, b), axis=1)
Now shape of hat is
hat.shape
(100,23)
After this, I tried to do this,
inversed_hat = scaler.inverse_transform(hat)
When I do this, I am getting an error:
ValueError: operands could not be broadcast together with shapes (100,23) (25,) (100,23)
Is this broadcast error in inverse_transform? Any suggestion will be helpful. Thanks in advance!
Although you didn't specify, I'm assuming you are using inverse_transform() from scikit learn's StandardScaler. You need to fit the data first.
import numpy as np
from sklearn.preprocessing import MinMaxScaler
In [1]: arr_a = np.random.randn(5*3).reshape((5, 3))
In [2]: arr_b = np.random.randn(5*2).reshape((5, 2))
In [3]: arr = np.concatenate((arr_a, arr_b), axis=1)
In [4]: scaler = MinMaxScaler(feature_range=(0, 1)).fit(arr)
In [5]: scaler.inverse_transform(arr)
Out[5]:
array([[ 0.19981115, 0.34855509, -1.02999482, -1.61848816, -0.26005923],
[-0.81813499, 0.09873672, 1.53824716, -0.61643731, -0.70210801],
[-0.45077786, 0.31584348, 0.98219019, -1.51364126, 0.69791054],
[ 0.43664741, -0.16763207, -0.26148908, -2.13395823, 0.48079204],
[-0.37367434, -0.16067958, -3.20451107, -0.76465428, 1.09761543]])
In [6]: new_arr = scaler.inverse_transform(arr)
In [7]: new_arr.shape == arr.shape
Out[7]: True
It seems you are using pre-fit scaler object of sklearn.preprocessing.
If it's true, according to me data that you have used for fitting is of dimension (x,25) whereas your data shape is of (x,23) dimension and thats the reason you are getting this issue.

Numpy: Inverse Transforming Different Size Array

I'm trying to get the hang of normalizing my data, doing some work on it and then changing it back. When doing an inverse_transform do I have to always pass in the exact same shape as it was when I did a fit_transform? The code below will give me a "non-broadcastable output operand with shape (3,1) doesn't match the broadcast shape (3,3)"
import numpy as np
from sklearn.preprocessing import MinMaxScaler
first = np.array([[ 1.2345, 1.220000,1.26245],
[ 1.234,1.220000,7.0901],
[ 1.23450,1.22000,1.14795]])
scaler = MinMaxScaler(feature_range=(0, 1))
dataset = scaler.fit_transform(first)
new_dataset = dataset[:,:1]
trainPredict2 = scaler.inverse_transform(new_dataset)
You don't have to pass a data set with exactly the same shape, but the number of columns must match the original data set as each row is interpreted as a record and each column is interpreted as a feature. And you can not miss features for your testing data set, technically. So for instance, slicing the rows will still work:
new_dataset = dataset[:1,:]
trainPredict2 = scaler.inverse_transform(new_dataset)
This gives back the first row of your original data set:
trainPredict2
# array([[ 1.2345 , 1.22 , 1.26245]])
If you really want to inverse one or two features, you can calculate this by inversing the min-max transformation x′:=(x−xmin)/(xmax−xmin) formula, :
scaler.data_range_[:1] * dataset[:,:1] + scaler.data_min_[:1]
# array([[ 1.2345],
# [ 1.234 ],
# [ 1.2345]])
This gives back the first column of your original data set.

Categories