Standardize Pixel Input Data with many Zeros - python

I wanna standardize my input data for a neural network.
Data looks like this:
data= np.array([[0,0,0,0,233,2,0,0,0],[0,0,0,23,50,2,0,0,0],[0,0,0,0,3,20,3,0,0]])
This is the function that I used. It doesn't work because of the zeros.
def standardize(data): #dataframe
_,c = data.shape
data_standardized = data.copy(deep=True)
for j in range(c):
x = data_standardized.iloc[:, j]
avg = x.mean()
std = x.std()
x_standardized = (x - avg)/ std
data_standardized.iloc[:, j] = x_standardized
return data_standardized

Use boolean indexing to avoid dividing by zero:
In [90]: data= np.array([[0,0,0,0,233,2,0,0,0],[0,0,0,23,50,2,0,0,0],[0,0,0,0,3,20,3,0,0]])
In [91]: new = np.zeros(data.shape)
In [92]: m = data.mean(0)
In [93]: std = data.std(0)
In [94]: r = data-m
In [95]: new[:,std.nonzero()] = r[:,std.nonzero()]/std[std.nonzero()]
In [96]: new
Out[96]:
array([[ 0. , 0. , 0. , -0.70710678, 1.3875163 ,
-0.70710678, -0.70710678, 0. , 0. ],
[ 0. , 0. , 0. , 1.41421356, -0.45690609,
-0.70710678, -0.70710678, 0. , 0. ],
[ 0. , 0. , 0. , -0.70710678, -0.9306102 ,
1.41421356, 1.41421356, 0. , 0. ]])
Or use sklearn.preprocessing.StandardScaler.
Your function refactored:
def standardize(data): #dataframe
data = data.values
new = np.zeros(data.shape)
m = data.mean(0)
std = data.std(0)
new[:,std.nonzero()] = r[:,std.nonzero()]/std[std.nonzero()]
return pd.DataFrame(new)

Related

A robust way to keep the n-largest elements in rows or colums in the matrix

I would like to make a sparse matrix from the dense one, such that in each row or column only n-largest elements are preserved. I do the following:
def sparsify(K, min_nnz = 5):
'''
This function eliminates the elements which are smaller that the maximal element in the matrix,
Parameters
----------
K : ndarray
K - the input matrix
min_nnz:
the minimal number of elements in row or column to be preserved
'''
cond = np.bitwise_or(K >= -np.partition(-K, min_nnz - 1, axis = 1)[:, min_nnz - 1][:, None],
K >= -np.partition(-K, min_nnz - 1, axis = 0)[min_nnz - 1, :][None, :])
return spsp.csr_matrix(np.where(cond, K, 0))
This approach works as intended but seems to be not the most efficient, and the robust one. What would you recommend to do it an better way?
The example of usage:
A = np.random.rand(10, 10)
A_sp = sparsify(A, min_nnz = 3)
Instead of making another dense matrix, you can use coo_matrix to build up using only the values you need:
return spsp.coo_matrix((K[cond], np.where(cond)), shape = K.shape)
As for the rest, you can maybe short-circuit the second dimension, but your time savings will be completely dependent on your inputs
def sparsify(K, min_nnz = 5):
'''
This function eliminates the elements which are smaller that the maximal element in the matrix,
Parameters
----------
K : ndarray
K - the input matrix
min_nnz:
the minimal number of elements in row or column to be preserved
'''
cond = K >= -np.partition(-K, min_nnz - 1, axis = 0)[min_nnz - 1, :]
mask = cond.sum(1) < min_nnz
cond[mask] = np.bitwise_or(cond[mask],
K[mask] >= -np.partition(-K[mask],
min_nnz - 1,
axis = 1)[:, min_nnz - 1][:, None])
return spsp.coo_matrix((K[cond], np.where(cond)), shape = K.shape)
Testing:
sparsify(A)
Out[]:
<10x10 sparse matrix of type '<class 'numpy.float64'>'
with 58 stored elements in COOrdinate format>
sparsify(A).A
Out[]:
array([[0. , 0. , 0.61362248, 0. , 0.73648987,
0.64561856, 0.40727807, 0.61674005, 0.53533315, 0. ],
[0.8888361 , 0.64548039, 0.94659603, 0.78474203, 0. ,
0. , 0.78809603, 0.88938798, 0. , 0.37631541],
[0.69356682, 0. , 0. , 0. , 0. ,
0.7386594 , 0.71687659, 0.67750768, 0.58002451, 0. ],
[0.67241433, 0.71923718, 0.95888737, 0. , 0. ,
0. , 0.82773085, 0.69788448, 0.63736915, 0.4263064 ],
[0. , 0.65831794, 0. , 0. , 0.59850093,
0. , 0. , 0.61913869, 0.65024867, 0.50860294],
[0.75522891, 0. , 0.93342402, 0.8284258 , 0.64471939,
0.6990814 , 0. , 0. , 0. , 0.32940821],
[0. , 0.88458635, 0.62460096, 0.60412265, 0.66969674,
0. , 0.40318741, 0. , 0. , 0.44116059],
[0. , 0. , 0.500971 , 0.92291245, 0. ,
0.8862903 , 0. , 0.375885 , 0.49473635, 0. ],
[0.86920647, 0.85157893, 0.89883006, 0. , 0.68427193,
0.91195162, 0. , 0. , 0.94762875, 0. ],
[0. , 0.6435456 , 0. , 0.70551006, 0. ,
0.8075527 , 0. , 0.9421039 , 0.91096934, 0. ]])
sparsify(A).A.astype(bool).sum(0)
Out[]: array([5, 6, 7, 5, 5, 6, 5, 7, 7, 5])
sparsify(A).A.astype(bool).sum(1)
Out[]: array([6, 7, 5, 7, 5, 6, 6, 5, 6, 5])

numpy array with diagonal equal to zero and [x,y] =-[y,x]

I want to create a N x N array in numpy such that the diagonal is zero and [x,y] = -[y,x].
For example:
np.array([[[0,12, 2],
[-12, 0, 3],
[-2, -3, 0]],])
The values inside the array can be any float.
One way would be with scipy.spatial.distance.squareform -
from scipy.spatial.distance import squareform
def diag_inverted(n):
l = n*(n-1)//2
out = squareform(np.random.randn(l))
out[np.tri(len(out),k=-1,dtype=bool)] *= -1
return out
Another with array-assignment and masking -
def diag_inverted_v2(n):
l = n*(n-1)//2
m = np.tri(n, k=-1, dtype=bool)
out = np.zeros((n,n),dtype=float)
out[m] = np.random.randn(l)
out[m.T] = -out.T[m.T]
return out
Sample runs -
In [148]: diag_inverted(2)
Out[148]:
array([[ 0. , -0.97873798],
[ 0.97873798, 0. ]])
In [149]: diag_inverted(3)
Out[149]:
array([[ 0. , -2.2408932 , -1.86755799],
[ 2.2408932 , 0. , 0.97727788],
[ 1.86755799, -0.97727788, 0. ]])
In [150]: diag_inverted(4)
Out[150]:
array([[ 0. , -0.95008842, 0.15135721, -0.4105985 ],
[ 0.95008842, 0. , 0.10321885, -0.14404357],
[-0.15135721, -0.10321885, 0. , -1.45427351],
[ 0.4105985 , 0.14404357, 1.45427351, 0. ]])
Here you go:
size = 3
a = np.random.normal(0,1, (size, size))
ret = (a-a.transpose())/2
Output (random):
array([[ 0. , 0.11872306, 0.46792054],
[-0.11872306, 0. , 0.12530741],
[-0.46792054, -0.12530741, 0. ]])

bool masking in numpy array matrices

I have following program
import numpy as np
arr = np.random.randn(3,4)
print(arr)
regArr = (arr > 0.8)
print (regArr)
print (arr[ regArr].reshape(arr.shape))
output:
[[ 0.37182134 1.4807685 0.11094223 0.34548185]
[ 0.14857641 -0.9159358 -0.37933393 -0.73946522]
[ 1.01842304 -0.06714827 -1.22557205 0.45600827]]
I am looking for output in arr where values greater than 0.8 should exist and other values to be zero.
I tried bool masking as shown above. But I am able to slove this. Kindly help
I'm not entirely sure what exactly you want to achieve, but this is what I did to filter.
arr = np.random.randn(3,4)
array([[-0.04790508, -0.71700005, 0.23204224, -0.36354634],
[ 0.48578236, 0.57983561, 0.79647091, -1.04972601],
[ 1.15067885, 0.98622772, -0.7004639 , -1.28243462]])
arr[arr < 0.8] = 0
array([[0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. ],
[1.15067885, 0.98622772, 0. , 0. ]])
Thanks to user3053452, I have added one more solution which the original data will not be changed.
arr = np.random.randn(3,4)
array([[ 0.4297907 , 0.38100702, 0.30358291, -0.71137138],
[ 1.15180635, -1.21251676, 0.04333404, 1.81045931],
[ 0.17521058, -1.55604971, 1.1607159 , 0.23133528]])
new_arr = np.where(arr < 0.8, 0, arr)
array([[0. , 0. , 0. , 0. ],
[1.15180635, 0. , 0. , 1.81045931],
[0. , 0. , 1.1607159 , 0. ]])

Keep First Non-Zero Element, Set All Others to 0

I have a 2-d NumPy array that looks like this:
array([[0. , 0. , 0.2, 0.2],
[0.3, 0. , 0.3, 0. ]])
I'd like to modify it so that each row consists of all 0's, except for the first non-zero entry. If it's all 0s to start with, we don't change anything.
I could do this:
example = np.array([[0,0, 0.2, 0.2], [0.3, 0, 0.3, 0]])
my_copy = np.zeros_like(example)
for i, row in enumerate(example):
for j, elem in enumerate(row):
if elem > 0:
my_copy[i, j] = elem
break
But that's ugly and not vectorized. Any suggestions for how to vectorize this?
Thanks!
Here's a vectorised solution. The trick is to calculate your first non-zero entries via bool conversion and argmax.
import numpy as np
A = np.array([[0. , 0. , 0.2, 0.2],
[0.3, 0. , 0.3, 0. ],
[0. , 0. , 0. , 0. ]])
res = np.zeros(A.shape)
idx = np.arange(res.shape[0])
args = A.astype(bool).argmax(1)
res[idx, args] = A[idx, args]
print(res)
array([[ 0. , 0. , 0.2, 0. ],
[ 0.3, 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. ]])
Simply
e =np.zeros(example.shape)
rows = np.arange(example.shape[0])
cols = np.argmax(example != 0, 1)
e[rows, cols] = example[rows, cols]
Setup
x = np.array([[0. , 0. , 0.2, 0.2],
[0.3, 0. , 0.3, 0. ],
[0. , 0. , 0. , 0. ]])
Using logical_and with np.eye:
m = (x!=0).argmax(1)
x[~np.logical_and(x, np.eye(x.shape[1])[m])] = 0
Output:
array([[0. , 0. , 0.2, 0. ],
[0.3, 0. , 0. , 0. ],
[0. , 0. , 0. , 0. ]])
Using this method will be slightly slower than the other two suggested.

How to randomly select some non-zero elements from a numpy.ndarray?

I've implemented a matrix factorization model, say R = U*V, and now I would to train and test this model.
To this end, given a sparse matrix R (zero for missing value), I want to first hide some non-zero elements in the training and use these non-zero elements as test set later.
How can I randomly select some non-zero elements from a numpy.ndarray? Besides, I need to remember the index and column position of these selected elements to use these elements in testing.
for example:
In [2]: import numpy as np
In [4]: mtr = np.random.rand(10,10)
In [5]: mtr
Out[5]:
array([[ 0.92685787, 0.95496193, 0.76878455, 0.12304856, 0.13804963,
0.30867502, 0.60245974, 0.00797898, 0.1060602 , 0.98277982],
[ 0.88879888, 0.40209901, 0.35274404, 0.73097713, 0.56238248,
0.380625 , 0.16432029, 0.5383006 , 0.0678564 , 0.42875591],
[ 0.42343761, 0.31957986, 0.5991212 , 0.04898903, 0.2908878 ,
0.13160296, 0.26938537, 0.91442668, 0.72827097, 0.4511198 ],
[ 0.63979934, 0.33421621, 0.09218392, 0.71520048, 0.57100522,
0.37205284, 0.59726293, 0.58224992, 0.58690505, 0.4791199 ],
[ 0.35219557, 0.34954002, 0.93837312, 0.2745864 , 0.89569075,
0.81244084, 0.09661341, 0.80673646, 0.83756759, 0.7948081 ],
[ 0.09173706, 0.86250006, 0.22121994, 0.21097563, 0.55090202,
0.80954817, 0.97159981, 0.95888693, 0.43151554, 0.2265607 ],
[ 0.00723128, 0.95690539, 0.94214806, 0.01721733, 0.12552314,
0.65977765, 0.20845669, 0.44663729, 0.98392716, 0.36258081],
[ 0.65994805, 0.47697842, 0.35449045, 0.73937445, 0.68578224,
0.44278095, 0.86743906, 0.5126411 , 0.75683392, 0.73354572],
[ 0.4814301 , 0.92410622, 0.85267402, 0.44856078, 0.03887269,
0.48868498, 0.83618382, 0.49404473, 0.37328248, 0.18134919],
[ 0.63999748, 0.48718656, 0.54826717, 0.1001681 , 0.1940816 ,
0.3937014 , 0.48768013, 0.70610649, 0.03213063, 0.88371607]])
In [6]: mtr = np.where(mtr>0.5, 0, mtr)
In [7]: %clear
In [8]: mtr
Out[8]:
array([[ 0. , 0. , 0. , 0.12304856, 0.13804963,
0.30867502, 0. , 0.00797898, 0.1060602 , 0. ],
[ 0. , 0.40209901, 0.35274404, 0. , 0. ,
0.380625 , 0.16432029, 0. , 0.0678564 , 0.42875591],
[ 0.42343761, 0.31957986, 0. , 0.04898903, 0.2908878 ,
0.13160296, 0.26938537, 0. , 0. , 0.4511198 ],
[ 0. , 0.33421621, 0.09218392, 0. , 0. ,
0.37205284, 0. , 0. , 0. , 0.4791199 ],
[ 0.35219557, 0.34954002, 0. , 0.2745864 , 0. ,
0. , 0.09661341, 0. , 0. , 0. ],
[ 0.09173706, 0. , 0.22121994, 0.21097563, 0. ,
0. , 0. , 0. , 0.43151554, 0.2265607 ],
[ 0.00723128, 0. , 0. , 0.01721733, 0.12552314,
0. , 0.20845669, 0.44663729, 0. , 0.36258081],
[ 0. , 0.47697842, 0.35449045, 0. , 0. ,
0.44278095, 0. , 0. , 0. , 0. ],
[ 0.4814301 , 0. , 0. , 0.44856078, 0.03887269,
0.48868498, 0. , 0.49404473, 0.37328248, 0.18134919],
[ 0. , 0.48718656, 0. , 0.1001681 , 0.1940816 ,
0.3937014 , 0.48768013, 0. , 0.03213063, 0. ]])
Given such sparse ndarray, how can I select 20% of the non-zero elements and remember their position?
We'll use numpy.random.choice. First, we get arrays of the (i,j) indices where the data is nonzero:
i,j = np.nonzero(x)
Then we'll select 20% of these:
ix = np.random.choice(len(i), int(np.floor(0.2 * len(i))), replace=False)
Here ix is a list of random, unique indices, 20% the length of i and j (the length of i and j is the number of nonzero entries). To recover the indices, we do i[ix] and j[ix], so we can then select 20% of the nonzero entries of x by writing:
print x[i[ix], j[ix]]

Categories