Related
I have following program
import numpy as np
arr = np.random.randn(3,4)
print(arr)
regArr = (arr > 0.8)
print (regArr)
print (arr[ regArr].reshape(arr.shape))
output:
[[ 0.37182134 1.4807685 0.11094223 0.34548185]
[ 0.14857641 -0.9159358 -0.37933393 -0.73946522]
[ 1.01842304 -0.06714827 -1.22557205 0.45600827]]
I am looking for output in arr where values greater than 0.8 should exist and other values to be zero.
I tried bool masking as shown above. But I am able to slove this. Kindly help
I'm not entirely sure what exactly you want to achieve, but this is what I did to filter.
arr = np.random.randn(3,4)
array([[-0.04790508, -0.71700005, 0.23204224, -0.36354634],
[ 0.48578236, 0.57983561, 0.79647091, -1.04972601],
[ 1.15067885, 0.98622772, -0.7004639 , -1.28243462]])
arr[arr < 0.8] = 0
array([[0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. ],
[1.15067885, 0.98622772, 0. , 0. ]])
Thanks to user3053452, I have added one more solution which the original data will not be changed.
arr = np.random.randn(3,4)
array([[ 0.4297907 , 0.38100702, 0.30358291, -0.71137138],
[ 1.15180635, -1.21251676, 0.04333404, 1.81045931],
[ 0.17521058, -1.55604971, 1.1607159 , 0.23133528]])
new_arr = np.where(arr < 0.8, 0, arr)
array([[0. , 0. , 0. , 0. ],
[1.15180635, 0. , 0. , 1.81045931],
[0. , 0. , 1.1607159 , 0. ]])
I have a numpy array as follows:
array([0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0.00791667, 0. , 0. , 0. , 0. ,
0. , 0.06837452, 0.09166667, 0.00370881, 0. ,
0. , 0.00489809, 0. , 0. , 0. ,
0. , 0. , 0.23888889, 0. , 0.05927778,
0.12138889, 0. , 0. , 0. , 0.36069444,
0.31711111, 0.16333333, 0.15005556, 0.01 , 0.005 ,
0.14357413, 0. , 0.15722222, 0.29494444, 0.3245 ,
0.31276639, 0.095 , 0.04750292, 0.09127039, 0. ,
0.06847222, 0.17 , 0.18039233, 0.21567804, 0.15913079,
0.4579781 , 0. , 0.2459 , 0.14886556, 0.08447222,
0. , 0.13722222, 0.28336984, 0.0725 , 0.077355 ,
0.45166391, 0. , 0.24892933, 0.25360062, 0. ,
0.12923041, 0.16145892, 0.48771795, 0.38527778, 0.29432968,
0.31983305, 1.07573089, 0.30611111, 0. , 0.0216475 ,
0. , 0.62268056, 0.16829156, 0.46239719, 0.6415958 ,
0.02138889, 0.76457155, 0.05711551, 0.35050949, 0.34856278,
0.15686164, 0.23158889, 0.16593262, 0.34961111, 0.21247575,
0.14116667, 0.19414785, 0.09166667, 0.93376627, 0.12772222,
0.00366667, 0.10297222, 0.173 , 0.0381225 , 0.22441667,
0.46686111, 0.18761111, 0.56037889, 0.47566111])
From this array, I need to calculate the area under the curve for each sub-array where the first value is 0, where it goes above 0, and the last number should be the 0 after a non-zero number. Obviously the array lengths will vary. It may also occur that two of these sub-arrays will share a 0 value (the last 0 of the first array will be the fist 0 if the second array).
The expected first two arrays should be:
[0. , 0.00791667, 0. ]
[0. , 0.06837452, 0.09166667, 0.00370881, 0. ]
I've tried and splitting python lists based on a character being equal to 0, but haven't found anything useful. What can I do?
See the code below - I think this is the most efficient you'll be able to do.
First, split the array using the indices of all of the zeroes. Where multiple zeroes are together, this produces several [ 0. ] arrays, so filter those out (based on length, as all arrays must necessarily begin with a zero) to produce C. Finally, since they all begin with zero, but none end with zero, append a zero to each array.
import numpy as np
# <Your array here>
A = np.array(...)
# Split into arrays based on zeroes
B = np.split(A, np.where(A == 0)[0])
# Filter out arrays of length 1
# (just a zero, caused by multiple zeroes together)
f = np.vectorize(lambda a: len(a) > 1)
C = np.extract(f(B), B)
# Append a zero to each array
g = np.vectorize(lambda a: np.append(a, 0), otypes=[object])
D = g(C)
# Output result
for array in D:
print(array)
This gives the following output:
[ 0. 0.00791667 0. ]
[ 0. 0.06837452 0.09166667 0.00370881 0. ]
[ 0. 0.00489809 0. ]
[ 0. 0.23888889 0. ]
[ 0. 0.05927778 0.12138889 0. ]
[ 0. 0.36069444 0.31711111 0.16333333 0.15005556 0.01 0.005
0.14357413 0. ]
[ 0. 0.15722222 0.29494444 0.3245 0.31276639 0.095
0.04750292 0.09127039 0. ]
[ 0. 0.06847222 0.17 0.18039233 0.21567804 0.15913079
0.4579781 0. ]
[ 0. 0.2459 0.14886556 0.08447222 0. ]
[ 0. 0.13722222 0.28336984 0.0725 0.077355 0.45166391
0. ]
[ 0. 0.24892933 0.25360062 0. ]
[ 0. 0.12923041 0.16145892 0.48771795 0.38527778 0.29432968
0.31983305 1.07573089 0.30611111 0. ]
[ 0. 0.0216475 0. ]
[ 0. 0.62268056 0.16829156 0.46239719 0.6415958 0.02138889
0.76457155 0.05711551 0.35050949 0.34856278 0.15686164 0.23158889
0.16593262 0.34961111 0.21247575 0.14116667 0.19414785 0.09166667
0.93376627 0.12772222 0.00366667 0.10297222 0.173 0.0381225
0.22441667 0.46686111 0.18761111 0.56037889 0.47566111 0. ]
I am trying to create anti-aliased (weighted and not boolean) circular masks for making circular kernels for use in convolution.
radius = 3 # no. of pixels to be 1 on either side of the center pixel
# shall be decimal as well; not the real radius
kernel_size = 9
kernel_radius = (kernel_size - 1) // 2
x, y = np.ogrid[-kernel_radius:kernel_radius+1, -kernel_radius:kernel_radius+1]
dist = ((x**2+y**2)**0.5)
mask = (dist-radius).clip(0,1)
print(mask)
and the output is
array([[1. , 1. , 1. , 1. , 1. , 1. , 1. , 1. , 1. ],
[1. , 1. , 0.61, 0.16, 0. , 0.16, 0.61, 1. , 1. ],
[1. , 0.61, 0. , 0. , 0. , 0. , 0. , 0.61, 1. ],
[1. , 0.16, 0. , 0. , 0. , 0. , 0. , 0.16, 1. ],
[1. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 1. ],
[1. , 0.16, 0. , 0. , 0. , 0. , 0. , 0.16, 1. ],
[1. , 0.61, 0. , 0. , 0. , 0. , 0. , 0.61, 1. ],
[1. , 1. , 0.61, 0.16, 0. , 0.16, 0.61, 1. , 1. ],
[1. , 1. , 1. , 1. , 1. , 1. , 1. , 1. , 1. ]])
Then we can do
mask = 1 - mask
print(mask)
to get
array([[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0.39, 0.84, 1. , 0.84, 0.39, 0. , 0. ],
[0. , 0.39, 1. , 1. , 1. , 1. , 1. , 0.39, 0. ],
[0. , 0.84, 1. , 1. , 1. , 1. , 1. , 0.84, 0. ],
[0. , 1. , 1. , 1. , 1. , 1. , 1. , 1. , 0. ],
[0. , 0.84, 1. , 1. , 1. , 1. , 1. , 0.84, 0. ],
[0. , 0.39, 1. , 1. , 1. , 1. , 1. , 0.39, 0. ],
[0. , 0. , 0.39, 0.84, 1. , 0.84, 0.39, 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ]])
I can now normalize and use this as my circular filter (kernel) in convolution operations.
Note: Radius can be decimal. Eg: get_circular_kernel(0.5,(5,5)) should give
array([[0. , 0. , 0. , 0. , 0. ],
[0. , 0.08578644, 0.5 , 0.08578644, 0. ],
[0. , 0.5 , 1. , 0.5 , 0. ],
[0. , 0.08578644, 0.5 , 0.08578644, 0. ],
[0. , 0. , 0. , 0. , 0. ]])
I want to generate a million of these at the very least, with the kernel_size fixed and radius changing, so is there a better or more efficient way to do this? (maybe without costly operations like sqrt and still stay accurate enough to arc integrals i.e., area covered by the curve in the particular pixel?)
Since you want to generate a large number of kernels with the same size, you can greatly improve performance by constructing every kernel in one step rather than one after the other in a loop. You can create a single array of shape (num_radii, kernel_size, kernel_size) given num_radii values for each kernel. The price of this vectorization is memory: you'll have to fit all these values in RAM, otherwise you should chunk up your millions of radii into a handful of smaller batches and generate each batch again separately.
The only thing you need to change is to take an array of radii (rather than a scalar radius), and inject two trailing singleton dimensions so that your mask creation triggers broadcasting:
import numpy as np
kernel_size = 9
kernel_radius = (kernel_size - 1) // 2
x, y = np.ogrid[-kernel_radius:kernel_radius+1, -kernel_radius:kernel_radius+1]
dist = (x**2 + y**2)**0.5 # shape (kernel_size, kernel_size)
# let's create three kernels for the sake of example
radii = np.array([3, 3.5, 4])[...,None,None] # shape (num_radii, 1, 1)
# using ... allows compatibility with arbitrarily-shaped radius arrays
masks = 1 - (dist - radii).clip(0,1) # shape (num_radii, kernel_size, kernel_size)
Now masks[0,...] (or masks[0] for short, but I prefer the explicit version) contains the example mask in your question, and masks[1,...] and masks[2,...] contain the kernels for radii 3.5 and 4, respectively.
If you want to build millions of masks, you should precompute once what never changes, and compute only the strict necessary for each radius.
You can try something like this:
class Circle:
def __init__(self, kernel_size):
self._kernel_size = kernel_size
self._kernel_radius = (self._kernel_size - 1) // 2
x, y = np.ogrid[
-self._kernel_radius:self._kernel_radius+1,
-self._kernel_radius:self._kernel_radius+1]
self._dist = np.sqrt(x**2 + y**2)
def __call__(self, radius):
mask = self._dist - radius
mask = np.clip(mask, 0, 1, out=mask)
mask *= -1
mask += 1
return mask
circle = Circle(kernel_size=9)
for radius in range(1, 4, 0.2):
mask = circle(radius)
print(mask)
I did the operations inplace as much as possible to optimize for speed and memory, but for small arrays it won't matter much.
I have a numpy array:
arr=np.array([0,1,0,0.5])
I need to form a new array from it as follows, such that every zero elements is repeated thrice and every non-zero element has 2 preceding zeroes, followed by the non-zero number. In short, every element is repeated thrice, zero as it is and non-zero has 2 preceding 0 and then the number itself. It is as follows:
([0,1,0,0.5])=0,0,0, [for index 0]
0,0,1 [for index 1]
0,0,0 [for index 2, which again has a zero] and
0,0,0.5
final output should be:
new_arr=[0,0,0,0,0,1,0,0,0,0,0,0.5]
np.repeat() repeats all the array elements n number of times, but i dont want that exactly. How should this be done? Thanks for the help.
A quick reshape followed by a call to np.pad will do it:
np.pad(arr.reshape(-1, 1), ((0, 0), (2, 0)), 'constant')
Output:
array([[ 0. , 0. , 0. ],
[ 0. , 0. , 1. ],
[ 0. , 0. , 0. ],
[ 0. , 0. , 0.5]])
You'll want to flatten it back again. That's simply done by calling .reshape(-1, ).
>>> np.pad(arr.reshape(-1, 1), ((0, 0), (2, 0)), 'constant').reshape(-1, )
array([ 0. , 0. , 0. , 0. , 0. , 1. , 0. , 0. , 0. , 0. , 0. ,
0.5])
A variant on the pad idea is to concatenate a 2d array of zeros
In [477]: arr=np.array([0,1,0,0.5])
In [478]: np.column_stack([np.zeros((len(arr),2)),arr])
Out[478]:
array([[ 0. , 0. , 0. ],
[ 0. , 0. , 1. ],
[ 0. , 0. , 0. ],
[ 0. , 0. , 0.5]])
In [479]: _.ravel()
Out[479]:
array([ 0. , 0. , 0. , 0. , 0. , 1. , 0. , 0. , 0. , 0. , 0. ,
0.5])
or padding in the other direction:
In [481]: np.vstack([np.zeros((2,len(arr))),arr])
Out[481]:
array([[ 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. ],
[ 0. , 1. , 0. , 0.5]])
In [482]: _.T.ravel()
Out[482]:
array([ 0. , 0. , 0. , 0. , 0. , 1. , 0. , 0. , 0. , 0. , 0. ,
0.5])
I'm trying to learn how to reduce dimensionality in datasets. I came across some tutorials on Principle Component Analysis and Singular Value Decomposition. I understand that it takes the dimension of greatest variance and sequentially collapses dimensions of the next highest variance (overly simplified).
I'm confused on how to interpret the output matrices. I looked at the documentation but it wasn't much help. I followed some tutorials and was not too sure what the resulting matrices were exactly. I provided some code to get a feel for the distribution of each variable in the dataset (sklearn.datasets) .
My initial input array is a (n x m) matrix of n samples and m attributes. I could do a common PCA plot of PC1 vs. PC2 but how do I know which dimensions each PC represents?
Sorry if this is a basic question. A lot of the resources are very math heavy which I'm fine with but a more intuitive answer would be useful. No where I've seen talks about how to interpret the output in terms of the original labeled data.
I'm open to using sklearn's decomposition.PCA
#Singular Value Decomposition
U, s, V = np.linalg.svd(X, full_matrices=True)
print(U.shape, s.shape, V.shape, sep="\n")
(442, 442)
(10,)
(10, 10)
As you stated above matrix M can decomposed as product ot 3 matrices: U * S * V*.
Geometrical sense is next: any transformation could be deemed as a sequence of rotation (V * ), scaling (S) and rotation again(U). Here's good description and animation.
What's important for us?
Matrix S is diagonal - all its values lying off the main diagonal are 0.
Like:
np.diag(s)
array([[ 2.00604441, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[ 0. , 1.22160478, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[ 0. , 0. , 1.09816315, 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0.97748473, 0. , 0. , 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0.81374786, 0. , 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0. , 0.77634993, 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0. , 0. , 0.73250287, 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.65854628, 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.27985695, 0. ],
[ 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.09252313]])
Geometrically - each value is a scaling factor along particular axis.
For our purposes (Classification and Regression) these values show impact of particular axis to the overall result.
As you may see these values are decreasing from 2.0 to 0.093.
One of the most important applications - easy Low-rank matrix approximation with given precision. If you do not need an ultra-precise decomposition (it is true for ML issues) you may throw off lowest values and keep only important. In such a way you may step-by-step refine your solution: estimate quality with test set, throw off least values and repeat. As a result you obtain easy and robust solution.
Here good candidates to be shrinked are 8 and 9, then 5-7, and as a last option you may approximate model to only one value - first.