Creating a 'normal distribution' like range in numpy - python

I am trying to 'bin' an array into bins (similar to histogram). I have an input array input_array and a range bins = np.linspace(-200, 200, 200). The overall function looks something like this:
def bin(arr):
bins = np.linspace(-100, 100, 200)
return np.histogram(arr, bins=bins)[0]
So,
bin([64, 19, 120, 55, 56, 108, 16, 84, 120, 44, 104, 79, 116, 31, 44, 12, 35, 68])
would return:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 2, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0])
However, I want my bins to be more 'detailed' as I get close to 0... something similar to an indeal normal distribution. As a result, I could have more bins (i.e. short ranges) when I am close to 0 and as I move out towards the range, the bins are bigger. Is it possible?
More specifically, rather than having equally wide bins in a range, can I have an array of range where the bins towards the centre are smaller than towards the extremes?
I have already looked at answers like this and numpy.random.normal, but something is just not clicking right.

Use the inverse error function to generate the bins. You'll need to scale the bins to get the exact range you want
This transform works because the inverse error function is flatter around zero than +/- one.
from scipy.special import erfinv
erfinv(np.linspace(-1,1))
# returns:
array([ -inf, -1.14541135, -0.8853822 , -0.70933273, -0.56893556,
-0.44805114, -0.3390617 , -0.23761485, -0.14085661, -0.0466774 ,
0.0466774 , 0.14085661, 0.23761485, 0.3390617 , 0.44805114,
0.56893556, 0.70933273, 0.8853822 , 1.14541135, inf])

Related

Converting an array to a list in Python

I have an array A. I want to identify all locations with element 1 and convert it to a list as shown in the expected output. But I am getting an error.
import numpy as np
A=np.array([0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
B=np.where(A==1)
B=B.tolist()
print(B)
The error is
in <module>
B=B.tolist()
AttributeError: 'tuple' object has no attribute 'tolist'
The expected output is
[1, 2, 5, 7, 10, 11]
np.where used with only the condition returns a tuple of arrays containing indices; one array for each dimension of the array. According to the docs, this is much like np.nonzero, which is the recommended approach over np.where. So, since your array is one dimensional, np.where will return a tuple with one element, inside of which is the array containing the indices in your expected output. You can resolve your problem by accessing into the tuple like np.where(A == 1)[0].tolist().
However, I recommend using np.flatnonzero instead, which avoids the hassle entirely:
import numpy as np
A = np.array([0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
B = np.flatnonzero(A).tolist()
B:
[1, 2, 5, 7, 10, 11]
PS: when all other elements are 0, you don't have to explicitly compare to 1 ;).
import numpy as np
A = np.array([0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
indices = np.where(A == 1)[0]
B = indices.tolist()
print(B)
You should access the first element of this tuple with B[0] :
import numpy as np
A=np.array([0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
B=np.where(A==1)
B = B[0].tolist()
print(B) # [1, 2, 5, 7, 10, 11]

Sklearn SVC with MNIST Dataset: Consistently wrong with the digit 5?

I have set up a very simple SVC to classify the MNIST digits. For some reason, the classifier is pretty consistently incorrectly predicting the digit 5, but when trying all other numbers it doesn't miss a single one. Does anyone have any idea if I might be setting this up wrong, or if it's just really bad at predicting the number 5?
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
data = datasets.load_digits()
images = data.images
targets = data.target
# Split into train and test sets
images_train, images_test, imlabels_train, imlabels_test = train_test_split(images, targets, test_size=.2, shuffle=False)
# Re-shape data so that it's 2D
images_train = np.reshape(images_train, (np.shape(images_train)[0], 64))
images_test = np.reshape(images_test, (np.shape(images_test)[0], 64))
svm_classifier = SVC(gamma='auto').fit(images_train, imlabels_train)
number_correct_svc = 0
preds = []
for label_index in range(len(imlabels_test)):
pred = svm_classifier.predict(images_test[label_index].reshape(1,-1))
if pred[0] == imlabels_test[label_index]:
number_correct_svc += 1
preds.append(pred[0])
print("Support Vector Classifier...")
print(f"\tPercent correct for all test data: {100*number_correct_svc/len(imlabels_test)}%")
confusion_matrix(preds,imlabels_test)
Here is the resulting confusion matrix:
array([[22, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 15, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 15, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 21, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 21, 0, 0, 0, 0, 0],
[13, 21, 20, 16, 16, 37, 23, 20, 31, 16],
[ 0, 0, 0, 0, 0, 0, 14, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 16, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0, 2, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 21]], dtype=int64)
I've been reading the sklearn page for SVC but can't tell what I'm doing wrong
Update:
I tried using SCV(gamma='scale') and it seems much more reasonable. It would still be nice to know why 'auto' doesn't work?
with scale:
array([[34, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 36, 0, 0, 0, 0, 0, 0, 1, 0],
[ 0, 0, 35, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 27, 0, 0, 0, 0, 0, 1],
[ 1, 0, 0, 0, 34, 0, 0, 0, 0, 0],
[ 0, 0, 0, 2, 0, 37, 0, 0, 0, 1],
[ 0, 0, 0, 0, 0, 0, 37, 0, 0, 0],
[ 0, 0, 0, 2, 0, 0, 0, 35, 0, 1],
[ 0, 0, 0, 6, 1, 0, 0, 1, 31, 1],
[ 0, 0, 0, 0, 2, 0, 0, 0, 1, 33]], dtype=int64)
The second question is much easier to deal with. The thing is in RBF kernel the gamma denotes how wiggly the decision boundary would be. What do we mean by "wiggly"? The higher the value of gamma more precise the decision boundary would be. Decision boundary of the SVM.
if gamma='scale' (default) is passed then it uses 1 / (n_features *X.var()) as value of gamma,
if ‘auto’, uses 1 / n_features.
In the second case the gamma is higher. For MNIST standard deviation is less than 1. As a result the second decision boundary is much more precise giving a better result than the previous case.

numpy: check for 1 every 6 element every row

I need to have something like this:
arr = array([[1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,
0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0,
0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0]])
Where each row contains 36 elements, every 6 element in a row represents a hidden row, and that hidden row needs exactly one 1, and 0 everywhere else. In other words, every entry mod 6 needs exactly one 1. This is my requirement for arr.
I have a table that's going to be used to compute a "fitness" value for each row. That is, I have a
table = np.array([10, 5, 4, 6, 5, 1, 6, 4, 9, 7, 3, 2, 1, 8, 3,
6, 4, 6, 5, 3, 7, 2, 1, 4, 3, 2, 5, 6, 8, 7, 7, 6, 4, 1, 3, 2])
table = table.T
and I'm going to multiply each row of arr with table. The result of that multiplication, a 1x1 matrix, will be stored as the "fitness" value of that corresponding row. UNLESS the row does not fit the requirement described above, which should return 0.
an example of what should be returned is
result = array([5,12,13,14,20,34])
I need a way to do this but I'm too new to numpy to know how to.
(I'm Assuming you want what you've asked for in the first half).
I believe better or more elegant solutions exist, but this is what I think can do the job.
np.all(arr[:,6] == 1) and np.all(arr[:, :6] == 0) and np.all(arr[:, 7:])
Alternatively, you can construct the array (with 0's and 1's) and then just compare with it, say using not_equal.
I'm also not 100% sure of your question, but I'll try to answer with the best of my knowledge.
Since you're saying your matrix has "hidden rows", to check whether it is well formed, the easiest way seems to be to just reshape it:
# First check, returns true if all elements are either 0 or 1
np.in1d(arr, [0,1]).all()
# Second check, provided the above was True, returns True if
# each "hidden row" has exactly one 1 and other 0.
(arr.reshape(6,6,6).sum(axis=2) == 1).all()
Both checks return "True" for your arr.
Now, my understanding is that for each "large" row of 36 elements, you want a scalar product with your "table" vector, unless that "large" row has an ill-formed "hidden small" row. In this case, I'd do something like:
# The following computes the result, not checking for integrity
results = arr.dot(table)
# Now remove the results that are not well formed.
# First, compute "large" rows where at least one "small" subrow
# fails the condition.
mask = (arr.reshape(6,6,6).sum(axis=2) != 1).any(axis=1)
# And set the corresponding answer to 0
results[mask] = 0
However, running this code against your data returns as answer
array([38, 31, 24, 24, 32, 20])
which is not what you mention; did I misunderstand your requirement, or was the example based on different data?

Build numpy array with multiple custom index ranges without explicit loop

In Numpy, is there a pythonic way to create array3 with custom ranges from array1 and array2 without a loop? The straightforward solution of iterating over the ranges works but since my arrays run into millions of items, I am looking for a more efficient solution (maybe syntactic sugar too).
For ex.,
array1 = np.array([10, 65, 200])
array2 = np.array([14, 70, 204])
array3 = np.concatenate([np.arange(array1[i], array2[i]) for i in
np.arange(0,len(array1))])
print array3
result: [10,11,12,13,65,66,67,68,69,200,201,202,203].
Assuming the ranges do not overlap, you could build a mask which is nonzero where the index is between the ranges specified by array1 and array2 and then use np.flatnonzero to obtain an array of indices -- the desired array3:
import numpy as np
array1 = np.array([10, 65, 200])
array2 = np.array([14, 70, 204])
first, last = array1.min(), array2.max()
array3 = np.zeros(last-first+1, dtype='i1')
array3[array1-first] = 1
array3[array2-first] = -1
array3 = np.flatnonzero(array3.cumsum())+first
print(array3)
yields
[ 10 11 12 13 65 66 67 68 69 200 201 202 203]
For large len(array1), using_flatnonzero can be significantly faster than using_loop:
def using_flatnonzero(array1, array2):
first, last = array1.min(), array2.max()
array3 = np.zeros(last-first+1, dtype='i1')
array3[array1-first] = 1
array3[array2-first] = -1
return np.flatnonzero(array3.cumsum())+first
def using_loop(array1, array2):
return np.concatenate([np.arange(array1[i], array2[i]) for i in
np.arange(0,len(array1))])
array1, array2 = (np.random.choice(range(1, 11), size=10**4, replace=True)
.cumsum().reshape(2, -1, order='F'))
assert np.allclose(using_flatnonzero(array1, array2), using_loop(array1, array2))
In [260]: %timeit using_loop(array1, array2)
100 loops, best of 3: 9.36 ms per loop
In [261]: %timeit using_flatnonzero(array1, array2)
1000 loops, best of 3: 564 µs per loop
If the ranges overlap, then using_loop will return an array3 which contains duplicates. using_flatnonzero returns an array with no duplicates.
Explanation: Let's look at a small example with
array1 = np.array([10, 65, 200])
array2 = np.array([14, 70, 204])
The objective is to build an array which looks like goal, below. The 1's are located at index values [ 10, 11, 12, 13, 65, 66, 67, 68, 69, 200, 201, 202, 203] (i.e. array3):
In [306]: goal
Out[306]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1], dtype=int8)
Once we have the goal array, array3 can be obtained with a call to np.flatnonzero:
In [307]: np.flatnonzero(goal)
Out[307]: array([ 10, 11, 12, 13, 65, 66, 67, 68, 69, 200, 201, 202, 203])
goal has the same length as array2.max():
In [308]: array2.max()
Out[308]: 204
In [309]: goal.shape
Out[309]: (204,)
So we can begin by allocating
goal = np.zeros(array2.max()+1, dtype='i1')
and then filling in 1's at the index locations given by array1 and -1's at the indices given by array2:
In [311]: goal[array1] = 1
In [312]: goal[array2] = -1
In [313]: goal
Out[313]:
array([ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, -1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
-1], dtype=int8)
Now applying cumsum (the cumulative sum) produces the desired goal array:
In [314]: goal = goal.cumsum(); goal
Out[314]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0])
In [315]: np.flatnonzero(goal)
Out[315]: array([ 10, 11, 12, 13, 65, 66, 67, 68, 69, 200, 201, 202, 203])
That's the main idea behind using_flatnonzero. The subtraction of first was simply to save a bit of memory.
Prospective Approach
I will go backwards on how to approach this problem.
Take the sample listed in the question. We have -
array1 = np.array([10, 65, 200])
array2 = np.array([14, 70, 204])
Now, look at the desired result -
result: [10,11,12,13,65,66,67,68,69,200,201,202,203]
Let's calculate the group lengths, as we would be needing those to explain the solution approach next.
In [58]: lens = array2 - array1
In [59]: lens
Out[59]: array([4, 5, 4])
The idea is to use 1's initialized array, which when cumumlative summed across the entire length would give us the desired result.
This cumumlative summation would be the last step to our solution.
Why 1's initialized? Well, because we have an array that increasing in steps of 1's except at specific places where we have shifts
corresponding to new groups coming in.
Now, since cumsum would be the last step, so the step before it should give us something like -
array([ 10, 1, 1, 1, 52, 1, 1, 1, 1, 131, 1, 1, 1])
As discussed before, it's 1's filled with [10,52,131] at specific places. That 10 seems to be coming in from the first element in array1, but what about the rest?
The second one 52 came in as 65-13 (looking at the result) and in it 13 came in the group that started with 10 and ran because of the length of
the first group 4. So, if we do 65 - 10 - 4, we will get 51 and then add 1 to accomodate for boundary stop, we would have 52, which is the
desired shifting value. Similarly, we would get 131.
Thus, those shifting-values could be computed, like so -
In [62]: np.diff(array1) - lens[:-1]+1
Out[62]: array([ 52, 131])
Next up, to get those shifting-places where such shifts occur, we can simply do cumulative summation on the group lengths -
In [65]: lens[:-1].cumsum()
Out[65]: array([4, 9])
For completeness, we need to pre-append 0 with the array of shifting-places and array1[0] for shifting-values.
So, we are set to present our approach in a step-by-step format!
Putting back the pieces
1] Get lengths of each group :
lens = array2 - array1
2] Get indices at which shifts occur and values to be put in 1's initialized array :
shift_idx = np.hstack((0,lens[:-1].cumsum()))
shift_vals = np.hstack((array1[0],np.diff(array1) - lens[:-1]+1))
3] Setup 1's initialized ID array for inserting those values at those indices listed in the step before :
id_arr = np.ones(lens.sum(),dtype=array1.dtype)
id_arr[shift_idx] = shift_vals
4] Finally do cumulative summation on the ID array :
output = id_arr.cumsum()
Listed in a function format, we would have -
def using_ones_cumsum(array1, array2):
lens = array2 - array1
shift_idx = np.hstack((0,lens[:-1].cumsum()))
shift_vals = np.hstack((array1[0],np.diff(array1) - lens[:-1]+1))
id_arr = np.ones(lens.sum(),dtype=array1.dtype)
id_arr[shift_idx] = shift_vals
return id_arr.cumsum()
And it works on overlapping ranges too!
In [67]: array1 = np.array([10, 11, 200])
...: array2 = np.array([14, 18, 204])
...:
In [68]: using_ones_cumsum(array1, array2)
Out[68]:
array([ 10, 11, 12, 13, 11, 12, 13, 14, 15, 16, 17, 200, 201,
202, 203])
Runtime test
Let's time the proposed approach against the other vectorized approach in #unutbu's flatnonzero based solution, which already proved to be much better than the loopy approach -
In [38]: array1, array2 = (np.random.choice(range(1, 11), size=10**4, replace=True)
...: .cumsum().reshape(2, -1, order='F'))
In [39]: %timeit using_flatnonzero(array1, array2)
1000 loops, best of 3: 889 µs per loop
In [40]: %timeit using_ones_cumsum(array1, array2)
1000 loops, best of 3: 235 µs per loop
Improvement!
Now, codewise NumPy doesn't like appending. So, those np.hstack calls could be avoided for a slightly improved version as listed below -
def get_ranges_arr(starts,ends):
counts = ends - starts
counts_csum = counts.cumsum()
id_arr = np.ones(counts_csum[-1],dtype=int)
id_arr[0] = starts[0]
id_arr[counts_csum[:-1]] = starts[1:] - ends[:-1] + 1
return id_arr.cumsum()
Let's time it against our original approach -
In [151]: array1,array2 = (np.random.choice(range(1, 11),size=10**4, replace=True)\
...: .cumsum().reshape(2, -1, order='F'))
In [152]: %timeit using_ones_cumsum(array1, array2)
1000 loops, best of 3: 276 µs per loop
In [153]: %timeit get_ranges_arr(array1, array2)
10000 loops, best of 3: 193 µs per loop
So, we have a 30% performance boost there!
This is my approach combining vectorize and concatenate:
Implementation:
import numpy as np
array1, array2 = np.array([10, 65, 200]), np.array([14, 70, 204])
ranges = np.vectorize(lambda a, b: np.arange(a, b), otypes=[np.ndarray])
result = np.concatenate(ranges(array1, array2), axis=0)
print result
# [ 10 11 12 13 65 66 67 68 69 200 201 202 203]
Performance:
%timeit np.concatenate(ranges(array1, array2), axis=0)
100000 loops, best of 3: 13.9 µs per loop
Do you mean this?
In [440]: np.r_[10:14,65:70,200:204]
Out[440]: array([ 10, 11, 12, 13, 65, 66, 67, 68, 69, 200, 201, 202, 203])
or generalizing:
In [454]: np.r_[tuple([slice(i,j) for i,j in zip(array1,array2)])]
Out[454]: array([ 10, 11, 12, 13, 65, 66, 67, 68, 69, 200, 201, 202, 203])
Though this does involve a double loop, the explicit one to generate the slices and one inside r_ to convert the slices to arange.
for k in range(len(key)):
scalar = False
if isinstance(key[k], slice):
step = key[k].step
start = key[k].start
...
newobj = _nx.arange(start, stop, step)
I mention this because it shows that numpy developers consider your kind of iteration normal.
I expect that #unutbu's cleaver, if somewhat obtuse (I haven't figured out what it is doing yet), solution is your best chance of speed. cumsum is a good tool when you need to work with ranges than can vary in length. It probably gains most when the working with many small ranges. I don't think it works with overlapping ranges.
================
np.vectorize uses np.frompyfunc. So this iteration can also be expressed with:
In [467]: f=np.frompyfunc(lambda x,y: np.arange(x,y), 2,1)
In [468]: f(array1,array2)
Out[468]:
array([array([10, 11, 12, 13]), array([65, 66, 67, 68, 69]),
array([200, 201, 202, 203])], dtype=object)
In [469]: timeit np.concatenate(f(array1,array2))
100000 loops, best of 3: 17 µs per loop
In [470]: timeit np.r_[tuple([slice(i,j) for i,j in zip(array1,array2)])]
10000 loops, best of 3: 65.7 µs per loop
With #Darius's vectorize solution:
In [474]: timeit result = np.concatenate(ranges(array1, array2), axis=0)
10000 loops, best of 3: 52 µs per loop
vectorize must be doing some extra work to allow more powerful use of broadcasting. Relative speeds may shift if array1 is much larger.
#unutbu's solution isn't special with this small array1.
In [478]: timeit using_flatnonzero(array1,array2)
10000 loops, best of 3: 57.3 µs per loop
The OP solution, iterative without my r_ middle man is good
In [483]: timeit array3 = np.concatenate([np.arange(array1[i], array2[i]) for i in np.arange(0,len(array1))])
10000 loops, best of 3: 24.8 µs per loop
It's often the case that with a small number of loops, a list comprehension is faster than fancier numpy operations.
For #unutbu's larger test case, my timings are consistent with his - with a 17x speed up.
===================
For the small sample arrays, #Divakar's solution is slower, but for the large ones 3x faster than #unutbu's. So it has more of a setup cost, but scales slower.

Python Equivalent for bwmorph

I am still coding a fingerprint image preprocessor on Python. I see in MATLAB there is a special function to remove H breaks and spurs:
bwmorph(a , 'hbreak')
bwmorph(a , 'spur')
I have searched scikit, OpenCV and others but couldn't find an equivalent for these two use of bwmorph. Can anybody point me to right direction or do i have to implement my own?
Edit October 2017
the skimage module now has at least 2 options:
skeletonize and thin
Example with comparison
from skimage.morphology import thin, skeletonize
import numpy as np
import matplotlib.pyplot as plt
square = np.zeros((7, 7), dtype=np.uint8)
square[1:-1, 2:-2] = 1
square[0, 1] = 1
thinned = thin(square)
skel = skeletonize(square)
f, ax = plt.subplots(2, 2)
ax[0,0].imshow(square)
ax[0,0].set_title('original')
ax[0,0].get_xaxis().set_visible(False)
ax[0,1].axis('off')
ax[1,0].imshow(thinned)
ax[1,0].set_title('morphology.thin')
ax[1,1].imshow(skel)
ax[1,1].set_title('morphology.skeletonize')
plt.show()
Original post
I have found this solution by joefutrelle on github.
It seems (visually) to give similar results as the Matlab version.
Hope that helps!
Edit:
As it was pointed out in the comments, I'll extend my initial post as the mentioned link might change:
Looking for a substitute in Python for bwmorph from Matlab I stumbled upon the following code from joefutrelle on Github (at the end of this post as it's very long).
I have figured out two ways to implement this into my script (I'm a beginner and I'm sure there are better ways!):
1) copy the whole code into your script and then call the function (but this makes the script harder to read)
2) copy the code it in a new python file 'foo' and save it. Now copy it in the Python\Lib (eg. C:\Program Files\Python35\Lib) folder. In your original script you can call the function by writing:
from foo import bwmorph_thin
Then you'll feed the function with your binary image:
skeleton = bwmorph_thin(foo_image, n_iter = math.inf)
import numpy as np
from scipy import ndimage as ndi
# lookup tables for bwmorph_thin
G123_LUT = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1,
0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0,
1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1,
0, 0, 0], dtype=np.bool)
G123P_LUT = np.array([0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0,
1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0,
0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0,
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1,
0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0], dtype=np.bool)
def bwmorph_thin(image, n_iter=None):
"""
Perform morphological thinning of a binary image
Parameters
----------
image : binary (M, N) ndarray
The image to be thinned.
n_iter : int, number of iterations, optional
Regardless of the value of this parameter, the thinned image
is returned immediately if an iteration produces no change.
If this parameter is specified it thus sets an upper bound on
the number of iterations performed.
Returns
-------
out : ndarray of bools
Thinned image.
See also
--------
skeletonize
Notes
-----
This algorithm [1]_ works by making multiple passes over the image,
removing pixels matching a set of criteria designed to thin
connected regions while preserving eight-connected components and
2 x 2 squares [2]_. In each of the two sub-iterations the algorithm
correlates the intermediate skeleton image with a neighborhood mask,
then looks up each neighborhood in a lookup table indicating whether
the central pixel should be deleted in that sub-iteration.
References
----------
.. [1] Z. Guo and R. W. Hall, "Parallel thinning with
two-subiteration algorithms," Comm. ACM, vol. 32, no. 3,
pp. 359-373, 1989.
.. [2] Lam, L., Seong-Whan Lee, and Ching Y. Suen, "Thinning
Methodologies-A Comprehensive Survey," IEEE Transactions on
Pattern Analysis and Machine Intelligence, Vol 14, No. 9,
September 1992, p. 879
Examples
--------
>>> square = np.zeros((7, 7), dtype=np.uint8)
>>> square[1:-1, 2:-2] = 1
>>> square[0,1] = 1
>>> square
array([[0, 1, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 1, 0, 0],
[0, 0, 1, 1, 1, 0, 0],
[0, 0, 1, 1, 1, 0, 0],
[0, 0, 1, 1, 1, 0, 0],
[0, 0, 1, 1, 1, 0, 0],
[0, 0, 0, 0, 0, 0, 0]], dtype=uint8)
>>> skel = bwmorph_thin(square)
>>> skel.astype(np.uint8)
array([[0, 1, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0]], dtype=uint8)
"""
# check parameters
if n_iter is None:
n = -1
elif n_iter <= 0:
raise ValueError('n_iter must be > 0')
else:
n = n_iter
# check that we have a 2d binary image, and convert it
# to uint8
skel = np.array(image).astype(np.uint8)
if skel.ndim != 2:
raise ValueError('2D array required')
if not np.all(np.in1d(image.flat,(0,1))):
raise ValueError('Image contains values other than 0 and 1')
# neighborhood mask
mask = np.array([[ 8, 4, 2],
[16, 0, 1],
[32, 64,128]],dtype=np.uint8)
# iterate either 1) indefinitely or 2) up to iteration limit
while n != 0:
before = np.sum(skel) # count points before thinning
# for each subiteration
for lut in [G123_LUT, G123P_LUT]:
# correlate image with neighborhood mask
N = ndi.correlate(skel, mask, mode='constant')
# take deletion decision from this subiteration's LUT
D = np.take(lut, N)
# perform deletion
skel[D] = 0
after = np.sum(skel) # coint points after thinning
if before == after:
# iteration had no effect: finish
break
# count down to iteration limit (or endlessly negative)
n -= 1
return skel.astype(np.bool)
"""
# here's how to make the LUTs
def nabe(n):
return np.array([n>>i&1 for i in range(0,9)]).astype(np.bool)
def hood(n):
return np.take(nabe(n), np.array([[3, 2, 1],
[4, 8, 0],
[5, 6, 7]]))
def G1(n):
s = 0
bits = nabe(n)
for i in (0,2,4,6):
if not(bits[i]) and (bits[i+1] or bits[(i+2) % 8]):
s += 1
return s==1
g1_lut = np.array([G1(n) for n in range(256)])
def G2(n):
n1, n2 = 0, 0
bits = nabe(n)
for k in (1,3,5,7):
if bits[k] or bits[k-1]:
n1 += 1
if bits[k] or bits[(k+1) % 8]:
n2 += 1
return min(n1,n2) in [2,3]
g2_lut = np.array([G2(n) for n in range(256)])
g12_lut = g1_lut & g2_lut
def G3(n):
bits = nabe(n)
return not((bits[1] or bits[2] or not(bits[7])) and bits[0])
def G3p(n):
bits = nabe(n)
return not((bits[5] or bits[6] or not(bits[3])) and bits[4])
g3_lut = np.array([G3(n) for n in range(256)])
g3p_lut = np.array([G3p(n) for n in range(256)])
g123_lut = g12_lut & g3_lut
g123p_lut = g12_lut & g3p_lut
"""`
You will have to implement those on your own since they aren't present in OpenCV or skimage as far as I know.
However, it should be straightforward to check MATLAB's code on how it works and write your own version in Python/NumPy.
Here is a guide describing in detail NumPy functions exclusively for MATLAB users, with hints on equivalent functions in MATLAB and NumPy:
Link

Categories