Working with 2 arrays with different lengths Numpy Python - python

Is there a way I could modify the function down below so that it could compute arrays with different length sizes. the length of Numbers array is 7 and the length of the Formating is 5. The code down below compares if any number in Formating is between two values and if it the case then it sums the values that are in between. So for the first calculation since no element in Numbers is between 0, 2 the result will be 0. Link to code was derived from: issue.
Code:
Numbers = np.array([3, 4, 5, 7, 8, 10,20])
Formating = np.array([0, 2 , 5, 12, 15])
x = np.sort(Numbers);
l = np.searchsorted(x, Formating, side='left')
mask=(Formating[:-1,None]<=Numbers)&(Numbers<Formating[1:,None])
N=Numbers[:,None].repeat(5,1).T
result= np.ma.masked_array(N,~mask)
result = result.filled(0)
result = np.sum(result, axis=1)
Expected output:
[ 0 7 30 0]

Here's an approach with bincounts. Note that you have your x and l messed-up, and I recalled that you could/should use digitize:
# Formating goes here
x = np.sort(Formating);
# digitize
l = np.digitize(Numbers, x)
# output:
np.bincount(l, weights=Numbers)
Out:
array([ 0., 0., 7., 30., 0., 20.])

Related

How to cut dataset into X and Y parameters

I have a time-series dataset that consists of evenly spaced timesteps and another parameter (say volume). I want to cut/split the dataset into X and Y parameters to train my ML model. I am looking for a logic/algorithm for Python that will be useful in tacking the simplified version below.
I have an array of even timesteps (1 timestep = 1day) ranging from 1 to 100:
array = [1,2,3,...,100]
I also have come up with the following parameters: N and K. N to be used in X parameter and K to be used in Y parameter.
If N = 5, then on first iteration X = [1,2,3,4,5], on second iteration X = [2,3,4,5,6] and on third iteration X = [3,4,5,6,7] and so forth. So, the length of an X is equal to the number of N. If N = 10, then on first iteration X = [1,2,3,...,10], on second iteration X = [2,3,4...,11] and so forth.
K parameter represents the length of a geometric sequence. For example: k =5 means k = (1,2,4,8,16), k = 3 means k = (1,2,4) and k = 7 means k = (1,2,4,8,16,32,64). Y parameter uses last element of an X array at each iteration and adds to it the values from the geometric sequence. So, the length of a Y is equal to the number of K. If len(K) = 5 -> len(Y)=5, if len(K) = 3 -> len(Y)=3 and so forth.
Example 1: N= 5, K=5:
First step:
X = [1,2,3,4,5] and Y = [6,7,9,13, 21]
because K = (1,2,4,8, 16) and Y = [5+1, 5+2, 5+4, 5+8, 5+16] with 5 being the last element of an array X
Second Step:
X = [2,3,4,5, 6] and Y = [7,8,10,14, 22]
because K = (1,2,4,8, 16) and Y = [6+1, 6+2, 6+4, 6+8, 6+16] with 6 being the last element of an array X
Third Step:
X = [3,4,5, 6, 7] and Y = [8,9,11,15, 23]
because K = (1,2,4,8, 16) and Y = [7+1, 7+2, 7+4, 7+8, 7+16] with 7 being the last element of an array X
**Other steps**
Last Step:
X = [?,?,?,?,?]; Y = [?,?,?,?,100]
k = (1,2,4,8,16) because 100 is the last element of an array
Example 2: N = 6, K = 3:
First Step:
X = [1,2,3,4,5, 6] and Y = [7,8,10] Because K = (1,2,4) and Y = [6+1, 6+2, 6+4]
Second Step:
X = [2,3,4,5,6, 7] and Y = [8,9,11] Because K = (1,2,4) and Y = [7+1, 7+2, 7+4]
Third Step:
X = [3,4,5,6,7,8] and Y = [9,10,12] Because K = (1,2,4) and Y = [8+1, 8+2, 8+4]
**Other steps**
Last Step:
X = [92,93,94,95,96]; Y = [97,98,100], k = (1,2,4) because 100 is the last element of an array
Edit
I expect the function to look like:
def dataset_split(array, N, K):
It should return multiple X and Y arrays (basically chunks) based on the input array between 1 and 100. Basically it should go over steps and save the results for X and Y in a form of matrix or arrays. Based on my Example 1 above, my X array after first three steps will be
X = [[1,2,3,4,5], [2,3,4,5,6], [3,4,5, 6, 7]]
and my Y array after first three steps will be
Y = [[6,7,9,13, 21], [7,8,10,14, 22], [8,9,11,15, 23]]
The procedure should continue until the last element of an array is reached which is 100 in this case
A very simple way to get all the values of X is to create a sliding window view into array. You can do this directly with np.lib.stride_tricks.sliding_window_view:
n = ...
k = ...
x = np.lib.stride_tricks.sliding_window_view(array, n)
The geometric sequence K can be trivially generated with np.logspace:
K = np.logspace(0, k - 1, k, base=2)
OR
K = 2.0**np.arange(k)
Either way, you can pre-generate all of y as
y = x + K
Now you have two arrays with all of the data you need:
>>> array = np.arange(1, 101)
>>> n = k = 5
>>> x = np.lib.stride_tricks.sliding_window_view(array, n)
>>> x
array([[ 1, 2, 3, 4, 5],
[ 2, 3, 4, 5, 6],
[ 3, 4, 5, 6, 7],
...
[ 94, 95, 96, 97, 98],
[ 95, 96, 97, 98, 99],
[ 96, 97, 98, 99, 100]])
>>> K = np.logspace(0, k - 1, k, base=2)
>>> K
array([ 1., 2., 4., 8., 16.])
>>> y = x + K
>>> y
array([[ 2., 4., 7., 12., 21.],
[ 3., 5., 8., 13., 22.],
[ 4., 6., 9., 14., 23.],
...
[ 95., 97., 100., 105., 114.],
[ 96., 98., 101., 106., 115.],
[ 97., 99., 102., 107., 116.]])
The nice thing about this approach is that you don't need to copy the original data of array to make x, and everything is fully vectorized. Whatever operation you are planning on doing can likely be performed in bulk using numpy functions.
This satisfies your examples:
def split_dataset(array, N, K):
k = 2**np.arange(K)
# column stacking i-places shifted array for N columns
X = np.c_[[np.roll(array,-i) for i in range(N)]].T[:-N+1]
# masking rows that will go over the last value in array
mask = X[:,-1] + k[-1] <= array[-1]
X = X[mask]
# adding k to the last column of X
Y = X[:,-1].reshape(-1,1) + k
return X, Y
X, Y = split_dataset(array, 5, 5)

Subtract the average of first and last value of each row from all values in the row

I have a numpy array that looks like this:
77.132 2.075 63.365 74.880
49.851 22.480 19.806 76.053
16.911 8.834 68.536 95.339
0.395 51.219 81.262 61.253
72.176 29.188 91.777 71.458
54.254 14.217 37.334 67.413
44.183 43.401 61.777 51.314
65.040 60.104 80.522 52.165
90.865 31.924 9.046 30.070
11.398 82.868 4.690 62.629
and what I'm trying to do is
find the average of the first and last item in each row
subtract this average from each pixel in that row
repeat for each row
create a new image of subtracted pixels.
I've tried it using for loops but I can't get it working:
import numpy as np
# Create random arrays to simulate images
np.random.seed(10)
image = 100 * np.random.rand(10, 4)
no_disk_list = []
#for row in image:
# left, right = row[0], row[-1]
# average = (left + right) / 2.0
# for i in row:
# no_average = row[i] - average
# print(average)
# no_disk_list.append(no_average)
subtracted = np.ones_like(image)
height, width = image.shape
for row in image:
left, right = image[0], image[-1]
average = (left + right) / 2.0
for element in row:
subtracted[row, element] = image[row, element] - average
Both of the nested loops gives an error:
File "C:/Users/Jeremy/Dropbox/Astro480/NEOWISE/subtract_disk.py", line 17, in <module>
no_disk_value = row[i] - disk_value
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
for the first loop and
File "C:/Users/Jeremy/Dropbox/Astro480/NEOWISE/subtract_pixels.py", line 23, in <module>
print(image[row, element])
IndexError: arrays used as indices must be of integer (or boolean) type
for the second. The questions here, here, and here are of limited use in my situation. Besides, I know that vectorization would be a better way to go, since the image I'll eventually be using has 1.3 million pixels. How can I get the loops working, or even better, vectorize the calculation?
If I understand the question correctly this will work:
subtracted = np.ones_like(image)
height, width = image.shape
for row_no, row in enumerate(image): # keep the row number using enumerate
left, right = row[0], row[-1] # you need the first and last value of the ROW!
average = (left + right) / 2.0
# Also use enumerate in the inner loop
for col_no, element in enumerate(row):
subtracted[row_no, col_no] = element - average
You can even use broadcasting ("vectorization") to shorten this considerably:
subtracted = image - (image[:, [0]] + image[:, [-1]]) / 2
The image[:, [0]] is the first column, the image[:, [-1]] is the last column. By adding and dividing them by 2 you get a 2D array containing the averages of each row. The final step is subtracting this from the image, which is easy in this case because it will broadcast correctly.
Step-by-step:
>>> arr = np.arange(20).reshape(4, 5)
>>> arr
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19]])
>>> arr[:, [0]] # first column
array([[ 0],
[ 5],
[10],
[15]])
>>> arr[:, [-1]] # last column
array([[ 4],
[ 9],
[14],
[19]])
>>> (arr[:, [0]] + arr[:, [-1]]) / 2 # average
array([[ 2.],
[ 7.],
[ 12.],
[ 17.]])
>>> arr - (arr[:, [0]] + arr[:, [-1]]) / 2 # subtracted
array([[-2., -1., 0., 1., 2.],
[-2., -1., 0., 1., 2.],
[-2., -1., 0., 1., 2.],
[-2., -1., 0., 1., 2.]])

Python how to find unique entries and get the minimum values from a matching array

I have a numpy array, indices:
array([[ 0, 0, 0],
[ 0, 0, 0],
[ 2, 0, 2],
[ 0, 0, 0],
[ 2, 0, 2],
[95, 71, 95]])
I have another array of the same length called distances:
array([ 0.98713981, 1.04705992, 1.42340327, 74.0139111 ,
74.4285216 , 74.84623217])
All of the rows in indices have a match in the distances array. The problem is, there are duplicates in the indices array, and they have different values in the corresponding distances array. I would like to get the minimum distance for all triplets of indices, and discard the others. Therefore, with the inputs above, I want the output:
indicesOUT =
array([[ 0, 0, 0],
[ 2, 0, 2],
[95, 71, 95]])
distancesOUT=
array([ 0.98713981, 1.42340327, 74.84623217])
My current strategy is as follows:
import numpy as np
indicesOUT = []
distancesOUT = []
for i in range(6):
for j in range(6):
for k in range(6):
if len([s for s in indicesOUT if [i,j,k] == s]) == 0:
current = np.array([i, j, k])
ind = np.where((indices == current).all(-1) == True)[0]
currentDistances = distances[ind]
dist = np.amin(distances)
indicesOUT.append([i, j, k])
distancesOUT.append(dist)
The problem is, the actual arrays have about 4 million elements each, so this approach is way too slow. What is the most efficient way of doing this?
This is essentially a grouping operation, and NumPy is not well-optimized for it. Fortunately, the Pandas package has some very fast tools that can be adapted to this exact problem.
With your data above, we can do this:
import pandas as pd
def drop_duplicates(indices, distances):
data = pd.Series(distances)
grouped = data.groupby(list(indices.T)).min().reset_index()
return grouped.values[:, :3], grouped.values[:, 3]
And the output for your data is
array([[ 0., 0., 0.],
[ 2., 0., 2.],
[ 95., 71., 95.]]),
array([ 0.98713981, 1.42340327, 74.84623217])
My benchmark shows that for 4,000,000 elements, this should run in about a second:
indices = np.random.randint(0, 100, size=(4000000, 3))
distances = np.random.random(4000000)
%timeit drop_duplicates(indices, distances)
# 1 loops, best of 3: 1.15 s per loop
As written above, the input order of the indices will not necessarily be preserved; keeping the original order would require a bit more thought.

Convert array of indices to one-hot encoded array in NumPy

Given a 1D array of indices:
a = array([1, 0, 3])
I want to one-hot encode this as a 2D array:
b = array([[0,1,0,0], [1,0,0,0], [0,0,0,1]])
Create a zeroed array b with enough columns, i.e. a.max() + 1.
Then, for each row i, set the a[i]th column to 1.
>>> a = np.array([1, 0, 3])
>>> b = np.zeros((a.size, a.max() + 1))
>>> b[np.arange(a.size), a] = 1
>>> b
array([[ 0., 1., 0., 0.],
[ 1., 0., 0., 0.],
[ 0., 0., 0., 1.]])
>>> values = [1, 0, 3]
>>> n_values = np.max(values) + 1
>>> np.eye(n_values)[values]
array([[ 0., 1., 0., 0.],
[ 1., 0., 0., 0.],
[ 0., 0., 0., 1.]])
In case you are using keras, there is a built in utility for that:
from keras.utils.np_utils import to_categorical
categorical_labels = to_categorical(int_labels, num_classes=3)
And it does pretty much the same as #YXD's answer (see source-code).
Here is what I find useful:
def one_hot(a, num_classes):
return np.squeeze(np.eye(num_classes)[a.reshape(-1)])
Here num_classes stands for number of classes you have. So if you have a vector with shape of (10000,) this function transforms it to (10000,C). Note that a is zero-indexed, i.e. one_hot(np.array([0, 1]), 2) will give [[1, 0], [0, 1]].
Exactly what you wanted to have I believe.
PS: the source is Sequence models - deeplearning.ai
You can also use eye function of numpy:
numpy.eye(number of classes)[vector containing the labels]
You can use sklearn.preprocessing.LabelBinarizer:
Example:
import sklearn.preprocessing
a = [1,0,3]
label_binarizer = sklearn.preprocessing.LabelBinarizer()
label_binarizer.fit(range(max(a)+1))
b = label_binarizer.transform(a)
print('{0}'.format(b))
output:
[[0 1 0 0]
[1 0 0 0]
[0 0 0 1]]
Amongst other things, you may initialize sklearn.preprocessing.LabelBinarizer() so that the output of transform is sparse.
For 1-hot-encoding
one_hot_encode=pandas.get_dummies(array)
For Example
ENJOY CODING
You can use the following code for converting into a one-hot vector:
let x is the normal class vector having a single column with classes 0 to some number:
import numpy as np
np.eye(x.max()+1)[x]
if 0 is not a class; then remove +1.
Here is a function that converts a 1-D vector to a 2-D one-hot array.
#!/usr/bin/env python
import numpy as np
def convertToOneHot(vector, num_classes=None):
"""
Converts an input 1-D vector of integers into an output
2-D array of one-hot vectors, where an i'th input value
of j will set a '1' in the i'th row, j'th column of the
output array.
Example:
v = np.array((1, 0, 4))
one_hot_v = convertToOneHot(v)
print one_hot_v
[[0 1 0 0 0]
[1 0 0 0 0]
[0 0 0 0 1]]
"""
assert isinstance(vector, np.ndarray)
assert len(vector) > 0
if num_classes is None:
num_classes = np.max(vector)+1
else:
assert num_classes > 0
assert num_classes >= np.max(vector)
result = np.zeros(shape=(len(vector), num_classes))
result[np.arange(len(vector)), vector] = 1
return result.astype(int)
Below is some example usage:
>>> a = np.array([1, 0, 3])
>>> convertToOneHot(a)
array([[0, 1, 0, 0],
[1, 0, 0, 0],
[0, 0, 0, 1]])
>>> convertToOneHot(a, num_classes=10)
array([[0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0]])
I think the short answer is no. For a more generic case in n dimensions, I came up with this:
# For 2-dimensional data, 4 values
a = np.array([[0, 1, 2], [3, 2, 1]])
z = np.zeros(list(a.shape) + [4])
z[list(np.indices(z.shape[:-1])) + [a]] = 1
I am wondering if there is a better solution -- I don't like that I have to create those lists in the last two lines. Anyway, I did some measurements with timeit and it seems that the numpy-based (indices/arange) and the iterative versions perform about the same.
Just to elaborate on the excellent answer from K3---rnc, here is a more generic version:
def onehottify(x, n=None, dtype=float):
"""1-hot encode x with the max value n (computed from data if n is None)."""
x = np.asarray(x)
n = np.max(x) + 1 if n is None else n
return np.eye(n, dtype=dtype)[x]
Also, here is a quick-and-dirty benchmark of this method and a method from the currently accepted answer by YXD (slightly changed, so that they offer the same API except that the latter works only with 1D ndarrays):
def onehottify_only_1d(x, n=None, dtype=float):
x = np.asarray(x)
n = np.max(x) + 1 if n is None else n
b = np.zeros((len(x), n), dtype=dtype)
b[np.arange(len(x)), x] = 1
return b
The latter method is ~35% faster (MacBook Pro 13 2015), but the former is more general:
>>> import numpy as np
>>> np.random.seed(42)
>>> a = np.random.randint(0, 9, size=(10_000,))
>>> a
array([6, 3, 7, ..., 5, 8, 6])
>>> %timeit onehottify(a, 10)
188 µs ± 5.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit onehottify_only_1d(a, 10)
139 µs ± 2.78 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
If using tensorflow, there is one_hot():
import tensorflow as tf
import numpy as np
a = np.array([1, 0, 3])
depth = 4
b = tf.one_hot(a, depth)
# <tf.Tensor: shape=(3, 3), dtype=float32, numpy=
# array([[0., 1., 0.],
# [1., 0., 0.],
# [0., 0., 0.]], dtype=float32)>
def one_hot(n, class_num, col_wise=True):
a = np.eye(class_num)[n.reshape(-1)]
return a.T if col_wise else a
# Column for different hot
print(one_hot(np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 9, 9, 9, 9, 8, 7]), 10))
# Row for different hot
print(one_hot(np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 9, 9, 9, 9, 8, 7]), 10, col_wise=False))
I recently ran into a problem of same kind and found said solution which turned out to be only satisfying if you have numbers that go within a certain formation. For example if you want to one-hot encode following list:
all_good_list = [0,1,2,3,4]
go ahead, the posted solutions are already mentioned above. But what if considering this data:
problematic_list = [0,23,12,89,10]
If you do it with methods mentioned above, you will likely end up with 90 one-hot columns. This is because all answers include something like n = np.max(a)+1. I found a more generic solution that worked out for me and wanted to share with you:
import numpy as np
import sklearn
sklb = sklearn.preprocessing.LabelBinarizer()
a = np.asarray([1,2,44,3,2])
n = np.unique(a)
sklb.fit(n)
b = sklb.transform(a)
I hope someone encountered same restrictions on above solutions and this might come in handy
Here's a dimensionality-independent standalone solution.
This will convert any N-dimensional array arr of nonnegative integers to a one-hot N+1-dimensional array one_hot, where one_hot[i_1,...,i_N,c] = 1 means arr[i_1,...,i_N] = c. You can recover the input via np.argmax(one_hot, -1)
def expand_integer_grid(arr, n_classes):
"""
:param arr: N dim array of size i_1, ..., i_N
:param n_classes: C
:returns: one-hot N+1 dim array of size i_1, ..., i_N, C
:rtype: ndarray
"""
one_hot = np.zeros(arr.shape + (n_classes,))
axes_ranges = [range(arr.shape[i]) for i in range(arr.ndim)]
flat_grids = [_.ravel() for _ in np.meshgrid(*axes_ranges, indexing='ij')]
one_hot[flat_grids + [arr.ravel()]] = 1
assert((one_hot.sum(-1) == 1).all())
assert(np.allclose(np.argmax(one_hot, -1), arr))
return one_hot
Such type of encoding are usually part of numpy array. If you are using a numpy array like this :
a = np.array([1,0,3])
then there is very simple way to convert that to 1-hot encoding
out = (np.arange(4) == a[:,None]).astype(np.float32)
That's it.
p will be a 2d ndarray.
We want to know which value is the highest in a row, to put there 1 and everywhere else 0.
clean and easy solution:
max_elements_i = np.expand_dims(np.argmax(p, axis=1), axis=1)
one_hot = np.zeros(p.shape)
np.put_along_axis(one_hot, max_elements_i, 1, axis=1)
I find the easiest solution combines np.take and np.eye
def one_hot(x, depth: int):
return np.take(np.eye(depth), x, axis=0)
works for x of any shape.
Here is an example function that I wrote to do this based upon the answers above and my own use case:
def label_vector_to_one_hot_vector(vector, one_hot_size=10):
"""
Use to convert a column vector to a 'one-hot' matrix
Example:
vector: [[2], [0], [1]]
one_hot_size: 3
returns:
[[ 0., 0., 1.],
[ 1., 0., 0.],
[ 0., 1., 0.]]
Parameters:
vector (np.array): of size (n, 1) to be converted
one_hot_size (int) optional: size of 'one-hot' row vector
Returns:
np.array size (vector.size, one_hot_size): converted to a 'one-hot' matrix
"""
squeezed_vector = np.squeeze(vector, axis=-1)
one_hot = np.zeros((squeezed_vector.size, one_hot_size))
one_hot[np.arange(squeezed_vector.size), squeezed_vector] = 1
return one_hot
label_vector_to_one_hot_vector(vector=[[2], [0], [1]], one_hot_size=3)
I am adding for completion a simple function, using only numpy operators:
def probs_to_onehot(output_probabilities):
argmax_indices_array = np.argmax(output_probabilities, axis=1)
onehot_output_array = np.eye(np.unique(argmax_indices_array).shape[0])[argmax_indices_array.reshape(-1)]
return onehot_output_array
It takes as input a probability matrix: e.g.:
[[0.03038822 0.65810204 0.16549407 0.3797123 ]
...
[0.02771272 0.2760752 0.3280924 0.33458805]]
And it will return
[[0 1 0 0] ... [0 0 0 1]]
Use the following code. It works best.
def one_hot_encode(x):
"""
argument
- x: a list of labels
return
- one hot encoding matrix (number of labels, number of class)
"""
encoded = np.zeros((len(x), 10))
for idx, val in enumerate(x):
encoded[idx][val] = 1
return encoded
Found it here P.S You don't need to go into the link.
Using a Neuraxle pipeline step:
Set up your example
import numpy as np
a = np.array([1,0,3])
b = np.array([[0,1,0,0], [1,0,0,0], [0,0,0,1]])
Do the actual conversion
from neuraxle.steps.numpy import OneHotEncoder
encoder = OneHotEncoder(nb_columns=4)
b_pred = encoder.transform(a)
Assert it works
assert b_pred == b
Link to documentation: neuraxle.steps.numpy.OneHotEncoder

assigning points to bins

What is a good way to bin numerical values into a certain range? For example, suppose I have a list of values and I want to bin them into N bins by their range. Right now, I do something like this:
from scipy import *
num_bins = 3 # number of bins to use
values = # some array of integers...
min_val = min(values) - 1
max_val = max(values) + 1
my_bins = linspace(min_val, max_val, num_bins)
# assign point to my bins
for v in values:
best_bin = min_index(abs(my_bins - v))
where min_index returns the index of the minimum value. The idea is that you can find the bin the point falls into by seeing what bin it has the smallest difference with.
But I think this has weird edge cases. What I am looking for is a good representation of bins, ideally ones that are half closed half open (so that there is no way of assigning one point to two bins), i.e.
bin1 = [x1, x2)
bin2 = [x2, x3)
bin3 = [x3, x4)
etc...
what is a good way to do this in Python, using numpy/scipy? I am only concerned here with binning integer values.
thanks very much for your help.
numpy.histogram() does exactly what you want.
The function signature is:
numpy.histogram(a, bins=10, range=None, normed=False, weights=None, new=None)
We're mostly interested in a and bins. a is the input data that needs to be binned. bins can be a number of bins (your num_bins), or it can be a sequence of scalars, which denote bin edges (half open).
import numpy
values = numpy.arange(10, dtype=int)
bins = numpy.arange(-1, 11)
freq, bins = numpy.histogram(values, bins)
# freq is now [0 1 1 1 1 1 1 1 1 1 1]
# bins is unchanged
To quote the documentation:
All but the last (righthand-most) bin is half-open. In other words, if bins is:
[1, 2, 3, 4]
then the first bin is [1, 2) (including 1, but excluding 2) and the second [2, 3). The last bin, however, is [3, 4], which includes 4.
Edit: You want to know the index in your bins of each element. For this, you can use numpy.digitize(). If your bins are going to be integral, you can use numpy.bincount() as well.
>>> values = numpy.random.randint(0, 20, 10)
>>> values
array([17, 14, 9, 7, 6, 9, 19, 4, 2, 19])
>>> bins = numpy.linspace(-1, 21, 23)
>>> bins
array([ -1., 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.,
10., 11., 12., 13., 14., 15., 16., 17., 18., 19., 20.,
21.])
>>> pos = numpy.digitize(values, bins)
>>> pos
array([19, 16, 11, 9, 8, 11, 21, 6, 4, 21])
Since the interval is open on the upper limit, the indices are correct:
>>> (bins[pos-1] == values).all()
True
>>> import sys
>>> for n in range(len(values)):
... sys.stdout.write("%g <= %g < %g\n"
... %(bins[pos[n]-1], values[n], bins[pos[n]]))
17 <= 17 < 18
14 <= 14 < 15
9 <= 9 < 10
7 <= 7 < 8
6 <= 6 < 7
9 <= 9 < 10
19 <= 19 < 20
4 <= 4 < 5
2 <= 2 < 3
19 <= 19 < 20
This is fairly straightforward in numpy using broadcasting--my example below is four lines of code (not counting first two lines to create bins and data points, which would of course ordinarily be supplied.)
import numpy as NP
# just creating 5 bins at random, each bin expressed as (x, y, z) although, this code
# is not limited by bin number or bin dimension
bins = NP.random.random_integers(10, 99, 15).reshape(5, 3)
# creating 30 random data points
data = NP.random.random_integers(10, 99, 90).reshape(30, 3)
# for each data point i want the nearest bin, but before i can generate a distance
# matrix, i need to 'conform' the array dimensions
# 'broadcasting' is an excellent and concise way to do this
bins = bins[:, NP.newaxis, :]
data2 = data[NP.newaxis, :, :]
# now i can calculate the distance matrix
dist_matrix = NP.sqrt(NP.sum((data - bins)**2, axis=-1))
bin_assignments = NP.argmin(dist_matrix, axis=0)
'bin_assignments' is a 1d array of indices comprised of integer values from 0 to 4, corresponding to the five bins--the bin assignments for each of the 30 original points in the 'data' matrix above.

Categories