Related
So, I want to slice my 3d array to skip the first 2 arrays and then return the next two arrays. And I want the slice to keep following this pattern, alternating skipping 2 and giving 2 arrays etc.. I have found a solution, but I was wondering if there is a more elegant way to go about this? Preferably without having to reshape?
arr = np.arange(1, 251).reshape((10, 5, 5))
sliced_array = np.concatenate((arr[2::4], arr[3::4]), axis=1).ravel().reshape((4, 5, 5))
You can use boolean indexing using a mask that repeats [False, False, True, True, ...]:
import numpy as np
arr = np.arange(1, 251).reshape((10, 5, 5))
mask = np.arange(arr.shape[0]) % 4 >= 2
out = arr[mask]
out:
array([[[ 51, 52, 53, 54, 55],
[ 56, 57, 58, 59, 60],
[ 61, 62, 63, 64, 65],
[ 66, 67, 68, 69, 70],
[ 71, 72, 73, 74, 75]],
[[ 76, 77, 78, 79, 80],
[ 81, 82, 83, 84, 85],
[ 86, 87, 88, 89, 90],
[ 91, 92, 93, 94, 95],
[ 96, 97, 98, 99, 100]],
[[151, 152, 153, 154, 155],
[156, 157, 158, 159, 160],
[161, 162, 163, 164, 165],
[166, 167, 168, 169, 170],
[171, 172, 173, 174, 175]],
[[176, 177, 178, 179, 180],
[181, 182, 183, 184, 185],
[186, 187, 188, 189, 190],
[191, 192, 193, 194, 195],
[196, 197, 198, 199, 200]]])
Since you want to select, and skip, the same numbers, reshaping works.
For a 1d array:
In [97]: np.arange(10).reshape(5,2)[1::2]
Out[97]:
array([[2, 3],
[6, 7]])
which can then be ravelled.
Generalizing to more dimensions:
In [98]: x = np.arange(100).reshape(10,10)
In [99]: x.reshape(5,2,10)[1::2,...].reshape(-1,10)
Out[99]:
array([[20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
[30, 31, 32, 33, 34, 35, 36, 37, 38, 39],
[60, 61, 62, 63, 64, 65, 66, 67, 68, 69],
[70, 71, 72, 73, 74, 75, 76, 77, 78, 79]])
I won't go on to 3d because the display will be longer, but it should be straight forward.
I am using the sklearn.cluster KMeans package. Once I finish the clustering if I need to know which values were grouped together how can I do it?
Say I had 100 data points and KMeans gave me 5 cluster. Now I want to know which data points are in cluster 5. How can I do that.
Is there a function to give the cluster id and it will list out all the data points in that cluster?
I had a similar requirement and i am using pandas to create a new dataframe with the index of the dataset and the labels as columns.
data = pd.read_csv('filename')
km = KMeans(n_clusters=5).fit(data)
cluster_map = pd.DataFrame()
cluster_map['data_index'] = data.index.values
cluster_map['cluster'] = km.labels_
Once the DataFrame is available is quite easy to filter,
For example, to filter all data points in cluster 3
cluster_map[cluster_map.cluster == 3]
If you have a large dataset and you need to extract clusters on-demand you'll see some speed-up using numpy.where. Here is an example on the iris dataset:
from sklearn.cluster import KMeans
from sklearn import datasets
import numpy as np
centers = [[1, 1], [-1, -1], [1, -1]]
iris = datasets.load_iris()
X = iris.data
y = iris.target
km = KMeans(n_clusters=3)
km.fit(X)
Define a function to extract the indices of the cluster_id you provide. (Here are two functions, for benchmarking, they both return the same values):
def ClusterIndicesNumpy(clustNum, labels_array): #numpy
return np.where(labels_array == clustNum)[0]
def ClusterIndicesComp(clustNum, labels_array): #list comprehension
return np.array([i for i, x in enumerate(labels_array) if x == clustNum])
Let's say you want all samples that are in cluster 2:
ClusterIndicesNumpy(2, km.labels_)
array([ 52, 77, 100, 102, 103, 104, 105, 107, 108, 109, 110, 111, 112,
115, 116, 117, 118, 120, 122, 124, 125, 128, 129, 130, 131, 132,
134, 135, 136, 137, 139, 140, 141, 143, 144, 145, 147, 148])
Numpy wins the benchmark:
%timeit ClusterIndicesNumpy(2,km.labels_)
100000 loops, best of 3: 4 µs per loop
%timeit ClusterIndicesComp(2,km.labels_)
1000 loops, best of 3: 479 µs per loop
Now you can extract all of your cluster 2 data points like so:
X[ClusterIndicesNumpy(2,km.labels_)]
array([[ 6.9, 3.1, 4.9, 1.5],
[ 6.7, 3. , 5. , 1.7],
[ 6.3, 3.3, 6. , 2.5],
... #truncated
Double-check the first three indices from the truncated array above:
print X[52], km.labels_[52]
print X[77], km.labels_[77]
print X[100], km.labels_[100]
[ 6.9 3.1 4.9 1.5] 2
[ 6.7 3. 5. 1.7] 2
[ 6.3 3.3 6. 2.5] 2
Actually a very simple way to do this is:
clusters=KMeans(n_clusters=5)
df[clusters.labels_==0]
The second row returns all the elements of the df that belong to the 0th cluster. Similarly you can find the other cluster-elements.
To get the IDs of the points/samples/observations that are inside each cluster, do this:
Python 2
Example using Iris data and a nice pythonic way:
import numpy as np
from sklearn.cluster import KMeans
from sklearn import datasets
np.random.seed(0)
# Use Iris data
iris = datasets.load_iris()
X = iris.data
y = iris.target
# KMeans with 3 clusters
clf = KMeans(n_clusters=3)
clf.fit(X,y)
#Coordinates of cluster centers with shape [n_clusters, n_features]
clf.cluster_centers_
#Labels of each point
clf.labels_
# Nice Pythonic way to get the indices of the points for each corresponding cluster
mydict = {i: np.where(clf.labels_ == i)[0] for i in range(clf.n_clusters)}
# Transform this dictionary into list (if you need a list as result)
dictlist = []
for key, value in mydict.iteritems():
temp = [key,value]
dictlist.append(temp)
RESULTS
#dict format
{0: array([ 50, 51, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76,
78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90,
91, 92, 93, 94, 95, 96, 97, 98, 99, 101, 106, 113, 114,
119, 121, 123, 126, 127, 133, 138, 142, 146, 149]),
1: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]),
2: array([ 52, 77, 100, 102, 103, 104, 105, 107, 108, 109, 110, 111, 112,
115, 116, 117, 118, 120, 122, 124, 125, 128, 129, 130, 131, 132,
134, 135, 136, 137, 139, 140, 141, 143, 144, 145, 147, 148])}
# list format
[[0, array([ 50, 51, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76,
78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90,
91, 92, 93, 94, 95, 96, 97, 98, 99, 101, 106, 113, 114,
119, 121, 123, 126, 127, 133, 138, 142, 146, 149])],
[1, array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49])],
[2, array([ 52, 77, 100, 102, 103, 104, 105, 107, 108, 109, 110, 111, 112,
115, 116, 117, 118, 120, 122, 124, 125, 128, 129, 130, 131, 132,
134, 135, 136, 137, 139, 140, 141, 143, 144, 145, 147, 148])]]
Python 3
Just change
for key, value in mydict.iteritems():
to
for key, value in mydict.items():
You can look at attribute labels_
For example
km = KMeans(2)
km.fit([[1,2,3],[2,3,4],[5,6,7]])
print km.labels_
output: array([1, 1, 0], dtype=int32)
As you can see first and second point is cluster 1, last point in cluster 0.
You can Simply store the labels in an array. Convert the array to a data frame. Then Merge the data that you used to create K means with the new data frame with clusters.
Display the dataframe. Now you should see the row with corresponding cluster. If you want to list all the data with specific cluster, use something like data.loc[data['cluster_label_name'] == 2], assuming 2 your cluster for now.
I'm using FFmpeg to decode a video, and am piping the RGB24 raw data into python.
So the format of the binary data is:
RGBRGBRGBRGB...
I need to convert this into a (640, 360, 3) numpy array, and was wondering if I could use reshape for this and, especially, how.
If rgb is a bytearray with 3 * 360 * 640 bytes, all you need is :
np.array(rgb).reshape(640, 360, 3)
As an example:
>>> import random
>>> import numpy as np
>>> bytearray(random.getrandbits(8) for _ in range(3 * 4 * 4))
bytearray(b'{)jg\xba\xbe&\xd1\xb9\xdd\xf9#\xadL?GV\xca\x19\xfb\xbd\xad\xc2C\xa8,+\x8aEGpo\x04\x89=e\xc3\xef\x17H#\x90]\xd5^\x94~/')
>>> rgb = bytearray(random.getrandbits(8) for _ in range(3 * 4 * 4))
>>> np.array(rgb)
array([112, 68, 7, 41, 175, 109, 124, 111, 116, 6, 124, 168, 146,
60, 125, 133, 1, 74, 251, 194, 79, 14, 72, 236, 188, 56,
52, 145, 125, 236, 86, 108, 235, 9, 215, 49, 190, 16, 90,
9, 114, 43, 214, 65, 132, 128, 145, 214], dtype=uint8)
>>> np.array(rgb).reshape(4,4,3)
array([[[112, 68, 7],
[ 41, 175, 109],
[124, 111, 116],
[ 6, 124, 168]],
[[146, 60, 125],
[133, 1, 74],
[251, 194, 79],
[ 14, 72, 236]],
[[188, 56, 52],
[145, 125, 236],
[ 86, 108, 235],
[ 9, 215, 49]],
[[190, 16, 90],
[ 9, 114, 43],
[214, 65, 132],
[128, 145, 214]]], dtype=uint8)
You might want to look at existing numpy and scipy methods for image processing. misc.imread could be interesting.
Given an array image which might be a 2D, 3D or 4D, but preferable nD array, I want to extract a contiguous part of the array around a point with a list denoting how I extend in along all axis and pad the array with a pad_value if the extensions is out of the image.
I came up with this:
def extract_patch_around_point(image, loc, extend, pad_value=0):
offsets_low = []
offsets_high = []
for i, x in enumerate(loc):
offset_low = -np.min([x - extend[i], 0])
offsets_low.append(offset_low)
offset_high = np.max([x + extend[i] - image.shape[1] + 1, 0])
offsets_high.append(offset_high)
upper_patch_offsets = []
lower_image_offsets = []
upper_image_offsets = []
for i in range(image.ndim):
upper_patch_offset = 2*extend[i] + 1 - offsets_high[i]
upper_patch_offsets.append(upper_patch_offset)
image_offset_low = loc[i] - extend[i] + offsets_low[i]
image_offset_high = np.min([loc[i] + extend[i] + 1, image.shape[i]])
lower_image_offsets.append(image_offset_low)
upper_image_offsets.append(image_offset_high)
patch = pad_value*np.ones(2*np.array(extend) + 1)
# This is ugly
A = np.ix_(range(offsets_low[0], upper_patch_offsets[0]),
range(offsets_low[1], upper_patch_offsets[1]))
B = np.ix_(range(lower_image_offsets[0], upper_image_offsets[0]),
range(lower_image_offsets[1], upper_image_offsets[1]))
patch[A] = image[B]
return patch
Currently it only works in 2D because of the indexing trick with A, B etc. I do not want to check for the number of dimensions and use a different indexing scheme. How can I make this independent on image.ndim?
Based on my understanding of the requirements, I would suggest a zeros-padded version and then using slice notation to keep it generic on number of dimensions, like so -
def extract_patch_around_point(image, loc, extend, pad_value=0):
extend = np.asarray(extend)
image_ext_shp = image.shape + 2*np.array(extend)
image_ext = np.full(image_ext_shp, pad_value)
insert_idx = [slice(i,-i) for i in extend]
image_ext[insert_idx] = image
region_idx = [slice(i,j) for i,j in zip(loc,extend*2+1+loc)]
return image_ext[region_idx]
Sample runs -
2D case :
In [229]: np.random.seed(1234)
...: image = np.random.randint(11,99,(13,8))
...: loc = (5,3)
...: extend = np.array([2,4])
...:
In [230]: image
Out[230]:
array([[58, 94, 49, 64, 87, 35, 26, 60],
[34, 37, 41, 54, 41, 37, 69, 80],
[91, 84, 58, 61, 87, 48, 45, 49],
[78, 22, 11, 86, 91, 14, 13, 30],
[23, 76, 86, 92, 25, 82, 71, 57],
[39, 92, 98, 24, 23, 80, 42, 95],
[56, 27, 52, 83, 67, 81, 67, 97],
[55, 94, 58, 60, 29, 96, 57, 48],
[49, 18, 78, 16, 58, 58, 26, 45],
[21, 39, 15, 93, 66, 89, 34, 61],
[73, 66, 95, 11, 44, 32, 82, 79],
[92, 63, 75, 96, 52, 12, 25, 14],
[41, 23, 84, 30, 37, 79, 75, 33]])
In [231]: image[loc]
Out[231]: 24
In [232]: out = extract_patch_around_point(image, loc, extend, pad_value=0)
In [233]: out
Out[233]:
array([[ 0, 78, 22, 11, 86, 91, 14, 13, 30],
[ 0, 23, 76, 86, 92, 25, 82, 71, 57],
[ 0, 39, 92, 98, 24, 23, 80, 42, 95], <-- At middle
[ 0, 56, 27, 52, 83, 67, 81, 67, 97],
[ 0, 55, 94, 58, 60, 29, 96, 57, 48]])
^
3D case :
In [234]: np.random.seed(1234)
...: image = np.random.randint(11,99,(13,5,8))
...: loc = (5,2,3)
...: extend = np.array([1,2,4])
...:
In [235]: image[loc]
Out[235]: 82
In [236]: out = extract_patch_around_point(image, loc, extend, pad_value=0)
In [237]: out.shape
Out[237]: (3, 5, 9)
In [238]: out
Out[238]:
array([[[ 0, 23, 87, 19, 58, 98, 36, 32, 33],
[ 0, 56, 30, 52, 58, 47, 50, 28, 50],
[ 0, 70, 93, 48, 98, 49, 19, 65, 28],
[ 0, 52, 58, 30, 54, 55, 46, 53, 31],
[ 0, 37, 34, 13, 76, 38, 89, 79, 71]],
[[ 0, 14, 92, 58, 72, 74, 43, 24, 67],
[ 0, 59, 69, 46, 68, 71, 94, 20, 71],
[ 0, 61, 62, 60, 82, 92, 15, 14, 57], <-- At middle
[ 0, 58, 74, 95, 16, 94, 83, 83, 74],
[ 0, 67, 25, 92, 71, 19, 52, 44, 80]],
[[ 0, 74, 28, 12, 12, 13, 62, 88, 63],
[ 0, 25, 58, 86, 76, 40, 20, 91, 61],
[ 0, 28, 42, 85, 22, 45, 64, 35, 66],
[ 0, 64, 34, 69, 27, 17, 92, 89, 68],
[ 0, 15, 57, 86, 17, 98, 29, 59, 50]]])
^
Here is a simple working example that demonstrates how to iteratively "wittle down" your input matrix to obtain the patch around a point in nDims:
import numpy as np
# Givens. Matrix to be sliced, point around which to slice,
# and the padding around the given point
matrix = np.random.normal(size=[5,5,5])
loc = (3,3,3)
padding = 2
# If one knows the dimensionality, the slice can be obtained easily
ans1 = matrix[loc[0] - padding:loc[0] + 1,
loc[1] - padding:loc[1] + 1,
loc[2] - padding:loc[2] + 1]
# If one does not know the dimensionality, the slice can be
# obtained iteratively
ans2 = matrix
for i in range(matrix.ndim):
# Compute slice for the particular axis
s = slice(loc[i] - padding, loc[i] + 1, 1)
# Move particular axis to front, slice it, then move it back
ans2 = np.moveaxis(np.moveaxis(ans2, i, 0)[s], 0, i)
# Assert the two answers are equal
np.testing.assert_array_equal(ans1, ans2)
This example does not take into account slicing beyond the existing dimensions, but that exception can be easily caught in the loop.
What is the most pythonic way of splitting a NumPy matrix (a 2-D array) into equal chunks both vertically and horizontally?
For example :
aa = np.reshape(np.arange(270),(18,15)) # a 18x15 matrix
then a "function" like
ab = np.split2d(aa,(2,3))
would result in a list of 6 matrices shaped (9,5) each. The first guess is combine hsplit, map and vsplit, but how the mar has to be applied if there are two parameters to define for it, like :
map(np.vsplit(#,3),np.hsplit(aa,2))
Here's one approach staying within NumPy environment -
def view_as_blocks(arr, BSZ):
# arr is input array, BSZ is block-size
m,n = arr.shape
M,N = BSZ
return arr.reshape(m//M, M, n//N, N).swapaxes(1,2).reshape(-1,M,N)
Sample runs
1) Actual big case to verify shapes :
In [41]: aa = np.reshape(np.arange(270),(18,15))
In [42]: view_as_blocks(aa, (9,5)).shape
Out[42]: (6, 9, 5)
2) Small case to manually verify values:
In [43]: aa = np.reshape(np.arange(36),(6,6))
In [44]: aa
Out[44]:
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23],
[24, 25, 26, 27, 28, 29],
[30, 31, 32, 33, 34, 35]])
In [45]: view_as_blocks(aa, (2,3)) # Blocks of shape (2,3)
Out[45]:
array([[[ 0, 1, 2],
[ 6, 7, 8]],
[[ 3, 4, 5],
[ 9, 10, 11]],
[[12, 13, 14],
[18, 19, 20]],
[[15, 16, 17],
[21, 22, 23]],
[[24, 25, 26],
[30, 31, 32]],
[[27, 28, 29],
[33, 34, 35]]])
If you are willing to work with other libraries, scikit-image could be of use here, like so -
from skimage.util import view_as_blocks as viewB
out = viewB(aa, tuple(BSZ)).reshape(-1,*BSZ)
Runtime test -
In [103]: aa = np.reshape(np.arange(270),(18,15))
# #EFT's soln
In [99]: %timeit split_2d(aa, (2,3))
10000 loops, best of 3: 23.3 µs per loop
# #glegoux's soln-1
In [100]: %timeit list(get_chunks(aa, 2,3))
100000 loops, best of 3: 3.7 µs per loop
# #glegoux's soln-2
In [111]: %timeit list(get_chunks2(aa, 9, 5))
100000 loops, best of 3: 3.39 µs per loop
# Proposed in this post
In [101]: %timeit view_as_blocks(aa, (9,5))
1000000 loops, best of 3: 1.86 µs per loop
Please note that I have used (2,3) for split_2d and get_chunks as by their definitions, they are using that as the number of blocks. In my case with view_as_blocks, I have the parameter BSZ indicating the block size. So, I have (9,5) there. get_chunks2 follows the same format as view_as_blocks. The outputs should represent the same there.
You could use np.split & np.concatenate, the latter to allow the second split to be conducted in a single step:
def split_2d(array, splits):
x, y = splits
return np.split(np.concatenate(np.split(array, y, axis=1)), x*y)
ab = split_2d(aa,(2,3))
ab[0].shape
Out[95]: (9, 5)
len(ab)
Out[96]: 6
This also seems like it should be relatively straightforward to generalize to the n-dim case, though I haven't followed that thought all the way through just yet.
Edit:
For a single array as output, just add np.stack:
np.stack(ab).shape
Out[99]: (6, 9, 5)
To cut, this matrix (18,15) :
+-+-+-+
+ +
+-+-+-+
in 2x3 blocks (9,5) like it :
+-+-+-+
+-+-+-+
+-+-+-+
Do:
from pprint import pprint
import numpy as np
M = np.reshape(np.arange(18*15),(18,15))
def get_chunks(M, n, p):
n = len(M)//n
p = len(M[0])//p
for i in range(0, len(M), n):
for j in range(0, len(M[0]), p):
yield M[i:i+n,j:j+p]
def get_chunks2(M, n, p):
for i in range(0, len(M), n):
for j in range(0, len(M[0]), p):
yield M[i:i+n,j:j+p]
# list(get_chunks2(M, 9, 5)) same result more faster
chunks = list(get_chunks(M, 2, 3))
pprint(chunks)
Output:
[array([[ 0, 1, 2, 3, 4],
[ 15, 16, 17, 18, 19],
[ 30, 31, 32, 33, 34],
[ 45, 46, 47, 48, 49],
[ 60, 61, 62, 63, 64],
[ 75, 76, 77, 78, 79],
[ 90, 91, 92, 93, 94],
[105, 106, 107, 108, 109],
[120, 121, 122, 123, 124]]),
array([[ 5, 6, 7, 8, 9],
[ 20, 21, 22, 23, 24],
[ 35, 36, 37, 38, 39],
[ 50, 51, 52, 53, 54],
[ 65, 66, 67, 68, 69],
[ 80, 81, 82, 83, 84],
[ 95, 96, 97, 98, 99],
[110, 111, 112, 113, 114],
[125, 126, 127, 128, 129]]),
array([[ 10, 11, 12, 13, 14],
[ 25, 26, 27, 28, 29],
[ 40, 41, 42, 43, 44],
[ 55, 56, 57, 58, 59],
[ 70, 71, 72, 73, 74],
[ 85, 86, 87, 88, 89],
[100, 101, 102, 103, 104],
[115, 116, 117, 118, 119],
[130, 131, 132, 133, 134]]),
array([[135, 136, 137, 138, 139],
[150, 151, 152, 153, 154],
[165, 166, 167, 168, 169],
[180, 181, 182, 183, 184],
[195, 196, 197, 198, 199],
[210, 211, 212, 213, 214],
[225, 226, 227, 228, 229],
[240, 241, 242, 243, 244],
[255, 256, 257, 258, 259]]),
array([[140, 141, 142, 143, 144],
[155, 156, 157, 158, 159],
[170, 171, 172, 173, 174],
[185, 186, 187, 188, 189],
[200, 201, 202, 203, 204],
[215, 216, 217, 218, 219],
[230, 231, 232, 233, 234],
[245, 246, 247, 248, 249],
[260, 261, 262, 263, 264]]),
array([[145, 146, 147, 148, 149],
[160, 161, 162, 163, 164],
[175, 176, 177, 178, 179],
[190, 191, 192, 193, 194],
[205, 206, 207, 208, 209],
[220, 221, 222, 223, 224],
[235, 236, 237, 238, 239],
[250, 251, 252, 253, 254],
[265, 266, 267, 268, 269]])]
For a simpler solution, I used np.array_split together with transforming the matrices. So let's say that I want it split into 3 equal chunks vertically and 2 equal chunks horizontally, then:
# Create your matrix
matrix = np.reshape(np.arange(270),(18,15)) # a 18x15 matrix
# Container for your final matrices
final_matrices = []
# Then split into 3 equal chunks vertically
vertically_split_matrices = np.array_split(matrix)
for v_m in vertically_split_matrices:
# Then split the transformed matrices equally
m1, m2 = np.array_split(v_m.T, 2)
# And transform the matrices back
final_matrices.append(m1.T)
final_matrices.append(m2.T)
So I end up with 6 chunks, all of which are the same height and the same width.