How to convert different numpy arrays to sets? - python

I have one numpy array that looks like this:
array([ 0, 1, 2, 6, 8, 9, 10, 11, 12, 13, 14, 15, 16,
18, 19, 20, 22, 27, 28, 29, 32, 33, 34, 36, 37, 38,
39, 42, 43, 44, 45, 47, 48, 51, 52, 54, 55, 56, 60,
65, 66, 67, 68, 69, 70, 71, 73, 74, 75, 77, 78, 80,
81, 82, 83, 84, 85, 86, 87, 88, 89, 92, 94, 95, 97,
98, 100, 101, 102, 105, 106, 108, 109, 113, 114, 117, 118, 119,
121, 123, 124, 126, 127, 128, 129, 131, 132, 133, 134, 135, 137,
138, 141, 142, 143, 144, 145, 147, 148, 149, 152, 154, 156, 157,
159, 160, 161, 163, 165, 166, 167, 168, 169, 170, 172, 176, 177,
179, 180, 182, 183, 185, 186, 187, 188, 191, 192, 194, 196, 197,
199, 200, 201, 202, 204, 205, 206, 207, 208])
I'm able to convert this to a set using set() no problem
However, I have another numpy array that looks like:
array([[ 2],
[ 4],
[ 10],
[ 10],
[ 12],
[ 13],
[ 14],
[ 16],
[ 19],
[ 21],
[ 21],
[ 22],
[ 29],
[209]])
When I try to use set() this gives me an error: TypeError: unhashable type: 'numpy.ndarray'
How can I convert my second numpy array to look like the first array and so I will be able to use set()?
For reference my second array is converted from a PySpark dataframe column using:
np.array(data2.select('row_num').collect())
And both arrays are used with set() in:
count = sorted(set(range(data1)) - set(np.array(data2.select('row_num').collect())))

As mentioned, use ravel to return a contiguous flattened array.
import numpy as np
arr = np.array(
[[2], [4], [10], [10], [12], [13], [14], [16], [19], [21], [21], [22], [29], [209]]
)
print(set(arr.ravel()))
Outputs:
{2, 4, 10, 12, 13, 14, 16, 209, 19, 21, 22, 29}
This is somewhat equivalent to doing a reshape with a single dimension being the array size:
print(set(arr.reshape(arr.size)))

Related

Advanced 3d numpy array slicing with alternation

So, I want to slice my 3d array to skip the first 2 arrays and then return the next two arrays. And I want the slice to keep following this pattern, alternating skipping 2 and giving 2 arrays etc.. I have found a solution, but I was wondering if there is a more elegant way to go about this? Preferably without having to reshape?
arr = np.arange(1, 251).reshape((10, 5, 5))
sliced_array = np.concatenate((arr[2::4], arr[3::4]), axis=1).ravel().reshape((4, 5, 5))
You can use boolean indexing using a mask that repeats [False, False, True, True, ...]:
import numpy as np
arr = np.arange(1, 251).reshape((10, 5, 5))
mask = np.arange(arr.shape[0]) % 4 >= 2
out = arr[mask]
out:
array([[[ 51, 52, 53, 54, 55],
[ 56, 57, 58, 59, 60],
[ 61, 62, 63, 64, 65],
[ 66, 67, 68, 69, 70],
[ 71, 72, 73, 74, 75]],
[[ 76, 77, 78, 79, 80],
[ 81, 82, 83, 84, 85],
[ 86, 87, 88, 89, 90],
[ 91, 92, 93, 94, 95],
[ 96, 97, 98, 99, 100]],
[[151, 152, 153, 154, 155],
[156, 157, 158, 159, 160],
[161, 162, 163, 164, 165],
[166, 167, 168, 169, 170],
[171, 172, 173, 174, 175]],
[[176, 177, 178, 179, 180],
[181, 182, 183, 184, 185],
[186, 187, 188, 189, 190],
[191, 192, 193, 194, 195],
[196, 197, 198, 199, 200]]])
Since you want to select, and skip, the same numbers, reshaping works.
For a 1d array:
In [97]: np.arange(10).reshape(5,2)[1::2]
Out[97]:
array([[2, 3],
[6, 7]])
which can then be ravelled.
Generalizing to more dimensions:
In [98]: x = np.arange(100).reshape(10,10)
In [99]: x.reshape(5,2,10)[1::2,...].reshape(-1,10)
Out[99]:
array([[20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
[30, 31, 32, 33, 34, 35, 36, 37, 38, 39],
[60, 61, 62, 63, 64, 65, 66, 67, 68, 69],
[70, 71, 72, 73, 74, 75, 76, 77, 78, 79]])
I won't go on to 3d because the display will be longer, but it should be straight forward.

How can i remove clusters of given indexes in kmeans? [duplicate]

I am using the sklearn.cluster KMeans package. Once I finish the clustering if I need to know which values were grouped together how can I do it?
Say I had 100 data points and KMeans gave me 5 cluster. Now I want to know which data points are in cluster 5. How can I do that.
Is there a function to give the cluster id and it will list out all the data points in that cluster?
I had a similar requirement and i am using pandas to create a new dataframe with the index of the dataset and the labels as columns.
data = pd.read_csv('filename')
km = KMeans(n_clusters=5).fit(data)
cluster_map = pd.DataFrame()
cluster_map['data_index'] = data.index.values
cluster_map['cluster'] = km.labels_
Once the DataFrame is available is quite easy to filter,
For example, to filter all data points in cluster 3
cluster_map[cluster_map.cluster == 3]
If you have a large dataset and you need to extract clusters on-demand you'll see some speed-up using numpy.where. Here is an example on the iris dataset:
from sklearn.cluster import KMeans
from sklearn import datasets
import numpy as np
centers = [[1, 1], [-1, -1], [1, -1]]
iris = datasets.load_iris()
X = iris.data
y = iris.target
km = KMeans(n_clusters=3)
km.fit(X)
Define a function to extract the indices of the cluster_id you provide. (Here are two functions, for benchmarking, they both return the same values):
def ClusterIndicesNumpy(clustNum, labels_array): #numpy
return np.where(labels_array == clustNum)[0]
def ClusterIndicesComp(clustNum, labels_array): #list comprehension
return np.array([i for i, x in enumerate(labels_array) if x == clustNum])
Let's say you want all samples that are in cluster 2:
ClusterIndicesNumpy(2, km.labels_)
array([ 52, 77, 100, 102, 103, 104, 105, 107, 108, 109, 110, 111, 112,
115, 116, 117, 118, 120, 122, 124, 125, 128, 129, 130, 131, 132,
134, 135, 136, 137, 139, 140, 141, 143, 144, 145, 147, 148])
Numpy wins the benchmark:
%timeit ClusterIndicesNumpy(2,km.labels_)
100000 loops, best of 3: 4 µs per loop
%timeit ClusterIndicesComp(2,km.labels_)
1000 loops, best of 3: 479 µs per loop
Now you can extract all of your cluster 2 data points like so:
X[ClusterIndicesNumpy(2,km.labels_)]
array([[ 6.9, 3.1, 4.9, 1.5],
[ 6.7, 3. , 5. , 1.7],
[ 6.3, 3.3, 6. , 2.5],
... #truncated
Double-check the first three indices from the truncated array above:
print X[52], km.labels_[52]
print X[77], km.labels_[77]
print X[100], km.labels_[100]
[ 6.9 3.1 4.9 1.5] 2
[ 6.7 3. 5. 1.7] 2
[ 6.3 3.3 6. 2.5] 2
Actually a very simple way to do this is:
clusters=KMeans(n_clusters=5)
df[clusters.labels_==0]
The second row returns all the elements of the df that belong to the 0th cluster. Similarly you can find the other cluster-elements.
To get the IDs of the points/samples/observations that are inside each cluster, do this:
Python 2
Example using Iris data and a nice pythonic way:
import numpy as np
from sklearn.cluster import KMeans
from sklearn import datasets
np.random.seed(0)
# Use Iris data
iris = datasets.load_iris()
X = iris.data
y = iris.target
# KMeans with 3 clusters
clf = KMeans(n_clusters=3)
clf.fit(X,y)
#Coordinates of cluster centers with shape [n_clusters, n_features]
clf.cluster_centers_
#Labels of each point
clf.labels_
# Nice Pythonic way to get the indices of the points for each corresponding cluster
mydict = {i: np.where(clf.labels_ == i)[0] for i in range(clf.n_clusters)}
# Transform this dictionary into list (if you need a list as result)
dictlist = []
for key, value in mydict.iteritems():
temp = [key,value]
dictlist.append(temp)
RESULTS
#dict format
{0: array([ 50, 51, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76,
78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90,
91, 92, 93, 94, 95, 96, 97, 98, 99, 101, 106, 113, 114,
119, 121, 123, 126, 127, 133, 138, 142, 146, 149]),
1: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]),
2: array([ 52, 77, 100, 102, 103, 104, 105, 107, 108, 109, 110, 111, 112,
115, 116, 117, 118, 120, 122, 124, 125, 128, 129, 130, 131, 132,
134, 135, 136, 137, 139, 140, 141, 143, 144, 145, 147, 148])}
# list format
[[0, array([ 50, 51, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76,
78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90,
91, 92, 93, 94, 95, 96, 97, 98, 99, 101, 106, 113, 114,
119, 121, 123, 126, 127, 133, 138, 142, 146, 149])],
[1, array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49])],
[2, array([ 52, 77, 100, 102, 103, 104, 105, 107, 108, 109, 110, 111, 112,
115, 116, 117, 118, 120, 122, 124, 125, 128, 129, 130, 131, 132,
134, 135, 136, 137, 139, 140, 141, 143, 144, 145, 147, 148])]]
Python 3
Just change
for key, value in mydict.iteritems():
to
for key, value in mydict.items():
You can look at attribute labels_
For example
km = KMeans(2)
km.fit([[1,2,3],[2,3,4],[5,6,7]])
print km.labels_
output: array([1, 1, 0], dtype=int32)
As you can see first and second point is cluster 1, last point in cluster 0.
You can Simply store the labels in an array. Convert the array to a data frame. Then Merge the data that you used to create K means with the new data frame with clusters.
Display the dataframe. Now you should see the row with corresponding cluster. If you want to list all the data with specific cluster, use something like data.loc[data['cluster_label_name'] == 2], assuming 2 your cluster for now.

Combine numpy subarrays of varying dimensions

I have a nested numpy array (dtype=object), it contains 333 arrays that increase consistently from size 52x1 to size 52x333
I would like to effectively extract and concatenate these arrays so that I have a single 52x55611 array
I imagine this may be straightforward but my attempts using numpy.reshape have been unsuccesful
If you want to stack them along the second axis, you can use numpy.hstack.
list_of_arrays = [ array_1, ..., array_n] #all these arrays have same shape[0]
big_array = np.hstack( list_of_arrays)
if I have understood you correctly, you could use numpy.concatenate.
>>> import numpy as np
>>> a = np.array([range(52)])
>>> b = np.array([range(52,104), range(104, 156)])
>>> np.concatenate((a,b))
array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,
26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,
39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51],
[ 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64,
65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77,
78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90,
91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103],
[104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,
130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142,
143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155]])
>>>

Reshaping a 1D bytes object into a 3D numpy array

I'm using FFmpeg to decode a video, and am piping the RGB24 raw data into python.
So the format of the binary data is:
RGBRGBRGBRGB...
I need to convert this into a (640, 360, 3) numpy array, and was wondering if I could use reshape for this and, especially, how.
If rgb is a bytearray with 3 * 360 * 640 bytes, all you need is :
np.array(rgb).reshape(640, 360, 3)
As an example:
>>> import random
>>> import numpy as np
>>> bytearray(random.getrandbits(8) for _ in range(3 * 4 * 4))
bytearray(b'{)jg\xba\xbe&\xd1\xb9\xdd\xf9#\xadL?GV\xca\x19\xfb\xbd\xad\xc2C\xa8,+\x8aEGpo\x04\x89=e\xc3\xef\x17H#\x90]\xd5^\x94~/')
>>> rgb = bytearray(random.getrandbits(8) for _ in range(3 * 4 * 4))
>>> np.array(rgb)
array([112, 68, 7, 41, 175, 109, 124, 111, 116, 6, 124, 168, 146,
60, 125, 133, 1, 74, 251, 194, 79, 14, 72, 236, 188, 56,
52, 145, 125, 236, 86, 108, 235, 9, 215, 49, 190, 16, 90,
9, 114, 43, 214, 65, 132, 128, 145, 214], dtype=uint8)
>>> np.array(rgb).reshape(4,4,3)
array([[[112, 68, 7],
[ 41, 175, 109],
[124, 111, 116],
[ 6, 124, 168]],
[[146, 60, 125],
[133, 1, 74],
[251, 194, 79],
[ 14, 72, 236]],
[[188, 56, 52],
[145, 125, 236],
[ 86, 108, 235],
[ 9, 215, 49]],
[[190, 16, 90],
[ 9, 114, 43],
[214, 65, 132],
[128, 145, 214]]], dtype=uint8)
You might want to look at existing numpy and scipy methods for image processing. misc.imread could be interesting.

How to find differences of mat files in Python in human readable format?

.mat files can be loaded into Python with:
import scipy.io
matdata1 = scipy.io.loadmat('file1.mat')
matdata2 = scipy.io.loadmat('file2.mat')
the files can then be piped and save the mat files as text by calling the following Python function:
def mat2txt(matdata):
for k, v in matdata.items(): #Python 3 specific
if isinstance(v,dict):
myprint(v)
else:
print (k,v)
The two .mat files that are being compared are of the same structure and type with different values.
I would like to able to identify the different values in human readable format, and not just their location.
I have tried:
diff matdata1.txt matdata2.txt
diff matdata1.txt matdata2.txt | grep "<" | sed 's/^<//g'
grep -v -F -x -f matdata1.txt matdata2.txt
which do not point to specific differences in values, and they are not within Python. I hoped to store the .mat files as .txt to be able to create a static state to compare the files data at different dates relative to itself and other files, as well as, affording the opportunity to store in git for future comparisons.
A toy example of the resulting data files are:
matdata1.txt
b [[([[(array([[0]], dtype=uint8),)]],)]]
a [[([[(array([[0]], dtype=uint8),)]],)]]
c [[ ([[(array([[ ([[122, 139, 156, 173, 190, 207, 224, 1, 18, 35, 52, 69, 86, 103, 120], [138, 155, 172, 189, 206, 223, 15, 17, 34, 51, 68, 85, 102, 119, 121], [154, 171, 188, 205, 222, 14, 16, 33, 50, 67, 84, 101, 118, 135, 137], [170, 187, 204, 221, 13, 30, 32, 49, 66, 83, 100, 117, 134, 136, 153], [186, 203, 220, 12, 29, 31, 48, 65, 82, 99, 116, 133, 150, 152, 169], [202, 22, 11, 28, 45, 47, 64, 81, 98, 115, 132, 149, 151, 168, 185], [218, 10, 27, 33, 46, 63, 80, 97, 114, 131, 148, 165, 167, 184, 201], [9, 26, 43, 60, 62, 11, 96, 113, 130, 147, 164, 166, 183, 200, 217], [25, 42, 59, 61, 78, 95, 112, 99, 146, 163, 180, 182, 199, 216, 8], [41, 58, 75, 77, 94, 111, 128, 145, 162, 179, 181, 198, 215, 7, 24], [57, 74, 76, 93, 110, 127, 144, 161, 178, 195, 197, 214, 6, 23, 40], [73, 90, 92, 109, 126, 143, 160, 177, 194, 196, 213, 5, 22, 39, 56], [89, 91, 108, 125, 142, 159, 176, 193, 210, 212, 4, 21, 38, 55, 72], [105, 107, 124, 141, 158, 175, 192, 209, 211, 3, 20, 37, 54, 71, 88], [106, 123, 140, 157, 174, 191, 208, 225, 2, 19, 36, 53, 70, 87, 104]],)]],
dtype=[('c', 'O')]),)]],)]]
__header__ b'MATLAB 5.0 MAT-file, Platform: PCWIN64, Created on: Mon Jun 27 20:55:29 2016'
d [[1]]
__globals__ []
f ['string']
__version__ 1.0
e [[2]]
matdata2.txt
e [[2]]
d [[1]]
__globals__ []
f ['string']
__header__ b'MATLAB 5.0 MAT-file, Platform: PCWIN64, Created on: Mon Jun 27 20:54:48 2016'
c [[ ([[(array([[ ([[122, 139, 156, 173, 190, 207, 224, 1, 18, 35, 52, 69, 86, 103, 120], [138, 155, 172, 189, 206, 223, 15, 17, 34, 51, 68, 85, 102, 119, 121], [154, 171, 188, 205, 222, 14, 16, 33, 50, 67, 84, 101, 118, 135, 137], [170, 187, 204, 221, 13, 30, 32, 49, 66, 83, 100, 117, 134, 136, 153], [186, 203, 220, 12, 29, 31, 48, 65, 82, 99, 116, 133, 150, 152, 169], [202, 219, 11, 28, 45, 47, 64, 81, 98, 115, 132, 149, 151, 168, 185], [218, 10, 27, 44, 46, 63, 80, 97, 114, 131, 148, 165, 167, 184, 201], [9, 26, 43, 60, 62, 79, 96, 113, 130, 147, 164, 166, 183, 200, 217], [25, 42, 59, 61, 78, 95, 112, 129, 146, 163, 180, 182, 199, 216, 8], [41, 58, 75, 77, 94, 111, 128, 145, 162, 179, 181, 198, 215, 7, 24], [57, 74, 76, 93, 110, 127, 144, 161, 178, 195, 197, 214, 6, 23, 40], [73, 90, 92, 109, 126, 143, 160, 177, 194, 196, 213, 5, 22, 39, 56], [89, 91, 108, 125, 142, 159, 176, 193, 210, 212, 4, 21, 38, 55, 72], [105, 107, 124, 141, 158, 175, 192, 209, 211, 3, 20, 37, 54, 71, 88], [106, 123, 140, 157, 174, 191, 208, 225, 2, 19, 36, 53, 70, 87, 104]],)]],
dtype=[('c', 'O')]),)]],)]]
a [[([[(array([[1]], dtype=uint8),)]],)]]
b [[([[(array([[0]], dtype=uint8),)]],)]]
__version__ 1.0

Categories