How to find clusters in a matrix of values - python

I have a 640x480 matrix that contains temperature values (coming from thermal images);
each element of the matrix represents the temperature of a single pixel like so:
[[31.2 30.4 32.5 ... 31.3 31.6 31.7]
[30.0 37.4 40.5 ... 51.5 52.6 52.7]
...
[28.9 28.8 28.1 ... 31.2 32.4 32.3]]
I want to find clusters in this matrix taking into consideration:
temperature difference between two elements;
positional distance between two elements;
I tried to do this by using DBSCAN clustering algorithm on an array containing coordinates and values of the elements like so:
coord = [[0 0 31.2]
[1 0 30.4]
[2 0 32.5]
...
[638 479 32.4]
[639 479 32.3]]
This is the code:
X=np.array(coord)
db = DBSCAN(eps=2, min_samples=2, metric='manhattan').fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
cluster_dict = {i: X[labels==i] for i in range(n_clusters_)}
The problem is that I get a large number of clusters and the clustering is not precise (also with different values ​​of eps and min_samples).
I would like to know if there is a more efficient method of doing this.
Thank you all

Related

speed up time for loop iterations in python

I have "random" points and would like to check which points can be connected by straight lines. Therefore I iterate through a list of points and draw a line at different angles. After all lines at all angles for every single point is drawn, I iterate over each line checking whether they are connecting 3 or more points. If the line connects 3 or more points, it is saved by appending it to a new list (newLines), if not the next line gets tested.
The problem which the following code is that it is way to slow... My testing image took about 30 min and my actual image was not done after about 14 hours. I read about speeding up for loops by using numpy (like in this article). I found plenty of examples for replacing for loops with numpy but in these example it was just simple iterating over a list without declaring the values as variables for usage.
Any hint for speeding up the following code is appreciated, it does not necessarily need to be numpy.
# list for saving rotated lines
lines=[]
for point in points:
# length of line is the diagonal of the point image so it still covers the whole image after rotation
length = sqrt(image.shape[0]**2+image.shape[1]**2)
start = Point(point)
end = Point(start.x+length, start.y)
line = LineString([start,end])
# rotating the generated line for 5 degrees and appeding it to the list
for a in range(0, 360, 5):
angle = np.deg2rad(a)
line = rotate(line, angle, origin=start, use_radians=True)
lines.append(line)
multiLines = MultiLineString(lines)
# list for rotated lines which connect 3 or more points
newLines = []
start = ()
for multiLine in multiLines.geoms:
lst = list(multiLine.coords)
# a: starting point of line | b: ending point of line
a = np.asarray(lst[0])
b = np.asarray(lst[1])
count = 0
# again iterating over point array to check which point is on line
for point in points:
p = np.asarray(point)
# check if point (p) is on line (a - b)
if np.cross(p-a,b-a) == 0:
if count == 0:
start = point
count += 1
else:
end = point
count += 1
if count >= 3:
line = (start, end)
newLines.append(line)
I'm not sure what your current benchmarks are, but you want to try with numpy you can do something like this. I'm using pandas which is a numpy wrapper, but it's effectively doing the same thing
I think this is doing the same thing as you want. I'm looking at each pair of points, calculating the m and c coefficients in the equation y = mx + c through the two points, then checking for cases where these match. I expect you might want some accepted error depending on your input data.
Sorry if I'm way off piste.
import pandas as pd
import numpy as np
import random
import itertools
import time
def get_matches(points):
# get all combinations of two points
combinations_of_points = ([(a[0], a[1], b[0], b[1]) for a, b in itertools.combinations(points, 2) if a != b])
data = pd.DataFrame(combinations_of_points, columns=['x1', 'y1', 'x2', 'y2'])
data['m'] = (data.y1 - data.y2) / (data.x1 - data.x2)
# swap negative gradients so all lines are in same direction
data.loc[np.isfinite(data.m) & data.m < 0, 'm'] = -(1 / data.m)
data.loc[np.isneginf(data.m), 'm'] = -data.m
# y = mx + c
data['c'] = data.y1 - (data.m * data.x1)
data = data.sort_values(['m', 'c', 'x1']).reset_index(drop=True)
# filter to items which are duplicated
filtered = data[
# matching m and c values
(np.isfinite(data.m) & data.duplicated(['m', 'c'], keep=False)) |
# infinite m and x equal (straight line up)
(np.isposinf(data.m) & data.duplicated(['m', 'x1'], keep=False))
]
return filtered
points = [(0, 0), (1, 1), (2, 2)]
print(get_matches(points))
random.seed(1)
count = 500
random_points = [(round(random.random(), 3), round(random.random(), 3)) for i in range(count)]
results = get_matches(random_points)
print(results)
print('\nPerformance with increasing points')
for i in [i ** 2 for i in range(5, 101, 5)]:
random.seed(1)
random_points = [(round(random.random(), 3), round(random.random(), 3)) for i in range(i)]
start = time.perf_counter()
results = get_matches(random_points)
stop = time.perf_counter()
print(f'{i:<9}{stop - start:03f}')
returns:
x1 y1 x2 y2 m c
0 0 0 1 1 1.0 0.0
1 0 0 2 2 1.0 0.0
2 1 1 2 2 1.0 0.0
x1 y1 x2 y2 m c
12243 0.606 0.262 0.400 0.880 -3.0 2.080
12244 0.606 0.262 0.440 0.760 -3.0 2.080
12251 0.378 0.970 0.506 0.586 -3.0 2.104
12252 0.505 0.589 0.378 0.970 -3.0 2.104
12253 0.505 0.589 0.506 0.586 -3.0 2.104
... ... ... ... ... ... ...
124741 0.971 0.382 0.971 0.716 inf -inf
124742 0.971 0.543 0.971 0.716 inf -inf
124744 0.983 0.593 0.983 0.296 inf -inf
124745 0.983 0.593 0.983 0.448 inf -inf
124746 0.983 0.296 0.983 0.448 inf -inf
[237 rows x 6 columns]
Performance with increasing points
25 0.010577
100 0.016897
225 0.045443
400 0.136834
625 0.338148
900 0.765913
1225 1.525819
1600 2.645753
2025 4.834811
2500 8.112012
3025 12.960043
3600 18.262522
4225 27.221498
4900 37.329662
5625 53.064736
6400 67.325213
7225 84.843119
8100 116.864120
9025 140.131420
10000 171.630961
As one of you comments pointed out earlier, the order of growth of the problem is approximately N ^ 2 because it is look at all the combinations of points so the performance very quickly degrades with increasing numbers of points. Note you could use this relationship to estimate how long it would take for your program to run if you know the number of points.

Montecarlo continuation of multicolumn pandas timeseries

I have a bunch of data points in a timeseries in a pandas dataframe. Each column is supposedly independent of each other. I want to create a montecarlo process to calculate expected values for each of the columns. For that, my expectation is that the underlying data follows a brownian motion pattern, so I'd need to generate a normal distribution over the differences between points in time space.
I transform my data like this:
diffs = (data.diff() / data.shift(1))
This is what I have at the moment:
data = diffs.describe()
This gives the following output:
A B C
count 4986.000000 4963.000000 1861.000000
mean 0.000285 0.000109 0.000421
std 0.015759 0.015426 0.014676
...
I process it like this to generate more samples:
import numpy as np
desired_samples = 1000
random = np.random.default_rng().normal(loc=[data.loc[["mean"]].to_numpy()], scale=[data.loc[["std"]].to_numpy()], size=[len(data.columns), desired_samples])
However this gives me an error:
ValueError: shape mismatch: objects cannot be broadcast to a single shape. Mismatch is between arg 0 with shape (441, 1000) and arg 1 with shape (1, 1, 441).
What I'd want is just a matrix of random values whose columns have the same std and mean as the sample's columns. I.e. such as when I do random.describe(), I'd get something like:
A B C
count 1000.0 1000.0 1000.0
mean 0.000285 0.000109 0.000421
std 0.015759 0.015426 0.014676
...
What'd be the correct way to generate those samples?
You could use apply() to create a data frame of random normal values using the attributes of the associated columns.
Generate Test Data
nv = 50
d = {'A':np.random.normal(1,1,nv),'B':np.random.normal(2,2,nv),'C':np.random.normal(3,3,nv)}
df = pd.DataFrame(d)
print(df)
A B C
0 0.276252 -2.833479 5.746740
1 1.562030 1.497242 2.557416
2 0.883105 -0.861824 3.106192
3 0.352372 0.014653 4.006219
4 1.475524 3.151062 -1.392998
5 2.011649 -2.289844 4.371251
6 3.230964 3.578058 0.610422
7 0.366506 3.391327 0.812932
8 1.669673 -1.021665 4.262500
9 1.835547 4.292063 6.983015
10 1.768208 4.029970 3.971751
...
45 0.501706 0.926860 7.008008
46 1.759266 -0.215047 4.560403
47 1.899167 0.690204 -0.538415
48 1.460267 1.506934 1.306303
49 1.641662 1.066182 0.049233
df.describe()
A B C
count 50.000000 50.000000 50.000000
mean 0.962083 1.522234 2.992492
std 1.073733 1.848754 2.838976
Generate Random Values with same approx (calculated) Mean and STD
mat = df.apply(lambda x: np.random.normal(x.mean(),x.std(),100))
print(mat)
A B C
0 0.234955 2.201961 1.910073
1 1.973203 3.528576 5.925673
2 -0.858201 2.234295 1.741338
3 2.245650 2.805498 0.135784
4 1.913691 2.134813 2.246989
.. ... ... ...
95 2.996207 2.248727 2.792658
96 0.663609 4.533541 1.518872
97 0.848259 -0.348086 2.271724
98 3.672370 1.706185 -0.862440
99 0.392051 0.832358 -0.354981
[100 rows x 3 columns]
mat.describe()
A B C
count 100.000000 100.000000 100.000000
mean 0.877725 1.332039 2.673327
std 1.148153 1.749699 2.447532
If you want the matrix to be numpy
mat.to_numpy()
array([[ 0.78881292, 3.09428714, -1.22757096],
[ 0.13044099, -1.02564025, 2.6566989 ],
[ 0.06090083, 1.50629474, 3.61487469],
[ 0.71418932, 1.88441111, 5.84979454],
[ 2.34287411, 2.58478867, -4.04433653],
[ 1.41846256, 0.36414635, 8.47482082],
[ 0.46765842, 1.37188986, 3.28011085],
[ 0.87433273, 3.45735286, 1.13351138],
[ 1.59029413, 4.0227165 , 3.58282534],
[ 2.23663894, 2.75007385, -0.36242541],
[ 1.80967311, 1.29206572, 1.73277577],
[ 1.20787923, 2.75529187, 4.64721489],
[ 2.33466341, 6.43830387, 4.31354348],
[ 0.87379125, 3.00658046, 4.94270155],
etc ...

How to add random points in between the given points?

I have data points as dataframe just like represented in figure1
sample data
df=
74 34
74.5 34.5
75 34.5
75 34
74.5 34
74 34.5
76 34
76 34.5
74.5 34
74 34.5
75.5 34.5
75.5 34
75 34
75 34.5
I want to add random points in between those points but keep the shape of the initial points.
Desired output will be somehow like in figure 2 (black dots represent the random points.And the red line represent the boundary)
~Any suggestions? I am looking for a general solution since the geometry of the outer boundary will change in problem
interpolation might be worth looking into:
import numpy as np
# lets suppose y = 2x and x[i], y[i] is a data point
x = [1, 5, 16, 20, 60]
y = [2, 10, 32, 40, 120]
interp_here = [7, 8, 9, 21] # the points where we are going to interpolate values.
print(np.interp(interp_here, x, y)) ## [14. 16. 18. 42.]
If you want random points, then you could use the above as a guide line and then for each point add/subtract some delta.
If the shape is convex it is pretty simple:
def get_random_point(points):
point_selectors = np.random.randint(0, len(points), 2)
scale = np.random.rand()#value between 0 and 1
start_point = points[point_selectors[0]]
end_point = points[point_selectors[1]]
return start_point + (end_point - start_point) * scale
The shape you have specified is not convex. But without you additionally specifying which points make up the exterior of your shape or additional constraints like e.g. you only want to allow for lines to go parallel to either x or y axis the shape you see is mathematically not sufficiently specified.
Final remark: There are algorithms which can check whether a point is within a polygon (Point in polygon).
You can then 1) specify bounding polygon 2) generate a point within the bounding rectangle of your shape and 3) test whether the point lies within the shape of your polygon.

Normalization by min-max & stand deviation method to only certain columns using Python or R

I have a dataframe which has 37 variables and 50,000 rows. There are both categorical and numerical features. I would like to do the normalization function to some columns in the dataframe.
Here is a fake dataset:
diagnosis gender area age weight score compactness class
447 1 95.88 50 117.66 674.8 80 0
167 0 109.3 65 118.8 886.3 35.6 2
444 0 117.5 80 160.85 990 64.2 2
100 0 88.05 35 94.98 582.7 35.23 1
227 1 97.45 40 15.51 684.5 70 1
I want to do normalization only to area, weight, score, compactness for example. How should I do it? BTW, I found a stand deviation method from here , but it meant for normalizing the whole dataset and the code is:
# identify outliers with standard deviation
from numpy.random import seed
from numpy.random import randn
from numpy import mean
from numpy import std
# calculate summary statistics
data_mean, data_std = mean(data), std(data)
# identify outliers
cut_off = data_std * 3
lower, upper = data_mean - cut_off, data_mean + cut_off
# identify outliers
outliers = [x for x in data if x < lower or x > upper]
print('Identified outliers: %d' % len(outliers))
# remove outliers
outliers_removed = [x for x in data if x >= lower and x <= upper]
print('Non-outlier observations: %d' % len(outliers_removed))
My question is how can do normalization only to some columns in a dataframe? Thanks for your help in advance!
I actually have a written function for automatic normalization that I use. It is the following:
n <-function(x){
d=dim(x)
c=colMeans(x)
xm=sapply(1:d[2],function(i){
x[,i]=x[,i]-c[i]
})
# xm is the x with removed means
v=var(xm) # variance matrix
xn=sapply(1:d[2],function(i){
xm[,i]=xm[,i]/sqrt(v[i,i])
})
xn
}
Then just apply this function to the desired columns.
tochange=c("age","weight","score")
df[,tochange]=n(df[,tochange])
> df
diagnosis gender area age weight score
[1,] 447 1 95.88 -0.2161373 0.3000106 -0.5282662
[2,] 167 0 109.30 0.5943775 0.3212536 0.7290858
[3,] 444 0 117.50 1.4048924 1.1048216 1.3455747
[4,] 100 0 88.05 -1.0266521 -0.1226130 -1.0757939
[5,] 227 1 97.45 -0.7564805 -1.6034728 -0.4706004
compactness class
[1,] 80.00 0
[2,] 35.60 2
[3,] 64.20 2
[4,] 35.23 1
[5,] 70.00 1

Speed up numpy.where for extracting integer segments?

I'm trying to work out how to speed up a Python function which uses numpy. The output I have received from lineprofiler is below, and this shows that the vast majority of the time is spent on the line ind_y, ind_x = np.where(seg_image == i).
seg_image is an integer array which is the result of segmenting an image, thus finding the pixels where seg_image == i extracts a specific segmented object. I am looping through lots of these objects (in the code below I'm just looping through 5 for testing, but I'll actually be looping through over 20,000), and it takes a long time to run!
Is there any way in which the np.where call can be speeded up? Or, alternatively, that the penultimate line (which also takes a good proportion of the time) can be speeded up?
The ideal solution would be to run the code on the whole array at once, rather than looping, but I don't think this is possible as there are side-effects to some of the functions I need to run (for example, dilating a segmented object can make it 'collide' with the next region and thus give incorrect results later on).
Does anyone have any ideas?
Line # Hits Time Per Hit % Time Line Contents
==============================================================
5 def correct_hot(hot_image, seg_image):
6 1 239810 239810.0 2.3 new_hot = hot_image.copy()
7 1 572966 572966.0 5.5 sign = np.zeros_like(hot_image) + 1
8 1 67565 67565.0 0.6 sign[:,:] = 1
9 1 1257867 1257867.0 12.1 sign[hot_image > 0] = -1
10
11 1 150 150.0 0.0 s_elem = np.ones((3, 3))
12
13 #for i in xrange(1,seg_image.max()+1):
14 6 57 9.5 0.0 for i in range(1,6):
15 5 6092775 1218555.0 58.5 ind_y, ind_x = np.where(seg_image == i)
16
17 # Get the average HOT value of the object (really simple!)
18 5 2408 481.6 0.0 obj_avg = hot_image[ind_y, ind_x].mean()
19
20 5 333 66.6 0.0 miny = np.min(ind_y)
21
22 5 162 32.4 0.0 minx = np.min(ind_x)
23
24
25 5 369 73.8 0.0 new_ind_x = ind_x - minx + 3
26 5 113 22.6 0.0 new_ind_y = ind_y - miny + 3
27
28 5 211 42.2 0.0 maxy = np.max(new_ind_y)
29 5 143 28.6 0.0 maxx = np.max(new_ind_x)
30
31 # 7 is + 1 to deal with the zero-based indexing, + 2 * 3 to deal with the 3 cell padding above
32 5 217 43.4 0.0 obj = np.zeros( (maxy+7, maxx+7) )
33
34 5 158 31.6 0.0 obj[new_ind_y, new_ind_x] = 1
35
36 5 2482 496.4 0.0 dilated = ndimage.binary_dilation(obj, s_elem)
37 5 1370 274.0 0.0 border = mahotas.borders(dilated)
38
39 5 122 24.4 0.0 border = np.logical_and(border, dilated)
40
41 5 355 71.0 0.0 border_ind_y, border_ind_x = np.where(border == 1)
42 5 136 27.2 0.0 border_ind_y = border_ind_y + miny - 3
43 5 123 24.6 0.0 border_ind_x = border_ind_x + minx - 3
44
45 5 645 129.0 0.0 border_avg = hot_image[border_ind_y, border_ind_x].mean()
46
47 5 2167729 433545.8 20.8 new_hot[seg_image == i] = (new_hot[ind_y, ind_x] + (sign[ind_y, ind_x] * np.abs(obj_avg - border_avg)))
48 5 10179 2035.8 0.1 print obj_avg, border_avg
49
50 1 4 4.0 0.0 return new_hot
EDIT I have left my original answer at the bottom for the record, but I have actually looked into your code in more detail over lunch, and I think that using np.where is a big mistake:
In [63]: a = np.random.randint(100, size=(1000, 1000))
In [64]: %timeit a == 42
1000 loops, best of 3: 950 us per loop
In [65]: %timeit np.where(a == 42)
100 loops, best of 3: 7.55 ms per loop
You could get a boolean array (that you can use for indexing) in 1/8 of the time you need to get the actual coordinates of the points!!!
There is of course the cropping of the features that you do, but ndimage has a find_objects function that returns enclosing slices, and appears to be very fast:
In [66]: %timeit ndimage.find_objects(a)
100 loops, best of 3: 11.5 ms per loop
This returns a list of tuples of slices enclosing all of your objects, in 50% more time thn it takes to find the indices of one single object.
It may not work out of the box as I cannot test it right now, but I would restructure your code into something like the following:
def correct_hot_bis(hot_image, seg_image):
# Need this to not index out of bounds when computing border_avg
hot_image_padded = np.pad(hot_image, 3, mode='constant',
constant_values=0)
new_hot = hot_image.copy()
sign = np.ones_like(hot_image, dtype=np.int8)
sign[hot_image > 0] = -1
s_elem = np.ones((3, 3))
for j, slice_ in enumerate(ndimage.find_objects(seg_image)):
hot_image_view = hot_image[slice_]
seg_image_view = seg_image[slice_]
new_shape = tuple(dim+6 for dim in hot_image_view.shape)
new_slice = tuple(slice(dim.start,
dim.stop+6,
None) for dim in slice_)
indices = seg_image_view == j+1
obj_avg = hot_image_view[indices].mean()
obj = np.zeros(new_shape)
obj[3:-3, 3:-3][indices] = True
dilated = ndimage.binary_dilation(obj, s_elem)
border = mahotas.borders(dilated)
border &= dilated
border_avg = hot_image_padded[new_slice][border == 1].mean()
new_hot[slice_][indices] += (sign[slice_][indices] *
np.abs(obj_avg - border_avg))
return new_hot
You would still need to figure out the collisions, but you could get about a 2x speed-up by computing all the indices simultaneously using a np.unique based approach:
a = np.random.randint(100, size=(1000, 1000))
def get_pos(arr):
pos = []
for j in xrange(100):
pos.append(np.where(arr == j))
return pos
def get_pos_bis(arr):
unq, flat_idx = np.unique(arr, return_inverse=True)
pos = np.argsort(flat_idx)
counts = np.bincount(flat_idx)
cum_counts = np.cumsum(counts)
multi_dim_idx = np.unravel_index(pos, arr.shape)
return zip(*(np.split(coords, cum_counts) for coords in multi_dim_idx))
In [33]: %timeit get_pos(a)
1 loops, best of 3: 766 ms per loop
In [34]: %timeit get_pos_bis(a)
1 loops, best of 3: 388 ms per loop
Note that the pixels for each object are returned in a different order, so you can't simply compare the returns of both functions to assess equality. But they should both return the same.
One thing you could do to same a little bit of time is to save the result of seg_image == i so that you don't need to compute it twice. You're computing it on lines 15 & 47, you could add seg_mask = seg_image == i and then reuse that result (It might also be good to separate out that piece for profiling purposes).
While there a some other minor things that you could do to eke out a little bit of performance, the root issue is that you're using a O(M * N) algorithm where M is the number of segments and N is the size of your image. It's not obvious to me from your code whether there is a faster algorithm to accomplish the same thing, but that's the first place I'd try and look for a speedup.

Categories