Numpy Interpolation Between Points Within Array (scipy.griddata) - python

I have a numpy array of a fixed size holding irregularly spaced data. An example would be:
[1 0 0 0 3 0 0 0 2 0
0 1 0 0 0 0 0 0 2 0
0 1 0 0 1 0 6 0 9 0
0 0 0 0 6 0 3 0 0 1]
I want to keep the array the same shape, but have all the 0 values overwritten with data interpolated from the points that do have data. If the data points in the array are thought of as height values, this would essentially be creating a surface over the points.
I have been trying to use scipy.interpolate.griddata but am continually getting errors. I start with an array of my known data points, as [x, y, value]. For the above, (first row only for brevity)
data = [0, 0, 1
0, 3, 3
0, 8, 2 ....................
I then define
points = (data[:,0], data[:,1])
values = (data[:,2])
Next, I define the points to sample at (in this case, the grid I desire)
grid = np.indices((4,10))
Finally, call griddata
t = interpolate.griddata(points, values, grid, method = 'linear')
This returns the following error
ValueError: number of dimensions in xi does not match x
Am I using the wrong function?
Thanks!

Solved: You need to pass the desired points as a tuple
t = interpolate.griddata(points, values, (grid[0,:,:], grid[1,:,:]), method = 'linear')

Related

Removing rows and columns if all zeros in non-diagonal entries

I am generating a confusion matrix to get an idea on my text-classifier's prediction vs ground-truth. The purpose is to understand which intents are being predicted as some another intents. But the problem is I have too many classes (more than 160), so the matrix is sparse, where most of the fields are zeros. Obviously, the diagonal elements are likely to be non-zero, as it is basically the indication of correct prediction.
That being the case, I want to generate a simpler version of it, as we only care non-zero elements if they are non-diagonal, hence, I want to remove the rows and columns where all the elements are zeros (ignoring the diagonal entries), such that the graph becomes much smaller and manageable to view. How to do that?
Following is the code snippet that I have done so far, it will produce mapping for all the intents i.e, (#intent, #intent) dimensional plot.
import matplotlib.pyplot as plt
import numpy as np
from pandas import DataFrame
import seaborn as sns
%matplotlib inline
sns.set(rc={'figure.figsize':(64,64)})
confusion_matrix = pd.crosstab(df['ground_truth_intent_name'], df['predicted_intent_name'])
variables = sorted(list(set(df['ground_truth_intent_name'])))
temp = DataFrame(confusion_matrix, index=variables, columns=variables)
sns.heatmap(temp, annot=True)
TL;DR
Here temp is a pandas dataframe. I need to remove all rows and columns where all elements are zeros (ignoring the diagonal elements, even if they are not zero).
You can use any on the comparison, but first you need to fill the diagonal with 0:
# also consider using
# a = np.isclose(confusion_matrix.to_numpy(), 0)
a = confusion_matrix.to_numpy() != 0
# fill diagonal
np.fill_diagonal(a, False)
# columns with at least one non-zero
cols = a.any(axis=0)
# rows with at least one non-zero
rows = a.any(axis=1)
# boolean indexing
confusion_matrix.loc[rows, cols]
Let's take an example:
# random data
np.random.seed(1)
# this would agree with the above
a = np.random.randint(0,2, (5,5))
a[2] = 0
a[:-1,-1] = 0
confusion_matrix = pd.DataFrame(a)
So the data would be:
0 1 2 3 4
0 1 1 0 0 0
1 1 1 1 1 0
2 0 0 0 0 0
3 0 0 1 0 0
4 0 1 0 0 1
and the code outputs (notice the 2nd row and 4th column are gone):
0 1 2 3
0 1 1 0 0
1 1 1 1 1
3 0 0 1 0
4 0 1 0 0

Create a matrix for datapoints in same or different clusters

I want to iterate through my datapoints and check whether they are in the same cluster, after using KMeans to cluster them.
And then I need to create a matrix for all the datapoints, and have 1 if they belong on the same cluster, and 0 if they don't.
After using Kmeans, I'm not sure how to retrieve which cluster every datapoint belongs to so I can create such matrix.
Do I do that using labels_ argument?
k_means = KMeans(n_clusters=5).fit(X)
labels_columns = k_means.labels_
labels_row = k_means.labels_
for row in labels_row:
for column in labels_columns:
if row == columns:
--add 1 in matrix position
else:
--add 0 in matrix position
How to best create this matrix? Or do they labels_ provide different information from what my understanding?
Any help is appreciated!
You are on the right track. Kmeans.labels_ returns a vector of n elements which tells you that the
cluster each point belongs to: [3, 4, 10, ...] tells you that point 0 belongs to cluster 3, point 1
belongs to cluster 4 and so on.
You can build the matrix you want in many ways. One possibility I thought which is a bit more elegant than
2 for loops would be the following:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
n_samples, n_features = 10, 2
X, y = make_blobs(n_samples, n_features)
plt.scatter(X[:, 0], X[:, 1], c=y)
plt.show()
kmeans = KMeans(n_clusters=3).fit(X)
plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_)
plt.show()
neighbour_matrix = np.zeros(n_samples)
repeat_labels = np.repeat(kmeans.labels_.T, n_samples, axis=0).reshape(n_samples, n_samples)
print(kmeans.labels_)
print(repeat_labels)
proximity_matrix = (repeat_labels == repeat_labels.T).astype(int)
print(proximity_matrix)
I use the vector of labels as my starting point. Let's say that it is the following:
[1 0 0 1 1 2 2 2 2 0]
I transform it in a 2D matrix with np.repeat which has the following shape:
[[1 1 1 1 1 1 1 1 1 1]
[0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0]
[1 1 1 1 1 1 1 1 1 1]
.....
So I repeat the labels as many times as is the number of points n. Then I can just check where this
matrix and its transpose are equal. That will be true only if two points belong to the same cluster:
[[1 0 0 1 1 0 0 0 0 0]
[0 1 1 0 0 0 0 0 0 1]
[0 1 1 0 0 0 0 0 0 1]
[1 0 0 1 1 0 0 0 0 0]
.....
I casted the matrix to int, but mind you that the original output is actually a boolean array.
I left the print statements and the plots in the code to hopefully make it more clear.
Hope it helps!

How can I optimize searching and matching through multi-dimensional arrays?

I'm trying to match up the elements in 2 different arrays. Array_A is a 3d map of A_Clouds, Array_B is a 3d map of B_Clouds. Each "cloud" is continuous, i.e. any isolated pixels would define a new cloud. The values of the pixels are a single, unique integer for each cloud. Non-cloud values are 0. Here's a 2D example:
[[0 0 0 0 0 0 0 0 0]
[0 0 0 1 1 1 0 0 0]
[0 0 1 1 1 1 1 1 0]
[0 0 0 1 1 1 1 1 0]
[0 0 0 0 0 1 0 0 0]
[0 0 0 0 0 0 0 0 0]]
The output I need is simply the IDs (for both clouds) of each A_Cloud which is overlapping with a B_Cloud, and the number (locations not needed) of pixels which are overlapping between those clouds.
The problem is that these are both very large 3 dimensional arrays (~2000x2000x200, both are the same size). I'm basically doing a bunch of nested for loops, which is of course very slow. Is there a faster way that I could approach this problem? Thanks in advance.
This is what I have right now (simplified to 2d):
final_matches = []
for Acloud_id in ACloud_list:
Acloud_locs = list(set([(i,j) for j, line in enumerate(Array_A) for i,pix in enumerate(line) if pix == Acloud_id]))
matches = []
for loc in Acloud_locs:
Bcloud_pix = Array_B[loc[0]][loc[1]]
if Bcloud_pix:
matches.append(Bcloud_pix)
counter=collections.Counter(matches)
final_matches.append([Acloud_id, counter])
Thanks in advance!
Some considerations here:
for Acloud_id in ACloud_list:
Acloud_locs = list(set([(i,j) for j, line in enumerate(Array_A) for i,pix in enumerate(line) if pix == Acloud_id]))
If I've read that right, this needs to check every pixel in the array in order to generate the set, and it repeats that for every cloud in A. So if you have 500 clouds, you're checking every pixel 500 times. This is not going to scale well!
Might be more efficient to store the overlap counts in a dict, and just go through the arrays once:
overlaps=dict()
for i in possible_x_coords: # define these however you like
for j in possible_y_coords:
if (Array_A[i][j] and Array_B[i][j]):
overlaps[(Array_A[i][j],Array_B[i][j])] = 1 + overlaps.get((Array_A[i][j],Array_B[i][j]),0)
(apologies for any errors, I'm on the road and can't test my code)
update: You've clarified that the arrays are about 80% sparse. If that figure was a lot higher, and if you had control over the format of your inputs, I'd suggest looking into sparse array formats - if your input only stores the non-zero values for A, this can save you the trouble of checking for zero values in A. However, for something that's only 80% sparse, I'm not sure how much efficiency this would add.

Efficient way of finding rectangle coordinates in 0-1 arrays

Say I have an MxN matrix of 0's and 1's. It may or may not be sparse.
I want a function to efficiently find rectangles in the array, where by rectangle I mean:
a set of 4 elements that are all 1's that create the 4 corners of a
rectangle, such that the sides of the rectangle are orthogonal to the
array axes. In other words, a rectangle is a set of 4 1's elements
with coordinates [row index, column index] like so: [r1,c1], [r1,c2],
[r2,c2], [r2,c1].
E.g. this setup has one rectangle:
0 0 0 1 0 1 0
0 0 0 0 0 0 0
0 1 0 0 0 0 0
1 0 0 1 0 1 0
0 0 0 0 0 0 0
0 0 0 1 0 0 1
For a given MxN array, I want a Python function F(A) that returns an array L of subarrays, where each subarray is the coordinate pair of the corner of a rectangle (and includes all of the 4 corners of the rectangle). For the case where the same element of the array is the corner of multiple rectangles, it's ok to duplicate those coordinates.
My thinking so far is:
1) find the coordinates of the apex of each right triangle in the array
2) check each right triangle apex coordinate to see if it is part of a rectangle
Step 1) can be achieved by finding those elements that are 1's and are in a column with a column sum >=2, and in a row with a row sum >=2.
Step 2) would then iterate through each coordinate determined to be the apex of a right triangle. For a a given right triangle coordinate pair, it would iterate through that column, looking at every other right triangle coordinate from 1) that is in that column. For any pair of 2 right triangle points in a column, it would then check which row has a smaller row sum to know which row would be faster to iterate through. Then it would iterate through all of the right triangle column coordinates in that row and see if the other row also has a right triangle point in that column. If it does, those 4 points form a rectangle.
I think this will work, but there will be repetition, and overall this procedure seems like it would be reasonably computationally intensive. What are some better ways for detecting rectangle corners in 0-1 arrays?
This is from the top of my head and during 5 hrs layover at LAX. Following is my algorithm:
Step 1: Search all rows for at least two ones
| 0 0 0 1 0 1 0
| 0 0 0 0 0 0 0
| 0 1 0 0 0 0 0
\|/ 1 0 0 1 0 1 0
0 0 0 0 0 0 0
0 0 0 1 0 0 1
Output:
-> 0 0 0 1 0 1 0
0 0 0 0 0 0 0
0 1 0 0 0 0 0
-> 1 0 0 1 0 1 0
0 0 0 0 0 0 0
-> 0 0 0 1 0 0 1
Step 2: For each pair of ones at each row get the index for one's in the column corresponding to the ones, lets say for the first row:
-> 0 0 0 1 0 1 0
you check for ones in the following columns:
| |
\|/ \|/
0 0 0 1 0 1 0
0 0 0 0 0 0 0
0 1 0 0 0 0 0
1 0 0 1 0 1 0
0 0 0 0 0 0 0
0 0 0 1 0 0 1
Step 3: If both index match; return the indices of all four. This can be easily accessed as you know the row and index of ones at all steps. In our case the search at columns 3, 5 are going to return 3 assuming you start index from 0. So we get the indicies for the following:
0 0 0 ->1 0 ->1 0
0 0 0 0 0 0 0
0 1 0 0 0 0 0
1 0 0 ->1 0 ->1 0
0 0 0 0 0 0 0
0 0 0 1 0 0 1
Step 4: Repeat for all pairs
Algorithm Complexity
I know you need to search for columns * rows * number of pairs but you can always use hashmaps to optimize search O(1). Which will make over complexity bound to the number of pairs. Please feel free to comment with any questions.
Here's an Python implementation which is similar to #PseudoAj solution. It will process the rows starting from top while constructing a dict where keys are x coordinates and values are sets of respective y coordinates.
For every row following steps are done:
Generate a list of x-coordinates with 1s from current row
If length of list is less than 2 move to next row
Iterate over all coordinate pairs left, right where left < right
For every coordinate pair take intersection from dict containing processed rows
For every y coordinate in the intersection add rectangle to result
Finally update dict with coordinates from current row
Code:
from collections import defaultdict
from itertools import combinations
arr = [
[0, 0, 0, 1, 0, 1, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0],
[1, 0, 0, 1, 0, 1, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 1]
]
# List corner coords
result = []
# Dict {x: set(y1, y2, ...)} of 1s in processed rows
d = defaultdict(set)
for y, row in enumerate(arr):
# Find indexes of 1 from current row
coords = [i for i, x in enumerate(row) if x]
# Move to next row if less than two points
if len(coords) < 2:
continue
# For every pair on this row find all pairs on previous rows
for left, right in combinations(coords, 2):
for top in d[left] & d[right]:
result.append(((top, left), (top, right), (y, left), (y, right)))
# Add coordinates on this row to processed rows
for x in coords:
d[x].add(y)
print(result)
Output:
[((0, 3), (0, 5), (3, 3), (3, 5))]

Counting of adjacent cells in a numpy array

Past midnight and maybe someone has an idea how to tackle a problem of mine. I want to count the number of adjacent cells (which means the number of array fields with other values eg. zeroes in the vicinity of array values) as sum for each valid value!.
Example:
import numpy, scipy
s = ndimage.generate_binary_structure(2,2) # Structure can vary
a = numpy.zeros((6,6), dtype=numpy.int) # Example array
a[2:4, 2:4] = 1;a[2,4] = 1 # with example value structure
print a
>[[0 0 0 0 0 0]
[0 0 0 0 0 0]
[0 0 1 1 1 0]
[0 0 1 1 0 0]
[0 0 0 0 0 0]
[0 0 0 0 0 0]]
# The value at position [2,4] is surrounded by 6 zeros, while the one at
# position [2,2] has 5 zeros in the vicinity if 's' is the assumed binary structure.
# Total sum of surrounding zeroes is therefore sum(5+4+6+4+5) == 24
How can i count the number of zeroes in such way if the structure of my values vary?
I somehow believe to must take use of the binary_dilation function of SciPy, which is able to enlarge the value structure, but simple counting of overlaps can't lead me to the correct sum or does it?
print ndimage.binary_dilation(a,s).astype(a.dtype)
[[0 0 0 0 0 0]
[0 1 1 1 1 1]
[0 1 1 1 1 1]
[0 1 1 1 1 1]
[0 1 1 1 1 0]
[0 0 0 0 0 0]]
Use a convolution to count neighbours:
import numpy
import scipy.signal
a = numpy.zeros((6,6), dtype=numpy.int) # Example array
a[2:4, 2:4] = 1;a[2,4] = 1 # with example value structure
b = 1-a
c = scipy.signal.convolve2d(b, numpy.ones((3,3)), mode='same')
print numpy.sum(c * a)
b = 1-a allows us to count each zero while ignoring the ones.
We convolve with a 3x3 all-ones kernel, which sets each element to the sum of it and its 8 neighbouring values (other kernels are possible, such as the + kernel for only orthogonally adjacent values). With these summed values, we mask off the zeros in the original input (since we don't care about their neighbours), and sum over the whole array.
I think you already got it. after dilation, the number of 1 is 19, minus 5 of the starting shape, you have 14. which is the number of zeros surrounding your shape. Your total of 24 has overlaps.

Categories