Comparing 2 arrays for tolerance - python

What I am trying to do is tax an array, transpose it , subtract the two arrays and then see if the difference of each cell is with a certain tolerance. I am able to get a subtracted array - but I don't know how to cycle through each item to compare the amounts - ideally I would test for floating-point near-equality; and return true - if all items are with a tolerance and false otherwise - not sure how do to this last step as well.
import numpy as np
a = np.array(([[1, 2, 3], [2, 3, 8],[ 3, 4, 1]])
b = a.transpose(1, 0)
rows = a.shape[1]
col = a.shape[0]
r = abs(np.subtract(a, b)) # abs value of 2 array
i = 0
while i < rows:
j = 0
while j < rows:
if np.any(r[i][j]) > 3: # sample using 3 as tolerance
print("false")
j += 1
print("true")
i += 1

Is this not sufficient for your needs?
tolerance = 3
result = (abs(a - b) <= tolerance).all()

In this step
r = abs(np.subtract(a, b))
you already have a matrix of distances, so all you need to do is apply comparison operator (which in numpy is applied element-wise)
errors = r > 3
which results in boolean array, and if you want to see how many elements have true value, just sum it
print( np.sum(r > 3) )
and to check if any is wrong, you can just do
print( np.sum(r > 3) > 0 ) # prints true iff any element of r is bigger than 3
There are also built-in methods, but this reasoning gives you more flexibility in expressing what is "near" or "good".

Related

Can numpy help me quickly find the index of an array, at which its sum is negativ for the first time?

I need to do something like this
import numpy as np
a = np.random.rand(1000)
a -= .55
a[0] = 1
b = 0
for i in range(len(a)):
b += a[i]
if b < 0:
print(i)
break
a lot, and preferably it should be swift. Can NumPy help me with that? I understand that NumPy is built for vector calculation and not for this. Still, it can calculate the sum of an array blazingly fast - can I give it this specific condition (the sum is negative for the first time) to stop and tell me the index number?
You can use numpy.cumsum() and numpy.argmax(). First, compute the cumulative sum of the elements. Then return True/False that elements < 0 and return the first index that is True for element < 0 with argmax().
>>> (a.cumsum() < 0).argmax()
check both codes:
import numpy as np
a = np.random.rand(1000)
a -= .55
a[0] = 1
def check_1(a):
b = 0
for i in range(len(a)):
b += a[i]
if b < 0:
print(i)
break
def check_2(a):
return (a.cumsum() < 0).argmax()
Output: (Generate base random input)
>>> check_1(a)
6
>>> check_2(a)
6

Similarity Measure in Python

I am working on this coding challenge named Similarity Measure. Now the problem is my code works fine for some test cases, and failed due to the Time Limit Exceed problem. However, my code is not wrong, takes more than 25 sec for input of range 10^4.
I need to know what I can do to make it more efficient, I cannot think on any better solution than my code.
Question goes like this:
Problems states that given an array of positive integers, and now we have to answer based upon the Q queries.
Query: Given two indices L,R, determine the maximum absolute difference of index of two same elements lies between L and R
If in a range, there are no two same inputs then return 0
INPUT FORMAT
The first line contains N, no. of elements in the array A
The Second line contains N space separated integers that are elements of the array A
The third line contains Q the number of queries
Each of the Q lines contains L, R
CONSTRAINTS
1 <= N, Q <= 10^4
1 <= Ai <= 10^4
1 <= L, R <= N
OUTPUT FORMAT
For each query, print the ans in a new line
Sample Input
5
1 1 2 1 2
5
2 3
3 4
2 4
3 5
1 5
Sample Output
0
0
2
2
3
Explanation
[2,3] - No two elements are same
[3,4] - No two elements are same
[2,4] - there are two 1's so ans = |4-2| = 2
[3,5] - there are two 2's so ans = |5-3| = 2
[1,5] - there are three 1's and two 2's so ans = max(|4-2|, |5-3|, |4-1|, |2-1|) = 3
Here is my algorithm:
To take the input and test the range in a different method
Input will be L, R and the Array
For difference between L and R equal to 1, check if the next element is equal, return 1 else return 0
For difference more than 1, loop through array
Make a nested loop to check for the same element, if yes, store the difference into maxVal variable
Return maxVal
My Code:
def ansArray(L, R, arr):
maxVal = 0
if abs(R - L) == 1:
if arr[L-1] == arr[R-1]: return 1
else: return 0
else:
for i in range(L-1, R):
for j in range(i+1, R):
if arr[i] == arr[j]:
if (j-i) > maxVal: maxVal = j-i
return maxVal
if __name__ == '__main__':
input()
arr = (input().split())
for i in range(int(input())):
L, R = input().split()
print(ansArray(int(L), int(R), arr))
Please help me with this. I really want to learn a different and a more efficient way to solve this problem. Need to pass all the TEST CASES. :)
You can try this code:
import collections
def ansArray(L, R, arr):
dct = collections.defaultdict(list)
for index in range(L - 1, R):
dct[arr[index]].append(index)
return max(lst[-1] - lst[0] for lst in dct.values())
if __name__ == '__main__':
input()
arr = (input().split())
for i in range(int(input())):
L, R = input().split()
print(ansArray(int(L), int(R), arr))
Explanation:
dct is a dictionary that for every seen number keeps a list of indices. The list is sorted so lst[-1] - lst[0] will give maximum absolute difference for this number. Applying max to all this differences you get the answer. Code complexity is O(R - L).
This can be solved as O(N) approximately the following way:
from collections import defaultdict
def ansArray(L, R, arr) :
# collect the positions and save them into the dictionary
positions = defaultdict(list)
for i,j in enumerate(arr[L:R+1]) :
positions[j].append(i)
# create the list of the max differences in index
max_diff = list()
for vals in positions.values() :
max_diff.append( max(vals) - min(vals) )
# now return the max element from the list we have just created
if len(max_diff) :
return max(max_diff)
else :
return 0

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all() python dbscan 3 dimensions point

I want to do clustering using DBSCAN algorithm with a dataset that contains 3 points. This is the dataset :
1 5 7
12 8 9
2 4 10
6 3 21
11 13 0
6 3 21
11 13 0
3 7 1
1 9 2
1 5 7
I do clustering with this code :
from math import sqrt, pow
def __init__(eps=0.1, min_points=2):
eps = 10
min_points = 2
visited = []
noise = []
clusters = []
dp = []
def cluster(data_points):
visited = []
dp = data_points
c = 0
for point in data_points:
if point not in visited:
visited.append(point)
print point
neighbours = region_query(point)
#print neighbours
if len(neighbours) < min_points:
noise.append(point)
else:
c += 1
expand_cluster(c, neighbours)
#cluster(data_points)
def expand_cluster(cluster_number, p_neighbours):
cluster = ("Cluster: %d" % cluster_number, [])
clusters.append(cluster)
new_points = p_neighbours
while new_points:
new_points = pool(cluster, new_points)
def region_query(p):
result = []
for d in dp:
distance = (((d[0] - p[0])**2 + (d[1] - p[1])**2 + (d[2] - p[2])**2)**0.5)
print distance
if distance <= eps:
result.append(d)
return result
#p_neighbours = region_query(p=pcsv)
def pool(cluster, p_neighbours):
new_neighbours = []
for n in p_neighbours:
if n not in visited:
visited.append(n)
n_neighbours = region_query(n)
if len(n_neighbours) >= min_points:
new_neighbours = unexplored(p_neighbours, n_neighbours)
for c in clusters:
if n not in c[1] and n not in cluster[1]:
cluster[1].append(n)
return new_neighbours
#staticmethod
def unexplored(x, y):
z = []
for p in y:
if p not in x:
z.append(p)
return z
in this code there are point and n variables which are same with data_points that contains the dataset. If I read manual I guess this code can work actually, but when I run cluster() function there is an error.
Traceback (most recent call last):
File "<ipython-input-39-77eb6be20d82>", line 2, in <module>
if n not in visited:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
I don't know why this code still get that error, whereas I change n or point variable with index data. Do you have any idea what's wrong with this code ? how can I make it work?
thank you for your help..
The error emerges from these lines:
if point not in visited:
visited.append(point)
The in operator calls list.__contains__, which iterates over the items in the visited list to see if any of them are equal to point. However, equality tests between numpy arrays do not yield a single Boolean value, but rather an array of bools representing the element-wise comparisons of the items in the arrays. For instance, the result of array([1, 2]) == array([1, 3]) is array([True, False]), not just False.
That's OK so far. Comparisons in Python are allowed to return whatever kind of object they want. However, when equality is being tested by in, it needs a Boolean result in the end, so bool is called on the result of the comparison. The exception you received comes from bool(array([...])), which as the message says, is ambiguous. Should bool(array([True, False])) be True or False? The library refuses to guess for you.
Unfortunately, I don't think there is a really good way to work around this. Perhaps you could convert your points to tuples before saving them in visited? As a nice side effect, this would let you use a set rather than a list (since tuples are hashable).
Another issue you may have is that equality testing between floats is inherently prone to inaccuracy. Two numbers that should be equal, may not in fact be equal when compared using floats derived by different calculations. For instance, 0.1 + 0.2 == 0.3 is False because the rounding doesn't work out the same way on both sides of the equals sign. So, even if you have two points that should be equal, you may not be able to detect them in your data using only equality tests. You'd need to compute their difference and compare it to some small espilon value, estimating the maximum error that could have grown out of your computations.
If you use numpy, you should use masks instead of lists:
def cluster(data_points, eps=0.1, min_points=3):
cluster_numbers = numpy.zeros(len(data_points), dtype=int)
c = 0
for idx, point in enumerate(data_points):
if cluster_numbers[idx] == 0:
print point
neighbours = region_query(data_points, point, eps)
#print neighbours
if sum(neighbours) < min_points:
# noise
cluster_numbers[idx] = -1
else:
c += 1
expand_cluster(c, data_points, cluster_numbers, neighbours, eps)
return cluster_numbers
def region_query(points, point, eps=0.1):
distance = ((points-point)**2).sum(axis=1) ** 0.5
return distance <= eps
def expand_cluster(cluster_number, points, cluster_numbers, new_points, eps=0.1):
while True:
indices = numpy.where(new_points & (cluster_numbers==0))[0]
if not len(indices):
break
new_points = False
for idx in indices:
cluster_numbers[idx] = cluster_number
new_points = new_points | region_query(points, points[idx], eps)
What you get is a array with integer numbers, one for each input point. Positions with -1 as value are noise points, 1 .. n are the different clusters.
So you can get the points for a cluster:
cluster_numbers = cluster(data_points)
noise_points = data_points[cluster_numbers == -1]
print "Total Clusters:", cluster_numbers.max()
for idx in range(1, cluster_numbers.max() + 1):
cluster_points = data_points[cluster_numbers == idx]
print "Cluster %d as %d points" % (idx, len(cluster_points))

Return zero value if division by zero encountered

I have two lists a and b of equal length. I want to calculate the sum of their ratio:
c = np.sum(a/b)
how can I have a zero (0) value in the summation coefficient when there is division by zero?
EDIT: Here a couple of answers I tested for my case, and still raise the error. Probably I am missing something. The aray that contains zero elements is counts:
try:
cnterr = (counts/np.mean(counts))*(((cnterr/counts)**2 + (meanerr/np.mean(counts))**2 ))**1/2
except ZeroDivisionError:
cnterr = (counts/np.mean(counts))*(((meanerr/np.mean(counts))**2 ))**1/2
RuntimeWarning: divide by zero encountered in divide
cnterr = (counts/np.mean(counts))*(((cnterr/counts)**2 + (meanerr/np.mean(counts))**2 ))**1/2
And also by np.where():
cnterr = np.where(counts != 0, ((counts/np.mean(counts))*(((cnterr/counts)**2 + (meanerr/np.mean(counts))**2 ))**1/2), 0)
Raise the same error.
To sum values except divide by 0,
sel = b != 0
c = np.sum(a[sel]/b[sel])
The arrays are float, you may need to use
sel = np.bitwise_not(np.isclose(b, 0))
UPDATE
If a and b are not np.array, write the follow code in the first.
a = np.array(a)
b = np.array(b)
c = np.where(b != 0, a/b, 0).sum()
See: http://docs.scipy.org/doc/numpy/reference/generated/numpy.where.html
This works, it puts a 0 in the list where there is a divide by zero:
c = np.sum([x/y if y else 0 for x,y in zip(a,b)])
Or, a variation on #mskimm's answer. Note, you first need to convert your input lists to numpy arrays:
a=np.array(a)
b=np.array(b)
c=np.sum(a[b!=0]/b[b!=0])
This should work.
c = []
for i, j in enumerate(a):
if b[i] != 0:
c += [j/b[i]]
else:
c += [0]
c = sum(c)
This is also simple:
c = 0 if 0 in b else sum(a/b)

better algorithm for checking 5 in a row/col in a matrix

is there a good algorithm for checking whether there are 5 same elements in a row or a column or diagonally given a square matrix, say 6x6?
there is ofcourse the naive algorithm of iterating through every spot and then for each point in the matrix, iterate through that row, col and then the diagonal. I am wondering if there is a better way of doing it.
You could keep a histogram in a dictionary (mapping element type -> int). And then you iterate over your row or column or diagonal, and increment histogram[element], and either check at the end to see if you have any 5s in the histogram, or if you can allow more than 5 copies, you can just stop once you've reached 5 for any element.
Simple, one-dimensional, example:
m = ['A', 'A', 'A', 'A', 'B', 'A']
h = {}
for x in m:
if x in h:
h[x] += 1
else:
h[x] = 1
print "Histogram:", h
for k in h:
if h[k]>=5:
print "%s appears %d times." % (k,h[k])
Output:
Histogram: {'A': 5, 'B': 1}
A appears 5 times.
Essentially, h[x] will store the number of times the element x appears in the array (in your case, this will be the current row, or column or diagonal). The elements don't have to appear consecutively, but the counts would be reset each time you start considering a new row/column/diagonal.
You can check whether there are k same elements in a matrix of integers in a single pass.
Suppose that n is the size of the matrix and m is the largest element. We have n column, n row and 1 diagonal.
Foreach column, row or diagonal we have at most n distinct element.
Now we can create a histogram containing (n + n + 1) * (2 * m + 1) element. Representing
the rows, columns and the diagonal each of them containing at most n distinct element.
size = (n + n + 1) * (2 * m + 1)
histogram = zeros(size, Int)
Now the tricky part is how to update this histogram ?
Consider this function in pseudo-code:
updateHistogram(i, j, element)
if (element < 0)
element = m - element;
rowIndex = i * m + element
columnIndex = n * m + j * m + element
diagonalIndex = 2 * n * m + element
histogram[rowIndex] = histogram[rowIndex] + 1
histogram[columnIndex] = histogram[columnIndex] + 1
if (i = j)
histogram[diagonalIndex] = histogram[diagonalIndex] + 1
Now all you have to do is to iterate throw the histogram and check whether there is an element > k
Your best approach may depend on whether you control the placement of elements.
For example, if you were building a game and just placed the most recent element on the grid, you could capture into four strings the vertical, horizontal, and diagonal strips that intersected that point, and use the same algorithm on each strip, tallying each element and evaluating the totals. The algorithm may be slightly different depending on whether you're counting five contiguous elements out of the six, or allow gaps as long as the total is five.
For rows you can keep a counter, which indicates how many of the same elements in a row you currently have. To do this, iterate through the row and
if current element matches the previous element, increase the counter by one. If counter is 5, then you have found the 5 elements you wanted.
if current element doesn't match previous element, set the counter to 1.
The same principle can be applied to columns and diagonals as well. You probably want to use array of counters for columns (one element for each column) and diagonals so you can iterate through the matrix once.
I did the small example for a smaller case, but you can easily change it:
n = 3
matrix = [[1, 2, 3, 4],
[1, 2, 3, 1],
[2, 3, 1, 3],
[2, 1, 4, 2]]
col_counter = [1, 1, 1, 1]
for row in range(0, len(matrix)):
row_counter = 1
for col in range(0, len(matrix[row])):
current_element = matrix[row][col]
# check elements in a same row
if col > 0:
previous_element = matrix[row][col - 1]
if current_element == previous_element:
row_counter = row_counter + 1
if row_counter == n:
print n, 'in a row at:', row, col - n + 1
else:
row_counter = 1
# check elements in a same column
if row > 0:
previous_element = matrix[row - 1][col]
if current_element == previous_element:
col_counter[col] = col_counter[col] + 1;
if col_counter[col] == n:
print n, 'in a column at:', row - n + 1, col
else:
col_counter[col] = 1
I left out diagonals to keep the example short and simple, but for diagonals you can use the same principle as you use on columns. The previous element would be one of the following (depending on the direction of diagonal):
matrix[row - 1][col - 1]
matrix[row - 1][col + 1]
Note that you will need to make a little bit extra effort in the second case. For example traverse the row in the inner loop from right to left.
I don't think you can avoid iteration, but you can at least do an XOR of all elements and if the result of that is 0 => they are all equal, then you don't need to do any comparisons.
You can try improve your method with some heuristics: use the knowledge of the matrix size to exclude element sequences that do not fit and suspend unnecessary calculation. In case the given vector size is 6, you want to find 5 equal elements, and the first 3 elements are different, further calculation do not have any sense.
This approach can give you a significant advantage, if 5 equal elements in a row happen rarely enough.
If you code the rows/columns/diagonals as bitmaps, "five in a row" means "mask % 31== 0 && mask / 31 == power_of_two"
00011111 := 0x1f 31 (five in a row)
00111110 := 0x3e 62 (five in a row)
00111111 := 0x3f 63 (six in a row)
If you want to treat the six-in-a-row case also as as five-in-a-row, the easiest way is probably to:
for ( ; !(mask & 1) ; mask >>= 1 ) {;}
return (mask & 0x1f == 0x1f) ? 1 : 0;
Maybe the Stanford bit-tweaking department has a better solution or suggestion that does not need looping?

Categories