Python integer and float multiplication error - python

The question seems dummy, but I cannot get it right. The output cm1 is expected to be floats, but I only get zeros and ones.
import numpy as np
import scipy.spatial.distance
sim = scipy.spatial.distance.cosine
a = [2, 3, 1]
b = [3, 1, 2]
c = [1, 2, 6]
cm0 = np.array([a,b,c])
ca, cb, cc = 0.9, 0.7, 0.4
cr = np.array([ca, cb, cc])
cm1 = np.empty_like(cm0)
for i in range(3):
for j in range(3):
cm1[i,j] = cm0[i,j] * cr[i] * cr[j]
print(cm1)
And I get:
[[1 1 0]
[1 0 0]
[0 0 0]]

empty_like() matches the type of the given numpy array by default, as hpaulj suggested in the comments. In your case cm0 is of type integer.
The empty_like function accepts multiple arguments though, one of wich is dtype. Setting dtype to float should solve the problem:
cm1 = np.empty_like(cm0, dtype=float)
And also Python truncates floating point numbers at the decimal point when converting to integers. In your case, every multiplication done results in a number between 1.89 and 0.36, so flooring the results will result in 0s and 1s respectively.

As #hpaulj said in the comments section, the problem is using empty_like which will keep the cm0 dtype, to solve it try:
cm1 = np.empty_like(cm0, dtype=float)

Related

How do I fix this weird behaviour in numpy arrays?

I am working on the implementation of a gauss elimination algorithm in python using numpy. While working on it, I have noticed a weird behaviour. Here goes what I've done so far:
def gauss_elimination(a, b):
n = len(b)
print(a)
for k in range(0, n-1):
for i in range(k+1, n):
lam = a[i, k] / a[k, k]
a[i, k:n] = a[i, k:n] - (lam * a[k, k:n])
b[i] = b[i] - lam * b[k]
print(a)
return b
With this code, considering the following arrays:
a = np.array([[4, -2, 1], [-2, 4, -2], [1, -2, 4]])
b = np.array([11, -16, 17])
The result will be:
array([ 11, -10, 10])
This is how the algorithm changed the array a:
array([[ 4, -2, 1],
[ 0, 3, -1],
[ 0, 0, 2]])
Which is wrong. For some reasong, the value in the second row and third column is -1, when it should -1.5. I've inserted some printing in the way to see what was actually happening, and for some reason numpy is truncating the result. This is the changed code:
def gauss_elimination(a, b):
n = len(b)
print(a)
for k in range(0, n-1):
for i in range(k+1, n):
lam = a[i, k] / a[k, k]
print(a[i, k:n])
print(lam * a[k, k:n])
print(a[i, k:n] - lam * a[k, k:n])
a[i, k:n] = a[i, k:n] - (lam * a[k, k:n])
b[i] = b[i] - lam * b[k]
print(a)
return b
And considering the same arrays that were defined a while back, the results will be:
[[ 4 -2 1]
[-2 4 -2]
[ 1 -2 4]]
[ 0. 3. -1.5] # This shows that the value is being calculated as presumed
[[ 4 -2 1]
[ 0 3 -1] # But when the object a is updated, the value is -1 and not -1.5
[ 1 -2 4]]
[ 0. -1.5 3.75]
[[ 4 -2 1]
[ 0 3 -1]
[ 0 -1 3]]
[0. 2.6666666666666665]
[[ 4 -2 1]
[ 0 3 -1]
[ 0 0 2]]
I am little bit confused. Perhaps I might have made a mistake, but the printing shows that everything is being calculated as it should be. Any tips?
The problem
Your array's dtype is "int32":
>>> import numpy as np
>>> x = np.array([1, 2, 3])
>>> x[1] / x[2]
0.6666666666666666
>>> x[0] = x[1] / x[2]
>>> x
array([0, 2, 3])
>>> x.dtype
>>> dtype('int32')
The "solution"
You can use dtype="float64" at instantiation to fix this issue:
>>> x = np.array([1, 2, 3], dtype="float64")
>>> x[0] = x[1] / x[2]
>>> x
array([0.66666667, 2. , 3. ])
Of course, you could also do this without having to explicitly specify the dtype by instantiating your array with floats in the first place:
>>> x = np.array([1., 2., 3.])
>>> x.dtype
dtype('float64')
However, whenever dealing with numerical methods in programming, you should be cognizant of the limitations of floating point arithmetic. Take a look at the note at the bottom of this answer for more details.
Explanation
NumPy arrays are homogeneous arrays, meaning that every element is of the same type. The dtype or "data type" determines how much memory each element in an array requires. This in turn dictates the memory footprint of an entire array in memory.
Because NumPy arrays are homogenenous, when an array is created it is stored in a contiguous chunk of memory which makes accessing elements very vast and easy since all elements take up the same number of bytes. This is part of the reason NumPy is so fast.
Just FYI, this is a very simplified explanation of what's going on here. Essentially, NumPy arrays are arrays of a single type ("int8", or "int32", or "float64", etc).
An operation like division on integers always results in a float, and since NumPy doesn't want to assume that you wanted to change the dtype of the entire array (which would require creating an entire new array in memory), it quietly maintains the dtype of the existing array.
Note
Anytime you're dealing with numerical computations, you need to be aware of the limitations of floating point arithmetic. Floating point arithmetic is prone to inaccuracy due to the nature of how these numbers are stored in memory.
In the case of solving systems of equations using linear algebra, even after solving for a particular vector x in Ax = b using a straight forward Gaussian elimination algorithm based on floating point arithmetic, the expression Ax will not necessarily be exactly equal to b. I recommend reading further: https://en.wikipedia.org/wiki/Floating-point_arithmetic
You've stumbled upon a discipline known as numerical analysis, by the way! More specifically numerical linear algebra. This is a deep topic. Gaussian elimination is a lovely algorithm, but I also hope this answer galvanizes future readers to look into more advanced algorithms for solving systems of equations.

binary matrix aa (only contain 0, 1), why sum(sum(aa)) isn't equal sum(sum(aa>0))?

I have a binary mask named crop_mask, which only contains 0 and 1. Why isn't the sum(sum(aa1)) equal sum(sum(aa2)).
aa1 = crop_mask,
aa2 = (aa1>0)
print(sum(sum(aa1)), sum(sum(aa2)))
This might be a minor issue, but I am just so confused now. Thanks for any help. I made a screenshot of the result in the attached figure.
updated screenshot
By definition the sum should be the same.
The only thing I can thing of is that the dtype of your array (assuming you are using a numpy array) is not int or float.
Did you check that the "True"s in aa2 match the "1" in aa1?
EDIT:
dtype = np.uint8 limits the maximum value of the column sum to 255 (2^8). So the sum(sum(a)) --> sum([0,160,0,...]) (160 is the remainder of 4000/256)
aa0 = aa0.astype(int) will solve your issue
a = np.zeros((4000, 4000)).astype(np.uint8)
a[:,1] = 1
a[:,4] = 1
b = (a > 0)
sum(sum(b)) # 8000
sum(sum(a)) # 320
a = a.astype(int)
sum(sum(a)) #8000
Assuming your cropmask is indeed a 2-dimensional ndarray with only 1s and 0s, this works:
import numpy as np
cropmask = np.array([[1, 1, 1], [1, 1, 0], [1, 0, 0]], np.uint8)
x = (cropmask > 0)
print(sum(sum(cropmask)), sum(sum(x)))
Result:
6 6
The most likely cause here is that you're wrong and your cropmask doesn't actually contain only 1s and 0s.
Have you tried:
print(sum(sum(np.logical_and((0 != crop_mask), (1 != crop_mask)))))
If that comes up greater than 0, there's something else in there.

Generate random matrix in numpy without rows of all 1's

I am generating a random matrix with
np.random.randint(2, size=(5, 3))
that outputs something like
[0,1,0],
[1,0,0],
[1,1,1],
[1,0,1],
[0,0,0]
How do I create the random matrix with the condition that each row cannot contain all 1's? That is, each row can be [1,0,0] or [0,0,0] or [1,1,0] or [1,0,1] or [0,0,1] or [0,1,0] or [0,1,1] but cannot be [1,1,1].
Thanks for your answers
Here's an interesting approach:
rows = np.random.randint(7, size=(6, 1), dtype=np.uint8)
np.unpackbits(rows, axis=1)[:, -3:]
Essentially, you are choosing integers 0-6 for each row, ie 000-110 as binary. 7 would be 111 (all 1's). You just need to extract binary digits as columns and take the last 3 digits (your 3 columns) since the output of unpackbits is 8 digits.
Output:
array([[1, 0, 1],
[1, 0, 0],
[1, 0, 0],
[1, 0, 0],
[0, 1, 1],
[0, 0, 0]], dtype=uint8)
If you always have 3 columns, one approach is to explicitly list the possible rows and then choose randomly among them until you have enough rows:
import numpy as np
# every acceptable row
choices = np.array([
[1,0,0],
[0,0,0],
[1,1,0],
[1,0,1],
[0,0,1],
[0,1,0],
[0,1,1]
])
n_rows = 5
# randomly pick which type of row to use for each row needed
idx = np.random.choice(range(len(choices)), size=n_rows)
# make an array by using the chosen rows
array = choices[idx]
If this needs to generalize to a large number of columns, it won't be practical to explicitly list all choices (even if you create the choices programmatically, the memory is still an issue; the number of possible rows grows exponentially in the number of columns). Instead, you can create an initial matrix and then just resample any unacceptable rows until there are none left. I'm assuming that a row is unacceptable if it consists only of 1s; it would be easy to adapt this to the case where the threshold is any number of 1s, though.
n_rows = 5
n_cols = 4
array = np.random.randint(2, size=(n_rows, n_cols))
all_1s_idx = array.sum(axis=-1) == n_cols
while all_1s_idx.any():
array[all_1s_idx] = np.random.randint(2, size=(all_1s_idx.sum(), n_cols))
all_1s_idx = array.sum(axis=-1) == n_cols
Here we just keep resampling all unacceptable rows until there are none left. Because all of the necessary rows are resampled at once, this should be quite efficient. Additionally, as the number of columns grows larger, the probability of a row having all 1s decreases exponentially, so efficiency shouldn't be a problem.
#busybear beat me to it but I'll post it anyway, as it is a bit more general:
def not_all(m, k):
if k>64 or sys.byteorder != 'little':
raise NotImplementedError
sample = np.random.randint(0, 2**k-1, (m,), dtype='u8').view('u1').reshape(m, -1)
sample[:, k//8] <<= -k%8
return np.unpackbits(sample).reshape(m, -1)[:, :k]
For example:
>>> sample = not_all(1000000, 11)
# sanity checks
>>> unq, cnt = np.unique(sample, axis=0, return_counts=True)
>>> len(unq) == 2**11-1
True
>>> unq.sum(1).max()
10
>>> cnt.min(), cnt.max()
(403, 568)
And while I'm at hijacking other people's answers here is a streamlined version of #Nathan's acceptance-rejection method.
def accrej(m, k):
sample = np.random.randint(0, 2, (m, k), bool)
all_ones, = np.where(sample.all(1))
while all_ones.size:
resample = np.random.randint(0, 2, (all_ones.size, k), bool)
sample[all_ones] = resample
all_ones = all_ones[resample.all(1)]
return sample.view('u1')
Try this solution using sum():
import numpy as np
array = np.random.randint(2, size=(5, 3))
for i, entry in enumerate(array):
if entry.sum() == 3:
while True:
new = np.random.randint(2, size=(1, 3))
if new.sum() == 3:
continue
break
array[i] = new
print(array)
Good luck my friend!

Randomize part of an array

I'm working on a project involving binary patterns (here np.arrays of 0 and 1).
I'd like to modify a random subset of these and return several altered versions of the pattern where a given fraction of the values have been changed (like map a function to a random subset of an array of fixed size)
ex : take the pattern [0 0 1 0 1] and rate 0.2, return [[0 1 1 0 1] [1 0 1 0 1]]
It seems possible by using auxiliary arrays and iterating with a condition, but is there a "clean" way to do that ?
Thanks in advance !
The map function works on boolean arrays too. You could add the subsample logic to your function, like so:
import numpy as np
rate = 0.2
f = lambda x: np.random.choice((True, x),1,p=[rate,1-rate])[0]
a = np.array([0,0,1,0,1], dtype='bool')
map(f, a)
# This will output array a with on average 20% of the elements changed to "1"
# it can be slightly more or less than 20%, by chance.
Or you could rewrite a map function, like so:
import numpy as np
def map_bitarray(f, b, rate):
'''
maps function f on a random subset of b
:param f: the function, should take a binary array of size <= len(b)
:param b: the binary array
:param rate: the fraction of elements that will be replaced
:return: the modified binary array
'''
c = np.copy(b)
num_elem = len(c)
idx = np.random.choice(range(num_elem), num_elem*rate, replace=False)
c[idx] = f(c[idx])
return c
f = lambda x: True
b = np.array([0,0,1,0,1], dtype='bool')
map_bitarray(f, b, 0.2)
# This will output array b with exactly 20% of the elements changed to "1"
rate=0.2
repeats=5
seed=[0,0,1,0,1]
realizations=np.tile(seed,[repeats,1]) ^ np.random.binomial(1,rate,[repeats,len(seed)])
Use np.tile() to generate a matrix from the seed row.
np.random.binomial() to generate a binomial mask matrix with your requested rate.
Apply the mask with the xor binary operator ^
EDIT:
Based on #Jared Goguen comments, if you want to change 20% of the bits, you can elaborate a mask by choosing elements to change randomly:
seed=[1,0,1,0,1]
rate=0.2
repeats=10
mask_list=[]
for _ in xrange(repeats):
y=np.zeros(len(seed),np.int32)
y[np.random.choice(len(seed),0.2*len(seed))]=1
mask_list.append(y)
mask = np.vstack(mask_list)
realizations=np.tile(seed,[repeats,1]) ^ mask
So, there's already an answer that provides sequences where each element has a random transition probability. However, it seems like you might want an exact fraction of the elements to change instead. For example, [1, 0, 0, 1, 0] can change to [1, 1, 0, 1, 0] or [0, 0, 0, 1, 0], but not [1, 1, 1, 1, 0].
The premise, based off of xvan's answer, uses the bit-wise xor operator ^. When a bit is xor'd with 0, it's value will not change. When a bit is xor'd with 1, it will flip. From your question, it seems like you want to change len(seq)*rate number of bits in the sequence. First create mask which contains len(seq)*rate number of 1's. To get an altered sequence, xor the original sequence with a shuffled version of mask.
Here's a simple, inefficient implementation:
import numpy as np
def edit_sequence(seq, rate, count):
length = len(seq)
change = int(length * rate)
mask = [0]*(length - change) + [1]*change
return [seq ^ np.random.permutation(mask) for _ in range(count)]
rate = 0.2
seq = np.array([0, 0, 1, 0, 1])
print edit_sequence(seq, rate, 5)
# [0, 0, 1, 0, 0]
# [0, 1, 1, 0, 1]
# [1, 0, 1, 0, 1]
# [0, 1, 1, 0, 1]
# [0, 0, 0, 0, 1]
I don't really know much about NumPy, so maybe someone with more experience can make this efficient, but the approach seems solid.
Edit: Here's a version that times about 30% faster:
def edit_sequence(seq, rate, count):
mask = np.zeros(len(seq), dtype=int)
mask[:len(seq)*rate] = 1
output = []
for _ in range(count):
np.random.shuffle(mask)
output.append(seq ^ mask)
return output
It appears that this updated version scales very well with the size of seq and the value of count. Using dtype=bool in seq and mask yields another 50% improvement in the timing.

Python one-liner for a confusion/contingency matrix needed

I want to write a one-liner to calculate a confusion/contingency matrix M (square matrix with either dimension equal to the number of classes ) that counts the cases presented in two vectors of lenght n: Ytrue and Ypredicted. Obiously the following does not work using python and numpy:
error = N.array([error[x,y]+1 for x, y in zip(Ytrue,Ypredicted)]).reshape((n,n))
Any hint to create a one-liner matrix confusion calculator?
error = N.array([zip(Ytrue,Ypred).count(x) for x in itertools.product(classes,repeat=2)]).reshape(n,n)
or
error = N.array([z.count(x) for z in [zip(Ytrue,Ypred)] for x in itertools.product(classes,repeat=2)]).reshape(n,n)
The latter being more efficient but possibly more confusing.
import numpy as N
import itertools
Ytrue = [1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,
3,3,3,3,3,3,3,3]
Ypred = [1,1,2,1,2,1,3,1,
2,2,2,2,2,2,2,2,
3,3,2,2,2,1,1,1]
classes = list(set(Ytrue))
n = len(classes)
error = N.array([zip(Ytrue,Ypred).count(x) for x in itertools.product(classes,repeat=2)]).reshape(n,n)
print error
error = N.array([z.count(x) for z in [zip(Ytrue,Ypred)] for x in itertools.product(classes,repeat=2)]).reshape(n,n)
print error
Which produces
[[5 2 1]
[0 8 0]
[3 3 2]]
[[5 2 1]
[0 8 0]
[3 3 2]]
If NumPy is newer or equal than 1.6 and Ytrue and Ypred are NumPy arrays, this code works
np.bincount(n * (Ytrue - 1) + (Ypred -1), minlength=n*n).reshape(n, n)

Categories