I am using Python 3 to try to find what linear combinations of a set of vectors will sum to another vector. I am using numpy arrays as vectors.
For example, I would have a target vector and matrix "choices" containing all the possible choices of vectors:
targetvector0 = numpy.array([0, 1, 2])
choices = numpy.array([[0, 1, 0], [0, 0, 1], [0, 0, 2], [1, 1, 0]])
I need something that would return all possible combinations and their integer multiples (need them to be integer multiples) that sum to the target and ignore those that don't:
option1 = [[1], [2], [0], [0]]
option2 = [[1], [0], [1], [0]]
I found some info on numpy.linalg.solve(x, y), but it doesn't quite do what I'm looking for or I don't know how to use it effectively.
I suppose the multiples you are searching are all positive.
You can carefully increment the multiples, studying all the combinations that give results not greater than the target vector.
import numpy as np
def solve(target_vector, choices):
nb_choices, n = choices.shape
factors = np.zeros((1, nb_choices), dtype=np.int)
i = 0
while True:
if i == nb_choices - 1:
return
factors[0, i] += 1
difference_to_target = factors.dot(choices) - targetvector
found_solution = np.all(difference_to_target == 0)
factors_too_high = np.any(difference_to_target > 0)
if found_solution:
yield factors.copy()
if found_solution or factors_too_high:
factors[0, :i + 1] = 0
i += 1
continue
i = 0
targetvector = np.array([0, 1, 2])
choices = np.array([[0, 1, 0], [0, 0, 1], [0, 0, 2], [1, 1, 0]])
print(list(solve(targetvector, choices)))
# [array([[1, 2, 0, 0]]), array([[1, 0, 1, 0]])]
Related
I would like to know the fastest way to extract the indices of the first n non zero values per column in a 2D array.
For example, with the following array:
arr = [
[4, 0, 0, 0],
[0, 0, 0, 0],
[0, 4, 0, 0],
[2, 0, 9, 0],
[6, 0, 0, 0],
[0, 7, 0, 0],
[3, 0, 0, 0],
[1, 2, 0, 0],
With n=2 I would have [0, 0, 1, 1, 2] as xs and [0, 3, 2, 5, 3] as ys. 2 values in the first and second columns and 1 in the third.
Here is how it is currently done:
x = []
y = []
n = 3
for i, c in enumerate(arr.T):
a = c.nonzero()[0][:n]
if len(a):
x.extend([i]*len(a))
y.extend(a)
In practice I have arrays of size (405, 256).
Is there a way to make it faster?
Here is a method, although quite confusing as it uses a lot of functions, that does not require sorting the array (only a linear scan is necessary to get non null values):
n = 2
# Get indices with non null values, columns indices first
nnull = np.stack(np.where(arr.T != 0))
# split indices by unique value of column
cols_ids= np.array_split(range(len(nnull[0])), np.where(np.diff(nnull[0]) > 0)[0] +1 )
# Take n in each (max) and concatenate the whole
np.concatenate([nnull[:, u[:n]] for u in cols_ids], axis = 1)
outputs:
array([[0, 0, 1, 1, 2],
[0, 3, 2, 5, 3]], dtype=int64)
Here is one approach using argsort, it gives a different order though:
n = 2
m = arr!=0
# non-zero values first
idx = np.argsort(~m, axis=0)
# get first 2 and ensure non-zero
m2 = np.take_along_axis(m, idx, axis=0)[:n]
y,x = np.where(m2)
# slice
x, idx[y,x]
# (array([0, 1, 2, 0, 1]), array([0, 2, 3, 3, 5]))
Use dislocation comparison for the row results of the transposed nonzero:
>>> n = 2
>>> i, j = arr.T.nonzero()
>>> mask = np.concatenate([[True] * n, i[n:] != i[:-n]])
>>> i[mask], j[mask]
(array([0, 0, 1, 1, 2], dtype=int64), array([0, 3, 2, 5, 3], dtype=int64))
My code has no error. But I am a little curious to know. Is there any way to write my code using list comprehension? I want to remove the outer loop(for i in range(labels)) and write it with list comprehension. Actually, I am facing problem regarding how can I make an assignment(current_class_p = p[y == i]) between inner and outer loops.
For example, y = np.array([0, 1, 1, 0, 1]), p =np.array([1, 0, 1, 0, 1]), and confusion matrix for this [[1, 1],[1, 2]].
def confusion_matrix_version2(y, p):
labels = len(np.unique(y))
result = np.zeros((labels, labels), dtype=int)
for i in range(labels):
current_class_p = p[y == i]
result[i] = [len(current_class_p[current_class_p == j]) for j in range(labels)]
return result
Use sklearn's confusion_matrix to generate your desired output :
>>> from sklearn.metrics import confusion_matrix
>>> y_true = [2, 0, 2, 2, 0, 1]
>>> y_pred = [0, 0, 2, 2, 0, 2]
>>> cm = confusion_matrix(y_true, y_pred)
array([[2, 0, 0],
[0, 0, 1],
[1, 0, 2]])
where rows correspond to predicted values and columns correspond to actual values. If you want this as a nested list, you can use cm.tolist()
>>> cm.tolist()
[[2, 0, 0], [0, 0, 1], [1, 0, 2]]
EDIT: Updated the list output from a list comprehension to the array's tolist function as per juanpa.arrivillaga's suggestion.
I have a tensor filled with 0 and 1. Now I want to randomly choose e.g. 50% of the elements which are equal to one. How do I do that?
For example I have the following tensor:
tensor = tf.constant([[0, 0, 1], [0, 1, 0], [1, 1, 0]])
Now I want to randomly choose the coordinates of 50% of the elements which are equal to one (in this case, I want to choose 2 elements out of the 4). The resulting tensors could look like follows:
[[0, 0, 1], [0, 0, 0], [0, 1, 0]]
You can use numpy.
import numpy as np
tensor = np.array([0, 1, 0, 1, 0, 1, 0, 1])
percentage = 0.5
ones_indices = np.where(tensor==1)
ones_length = np.shape(ones_indices)[1]
random_indices = np.random.permutation(ones_length)
ones_indices[0][random_indices][:int(ones_length * percentage)]
Edit: With your definition of a tensor I have adjusted the code:
import numpy as np
tensor = np.array([[0, 0, 1], [0, 1, 0], [1, 1, 0]])
percentage = 0.5
indices = np.where(tensor == 1)
length = np.shape(indices)[1]
random_idx = np.random.permutation(length)
random_idx = random_idx[:int(length * percentage)]
random_indices = (indices[0][random_idx], indices[1][random_idx])
z = np.zeros(np.shape(tensor), dtype=np.int64)
z[random_indices] = 1
# output
z
I have a matrix with a certain number of columns that contain only the numbers 0 and 1, I want to count the number of [0, 0], [0, 1], [1, 0], and [1, 1] in each PAIR of columns.
So for example, if I have a matrix with four columns, I want to count the number of 00s, 11s, 01s, and 11s in the first and second column, append the final result to a list, then loop over the 3rd and 4th column and append that answer to the list.
Example input:
array([[0, 1, 1, 0],
[1, 0, 1, 0],
[0, 1, 0, 1],
[0, 0, 1, 1],
[1, 1, 0, 0]])
My expected output is:
array([[1, 1],
[2, 1],
[1, 2],
[1, 1]])
Explanation:
The first two columns have [0, 0] once. The second two columns also have [0, 0] once. The first two columns have [0, 1] twice, and the second two columns have [0, 1] once... and so on.
This is my latest attempt and it seems to work. Would like feedback.
# for each pair of columns calculate haplotype frequencies
# haplotypes:
# h1 = 11
# h2 = 10
# h3 = 01
# h4 = 00
# takes as input a pair of columns
def calc_haplotype_freq(matrix):
h1_frequencies = []
h2_frequencies = []
h3_frequencies = []
h4_frequencies = []
colIndex1 = 0
colIndex2 = 1
for i in range(0, 2): # number of columns divided by 2
h1 = 0
h2 = 0
h3 = 0
h4 = 0
column_1 = matrix[:, colIndex1]
column_2 = matrix[:, colIndex2]
for row in range(0, matrix.shape[0]):
if (column_1[row, 0] == 1).any() & (column_2[row, 0] == 1).any():
h1 += 1
elif (column_1[row, 0] == 1).any() & (column_2[row, 0] == 0).any():
h2 += 1
elif (column_1[row, 0] == 0).any() & (column_2[row, 0] == 1).any():
h3 += 1
elif (column_1[row, 0] == 0).any() & (column_2[row, 0] == 0).any():
h4 += 1
colIndex1 += 2
colIndex2 += 2
h1_frequencies.append(h1)
h2_frequencies.append(h2)
h3_frequencies.append(h3)
h4_frequencies.append(h4)
print("H1 Frequencies (11): ", h1_frequencies)
print("H2 Frequencies (10): ", h2_frequencies)
print("H3 Frequencies (01): ", h3_frequencies)
print("H4 Frequencies (00): ", h4_frequencies)
For the sample input above, this gives:
----------
H1 Frequencies (11): [1, 1]
H2 Frequencies (10): [1, 2]
H3 Frequencies (01): [2, 1]
H4 Frequencies (00): [1, 1]
----------
Which is correct, but is there a better way to do this? How can I return these results from the function for further processing?
Starting with this -
x
array([[0, 1, 1, 0],
[1, 0, 1, 0],
[0, 1, 0, 1],
[0, 0, 1, 1],
[1, 1, 0, 0]])
Split your array into groups of 2 columns and concatenate them:
y = x.T
z = np.concatenate([y[i:i + 2] for i in range(0, y.shape[0], 2)], 1).T
Now, perform a broadcasted comparison and sum:
(z[:, None] == [[0, 0], [0, 1], [1, 0], [1, 1]]).all(2).sum(0)
array([2, 3, 3, 2])
If you want a per-column pair count, then you could do something like this:
def calc_haplotype_freq(x):
counts = []
for i in range(0, x.shape[1], 2):
counts.append(
(x[:, None, i:i + 2] == [[0, 0], [0, 1], [1, 0], [1, 1]]).all(2).sum(0)
)
return np.column_stack(counts)
calc_haplotype_freq(x)
array([[1, 1],
[2, 1],
[1, 2],
[1, 1]])
Let's say I have a 2D array with positive integers:
a = numpy.array([[1, 1, 2],
[1, 2, 5],
[1, 3, 6],
[3, 3, 3],
[3, 4, 6],
[4, 5, 6],
])
and a threshold (positive integer). I want to count, for each row, how many ocurrences are < threshold, how many >= threshold and < threshold+2, and how many >= threshold+2. The results are to be stored on a size 3 x n array, where n = a.shape[0] and each of the 3 columns corresponds to the threshold partition.
For the example above and threshold = 3, it would be:
b = numpy.array([[3, 0, 0],
[2, 0, 1],
[1, 1, 1],
[0, 3, 0],
[0, 2, 1],
[0, 1, 2],
])
My solution was to use a for loop combined with masks, so that I could apply the masks individually for each row. But using for loops on arrays feels wrong. Is there a more optimized way to accomplish that?
My solution so far:
b = []
for row in a:
b.append((numpy.sum(row < threshold),
numpy.sum((row >= threshold) * (row < threshold + 2)),
numpy.sum(row >= threshold + 2)))
b = numpy.array(b)
Approach #1
Making use of elementwise comparison against the thresholds and summing each row -
t = 3 # threshold
mask0 = (a<t)
mask2 = a>=t+2
mask1 = (a>=t) & ~mask2
out = np.c_[mask0.sum(1), mask1.sum(1), mask2.sum(1)]
Approach #2
If you think about it closely, we are creating three bins there. So, we could use get the bin ID for each element and finally, get the count of each row based on the IDs. We would use np.searchsorted to get those bin IDs and then elementwise equate and sum along each row.
Thus, we would have a solution, like so -
t = 3 # threshold
bins = [t, t+2] # Create intervals
N = len(bins)+1 # Number of cols in output
idx = np.searchsorted(bins,a,'right') # Get bin IDs
out = np.column_stack([(idx==i).sum(1) for i in range(N)])
We can vectorize the last step with broadcasting -
out = (idx == np.arange(N)[:,None,None]).sum(2).T
And one more vectorized alternative, which would also be memory efficient with np.bincount -
M = a.shape[0]
r = N*np.arange(M)[:,None]
out = np.bincount((idx + r).ravel(),minlength=M*N).reshape(M,N)
You have to break points 3 and 5. We can use np.searchsorted to find where each element of a falls with respect to our break points.
np.searchsorted([3, 5], 1, side='right') will return 0 because 1 should be inserted at position 0 to maintain sorted-ness.
np.searchsorted([3, 5], 3, side='right') will return 1 because 3 can be inserted at position 0 or any other in which a value of 3 occupies to maintain sorted-ness. The default behavior to insert to the left of elements that are equal. We can change this to insert to the right of all elements that are equal. This accounts for the condition < threshold
np.searchsorted([3, 5], 5) will return 1
np.searchsorted([3, 5], 7) will return 2
I use np.eye to build sub arrays to sum over in order to count how many fall within each bin.
np.eye(3, dtype=int)[np.searchsorted([3, 5], a, side='right')].sum(1)
array([[3, 0, 0],
[2, 0, 1],
[1, 1, 1],
[0, 3, 0],
[0, 2, 1],
[0, 1, 2]])
We can generalize this with a function
def count_bins(a, threshold, interval_sizes):
edges = np.append(threshold, interval_sizes).cumsum()
eye = np.eye(edges.size + 1, dtype=int)
return eye[edges.searchsorted(a, side='right')].sum(1)
count_bins(a, 3, [2])
array([[3, 0, 0],
[2, 0, 1],
[1, 1, 1],
[0, 3, 0],
[0, 2, 1],
[0, 1, 2]])
Or
count_bins(a, 3, [1, 1])
array([[3, 0, 0, 0],
[2, 0, 0, 1],
[1, 1, 0, 1],
[0, 3, 0, 0],
[0, 1, 1, 1],
[0, 0, 1, 2]])
But I'd rather return a pandas dataframe to see things more clearly
def count_bins(a, threshold, interval_sizes):
edges = np.append(threshold, interval_sizes).cumsum()
eye = np.eye(edges.size + 1, dtype=int)
labels = ['{:0.0f} to {:0.0f}'.format(i, j) for i, j in zip(np.append(-np.inf, edges), np.append(edges, np.inf))]
return pd.DataFrame(
eye[edges.searchsorted(a, side='right')].sum(1),
columns=labels
)
count_bins(a, 3, [2])
-inf to 3 3 to 5 5 to inf
0 3 0 0
1 2 0 1
2 1 1 1
3 0 3 0
4 0 2 1
5 0 1 2